Research scaffold

A model that thinks in latent space.

NOESIS is a research scaffold for continuous-thought reasoning: the model thinks in the residual stream instead of the token stream, allocates its own thinking budget per problem, and verifies its latent thoughts before emitting an answer.

Get started See architecture

87 Tests passing

0 Fabricated results

4 / 4 Components built

CPU Runs on

What is this?

Reasoning LLMs (o1, R1, Claude Extended-Thinking) have shown that spending more tokens on "thinking" before answering reliably improves accuracy. The current production form — chain-of-thought in token space — has a fundamental inefficiency: most thinking tokens are linguistic glue (transitions, hedges, restatements) rather than load-bearing reasoning steps.

NOESIS thinks in continuous latent space instead. After emitting <bot>, the last hidden state is projected through a learned matrix into the next input embedding — no token is decoded, no commitment is made. The model can keep multiple alternatives alive in superposition. After K such steps, it emits <eot> and resumes language-mode generation.

The four ideas

NOESIS combines four mechanisms that make latent reasoning practical:

1. Continuous-thought backbone

Reasoning happens in the residual stream. The last hidden state is fed back as the next input embedding via a learned bridge.

2. Adaptive think-budget

A small policy decides how many latent steps to spend on a given problem. Trained with reward = accuracy − λ·steps.

3. Latent verifier

A separate head reads the trajectory and emits a confidence score. Low confidence triggers retry. Also serves as the RL baseline.

4. Continuous-thought RL

Gaussian perturbation on latent thoughts gives a well-defined trajectory log-probability, unlocking REINFORCE through continuous reasoning.

Current state

Every architectural piece of NOESIS exists in code and is tested. 87 / 87 tests passing

The two gating tests from spec §4 Phase 1 both pass:

Bit-exact K=0: with zero latent steps, the latent reasoner is byte-identical to the bare backbone (catches the most common implementation bug, per spec).
Step-by-step matches batched at σ=0: running K latent steps with the KV cache produces the same hidden states as a single batched forward over the equivalent input (catches KV-cache and position-offset bugs).

What is intentionally not here

No fabricated benchmark numbers, no lit notes written from memory, no training runs without compute behind them. The repo's "no fabricated results" rule applies to documentation too.

Phase 2 Coconut warmup curriculum — needs paper verification.
Real datasets (ProsQA, GSM8K, MATH) — not yet downloaded or wired in.
End-to-end RL training loop — the primitives are here, the loop isn't.
The 14 spec-required lit notes — would need to be drafted while reading the actual papers.

Read on

Architecture — how the four pieces fit together, with the spec's math.
Getting started — install, run the tests, walk a working example.
Tests — the 87 tests, broken down by what they actually guarantee.