NOESIS

A language model that thinks in continuous latent space, allocates its own thinking budget per problem, and verifies its latent thoughts before emitting an answer.

NOESIS  from  νόησις — pure thought; intellect's direct apprehension

Implementation status

From-scratch reimplementation of the primitives — tested on CPU (2026-05)

A from-scratch reference implementation lives in architectures/05-noesis/noesis-lm and passes 87 tests on CPU. The stochastic latent loop's REINFORCE score-function gradient is verified against the analytical form and a finite-difference check. This is a small-scale reimplementation of published ideas (Coconut / latent reasoning), not an original architecture and not a trained model: the reasoning, budget, and verifier modules exist and are unit-tested, but there is no end-to-end training loop, no task, and no comparative result yet. The "target marquee result" below is a goal from the spec, not a measured outcome.

The thesis in one paragraph

Reasoning LLMs (o1, R1, the Claude Extended-Thinking family) have shown that test-time compute scaling works: spending more tokens “thinking” reliably improves accuracy on hard problems. The current production form, CoT in token space, has a fundamental inefficiency: most thinking tokens are linguistic glue (transitions, hedges, restatements) rather than load-bearing reasoning steps. Coconut (Hao et al., NeurIPS 2024) attacks this directly: instead of decoding each reasoning step to a word token, feed the last hidden state back as the next input embedding — stay in latent space. NOESIS makes latent reasoning practical by combining four ideas: continuous-thought backbone, learned adaptive think-budget, latent verifier, and a clean continuous-thought RL formulation. The result is a reasoning model that thinks in latent space, knows when to stop thinking, verifies its conclusions, and can be RL-trained end-to-end.

Architecture in one figure

           prompt embeddings
                  │
                  ▼
        ┌─────────────────────┐
        │ Transformer encoder │
        └─────────┬───────────┘
                  │
                  ▼
        ┌──────────────────────┐
        │ Adaptive K-thoughts  │  policy π_K: how many latent
        │ policy π_K(K | h_0)  │  steps to spend? K ∈ {1,2,4,…}
        └─────────┬────────────┘
                  │  K
                  ▼
        ┌──────────────────────┐
        │ Latent thought loop  │  for k=1..K:
        │ (Coconut + noise)    │    h_k = LM(h_{k-1}) + σ ε
        │                      │    (noise ε enables clean RL)
        └─────────┬────────────┘
                  │
                  ▼
        ┌──────────────────────┐
        │ Latent verifier      │  confidence(h_1,…,h_K) → c
        │ (small MLP)          │  if c < τ: extra thinking round
        └─────────┬────────────┘
                  │
                  ▼
              answer head
                  │
                  ▼
              y_1, y_2, …

The reference implementation in Appendix B of the master prompt is the stochastic latent loop with REINFORCE trajectory_log_prob. The score-function gradient d/dμ log p = ε / σ matches both the analytical form and a properly-set-up finite-difference numerical gradient. The σ-gradient correctly comes only from the log-normalization term — a subtle bug caught and fixed in v1 of the reference during development.

Key contributions

Phased plan

NOESIS's phased implementation plan.
PhaseDeliverableStatus
0 — BootstrapRepo scaffold; reference stochastic latent loop + REINFORCE log-probdone
1 — Coconut reproductionVanilla continuous-thought decoding at fixed Kdone
2 — Stochastic latent + REINFORCENoise-injected latent loop; trajectory log-prob verified vs. finite-diffdone
3 — Adaptive K policyThink-budget policy module (built + tested); not yet trained on a difficulty signalpartial (module only)
4 — Latent verifierConfidence head + calibration metrics; mode controller triggers extra thinkingdone (module)
5 — RL on math/codeGRPO-style RL through latent thoughts; MATH / HumanEval liftnot started
6 — PaperMarquee plot: ≥ 3× fewer thinking tokens at matched accuracynot started

Required reading

  1. Hao et al. 2024/2025 — Coconut (arxiv 2412.06769 v3) — the anchor
  2. Pfau et al. 2024 — “Let's Think Dot by Dot” (latent tokens as null padding boost reasoning)
  3. Goyal et al. 2024 — Pause tokens / Filler tokens
  4. Hu et al. 2024 — Quiet-STaR (self-taught reasoning with rationales)
  5. Akyürek et al. 2024 — TTT for few-shot reasoning
  6. OpenAI 2024 — o1 system card; DeepSeek 2025 — R1 (discrete-CoT-with-RL baseline)

Target marquee result goal — not yet measured

A goal from the spec — not a measured result

≥ 3× fewer total inference tokens than a discrete CoT-RL baseline at matched MATH accuracy. This requires a training loop and task harness that have not been built.