Getting started

Everything here runs on CPU. There's no model download, no dataset fetch, and no training step that takes longer than a few seconds.

Install

pip install -e ".[dev]"
pytest

Expected output: 87 passed. If any of the gradient or parity tests fail, the RL recipe in later phases will silently learn wrong updates — do not proceed.

Example 1 — latent reasoning with fixed K

The simplest invocation: encode a context, run K latent steps, look at the trajectory and its log-probability.

import torch
from noesis.backbone import TinyTransformer
from noesis.thought import (
    LatentReasoner,
    StochasticLatentLoop,
    trajectory_log_prob,
)

backbone = TinyTransformer(vocab_size=128, d_model=64, n_layers=4, max_seq_len=128)
loop = StochasticLatentLoop(d=64, sigma_init=0.1)
reasoner = LatentReasoner(backbone, loop)

input_ids = torch.randint(0, 128, (2, 10))
out = reasoner.think(input_ids, K=4)           # 4 latent thought steps

# Trajectory log-prob; REINFORCE backprops through this.
log_p = trajectory_log_prob(out.mus, out.epsilons, loop.sigma)   # (B,)

The K=0 path is byte-identical to the bare backbone, by test. Try reasoner.think(input_ids, K=0) and compare to backbone.forward(backbone.embed(input_ids)) — they match exactly.

Example 2 — autoregressive generation with thinking

The full inference protocol from spec §2.1: emit <bot>, run K latent steps in the KV cache (no tokens emitted), emit <eot>, resume language mode.

from noesis.thought import ModeController, generate

# Reserve two vocab IDs for bot and eot. Later phases learn to emit them.
ctl = ModeController(bot_token_id=126, eot_token_id=127, default_K=4)

prompt = torch.tensor([[10, 20, 30, 40]])
out = generate(
    reasoner, ctl, prompt,
    max_new_tokens=16,
    temperature=0.0,
    force_think_at={2},     # force a think block at the 3rd new token (testing)
)
# out: (1, 4 + 16) -- latent thoughts are in the KV cache, NOT in the token sequence.

The force_think_at argument is for testing before the model is trained to emit <bot> on its own; drop it once Phase 2 SFT is run.

Example 3 — one Phase 5 step (REINFORCE + KL)

The composition that the spec §A central derivation enables. Every function here has tests gating its gradient correctness.

import copy
from noesis.policy import compute_reinforce_loss, compute_trajectory_kl

# Frozen pre-RL reference (for KL constraint).
reference = copy.deepcopy(reasoner)
for p in reference.parameters():
    p.requires_grad = False

# 1. Sample trajectory under current policy.
out = reasoner.think(input_ids, K=4, deterministic=False)
log_p = trajectory_log_prob(out.mus, out.epsilons, loop.sigma)

# 2. Reward = 1[correct] - lambda*K, supplied by the task harness.
rewards = torch.tensor([1.0, 0.0])

# 3. REINFORCE loss with optional verifier baseline.
loss_rl = compute_reinforce_loss(log_p, rewards, baselines=None, normalize=True)

# 4. KL to the frozen reference (prevents capability collapse).
with torch.no_grad():
    out_ref = reference.think(input_ids, K=4, deterministic=False)
kl = compute_trajectory_kl(out.mus, out_ref.mus, loop.sigma.detach())

# 5. Backward through the combined objective.
beta_kl = 0.01
(loss_rl + beta_kl * kl.mean()).backward()

Example 4 — budget head + verifier

The two remaining architectural pieces. Phase 3 sample-and-train the budget; Phase 4 train the verifier on (trajectory, correctness) pairs.

from noesis.policy import BudgetHead
from noesis.verifier import VerifierHead, verifier_bce_loss

head = BudgetHead(d_model=64, K_max=8)
K, budget_lp = head.sample(h_bot)            # K shape (B,), in [0, 8]

verifier = VerifierHead(d_model=64, n_layers=2)
trajectory = torch.stack(out.e_projecteds, dim=1)   # (B, K, d)
logits = verifier.logits(trajectory)         # (B,)
v_loss = verifier_bce_loss(logits, correct=torch.tensor([1.0, 0.0]))

What still needs you

The pieces above compose into a Phase 5 training loop, but you need:

A real reward signal — a correctness oracle for the task (GSM8K answer match, ProsQA label, etc.).
A dataset loader — spec §3 calls for train/data/{prosqa,prontoqa,math,code}_loader.py.
The Coconut warmup curriculum — spec §4 Phase 2; needs paper-verification.
An optimizer, scheduler, gradient accumulation, mixed precision, and checkpointing — standard training infrastructure.
GPU compute — the spec assumes 8×H100 for ~4–5 weeks of work.

Repository map

noesis-lm/
├─ noesis/
│  ├─ backbone/        # TinyTransformer + Backbone protocol
│  ├─ thought/         # latent_loop, reasoner, mode_controller
│  ├─ policy/          # budget_head, reinforce
│  └─ verifier/        # verifier_head, calibration
├─ tests/              # 87 tests across 6 files
├─ docs/               # THINK_phase{0,1}.md, ADRs
└─ site/               # this GitHub Pages site