Getting started
Everything here runs on CPU. There's no model download, no dataset fetch, and no training step that takes longer than a few seconds.
Install
pip install -e ".[dev]"
pytest
Expected output: 87 passed. If any of the gradient or
parity tests fail, the RL recipe in later phases will silently learn
wrong updates — do not proceed.
Example 1 — latent reasoning with fixed K
The simplest invocation: encode a context, run K latent steps, look at the trajectory and its log-probability.
import torch
from noesis.backbone import TinyTransformer
from noesis.thought import (
LatentReasoner,
StochasticLatentLoop,
trajectory_log_prob,
)
backbone = TinyTransformer(vocab_size=128, d_model=64, n_layers=4, max_seq_len=128)
loop = StochasticLatentLoop(d=64, sigma_init=0.1)
reasoner = LatentReasoner(backbone, loop)
input_ids = torch.randint(0, 128, (2, 10))
out = reasoner.think(input_ids, K=4) # 4 latent thought steps
# Trajectory log-prob; REINFORCE backprops through this.
log_p = trajectory_log_prob(out.mus, out.epsilons, loop.sigma) # (B,)
The K=0 path is byte-identical to the bare backbone, by test. Try
reasoner.think(input_ids, K=0) and compare to
backbone.forward(backbone.embed(input_ids)) — they
match exactly.
Example 2 — autoregressive generation with thinking
The full inference protocol from spec §2.1: emit <bot>,
run K latent steps in the KV cache (no tokens emitted), emit
<eot>, resume language mode.
from noesis.thought import ModeController, generate
# Reserve two vocab IDs for bot and eot. Later phases learn to emit them.
ctl = ModeController(bot_token_id=126, eot_token_id=127, default_K=4)
prompt = torch.tensor([[10, 20, 30, 40]])
out = generate(
reasoner, ctl, prompt,
max_new_tokens=16,
temperature=0.0,
force_think_at={2}, # force a think block at the 3rd new token (testing)
)
# out: (1, 4 + 16) -- latent thoughts are in the KV cache, NOT in the token sequence.
The force_think_at argument is for testing before the
model is trained to emit <bot> on its own; drop it
once Phase 2 SFT is run.
Example 3 — one Phase 5 step (REINFORCE + KL)
The composition that the spec §A central derivation enables. Every function here has tests gating its gradient correctness.
import copy
from noesis.policy import compute_reinforce_loss, compute_trajectory_kl
# Frozen pre-RL reference (for KL constraint).
reference = copy.deepcopy(reasoner)
for p in reference.parameters():
p.requires_grad = False
# 1. Sample trajectory under current policy.
out = reasoner.think(input_ids, K=4, deterministic=False)
log_p = trajectory_log_prob(out.mus, out.epsilons, loop.sigma)
# 2. Reward = 1[correct] - lambda*K, supplied by the task harness.
rewards = torch.tensor([1.0, 0.0])
# 3. REINFORCE loss with optional verifier baseline.
loss_rl = compute_reinforce_loss(log_p, rewards, baselines=None, normalize=True)
# 4. KL to the frozen reference (prevents capability collapse).
with torch.no_grad():
out_ref = reference.think(input_ids, K=4, deterministic=False)
kl = compute_trajectory_kl(out.mus, out_ref.mus, loop.sigma.detach())
# 5. Backward through the combined objective.
beta_kl = 0.01
(loss_rl + beta_kl * kl.mean()).backward()
Example 4 — budget head + verifier
The two remaining architectural pieces. Phase 3 sample-and-train the budget; Phase 4 train the verifier on (trajectory, correctness) pairs.
from noesis.policy import BudgetHead
from noesis.verifier import VerifierHead, verifier_bce_loss
head = BudgetHead(d_model=64, K_max=8)
K, budget_lp = head.sample(h_bot) # K shape (B,), in [0, 8]
verifier = VerifierHead(d_model=64, n_layers=2)
trajectory = torch.stack(out.e_projecteds, dim=1) # (B, K, d)
logits = verifier.logits(trajectory) # (B,)
v_loss = verifier_bce_loss(logits, correct=torch.tensor([1.0, 0.0]))
What still needs you
The pieces above compose into a Phase 5 training loop, but you need:
- A real reward signal — a correctness oracle for the task (GSM8K answer match, ProsQA label, etc.).
- A dataset loader — spec §3 calls for
train/data/{prosqa,prontoqa,math,code}_loader.py. - The Coconut warmup curriculum — spec §4 Phase 2; needs paper-verification.
- An optimizer, scheduler, gradient accumulation, mixed precision, and checkpointing — standard training infrastructure.
- GPU compute — the spec assumes 8×H100 for ~4–5 weeks of work.
Repository map
noesis-lm/
├─ noesis/
│ ├─ backbone/ # TinyTransformer + Backbone protocol
│ ├─ thought/ # latent_loop, reasoner, mode_controller
│ ├─ policy/ # budget_head, reinforce
│ └─ verifier/ # verifier_head, calibration
├─ tests/ # 87 tests across 6 files
├─ docs/ # THINK_phase{0,1}.md, ADRs
└─ site/ # this GitHub Pages site