Architecture
Four pieces, one composition. The pure-math primitives compose into the Phase 5 RL training step with every line gated by a test.
The think→act loop
The latent thoughts live only in the KV cache — they never appear in the visible token sequence. Subsequent language-mode tokens attend to them through normal causal attention.
The four components
1. StochasticLatentLoop — the projection
Given hidden state h of shape (B, d):
mu = W_loop(h)
eps = randn_like(mu).detach() # Gaussian noise, detached
e_raw = mu + sigma * eps
e_projected = project_to_manifold(e_raw)
return e_projected, mu, eps
The detachment of eps from autograd is load-bearing —
noise is data, not a parameter-dependent variable. The manifold projection
keeps long latent chains on the token-embedding distribution; without it,
cosine similarity to the nearest token drops sharply by K=16.
2. BudgetHead — categorical π(K | h)
A small 2-layer MLP over the hidden state at the <bot>
position, outputting logits over K ∈ {0..K_max}. Trained via
REINFORCE with reward R = accuracy − λ·K. K=0 is a
valid action — for easy questions, the model can decide to spend
no thinking at all.
3. VerifierHead — P[correct | trajectory]
A small TransformerEncoder reading the K-step trajectory with a learned
[CLS] token for pooled readout. Two roles: runtime quality
signal (retry on low confidence) and the principled REINFORCE baseline
(variance reduction by ~10× per standard RL practice).
4. REINFORCE + KL math
The pure math pieces from spec §A.2 and §A.4. The trajectory log-probability uses the score-function surrogate trick:
log_p = log_norm + eps_term + surrogate - surrogate.detach()
# where:
# log_norm = -K*d/2 * log(2*pi*sigma^2) # gradient via sigma
# eps_term = -0.5 * sum_k ||eps_k||^2
# surrogate = (eps . mu).sum() / sigma.detach() # gradient via mu
The surrogate's value is zero (subtract its detach) but its autograd
graph delivers d log_p / d mu = eps / sigma, which is
exactly what REINFORCE needs. The sigma.detach() in the
denominator prevents the surrogate from contaminating the σ gradient
— a silent bug that would make RL learn wrong σ updates.
Composition: one Phase 5 step
Every primitive composes into the canonical RL step:
K, budget_lp = head.sample(h_bot) # log pi(K | h)
out = reasoner.think(input_ids, K) # K latent steps
traj_lp = trajectory_log_prob(out.mus, out.epsilons, sigma)
# verifier baseline (detached, per spec A.2)
with torch.no_grad():
baseline = verifier.confidence(trajectory_tensor)
loss_rl = compute_reinforce_loss(
budget_lp + traj_lp,
rewards, # 1[correct] - lambda*K
baselines=baseline,
normalize=True,
)
# capability-preserving KL (frozen sigma)
kl = compute_trajectory_kl(out.mus, ref_mus, sigma.detach())
loss = loss_rl + beta_kl * kl.mean() - beta_ent * head.entropy(h_bot).mean()
loss.backward()
Component & test cross-reference
| Spec section | Module | Test file | Gating test |
|---|---|---|---|
| §2.1 protocol | thought/reasoner.py |
test_reasoner.py |
K=0 bit-exact; step-by-step matches batched at σ=0 |
| §2.1 generation | thought/mode_controller.py |
test_mode_controller.py |
latent thoughts change post-eot distribution |
| §2.2 budget | policy/budget_head.py |
test_budget_head.py |
joint policy gradient flows to head + loop |
| §2.3 verifier | verifier/verifier_head.py |
test_verifier.py |
overfits a 4-example set in 300 steps |
| §2.4 stochastic latent | thought/latent_loop.py |
test_latent_loop.py |
d/dμ = ε/σ matches finite-difference |
| §A.2 REINFORCE | policy/reinforce.py |
test_reinforce.py |
gradient sign flips with reward sign |
| §A.4 KL | policy/reinforce.py |
test_reinforce.py |
KL identity at zero; closed-form match |
| §5 calibration | verifier/calibration.py |
test_verifier.py |
Brier 0 / 0.25 / ECE miscalibrated = 0.4 |