Architecture

Four pieces, one composition. The pure-math primitives compose into the Phase 5 RL training step with every line gated by a test.

The think→act loop

context tokens

backbone (decoder LM)

h_t (hidden state)

→ W_loop ·h + b

+ σ·ε

e_t (next embed)

backbone continues with e_1..e_K in KV cache

language mode resumes

The latent thoughts live only in the KV cache — they never appear in the visible token sequence. Subsequent language-mode tokens attend to them through normal causal attention.

The four components

1. `StochasticLatentLoop` — the projection

Given hidden state h of shape (B, d):

mu = W_loop(h)
eps = randn_like(mu).detach()      # Gaussian noise, detached
e_raw = mu + sigma * eps
e_projected = project_to_manifold(e_raw)
return e_projected, mu, eps

The detachment of eps from autograd is load-bearing — noise is data, not a parameter-dependent variable. The manifold projection keeps long latent chains on the token-embedding distribution; without it, cosine similarity to the nearest token drops sharply by K=16.

2. `BudgetHead` — categorical π(K | h)

A small 2-layer MLP over the hidden state at the <bot> position, outputting logits over K ∈ {0..K_max}. Trained via REINFORCE with reward R = accuracy − λ·K. K=0 is a valid action — for easy questions, the model can decide to spend no thinking at all.

3. `VerifierHead` — P[correct | trajectory]

A small TransformerEncoder reading the K-step trajectory with a learned [CLS] token for pooled readout. Two roles: runtime quality signal (retry on low confidence) and the principled REINFORCE baseline (variance reduction by ~10× per standard RL practice).

4. REINFORCE + KL math

The pure math pieces from spec §A.2 and §A.4. The trajectory log-probability uses the score-function surrogate trick:

log_p = log_norm + eps_term + surrogate - surrogate.detach()
# where:
#   log_norm  = -K*d/2 * log(2*pi*sigma^2)   # gradient via sigma
#   eps_term  = -0.5 * sum_k ||eps_k||^2
#   surrogate = (eps . mu).sum() / sigma.detach()  # gradient via mu

The surrogate's value is zero (subtract its detach) but its autograd graph delivers d log_p / d mu = eps / sigma, which is exactly what REINFORCE needs. The sigma.detach() in the denominator prevents the surrogate from contaminating the σ gradient — a silent bug that would make RL learn wrong σ updates.

Composition: one Phase 5 step

Every primitive composes into the canonical RL step:

K, budget_lp = head.sample(h_bot)              # log pi(K | h)
out = reasoner.think(input_ids, K)             # K latent steps
traj_lp = trajectory_log_prob(out.mus, out.epsilons, sigma)
# verifier baseline (detached, per spec A.2)
with torch.no_grad():
    baseline = verifier.confidence(trajectory_tensor)
loss_rl = compute_reinforce_loss(
    budget_lp + traj_lp,
    rewards,                                   # 1[correct] - lambda*K
    baselines=baseline,
    normalize=True,
)
# capability-preserving KL (frozen sigma)
kl = compute_trajectory_kl(out.mus, ref_mus, sigma.detach())
loss = loss_rl + beta_kl * kl.mean() - beta_ent * head.entropy(h_bot).mean()
loss.backward()

Component & test cross-reference

Components, spec sections, and the test files that gate them.
Spec section	Module	Test file	Gating test
§2.1 protocol	`thought/reasoner.py`	`test_reasoner.py`	K=0 bit-exact; step-by-step matches batched at σ=0
§2.1 generation	`thought/mode_controller.py`	`test_mode_controller.py`	latent thoughts change post-eot distribution
§2.2 budget	`policy/budget_head.py`	`test_budget_head.py`	joint policy gradient flows to head + loop
§2.3 verifier	`verifier/verifier_head.py`	`test_verifier.py`	overfits a 4-example set in 300 steps
§2.4 stochastic latent	`thought/latent_loop.py`	`test_latent_loop.py`	d/dμ = ε/σ matches finite-difference
§A.2 REINFORCE	`policy/reinforce.py`	`test_reinforce.py`	gradient sign flips with reward sign
§A.4 KL	`policy/reinforce.py`	`test_reinforce.py`	KL identity at zero; closed-form match
§5 calibration	`verifier/calibration.py`	`test_verifier.py`	Brier 0 / 0.25 / ECE miscalibrated = 0.4