Tests
87 tests across 6 files. Each test gates a specific correctness invariant; few are coverage padding.
Why so much testing for a research scaffold?
Because most NOESIS bugs are silent. The score-function gradient is
wrong by a constant factor — training appears to converge, just
to the wrong policy. The KV cache misaligns positions — latent
thoughts don't influence downstream tokens, but the loss still drops.
The σ gradient is contaminated by a missing .detach()
— RL trains, but the noise level drifts wrong.
Every test below was written against a specific failure mode mentioned in the spec (most of them in spec Appendix B's bug-fix story).
Files
| File | Count | What it verifies |
|---|---|---|
test_latent_loop.py |
13 | Score-function gradient correctness. The math heart. |
test_backbone.py |
5 | Backbone forward pass and KV-cache equivalence. |
test_reasoner.py |
9 | The two Phase 1 gating tests plus orchestration sanity. |
test_mode_controller.py |
9 | Generation protocol: no-think parity, forced think positions, EOS, latent-thoughts-change-downstream. |
test_reinforce.py |
18 | REINFORCE gradient sign, baseline subtraction, KL identities. |
test_budget_head.py |
14 | Categorical policy: shape, log-prob consistency, entropy bounds, joint gradient. |
test_verifier.py |
19 | Verifier architecture, overfit-a-tiny-set, Brier/ECE closed-form. |
| Total | 87 |
The most load-bearing tests
If any of these fails, do not proceed to later phases.
test_K_zero_is_bit_exact_with_backbone
With K=0, the LatentReasoner must produce
logits identical to the bare backbone, checked with
torch.equal — not allclose. Spec
explicitly calls this out as the most common implementation bug
(W_loop initialization leaking state).
test_K_positive_deterministic_matches_batched_forward
With K>0 and σ=0, running K latent steps step-by-step (with
the KV cache) must match a single batched forward over the equivalent
concatenated input. allclose at
atol=1e-5. Catches KV cache and position-offset bugs.
test_mu_gradient_matches_finite_difference
The score-function gradient
d log_p / d μ = ε / σ must match a
finite-difference numerical gradient computed against the value-form
-||e - μ||² / (2 σ²) with the realized
thought e held fixed. If this fails, REINFORCE silently
learns the wrong policy.
test_sigma_gradient_matches_finite_difference
d log_p / d σ must come only from the
log-normalization term, not the surrogate. The original reference
implementation got this wrong: analytical d/dσ = -788
vs numerical -320. Fixed by adding
.detach() to σ in the surrogate denominator.
Without this fix, RL silently trains the wrong noise level.
test_thinking_changes_post_eot_token_distribution
After a forced think block, the tokens emitted by greedy decoding must differ from the tokens that would have been emitted at the same absolute positions without the think block. If they match, the latent thoughts had no effect on downstream attention — either the KV cache is wrong or attention is ignoring the latent positions.
test_verifier_can_overfit_a_tiny_set
A 4-example train set with the verifier head must reach
final_loss < 0.1 · initial_loss in 300 Adam
steps, and the predictions must end up on the right side of 0.5. This
is a sanity check that gradients are useful, not just nonzero.
Run them
pytest -v
Total time on a modern CPU: about 4 seconds. Most tests are tiny synthetic tensors. The verifier overfit test is the slowest (300 Adam steps on 4 examples), and it still completes in well under a second.
What testing does NOT cover
The repo's "no fabricated results" rule extends here too:
- No benchmark accuracy. No GSM8K, no MATH, no ProsQA. The tasks aren't wired up.
- No claim that the RL recipe actually improves accuracy — only that the math composes correctly. Training-time validation requires GPU compute and real data.
- No claim about scaling. The backbone here is 32-dim; the spec's medium config is 3B parameters.
- No claim about calibration of the trained verifier — only that the metrics return known closed-form values for known inputs.