Tests

87 tests across 6 files. Each test gates a specific correctness invariant; few are coverage padding.

Why so much testing for a research scaffold?

Because most NOESIS bugs are silent. The score-function gradient is wrong by a constant factor — training appears to converge, just to the wrong policy. The KV cache misaligns positions — latent thoughts don't influence downstream tokens, but the loss still drops. The σ gradient is contaminated by a missing .detach() — RL trains, but the noise level drifts wrong.

Every test below was written against a specific failure mode mentioned in the spec (most of them in spec Appendix B's bug-fix story).

Files

Test files, counts, and what they verify.
File	Count	What it verifies
`test_latent_loop.py`	13	Score-function gradient correctness. The math heart.
`test_backbone.py`	5	Backbone forward pass and KV-cache equivalence.
`test_reasoner.py`	9	The two Phase 1 gating tests plus orchestration sanity.
`test_mode_controller.py`	9	Generation protocol: no-think parity, forced think positions, EOS, latent-thoughts-change-downstream.
`test_reinforce.py`	18	REINFORCE gradient sign, baseline subtraction, KL identities.
`test_budget_head.py`	14	Categorical policy: shape, log-prob consistency, entropy bounds, joint gradient.
`test_verifier.py`	19	Verifier architecture, overfit-a-tiny-set, Brier/ECE closed-form.
Total	87

The most load-bearing tests

If any of these fails, do not proceed to later phases.

`test_K_zero_is_bit_exact_with_backbone`

With K=0, the LatentReasoner must produce logits identical to the bare backbone, checked with torch.equal — not allclose. Spec explicitly calls this out as the most common implementation bug (W_loop initialization leaking state).

`test_K_positive_deterministic_matches_batched_forward`

With K>0 and σ=0, running K latent steps step-by-step (with the KV cache) must match a single batched forward over the equivalent concatenated input. allclose at atol=1e-5. Catches KV cache and position-offset bugs.

`test_mu_gradient_matches_finite_difference`

The score-function gradient d log_p / d μ = ε / σ must match a finite-difference numerical gradient computed against the value-form -||e - μ||² / (2 σ²) with the realized thought e held fixed. If this fails, REINFORCE silently learns the wrong policy.

`test_sigma_gradient_matches_finite_difference`

d log_p / d σ must come only from the log-normalization term, not the surrogate. The original reference implementation got this wrong: analytical d/dσ = -788 vs numerical -320. Fixed by adding .detach() to σ in the surrogate denominator. Without this fix, RL silently trains the wrong noise level.

`test_thinking_changes_post_eot_token_distribution`

After a forced think block, the tokens emitted by greedy decoding must differ from the tokens that would have been emitted at the same absolute positions without the think block. If they match, the latent thoughts had no effect on downstream attention — either the KV cache is wrong or attention is ignoring the latent positions.

`test_verifier_can_overfit_a_tiny_set`

A 4-example train set with the verifier head must reach final_loss < 0.1 · initial_loss in 300 Adam steps, and the predictions must end up on the right side of 0.5. This is a sanity check that gradients are useful, not just nonzero.

Run them

pytest -v

Total time on a modern CPU: about 4 seconds. Most tests are tiny synthetic tensors. The verifier overfit test is the slowest (300 Adam steps on 4 examples), and it still completes in well under a second.

What testing does NOT cover

The repo's "no fabricated results" rule extends here too:

No benchmark accuracy. No GSM8K, no MATH, no ProsQA. The tasks aren't wired up.
No claim that the RL recipe actually improves accuracy — only that the math composes correctly. Training-time validation requires GPU compute and real data.
No claim about scaling. The backbone here is 32-dim; the spec's medium config is 3B parameters.
No claim about calibration of the trained verifier — only that the metrics return known closed-form values for known inputs.