Tests

87 tests across 6 files. Each test gates a specific correctness invariant; few are coverage padding.

Why so much testing for a research scaffold?

Because most NOESIS bugs are silent. The score-function gradient is wrong by a constant factor — training appears to converge, just to the wrong policy. The KV cache misaligns positions — latent thoughts don't influence downstream tokens, but the loss still drops. The σ gradient is contaminated by a missing .detach() — RL trains, but the noise level drifts wrong.

Every test below was written against a specific failure mode mentioned in the spec (most of them in spec Appendix B's bug-fix story).

Files

Test files, counts, and what they verify.
File Count What it verifies
test_latent_loop.py 13 Score-function gradient correctness. The math heart.
test_backbone.py 5 Backbone forward pass and KV-cache equivalence.
test_reasoner.py 9 The two Phase 1 gating tests plus orchestration sanity.
test_mode_controller.py 9 Generation protocol: no-think parity, forced think positions, EOS, latent-thoughts-change-downstream.
test_reinforce.py 18 REINFORCE gradient sign, baseline subtraction, KL identities.
test_budget_head.py 14 Categorical policy: shape, log-prob consistency, entropy bounds, joint gradient.
test_verifier.py 19 Verifier architecture, overfit-a-tiny-set, Brier/ECE closed-form.
Total 87

The most load-bearing tests

If any of these fails, do not proceed to later phases.

test_K_zero_is_bit_exact_with_backbone

With K=0, the LatentReasoner must produce logits identical to the bare backbone, checked with torch.equal — not allclose. Spec explicitly calls this out as the most common implementation bug (W_loop initialization leaking state).

test_K_positive_deterministic_matches_batched_forward

With K>0 and σ=0, running K latent steps step-by-step (with the KV cache) must match a single batched forward over the equivalent concatenated input. allclose at atol=1e-5. Catches KV cache and position-offset bugs.

test_mu_gradient_matches_finite_difference

The score-function gradient d log_p / d μ = ε / σ must match a finite-difference numerical gradient computed against the value-form -||e - μ||² / (2 σ²) with the realized thought e held fixed. If this fails, REINFORCE silently learns the wrong policy.

test_sigma_gradient_matches_finite_difference

d log_p / d σ must come only from the log-normalization term, not the surrogate. The original reference implementation got this wrong: analytical d/dσ = -788 vs numerical -320. Fixed by adding .detach() to σ in the surrogate denominator. Without this fix, RL silently trains the wrong noise level.

test_thinking_changes_post_eot_token_distribution

After a forced think block, the tokens emitted by greedy decoding must differ from the tokens that would have been emitted at the same absolute positions without the think block. If they match, the latent thoughts had no effect on downstream attention — either the KV cache is wrong or attention is ignoring the latent positions.

test_verifier_can_overfit_a_tiny_set

A 4-example train set with the verifier head must reach final_loss < 0.1 · initial_loss in 300 Adam steps, and the predictions must end up on the right side of 0.5. This is a sanity check that gradients are useful, not just nonzero.

Run them

pytest -v

Total time on a modern CPU: about 4 seconds. Most tests are tiny synthetic tensors. The verifier overfit test is the slowest (300 Adam steps on 4 examples), and it still completes in well under a second.

What testing does NOT cover

The repo's "no fabricated results" rule extends here too: