Results

Concrete numbers from the project's CPU-bounded test suite. All numbers below are reproducible by running python -m pytest tests/ -v -s from cascade-lm/. The full suite is 55 passing tests + 2 phased skips, completing in under 10 seconds on a CPU.

Scope of these results

The numbers below verify correctness wiring: that block-causal attention masks correctly, that the cache reproduces full-forward outputs bit-near-identically, that training propagates gradients and converges, and that REINFORCE can learn a difficulty-conditional policy. They do not include real-corpus benchmark numbers (MMLU, HumanEval, throughput), which require the GPU-bounded sub-phases of Phase 1, 3, and 4. See Phases.

Reference parity (correctness oracle)

The reference attention/cache in cascade/attention_reference.py is the correctness oracle for the production multi-head + RoPE path. Three structural properties verified, all at fp64:

Reference parity numbers from `python cascade/attention_reference.py`.
Property	Max diff (fp64)	Spec	Status
Block-causal mask matches the published BD3-LM construction	—	Structural equality on an explicit (8, 8) example	pass
`block_size = 1` reproduces standard causal attention	0.00 × 10⁰	strictly bit-identical (mask reduces to `tril`)	pass
Cache-reuse path equals full forward (single-layer, no RoPE)	5.55 × 10⁻¹⁷	< 10⁻¹⁰	pass

The block_size = 1 bit-identity is meaningful: it proves CASCADE's block-causal construction is a strict generalization of AR, not a related-but-different architecture. If a production change breaks this test, it has broken something fundamental.

Multi-layer cache equivalence (the headline Phase 2 result)

The central correctness property of CASCADE: a model can generate block b via two paths and get the same answer.

Path A: full forward over the entire sequence of 4 blocks at once.
Path B: commit blocks 0–2 to the cache via a clean forward over just the prefix; then forward_block_with_cache for block 3, reading the cache.

The logits at block 3 must match to fp64 precision. With three layers, four heads, RoPE enabled:

Multi-layer cache-vs-no-cache equivalence on the full production stack.
Configuration	Max diff (fp64)	Spec	Status
3 layers, 3 heads, RoPE on; 3 committed blocks; final-block logits	1.57 × 10⁻¹⁵	< 10⁻¹⁰	pass
Zero-prefix forward (block 0, empty cache) equals standard `forward()`	0.00 × 10⁰	strictly bit-identical	pass
No future information leak (perturb future-block token, past outputs unchanged)	< 10⁻¹²	fp64 bit-near-identity at past positions	pass

The 1.57 × 10⁻¹⁵ result is the most important number in this repo: it proves the BlockCache reuse pattern is correct on the full multi-head + RoPE stack at the precision floor of fp64, not just on the simplified single-layer reference.

End-to-end training (Phase 1 smoke test)

Train a 139k-param nano CASCADE to memorize a single random batch — verifies the full wiring: corruption.corrupt_batch → CascadeLM.forward → losses.masked_diffusion_loss → loss.backward → AdamW step.

Setup: vocab 128, d = 64, 2 layers, 4 heads, B = 8, batch 4, length 32. m ∈ U[0.2, 1] (higher floor than production so 1/m stays bounded for the test).

Smoke-train loss trajectory on the memorizable batch.
Statistic	Value
Initial mean loss (first 10 steps)	9.5181
Final mean loss (last 30 steps)	1.2183
Reduction factor	7.81 ×
Spec floor	≥ 2.5 × reduction; final < 4.5
Status	pass

An accompanying test verifies the loss is finite even when m is at the floor (m_min = 10⁻³, so 1/m ≈ 1000): no NaNs in loss or gradients.

Monotone-difficulty diagnostic (Phase 3 headline)

Per Appendix A.3 of the master prompt, this is the #1 diagnostic for whether the adaptive-K head works at all. Setup:

Synthetic task: 5 difficulty buckets d ∈ {0, 1, 2, 3, 4}, each demanding a different K.
Input h_i carries a one-hot difficulty signal at dim d_i plus Gaussian noise (σ = 0.5).
Reward = −|action index − true d| (no λ·K cost term — we test the difficulty-mapping capability, not the speed-quality tradeoff).
Train via REINFORCE with EMA baseline + entropy bonus (β_H = 0.1, annealed to 0 over the first 30 % of 800 steps).

Pass criteria: ≥ 3 of 4 adjacent bucket-mean pairs increasing; Spearman rank correlation ≥ 0.7; mean(K | d = 4) / mean(K | d = 0) ≥ 2.

Bucket mean of chosen K at the end of REINFORCE training, evaluated on a held-out 2048-example batch.
True difficulty d	0	1	2	3	4
Mean chosen K	2.45	2.86	4.30	7.27	12.41

Pass criteria evaluated.
Criterion	Result	Spec	Status
Monotone increases (adjacent pairs)	4 / 4	≥ 3 / 4	pass
Spearman rank correlation	1.000	≥ 0.7	pass
Ratio mean(K\|d=4) / mean(K\|d=0)	5.06 ×	≥ 2 ×	pass

The adaptive-K head successfully learns the difficulty mapping on this synthetic testbed. Critically, this verifies the policy can extract a difficulty-conditional signal from a hidden state — the remaining open question for Phase 3 GPU work is whether the real-corpus NLL-gap reward is informative enough to drive the same learning at scale.

REINFORCE 2-action toy

Sanity test for the REINFORCE training loop: a 2-context contextual bandit where the optimal action depends on the input. After 300 steps:

2-action contextual bandit final policy.
Context	Optimal action	P(optimal \| context)
[+1, 0]	action 0	1.000
[0, +1]	action 4	1.000

Policy reaches 100 % optimal action probability in both contexts — REINFORCE + EMA baseline + entropy bonus is properly wired.

Full test suite

Total: 55 passing, 2 skipped (Phase 4 distillation-init, requires AR teacher), 0 failed.

By file, in dependency order:

Test files and counts.
File	Tests	Notable
`tests/test_block_mask.py`	8	Multi-head, RoPE attn matches reference oracle
`tests/test_corruption.py`	9	Includes statistical check that masked fraction ≈ m
`tests/test_losses.py`	5	Exact match: uniform logits → loss = (1/m) · log(V)
`tests/test_rope.py`	4	Verifies “dot product depends only on relative position”
`tests/test_smoke_train.py`	2	End-to-end loss-decrease + low-m NaN guard
`tests/test_cache_consistency.py`	9	Includes the multi-layer cache-vs-no-cache headline test
`tests/test_denoise_loop.py`	6	`generate()` shape, EOS stop, K=1 commit-all
`tests/test_step_predictor.py`	12	Includes the monotone-difficulty diagnostic
`tests/test_distill_init.py`	2 (skipped)	Phase 4

What's pending (GPU-bounded)

Real nano training on 200M FineWeb-Edu tokens (~12 h on a single H100).
flash-attn kernel dispatch (the dense fallback works on CPU; production GPU path benchmarked separately).
Real-corpus REINFORCE for the K_b head using the actual NLL-gap quality signal from the trained nano.
λ sweep (∈ {0.01, 0.05, 0.1, 0.5}) and Pareto frontier plot.
AR-to-CASCADE distillation from Qwen-2.5-7B (~150B tokens of continued pretraining).
Eval harness runs: MMLU, GSM8K, HumanEval, MBPP, MT-Bench, LongBench-v2.
Throughput measurement at production batch sizes (16, 64) vs. vLLM-served AR baseline.