Results

Concrete numbers from the project's CPU-bounded test suite. All numbers below are reproducible by running python -m pytest tests/ -v -s from cascade-lm/. The full suite is 55 passing tests + 2 phased skips, completing in under 10 seconds on a CPU.

Scope of these results

The numbers below verify correctness wiring: that block-causal attention masks correctly, that the cache reproduces full-forward outputs bit-near-identically, that training propagates gradients and converges, and that REINFORCE can learn a difficulty-conditional policy. They do not include real-corpus benchmark numbers (MMLU, HumanEval, throughput), which require the GPU-bounded sub-phases of Phase 1, 3, and 4. See Phases.

Reference parity (correctness oracle)

The reference attention/cache in cascade/attention_reference.py is the correctness oracle for the production multi-head + RoPE path. Three structural properties verified, all at fp64:

Reference parity numbers from python cascade/attention_reference.py.
PropertyMax diff (fp64)SpecStatus
Block-causal mask matches the published BD3-LM construction Structural equality on an explicit (8, 8) example pass
block_size = 1 reproduces standard causal attention 0.00 × 100 strictly bit-identical (mask reduces to tril) pass
Cache-reuse path equals full forward (single-layer, no RoPE) 5.55 × 10−17 < 10−10 pass

The block_size = 1 bit-identity is meaningful: it proves CASCADE's block-causal construction is a strict generalization of AR, not a related-but-different architecture. If a production change breaks this test, it has broken something fundamental.

Multi-layer cache equivalence (the headline Phase 2 result)

The central correctness property of CASCADE: a model can generate block b via two paths and get the same answer.

The logits at block 3 must match to fp64 precision. With three layers, four heads, RoPE enabled:

Multi-layer cache-vs-no-cache equivalence on the full production stack.
ConfigurationMax diff (fp64)SpecStatus
3 layers, 3 heads, RoPE on; 3 committed blocks; final-block logits 1.57 × 10−15 < 10−10 pass
Zero-prefix forward (block 0, empty cache) equals standard forward() 0.00 × 100 strictly bit-identical pass
No future information leak (perturb future-block token, past outputs unchanged) < 10−12 fp64 bit-near-identity at past positions pass

The 1.57 × 10−15 result is the most important number in this repo: it proves the BlockCache reuse pattern is correct on the full multi-head + RoPE stack at the precision floor of fp64, not just on the simplified single-layer reference.

End-to-end training (Phase 1 smoke test)

Train a 139k-param nano CASCADE to memorize a single random batch — verifies the full wiring: corruption.corrupt_batchCascadeLM.forwardlosses.masked_diffusion_lossloss.backward → AdamW step.

Setup: vocab 128, d = 64, 2 layers, 4 heads, B = 8, batch 4, length 32. m ∈ U[0.2, 1] (higher floor than production so 1/m stays bounded for the test).

Smoke-train loss trajectory on the memorizable batch.
StatisticValue
Initial mean loss (first 10 steps)9.5181
Final mean loss (last 30 steps)1.2183
Reduction factor7.81 ×
Spec floor≥ 2.5 × reduction; final < 4.5
Statuspass

An accompanying test verifies the loss is finite even when m is at the floor (mmin = 10−3, so 1/m ≈ 1000): no NaNs in loss or gradients.

Monotone-difficulty diagnostic (Phase 3 headline)

Per Appendix A.3 of the master prompt, this is the #1 diagnostic for whether the adaptive-K head works at all. Setup:

Pass criteria: ≥ 3 of 4 adjacent bucket-mean pairs increasing; Spearman rank correlation ≥ 0.7; mean(K | d = 4) / mean(K | d = 0) ≥ 2.

Bucket mean of chosen K at the end of REINFORCE training, evaluated on a held-out 2048-example batch.
True difficulty d01234
Mean chosen K2.452.864.307.2712.41
Pass criteria evaluated.
CriterionResultSpecStatus
Monotone increases (adjacent pairs)4 / 4≥ 3 / 4pass
Spearman rank correlation1.000≥ 0.7pass
Ratio mean(K|d=4) / mean(K|d=0)5.06 ×≥ 2 ×pass

The adaptive-K head successfully learns the difficulty mapping on this synthetic testbed. Critically, this verifies the policy can extract a difficulty-conditional signal from a hidden state — the remaining open question for Phase 3 GPU work is whether the real-corpus NLL-gap reward is informative enough to drive the same learning at scale.

REINFORCE 2-action toy

Sanity test for the REINFORCE training loop: a 2-context contextual bandit where the optimal action depends on the input. After 300 steps:

2-action contextual bandit final policy.
ContextOptimal actionP(optimal | context)
[+1, 0]action 01.000
[0, +1]action 41.000

Policy reaches 100 % optimal action probability in both contexts — REINFORCE + EMA baseline + entropy bonus is properly wired.

Full test suite

Total: 55 passing, 2 skipped (Phase 4 distillation-init, requires AR teacher), 0 failed.

By file, in dependency order:

Test files and counts.
FileTestsNotable
tests/test_block_mask.py8Multi-head, RoPE attn matches reference oracle
tests/test_corruption.py9Includes statistical check that masked fraction ≈ m
tests/test_losses.py5Exact match: uniform logits → loss = (1/m) · log(V)
tests/test_rope.py4Verifies “dot product depends only on relative position”
tests/test_smoke_train.py2End-to-end loss-decrease + low-m NaN guard
tests/test_cache_consistency.py9Includes the multi-layer cache-vs-no-cache headline test
tests/test_denoise_loop.py6generate() shape, EOS stop, K=1 commit-all
tests/test_step_predictor.py12Includes the monotone-difficulty diagnostic
tests/test_distill_init.py2 (skipped)Phase 4

What's pending (GPU-bounded)