Results
Concrete numbers from the project's CPU-bounded test suite. All numbers below are reproducible by running python -m pytest tests/ -v -s from cascade-lm/. The full suite is 55 passing tests + 2 phased skips, completing in under 10 seconds on a CPU.
Scope of these results
The numbers below verify correctness wiring: that block-causal attention masks correctly, that the cache reproduces full-forward outputs bit-near-identically, that training propagates gradients and converges, and that REINFORCE can learn a difficulty-conditional policy. They do not include real-corpus benchmark numbers (MMLU, HumanEval, throughput), which require the GPU-bounded sub-phases of Phase 1, 3, and 4. See Phases.
Reference parity (correctness oracle)
The reference attention/cache in cascade/attention_reference.py is the correctness oracle for the production multi-head + RoPE path. Three structural properties verified, all at fp64:
| Property | Max diff (fp64) | Spec | Status |
|---|---|---|---|
| Block-causal mask matches the published BD3-LM construction | — | Structural equality on an explicit (8, 8) example | pass |
block_size = 1 reproduces standard causal attention |
0.00 × 100 | strictly bit-identical (mask reduces to tril) |
pass |
| Cache-reuse path equals full forward (single-layer, no RoPE) | 5.55 × 10−17 | < 10−10 | pass |
The block_size = 1 bit-identity is meaningful: it proves CASCADE's block-causal construction is a strict generalization of AR, not a related-but-different architecture. If a production change breaks this test, it has broken something fundamental.
Multi-layer cache equivalence (the headline Phase 2 result)
The central correctness property of CASCADE: a model can generate block b via two paths and get the same answer.
- Path A: full forward over the entire sequence of 4 blocks at once.
- Path B: commit blocks 0–2 to the cache via a clean forward over just the prefix; then
forward_block_with_cachefor block 3, reading the cache.
The logits at block 3 must match to fp64 precision. With three layers, four heads, RoPE enabled:
| Configuration | Max diff (fp64) | Spec | Status |
|---|---|---|---|
| 3 layers, 3 heads, RoPE on; 3 committed blocks; final-block logits | 1.57 × 10−15 | < 10−10 | pass |
Zero-prefix forward (block 0, empty cache) equals standard forward() |
0.00 × 100 | strictly bit-identical | pass |
| No future information leak (perturb future-block token, past outputs unchanged) | < 10−12 | fp64 bit-near-identity at past positions | pass |
The 1.57 × 10−15 result is the most important number in this repo: it proves the BlockCache reuse pattern is correct on the full multi-head + RoPE stack at the precision floor of fp64, not just on the simplified single-layer reference.
End-to-end training (Phase 1 smoke test)
Train a 139k-param nano CASCADE to memorize a single random batch — verifies the full wiring: corruption.corrupt_batch → CascadeLM.forward → losses.masked_diffusion_loss → loss.backward → AdamW step.
Setup: vocab 128, d = 64, 2 layers, 4 heads, B = 8, batch 4, length 32. m ∈ U[0.2, 1] (higher floor than production so 1/m stays bounded for the test).
| Statistic | Value |
|---|---|
| Initial mean loss (first 10 steps) | 9.5181 |
| Final mean loss (last 30 steps) | 1.2183 |
| Reduction factor | 7.81 × |
| Spec floor | ≥ 2.5 × reduction; final < 4.5 |
| Status | pass |
An accompanying test verifies the loss is finite even when m is at the floor (mmin = 10−3, so 1/m ≈ 1000): no NaNs in loss or gradients.
Monotone-difficulty diagnostic (Phase 3 headline)
Per Appendix A.3 of the master prompt, this is the #1 diagnostic for whether the adaptive-K head works at all. Setup:
- Synthetic task: 5 difficulty buckets d ∈ {0, 1, 2, 3, 4}, each demanding a different K.
- Input hi carries a one-hot difficulty signal at dim di plus Gaussian noise (σ = 0.5).
- Reward = −|action index − true d| (no λ·K cost term — we test the difficulty-mapping capability, not the speed-quality tradeoff).
- Train via REINFORCE with EMA baseline + entropy bonus (βH = 0.1, annealed to 0 over the first 30 % of 800 steps).
Pass criteria: ≥ 3 of 4 adjacent bucket-mean pairs increasing; Spearman rank correlation ≥ 0.7; mean(K | d = 4) / mean(K | d = 0) ≥ 2.
| True difficulty d | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Mean chosen K | 2.45 | 2.86 | 4.30 | 7.27 | 12.41 |
| Criterion | Result | Spec | Status |
|---|---|---|---|
| Monotone increases (adjacent pairs) | 4 / 4 | ≥ 3 / 4 | pass |
| Spearman rank correlation | 1.000 | ≥ 0.7 | pass |
| Ratio mean(K|d=4) / mean(K|d=0) | 5.06 × | ≥ 2 × | pass |
The adaptive-K head successfully learns the difficulty mapping on this synthetic testbed. Critically, this verifies the policy can extract a difficulty-conditional signal from a hidden state — the remaining open question for Phase 3 GPU work is whether the real-corpus NLL-gap reward is informative enough to drive the same learning at scale.
REINFORCE 2-action toy
Sanity test for the REINFORCE training loop: a 2-context contextual bandit where the optimal action depends on the input. After 300 steps:
| Context | Optimal action | P(optimal | context) |
|---|---|---|
| [+1, 0] | action 0 | 1.000 |
| [0, +1] | action 4 | 1.000 |
Policy reaches 100 % optimal action probability in both contexts — REINFORCE + EMA baseline + entropy bonus is properly wired.
Full test suite
Total: 55 passing, 2 skipped (Phase 4 distillation-init, requires AR teacher), 0 failed.
By file, in dependency order:
| File | Tests | Notable |
|---|---|---|
tests/test_block_mask.py | 8 | Multi-head, RoPE attn matches reference oracle |
tests/test_corruption.py | 9 | Includes statistical check that masked fraction ≈ m |
tests/test_losses.py | 5 | Exact match: uniform logits → loss = (1/m) · log(V) |
tests/test_rope.py | 4 | Verifies “dot product depends only on relative position” |
tests/test_smoke_train.py | 2 | End-to-end loss-decrease + low-m NaN guard |
tests/test_cache_consistency.py | 9 | Includes the multi-layer cache-vs-no-cache headline test |
tests/test_denoise_loop.py | 6 | generate() shape, EOS stop, K=1 commit-all |
tests/test_step_predictor.py | 12 | Includes the monotone-difficulty diagnostic |
tests/test_distill_init.py | 2 (skipped) | Phase 4 |
What's pending (GPU-bounded)
- Real nano training on 200M FineWeb-Edu tokens (~12 h on a single H100).
- flash-attn kernel dispatch (the dense fallback works on CPU; production GPU path benchmarked separately).
- Real-corpus REINFORCE for the Kb head using the actual NLL-gap quality signal from the trained nano.
- λ sweep (∈ {0.01, 0.05, 0.1, 0.5}) and Pareto frontier plot.
- AR-to-CASCADE distillation from Qwen-2.5-7B (~150B tokens of continued pretraining).
- Eval harness runs: MMLU, GSM8K, HumanEval, MBPP, MT-Bench, LongBench-v2.
- Throughput measurement at production batch sizes (16, 64) vs. vLLM-served AR baseline.