CASCADE
A diffusion language model that decodes block-by-block left-to-right with KV-cache reuse and entropy-adaptive denoising step count, eliminating the autoregressive throughput ceiling without sacrificing streaming.
CASCADE expands to Causal Adaptive Streaming Cascaded Architecture for Diffusive Emission
The thesis in one paragraph
Autoregressive language models are bottlenecked at decode by sequentiality: token T+1 cannot start until token T's forward pass is done. Diffusion language models attack this directly by denoising spans in parallel, but pure non-AR diffusion has two open problems: no KV-cache reuse (every denoising step is a full forward pass over the noisy sequence), and inflexible length. Block diffusion (BD3-LM, Arriola et al. ICLR 2025) recovers both: diffusion within a block of size B, left-to-right across blocks. CASCADE pushes this to its strongest form by combining block masked diffusion, KV-cache across blocks and within blocks, a unifying objective that interpolates AR↔diffusion via a single knob, and — the project's headline novelty — an entropy-adaptive per-block denoising step count Kb. Easy blocks denoise in 2 steps; hard blocks in 16.
What's in this repo
The four CPU-implementable phases of the master plan are complete and tested at fp64 precision. The reference attention/cache implementation produces outputs bit-identical to standard causal attention at block_size = 1 and exhibits the cache-reuse equivalence property at 1.57 × 10−15 across multiple layers with RoPE.
- Phase 0 — Bootstrap. Repo scaffold, 13 lit notes (2 anchors fully drafted), Phase-0 THINK doc, scaffold-completeness checker.
- Phase 1 — Pretrain wiring. Block-causal multi-head attention with RoPE, BD3-LM corruption process, LLaDA
1/m-weighted masked cross-entropy, RMSNorm + SwiGLU blocks, fullCascadeLM. End-to-end smoke training: loss falls 9.52 → 1.22 over 300 steps on a memorizable batch. - Phase 2 — Block cache & denoise. Pre-allocated multi-layer
BlockCache; cache-aware forward at every level of the stack; block-by-blockgenerate()with confidence-ranked unmask and EOS-stop. Multi-layer cache-vs-no-cache equivalence verified at 1.57 × 10−15 fp64. - Phase 3 — Adaptive K head.
AdaptiveKHeadoverK_CHOICES = (1, 2, 4, 8, 16);EMABaseline;reinforce_step_count_loss. On a synthetic difficulty task the monotone-difficulty diagnostic (Appendix A.3 of the master prompt) achieves Spearman rank correlation 1.000 between true difficulty and mean chosen K per bucket.
The whole test suite (55 passing, 2 phased skips) runs in under 10 seconds on a CPU.
Get started
The project lives under cascade-lm/ in the repository. After cloning:
cd cascade-lm
python scripts/check_scaffold.py # verify all 58 scaffold files present
python cascade/attention_reference.py # 4 PASS lines — the correctness oracle
python -m pytest tests/ -q # 55 passed, 2 skipped
To set up a Python environment:
uv venv --python 3.11
uv pip install -e ".[dev]"
# On a GPU box, also:
uv pip install -e ".[gpu]" # adds flash-attn and triton
Status
| Phase | Status | Headline result |
|---|---|---|
| 0 — Bootstrap | complete | 58-file scaffold; 13 lit notes |
| 1 — Pretrain wiring (CPU) | complete | Smoke train 9.52 → 1.22 (7.81×) |
| 1 — Pretrain training | needs GPU | Train nano on 200M FineWeb-Edu tokens |
| 2 — Cache + denoise | complete | Multi-layer cache equivalence 1.57e-15 |
| 3 — Adaptive K (CPU) | complete | Monotone-K Spearman 1.000 |
| 3 — Adaptive K (real corpus) | needs trained model | λ sweep, MMLU preservation |
| 4 — AR-to-CASCADE distillation | needs teacher | Qwen-2.5-7B continued pretrain (~150B tokens) |
| 5 — Scaling + ablations | deferred | Pareto frontier, scaling laws |
| 6 — Paper + release | deferred | Manuscript + repro script |
Citing
If you build on CASCADE, please cite the foundational works it descends from — at minimum, BD3-LM (Arriola et al. ICLR 2025) and LLaDA (Nie et al. 2025). See the lit notes for the canonical reference set. A CASCADE citation block will appear here once Phase 5/6 produces a paper.