CASCADE

A diffusion language model that decodes block-by-block left-to-right with KV-cache reuse and entropy-adaptive denoising step count, eliminating the autoregressive throughput ceiling without sacrificing streaming.

CASCADE expands to Causal Adaptive Streaming Cascaded Architecture for Diffusive Emission

The thesis in one paragraph

Autoregressive language models are bottlenecked at decode by sequentiality: token T+1 cannot start until token T's forward pass is done. Diffusion language models attack this directly by denoising spans in parallel, but pure non-AR diffusion has two open problems: no KV-cache reuse (every denoising step is a full forward pass over the noisy sequence), and inflexible length. Block diffusion (BD3-LM, Arriola et al. ICLR 2025) recovers both: diffusion within a block of size B, left-to-right across blocks. CASCADE pushes this to its strongest form by combining block masked diffusion, KV-cache across blocks and within blocks, a unifying objective that interpolates AR↔diffusion via a single knob, and — the project's headline novelty — an entropy-adaptive per-block denoising step count K_b. Easy blocks denoise in 2 steps; hard blocks in 16.

What's in this repo

The four CPU-implementable phases of the master plan are complete and tested at fp64 precision. The reference attention/cache implementation produces outputs bit-identical to standard causal attention at block_size = 1 and exhibits the cache-reuse equivalence property at 1.57 × 10⁻¹⁵ across multiple layers with RoPE.

Phase 0 — Bootstrap. Repo scaffold, 13 lit notes (2 anchors fully drafted), Phase-0 THINK doc, scaffold-completeness checker.
Phase 1 — Pretrain wiring. Block-causal multi-head attention with RoPE, BD3-LM corruption process, LLaDA 1/m-weighted masked cross-entropy, RMSNorm + SwiGLU blocks, full CascadeLM. End-to-end smoke training: loss falls 9.52 → 1.22 over 300 steps on a memorizable batch.
Phase 2 — Block cache & denoise. Pre-allocated multi-layer BlockCache; cache-aware forward at every level of the stack; block-by-block generate() with confidence-ranked unmask and EOS-stop. Multi-layer cache-vs-no-cache equivalence verified at 1.57 × 10⁻¹⁵ fp64.
Phase 3 — Adaptive K head. AdaptiveKHead over K_CHOICES = (1, 2, 4, 8, 16); EMABaseline; reinforce_step_count_loss. On a synthetic difficulty task the monotone-difficulty diagnostic (Appendix A.3 of the master prompt) achieves Spearman rank correlation 1.000 between true difficulty and mean chosen K per bucket.

The whole test suite (55 passing, 2 phased skips) runs in under 10 seconds on a CPU.

Get started

The project lives under cascade-lm/ in the repository. After cloning:

cd cascade-lm
python scripts/check_scaffold.py        # verify all 58 scaffold files present
python cascade/attention_reference.py   # 4 PASS lines — the correctness oracle
python -m pytest tests/ -q              # 55 passed, 2 skipped

To set up a Python environment:

uv venv --python 3.11
uv pip install -e ".[dev]"
# On a GPU box, also:
uv pip install -e ".[gpu]"              # adds flash-attn and triton

Status

Phase-by-phase completion state.
Phase	Status	Headline result
0 — Bootstrap	complete	58-file scaffold; 13 lit notes
1 — Pretrain wiring (CPU)	complete	Smoke train 9.52 → 1.22 (7.81×)
1 — Pretrain training	needs GPU	Train nano on 200M FineWeb-Edu tokens
2 — Cache + denoise	complete	Multi-layer cache equivalence 1.57e-15
3 — Adaptive K (CPU)	complete	Monotone-K Spearman 1.000
3 — Adaptive K (real corpus)	needs trained model	λ sweep, MMLU preservation
4 — AR-to-CASCADE distillation	needs teacher	Qwen-2.5-7B continued pretrain (~150B tokens)
5 — Scaling + ablations	deferred	Pareto frontier, scaling laws
6 — Paper + release	deferred	Manuscript + repro script

Citing

If you build on CASCADE, please cite the foundational works it descends from — at minimum, BD3-LM (Arriola et al. ICLR 2025) and LLaDA (Nie et al. 2025). See the lit notes for the canonical reference set. A CASCADE citation block will appear here once Phase 5/6 produces a paper.