CASCADE

A diffusion language model that decodes block-by-block left-to-right with KV-cache reuse and entropy-adaptive denoising step count, eliminating the autoregressive throughput ceiling without sacrificing streaming.

CASCADE  expands to  Causal Adaptive Streaming Cascaded Architecture for Diffusive Emission

The thesis in one paragraph

Autoregressive language models are bottlenecked at decode by sequentiality: token T+1 cannot start until token T's forward pass is done. Diffusion language models attack this directly by denoising spans in parallel, but pure non-AR diffusion has two open problems: no KV-cache reuse (every denoising step is a full forward pass over the noisy sequence), and inflexible length. Block diffusion (BD3-LM, Arriola et al. ICLR 2025) recovers both: diffusion within a block of size B, left-to-right across blocks. CASCADE pushes this to its strongest form by combining block masked diffusion, KV-cache across blocks and within blocks, a unifying objective that interpolates AR↔diffusion via a single knob, and — the project's headline novelty — an entropy-adaptive per-block denoising step count Kb. Easy blocks denoise in 2 steps; hard blocks in 16.

What's in this repo

The four CPU-implementable phases of the master plan are complete and tested at fp64 precision. The reference attention/cache implementation produces outputs bit-identical to standard causal attention at block_size = 1 and exhibits the cache-reuse equivalence property at 1.57 × 10−15 across multiple layers with RoPE.

The whole test suite (55 passing, 2 phased skips) runs in under 10 seconds on a CPU.

Get started

The project lives under cascade-lm/ in the repository. After cloning:

cd cascade-lm
python scripts/check_scaffold.py        # verify all 58 scaffold files present
python cascade/attention_reference.py   # 4 PASS lines — the correctness oracle
python -m pytest tests/ -q              # 55 passed, 2 skipped

To set up a Python environment:

uv venv --python 3.11
uv pip install -e ".[dev]"
# On a GPU box, also:
uv pip install -e ".[gpu]"              # adds flash-attn and triton

Status

Phase-by-phase completion state.
PhaseStatusHeadline result
0 — Bootstrapcomplete58-file scaffold; 13 lit notes
1 — Pretrain wiring (CPU)completeSmoke train 9.52 → 1.22 (7.81×)
1 — Pretrain trainingneeds GPUTrain nano on 200M FineWeb-Edu tokens
2 — Cache + denoisecompleteMulti-layer cache equivalence 1.57e-15
3 — Adaptive K (CPU)completeMonotone-K Spearman 1.000
3 — Adaptive K (real corpus)needs trained modelλ sweep, MMLU preservation
4 — AR-to-CASCADE distillationneeds teacherQwen-2.5-7B continued pretrain (~150B tokens)
5 — Scaling + ablationsdeferredPareto frontier, scaling laws
6 — Paper + releasedeferredManuscript + repro script

Citing

If you build on CASCADE, please cite the foundational works it descends from — at minimum, BD3-LM (Arriola et al. ICLR 2025) and LLaDA (Nie et al. 2025). See the lit notes for the canonical reference set. A CASCADE citation block will appear here once Phase 5/6 produces a paper.