Literature notes
CASCADE descends from a specific set of 13 papers. Each note follows a uniform five-section template: corruption process, loss, inference structure, KV-cache use, throughput vs AR comparison. The two anchors (LLaDA and BD3-LM) are fully drafted; the other 11 are structured stubs flagged for verification against the actual papers.
A note on verification status
The stub notes draw from training-data memory of the cited papers. Every numeric claim is marked [verify] or TBD. Before any claim in a stub note enters a downstream document (a decision log, the eventual paper, or external talks), the note should be promoted to verified by reading the actual paper.
Anchor notes (fully drafted)
LLaDA — Large Language Diffusion Models (Nie et al. 2025)
First masked-diffusion LM trained from scratch at 8B parameters, matching LLaMA-3 8B on standard benchmarks. The strongest single piece of evidence that the whole CASCADE thesis is sound. Loss recipe (per-sequence m ∈ U[0, 1], 1/m-weighted CE on masked positions) is inherited directly. Critically, LLaDA's throughput does not grow with sequence length — the structural block-diffusion fix is what CASCADE bets on.
View full lit note (markdown)
BD3-LM — Block Diffusion (Arriola et al., ICLR 2025)
The architectural ancestor CASCADE descends from most directly. Interpolates AR and full masked diffusion via a single knob — the block size B. At B = 1 it is exactly AR; at B = L it is exactly LLaDA-style full masked diffusion; intermediate B (~32) Pareto-dominates both endpoints. CASCADE adds two deltas: adaptive per-block step count (BD3-LM uses fixed K) and AR-to-CASCADE distillation (BD3-LM trains from scratch).
View full lit note (markdown)
Structured stubs (needs paper read-through)
D3PM — Discrete Denoising Diffusion (Austin et al. 2021)
The foundational paper for discrete-state diffusion. The absorbing-state variant — gradually replacing tokens with [MASK] — is the direct ancestor of LLaDA, BD3-LM, and CASCADE. The ELBO derivation here is the principled justification for the 1/m reweighting.
SEDD — Score Entropy Discrete Diffusion (Lou et al. 2024)
Alternative loss formulation via score matching for discrete state spaces. Important as a competing theoretical framing; ablation possibility for CASCADE if the LLaDA-style CE loss underperforms.
MD4 — Simplified Masked Diffusion (Sahoo et al. 2024)
The cleanest re-derivation: linear absorbing schedule m(t) = t gives the 1/m-weighted CE loss as the negative ELBO. CASCADE's training loss is literally MD4's loss applied per-block.
LLaDA-V / LLaDA 1.5 / 2.0 / MoE (You et al., Nie et al. 2025)
Family of LLaDA follow-ups: multimodal (V), MoE, dense scaling. Robustness evidence — masked diffusion works across scales and architecture additions.
Dream-7B — Diffusion via AR Adaptation (Ye et al. 2025)
The blueprint for CASCADE Phase 4. Starts from Qwen-2.5-7B and continues pretraining on the masked-diffusion objective for ~150B tokens, recovering most of the AR teacher's quality at a fraction of the from-scratch cost.
Fast-dLLM and v2 — KV-cache acceleration (Wu et al. 2025)
Training-free accelerator for masked-diffusion LMs: caches K/V at committed positions across denoising steps. The within-block analog of CASCADE's across-block cache. The structural threat: if Fast-dLLM v2 alone is fast enough, CASCADE's pitch must lean on streaming preservation.
SSD-LM — Semi-autoregressive Simplex Diffusion (Han et al. 2022)
The 2022 antecedent to BD3-LM, in continuous (simplex) state space. Historical interest: BD3-LM cites it as the precedent for block diffusion. Mostly relevant as a “why discrete-state won” data point.
Yin et al. 2025 — From slow bidirectional to fast AR video diffusion
The closest analog to CASCADE's thesis in the video domain — convert bidirectional diffusion to AR-along-one-axis. Cross-modal validation of the “block-causal + diffusion-within-block” pattern.
Han et al. 2025 — Survey on Diffusion Language Models (VILA-Lab)
2025 field survey. Useful for coverage check (anything cited but missing from CASCADE's lit set is a gap) and for the related-work matrix in Phase 6's paper.
Mercury — Inception Labs production diffusion LM
The first production-deployed diffusion language model, with advertised very-large throughput multiples. The visible competitive benchmark for CASCADE. Open question: is Mercury block-diffusion (in which case CASCADE's adaptive-K is the only delta) or full diffusion + Fast-dLLM-style cache (in which case streaming is the differentiator)?
BD3-LM § 5 — Block-vs-token ablations
The empirical evidence for the block-size knob. Sweeps B from 1 (= AR) to L (= full diffusion); shows intermediate B ≈ 32 Pareto-dominates both endpoints. CASCADE will rerun these ablations to provide head-to-head numbers.
How the notes are structured
Each lit note has the same five mandatory sections:
- Corruption process
- How is training data corrupted (forward process)? Schedule, per-position vs. per-sequence, independence.
- Loss
- The exact loss function. Where does each term come from? Reweighting? What ELBO does it correspond to?
- Inference structure
- The exact generation procedure. Steps. What changes per step. Determinism vs. sampling. Length flexibility.
- KV-cache use
- Cached or not? Which part of K/V is cached, how invalidated, cost saving? If no cache, why — what structural blocker?
- Throughput vs AR
- Honest scaling characterization. Does the advantage hold at long sequences, at large batches, in production settings?
The uniform structure makes cross-paper comparison mechanical, which means the related-work matrix in Phase 6's paper essentially writes itself from these sections.