Literature notes

CASCADE descends from a specific set of 13 papers. Each note follows a uniform five-section template: corruption process, loss, inference structure, KV-cache use, throughput vs AR comparison. The two anchors (LLaDA and BD3-LM) are fully drafted; the other 11 are structured stubs flagged for verification against the actual papers.

A note on verification status

The stub notes draw from training-data memory of the cited papers. Every numeric claim is marked [verify] or TBD. Before any claim in a stub note enters a downstream document (a decision log, the eventual paper, or external talks), the note should be promoted to verified by reading the actual paper.

Anchor notes (fully drafted)

LLaDA — Large Language Diffusion Models (Nie et al. 2025)

First masked-diffusion LM trained from scratch at 8B parameters, matching LLaMA-3 8B on standard benchmarks. The strongest single piece of evidence that the whole CASCADE thesis is sound. Loss recipe (per-sequence m ∈ U[0, 1], 1/m-weighted CE on masked positions) is inherited directly. Critically, LLaDA's throughput does not grow with sequence length — the structural block-diffusion fix is what CASCADE bets on.

View full lit note (markdown)

BD3-LM — Block Diffusion (Arriola et al., ICLR 2025)

The architectural ancestor CASCADE descends from most directly. Interpolates AR and full masked diffusion via a single knob — the block size B. At B = 1 it is exactly AR; at B = L it is exactly LLaDA-style full masked diffusion; intermediate B (~32) Pareto-dominates both endpoints. CASCADE adds two deltas: adaptive per-block step count (BD3-LM uses fixed K) and AR-to-CASCADE distillation (BD3-LM trains from scratch).

View full lit note (markdown)

Structured stubs (needs paper read-through)

D3PM — Discrete Denoising Diffusion (Austin et al. 2021)

The foundational paper for discrete-state diffusion. The absorbing-state variant — gradually replacing tokens with [MASK] — is the direct ancestor of LLaDA, BD3-LM, and CASCADE. The ELBO derivation here is the principled justification for the 1/m reweighting.

SEDD — Score Entropy Discrete Diffusion (Lou et al. 2024)

Alternative loss formulation via score matching for discrete state spaces. Important as a competing theoretical framing; ablation possibility for CASCADE if the LLaDA-style CE loss underperforms.

MD4 — Simplified Masked Diffusion (Sahoo et al. 2024)

The cleanest re-derivation: linear absorbing schedule m(t) = t gives the 1/m-weighted CE loss as the negative ELBO. CASCADE's training loss is literally MD4's loss applied per-block.

LLaDA-V / LLaDA 1.5 / 2.0 / MoE (You et al., Nie et al. 2025)

Family of LLaDA follow-ups: multimodal (V), MoE, dense scaling. Robustness evidence — masked diffusion works across scales and architecture additions.

Dream-7B — Diffusion via AR Adaptation (Ye et al. 2025)

The blueprint for CASCADE Phase 4. Starts from Qwen-2.5-7B and continues pretraining on the masked-diffusion objective for ~150B tokens, recovering most of the AR teacher's quality at a fraction of the from-scratch cost.

Fast-dLLM and v2 — KV-cache acceleration (Wu et al. 2025)

Training-free accelerator for masked-diffusion LMs: caches K/V at committed positions across denoising steps. The within-block analog of CASCADE's across-block cache. The structural threat: if Fast-dLLM v2 alone is fast enough, CASCADE's pitch must lean on streaming preservation.

SSD-LM — Semi-autoregressive Simplex Diffusion (Han et al. 2022)

The 2022 antecedent to BD3-LM, in continuous (simplex) state space. Historical interest: BD3-LM cites it as the precedent for block diffusion. Mostly relevant as a “why discrete-state won” data point.

Yin et al. 2025 — From slow bidirectional to fast AR video diffusion

The closest analog to CASCADE's thesis in the video domain — convert bidirectional diffusion to AR-along-one-axis. Cross-modal validation of the “block-causal + diffusion-within-block” pattern.

Han et al. 2025 — Survey on Diffusion Language Models (VILA-Lab)

2025 field survey. Useful for coverage check (anything cited but missing from CASCADE's lit set is a gap) and for the related-work matrix in Phase 6's paper.

Mercury — Inception Labs production diffusion LM

The first production-deployed diffusion language model, with advertised very-large throughput multiples. The visible competitive benchmark for CASCADE. Open question: is Mercury block-diffusion (in which case CASCADE's adaptive-K is the only delta) or full diffusion + Fast-dLLM-style cache (in which case streaming is the differentiator)?

BD3-LM § 5 — Block-vs-token ablations

The empirical evidence for the block-size knob. Sweeps B from 1 (= AR) to L (= full diffusion); shows intermediate B ≈ 32 Pareto-dominates both endpoints. CASCADE will rerun these ablations to provide head-to-head numbers.

How the notes are structured

Each lit note has the same five mandatory sections:

Corruption process
How is training data corrupted (forward process)? Schedule, per-position vs. per-sequence, independence.
Loss
The exact loss function. Where does each term come from? Reweighting? What ELBO does it correspond to?
Inference structure
The exact generation procedure. Steps. What changes per step. Determinism vs. sampling. Length flexibility.
KV-cache use
Cached or not? Which part of K/V is cached, how invalidated, cost saving? If no cache, why — what structural blocker?
Throughput vs AR
Honest scaling characterization. Does the advantage hold at long sequences, at large batches, in production settings?

The uniform structure makes cross-paper comparison mechanical, which means the related-work matrix in Phase 6's paper essentially writes itself from these sections.