# LLaDA — Large Language Diffusion Models (Nie et al. 2025)

> **Verification status:** drafted from training-data recall of the LLaDA paper and follow-up works (LLaDA 1.5 / 2.0, LLaDA-MoE, LLaDA-V). **Must be checked against the actual paper before any claim here is cited in `docs/architecture.md` or the eventual paper.** Specific numbers (parameter counts, training-token budgets, benchmark scores) are particularly suspect; mark with `[verify]` any claim that depends on them.

## One-paragraph summary

LLaDA is the first masked-diffusion language model trained from scratch at 8B parameters that reaches LLaMA-3 8B on standard benchmarks. It establishes that the masked-diffusion objective — random per-sequence masking ratio, simple cross-entropy on masked positions weighted by `1/m` — is a viable pretraining objective at scale, not just a curiosity for small models or token-prediction toys. LLaDA is the strongest single piece of evidence that the whole CASCADE thesis (block-diffusion as a real architecture, not just an academic exercise) is sound.

## Corruption process

- For each training sequence, sample a single masking ratio `m ∼ U[0, 1]` (uniform on `[0, 1]` — *not* a discrete schedule).
- For each token position independently, replace with `[MASK]` with probability `m`.
- The corrupted sequence has roughly `mL` masked positions (binomial, mean `mL`, variance `m(1-m)L`).
- Crucially, `m` is sampled *per sequence*, not per token. This means within a sequence the masking is i.i.d., but across sequences the corruption rate varies widely.

The motivating insight: a single `m` per training example exposes the model to a *spectrum* of noise levels at training time. This is what makes multi-step denoising at inference time work — every intermediate noise level the inference loop visits was in-distribution during training.

## Loss

Cross-entropy on the masked positions only, weighted by `1/m`:

```
L = E_{x_0, m} [ -(1/m) * sum_{i : x_t^i = MASK} log p_theta(x_0^i | x_t) ]
```

The `1/m` factor is the ELBO weight for the absorbing-state discrete diffusion (derived in CASCADE `03_CASCADE.md` Appendix A.1, following Sahoo et al. 2024 MD4). Without it, low-`m` batches dominate via their large number of masked positions. With it, every `m` contributes proportionally.

Stability trick (not always stated in the paper, but used in practice): clip the sampling to `m ∼ U[m_min, 1]` with `m_min ≈ 1e-3`, otherwise the loss explodes on the few sequences where `m` is sampled near 0. The bias is `O(m_min log(1/m_min))` nats, smaller than batch-to-batch noise.

## Inference structure

Iterative denoising on a fully-masked sequence of fixed target length `L`:

1. Initialize all `L` positions as `[MASK]`.
2. For `k = 1 ... K` steps:
    a. One forward pass over the (partly-masked) sequence; get per-position distributions over the vocabulary.
    b. Pick the `L/K` most confident positions (by entropy of the predicted distribution, or by max-prob).
    c. Sample (or argmax) the predicted token at each of those positions; commit it (replace `[MASK]` with the chosen token).
    d. Optionally, *remask* positions whose committed prediction is low-confidence — a "second guess" mechanism.
3. After `K` steps, every position is committed; that's the generated sequence.

The target length `L` must be fixed up front. Variable-length output is awkward (workarounds: generate an `[EOS]` token, or generate multiple lengths and pick the best). This length-inflexibility is one of the two open problems with non-block masked diffusion (`03_CASCADE.md § 1`).

`K` is a hyperparameter typically in `{16, 32, 64, 128}` for L=512. Larger `K` ≈ better quality up to a point, then plateaus. Smaller `K` ≈ faster but rougher generations.

## KV-cache use

**Almost none.** This is the big problem.

Every denoising step is a full forward pass over the entire `L`-length sequence, because every step changes (possibly all) token positions — the K/V at every position depends on what tokens are currently there, and tokens change between steps. So the cache cannot be reused across steps the way AR reuses it across tokens.

Within a step, of course, all positions attend to all others bidirectionally — the attention computation itself is standard self-attention. But there is no "growing-prefix" KV reuse the way an AR LM has.

Fast-dLLM and Fast-dLLM v2 are the training-free fixes for this: they note that *committed* (no-longer-masked) positions' K/V are stable across the rest of the denoising steps and can be cached. The remaining masked positions' K/V still get recomputed. This recovers a chunk of throughput but is fundamentally a partial fix because the proportion of committed vs. masked positions evolves through the denoising loop.

**Block diffusion (BD3-LM / CASCADE)** is the structural fix: by making the model block-causal across blocks, committed *blocks* (not individual tokens) have permanent K/V that's reused for all future block generations. See `bd3lm.md`.

## Throughput vs. AR comparison

Reported in the LLaDA paper (numbers are approximate from memory; `[verify]`):

- LLaDA 8B at K=32 steps: roughly *comparable* wall-clock to LLaMA-3 8B AR generation of the same length on the same hardware, with some configurations *faster* (decode is parallel within a step) and some *slower* (the K full-sequence forward passes can dominate at long L).
- At K=16: noticeably faster than AR, with measurable quality loss.
- The crossover where LLaDA beats AR on wall-clock is roughly `K ≤ L/2` for short sequences and worsens at long L (because each step is O(L)).

The crucial qualitative point: LLaDA's throughput advantage *does not grow* with sequence length. Pure masked diffusion is fundamentally O(K·L) per generation, vs. AR's O(L) per generation (but where each AR step is memory-bound and underutilizes the GPU). At long L, the constant factor matters less and the O(K) vs. O(1) per-step difference catches up.

This is *exactly* the problem that block diffusion solves: by making each block O(K·B + K·b·B) (where `b` is the prior block count), the per-block cost grows linearly with prior context — same scaling as AR — but each block emits `B` tokens in `K` parallel steps instead of `B` sequential.

## Why this paper matters for CASCADE

- **Proof of concept:** masked diffusion at 8B works. Without LLaDA, CASCADE's whole premise is speculative.
- **Loss recipe:** the `1/m` weighting and random-per-sequence `m` are inherited directly. CASCADE doesn't innovate on the loss; it innovates on the *structure* (block-causal) and the *step count* (adaptive K_b).
- **Distillation precedent:** LLaDA is trained from scratch, but its existence + Dream-7B's adaptation-from-Qwen success together establish that distillation from an AR teacher (CASCADE Phase 4) is plausible.
- **Cautionary data point:** LLaDA's throughput is *not* a wall-clock win over AR at production scales without block structure. CASCADE's pitch is "block structure recovers the AR throughput curve while keeping the parallel-within-block speedup". If we can't beat LLaDA's own throughput in the block-diffusion regime, the project has failed.

## Open questions to check against the paper

- [ ] Is the `1/m` weighting stated explicitly in the LLaDA paper or only in follow-ups (Sahoo et al. MD4, etc.)?
- [ ] What `K` does LLaDA actually use for the reported throughput numbers?
- [ ] What is the `m_min` clip (if any) in the published code?
- [ ] LLaDA-MoE and LLaDA-V — do they change the corruption process or only the architecture?

## References

- Nie et al. 2025 — *Large Language Diffusion Models* (LLaDA original)
- Nie et al. 2025 — LLaDA 1.5 / 2.0 (scaling follow-ups)
- 2025 — LLaDA-MoE (MoE variant)
- You et al. 2025 — LLaDA-V (vision)
- Sahoo et al. 2024 — MD4 (the loss derivation done cleanly)