# BD3-LM — Block Diffusion 3 Language Models (Arriola et al., ICLR 2025)

> **Verification status:** drafted from training-data recall of the BD3-LM paper. **Must be checked against the actual paper before citation.** Specific numbers (block sizes evaluated, downstream benchmark numbers) are `[verify]`-marked.

## One-paragraph summary

BD3-LM is the architectural ancestor CASCADE descends from most directly. It interpolates between fully-autoregressive and fully-masked-diffusion language models via a single knob: the block size `B`. At `B = 1`, BD3-LM is exactly AR. At `B = L` (the full sequence as one block), it's exactly LLaDA-style masked diffusion. At intermediate `B` (the paper evaluates `B ∈ {4, 8, 16, 32, 64, 128}` `[verify]`), it gets the best of both worlds: KV-cache reuse across blocks (the AR property), with parallel decoding within blocks (the diffusion property). CASCADE adds adaptive per-block step count + an AR-to-CASCADE distillation recipe on top of this foundation.

## Corruption process

Given a sequence of length `L = nB` partitioned into `n` blocks of size `B`:

- For training step, pick a block index `b ∼ Uniform{0, ..., n-1}` (the *target block*).
- Sample a masking ratio `m ∼ U[0, 1]` (LLaDA-style; same recipe as `llada.md`).
- For each position `i` in block `b`, replace with `[MASK]` with probability `m`.
- Blocks `< b` are kept **clean** (this provides the "left-to-right" context that lets KV-cache reuse work at inference).
- Blocks `> b` are **fully masked** in some setups, **dropped** in others (the published variant masks them; the implementation can equivalently mask them all out via the attention mask).

The result: each training example is an autoregressive prefix of `b` clean blocks followed by a noisy "current" block followed by a fully-masked or empty suffix. The model sees the per-position cross-entropy loss only on the masked positions of block `b`.

This is exactly CASCADE's training corruption (`03_CASCADE.md § 2.1`).

## Loss

Per-block, identical to LLaDA but applied only to block `b`:

```
L = E_{x_0, b, m} [ -(1/m) * sum_{i in block_b : x^i = MASK} log p_theta(x_0^i | x_<b, x_b^masked) ]
```

The `1/m` weighting is again the ELBO weight. Block index `b` is uniform, so on average every block position gets equal training signal.

A subtlety: the paper notes that with uniform `b`, the *first* block (b=0) is trained with no left context, and the last block is trained with the most. This creates positional asymmetry that, empirically, is *fine* — the model learns to handle variable left-context lengths because that's exactly what it faces at inference.

## Inference structure

Left-to-right block-by-block, exactly as CASCADE (`03_CASCADE.md § 2.2`):

```
For b = 0, 1, 2, ..., until [EOS] is emitted:
    1. Initialize block b as all [MASK] tokens.
    2. For k = 1 ... K denoising steps:
        a. Forward over [clean blocks <b] + [current block b]. (Block b's K/V are recomputed every step;
           prior blocks' K/V come from the cache.)
        b. Predict per-position distributions over the vocabulary for block b's masked positions.
        c. Commit the top (B/K) most-confident positions.
        d. Optionally remask any low-confidence committed positions.
    3. Block b is now fully resolved; cache its K/V; emit it.
```

`K` is a fixed hyperparameter in BD3-LM. **CASCADE's contribution is making `K` adaptive per block** (`03_CASCADE.md § 2.3`).

## KV-cache use

This is the structural innovation of BD3-LM (and why CASCADE adopts it directly):

- **Across blocks:** once block `b` is committed, its K/V is *frozen forever* and goes in the cache. All future blocks `b+1, b+2, ...` reuse it. This is exactly the AR-style growing-prefix cache.
- **Within block (K denoising steps for block `b`):** the prior-block K/V is frozen and reused at every step. Block `b`'s own K/V is **recomputed every step**, because between steps the token IDs at committed positions of block `b` change as `[MASK]`s resolve.

So during the generation of block `b` with `K` denoising steps:
- The cross-attention from block `b` to the cached prior is done `K` times, each time reading the same cache. Cost per step: O(B × b·B × d).
- The self-attention within block `b` is done `K` times, recomputing block `b`'s K/V each time. Cost per step: O(B² × d) + O(B × d²) for K/V projection.

This is the cache structure formalized in `cascade/attention_reference.py:BlockKVCache` and tested in `test_cache_reuse_equivalence`.

## Throughput vs. AR comparison

The BD3-LM paper's headline result `[verify numbers]`:

- Block-diffusion at `B = 32, K = 4` achieves comparable quality to AR with substantial throughput gain on long sequences.
- Throughput crosses over to favor block-diffusion as soon as `K < B`. At `K = B`, block-diffusion degenerates to AR (one token per step).
- At `B = 1`, exactly equivalent to AR (this is the "strict generalization" property — also verified by CASCADE's `test_block_size_1_equals_causal`).
- At `B = L` (one giant block), exactly equivalent to LLaDA-style full masked diffusion with no cache reuse.

The qualitative throughput formula: per block, CASCADE/BD3 does `K` forward passes over `B` tokens with attention cost `O(B² + B·b·B)`. AR does `B` forward passes over `1` token each with attention cost `O(b·B + t)` for the `t`-th token. The diffusion variant wins on wall-clock because parallel-over-`B` is much faster than `B` sequential memory-bound steps. The exact crossover depends on the GPU's compute-to-memory ratio.

## Ablations from the paper that CASCADE inherits

`[verify]` numbers; from memory the BD3-LM ablations include:

- Block size sweep: `B ∈ {4, 8, 16, 32, 64, 128}`. Quality monotone in `B` (closer to full diffusion = better) up to about `B = 32`, then plateaus. Throughput non-monotone (larger `B` means each block's per-step cost grows).
- `K` sweep: at `K = 4`, BD3-LM matches AR quality at significant throughput gain. At `K = 1`, generated text degrades visibly.
- Remask-low-confidence ablation: helps quality, especially at small `K`.
- Without `1/m` reweighting: small but consistent quality regression.

CASCADE will rerun these ablations in `ablations/` to confirm the picture holds at our scale and to provide head-to-head numbers for the paper.

## Why this paper matters for CASCADE

- **Foundation:** CASCADE's block-causal architecture is BD3-LM's, full stop. The corruption process, the loss, the inference loop, and the cache structure are identical.
- **The two deltas CASCADE adds:** (1) adaptive per-block step count `K_b` (BD3-LM uses fixed `K`); (2) AR-to-CASCADE distillation recipe (BD3-LM trains from scratch).
- **Reviewer test:** if reviewers see CASCADE as "BD3-LM + a head", we lose. The two deltas must each be ablated individually, and the joint quality-speed Pareto plot must be the headline. See the risk matrix in `03_CASCADE.md § 6`.

## Open questions to check against the paper

- [ ] Exact block sizes evaluated and on what corpora.
- [ ] Is the `1/m` reweighting derivation given in the BD3-LM paper or inherited from LLaDA / MD4?
- [ ] Does BD3-LM explore any *non-uniform* block index sampling (e.g. emphasis on later blocks where context is harder)? CASCADE could borrow this.
- [ ] What attention kernel does the BD3-LM open-source release use? (flash-attn with block_mask, or a custom kernel?) — informs Phase 2 production migration.
- [ ] What is the *exact* PPL of BD3-LM-medium vs. AR-medium at matched FLOPs?

## References

- Arriola et al. 2025 — *Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models* (ICLR 2025)
- The BD3-LM open-source repo (URL: TBD — find via the paper)
- Han et al. 2022 — SSD-LM (earlier block-diffusion attempt; BD3-LM cites it as antecedent)