Architecture
CASCADE is a block-causal masked-diffusion language model. The vocabulary contains a special [MASK] token. The model is a Transformer decoder with block-causal attention: tokens within a block see each other bidirectionally; tokens in block b see all tokens in blocks < b.
Corruption process (training)
Given a sequence partitioned into blocks of size B = 32, training corrupts blocks one at a time:
- Pick a block index b uniformly.
- Sample a masking ratio m ∈ U[mmin, 1] (LLaDA-style — the random per-sequence ratio is critical; fixed ratios collapse).
- Mask each position in block b independently with probability m.
- Blocks < b are kept clean (this provides the autoregressive context).
- Blocks > b are fully masked (so the input shape is consistent), but the loss is only computed at the masked positions of block b.
The loss is per-masked-position cross-entropy weighted by 1/m. The 1/m factor is not heuristic — it is the negative ELBO weight of the absorbing-state discrete diffusion (see Sahoo et al. MD4, 2024).
Why 1/m?
For the linear noising schedule m(t) = t, the negative ELBO on log p(x0) reduces (after a change of variables from t to m) to −(1/m) · Σmasked i log pθ(x0i | xm). Without the 1/m factor, low-m (lightly-masked) batches dominate via their large number of masked positions. With it, every noise level contributes proportionally.
Denoising process (inference)
To generate block b given clean blocks < b:
- Initialize block b as all
[MASK]tokens. - For k = 1 … Kb denoising steps:
- Forward pass: predict the distribution over true tokens at every masked position.
- Confidence-rank the predictions.
- Unmask the top ⌈nmasked / kremaining⌉ most confident positions.
- Emit block b (now fully filled). Append to context. Proceed to block b + 1.
Standard block diffusion (BD3-LM) fixes Kb = K for all blocks. CASCADE's contribution is to predict Kb from context.
Entropy-adaptive step count (the headline novelty)
Before generating block b, a small head reads the final hidden state at the last clean position and predicts a categorical distribution over K_CHOICES = (1, 2, 4, 8, 16). Discrete actions, geometric spacing — covers a 16× cost range with only five choices, keeping REINFORCE variance manageable.
The head is trained with REINFORCE on a quality-minus-cost reward R = quality(block | K) − λ · K, with an EMA scalar baseline and an entropy bonus that anneals from 0.1 to 0 over the first 30 % of training. Quality is the negative-NLL gap relative to a K = 16 reference for the dense reward phase; downstream task accuracy for the final fine-tuning.
The result: average Kb ≈ 4–6 on typical text, while hard passages (math, code identifiers) get Kb ≥ 12. This is the throughput multiplier.
On this repo's CPU-bounded synthetic-task verification, the monotone-difficulty diagnostic from Appendix A.3 of the master prompt — “bucket inputs by independent difficulty, check that mean K per bucket is monotone increasing” — achieves Spearman rank correlation 1.000. See Results → monotone-difficulty diagnostic.
Block-causal attention mask
For sequence length N partitioned into blocks of size B:
M[i, j] = 1 iff ⌊i / B⌋ ≥ ⌊j / B⌋
This is a block-lower-triangular pattern where each B × B diagonal block is full (all ones), not lower-triangular. The full structure for two blocks of size 4 (so N = 8):
B0 B1
┌─────────────┐
B0 │ FULL ─ ← Block 0 self-attention (bidirectional within block)
B1 │ FULL FULL ← Block 1 sees all of B0 (causal) and itself (bidirectional)
└─────────────┘
Implementation: a dense (N, N) bool mask in the CPU reference path; the production GPU path uses flash-attn's block_mask parameter with predicate (kv_idx // block_size) ≤ (q_idx // block_size).
KV-cache across blocks and within block
Two levels of cache reuse, both essential:
- Across blocks
- Once block b is generated and committed, its K/V are cached forever. Block b+1's denoising sees this cache. This recovers the AR-style growing-prefix cache.
- Within block
- During the Kb denoising steps of block b, the K/V of the prior blocks (< b) are unchanged across all steps — read from the cache. The K/V of tokens being denoised in block b itself changes each step (as new tokens are unmasked), so they are recomputed each step.
Two P0 cache bugs to avoid
- Caching block b's K/V across denoising steps. Token IDs at masked positions resolve to actual tokens between steps; their K/V depend on those IDs. This is a silent quality drop.
- Committing block b's K/V to the cache before all Kb denoising steps complete. Commit-time freezes a partially-denoised block and contaminates all subsequent block generations.
Both are verified absent by the multi-layer cache-vs-no-cache equivalence test at 1.57 × 10−15 fp64.
AR-to-CASCADE distillation
Following the Dream-7B recipe:
- Start from a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B).
- Convert attention to block-causal (strictly more permissive than fully-causal — the AR weights stay valid).
- Add a
[MASK]token: initialize its embedding as the mean of existing embeddings + small Gaussian noise (σ = 0.01 × std). - Three-phase continued pretraining on ~150B tokens:
- Phase 2a (warmup, 10B): low mask rates m ∈ U[0, 0.3] — close to AR, preserves AR-like capability.
- Phase 2b (full diffusion, 100B): full m ∈ U[mmin, 1].
- Phase 2c (mixed, 40B): alternate diffusion-loss batches with pure-AR-loss batches at 4 : 1, anchoring against catastrophic forgetting.
- Train the adaptive Kb head on a held-out adaptation set.
This is a projection, not a CASCADE result — no distillation run has been done here. Extrapolating from the recovery curve reported for Dream-7B, one would expect: ~10B tokens → noticeably worse than the AR teacher; ~50B → comparable on most tasks; ~150B → on par with the teacher. Whether CASCADE reproduces this is untested.
Important subtleties
- Causal consistency at block boundaries. The last token of block b−1 and the first token of block b must attend to each other appropriately. Tested explicitly via the “no future information leak” test (perturb a future-block position, verify past-block outputs are bit-identical).
- The
[MASK]token must be present in the tokenizer with a stable ID. If extending an existing tokenizer, re-init the relevant embedding carefully (mean + tiny noise; not uniform random — that breaks downstream Q/K/V projections). - Stop conditions. Generation stops when
[EOS]is committed within any block — same as AR. - Block-AR equivalence. When Kb = B and unmasking is forced left-to-right, CASCADE generates the same sequence as a pure AR model. The implementation-level analog is verified in
tests/test_block_mask.py::test_block_size_one_equals_ar(B = 1 → standard causal attention, bit-identical at fp64).