Architecture

CASCADE is a block-causal masked-diffusion language model. The vocabulary contains a special [MASK] token. The model is a Transformer decoder with block-causal attention: tokens within a block see each other bidirectionally; tokens in block b see all tokens in blocks < b.

Corruption process (training)

Given a sequence partitioned into blocks of size B = 32, training corrupts blocks one at a time:

Pick a block index b uniformly.
Sample a masking ratio m ∈ U[m_min, 1] (LLaDA-style — the random per-sequence ratio is critical; fixed ratios collapse).
Mask each position in block b independently with probability m.
Blocks < b are kept clean (this provides the autoregressive context).
Blocks > b are fully masked (so the input shape is consistent), but the loss is only computed at the masked positions of block b.

The loss is per-masked-position cross-entropy weighted by 1/m. The 1/m factor is not heuristic — it is the negative ELBO weight of the absorbing-state discrete diffusion (see Sahoo et al. MD4, 2024).

Why 1/m?

For the linear noising schedule m(t) = t, the negative ELBO on log p(x₀) reduces (after a change of variables from t to m) to −(1/m) · Σ_{masked i} log p_θ(x₀ⁱ | x_m). Without the 1/m factor, low-m (lightly-masked) batches dominate via their large number of masked positions. With it, every noise level contributes proportionally.

Denoising process (inference)

To generate block b given clean blocks < b:

Initialize block b as all [MASK] tokens.
For k = 1 … K_b denoising steps:
1. Forward pass: predict the distribution over true tokens at every masked position.
2. Confidence-rank the predictions.
3. Unmask the top ⌈n_masked / k_remaining⌉ most confident positions.
Emit block b (now fully filled). Append to context. Proceed to block b + 1.

Standard block diffusion (BD3-LM) fixes K_b = K for all blocks. CASCADE's contribution is to predict K_b from context.

Entropy-adaptive step count (the headline novelty)

Before generating block b, a small head reads the final hidden state at the last clean position and predicts a categorical distribution over K_CHOICES = (1, 2, 4, 8, 16). Discrete actions, geometric spacing — covers a 16× cost range with only five choices, keeping REINFORCE variance manageable.

The head is trained with REINFORCE on a quality-minus-cost reward R = quality(block | K) − λ · K, with an EMA scalar baseline and an entropy bonus that anneals from 0.1 to 0 over the first 30 % of training. Quality is the negative-NLL gap relative to a K = 16 reference for the dense reward phase; downstream task accuracy for the final fine-tuning.

The result: average K_b ≈ 4–6 on typical text, while hard passages (math, code identifiers) get K_b ≥ 12. This is the throughput multiplier.

On this repo's CPU-bounded synthetic-task verification, the monotone-difficulty diagnostic from Appendix A.3 of the master prompt — “bucket inputs by independent difficulty, check that mean K per bucket is monotone increasing” — achieves Spearman rank correlation 1.000. See Results → monotone-difficulty diagnostic.

Block-causal attention mask

For sequence length N partitioned into blocks of size B:

M[i, j] = 1   iff   ⌊i / B⌋ ≥ ⌊j / B⌋

This is a block-lower-triangular pattern where each B × B diagonal block is full (all ones), not lower-triangular. The full structure for two blocks of size 4 (so N = 8):

           B0  B1
        ┌─────────────┐
   B0   │ FULL ─       ← Block 0 self-attention (bidirectional within block)
   B1   │ FULL FULL    ← Block 1 sees all of B0 (causal) and itself (bidirectional)
        └─────────────┘

Implementation: a dense (N, N) bool mask in the CPU reference path; the production GPU path uses flash-attn's block_mask parameter with predicate (kv_idx // block_size) ≤ (q_idx // block_size).

KV-cache across blocks and within block

Two levels of cache reuse, both essential:

Across blocks: Once block b is generated and committed, its K/V are cached forever. Block b+1's denoising sees this cache. This recovers the AR-style growing-prefix cache.
Within block: During the K_b denoising steps of block b, the K/V of the prior blocks (< b) are unchanged across all steps — read from the cache. The K/V of tokens being denoised in block b itself changes each step (as new tokens are unmasked), so they are recomputed each step.

Two P0 cache bugs to avoid

Caching block b's K/V across denoising steps. Token IDs at masked positions resolve to actual tokens between steps; their K/V depend on those IDs. This is a silent quality drop.
Committing block b's K/V to the cache before all K_b denoising steps complete. Commit-time freezes a partially-denoised block and contaminates all subsequent block generations.

Both are verified absent by the multi-layer cache-vs-no-cache equivalence test at 1.57 × 10⁻¹⁵ fp64.

AR-to-CASCADE distillation

Following the Dream-7B recipe:

Start from a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B).
Convert attention to block-causal (strictly more permissive than fully-causal — the AR weights stay valid).
Add a [MASK] token: initialize its embedding as the mean of existing embeddings + small Gaussian noise (σ = 0.01 × std).
Three-phase continued pretraining on ~150B tokens:
- Phase 2a (warmup, 10B): low mask rates m ∈ U[0, 0.3] — close to AR, preserves AR-like capability.
- Phase 2b (full diffusion, 100B): full m ∈ U[m_min, 1].
- Phase 2c (mixed, 40B): alternate diffusion-loss batches with pure-AR-loss batches at 4 : 1, anchoring against catastrophic forgetting.
Train the adaptive K_b head on a held-out adaptation set.

This is a projection, not a CASCADE result — no distillation run has been done here. Extrapolating from the recovery curve reported for Dream-7B, one would expect: ~10B tokens → noticeably worse than the AR teacher; ~50B → comparable on most tasks; ~150B → on par with the teacher. Whether CASCADE reproduces this is untested.

Important subtleties

Causal consistency at block boundaries. The last token of block b−1 and the first token of block b must attend to each other appropriately. Tested explicitly via the “no future information leak” test (perturb a future-block position, verify past-block outputs are bit-identical).
The [MASK] token must be present in the tokenizer with a stable ID. If extending an existing tokenizer, re-init the relevant embedding carefully (mean + tiny noise; not uniform random — that breaks downstream Q/K/V projections).
Stop conditions. Generation stops when [EOS] is committed within any block — same as AR.
Block-AR equivalence. When K_b = B and unmasking is forced left-to-right, CASCADE generates the same sequence as a pure AR model. The implementation-level analog is verified in tests/test_block_mask.py::test_block_size_one_equals_ar (B = 1 → standard causal attention, bit-identical at fp64).