Architecture

CASCADE is a block-causal masked-diffusion language model. The vocabulary contains a special [MASK] token. The model is a Transformer decoder with block-causal attention: tokens within a block see each other bidirectionally; tokens in block b see all tokens in blocks < b.

Corruption process (training)

Given a sequence partitioned into blocks of size B = 32, training corrupts blocks one at a time:

  1. Pick a block index b uniformly.
  2. Sample a masking ratio m ∈ U[mmin, 1] (LLaDA-style — the random per-sequence ratio is critical; fixed ratios collapse).
  3. Mask each position in block b independently with probability m.
  4. Blocks < b are kept clean (this provides the autoregressive context).
  5. Blocks > b are fully masked (so the input shape is consistent), but the loss is only computed at the masked positions of block b.

The loss is per-masked-position cross-entropy weighted by 1/m. The 1/m factor is not heuristic — it is the negative ELBO weight of the absorbing-state discrete diffusion (see Sahoo et al. MD4, 2024).

Why 1/m?

For the linear noising schedule m(t) = t, the negative ELBO on log p(x0) reduces (after a change of variables from t to m) to −(1/m) · Σmasked i log pθ(x0i | xm). Without the 1/m factor, low-m (lightly-masked) batches dominate via their large number of masked positions. With it, every noise level contributes proportionally.

Denoising process (inference)

To generate block b given clean blocks < b:

  1. Initialize block b as all [MASK] tokens.
  2. For k = 1 … Kb denoising steps:
    1. Forward pass: predict the distribution over true tokens at every masked position.
    2. Confidence-rank the predictions.
    3. Unmask the top ⌈nmasked / kremaining⌉ most confident positions.
  3. Emit block b (now fully filled). Append to context. Proceed to block b + 1.

Standard block diffusion (BD3-LM) fixes Kb = K for all blocks. CASCADE's contribution is to predict Kb from context.

Entropy-adaptive step count (the headline novelty)

Before generating block b, a small head reads the final hidden state at the last clean position and predicts a categorical distribution over K_CHOICES = (1, 2, 4, 8, 16). Discrete actions, geometric spacing — covers a 16× cost range with only five choices, keeping REINFORCE variance manageable.

The head is trained with REINFORCE on a quality-minus-cost reward R = quality(block | K) − λ · K, with an EMA scalar baseline and an entropy bonus that anneals from 0.1 to 0 over the first 30 % of training. Quality is the negative-NLL gap relative to a K = 16 reference for the dense reward phase; downstream task accuracy for the final fine-tuning.

The result: average Kb ≈ 4–6 on typical text, while hard passages (math, code identifiers) get Kb ≥ 12. This is the throughput multiplier.

On this repo's CPU-bounded synthetic-task verification, the monotone-difficulty diagnostic from Appendix A.3 of the master prompt — “bucket inputs by independent difficulty, check that mean K per bucket is monotone increasing” — achieves Spearman rank correlation 1.000. See Results → monotone-difficulty diagnostic.

Block-causal attention mask

For sequence length N partitioned into blocks of size B:

M[i, j] = 1   iff   ⌊i / B⌋ ≥ ⌊j / B⌋

This is a block-lower-triangular pattern where each B × B diagonal block is full (all ones), not lower-triangular. The full structure for two blocks of size 4 (so N = 8):

           B0  B1
        ┌─────────────┐
   B0   │ FULL ─       ← Block 0 self-attention (bidirectional within block)
   B1   │ FULL FULL    ← Block 1 sees all of B0 (causal) and itself (bidirectional)
        └─────────────┘

Implementation: a dense (N, N) bool mask in the CPU reference path; the production GPU path uses flash-attn's block_mask parameter with predicate (kv_idx // block_size) ≤ (q_idx // block_size).

KV-cache across blocks and within block

Two levels of cache reuse, both essential:

Across blocks
Once block b is generated and committed, its K/V are cached forever. Block b+1's denoising sees this cache. This recovers the AR-style growing-prefix cache.
Within block
During the Kb denoising steps of block b, the K/V of the prior blocks (< b) are unchanged across all steps — read from the cache. The K/V of tokens being denoised in block b itself changes each step (as new tokens are unmasked), so they are recomputed each step.

Two P0 cache bugs to avoid

  1. Caching block b's K/V across denoising steps. Token IDs at masked positions resolve to actual tokens between steps; their K/V depend on those IDs. This is a silent quality drop.
  2. Committing block b's K/V to the cache before all Kb denoising steps complete. Commit-time freezes a partially-denoised block and contaminates all subsequent block generations.

Both are verified absent by the multi-layer cache-vs-no-cache equivalence test at 1.57 × 10−15 fp64.

AR-to-CASCADE distillation

Following the Dream-7B recipe:

  1. Start from a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B).
  2. Convert attention to block-causal (strictly more permissive than fully-causal — the AR weights stay valid).
  3. Add a [MASK] token: initialize its embedding as the mean of existing embeddings + small Gaussian noise (σ = 0.01 × std).
  4. Three-phase continued pretraining on ~150B tokens:
    • Phase 2a (warmup, 10B): low mask rates m ∈ U[0, 0.3] — close to AR, preserves AR-like capability.
    • Phase 2b (full diffusion, 100B): full m ∈ U[mmin, 1].
    • Phase 2c (mixed, 40B): alternate diffusion-loss batches with pure-AR-loss batches at 4 : 1, anchoring against catastrophic forgetting.
  5. Train the adaptive Kb head on a held-out adaptation set.

This is a projection, not a CASCADE result — no distillation run has been done here. Extrapolating from the recovery curve reported for Dream-7B, one would expect: ~10B tokens → noticeably worse than the AR teacher; ~50B → comparable on most tasks; ~150B → on par with the teacher. Whether CASCADE reproduces this is untested.

Important subtleties