Phases
The implementation runs in seven phases, each with a THINK document, code, tests, and an explicit exit gate. The CPU-implementable sub-phases of 0, 1, 2, and 3 are complete; the rest require GPU + data.
Phase 0 — Bootstrap complete
Stand up the repo so subsequent phases have a place to land code without re-litigating layout. Deliverables: directory scaffold matching 03_CASCADE.md § 3 exactly, pinned pyproject.toml, reference attention/cache placed at cascade/attention_reference.py (verified to pass all parity tests), 13 lit notes with the canonical 5-section structure, skeleton modules.
- Exit gate
- All 58 scaffold files present (verified by
scripts/check_scaffold.py); 4 PASS lines from the reference; LLaDA and BD3-LM anchor lit notes fully drafted. - Doc
- PHASE_0_THINK.md (renders on GitHub; served as plain text from Pages)
Phase 1 — Pretrain wiring
Wire up everything between “raw token tensor” and “trained CASCADE-nano”: corruption.py (BD3-LM corruption), losses.py (LLaDA 1/m-weighted masked CE), modules/rope.py, modules/ffn.py (SwiGLU), modules/block_causal_attn.py (multi-head, RoPE), modules/cascade_block.py (RMSNorm + pre-norm Transformer block), cascade/model.py (CascadeLM).
Phase 1 CPU sub-phase complete
- All modules implemented;
cascade/modules/block_causal_attn.pywithn_heads = 1+ RoPE off matches the single-head reference oracle within10−12fp64. - End-to-end smoke train: 9.52 → 1.22 loss over 300 steps on a memorizable batch (see Results).
- Low-m NaN guard test passes.
Phase 1 GPU sub-phase needs GPU
- Train a 12M nano CASCADE on 200M tokens of FineWeb-Edu.
- Verify the “at K = 32, the model fills masked blocks coherently” sanity check (needs a trained model).
- Verify the block-AR equivalence at Kb = B on a trained model.
- flash-attn dispatch in
BlockCausalAttention.
- Doc
- PHASE_1_THINK.md
Phase 2 — Block cache and denoise complete
The most important phase for correctness: extend the single-layer cache reference to the full multi-layer + multi-head + RoPE stack, and prove the cache-vs-no-cache equivalence.
BlockCache: per-layer pre-allocated buffers (batch,n_heads,max_blocks · block_size,d_head); notorch.caton the hot path; validates shape and capacity on commit.- Cache-aware forward at every level:
BlockCausalAttention.forward_with_cache,CascadeBlock.forward_with_cache,CascadeLM.forward_block_with_cache. denoise.generate()anddenoise_one_block(): confidence-ranked unmask, forbids predictingmask_token_idas a vocab item, EOS-stops generation, supports prefix.
- Exit gate
- Multi-layer cache-vs-no-cache equivalence 1.57 × 10−15 fp64; zero-prefix path bit-identical;
generate()produces finite outputs with no[MASK]leak; EOS-stop verified. - Doc
- PHASE_2_THINK.md
Phase 3 — Adaptive K head
The project's headline novelty. A small MLP head over K_CHOICES = (1, 2, 4, 8, 16) trained by REINFORCE on a quality-minus-cost reward. EMA scalar baseline + entropy bonus (annealed). The exit-critical diagnostic is monotone-difficulty: mean chosen K per difficulty bucket should be monotone increasing.
Phase 3 CPU sub-phase complete
AdaptiveKHead,EMABaseline,reinforce_step_count_lossimplemented.- 2-action contextual bandit reaches 100 % optimal action probability in both contexts.
- Monotone-difficulty diagnostic: bucket K means 2.45 → 2.86 → 4.30 → 7.27 → 12.41, Spearman = 1.000. (See Results.)
Phase 3 GPU / trained-model sub-phase needs trained nano
- Real-corpus REINFORCE against actual NLL-gap quality signal.
- λ sweep over {0.01, 0.05, 0.1, 0.5}; Pareto plot.
- MMLU-preservation check (Phase 3.5 unfreezes the body with a KL constraint).
- Per-cluster (k-means in h-space) baseline instead of scalar EMA.
- Doc
- PHASE_3_THINK.md
Phase 4 — AR-to-CASCADE distillation needs teacher
Convert a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B) to CASCADE format and continue pretraining on the diffusion objective. Three-phase schedule (10B / 100B / 40B tokens). Skipped here because (a) the teacher checkpoint is multi-GB to download and (b) the training is hundreds of GPU-hours.
Implementation skeleton lives in cascade/distill.py; the three-phase config is in train/configs/distill_qwen_7b.yaml.
- Exit gate
- CASCADE-7B-distilled within 2 pp of the AR teacher on MMLU/HumanEval; ≥ 4 × decode throughput at batch 1; streaming preserved.
Phase 5 — Scaling + ablations deferred
Train from-scratch CASCADE at small / medium / large (125M / 350M / 1.3B); fit scaling laws against a matched-compute AR baseline; produce the quality-speed Pareto frontier; run the full ablation grid (fixed-K, block-size sweep, no-reweighting, no-remask, no-KV-cache, pure AR, full diffusion).
Ablation stubs are scaffolded under ablations/.
Phase 6 — Paper + release deferred
Manuscript at paper/main.tex with related-work matrix, the headline Pareto plot, scaling laws, ablations, and a clear honest discussion of where CASCADE underperforms (very-low-entropy completions where AR is already cache-bound; ultra-long needle-in-haystack tasks where bidirectional within-block isn't helpful). A repro.sh reproduces the headline throughput measurement.
Risks tracked across phases
| Risk | Likelihood | Mitigation |
|---|---|---|
The 1/m reweighting destabilizes training | Medium | Clip m from below (mmin = 10−3); warm up from m ∈ [0.2, 0.8] before full range |
| Adaptive K head collapses to K = Kmax | Medium | Cost term −λK; entropy bonus; KL regularization to entropy-proxy baseline |
| Distillation diverges (catastrophic forgetting) | High | Low LR; mix 5 % original AR-loss batches (Phase 2c); checkpoint frequently |
| Block-causal mask gets a subtle bug | High | Exhaustive test against a slow dense reference; verified at fp64 bit-near-identity |
| KV-cache reuse breaks under remask-low-confidence | High | Disable remask within a block (v1); invalidate cache for remasked positions if enabled (v2) |
| Diffusion underperforms on reasoning (math) | Inherent | Honest reporting; combine with CoT distillation in post-training |
| Reviewers think it's just a re-implementation of BD3-LM | High | The adaptive-K head and distillation recipe are the deltas; ablate both clearly |