Phases

The implementation runs in seven phases, each with a THINK document, code, tests, and an explicit exit gate. The CPU-implementable sub-phases of 0, 1, 2, and 3 are complete; the rest require GPU + data.

Phase 0 — Bootstrap complete

Stand up the repo so subsequent phases have a place to land code without re-litigating layout. Deliverables: directory scaffold matching 03_CASCADE.md § 3 exactly, pinned pyproject.toml, reference attention/cache placed at cascade/attention_reference.py (verified to pass all parity tests), 13 lit notes with the canonical 5-section structure, skeleton modules.

Exit gate
All 58 scaffold files present (verified by scripts/check_scaffold.py); 4 PASS lines from the reference; LLaDA and BD3-LM anchor lit notes fully drafted.
Doc
PHASE_0_THINK.md (renders on GitHub; served as plain text from Pages)

Phase 1 — Pretrain wiring

Wire up everything between “raw token tensor” and “trained CASCADE-nano”: corruption.py (BD3-LM corruption), losses.py (LLaDA 1/m-weighted masked CE), modules/rope.py, modules/ffn.py (SwiGLU), modules/block_causal_attn.py (multi-head, RoPE), modules/cascade_block.py (RMSNorm + pre-norm Transformer block), cascade/model.py (CascadeLM).

Phase 1 CPU sub-phase complete

Phase 1 GPU sub-phase needs GPU

Doc
PHASE_1_THINK.md

Phase 2 — Block cache and denoise complete

The most important phase for correctness: extend the single-layer cache reference to the full multi-layer + multi-head + RoPE stack, and prove the cache-vs-no-cache equivalence.

Exit gate
Multi-layer cache-vs-no-cache equivalence 1.57 × 10−15 fp64; zero-prefix path bit-identical; generate() produces finite outputs with no [MASK] leak; EOS-stop verified.
Doc
PHASE_2_THINK.md

Phase 3 — Adaptive K head

The project's headline novelty. A small MLP head over K_CHOICES = (1, 2, 4, 8, 16) trained by REINFORCE on a quality-minus-cost reward. EMA scalar baseline + entropy bonus (annealed). The exit-critical diagnostic is monotone-difficulty: mean chosen K per difficulty bucket should be monotone increasing.

Phase 3 CPU sub-phase complete

Phase 3 GPU / trained-model sub-phase needs trained nano

Doc
PHASE_3_THINK.md

Phase 4 — AR-to-CASCADE distillation needs teacher

Convert a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B) to CASCADE format and continue pretraining on the diffusion objective. Three-phase schedule (10B / 100B / 40B tokens). Skipped here because (a) the teacher checkpoint is multi-GB to download and (b) the training is hundreds of GPU-hours.

Implementation skeleton lives in cascade/distill.py; the three-phase config is in train/configs/distill_qwen_7b.yaml.

Exit gate
CASCADE-7B-distilled within 2 pp of the AR teacher on MMLU/HumanEval; ≥ 4 × decode throughput at batch 1; streaming preserved.

Phase 5 — Scaling + ablations deferred

Train from-scratch CASCADE at small / medium / large (125M / 350M / 1.3B); fit scaling laws against a matched-compute AR baseline; produce the quality-speed Pareto frontier; run the full ablation grid (fixed-K, block-size sweep, no-reweighting, no-remask, no-KV-cache, pure AR, full diffusion).

Ablation stubs are scaffolded under ablations/.

Phase 6 — Paper + release deferred

Manuscript at paper/main.tex with related-work matrix, the headline Pareto plot, scaling laws, ablations, and a clear honest discussion of where CASCADE underperforms (very-low-entropy completions where AR is already cache-bound; ultra-long needle-in-haystack tasks where bidirectional within-block isn't helpful). A repro.sh reproduces the headline throughput measurement.

Risks tracked across phases

Risk matrix carried over from 03_CASCADE.md § 6.
RiskLikelihoodMitigation
The 1/m reweighting destabilizes trainingMediumClip m from below (mmin = 10−3); warm up from m ∈ [0.2, 0.8] before full range
Adaptive K head collapses to K = KmaxMediumCost term −λK; entropy bonus; KL regularization to entropy-proxy baseline
Distillation diverges (catastrophic forgetting)HighLow LR; mix 5 % original AR-loss batches (Phase 2c); checkpoint frequently
Block-causal mask gets a subtle bugHighExhaustive test against a slow dense reference; verified at fp64 bit-near-identity
KV-cache reuse breaks under remask-low-confidenceHighDisable remask within a block (v1); invalidate cache for remasked positions if enabled (v2)
Diffusion underperforms on reasoning (math)InherentHonest reporting; combine with CoT distillation in post-training
Reviewers think it's just a re-implementation of BD3-LMHighThe adaptive-K head and distillation recipe are the deltas; ablate both clearly