Phases

The implementation runs in seven phases, each with a THINK document, code, tests, and an explicit exit gate. The CPU-implementable sub-phases of 0, 1, 2, and 3 are complete; the rest require GPU + data.

Phase 0 — Bootstrap complete

Stand up the repo so subsequent phases have a place to land code without re-litigating layout. Deliverables: directory scaffold matching 03_CASCADE.md § 3 exactly, pinned pyproject.toml, reference attention/cache placed at cascade/attention_reference.py (verified to pass all parity tests), 13 lit notes with the canonical 5-section structure, skeleton modules.

Exit gate: All 58 scaffold files present (verified by scripts/check_scaffold.py); 4 PASS lines from the reference; LLaDA and BD3-LM anchor lit notes fully drafted.
Doc: PHASE_0_THINK.md (renders on GitHub; served as plain text from Pages)

Phase 1 — Pretrain wiring

Wire up everything between “raw token tensor” and “trained CASCADE-nano”: corruption.py (BD3-LM corruption), losses.py (LLaDA 1/m-weighted masked CE), modules/rope.py, modules/ffn.py (SwiGLU), modules/block_causal_attn.py (multi-head, RoPE), modules/cascade_block.py (RMSNorm + pre-norm Transformer block), cascade/model.py (CascadeLM).

Phase 1 CPU sub-phase complete

All modules implemented; cascade/modules/block_causal_attn.py with n_heads = 1 + RoPE off matches the single-head reference oracle within 10⁻¹² fp64.
End-to-end smoke train: 9.52 → 1.22 loss over 300 steps on a memorizable batch (see Results).
Low-m NaN guard test passes.

Phase 1 GPU sub-phase needs GPU

Train a 12M nano CASCADE on 200M tokens of FineWeb-Edu.
Verify the “at K = 32, the model fills masked blocks coherently” sanity check (needs a trained model).
Verify the block-AR equivalence at K_b = B on a trained model.
flash-attn dispatch in BlockCausalAttention.

Doc: PHASE_1_THINK.md

Phase 2 — Block cache and denoise complete

The most important phase for correctness: extend the single-layer cache reference to the full multi-layer + multi-head + RoPE stack, and prove the cache-vs-no-cache equivalence.

BlockCache: per-layer pre-allocated buffers (batch, n_heads, max_blocks · block_size, d_head); no torch.cat on the hot path; validates shape and capacity on commit.
Cache-aware forward at every level: BlockCausalAttention.forward_with_cache, CascadeBlock.forward_with_cache, CascadeLM.forward_block_with_cache.
denoise.generate() and denoise_one_block(): confidence-ranked unmask, forbids predicting mask_token_id as a vocab item, EOS-stops generation, supports prefix.

Exit gate: Multi-layer cache-vs-no-cache equivalence 1.57 × 10⁻¹⁵ fp64; zero-prefix path bit-identical; generate() produces finite outputs with no [MASK] leak; EOS-stop verified.
Doc: PHASE_2_THINK.md

Phase 3 — Adaptive K head

The project's headline novelty. A small MLP head over K_CHOICES = (1, 2, 4, 8, 16) trained by REINFORCE on a quality-minus-cost reward. EMA scalar baseline + entropy bonus (annealed). The exit-critical diagnostic is monotone-difficulty: mean chosen K per difficulty bucket should be monotone increasing.

Phase 3 CPU sub-phase complete

AdaptiveKHead, EMABaseline, reinforce_step_count_loss implemented.
2-action contextual bandit reaches 100 % optimal action probability in both contexts.
Monotone-difficulty diagnostic: bucket K means 2.45 → 2.86 → 4.30 → 7.27 → 12.41, Spearman = 1.000. (See Results.)

Phase 3 GPU / trained-model sub-phase needs trained nano

Real-corpus REINFORCE against actual NLL-gap quality signal.
λ sweep over {0.01, 0.05, 0.1, 0.5}; Pareto plot.
MMLU-preservation check (Phase 3.5 unfreezes the body with a KL constraint).
Per-cluster (k-means in h-space) baseline instead of scalar EMA.

Doc: PHASE_3_THINK.md

Phase 4 — AR-to-CASCADE distillation needs teacher

Convert a pretrained AR LLM (Qwen-2.5-7B or Llama-3-8B) to CASCADE format and continue pretraining on the diffusion objective. Three-phase schedule (10B / 100B / 40B tokens). Skipped here because (a) the teacher checkpoint is multi-GB to download and (b) the training is hundreds of GPU-hours.

Implementation skeleton lives in cascade/distill.py; the three-phase config is in train/configs/distill_qwen_7b.yaml.

Exit gate: CASCADE-7B-distilled within 2 pp of the AR teacher on MMLU/HumanEval; ≥ 4 × decode throughput at batch 1; streaming preserved.

Phase 5 — Scaling + ablations deferred

Train from-scratch CASCADE at small / medium / large (125M / 350M / 1.3B); fit scaling laws against a matched-compute AR baseline; produce the quality-speed Pareto frontier; run the full ablation grid (fixed-K, block-size sweep, no-reweighting, no-remask, no-KV-cache, pure AR, full diffusion).

Ablation stubs are scaffolded under ablations/.

Phase 6 — Paper + release deferred

Manuscript at paper/main.tex with related-work matrix, the headline Pareto plot, scaling laws, ablations, and a clear honest discussion of where CASCADE underperforms (very-low-entropy completions where AR is already cache-bound; ultra-long needle-in-haystack tasks where bidirectional within-block isn't helpful). A repro.sh reproduces the headline throughput measurement.

Risks tracked across phases

Risk matrix carried over from `03_CASCADE.md § 6`.
Risk	Likelihood	Mitigation
The `1/m` reweighting destabilizes training	Medium	Clip m from below (m_min = 10⁻³); warm up from m ∈ [0.2, 0.8] before full range
Adaptive K head collapses to K = K_max	Medium	Cost term −λK; entropy bonus; KL regularization to entropy-proxy baseline
Distillation diverges (catastrophic forgetting)	High	Low LR; mix 5 % original AR-loss batches (Phase 2c); checkpoint frequently
Block-causal mask gets a subtle bug	High	Exhaustive test against a slow dense reference; verified at fp64 bit-near-identity
KV-cache reuse breaks under remask-low-confidence	High	Disable remask within a block (v1); invalidate cache for remasked positions if enabled (v2)
Diffusion underperforms on reasoning (math)	Inherent	Honest reporting; combine with CoT distillation in post-training
Reviewers think it's just a re-implementation of BD3-LM	High	The adaptive-K head and distillation recipe are the deltas; ablate both clearly