# `docs/PHASE_0_THINK.md` — Phase 0: bootstrap

> Modeled on the bar set by `03_CASCADE.md` Appendix C. The same depth of reasoning is expected at every phase boundary; this is the simplest phase, so it's the shortest THINK.

## 1. What I understand the task to be

Stand up the `cascade-lm` repo so that subsequent phases have somewhere to land code without re-litigating layout, deps, or scope. Deliverables:

- The directory scaffold from `03_CASCADE.md § 3`, exactly.
- A `pyproject.toml` with pinned versions of `torch >= 2.5`, `flash-attn >= 2.6` (gpu-extra), `transformers`, `datasets`, `hydra-core`, `wandb`, `pytest`, `ruff`, `pyright`.
- `cascade/attention_reference.py` containing the reference block-causal attention + KV cache implementation from `03_CASCADE.md` Appendix B. **Already verified to pass parity tests** (the proof we have a correct foundation before any other code is written).
- 13 lit notes in `docs/lit/`. At minimum, the two anchors — BD3-LM and LLaDA — are fully drafted with the required structure (corruption process / loss / inference structure / KV-cache use / throughput claims).
- Empty (but typed) skeletons for every module in `03_CASCADE.md § 3`. No logic, just signatures with `raise NotImplementedError` and shape docstrings.

**Exit when:** all skeleton modules import cleanly; `python cascade/attention_reference.py` prints four PASS lines; BD3-LM and LLaDA notes pass a self-review against their structure template; the other 11 notes exist at minimum as stubs with the required template (`{corruption, loss, inference, kv-cache, throughput}` sections, even if the content is `TBD: read paper §X.Y`).

## 2. Alternative bootstrap shapes considered

**Shape A: One flat package, no subpackages.**
- All `.py` in `cascade/`, no `cascade/modules/`.
- Pros: less import noise; easier to grep.
- Cons: 03_CASCADE.md § 3 prescribes `cascade/modules/` and we're explicitly canonical-following. Diverging here would require a deviation note.
- Verdict: rejected. Follow the prescribed layout.

**Shape B: Use `src/cascade/` layout instead of `cascade/`.**
- The Python packaging community considers `src/` layout safer (prevents accidental import of the working directory).
- Pros: idiomatic for new packages in 2026.
- Cons: doc says `cascade/` directly. Changing means every path in the doc is wrong.
- Verdict: rejected for consistency with the canonical doc.

**Shape C (chosen): Mirror `03_CASCADE.md § 3` exactly.**
- Directory tree is 1:1 with the doc.
- Pros: zero ambiguity for the reader; the doc is the map.
- Cons: none material.
- Verdict: do this.

## 3. The chosen design with explicit tradeoffs

**Decision 1: pin `flash-attn` as a separate `[gpu]` extra, not a base dep.**
- *Why:* `flash-attn` needs CUDA at install time. CI runners and laptops often lack it. Forcing it into base means `uv add` fails on machines that just want to read code or run CPU tests. Making it an extra means `uv sync` works everywhere; `uv sync --extra gpu` works on the H100 nodes.
- *Tradeoff:* a small number of users will install without `--extra gpu` and be confused when fast paths fall back to the dense reference. Document this loudly in `README.md`.

**Decision 2: keep the reference attention implementation as `cascade/attention_reference.py`, not `cascade/modules/block_causal_attn.py`.**
- *Why:* the reference is the *correctness oracle* — it exists to be tested against, not used in training. `block_causal_attn.py` (the production flash-attn variant) is what's imported in training. Naming them apart prevents confusion later.
- *Tradeoff:* one extra file. Worth it for the conceptual separation.

**Decision 3: every skeleton module imports cleanly and either raises `NotImplementedError` or returns dummy values that pass type-checking.**
- *Why:* lets `pyright cascade/` and `pytest --collect-only tests/` work from Day 1. CI can be green even before any feature is written.
- *Tradeoff:* slight risk of forgetting a stub before claiming a phase complete. Mitigation: a `grep -r "NotImplementedError" cascade/` check in CI catches leftover stubs at the end of Phase 1.

**Decision 4: lit notes use a fixed template.** Every note has these five sections, exactly: `Corruption process`, `Loss`, `Inference structure`, `KV-cache use`, `Throughput-vs-AR comparison`. No prose-only notes.
- *Why:* uniform structure means cross-paper comparisons are mechanical. The "related work matrix" in Phase 6's paper essentially writes itself from these sections.
- *Tradeoff:* some papers (e.g. the discrete-diffusion survey) don't fit cleanly. For those, the section says `N/A — survey, see notes`.

## 4. What could go wrong (failure modes)

1. **The `attention_reference.py` tests pass on CPU but break on GPU** because of fp16 / nondeterministic reductions.
   *Mitigation:* the reference is fp64 by construction (`.double()` in the test). Production paths will diverge slightly numerically; that's expected. The reference's job is to be the correctness oracle, not the fast path. Phase 2 adds a separate test comparing the production path to the reference at `1e-3` fp16 tolerance.

2. **A pinned dep version conflicts with another** (most likely `torch` ↔ `flash-attn` ABI mismatch).
   *Mitigation:* pin upper bounds when first conflict observed; until then, lower bounds only. Don't pre-pin defensively; that creates fake-stability illusions.

3. **Lit notes drift from the papers' actual content** because they're being written from training-data memory rather than the PDFs.
   *Mitigation:* anchor notes (LLaDA, BD3-LM) carry a `**Verification status:**` line at the top stating what was checked against the actual paper. Stub notes are explicitly labeled `STUB — needs paper read-through`. No claim in a stub note enters a downstream document (paper, decision log) without promotion to a verified note first.

4. **`pyproject.toml` Python version range disagrees with the version on the dev machine.**
   *Mitigation:* `requires-python = ">=3.11"`. 3.11 has the structural pattern matching and `Self` typing we'll want for the `BlockCache`. 3.12+ adds nothing essential. Don't pin upper bound (avoid the SciPy mistake).

5. **Someone tries to start Phase 1 before Phase 0 is closed** and the skeleton modules end up with half-written logic interleaved with stubs.
   *Mitigation:* explicit exit gate. Phase 0 closes only when all skeleton modules `import` cleanly and the four PASS lines print. The next phase's THINK.md cannot reference any function from the skeleton modules until that function has been promoted from stub to implemented.

## 5. Evidence-of-success plan

Phase 0 is closed when **all** of these hold:

1. `python cascade/attention_reference.py` prints four PASS lines and the final "All CASCADE block-attention and cache tests passed" line.
2. `python -c "import cascade; import cascade.modules"` exits 0.
3. Every file in `03_CASCADE.md § 3` repo scaffold exists (use a checker script — `scripts/check_scaffold.py`).
4. `docs/lit/llada.md` and `docs/lit/bd3lm.md` exist and contain all five required sections, each non-empty.
5. The other 11 lit notes exist as files and contain at least the section headers (TBD content is acceptable for stubs but the sections must be present).
6. `pyproject.toml` parses (try `uv pip compile pyproject.toml --dry-run`).

If criterion 1 fails, Phase 2 has no foundation — don't proceed. Everything else in this phase is just paperwork.

## 6. Estimated effort

1–2 days, if not blocked on dep installation.

- Day 0 (morning): scaffold; `pyproject.toml`; move attention reference; verify tests still pass. **(Done at the time of writing.)**
- Day 0 (afternoon): write anchor lit notes (LLaDA, BD3-LM).
- Day 1 (morning): write 11 stub lit notes; check scaffold completeness; write skeleton module signatures.
- Day 1 (afternoon): close the exit gate.

If `flash-attn` installation drags on (it often does), move to Phase 1 work using the CPU reference; come back to gpu deps when an H100 box is available.

## 7. References

- `03_CASCADE.md` itself — the canonical doc.
- `03_CASCADE.md` Appendix B — the reference implementation that already lives at `cascade/attention_reference.py`.
- `03_CASCADE.md` Appendix C — the sample THINK.md whose bar this document is trying to meet.

## 8. What this phase explicitly does NOT cover

- Any training (Phase 1).
- Any cache reuse logic beyond the reference (Phase 2).
- Step predictor (Phase 3).
- AR teacher distillation (Phase 4).
- Eval harness (Phase 5).

Scope: if I find myself writing logic in a skeleton module, I'm out of scope. Stubs only.