# `docs/PHASE_2_THINK.md` — Phase 2: BlockCache and block-by-block denoising

## 1. What I understand the task to be

Extend the multi-head, multi-layer, RoPE-equipped CASCADE model (from Phase 1) with:

- `BlockCache`: a per-layer pre-allocated buffer for committed K/V.
- `CascadeLM.forward_block_with_cache`: process one block's tokens, reading prior K/V from the cache, returning logits AND the per-layer K/V (for potential commit).
- `denoise.generate`: the full inference loop — generate blocks one at a time, denoise each block in `K_b` steps, commit, repeat.
- `denoise_one_block`: a single block's denoising (used by `generate` and by the adaptive-K head in Phase 3).

The *headline correctness test*: cache-vs-no-cache produces bit-identical outputs on multi-layer multi-head + RoPE — extending what the single-layer no-RoPE reference (`attention_reference.py`) already proves.

## 2. Design decisions

**Cache layout: pre-allocated per-layer, no concat.** Each layer owns two tensors of shape `(batch, n_heads, max_blocks·block_size, d_head)`. `commit_block` writes into the next slot; `read` returns a contiguous view of the committed prefix. No `torch.cat` on the hot path — that's a Phase 2 performance trap.

Tradeoff: `max_blocks` is fixed at construction. Doubling-growth would be cheap but adds complexity; defer to Phase 5 if/when we generate sequences longer than the initial allocation.

**Cache stores post-RoPE, post-projection K/V (multi-head shape).** Two reasons:
1. AR caches universally do this — no need to recompute or re-rotate cached tensors at every step.
2. RoPE is applied with positions matching where the K/V actually sit in the sequence. Cached K from block `b-1` was rotated with positions `[(b-1)·B .. b·B-1]`; it stays valid because RoPE's relative-position property means q at position `q_pos` attending to k at position `k_pos` gets a phase determined by `q_pos - k_pos`, regardless of when `k_pos` was rotated.

**`forward_block_with_cache` returns the new K/V but does NOT commit.** Commitment is the caller's choice. During denoising, only the *last* step's K/V should be committed (after that step's predictions are taken as the block's final tokens). Mid-denoise K/V would freeze a partially-resolved block — the P0 bug from Appendix B.

**No within-block mask needed during cached forward.** From block `b`'s perspective, all positions in `[committed prior | block b itself]` are "current-or-past". The block-causal mask is structurally satisfied. (This matches the reference impl's logic; the dense mask is needed only in the no-cache full-forward path.)

**Multi-step denoising: confidence-ranked unmask.** Each step:
1. Forward the current (mixed mask + tentative tokens) block.
2. Take argmax (or sampled) prediction at every position.
3. Among currently-masked positions, rank by confidence (max-softmax-prob); unmask the top `ceil(n_masked / k_remaining)`.

Tradeoff: this is the BD3-LM standard. CASCADE's adaptive-K head will sit on top; for Phase 2 we use fixed `K_b = K`.

**Remask-low-confidence: disabled in v1.** Adds complexity; interacts with cache (the remasked positions' K/V is "stale" for that step). BD3-LM § 5 reports modest gains. Defer to Phase 5 as an explicit ablation.

## 3. Failure modes I'm watching for

1. **Cache layout staleness** — committing block-`b` K/V to slot `n_committed_blocks` and *then* incrementing the counter. Tests must check both orderings.

2. **RoPE positions misaligned across cache reads.** If we cache K at positions `[0..B-1]` (block 0) and `[B..2B-1]` (block 1) but at inference time use positions `[0..B-1]` for both, attention scores are wrong. Test: `forward_block_with_cache` with `block_idx=2` must use positions `[2B..3B-1]` for its own K/V.

3. **Multi-layer cache mismatch.** Layer-`L` cache must store layer-`L`'s K/V, not layer-0's. Test by checking that `forward_block_with_cache` at layer `L` reads from `cache.k[L]`, not `cache.k[0]`.

4. **The full-forward vs. block-by-block paths drift over many blocks.** Each block adds a layer of numerical noise from softmax + projection. Test at 3+ committed blocks to catch accumulating drift; should still match within `1e-10` fp64.

5. **EOS detection happens at the wrong granularity.** EOS should be detected per-position within a committed block; once committed (i.e., once the block's tokens are finalized), if EOS is present, generation stops *and we truncate the block at EOS*. Not "block contains EOS so generate one more block" — that yields garbage after EOS.

## 4. Evidence-of-success plan

Phase 2 closes when:

1. `tests/test_cache_consistency.py` passes: commit/read invariants, empty-read returns shape-correct empty tensor.
2. `tests/test_denoise_loop.py` passes the **multi-layer cache-vs-no-cache equivalence** at fp64 within `1e-10` over at least 3 committed blocks.
3. `generate()` produces sensible output (finite tensors, correct shape, stops on EOS).
4. The reference single-layer test (`attention_reference.py`) still passes — that's the unchanged baseline.

Deferred to a real GPU + trained model:
- Block-AR equivalence at `K_b = block_size` — requires a model that can actually produce sensible predictions; without training it just matches "argmax of random initial logits", which doesn't verify anything semantic.
- Throughput measurement (`eval/throughput.py`).

## 5. Scope explicitly NOT covered

- Adaptive K_b head (Phase 3).
- Remask-low-confidence (Phase 5 ablation).
- Doubling cache growth (Phase 5 if needed).
- flash-attn dispatch in the cache path (GPU work).
- Real-data training (deferred).