Field map

The schools of thought

The reasoning-models field has at least five mechanistic accounts in active circulation. Most are partly true; none alone explains everything we see. This map is honest about that — it shows the schools, where the evidence supports them, and where the open debates live.

Mechanism graph

How the eight chapters depend on each other

Solid edges = mechanism dependencies. Dashed = open debates. Color groups by role.

Field map showing the eight chapters and their connections
Schools of thought

Five accounts of "why CoT works"

Each school explains a different fragment of what we observe. The labels are the curator's; they aren't standard, but they're the cleanest cuts. Most papers blend two or three.

1. Compute-depth extension

Merrill & Sabharwal · Li · Feng — formal

A fixed-depth transformer is in TC0. T tokens of CoT amount to T extra serial steps — escaping toward problems that require P-class computation.

Best for: explaining why CoT helps on inherently serial / multi-step problems. Weak at: explaining stylistic effects, "Aha moments," and unfaithful chains.

2. Implicit / amortized search

Stream of Search · R1-Zero post-mortems

RL training compiles a search procedure into the policy distribution. Greedy decoding then mimics what an external search would do — without a search loop at runtime.

Best for: the R1 phenomenology — chain length increase, self-correction. Weak at: formal claims about what is searchable.

3. Program synthesis

CoT-as-source-code interpretations

A CoT is the source of a small program the model "compiles" to its answer. Non-CoT inference interprets compiled artifacts of similar programs encountered at training.

Best for: explaining transfer between problem families. Weak at: cases where the chain doesn't match a program — narrative reasoning, qualitative judgment.

4. Bayesian posterior over thoughts

Xie · Prystawski — extended to multi-step

CoT generation is implicit posterior inference over a latent "solution program." Extends the ICL-as-Bayes account to chains of intermediate states.

Best for: Wei-style few-shot CoT improvements. Weak at: RL-trained reasoners where the chain is the optimization target, not a posterior sample.

5. RL-shaped policy elicitation

DeepSeek-R1 — the practitioner story

Reasoning circuits are already in the base from pretraining. RLVR is an elicitation procedure — it raises probability mass on paths that lead to verifiably correct answers.

Best for: R1-Zero results, the "no-SFT" finding. Weak at: cases where the base lacks the circuit — R1-Zero fails on weak bases (threshold uncharacterized).

The synthesis essay Why do reasoning models work? A synthesis argues these aren't competitors so much as views of different layers of the same elephant — and proposes which layer dominates for which task.

Open debates

What the field actively disagrees about

Each row is a real fault line — defenders on each side cite peer-reviewed evidence. The repo's job is to platform the disagreement, not to declare a winner.

Are reasoning gains elicitation of latent capability, or new learning?

If elicitation, RL re-shapes the base distribution. If learning, RL teaches the model new circuits.

Elicitation camp — DeepSeek (R1-Zero), Lambert. Evidence: pure RL from base reaches o1-class. Predicts new bases will dominate after RL across recipes.
Learning camp — implicit in some Anthropic / OpenAI framings; distillation literature. Evidence: distillation transfers reasoning to weaker bases that wouldn't elicit it via RL alone.
Is CoT faithful enough to use as evidence about model computation?

If yes, CoT-monitoring is a viable safety primitive. If no, the safety story needs alternative tools.

Faithful-enough — practitioners using CoT inspection in eval. Evidence: RL-trained reasoners pass more Lanham probes than instruct-only models.
Not faithful enough — Turpin, Lanham, Anthropic 2025. Evidence: targeted reward-hack scenarios produce systematic post-hoc rationalization in current frontier reasoners.
Does explicit search add value on top of an RL-trained reasoner?

The intuition: RL has already absorbed the easy search. Practice: it depends.

Search adds value — recursive aggregation, MCTS-over-CoT. Evidence: explicit search improves over single-pass long-CoT at fixed budget on hard tasks.
RL absorbs search — Stream of Search, R1-Zero. Evidence: trained reasoners match search-augmented baselines on benchmarks the RL covered.
Should test-time compute scaling laws hold across model scales?

The Snell-style curves were measured at one scale. Whether the optimal allocation transfers is unsettled.

Universal exchange-rate — Snell et al., follow-ups. Evidence: log-linear curves replicate on multiple model families at multiple scales.
Task- and scale-specific — Yang et al. (thinking-optimal scaling), overthinking literature. Evidence: optimal length depends on task difficulty and base capability — no universal recipe.
Is "reasoning" doing anything different from amortized search + style-shaped policy?

The deepest fault line. Determines whether the reasoning-model paradigm is qualitative or quantitative.

Quantitative only — RL elicits, doesn't add. Same circuits, more usage. Predicts scaling will plateau in line with base-model scaling.
Qualitative shift — long-horizon RL produces genuinely new behaviors (Aha moments, self-correction). Predicts the compute axis will keep paying off after pretraining plateaus.
Method × task matrix

Which technique helps which problem

A condensed cross-tabulation. Read each row as: at fixed compute budget, this strategy is the right choice when the problem looks like the column.

Strategy Math (closed answer) Code (unit-test) Open-ended writing Multi-hop QA OOD (ARC-AGI)
Greedy long-CoT ×
Self-consistency (cons@K) ××
Best-of-N + PRM
Tree of Thoughts
Recursive self-aggregation
RLVR-trained reasoner (single pass)×
Test-time training ×

well-suited   partial / verifier-dependent   × known to underperform

Eight chapters

Open the chapters on GitHub

Reading paths

Cross-chapter sequences

Calibrated for skim / weekend / research depth.