Field map · Awesome Reasoning Models Theory

Schools of thought

Five accounts of "why CoT works"

Each school explains a different fragment of what we observe. The labels are the curator's; they aren't standard, but they're the cleanest cuts. Most papers blend two or three.

1. Compute-depth extension

Merrill & Sabharwal · Li · Feng — formal

A fixed-depth transformer is in TC₀. T tokens of CoT amount to T extra serial steps — escaping toward problems that require P-class computation.

Best for: explaining why CoT helps on inherently serial / multi-step problems. Weak at: explaining stylistic effects, "Aha moments," and unfaithful chains.

2. Implicit / amortized search

Stream of Search · R1-Zero post-mortems

RL training compiles a search procedure into the policy distribution. Greedy decoding then mimics what an external search would do — without a search loop at runtime.

Best for: the R1 phenomenology — chain length increase, self-correction. Weak at: formal claims about what is searchable.

3. Program synthesis

CoT-as-source-code interpretations

A CoT is the source of a small program the model "compiles" to its answer. Non-CoT inference interprets compiled artifacts of similar programs encountered at training.

Best for: explaining transfer between problem families. Weak at: cases where the chain doesn't match a program — narrative reasoning, qualitative judgment.

4. Bayesian posterior over thoughts

Xie · Prystawski — extended to multi-step

CoT generation is implicit posterior inference over a latent "solution program." Extends the ICL-as-Bayes account to chains of intermediate states.

Best for: Wei-style few-shot CoT improvements. Weak at: RL-trained reasoners where the chain is the optimization target, not a posterior sample.

5. RL-shaped policy elicitation

DeepSeek-R1 — the practitioner story

Reasoning circuits are already in the base from pretraining. RLVR is an elicitation procedure — it raises probability mass on paths that lead to verifiably correct answers.

Best for: R1-Zero results, the "no-SFT" finding. Weak at: cases where the base lacks the circuit — R1-Zero fails on weak bases (threshold uncharacterized).

The synthesis essay Why do reasoning models work? A synthesis argues these aren't competitors so much as views of different layers of the same elephant — and proposes which layer dominates for which task.

Open debates

What the field actively disagrees about

Each row is a real fault line — defenders on each side cite peer-reviewed evidence. The repo's job is to platform the disagreement, not to declare a winner.

Are reasoning gains elicitation of latent capability, or new learning?

If elicitation, RL re-shapes the base distribution. If learning, RL teaches the model new circuits.

Elicitation camp — DeepSeek (R1-Zero), Lambert. Evidence: pure RL from base reaches o1-class. Predicts new bases will dominate after RL across recipes.

Learning camp — implicit in some Anthropic / OpenAI framings; distillation literature. Evidence: distillation transfers reasoning to weaker bases that wouldn't elicit it via RL alone.

Is CoT faithful enough to use as evidence about model computation?

If yes, CoT-monitoring is a viable safety primitive. If no, the safety story needs alternative tools.

Faithful-enough — practitioners using CoT inspection in eval. Evidence: RL-trained reasoners pass more Lanham probes than instruct-only models.

Not faithful enough — Turpin, Lanham, Anthropic 2025. Evidence: targeted reward-hack scenarios produce systematic post-hoc rationalization in current frontier reasoners.

Does explicit search add value on top of an RL-trained reasoner?

The intuition: RL has already absorbed the easy search. Practice: it depends.

Search adds value — recursive aggregation, MCTS-over-CoT. Evidence: explicit search improves over single-pass long-CoT at fixed budget on hard tasks.

RL absorbs search — Stream of Search, R1-Zero. Evidence: trained reasoners match search-augmented baselines on benchmarks the RL covered.

Should test-time compute scaling laws hold across model scales?

The Snell-style curves were measured at one scale. Whether the optimal allocation transfers is unsettled.

Universal exchange-rate — Snell et al., follow-ups. Evidence: log-linear curves replicate on multiple model families at multiple scales.

Task- and scale-specific — Yang et al. (thinking-optimal scaling), overthinking literature. Evidence: optimal length depends on task difficulty and base capability — no universal recipe.

Is "reasoning" doing anything different from amortized search + style-shaped policy?

The deepest fault line. Determines whether the reasoning-model paradigm is qualitative or quantitative.

Quantitative only — RL elicits, doesn't add. Same circuits, more usage. Predicts scaling will plateau in line with base-model scaling.

Qualitative shift — long-horizon RL produces genuinely new behaviors (Aha moments, self-correction). Predicts the compute axis will keep paying off after pretraining plateaus.

Method × task matrix

Which technique helps which problem

A condensed cross-tabulation. Read each row as: at fixed compute budget, this strategy is the right choice when the problem looks like the column.

Strategy	Math (closed answer)	Code (unit-test)	Open-ended writing	Multi-hop QA	OOD (ARC-AGI)
Greedy long-CoT	○	○	●	○	×
Self-consistency (cons@K)	●	●	×	○	×
Best-of-N + PRM	●	●	○	●	○
Tree of Thoughts	●	○	○	●	○
Recursive self-aggregation	●	○	●	●	○
RLVR-trained reasoner (single pass)	●	●	○	○	×
Test-time training	○	○	×	○	●

● well-suited ○ partial / verifier-dependent × known to underperform

The schools of thought

How the eight chapters depend on each other

Five accounts of "why CoT works"

1. Compute-depth extension

2. Implicit / amortized search

3. Program synthesis

4. Bayesian posterior over thoughts

5. RL-shaped policy elicitation

What the field actively disagrees about

Which technique helps which problem

Open the chapters on GitHub

CoT & Scratchpads

Test-Time Compute Scaling

Sampling & Verification

Search at Inference

RL for Reasoning

Overthinking & Length

Faithfulness of Traces

Theoretical Frameworks

Cross-chapter sequences