awesome-list · theory-first

Why do reasoning models actually work?

A theory-and-mechanism-first map of the o-series / R1 / Claude-thinking paradigm. Eight argued chapters, five reproduction notebooks, a monthly benchmarks tracker, and explicit engagement with the faithfulness and overthinking debates most lists hedge.

Read on GitHub → Interactive timeline The field map

argued chapters

60+

indexed papers

models compared

reproduction notebooks

benchmarks tracked

reading paths

The arc

From CoT prompting to RL-for-reasoning in 36 months

The 2022–2026 trajectory: chain-of-thought prompting → process reward models → test-time compute scaling → o1 → DeepSeek-R1 → the overthinking and faithfulness debates. Each transition produced a recipe; each recipe shifted what "reasoning" meant. The interactive timeline lets you click any milestone.

paper model benchmark

2022
CoT 2023
PRMs 2024-07
AlphaProof 2024-08
Snell 2024-09
o1 2025-01
R1 2025-02
Claude 3.7 2025-05
Faithfulness 2025-07
Gemini gold 2026-03
ARC-AGI-3

→ Open the interactive timeline

Eight chapters

Each argues a position about how reasoning works

Most awesome-lists aggregate titles. This one argues mechanisms. Each chapter has a TL;DR, the proposed mechanism, 10+ annotated papers, the live debates, reading paths, and an open-problems list.

CH 01 · FOUNDATION

CoT & Scratchpads

Intermediate tokens turn a fixed-depth forward pass into an unbounded serial program.

CH 02 · INFERENCE

Test-Time Compute Scaling

Inference compute trades off against parameters with task-dependent exchange rate.

CH 03 · INFERENCE

Sampling & Verification

Reranking and voting over samples extract quality faster than improving any single sample.

CH 04 · INFERENCE

Search at Inference

Structured exploration over CoT prefixes recovers solutions a greedy decode misses.

CH 05 · TRAINING

RL for Reasoning

RL with verifiable rewards reshapes the policy toward long, self-correcting chains.

CH 06 · FAILURE

Overthinking & Length

Beyond a task-dependent optimum, more CoT hurts — long chains compound errors.

CH 07 · FAILURE

Faithfulness of Traces

CoTs often post-hoc rationalize. The visible chain isn't always the computation.

CH 08 · SYNTHESIS

Theoretical Frameworks

Three accounts compete and partly cooperate: compute-depth, program synthesis, Bayes-over-thoughts.

→ Open the field map (interactive)

Why this list is different

Not an aggregator. An argument.

Typical awesome-reasoning list

Flat list of titles + URLs
Mixes prompt-engineering tricks with theoretical results
No engagement with debates or open problems
Closed-model marketing claims listed as fact
Static; ages poorly post-release-cycle

Awesome Reasoning Models Theory

Chapter-as-position; each argues a mechanism
Five-criterion bar for entries (primary source, mechanism-not-phenomenon)
Explicit "Debates" section per chapter
Every closed-model number flagged with (vendor-reported)
Monthly tracker digest + WANTED gap list
Five reproduction notebooks at single-GPU scale
Sister-list scope split, boundary cases enumerated

Field map

How the eight chapters depend on each other

Solid arrows are mechanism dependencies. Dashed are open debates. Color marks the role: foundation (blue), inference-time (green), training-time (orange), failure modes (pink), synthesis (purple).

Field map of the eight chapters and their interconnections

Browse the literature

Two interactive registries

The papers and models are also exposed as filterable indexes — slice by chapter, year, status, open vs closed weights, training recipe. Both are powered by versioned JSON files in the repo.

SITE · INTERACTIVE

Papers · 60+ entries, search + filter

Filter by chapter, year, type, and verified/open/vendor status. Free-text search across title, authors, TL;DR.

SITE · INTERACTIVE

Models · 13 reasoning models compared

DeepSeek-R1 → o-series → Claude-thinking → Gemini Deep Think → QwQ → s1 → Tülu. Side-by-side cards or compact table view.

SITE · INTERACTIVE

12 reasoning-model myths

Flippable cards: claim on the front, what the literature actually says on the back. Sourced from the misconceptions essay.

Family tree

How today's reasoners trace back to their bases

The open and closed tracks both run base → SFT → RLVR → deployed reasoner, but only one side discloses the recipe. The dashed lavender arrow is R1's distillation trail — the largest single distribution-shift event the open ecosystem has seen.

Family tree of major reasoning models, 2024-2026

Reproductions you can run

Five notebooks, single-GPU runnable

Each notebook isolates one chapter's empirical claim and reproduces it at single-A10G scale (or CPU for the toys). Documented hardware, documented caveats, runnable end-to-end.

Notebook 01 · CH 2

Test-time compute scaling

Qwen2.5-Math-1.5B on MATH-500, log-x token-budget sweep, plot accuracy.

Notebook 02 · CH 3

BoN vs self-consistency

At fixed budget, compare long-CoT vs self-consistency vs BoN-with-PRM.

Notebook 03 · CH 5

Tiny R1-Zero GRPO run

Qwen2.5-0.5B + GSM8K + trl GRPOTrainer. See the signal in an hour.

Notebook 04 · CH 6

Overthinking on trivial problems

R1-distilled vs base on 2+3-class questions; reproduce Chen et al. 2024.

Notebook 05 · CH 3

PRM vs ORM toy

Synthetic stepwise arithmetic. CPU-only, ~ 5 minutes. PRM beats ORM cleanly.

One opinion

"The chain of thought is the model's behavior, not its computation. Reasoning-model gains come from RL elicitation of latent capability, structured by the training distribution and amortized at inference. None of the three current theoretical frameworks alone is sufficient — and the most useful research moves the frontier where they conflict." — from Why do reasoning models work? A synthesis

Essays

Long-form syntheses

Where a survey paragraph isn't enough.

Essay 01

Why do reasoning models work? A synthesis.

A four-component layered account; what each component explains and doesn't.

Essay 02

Is CoT faithful? The state of the debate.

Camp A (oversight-via-behavior) vs Camp B (faithfulness-as-objective). Where we land.

Essay 03

Search vs RL: the deep tension.

Two stories about reasoning. Why both have evidence, neither suffices.

Essay 04

Common misconceptions about reasoning models.

Twelve claims that circulate; why each is misleading; what to say instead.

Essay 05

The closed–open gap, tracked.

Four transitions, a cycle, and what's timing vs structural.

Essay 06

How to read a reasoning-model paper.

A triage checklist for the 30+ papers landing weekly. Red flags. What to look for.

Essay 07

Reasoning and mechanistic interpretability.

The open gap and what's likely to move first.

Auxiliary docs

Reading paths, glossary, BibTeX, model families

DOC

7 cross-chapter reading paths

Sequenced, depth-calibrated (skim / weekend / research).

DOC

Cheat sheet

One-page reference: recipe, strategies, tests, top-5 papers.

DOC

Model families catalog

Closed (o-series, Claude thinking, Gemini) and open (DeepSeek, Qwen, Tulu).

DOC

BibTeX exports

Machine-readable citations for 30+ anchor papers, by chapter.

DOC

FAQ

Predictable scope and curation questions, answered honestly.

DOC

Glossary

60+ field-specific terms (GRPO, RLVR, PRM, ORM, ...).

Cite the list

If this map is useful to your research

@misc{awesome_reasoning_models_theory_2026,
  title  = {Awesome Reasoning Models Theory: A theoretical and
            empirical map of the o-series / R1 / Claude-thinking paradigm},
  year   = {2026},
  url    = {https://github.com/bettyguo/awesome-reasoning-models-theory},
  note   = {Living document}
}

→ BibTeX for the anchor papers (per chapter)