A theory-and-mechanism-first map of the o-series / R1 / Claude-thinking paradigm. Eight argued chapters, five reproduction notebooks, a monthly benchmarks tracker, and explicit engagement with the faithfulness and overthinking debates most lists hedge.
The 2022–2026 trajectory: chain-of-thought prompting → process reward models → test-time compute scaling → o1 → DeepSeek-R1 → the overthinking and faithfulness debates. Each transition produced a recipe; each recipe shifted what "reasoning" meant. The interactive timeline lets you click any milestone.
Most awesome-lists aggregate titles. This one argues mechanisms. Each chapter has a TL;DR, the proposed mechanism, 10+ annotated papers, the live debates, reading paths, and an open-problems list.
Intermediate tokens turn a fixed-depth forward pass into an unbounded serial program.
Inference compute trades off against parameters with task-dependent exchange rate.
Reranking and voting over samples extract quality faster than improving any single sample.
Structured exploration over CoT prefixes recovers solutions a greedy decode misses.
RL with verifiable rewards reshapes the policy toward long, self-correcting chains.
Beyond a task-dependent optimum, more CoT hurts — long chains compound errors.
CoTs often post-hoc rationalize. The visible chain isn't always the computation.
Three accounts compete and partly cooperate: compute-depth, program synthesis, Bayes-over-thoughts.
Solid arrows are mechanism dependencies. Dashed are open debates. Color marks the role: foundation (blue), inference-time (green), training-time (orange), failure modes (pink), synthesis (purple).
The papers and models are also exposed as filterable indexes — slice by chapter, year, status, open vs closed weights, training recipe. Both are powered by versioned JSON files in the repo.
The open and closed tracks both run base → SFT → RLVR → deployed reasoner, but only one side discloses the recipe. The dashed lavender arrow is R1's distillation trail — the largest single distribution-shift event the open ecosystem has seen.
Each notebook isolates one chapter's empirical claim and reproduces it at single-A10G scale (or CPU for the toys). Documented hardware, documented caveats, runnable end-to-end.
"The chain of thought is the model's behavior, not its computation. Reasoning-model gains come from RL elicitation of latent capability, structured by the training distribution and amortized at inference. None of the three current theoretical frameworks alone is sufficient — and the most useful research moves the frontier where they conflict." — from Why do reasoning models work? A synthesis
Where a survey paragraph isn't enough.
@misc{awesome_reasoning_models_theory_2026,
title = {Awesome Reasoning Models Theory: A theoretical and
empirical map of the o-series / R1 / Claude-thinking paradigm},
year = {2026},
url = {https://github.com/bettyguo/awesome-reasoning-models-theory},
note = {Living document}
}