Reading paths
Curated cross-chapter sequences. Each path picks 6–12 papers and sequences them to build a coherent picture of a sub-topic in a weekend or less.
These supplement the per-chapter Reading paths sections; this file is the place for sequences that cross chapter boundaries.
Path 1 — What is the o1 / R1 paradigm, in two evenings?
For the practitioner who needs to understand the dominant 2024–2026 reasoning recipe end-to-end.
Evening 1 — the empirical anchor:
- Wei et al. 2022, “Chain-of-Thought Prompting” — the origin.
- Lightman et al. 2023, “Let’s Verify Step by Step” — process reward models.
- Snell et al. 2024, “Scaling LLM Test-Time Compute Optimally” — the test-time-compute scaling law.
- OpenAI 2024, “Learning to Reason with LLMs” — o1 announcement (read for context, flagged closed-model).
Evening 2 — the reproductions:
- DeepSeek-AI 2025, “DeepSeek-R1” — full RLVR recipe with open weights.
- Muennighoff et al. 2025, “s1: Simple Test-Time Scaling” — the cleanest open scaling-curve reproduction.
- Shao et al. 2024, “DeepSeekMath” — where GRPO originates.
After this path you can read most of arXiv’s reasoning-model papers with context.
Path 2 — I don’t trust CoT-as-thought. Convince me one way or the other.
For the reader skeptical (or persuaded) by the faithfulness debate.
- Turpin et al. 2023, “Language Models Don’t Always Say What They Think” — the originating demonstration.
- Lanham et al. 2023, “Measuring Faithfulness in Chain-of-Thought Reasoning” — the measurement framework.
- Pfau, Merrill, Bowman 2024, “Hidden Computation in Transformer Language Models” — filler tokens partly substitute for content.
- Chen et al. 2024, “Premise Order Matters in Reasoning” — different angle, same phenomenon.
- Anthropic 2025, “Reasoning Models Don’t Always Say What They Think” — the RL-trained-reasoner version.
- Read the faithfulness essay for the synthesis.
Optional dual-use end: Greenblatt et al. 2024, “Alignment Faking”.
Path 3 — Is reasoning search or RL?
For the reader interested in the deep methodological question of whether reasoning model behavior is best explained as inference-time search, RL-amortized policy, or both.
- Yao et al. 2023, “Tree of Thoughts” — the search-side framing.
- Gandhi et al. 2024, “Stream of Search” — search as a learnable behavior.
- DeepSeek-AI 2025, “DeepSeek-R1” — pure RL, no explicit search, reaches o1-class.
- Huang et al. 2024, “Large Language Models Cannot Self-Correct Reasoning Yet” — the self-refine negative result.
- DeepMind 2024, AlphaProof (blog) — where explicit search still dominates.
- Read the search-vs-RL essay.
Path 4 — Theory of why CoT helps (formal + informal)
For the researcher wanting both the formal-expressivity story and the empirical-explanatory one. Mixes papers from this list and the sister list.
- (foundations list) Merrill & Sabharwal 2024, “CoT expressivity”.
- (foundations list) Li et al. 2024, “CoT empowers serial problems”.
- (this list) Prystawski et al. 2023, “Why think step by step?”.
- (this list) Pfau et al. 2024, “Dot by dot” — filler-token evidence.
- (this list) Sprague et al. 2024, “To CoT or not to CoT?” — meta-analysis of when CoT helps.
- Read the synthesis essay.
Path 5 — The overthinking debate
For the reader interested in chain-length calibration and the empirical line that “longer is not always better.”
- Chen et al. 2024, “Do Not Think That Much for 2+3=?” — naming paper.
- Hassid et al. 2025, “Don’t Overthink it” — training-time fix.
- Xu et al. 2025, “Chain of Draft” — prompt-time fix.
- Yang et al. 2025, “Towards Thinking-Optimal Scaling” — the principled framing.
- Muennighoff et al. 2025, “s1” §3-4 — budget forcing as the formal knob.
- Sui et al. 2025, “Stop Overthinking” (survey) — the comprehensive index.
Path 6 — RL-for-reasoning, fast track
For someone implementing RLVR on their own model.
- Shao et al. 2024, “DeepSeekMath” — GRPO algorithm.
- Lambert et al. 2024, “Tulu 3” — open RLVR recipe.
- DeepSeek-AI 2025, “DeepSeek-R1” — full pipeline.
- Luong et al. 2024, “ReFT” — pre-R1 reference.
- Gao, Schulman, Hilton 2022, “Scaling Laws for Reward Model Overoptimization” — the load-bearing prior result on reward hacking.
- The TRL library
GRPOTrainersource code. - Reproduce on a tiny model with notebook 03.
Path 7 — Sampling and verification, fast track
For the inference-side practitioner.
- Cobbe et al. 2021, “Training Verifiers” — origin of BoN-with-verifier.
- Wang et al. 2022, “Self-Consistency” — the verifier-free baseline.
- Lightman et al. 2023, “Let’s Verify Step by Step” — PRMs.
- Wang et al. 2023, “Math-Shepherd” — auto-labeled PRMs.
- Brown et al. 2024, “Large Language Monkeys” — pass@K scaling.
- Liu et al. 2025, “Inference-Time Scaling for Generalist Reward Modeling” — verifier-side scaling.
- Rohatgi et al. 2025, “Taming Imperfect Process Verifiers” — practical guidance.
- Reproduce on a small model with notebook 02 and notebook 05.
Calibrating depth
- Skim path = ~ 90 minutes. Headlines + abstracts + a single figure per paper.
- Deep path = a weekend. Full read of each paper, including methods and ablations.
- Research path = a week. Full read + reading the cited papers’ citations one layer out.
If a chapter’s Reading paths section conflicts with a path here, the chapter version is the authoritative one for that chapter — this file’s value is the cross-chapter sequences.
Filed: 2026-05-14. PR-friendly — propose new paths or revisions via issue.