Reading paths

Curated cross-chapter sequences. Each path picks 6–12 papers and sequences them to build a coherent picture of a sub-topic in a weekend or less.

These supplement the per-chapter Reading paths sections; this file is the place for sequences that cross chapter boundaries.

Path 1 — What is the o1 / R1 paradigm, in two evenings?

For the practitioner who needs to understand the dominant 2024–2026 reasoning recipe end-to-end.

Evening 1 — the empirical anchor:

Wei et al. 2022, “Chain-of-Thought Prompting” — the origin.
Lightman et al. 2023, “Let’s Verify Step by Step” — process reward models.
Snell et al. 2024, “Scaling LLM Test-Time Compute Optimally” — the test-time-compute scaling law.
OpenAI 2024, “Learning to Reason with LLMs” — o1 announcement (read for context, flagged closed-model).

Evening 2 — the reproductions:

DeepSeek-AI 2025, “DeepSeek-R1” — full RLVR recipe with open weights.
Muennighoff et al. 2025, “s1: Simple Test-Time Scaling” — the cleanest open scaling-curve reproduction.
Shao et al. 2024, “DeepSeekMath” — where GRPO originates.

After this path you can read most of arXiv’s reasoning-model papers with context.

Path 2 — I don’t trust CoT-as-thought. Convince me one way or the other.

For the reader skeptical (or persuaded) by the faithfulness debate.

Turpin et al. 2023, “Language Models Don’t Always Say What They Think” — the originating demonstration.
Lanham et al. 2023, “Measuring Faithfulness in Chain-of-Thought Reasoning” — the measurement framework.
Pfau, Merrill, Bowman 2024, “Hidden Computation in Transformer Language Models” — filler tokens partly substitute for content.
Chen et al. 2024, “Premise Order Matters in Reasoning” — different angle, same phenomenon.
Anthropic 2025, “Reasoning Models Don’t Always Say What They Think” — the RL-trained-reasoner version.
Read the faithfulness essay for the synthesis.

Optional dual-use end: Greenblatt et al. 2024, “Alignment Faking”.

Path 3 — Is reasoning search or RL?

For the reader interested in the deep methodological question of whether reasoning model behavior is best explained as inference-time search, RL-amortized policy, or both.

Yao et al. 2023, “Tree of Thoughts” — the search-side framing.
Gandhi et al. 2024, “Stream of Search” — search as a learnable behavior.
DeepSeek-AI 2025, “DeepSeek-R1” — pure RL, no explicit search, reaches o1-class.
Huang et al. 2024, “Large Language Models Cannot Self-Correct Reasoning Yet” — the self-refine negative result.
DeepMind 2024, AlphaProof (blog) — where explicit search still dominates.
Read the search-vs-RL essay.

Path 4 — Theory of why CoT helps (formal + informal)

For the researcher wanting both the formal-expressivity story and the empirical-explanatory one. Mixes papers from this list and the sister list.

(foundations list) Merrill & Sabharwal 2024, “CoT expressivity”.
(foundations list) Li et al. 2024, “CoT empowers serial problems”.
(this list) Prystawski et al. 2023, “Why think step by step?”.
(this list) Pfau et al. 2024, “Dot by dot” — filler-token evidence.
(this list) Sprague et al. 2024, “To CoT or not to CoT?” — meta-analysis of when CoT helps.
Read the synthesis essay.

Path 5 — The overthinking debate

For the reader interested in chain-length calibration and the empirical line that “longer is not always better.”

Chen et al. 2024, “Do Not Think That Much for 2+3=?” — naming paper.
Hassid et al. 2025, “Don’t Overthink it” — training-time fix.
Xu et al. 2025, “Chain of Draft” — prompt-time fix.
Yang et al. 2025, “Towards Thinking-Optimal Scaling” — the principled framing.
Muennighoff et al. 2025, “s1” §3-4 — budget forcing as the formal knob.
Sui et al. 2025, “Stop Overthinking” (survey) — the comprehensive index.

Path 6 — RL-for-reasoning, fast track

For someone implementing RLVR on their own model.

Shao et al. 2024, “DeepSeekMath” — GRPO algorithm.
Lambert et al. 2024, “Tulu 3” — open RLVR recipe.
DeepSeek-AI 2025, “DeepSeek-R1” — full pipeline.
Luong et al. 2024, “ReFT” — pre-R1 reference.
Gao, Schulman, Hilton 2022, “Scaling Laws for Reward Model Overoptimization” — the load-bearing prior result on reward hacking.
The TRL library GRPOTrainer source code.
Reproduce on a tiny model with notebook 03.

Path 7 — Sampling and verification, fast track

For the inference-side practitioner.

Cobbe et al. 2021, “Training Verifiers” — origin of BoN-with-verifier.
Wang et al. 2022, “Self-Consistency” — the verifier-free baseline.
Lightman et al. 2023, “Let’s Verify Step by Step” — PRMs.
Wang et al. 2023, “Math-Shepherd” — auto-labeled PRMs.
Brown et al. 2024, “Large Language Monkeys” — pass@K scaling.
Liu et al. 2025, “Inference-Time Scaling for Generalist Reward Modeling” — verifier-side scaling.
Rohatgi et al. 2025, “Taming Imperfect Process Verifiers” — practical guidance.
Reproduce on a small model with notebook 02 and notebook 05.

Calibrating depth

Skim path = ~ 90 minutes. Headlines + abstracts + a single figure per paper.
Deep path = a weekend. Full read of each paper, including methods and ablations.
Research path = a week. Full read + reading the cited papers’ citations one layer out.

If a chapter’s Reading paths section conflicts with a path here, the chapter version is the authoritative one for that chapter — this file’s value is the cross-chapter sequences.

Filed: 2026-05-14. PR-friendly — propose new paths or revisions via issue.