Reasoning Models Cheat Sheet

One page. Print, screenshot, share. Each row is the smallest summary that does not mislead.

Last updated: 2026-05-14.

The arc in 60 seconds

2022 CoT prompting   →   2023 PRMs / ToT   →   2024 test-time scaling
                                                ↓
                                          2024-09 o1 (closed)
                                                ↓
                                          2025-01 R1 (open)
                                                ↓
                              overthinking + faithfulness debates
                                                ↓
                                       FrontierMath, ARC-AGI-3

If you remember one thing per chapter

#	Chapter	One sentence
1	CoT & Scratchpads	Intermediate tokens turn a fixed-depth forward pass into an unbounded serial program.
2	Test-time compute scaling	Inference compute trades off against training compute with a task-dependent exchange rate.
3	Sampling & verification	At fixed budget, ranked: BoN+PRM > self-consistency > single long-CoT > greedy.
4	Search at inference	Search dominates RL on tasks the RL didn’t cover; RL absorbs search elsewhere.
5	RL for reasoning	RLVR is elicitation, not learning — R1-Zero proves the circuits are already in the base.
6	Overthinking & length	The optimal chain length is task-dependent; flooring and ceiling matter.
7	Faithfulness	CoTs often post-hoc rationalize. RL training reduces this but does not eliminate it.
8	Theoretical frameworks	Three accounts — compute-depth, program synthesis, Bayes-over-thoughts — none alone is sufficient.

The R1 recipe, in 5 lines

Base: a strong pretrained model (≈ 7B+ for the recipe to work).
R1-Zero: pure RL with verifiable rewards (math answer match, code unit tests).
GRPO algorithm: PPO without the value head; group-relative advantage normalization.
R1: R1-Zero + cold-start SFT for readability + a final RL pass.
Output: a model that emits long self-correcting CoTs and reaches o1-class accuracy on math/code.

Cold-start SFT is cosmetic. The reasoning was already in the base; RL elicited it.

Test-time compute strategies — compact comparison

Strategy	Verifier needed?	Compute cost	Wins when
Greedy long-CoT	No	1×	Easy problems, model is well-calibrated
Self-consistency (cons@K)	No	K×	Discrete answer + model right more often than any single wrong answer
Best-of-N + PRM	Yes (good one)	K× + verifier	Hard problems + step-level signal available
Tree of Thoughts	Yes (value fn)	varies	Small-state planning, partial-solution value informative
Recursive self-aggregation	Optional	iterations × K	Tasks where summarization-then-recurse adds context
Test-time training	Gradient access	gradients per Q	Out-of-distribution (e.g. ARC-AGI)

ORMs vs PRMs in one line each

Outcome reward model (ORM) — scores entire trace by final-answer correctness. Cheap labels, low signal.
Process reward model (PRM) — scores each intermediate step. Expensive labels, high signal. Beats ORM at fixed label budget (Lightman et al. 2023).
Math-Shepherd-style auto-PRMs — synthesize step labels by Monte-Carlo rollouts. Closes most of the cost gap.

When does CoT help, when does it hurt?

Task type	CoT effect	Why
Multi-step math, algorithms, symbolic	Big help	Serial compute is the bottleneck; CoT extends it
Multi-hop fact composition	Help	Bayesian-locality recovery (Prystawski 2023)
Single-step factual lookup	Neutral / slight hurt	Compute extension doesn’t apply; latency cost
Easy arithmetic for an RL-trained reasoner	Hurts	Overthinking: long chain creates error opportunities
Tasks with prompt-bias hints	Unfaithful	Chain rationalizes a bias-driven answer (Turpin 2023)
Long-horizon agent tasks	Weak benefit	Verifier-poor regime; scaling law degrades

Faithfulness tests (Lanham battery)

A CoT is faithful iff it passes all four:

Truncation — cutting at step k changes the answer (the model was using the chain).
Paraphrase — semantically equivalent chains give the same answer (chain content matters).
Mistake injection — wrong intermediate steps propagate to wrong final answers.
Filler tokens — replacing the chain with filler reduces accuracy (content > pure compute).

Failing any test: the chain is at least partially decorative. Empirically, no current model passes all four cleanly on adversarial inputs.

Vendor-reported vs verified

Claim	Source	Status
“o1 scaling curve on AIME”	OpenAI blog 2024-09	🔴 vendor-reported, no public infra
“o3 reaches > 2700 Codeforces Elo”	OpenAI announcement	🔴 vendor-reported
DeepSeek-R1: AIME-24 79.8% pass@1	arXiv:2501.12948 Table 2	🟢 open weights, reproducible
s1-32B: AIME-24 56.7%, MATH-500 93%	arXiv:2501.19393 Table 1	🟢 open
AlphaProof: IMO 2024 silver	Nature 2025 / DeepMind blog	🟢 primary source

Rule of thumb. If the curve is screenshot-able and the model is closed, treat it as a claim, not evidence.

Common confusions to avoid

pass@1 ≠ cons@k ≠ pass@k. Always check which.
“o1-style” is a phrase, not a guarantee. Many “o1-style” papers test only on easy slices.
“R1-distilled” small models are distillations of R1’s outputs, not RL-trained themselves.
“Reasoning model” in 2025–2026 usage means an RL-trained, long-CoT model. A base instruct model doing CoT prompting is not a reasoning model.
“Faithful” means the chain causes the answer, not “the chain sounds correct.”

Five papers if you only read five

Wei et al. 2022, “Chain-of-Thought Prompting” — origin.
Lightman et al. 2023, “Let’s Verify Step by Step” — PRMs.
Snell et al. 2024, “Test-Time Compute Optimally” — the scaling law.
DeepSeek-AI 2025, “DeepSeek-R1” — the open recipe.
Lanham et al. 2023, “Measuring Faithfulness” — the corrective.

Open problems (priority)

A predictor for the optimal test-time strategy at a given (task, budget).
A mechanistic story for the “aha moment” emergence under RLVR.
Faithfulness as a primary training objective (currently no clean recipe).
Verifier scaling laws — how much accuracy follows from how much verifier compute.
Test-time training (Akyurek et al.) as an alternative scaling axis on harder benchmarks.

Want this in a different format (PDF, slide)? PR a converter. Want a row corrected? Open an issue.