Reasoning benchmark numbers — sourced, dated, and honest about what's vendor-reported vs. independently verifiable.
Refreshed monthly. As-of —.
—
benchmarks tracked
—
rows / data points
—
🟢 open / verifiable
—
🔴 closed / vendor-reported
Filter
Show me only…
How to read.pass@1 = single-attempt accuracy (the post-o1 standard).
cons@k = consensus / majority-vote over k samples — not the same as pass@k.
🟢 = open weights, methods public. 🔴 = vendor-reported only. 🟠 = mixed / partial.
Cells without a number (—) mean the curator could not verify a stable headline number; see the source link.
AIME 2024 · the 18-month sprint
How the field's headline reasoning benchmark moved from 13.4% (GPT-4o, May 2024) to 91.6% (o3, April 2025).
The R1 release closed the open–closed gap to ~4 months for AIME-class math.
Methodological gotchas
AIME pass@1 vs cons@64 are not directly comparable. Always check.
LiveCodeBench is date-cutoff designed; the slice (problem date range) materially affects scores.
SWE-bench Verified scores depend heavily on the agent harness — different numbers without comparable harnesses.
Vendor cherry-picking — closed-model reports sometimes use bespoke prompting that isn't replicable. Prefer system-card / paper sources over blog screenshots.
Contamination — AIME, MATH, Codeforces problem statements appear widely on the web. Models pretrained after the contest date may have seen them.