Living document

The benchmarks tracker

Reasoning benchmark numbers — sourced, dated, and honest about what's vendor-reported vs. independently verifiable. Refreshed monthly. As-of .

benchmarks tracked
rows / data points
🟢 open / verifiable
🔴 closed / vendor-reported
Filter

Show me only…

How to read. pass@1 = single-attempt accuracy (the post-o1 standard). cons@k = consensus / majority-vote over k samples — not the same as pass@k. 🟢 = open weights, methods public. 🔴 = vendor-reported only. 🟠 = mixed / partial. Cells without a number (—) mean the curator could not verify a stable headline number; see the source link.

AIME 2024 · the 18-month sprint

How the field's headline reasoning benchmark moved from 13.4% (GPT-4o, May 2024) to 91.6% (o3, April 2025). The R1 release closed the open–closed gap to ~4 months for AIME-class math.

AIME 2024 score progression across major reasoning models, May 2024 to April 2025

Methodological gotchas

Want a deeper dig?