Benchmarks tracker · Awesome Reasoning Models Theory

Filter

Show me only…

How to read. pass@1 = single-attempt accuracy (the post-o1 standard). cons@k = consensus / majority-vote over k samples — not the same as pass@k. 🟢 = open weights, methods public. 🔴 = vendor-reported only. 🟠 = mixed / partial. Cells without a number (—) mean the curator could not verify a stable headline number; see the source link.

AIME 2024 · the 18-month sprint

How the field's headline reasoning benchmark moved from 13.4% (GPT-4o, May 2024) to 91.6% (o3, April 2025). The R1 release closed the open–closed gap to ~4 months for AIME-class math.

AIME 2024 score progression across major reasoning models, May 2024 to April 2025

Methodological gotchas

AIME pass@1 vs cons@64 are not directly comparable. Always check.
LiveCodeBench is date-cutoff designed; the slice (problem date range) materially affects scores.
SWE-bench Verified scores depend heavily on the agent harness — different numbers without comparable harnesses.
Vendor cherry-picking — closed-model reports sometimes use bespoke prompting that isn't replicable. Prefer system-card / paper sources over blog screenshots.
Contamination — AIME, MATH, Codeforces problem statements appear widely on the web. Models pretrained after the contest date may have seen them.

Want a deeper dig?

SOURCE-OF-TRUTH

Markdown tracker

The full table with every methodological note.

MONTHLY

Digest log

What moved on the table this month, plus context.

SCRIPT

update_benchmarks.py

The automation behind the monthly refresh.