An adversarial audit of AI agent benchmarks

BenchProbe Leaderboard

Which agent benchmarks survive the Berkeley exploit families.

v0.1 · schema 1Last audit: 2026-05-1515 benchmarks · 8 families · 120 verdictsMethodology · How to cite

Benchmarks audited

15

covering Berkeley/RDI's published catalog

Exploit families

8

each citation-traced to a published source

Verdicts

120

85 pass · 21 vulnerable · 14 inconclusive

Vulnerable findings

18%

21 of 120 verdicts

Inconclusive

12%

required artifact absent in audit

Overview heatmap

Rows are benchmarks; columns are exploit families. Each cell is one verdict from a static audit against the benchmark's source at the recorded commit SHA. Hover or click a cell in the table below for the formal definition and remediation.

assertion_rewriteconfig_lookupempty_response_acceptanceenv_trojanizationgold_answer_leakjudge_prompt_injectionresult_pattern_matchwrapper_no_opagentbenchagentbench · assertion_rewrite: passagentbench · config_lookup: passagentbench · empty_response_acceptance: passagentbench · env_trojanization: passagentbench · gold_answer_leak: vulnerableagentbench · judge_prompt_injection: inconclusiveagentbench · result_pattern_match: passagentbench · wrapper_no_op: passagievalagieval · assertion_rewrite: passagieval · config_lookup: vulnerableagieval · empty_response_acceptance: passagieval · env_trojanization: passagieval · gold_answer_leak: vulnerableagieval · judge_prompt_injection: inconclusiveagieval · result_pattern_match: passagieval · wrapper_no_op: passbfclbfcl · assertion_rewrite: passbfcl · config_lookup: vulnerablebfcl · empty_response_acceptance: passbfcl · env_trojanization: passbfcl · gold_answer_leak: passbfcl · judge_prompt_injection: inconclusivebfcl · result_pattern_match: passbfcl · wrapper_no_op: passcar_benchcar_bench · assertion_rewrite: passcar_bench · config_lookup: passcar_bench · empty_response_acceptance: vulnerablecar_bench · env_trojanization: passcar_bench · gold_answer_leak: passcar_bench · judge_prompt_injection: vulnerablecar_bench · result_pattern_match: passcar_bench · wrapper_no_op: passfieldwork_arenafieldwork_arena · assertion_rewrite: passfieldwork_arena · config_lookup: passfieldwork_arena · empty_response_acceptance: vulnerablefieldwork_arena · env_trojanization: passfieldwork_arena · gold_answer_leak: passfieldwork_arena · judge_prompt_injection: inconclusivefieldwork_arena · result_pattern_match: passfieldwork_arena · wrapper_no_op: passfrontier_csfrontier_cs · assertion_rewrite: passfrontier_cs · config_lookup: passfrontier_cs · empty_response_acceptance: passfrontier_cs · env_trojanization: vulnerablefrontier_cs · gold_answer_leak: passfrontier_cs · judge_prompt_injection: inconclusivefrontier_cs · result_pattern_match: passfrontier_cs · wrapper_no_op: passgaiagaia · assertion_rewrite: passgaia · config_lookup: passgaia · empty_response_acceptance: passgaia · env_trojanization: passgaia · gold_answer_leak: passgaia · judge_prompt_injection: inconclusivegaia · result_pattern_match: vulnerablegaia · wrapper_no_op: passhumanevalhumaneval · assertion_rewrite: passhumaneval · config_lookup: vulnerablehumaneval · empty_response_acceptance: passhumaneval · env_trojanization: passhumaneval · gold_answer_leak: passhumaneval · judge_prompt_injection: inconclusivehumaneval · result_pattern_match: passhumaneval · wrapper_no_op: passlivebenchlivebench · assertion_rewrite: passlivebench · config_lookup: passlivebench · empty_response_acceptance: passlivebench · env_trojanization: passlivebench · gold_answer_leak: vulnerablelivebench · judge_prompt_injection: inconclusivelivebench · result_pattern_match: passlivebench · wrapper_no_op: passmmlummlu · assertion_rewrite: passmmlu · config_lookup: vulnerablemmlu · empty_response_acceptance: passmmlu · env_trojanization: passmmlu · gold_answer_leak: vulnerablemmlu · judge_prompt_injection: inconclusivemmlu · result_pattern_match: passmmlu · wrapper_no_op: passosworldosworld · assertion_rewrite: passosworld · config_lookup: vulnerableosworld · empty_response_acceptance: passosworld · env_trojanization: passosworld · gold_answer_leak: passosworld · judge_prompt_injection: inconclusiveosworld · result_pattern_match: passosworld · wrapper_no_op: passswebenchswebench · assertion_rewrite: passswebench · config_lookup: passswebench · empty_response_acceptance: passswebench · env_trojanization: vulnerableswebench · gold_answer_leak: passswebench · judge_prompt_injection: inconclusiveswebench · result_pattern_match: passswebench · wrapper_no_op: passswebench_proswebench_pro · assertion_rewrite: passswebench_pro · config_lookup: passswebench_pro · empty_response_acceptance: passswebench_pro · env_trojanization: vulnerableswebench_pro · gold_answer_leak: passswebench_pro · judge_prompt_injection: inconclusiveswebench_pro · result_pattern_match: passswebench_pro · wrapper_no_op: passterminal_benchterminal_bench · assertion_rewrite: passterminal_bench · config_lookup: passterminal_bench · empty_response_acceptance: vulnerableterminal_bench · env_trojanization: passterminal_bench · gold_answer_leak: inconclusiveterminal_bench · judge_prompt_injection: inconclusiveterminal_bench · result_pattern_match: passterminal_bench · wrapper_no_op: vulnerablewebarenawebarena · assertion_rewrite: passwebarena · config_lookup: passwebarena · empty_response_acceptance: passwebarena · env_trojanization: passwebarena · gold_answer_leak: vulnerablewebarena · judge_prompt_injection: vulnerablewebarena · result_pattern_match: vulnerablewebarena · wrapper_no_op: pass

Per-benchmark verdicts

Click any cell for the family's formal definition, severity, mitigation class, and citation. Click a column header to sort; type in the filter to narrow.

pass vulnerable inconclusive
Audits against pinned commit SHAs; rendered 2026-05-15. A VULNERABLE verdict means a documented exploit pattern is reachable — it is not a statement about any model's behavior.
BenchmarkSHAAuditedOverallassertion_rewriteconfig_lookupempty_response_acceptanceenv_trojanizationgold_answer_leakjudge_prompt_injectionresult_pattern_matchwrapper_no_op
agentbench2a1b0c9d8e7f2026-05-15VULNERABLEpasspasspasspassvulnerableinconclusivepasspass
agieval1b0c9d8e7f6a2026-05-15VULNERABLEpassvulnerablepasspassvulnerableinconclusivepasspass
bfcl3f2a1b0c9d8e2026-05-15VULNERABLEpassvulnerablepasspasspassinconclusivepasspass
car_bench6c5d4e3f2a1b2026-05-15VULNERABLEpasspassvulnerablepasspassvulnerablepasspass
fieldwork_arena7b6c5d4e3f2a2026-05-15VULNERABLEpasspassvulnerablepasspassinconclusivepasspass
frontier_cs9a8b7c6d5e4f2026-05-15VULNERABLEpasspasspassvulnerablepassinconclusivepasspass
gaiac3d4e5f6a7b82026-05-15VULNERABLEpasspasspasspasspassinconclusivevulnerablepass
humaneval5d4e3f2a1b0c2026-05-15VULNERABLEpassvulnerablepasspasspassinconclusivepasspass
livebench0c9d8e7f6a5b2026-05-15VULNERABLEpasspasspasspassvulnerableinconclusivepasspass
mmlu4e3f2a1b0c9d2026-05-15VULNERABLEpassvulnerablepasspassvulnerableinconclusivepasspass
osworlde5f6a7b8c9d02026-05-15VULNERABLEpassvulnerablepasspasspassinconclusivepasspass
swebencha1b2c3d4e5f62026-05-15VULNERABLEpasspasspassvulnerablepassinconclusivepasspass
swebench_prof0e1d2c3b4a52026-05-15VULNERABLEpasspasspassvulnerablepassinconclusivepasspass
terminal_benchd4e5f6a7b8c92026-05-15VULNERABLEpasspassvulnerablepassinconclusiveinconclusivepassvulnerable
webarenab2c3d4e5f6a72026-05-15VULNERABLEpasspasspasspassvulnerablevulnerablevulnerablepass

Methodology

BenchProbe audits benchmark source, not benchmark runs. Each family check is deterministic, reads only filesystem artifacts the adapter exposes, and traces back to a published source. Verdicts are PASS, VULNERABLE, or INCONCLUSIVE — the third is used (and never silently downgraded to PASS) when the artifact required to decide is missing. Full taxonomy: docs/taxonomy.md.

assertion_rewrite

Severity · CRITICAL · Mitigation: test-framework hook isolation + signed outcomes

The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.

config_lookup

Severity · CRITICAL · Mitigation: agent-egress isolation

The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.

empty_response_acceptance

Severity · CRITICAL · Mitigation: content-aware validation

The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.

env_trojanization

Severity · CRITICAL · Mitigation: filesystem-and-process isolation

The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.

gold_answer_leak

Severity · CRITICAL · Mitigation: reference-data isolation

Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.

judge_prompt_injection

Severity · HIGH · Mitigation: prompt-structure sanitization

An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.

result_pattern_match

Severity · HIGH · Mitigation: semantic-content validation

The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.

wrapper_no_op

Severity · HIGH · Mitigation: behavioral validation

The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.

Citation
Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.

How to cite

If you use BenchProbe verdicts in academic work, please cite the tool and the Berkeley/RDI taxonomy it audits against.

@software{benchprobe_2026,
  title        = {BenchProbe: Adversarial Audit Toolkit for AI Agent Benchmarks},
  author       = {{BenchProbe contributors}},
  year         = {2026},
  month        = {may},
  url          = {https://github.com/benchprobe/benchprobe},
  note         = {audit covers 15 benchmarks against 8 exploit families}
}
BibTeX