BenchProbe Leaderboard

Benchmarks audited

covering Berkeley/RDI's published catalog

Exploit families

each citation-traced to a published source

Verdicts

120

85 pass · 21 vulnerable · 14 inconclusive

Vulnerable findings

18%

21 of 120 verdicts

Inconclusive

12%

required artifact absent in audit

Overview heatmap

Rows are benchmarks; columns are exploit families. Each cell is one verdict from a static audit against the benchmark's source at the recorded commit SHA. Hover or click a cell in the table below for the formal definition and remediation.

Per-benchmark verdicts

Click any cell for the family's formal definition, severity, mitigation class, and citation. Click a column header to sort; type in the filter to narrow.

✓ pass✗ vulnerable◐ inconclusive

Audits against pinned commit SHAs; rendered 2026-05-15. A VULNERABLE verdict means a documented exploit pattern is reachable — it is not a statement about any model's behavior.
Benchmark	SHA	Audited	Overall	assertion_rewrite	config_lookup	empty_response_acceptance	env_trojanization	gold_answer_leak	judge_prompt_injection	result_pattern_match	wrapper_no_op
agentbench	2a1b0c9d8e7f	2026-05-15	VULNERABLE	pass	pass	pass	pass	vulnerable	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `gold_answer_leak` · vulnerable Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
agieval	1b0c9d8e7f6a	2026-05-15	VULNERABLE	pass	vulnerable	pass	pass	vulnerable	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `config_lookup` · vulnerable Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `gold_answer_leak` · vulnerable Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
bfcl	3f2a1b0c9d8e	2026-05-15	VULNERABLE	pass	vulnerable	pass	pass	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `config_lookup` · vulnerable Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
car_bench	6c5d4e3f2a1b	2026-05-15	VULNERABLE	pass	pass	vulnerable	pass	pass	vulnerable	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `empty_response_acceptance` · vulnerable Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `judge_prompt_injection` · vulnerable Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
fieldwork_arena	7b6c5d4e3f2a	2026-05-15	VULNERABLE	pass	pass	vulnerable	pass	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `empty_response_acceptance` · vulnerable Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
frontier_cs	9a8b7c6d5e4f	2026-05-15	VULNERABLE	pass	pass	pass	vulnerable	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `env_trojanization` · vulnerable Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
gaia	c3d4e5f6a7b8	2026-05-15	VULNERABLE	pass	pass	pass	pass	pass	inconclusive	vulnerable	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `result_pattern_match` · vulnerable Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
humaneval	5d4e3f2a1b0c	2026-05-15	VULNERABLE	pass	vulnerable	pass	pass	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `config_lookup` · vulnerable Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
livebench	0c9d8e7f6a5b	2026-05-15	VULNERABLE	pass	pass	pass	pass	vulnerable	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `gold_answer_leak` · vulnerable Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
mmlu	4e3f2a1b0c9d	2026-05-15	VULNERABLE	pass	vulnerable	pass	pass	vulnerable	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `config_lookup` · vulnerable Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `gold_answer_leak` · vulnerable Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
osworld	e5f6a7b8c9d0	2026-05-15	VULNERABLE	pass	vulnerable	pass	pass	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `config_lookup` · vulnerable Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
swebench	a1b2c3d4e5f6	2026-05-15	VULNERABLE	pass	pass	pass	vulnerable	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `env_trojanization` · vulnerable Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
swebench_pro	f0e1d2c3b4a5	2026-05-15	VULNERABLE	pass	pass	pass	vulnerable	pass	inconclusive	pass	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `env_trojanization` · vulnerable Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `gold_answer_leak` · pass Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
terminal_bench	d4e5f6a7b8c9	2026-05-15	VULNERABLE	pass	pass	vulnerable	pass	inconclusive	inconclusive	pass	vulnerable
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `empty_response_acceptance` · vulnerable Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `gold_answer_leak` · inconclusive Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `judge_prompt_injection` · inconclusive Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `result_pattern_match` · pass Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture `wrapper_no_op` · vulnerable Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
webarena	b2c3d4e5f6a7	2026-05-15	VULNERABLE	pass	pass	pass	pass	vulnerable	vulnerable	vulnerable	pass
`assertion_rewrite` · pass Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes Formal definition The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `config_lookup` · pass Severity: CRITICALMitigation class: agent-egress isolation Formal definition The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `empty_response_acceptance` · pass Severity: CRITICALMitigation class: content-aware validation Formal definition The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `env_trojanization` · pass Severity: CRITICALMitigation class: filesystem-and-process isolation Formal definition The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `gold_answer_leak` · vulnerable Severity: CRITICALMitigation class: reference-data isolation Formal definition Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `judge_prompt_injection` · vulnerable Severity: HIGHMitigation class: prompt-structure sanitization Formal definition An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `result_pattern_match` · vulnerable Severity: HIGHMitigation class: semantic-content validation Formal definition The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture `wrapper_no_op` · pass Severity: HIGHMitigation class: behavioral validation Formal definition The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior. Citation Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check. Evidence https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture

Methodology

BenchProbe audits benchmark source, not benchmark runs. Each family check is deterministic, reads only filesystem artifacts the adapter exposes, and traces back to a published source. Verdicts are PASS, VULNERABLE, or INCONCLUSIVE — the third is used (and never silently downgraded to PASS) when the artifact required to decide is missing. Full taxonomy: docs/taxonomy.md.

assertion_rewrite

Severity · CRITICAL · Mitigation: test-framework hook isolation + signed outcomes

The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.

config_lookup

Severity · CRITICAL · Mitigation: agent-egress isolation

The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.

empty_response_acceptance

Severity · CRITICAL · Mitigation: content-aware validation

The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.

env_trojanization

Severity · CRITICAL · Mitigation: filesystem-and-process isolation

The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.

gold_answer_leak

Severity · CRITICAL · Mitigation: reference-data isolation

Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.

judge_prompt_injection

Severity · HIGH · Mitigation: prompt-structure sanitization

An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.

result_pattern_match

Severity · HIGH · Mitigation: semantic-content validation

The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.

wrapper_no_op

Severity · HIGH · Mitigation: behavioral validation

The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.

Citation: Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.

How to cite

If you use BenchProbe verdicts in academic work, please cite the tool and the Berkeley/RDI taxonomy it audits against.

@software{benchprobe_2026,
  title        = {BenchProbe: Adversarial Audit Toolkit for AI Agent Benchmarks},
  author       = {{BenchProbe contributors}},
  year         = {2026},
  month        = {may},
  url          = {https://github.com/benchprobe/benchprobe},
  note         = {audit covers 15 benchmarks against 8 exploit families}
}

BibTeX

Overview heatmap

Per-benchmark verdicts

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · vulnerable

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · vulnerable

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · vulnerable

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · vulnerable

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · pass

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · vulnerable

env_trojanization · pass

gold_answer_leak · pass

judge_prompt_injection · vulnerable

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · vulnerable

env_trojanization · pass

gold_answer_leak · pass

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · pass

env_trojanization · vulnerable

gold_answer_leak · pass

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · pass

judge_prompt_injection · inconclusive

result_pattern_match · vulnerable

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · vulnerable

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · pass

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · pass

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · vulnerable

judge_prompt_injection · inconclusive

result_pattern_match · pass

wrapper_no_op · pass

assertion_rewrite · pass

config_lookup · vulnerable

empty_response_acceptance · pass

env_trojanization · pass

gold_answer_leak · vulnerable

judge_prompt_injection · inconclusive

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · vulnerable

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · vulnerable

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · vulnerable

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · vulnerable

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · pass

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · vulnerable

`env_trojanization` · pass

`gold_answer_leak` · pass

`judge_prompt_injection` · vulnerable

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · vulnerable

`env_trojanization` · pass

`gold_answer_leak` · pass

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · pass

`env_trojanization` · vulnerable

`gold_answer_leak` · pass

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · pass

`judge_prompt_injection` · inconclusive

`result_pattern_match` · vulnerable

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · vulnerable

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · pass

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · pass

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · vulnerable

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass

`assertion_rewrite` · pass

`config_lookup` · vulnerable

`empty_response_acceptance` · pass

`env_trojanization` · pass

`gold_answer_leak` · vulnerable

`judge_prompt_injection` · inconclusive

`result_pattern_match` · pass

`wrapper_no_op` · pass