| agentbench | 2a1b0c9d8e7f | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✓pass | ✗vulnerable | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
gold_answer_leak · vulnerable
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agentbench_fixture
|
| agieval | 1b0c9d8e7f6a | 2026-05-15 | VULNERABLE | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✗vulnerable | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
config_lookup · vulnerable
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
gold_answer_leak · vulnerable
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/agieval_fixture
|
| bfcl | 3f2a1b0c9d8e | 2026-05-15 | VULNERABLE | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
config_lookup · vulnerable
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/bfcl_fixture
|
| car_bench | 6c5d4e3f2a1b | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✗vulnerable | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
empty_response_acceptance · vulnerable
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
judge_prompt_injection · vulnerable
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/car_bench_fixture
|
| fieldwork_arena | 7b6c5d4e3f2a | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✗vulnerable | ✓pass | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
empty_response_acceptance · vulnerable
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/fieldwork_arena_fixture
|
| frontier_cs | 9a8b7c6d5e4f | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✗vulnerable | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
env_trojanization · vulnerable
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/frontier_cs_fixture
|
| gaia | c3d4e5f6a7b8 | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✓pass | ✓pass | ◐inconclusive | ✗vulnerable | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
result_pattern_match · vulnerable
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/gaia_fixture
|
| humaneval | 5d4e3f2a1b0c | 2026-05-15 | VULNERABLE | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
config_lookup · vulnerable
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/humaneval_fixture
|
| livebench | 0c9d8e7f6a5b | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✓pass | ✗vulnerable | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
gold_answer_leak · vulnerable
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/livebench_fixture
|
| mmlu | 4e3f2a1b0c9d | 2026-05-15 | VULNERABLE | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✗vulnerable | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
config_lookup · vulnerable
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
gold_answer_leak · vulnerable
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/mmlu_fixture
|
| osworld | e5f6a7b8c9d0 | 2026-05-15 | VULNERABLE | ✓pass | ✗vulnerable | ✓pass | ✓pass | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
config_lookup · vulnerable
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/osworld_fixture
|
| swebench | a1b2c3d4e5f6 | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✗vulnerable | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
env_trojanization · vulnerable
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_fixture
|
| swebench_pro | f0e1d2c3b4a5 | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✗vulnerable | ✓pass | ◐inconclusive | ✓pass | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
env_trojanization · vulnerable
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
gold_answer_leak · pass
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/swebench_pro_fixture
|
| terminal_bench | d4e5f6a7b8c9 | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✗vulnerable | ✓pass | ◐inconclusive | ◐inconclusive | ✓pass | ✗vulnerable |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
empty_response_acceptance · vulnerable
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
gold_answer_leak · inconclusive
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
judge_prompt_injection · inconclusive
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
result_pattern_match · pass
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
wrapper_no_op · vulnerable
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/terminal_bench_fixture
|
| webarena | b2c3d4e5f6a7 | 2026-05-15 | VULNERABLE | ✓pass | ✓pass | ✓pass | ✓pass | ✗vulnerable | ✗vulnerable | ✗vulnerable | ✓pass |
assertion_rewrite · pass
Severity: CRITICALMitigation class: test-framework hook isolation + signed outcomes - Formal definition
- The evaluator runs a test framework whose hook/plugin interface is reachable from code the agent writes, allowing the agent to install a hook that rewrites test outcomes before the harness reads them.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): SWE-bench conftest hookimpl; Django TestCase.run monkey-patch; SWE-bench Pro parser.py overwrite.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
config_lookup · pass
Severity: CRITICALMitigation class: agent-egress isolation - Formal definition
- The task configuration or runtime environment references a network-reachable URL containing the gold answer, and the agent's runtime has the egress to fetch it before submitting.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): GAIA HuggingFace answer-file load; OSWorld wget of gold.xlsx.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
empty_response_acceptance · pass
Severity: CRITICALMitigation class: content-aware validation - Formal definition
- The evaluator awards full credit when the agent's submission is empty, structurally trivial, or satisfies only a tautology of the response shape rather than its content.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): FieldWorkArena validate() role check; CAR-bench reward-component skipping for hallucination tasks.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
env_trojanization · pass
Severity: CRITICALMitigation class: filesystem-and-process isolation - Formal definition
- The agent's writable working tree contains, or can be made to contain, a file that the evaluator process later loads, executes, or resolves — granting the agent code execution inside the evaluator's trust boundary.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026); moogician/trustworthy-env — SWE-bench conftest, Terminal-Bench curl wrapper, Django TestCase.run monkey-patch.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
gold_answer_leak · vulnerable
Severity: CRITICALMitigation class: reference-data isolation - Formal definition
- Reference answers are reachable from inside the agent's execution context via the local filesystem at audit time — typically because task spec and gold answer live in the same config file the agent can read.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena file:// navigation to config_files/{task_id}.json.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
judge_prompt_injection · vulnerable
Severity: HIGHMitigation class: prompt-structure sanitization - Formal definition
- An LLM judge scores agent output by interpolating the agent's response into a prompt template without role separation or delimiter-based sandboxing, allowing the agent to inject instructions the judge follows.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena/CAR-bench LLM-judge interpolation.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
result_pattern_match · vulnerable
Severity: HIGHMitigation class: semantic-content validation - Formal definition
- The evaluator decides a task is correct by substring or regex match against agent-controlled output, without verifying that the matched content is in the user-visible or semantically-relevant part.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): WebArena must_include hidden-DOM injection; eval()-on-agent-output RCE; GAIA normalize_str loose collisions.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
wrapper_no_op · pass
Severity: HIGHMitigation class: behavioral validation - Formal definition
- The evaluator declares success based on the existence of an artifact (file present, function defined, binary installed) without exercising the artifact to verify it produces the expected behavior.
- Citation
- Berkeley RDI, How We Broke Top AI Agent Benchmarks (2026): Terminal-Bench fake .so existence check.
- Evidence
- https://github.com/benchprobe/benchprobe/blob/main/tests/adapters/webarena_fixture
|