agent-backtest-lab — audit your LLM trading agent before you trust it

This is a research and evaluation tool. It is not financial, investment, or trading advice. It does not execute trades or connect to brokerages. It exists to help researchers and practitioners rigorously measure how trading-agent frameworks actually perform — including, and especially, when they perform badly. Backtest results are not predictive of live performance.

The problem this tool exists to make impossible to ignore

A trading agent shows +40% in a backtest. Was the agent good — or did it quietly see tomorrow's prices through retroactively-adjusted data, get the best of 50 untracked prompt-tuning attempts, or earn returns no real trader could net of costs? agent-backtest-lab is the library that answers that question. It is not another LLM trading agent. It is the audit layer the entire category lacks.

Example reports

buy_and_hold

The simplest baseline: open a LONG on day one and never trade again. Net of cost; rendered with the same scorecard pipeline a real agent goes through.

HTML scorecard → Markdown JSON

naive_momentum

A 5-day momentum signal: LONG if trailing return positive, FLAT otherwise. Useful as a "is your agent doing anything beyond trend-following?" sanity check.

HTML scorecard → Markdown JSON

These reports are rendered against a deterministic synthetic fixture (geometric Brownian motion with known drift/vol parameters). The framing carries over to real-data runs: net of cost, with CIs, multiple-testing corrected, with leakage / overfitting / reward-hacking flags surfaced loudly when present.

What it does

Capability	Module	Source
Walk-forward + purged-CV with embargo	`abl.backtest.cv`	López de Prado 2018, Ch. 7
Combinatorial Purged K-Fold	`abl.backtest.cv`	López de Prado 2018, Ch. 12
Hard leakage firewall (refuses date > as_of)	`abl.data.firewall`	—
yfinance raw-only loader (refuses `Adj Close`)	`abl.data.yfinance_loader`	—
Reward-hacking detection	`abl.leakage.reward_hacking`	IS/OOS Sharpe drop, drawdown widening, calibration divergence
Transaction-cost + Almgren-style impact	`abl.costs`	Almgren-Chriss 2000, Almgren et al. 2005
BH FDR + BH-Yekutieli	`abl.multipletest.bh`	Benjamini-Hochberg 1995, Benjamini-Yekutieli 2001
Bonferroni-Holm step-down FWER	`abl.multipletest.stepwise`	Holm 1979
Romano-Wolf stepwise	`abl.multipletest.stepwise`	Romano & Wolf 2005
Probabilistic Sharpe Ratio (PSR)	`abl.multipletest.psr`	Bailey & López de Prado 2012/13
Deflated Sharpe Ratio (DSR)	`abl.multipletest.dsr`	Bailey & López de Prado 2014
HAC / Newey-West Sharpe SE	`abl.multipletest.hac`	Lo 2002, Newey-West 1987/1994
BCa bootstrap CIs	`abl.multipletest.bootstrap`	Efron 1987
White's Reality Check / SPA	`abl.multipletest.spa`	White 2000, Politis-Romano 1994
Sortino + Information Ratio	`abl.multipletest.risk_metrics`	Sortino-Price 1994
PBO via CSCV	`abl.overfitting.cscv`	Bailey, Borwein, López de Prado, Zhu 2017
Reliability diagrams + ECE	`abl.calibration.reliability`	Guo et al. ICML 2017
Split conformal w/ rolling window	`abl.calibration.conformal`	Vovk-Gammerman-Shafer 2005; Angelopoulos-Bates 2021
Six baselines (always rendered)	`abl.baselines`	—
Five adapters: TradingAgents, FinGPT, FinRobot, callable, plain	`abl.adapters`	—
Markdown + JSON + HTML scorecards	`abl.scorecard`	HTML embeds plots as base64

Capability

Module

Source

Walk-forward + purged-CV with embargo

abl.backtest.cv

López de Prado 2018, Ch. 7

Combinatorial Purged K-Fold

abl.backtest.cv

López de Prado 2018, Ch. 12

Hard leakage firewall (refuses date > as_of)

abl.data.firewall

—

yfinance raw-only loader (refuses Adj Close)

abl.data.yfinance_loader

—

Reward-hacking detection

abl.leakage.reward_hacking

IS/OOS Sharpe drop, drawdown widening, calibration divergence

Transaction-cost + Almgren-style impact

abl.costs

Almgren-Chriss 2000, Almgren et al. 2005

BH FDR + BH-Yekutieli

abl.multipletest.bh

Benjamini-Hochberg 1995, Benjamini-Yekutieli 2001

Bonferroni-Holm step-down FWER

abl.multipletest.stepwise

Holm 1979

Romano-Wolf stepwise

abl.multipletest.stepwise

Romano & Wolf 2005

Probabilistic Sharpe Ratio (PSR)

abl.multipletest.psr

Bailey & López de Prado 2012/13

Deflated Sharpe Ratio (DSR)

abl.multipletest.dsr

Bailey & López de Prado 2014

HAC / Newey-West Sharpe SE

abl.multipletest.hac

Lo 2002, Newey-West 1987/1994

BCa bootstrap CIs

abl.multipletest.bootstrap

Efron 1987

White's Reality Check / SPA

abl.multipletest.spa

White 2000, Politis-Romano 1994

Sortino + Information Ratio

abl.multipletest.risk_metrics

Sortino-Price 1994

PBO via CSCV

abl.overfitting.cscv

Bailey, Borwein, López de Prado, Zhu 2017

Reliability diagrams + ECE

abl.calibration.reliability

Guo et al. ICML 2017

Split conformal w/ rolling window

abl.calibration.conformal

Vovk-Gammerman-Shafer 2005; Angelopoulos-Bates 2021

Six baselines (always rendered)

abl.baselines

—

Five adapters: TradingAgents, FinGPT, FinRobot, callable, plain

abl.adapters

—

Markdown + JSON + HTML scorecards

abl.scorecard

HTML embeds plots as base64

Quickstart

pip install agent-backtest-lab # Evaluate any of the bundled baselines on the synthetic fixture: abl evaluate buy_and_hold --window 2020-01-06..2024-12-31 --out out/ abl evaluate naive_momentum --window 2020-01-06..2024-12-31 --out out/ abl evaluate naive_mean_reversion --window 2020-01-06..2024-12-31 --out out/ # Open out/report.html in any browser; identical to the demos linked above.

What it deliberately is not

Companion to TradingAgents

Designed as a companion to TauricResearch/TradingAgents (and to FinGPT, FinRobot, and any callable strategy). Their job is to generate decisions; ours is to audit them. The reference adapter wraps TradingAgentsGraph().propagate(ticker, date) behind the same evaluation pipeline every baseline goes through. tradingagents remains an optional dependency.

agent-backtest-lab is a research tool. Not financial advice. Not a trading system. Backtests don't predict the future.

Built by Betty Guo (Dongxin Guo / 郭东欣), PhD candidate, University of Hong Kong, advised by Prof. Siu-Ming Yiu. ORCID: 0009-0000-2388-1072. Apache-2.0.