Before you trust an LLM trading agent, audit it.
A statistical-audit harness for trading-agent frameworks — look-ahead-leak detection, transaction-cost modeling, multiple-testing correction, calibration, and reward-hacking detection.
A trading agent shows +40% in a backtest. Was the agent good — or did it quietly see tomorrow's prices
through retroactively-adjusted data, get the best of 50 untracked prompt-tuning attempts, or earn
returns no real trader could net of costs? agent-backtest-lab is the library that answers
that question. It is not another LLM trading agent. It is the audit layer the entire category lacks.
Two pre-rendered scorecards from the bundled synthetic fixture, generated by:
$ abl evaluate buy_and_hold --window 2020-01-06..2024-12-31 --out out/ $ abl evaluate naive_momentum --window 2020-01-06..2024-12-31 --out out/
The simplest baseline: open a LONG on day one and never trade again. Net of cost; rendered with the same scorecard pipeline a real agent goes through.
A 5-day momentum signal: LONG if trailing return positive, FLAT otherwise. Useful as a "is your agent doing anything beyond trend-following?" sanity check.
These reports are rendered against a deterministic synthetic fixture (geometric Brownian motion with known drift/vol parameters). The framing carries over to real-data runs: net of cost, with CIs, multiple-testing corrected, with leakage / overfitting / reward-hacking flags surfaced loudly when present.
| Capability | Module | Source |
|---|---|---|
| Walk-forward + purged-CV with embargo | abl.backtest.cv | López de Prado 2018, Ch. 7 |
| Combinatorial Purged K-Fold | abl.backtest.cv | López de Prado 2018, Ch. 12 |
| Hard leakage firewall (refuses date > as_of) | abl.data.firewall | — |
yfinance raw-only loader (refuses Adj Close) | abl.data.yfinance_loader | — |
| Reward-hacking detection | abl.leakage.reward_hacking | IS/OOS Sharpe drop, drawdown widening, calibration divergence |
| Transaction-cost + Almgren-style impact | abl.costs | Almgren-Chriss 2000, Almgren et al. 2005 |
| BH FDR + BH-Yekutieli | abl.multipletest.bh | Benjamini-Hochberg 1995, Benjamini-Yekutieli 2001 |
| Bonferroni-Holm step-down FWER | abl.multipletest.stepwise | Holm 1979 |
| Romano-Wolf stepwise | abl.multipletest.stepwise | Romano & Wolf 2005 |
| Probabilistic Sharpe Ratio (PSR) | abl.multipletest.psr | Bailey & López de Prado 2012/13 |
| Deflated Sharpe Ratio (DSR) | abl.multipletest.dsr | Bailey & López de Prado 2014 |
| HAC / Newey-West Sharpe SE | abl.multipletest.hac | Lo 2002, Newey-West 1987/1994 |
| BCa bootstrap CIs | abl.multipletest.bootstrap | Efron 1987 |
| White's Reality Check / SPA | abl.multipletest.spa | White 2000, Politis-Romano 1994 |
| Sortino + Information Ratio | abl.multipletest.risk_metrics | Sortino-Price 1994 |
| PBO via CSCV | abl.overfitting.cscv | Bailey, Borwein, López de Prado, Zhu 2017 |
| Reliability diagrams + ECE | abl.calibration.reliability | Guo et al. ICML 2017 |
| Split conformal w/ rolling window | abl.calibration.conformal | Vovk-Gammerman-Shafer 2005; Angelopoulos-Bates 2021 |
| Six baselines (always rendered) | abl.baselines | — |
| Five adapters: TradingAgents, FinGPT, FinRobot, callable, plain | abl.adapters | — |
| Markdown + JSON + HTML scorecards | abl.scorecard | HTML embeds plots as base64 |
pip install agent-backtest-lab # Evaluate any of the bundled baselines on the synthetic fixture: abl evaluate buy_and_hold --window 2020-01-06..2024-12-31 --out out/ abl evaluate naive_momentum --window 2020-01-06..2024-12-31 --out out/ abl evaluate naive_mean_reversion --window 2020-01-06..2024-12-31 --out out/ # Open out/report.html in any browser; identical to the demos linked above.
Designed as a companion to TauricResearch/TradingAgents
(and to FinGPT, FinRobot, and any callable strategy). Their job is to generate decisions; ours
is to audit them. The reference adapter wraps
TradingAgentsGraph().propagate(ticker, date) behind the same evaluation pipeline every
baseline goes through. tradingagents remains an optional dependency.
agent-backtest-lab is a research tool. Not financial advice. Not a trading system. Backtests don't predict the future.
Built by Betty Guo (Dongxin Guo / 郭东欣), PhD candidate, University of Hong Kong, advised by Prof. Siu-Ming Yiu. ORCID: 0009-0000-2388-1072. Apache-2.0.