agent-backtest-labv0.6.0 · 150 fixture tests passing

Before you trust an LLM trading agent, audit it.

A statistical-audit harness for trading-agent frameworks — look-ahead-leak detection, transaction-cost modeling, multiple-testing correction, calibration, and reward-hacking detection.

This is a research and evaluation tool. It is not financial, investment, or trading advice. It does not execute trades or connect to brokerages. It exists to help researchers and practitioners rigorously measure how trading-agent frameworks actually perform — including, and especially, when they perform badly. Backtest results are not predictive of live performance.

The problem this tool exists to make impossible to ignore

A trading agent shows +40% in a backtest. Was the agent good — or did it quietly see tomorrow's prices through retroactively-adjusted data, get the best of 50 untracked prompt-tuning attempts, or earn returns no real trader could net of costs? agent-backtest-lab is the library that answers that question. It is not another LLM trading agent. It is the audit layer the entire category lacks.

Example reports

Two pre-rendered scorecards from the bundled synthetic fixture, generated by:

$ abl evaluate buy_and_hold --window 2020-01-06..2024-12-31 --out out/
$ abl evaluate naive_momentum --window 2020-01-06..2024-12-31 --out out/

buy_and_hold

The simplest baseline: open a LONG on day one and never trade again. Net of cost; rendered with the same scorecard pipeline a real agent goes through.

naive_momentum

A 5-day momentum signal: LONG if trailing return positive, FLAT otherwise. Useful as a "is your agent doing anything beyond trend-following?" sanity check.

These reports are rendered against a deterministic synthetic fixture (geometric Brownian motion with known drift/vol parameters). The framing carries over to real-data runs: net of cost, with CIs, multiple-testing corrected, with leakage / overfitting / reward-hacking flags surfaced loudly when present.

What it does

CapabilityModuleSource
Walk-forward + purged-CV with embargoabl.backtest.cvLópez de Prado 2018, Ch. 7
Combinatorial Purged K-Foldabl.backtest.cvLópez de Prado 2018, Ch. 12
Hard leakage firewall (refuses date > as_of)abl.data.firewall
yfinance raw-only loader (refuses Adj Close)abl.data.yfinance_loader
Reward-hacking detectionabl.leakage.reward_hackingIS/OOS Sharpe drop, drawdown widening, calibration divergence
Transaction-cost + Almgren-style impactabl.costsAlmgren-Chriss 2000, Almgren et al. 2005
BH FDR + BH-Yekutieliabl.multipletest.bhBenjamini-Hochberg 1995, Benjamini-Yekutieli 2001
Bonferroni-Holm step-down FWERabl.multipletest.stepwiseHolm 1979
Romano-Wolf stepwiseabl.multipletest.stepwiseRomano & Wolf 2005
Probabilistic Sharpe Ratio (PSR)abl.multipletest.psrBailey & López de Prado 2012/13
Deflated Sharpe Ratio (DSR)abl.multipletest.dsrBailey & López de Prado 2014
HAC / Newey-West Sharpe SEabl.multipletest.hacLo 2002, Newey-West 1987/1994
BCa bootstrap CIsabl.multipletest.bootstrapEfron 1987
White's Reality Check / SPAabl.multipletest.spaWhite 2000, Politis-Romano 1994
Sortino + Information Ratioabl.multipletest.risk_metricsSortino-Price 1994
PBO via CSCVabl.overfitting.cscvBailey, Borwein, López de Prado, Zhu 2017
Reliability diagrams + ECEabl.calibration.reliabilityGuo et al. ICML 2017
Split conformal w/ rolling windowabl.calibration.conformalVovk-Gammerman-Shafer 2005; Angelopoulos-Bates 2021
Six baselines (always rendered)abl.baselines
Five adapters: TradingAgents, FinGPT, FinRobot, callable, plainabl.adapters
Markdown + JSON + HTML scorecardsabl.scorecardHTML embeds plots as base64

Quickstart

pip install agent-backtest-lab

# Evaluate any of the bundled baselines on the synthetic fixture:
abl evaluate buy_and_hold     --window 2020-01-06..2024-12-31 --out out/
abl evaluate naive_momentum   --window 2020-01-06..2024-12-31 --out out/
abl evaluate naive_mean_reversion --window 2020-01-06..2024-12-31 --out out/

# Open out/report.html in any browser; identical to the demos linked above.

What it deliberately is not

Companion to TradingAgents

Designed as a companion to TauricResearch/TradingAgents (and to FinGPT, FinRobot, and any callable strategy). Their job is to generate decisions; ours is to audit them. The reference adapter wraps TradingAgentsGraph().propagate(ticker, date) behind the same evaluation pipeline every baseline goes through. tradingagents remains an optional dependency.

agent-backtest-lab is a research tool. Not financial advice. Not a trading system. Backtests don't predict the future.

Built by Betty Guo (Dongxin Guo / 郭东欣), PhD candidate, University of Hong Kong, advised by Prof. Siu-Ming Yiu. ORCID: 0009-0000-2388-1072. Apache-2.0.