“Honest Backtesting” series, article 1 of 3
A note on this series
This is the first article in a didactic series on quantitative algorithmic trading I’m publishing here on the blog .
I’m a Quant PM with well over a decade building systematic strategies across multiple asset classes. In that time I’ve read hundreds of articles on this material and noticed two recurring failure modes: either trivial — recycled textbook content with no contact with the practice, or opaque, formalism for its own sake, academic papers no working practitioner has time to decode. Very few sit in the honest middle, where a researcher or a desk PM actually learns something they can use.
This series tries to live in that middle. I want to share what I’ve learned, and especially what I’ve gotten wrong — across four areas I see consistently underestimated or handled poorly:
- Honest backtesting (the series starts here)
- Risk management — position sizing, leverage, drawdown control
- Portfolio construction — combining signals, allocation, rebalancing
- Microstructure — cost models, slippage, market impact
Each piece stands on its own but builds on what came before. I’m opening deliberately with something tough: simulated incubation. It’s a methodological error I see proliferating, with serious consequences for strategies that hit production. If this critique resonates, follow along — later articles drop a notch in technical density but stay rigorous.
This first topic spans three parts. Part 1 (this one) lays out the problem and the taxonomy of biases. Part 2 quantifies the bias with a Monte Carlo simulation. Part 3 covers what to do instead.
Simulated incubation — declaring the last 12 months of data an out-of-sample window “because we could have designed the strategy a year ago” — is one of the most insidious methodological errors in contemporary quantitative backtesting. Not because researchers are dishonest, but because the procedure inverts the very property that makes incubation informative: the temporal asymmetry of knowledge. In Part 2 I’ll quantify the bias with a Monte Carlo simulation. Here in Part 1 I lay out why the bias is severe even assuming no explicit data leakage, and decompose it into four distinct channels.
1. Why I’m writing this
Over the last 18 months I have increasingly seen, in published papers and in strategy pitches to allocators, a practice I will call simulated incubation. The phrasing varies, but the core is always:
“I designed the strategy as if it were a year ago. The subsequent 12 months are my incubation window.”
The motivation is understandable. Real incubation has a cost: time. A year of serious paper trading means a year without data to convince a risk committee, an allocator, a partner. The temptation to “compress” that cost — while, apparently, preserving the statistical properties of incubation — is strong.
My thesis is that the compression is not free, and its cost is invisible to whoever applies it. The bias is structural, not an implementation error. And it operates even when the researcher is perfectly honest.
2. What incubation actually does
Before attacking the “simulated” version, it helps to be precise about why real incubation works.
Incubation is a commitment device: you freeze the decision process (code, features, hyperparameters, universe, cost model) at an instant , declare you have done so, and observe the strategy’s behavior on data that did not yet exist at . It performs three distinct statistical functions:
- It locks in the researcher. After you cannot change parameters or code. The stopping rule is binding: you cannot “look at the results and try again.”
- It generates a genuine forward realization. The returns observed in are independent — strictly — of design decisions. The researcher did not know that data.
- It samples an unchosen regime. You do not pick the macro, volatility, or correlation environment. The market imposes it on you.
None of these three properties survive simulated incubation.
3. Real vs. simulated incubation: the temporal asymmetry of knowledge
The core point is simple and, once seen, trivial: you cannot unknow what you know.
A researcher who, in 2026, claims to have “designed the strategy as if it were 2025” is operating in a decision space that has been inevitably shaped by knowledge of 2025:
- they know which regime shifts occurred;
- they know the period’s sector leadership;
- they know which factors performed and which didn’t;
- they know which methodologies are “in vogue” (and the literature they read is itself conditioned on those 12 months);
- they know — the subtlest channel — the set of strategies their community is currently promoting, which is a selection mechanism already operating on them.
None of these channels requires an explicit leakage action. They are default channels. To neutralize them you would have to do something active: e.g., design the strategy in a physically isolated environment, with access only to data truncated at a specific date, with literature frozen. No researcher does this.
4. Taxonomy of biases
Separating the channels is useful because mitigations differ by channel.
4.1 Implicit look-ahead (regime awareness)
The cleanest channel to identify. The researcher, without accessing the returns of the “incubation” window during backtesting, still knows what happened: a crash, a rally, a correlation shift. Design choices — signal type, risk filter, position-sizing scheme — are conditioned on that knowledge. The distribution of returns produced by a strategy designed this way is not independent of the distribution of the “incubation” window.
A concrete example: if 2025 had a brutal February drawdown, your strategy probably includes a volatility filter “because it makes sense.” You might not have included that filter without knowing about February 2025.
4.2 Publication selection bias (multiple testing)
The least visible channel and the most severe. If the researcher has “tried” the procedure on several variants of the strategy (or, equivalently, if different researchers have tried different versions in parallel), only the variants that pass simulated incubation get presented. This is the classic familywise error rate problem, handled rigorously by Harvey, Liu & Zhu (2016) for the cross-section of expected returns: with independent trials, the -statistic threshold for a nominal is much higher than 1.96. If a researcher tries 50 strategies and publishes the one with simulated-incubation Sharpe > 1, the probability it’s real alpha is minimal.
4.3 Retroactively informed design decisions
Apparently innocuous choices — excluding an asset “for liquidity reasons,” using a 63-day estimation window instead of 42, applying winsorization at 2.5% — are default choices informed by the current context. The literature, the blogs, the arXiv papers the researcher reads are written by people who know 2025. This design noise has a non-zero-mean component when OOS is in the past.
4.4 No commitment (unbounded stopping rule)
If simulated incubation fails, the researcher can simply try again — change a parameter, a filter, a split date — with no reputational or procedural cost. In real incubation, failure is public (at least to oneself and one’s team) and commitment is binding. The Probability of Backte st Overfitting (Bailey, Borwein, López de Prado & Zhu, 2017) formally captures this channel.