Building a Robust Composite Score

  • May 22, 2026
  • 9 min read

Four statistical tests, one number. How to design a composite that defends against survival bias without throwing away signal.

 

Why a composite rather than a gate

In the previous post I described four diagnostic tools for survival-bias correction: the Deflated Sharpe Ratio, Haircut Sharpe, Probability of Backtest Overfitting, and bootstrap-based Stability selection. The natural next question is how to use them.

The simplest approach is gating: define a threshold for each (DSR > 0.9, PBO < 0.5, etc.), keep strategies that pass all gates, discard the rest. The result is a binary decision per strategy: in or out.

Binary gates are easy to explain, easy to audit, and easy to misuse. The problem is that they discard information. A strategy with DSR of 0.92 is meaningfully different from one with DSR of 0.99, but a 0.95 threshold treats them the same. A strategy that fails one gate by a small margin can be either fundamentally flawed or simply on the wrong side of a noisy estimate. Binary gates also produce cliff-edge effects: tiny changes in input data can flip a strategy from “in” to “out,” which is operationally unstable.

A continuous composite score addresses all three problems. It preserves the gradation of evidence, it permits graduated portfolio sizing, and it dampens the noise sensitivity of any single component. The price is that you need to commit to a specific combination function — and that choice itself has consequences.

Design principles

A well-designed composite score should satisfy several properties:

Interpretability. The number should have a natural meaning. A composite of 0.5 should mean roughly “moderate evidence of edge”; 0.0 should mean “no evidence”; 1.0 should mean “very strong evidence.” This is a soft requirement but a useful one when communicating with stakeholders.

Range. Bounded to [0, 1] for cross-strategy comparability and easy use as a prior in portfolio weights.

Independence of components. The four input tests should be measuring distinct facets of evidence, not redundant aspects of the same thing. DSR and Haircut both correct for multiple testing, but they do so in different ways and rarely agree fully. Stability bootstraps the data, which is orthogonal to multiple-testing correction. PBO is universe-wide rather than per-strategy. Each adds non-overlapping information.

Failure of any component should hurt the composite. If a strategy passes three tests strongly but fails the fourth, the composite should reflect that. This argues for a multiplicative combination rather than additive: with four [0,1] inputs combined as a product, a weakness in any single component pulls the overall score down toward zero. Additive combinations dilute weaknesses.

Conservative bias is fine. A composite that under-counts edge is more useful than one that overstates it, because the operational consequences of false positives (allocating to strategies that don’t work) are typically worse than false negatives (missing strategies that do work).

A specific composition

A natural composite that satisfies these properties has the form:

composite_i = DSR_i × stability_i × retention_i × universe_factor

where:

DSR_i is the per-strategy Deflated Sharpe probability, in [0,1]. — stability_i is the mean DSR across bootstrap resamples of the data, also in [0,1]. (Alternatively, the fraction of bootstraps in which the strategy’s DSR exceeds 0.5.) — retention_i is the fraction of the strategy’s raw mean daily return that survives Bayesian shrinkage toward zero. A James-Stein shrinkage estimator with cross-sectional prior calibration works well. — universe_factor is a universe-wide penalty, typically (1 − PBO).

This is one possible composition among many; the specific formula matters less than the architecture. The four components are conceptually independent: DSR captures multiple-testing correction, stability captures bootstrap robustness, retention captures Bayesian regularization, and PBO captures universe-level ranking quality. A multiplicative combination ensures that weakness in any single dimension is reflected in the final score.

Worked intuition for each component

DSR (per-strategy). A strategy with annualized Sharpe of 1.4, observed over 40 months, drawn from N = 75 trials, will typically have DSR between 0.5 and 0.85 depending on the higher moments of its returns. Negative skew or high kurtosis reduces DSR because the standard error of the Sharpe estimate is larger. The same Sharpe over 10 years would have DSR very close to 1.

Stability. Run the DSR computation on 200 to 400 bootstrap samples of the strategy’s returns, using a block bootstrap to preserve autocorrelation structure (block length around 30 trading days for daily data). The stability score is either the mean DSR across bootstraps (a smoother, continuous measure) or the fraction of bootstraps in which DSR exceeds some threshold (a more conservative measure). I tend to use the continuous mean for the composite because it integrates evidence smoothly.

Retention. Bayesian shrinkage of the strategy’s mean return toward a prior of zero produces a shrunk estimate that depends on the precision of the raw estimate. The retention factor is the ratio of shrunk to raw mean, clipped to [0,1]. A strategy with a noisy mean estimate (high variance, short track) gets pulled aggressively toward zero and has low retention. A strategy with a precise mean estimate (long track, moderate variance) retains most of its raw signal. The James-Stein form, with cross-sectional empirical Bayes calibration of the prior variance, works well in practice and avoids the need for a fully Bayesian setup.

PBO (universe-wide). Combinatorially Symmetric Cross-Validation on the full universe gives a single number for the universe’s ranking quality. A PBO of 0.30 (30%) means the ranking has reasonable out-of-sample predictive power; the multiplier (1 − 0.30) = 0.70 is applied to every strategy’s composite. A PBO of 0.50 collapses the multiplier to 0.50, halving every composite — appropriate, because the universe’s ranking is no better than chance.

What the composite reveals about the universe

In a universe of 18 strategies that had all passed conventional in-sample, out-of-sample, and live-incubation filters, applying this composite produced a striking stratification.

Three strategies emerged with composite scores above 0.20, representing strong evidence of real edge. Four more scored between 0.10 and 0.20, representing moderate evidence and reasonable satellite candidates. The remaining eleven scored below 0.10, with several below 0.01 — statistically indistinguishable from random.

The stratification was not obvious from raw Sharpe ratios alone. Strategies with the highest raw Sharpes did not always have the highest composites. One strategy with raw Sharpe near 0.85 dropped to composite 0.04 because its retention factor was only 0.23 — the Bayesian shrinkage pulled its mean aggressively toward zero, signaling that the high Sharpe was driven by noise in a high-variance return distribution.

This kind of demotion is exactly what the framework is designed to do. A strategy can have a “great” Sharpe in incubation and still fail to clear the composite because the supporting evidence is fragile. The framework cares about the signal-to-noise ratio of the evidence, not just the point estimate of the metric.

Practical considerations

A few subtleties that matter in implementation:

The choice of N for the DSR. The most defensible choice is the total number of parameter configurations and strategy variants that were explored during development. This is usually much larger than the final number of strategies you ended up with. If you developed 25 strategies but explored 75 to 100 parameter configurations to get there, use N close to 100. If unsure, use the larger plausible number; the cost of being too conservative is small (slightly lower composites), the cost of being too generous is large (failing to correct for actual multiple testing).

The bootstrap design. Block bootstrap with block length comparable to the typical autocorrelation horizon of the strategy’s returns. For mid-frequency strategies on daily data, 20 to 40 trading days works well. For higher frequencies, shorter blocks. For lower frequencies (one trade per month), the bootstrap may be uninformative — the strategy has too few observations to resample meaningfully.

The shrinkage prior. Empirical Bayes calibration of the prior variance from the cross-section is robust. Setting the prior mean to zero is conservative; setting it to the cross-sectional grand mean is less conservative but exposes you to the possibility that the grand mean itself is positively biased by survival.

The threshold for action. Composite scores can be used as soft weights in portfolio sizing, as hard thresholds (with the threshold chosen by inspection of the composite distribution), or as both. In my framework, I prefer: — Composite > 0.20: core position, full weight. — 0.10 to 0.20: satellite, reduced weight. — 0.01 to 0.10: marginal, minimal weight only if needed for diversification. — Below 0.01: exclude.

These thresholds are heuristic but defensible. A composite of 0.20 corresponds roughly to “the joint evidence places the strategy in the top quartile of credible edge”; 0.10 to “borderline credible”; 0.01 to “barely above noise”; below that to “indistinguishable from a random selection effect.”

What the composite does not solve

Important to be honest: the composite addresses survival bias and selection effects, but it does not address every problem in systematic strategy evaluation.

It does not detect regime change. A strategy that worked in a particular macro regime and will not work in the next one cannot be flagged by historical DSR. Forward monitoring (next post) is required.

It does not validate the strategy’s logic. A high composite score on a strategy with no causal mechanism is not a guarantee of forward performance. The framework is statistical; understanding why a strategy works requires domain knowledge.

It does not capture transaction costs and slippage that vary with size. A strategy that scales to one contract is different from one that needs to trade 50 contracts. The composite is computed on backtested returns that may understate the live cost of larger sizes.

It is a snapshot. Composite scores drift with new data. A strategy that scored 0.30 a year ago might score 0.15 today. Re-running the pipeline every six months is good practice.

These caveats notwithstanding, the composite framework is the most defensible quantitative tool I know for converting a noisy set of survivor strategies into a ranked input for portfolio construction. The next post in the series describes how to take that ranked input and turn it into an actual portfolio with margin constraints and discipline.

 

References

           Bailey, D.H., López de Prado, M. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management 40(5).

           Bailey, D.H., Borwein, J.M., López de Prado, M., Zhu, Q.J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance 20(4).

           Harvey, C.R., Liu, Y. (2015). Backtesting. Journal of Portfolio Management 42(1).

           López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 3 (meta-labeling) and Chapter 7 (cross-validation).

           Efron, B. (1975). Biased versus unbiased estimation. Advances in Mathematics 16.

           Ledoit, O., Wolf, M. (2004). Honey, I Shrunk the Sample Covariance Matrix. Journal of Portfolio Management 30(4).