The winner’s curse under dependence: repairing empirical Bayes using convoluted densities

The winner’s curse is a form of selection bias that arises when estimates are obtained for a large number of features, but only a subset of most extreme estimates is reported. It occurs in large scale significance testing as well as in rank-based selection, and imperils reproducibility of findings and follow-up study design. Several methods correcting for this selection bias have been proposed, but questions remain on their susceptibility to dependence between features since theoretical analyses and comparative studies are few. We prove that estimation through Tweedie’s formula is biased in presence of strong dependence, and propose a convolution of its density estimator to restore its competitive performance, which also aids other empirical Bayes methods. Furthermore, we perform a comprehensive simulation study comparing different classes of winner’s curse correction methods for point estimates as well as confidence intervals under dependence. We find a bootstrap method by Tan et al. (2015) and empirical Bayes methods with density convolution to perform best at correcting the selection bias, although this correction generally does not improve the feature ranking. Finally, we apply the methods to a comparison of single-feature versus multi-feature prediction models in predicting Brassica napus phenotypes from gene expression data, demonstrating that the superiority of the best single-feature model may be illusory.


Introduction
Contemporary omics technologies allow to routinely gather datasets containing many features.Commonly all these features are tested for association with a variable of interest, or some other quantity is estimated for each of these features.Scientific interest then naturally focuses on the most extreme estimates.Consequently, often only a small number of most extreme estimates or test statistics (e.g.top 10) are reported and considered for follow-up studies.Yet this subset of extreme estimates is subject to a selection bias called the winner's curse (Sun et al., 2011;Zollner and Pritchard, 2007;Efron, 2011); a manifestation of the more general phenomenon of "regression towards the mean" (Galton, 1886;Bland and Altman, 1994).There are two possible reasons why the estimates are extreme: because their true estimand is extreme or because the random estimation error pushes the estimate towards the extremes.Selecting only the most extreme estimates leads to enrichment of extreme estimation errors, engendering the selection bias.More formally, if all features with estimates for a metric of interest γ falling below a threshold c are selected (assuming we are interested in small values), the selection bias is defined as E(γ −γ|γ < c).This bias is usually non-zero and negative, even when the regular estimator is unbiased (E(γ−γ) = 0).Selection bias results from using the same dataset for identifying extreme features as well as for providing the corresponding estimates, not unlike data dredging.It has been suggested as a cause for poor reproducibility of published findings and insufficient power in follow-up studies (Sun et al., 2011;Zollner and Pritchard, 2007;van Zwet et al., 2021).The winner's curse is related to the problem of inference on model parameters after model selection (post-selection inference) (Berk et al., 2013), but here we focus purely on parameter estimation and ranking, without considering model selection.Existing selection bias correction methods are often applied to test statistics in the realm of hypothesis testing (Ghosh et al., 2008;Zhong and Prentice, 2008;Tan et al., 2015;Ferguson et al., 2013;Faye et al., 2011;Bigdeli et al., 2016;Efron, 2011).Yet also estimation and ordering of raw effect sizes can be of interest, e.g. in cases where there is no obvious null value, like predictive modelling.Moreover, even when using test statistics, improving the ordering of top features may improve the success rate of subsequent validation efforts.Multiple testing corrections that control the false discovery rate (FDR), e.g.Benjamini-Hochberg adjustment of p-values (Benjamini and Hochberg, 1995), are widely adopted, but often not all significant features can be investigated further as resources for follow-up studies are limited.Consequently, it is common practice to only focus on a small set of top test statistics.Conversely, researchers may also consider features falling just short of the significance level when the number of discoveries is small.In such settings it is the number of findings that is fixed, rather than the rate of false findings, and the reliability of the findings might be increased by selection bias correction methods that improve the feature ranking.Despite the many available methods for selection bias correction, and although many publications do include simulation studies (Tan et al., 2015;Faye et al., 2011;Bigdeli et al., 2016), exhaustive benchmarking studies are lacking.Also the effect of dependence among estimates on these procedures has not been thoroughly investigated.The estimates may show dependence for a host of different reasons: co-expression of genes for instance causes correlation between test statistics in differential expression testing.But also reliance on a common random quantity for the calculation of the estimates introduces dependence, e.g. a common outcome variable in prediction problems.Tweedie's formula, an empirical Bayes method that is often used to correct for selection bias, has been found to perform poorly in the presence of dependence (Tan et al., 2015;Bigdeli et al., 2016).Tan et al. (2015) claim that dependence invalidates Tweedie's formula, but according to Efron (2011) it remains valid and dependence merely inflates the variability of the resulting estimates.Similarly, Stephens (2017) notes the independence assumption on the estimates as a weakness of his ash method, another empirical Bayes method, but saw no obvious way to relax it.Bigdeli et al. (2016) tackle the problem of correlation for empirical Bayes methods in the context of GWAS studies by splitting the test statistics into non-overlapping sets based on chromosome order, but this is not easily generalizable to other applications.In this work we investigate the problem with empirical Bayes procedures under dependence both theoretically and empirically, and suggest a solution based on convolution.Next we benchmark the refactored empirical Bayes methods against other selection bias correction methods under dependence.Finally we apply the best performing methods to a real dataset from a Brassica napus field trial.

Materials and methods
Given is an n × p data matrix X with n samples and p features, possibly with a vector y of length n that is tested for association with every feature x j .The quantity of interest for every feature j is indicated by γ j ; we assume that small values are of most interest, e.g. the features with the lowest prediction loss.We distinguish four streams of methods to tackle selection bias: 1) separate estimation, 2) conditional likelihood, 3) bootstrapping and 4) empirical Bayes.The first method is inspired on the observation that the selection bias springs from using the same data for best feature selection as well as for parameter estimation.An obvious solution is to split the available samples at random in two sets, and use the first set for selecting the features with the smallest estimates, and the second set for providing parameter estimates of the selected features to be reported (Sun and Bull, 2005).We call this the separate estimation procedure.A second method is conditional likelihood.First γ is estimated for all features using full maximum likelihood (ML) as γ = arg max given parameter value γ (omitting feature subscripts).Next conditional likelihood estimates are found for the subset of features with the smallest estimates, conditional on them falling below some threshold c, which may depend on the other estimates in rank-based selection: c(γ) (Zollner and Pritchard, 2007;Ghosh et al., 2008).Using Bayes' theorem, the conditional likelihood L c for features passing the threshold c is and the conditional ML estimate is γCL = arg max Assuming normality of the parameter estimate around the true value γ, P (γ with Φ the standard normal distribution function and σ γ the estimate's standard error.This probability shrinks with increasing γ, shifting the peak of the conditional likelihood to larger values and attenuating the selection bias of the ML estimator (Zollner and Pritchard, 2007).Conditional likelihood methods are evidently only applicable to likelihood-based estimation.Moreover, their corrected estimates depend on the choice of cut-off value c.A third stream of methods rely on bootstrapping to estimate the selection bias, and then correct the parameter estimates by subtracting the bias.Three distinct bootstrap methods exist: the first one, known as "BR-squared" (Faye et al., 2011), works as follows.For every b = 1, . . ., B: 1. Draw a bootstrap sample by sampling n rows with replacement from X, and if applicable select the corresponding elements from y. 2. Estimate the parameter of interest on both the bootstrap sample (in-sample), yielding γb Dj for feature j, and on the samples not included in the bootstrap sample (out-ofsample), yielding γb Ej .

Rank the in-sample estimates γb
Dj over the bootstrap sample, to yield γb D(k b ) , with (k b ) indexing the feature with the k-th smallest value (k-th order statistic) in γb D .Use the same feature ordering to also order γb Ej and γj (the estimates on the real non-bootstrapped data) into γb and γ(k b ) , respectively.4. Estimate the bias as qb , both estimated from all bootstrap samples.The last term corrects for the fact that γ b D and γ b E are estimated from mutually exclusive sets of samples.
Finally, subtract the average bias to obtain the corrected estimate γBR-squared with γ(k) the k-th order statistic of γ.The second approach, which we will call "Tan2015" (Tan et al., 2015), works as follows: 1. Perform steps 1-3 as for BR-squared, without calculating the out-of-sample estimates.

Set êb
Both these bootstrap methods address dependence between estimates by performing the resampling in the first step jointly for all features, i.e. sampling entire rows.For the Tan2015 method, also parametric bootstrapping is possible if the dependence structure of the data can be estimated.A third bootstrap method is called winnerscurse and only generates a single, parametric bootstrap sample for every feature relying on normality of the estimates.The bias ê(k) is approximated as for the Tan2015 method, and a cubic smoothing spline is fitted through the collection of (γ (k) , ê(k) ) pairs.The fitted values ê * (k) of this spline are subtracted from the raw values to correct the bias (Forde et al., 2023).A fourth class consists of Bayesian methods, as Bayesian statistics is immune to selection bias (Efron, 2011).For the three methods considered, some prior distribution is estimated from the data, making them empirical Bayes methods.Tweedie's formula (Robbins, 1956) is the most popular one for the purpose of selection bias correction (Efron, 2011).For normally distributed estimators, its estimate of the posterior mean is: with σ 2 γj the variance of the estimator γj .l = log f is the marginal log density, where f is a convolution of the prior distribution over all features g(γ) with an estimator distribution k: dγ or f = g * k for short.By exploiting the high-dimensionality of the data, the marginal log density l and its derivative can be estimated directly from the observed estimates γ (Efron, 2011).A second empirical Bayes procedure which has been applied for selection bias correction (Morrison and Simon, 2018) is adaptive shrinkage (ash) by Stephens (2017).It assumes a unimodal prior distribution g of which the parameters are estimated from the data, and assumes the estimates γj to be independently and normally distributed with variance σ 2 γj .A posterior distribution p(γ j | γj , σ2 γj ) ∝ g(γ j )f (γ j | γ j , σ2 γj ) is then calculated for each feature, and the posterior mean serves as a point estimate.Note the difference between both empirical Bayes approaches: in the ash method the prior distribution of the parameters g(γ) is estimated, whereas Tweedie's formula estimates the marginal distribution of the estimates f (γ), which is a convolution of the prior with the estimator sampling distribution.A third empirical Bayes method, which we will call "VanZwet2021", estimates the distribution f (SNR|z) of the signal-to-noise-ratio (SNR) γ j σ γj conditional on the z-value z j = γj σγ j using Gaussian mixture modelling.Its corrected estimate is then the posterior mean of the SNR multiplied by the standard error: E(SNR|z j )σ γj (van Zwet et al., 2021).Finally, Zuber and Strimmer (2009) construct correlation adjusted t-statistics ("cat-scores") by decorrelating the test statistics.Although not directed against the winner's curse, they claim that the decorrelation improves feature ranking.Apart from point estimates, confidence intervals for the h smallest estimates are also desirable.The regular confidence intervals do not achieve their nominal coverage for extreme estimates, and custom interval construction is needed (Morrison and Simon, 2018).For the separate estimation method, the confidence intervals for the selected features are constructed based on the samples from the second group alone.Zollner and Pritchard (2007) and Zhong and Prentice (2008) present a 95% confidence interval for the conditional likelihood estimates consisting of all values γ j for which twice the difference in log conditional likelihood with γCL j lies below the 95th percentile of a χ 2 -distribution with degrees of freedom equal to the number of free parameters of the model.This interval can be asymmetric.Faye et al. (2011) applied a nested bootstrap to construct a symmetric confidence interval for the BR-squared estimates using the bootstrap standard error and relying on normality of the estimator.Morrison and Simon (2018) provide the estimator by Tan et al. (2015) with a confidence interval, also employing a nested bootstrap but using the quantiles of the bootstrap distribution rather than assuming normality of the estimator.This bootstrap interval can be asymmetric to reflect the fact that small estimates are often underestimates.The Bayesian framework of Tweedie's formula also yields an expression for a credible interval on the final estimates, relying on the normality of the posterior distribution.The posterior variance is and the (1 Efron, 2011).Ferguson et al. (2013) also incorporated the variability of the estimator in every point γj through bootstrapping.Credible intervals for the ash and VanZwet2021 methods are derived from their posterior distributions p(γ j | γj , σ2 γj ) and f (SNR|z j ), respectively (Stephens, 2017;van Zwet et al., 2021).

Tweedie's formula is biased under strong dependence
The effect of dependence between estimates on empirical Bayes procedures is subject to debate (Tan et al., 2015;Stephens, 2017;Efron, 2011;Morrison and Simon, 2018;Bigdeli et al., 2016).In our simulations we find that dependence negatively affects Tweedie's formula (Figure S7), in line with Tan et al. (2015).Also ash and to a lesser extent VanZwet2021 are impacted by the dependence.To elucidate why this happens, we simulated 3 Monte-Carlo instances of 1,000 independent and 1,000 dependent estimates γj (Figure 1, see Supplementary Section 2.2.3.1 for details).The distributions for independent estimates are all very similar and approximate the true marginal distribution of all estimates overlaid in blue.Dependence narrows the distribution of estimates and shifts it to higher or lower values, strongly departing from the true marginal distribution; a phenomenon known from the field of multiple testing correction (Efron, 2007;Hawinkel et al., 2022a;Azriel and Schwartzman, 2015).It is easy to envision how this behaviour upsets the performance of Tweedie's formula (2): a naive density estimator will yield wrong derivatives of the logdensity, leading to incorrect shrinkage.Analogously, ash and VanZwet2021 will be affected as the width of their prior and z-value distribution, respectively, is underestimated.
To investigate this phenomenon more formally in line with existing theory (Schwartzman, 2010;Efron, 2011;Azriel and Schwartzman, 2015), we normalize all estimates γj to z-values: . The generative model underlying Tweedie's formula is then: The observed distribution f (z) is a convolution of a prior distribution g of the expectations µ with an estimator distribution k: f = g * k.Since all z-values have the same variance 1, k is an even mixture of p standard normal distributions equal to ϕ(z): the standard normal density.We assume for a moment that g is a point density, so all means µ are the same, and k and f coincide. Azriel and Schwartzman (2015 have shown that a naive estimator f (z) based on an observed sample z only does not converge to f (z) as p → ∞ in presence of strong dependence.Strong dependence means that the average absolute correlation between features does not vanish as p → ∞.This occurs e.g. in presence of a confounder such as a common outcome variable of a prediction model as in our motivating example (see case study below).On the contrary, weak correlation occurs when the average absolute pairwise correlation does tend to zero as p tends to infinity.Weak correlation holds e.g. for autoregressive models (Azriel and Schwartzman, 2015) but also for GWAS settings with only local correlation due to linkage disequilibrium, or for gene correlation networks, as most gene pairs are not correlated.Schwartzman (2010) demonstrates how strong correlation narrows and shifts f (z), making it behave as a random function even as p → ∞.
Following the Gram-Charlier expansion in Schwartzman (2010), more precisely his equation ( 8), this random function can be described as: with h v (z) the v-th Hermite polynomial.W 0 = 1 and the other W 's are random variables with mean 0 and variance-covariance structure with G(ρ) the distribution function of pairwise correlations ρ between z-values, and α v the v-th moment of the correlation distribution.Repeated experiments correspond to repeated draws W = (W 0 , W 1 , . . ., W ∞ ) t .The term W 1 h 1 (z) is responsible for the shift in location of f (z), whereas the terms W 1 h 1 (z) and W 2 h 2 (z) drive its width, usually narrowing f (z) with respect to f (z) (Schwartzman, 2010).These departures from f(z) upset empirical Bayes methods, even though the correlation does not affect the expectation over different experiments: estimator for f(z) but not consistent in the sense that is does not converge to f (z) as p → ∞ (Schwartzman, 2010).
Tweedie's formula relies on the first derivative of the logarithm of the marginal density function, so we investigate the behaviour of this random function too.Taking the logarithm and derivative of (5) yields We now take the expectation over random realizations of W, so not over z as in equation ( 11) in Schwartzman (2010).For ease of notation we set , both of which are functions of W and z, with E W (N ) = 0 and E W (D) = 1.The term N/D is in general not zero and causes the bias in the estimation of the shrinkage term in (2).Its second order Taylor expansion (Seltman, 2019) provides the following approximation: Hence even though = −z because of the dependence.A special case is z=0: since the Hermite polynomials contain only even or only odd powers of z, h v (0)h ′ v (0) = 0 for all v and thus (8) equals 0. In other words, the naive estimator for the derivative of the log-density is biased for values other than z=0, with bias growing in absolute value as z moves away from 0.
Next we approximate the variance of expression ( 7), which we rewrite as b(W) = −z + N/D, through the first-order delta method (Gauss, 1823) as The variance-covariance matrix Σ is diagonal, see ( 6).The v-th element of ∇ (b(W)) is given by Evaluated in the expected value E(W) = (1, 0) t , with 0 a vector of zeroes of infinite length, this equals Plugging this into (9) yields (since Var(W 0 ) = 0) Hence the variance grows as z moves away from zero.Ferguson et al. (2013) already observed that the variability of the dz is larger in the tails because there are fewer observations, causing inaccurate shrinkage.This is sampling variability that is expected to disappear as p → ∞.Yet our results ( 8) and ( 12) uncover two more problems under correlation: 1) the regular estimator for d log f (z) dz is biased for z ̸ = 0, and 2) d log f (z) dz itself is a random function with a variance of at least α 1 (the average pairwise correlation), even as p → ∞.Both absolute value of the bias and variance grow as z moves away from 0. Problem 1) adds nuance to the assertions by Efron (2011) that Tweedie's formula remains valid under dependence: it is biased when z ̸ = 0 under strong dependence in combination with a naive estimator for f (z) that assumes independent z-values.
We confirm the previous results numerically through Monte-Carlo simulation.We repeatedly generate datasets from a standard multivariate normal distribution under compound symmetry, which means that all features are equally correlated.We vary this correlation ρ by increasing it from 0 to 0.6 with steps of 0.15.Marginally, all features are standard normal, so g is a point density.Of course, the simulation results will also reflect sampling variability, so we choose p ∈ (500; 1, 000; 5, 000) to have it decrease.For each combination of ρ and p we perform 5,000 Monte-Carlo simulations to approximate the bias, variance and mean squared error (MSE) with respect to d log ϕ(z) dz = −z across a range of z-values, using Lindsey's method as density estimator (Lindsey, 1974).The approximate bias is shown in Figure 2 and confirms the results from (8).The bias is close to 0 for ρ = 0, but grows in absolute value with increasing z when ρ ̸ = 0.For low correlations, the growth is linear as it is dominated by the first term in (8) (shown as dashed line), but for larger correlations also the higher order terms kick in.The variance and MSE are depicted in Supplementary Figure S1; the variance is inevitably larger than predicted by (12) as it also includes sampling variability, but as expected a larger p yields more precise estimates.The second order term provides a reasonable approximation for the shape of the variance function.
In general, g is not a point density, such that f is a mixture of normal distributions with differing means and only k follows (5) with expectation ϕ(z) as f takes on other shapes.The derivative of the marginal log-density then becomes This implies that the shape of the bias function depends on the prior g.When the spread of g is the main driver of the spread of f , which happens e.g when n → ∞, the change in shape due to the correlation is negligible and the derivative of the logdensity is approximately unbiased.Yet in that scenario, the estimator variances σ 2 γj are small compared to the variability of the true values γ j , and the effect of winner's curse would be minor.Correcting for winner's curse is most relevant when k is the dominant source of variance of the estimates γj , of which g being a point density is the most extreme case.
In the field of multiple testing correction, precise control or estimation of the false discovery rate while classifying the ensemble of all features into significant and non-significant is often the goal.This means that the significance of a certain feature depends on all other features' test statistics, and the distribution of a given experiment f (z) is of primary concern (Efron, 2007;Hawinkel et al., 2022b;Azriel and Schwartzman, 2015).Yet when estimation of the parameters γ is the main purpose, the estimates of other features are not of direct interest as estimation occurs feature-by-feature.The other features' estimates simply provide a way to estimate the prior distribution g or the true marginal distribution f (z), which are needed for Bayesian shrinkage.Yet we have proven that a naive estimator f (z) is not a good choice under strong dependence, so we need a better estimator for f (z).
Convolution reduces bias and variance of the estimator for the derivative of the marginal log-density Schwartzman (2010) has shown that k(z) has expected variance 1 − α 1 over repeated experiments W: with α 1 the average pairwise correlation between the z j 's.Azriel and Schwartzman (2015) add that k can be described by a mixture of M normal distributions: with r(.|ζ, σ 2 ) the normal density with random mean ζ (with expectation 0) and variance σ 2 .In our motivating example (see case study below), a single confounder is causing a compound symmetric correlation structure (see Supplementary Figure S16), such that k(z) can be described using a single normal distribution (M=1) with random mean √ α 1 ζ (with ζ standard normal) and variance 1 − α 1 (Azriel and Schwartzman, 2015).
The derivative of its log-density equals with expected value −z 1−α 1 .For small α 1 this is similar to (7) with the approximation in (8) truncated after one term, which equals −z(1 + α 1 ), proving again that Tweedie's formula is biased when combined with a naive density estimator.
To counter this bias, knowing that correlation narrows k with respect to k, we propose to widen k by convoluting it with a zero-mean normal distribution with variance α 1 .This means that we convolute every component with r(z|0, α 1 ), yielding with expected variance (1 − α 1 ) + α 1 = 1.Consequently, we want to use f = g * k = g * k * r(z|0, α 1 ) as an estimator of f .Yet estimating the parameters ζ m and σ 2 m of k is difficult, as all we observe are the estimates z, reflecting variability of both g and k.Sidestepping the deconvolution of g and k is a key advantage of Tweedie's formula.In this spirit, since our estimates γ are drawn from f = g * k, we could estimate f and then convolute it with r(z|0, α 1 ).Yet even simpler, and avoiding density estimation, is to convolute the observed distribution, formed by point masses of 1/p at each z j , with r(z|0, α 1 ) and estimate f using the following mixture: r j (z|z j , α 1 ). ( If M=1, unbiasedness of the estimator for the derivative of the log-density in (13) based on (17) can be proven for the case where g is a point density or a Gaussian distribution (see Supplementary Section 1.4), but not for arbitrary shape of g.For more complex strong correlation structures (M>1) and/or for arbitrary g, f will still have the correct expected variance but unbiasedness of the estimator for the derivative of the logdensity can no longer be guaranteed for all values of z.Still, numerical studies suggest a reduced estimation bias for the example of block correlation too (Supplementary section 1.2.2).Hence our estimator f should be seen as a working solution that captures the bulk of the dependence structure, especially when k is the dominant source of spread, similar to the empirical normal null by Efron (2010).
On the scale of the original estimates γj , an analogous reasoning leads to the following estimator: α 1 is estimated as the average of all pairwise correlations between γj 's (or a subsample thereof), which are themselves estimated either through likelihood theory or by the bootstrap.Simulations show that the convolution estimate f (z) under dependence is on average as wide as f (z) but remains shifted with respect to it (see Figure 1).There is no way of correcting The horizontal dotted line indicates 0 (no bias).For a correlation of 0, the effect of the convolution is negligible and the red and green lines overlap.
this, as from a single realization z we have got no way of knowing in which direction it is shifted.In case α1 = 0, no convolution is necessary and the regular Tweedie's formula can be employed.If α1 < 0, it cannot be smaller than −1 p−1 due to non-negative definiteness constraints on the covariance matrix (Schwartzman, 2010), and convolution is also not needed.In practice, since p is finite, small positive values of α1 may lead to irregular estimates f , so we do not apply convolution when α1 < 0.05.Instead, to stabilize the density estimation in the tails, we apply a density bagging approach to estimate f (z) in this case (see Supplementary Section 2.1.1)(Breiman, 1996;Bourel and Cugliari, 2019).In the other extreme case of α 1 close to 1, k converges to a point density and f is found by convoluting f with the full estimator distributions with variances σ 2 j following (18).We test the estimators for the derivative of the log-density in a second simulation study, different from the previous in the following ways to allow for estimating the average correlation.Multivariate normal data X of dimension n × p with n = 10 are drawn, with means µ j drawn uniformly on [-1,1] (so dropping the assumption of g being a point density) and standard deviations drawn uniformly on [1,2].The average of every feature serves as estimator γj for µ j , the average Pearson correlation between different columns of X serves as estimator for α 1 .500 Monte-Carlo runs were executed per parameter setting.Figure 3 reveals how the convoluted density yields an unbiased estimator of the derivative of the log marginal density for large p, although it still suffers from small sample bias in the tails when ρ = 0.The bagging estimator largely remedies this bias, although it becomes itself biased in the tails as ρ grows (Supplementary Figure S2).Supplementary Figures S3  and S4 show that the convoluted and bagged estimators also have lower variability than the naive estimator.
The workflow of the modified Tweedie's formula with convolution is then as follows 1. Obtain estimate γj and its standard error σj for every feature j. 2. Estimate α 1 as the average pairwise correlation between estimates γj .3. If α1 < 0.05, estimate the marginal log density with density bagging.We use Lindsey's method for density estimation, as this allows a direct estimate of the log density and its derivative (Lindsey, 1974) and is less biased than nonparametric density estimators (Efron and Tibshirani, 1996).If α1 > 0.05, estimate the first and second order derivative of the log density based on (18) (see Supplementary section 1.3 for the derivatives) by plugging in γj , σ2 j and α1 .4. Plug the estimates from steps 1 and 3 into ( 2) and ( 3) to obtain empirical Bayes estimates and the associated credible intervals.
Like Tweedie's formula, the ash method assumes the estimates γj to be drawn from a convolution of the prior g with a normal sampling distribution k.Yet in reality g is convoluted with k, leading ash to underestimate the width of g and hence also of the posterior.As a solution we estimate ash's prior using random samples from the distributions r j (γ|γ j , σ2 γj α 1 ); a sample size of 10 per feature suffices to obtain an estimate g with spread similar to that of the true prior (see Supplementary Figures S8-S9).The posterior distributions are then found as . Analogously, we modify the VanZwet2021 method by estimating the distribution of zvalues from the same random samples from r j (γ|γ j , σ2 γj α 1 ).Although this method appeared more robust to the dependence, our convolution scheme leads to a more accurate estimation of the width of the distribution of the z-values under high correlation (see Supplementary Figures S10-S11).Presumably, VanZwet2021's approach to estimate the joint distribution of SNRs and z-values is less sensitive to dependence effects than ash's approach to estimate the prior g directly.No instability due to small α 1 was observed for the ash or VanZwet2021 methods, so the convolution was applied for all cases in which α1 > 0.

Simulation studies
We investigate the performance of the 4 methods for selection bias correction (separate estimation, conditional likelihood, bootstrapping and empirical Bayes) in 3 simulation studies with different estimands γ and data generation procedures; details are given in Supplementary Section 2.1.Although correcting selection bias and building confidence intervals for extreme estimates is our primary goal, we also look at the performance of the methods in terms of feature ranking by means of the true discovery rate (TDR) of the top estimates.All analyses are run in R 4.4.0,see Supplementary Section 7 for the package versions.The original Tweedie's formula with naive density estimator based on the observed estimates γ performs so badly that it is not shown in the main text (see Supplementary Figures S7 and S19).

Simulation study 1: mean estimation
The first simulation study investigates selection bias correction for maximum likelihood estimation of the mean of Gaussian data under increasing correlation between features.The results in Figure 4 and Supplementary Figures S12-S14 show how the selection bias of the raw estimates decreases with the number of top features and with increasing correlation between features.This latter effect has been proven by Tan et al. (2015).The coverage of the naive confidence intervals falls short of the nominal level of 95%.The separate estimation method is immune to selection bias, reduces the MSE of the top estimates with respect to the raw estimates in most cases and its confidence intervals almost achieve the nominal coverage.Its performance is not affected by dependence, but it has wide confidence intervals and low TDR.Conditional likelihood estimation greatly reduces selection bias of top estimates, but overcorrects when the correlation between the estimates is high.Its confidence intervals are wider than those of the separate estimation method, and slightly overcover.The BR-squared correction increases selection bias and MSE in case of independence between features; its feature ranking performance is also very bad in this case and its confidence intervals are narrow but undercover.As the correlation between features grows higher, its performance improves over all these performance measures, becoming good under high correlation.The Tan2015 bootstrap correction works well irrespective of correlation strength, reducing selection bias and MSE of the smallest estimates, and the coverage of its confidence intervals hovers around the nominal level under independence but drops under dependence.Its feature ranking is often slightly worse than that of the raw estimate.The winnerscurse method performs well at correcting selection bias but has low TDR under high correlation.Tweedie's formula with convolution has low bias and MSE for small estimates, but the coverage of its credible intervals falls slightly short of the nominal level.The regular ash method performs well under independence, but overcorrects the selection bias as the correlation grows, resulting in deteriorating feature ranking and narrowing credible intervals with coverage plummeting far below the nominal level.Convoluting the density restores its good performance under dependence, resulting in low bias and MSE and good feature ranking, with higher coverage of the credible intervals but still below the nominal level.The VanZwet2021 method performs well in most scenarios and is robust to dependence, except that it overcorrects the selection bias and the coverage of its credible intervals drops below the nominal level for high correlation.Convoluting its density mostly degrades its performance, except under high correlation where it has low selection bias but very wide confidence intervals.Feature ranking using test statistics works better than using only raw estimates for all methods.The raw test statistics are among the best performers mostly, only under the highest correlation do the cat scores provide a slight improvement.Bringing the corrected test statistics back to the scale of the raw estimates by multiplying by the standard errors as suggested by several authors (Ferguson et al., 2013;Ghosh et al., 2008) did not offer a clear improvement for any method (Figure S13).

Simulation study 2: Nonparametric simulation based on B. napus data
In simulation study 2, the predictive performance of a common outcome variable (a plant phenotype) is being estimated for many features (gene expression values), from data nonparametrically simulated based on a real dataset from a Brassica napus field trial (see Supplementary section 2.1.2Fig. 4. Simulation results for mean estimation (simulation study 1) for different performance measures (side panels) and methods (colour) for increasing correlation between features (top panels).The x-axis shows the number of features with smallest estimates considered.For the true discovery rate (TDR), the linetype indicates whether the feature ranking is based on raw estimates or on test statistics.The feature ranking is the same for conditional likelihood as for the raw estimates, and when using test statistics also identical to ash, so in these cases only the raw estimates' results are shown.The feature ranking of Tweedie's formula using test statistics is very similar to the one of the raw test statistics, so their lines largely overlap.The dotted horizontal line indicates zero bias or MSE or the nominal coverage of 95%.
for details).Figure 5 and Supplementary Figure S15 reveal that the separate estimation method has a (slight) positive bias and low TDR; its MSE of top estimates varies across the phenotypes and its MSE of all estimates is higher than for the raw estimates.The Tan2015, ash and VanZwet2021 corrections have the lowest bias and MSE for the top estimates (except for total shoot dry weight), and ash and VanZwet2021 achieve the lowest overall MSE.Also the truncated Tweedie estimates (see Supplementary section 2.1.2) improve upon the raw estimates for all phenotypes in terms of bias and MSE of top estimates.The BR-squared method reduces bias and MSE for most outcome variables, but not as much as the other methods.The winnerscurse method performs poorly overall.Convolution was not found to improve the performance of the ash or VanZwet2021 methods in most cases (except for total shoot dry weight), with ash's credible intervals remaining unlikely narrow.The credible intervals of the convoluted Tweedie's formula and VanZwet2021 are narrower than the confidence intervals for the raw estimate, but appear more believable as they are comparable in width to those of the raw values.Decorrelating the estimates (see Supplementary Section 2.1.2) degrades feature ranking accuracy for most phenotypes.These observations are explained by the correlation between estimates hovering around 0.5 for most phenotypes in this scenario (Figure S16), lower than the highest value of 0.75 in simulation study 1 where cat scores showed a mild increase in TDR.
Simulation study 3: parametric simulation for RMSE estimation of prediction models In simulation study 3, a common outcome variable is again being predicted for p=1,000 features in two scenarios: a null scenario where no feature is predictive and an alternative scenario where 4 features are predictive.The results are shown in Supplementary Section 2.2.3 and are mostly similar to those of simulation study 2, except for the following.Ash's credible intervals are very narrow with coverage close to zero, and ash overcorrects the bias in the alternative scenario, likely because the unimodality assumption for the prior is violated.The VanZwet2021 also has very low coverage of its credible intervals in the null scenario, and overcorrects the selection bias in the alternative scenario, which may also be caused by the estimates not having a unimodal distribution.Yet both shortcomings are largely repaired by the inclusion of convolution for the VanZwet2021 method.

Case study on B. napus field trial
A relevant application for selection bias correction in presence of dependence is in predictive modelling with omics data.Often, several prediction models are built for the same outcome variable, including single-feature models and multi-feature models.It then often seems as though the best singlefeature model outperforms the multi-feature model in terms of predictive power (Zhou et al., 2021;Cruz et al., 2020).This is also observed in a study by De Meyer et al. (2023), where leaf gene expression and several phenotypes were measured for individual field-grown Brassica napus (rapeseed) plants.
The aim was to build prediction models based on leaf gene expression in autumn for phenotypes measured the following spring, and estimate the genes' predictive performance.Figure S21 shows the root mean squared error (RMSE) of all singlegene prediction models for predicting six selected outcome phenotypes.Because of common dependence on a single vector of outcome values, these estimates are correlated (see Supplementary Figure S16).In addition, a multi-gene linear model predicting the phenotype concerned from the expression of all genes combined was estimated through elastic net (EN) (see Supplementary Section 3.2 for methodological details).For four out of six phenotypes considered, namely leaf 8 width, leaf count, plant height and total shoot dry weight, the best singlegene model has a lower estimated RMSE than the multi-gene model.This is a plausible result, as even though the multi-gene model is given at least as much information as the single-gene model, its higher model complexity may degrade prediction accuracy.It is also a desirable result, as a single marker feature is easier to measure than the ensemble of all features.Yet it may also be a case of winner's curse, so we applied the correction methods investigated to the single-gene RMSEs.
Figure 6 shows the smallest raw and corrected single-gene RMSE estimates as well as the multi-gene estimate, with confidence intervals.For leaf count, plant height and total shoot dry weight, the smallest single-gene RMSE point estimate obtained by most selection bias correction methods is no longer smaller than the EN estimate, but for leaf 8 width it still is.Yet also for phenotypes for which the smallest raw RMSE estimate was not smaller than the EN estimate (leaf 8 length and total seed count), the smallest corrected RMSE values are larger than the raw estimate, corroborating our initial suspicion that the latter ones are underestimates.The RMSE estimates of the separate estimation method are mostly the largest with the widest confidence intervals; both observations are due to the smaller sample size on which the RMSEs are estimated.The empirical Bayes credible intervals are narrower than the raw confidence intervals, although often unlikely narrow for the ash method.Often, the feature with the smallest raw RMSE is no longer the smallest one after correction, i.e. the feature ranking has changed (Supplementary Table S1).

Discussion
Empirical Bayes methods have previously been found to perform poorly when applied to dependent estimates (Tan et al., 2015;Bigdeli et al., 2016;Ferguson et al., 2013), leading to claims that Tweedie's formula is invalid in this case (Tan et al., 2015) and to concerns over the performance of ash (Stephens, 2017).Here we prove that these performance issues originate from the random nature of the observed marginal distribution under dependence, causing the width of the true marginal distribution to be underestimated.This underestimation introduces bias into Tweedie's formula and leads ash to underestimate prior and posterior variability.
We provide a remedy in the form of a simple convolution scheme that only requires an estimate of the average correlation between estimators to counter most of the dependence effect.After this modification, Tweedie's formula and ash regained a competitive performance in our simulations.A third empirical Bayes method by van Zwet et al. ( 2021) was found to be more robust to dependence, but when the correlation between estimates is high its performance also deteriorates, which can be remedied by our convolution scheme too.The collection of most promising estimates in a screening experiment are often biased towards extreme values.Although ubiquitous, this winner's curse is often ignored, possibly because of lack of awareness or because existing remedies are not widely known.We have benchmarked four classes of correction methods for selection bias in parametric as well as non-parametric simulation: separate estimation, conditional likelihood estimation, three bootstrap methods (BR-squared, Tan2015 and winnerscurse) and three empirical Bayes methods (Tweedie's formula, ash and VanZwet2021), see Table S3 for an overview.Separate estimation is the most drastic solution to selection bias, tackling its root cause by using different parts of the dataset for feature selection and for estimation.It eradicates selection bias, achieves nominal coverage of the confidence intervals and is robust to dependence between estimates.Yet this comes at a high cost in estimator variance and feature ranking accuracy, and it may still be biased when the estimate depends on the sample size, as for prediction models.In addition, the result of the analysis is random as a result of the single data split, which is undesirable.Although this approach may be commendable when the sample size is large, splitting the dataset in two is no substitute for confirmation from an independent experiment.Conditional likelihood is only available for maximum likelihood estimation, but there it greatly reduces selection bias, although its estimates are imprecise and it tends to overcorrect when the estimates are correlated.The three bootstrap methods considered in our comparison, BR-squared, Tan2015 and winnerscurse, yielded different results.The BR-squared correction performs poorly in absence of correlation between estimates, but recovers in presence of such correlation.In the original publication, it was only applied to SNPs which are naturally correlated, but it appears not to generalise well to settings without correlation.Also its confidence intervals relying on normality of the estimator do not reach the nominal coverage.The bootstrap method by Tan et al. (2015), on the other hand, is a robust method that effectively combats selection bias both in presence and absence of dependence, and its confidence intervals achieve good coverage.Presumably because this work is only available as a preprint, it has received little attention in the literature.Both these methods use the average difference of the ranked estimate of a bootstrap sample with a reference quantity to estimate the selection bias.The BR-squared method uses the estimate from a small, random set of out-of-sample observations as reference, whereas the Tan2015 method uses the estimate from the fixed, larger set of observations from the real dataset.The latter option requires fewer parameters to be estimated, is theoretically better justified and performs best in our simulation study, and hence it should be preferred.Both bootstrap methods do not require an estimate of the standard error and make few distributional assumptions.On the downside, these methods do not work for small sample sizes and they can be computationally demanding, especially for the construction of confidence intervals.The winnerscurse method is much faster as it only requires a single bootstrap sample and does not rely on the original dataset, yet it demands an estimate of the standard error, has a changeable selection bias correction performance and does not provide a confidence interval.Unlike the other methods considered, empirical Bayes methods were not developed specifically for combatting selection bias of extreme estimates, but simply provide a posterior distribution for all estimates, which is easily coupled to downstream analyses.Yet we found them to be a fast and reliable alternative to bootstrap methods for correcting winner's curse, if our convolution scheme is used to counter dependence effects.The empirical Bayes method by van Zwet et al. (2021), based on the joint distribution of signal-to-noise ratio and z-values, was found the be natively most robust to the effects of dependence.The VanZwet2021 method and ash are preferable when the true effects are known to be unimodally distributed; otherwise Tweedie's formula may be best as it is agnostic to the shape of the prior distribution.On the downside, empirical Bayes methods rely on accurate estimation of estimator variance and densities, and on distributional assumptions, which limit computing time but also make them dependent on high-dimensionality and less robust than bootstrap methods.When a clear null value and an estimator for the standard error are available, conversion of the raw estimates to test statistics improves ranking of the features and hence also feature selection.In presence of high correlation, also decorrelating the raw estimates or test statistics can improve feature ranking.On the contrary, methods that work well for reducing selection bias can exhibit a poorer feature ranking than the raw estimates, which suggests a bias-variance trade-off.In a case study on a B. napus field trial, we have demonstrated how the apparent superiority of the best singlegene prediction model over the multi-gene prediction model for some phenotypes was most likely a case of selection bias.This result calls for caution in marker selection: a correction for winner's curse is needed.Based on our results, we recommend the Tan2015 bootstrap and empirical Bayes methods, where applicable with our convolution scheme, for this purpose.

Fig. 1 .
Fig.1.Histograms of independent estimates (left) and dependent estimates (right), see Supplementary Section 2.2.3.1 for details.The true value is shown as a dashed red vertical line, the naive density estimator is overlaid as a black line, the true marginal distribution as a blue line, and the convoluted density estimate (see Results section) as a green line.In the left column the black, blue and green lines largely overlap.Convolution of the estimated densities yields estimates of similar spread as the true densities, even though they are still shifted.

Fig. 2 .
Fig.2.Bias in the estimation of the derivative of the log density of f (y-axis) as a function of z-value (x-axis), correlation between features ρ (rows) and number of features p (colour).The horizontal solid line indicates 0 (no bias), the dashed and dotted lines indicate first order and second order approximations to the bias as in (8) respectively.Notice the different scales of the y-axis; the bias grows in absolute value with ρ.

Fig. 3 .
Fig.3.Bias in the estimation of the derivative of the log density of f (y-axis) as a function of estimand γ (x-axis), correlation between features (rows), number of features p (column) and estimator (colour).The horizontal dotted line indicates 0 (no bias).For a correlation of 0, the effect of the convolution is negligible and the red and green lines overlap.

Fig. 5 .
Fig. 5. Nonparametric simulation results based on B. napus dataset (simulation study 2) for different methods (colour), performance measures (side panels) and phenotypes (top panels).The x-axis shows the number of top features considered, the y-axis shows the performance measure.The bias (top panels) was scaled to the [-1,1] range and the MSE and width of the confidence interval (second and bottom row) to the [0,1] range per phenotype for legibility (see case study below for an impression of their absolute values).The dotted horizontal lines indicate 0 for all performance measures and 1 for MSE and width of the interval.CV: cross-validation; SE: standard error.

Fig. 6 .
Fig.6.Point estimates of the RMSEs (dots) and corresponding 95% confidence or credible intervals (error bars) for the gene with smallest RMSE for different phenotypes of the B. napus dataset (side panels) and correction methods (colour); units are shown between brackets.The RMSE estimates of the multi-gene elastic net models are also shown, the dotted horizontal line corresponds to its point estimate.For the BRsquared and Tan2015 bootstrap methods, no confidence intervals were calculated as this would be too computationally demanding, winnerscurse does not provide confidence intervals.CV SE: cross-validation standard error.