Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A simple new approach to variable selection in regression, with application to genetic fine-mapping

View ORCID ProfileGao Wang, View ORCID ProfileAbhishek Sarkar, View ORCID ProfilePeter Carbonetto, View ORCID ProfileMatthew Stephens
doi: https://doi.org/10.1101/501114
Gao Wang
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gao Wang
Abhishek Sarkar
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Abhishek Sarkar
Peter Carbonetto
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter Carbonetto
Matthew Stephens
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matthew Stephens
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

We introduce a simple new approach to variable selection in linear regression, and to quantifying uncertainty in selected variables. The approach is based on a new model – the “Sum of Single Effects” (SuSiE) model – which comes from writing the sparse vector of regression coefficients as a sum of “single-effect” vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure – Iterative Bayesian Step-wise Selection (IBSS) – which is a Bayesian analogue of traditional stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We show that the IBSS algorithm computes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally leads to a convenient, novel, way to summarize uncertainty in variable selection, and provides a Credible Set for each selected variable. Our methods are particularly well suited to settings where variables are highly correlated and true effects are very sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate the methods by fine-mapping genetic variants that influence alternative splicing in human cell-lines. We also discuss both the potential and the challenges for applying these methods to generic variable selection problems.

1. Introduction

The need to identify, or “select”, relevant variables in regression models arises in a diverse range of applications, and has spurred the development of a correspondingly diverse range of methods (e.g., see O’Hara and Sillanpää, 2009; Fan and Lv, 2010; Desboulets, 2018, for reviews). However, variable selection is a complex problem, and so despite considerable work there remain important issues that existing methods do not fully address. Here we develop a new approach to this problem that has several attractive features: it is simple, computationally scaleable, and it provides new, more effective, ways to capture uncertainty in which variables should be selected. Our new approach is particularly helpful in situations involving highly correlated variables, where it may be impossible to confidently select any individual variable, but it may nonetheless be possible to confidently draw useful conclusions such as “either variable A or B is relevant”. More generally it may be possible to confidently identify “Credible Sets” of (correlated) variables, that each, with high probability, contain a relevant variable. Our new approach can quickly, simply and reliably identify such sets, as well as prioritize the variables within each set.

Our work, although potentially of broad interest, is particularly motivated by genetic fine-mapping studies (e.g Veyrieras et al., 2008; Maller et al., 2012; Spain and Barrett, 2015; Huang et al., 2017; Schaid et al., 2018), which aim to identify which genetic variants influence some trait of interest (e.g. LDL cholesterol in blood, gene expression in cells). Genetic fine-mapping can be helpfully framed as a variable selection problem, by building a regression model with the trait as the outcome and genetic variants as predictor variables (Sillanpää and Bhattacharjee, 2005). Performing variable selection in this model identifies variants that may causally affect the trait, and this – rather than prediction accuracy – is the main goal here. Fine-mapping is an example of a variable selection problem that often involves highly correlated variables: neighboring genetic variants are often highly correlated, a phenomenon called linkage disequilibrium (Ott, 1999).

Our approach builds on previous work on Bayesian variable selection regression (BVSR) (Mitchell and Beauchamp, 1988; George and McCulloch, 1997), which is already commonly used for genetic fine-mapping and related applications (e.g. Meuwissen et al., 2001; Sillanpää and Bhattacharjee, 2005; Servin and Stephens, 2007; Guan and Stephens, 2011; Bottolo et al., 2011; Maller et al., 2012; Carbonetto and Stephens, 2012; Hormozdiari et al., 2014; Chen et al., 2015; Wallace et al., 2015; Wen et al., 2016; Lee et al., 2018). BVSR has several appealing features compared with other approaches to variable selection. In particular, in principle, BVSR can assess uncertainty about which variables to select, even in the presence of strong correlations among variables. However, applying BVSR in practice remains difficult for two reasons. First, BVSR is computationally challenging, often requiring implementation of sophisticated Markov chain Monte Carlo or stochastic search algorithms (Bottolo and Richardson, 2010; Bottolo et al., 2011; Guan and Stephens, 2011; Wallace et al., 2015; Benner et al., 2016; Wen et al., 2016; Lee et al., 2018). Second, and perhaps more important, the output of existing methods for fitting BVSR is typically a complex posterior distribution, or a sample from it, which can be difficult to distill into easily-interpretable conclusions.

Our new approach addresses these limitations of BVSR through several innovations. First we introduce a new formulation of BVSR, which we call the Sum of Single Effects (SuSiE) model. This model, while similar to existing BVSR models, has a different structure that naturally leads to a very simple, intuitive and fast fitting procedure – Iterative Bayesian Stepwise Selection (IBSS) – which is a Bayesian analogue of traditional stepwise seletion methods. We show that IBSS can be viewed as computing a variational approximation to the posterior distribution under the SuSiE model. Unlike previous variational approaches to sparse regression (Logsdon et al., 2010; Carbonetto and Stephens, 2012), this new approach deals well with correlated variables. Furthermore, the approximate posterior leads immediately to Credible Sets of variables that are designed to be as small as possible while still each capturing a relevant variable. Arguably this is exactly the kind of posterior summary that one would like to obtain from MCMC-based or stochastic search BVSR methods, but doing so would require non-trivial post-processing of their output. In contrast, our method provides this posterior summary directly, and at a fraction of the computational effort.

The structure of the paper is as follows. In Section 2 we provide further motivation for our work, and brief background on BVSR. Section 3 describes the new SuSiE model and fitting procedure. In Section 4 we use simulations, designed to mimic realistic genetic fine-mapping studies, to demonstrate the effectiveness of our approach compared with existing BVSR methods. Section 5 illustrates our methods on fine-mapping of genetic variants affecting splicing, and Section 6 briefly highlights both the promise and the limitations of our methods for other applications using change-point problems. We end with Discussion highlighting avenues for further work.

2. Background

2.1 A motivating toy example

Suppose the relationship between y (an n-vector) and variables X=(x1,…, xp), an n × p matrix, is modeled as a multiple regression: Embedded Image in which b is a p-vector of regression coefficients, e is an n-vector of error terms, σ2 > 0 is the residual variance, and In is the n×n identity matrix. For brevity, we will refer to variables j with non-zero effects (bj≠0) as “effect variables”. We assume that exactly two variables are effect variables – variables 1 and 4, say – and that each of these two effect variables are each completely correlated with another non-effect variable, say x1 = x2 and x3= x4. We further suppose that no other pairs of variables are correlated.

In this situation, given sufficient data, it should be possible to conclude that there are (at least) two effect variables. However, because the effect variables are completely correlated with other variables, it will be impossible to confidently select the correct variables, even when n is very large. The best we can hope to infer is that Embedded Image

Our goal, in short, is to provide methods that directly produce this kind of inferential statement. Although this example is simplistic, it mimics the kind of structure that occurs in, for example, genetic fine-mapping applications, where it often happens that an association can be narrowed down to a small set of highly correlated genetic variants, and it is desired to provide a quantitative summary about which genetic variants are, based on the data, the plausible effect variables.

Most existing approaches to sparse regression do not provide statements like (2.2). For example, sparse regression methods based on penalized likelihood (e.g., Lasso; Tibshirani, 1996) would, in our example, select one of the four equivalent configurations (combinations of variables) – {1, 3}, {1, 4}, {2, 3} or {2, 4} – and give no indication that other configurations are equally plausible. Attempting to control error rates of false discoveries at the level of individual variables using methods such as stability selection (Meinshausen and Bühlmann, 2010) or the knockoff filter (Barber and Candes, 2015) will result in no discoveries, since no individual variable can be confidently declared an effect variable. One potential solution would be to first cluster the variables into groups of highly correlated variables and then perform some kind of “group selection” (Huang et al., 2012) or hierarchical testing (Meinshausen, 2008; Mandozzi and Bühlmann, 2016). However, although this might work in our stylized example, in general this approach involves ad hoc decisions about which variables to cluster together – an unattractive feature we seek to avoid.

In principle, Bayesian approaches (BVSR; see Introduction for references) can solve this problem. These approaches introduce a prior distribution on the coefficients b to capture the notion that b is sparse, then compute (approximately) a posterior distribution that assesses relative support for each configuration. In our example, the posterior distribution would typically have equal mass (≈0.25) on the four equivalent configurations. This posterior distribution contains exactly the information necessary to infer (2.2). However, even in this simple example, translating the posterior distribution to the simple statement (2.2) requires some effort, and in more complex settings such translations become highly non-trivial. Practical implementations of BVSR typically summarize the posterior distribution by the marginal posterior inclusion probability (PIP) of each variable, Embedded Image

In our example, they would report PIP1 PIP2 PIP3 PIP4 0.5. While not inaccurate, these marginal PIPs nonetheless fail to convey the information necessary to infer (2.2).

2.2 Credible Sets

To define our main goal more formally, we introduce the concept of a Credible Set (CS) of variables:

Definition 1.

In the context of a multiple regression model, we define a level-ρ Credible Set to be a subset of variables that has probability ⩾ρ of containing at least one effect variable (i.e., a variable with non-zero regression coefficient). Equivalently, the probability that all variables in the Credible Set have zero regression coefficients is no more than 1 – ρ.

Our use of the term “Credible Set” here indicates that we have in mind a Bayesian inference approach, in which the probability statements in the definition are statements about uncertainty in the regression coefficients with respect to the available data and modelling assumptions. One could analogously define a “Confidence Set” by interpreting the probability statements as referring to the set, considered random.

Although the term “Credible Set” has been used in fine-mapping applications before, most previous uses either assumed there was a single effect variable (Maller et al., 2012), or defined a CS as a set that contains all effect variables (Hormozdiari et al., 2014), which is a very different definition (and, we argue, both less informative and less attainable). Our definition here is closer to the “signal clusters” from Lee et al. (2018). It is also related to the idea of “minimal true detection” in Mandozzi and Bühlmann (2016).

With Definition 1, our primary aim can be restated: we wish to report as many CSs as the data support, each as few variables as possible. For example, to infer (2.2) we would like to report two CSs, {1, 2} and {3, 4}. As a secondary goal, we would also like to prioritize the variables within each CS, assigning each a probability that reflects the strength of the evidence for that variable being an effect variable. Our methods achieve both these goals.

2.3 The single effect regression model

The building block for our approach is the “single effect regression” (SER) model, which we define as a multiple regression model in which exactly one of the p explanatory variables has a non-zero regression coefficient. This idea was introduced in Servin and Stephens (2007) to fine-map genetic associations, and consequently adopted and extended by others, such as Veyrieras et al. (2008) and Pickrell (2014). Although of very narrow applicability, the SER model is trivial to fit. Furthermore, when its assumptions hold the SER provides exactly the inferences we desire, including CSs. For example, if we simplify our motivating example (Section 2.1) to have a single effect variable – variable 1, for example – then the SER model would, given sufficient data, infer a 95% CS containing both of the correlated variables, 1 and 2, with PIPs of approximately 0.5 each. This CS tells us that we can be confident that one of the two variables has a non-zero coefficient, but we do not know which one.

Specifically, we consider the following SER model, with hyperparameters σ2 (the residual variance), Embedded Image (the prior variance of the non-zero effect) and π = (π1,…, πp) (a p-vector of prior inclusion probabilities, with πj giving the prior probability that variable j is the effect variable): Embedded Image Embedded Image Embedded Image Embedded Image Embedded Image

Here, y is the n-vector of response data; X is an n×p matrix; b is a p-vector of regression coefficients; e is an n-vector of independent error terms; and Mult (m, π) denotes the multinomial distribution on class counts obtained when m samples are drawn with class probabilities given by π. (We assume that y and the columns of X have been centered to have mean zero, which avoids the need for an intercept term; see Chipman et al. 2001.)

Under the SER model (2.4–2.8), the effect vector b has exactly one nonzero element (equals to b), so we refer to b as a “single effect vector”. The element of b that is non-zero is determined by the binary vector γ, which also has exactly one non-zero entry. The probability vector π determines the prior probability distribution of which of the p variables is the effect variable. In the simplest case, π= (1/p,…, 1/p); we assume this uniform prior here but SER model requires only that π is fixed and known (so in fine-mapping one could incorporate different priors based on genetic annotations; e.g., Veyrieras et al., 2008). To lighten notation, we henceforth make conditioning on π implicit.

Posterior under SER model

Given σ2 and Embedded Image, the posterior distribution on b = γb is easily computed: Embedded Image Embedded Image where α= (α1,…, αp) is the vector of PIPs, Embedded Image and Embedded Image are the posterior mean and variance of b given γj = 1. Calculating these quantities simply involves performing the p univariate regressions of y on xj, for j = 1,…, p as detailed in Appendix A. From α, it is also simple to compute a level-ρ CS (Definition 1), CS(α; ρ); as described in Maller et al. (2012). In brief, this involves sorting variables by decreasing αj, then including variables in the CS until their cumulative probability exceeds ρ (Appendix A).

For later convenience we introduce SER as shorthand for the function that returns the posterior distribution for b under the SER model. Since this posterior distribution is uniquely determined by the values of α, µ1 and σ1 in (2.9-2.10), we can define SER as Embedded Image where µ1:= (µ11,…, µ1p) and Embedded Image.

Empirical Bayes for SER model

Although most previous treatments of the SER model assume Embedded Image and σ2 to be fixed and known, we note here the possibility of estimating Embedded Image and/or σ2 by maximum-likelihood before computing the posterior distribution of b. This is effectively an “Empirical Bayes” approach. The likelihood for Embedded Image and σ2 under the SER, Embedded Image is available in closed form (Appendix A), and can be maximized over one or both parameters numerically.

3. The Sum of Single Effects Regression model

We now introduce a new approach to variable selection in multiple regression. Our approach is motivated by the observation that the SER model provides simple inference if there is indeed exactly one effect variable; it is thus desirable to extend the SER to allow for multiple variables. The conventional approach to doing this in BVSR is to introduce a prior on b that allows for multiple non-zero entries (e.g., using the “spike and slab” prior of Mitchell and Beauchamp, 1988). However, this approach no longer enjoys the convenient analytical properties and intuitive interpretation of the posterior distribution under the SER model: posterior distributions become difficult to compute accurately, and CSs even harder.

Here we introduce a different approach, which better preserves the nice features of the SER model. The key idea is straightforward to describe: introduce multiple single-effect vectors, b1,…, bL, and construct the overall effect vector b as the sum of these single effects. We call this the “SUm of SIngle Effects” (SuSiE) regression model: Embedded Image Embedded Image Embedded Image Embedded Image Embedded Image Embedded Image

For generality, we have allowed the variance of each effect, Embedded Image, to vary among the components l = 1,…, L. The special case in which L “ 1 recovers the SER model. For simplicity, we initially assume σ2 and Embedded Image are given, and defer estimation of these hyperparameters to Section 3.3.

Note that if L≪p then the SuSiE model is approximately equivalent to a standard BVSR model in which L randomly chosen variables have non-zero coefficients. The only difference is that with some (small) probability some of the bl in the SuSiE model may have the same non-zero co-ordinates, and so the number of non-zero elements in b has some (small) probability to be less than L. Thus at most L variables have non-zero coefficients in this model. We discuss choice of L in Section 3.5.

Although the SuSiE model is approximately equivalent to a standard BVSR model, its novel structure has two major advantages. First, it leads to a simple, iterative and deterministic algorithm for computing approximate posterior distributions. Second, it yields a simple way to calculate the CSs. In essense, because each bl captures only one effect, the posterior distribution on each γl can be used to compute a CS that has a high probability of containing an effect variable. The remainder of this section details both these advantages, and other issues that may arise in fitting the model.

3.1 Fitting SuSiE: Iterative Bayesian stepwise selection

The key to SuSiE model fitting procedure is that, given b1,…, bL–1, estimating bL involves fitting the simpler SER model, which is analytically tractable. This immediately suggests an iterative algorithm that uses the SER model to estimate each bl in turn, given estimates of the other bl′ (l′ l). Algorithm 1 details such an algorithm, which is both simple and computationally scalable (computational complexity OpnpLq per iteration).

Algorithm 1

Iterative Bayesian stepwise selection

Figure
  • Download figure
  • Open in new tab

We call Algorithm 1 “Iterative Bayesian Stepwise Selection” (IBSS) because it can be viewed as a Bayesian version of stepwise selection approaches. For example, we can compare it with an approach referred to as “Forward-Stagewise” (FS) selection in Hastie et al. (2009) Section 3.3.3 (although subsequent literature often uses this term slightly differently). In brief, FS first selects the single “best” variable among p candidates by comparing the results of the p univariate regressions. It then computes the residuals from the univariate regression on this selected variable, and selects the next “best” variable by comparing the results of univariate regression of the residuals on each variable. This process continues, selecting one variable each iteration, until some stopping criteria is reached.

IBSS is similar to FS, but instead of selecting a single “best” variable at each step, it computes a distribution on which variable to select, by fitting the Bayesian SER model (Step 5). Similar to FS, this distribution is based on the results of the p univariate regressions, and so IBSS has the same computational complexity as FS (𝒪(np) per selection). However, by computing a distribution on variables – rather than choosing a single best variable – IBSS captures uncertainty about which variable should be selected at each step. This uncertainty is taken into account when computing residuals (Step 4) by using a model-averaged (posterior mean) estimate for the regression coefficients. In IBSS we incorporate an iterative procedure, whereby early selections are re-evaluated in light of the later selections (as in “backfitting”; Friedman and Stuetzle, 1981). The final output of IBSS is L distributions on variables (parameterized by (αl, µ1l, σ1l); l 1,…, L), in place of the L selected variables output by FS. Each distribution is easily summarized by, for example, a 95% CS for each selection.

To illustrate, consider our motivating example (Section 2.1) with x1 = x2, x3 x4, and with variables 1 and 4 having non-zero effects. Suppose for simplicity that the effect of variable 1 is substantially larger than the effect of variable 4. Then FS would first select either variable 1 or 2 (which one being arbitrary), and then select variable 3 or 4 (again, which one being arbitrary). In contrast, given enough data, the first step of IBSS would select both variables 1 and 2 (with equal weight, ≈0.5 each, and small weights on other variables). The second step of IBSS would similarly select variables 3 and 4 (again with equal weight, ≈0.5 each). Summarizing these results would yield two CSs, {1, 2} and {3, 4}, and the inference (2.2) falls into our lap. This simple example is intended only to sharpen intuition; later numerical experiments demonstrate that IBSS works well in realistic settings.

3.2 IBSS computes a variational approximation to the SuSiE posterior distribution

We now provide a more formal justification for the IBSS algorithm. Specifically, we show that it is a coordinate ascent algorithm for optimizing a variational approximation to the posterior distribution for b1,…, bL under the SuSiE model (3.1)-(3.6). This result also leads to a natural extension of the algorithm to estimate the hyperparameters Embedded Image.

The idea behind variational approximation methods for Bayesian models (e.g. Blei et al., 2017) is to find an approximation q(b1,…, bL) to the posterior distribution ppost:“ p(b1,…, bL|y), by minimizing the Kullback–Leibler (KL) divergence from q to ppost, DKL(q, ppost), subject to constraints on q that make the problem tractable. Although DKL q, ppost is hard to compute, it can be written as: Embedded Image where F is an easier-to-compute function known as the “evidence lower bound” (ELBO). (Note: F depends on the data y, X, but we suppress this dependence to lighten notation.) Because Embedded Image does not depend on q, minimizing DKL over q is equivalent to maximizing F; and since F is easier to compute this is how the problem is usually framed. See Appendix B.1 for further details.

Here we seek an approximate posterior, q, that factorizes: Embedded Image

Under this approximation b1,…, bL are independent a posteriori. We make no assumptions on the form of ql, and in particular we do not assume that ql factorizes over the p elements of bl. This is a crucial difference from previous variational approaches for standard multiple regression models (e.g. Logsdon et al., 2010; Carbonetto and Stephens, 2012), and it means that ql can capture the strong dependencies among the elements of bl that are induced by the assumption that exactly one element of bl is non-zero. Intuitively each ql captures one effect variable, and provides inferences of the form “we need one of variables {A, B, C} but we cannot tell which”. Similarly, the approximation (3.8) provides inferences of the form “we need one of variables {A, B, C} and one of variables {D, E, F, G}, and…”.

Under (3.8), finding the optimal q can now be written as: Embedded Image

Although jointly optimizing F over (q1,…, qL) is not straightforward, it turns out to be very easy to optimize over a single ql (given ql1 for l′ ≠ l), by fitting an SER model, as formalized in the following proposition.

Proposition 1.

Embedded Image where rl denotes the residuals obtained by removing the estimated effects other than l: Embedded Image where Embedded Image denotes the expectation of the distribution ql′.

For intuition in this proposition, recall that computing the posterior distribution for bl under the SuSiE model if the other effects bl′, l′ ≠l were known reduces to fitting a SER to residuals Embedded Image. Now consider computing an (approximate) posterior distribution for bl when bl′ are not known, but we have approximations ql′ to their posterior distributions. This is, essentially, the problem of finding arg maxql F (q1,…, qL). Proposition 1 says that we can solve this using a similar procedure as for known bl′, but replacing each bl′ with its (approximate) posterior mean Embedded Image. The proof is given in Appendix B (Proposition 2).

The following corollary is an immediate consequence of Proposition 1:

Corollary 1.

Algorithm 1 is a coordinate ascent algorithm for maximizing the ELBO F, and therefore for minimizing the KL divergence DKL(q, ppost).

Proof. Step 5 of Algorithm 1 is simply computing the right hand side of equation (3.10). Thus, by Proposition 1, it is a coordinate ascent step for optimizing the lth coordinate of Embedded Image (the distribution ql being determined by the parameters αl, µ1l, σ1l).

3.3 Estimating Embedded Image

Algorithm 1 can be extended to estimate the hyperparameters σ2 and Embedded Image, by adding steps to optimize Embedded Image over σ2 and/or Embedded Image. Estimating hyperparameters by maximizing the ELBO F is a commonly-used strategy in variational inference, and often performs well in practice (e.g. Carbonetto and Stephens, 2012).

Optimizing F over σ2 involves computing the expected residual sum of squares under the variational approximation, which is straightforward; see Appendix B for details.

Optimizing F over Embedded Image can be achieved by modifying the step that computes the posterior distribution for bl under the SER model (Step 5) to first estimate the hyperparameter Embedded Image in the SER model by maximum likelihood. That is, by optimizing (2.12) over Embedded Image, keeping σ2 fixed. This is a one-dimensional optimization which is easily performed numerically (we used the R function uniroot).

Algorithm 4 in Appendix B extends IBSS to include both these steps.

3.4 Posterior inference: posterior inclusion probabilities and Credible Sets

Algorithm 1 provides an approximation to the posterior distribution on b under the SuSiE model, parameterized by (α1, µ11, σ11),…, (αL, µ1L, σ1L). From these results it is straightforward to compute approximations to various posterior quantities of interest, including PIPs and CSs.

3.4.1 Posterior inclusion probabilities

Under the SuSiE model the effect bj is zero if and only if Embedded Image. Under the variational approximation the Embedded Image are independent across l, and so: Embedded Image

Here, Embedded Image, to account for the edge case where some Embedded Image (which can happen when Embedded Image is estimated as in Section 3.3).

3.4.2 Credible Sets

Simply computing the sets CS (α=; ρ) (A.17) for l 1,, L immediately yields L CSs that satisfy Definition 1 under the variational approximation to the posterior.

If L exceeds the number of detectable effects in the data then in practice it turns out that many of the L CSs are large, often containing the majority of variables. The intuition is that once all the detectable signals have been accounted for, the IBSS algorithm becomes very uncertain about which variable to include at each step, and so the distributions α become very diffuse. CSs that contain very many uncorrelated variables are of essentially no inferential value – whether or not they actually contain a effect variable – and so in practice it makes sense to ignore them. To automate this process, in this paper we discard CSs with purity <0.5, where we define purity as the smallest absolute correlation among all pairs of variables within the CS. (To reduce computation for CSs containing >100 variables we sampled 100 variables at random to compute purity.) The purity threshold >0.5 was chosen primarily for comparability with Lee et al. (2018) who use a similar threshold in a related context. Although any choice of threshold is somewhat arbitrary, in practice we observed that most CSs are either very pure (>0.95) or very impure (<0.05), with intermediate cases being rare (Figure S2), and so results are robust to this choice of threshold.

3.5 Choice of L

It may seem that choice of suitable L would be crucial. However, in practice key inferences are robust to overstating L; for example in our simulations later the true L is 1-5, but we obtain good results with L 10. This is because, when L is too large, the method becomes very uncertain about where to place the additional (non-existent) effects; consequently it distributes them broadly among many variables, and so they have little impact on key inferences. For example, setting L too large inflates the PIPs of many variables just very slightly, and leads to some low-purity CSs that are filtered out (see Section 3.4.2).

Although inferences are generally robust to overstating L, we also note that the Empirical Bayes version of our method, which estimates Embedded Image, also effectively estimates the number of effects: when L is greater than the number of signals in the data, the maximum likelihood estimate of Embedded Image is often 0 for many l, which forces bl = 0. This is connected to “Automatic relevance determination” (Neal, 1996).

4. Numerical Comparisons

We use simulations to assess our methods and compare with standard BVSR methods. Our simulations are designed to mimic genetic fine-mapping studies, in particular fine-mapping of expression quantitative trait loci (eQTLs) – eQTLs are genetic variants associated with gene expression.

In genetic fine-mapping, X is a matrix of genotype data, in which each row corresponds to an individual, and each column corresponds to a genetic variant, typically a single nucleotide polymorphism (SNP). In our simulations, we used the real human genotype data from the n=574 genotype samples collected as part of the Genotype-Tissue Expression (GTEx) project (GTEx Consortium, 2017). To simulate fine-mapping of cis effects on gene expression, we randomly select 150 genes from the >20,000 genes on chromosomes 1–22, then take X to be the genotypes for genetic variants nearby the transcribed region of the selected gene. For a given gene, between p=1,000 and p=12,000 SNPs are included in the fine-mapping analysis; for more details on how SNPs are selected, see Appendix C.

To assess accuracy of SuSiE inference by comparing estimates against “groundtruth”, we generate synthetic gene expression data y under the multiple regression model (2.1), with various assumptions on the effects b. We specify our assumptions about the simulated effects b using two parameters: S, the number of effect variables; and ϕ, the proportion of variance in y explained by X (“PVE” for short).

We consider two sets of simulations. In the first set, each data set has exactly p=1, 000 SNPs. We simulate data sets under all combinations of S∈{1,…, 5} and ϕ∈{0.05, 0.1, 0.2, 0.4}. These settings were chosen to span typical values for eQTL studies. Given choices of S and f, we take the following steps to simulate gene expression data:

  1. Sample the indices 𝒮 of the S effect variables uniformly at random from {1,…, p}.

  2. For each j ∈ 𝒮, draw bj ∈ N (0, 0.62), and set bj “ 0 for all j∉𝒮.

  3. Set σ2 to achieve the desired PVE, f; specifically, we solve for σ2 in Embedded Image, where Var(·) denotes sample variance.

  4. For each i = 1,…, 574, draw yi ∼N (xi1b1 +… +xipbp, σ2).

We simulate two replicates for each gene and each scenario, resulting in a total of 2×150×4×5×6,000 simulations.

In the second set of simulations, we generate larger data sets with 3,000 to 12,000 SNPs. To simulate gene expression data, we set S=10 and f=0.3. Again, we simulate two replicates for each gene, so in total, we generate an additional 2 × 150 = 300 data sets in the second set of simulations.

4.1 Illustrative example

We begin with an example to illustrate that, despite its simplicity, the IBSS algorithm (Algorithm 1) can perform well in a challenging situation. This example is given in Figure 1.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: Illustration that IBSS algorithm can deal with a challenging case.

Results are from a simulated data with p 1, 000 variables (the SNPs), of which two are effect variables (labeled as “SNP 1” and “SNP 2”, in red). This example was chosen because the strongest marginal association is with a non-effect variable (at position 780 on the x-axis); see the p values in the left-hand panel. Despite its simplicity, the IBSS algorithm converges to a solution in which the two 95% CSs – represented by the light and dark blue open circles in the right-hand panel – each contain a true effect variable. Additionally, neither CS contains the variable that has the strongest marginal association. One CS contains only 3 SNPs, whereas the other CS (in dark blue) contains 37 very highly correlated variables (minimum pairwise absolute correlation of 0.972). In the latter CS, the individual PIPs are small, but the inclusion of the 37 variables in this CS indicates, correctly, high confidence in one effect variable among them.

We draw this example from one of our simulations in which the variable most strongly associated with y is not one of the actual effect variables (in this particular example, there are two effect variables). This situation occurs because at least one variable has moderate correlation with both effect variables, and these effects combine to make its marginal association stronger than the marginal associations of the individual effect variables. Standard forward selection in this case would select the wrong variable in the first step; by contrast, the iterative nature of IBSS allows it to escape this trap. Indeed, in this example IBSS yields two high-purity CSs, each containing one of the effect variables.

Interestingly, in this example the most strongly associated variable does not appear in either CS. This illustrates that multiple regression can sometimes result in very different conclusions compared to a marginal association analysis. An animation showing the iteration-by-iteration progress of the IBSS algorithm can be found at our manuscript resource repository (Wang et al., 2018).

4.2 Posterior inclusion probabilities

Next we seek to assess the effectiveness of our methods more quantitatively. We focus initially on one of the simpler tasks in BVSR: computing posterior inclusion probabilities (PIPs). Most implementations of BVSR compute PIPs, making it possible to compare results across several implementations. Here we compare our methods (henceforth SuSiE for short, implemented in an R pack-age, susieR, version 0.4.29) with three other software implementations aimed at genetic fine-mapping applications: CAVIAR (Hormozdiari et al., 2014, version 2.2), FINEMAP (Benner et al., 2016, version 1.1) and DAP-G (Wen et al., 2016; Lee et al., 2018, GitHub commit ef11b26). These C++ software packages implement different algorithms to fit similar BVSR models, which differ in details such as priors on effect sizes. CAVIAR exhaustively evaluates all possible combinations of up to L non-zero effects among the p variables, whereas FINEMAP and DAP-G approximate this exhaustive approach by heuristics that target the best combinations. Another difference among methods is that FINEMAP and CAVIAR perform inference using summary statistics computed for each dataset – specifically, the marginal association Z scores and the p× p correlation matrix for all variables – whereas, as we apply them here, DAP-G and SuSiE use the full data. The summary statistic approach can be viewed as approximating inferences from the full data; see Lee et al. (2018) for discussion.

For SuSiE, we set L 10 for the first set of simulations, and L 20 for the data sets with the larger numbers of SNPs. We assessed performance when estimating both hyperparameters Embedded Image, and when fixing one or both of these. Overall, results from these different strategies were similar. In the main text, we show results obtained when estimating σ2 and fixing Embedded Image to 0.1Var(y), to be consistent with data applications in Section 5; other results are found in Supplementary Data (Figure S4, Figure S5). Parameter settings for other methods are given in Appendix C. Since CAVIAR and FINEMAP were much more computationally intensive than DAP-G and SuSiE, we ran all methods in simulations with S= 1, 2, 3, and only ran DAP-G and SuSiE in simulations with S>3.

Since these methods differ in their modelling assumptions, we do not expect their PIPs to agree exactly. Nonetheless, we found generally good agreement (Figure 2A). For S = 1, the PIPs from all four methods closely agree. For S > 1, the PIPs from different methods are also highly correlated; correlations between PIPs from SuSiE and other methods vary from 0.94 to 1, and the proportions of PIPs differing by more than 0.1 between methods vary from 0.013% to 0.2%. Visually, this agreement appears less strong because the eye is drawn to the small proportion of points that lie away from the diagonal, while the vast majority of points lie on or near the origin. In addition, all four methods produce reasonably well-calibrated PIPs (Figure S1).

Figure 2
  • Download figure
  • Open in new tab
Figure 2 Evaluation of posterior inclusion probabilities (PIPs).

Scatterplots in Panel A compare PIPs computed in SuSiE against PIPs computed using other methods (DAP-G, CAVIAR, FINEMAP). Points drawn in red represent true effect variables; point in black represent variables of no effect. Each scatterplot combines results from many simulations. Panel B summarizes these same results as a plot of power vs. FDR. These curves are obtained by varying the PIP threshold for each method. The open circles in the left-hand plot highlight results at PIP thresholds of 0.9 and 0.95). Here, FDR Embedded Image (also known as the “false discovery proportion”), and power Embedded Image, where FP, TP, FN and TN denote the number of False Positives, True Positives, False Negatives and True Negatives, respectively. (This plot is the same as a precision-recall curve after reversing the x-axis, because precision Embedded Image FDR, and recall = power.)

The general agreement of PIPs from four different methods suggests that: (i) all four methods are mostly accurate for computing PIPs for the size of the problems explored in our numerical comparisons; and (ii) the PIPs themselves are typically robust to details of the modelling assumptions. Nonetheless, non-trivial differences in PIPs are clearly visible from Figure 2A. Visual inspection of the results of these simulations suggests that, in many of these cases, SuSiE assigns higher PIPs to the true effect variables than other methods, particularly compared to FINEMAP and CAVIAR; for non-effect variables where other methods report high PIPs, SuSiE often correctly assigns PIPs close to zero. These observations suggest that the PIPs from SuSiE may better distinguish effect variables from non-effect variables. This is confirmed by our analysis of power vs. False Discovery Rate (FDR) for each method, which is obtained by varying the PIP threshold for each method (Figure 2B): the SuSiE PIPs always yield comparable or higher power at a given FDR.

Notably, even implemented in R, SuSiE computations are much faster than others implemented in C++: in the S = 3 simulations, SuSiE is roughly 4 times faster than DAP-G, 30 times faster than FINEMAP, and 4,000 times faster than CAVIAR on average (Table 1).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Runtimes on data sets simulated with S = 3 (all times are in seconds)

In summary, the results suggest that SuSiE produces PIPs that are as or more reliable than existing methods, and does so at a fraction of the computational cost.

4.3 Credible Sets

A key feature of SuSiE is that it yields multiple Credible Sets (CSs), each aimed at capturing an effect variable (Definition 1). The only other BVSR method that attempts something similar – as far as we are aware – is DAP-G, which outputs “signal clusters” defined by heuristic rules (Lee et al., 2018). Although Lee et al. (2018) do not refer to their signal clusters as CSs, and do not give a formal definition of signal cluster, the signal clusters have a similar goal to our CSs, and so for brevity we henceforth refer to them as CSs.

We compared the level 95% CSs produced by SuSiE and DAP-G in several ways. First we assessed their empirical (frequentist) coverage levels, i.e., proportion of CSs that contain an effect variable. Since our CSs are Bayesian Credible Sets, they are not designed or guaranteed to have frequentist coverage 0.95 (Fraser, 2011). Indeed, coverage will inevitably depend on simulation scenario. For example, in completely null simulations (b = 0) every CS would necessarily contain no effect variable, and so coverage would be 0. Nonetheless, one might hope that under reasonable simulations that include effect variables the Bayesian CSs would have coverage near the nominal levels. And indeed, we confirmed this was the case: in these simulations CSs from both methods typically had coverage slightly below 0.95, but usually > 0.9 (Figure 3; see Figure S3 for additional results).

Having established that the methods produce CSs with similar coverage, we compared them by three other criteria: (i) power (overall proportion of simulated effect variables included in a CS); (ii) average size (median number of variables included in CS) and (iii) purity (here measured as average squared correlation of variables in CS since this is output by DAP-G). By all three metrics the CSs from SuSiE are consistently better: higher power, smaller size and higher purity (Figure 3).

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: Comparison of 95% credible sets (CS) from SuSiE and DAP-G.

Panels show A) coverage, B) power, C) median size and D) average squared correlation of the variables in each CS. Scenarios with 1-5 effect variables each involved p = 1, 000 variables. The Scenario with 10 effect variables involved p = 3, 000 – 12, 000 variables (the entire candidate region in cis-eQTL association analysis of GTEx data).

Although the way that we construct CSs in SuSiE does not require that they be disjoint, we note that in practice here CSs rarely overlapped (after filtering out low purity CSs; Section 3.4.2). Indeed, across all simulations there was only one instance of a pair of overlapping CSs.

5. Application to splice QTL fine-mapping

5.1 Genome-wide identification of splice QTL in human cell lines

To illustrate SuSiE on a real fine-mapping problem we analyzed data from Li et al. (2016) aimed at detecting genetic variants (SNPs) that influence splicing (known as “splice QTLs”, sQTLs). These authors quantified alternative splicing by estimating, at each intron in each sample, a ratio capturing how often the intron is used relative to other introns in the same “cluster” (roughly, gene). The data involve 77,345 intron ratios measured on lymphoblastoid cell lines from 87 Yoruban individuals, together with genotypes of these individuals. Following Li et al. (2016) we pre-process the intron ratios by regressing out the first 3 principle components of the matrix of intron ratios, which aims to control for unmeasured confounders (Leek and Storey, 2007). For each intron ratio we test for its association with SNPs within 100kb of the intron, which is on average ∼ 600 SNPs. In other words, we run SuSiE on 77,345 data sets with n = 87 and p ≈ 600.

To specify the prior variance Embedded Image we first estimated typical effect sizes from the data on all introns. Specifically we performed simple (SNP-by-SNP) regression analysis at every intron, and estimated the PVE of the top (strongest associated) SNP. The mean PVE of the top SNP across all introns was 0.096, and so we applied SuSiE with Embedded Image (with the columns of X standardized to have unit variance).

We then applied SuSiE to fine-map sQTLs at all 77,345 introns. After filtering for purity, this yielded a total of 2,652 CSs (level 0.95) which were spread across 2,496 intron units. These numbers are broadly in line with the original study, which reported 2,893 significant introns at 10% FDR. Of the 2,652 CSs identified, 457 contain exactly one SNP, and these represent strong candidates for being actual causal variants that affect splicing. Another 239 CSs contain exactly two SNPs. The median size of CS was 7 and the median purity was 0.94.

The vast majority of intron units with any CS had only one CS (2,357 of 2,496). Thus, at most introns SuSiE could reliably identify (at most) one sQTL. Of the remainder, 129 introns yielded two CSs, 5 introns yielded three CSs, 3 introns yielded four CSs and 2 introns yielded five CSs. This represents a total of 156 (129+10+9+8) additional (“secondary”) signals that would be missed in conventional analyses that report only one signal per intron. Although these data show relatively few secondary signals, this is a relatively small study (n = 87); it is likely that in larger studies the ability of SuSiE (and other fine-mapping methods) to detect secondary signals will be more important.

5.2 Functional enrichment of association signals

Although in these real data we do not know the true causal SNPs, we can provide indirect evidence that both the primary and secondary signals identified here are enriched for real signals using functional enrichment analysis. To perform this analysis we labelled one CS at each intron the “primary” CS, and we chose the CS with highest purity at each intron as the primary CS; remaining CSs at each intron (if any) were labelled “secondary” CSs. We then tested both primary and secondary CSs for enrichment of SNPs with various biological annotations, by comparing SNPs inside these CSs (with PIP> 0.2) against random control SNPs outside CSs.

We used the same annotations in our enrichment analysis as Li et al. (2016). These were: LCL-specific histone marks (H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1), DNase I hypersensitive sites, transcriptional repressor CTCF binding sites, RNA polymerase II (PolII) binding sites and extended splice sites (defined as 5bp up/down-stream of intron start site and 15bp up/down-stream of intron end site), and intron and coding annotations.

Figure 4 shows the enrichments in both primary and secondary CSs, for annotations that were significant at p-value < 10-4 in the primary signals (Fisher’s exact test, two-sided). The strongest enrichment in both primary and secondary signals was for extended splice sites (odds ratio ≈ 5 in primary signals), which is reassuring given that these results are for splice QTLs. Other significantly enriched annotations in primary signals include PolII binding, several histone marks, and coding regions. The only annotation showing a significant depletion was introns. Results for secondary signals were qualitatively similar to those for primary, though all enrichments are less significant due to the much smaller numbers of secondary CSs.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4: Results of enrichment analysis for splice QTLs.

The plot shows the estimated odds ratio, ± 2 standard errors, for each annotation, obtained by comparing the annotations of SNPs inside primary/secondary CSs against random control SNPs outside CSs (see text for definitions of primary and secondary). The p-values are from two-sided Fisher’s exact test.

6. An example beyond fine-mapping: change point detection

Although our methods were initially motivated by genetic fine-mapping applications, they are also applicable to other sparse regression applications. Here we briefly illustrate this by applying SuSiE to an example that is very different from fine-mapping: change point detection. This application also provides a simple example where the IBSS algorithm produces a poor fit – due to getting stuck in a local optimum – which is something we seldom observed in fine-mapping simulations. We believe that examples where algorithms fail are just as important as examples where they succeed – perhaps more so – and that this example could help motivate future methods development and improvements.

In brief, we consider a simple change point model: Embedded Image where t indexes location in one-dimensional space (or time), the errors et are independent and identically distributed N (0, σ2), and the mean vector µ:= (µ1,…, µT) is assumed to be piecewise constant. Indices, t, at which µ changes (µt ≠ µt+1) are then called “change-points”.

The idea that change points are rare can be captured by formulating this model as a sparse multiple regression (2.1), where X has T–1 columns, the tth column being a step function with step at t (xst = 0 for s ≼ t and 1 for s > t). The tth element of b then determines the change in the mean at position t, µt+1 – µt, and so the non-zero regression coefficients in this multiple regression model correspond to change points in µ.

Note that the design matrix X has a very different structure here from in fine-mapping applications. In particular, the correlation matrix of the columns of X has its largest elements near the diagonal, and gradually decays moving away from the diagonal – very different from the “blocky” correlation structure that typically occurs in genetic fine-mapping. (A side note on computation: due to the special structure of this X, SuSiE computations can be made O(TL) rather than O(T 2L) which would be achieved by a naive implementation; for example the vector XT y is simply the cumulative sums of the elements of the reverse of y, which can be computed in linear complexity.)

Change point detection has a wide range of potential applications, including, for example, segmentation of genomes into regions with different numbers of copies of the genome. Software packages in R that can be used for detecting change points include changepoint (Killick and Eckley, 2014), DNAcopy (Seshan and Olshen, 2018; Olshen et al., 2004), bcp (Erdman and Emerson, 2007) and genlasso (Tibshirani, 2014; Arnold and Tibshirani, 2016); see Killick and Eckley (2014) for a longer list. Of these, only bcp, which implements a Bayesian method, quantifies uncertainty in estimated change point locations, and bcp provides only (marginal) PIPs, not CSs for change point locations. Therefore the ability of SuSiE to provide such CSs is unusual, and perhaps even unique, among existing methods for this problem.

To illustrate the potential of SuSiE for change point estimation we applied it to a simple simulated example from the DNAcopy package. Figure 5 shows both the SuSiE and DNAcopy results. The two methods provide similar estimates for the change point locations, but SuSiE also provides a 95% CS for each change point. In this case every true change point is contained in a reported CS, and every CS contains a true change point. This is true even though our fit assumed L = 10 change points and the truth is only 7 change points: the additional CSs were filtered out here because they contain very uncorrelated variables. (Actually SuSiE reported 8 CSs after filtering, two of them overlapping and containing the same true change point. As reported in Section 4, in fine-mapping applications we found such overlapping CSs very rarely.)

Figure 5:
  • Download figure
  • Open in new tab
Figure 5: Illustration of SuSiE applied to two simple changepoint problems.

Top panel shows a simple simulated example with seven true change points (vertical black lines). The blue horizontal lines show the mean function inferred by DNAcopy::segment. The inference is reasonably accurate, but provides no indication of uncertainty in change point locations. The red vertical strips show the 95% CSs for change point locations inferred by SuSiE. Each CS contains a true change point. Bottom panel shows a simple simulated example with two change points in quick succession, designed to show how the IBSS algorithm used to fit SuSiE can converge to a local optimum. The two lines shows the fit from initializing IBSS from the null model with no change points (black), and the true model with two change points (red). The red line is much closer to the truth and attains a higher value of the objective function (−148.2 vs −181.8)

To highlight an example where IBSS can converge to a poor local optima, consider the simple simulated example in Figure 5, which consists of two change points in quick succession that approximately cancel each other out (so the mean before and after the change points are equal). We designed this example specifically to illustrate a limitation of IBSS: here introducing any single change point (to the null model of no change point) makes the fit worse, and one really needs to introduce both change points at the same time to improve the fit, which IBSS is not set up to do. Consequently, when run from a null initialization, IBSS finds no change points (and reports no CSs).

We emphasize that this result represents a limitation of the IBSS algorithm for optimizing the objective function, and not a limitation of either the SuSiE model or the variational approximation. To demonstrate this we re-ran the IBSS algorithm, initialized from a solution that contains the two true change points. This yields a fit with two CSs, containing the two correct change points, and a higher value of the objective function than the original fit (−148.2 vs −181.8). Improved fitting algorithms – or more careful initialization of IBSS – could therefore address this problem.

7. Discussion

We presented a new model (SuSiE) and algorithm (IBSS) which together provide a simple new approach to variable selection in regression. Compared with existing methods, the main benefits of our approach are its computational efficiency, and its ability to provide CSs summarizing uncertainty in which variables should be selected. Our numerical comparisons demonstrate that for genetic fine-mapping our methods outperform existing methods at a fraction of the computational cost.

Although our methods apply generally to variable selection in linear regression, further work may be required to improve performance in difficult settings.

In particular, while the IBSS algorithm worked well in our fine-mapping experiments, for change-point problems we showed that IBSS may converge to poor local optima. We have also seen convergence problems in experiments with many effect variables (e.g. 200 effect variables out of 1,000). Such problems may be alleviated by better initialization, for example using fits from convex objective functions (e.g. Lasso) or from more sophisticated algorithms for nonconvex problems (Bertsimas et al., 2016; Hazimeh and Mazumder, 2018). More ambitiously, one could attempt to develop better algorithms to reliably optimize the SuSiE variational objective function in difficult cases. For example, taking smaller steps each iteration, rather than full coordinate ascent, may help.

At its core, the SuSiE model is based on adding up simple models (SERs) to create more flexible models (sparse multiple regression). This additive structure is the key to our variational approximations, and indeed our methods apply generally to adding up any simple models for which exact Bayesian calculations are tractable, not only SER models (Appendix B; Algorithm 3). These observations suggest connections with both additive models and boosting (e.g. Friedman et al., 2000; Freund et al., 2017). However, our methods differ from most work on boosting in that each “weak learner” (here, SER model) itself yields a model-averaged predictor. Other differences include our use of backfitting, the potential to estimate hyper-parameters by maximizing an objective function rather than cross-validation, and the interpretation of our algorithm as a variational approximation to a Bayesian posterior. Although we did not focus on prediction accuracy here, the generally good predictive performance of methods based on model averaging and boosting suggest that SuSiE should work well for prediction as well as variable selection.

It would be natural to extend our methods to generalized linear models (glms), particularly logistic regression. In genetic studies with small effects Gaussian models are often adequate to model binary outcomes (e.g. Zhou et al., 2013). However, in other settings this extension may be more important. One strategy would be to directly modify the IBSS algorithm, replacing the SER fitting procedure with a logistic or glm equivalent. This strategy is appealing in its simplicity, although it is not obvious what objective function the resulting algorithm is optimizing.

For genetic fine-mapping it would also be useful to modify our methods to deal with settings where only summary data are available (e.g. the p simple regression results). Many recent fine-mapping methods deal with this (e.g Chen et al., 2015; Benner et al., 2016; Newcombe et al., 2016) and ideas used by these methods can also be applied to SuSiE. Indeed our software already includes preliminary implementations for this problem.

Finally, we are particularly interested in extending these methods to select variables simultaneously for multiple outcomes (multivariate regression, and multi-task learning). In settings where multiple outcomes share the same relevant variables, multivariate analysis can greatly enhance power and precision to identify relevant variables. The computational simplicity of our approach makes it particularly appealing for this complex task, and we are currently pursuing this by combining our methods with those from Urbut et al. (2018).

8. Data and resources

SuSiE is implemented in the R package susieR available at https://github.com/stephenslab/susieR. Source code and a website documenting in detail the analysis steps for numerical comparisons and data applications are available at our manuscript resource repository Wang et al. (2018), also available at https://github.com/stephenslab/susie-paper.

Acknowledgements

We thank Kaiqian Zhang and Yuxin Zou for their substantial contributions to the development and testing of the susieR package. Computing resources were provided by the University of Chicago Research Computing Center. This work was supported by NIH grant HG002585 and by a grant from the Gordon and Betty Moore Foundation (Grant GBMF #4559).

Appendix A Details of posterior computations for the SER model

A.1. Bayesian simple linear regression

To provide posterior computations for the SER it helps to start with the even simpler Bayesian simple linear regression model: Embedded Image Embedded Image Embedded Image

Here y is an n-vector of response data (centered to have mean 0), x is an n- vector containing values of a single explanatory variable (similarly centered), e is an n-vector of independent error terms with variance σ2, b is the scalar regression coefficient to be estimated, and Embedded Image is the prior variance of b.

Given Embedded Image the Bayesian computations for this model are very simple. They can be conveniently written in terms of the usual least-squares estimate of b, Embedded Image, its variance, s2:= σ2/(xT x), and the corresponding z score Embedded Image. The posterior distribution for b is

Embedded Image where Embedded Image Embedded Image

And the Bayes Factor (BF) for comparing this model with the null model (b = 0) is: Embedded Image Embedded Image

(The form of BF matches the “asymptotic BF” of Wakefield (2009), but here – because we consider linear regression and given σ2 – it is an exact expression and not only asymptotic.)

A.2. The single effect regression model

Under the SER model, given Embedded Image, the posterior distribution on b = γb is as given in the main text: Embedded Image Embedded Image where α = (α1,…, αp) is the vector of PIPs: Embedded Image with BF as in (A.8), and where Embedded Image are the posterior mean (A.6) and variance (A.5) from Bayesian simple regression of y on xj: Embedded Image Embedded Image

Our algorithm requires the first and second moments of this posterior distribution, which are: Embedded Image Embedded Image

A.3. Computing Credible Sets

As noted in the main text, under the SER model it is simple to compute a level ρ CS (Definition 1), CS(α; ρ), as described in Maller et al. (2012). For convenience we give the procedure here explicitly.

Given α, let r = (r1,…, rp) denote the indices of the variables ranked in order of decreasing αj, so Embedded Image, and let Sk denote the cumulative sum of the k largest PIPs: Embedded Image

Now take Embedded Image where k0 min {k: Sk ≥ ρ}. This choice of k0 ensures that the CS is as small as possible while still satisfying the requirement that it has at least level ρ.

A.4. Empirical Bayes approach

As noted in the main text, it is possible to take an Empirical Bayes approach to estimating the hyperparameters Embedded Image. The likelihood is: Embedded Image where p0 denotes the distribution of y under the “null” that b = 0 (i.e. N(0, σ2In)). The likelihood (A.18) can be maximized over one or both parameters numerically.

Appendix B Derivation of Variational Algorithms

B.1. Background: Empirical Bayes and Variational Approximation

Here we introduce helpful background and notation before applying the ideas to our specific application.

B.1.1 Empirical Bayes as a single optimization problem

Consider problems of the form: Embedded Image Embedded Image where y represent observed data, b represent unobserved (latent) variables of interest, g ∈ 𝒢 represents a prior distribution for b (which in the Empirical Bayes paradigm is treated as an unknown to be estimated) and θ ∈ Θ represents parameters to be estimated. This formulation also includes as a special case situations where g is pre-specified rather than estimated, simply by making 𝒢 contain a single distribution.

Fitting this model by Empirical Bayes involves the following steps:

  1. Obtain estimates Embedded Image for g, θ, by maximizing the log-likelihood: Embedded Image where Embedded Image

  2. Compute the posterior distribution for b given these estimates, Embedded Image where Embedded Image

This two step procedure can be conveniently written as a single optimization problem: Embedded Image with Embedded Image where Embedded Image is the Kullback–Leibler divergence from q to p and the optimization over q in (B.6) is over all possible distributions. The function F (B.7) is often called the “evidence lower bound”, or ELBO, because it is a lower bound for the evidence (log-likelihood). (This follows from the fact that KL divergence is non-negative.)

That this single optimization problem (B.6) is equivalent to the usual two-step EB procedure follows from two simple observations:

  1. Since the log-likelihood, l, does not depend on q, we have Embedded Image

  2. Since the optimum DKL term over q is 0 for any θ, g we have maxq F (q, g, θ; y) = l(g, θ; y), so Embedded Image

B.1.2. Variational approximation

The optimization problem (B.6) is often intractable. The idea of variational approximation is to adjust the problem to make it tractable, simply by restricting the optimization over q to q ∈ 𝒬 where 𝒬 denotes a suitably chosen class of distributions: Embedded Image

From the definition of F it follows that optimizing F over q ∈ 𝒬 (for given g, θ) corresponds to minimizing the KL divergence from q to the posterior distribution, and so can be interpreted as finding the “best” approximation to the posterior distribution for b among distributions in the class 𝒬. And the optimization of F over g, θ can be thought of as replacing the optimization of the log-likelihood with optimization of the ELBO, a lower bound to the log-likelihood.

We refer to solutions of the general problem (B.6), in which q is unrestricted, as “EB solutions”. We refer to solutions of the restricted problem (B.11) as “Variational EB (VEB) solutions”.

B.1.3. Algebraic form for F

It is helpful to note that, by simple algebraic manipulations, the ELBO F in (B.7) can be written as: Embedded Image Embedded Image

B.2. The additive effects model

We now focus on fitting an additive model, ℳ, that includes the SuSiE model as a special case: Embedded Image Embedded Image Embedded Image where y ∈ Rn, µl ∈ Rn, e ∈ Rn, and In denotes the n × n identity matrix. We let ℳl denote the simpler model that is derived from ℳ by setting µl’ ≡ 0 for l ≤ l (i.e. ℳl is the model that contains only the lth additive term), and use Ll to denote the marginal likelihood for this simpler model: Embedded Image

The SuSiE model corresponds to the special case of ℳ where µl:= Xbl and gl is the “single effect prior” in (2.6)-(2.8). Further, in this special case each ℳl is a “single effect regression” (SER) model.

The key idea is that we can fit ℳ by VEB provided we can fit each simpler model ℳl by EB. To expand on this: consider fitting the model ℳ by VEB, where the restricted family 𝒬 is the class of distributions on Embedded Image that factorize over l. That is, for any q ∈ 𝒬, Embedded Image and we can write q = (q1,…, qL).

For q ∈ 𝒬, using expression (B.13), we obtain the following expression for the ELBO F: Embedded Image where ||y||2:= yT y and g denotes the priors (g1,…, gL). Note that the second term here is the expected residual sum of squares (ERSS) under q, and depends on q only through its first and second moments. Indeed, if we define Embedded Image Embedded Image and Embedded Image, then Embedded Image

(This expression follows from the definition, and independence across l = 1,…, L, by simple algebraic manipulation; see Section B.6).

Fitting ℳ by VEB involves optimizing F in (B.19) over q, g, σ2. Our strategy is to use “coordinate ascent”, using steps that optimize over (ql, gl) (l = 1,…, L) while keeping other elements of q, g fixed, and with a separate step to optimize over σ2 given q, g. This strategy is summarized in Algorithm 2.

Algorithm 2

Coordinate Ascent for F (outline)

Figure
  • Download figure
  • Open in new tab

The update for σ2 in Algorithm 2 is easily obtained by taking partial derivative of (B.19), setting to zero, and solving for σ2, giving Embedded Image

The update for (ql, gl) turns out to correspond to finding the EB solution to the simpler model ℳl, but with the data y replaced with the expected residuals, Embedded Image. The proof of this is given in the next section (Proposition 2).

Substituting these ideas into Algorithm 2 gives Algorithm 3, which is a generalization of the IBSS algorithm (Algorithm 1) in the main text.

Algorithm 3

Coordinate Ascent for F

Figure
  • Download figure
  • Open in new tab

B.3. Special case of SuSiE model

The SuSiE model corresponds to the special case µl = Xbl, in which case ℳl is a single effects regression model. The first and second moments of µl, Embedded Image and Embedded Image are determined by the first and second moments of bl: Embedded Image Embedded Image Embedded Image where the last line comes from the fact that only one element of bl is nonzero, so the cross terms bljblj′ ≠ 0 for j ≠ j’. Because of this we can write Embedded Image as a function of the first and second moments of the bl – say Embedded Image – and Algorithm 3 can be implemented by working with the posterior distributions of b instead of µ.

For completeness we give this algorithm, which is the one we implemented in the susieR software, explicitly as Algorithm 4. This algorithm is the same as the IBSS algorithm in the main text, but extended to estimate the hyperparameters Embedded Image.

Algorithm 4

Iterative Bayesian stepwise selection (Extended)

Figure
  • Download figure
  • Open in new tab

We implemented the option Step 5, which is a one-dimensional optimization, using uniroot in R to find the point where the derivative of LSER is 0.

B.4. Update for ql, gl is EB solution of Ml

In this subsection we establish that Step 3 in Algorithm 2 is accomplished by EB solution of ℳl (Steps 4–5 in Algorithm 3). This result is formalized in the following Proposition, which generalizes Proposition 1 in the main text:

Proposition 2.

Optimizing F in (B.19) over ql, gl is achieved by Embedded Image where Fl denotes the ELBO corresponding to model ⊳l and Embedded Image is as in (B.20).

Note that the optimization of Fl over ql, gl on the right hand side of (B.27) does not involve restrictions on ql, and so corresponds precisely to finding the EB solution to Ml (see Section B.1.1).

Proof. Let Fl denote the ELBO for model ⊳l. Then, from (B.13) we have: Embedded Image

Further, let µ–l denote the components of (µ1,…, µL) omitting µl, and q–l denote the distribution on µ–l induced by marginalizing q over bl. Finally, let rl denote the residuals obtained by removing all the effects other than l from y, and Embedded Image denote its expectation under q–l: Embedded Image Embedded Image

Now, separating F in (B.19) into the parts that depend on ql, gl, and those that do not (here denoted “const”), we have: Embedded Image Embedded Image Embedded Image

B.5. Computing the evidence lower bound

Although not strictly required to implement Algorithm 3, it can also be helpful to compute the objective function F (e.g., for monitoring convergence and for comparing solutions obtained from different initial points). Here we outline a convenient approach to computing F in practice.

The ELBO F is given by (B.19). Computing the first term is easy, and the second term is the ERSS (B.22). The third term can be computed from the marginal likelihoods Ll in (B.17), computation of which is straightforward for Ml the SER model, involving a simple sum over the p possible single effects.

Specifically we have the following lemma:

Lemma 1.

Let Embedded Image. Then Embedded Image Proof. Rearranging (B.28) with y replaced by Embedded Image, we have Embedded Image

The result then follows from noting that Fl is equal to log-Ll at the optimum Embedded Image. That is, Embedded Image.

B.6. Expression for ERSS

The expression (B.22) is derived as follows: Embedded Image Embedded Image Embedded Image

Appendix C Simulation details

C.1. Simulation data set

For the numerical simulations of eQTL fine-mapping, (Section 4), we used n 574 human genotypes collected as part of the Genotype-Tissue Expression (GTEx) project (GTEx Consortium, 2017). Specifically, we obtained genotype data from whole-genome sequencing, with imputed genotypes, under dbGaP accession phs000424.v7.p2. In our analyses, we only included SNPs with minor allele frequencies 1% or greater. All reported SNP base-pair positions were based on Genome Reference Consortium human genome assembly 38.

To select SNPs nearby each gene, we considered two SNP selection schemes in our simulations: (i) all SNPs with 1 Megabase (Mb) of the gene’s transcription start site (TSS), and (ii) the p = 1, 000 SNPs closest to the TSS. Since the GTEx genotype data is very dense, the 1,000 closest SNPs are always less than 0.4 Mb away from the TSS, regardless of the gene considered. The first selection scheme yields genotype matrices X with at least p = 3, 022 SNPs and at most p = 11, 999 SNPs, with an average of 7, 217 SNPs.

C.2. CAVIAR, FINEMAP and DAP-G settings used for numerical comparisons

In CAVIAR, we set all prior inclusion probabilities to 1 p to match the default settings used in FINEMAP and DAP-G. In CAVIAR and FINEMAP, we set the maximum number of effect variables to the value of S that was used to simulate the gene expression data. The maximum number of iterations in FINEMAP was set to 100,000 (which is the default in FINEMAP).

All computations were performed on Linux systems with Intel Xeon E5-2680 v4 (2.40 GHz) processors. We ran SuSiE in R 3.5.1, with optimized matrix operations provided by the OpenBLAS dynamically linked libraries. DAP-G and CAVIAR were compiled from source using GCC 4.9.2, and pre-compiled binary executables, available from the website, were used to run FINEMAP. The result was averaged over 300 data-sets.

Footnotes

  • * This work was supported by NIH grant HG002585 and by a grant from the Gordon and Betty Moore Foundation

  • e-mail: gaow{at}uchicago.edu aksarkar@uchicago.edupcarbo{at}uchicago.edu mstephens{at}uchicago.edu

References

  1. ↵
    Arnold, T. and Tibshirani, R. (2016, 3). Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics 25 (1), 1–27.
    OpenUrl
  2. ↵
    Barber, R. F. and Candes, E. J. (2015, 10). Controlling the false discovery rate via knockoffs. Annals of Statistics 43 (5), 2055–2085.
    OpenUrl
  3. ↵
    Benner, C., Spencer, C. C., Havulinna, A. S., et al. (2016, 5). FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32 (10), 1493–1501.
    OpenUrlCrossRefPubMed
  4. ↵
    Bertsimas, D., King, A., Mazumder, R., et al. (2016). Best subset selection via a modern optimization lens. Annals of Statistics 44 (2), 813–852.
    OpenUrl
  5. ↵
    Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017, 4). Variational inference: A review for statisticians. Journal of the American Statistical Association 112 (518), 859–877.
    OpenUrl
  6. ↵
    Bottolo, L., Petretto, E., Blankenberg, S., et al. (2011, 12). Bayesian detection of expression quantitative trait loci hot spots. Genetics 189 (4), 1449–1459.
    OpenUrlAbstract/FREE Full Text
  7. ↵
    Bottolo, L. and Richardson, S. (2010, sep). Evolutionary stochastic search for Bayesian model exploration. Bayesian Analysis 5 (3), 583–618.
    OpenUrl
  8. ↵
    Carbonetto, P. and Stephens, M. (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7 (1), 73–108.
    OpenUrl
  9. ↵
    Chen, W., Larrabee, B. R., Ovsyannikova, I. G., et al. (2015, 7). Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics 200 (3), 719–736.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    Chipman, H., George, E. I., and McCulloch, R. E. (2001). The practical implementation of Bayesian model selection. In Model Selection, Volume 38 of IMS Lecture Notes, pp. 65–116.
    OpenUrl
  11. ↵
    Desboulets, L. D. D. (2018). A review on variable selection in regression analysis. Econometrics 6 (4).
  12. ↵
    Erdman, C. and Emerson, J. W. (2007). bcp: an R package for performing a Bayesian analysis of change point problems. Journal of Statistical Software 23 (3), 1–13.
    OpenUrl
  13. ↵
    Fan, J. and Lv, J. (2010, 1). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20 (1), 101–148.
    OpenUrl
  14. ↵
    Fraser, D. A. S. (2011, 8). Is Bayes posterior just quick and dirty confidence? Statistical Science 26 (3), 299–316.
    OpenUrlCrossRef
  15. ↵
    Freund, R. M., Grigas, P., and Mazumder, R. (2017). A new perspective on boosting in linear regression via subgradient optimization and relatives. Annals of Statistics 45 (6), 2328–2364.
    OpenUrl
  16. ↵
    Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics 28 (2), 337–407.
    OpenUrlCrossRefWeb of Science
  17. ↵
    Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association 76 (376), 817–823.
    OpenUrlCrossRefWeb of Science
  18. ↵
    George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica 7, 339–373.
    OpenUrl
  19. ↵
    GTEx Consortium (2017, 10). Genetic effects on gene expression across human tissues. Nature 550 (7675), 204–213.
    OpenUrlCrossRefPubMed
  20. ↵
    Guan, Y. and Stephens, M. (2011, 9). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Annals of Applied Statistics 5 (3), 1780–1815.
    OpenUrl
  21. ↵
    Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning (2 ed.). New York, NY: Springer.
  22. ↵
    Hazimeh, H. and Mazumder, R. (2018). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. arXiv arxiv:1803.01454.
  23. ↵
    Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B., and Eskin, E. (2014, 10). Identifying causal variants at loci with multiple signals of association. Genetics 198 (2), 497–508.
    OpenUrlAbstract/FREE Full Text
  24. ↵
    Huang, H., Fang, M., Jostins, L., et al. (2017, 6). Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547 (7662), 173–178.
    OpenUrlCrossRefPubMed
  25. ↵
    Huang, J., Breheny, P., and Ma, S. (2012, 11). A selective review of group selection in high-dimensional models. Statistical Science 27 (4), 481–499.
    OpenUrlCrossRef
  26. ↵
    Killick, R. and Eckley, I. (2014). changepoint: An R package for changepoint analysis. Journal of statistical software 58 (3), 1–19.
    OpenUrl
  27. ↵
    Lee, Y., Francesca, L., Pique-Regi, R., and Wen, X. (2018, 5). Bayesian multi-SNP genetic association analysis: Control of FDR and use of summary statistics. bioRxiv doi:10.1101/316471.
    OpenUrlAbstract/FREE Full Text
  28. ↵
    Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3 (9), e161.
    OpenUrl
  29. ↵
    Li, Y. I., van de Geijn, B., Raj, A., et al. (2016, 4). RNA splicing is a primary link between genetic variation and disease. Science 352 (6285), 600–604.
    OpenUrlAbstract/FREE Full Text
  30. ↵
    Logsdon, B. A., Hoffman, G. E., and Mezey, J. G. (2010). A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics 11 (1), 58.
    OpenUrlCrossRefPubMed
  31. ↵
    Maller, J. B., McVean, G., Byrnes, J., et al. (2012, 12). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature Genetics 44 (12), 1294–1301.
    OpenUrlCrossRefPubMed
  32. ↵
    Mandozzi, J. and Bühlmann, P. (2016, 1). Hierarchical testing in the highdimensional setting with correlated variables. Journal of the American Statistical Association 111 (513), 331–343.
    OpenUrl
  33. ↵
    Meinshausen, N. (2008, 2). Hierarchical testing of variable importance. Biometrika 95 (2), 265–278.
    OpenUrlCrossRefWeb of Science
  34. ↵
    Meinshausen, N. and Bühlmann, P. (2010, 7). Stability selection. Journal of the Royal Statistical Society, Series B 72 (4), 417–473.
    OpenUrlCrossRefWeb of Science
  35. ↵
    Meuwissen, T. H., Hayes, B. J., and Goddard, M. E. (2001, 4). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157 (4), 1819–1829.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    Mitchell, T. J. and Beauchamp, J. J. (1988, 12). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83 (404), 1023–1032.
    OpenUrlCrossRef
  37. ↵
    Neal, R. M. (1996). Bayesian learning for neural networks, Volume 118 of Lecture Notes in Statistics. New York, NY: Springer.
  38. ↵
    Newcombe, P. J., Conti, D. V., and Richardson, S. (2016, mar). JAM: a scalable Bayesian framework for joint analysis of marginal SNP effects. Genetic Epidemiology 40 (3), 188–201.
    OpenUrlCrossRef
  39. ↵
    O’Hara, R. B. and Sillanpää, M. J. (2009, 3). A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis 4 (1), 85–117.
    OpenUrl
  40. ↵
    Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004, oct). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 (4), 557–572.
    OpenUrlCrossRefPubMedWeb of Science
  41. ↵
    Ott, J. (1999). Analysis of human genetic linkage (3 ed.). Baltimore, MD: Johns Hopkins University Press.
  42. ↵
    Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. American Journal of Human Genetics 94 (4), 559–573.
    OpenUrlCrossRefPubMed
  43. ↵
    Schaid, D. J., Chen, W., and Larson, N. B. (2018, 8). From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics 19 (8), 491–504.
    OpenUrl
  44. ↵
    Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3 (7), 1296–1308.
    OpenUrl
  45. ↵
    Seshan, V. E. and Olshen, A. (2018). DNAcopy: DNA copy number data analysis. R package version 1.56.0.
  46. ↵
    Sillanpää, M. J. and Bhattacharjee, M. (2005, 1). Bayesian association-based fine mapping in small chromosomal segments. Genetics 169 (1), 427–439.
    OpenUrlAbstract/FREE Full Text
  47. ↵
    Spain, S. L. and Barrett, J. C. (2015, 10). Strategies for fine-mapping complex traits. Human Molecular Genetics 24 (R1), R111–R119.
    OpenUrlCrossRefPubMed
  48. ↵
    Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
    OpenUrlWeb of Science
  49. ↵
    Tibshirani, R. J. (2014, 2). Adaptive piecewise polynomial estimation via trend filtering. Annals of Statistics 42 (1), 285–323.
    OpenUrl
  50. ↵
    Urbut, S., Wang, G., Carbonetto, P., and Stephens, M. (2018). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nature Genetics, forthcoming.
  51. ↵
    Veyrieras, J.-B., Kudaravalli, S., Kim, S. Y., et al. (2008, oct). High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genetics 4 (10), e1000214.
    OpenUrl
  52. ↵
    Wakefield, J. (2009, jan). Bayes factors for genome-wide association studies: comparison with P-values. Genetic Epidemiology 33 (1), 79–86.
    OpenUrlCrossRefPubMedWeb of Science
  53. ↵
    Wallace, C., Cutler, A. J., Pontikos, N., et al. (2015, 6). Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping. PLOS Genetics 11 (6), e1005272.
    OpenUrl
  54. ↵
    Wang, G., Sarkar, A., Carbonetto, P., and Stephens, M. (2018, December). Code and data accompanying manuscript wang et. al (2018).
  55. ↵
    Wen, X., Lee, Y., Luca, F., and Pique-Regi, R. (2016, 6). Efficient integrative multi-SNP association analysis via deterministic approximation of posteriors. American Journal of Human Genetics 98 (6), 1114–1129.
    OpenUrl
  56. ↵
    Zhou, X., Carbonetto, P., and Stephens, M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9 (2), e1003264.
    OpenUrl
Back to top
PreviousNext
Posted December 19, 2018.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A simple new approach to variable selection in regression, with application to genetic fine-mapping
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A simple new approach to variable selection in regression, with application to genetic fine-mapping
Gao Wang, Abhishek Sarkar, Peter Carbonetto, Matthew Stephens
bioRxiv 501114; doi: https://doi.org/10.1101/501114
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
A simple new approach to variable selection in regression, with application to genetic fine-mapping
Gao Wang, Abhishek Sarkar, Peter Carbonetto, Matthew Stephens
bioRxiv 501114; doi: https://doi.org/10.1101/501114

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3479)
  • Biochemistry (7318)
  • Bioengineering (5296)
  • Bioinformatics (20197)
  • Biophysics (9976)
  • Cancer Biology (7703)
  • Cell Biology (11250)
  • Clinical Trials (138)
  • Developmental Biology (6417)
  • Ecology (9916)
  • Epidemiology (2065)
  • Evolutionary Biology (13280)
  • Genetics (9352)
  • Genomics (12554)
  • Immunology (7674)
  • Microbiology (18939)
  • Molecular Biology (7417)
  • Neuroscience (40889)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2126)
  • Physiology (3140)
  • Plant Biology (6838)
  • Scientific Communication and Education (1270)
  • Synthetic Biology (1891)
  • Systems Biology (5296)
  • Zoology (1085)