Iterative Hard Thresholding in GWAS: Generalized Linear Models, Prior Weights, and Double Sparsity

Background: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. Results: We extend and efﬁciently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models (GLMs), prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing, and exhibits a 2 to 3 orders of magnitude decrease in false positive rates compared to lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We ﬁnd that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identiﬁed by previous studies. Conclusions: Our real data analysis and simulation studies suggest that IHT can (a) recover highly correlated predictors, (b) avoid over-ﬁtting, (c) deliver better true positive and false positive rates than either

marginal testing or lasso regression, (d) recover unbiased regression coefficients, (e) exploit prior information and group-sparsity and (f) be used with biobank sized data sets.Although these advances are studied for GWAS inference, our extensions are pertinent to other regression problems with large numbers of predictors.

Introduction
In genome-wide association studies (GWAS), modern genotyping technology coupled with imputation algorithms can produce an n × p genotype matrix X with n ≈ 10 6 subjects and p ≈ 10 7 genetic predictors (10; 41).Data sets of this size require hundreds of gigabytes of disk space to store in compressed form.Decompressing data to floating point numbers for statistical analyses leads to matrices too large to fit into standard computer memory.The computational burden of dealing with massive GWAS datasets limits statistical analysis and interpretation.This paper discusses and extends a class of algorithms capable of meeting the challenge of multiple regression models with modern GWAS data scales.
Traditionally, GWAS analysis has focused on SNP-by-SNP (single nucleotide polymorphism) association testing (10; 9), with a p-value computed for each SNP via linear regression.This approach enjoys the advantages of simplicity, interpretability, and a low computational complexity of O(np).Furthermore, marginal linear regressions make efficient use of computer memory, since computations are carried out on genotype vectors one at a time, as opposed to running on the full genotype matrix in multiple regression.Some authors further increase association power by reframing GWAS as a linear mixed model problem and proceeding with variance component selection (20; 26).These advances remain within the scope of marginal analysis.
Despite their numerous successes (41), marginal regression is less than ideal for GWAS.It implicitly assumes that all SNPs have independent effects.In contrast, multiple regression can in principle model the effect of all SNPs simultaneously.This approach captures the biology behind GWAS more realistically because traits are usually determined by multiple SNPs acting in unison.Marginal regression selects associated SNPs one by one based on a pre-set threshold.Given the stringency of the p-value threshold, marginal regression can miss many causal SNPs with low effect sizes.As a result, heritability is underestimated.When p n, one usually assumes that the number of variants k associated with a complex trait is much less than n.If this is true, we can expect multiple regression models to perform better because it a) offers better outlier detection (35) and better prediction, b) accounts for the correlations among SNPs, and c) allows investigators to model interactions.Of course, these advantages are predicated on finding the truly associated SNPs.
Adding penalties to the loss function is one way of achieving parsimony in multiple regression.The lasso (39; 40) is the most popular model selection device in current use.The lasso model selects non-zero parameters by minimizing the criterion where (β) is a convex loss, λ is a sparsity tuning constant, and β 1 = ∑ j |β j | is the 1 norm of the parameters.The lasso has the virtues of preserving convexity and driving most parameter estimates to 0. Minimization can be conducted efficiently via cyclic coordinate descent (15; 44).The magnitude of the nonzero tuning constant λ determines the number of predictors selected.
Despite its widespread use, the lasso penalty has some drawbacks.First, the 1 penalty tends to shrink parameters toward 0, sometimes severely so.Second, λ must be tuned to achieve a given model size.Third, λ is chosen by cross-validation, a costly procedure.Fourth and most importantly, the shrinkage caused by the Figure 1: The 0 quasinorm of IHT enforces sparsity without shrinkage.The estimated effect size (dashed line) is plotted against its true value (diagonal line) for 1 , MPC, and 0 penalties.penalty leaves a lot of unexplained trait variance, which tends to encourage too many false positives to enter the model ultimately identified by cross-validation.Inflated false positive rates can be mitigated by substituting nonconvex penalties for the 1 penalty.For example, the minimax concave penalty (MCP) (50) λ p(β j ) = λ starts out at β j = 0 with slope λ and gradually transitions to a slope of 0 at β j = λ γ.With minor adjustments, the coordinate descent algorithm for the lasso carries over to MCP penalized regression (8; 29).Model selection is achieved without severe shrinkage, and inference in GWAS improves (21).However, in our experience its false negative rate is considerably higher than IHT's rate (22).A second remedy for the lasso, stability selection, weeds out false positives by looking for consistent predictor selection across random halves of the data (32).However, it is known to be under-powered for GWAS compared to standard univariate selection (2).
In contrast, iterative hard thresholding (IHT) minimizes a loss (β) subject to the nonconvex sparsity constraint β 0 ≤ k, where β 0 counts the number of non-zero components of β (3; 4; 7). Figure 1 explains graphically how the 0 penalty reduces the bias of the selected parameters.In the figure λ , γ, and k are chosen so that the same range of β values are sent to zero.To its detriment, the lasso penalty shrinks all β 's, no matter how large their absolute values.The nonconvex MCP penalty avoids shrinkage for large β 's but exerts shrinkage for intermediate β 's.IHT, which is both nonconvex and discontinuous, avoids shrinkage altogether.For GWAS, the sparsity model-size constant k also has a simpler and more intuitive interpretation than the lasso tuning constant λ .Finally, both false positive and false negative rates are well controlled.Balanced against these advantages is the loss of convexity in optimization and concomitant loss of computational efficiency.In practice, the computational barriers are surmountable and are compensated by the excellent results delivered by IHT in high-dimensional regression problems such as multiple GWAS regression.This article has four interrelated goals.First, we extend IHT to generalized linear models.These models encompass most of applied statistics.Previous IHT algorithms focused on normal or logistic sparse regression scenarios.Our software can also perform sparse regression under Poisson and negative binomial response distributions and can be easily extended to other GLM distributions as needed.The key to our extension is the derivation of a nearly optimal step size s for improving the loglikelihood at each iteration.Second, we introduce doubly-sparse regression to IHT.Previous authors have considered group sparsity (46).The latter tactic limits the number of groups selected.It is also useful to limit the number of predictors selected per group.Double sparsity strikes a compromise that encourages selection of correlated causative variants in linkage disequilibrium (LD).Notably, this technique generalizes group-IHT.Third, we demonstrate how to incorporate predetermined SNP weights in IHT.Our simple and interpretable weighting option allows users to introduce prior knowledge into sparse projection.Thus, one can favor predictors whose association to the response is supported by external evidence.Fourth, we present MendelIHT.jl:a scalable, open source, and user friendly software for IHT in the high performance programming language Julia (6).

Model Development
This section sketches our extensions of iterative hard thresholding (IHT).

IHT Background
IHT was originally formulated for sparse signal reconstruction, which is framed as sparse linear least squares regression.In classical linear regression, we are given an n × p design matrix X and a corresponding ncomponent response vector y.We then postulate that y has mean E(y) = Xβ and that the residual vector y − Xβ has independent Gaussian components with a common variance.The parameter (regression coefficient) vector β is estimated by minimizing the sum of squares f (β) = 1 2 y − Xβ 2 2 .The solution to this problem is known as the ordinary least squares estimator and can be written explicitly as β = (X t X) −1 X t y, provided the problem is overdetermined (n > p).This paradigm breaks down in the high-dimensional regime n p, where the parameter vector β is underdetermined.In the spirit of parsimony, IHT seeks a sparse version of β that gives a good fit to the data.This is accomplished by minimizing f (β) subject to β 0 ≤ k for a small value of k, where • 0 counts the number of nonzero entries of a vector.The optimization problem is formally: IHT abandons the explicit formula for β because it fails to respect sparsity and involves the numerically intractable matrix inverse (X t X) −1 .
IHT combines three core ideas.The first is steepest descent.Elementary calculus tells us that the negative gradient −∇ f (x) is the direction of steepest descent of f (β) at x. First-order optimization methods like IHT define the next iterate in minimization by the formula β n+1 = β n + s n v n , where v n = −∇ f (β n ) and s n > 0 is some optimally chosen step size.In the case of linear regression −∇ f (β) = X t (y − Xβ).To reduce the error at each iteration, the optimal step size s n can be selected by minimizing the second-order Taylor expansion with respect to s n .Here d 2 f (β) = X t X is the Hessian matrix of second partial derivatives.Because f (β) is quadratic, the expansion is exact.Its minimum occurs at the step size This formula summarizes the second core idea.
The third component of IHT involves projecting the steepest descent update β n + s n v n onto the sparsity set S k = {β : β 0 ≤ k}.The relevant projection operator P S k (β) sets all but the k largest entries of β in magnitude to 0. In summary, IHT solves problem (3.1) by updating the parameter vector β according to the recipe: with the step size given by formula (3.2).
An optional debiasing step can be added to improve parameter estimates.This involves replacing β n+1 by the exact minimum point of f (β) in the subspace defined by the support { j : β n+1, j = 0} of β n+1 .Debiasing is efficient because it solves a low-dimensional problem.Several versions of hard-thresholding algorithms have been proposed in the signal processing literature.The first of these, NIHT (7), omits debaising.The rest, HTP (14), GraHTP (47), and CoSaMp (34) offer debiasing.

IHT for Generalized Linear Models
A generalized linear model (GLM) involves responses y following a natural exponential distribution with density in the canonical form where y is the data, θ is the natural parameter, φ > 0 is the scale (dispersion), and a(φ ), b(θ ), and c(y, φ ) are known functions which vary depending on the distribution (13; 30).Simple calculations show that y has mean µ = b (θ ) and variance σ 2 = b (θ )a(φ ); accordingly, σ 2 is a function of µ.Table 1 summarizes the mean domains and variances of a few common exponential families.Covariates enter GLM modeling through an inverse link representation µ = g(x t β), where x is a vector of covariates (predictors) and β is vector of regression coefficients (parameters).In statistical practice, data arrive as a sample of independent responses y 1 , . . ., y m with different covariate vectors x 1 , . . ., x m .To put each predictor on an equal footing, each should be standardized to have mean 0 and variance 1.Including an additional intercept term is standard practice.
If we assemble a design matrix X by stacking the row vectors x t i , then we can calculate the loglikelihood,

Family
Mean Domain Var(y) where W 1 and W 2 are two diagonal matrices.The second has positive diagonal entries; they coincide under the identity inverse link g(s) = s.
In the generalized linear model version of IHT, we maximize L(β ) (equivalent to minimizing f (β) = −L(β)) and substitute the expected information (3.2).This translates into the following step size in GLM estimation: This substitution is a key ingredient of our extended IHT.It simplifies computations and guarantees that the step size is nonnegative.

Doubly Sparse Projections
The effectiveness of group sparsity in penalized regression has been demonstrated in general (31; 16) and for GWAS (53) in particular.Group IHT (46) enforces group sparsity but does not enforce within-group sparsity.
In GWAS, model selection is desired within groups as well to pinpoint causal SNPs.Furthermore, one concern in GWAS is that two causative SNPs can be highly correlated with each other due to linkage disequilibrium (LD).When sensible group information is available, doubly sparse IHT encourages the detection of causative yet correlated SNPs while enforcing sparsity within groups.Here we discuss how to carry out a doubly-sparse projection that enforces both within-and between-group sparsity.
Suppose we divide the SNPs of a study into a collection G of nonoverlapping groups.Given a parameter vector β and a group g ∈ G, let β g denote the components of β corresponding to the SNPs in g.Now suppose we want to select at most j groups and at most λ g ∈ Z + SNPs for each group g.In projecting β, the component β i is untouched for a selected SNP i.For an unselected SNP, β i is reset to 0. By analogy with our earlier discussion, we can define a sparsity projection operator P g (β g ) for each group g; P g (β g ) selects the λ g most prominent SNPs in group g.The potential reduction in the squared distance offered by group g is r g = β g 2 2 − P g (β g ) 2  2 .The j selected groups are determined by selecting the j largest values of r g .If desired, we can set the sparsity level λ g for each group high enough so that all SNPs in group g come into play.Thus, doubly-sparse IHT generalizes group-IHT.In Algorithm 1, we write P(β) for the overall projection with the component projections P g (β g ) on the j selected groups and projection to zero on the remaining groups.

Prior weights in IHT
Zhou et al. (53) treat prior weights in penalized GWAS.Before calculating the lasso penalty, they multiply each component of the parameter vector β by a positive weight w i .We can do the same in IHT before projection.Thus, instead of projecting the steepest descent step β = β n + s n v n , we project the Hadamard (pointwise) product w • β of β with a weight vector w.This produces a vector with a sensible support S. The next iterate β n+1 is defined to have support S and to be equal to In GWAS, weights can and should be informed by prior biological knowledge.A simple scheme for choosing nonconstant weights relies on minor allele frequencies.For instance, Zhou et al. ( 51) assign SNP i with minor allele frequency p i the weight w i = 1/ 2p i (1 − p i ).Giving rare SNPs greater weight in this fashion is most appropriate for traits under strong negative selection (49; 37).Alternatively, our software permits users to assign weights geared to specific pathway and gene information.
de Lamare et al. (12) incorporate prior weights into IHT by adding an element-wise logarithm of a weight vector q before projection.The weight vector q is updated iteratively and requires two additional tuning constants that in practice are only obtained through cross validation.Our weighting scheme is simpler, more computationally efficient, and more interpretable.
. CC-BY 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 28, 2020.; https://doi.org/10.1101/697755doi: bioRxiv preprint

Algorithm Summary
The final algorithm combining doubly sparse projections, prior weight scaling, and debiasing is summarized in Algorithm 1.
Algorithm 1: Iterative hard-thresholding Input : Design matrix X, response vector y, membership vector g, weight vector w, max number of groups j, and overall sparsity projection P(β). 1 Initialize: β ≡ 0. Output: β with j active groups and λ g active predictors for group g

Results
Readers can reproduce our results by accessing the software, documentation, and Jupyter notebooks at: https://github.com/OpenMendel/MendelIHT.jl

Scalability of IHT
To test the scalability of our implementation, we ran IHT on p = 10 6 SNPs for sample sizes n = 10, 000, 20, 000, ..., 120, 000 with five independent replicates per n.All simulations rely on a true sparsity level of k = 10.Based on an Intel-E5-2670 machine with 63GB of RAM and a single 3.3GHz processor, Figure 2 plots the IHT median CPU time per iteration, median iterations to convergence, and median memory usage under Gaussian, logistic, Poisson, and negative binomial models.The largest matrix simulated here is 30GB in size and can still fit into our personal computer's memory.Of course, it is possible to test even larger sample sizes using cloud or cluster resources, which are often needed in practice.
The formation of the vector µ of predicted values requires only a limited number of nonzero regression coefficients.Consequently, the computational complexity of this phase of IHT is relatively light.In contrast, calculation of the Fisher score (gradient) and information (expected negative Hessian) depend on the entire genotype matrix X.Fortunately, each of the np entries of X can be compressed to 2 bits.Figure 2b and d show that IHT memory demands beyond storing X never exceeded a few gigabytes.Figure 2a and c show that IHT  run time per iteration increases linearly in problem size n.Similarly, we expect increasing p will increase run time linearly, since the bottleneck of IHT is the matrix-vector multiplication step in computing the gradient, which scales as O(np).Debiasing increases run time per iteration only slightly.Except for negative binomial responses, debiasing is effective in reducing the number of iterations required for convergence and hence overall run time.

Cross Validation in Model Selection
In actual studies, the true number of genetic predictors k true is unknown.This section investigates how q-fold cross-validation can determine the best model size on simulated data.Under normal, logistic, Poisson, and negative binomial models, we considered 50 different combinations of X, y, and β true with k true = 10, n = 5000 samples, and p = 50, 000 SNPs fixed in all replicates.Here, k true is chosen so that it is closer to our NFBC and UK Biobank results.On these data sets we conducted 5-fold cross validation across 20 model sizes k ranging from 1 to 20. Figure 3 plots deviance residuals on the holdout dataset for each of the four GLM responses  (mean squared error in the case of normal responses) and the best estimate k of k true .
Figure 3 shows that k true can be effectively recovered by cross validation.In general, prediction error starts off high where the proposed sparsity level k severely underestimates k true and plateaus when k true is reached (Figure 3a-d).Furthermore, the estimated sparsity k for each run is narrowly centered around k true = 10 (Figure 3e-f).In fact, | k − k true | ≤ 4 always holds.When k exceeds k true , the estimated regression coefficients for the false predictors tend to be very small.In other words, IHT is robust to overfitting, in contrast to lasso penalized regression.We see qualitatively similar results when k true is large.This proved to be the case in our previous paper (22) for Gaussian models with k true ∈ {100, 200, 300}.

Comparing IHT to Lasso and Marginal Tests in Model Selection
Comparison of the true positive and false positive rates of IHT and its main competitors is revealing.For lasso regression we use the glmnet implementation of cyclic coordinate descent (15; 43; 44) (v2.0-16 implemented in R 3.5.2);for marginal testing we use the beta version of MendelGWAS (54).As explained later, Poisson regression is supplemented by zero-inflated Poisson regression implemented under the pscl (48) (v1.5.2) package of R. Unfortunately, glmnet does not accommodate negative binomial regression.Because both glmnet and pscl operate on floating point numbers, we limit our comparisons to small problems with 1000 subjects, 10,000 SNPs, 50 replicates, and k = 10 causal SNPs.IHT performs model selection by 3-fold cross validation across model sizes ranging from 1 to 50.This range is generous enough to cover the models selected by lasso regression.We adjust for multiple testing in the marginal case test by applying a p-value cutoff of 5 × 10 −6 .
Table 2 demonstrates that IHT achieves the best balance between maximizing true positives and minimizing false positives.IHT finds more true positives than marginal testing and almost as many as lasso regression.IHT also finds far fewer false positives than lasso regression.Poisson regression is exceptional in yielding an excessive number of false positives in marginal testing.A similar but less extreme trend is observed for lasso

Reconstruction Quality for GWAS Data
Table 3 demonstrates that IHT estimates show little bias compared to estimates from lasso and marginal regressions.These trends hold with or without debiasing as described earlier.The proportion of variance explained is approximately the same in both scenarios.The displayed values are the averaged estimated β 's, computed among the SNPs actually found.As expected, lasso estimates show severe shrinkage compared to IHT.Estimates from marginal tests are severely overestimated, since each SNP are asked to explain more trait variance than it could.As the magnitude of β true falls, IHT estimates show an upward absolute bias, consistent with the winner's curse phenomenon.When sample sizes are small, small effect sizes make most predictors imperceptible amid the random noise.The winner's curse operates in this regime and cannot be eliminated by IHT.Lasso's strong shrinkage overwhelms the bias of the winner's curse and yields estimates smaller than true values.
The results displayed in Table 3 reflect n = 5, 000 subjects, p = 10, 000 SNPs, 100 replicates, and a sparsity level k fixed at its true value k true = 10.The λ value for lasso is chosen by cross validation.To avoid data sets with monomorphic SNPs, the minimum minor allele frequency (maf) is set at 0.05.For linear, logistic and Poisson regressions in marginal tests, we first screen for potential SNPs via a score test.Only top SNPs are used in the more rigorous and more computationally intensive likelihood ratio tests, which gives the beta estimates.This procedure is described in (54).We ran likelihood ratio tests for all SNPs in the negative binomial model because the screening procedure is not yet implemented.However, the inflation in parameter estiamtes are present throughout all marginal tests.

Correlated Covariates and Doubly Sparse Projections
Next we study how well IHT works on correlated data and whether doubly-sparse projection can enhance model selection.Table 4 shows that, in the presence of extensive LD, IHT performs reasonably well even without grouping information.When grouping information is available, group IHT enhances model selection.SNP belongs to 1 of 500 disjoint groups containing 20 SNPs each; j = 5 distinct groups are each assigned 1, 2, ..., 5 causal SNPs with effect sizes randomly chosen from {−0.2, 0.2}.In all there 15 causal SNPs.For grouped-IHT, we assume perfect group information.That is, groups containing 1 ∼ 5 causative SNPs are assigned λ g ∈ {1, 2, ..., 5}.The remaining groups are assigned λ g = 1.As described in the Methods Section, the simulated data show LD within each group, with the degree of LD between two SNPs decreasing as their separation increases.Although the conditions of this simulation are somewhat idealized, they mimic what might be observed if small genetic regions of whole exome data were used with IHT.

The results displayed in
We repeated this examination of doubly sparse projection for the first 30, 000 SNPs of the NFBC1966 (36) data for all samples passing the quality control measures outlined in our Methods Section.We arbitrarily assembled 2 large groups with 2000 SNPs, 5 medium groups with 500 SNPs, and 10 small groups with 100 SNPs, representing genes of different length.The remaining SNPs are lumped into a final group representing non-coding regions.In all there are 18 groups.Since group assignments are essentially random beyond choosing neighboring SNPs, this example represents the worse case scenario of a relatively sparse marker map with undifferentiated SNP groups.We randomly selected 1 large group, 2 medium groups, and 3 small groups to contain 5, 3, and 2 causal SNPs, respectively.The non-coding region harbors 2 causal SNPs.In all there are 19 causal SNPs.Effect sizes were randomly chosen to be −0.2 or 0.2.We ran 100 independent simulation studies under this setup, where the large, medium, small, and non-coding groups are each allowed 5, 3, 2, and 2 active SNPs.The results are displayed in Table 5.We find that even in this worse case scenario where group information is completely lacking that grouped IHT does no worse than ungrouped IHT.

Introduction of Prior Weights
This section considers how scaling by prior weights helps in model selection.Table 6 compares weighted IHT reconstructions with unweighted reconstructions where all weights w i = 1.The weighted version of IHT consistently finds approximately 10% more true predictors than the unweighted version.Here we simulated 50 replicates involving 1000 subjects, 10,000 uncorrelated variants, and k = 10 true predictors for each GLM.For the sake of simplicity, we defined a prior weight w i = 2 for about one-tenth of all variants, including the 10 true predictors.For the remaining SNPs the prior weight is w i = 1.These choices reflect a scenario where one tenth of all genotyped variants fall in a protein coding region, including the 10 true predictors, and where such variants are twice as likely to influence a trait as those falling in non-coding regions.

Hypertension GWAS in the UK Biobank
Now we test IHT on the second release of UK Biobank (38) data.This dataset contains ∼ 500, 000 samples and ∼ 800, 000 SNPs without imputation.Phenotypes are systolic blood pressure (SBP) and diastolic blood pressure (DBP), averaged over 4 or fewer readings.To adjust for ancestry and relatedness, we included the following nongenetic covariates: sex, hospital center, age, age 2 , BMI, and the top 10 principal components computed with FlashPCA2 (1).After various quality control procedures as outlined in the Methods section, the final dataset used in our analysis contains 185, 565 samples and 470, 228 SNPs.For UK biobank analysis, we omitted debiasing, prior weighting, and doubly sparse projections.

Stage 2 Hypertension under a Logistic Model
Consistent with the clinical definition for stage 2 hypertension (S2 Hyp) (42), we designated patients as hypertensive if their SBP ≥ 140mmHG or DBP ≥ 90 mmHG.We ran 5-fold cross validated logistic model across model sizes k = {1, 2, ..., 50}.The work load was distributed to 50 computers, each with 5 CPU cores.Each computer was assigned one model size, and all completed its task within 24 hours.The model size that minimizes the deviance residuals is k = 39.The selected predictors include the 33 SNPs listed in Table 7 and 6 non-genetic covariates: intercept, sex, age, age 2 , BMI, and the fifth principal component.Figure 4, generated by MendelPlots.jl(19), compares univariate logistic GWAS with logistic IHT.SNPs recovered by IHT are circled in black.Our Github page records the full list of significant SNPs detected by univariate GWAS.There are 10 SNPs selected by IHT that have a p-value less than 5 × 10 −8 ; 83 SNPs pass the threshold in the univariate analysis but remain unselected by IHT.IHT tends to pick the most significant SNP among a group of SNPs in LD. to be associated with elevated SBP/DBP (27) or that exhibit genome-wide significance when the same data are analyzed as an ordinal trait (18).Ordinal univariate GWAS treats the different stages of hypertension as ordered categories.Ordinal GWAS has higher power than logistic or multinomial GWAS (18).The known SNPs displayed in Table 7 tend to have larger absolute effect sizes (avg 0.033) than the unknown SNPs (avg = 0.027).Finally, IHT is able to recover two pairs of highly correlated SNPs: (rs1374264,rs1898841) and (rs7497304,rs2677738) with pairwise correlations of r 1,2 = 0.59 and r 3,4 = 0.49.

Cardiovascular GWAS in NFBC1966
We also tested IHT on data from the 1966 Northern Finland Birth Cohort (NFBC1966) (36).Although this dataset is relatively modest with 5402 participants and 364,590 SNPs, it has two virtues.First, it has been analyzed multiple times (22; 36; 17), so comparison with earlier analysis is easy.Second, due to a population bottleneck (28), the participants' chromosomes exhibit more extensive linkage disequilibrium than is typically found in less isolated populations.Multiple regression methods, including the lasso, have been criticized for their inability to deal with the dependence among predictors induced by LD.Therefore this dataset provides an interesting test case.

High Density Lipoprotein (HDL) Phenotype as a Normal model
Using IHT we find previously associated SNPs as well as a few new potential associations.We model the HDL phenotype as normally-distributed and find a best model size k = 9 based on 5-fold cross validation across model sizes k = {1, 2, ..., 20}.Without debiasing, the analysis was completed in 2 hours and 4 minutes with 30 CPU cores on a single machine.Table 8 displays the recovered predictors.SNP rs1800961 was replaced by rs7499892 with similar effect size if we add the debiasing step in obtaining the final model.
Importantly, IHT is able to simultaneously recover effects for SNPs (1) rs9261224, (2) rs6917603, and (3) rs6917603 with pairwise correlations of r 1,2 = 0.618, r 1,3 = 0.984, and r 2,3 = 0.62.This result is achieved without grouping of SNPs, which can further increase association power.Compared with earlier analyses of these data, we find 3 SNPs that were not listed in our previous IHT paper (22), presumably due to slight algorithmic modifications.The authors of NFBC (36) found 5 SNPs associated with HDL under SNP-by-SNP testing.We did not find SNPs rs2167079 and rs255049.To date, rs255049 was replicated (17).SNP rs2167079 has been reported to be associated with an unrelated phenotype (33).If we repeat the analysis with HDL dichotomized into low and high HDL using a cutpoint of 60ml/DL, then we identify the 5 SNPs rs9261224, rs6917603, rs9261256, rs3764261, and rs9898058; all but one of these, SNP rs9898058, is also found under the continuous model.This SNP is not replicated in previous studies.As in the continuous model, rs6917603 has the largest effect of all the selected SNPs.Readers interested in the full result can visit our Github site.

Low Density Lipoprotein (LDL) as a Binary Response
Unfortunately we did not have access to any qualitative phenotypes for this cohort, so for purposes of illustration, we fit a logistic regression model to a derived dichotomous phenotype, high versus low levels of Low Density Lipoprotein (LDL).The original data are continuous, so we choose 145 mg/dL, the midpoint (42) between the borderline-high and high LDL cholesterol categories, to separate the two categories.This dichotomization resulted in 932 cases (high LDL) and 3961 controls (low LDL).Under 5-fold cross validation without debiasing across model sizes k = {1, 2, ..., 20}, we find k = 3.Using 30 CPU cores, our algorithm finishes in 1 hours and 7 minutes.
Despite the loss of information inherent in dichotomization, our results are comparable to the prior results under a normal model for the original quantitative LDL phenotype.Our final model still recovers two SNP predictors with and without debiasing (Table 8).We miss all but one of the SNPs that the NFBC analysis found to be associated with LDL treated as a quantitative trait.Notably we again find an association with SNP rs6917603 that they did not report.

Discussion
Multiple regression methods like iterative hard thresholding provide a principled way of model fitting and variable selection.With increasing computing power and better software, multiple regression methods are likely to prevail over univariate methods.This paper introduces a scalable implementation of iterative hard thresholding for generalized linear models.Because lasso regression can handle group and prior weights, we have also extended IHT to incorporate such prior knowledge.When it is available, enhanced IHT outperforms standard IHT.Given its sharper parameter estimates and more robust model selection, IHT is clearly superior to lasso selection or marginal association testing in GWAS.
Our real data analyses and simulation studies suggest that IHT can (a) recover highly correlated SNPs, (b) avoid over-fitting, (c) deliver better true positive and false positive rates than either marginal testing or lasso regression, (d) recover unbiased regression coefficients, and (e) exploit prior information and group-sparsity.Our Julia implementation of IHT can also exploit parallel computing strategies that scale to biobank-level data.In our opinion, the time is ripe for the genomics community to embrace multiple regression models as a supplement to and possibly a replacement of marginal analysis.
Although we focused our attention on GWAS, the potential applications of iterative hard thresholding reach far beyond gene mapping.Our IHT implementation accepts arbitrary numeric data and is suitable for a variety of applied statistics problems.Genetics and the broader field of bioinformatics are blessed with rich, ultra-high dimensional data.IHT is designed to solve such problems.By extending IHT to the realm of generalized linear models, it becomes possible to fit regression models with more exotic distributions than the Gaussian distributions implicit in ordinary linear regression.In our view IHT will eventually join and probably supplant lasso regression as the method of choice in GWAS and other high-dimensional regression settings.

Data Simulation
Our simulations mimic scenarios for a range of rare and common SNPs with or without LD.Unless otherwise stated, we designate 10 SNPs to be causal with effect sizes of 0.1, 0.2, ..., 1.0.
To generate independent SNP genotypes, we first sample a minor allele frequency ρ j ∼ Uniform(0, 0.5) for each SNP j.To construct the genotype of person i at SNP j, we then sample from a binomial distribution with success probability ρ j and two trials.The vector of genotypes (minor allele counts) for person i form row x t i of the design matrix X.To generate SNP genotypes with linkage disequilibrium, we divide all SNPs into blocks of length 20.Within each block, we first sample x 1 ∼ Bernoulli(0.5).Then we form a single haplotype block of length 20 by the following Markov chain procedure: x i with probability p 1 − x i with probability 1 − p with default p = 0.75.For each block we form a pool of 20 haplotypes using this procedure, ensuring every one of the 40 alleles (2 at each SNP) are represented at least once.For each person, the genotype vector in a block is formed by sampling 2 haplotypes with replacement from the pool and summing the number of minor alleles at each SNP.
Depending on the simulation, the number of subjects range from 1, 000 to 120, 000, and the number of independent SNPs range from 10, 000 to 1, 000, 000.We simulate data under four GLM distributions: normal (Gaussian), Bernoulli, Poisson, and negative binomial.We generate component y i of the response vector y by sampling from the corresponding distribution with mean µ i = g(x t i β), where g is the inverse link function.For normal models we assume unit variance, and for negative binomial models we assume 10 required failures.To avoid overflows, we clamp the mean g(x t i β) to stay within [−20 , 20].(See Ad Hoc Tactics for a detailed explanation).We apply the canonical link for each distribution, except for the negative binomial, where we apply the log link.

Real Data's Quality Control Procedures
UK Biobank. the UK biobank's own quality control procedures, we first filtered all samples for sex discordance and high heterozygosity/missingness.Second, we included only people of European ancestry and excluded first and second-degree relatives based on empiric kinship coefficients.Third, we also excluded people who had taken hypertension related medications at baseline.Finally, we only included people with ≥ 98% genotyping success rate over all chromosomes and SNPs with ≥ 99% genotyping success rate.Calculation of kinship coefficients and filtering were carried out via the OpenMendel modules SnpArrays (52).Remaining missing genotypes were imputed using modal genotypes at each SNP.After these quality control procedures, our UK biobank data is the same data that was used in (18).
Northern Finland Birth Cohort.We imputed missing genotypes with Mendel (24).Following (22), we excluded subjects with missing phenotypes, fasting subjects, and subjects on diabetes medication.We conducted quality control measures using the OpenMendel module SnpArrays (52).Based on these measures, we excluded SNPs with minor allele frequency ≤ 0.01 and Hardy Weinberg equilibrium p-values ≤ 10 −5 .As for non-genetic predictors, we included sex (the sexOCPG factor defined in (36)) as well as the first 2 principal components of the genotype matrix computed via PLINK 2.0 alpha (11).To put predictors, genetic and non-genetic, on an equal footing, we standardized all predictors to have mean zero and unit variance.

Linear Algebra with Compressed Genotype Files
The genotype count matrix stores minor allele counts.The PLINK genotype compression protocol (11) compactly stores the corresponding 0's, 1's, and 2's in 2 bits per SNP, achieving a compression ratio of 32:1 compared to storage as floating point numbers.For a sparsity level k model, we use OpenBLAS (a highly optimized linear algebra library) to compute predicted values.This requires transforming the k pertinent columns of X into a floating point matrix X k and multiplying it times the corresponding entries β k of β.The inverse link is then applied to X k β k to give the mean vector µ = g(X k β k ).In computing the GLM gradient (equation 3.3), formation of the vector W 1 (y − µ) involves no matrix multiplications.Computation of the gradient X t W 1 (y − µ) is more complicated because the full matrix X can no longer be avoided.Fortunately, the OpenMendel module SnpArrays (52) can be invoked to perform compressed matrix times vector multiplication.Calculation of the steplength of IHT requires computation of the quadratic form ∇L(β n ) t X t W 2 X∇L(β n ).Given the gradient, this computation requires a single compressed matrix times vector multiplication.Finally, good statistical practice calls for standardizing covariates.To standardize the genotype counts for SNP j, we estimate its minor allele frequency p j and then substitute the ratio for the genotype count x i j for person i at SNP j.This procedure is predicated on a binomial distribution for the count x i j .Our previous paper (22) shows how to accommodate standardization in the matrix operations of IHT without actually forming or storing the standardized matrix.
Although multiplication via the OpenMendel module SnpArrays (52) is slower than OpenBLAS multiplication on small data sets, it can be as much as 10 times faster on large data sets.OpenBLAS has advantages in parallelization, but it requires floating point arrays.Once the genotype matrix X exceeds the memory available in RAM, expensive data swapping between RAM and hard disk memory sets in.This dramatically slows matrix multiplication.SnpArrays is less vulnerable to this hazard owing to compression.Once compressed data exceeds RAM, SnpArrays also succumbs to the swapping problem.Current laptop and desktop computers seldom have more than 32 GB of RAM, so we must resort to cluster or cloud computing when input files exceed 32 GB.

Computations Involving Non-genetic Covariates
Non-genetic covariates are stored as double or single precision floating point entries in an n × r design matrix Z.To accommodate an intercept, the first column should be a vector of 1's.Let γ denote the r vector of regression coefficients corresponding to Z.The full design matrix is the block matrix (X Z).Matrix multiplications involving (X Z) should be carried out via Adherence to these rules ensures a low memory footprint.Multiplication involving X can be conducted as previously explained.Multiplication involving Z can revert to BLAS.

Parallel Computation
The OpenBLAS library accessed by Julia is inherently parallel.Beyond that we incorporate parallel processing in cross validation.Recall that in q-fold cross validation we separate subjects into q disjoint subsets.We then fit a training model using q − 1 of those subsets on all desired sparsity levels and record the mean-squared prediction error on the omitted subset.Each of the q subsets serve as the testing set exactly once.Testing error is averaged across the different folds for each sparsity levels k.The lowest average testing error determines the recommended sparsity.
MendelIHT.jloffers 2 parallelism strategies in cross validation.Either the q training sets are each loaded to q different CPUs where each compute and test differ sparsity levels sequentially, or each of the q training sets are cycled through sequentially and each sparsity parameter is fitted and tested in parallel.The former tactic requires enough disk space and RAM to store q different training data (where each typically require (q − 1)/q GB of the full data), but offers immense parallel power because one can assign different computers to handle different sparsity levels.This tactic allows one to fit biobank scale data in less than a day assuming enough storage space and computers are available.The latter tactic requires cycling through the training sets sequentially.Since intermediate data can be deleted, the tactic only requires enough disk space and RAM to store 1 copy of the training set.MendelIHT.jluses one of Julia's (6) standard library Distributed.jl to achieve the aforementioned parallel strategies.

Ad Hoc Tactics to Prevent Overflows
In Poisson and negative binomial regressions, the inverse link argument exp(x t i β) experiences numerical overflows when the inner product x t i β is too large.In general, we avoid running Poisson regression when response means are large.In this regime a normal approximation is preferred.As a safety feature, MendelIHT.jlclamps values of x t i β to the interval [−20 , 20].Note that penalized regression suffers from the same overflow catastrophes.

Convergence and Backtracking
For each proposed IHT step we check whether the objective L(β) increases.When it does not, we step-halve at most 5 times to restore the ascent property.Convergence is declared when with the default tolerance being 0.0001.The addition of 1 in the denominator of the convergence criterion guards against division by 0.

( a )
Speed per iteration without debiasing (b) Memory usage without debiasing (c) Median iterations until convergence.Speed per iteration with debiasing (e) Memory usage with debiasing

Figure 2 :
Figure 2: (a, d) Time per iteration scales linearly with data size.Speed is measured for compressed genotype files.On uncompressed data, all responses are roughly 10 times faster.(b, e) Memory usage scales as ∼ 2np bits.Note memory for each response are usages in addition to loading the genotype matrix.Uncompressed data requires 32 times more memory.(c) Debiasing reduces median iterations until convergence for all but negative binomial regression.Benchmarks were carried out on 10 6 SNPs and sample sizes ranging from 10,000 to 120,000.Hence, the largest matrix here requires 30GB and can still fit into personal computer memories.

Figure 3 :
Figure 3: Five-fold cross validation results is capable of identifying the true model size k true .(a-d) Deviance residuals of the testing set are minimized when the estimated model size k ≈ k true .Each line represents 1 simulation.(e-h) k is narrowly spread around k true = 10.

Figure 4 :
Figure 4: Manhattan plot comparing a logistic (univariate) GWAS vs logistic IHT on UK Biobank data.Colored dots are log 10 p-values from a logistic GWAS, and the circled dots are SNPs recovered by IHT.

Table 1 :
Summary of mean domains and variances for common exponential distributions.In GLM, µ = g(x t β) denotes the mean, s = x t β the linear responses, g is the inverse link function, and φ the dispersion.

Table 2 :
IHT achieves the best balance of false positives and true positives compared to lasso and marginal (single-snp) regression.TP = true positives, FP = false positives.There are k = 10 causal SNPs.Best model size for IHT and lasso were chosen by cross validation.() = zero-inflated Poisson regression.regression.The marginal false positive rate is reduced by switching to zero-inflated Poisson regression.This alternative model is capable of handling overdispersion due an excess of 0 values.Interestingly, IHT rescues the Poisson model by accurately capturing the simultaneous impact of multiple predictors.
* = zero true positives observed on average.NA = glmnet does not support negative binomial lasso regression.There are k = 10 true SNPs.

Table 7 :
Table 7 shows 25 SNPs selected by IHT that were previously reported UK Biobank GWAS results generated by running IHT on Stage 2 Hypertension (S2 Hyp) under a logistic model.The SNP ID, chromosome number, position (in basepair), and estimated effect sizes are listed.

Table 8 :
NFBC GWAS results generated by running IHT on high density lipoprotein (HDL) phenotype as a normal response.The SNP ID, chromosome number, position (in basepair), and estimated effect sizes are listed.