Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Wei Zhou; Jonas B. Nielsen; Lars G. Fritsche; Jonathon LeFaive; Sarah A. Gagliano Taliun; Wenjian Bi; Maiken E. Gabrielsen; Mark J. Daly; Benjamin M. Neale; Kristian Hveem; Goncalo R. Abecasis; Cristen J. Willer; Seunggeun Lee

doi:10.1101/583278

Abstract

With very large sample sizes, population-based cohorts and biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, gene or region-based multiple variant aggregate tests are commonly used to increase association test power. However, due to the substantial computation cost, existing region-based rare variant tests cannot analyze hundreds of thousands of samples while accounting for confounders, such as population stratification and sample relatedness. Here we propose a scalable generalized mixed model region-based association test that can handle large sample sizes. This method, SAIGE-GENE, utilizes state-of-the-art optimization strategies to reduce computational and memory cost, and hence is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples. Through the analysis of the HUNT study of 69,716 Norwegian samples and the UK Biobank data of 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large sample data (N > 400,000) with type I error rates well controlled.

Introduction

In recent years, large cohort studies and biobanks, such as Trans-Omics for Precision Medicine (TOPMed) study and UK Biobank¹, have sequenced or genotyped hundreds of thousands of samples, which are invaluable resources to identify genetic components of complex traits, including rare variants (minor allele frequency (MAF) < 1%). It is well known that single variant tests are underpowered to identify trait-associated rare variants². Gene- or region-based tests, such as Burden test, SKAT³ and SKAT-O⁴, can be more powerful by grouping rare variants into functional units, i.e. genes. To adjust for both population structure and sample relatedness, gene-based tests have been extended to mixed models^5,6. For example, EmmaX⁵ based SKAT³ approaches (EmmaX-SKAT) have been implemented and used for many rare variant association studies including TOPMed⁷. The generalized linear mixed model gene-based test, SMMAT, has been recently developed⁶. However, these approaches require O(N³) computation time and O(N²) memory usages, where N is the sample size, which are not scalable to large datasets.

Here, we propose a novel method called SAIGE-GENE for region-based association analysis that is capable of handling very large samples (> 400,000 individuals), while inferring and accounting for sample relatedness. SAIGE-GENE is an extension of the previously developed single variant association method, SAIGE⁸, with a modification suitable to rare variants. Same as SAIGE, it utilizes state-of-the-art optimization strategies to reduce computation cost for fitting null mixed models. To ensure the computation efficiency while improving test accuracy for rare variants, SAIGE-GENE approximates the variance of score statistics calculated with the full genetic relationship matrix (GRM) using the variance calculated with a sparse GRM and the ratios of these two variances estimated from a subset of genetic markers. Because the sparse GRM, constructed by thresholding small values in the full GRM, preserves close family structures, this approach provides a far more accurate variance estimation for very rare variants (minor allele count (MAC) < 20) than the original approach in SAIGE. By combining single variant score statistics, SAIGE-GENE can perform Burden, SKAT and SKAT-O type gene-based tests. We have also developed conditional analysis, which performs association tests with conditioning on a single variant or multiple variants to identify independent rare variant association signals.

We have demonstrated that SAIGE-GENE controls for type I error rates in related samples through extensive simulations as well as the real data analysis, including the HUNT study for 69,716 Norwegian samples^9,10 and the UK Biobank for 408,910 White British samples¹. By evaluating its computation performance, we have shown the feasibility of SAIGE-GENE for large-scale genome-wide analysis. To perform exome-wide gene-based tests on 400,000 samples with on average 50 markers per gene, SAIGE-GENE requires 2,238 CPU hours and less than 36 Gb memory, while current methods will cost more than > 10 Tb in memory. We have further applied SAIGE-GENE to 53 quantitative traits in the UK Biobank and identified several significantly associated genes through exome-wide gene-based tests.

RESULTS

Overview of Methods

SAIGE-GENE consists of two main steps: 1. Fitting the null generalized linear mixed model (GLMM) to estimate variance components and other model parameters. 2. Testing for association between each genetic variant set, such as a gene or a region, and the phenotype. Three different association tests: Burden, SKAT, and SKAT-O have been implemented in SAIGE-GENE. The workflow is shown in the Supplementary Figure 1.

SAIGE-GENE uses similar optimization strategies as utilized in the original SAIGE to achieve the scalability for fitting the null GLMM and estimating the model parameters in Step 1. In particular, the spectral decomposition has been replaced by the preconditioning conjugate gradient (PCG) to solve linear systems without calculating and inverting the N × N GRM. To reduce the memory usage, raw genotypes are stored in a binary vector and elements of GRM are calculated when needed rather than being stored.

One of the most time-consuming part in association tests is to calculate variance of single variant score statistic, which requires O(N²) computation. In SAIGE⁸, BOLT-LMM¹¹, and GRAMMA-Gamma¹², in order to reduce the computation cost, the variance with GRM is approximated using the variance without GRM and the estimated ratio of the two variances. The ratio, which is assumed to be constant, is estimated using a subset of randomly selected genetic markers. However, for very rare variants with MAC below 20, the constant ratio assumption is not satisfied (Supplementary Figure 2, left panel). This is because rare variants are more susceptible to close family structures. Thus, to better approximate the variance, SAIGE-GENE incorporates close family structures through a sparse GRM, in which GRM elements below a user-specified relatedness coefficient are zeroed out and close family structures are preserved. The ratio between the variance with the full GRM and with the sparse GRM is much less variable (Supplementary Figure 2, right panel). To construct a sparse GRM, a small subset of randomly selected genetic markers, i.e. 2,000, are firstly used to quickly estimate which sample pairs pass the user-specified coefficient of relatedness cutoff, e.g. ≥0.125 for up to 3^rd degree relatives. Then the coefficients of relatedness for those related pairs are further estimated using the full set of genetic markers, which equal to values in the full GRM. Once the sparse GRM has been computed for a biobank or a data set, it can be re-used for downstream genetic association tests for any phenotype. Heritability estimates using a sparse GRM with up to 3^rd degree relatives preserved for 24 quantitative traits with sample size ≥ 100,000 in the UK Biobank are close to the estimates using the full GRM (Supplementary Figure 3). Moreover, given that estimated values for variance ratios may vary by MAC for the extremely rare variants (Supplementary Figure 2, left panel), such as singletons and doubletons, the variance ratio can be estimated by different MAC categories. By default, MAC categories are set to be MAC equals to 1, 2, 3, 4, 5, 6 to 10, 11 to 20, and > 20.

In Step 2, gene-based tests are conducted using single variant score statistics and their covariance estimates, which are approximated as the product of the covariance with the sparse GRM and the pre-estimated ratio. SAIGE-GENE can carry out Burden, SKAT, and SKAT-O approaches. Since SKAT-O is a combined test of Burden and SKAT, and hence provides a robust power, SAIGE-GENE performs SKAT-O by default.

If a gene or a region is significantly associated with the phenotype of interest, it is necessary to test if the signal is from rare variants or just a shadow of common variants in the same locus. We have developed the conditional analysis using linkage disequilibrium (LD) information between conditioning markers and the tested gene¹³. Details of the conditional analysis are described in the Online Methods section.

SAIGE-GENE uses the same generalized linear mixed model as in SMMAT, while SMMAT calculates the variances of the score statistics using the full GRM and hence can be thought of as the “exact” method. When the trait is continuous, the GLMM used by SAIGE-GENE and SMMAT is equivalent to the linear mixed mode (LMM) of EmmaX-SKAT. We have further shown that SAIGE-GENE provides consistent association p-values to the “exact” methods EmmaX-SKAT and SMMAT (r² of −log₁₀ P-values > 0.99) using both simulation studies (Supplementary Figure 4) and real data analysis for down-sampled UK Biobank and HUNT (Supplementary Figure 5), but with much smaller computation and memory cost (Figure 1).

Figure 1.

Estimated and projected computation cost by sample sizes (N) for gene-based tests for 15,342 genes, each containing 50 rare variants. Benchmarking was performed on randomly sub-sampled UK Biobank data with 408,144 white British participants for waist-to-hip ratio. The reported run times and memory are medians of five runs with samples randomly selected from the full sample set using different sampling seeds. The reported computation time and memory for EmmaX-SKAT and SMMAT is the projected computation time when N > 20,000. A. Log-log plots of the memory usage as a function of sample size (N) B. Log-log plots of the run time as a function of sample size (N). Numerical data are provided in Supplementary Table 1.

Computation and Memory Cost

To evaluate the computation performance of SAIGE-GENE, we randomly sampled subsets of the 408,144 UK Biobank participants with the White British ancestry and non-missing measurements for waist hip ratio¹. We benchmarked SAIGE-GENE, EmmaX-SKAT, and SMMAT for exome-wide gene-based SKAT-O tests, in which 15,342 genes were tested with assuming that each has 50 rare variants.

log10 of memory usage is plotted against sample sizes in Figure 1A. The memory cost of SAIGE-GENE is linear to the number of markers, M₁, used for kinship estimation, but too few markers may not be sufficient to account for subtle sample relatedness in the data, leading to inflated type I error rates in genetic association tests^8,14. SAIGE-GENE uses 11.74 Gb with M₁ = 93,511 and 35.59 Gb when M₁ = 340,447 when the sample size N is 400,000, making it feasible for large sample data. In contrast, with N = 400,000 the memory usages in EmmaX-SKAT and SMMAT are projected to be nearly 10Tb, which makes them impossible to be used for large sample data.

log10 of total computation time for exome-wide gene-based tests is potted against the sample size as shown in Figure 1B and computation time for Step 1 and Step 2 are plotted separately in Supplementary Figure 6 with numbers presented in Supplementary Table 1. The computation time for Step 1 in SAIGE-GENE is approximately O(M₁N^1.5) and in SMMAT and EmmaX-SKAT is O(N³), where M₁ is the number of markers used for estimating the full GRM and N is the sample size. In Step 2, the association test for each gene costs O(qK) in SAIGE-GENE, where q is the number of markers in the gene and K is the number of non-zero elements in the sparse GRM. Compared to O(qN²) in Step 2 of SMMAT and EmmaX-SKAT, SAIGE-GENE decreases the computation time dramatically. For example, in the UK Biobank (N =408,910) with the relatedness coefficient ≥ 0.125 (corresponding to preserving samples with 3^rd degree or closer relatives in the GRM), K = 493,536, which is the same order of magnitude of N, and hence O(qK) is greatly smaller than O(qN²). As the computation time in Step 2 is approximately linear to q, the number of markers in each variant set, the total computation time for exome-wide gene-based tests was projected by different q and has been plotted in Supplementary Figure 7. In addition, we plotted the projected computation time for genome-wide region-based tests against the sample size as shown in Supplementary Figure 8, in which 286,000 chunks with 50 markers per chunk were assumed to be tested, corresponding to the 14.3 million markers in the HRC-imputed UKB data with MAF ≤ 1% and imputation info score ≥ 0.8.

With M₁ = 340,447, it takes SAIGE-GENE 2,238 CPU hours for the exome-wide gene-based tests and 3,919 CPU hours for genome-wide region-based tests for the waist hip ratio with N = 400,000 and each test contains 50 markers on average. Compared to EmmaX-SKAT and SMMAT, SAIGE-GENE is 25 times faster for the exome-wide gene-based tests and 161 times faster for the genome-wide region-based tests. More details about the computation cost are presented in Supplementary Table 1.

The computation time for constructing the sparse GRM is . is the number of a small set of markers used for initial determination of related sample pairs based on a relationship coefficient cutoff, which by default is set to be 2,000. This step is only needed for each data set for one time to create a sparse GRM and the constructed sparse GRM will be re-used for all phenotypes in the same cohort or biobank. For example, for the UK Biobank with with the relationship coefficient ≥ 0.125, corresponding to up to 3^rd degree relatives, it took 312 CPU hours to create the sparse GRM. Parallel computation is allowed for this step.

Gene-based association analysis of quantitative traits in HUNT and UK-Biobank

We applied SAIGE-GENE to analyze 13,416 genes, with at least two rare (MAF ≤ 1%) missense and stop-gain variants that were directly genotyped or imputed from HRC for high-density lipoprotein (HDL) in 69,716 Norwegian samples from a population-based Nord Trøndelag Health Study (HUNT)⁹. The HUNT study has substantial sample relatedness, in which ∼55,000 samples have at least one up to 3rd degree relatives present in HUNT. The quantile-quantile (QQ) plot for the p-values of SKAT-O tests from SAIGE-GENE for HDL in HUNT is shown Figure 2A. As Table 1 shows, eight genes reached the exome-wide significant threshold (P-value ≤ 2.5×10⁻⁶) and all of them are located in the previously reported GWAS loci for HDL^15,16. By extending 500kb up and down stream, a top significant hit from single-variant association tests were identified around each gene. For genes LIPC, LIPG, NR1H3, and CKAP5, the top hits are common variants with MAF > 5% and the top hits in FSD1L, ABCA1 and RNF111 are less frequent non-coding variants that are not included in the gene-based tests. After conditioning on top hits, all genes remained exome-wide significant.

View this table:

Table 1.

Genes that are significantly associated with automated read pulse rate in the UK Biobank and high-density lipoprotein (HDL) in the HUNT study with SKAT-O P-values < 2.5×10⁻⁶ from SAIGE-GENE. Conditional analysis was performed when the top hit in the locus (+/- 500kb of the start and end positions of the gene) is a common or low frequency variant (MAF ≥ 0.01) or a rare variant (MAF < 0.01) not included in the gene-based test. The P-value of conditional analysis is NA when the top hit is a rare missense or stop gain variant included in the gene-based test.

Figure 2.

Quantile-quantile plots of exome-wide gene-based association results for A. high-density lipoprotein (HDL) in the HUNT study (N = 69,214). SKAT-O approach in SAIGE-GENE was performed for 13,416 genes with stop-gain and missense variants with MAF ≤ 1%, of which 10,600 having at least two variants are plotted. B. automated read pulse rate in the UK Biobank (N = 385,365). SKAT-O approach in SAIGE-GENE was performed for 15,338 genes with stop-gain and missense variants with MAF ≤ 1%, of which 12,638 having at least two variants are plotting.

We also applied SAIGE-GENE to analyze 15,342 genes for 53 quantitative traits using 408,910 UK Biobank participants with White British ancestry¹. The same MAF cutoff ≤ 1%, for missense and stop-gain variants were applied. Supplementary Table 2 presents all genes with p-values reaching the exome-wide significant threshold (p ≤ 2.5×10⁻⁶). Figure 2B shows the QQ plot for automated read pulse rate as an exemplary phenotype. After conditioning on the most significant variants if not included in the gene-based tests, MYH6, ARHGEF40 and DBH remain significant (Table 1). Gene TBX5, MYH6, TTN, and ARHGEF40 are known genes for heart rates by previous GWAS studies^17-20. To our knowledge, KIF1C and DBH have not been reported by association studies for heart rates, but both homozygous and heterozygous DBH mutant mice have decreased heart rates²¹. For the gene DBH, no single variant reaches the genome-wide significant threshold (the most significant variant is 9:136149399 (GRCh37) with MAF = 18.7% and P-value =3.46×10⁻⁶)

Simulation Studies

We investigated the empirical type I error rates and power of SAIGE-GENE through simulation. We followed the steps described in the Online Methods section to simulate genotypes and phenotypes for 10,000 samples in two settings. One has 500 families and 5,000 unrelated samples and the other one has 1,000 families, each with 10 family members based on the pedigree shown in Supplementary Figure 9.

Type I error rates

The type I error rates for SAIGE-GENE, EmmaX-SKAT, and SMMAT have been evaluated based on gene-based association tests performed on 10⁷ simulated gene-phenotype combinations, each with 20 genetic variants with MAF ≤ 1% on average. A sparse GRM with a cutoff 0.2 for the coefficient of relatedness was used in SAIGE-GENE. Two different values of variance component parameter corresponding to the heritability h²=0.2 and 0.4 were considered, respectively. The empirical type I error rates at the α = 0.05, 10⁻², 10⁻³, 10⁻⁴ and 2.5×10⁻⁶ are shown in the Supplementary Table 3. Our simulation results suggest that SAIGE-GENE has well controlled type I error rates. The type I error rates are slightly inflated when heritability is relatively high (h²= 0.4). SAIGE-GENE allows users to apply the genomic control (GC) inflation factor lambda from single variant score statistics (Supplementary Materials) to adjust for gene-based test statistics and this approach successfully attenuated the inflation.

We also evaluated empirical type I error rates of SAIGE-GENE for binary traits with various case-control ratios. As expected, as case-control ratios are relatively balanced (1:1 ∼ 1:9), the type I error rates are well controlled, while when the ratios are unbalanced (e.g. 1:99), inflation is observed (Supplementary Table 4).

Power

Next, we evaluated empirical power of SAIGE-GENE and EmmaX-SKAT. Two different settings for proportions of causal variants are used: 10% and 40%. In each setting, among causal variants, 80% and 100% have negative effect sizes. The effect sizes for causal variants are set to be −0.3log₁₀(MAF) and - log₁₀(MAF), respectively, when the proportions of causal variants are 0.4 and 0.1. Supplementary Table 5 shows the power by the proportion of rare variants were causal variants. As expected, the power of the three methods are nearly identical for all simulation settings for Burden, SKAT and SKAT-O tests.

Code and data availability

SAIGE-GENE is implemented as an open-source R package available at https://github.com/weizhouUMICH/SAIGE/master-gene.

The SAIGE-GENE results for 53 quantitative phenotypes in UK Biobank are currently available for public download at https://www.leelabsg.org/resources.

DISCUSSION

In summary, we have presented a method, called SAIGE-GENE, to perform gene- or region-based association tests, including Burden, SKAT and SKAT-O tests, in large cohorts or biobanks in the presence of sample relatedness. Similar to SAIGE, which was previously developed by our group for single-variant association tests in large biobanks, SAIGE-GENE uses generalized linear mixed models to account for sample relatedness and cutting-edge computational approaches to make it practical for large sample sizes. SAIGE-GENE successfully controls for type I error rates for gene-based tests while accounting for relatedness among samples. It uses several optimization strategies that are similar to those used in SAIGE to make fitting the null generalized linear mixed models feasible for large sample sizes. For example, instead of storing the genetic relationship matrix (GRM) in the memory, SAIGE-GENE stores genotypes that are used for constructing the matrix in a binary vector and computes the elements of the matrix as needed. Preconditioned conjugate gradient algorithm is also used to solve linear systems instead of the Cholesky decomposition method. However, some optimization approaches are specifically applied in the gene-based tests in regard of the rare variants. Because computing variance of score statistics for genetic variants is computationally expensive as it requires the inversion of the GRM, SAIGE, similar to BOLT-LMM¹¹ and GRAMMA-Gamma¹², approximates the variance by estimating the ratio of the variance with and without the GRM using a subset of random genetic markers. As estimating the variances of score statistics for rare variants are more sensible to family structures, we use a sparse GRM to preserve close family structures in SAIGE-GENE rather than ignoring all sample relatedness for the variance ratio. In addition, the variance ratios are estimated for different minor allele count (MAC) categories, especially for those extremely rare variants with MAC lower than or equal to 20.

SAIGE-GENE has some limitations. First, similar to SAIGE and other mixed-model methods, the time for algorithm convergence to fit the generalize linear mixed models may vary among phenotypes and study samples given different heritability levels and sample relatedness. Second, SAIGE-GENE is not yet able to handle unbalanced case-control ratios of binary traits, which causes inflated type I error rates, especially for rare variants. Therefore, we recommend using SAIGE-GENE for quantitative traits and binary traits with relatively balanced case-control ratios. In SAIGE, the issue of imbalanced case-control ratios for binary traits are successfully addressed by approximating the distribution of score statistics using saddle-point approximation (SPA)²². In future work, we plan to apply SPA to SAIGE-GENE to make the method extendable to analyze binary traits with imbalanced case-control ratios.

Overall, we have shown that SAIGE-GENE can account for sample relatedness while maintaining test power through extensive simulation studies. By applying SAIGE-GENE to the HUNT study⁹ and the UK Biobank¹ followed by conditioning on most significant variants in the testing loci, we have demonstrated that SAIGE-GENE can identify potentially novel association signals that are independent of the common signals from the single-variant association tests. Currently, our method is the only available mixed effect model approach for gene- or region-based rare variant tests for large sample data. By providing a scalable solution to the current largest and future even larger datasets, our method will contribute to identifying trait-susceptibility rare variants and genetic architecture of complex traits.

URLs

SAIGE (version 0.35.6.3), https://github.com/weizhouUMICH/SAIGE/.

SMMAT (version 1.0.2), https://github.com/hanchenphd/GMMAT.

EmmaX-SKAT (SKAT version_1.3.2.1), https://cran.r-project.org/web/packages/SKAT/index.html.

UK-Biobank analysis results (Gene-based summary statistics for 53 quantitative phenotypes in the UK Biobank by SAIGE-GENE), https://www.leelabsg.org/resources.

AUTHOR CONTRIBUTIONS

W.Z. and S.L. designed experiments. W.Z. and S.L. performed experiments. W.Z. implemented the software with input from W.B. and J.L.. J.B., L.G.F and S.A.G.T. constructed phenotypes for UK Biobank data. M.E.G. and K.H. provided data for the HUNT study. W.Z., C.W., S.L. and G.R.A. analyzed UK Biobank data. Helpful advice was provided by B.M.N and M.J.D.. W.Z. and S.L. wrote the manuscript with input from S.A.G.T. and M.E.G..

COMPETING FINANCIAL INTERESTS STATEMENT

G.R.A. is an employee of Regeneron Pharmaceuticals. He owns stock and stock options for Regeneron Pharmaceuticals. B.N. is a member of Deep Genomics Scientific Advisory Board, has received travel expenses from Illumina, and also serves as a consultant for Avanir and Trigeminal solutions.

ONLINE METHODS

Generalized linear mixed model

In a study with sample size N, we denote the phenotype of the ith individual using y_i for both continuous and binary traits. Let the 1 × (p + 1) vector X_i represent p covariates including the intercept, the N × q matrix G_i represent the allele counts (0, 1 or 2) for q variants in the gene to test. The generalized linear mixed model can be written as where μ_i is the mean of phenotype, b_i is the random effect, which is assumed to be distributed as N(0, τ ψ), where ψ is an N × N genetic relationship matrix (GRM) and τ is the additive genetic variance parameter. The link function g is the identity function for continuous trait with an error term and logistic function for binary trait. α is a (p + 1) × 1 coefficient vector of fixed effects and β is a q × 1 coefficient vector of the genetic effect.

Estimate variance component and other model parameters (Step 1)

Same as in the original SAIGE⁸, to fit the null GLMM in SAIGE-GENE, penalized quasi-likelihood (PQL) method⁴² and the computational efficient average information restricted maximum likelihood (AI-REML) algorithm³⁵ are used to iteratively estimate under the null hypothesis of β = 0. At iteration k, let be estimated be the estimated mean of y_i and be an N × N matrix of the variance of working vector , in which ψ is the N × N GRM. For continuous traits and . For binary traits, and . To obtain log quasi-likelihood and average information at each iteration, SAIGE and SAIGE-GENE uses the preconditioned conjugate gradient approach (PCG)^31,32 to obtain the product of inverse of and any other vector by iteratively solving a linear system with , which is more computational efficient than using Cholesky decomposition to obtain . The numerical accuracy of PCG has been evaluated in the SAIGE paper⁸.

Gene-based association tests (Step 2)

Test statistics of the Burden, SKAT and SKAT-O tests for a gene can be constructed based on the score test statistics from the marginal model for individual variants in the gene. Suppose there are q variants in the region or gene to test. The score test statistics for variant j (j=1,.., q) under H₀: β_j = 0 is where g_j and Y are N × 1 genotype and phenotype vectors, respectively, and is the estimated mean of Y under the null hypothesis.

Let u_j denote a threshold indicator or weight for variant j and U = diag(u₁, …, u_q) be a diagonal matrix with u_j as the jth element. The Burden test statistics can be written as . Suppose is the covariate adjusted genotype matrix, where G = (g₁, …, g_q) is the N × q genotype matrix of the q genetic variants, and. with . Under the null hypothesis of no genetic effects, Q_Burden followed , where is a q × 1 vector with all elements being unity and is a chi-squared distribution with 1 degree of freedom². The SKAT test³ can be written as , which follows a mixture of chi-square distribution , where are the eigenvalues of . The SKAT-O test developed by Lee et al in 2012 ⁴ uses a linear combination of the Burden and SKAT tests statistics Q_SKATO = (1 – ρ)Q_SKAT + ρQ_Burden, 0 ≤ ρ ≤ 1. To conduct the test. The minimum p-value from grid of ρ is calculated and the p-value of the minimum p-value is estimated through numerical integration. Following the suggestion in Lee et al²³, we use a grid of eight values of ρ = (0, 0.1², 0.2², 0.3², 0.4², 0.5², 0.5, 1) to find the minimum p-value.

Approximate

For each gene, given , calculation of requires applying PCG for each variant in the gene, which can be computationally very expensive. Suppose represents a covariate adjusted single variant genotype vector. To reduce computation cost, an approximation approach has been used in SAIGE, BOLT-LMM¹¹ and GRAMMAR-GAMMAR¹², in which the ratio between and is estimated by a small subset of randomly selected genetic markers that has been shown to be approximately constant for all variants. Given the ratio for all other variance can be easily obtained as . However, the variations of estimated for extremely rare variants are large and including some closely related samples in the denominator helps reduce the variation of as shown in Supplementary Figure 2. Let ψ_S denote a sparse GRM that preserves close family structure and ψ_f denote a full GRM. We estimate the ratio ,where and .

In ψ_S, elements below a user-specified relatedness coefficient cutoff, i.e. > 3^rd degree relatedness, are zeroed out with only close family structures are preserved. To construct ψ_S, a subset of randomly selected genetic markers, i.e. 2,000, is firstly used to quickly estimate which related samples pass the user-specified cutoff. Then the relatedness coefficients for those related samples are further estimated using the full set of genetic markers, which equal to corresponding values in the ψ_f. In the model fitting using and need to be calculated. For this we use a sparse-LU based solve method ²⁴, which is implemented in R. The constructed ψ_S is also used for approximating the variance of score statistics with ψ_f. For a biobank or a data set, ψ_S only needs to be constructed once and can be re-used for SAIGE-GENE jobs on any phenotype in the same date set.

SAIGE-GENE estimates variance ratios for different MAC categories. By default, MAC categories are set to be MAC equals to 1, 2, 3, 4, 5, 6 to 10, 11 to 20, and is greater than 20. Once the MAC categorical variance ratios are estimated, for each genetic marker in tested genes or regions, a can be obtained according to its MAC. Let be a q × q diagonal matrix whose jth diagonal element is the ratio for the jth marker in the gene (i.e.). For the tested gene with q markers, can be approximated as (See Supplementary Notes for more details).

Conditional analysis

In SAIGE-GENE, we have implemented the conditional analysis to perform gene-based tests conditioning on a given markers using the summary statistics from the unconditional gene-based tests and the linkage disequilibrium r² between testing and conditioning markers¹³. Let G be the genotypes for a gene to be tested for association, which contains q markers, and G₂ be the genotypes for the conditioning markers, which contains q₂ markers. Let β denote a q × 1 coefficient vector of the genetic effect for the gene to be tested and β₂ be a q₂ × 1 coefficient vector of the genetic effect for the conditioning markers. The genotype matrix with the non-genetic covariates projected out and . In the unconditioning association tests, the test statistics and . In conditional analysis, under the null hypothesis, and . T and T₂ jointly follow the multivariate normal with mean (E(T), E(T₂)) and variance .

Thus under the null hypothesis of no association of T, i.e. H₀: β = 0, the T|T₂ follows the conditional normal distribution with and , and p-values can be calculated from the conditional distribution.

Data simulation

We carried out a series of simulations to evaluate and compare the performance of SAIGE-GENE, EmmaX-SKAT^4,5 and SMMAT⁶. We used the sequence data from 10,000 European ancestry chromosomes over 1Mb regions that was generated using the calibrated coalescent model in the SKAT R package⁴. We randomly selected 100,000 regions with 3Kb from the sequence data, followed by the gene-dropping simulation⁴⁴ using these sequences as founder haplotypes that were propagated through the pedigree of 10 family members shown in Supplementary Figure 9. Only variants with MAF ≤ 1% are used for simulation studies. Quantitative phenotypes were generated from the following linear mixed model y_i = X₁ + X₂ + G_iβ + b_i + ε_i, where G_i is the genotype value, β is the genetic effect sizes, b_i is the random effect simulated from N(0, τ ψ), and ε_i is the error term simulated from N(0, (1 – τ)I). Two covariates, X₁ and X₂, were simulated from Bernoulli(0.5) and N(0,1), respectively. Binary phenotypes were also generated from the logistic mixed model logit(π_i0) = α₀ + b_i + X₁ + X₂ + G_iβ, β is the genetic log odds ratio, b_i is the random effect simulated from N(0, τ ψ) with τ = 1. The intercept α₀ was determined by given prevalence (i.e. case-control ratios).

To evaluate the type I error rates at exome-wide α=2.5×10⁻⁶, we first simulated 10,000 regions, and then simulated 1000 sets of quantitative phenotypes for each simulated region with different random seeds under the null hypothesis with β = 0. Gene-based association tests were performed using SAIGE-GENE, EmmaX-SKAT, and GMMAT therefore in total 10⁷ tests for each of Burden test, SKAT, and SKAT-O were carried out. Two different settings for τ were evaluated: 0.2 and 0.4 and two different sample relatedness settings were used: one has 500 families and 5,000 independent samples and other one has 1,000 families, each with 10 family members. Moreover, 1,000 sets of binary phenotypes were also simulated under the null hypothesis with β = 0 with different random seeds for case-control ratios 1:99, 1:9, 1:4, and 1:1. Given τ = 1, the liability scale heritability is 0.23⁴⁵. Gene-based association tests Burden test, SKAT, and SKAT-O were performed on the 10,000 genome regions with 1,000 binary phenotypes using SAIGE-GENE.

For the power simulation, phenotypes were generated under the alternative hypothesis β ≠ 0. Two different settings for proportions of causal variants are used: 10% and 40%, corresponding to β = log10(MAF) and β = 0.3log10(MAF), respectively. In each setting, among causal variants, 80% and 100% have negative effect sizes are simulated. We simulated 1,000 datasets in each simulation, and power was evaluated at test-specific empirical α, which yields nominal α=2.5×10⁻⁶. The empirical α was estimated from the previous type I error simulations.

HUNT and UK-Biobank data analysis

We have applied SAIGE-GENE to the high-density lipoprotein (HDL) levels in 69,500 Norwegian samples from a population-based Nord Trøndelag Health Study (HUNT) ⁹. 13,416 genes were tested, with rare (MAF ≤ 1%) missense and stop-gain variants that were directly genotyped or successfully imputed from HRC (imputation score ≥ 0.8). Variants were annotated using Seattle Seq Annotations (http://snp.gs.washington.edu/SeattleSeqAnnotation138/). Age, Sex, genotyping batch, and first four PCs were included as covariates in the model. We used 249,749 pruned genotyped markers estimate relatedness coefficients in the full GRM for step 1 and used the relative coefficient cutoff ≥ 0.125 for the sparse GRM.

We have also analyzed 53 quantitative traits using SAIGE-GENE in the UK Biobank for 408,910 participants with white British ancestry¹. 17,433 genes were tested, among which 15,342 genes with at least one rare (MAF ≤ 1%) missense and stop-gain variants that were directly genotyped or successfully imputed from HRC (imputation score ≥ 0.8). Sex, age when attended assessment center, and first four PCs that were estimated using all samples with white British ancestry were adjusted in all tests. 340,447 pruned genotyped markers were used to estimate coefficients of relatedness in the full GRM for step 1 and used the relative coefficient cutoff ≥ 0.125 for the sparse GRM.

Genome build

All genomic coordinates are given in NCBI Build 37/UCSC hg19.

Reporting Summary

Further information on study design is available in the Nature Research Reporting Summary linked to this article.

ACKNOWLEDGMENTS

This research has been conducted using the UK Biobank Resource under application number 45227. SL and WB were supported by NIH R01 HG008773.

REFERENCES

↵
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209, doi:10.1038/s41586-018-0579-z (2018).
OpenUrl CrossRef
↵
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95, 5–23, doi:10.1016/j.ajhg.2014.06.009 (2014).
OpenUrl CrossRef PubMed
↵
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89, 82–93, doi:10.1016/j.ajhg.2011.05.029 (2011).
OpenUrl CrossRef PubMed
↵
Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775, doi:10.1093/biostatistics/kxs014 (2012).
OpenUrl CrossRef PubMed Web of Science
↵
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–354, doi:10.1038/ng.548 (2010).
OpenUrl CrossRef PubMed Web of Science
↵
Chen, H. et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole genome sequencing studies. bioRxiv (2018).
↵
Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat Commun 9, 3391, doi:10.1038/s41467-018-05747-8 (2018).
OpenUrl CrossRef
↵
Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet 50, 1335–1341, doi:10.1038/s41588-018-0184-y (2018).
OpenUrl CrossRef PubMed
↵
Krokstad, S. et al. Cohort Profile: the HUNT Study, Norway. Int J Epidemiol 42, 968–977, doi:10.1093/ije/dys095 (2013).
OpenUrl CrossRef PubMed Web of Science
↵
Langhammer, A., Krokstad, S., Romundstad, P., Heggland, J. & Holmen, J. The HUNT study: participation is associated with survival and depends on socioeconomic status, diseases and symptoms. BMC medical research methodology 12, 143–143, doi:10.1186/1471-2288-12-143 (2012).
OpenUrl CrossRef PubMed
↵
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 47, 284–290, doi:10.1038/ng.3190 (2015).
OpenUrl CrossRef PubMed
↵
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat Genet 44, 1166–1170, doi:10.1038/ng.2410 (2012).
OpenUrl CrossRef PubMed
↵
Liu, D. J. et al. Meta-analysis of gene-level tests for rare variant association. Nat Genet 46, 200–204, doi:10.1038/ng.2852 (2014).
OpenUrl CrossRef PubMed
↵
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat Genet 46, 100–106, doi:10.1038/ng.2876 (2014).
OpenUrl CrossRef PubMed
↵
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat Genet 45, 1274–1283, doi:10.1038/ng.2797 (2013).
OpenUrl CrossRef PubMed
↵
Willer, C. J. et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet 40, 161–169, doi:10.1038/ng.76 (2008).
OpenUrl CrossRef PubMed
↵
Holm, H. et al. Several common variants modulate heart rate, PR interval and QRS duration. Nat Genet 42, 117–122, doi:10.1038/ng.511 (2010).
OpenUrl CrossRef PubMed Web of Science
Eijgelsheim, M. et al. Genome-wide association analysis identifies multiple loci related to resting heart rate. Hum Mol Genet 19, 3885–3894, doi:10.1093/hmg/ddq303 (2010).
OpenUrl CrossRef PubMed Web of Science
Eppinga, R. N. et al. Identification of genomic loci associated with resting heart rate and shared genetic predictors with all-cause mortality. Nat Genet 48, 1557–1563, doi:10.1038/ng.3708 (2016).
OpenUrl CrossRef
↵
Arking, D. E. et al. Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat Genet 46, 826–836, doi:10.1038/ng.3014 (2014).
OpenUrl CrossRef PubMed
↵
Swoap, S. J., Weinshenker, D., Palmiter, R. D. & Garber, G. Dbh(-/-) mice are hypotensive, have altered circadian rhythms, and have abnormal responses to dieting and stress. Am J Physiol Regul Integr Comp Physiol 286, R108–113, doi:10.1152/ajpregu.00405.2003 (2004).
OpenUrl CrossRef
↵
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS. Am J Hum Genet 101, 37–49, doi:10.1016/j.ajhg.2017.05.014 (2017).
OpenUrl CrossRef
↵
Lee, S., Teslovich, T. M., Boehnke, M. & Lin, X. General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet 93, 42–53, doi:10.1016/j.ajhg.2013.05.010 (2013).
OpenUrl CrossRef PubMed
↵
Davis, T. A. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). (Society for Industrial and Applied Mathematics, 2006).