An efficient and accurate frailty model approach for genome-wide survival association analysis controlling for population structure and relatedness in large-scale biobanks

With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We developed an efficient and accurate frailty (random effects) model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes in large biobanks by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrated the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 samples in FinnGen, respectively. We further performed genome-wide association analysis for 871 TTE phenotypes in UK Biobank and presented the genome-wide scale phenome-wide association (PheWAS) results with the PheWeb browser.


Introduction
Survival models, especially the Cox proportional hazard model 1 , have been widely used to analyze time-to-event (TTE) outcomes, both in biomedical research [2][3][4] , and in genome-wide association studies (GWAS) [5][6][7][8][9][10][11] . It has been shown that the proportional hazard model can increase the power to detect genetic variants associated with the age-of-onset of TTE phenotypes in cohort studies compared to modelling the disease status using a logistic regression model [12][13][14]  In GWAS analysis, population structure and sample relatedness are often key factors that need to be controlled for. Biobank cohorts often have substantial population structure and relatedness. For example, in the UK Biobank, 91,392 out of 408,582 subjects with White British ancestry have at least one relative (up to 3 rd degree) in the data. Several linear [16][17][18] and logistic 19,20 mixed effects models have been developed to account for relatedness in GWASs for quantitative and binary phenotypes. To account for related subjects in the proportional hazard model, frailty models, which are mixed effects survival models, have been proposed 21,22 , where event times are assumed to be independent conditional on unobserved random effects called "frailties". The frailties are modeled based on the dependence and clustering structure of the observations. Previous research has extensively studied shared frailty models with Gammadistributed frailties 21,23-27 . However, the shared frailty model is limited in its scope to model more complicated dependency structures that arise in cohort-based association studies. To model complicated dependency structures, such as known familial structures and cryptic relatedness, the multivariate frailty model with Gaussian frailty was proposed 28,29 , and was later implemented in the R package COXME 30 , which, however, lacks scalability for GWASs. Recently the COXME method was further improved in COXMEG 31 , which utilizes several computational optimization strategies to make it applicable in genetic association studies, but COXMEG still cannot handle biobank-scale genome-wide datasets. Based on our performance benchmarking, even for 20,000 subjects, COXMEG requires 3,356 CPU-hours to perform a GWAS of 46 million variants, which means even with perfect parallelization on 30 CPUs, it would take over 4.6 days to complete the GWAS.
In large-scale GWASs, the score test is particularly useful among different asymptotic tests, because it requires fitting the model only once under the null hypothesis of no association 20 . Score tests have also been implemented in the COXMEG package 32 .
However, score tests can lead to severe type I error inflation for phenotypes with heavy censoring, which is extremely common in biobank-based phenotypes. In the UK Biobank phenome that we built (see ONLINE METHODS), 871 TTE phenotypes have at least 500 events (cases), out of which 811 phenotypes have censoring rate more than 95%. The inaccuracies of the score test in unbalanced case-control phenotypes have been previously shown for logistic regression and logistic mixed effects models 19,33-35 , and a saddlepoint approximation 36 (SPA)-based adjustment has been proposed and successfully implemented 19 to accurately calibrate the p-values in such scenarios.
Recently, the SPACox 11 method also used SPA to calibrate p-values for time-to-event phenotypes in unrelated samples. However, the SPACox method does not account for sample-relatedness. Through simulations, we show similar inaccuracies are also present in score tests in frailty models for analyzing heavily censored phenotypes.
Here we propose a novel method for genome-wide survival analysis of TTE phenotypes, which accounts for both population structure and sample relatedness, controls type I error rates even for phenotypes with extremely heavy censoring, and is scalable for genome-wide scale PheWASs on biobank-scale data. Our method, Genetic Analysis of Time-to-Event phenotypes (GATE), transforms the likelihood of a multivariate Gaussian frailty model to a modified Poisson generalized linear mixed model (GLMM 20,37 ) likelihood, employs several state-of-the-art optimization techniques to fit the modified GLMM under the null hypothesis, and then performs score tests calculated using the null model for each genetic variant. To obtain well-calibrated p-values for heavily censored phenotypes, GATE uses the SPA to estimate the null distribution of the score statistic instead of the traditionally used normal approximation. Moreover, our method saves the memory requirement substantially by storing the raw genotypes in binary format and calculating the elements of the GRM on the fly instead of storing or inverting a large dimensional GRM.
Through extensive simulations and analysis of TTE phenotypes from the UK Biobank data of 408,582 subjects with White British ancestry and the FinnGen study, we showed that GATE is scalable to biobank-scale GWASs of TTE phenotypes with type I error rates well controlled even for less frequent variants and heavily censored phenotypes.
Benchmarking has shown that GATE can analyze 46 million variants in a GWAS with 408,582 subjects in ∼ 14.5 hours using 30 CPUs with peak memory usage under 11 GB. Step 2 involves scanning the entire genome and testing each variant for association using the score statistic. Since the overall cost of computing the variance of the score statistic for all variants is extremely high because it involves operations on the large-  Figure 14). Therefore, when performing the genome-wide scan, the variance of the score statistic is computed without using the GRM and then calibrated using the variance ratio.  Figure 1A) and has well controlled type I error rates, even for less frequent variants and phenotypes with heavy censoring rates (Supplementary Figure   1B).

Computation and Memory costs
To assess the computational performance of GATE and the score test implemented in the COXMEG package (COXMEG-Score), we randomly sampled subsets of different were plotted in Figure 4 and Supplementary Figure 3, respectively.

GWAS of lifespan in the FinnGen Study and the UK Biobank
We have also applied GATE to the overall lifespan in the FinnGen study (N events = 15,152, N censored = 203,244), in which the age of death ranges from 7 years old to

Simulation Studies
We investigated the type I error rates and power of GATE in presence of sample relatedness using 10,000 simulated samples. Due to computation burden, we used GATE-noSPA instead of COXMEG-Score for type I error evaluation as Supplementary Figure 1C shows the two approaches provide consistent association p-values (ܴ ଶ of -log10 p-values > 0.99).
The type I error rates of GATE were evaluated based on association tests of 9. controlled type I error rates even for low frequency variants (down to MAC = 20) when the phenotype is heavily censored (90%). However, without SPA, the score tests in GATE suffer from inflated type I error rates as the case-control ratios become more unbalanced and the frequency of variants decreases. We also evaluated type I error rates of GATE in a setting with cryptic sample relatedness by randomly selecting 10,000 UKBB participants with white British ancestry. Phenotypes were simulated using the real genotypes to mimic the sample relatedness of a real-world dataset, and association tests were conducted on the imputed genetic markers in the UKBB (see ONLINE

METHODS).
Similarly, we observed that the type I error rates were well controlled in GATE in presence of cryptic sample relatedness with different censoring rates (Supplementary Table 5 Overall simulation studies show that GATE can control type I error rates even when censoring rate is high and has similar power for common variants as COXMEG-Score. In contrast, same as GATE-noSPA, COXMEG suffers type I error inflation and the inflation is especially severe with low MAF and heavy censoring (Supplementary Figure 1B, 1C, 8 and 9).

Discussion
In this paper, we have proposed a novel method to perform scalable genome-wide survival association analysis of censored TTE phenotypes in large biobanks using an efficient implementation of the frailty model. Our method can adjust for population structure and sample relatedness and provide accurate p-values even in extreme cases of very low frequency variants and heavily censored phenotypes (incidence rate ൏ 0 . 1 %  showed that the p-values were generally concordant, and the p-values using The covariate-and-intercept-adjusted genotypes are denoted by is the augmented covariate matrix. Then, the variance of the score statistic under ‫ܪ‬ is given by , and the p-value is given by is the observed adjusted score statistic, We carried out a series of simulations to evaluate the performance of GATE, including the type I error rates and power. To evaluate whether GATE can control type I error rates in presence of sample relatedness, we randomly simulated a set of 1,000,000 base-pair "pseudo" sequences, in which variants are independent to each other. Alleles for each variant were randomly drawn from Binomial(n = 2, p = MAF). Then we performed the gene-dropping 64 simulation using these sequences as founder haplotypes that were propagated through the pedigree of 10 family members shown in  . In

Supplementary
Step 2, we conducted single variant association tests on 1.9 million simulated genetic markers. In totally, about 9.4x10 8 association tests were conducted.
We evaluated the empirical type I error rates at the type I error rate α = 1x10 -6 and 5x10 -8 as shown in Supplementary Table 4 and Supplementary Figure 8A. These results have indicated that GATE can produce well calibrated type I error rates in the presence of sample relatedness at the significance level, while GATE-no SPA (similar to COXMEG) has inflated type I error rates and inflation gets larger than censoring rates is higher (Supplementary Table 4). For example, GATE-no SPA has type I error rate 8.9x10 -6 at α = 5x10 -8 when censoring rate is 75% and 2.8x10 -5 when censoring rate is 90% with    E  f  f  i  c  i  e  n  t  l  y  c  o  n  t  r  o  l  l  i  n  g  f  o  r  c  a  s  e  -c  o  n  t  r  o  l  i  m  b  a  l  a  n  c  e  a  n  d  s  a  m  p  l  e  r  e  l  a  t  e  d  n  e  s  s  i  n  l  a  r  g  e  s  c  a  l  e  g  e  n  e  t  i  c  a  s  s  o  c  i  a  t  i  o  n  s  t  u  d  i  e  s  .   N  a  t  G  e  n  e  t   5  0  ,  1  3  3  5  -1  3  4  1  (  2  0  1  .  E  x  p  l  o  r  i  n  g  a  n  d  v  i  s  u  a  l  i  z  i  n  g  l  a  r  g  e  -s  c  a  l  e  g  e  n  e  t  i  c  a  s  s  o  c  i  a  t  i  o  n  s  b  y  u  s  i  n  g  P  h  e  W  e  b  .   N  a  t  u  r  e  G  e  n  e  t  i  c  s   5  2  ,  5  5  0  -5  5  2  (  2  0  2 A  s  s  o  c  i  a  t  i  o  n  a  n  a  l  y  s  e  s  b  a  s  e  d  o  n  f  a  l  s  e  d  i  s  c  o  v  e  r  y  r  a  t  e  i  m  p  l  i  c  a  t  e  n  e  w  l  o  c  i  f  o  r  c  o  r  o  n  a  r  y  a  r  t  e  r  y  d  i  s  e  a  s  e  .   N  a  t  u  r  e  G  e  n  e  t  i  c  s   4  9  ,  1  3  8  5  -1  3  9  1  (  2  0  1  7  ) .