## Abstract

The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits, and has great potential to identify novel genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Since PheWAS can test 1000s of binary phenotypes, and most of them have unbalanced (case:control = 1:10) or often extremely unbalanced (case:control = 1:600) case-control ratios, existing methods cannot provide an accurate and scalable way to test for associations. Here we propose a computationally fast score test-based method that estimates the distribution of the test statistic using the saddlepoint approximation. Our method is much faster than the state of the art Firth’s test ( ~ 100 times). It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls.

## Introduction

Over the last decade, genome wide association studies (GWASs) have proved instrumental to unravelling the genetic complexities of hundreds of diseases and traits and their associations with common genomic variations. To date, thousands of GWASs have identified more than 4000 significant loci to be associated with human diseases and traits.^{1} However, since most GWASs investigate a single disease or trait, they cannot exploit the cross-phenotype associations or pleiotropy^{2} where a single genetic variant can be associated with multiple phenotypes. Phenome-wide association study (PheWAS) has been proposed as an alternative approach to take advantage of the pleiotropy phenomenon by studying the impact of genetic variations across a broad spectrum of human phenotypes or ‘phenome’. It is a complementary approach to GWAS in the sense that while GWAS attempts to identify phenotype-to-genotype associations, PheWAS uses a genotype-to-phenotype approach. The first PheWAS^{3} was published as a proof-of-principle study, which demonstrated that the PheWAS strategy could be applied to successfully identify the expected gene-disease associations. Additional studies^{4–⇓⇓⇓8} have shown that the PheWAS approach can further identify novel disease-SNP associations.^{9}

The PheWAS approach depends on the availability of detailed phenotypic information. Currently, most of the PheWASs are applied to clinical cohorts linked to electronic health records (EHR) and utilize the International Classification of Disease (ICD) billing codes to define clinical phenotypes. The ICD codes provide an intuitive ordering of the phenotypes based on clinical disease and trait classifications. Since the current genotyping and imputation technologies allow for genotyping tens of millions of variants at very low cost,^{10} an extensive PheWAS can attempt to investigate the genotype-phenotype associations by performing genome-wide association analyses in thousands of traits. We can interpret the PheWAS result of a single genetic variant by observing its associations across the phenome. Such a PheWAS is exhaustive in nature and has great potential to identify novel variants associated with clinical diseases.

One of the main challenges of the PheWAS analysis is that most of the phenotypes are binary phenotypes with unbalanced (1:5) or often extremely unbalanced (1:600) case-control ratios (See *Figure S1*), since the data is collected in cohorts. Although standard asymptotic tests, such as the Wald, score and likelihood ratio tests, are relatively well calibrated and asymptotically equivalent^{11} for common variants (minor allele frequency (MAF) > 0.05) in balanced case-control studies, they can inflate type I error for low frequency (0.01 < MAF ≤ 0.05) and rare variants (MAF ≤ 0.01) in unbalanced case-control studies.^{12} Moreover, since the Wald and likelihood ratio tests need to calculate the likelihood or the maximum likelihood estimator under the full model, which is computationally expensive, they are not scalable for the amount of tests that PheWASs attempt. On the other hand, the score test is computationally efficient as it does not need to calculate the maximum likelihood under the full model. However, as mentioned before, it suffers from having highly inflated type I error rates in unbalanced studies. Ma *et al*. proposed Firth’s penalized likelihood ratio test^{13} as a solution to control the type I error rates in such situations. Firth’s test, despite being well calibrated and robust for testing low frequency and rare variants in unbalanced studies, lacks in computational efficiency as it also involves calculating the maximum likelihood under the full model. For instance, the projected computation time of the Firth’s test to test 1500 phenotypes across 10 million SNPs is ~ 117 CPU-years (2000 cases, 18000 controls). Thus, it is impractical to apply the Firth’s test for analyzing large PheWAS datasets.

In this paper, we propose a score-based single variant test for binary phenotypes which is well calibrated for controlling the type I errors and can adjust for covariates even in extremely unbalanced case-control studies. Moreover, our test is computationally efficient and scalable to test thousands of phenotypes across millions of SNPs in large PheWAS datasets. Our proposed test (SPA) is based on the score statistics and estimates the null distribution using the saddlepoint approximation^{14-16} instead of the normal approximation^{17} traditionally used in score tests. We further develop an improvement of our test (fastSPA) which renders the most computationally challenging steps to be dependant only on the number of carriers (subjects with at least one minor allele) rather than the sample size. This improved test can substantially reduce the computation time, especially for low frequency and rare variants where the number of carriers is very low compared to the sample size. The projected computation time of our method to test for 1500 phenotypes across 10 million SNPs is ~ 400 CPU-days (2000 cases, 18000 controls) which is more than a 100 times improvement over Firth’s test. In addition, through the extensive simulation studies and analysis of the Michigan Genomics Initiative (MGI) data, we demonstrate that the proposed approach can control type I errors and is powerful enough to replicate known association signals.

## Material and Methods

### Logistic regression model and saddlepoint approximation method

We consider a case-control study with sample size *n*. For the *i*^{th} subject, let *Y*_{i} = 1or 0 denote the case-control status, *X*_{i} the *k* ×1 vector of non-genetic covariates including the intercept, and *G*_{i} the number of minor alleles (*G*_{i} = 0,1,2) of the variant to test. To relate genotypes to phenotypes, we use the following logistic regression model,

Where *β* is a *k* × 1 vector of coefficients of the covariates, and *γ* is the genotype log-odds ratio. Under this model, we are interested in testing for the genetic association by testing the null hypothesis *H*_{0}: *γ* = 0. Let be the estimate of *μ*_{i} = Pr(*Y*_{i} = 1 | *X*_{i}), which is a probability to be a case under *H*_{0}. A score statistic for *γ* from the model (1.1) is given by . Suppose *X* = (*X*_{1}^{T},…, *X*_{n}^{T}) is the *n* × *k* matrix of covariates, *G* = (*G*_{1},…, *G*_{n})^{T} is the genotype vector, *W* is a diagonal matrix with the *i*^{th} diagonal element being , and is a covariate adjusted genotype vector in which covariate effects are projected out from the genotypes (details given in the Appendix). Then *S* can be written as
and the mean and variance of *S* under *H*_{0} are *E*_{H0} (*S*) = 0 and , respectively, where is the *i*^{th} element of .

The traditional score test approximates the null distribution using a normal distribution which depends only on the mean and the variance of the score statistic. The *p*-value can be obtained by comparing the observed test statistic, *s* and *N* (0,*V*_{H0} (*S*)). Normal approximation works well near the mean of the distribution, but performs very poorly at the tails. The performance is especially poor when the underlying distribution is highly skewed, such as in unbalanced case-control outcomes^{12}, since normal approximation cannot incorporate higher moments such as skewness. In addition, the convergence rate of normal approximation^{18⇓–20} is *O*(*n*)^{−1/2}), which is not fast enough for rare variants.

Saddlepoint approximation was introduced by Daniels^{14} as an improvement over the normal approximation. Contrary to normal approximation, where only the first two cumulants (mean and variance) are used to approximate the underlying distribution, saddlepoint approximation uses the entire cumulant generating function. Jensen^{21} further showed that saddlepoint approximation has a relative error bound of *O*(*n*^{−3/2}) making it a considerable improvement over the normal approximation.

To use the saddlepoint approximation, we first derive the cumulant generating function of *S* from the fact that *Y*_{i} ~ *Bernoulli*(*μ*_{i}) under *H*_{0}. Let be an *n* × 1 vector with the *i*^{th} element being . From (1.2), the estimate of the cumulant generating function of the score statistic *S* is,
and the estimate of the first and second order derivatives of *K* are
respectively. We note that *K*, *K*' and *K*'' are plug-in estimates in which we plug in instead of *μ*_{i}. Then, according to the saddlepoint method (Barndorff-Nielson^{15; 16}), the distribution of *S* at *s* can be approximated by,
where is the solution to the equation , and Φ is the distribution function of a standard normal distribution.

### Implementation details and approaches to reduce the computation time

The saddlepoint approximation method involves finding the root of the saddlepoint equation K'(*t*) = *s*. It is easy to verify that *K*' is strictly increasing as *K*''(*t*) > 0 for all −∞ < *t* < ∞, and

Therefore a unique root exists, and we can use popular root-finding algorithms (Newton-Raphson,^{22;23} bisection,^{23} secant,^{23} Brent´s method^{24}) to efficiently solve this equation. For our simulation studies and real-data applications we applied a combination of the Newton-Raphson and bisection method to solve the saddlepoint equations.

The most computationally demanding step in this saddlepoint approximation method is calculating the cumulant generating function and its derivatives. Here we propose several approaches to reduce the computational complexities associated with these calculations.

**Faster calculation of the CGF using a partially normal approximation approach**: The most computationally intensive step in the saddlepoint method is the calculation of the cumulant generating function *K* and its derivatives. In each step of the root-finding algorithm we need to calculate *K*,*K*' and *K*", each of which needs *O*(*n*) computations. Using the fact that many elements of *G* are zeroes (i.e, homozygous major genotypes), we propose a fast computation method that speeds up the computation to *O*(*m*), where *m* is the number of non-zero elements in *G*. Without loss of generality we assume that the first *m* subjects have at least one minor allele each and rests have homozygous major genotypes. We can then express *S* as *S* = *S*_{1} + *S*_{2} where and . Let *Z* = (*X*^{T}*WX*)^{−1} *X*^{T}*WG* and *Z*_{l} be the *l*^{th} element of *Z*. Then we can further express *S*_{2} as,
where . Now, if we assume that the non-genetic covariates are relatively balanced in the sample, then the normal distribution should be a good approximation for the null distribution of each *S*_{2}_{l}. Since *S*_{2} is a weighted sum of the *S*_{2l} s, we can also approximate the null distribution of *S*_{2} using a normal distribution with mean and the variance under *H*_{0} given by *E*_{H0} (*S*_{2}) = 0 and . Then, the CGF of *S*_{2} can be approximated by
and the CGF of *S* = *S*_{1} + *S*_{2} can be approximated by,

In order to calculate the first two terms at the right hand side of (1.3), we will need for *i* = 1,…, *m*, which can be calculated in *O*(*m*) computations since *G* only has *m* many non-zero elements and the quantity can be pre-calculated. Then, the first two terms will require only *O*(*m*) computations as both of them sums over *m* many elements. Next, the variance *V*_{H0} (*S*) can be further broken down into,

Since *X*^{T}*WX* can be pre-calculated and *Z* is a *k* × 1 vector, the first term requires *O*(*k*) computations, and the second term requires *O*(*m*) computations, which implies that the calculation of *V*_{H}_{0} (*S*_{2}) requires *O*(*m*) calculations assuming *k* < *m*, i.e, the number of non-genetic covariates is smaller than the number of subjects with at least one minor allele each. Hence, the cumulant generating function *K*(*t*) can be calculated in *O*(*m*) computations. Using similar arguments, we can further show that the derivatives *K*'(*t*) and *K*"(*t*) can also be calculated in *O*(*m*) computations. Therefore, this partially normal approximation reduces the computational complexity of our test from *O*(*n*) to *O*(*m*), which is especially useful for rare variants, where *m* is much smaller than *n*.

**Using normal approximation near the mean for faster computation**: Since the normal approximation behaves well near the mean of the distribution, we can use it to obtain the *p*-value when the observed score statistic ( *s*) lies close to the mean (zero). Moreover, saddlepoint approximation can be numerically unstable very close to the mean of the distribution. Such situations can also be avoided by using normal approximation near the mean. One possible approach is to use a fixed threshold in which we apply normal approximation to obtain the *p*-value if the absolute value of the observed score statistic, |*s*| < *rσ* where and *r* is a pre specified value. For example, we used *r* = 2 in our simulation studies and real data analyses. For a given level *α*, this approach does not inflate type I error rates if *r* < Φ^{−1}(1-*α* / 2), where Φ^{−1} is the inverse function of the standard normal distribution function, Φ(*x*).

Alternatively, we can adaptively select the threshold using the error bound of the normal approximation given by the Berry-Esseen theorem. Suppose we are interested in controlling the type I error rate at level *α*. Let *F*_{n} (*x*) be the true distribution function of the standardized score test statistic Then, according to Berry-Esseen theorem^{18⇓–20}, the maximum error bound in approximating *F*_{n} (*x*) by Φ(*x*) is

Where is a constant. As of now, the best known estimate for *C* is 0.56, given by Shevtsova.^{25} Suppose *p*_{F} and *p*_{N} are *F*_{n} (*x*) and Φ(*x*) based *p*-values. From the Berry-Esseen theorem, we can show *p*_{N} ≤ *p*_{F} + *B*_{n}. Suppose *q* = *B*_{n} + *α*/2 and *r*_{α} = Φ^{−1}(1-*q*). Then *p*_{N} ≥ *q* indicates *p*_{F} ≥ *α*/2. Therefore, we use *r*_{α}*σ* as a threshold at level *α* in which we will apply normal approximation if |*s*|<*r*_{α}σ.

### Numerical Simulations

To evaluate the computation times, type I error rates and power of the proposed method, we carried out extensive simulation studies. We considered three different case-control ratios: balanced with 10000 cases and 10000 controls, moderately unbalanced with 2000 cases and 18000 controls, and extremely unbalanced with 40 cases and 19960 controls. For each choice of case-control ratios, the phenotypes were simulated based on the following logistic model,

Where the two non-genetic covariates *X*_{1i} and *X*_{2i} were simulated from *X*_{1i} ~ *Bernoulli*(0.5) and *X*_{2i} ~ *N*(0,1). The intercept *β*_{0} is chosen to correspond to prevalence 0.01. The genotype *G*_{i} s were generated from a *Binomial*(2, *p*) distribution where *p* was the minor allele frequency (MAF). The parameter *γ* represents the genotype log odds-ratio.

To estimate computation times and type I error rates in realistic scenarios, the MAF (p) was randomly sampled from the MAF distribution in the MGI data. For the computation time comparisons, we simulated 10^{4} variants with *γ* = 0. For the type I error comparisons, we simulated 10^{9} variants with *γ* = 0 and recorded the number of rejections at *α* = 5×10^{−5} and 5×10^{−8}. We also used fixed MAFs to evaluate the effect of MAFs to computation time and type I error rates. For the power calculations, we considered two different choices for MAF, *p* = 0.01 and 0.05, and wide ranges of *γ* (Figure *1*). For each choice of *p* and *γ* we generated 5000 variants, and recorded the number of rejections at 5×10^{−8} level.

We compared the computation times of seven different tests: traditional score test using normal approximation (Score); the saddlepoint approximation based test with the standard deviation threshold at 0.1 and 2 (SPA-0.1 and SPA-2); the fast saddlepoint approximation based test with the partially normal approximation improvement and the standard deviation threshold at 0.1 and 2 (fastSPA-0.1 and fastSPA-2); the fastSPA test with the Berry-Esseen bound threshold at level 5×10^{−8} (fastSPA-BE); and the Firth’s penalized likelihood test (Firth). To reduce the computation burden to evaluate type I error rates at the genome-wide significant level, we compared the empirical type I errors of two methods, fastSPA-2 and Score. For power comparisons, we compared the empirical power curves of fastSPA-2 and Firth. In order to compare the *p*-values resulted from different tests, we also simulated 5×10^{6} variants with MAFs randomly sampled from the MAF distribution of the MGI data. We further compared the inflation factors of the genomic controls at different *p*-value quantiles for fastSPA-2 and fastSPA-0.1 in order to explore the effect of the standard deviation threshold on the inflation factor.

### Michigan Genomics Initiative (MGI) data application

To illustrate the performance of the proposed methods in real data application, we analyzed four selected phenotypes in the MGI data. The main goal of MGI is to create an institutional repository of genetic data together with rich clinical phenotypes for a broad portfolio of future medical research. DNA from blood samples of > 20,000 surgical patients at the University of Michigan Health System was genotyped (with their informed consent) on the Illumina HumanCoreExome v12.1 array, which is a combination GWAS plus exome array comprised of > 500,000 single nucleotide polymorphisms. Genotypes of the Haplotype Reference Consortium^{26} (chromosome 1-22: HRC release 1; chromosome X: HRC release 1.1) were imputed into the phased MGI genotypes (SHAPEIT2^{27} on autosomal chromosomes and Eagle2^{28} on chromosome X) using Minimac3.^{29} Excluding variants with low imputation quality (R^{2} < 0.3) resulted in dense mapping at over 39 million quality-imputed genetic markers.

Phenotypes derived from 8,940 ICD-9 billing codes were classified into 1,815 PheWAS disease states of shared disease etiology, of which 1,448 had at least 20 cases. Standard code translations were used to convert the taxonomy of diagnostic ICD-9 codes into PheWAS code groups (PheWAS code translation table version 1.2^{30}). Cases were derived from electronic health records for patients with at least 2 encounters with an ICD-9 billing code. We performed genome-wide association analyses for 4 selected traits, Skin Cancer (PheWAS code: 172), Type-2 diabetes (PheWAS code: 250.2, [MIM: 125853]), Primary Hypercoagulable state (PheWAS code: 286.81, [MIM: 188055]) and Cystic Fibrosis (PheWAS code: 499, [MIM: 219700]), in 18,267 unrelated individuals of European ancestry, with adjustment for age, sex, and four principal components. Genotyped samples with any missing covariate information were excluded from analysis. Since imputation quality is low for very rare variants^{26}, we excluded variants with MAF < 0.001 in our main analysis, which resulted in 13 million variants.

## Results

**Comparison of computation times**: Table 1 shows empirical and projected computation time of the proposed and competing methods. To obtain computation time under realistic scenarios of the MAF distribution, the MAFs of the simulated SNPs were randomly sampled from the MAF spectrum of the MGI data (*Figure S2*). The fastSPA-2 test performs 100-300 times faster than the Firth’s test. In the unbalanced case-control setup of 2000 cases and 18000 controls, for example, the Firth’s test takes 117 CPU-years whereas fastSPA-2 only takes 1.09 CPU-years to analyze 10 million SNPs across 1500 phenotypes. This indicates that on a cluster with 100 CPU cores, the proposed test would require 4 days (without data reading) but the Firth’ test would need more than a year. When we compare fastSPA and SPA, fastSPA-0.1 performs 4-6 times faster than SPA-0.1 (e.g. 2.90 vs 12.32 CPU years when case:control = 2000:18000), and fastSPA-2 performs 1.5-2 times faster than SPA-2 (e.g. 1.09 vs 1.62 CPU years when case:control = 2000:18000). fastSPA-BE also performs reasonably fast.

We also recorded the computation times for variants with three different fixed MAFs 0.1, 0.01 and 0.001 in order to assess the effect of MAF on the performance of the tests. Similar to *Table 1*, *Table 2* also shows the superior performance of fastSPA-2 compared to all other tests. Moreover, while the computation time of SPA increases with decreasing MAFs, which may be due to the slow convergence caused by the discrete nature of the underlying distribution, fastSPA requires less computation time for rarer variants (smaller MAFs) compared to more common variants (larger MAFs). This demonstrates the potential of the partially normal approximation improvement in terms of faster computation of the *p*-values, especially for low-frequency and rare variants.

**Type I error comparison**: The type I error rates from 10^{9} simulated datasets are presented in Table 3. Due to the heavy computation burden for testing these extremely large numbers of datasets, in this comparison, we only considered the traditional score test and fastSPA-2. The traditional score test had greatly inflated type I error rates for moderately unbalanced and extremely unbalanced case-control ratios, whereas fastSPA-2 can control the type I error in such situations. At the genome-wide significance level of *α* = 5×10^{−8}, for example, the empirical type I error rates of the score test were 32 (1.63×10^{−6}, when case: control = 2000:18000) and 26600 (1.33×10^{−3}, when case:control = 40:19960) times higher than the nominal *α* = 5×10^{−8}. In contrast, the fastSPA-2 had empirical type I error rates nearly identical ( 4.9 ×10^{−8}, when case:control = 2000:18000) or slightly lower (3.5×10^{−8}, when case:control = 40:19960) than the nominal *α* = 5 ×10^{−8}. We also estimated empirical type I error rates at six different MAFs (*Figure S3*). The score test had deflated type I error rates for low-frequency and rare variants for the balanced case-control ratio and inflated and extremely inflated type I error rates for moderately and severely unbalanced case-control ratios. The fastSPA-2 method had overall well controlled type I error rates regardless of MAFs and case-control ratios

**Power comparison**: Next, we compared the power of fastSPA-2 to that of the Firth’s test at *α* = 5 ×10^{−8}. Note that the Firth’s test is a current gold standard method.^{13} Since the traditional score test had greatly inflated type I error rates, we did not include it in the power comparison. *Figure 1* shows power by odds ratios when the MAF of the variant was 0.05 (top panel) and 0.01 (bottom panel). As expected, the power is higher when the case-control ratio is balanced. The empirical powers of fastSPA-2 and the Firth’s test were nearly identical for all case-control ratios and MAFs, which suggests that our proposed test does not suffer from any loss in power compared to the Firth’s test.

** P-value and inflation factor (λ) comparison**: To compare

*p*-value distributions of various tests, wegenerated QQ plots and calculated the inflation factor (λ) of the genomic control.

*Figure S4*suggests strong deflation (smaller than expected) in the

*p*-values based on the traditional score test in the moderately unbalanced and extremely unbalanced case-control setups, whereas fastSPA-2, SPA-2 and Firth tests resulted in well calibrated QQ plots, which suggest that these methods can control for type I errors. Both fastSPA-2 and fastSPA-0.1 showed no inflation or deflation in genomic controls (λ) in the balanced and moderately unbalanced case-control setups (

*Table S1*). In the extremely unbalanced case-control setup, fastSPA-2 resulted in greatly deflated inflation factor (λ = 0.48) at the median of

*p*-value (q = 0.5). Interestingly fastSPA-0.1 resulted in inflated λ (λ = 1.83) at q = 0.5, which may be due to the discrete nature of

*p*-values. When λ was measured at

*p*-value quantiles q = 0.01 and 0.001, however, both tests provided λ very close to unity.

### MGI Data Analysis

We applied the traditional score test and the fastSPA-2 method to the MGI data with four phenotypes, Skin Cancer, Type-2 diabetes, Primary Hypercoagulable state, and Cystic Fibrosis, which were selected based on case-control ratios. Skin Cancer (2359 cases, 15265 controls) and Type-2 diabetes (1987 cases, 14906 controls) were moderately unbalanced, whereas Primary Hypercoagulable state (168 cases, 16401 controls) and Cystic Fibrosis (28 cases, 18212 controls) were extremely unbalanced phenotypes.

The Manhattan plots (Figure 2) show that the traditional score test produced a large number of spurious associations for all of these phenotypes, whereas all of the significant variants from our proposed test at the genome-wide significant level of *α* = 5 ×10^{−8} can be verified as truly associated with the phenotypes based on previous findings (Table 4). In the analysis of Skin Cancer, variants in or near *IRF4* (MIM: 601900), *MC1R* (MIM: 155555), *RALY* (MIM: 614663) and *SLC45A2* (MIM: 606202) were significant at *α* = 5 ×10^{−8} and all of these four genes were previously identified as associated with pigmentation traits and skin cancers.^{34⇓⇓⇓⇓–39} In the other traits, variants in *TCF7L2* (MIM: 602228), *F5* (MIM: 612309) and *CFTR* (MIM: 602421) were significantly associated with Type2 diabetes,^{40} Primary Hypercoagulable State^{41} and Cystic Fibrosis,^{42} respectively, and all of these genes are well known to be associated with the risk of each disease.

The QQ plots (*Figure S6*) also suggest that the *p*-values based on the traditional score test are much smaller than expected, especially for low-frequency and rare variants, whereas the *p*-values based on fastSPA-2 closely follow the uniform distribution. We also observed the Manhattan plots (*Figure S5*) including the variants with MAF < 0.001 in the analysis. The inclusion of the rarer variants resulted in extreme inflation in the number of spurious associations for the traditional score test. However, our proposed test still produced none to very few new associations.

Further, based on the *p*-values from our proposed test, we obtained the inflation factor λ of the genomic control at different *p*-value quantiles (q) and different MAF cut-offs (*Table S2*). To evaluate whether using a smaller standard deviation threshold (r) improves the estimation of λ, we also applied fastSPA with r = 0.1 (i.e fastSPA-0.1) on these four phenotypes. When all the variants were included in the analysis, there was slight inflation (λ = 1.11, type 2 diabetes) or great deflation (λ = 0.12, Cystic fibrosis) at the median level for fastSPA-2. However, the genomic controls are very close to unity at q = 0.01 and q = 0.001. If we only consider the variants with MAF > 0.001, then fastSPA-2 does not show any significant inflation in λ at the median for Skin Cancer, Type-2 Diabetes, and Primary Hypercoagulable State. Although it shows a deflated genomic control for Cystic Fibrosis (λ =0.63) due to the discrete nature of the underlying distribution. However, if we exclude the rare variants and consider only the variants with MAF > 0.01, then all four of the phenotypes show λ very close to unity. fastSPA-0.1 shows no significant inflation or deflation in λ at all quantiles and MAF cut-offs, except for Cystic Fibrosis (λ = 1.27) when all the variants are considered and genomic control is measured at the median level.

## Discussion

In this paper, we proposed a fast and scalable test to analyze large PheWAS datasets which is well calibrated even in extremely unbalanced case-control settings. The method uses computationally efficient saddle point approximation to accurately calculate p-values of score test statistics. We further proposed an improved version of our test which substantially reduces the computation time, especially for low-frequency and rare variants. Our proposed test can also adjust for additional covariates. Through extensive numerical studies we demonstrated that our test can perform 100-300 times faster than the currently used Firth’s test while retaining similar power and well controlled type I error rates. MGI data analysis illustrates that by applying the proposed method to PheWAS, we can identify true association signals while controlling for type I error, even for traits with a very small number of cases and a large number of controls.

Our test calculates *p*-values based on the traditional score test if the score statistics lie sufficiently close to the mean. Even though normal approximation is accurate near the mean, those *p*-values may not be well calibrated. In such cases, since the median *p*-values might come from the traditional score test, we can encounter slightly inflated or deflated inflation factor at median. When the case control ratio is extremely unbalanced, this phenomenon is more pronounced. One way to circumvent this issue is to measure the inflation factor at more extreme quantiles (0.01, 0.001 etc.), or to exclude rare variants when estimating the inflation factor. Another approach is to decrease the standard deviation threshold so that the median *p*-values come from the saddlepoint approximation. In the MGI data analysis, fastSPA-0.1 produced substantially improved inflation factor estimates than fastSPA-2. However, the use of threshold 0.1 instead of 2 would increase computation time ~ 3-4x. The choice of the threshold should be based on the careful assessment of the available computational resources.

As sequencing cost continue to drop, whole-exome or whole-genome sequencing will be used for PheWAS to identify rare variants associated with clinical phenotypes.^{31} In rare variant association analysis, gene or region based multiple variant tests are commonly used to improve power.^{32} When case-control ratios are unbalanced, popular rare variant tests, including burden tests, SKAT and SKAT-O, can also have substantially inflated type I error rates. Although resampling based approaches have been developed to address this problem,^{33} the existing methods are not fast enough to be used in PheWAS. One possible approach is first to adjust single variant score statistics using SPA and then to use the adjusted score statistics to control for the type I error. We left it for future research.

In summary, we have proposed an accurate and scalable method for PheWAS data analysis. With the growing effort to build large research cohorts for precision medicine^{31}, future PheWAS would have hundreds of thousands of samples and hundreds of millions of variants. Our method will provide a scalable solution for this large-scale problem and contribute to finding genetic component of complex traits. All our tests are implemented in the R package **SPAtest**.

### Acknowledgements

This work was supported by grants R01 HG008773 (RD and SL). We would like to thank the investigators of MGI project for access to the PheWAS dataset and Dr. Hyun Min Kang for implementing the methods in the Epacts package.

## Web resources

SPAtest R-package: https://sites.google.com/a/umich.edu/leeshawn/software

Michigan Genomics Initiative: https://www.michigangenomics.org/

Online Mendelian Inheritance in Man (OMIM): http://www.omim.org

## Appendix

**Explanation behind using** **instead of** *G*: We first note that since is the maximum likelihood estimator of *μ* under the null model and Now, the score function and the observed information matrix under the null model are given by,

Therefore, the variance of *S* under *H*_{0} is given by,

So, even though the two expressions of *S* are algebraically the same, the variance can be expressed as a weighted sum of s where the weights are given by . Therefore, we used instead of *G* to express the score statistic.