KS-Burden: Assessing distributional differences of rare variants in dichotomous traits

A number of rare variant tests have been developed to explore the effect of low frequency genetic variations on complex phenotypes. However, an often neglected aspect in these tests is the position of genetic variations. Here we are proposing a way to assess the differences in spatial organization of rare variants by assessing their distributional differences between affected and unaffected subjects. To do so, we have formulated an adaptation of the well know Kolmogorov-Smirnov (KS) test, combining both KS and a simple gene burden approach, called KS-Burden. The performance of our test was evaluated under a comprehensive simulations framework using real data and various scenarios. Our results show that the KS-Burden test is able to outperform the commonly used SKAT-O test, as well as others, in the presents of clusters of causal variants within a genomic region. Furthermore, our test is able to maintain competitive statistical power in scenarios unfavorable to its original assumptions. Hence, the KS-Burden test is a valuable alternative to existing tests and provides better statistical power in the presents of causal clusters within a gene.

The advent of genome-wide association studies (GWAS) has contributed significantly to 2 our understanding of complex traits by finding several thousand robust association 3 between genetic variants and complex phenotypes [1]. However, GWASs survey only 4 common variants (M AF > 0.01) and ignore lower frequency variations which make up 5 the majority of polymorphisms. Detection of low frequency variants have been more 6 challenging but the recent development of next-generation sequencing based 7 technologies have provided rich opportunities to study those rare variants and their 8 impact on complex human traits [2]. 9 Indeed, rare variants, which can be defined as genetic variations occurring in less 10 than 1% of the population, have been suggested to play an important role in the 11 etiology of human traits and potentially account for the missing heritability [3,4]. Thus 12 considerable effort has been made to develop and deploy statistical methods to discover 13 important causal relationships between rare variants and complex human traits [5][6][7][8]. In 14 GWASs, a single variants is associated with the trait in question. This approach is largely unfeasible in rare variants due to their low frequency and large numbers, as well 16 as the limited sample size of most studies [9]. Thus most approaches have been focused 17 on combining multiple rare variants in order to increase statistical power. This can be 18 either done on the gene or pathway level, but for simplicity we will only consider gene 19 based tests within this paper. 20 In general, one can classify rare variant tests into three categories, namely burden, 21 variance-component and omnibus tests, based on their assumption regarding the 22 underlying genetic architecture [9]. In general, burden tests aggregate single rare genetic 23 variations. Thus assuming that all variants in a given genomic region have the same Collapsing (CMC) test [10], as well as the weighted sum statistic [11]. Alternatively, 27 variance-component tests do not assume uni-directional effect of all included variants.

28
These methods investigate the distribution of genetic effects for a genomic region and 29 are robust to variants with differing direction of effects. Prominent example of 30 variance-component tests are SKAT [12] and C-Alpha [13]. These tests are more  [14] which uses a combination of SKAT and burden tests statistics to derive a 37 combined p-value. 38 An often neglected aspect of rare variant tests are the position of these genetic 39 variation and only a few tests have so far been suggested [15][16][17]. Multiple biological 40 evidence has been reported in the past demonstrating clustering of causal rare variants 41 within the genome [16]. It is biological plausible to suggest that rare deleterious 42 mutations causally related to a considered trait might be more likely to be located in 43 protein functional domains or gene regulatory elements. 44 We are here proposing a way to assess the differences in spatial organization by 45 assessing the distributional differences of rare variants between cases and controls. To 46 do so we make use of the well known Kolmogorov-Smirnov (KS) test. We demonstrate, 47 through simulations, that our methods shows good statistical power compared to 48 commonly used tests, such as SKAT and burden, when the assumption of the KS test 49 are met. Further, we combine the KS and burden test to provide an omnibus approach 50 to our gene based association tests.   The two sample KS test is a non-parametric test for the equality of two one-dimensional 59 probability distributions. The test can be adapted to test for the distributional 60 differences of multiple variants in a region on a dichotomous phenotype by computing 61 the respective cumulative distribution functions (see Figure 1).

62
July 5, 2018 2/9 For a given genotype matrix G of size n × p one can compute the empirical 63 cumulative distribution function at any given variant position x in G as

67
Given the genotype matrices of affected and unaffected individuals (G A and G U ) on 68 can compute F (x) for both groups separately for all genomic positions x. The test 69 statistic of the KS test is then given as Thus the KS test aims to identify distributional differences of multiple variants by 71 identifying the largest absolute differences between the two empirical cumulative 72 distribution functions. Evaluating distributional differences of variants between affected and unaffected 74 individuals corresponds to testing the null hypothesis H 0 : such as allele count. Indeed, application of the Kolmogorov distribution to obtain 77 critical values for K for discrete data yields conservative estimates [18,19]. Therefore, 78 we applied a permutation-based approach to evaluate K under H 0 .
In which K b is the test statistic of the b th permuted sample and K is the observed test 80 statistic.

82
The KS test evaluates distributional differences of multiple variants in a given region, 83 but does not test for overall differences in the amount of variants between affected and 84 unaffected individuals. The Burden test, in contrast, does not take distributional 85 differences into account but tests for the differences in allele counts between affected 86 and unaffected individuals. One can define the test statistic of the Burden tests as most prominently Fisher's product methods. We made use of a similar developed which 90 has been shown to be more powerful [20]. Specially, given the ordered p-values of both 91 KS and Burden (indexed by j) one can compute W j as This relatively simple approach allows to combine both KS and Burden, while holding 101 the type 1 error rate stable [20].

102
Implementation of the KS, Burden and the combined KS-Burden was done in C++ 103 and can be found at https://github.com/rmporsch/ksburden.      was below or equal 1%. We simulated two different scenarios (see Figure 3 and Figure 4) 143 in which we assume different configurations of causal clusters γ.

144
The phenotype for each configuration was simulated via a liability threshold model. 145 Hence the phenotype Y i of the i th subject was generated via Y i = G i β + i in which G i 146 are the standardized genotype of the nth subjects with P variants, β is the effect size 147 vector of size 1 × P and i is a standard normally-distributed error term with a mean of 148 0 and a variance of 1 − h in which h = P j=1 β j . The effect h was uniformly distributed 149 across all causal variants and therefore representing the effect of the whole genomic 150 region. We assigned case status for each subject whose Y i was above a certain liability 151 threshold, q. This process was repeated until 500 cases and an equal number of controls 152 were collected.

153
Configuration 1 We assumed a single cluster in a given genomic region, called γ, 154 which is located at random positions within the genomic region. All variants in γ were 155 assigned to be causal. Furthermore, the size of γ was expressed as the proportion of 156 variants included in the causal cluster relative to the total number of variants in a given 157 region. For example, given γ = 0.1 and a gene with 100 rare variants causal status 158 would be assigned to a cluster of 10 variants.  Next, we explored the behavior of the our developed tests in situations with more 187 than 1 causal cluster (see Figure 5). Not surprisingly, statistical power of the KS test is 188 reduced with the increase of causal clusters while holding the total size of clusters   Overall, our simulations have shown that the KS-Burden test is able to outperform 202 commonly used tests in some specific scenarios. Furthermore, the test is able to 203 maintain good statistical power in simulations unfavorable to the KS test. Therefore, 204 providing a valuable alternative to commonly used tests.

205
Interestingly, type 1 error rate for all tests is significantly lower than the chosen α of 206 0.05 (see Figure 7). It is important to note that permutation approaches are known to 207 estimate conservative p-value in rare event data, such as rare genetic variations.

208
Nevertheless, type 1 error rate is noticeable low for both SKAT and SKAT-O, while the 209 burden based approaches, such as CMC and Burden are slightly higher. The highest 210 type 1 error rate is present in our KS-Burden test, but still significantly below 0.05.

212
We have shown that the KS-Burden test is able to outperform commonly used 213 gene-based tests when a single causal cluster is present within a genomic region.
214 Furthermore, our test shows similar performance compared to other tests in scenarios 215 which are unfavorable to its underlying assumptions.

216
It is important to emphasize all gene based tests make assumptions about the 217 underlying genetic architecture [9]. Specifically, while the Burden tests assumes that all 218  In addition, the low type 1 error rate across all tests needs to be discussed which is 242 in contrast to previous studies [12-14, 23, 24]. Interestingly, most previous studies did 243 only used simulated genotype matrices to estimate statistical power. In contrast, our 244 study made use of a large whole genome sequencing data set therefore reflecting 245 commonly encountered scenarios in rare variant association studies. Indeed, most genes 246 used in our analysis are relatively small and contain only a few rare variants. This has a 247 direct effect on the commonly used permutation approaches and results in conservative 248 p-value estimations. However, this issue is present across all used tests and should be 249 reduced in larger sample sizes.

250
Furthermore, it is somewhat surprising that the Burden test is unable to outperform 251 SKAT-O and KS-Burden in situations most favorable to it (γ = 1.0). However, it is 252 important to note that even under very unfavorable scenarios both KS and SKAT are 253 able to retain some statistical power which has not been captured by Burden. Hence 254 explaining the superior performance of the two omnibus tests SKAT-O and KS-Burden. 255 In addition to the benefits of the KS-Burden test, our approach has also a number of 256 limitations. As shown, given more than 1 causal cluster the KS test loses statistical 257 power. However, the number of causal cluster depends on the sizes of the genomic 258 region as well as the underlying genetic architecture. Furthermore, the use of the 259 combined KS-Burden test is able to, at least partially, recover these shortcomings.

260
Other limitations of the KS-Burden test include its inability to use non-binary 261 phenotypes and make use of covariates as well as variant annotations. While researchers 262 are able to select variants, based on available biological information, and therefore 263 indirect include variant annotations into the test the inability to include covariates is an 264 important limitation. However, most sequencing based studies contain relative 265 homogeneous samples due to potential differences in sequencing platforms and larger 266 population differences in rare variants. Hence despite these limitations the KS-Burden 267 test is a valuable alternative to currently used statistical approaches.

269
The KS-Burden test provides better statistical power, compared to most commonly 270 used gene based tests, given a single causal cluster. Furthermore, the test is able to 271 maintain appropriate power in scenarios unfavorable to its underlying assumptions.

272
Hence making it a good alternative to current rare variant tests. 273