Extreme Sampling for Genetic Rare Variant Association Analysis of Dichotomous Traits with Focus on Infectious Disease Susceptibility

As genomic sequencing becomes more accurate and less costly, large cohorts and consortiums of cohorts are providing high power for rare variant association studies for many conditions. When large sample sizes are not attainable and the phenotype under study is continuous, an extreme phenotypes design can provide high statistical power with a small to moderate sample size. We extend the extreme phenotypes design to the dichotomous infectious disease outcome by sampling on extremes of the pathogenic exposure instead of sampling on extremes of phenotype. We use a likelihood ratio test (LRT) to test the significance of association between infection status and presence of susceptibility rare variants. More than 10 billion simulations are studied to assess the method. The method results in high sample enrichment for rare variants affecting susceptibility. Greater than 90% power to detect rare variant associations is attained in reasonable scenarios. The ordinary case-control design requires orders of magnitude more samples to achieve the same power. The Type I error rate of the LRT is accurate even for p-values < 10-7. We find that erroroneous exposure assessment can lead to power loss more severe than excluding the observations with errors. Nevertheless, careful sampling on exposure extremes can make a study feasible by providing adequate statistical power. Limitations of this method are not unique to this design, and the power is never less than that of the ordinary case-control design. The method applies without modification to other dichotomous outcomes that have strong association with a continuous covariate.


32
As high throughput genomic sequencing has become available on a large scale, virtual armies of researchers 33 and huge numbers of study participants have been contributing to the discovery of thousands of genetic 34 associations, largely in common, chronic diseases among populations of European ancestry (1) (2). This is 35 encouraging for progress in these areas of medicine and public health. It also has provided a wealth of 36 knowledge on the technology and spurred development of statistical methods to handle large sample numbers 37 from multiple ancestries. However, most diseases in the world remain untested for genomic associations, and 38 most of these will not have the personnel and funding to assess more than a few hundred or even a few dozen  shows histograms of these two sub-populations: two overlapping bell-shaped curves with different means and 62 far fewer individuals in the more extreme sub-population. Fig. 1B shows the distributions combined, before 63 genotype is known, conferring a longish right tail. In this example, random sampling from the population here 64 would select mostly individuals from the middle of the distribution where most of the weight lies, resulting in 1 65 out of 9 individuals being carriers. In contrast, sampling individuals from the extreme, say individuals with 66 LDL-C > 5mmol/L, increases that ratio by a factor of 80: 90% of those sampled from LDL-C > 5 will be carriers 67 (Fig. 1A). A variant that is uncommon in the population becomes frequent in this EP sample. When those with 68 LDL-C >5 mmol/L ("cases") are compared to individuals with LDL-C < 2mmol/L ("controls") in the opposite 69 extreme of the distribution, Fisher's exact test produces p < 1x10 -16 for association of LDL-C with LDL-R class 70 1 mutations using just 50 high LDL-C cases and 50 low LDL-C controls. The power exceeds 99% here with 71 100 total participants. We use the term "enrichment" for the increase in variant frequency in the extremes.

72
The trade-off for this enrichment is that the odds ratio (OR) from the hypothetical case-control study above is 73 biased relative to the population OR. However, that is of little concern when searching for associations that 74 are difficult to detect for variants that will inevitably undergo functional testing and estimation of prevalence in

92
Researchers noticed some hemophiliacs remained uninfected for HIV-1 even after multiple infusions with virus-93 contaminated blood products. These individuals constitute a resistant extreme. In this example, the exposure 94 to HIV-1 was well-documented and quantifiable. In a second example of high interest, the sera of a group of 95 12 children who were observed to be resistant to severe malaria were compared to that from 11 children who 96 were susceptible (from a total cohort of 784 children)(9). Intense quantitative analysis led to discovery of 97 antibodies that prevent the malaria parasite, Plasmodium falciparum, from reproducing in the hosts' blood,

98
providing a protective effect. While the latter is not strictly genomic, the design principle is completely parallel 99 with that for extreme exposure genetic association testing. Both examples illustrate the role of astute clinical 100 and field observation along with the value of extreme sampling. Note that the term "exposure" throughout this 101 paper refers to pathogen exposure and is not to be confused with genotype as the exposure in a genetic 102 association test. We use the terms "genotype" or "carrier/non-carrier" to denote genetic status. With these 103 examples in mind, we formally extend the extreme sampling for genetic association studies for susceptibility to 104 infectious diseases in general.

106
Overview Execution of extreme exposure sampling is simple in theory. Cases (infected individuals) with low exposure 108 are selected for study along with controls (uninfected) individuals with high exposure. Logistic regression is 109 then applied using infection status as the outcome and genetic score as the independent variable of interest 110 (the genetic score for each locus). We expect the cases to be enriched for deleterious/disease-causing 111 variants and expect the controls to be enriched for protective variants. The likelihood ratio test (LRT) for 112 logistic regression is used to test for association between disease and genetic score. We use simulations to 113 illustrate the principle of enrichment via sampling on pathogen exposure level and to assess the behavior of 114 the method. The "extremeness" of the exposure is measured by percentiles of the exposure distribution.

115
Simulated Data Generation

154
To illustrate a scenario in which very high power is simply not attainable, we use a population size of 50,000 in 155 a series of simulations with MAF=0005 and RR=05.

156
Results pathogen exposure and genotype; we have sampled cases with low exposure and controls with high exposure 159 and tested for association of outcome with genotype. One of the results of the sampling is that individuals with  190 detect RRs at 0.5 or less when sampling below the 0.001 percentile and above the 0.999 percentile. This is in 201 stark contrast to the >7000 needed for random cases and controls (Table 1; S2). The empirical OR in column 202 9 is the OR observed within the extreme sample of the simulation (the mean of 500 trials). This is another 203 measure of how successful the enrichment sampling is: for row one in Table 1, the empirical OR=0001, 204 meaning that a case is about 1/1000 as likely as a control to be a protective rare variant carrier in the extreme 205 sample, compared to 1/6 (RR=1/6) as likely in the general population. Some of the combinations in Table 1 206 are quite extreme, but a very reasonable scenario is one in which RR ~ 025, MAF0005  sampling. The empirical OR will be close to the population RR when there is no enrichment.
216 As expected, Fisher's exact test is less powerful than logistic regression with the likelihood ratio test; but, 218 unexpectedly, Firth regression provided no benefit over ordinary logistic regression and was consistently less 219 powerful the the LRT (S2 Table).

220
Test sizes from 0.05 to 510 -8 were evaluated by performing 10 10 LRT tests under the null and then tabulating 221 how many results showed p< for =005, =0005, …, =510 -8 . Comparing observed and expected, the 222 LRT test was slightly optimistic at higher p-values but was conservative for genome-wide significance levels 223 (Fig. 3). maximums at 14 and 30) and two different overall sample sizes (Fig. 4) for both logistic regression and

232
Fisher's exact test. The case:control ratio providing maximal power was stable over the range of scenarios.

233
The result was the same for RR = 0.04, an extreme value that should show an effect on the optimal 234 case:control ratio if there was one (results not shown).

240
In a previous study of genetic risk factors for HIV-1 infection, we noticed discrepancies in reports of exposure 241 to unprotected sex when reports were obtained from the different partners(10), along with outcomes that were 242 inconsistent with some recorded exposures. Sampling based on these erroneous exposure reports would put 243 non-extreme individuals in place of extreme individuals. To investigate the effect of this, we used 300 244 individuals per group with sampling at the 99 th percentile and then substituted random samples for extreme 245 samples in increasingly large numbers. Fig. 5A shows the effect on power is quite deleterious, with power 246 decreasing to  as the number of observations with error nears 50% of the sample size. In fact, power is 247 slightly better, unexpectedly, when the samples with error are discarded (Fig. 4B). is better than when including the randomly mismeasured samples.

286
High power is attained at the cost of careful sampling and searching in the population to find the extremes.

301
Focusing on a rarer subgroup of outcomes can also be helpful in achieving enrichment of the sample with RVs.

302
The infection under study might have distinct, rare manifestations that are of clinical relevance and that arise 303 only rarely in some infected individuals. Examples are neurological involvement in syphilis and West Nile 304 virus, and development of certain post-acute symptoms following COVID-19 infection.

305
In order to have faith in our power estimates, the test size should be correct. That is, when the null hypothesis 306 is true, the proportion of tests with p-value <  should be quite close to . Inflated test sizes (anti-conservative 307 results) are often seen with small samples and very low p-values. This is not a problem here: we have shown 308 through 10 billion simulations that the test size for logistic regression using the LRT p-value has correct size 309 with only 50 cases and 50 controls at p-values as low as 5x10 -8 . This is itself an important result. The scenario for which the population size is only 50,000 and the power decreases after 400 per group is not 311 just a toy example. Isolated populations with unique background genetics in combination with rarity of the 312 sought-after variants will lead to such a scenarios in reality. This problem is not limited to extreme sampling.

313
One should not adjust for pathogen exposure when using this method. The method shown here is equivalent 314 to performing logistic regression of disease status on H and then sampling the most extreme residuals as in level of 0001, whereas the work here informs us on genome-wide significance levels, a critical addition.

319
One should adjust for other confounders when they exist. In particular, adjusting for ancestry should be 320 done (24).

321
We found that the optimal case:control ratio was empirically 4:3 for different overall sample sizes, different 322 tests and different values of the OR (05 and 011). The predicted optimal ratio under random sampling is 323 1/OR 1/2 , but our results do not reflect that prediction. Staying with a 1:1 case:control ratio seems advisable and 324 provides the best strategy for finding both protective and deleterious variants.

325
We have purposely used a model with only one RV for simplicity. We chose to illustrate this method using a 326 protective variant, but by symmetry of the logistic model, the general methods and results also apply to 327 deleterious variants. Further, the method applies to other dichotomous outcomes that are strongly correlated 328 with a continuous covariate.

329
A limitation of this method is that quantification of the infective agent often is not available. However, it might 330 be possible to obtain quantified exposure measurements in situations where they are not routinely made. For 331 example, aerosolized and windblown Burkholderia pseudomallei, an environmental saprophyte in certain 332 tropical regions that causes the ID melioidosis, is correlated with melioidosis incidence(25). The melioidosis 333 incidence is too low even in the aerosol-exposed areas for this metric alone to be useful in identifying 334 particularly resistant individuals, but it can be useful in identifying highly susceptible individuals who could be 335 compared with exposed, uninfected family members using a paired test. We emphasize that controls in any