Case-Base-Control designs

Most genome-wide association studies (GWASs) use randomly selected samples from the population (hereafter bases) as the control set. This approach is successful when the trait of interest is rare; otherwise, a loss in the statistical power to detect disease-associated variants is expected. To address this, a proposal to combine the three sample types, cases, controls and bases is introduced, for instances when the disease under study is prevalent. This is done by modelling the bases as a mixture of multinomial logistic functions of cases and controls, according to the disease prevalence. The maximum likelihood method is used to estimate the underlying parameters using the EM algorithm. Three classical tests of association; score, Walds, and likelihood ratio tests are derived and their power of detecting genetic associations under different designs is compared. Simulations show that combining the three samples can increase the power to detect disease-associated variants, though a very large base sample set can compensate for the lack of controls.


Introduction
In a typical case-control study which requires controls to be free of the disease of interest, the process of selecting and genotyping a control set for different diseases is expensive and time consuming. This led many researchers to take some expedient approaches to overcome this problem. For example, one of the aspects that characterizes most large GWASs e.g., The Wellcome Trust Case Control Consortium (WTCCC, http://www.wtccc.org.uk) [1], is the random selection of their control sets. In other words, controls are random samples from the population of interest with unknown disease status, rather than individuals who have been screened to ensure that they do not have the disease under study. The main advantage of adopting this approach is that genome-wide genetic data for various reference sample sets are freely available.

Examples include the 1958 British Birth Cohort and the three national UK Blood
Services, and both have been used in the WTCCC. Although this approach allows large time and cost savings to be made relative to sourcing and genotyping a bespoke control set, its results are reliable only if the disease of interest is rare, otherwise the degree of case contamination in the base set will be high. Examples include depression and sub-clinical conditions such as mild acne. In such situations, experimenters may be faced with a choice between analysis using a small control sample set or a large base sample set.
In this paper we combine three data types; controls, bases and cases, in a single design called the case-base-control design (CBC). We use "reverse regression" techniques, which treat the SNP's genotype as the outcome and the phenotype as the predictor in a linear model. Reversed approaches have been previously used over classical approaches, for their capacity to simultaneously incorporate both quantitative and categorical phenotypes in the same model. See for example, the MultiPhen package [2] and the SCOPA software [3], which both target multiple-phenotype associations. Although "reverse re-gression" offers technical flexibilities, its estimates, especially those of the odds ratios are non-univocal, in the sense that they cannot be interpreted in terms of the effect of a SNP on a phenotype [3].
The "reverse regression" approach used herein, is based on the multinomial logistic regression [4] that models genotypes given disease status (case or control). Respective probabilities are then mixed according to the disease prevalence to form the base set.
We investigate the effect of capturing information across all three data types in a joint analysis on the statistical power to detect disease associated variants. It is worth noting that, during our work on the CBC model, a similar idea was published by McCarthy et al. [5] to increase power to detect HIV associated variants, using cases, bases and a set of individuals demonstrating an exceptional resistance to HIV infection (HIV-1 high-risk seronegative individuals, HRSN). Although the motivation to simultaneously analyse the three groups is shared between the work of McCarthy et al. and ours, the corresponding approach and the underlying statistical problem are different. Basically, unlike our approach, McCarthy et al., prospectively modeled the probability of an individual being HRSN (i = 2), HIV positive (i = 1), or base (i = 0) conditional on genotype value g (reference allele count) using a logit model that treats the bases as the baseline category, as follows: An alternative hypothesis of this form H 1 : at least one β = 0 would reflect some biologically implausible genetic effect across some groups. For example, if only β 1 > 0 while β 2 < 0, this means having more copies of the mutant allele will increase both the probability that an individual is HIV+ (relative to bases) and the probability that an individual is HRSN (relative to bases). Such irrelevant patterns were excluded using a constrained alternative hypothesis that only reflect plausible risk or protective effects that is H 1 : β 1 β 2 > 0 with at least one β = 0. Such formulation can capture (1) risk effect that is when β 1 and β 2 are both positive, in other words, having more copies of the mutant allele will increase the probability that an individual is HIV+ (relative to bases) and decrease the probability that an individual is HRSN. (2) protective effect that is when β 1 and β 2 are both negative. In other words when having more copies of the mutant allele will decrease the probability that an individual is HIV positive (relative to bases) and increase the probability that an individual is HRSN. The above constraints are irrelevant in our model as it is based on reverse regression techniques that assumes multiplicative genetic trends.

General genetic model
The derivation proceeds from a multinomial logistic regression approach predicting an individual's genotype G conditional on trait status (affected or unaffected). We treat bases as a mixture of cases and controls according to the population prevalence K.
We eliminate correlation between the parameter of interest and other parameters by subtracting the mean from the independent variable. Indexing trait status by i (case: i = 1, control: i = 0) and genotype status by g (g = 0, 1, 2 depending on reference allele count), the multinomial logistic model is defined by a logistic function: with corresponding logit (link) function: where θ G 0 = 0 and m is the mean value of i over all individuals in the dataset under consideration. θ 1 and θ 2 are the log odds ratios for genotypes G = 1 and G = 2 respectively, relative to the baseline genotype G = 0. Under the null hypothesis of interest here (when both θ 1 and θ 2 are zero), θ G 1 and θ G 2 are the log ratios of the frequencies of genotypes G = 1 and G = 2 relative to the baseline genotype G = 0.

Multiplicative trend model
A special case of the general model that is of a particular interest is the multiplicative trend model where θ 2 = 2θ 1 [6]. The logit link function can be written as: θ is equivalent to θ 1 in the general model. Because of its widespread use in genetic epidemiology, we will focus on it for the remainder of this work.
The data for each SNP can be represented as a set of observed genotype counts stratified by sample set (table 1). The conditional probabilities of observing an individual with genotype G=g, given the sample set he/she belongs to, are given in table 2.
In the next section, we derive the ST statistic for a univariate analysis of genome-wide SNPs data.

Score test for association
The test is derived using the derivative of the likelihood with respect to the parameter of interest (here the log odds ratio), with other parameters (here intercepts) set to their null values. The ST is therefore computationally efficient in the sense that it allows a large number of tests to be performed in a relatively short time compared to other tests. For example, Wald's test requires the standard error (SE) of the parameter of interest to be evaluated under the alternative hypothesis, and the likelihood ratio test requires estimation of parameters under both the null and the alternative hypotheses.
Accordingly, in terms of computational speed they are not as efficient for a genome-wide scan as the ST.
In the CBC design, the conditional counts are assumed to be coming from three independent multinomial distributions; i.e., (n 0i , n 1i , the log-likelihood function using tables 1 and 2 is given by: . Here we consider a hypothesis test of the multiplicative trend model of the form: where θ is the log odds ratio parameter. We will treat θ G 1 and θ G 2 as unknown parameters and K as a known prevalence. The score function is the derivative of the log-likelihood evaluated at θ = 0, for this we need the MLE of θ G g under H 0 which is given by: The resulted score function is: Next, we need the variance of the score function evaluated at θ = 0. Since the variance is linear in n gi , we can directly replace the observed cell counts (n gi ) by their expected values under H 0 (n .i n g. /n .. ). This leads, Therefore, the ST statistic is: It is known that the Cochran-armitage (CA) test is the ST that corresponds to the logistic regression model [7,8]. Therefore, our ST from the CBC design can be seen as a generalization of the CA test statistic.

Special cases
Here we show the special cases under which the ST from the CBC design is equivalent to the CA test statistic.
• When K=0, the CBC ST statistic reduces to, which is the CA statistic, replacing n .0 by n .0 + n .2 .
• When K=1, the CBC ST statistic reduces to, which is also the CA statistic, replacing n .1 by n .1 +n .2 . The above shows the intertwined between a prospective and a retrospective CBC design. In particular, the special cases of the CBC ST statistic, which were derived using a retrospective regression model are equal to the CA statistic which was derived using a prospective model (supplementary materials). Accordingly, our retrospective approach, which, as described earlier, was merely used for practical reasons is in fact equivalent to the prospective approach, at least when K=0,1. For 0 < K < 1, a discussion is given in the supplementary materials.

Score test as a contrast test
The CA test statistic can be seen as a two-sample test for differences between two means; the average number of the reference alleles occurring in the genotypes of cases and that for controls. In contrast, it is unclear which groups' means are being contrasted under the CBC's ST given in equation (9). Here, we use simple linear algebra to rewrite the ST in an interpretable way. Specifically, we show that the ST derived from the CBC design can be seen as a contrast test of the form H 0 : G 2 =Ê(G|base) = (n 12 + 2n 22 )/n .2 (15) G 0,1 =Ê(G|ctrl + case) = (n 10 + n 11 + 2(n 20 + n 21 ))/(n .0 + n .1 ) (16) G 0,2 =Ê(G|ctrl + base) = (n 10 + n 12 + 2(n 20 + n 22 ))/(n .0 + n .2 ) The interpretation of this result goes beyond what the CBC likelihood equation shows.
The likelihood which is the joint distribution of the counts across the three groups, shows only the counts information. The score statistic on the other hand, test whether there is a statistically significant deference between these counts. A typical genetic association testing compares the allele frequency mean in two groups; cases and controls. However, since the CBC likelihood allows for information from a third group that is the base set, its score test will not be a simple test of the difference between two means. Indeed, the demonstration above showed how the CBC score test appeared to be contrasting 4 means. This contrast allows for the comparison of two average means; the average of allele means in cases and bases with that in the merged group of controls and bases and merged groups of controls and cases (proof is given in the supplementary materials).

ML estimation
In constructing the ST, we used constrained MLEs (i.e., estimates derived under the null only). Herein, we use the ML method to estimate the underlying parameters, in general, to facilitate the construction of other association tests such as Wald's and likelihood ratio tests. The derivatives of the log-likelihood function in equation (5) with respect to θ, θ G 1 and θ G 2 are given by, Equating the right-hand sides to zero and solving does not yield a closed form analytical solution. Therefore, a numerical solution is needed. The EM algorithm is used here to overcome convergence problems that arise from having singular Jacobian matrices.
Closed forms of the second derivatives to form the Fisher-information matrix are given in the supplementary note.
To illustrate the EM algorithm, we start by treating the counts as an incomplete data. For this, we think of the bases as a multinomial experiment with 6 cate-gories. The counts are (n 020 , n 021 , n 120 , n 121 , n 020 , n 021 ), where n g2 = n g21 + n g20 , g=0,1,2 and the corresponding multinomial probabilities are ( Kh(1, 2)). The log likelihood function based on the complete data is given by: Since n g2i are unobservable, we cannot maximise the above log likelihood directly. This is dealt with using the expectation step. It is known that: , n g20 |n g21 + n g20 ∼ Binomial n g2 , .
Given an initial guess θ 0 of θ, the E-step requires computation of The M-step includes maximising (21) with respect to the parameters to be estimated: Equating to zero, a numerical solution using either Newton-Raphson or Fisher scoring will be found. In order to implement a practical EM algorithm, two important issues should be taken into consideration. These are In this paper, all the work is based on simulated data, therefore the parameter values used for data generation were also used as initial values in the estimation process. In a real data application, ideally one would use several initial values over multiple runs of simulations, augmented by a strict stopping criterion, in order to make sure that the algorithm indeed converged to the true global maximum. The stopping criterion we have used is based on the norm of the score function. This norm should be zero when the likelihood attains its global maximum. Therefore, the criterion is that the algorithm stops iterating when the estimated norm of the score function becomes smaller than a pre-specified small constant . The smaller the , the more stringent the criterion is.
In the supplementary note, a Mathematica module for estimation is provided to make the task practical. Since the prevalence K in our study is assumed to be known, it is constructive to investigate the effect of its misspecification on the asymptotic distribution of the test statistic as well as on statistical power. To this end, we varied the prevalence in the ST statistic and maintained the tables we simulated previously based on a prevalence of 0.2. In Table 3 we summarise the misspecification effect on the distributional behaviour of the simulated test statistics in terms of the area between the probability plot and the straight-line y=x. It is clear that the area becomes smaller as we increase the sample size. Also for medium-n, we can see that the smallest area corresponds to the actual value of the prevalence. A similar investigation was carried under the alternative hypothesis with θ = log [1.5].
The power of the test is defined as the probability of rejecting H 0 when it is false. In other words, it is the probability that the ST statistic will exceed χ 2 1 , at a significance level α of 0.05, under the alternative hypothesis. Accordingly, in our simulation, the power of the ST is the proportion of the simulated test statistics under H 1 , which exceed the pre-specified value of 3.84. Figure 2 shows the power as a function of the misspecified prevalence K, using moderate samples. It is encouraging that misspecifying K up to ∼ 0.2 away from the actual value does not lead to a considerable loss of power.  Table 3: convergence of the ST statistic to its asymptotic distribution under H 0 and H 1 in terms of areas between the pp plot and the y=x line. Actual prevalence is 0.2. The area becomes smaller with increase sample size. Also for medium-n, the smallest area corresponds to the actual value of the prevalence.

Test comparison: LRT, WT and ST
So far we have focused on the ST, merely for practical reasons concerning genome-wide data. Two remaining tests; Wald's and likelihood ratio, are known to be asymptotically equivalent to the ST. Meaning, their performance in terms of power is approximately the same using large sample sizes. In GWASs, sample sizes are limited, it is therefore constructive to compare their power performance, under moderate sample sizes. To test the null hypothesis of no association θ = 0, the WT statistic is given by W = [θ/SE(θ)] 2 , where SE(θ) is the non-null SE obtained from the inverse of the Fisher information matrix. On the other hand, the LRT statistic is given by: , and requires parameter estimation using both the null and the general likelihood.
Using the same simulation procedure under H 1 , as before, the empirical power of the three tests is calculated, first under different values of a correctly specified prevalence then assuming a misspecified prevalence. Figure 3 shows

Design comparison: CC, CB and CBC
Here we compare the ST under the three possible designs; CBC, CC and CB. Figure   4 shows that, the empirical power of the score test under the CBC is equal to that under the CC around K=0.4, which is the proportion of cases in the experiment (cases and controls only). In other words, if K = n .1 /(n .0 + n .1 ), the CBC design is more powerful than the CC design, which means adding bases is beneficial only if there is imbalance between the proportion of cases in the bases and the proportion of cases in which is expected because when the trait is common, the case contamination in the bases is high and therefore, treating them as a set of controls can reduce the power.
Above, we assumed that, the CA test can incorporate either controls or bases, but not a combination of them. This is because, in many cases only one of the two sets is available. However, in some cases, where both sets exist, researchers would typically merge bases and controls into one set and analyse it with the cases, using CA test (which is also CBC assuming K=0), which does not take into account any possible misclassification, that is when some of the individuals in the merged set meet the criterion used to define the individuals in the case set. In this case, it will be useful to see whether adding controls to bases would alleviate the loss in power shown above in the CB design, and how this compares to the empirical power of the CBC score test. For this we simulate tables from the CBC design, assuming different values of prevalence. We test for association, first using score test from CBC and then using the above naive approach which treats the combined set of bases and controls as one set of controls.

The case of large base data set
The question of whether to collect controls in addition to bases has so far been investigated using a moderate sample size of the base set. Here, we tackle the case of a very large base set e.g. n .2 = 10000, analysed with a relatively smaller experiment sample n .0 + n .1 = 2000.

Correctly specified prevalence
Using simulated tables based on an odds ratio of 1.2, figure 6 shows the power as a function of the proportion of cases in the experiment. When K=0.8 (0.2) the power decreases (increases) as the proportion increases, which indicates that an optimal design can be gained by using cases only, when the prevalence is low, and controls only, when the prevalence is high. To confirm this, we fix the number of cases (controls) to be  Figure 5: The plot compares the empirical power of the ST from the CBC (assuming correctly-specified prevalence) and that of the CA test after grouping bases with controls. Medium-n: n .1 = 200, n .2 = 500 n .0 = 300 was used. This means, both tests use a sample of size 1000. When K=0, the two tests coincide as shown in section 2.4. When K = 0, the empirical power of the CA test when controls are merged with bases, decreases as the prevalence increases. The improvement in power when using the CBC design becomes more evident as K exceeds 0.2 2000, and plot the power as a function of the number of controls (cases) when K=0.2 (0.8). From figure 7, It is clear that the change in the power as we vary the number of controls (cases) when K=0.2 (0.8) is trivial. In conclusion, an available large base sample set will compensate for missing cases or controls, depending on which of them is more prevalent.

Misspecified prevalence
Assuming only cases have been collected in addition to an available large base sample set, we investigate the same question of whether controls should be collected, under different misspecifications of the prevalence. Figure 8 . The plot suggests that, when the base set is very large, an optimal design can be gained by using cases (controls) only, when the prevalence is low (high). that misspecifying the prevalence up to 0.2 away from the actual value has a negligible effect on power. Next, we check how serious the misspecification should be to break down the test. Figure 8(b) shows the power curves when the actual value of K is 0.1.
The green and blue curves show the power with 20% and 40% over-misspecification. It is clear that still the optimal design can be gained by collecting cases only.   and a large set of bases (n .2 = 10000). It can be seen that when the base sample set is very large with expected prevalence ≤ 0.5, the optimal design can be gained by collecting cases only.

Conclusions
This work has presented a novel CBC method that simulation studies suggest would improve power to identify associations. Application of the CBC to genome-wide data would be useful to illustrate that the method is feasible and can indeed extract new information. One way to show the improvement in power that the CBC offers over other designs, is to establish a pattern of smaller p-values across the genome using the CBC than with other designs. Besides the difficulty of finding large enough studies where such genome-wide pattern of p-values can be observed, the current form of the CBC raises other practical limitations, which make it difficult to perform power assessment, regardless of sample size. The main problems are (a) the difficulty of finding cases, controls and bases genotyped using the same chip. This is essential to avoid assay artefacts; that is the possibility of detecting an association, that is due to differences in chips rather than differences in disease status. (b) The difficulty of finding studies where correction for genetic heterogeneity across the three groups is not essential. This is important in genetic association studies, as even subtle differences between cases and controls, can result in a highly confounded analysis [10].
The current form of the CBC assumes that, disease status is the only covariate in the model. Therefore, in order to allow a swift integration of the data across studies, additional covariates, representing the principal components, should be added to the model to account for differences in genetic backgrounds (population stratification).