Population Stratification at the Phenotypic Variance level and Implication for the Analysis of Whole Genome Sequencing Data from Multiple Studies

In modern Whole Genome Sequencing (WGS) epidemiological studies, participant-level data from multiple studies are often pooled and results are obtained from a single analysis. We consider the impact of differential phenotype variances by study, which we term ‘variance stratification’. Unaccounted for, variance stratification can lead to both decreased statistical power, and increased false positives rates, depending on how allele frequencies, sample sizes, and phenotypic variances vary across the studies that are pooled. We describe a WGS-appropriate analysis approach, implemented in freely-available software, which allows study-specific variances and thereby improves performance in practice. We also illustrate the variance stratification problem, its solutions, and a corresponding diagnostic procedure in data from the Trans-Omics for Precision Medicine Whole Genome Sequencing Program (TOPMed), used in association tests for hemoglobin concentrations and BMI.


Summary
In modern Whole Genome Sequencing (WGS) epidemiological studies, participant-level data from multiple studies are often pooled and results are obtained from a single analysis. We consider the impact of differential phenotype variances by study, which we term 'variance stratification'. Unaccounted for, variance stratification can lead to both decreased statistical power, and increased false positives rates, depending on how allele frequencies, sample sizes, and phenotypic variances vary across the studies that are pooled. We describe a WGSappropriate analysis approach, implemented in freely-available software, which allows studyspecific variances and thereby improves performance in practice. We also illustrate the variance stratification problem, its solutions, and a corresponding diagnostic procedure in data from the Trans-Omics for Precision Medicine Whole Genome Sequencing Program (TOPMed), used in association tests for hemoglobin concentrations and BMI.
Large-scale association analyses using whole genome sequence (WGS) data on thousands of participants are now underway, through programs such as the NHLBI's TOPMed and NHGRI's Genome Sequencing Program. Unlike earlier Genome-Wide Association Studies (GWASs), where data were combined using meta-analyses of summary statistics, in WGS analyses participant-level data from multiple studies are often pooled, and results are obtained from a single analysis. Pooled analysis of WGS is useful, due to its computational tractability, ability to control for genetic relatedness across the pooled datasets, and its statistical testing procedures' improved performance when applied to rare variants. However, it is sensitive to a form of population stratification that is not well known. Population stratification in genetic association analysis [1] typically refers to situations where the mean phenotype value and allele frequency both differ across population subgroups. Unless appropriately accounted for in the analysis, e.g. by using regression-based adjustments for ancestry such as principal components or genetic relatedness matrices in linear mixed models, or their combination, it can lead to false-positive associations [2][3][4]. But population stratification can also manifest as differences in phenotype variances, across population subgroups, combined with differences in allele frequencies . In practice, this phenomenon is common in pooled analysis of multi-study data, as small differences in allele frequencies are prevalent, and different studies being pooled often have different measurement protocols, environmental exposures and inclusion criteria, all of which can lead to different phenotype variances among studies.
A standard tool for analysis of quantitative traits is linear or linear mixed model regression. In its widely-used default version, linear regression is fitted under the assumption that the phenotype's residual variance is the same for all individuals in the analyzed sample. Similarly, linear mixed models are usually fitted under the assumption that the variance components due to environmental effects and measurement error are equally-sized across all studies. The extent of the consequences if the variances are not equal-sized can be computed exactly given simplifying assumptions (see Online Methods). Broadly, using default approaches, if a specific subgroup has a larger phenotypic variance than that of other subgroups in the pooled analysis, the estimated precision of the association signal will understate the contribution from such a subgroup. The result is deflation (loss of power) for variants where allele frequency is greater in this subgroup compared to other subgroups, or inflation (too many false positives) for variants with lower allele frequency in this subgroup compared to others.
While default linear regression methods assume the same variance for all subgroup, which leads to mis-calibrated tests if the assumption does not hold, standard computational tools can be adapted to allow for a stratified variance model, yielding correctly calibrated tests. Specifically, by fitting different residual variances for each study, or more generally, appropriately-defined "analysis group" (e.g. all African Americans of a specific study) the problem can be alleviated. This can be viewed as fitting a different variance component for noise within each study, or as a weighted least squares approach, in which the group-dependent weights are estimated. This approach is implemented in some standard genetic analysis software packages (e.g. GENESIS [5]). Full details, including coding examples, are given in the Online Methods, but here we note that such software exists, and is fast enough for high-throughput analysis in large samples.
the TOPMed freeze 4 dataset. For both traits, we performed single-variant association analysis for all variants with minor allele count of at least 20. Detailed breakdown of the studies and populations used in these analyses are provided in the Online Methods, Tables 1 and 2. The analysis strategy for both traits was: first fit a mixed linear regression model, with fixed effects for sex, age (also age 2 for BMI), group defined by study and race/ethnicity, and allowing for genetic relatedness by including a variance component proportional to a Genetic Relationship Matrix (GRM [6]) computed on all available variants with minor allele frequency at least 0.001. Then we took the residuals generated by this model, rank-normalized them, and then re-fit the same model but with the rank-normalized residuals as the trait [7]. For both traits, we compared three analyses: first, a 'homogeneous variance' analysis that estimates a single residual variance parameter across all individuals; second, a 'stratified residual variance' model that allows a different residual variance parameter for each analysis group, and a 'completely stratified' approach which fits models and performs tests in each analysis groups separately, and then combines the results via inverse-variance fixed-effects meta-analysis. Analysis groups were either all individuals from a single study (e.g. Amish), or further defined by both study and race/ethnic group (e.g. European and African Americans from the Cleveland Family Study were separate analysis groups). For BMI, we removed 8 individuals from the 'completely stratified' analysis to ensure individuals were unrelated across groups, defined as less than third-degree relatedness.
In the online methods, we provide mathematical derivation and code for computing expected variance-specific inflation factors ߣ ௩ ௦ and for demonstrating the variance stratification problem, under simplifying assumptions where only residual variance is modeled. Using these, for each variant in the BMI and HGB analyses we computed an approximate variant-specific inflation factor ߣ ௩ ௦ . We investigated the inflation/deflation problems resulting from variance stratification, and verified that the patterns of inflation and deflation in the homogeneous variance analysis agree, across the different variants, with those obtained from the formula and code provided in the Online Methods. Figures 1 and 2  ), and across all variants, for HGB and BMI analyses respectively. The plots overlay the results from the three analyses methods together. While the homogeneous variance model clearly produces inflated and deflated QQ-plots in line with the theoretical expectation, when looking at all tested variants together, this inflation and deflation (i.e. Type I and Type II errors) mask each other, alarmingly. Despite appearances, these problems do not "cancel out"; one creates more Type I errors, one creates more Type II errors, yet the plot of all results may lead investigators to conclude that the analysis is well-calibrated. In contrast, the stratified residual variance model provides better control of Type I errors, as seen in the QQ-plots, with the exception of the bottom left panel in Fig. 2, which provides QQ-plots for the set of variants that are expected to have deflated test statistics under the pooled variance model: here the stratified residual variance model was also somewhat deflated. In linear regression, the stratified residual variance model allows every analysis group to have its own residual variance parameter. In the mixed model setting, where the variance is decomposed into genetic and residual variance, this model keeps the genetic variance component the same but allows for the residual variance to differ across groups. Analysis groups can be defined as study, race, and combinations of these. Our mathematical derivation and code for computing ߣ ௩ ௦ are under simplifying assumptions of no covariate effects and independent observations. Therefore, these make no distinction between genetic and residual variance components. While in the linear regression setting (independent observations) variance stratification model clearly suffices to account for variance heterogeneity, in the mixed model setting, a residual variance stratification model may not be optimal, because it does not account for stratification in the genetic variance as well, which could be the result of study design. For example, in Fig. 4, the estimated genetic variance component of the Cleveland Family Study is much higher than those of other studies, and of the residual variance component of the same study, perhaps because study participants were selected from families with Obstructive Sleep Apnea, which is highly associated with obesity. Heterogeneity in genetic variance is addressed in the 'completely stratified model', but such a model requires that individuals are independent between different groups (strata). A method that allows for complete stratification of analysis groups, while keeping genetically related individuals across these groups, was previously developed [8] and was shown to have good statistical properties. However, it was only introduced for the Wald test, and not the Score test, which is often used for rare variants association analyses, and it is currently computationally costlier than a pooled analysis because individual level data are used both at the individual analysis group computations, and when computing covariances between effect size estimates of all analysis groups. Computational efficiency is critical when testing the large number of variants observed in WGS studies. As sample sizes of TOPMed grow, pooling together more diverse studies and populations, variance stratification problems may be more severe. Models allowing for pooled analysis with both group-specific residual and genetic variances or robust variance estimates may be needed for better control of Type I errors and increased efficiency.

Competing Interest Statement
The authors declare no competing interests.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 5, 2020. ; https://doi.org/10.1101/2020.03.03.973420 doi: bioRxiv preprint Figure 1: QQ-plots comparing observed and expected p-values from the analysis of hemoglobin concentrations using three approaches: "homogeneous variance" model, that assumes that all groups in the analysis have the same variances; "stratified variance" model, that allows for different residual variances across analysis groups; and a "completely stratified" model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in meta-analysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as "About right" when they had estimated between 0.97 to 1.03, "Deflated" when estimated lower than 0.97, and "Inflated" when they had estimated higher than 1.03. y . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

Figures
The copyright holder for this preprint this version posted March 5, 2020. ; Figure 2: QQ-plots comparing observed and expected p-values from the analysis of BMI using three approaches: "homogeneous variance" model, that assumes that all groups in the analysis have the same variances; "stratified variance" model, that allows for different residual variances across analysis groups; and a "completely stratified" model in which analysis groups were analyzed separately, allowing for both heterogeneous residual and genetic variances across groups, and then combined together in metaanalysis. The QQ-plots are provided across sets of variants classified by their inflation/deflation patterns according to the algorithm for variant-specific approximate inflation factors. We categorized variants as "About right" when they had estimated between 0.97 to 1.03, "Deflated" when estimated lower than 0.97, and "Inflated" when they had estimated higher than 1.03. e . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 5, 2020. ; https://doi.org/10.1101/2020.03.03.973420 doi: bioRxiv preprint . We categorized variants as "About right" when they had estimated between 0.97 to 1.03, "Deflated" when estimated lower than 0.97, and "Inflated" when they had estimated higher than 1.03. Genomic control inflation factors were computed as the ratio between the median test statistic across variants in the set to the theoretical median of the test statistic under the null hypothesis of no association. in . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 5, 2020. ; https://doi.org/10.1101/2020.03.03.973420 doi: bioRxiv preprint Figure 4: Estimated variance components corresponding to residual and genetic relatedness in the analyses of BMI and hemoglobin concentration. For each analysis group, the estimated variance components were computed based on the analysis of the group alone.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 5, 2020. ; https://doi.org/10.1101/2020.03.03.973420 doi: bioRxiv preprint

Mathematical derivation when observations are independent within study
For a total sample size of ݊ , we assume that the data follow a linear model denoted as is the trait or phenotype value of person ݅ , ݃ is their count of coded alleles (i.e. genotype), ߚ denotes the mean outcome in those with no copies of the coded allele, ߚ denotes the effect on the mean trait of each additional copy of the coded allele, and the ߳ are residual errors, which we for now assume are independent.
To provide intuition for the variance stratification problem, we first demonstrate a mathematical derivation under simplifying assumptions. We assume that the study individuals are independent and identically distributed (iid), that the genotype effect is null (ߚ ൌ 0 ሻ , and that the residuals follow a normal distribution ߳ . We further assume that the phenotypes are centered, the genotypes are centered, and follow a dominant mode of inheritance, i.e. we are using is the genotype under a dominant mode (having values 1 or 0), ‫‬ is its frequency, and ݃ is used in analysis.

The Wald test
The Wald test quantifies the significance of the data by dividing a regression-based estimate of ߚ by its corresponding estimated standard error. The linear regression estimate of the effect is (2) Denoting the residual variance of individual ݅ by ߪ ଶ (which may differ across individuals), the variance of ߚ መ is (3) When the variance of the residuals is homogeneous across all individuals, this is We see that the actual variance is a linear combination of the variance parameters ߪ ଵ ଶ , ߪ ଶ ଶ , and the weight assigned to each depends on the minor allele frequency and sample size in each group. Where the minor allele frequencies are equal the two forms are equal, as there is no association between genotype and outcome, and no confounding occurs. But where they are not equal then the variance of the estimator upweights the residual variance in the group where the variant is more common, which does not happen under homogeneity. This result straightforwardly generalizes for ‫ܯ‬ analysis groups. Next, we provide a more general algorithm for computing variant-specific inflation factors.

Computing approximate variant-specific inflation factors
We can also use mathematical derivations under homogeneity and heterogeneity, relaxing the restrictive assumptions provided earlier, to compute variant-specific inflation factors, using the standard Huber "sandwich" formula [9]. We now allow for additive genetic model, and do not assume that the genotypes are standardized. The variance estimator used by the Wald test is now provided as follows: but allowing for different variances per analysis groups, it becomes: Based on these two expressions, we propose an algorithm to compute an approximate variantspecific inflation factor. For computational purposes, we further simplify these arguments taking advantage of the fact that there are repeated rows (e.g. corresponding to people who have ݃ ൌ 1 are from the same analysis unit, i.e. have the same residual variance). The algorithm below uses the additional assumption that phenotype variance within each study does not vary with genotype -which must hold under the strong null hypothesis of no association in any subpopulation. It also uses the simplifying assumption that variants are in Hardy Weinberg Equilibrium (HWE) within each study population; testing HWE is a standard pre-processing step for genotype data.   Table 2: Analysis groups/strata participating in the analysis of hemoglobin concentration. For each group, we report parent study, race/ethnic group, number of participants, number and percentage of males, and age and hemoglobin's means and standard deviations.

Framingham Heart Study
The Framingham Heart Study (FHS) acknowledges the support of contracts NO1-HC-25195 and . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 5, 2020. ; HHSN268201500001I from the National Heart, Lung and Blood Institute and grant supplement R01 HL092577-06S1 for thisresearch. We also acknowledge the dedication of the FHS study participants without whom this researchwould not be possible. FHS participated in both the hemoglobin and BMI analysis.

Jackson Heart Study
JHS is a longitudinal community-based study designed to assess the causes of the high prevalence of cardiovascular disease among AAs in the Jackson, Mississippi metropolitan area [11]. During the baseline examination period (2000-2004) 5,306 self-identified AAs were recruited from urban and rural areas of the three counties (Hinds, Madison and Rankin) that comprise the Jackson, Mississippi metropolitan area. Participants were between 35 and 84 years old with the exception of a nested family cohort, where those ≥ 21 years old were eligible All participants included in analyses provided written informed consent for genetic studies. Approval was obtained from the institutional review board of the University of Mississippi Medical Center (UMMC). Data on participants' health behaviors, medical history, and medication use were collected at baseline and subjects underwent venipuncture, allowing for assessment of complete blood cell counts and other measures at UMMC (Beckman-Coulter, [12]). JHS participated in both the hemoglobin and BMI analysis.