Abstract
Copy number variations (CNV) represent a significant proportion of the genetic differences between individuals and many CNVs associate causally with syndromic disease and clinical outcomes. Here, we characterize the landscape of copy number variation and their phenome-wide effects in a sample of 472,228 array-genotyped individuals from the UK Biobank. In addition to population-level selection effects against genic loci conferring high-mortality, we describe genetic burden from syndromic and previously uncharacterized CNV loci across nearly 2,000 quantitative and dichotomous traits, with separate analyses for common and rare classes of variation. Specifically, we highlight the effects of CNVs at two well-known syndromic loci 16p11.2 and 22q11.2, as well as novel associations at 9p23, in the context of acute coronary artery disease and high body mass index. Our data constitute a deeply contextualized portrait of population-wide burden of copy number variation, as well as a series of known and novel dosage-mediated genic associations across the medical phenome.
Introduction
Copy number variants (CNV) are a class of structural variation often defined as large deletions or duplications of at least 1 kilobase (kb) of genomic sequence. CNVs exhibit substantial variability in both size and frequency in the population and have been implicated across a variety of common and rare health outcomes1. Regional deletion and duplication syndromes have also been described at many loci, clustering near microsatellite repeats or areas of segmental duplication which may potentiate non-allelic homologous recombination2. For example, CNV-based architectures for neuropsychiatric (e.g. autism spectrum disorder), developmental (e.g. 16p11.2)3,4, and syndromic cardiac disease (e.g. 22q11.2)5 phenotypes have been well established.
Despite a growing body research on CNV-related syndromes and disease etiologies, large-scale studies of CNV effects have been limited by their rarity in the general population. However, burden testing methods which address this rarity by pooling observed variation across gene regions have yielded reproducible associations in the context of congenital heart disease and various neurocognitive outcomes6,7. Moreover, as studies which include either microarray or NGS-based genotype data have grown in size and scope, it has become possible to describe the distribution of CNVs at kilobase-level resolution in the general population8,9. Furthermore, the aggregation of richly annotated phenotype data in biobanks has diversified the set of phenotypes available for well-powered association studies, and allows for more precise characterization of syndromic CNV-associated disease10,11,12.
We here describe the landscape of copy number variation and their associations with 1,937 phenotypes in a cohort of 472,228 participants from the UK Biobank13. We replicate well-established syndromic effects of common CNVs — namely 22q11.2 deletion (DiGeorge) syndrome and two variants of 16p11.2 deletion syndrome — and highlight known and novel associations for body mass index (BMI), acute coronary artery disease (CAD), and related adipose and cardiovascular phenotypes. Summary statistics from traditional genome-wide associations for common CNVs, as well as from gene-level aggregate burden tests of rare variants across all phenotypes are available for download on the Global Biobank Engine14.
Results
Landscape of common and rare CNVs in a large volunteer cohort
To call copy number variants in UK Biobank, we apply PennCNV15 separately within each genotyping batch, resulting in 278,455 unique CNVs among 472,228 individuals after sample quality control. We also observe heavy-tailed distributions in size and allele count of CNVs, with average CNV length ~226kb and the majority of called variants singleton in the cohort (Figure 1a,b). This translates to notable burden of variation for nearly all individuals, with 439,464 (93.1%) of the individuals possessing at least one CNV detectable at kilobase-resolution (Figure 1c,1d). Among individuals with at least one CNV, we estimate an average burden of 5.5 variants covering >200kb of genomic sequence (median 3 variants affecting ~100kb, Figure 1c,d). While in-line with previous reports8, these estimates of individual-level burden are likely conservative, as regions where array markers are sparse or missing limit the accuracy of variant calling. Furthermore, we are unable to call smaller (<1kb) variants due to inconsistent marker density across all chromosomal regions on the Axiom and BiLEVE UK Biobank genotyping arrays. This limitation is visible in the histogram of called CNV lengths (Figure 1a); we call substantially fewer variants on the order of hundreds of base-pairs than on the order of thousands.
We also observe a highly non-uniform burden of variation across genomic position, with breakpoints most common near the ends of chromosomes, and at known regions of segmental duplication (Figure 1e). Among them are 1p36, 8q24.3, 9q34.3, and 19q13, all of which have associated microdeletion syndromes causing developmental delay with uniquely characteristic growth patterns16–19. Other CNV-hotspots like 6p21.33, which contains the major histocompatibility complex protein gene family, may be influenced by high marker density (in this case for HLA allelotyping) in addition to these biological features which underlie structural mutagenesis. However, these loci do not categorically correspond to areas where structural variation is commonly observed in the population (Figure S1). For example, 1p36 and 19q13 are also the respective sites of common CNVs overlapping RHD and FUT2 (Rhesus and Lewis blood groups), but there are no such common variants within the telomeric 16p13 cytoband.
Survivorship bias due to genetic selection against early-onset diseases
We estimate gene-level intolerance to structural variation by adapting a method for estimating regional selective constraint8. Relative to the general population, the volunteers within the UK Biobank are described to have a “healthy-cohort” enrollment bias20 and were enrolled between the ages of 40 to 69, which informs our findings. Within the tail of positive constraint z-scores, which indicate the strongest intolerance to structural variation, we observe enrichment for genes which cause early onset diseases, particularly cancer. Among the top fifteen constrained genes (Table 1) are BRCA1 and BRCA2, which are associated with early-onset breast cancer21,22; MLH1, MSH2, MSH6, which cause early onset colorectal cancer (Lynch syndrome)23–25; and ATM and APC, which are involved with mismatch repair cancers26,27.
Selections from the most highly constrained pathways from Gene Ontology Consortium28 resources (Table 2) also suggest strong intolerance to CNV for genes involved with core biological processes like protein binding, cellular structural integrity (keratinization), development (growth hormone receptor binding), and immune regulation (natural killer cell activation). Similar results at the gene- and pathway-level are observed for deletion-specific constraint (Table S1,S2), whereas duplication-specific analysis suggests autoimmune-related genes and pathways are most strongly intolerant to dosage effects (Table S1,S2). These results indicate strong selective effects occurring prior to enrollment in the UK Biobank during childhood and early adulthood against loss of function variation in core developmental, metabolic, and tumor-suppressing genes, and against dosage-altering variation in immune-related genes.
Association testing identifies CNVs at novel and syndromic loci
We compute genome-wide associations across 1,893 phenotypes for 8,274 common CNVs observed at 0.005% allele frequency (1 in 20,000) in our cohort, using regression as implemented in the analysis software PLINK29. We also perform L1-regularized regression for rare-variant burden tests, pooled by gene. For these tests, we measure the net effect of rare CNVs (AF < 0.1%) overlapping within 10kb of the gene region as defined by HGNC30 for 7,614 protein coding genes with at least 5 individuals affected by such a variant. A complete list of phenotypes analyzed is available in Table S3. Here, we describe representative results for one common disease and one quantitative measure with established genetic risk factors and large sample sizes in UK Biobank: acute coronary artery disease (CAD) and body mass index (BMI).
For Acute CAD, we identify one statistically significant (p < 6×10−6) association after Bonferroni correction for the common CNV GWAS: an intergenic deletion at chromosome 9p23. Intergenic variants at the 9p21 locus have been implicated in previous association studies of blood-based biomarkers relevant to cardiac outcomes, specifically, decreases in hematocrit and hemoglobin concentration31, as well as carotid plaque burden32. A recent meta-analysis33 using data from UK Biobank and CARDIoGRAMplusC4D identified a lead variant in the vicinity of this locus (rs2891168) associated with 6% unit increase in risk for similarly defined coronary artery disease. However, the CNV we here identify confers an estimated 12.4-fold increased risk (95%CI: 4.3-35.9, p=3.6×10−6) and is at least 2Mb distant from the nearest SNPs (rs10961206) at genome-wide significance near the 9p21/9p23 locus in the meta-analysis. This and the absence of linkage between the 9p23 CNVs and rs10961206 (r = 0.013) are suggestive of independent effects.
Gene-level burden testing of rare CNVs in individuals with CAD implicates LDLRAD3, a member of the low density lipoprotein (LDL) receptor family. CNVs called in this gene are predominantly deletions affecting the coding sequence — in aggregate (n=27), these variants confer an estimated 10-fold increase in risk of Acute CAD (95% CI: 3.9-25.6, p=1.4×10−6). Though the role of lipoprotein receptors in cholesterol metabolism is a well established mechanism of risk for cardiovascular disease, LDLRAD3 is not known to participate in cholesterol metabolism. It is, however, a receptor widely expressed throughout adult tissues which may participate in proteolysis in the central nervous system34,35. We therefore sought to replicate these findings using two-sample mendelian randomization36 on expression quantitative trait loci (eQTLs) from CAD summary statistics from a CARDIoGRAMplusC4D meta-analysis37. We identify a nominally significant protective effect between an eQTL increasing expression of LDLRAD3 and CAD (OR=0.85 [95%CI: 0.62-0.97], p=0.012), the direction of which is consistent with a dosage model of LDLRAD3-mediated risk for CAD.
We also find two significant associations for BMI, both deletions at chromosome 16p11.2, a locus implicated in syndromic early onset obesity and developmental delay. Each of these CNVs appears to correspond to a distinct form of 16p11.2 deletion syndrome. The smaller ~220kb deletion (β = 4.5 kg/m2 [95%CI: 2.7-6.3 kg/m2], p = 1.8×10−6, AC=35) has been associated with early onset obesity, and spans ATXN2L, TUFM, SH2B1, ATP2A1, RABEP2, CD19, NFATC2IP, SPNS1, and LAT, with SH2B1 the suspected causal obesity gene3. Obesity is also a phenotypic consequence of a larger ~593kb deletion (β = 7.8 kg/m2 [95%CI: 6.2-9.4 kg/m2], p = 5.0×10−23, AC=58), which is further associated with neurodevelopmental delay and related conditions4. However, this deletion spans a wholly distinct set of genes which are suspected to play complex dosage-dependent roles in the phenotypic consequences of the syndrome40. As both subtypes of 16p11.2 deletion syndrome may present in early childhood, it is noteworthy that the effect we measure on BMI is in a cohort comprised entirely of older individuals, indicating burden of adult disease associated with the CNV locus.
After controlling for multiple comparisons, burden testing for BMI identifies a group of genes at chromosome 22q11.2 and recapitulates the list of genes affected by each of the 16p11.2 deletions. Variation at the 22q11.2 locus also constitutes one of the first named microdeletion syndromes, DiGeorge syndrome, which has variable phenotypic consequences including craniofacial dysmorphisms and conotruncal congenital heart disease, along with increased risk for an adverse cardiovascular outcomes and neuropsychiatric disease later in life5. Among individuals affected with 22q11.2 deletion syndrome obesity is a recognized manifestation of disease42, and we estimate a 0.3-0.4 point increase in BMI for genic CNVs near 22q11.2. as well as 0.55-0.7 for genic CNV at 16p11.2. The presence of these associations in a large volunteer cohort offers further evidence that syndromic alleles may contribute to the risk of common diseases in the general population.
Phenome-wide associations for each of the CNVs at 16p11.2 further highlight changes in biomarkers, biomeasures, and increased risk of common disease, consistent with high BMI over the course of a lifetime (Figure 4). Genome-wide significant phenotypes for the 220kb CNV recapitulate the established syndromic effects from early onset obesity. We observe significant increases, on the order of one standard deviation, in weight, BMI, hip and waist circumference, reticulocyte count, and comparative body size at age 10 for these individuals. The larger 593kb CNV associates with similar measures of body size and fat, as well as hypertension, diabetes, and abdominal hernia. These results are also indicative of effects due to developmental delay; namely, decreased measures of memory, higher Townsend deprivation, and lower lung capacity (FEV, FVC) with higher associated risk of respiratory failure. Taken together these results highlight the variable expressivity of CNV-related disease, as well as its long-term effects across the medical phenome.
Discussion
In calling copy number variants and performing genetic association studies at scale from a large cohort of array-genotyped individuals with richly annotated phenotype data, we provide a portrait of the phenome-wide burden of genomic copy number variation. Our estimates of the individual-level burden of CNV and population-wide allele frequencies are consistent with previous reports, the deep phenotypic information available in the UK Biobank permits more finely tuned measures of the genic intolerance to CNV which include estimates of variation absent from our cohort of predominantly healthy, middle-aged individuals. We consider both rare and common CNVs in our association studies and identify effects of previously known (22q11.2, 16p11.2) syndromes and potentially novel (9p23) loci which have not yet been characterized or associated with disease.
Our study has significant limitations which inform our analysis. While arrays are an inexpensive way to genotype large cohorts, permitting straightforward algorithms to infer the presence of structural variation, the resulting CNV calls are limited by the density and placement of markers across chromosomes. For UK Biobank genotyping arrays in particular, there are large portions of genomic sequence with low marker density (in particular near centromeric regions) which bias our resulting genotype calls away from such regions. Array-derived CNV calls also suffer from limitations, in their inability to differentiate other structural events like inversions or translocations, or to determine breakpoint position at base-pair resolution. Complicating these barriers is the fact that our sample was genotyped on two distinct arrays, which may cause identical CNVs to present with different breakpoints across individuals in the call set. Our choice to present gene-level burden tests which include the vast majority of variants included in our CNV GWAS was informed by this realization.
Our associations are also heavily impacted by a known “healthy-cohort” bias, which results in null results for several phenotypes with known genetic contributions; notably, there are no hits in our burden tests for cancers other than leukemia and lymphomas. With this in mind, our constraint scores constitute a sobering observation of genetic survivorship bias. Estimates of gene-level intolerance to structural variation are derived from people who did not enroll in UK Biobank; the absence of individuals who were not healthy enough to participate or did not survive until age 40 constitutes a enrollment bias against severe early-onset disease. We take this opportunity to honor these non-participating individuals and their implicit contribution to our understanding of genetic disease. The observation of selection bias colors the interpretation of genetic findings from UK Biobank in general, as the cohort is relatively depleted of disease of early-onset morbidity and mortality and any genetic variation associated with these diseases will likewise be difficult to detect. While UK Biobank is unprecedented in size, scope, and scientific yield, our data illustrate that the anticipated findings from the proliferation of large biobank studies around the world will be influenced by implicit and explicit barriers to participation.
Despite selection against high-penetrance alleles causing early-onset disease, we detect a novel and strong association for coronary artery disease at LDLRAD3. While this locus has prior putative association with bone mineral density43, existing large-scale GWAS do not detect a strong association with coronary artery disease or established cardiometabolic risk-factors. However, the absence of gene level intolerance to truncating and missense variation in LDLRAD3 does not suggest a compelling rationale for important biological function or role in disease. In our study, CNVs at this locus are associated with some established cardiometabolic risk factors, such as diabetes onset, smoking status, and arterial stiffness, but not obesity or other fat-related phenotypes (Figure S6). Consistent with our findings that a decrease in LDLRAD3 dosage increases the risk of disease, a strong eQTL increasing LDLRAD3 expression decreases the risk of disease when used as an instrument in a two-sample mendelian randomization in a large-scale study of coronary artery disease. Thus, our findings of an mRNA dosage effect are replicated at the gene level. These results highlight the utility of analyzing genic CNV which, when directly impacting mRNA dosage, offer a more easily interpretable mechanism distinct from alterations of protein structure or small changes in transcriptional regulation.
The observation of variation at the 16p11.2 and 22q11.2 loci sheds further light on the penetrance of syndromic loci in the general population. The 16p11.2 recurrent microdeletion syndrome was first described in individuals with autism and neuropsychiatric disease and may include seizures, brain and other anatomic abnormalities. Even accounting for the enrollment bias inherent to the UK Biobank our PheWAS detects a modest relationship to neurocognitive measurements via secondary markers of intellectual differences such as prospective memory for one variant at this locus. People carrying variation at the 22q11.2 locus within the general population are known to be at increased risk of neuropsychiatric diseases44 for which variable phenotypic penetrance is well recognized345. To wit, individuals with genetic variation at both syndromic loci were by and large sufficiently healthy and capable of volunteering to participate in the Biobank. Our findings support a growing recognition that the penetrance and effect sizes of syndromic alleles will likely require revision in the context of broad population-based surveys of rare genetic variation46,47.
Our findings add to the growing body of literature measuring global burden of structural variation across healthy and diseased individuals. Our estimates of the effects of common CNV suggest a notable role of structural variation in population-wide burden of common disease, and suggest genomic loci where novel CNV-derived syndromic disease may exist. For rare variants, our data offer broad phenotypic characterization of the effects of gene-specific knockouts, which may inform development of pharmacological and genetic therapies. While the functional consequence and pathogenicity of missense, synonymous, and noncoding single nucleotide variation within a gene may be difficult to classify, the mechanism of most genic CNV are clear: a dosage effect upon mRNA transcription. This population-scale catalog of variation and the described associations with a multiplicity of diseases should be of immediate use by genetic clinicians in classification of novel and rare CNV detected in clinical testing. Full support for gene- and variant-level browsing is forthcoming in a future version of the Global Biobank Engine (biobankengine.stanford.edu). In the interim, summary statistics from association studies described here, as well as for all phenotypes present on the engine, are freely available for download on the site. We hope that these data will be leveraged to empower future analyses of the phenome-wide effects of structural variation and gene-level dosage effects.
Methods
CNVs were called using PennCNV v1.0.4 on raw signal intensity data from each genotyping array. Phenotype data was derived from data-fields collected for UK Biobank corresponding to various body measurements, biomarkers, disease diagnoses and medical procedures from medical records, as well as a questionnaire about lifestyle and medical history. Summary-level data from all statistical tests described here, as well as more thorough documentation on phenotyping, will be available on the Global Biobank Engine14 (biobankengine.stanford.edu) and can be found in related publications48.
CNV calling in UK Biobank
Methods for genetic data acquisition and quality control as performed by the UK Biobank have been previously described13. In brief, two similar arrays were used for targeted genotyping within the study population: the UK BiLEVE Axiom Array (n=49,950) by Affymetrix and the UK Biobank Axiom Array (n=438,427), which was custom-designed by Applied Biosystems. Samples and array markers were subject to threshold-based filtration and quality control prior to public release. Specifically, markers were tested for discordance across control replicates, departures from Hardy-Weinberg equilibrium, as well as effects due to batch, plate, array, and sex; affected markers were set as missing in affected batches or removed. Similarly, samples were tested for missingness (>5%) and heterozygosity across a set of high-quality markers, but samples identified as low quality (n=968) were not excluded. We also chose to include these samples in this analysis, considering that large structural variants may have been responsible for their poor quality with respect to metrics used for filtration.
We used PennCNV v1.0.415 to call CNVs within each of the 106 genotyping batches from UK Biobank. We first estimate genomic runs of heterozygosity (RoH) for each sample using a previously developed pipeline in PLINK49,50 using the --homozyg option. We then select n=100 samples with total RoH covering less than 20MB to train a hidden markov model (HMM) of copy state on each chromosome. HMM training was initialized with conditions optimized for Affymetrix arrays (affygw6.hmm), provided in PennCNV resources. We used the general calling mode, which performs likelihood-based testing for copy-number state (CN=0,1,2,3,4) at each input marker using its log-normalized signal intensity and allele balance in a given sample. We also apply adjustment for GC content across sites using waviness factor correction51. After CNV calling, we exclude 1,360 samples with over 30 called CNVs from downstream analysis, resulting in a cohort of 472,228 individuals with 278,455 unique variants.
Gene-level constraint estimation
Regional selective constraint to CNV was estimated for all protein-coding genes, with genic CNV defined as any variant overlapping within 10kb of the HGNC gene region. We estimate a null model of structural mutation empirically, and model burden of genic CNV as a linear function of gene size, fraction of genic sequence covered by regions of segmental duplication as extracted from the UCSC Genome Browser52,53. We also account for biased observations due to array genotyping by including the number of genic markers as a covariate. The formula for this null model can be written as:
From this model, we compute constraint z-scores for each gene using its negated standardized residual for each gene, winsorizing the negative tail at the lowest 5% of values. We also compute the probability of loss of function intolerance (pLI) as the non-normalized residual over the number of expected CNV, with negative values rounded to zero.
Genetic associations
Variant-level associations in UK Biobank were estimated with PLINK v2.00a (5 Jan 2018). We used the --glm firth-fallback option for computation. This option is a hybrid algorithm for logistic regression which defaults to a standard regression solver for computation, falling back to Firth’s regression (https://cran.r-project.org/web/packages/logistf/index.html) in cases where one of the cells of the 2×2 contingency table is zero, or where the traditional method fails to converge in a pre-specified number of iterations. These analyses were performed in a subset of 337,538 unrelated individuals of self-reported white British ancestry, and were controlled for age, sex, and 4 marker-based genomic principal components from the UK Biobank PCA calculation. To ensure adequate power for estimating genetic effects, we perform these tests on 8,274 CNVs observed at a frequency of 0.005% (1 in 20,000, or 18 individuals) in the whole sample of individuals with called CNVs.
Gene-level burden tests were conducted across all gene:phenotype pairs for genes with at least 5 individuals with overlapping CNV. Genic burden was encoded as a binary variable which indicates whether an individual has a CNV which contains any overlap within 10,000 base pairs of the HGNC gene region. CNVs which overlapped several gene regions were used for analysis in each gene. We treat deletions and duplications identically, with the assumption that any CNV which overlaps a gene in this fashion will disrupt its normal function. Effects of genic CNV burden were estimated by linear and regularized logistic regression with the python package Statsmodels, respectively computed using the .fit() and .fit_regularized() methods. We included the following as covariates in both models: age, sex, four marker-based genomic principal components from UK Biobank’s PCA calculation, and the number and combined length of CNVs in each individual.
Two-sample mendelian randomization was performed via the MR Base web app using GWAS summary statistics for LDLRAD3 expression QTLs from a CARDIoGRAMplusC4D meta-analysis37. We report Wald summary statistics from inverse-variance weighted Egger regression; these are the default analysis options for the web interface.
Acknowledgements
This research has been conducted using the UK Biobank Resource under application numbers 24983, 16698, 13721, and 15860. We thank all the participants in the study. The primary and processed data used to generate the analyses presented here are available in the UK Biobank access management system (https://amsportal.ukbiobank.ac.uk/) for application 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf), and the results are displayed in the Global Biobank Engine (https://biobankengine.stanford.edu).
M.A.R. is supported by Stanford University and a National Institute of Health center for Multi- and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080). This work was supported by National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under awards R01HG010140. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.