Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Exome Aggregation Consortium, Shamil Sunyaev, Mark Daly, Daniel G MacArthur
doi: https://doi.org/10.1101/030841
James Zou
1Microsoft Research, One Memorial Drive, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gregory Valiant
2Computer Science Department, Stanford University, Palo Alto CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Valiant
3Computer Science Department, Brown University, Providence RI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Konrad Karczewski
4Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA, USA
5Broad Institute or MIT and Harvard, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Siu On Chan
6Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, China.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kaitlin Samocha
4Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA, USA
5Broad Institute or MIT and Harvard, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Monkol Lek
4Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA, USA
5Broad Institute or MIT and Harvard, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
7Exome Aggregation Consortium (ExAC), Cambridge MA, USA
Shamil Sunyaev
5Broad Institute or MIT and Harvard, Cambridge MA, USA
8Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mark Daly
4Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA, USA
5Broad Institute or MIT and Harvard, Cambridge MA, USA
9Department of Medicine, Harvard Medical School, Boston MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel G MacArthur
4Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston MA, USA
5Broad Institute or MIT and Harvard, Cambridge MA, USA
9Department of Medicine, Harvard Medical School, Boston MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Introduction

Recent efforts aggregating the genomes and exomes of tens of thousands of individuals have provided unprecedented insights into the landscape of rare human genetic variation1,2 and generated critical resources for clinical and population genetics. The recently announced U.S. Precision Medicine Initiative raises the prospect of growing these databases to encompass hundreds of thousands of human genomes. In the context of these ambitious efforts, it is important to quantify the power of large sequencing projects to discover rare functional genetic variants3. In particular, we need to understand, as we sequence ever larger cohorts of individuals, how many new variants we can expect to identify and their expected allele frequencies. Accurate estimates of these quantities will enable better study design and quantitative evaluation of the potential and limitations of these datasets for precision medicine.

Results

Predicting the number of new variants we expect to identify in larger cohorts requires accurate estimates of allele frequencies of all the genetic variation in the human population, including the rare variants that have not been observed in the current sequencing cohorts4-6. The population frequencies of the unobserved rare variants determine the discovery rate of new variants as the cohort sizes increase. We developed a new method, UnseenEst, to estimate the frequency distribution of all variants using the observed site frequency spectrum (SFS) of the current cohort. The method is based on linear program estimators of the SFS7, and our mathematical analysis shows that it enables accurate extrapolation of the SFS from current data to cohort sizes more than an order of magnitude larger (Supplementary Information).

Protein-coding variants represent the most readily interpretable and medically relevant slice of human genetic variation, and have been assessed in large sample sizes through the widespread application of exome sequencing approaches2. We leveraged data from the Exome Aggregation Consortium (ExAC)8 to estimate the discovery rates of different classes of protein coding variants in larger cohorts. We validated UnseenEst by training it on random 10% of the alleles in ExAC and then used the estimated frequency distribution to predict the number of distinct variants that we can identify in the entire ExAC cohort. For every variant type (Supp. Figure 1) and every population (Supp. Figure 2), UnseenEst accurately predicted the number of unique variants that were identified in the entire ExAC cohort as well as the empirical SFS of ExAC (Supp. Table 1).

Supplementary Figure 1.
  • Download figure
  • Open in new tab
Supplementary Figure 1. Using 10% of the ExAC alleles to predict the number of unique variants in the entire ExAC cohort.

Each panel corresponds to one variant type. For each variant type, we applied UnseenEst on 10% of the ExAC alleles (5919 individuals) to predict the number of unique variants that we would expect to observe in a cohort of size less than or equal to ExAC (59198 individuals). The blue curves are the average predictions over the different 10% sub-samples and the blue shaded regions correspond to one standard deviation from the average. The red curves are the actual number of unique variants observed in ExAC. For all variant types, the predicted number of unique variants is in good agreement with the observed number of unique variants.

Supplementary Figure 2.
  • Download figure
  • Open in new tab
Supplementary Figure 2. Using 10% of the alleles in each ExAC population to predict the total number of observed variants.

For each of the ExAC populations, we trained UnseenEst on random 10% of the alleles and applied it to predict the total number of unique variants in the entire population. The x-axis of each panel indicate the number of individuals of that population; the first mark (e.g. 5919 in (a)) indicate the size of the training set and the last mark (e.g. 59198 in (a)) is the total cohort size of that population in ExAC. The blue curves are the average predictions over the different 10% sub-samples and the blue shaded regions correspond to one standard deviation from the average. The red curves are the actual number of unique variants observed in ExAC. For all variant types, the predicted number of unique variants is in good agreement with the observed number of unique variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Supplementary Table 1. Observed and predicted allele counts.

Blue rows are the number of ExAC variants with empirical allele counts in bins of 0-10, 10-100, 100-1000, and greater than 1000. Red rows are the predicted allele counts based on UnseenEst trained on 10% of the samples. The standard deviations are shown in the parentheses.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. Predictions for the number of unique variants in 500K individuals.

We trained UnseenEst on the U.S. Census-matched ExAC cohort (“current”) and predicted the number of unique variants we expect to find in up to 500K individuals. The number of unique variants in the cohort were estimated for synonymous, missense and lose-of-function (LoF) variants in (a), and for CpGs, transitions and transversions in (b). The shaded regions correspond to one standard deviation around the estimates. (c) A gene is classified as LoF on a given allele if that allele contains at least one variant that introduces a stop codon, disrupts a splice donor/receptor site, or disrupts the reading frame. Genes are partitioned into bins based on their LoF allele frequencies: less than 10−5, 10−5 to 10−4, 10−4 to 10−3, and greater than 10−3. The y-axis indicates the number of genes with LoF allele frequency belonging to each bin. Error bars correspond to one standard deviation. (d) Estimated number of genes with at least 10 and 20 LoF alleles.

From the full ExAC dataset, we generated a cohort of 33778 healthy individuals that matched the ancestral population breakdown of the 2010 U.S. Census (Supp. Table 2). We trained UnseenEst on this U.S. Census-matched cohort and predicted the frequency distributions of variants in the entire population (Supp. Figure 3). In particular, we estimated the number of distinct variants we expect to identify in cohorts of up to 500K individuals. These results provide a quantitative framework to evaluate the power and limitations of precision medicine initiatives in discovering rare coding variants.

Supplementary Figure 3.
  • Download figure
  • Open in new tab
Supplementary Figure 3. UnseenEst estimated allele frequencies.

UnseenEst was trained on the U.S. Census matched ExAC cohort and the synonymous (a), missense (b), LoF (c) and CpG (d) allele frequencies were estimated for the US population. The variants are grouped into bins based on allele frequency: less than 10−5, 10−5 to 10−4, 10−4 to 10−3, and greater than 10−3. The y-axes indicate the log10 number of variants in each bin. The error bars correspond to one standard deviation.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Supplementary Table 2. The number of individuals by ancestry.

The top row shows the number of individuals of each ancestry in the 2010 U.S. Census. The middle row shows the ancestry composition of the ExAC cohort. The bottom row shows the number of individuals of each ancestry in the ExAC cohort that was adjusted to match the 2010 U.S. Census.

We categorized the variants by their predicted functional consequence—synonymous, missense, and loss-of-function (LoF), which is defined as point substitutions that introduce stop codons or disrupt splice donor/acceptor sites (Figure 1a). The discovery rate of LoF variants is the lowest, reflecting the fact that LoFs are likely to be deleterious and hence tend to occur comparatively rarely in the healthy population. With 500K individuals, we expect to identify 400K distinct LoF variants or 7.5% of all possible LoF point mutations in the human exome. In the same cohort, we expect to identify 3.4 million synonymous and 7.5 million missense variants, corresponding to 18% and 12% of possible synonymous and missense variants respectively. These estimates indicate that the discovery rates of rare LoF, missense and synonymous variants are far from saturation, even with 500K individuals. We note that slightly higher numbers of distinct synonymous and missense variants (Supp. Figure 4) would be discovered if the 500K individuals were instead sampled from the same ancestral composition as the current ExAC cohort, which contains higher fractions of South and East Asian individuals than the U.S., indicating that the overall discovery rate of rare variants can be boosted by optimizing the population composition of the sequencing cohort.

Supplementary Figure 4.
  • Download figure
  • Open in new tab
Supplementary Figure 4. Predicted number of unique variants in cohorts of size up to 500K individuals with the same demographic distribution as the ExAC dataset.

The x-axis indicates the number of individuals in the cohort and the y-axis indicates the fraction of possible variants that we expect to observe at in a cohort of that size. We trained UnseenEst on the full ExAC dataset and made the predictions for synonymous (grey), missense (orange) and loss-of-function (brown) variants.

We additionally classified the variants by their biochemical properties (Figure 1b). With the 34K individuals of the current cohort, we can already identify close to 50% of all possible variants at CpG sites (the most highly mutable substitution class), and the discovery rate for this class of variant quickly saturates as cohorts grow larger. Transversions, in contrast, are discovered much more slowly—attaining 7.6% of all possible transversions with 500K individuals—which is consistent with their much lower mutation rate. We further applied UnseenEst to quantify the number of distinct missense variants we expect to discover in specific gene families of interest, for example genes near GWAS hits and known drug target genes (Supp. Figure 5). Missense mutations in drug target genes are particularly suppressed, suggesting that these genes are more likely to be essential to humans.

Supplementary Figure 5.
  • Download figure
  • Open in new tab
Supplementary Figure 5. Predicted number of unique missense variants in gene families.

We trained the model on the cohort that matches U.S. demographics and predicted the fraction of possible missense variants in each gene family that we can expect to observe in cohorts of size up to 500K individuals. (a) Recessive genes (red) and dominant genes (blue). (b) All genes (red) and genes with cerebral specific expression (blue). (c) Genes associated with GWAS loci (red) and drug target genes (blue).

LoF variants likely disrupt the normal function of genes and by studying individuals carrying such variants, we can quantify the phenotypic consequence of disrupting particular genes. Therefore, a catalogue of the number of human alleles harboring candidate LoF variants for each gene is an important resource for drug development and disease diagnosis. We applied UnseenEst to estimate the LoF frequency of genes in the U.S. population (Figure 1c, Supp. Figure 6). About 2900 genes have LoF allele frequency lower than 10−5, consistent with strong intolerance to inactivation, whereas 1700 genes are expected to harbor LoF variants in at least 0.1% of the population. With 250K individuals, we expect to identify 14K genes that harbor LoFs in at least 10 individuals, substantially expanding the current catalog of 10K such genes in ExAC (Figure 1d, Supp. Figure 7). We estimate that the discovery rate of these genes with multiple LoF occurrences will saturate around 16K, providing an upper bound on the number of genes that can tolerate LoF variants on one allele.

Supplementary Figure 6.
  • Download figure
  • Open in new tab
Supplementary Figure 6. Validation of the estimated number of genes with at least 10 LoF alleles.

We trained UnseenEst on random subsamples of 10% of the alleles in the U.S. Census matched cohort and applied it to estimate the number of genes with at least 10 LoF alleles in the entire cohort. The red curve is the actual number of genes with at least 10 LoF alleles and the blue curve is the average predictions over the different subsamples. The shaded blue region corresponds to one standard deviation of the predictions.

Supplementary Figure 7.
  • Download figure
  • Open in new tab
Supplementary Figure 7. Discovery rate of LoF genes in non-Finnish Europeans.

Estimated number of genes with at least 10 LoF alleles in non-Finnish Europeans as a function of the sample size. The number of genes with at least 10 LoF alleles saturates around 16K genes, in agreement with the saturation level of LoF genes in the U.S. Census-matched population (Figure 1d).

Discussion

We describe a framework for estimating the power of sequencing cohorts to discover protein-coding variants. We apply it to the largest available collection of sequenced individuals to estimate the discovery power of much larger cohorts such as the ones proposed by the Precision Medicine Initiative. While our predictions here assumed that the samples are representative of the U.S. demography, UnseenEst can be directly applied to estimate the discovery rate of cohorts with different ancestral composition. Our results show that sequencing a cohort of 500K randomly selected U.S. individuals would provide access to over 12% of all possible missense variants and 7.5% of all possible LoF variants, thereby permitting exploration of a substantial fraction of human biological diversity.

References

  1. 1.↵
    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
    OpenUrlCrossRefPubMed
  2. 2.↵
    Macarthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. 335, 823–829 (2012).
  3. 3.↵
    Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–5 (2015).
    OpenUrlCrossRefPubMedWeb of Science
  4. 4.↵
    Ionita-Laza, I., Lange, C. & M Laird, N. Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. U. S. A. 106, 5008–13 (2009).
    OpenUrlAbstract/FREE Full Text
  5. 5.
    Gravel, S. Predicting discovery rates of genomic features. Genetics 197, 601–10 (2014).
    OpenUrlAbstract/FREE Full Text
  6. 6.↵
    Henn, B. M., Botigué, L. R., Bustamante, C. D., Clark, A. G. & Gravel, S. Estimating the mutation load in human genomes. Nat. Rev. Genet. 16, 333–43 (2015).
    OpenUrlCrossRefPubMed
  7. 7.↵
    Valiant, P. & Valiant, G. Estimating the unseen: improved estimators for entropy and other properties. in Advances in Neural Information Processing Systems 2157–2165 (2013). at <http://papers.nips.cc/paper/5170-estimating-the-unseen-improved-estimators-for-entropy-and-other-properties>
  8. 8.↵
    Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. (2015).

References

  1. [1].↵
    Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. In Submission, 2015.
  2. [2].↵
    Jonathan S Berg, Michael Adams, Nassib Nassar, Chris Bizon, Kristy Lee, Charles P Schmitt, Kirk C Wilhelmsen, and James P Evans. An informatics approach to analyzing the incident-talome. Genetics in Medicine, 15(1):36–44, 2012.
    OpenUrl
  3. [3].↵
    Ran Blekhman, Orna Man, Leslie Herrmann, Adam R Boyko, Amit Indap, Carolin Kosiol, Carlos D Bustamante, Kosuke M Teshima, and Molly Przeworski. Natural selection on genes that underlie human disease susceptibility. Current biology, 18(12):883–889, 2008.
    OpenUrlCrossRefPubMedWeb of Science
  4. [4].↵
    Matthew R Nelson, Daniel Wegmann, Margaret G Ehm, Darren Kessner, Pamela St Jean, Claudio Verzilli, Judong Shen, Zhengzheng Tang, Silviu-Alin Bacanu, Dana Fraser, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337(6090):100–104, 2012.
    OpenUrlAbstract/FREE Full Text
  5. [5].↵
    Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, et al. Drugbank 4.0: shedding new light on drug metabolism. Nucleic acids research, 42(D1):D1091–D1097, 2014.
    OpenUrlCrossRefPubMedWeb of Science
  6. [6].↵
    Mathias Uhlen, Per Oksvold, Linn Fagerberg, Emma Lundberg, Kalle Jonasson, Mattias Fors-berg, Martin Zwahlen, Caroline Kampf, Kenneth Wester, Sophia Hober, et al. Towards a knowledge-based human protein atlas. Nature biotechnology, 28(12):1248–1250, 2010.
    OpenUrlCrossRefPubMed
  7. [7].↵
    G Valiant and P Valiant. Estimating the unseen: improved estimators for entropy and other properties. Advances in Neural Information Processing Systems 26, 2013.
  8. [8].↵
    G Valiant and P Valiant. The power of linear estimators. IEEE Symposium on Foundations of Computer Science, 2011.
  9. [9].↵
    B Efron and R Thisted. Estimating the number of unseen species: how many words did shakespeare know? Biometrika, 63:435–47, 1976.
    OpenUrlCrossRefWeb of Science
  10. [10].↵
    S Gravel and NHLBI GO Exome Sequencing Project. Predicting discovery rates of genomic features. Genetics, 197:601–10, 2014.
    OpenUrlAbstract/FREE Full Text
  11. [11].↵
    I Ionita-Laza, C Lange, and N Laird. Estimating the number of unseen variants in the human genome. Proceedings of the National Academy of Sciences, 106:5008–5013, 2009.
    OpenUrlAbstract/FREE Full Text
  12. [12].↵
    I Ionita-Laza, C Lange, and N Laird. On the optimal design of genetic variant discovery studies. Statistical Applications in Genetics and Molecular Biology, 9, 2010.
  13. [13].↵
    Simon Gravel, Brenna M Henn, Ryan N Gutenkunst, Amit R Indap, Gabor T Marth, Andrew G Clark, Fuli Yu, Richard A Gibbs, Carlos D Bustamante, David L Altshuler, et al. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences, 108(29):11983–11988, 2011.
    OpenUrlAbstract/FREE Full Text
Back to top
PreviousNext
Posted November 07, 2015.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects
James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Exome Aggregation Consortium, Shamil Sunyaev, Mark Daly, Daniel G MacArthur
bioRxiv 030841; doi: https://doi.org/10.1101/030841
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects
James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Exome Aggregation Consortium, Shamil Sunyaev, Mark Daly, Daniel G MacArthur
bioRxiv 030841; doi: https://doi.org/10.1101/030841

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4684)
  • Biochemistry (10361)
  • Bioengineering (7675)
  • Bioinformatics (26337)
  • Biophysics (13529)
  • Cancer Biology (10686)
  • Cell Biology (15444)
  • Clinical Trials (138)
  • Developmental Biology (8497)
  • Ecology (12821)
  • Epidemiology (2067)
  • Evolutionary Biology (16862)
  • Genetics (11399)
  • Genomics (15478)
  • Immunology (10617)
  • Microbiology (25220)
  • Molecular Biology (10224)
  • Neuroscience (54475)
  • Paleontology (401)
  • Pathology (1668)
  • Pharmacology and Toxicology (2897)
  • Physiology (4343)
  • Plant Biology (9247)
  • Scientific Communication and Education (1586)
  • Synthetic Biology (2558)
  • Systems Biology (6781)
  • Zoology (1466)