Abstract
Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. To call genotypes correctly from short-read data requires modeling read counts for each base. True heterozygotes may be affected by mapping reference bias and library preparation, leading to a distribution of reads that does not fit a 1:1 binomial distribution, and potentially resulting failure to call the alternate allele. Homozygous sites can be affected by the alignment of paralogous genes and sequencing error, which could incorrectly suggest heterozygousity.
Previous work has modeled increased variance and skewed allele ratios to some degree. Here, we were able to model reads for all data as a mixture of Dirichlet multinomial distributions. This model has a better fit to the data than previously used models. In most cases we observed two distributions: one corresponds to a large proportion of heterozygous sites with a low reference bias and close-to-binomial distribution, and the other to a small proportion of sites with a high bias and overdispersion. The sites with high reference bias have not been previously identified as SNPs in extensive human genome research; thus, we believe these sites are not heterozygous in our data for the individuals studied here, and are falsely identified as heterozygous sites. We propose that this approach to modeling the distribution of NGS data provides a better fit to the data, which should lead to improved genotyping. Furthermore, the mixture of distributions may be used to suggest true and false positive de novo mutations. This approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.
Background
Identifying genotypes from next-generation sequencing (NGS) data is an important component of modern genomic analysis. Accurate genotyping is key to identifying sequence polymorphisms, detecting de novo mutations, linking genetic variants with disease, and determining mutation rates (Awadalla et al, 2010; Sayed et al, 2009). However, accurately identifying de novo mutations is a particular challenge, as true mutations are rare compared to errors in sequencing and downstream analyses.
Estimating genotypes from NGS data can be computationally and statistically complicated. A typical NGS experiment generates millions of short read fragments, 100 to 650 bp in length, that are aligned to a reference genome if available. For this reason, variant calling software typically uses a binomial distribution to model base-counts (although see Ramu et al, 2013). However, there are at least three experimental processes that affect the ratio of the alleles. (1) During library preparation, variation in amplification rates can cause some chromosomes to be replicated more than others (Heinrich et al, 2012). This variation is especially a concern if there is little starting material. (2) NGS technologies introduce sequencing errors into sequencing reads. Error-rates are on the order of 0.1–1% per base-call. While this may seem small, 0.1% error is equivalent to sequencing the wrong human genome, and 1% is equivalent to sequencing a chimpanzee instead of a human (Fox et al, 2014; Wall et al, 2014). (3) Bioinformatic methods that assemble reads with respect to a reference can misplace reads and penalize non-reference alleles (Degner et al, 2009). Together these processes shift the mean and increase the variance of sequencing-read distributions. Thus, it is possible for both homozygotes and heterozygotes to have an intermediate ratio of two alleles, making identification of true heterozygotes particularly difficult (Malhis and Jones, 2010).
These processes do not affect all parts of the genome equally. The genomic context of a site, including the presence of nearby indels, structural variants, or low-complexity regions, influences the probability that reads generated from a given site will be subject to these processes (Malhis and Jones, 2010). The mismatch between observed and expected read distributions created by the processes described above contributes to observed false positive single nucleotide polymorphism (SNP) discovery rates of 3 to 12% (Harismendy et al, 2009). Because putative SNPs are typically validated using another sequencing technology, high false positive rates increase the effort required for validation.
Previous approaches for accurate genotyping
Modeling systematic bias and variation in data has provided some improvements in statistical discrimination of true and false positive heterozygotes. The increased variance and skewed allele ratios produced from mismapped reads can be partially controlled for by including mapping quality data in a genotype-calling procedure. In the simplest approaches, reads with low quality scores are removed from an analysis. In Bayesian approaches to genotype calling, read quality data is included when calculating genotype probabilities (Li et al, 2009b). The increased variance caused by library preparation, sequencing, and errors in mapping reads to a reference genome can be accommodated by modeling read-counts as coming from a beta-binomial distribution (Ramu et al, 2013). The beta-binomial distribution acts as an over-dispersed binomial, allowing the excess variance to be handled in a standard statistical framework. All genotype calling procedures can combined with machine learning algorithms that attempt to differentiate between true variants and those caused by sequencing artifacts (DePristo et al, 2011). However, maximizing the true positive rate (i.e. maintaining high sensitivity) while minimizing the false negative rate (i.e. maintaining high specificity), remains a significant challenge (Greiner et al, 2000).
Our approach
In this study we introduce a new model for the distribution of reads produced from NGS, in which reads are assumed to come from a mixture of Dirichlet multinomial distributions. The Dirichlet multinomial distribution (DM) is the general case of the beta-binomial, allowing for overdispersion and modeling of more than two outcomes. By fitting a mixtures of DMs (MDM) we improve the beta-binomial models discussed above in two ways. First, we account for the context-dependent nature of genotyping errors by estimating multiple different DM models for a given dataset, each with different parameter values and levels of overdispersion. Second, we can explicitly model the presence of bases that are neither reference nor the likely alternative allele at a given site. This model allows us to directly estimate the probability of sequencing errors in a given DM model.
We first demonstrate the value of our approach by fitting MDMs to sequencing data derived from a haploid human cell line. The MDM produces a superior fit to this data compared to other methods, showing that even relatively simple genetic datasets can be the result of heterogeneous processes, and thus benefit from a mixed-model approach. We then fit MDMs to diploid data generated by the 1000 Genomes Project (1000 Genomes Project Consortium et al, 2010, 2015). For this data, the MDM also improves the fits compared to other models. One component of the MDM model contains most of the true heterozygotes, while the other component contains primarily sites that have not been identified as heterozygous in any previous human research. Therefore we believe this model may be utilized to detect false positive heterozygous sites, leading to a significant reduction in the number of sites requiring validation.
Results
Haploid Dataset
We examined two genomic regions from the CHM1 (haploid human cell line) dataset: all of chromosome 21 and part of chromosome 10. For each region we further split sites into two subsets. The full dataset (FD), where reads were only filtered to exclude regions with unusually high coverage, and the reference dataset (RD), where only sites with at least 80% of reads matching the reference base were included.
Best fit models for haploid data
We fit seven models to each genomic region in each dataset: a multinomial, a DM, and MDM models with two to six components. The addition of model components increased the likelihood of the model for all cases (Table 1). Using Bayesian information criterion (BIC), the best fitting model for each dataset was the two component MDM. In all cases a single component contains a substantial majority of sites (approximately 75% of sites for the reference dataset from chromosome 21 and 95% of the sites for other datasets). We will refer the component to which the highest proportion of sites is assigned as the “major component” and all other components as “minor components”.
The overdispersion parameter, φ, describes the degree which the expected variance of a given DM distribution is greater than that of a corresponding multinomial. φ can take values between 0 and 1, with 0 being identical to the multinomial and 1 being completely overdispersed. For the full dataset, the major component had relative little overdispersion (φ = 0.00252 and 0.00415 for chromosome 21 and chromosome 10 respectively). The minor component displayed strong overdispersion (φ = 0.892 and 0.948). For the reference dataset, there was also little overdispersion for the major component (φ = 0 for chromosome 21 and 0.00269 for chromosome 10). After removing sites with high proportions of reads in the error categories, the minor component was slightly overdispersed (φ = 0.0153 and 0.0475) (Table 1 and supplementary tables)
Visualizing model fitting for the haploid data
We examined the fit of the data to each model using quantile-quantile (QQ) plots, where the quantile of the observed read counts are plotted against the quantile of the estimated read counts. For the model with two components applied to the reference dataset, the reference and error counts fit the expected values (Figure 1 and supplementary figures).
Diploid dataset
We examined the same two genomic regions for NA12878, the daughter of the CEU trio. In order to investigate the impact of sequencing technology on parameter estimates from our our model we repeated our analysis for each of the four released datasets (1000 Genomes Project Consortium et al, 2010, 2015). As potential heterozygotes present the greatest challenge to variant calling, we focused on these sites. Specifically, we identified potential heterozygous sites using the SAMtools heterozygote caller on NA12878 alone and using the trio caller. Sites were only included in the potential heterozygote (PH) dataset if they were called by both methods. We filtered the PH dataset to include only sites identified as SNPs by the 1000 Genomes Project. We considered these sites to be true heterozygous sites (TH dataset). The number of sites and the proportion of true heterozygous sites are summarized in Table 2.
Best fitting models
We fit eight models to each of the 16 CEU datasets (two genomic regions, four release years, PH/TH): a multinomial, multinomial with reference bias, a DM, and MDMs with two to six components. The addition of model components increased the likelihood of the model for all cases (Table 3).
For the TH dataset, the best-fitting model as selected by BIC had two or three components, and the best AIC had three or four components depending on the run year. The majority of the sites (88-99%) were assigned to one component in the model (Table 3 and supplementary tables). The major component of the model for each dataset (the component with the highest proportion of sites) had little overdispersion (φ = 0 to 0.00055). In addition, the major component also had a approximately equal proportion of reference and alternative alelles (49% to 51%), and a relatively small error term (< 0.1%) for the 2011, 2012, and 2013 datasets. Thus, the majority of sites fall into a component that is approximately a binomial distribution. The 2010 dataset has a slightly larger error term (0.2% and 0.3%), and the reference and alternate terms are 53% and 46% respectively. For the datasets with a two component model, the minor component is similar to the major component but with greater overdispersion (φ = 0.06 − 0.1). For models with three components, the minor components had an elevated proportion of one of the reference, alternate, or error terms, and greater overdispersion. For instance, CEU2013 chromosome 21 has φ = 0.0656 and πerror = 0.278 for the third component.
For the PH dataset, the best fitting model had three to six components (Table 3). The major component of the model contains between 71% and 95% of sites for all years. As with the TH dataset, the major components all had little overdispersion (φ = 0 to 0.00135). Even when models with greater than four components were favored by BIC, the additional components contain a very small proportion of the data (< 1%), and frequently produce estimates of sequencing error very close to zero. Thus, a model with more than three components is likely overfitting the data.
Visualizing model fit
We examined the fit of the data to each model using quantile-quantile (QQ) plots. When we examined the QQ plot for the MDM model with the lowest BIC, all three terms (reference allele, alternative allele, and error term) fit closely to the expected values (Figure 2).
Assignment of sites to model components
We assigned each site from each of the eight PH datasets to a component in the MDM model based on the site likelihood. The minor or combined minor components were always enriched for false positive heterozygous sites; these false positive sites made up between 10% and 51% of the sites in these components. The major component contained only 3 – 10 % of the false positives (Table 3 and supplementary table).
We constructed a receiver operating characteristic (ROC) curve to examine the performance of the MDM model as a classifier of heterozygotes (Figure 3 and supplementary figures). The performance of this classifier on a given dataset can be summarized by the area under the ROC curve (AUC). AUC was between 0.634 and 0.81 (Table 4).
Based on the ROC curve, we selected a threshold value, the probability cutoff for assigning a site to the major component, with sensitivity close to 1 and specificity near 50%. Thus, we can use our model to filter out half of the false positive heterozygous sites without losing true heterozygotes.
Classification of sites as copy number variants
We tested the hypothesis that CNVs produce false positive heterozygous sites (Li, 2014). The proportion of the false positive heterozygous sites belonging to CNV regions is between 8% to 13% for chromosome 21, but < 2% for chromosome 10 (Table 5).
Discussion
We have developed a novel statistical approach to model the distribution of NGS reads. Using an MDM produces a better fit to haploid human cell line data than previous approaches (i.e. the multinomial and DM Ramu et al, 2013). This result demonstrates that NGS datasets from relatively simple biological samples (i.e. no true heterozygotes and a high quality reference genome) can benefit from the approach we describe here.
Similarly, our MDM model provide a better fit to more complex data, including potentially heterozygous sites in data arising from the 1000 Genomes Project. Our goal in developing this model was to improve the accuracy of genotype calling, and reduce the number of false-positive variant calls produced from NGS data.
Best fitting models
Minor components of our MDMs tended to display bias toward reference or alternative alleles or higher values for overdispersion of sequencing errors. These results suggest that most sites in an NGS experiment match the idealized expectation of a binomial distributed of base-counts. On the other hand, a substantial minority of sites appear to be generated by processes that differ from this expectation. Moreover, these minor components are greatly enriched for apparently false positive heterozygous sites.
Our results are similar to that of Muralidharan et al (2012), who observed a high proportion of SNPs with low error rates and a low proportion of SNPs with high error rates. This result was attributed to high alignment error in repetitive regions. We now provide a way of using this approach to distinguish these two types of sites.
Assignment of sites to model components
The MDM classifier shows promise in discriminating true and false positive heterozygotes, as illustrated by our ROC curves. The two exceptions, CEU2010 chromosome 10 and CEU2013 chromosome 10, may be due to an extremely low proportion of false positives in the dataset: with only 5% false positive sites, it is a challenging task for the classification algorithm to identify these sites (Table 6). Thus, the modeling approach described here can be used to remove sites that have been called as heterozygous but are likely to be false positive calls, by selecting a probability cutoff for assigning sites to the major component and filtering out sites belonging to the minor components. Removing such sites reduces the cost and time required for validation.
Copy number variants
It is possible that there is a weak correlation between false positive heterozygous sites and copy number variations in chromosome 21. We expected this correlation to be stronger across all regions. This also suggested that there are several other multiple factors caused copy number variations. Species with greater numbers of duplicated regions than humans may have greater numbers of sites incorrectly identified as heterozygous, potentially affecting the identification of de novo mutations.
Conclusion
Our modeling approach is designed to accommodate the correlated and context-specific nature of errors introduced in generating NGS datasets. The datasets we analyzed were produced using a variety of different library preparations and sequencing technologies. These differences are partly reflected in the different parameter values we estimate from our model. In particular, the 2010 dataset appears to have a quite different profile than those from other years: the major component of this model has higher overdispersion and stronger reference-bias than that of any other dataset.
Previous work has suggested that there is reference bias in mapping, and overdispersion due to biological factors (Meyer and Liu, 2014). We observed very little reference bias and overdispersion for true heterozygotes. However, false positive heterozygote calls fall into a distribution with reference bias and overdispersion. By using an MDM model we believe we are able to separate true heterozygotes from false positive calls, which can significantly reduce the time and expense of subsequent validation work. In the future this modeling approach will be incorporated into a pedigree-based approach for accurate genotype calling (Cartwright et al, 2012)
Methods
Data
We extracted datasets from two types of data. The first dataset is the haploid human sequence from a hydatidiform mole cell line (CHM1hTERT SRR1283824 from SRP017546). We refer to this dataset as the CHM1 dataset in this paper.
Second, we obtained sequences from the 1000 Genomes Project for three individuals, a woman (NA12878) and both of her parents (NA12891 and NA12892). Sequencing was repeated for these individuals in different years using different technologies (2010, 2011, 2012, 2013). The 2010 dataset was generated during pilot 2 studies; the 2013 PCR free dataset was part of the phase 3 release; the 2011 and 2012 datasets were two non-official release datasets, aligned with a decoy genome that captures reads that failed to align to the standard reference genome (1000 Genomes Project Consortium et al, 2010, 2015).
We refer to this dataset as the CEU dataset. If the release year is appended, for example CEU2013, then we refer to the specific release in 2013. For each of these five datasets (CHM1 and each of the four releases of CEU), we analyzed two genomic regions, the whole chromosome 21 and a subregion of chromosome 10, from positions 85534747 to 135534747, which is approximately the same size as chromosome 21 (48 million base pairs).
For CHM1 we probabilistically called genotypes by first obtaining allele counts for each base at each site using the mpileup function in SAMTools v1.2 (Li et al, 2009a; Li, 2011) and the human reference genome (Genome Reference Consortium human genome build 37). We then used BCFTools v1.2 (Li et al, 2009a; Li, 2011) to identify potential heterozygous sites. For each of these sites we calculated the frequency of the reference allele and the frequency of all non-reference alleles (error). We filtered this dataset based on the read depth for each site: we removed sites with read counts of less than 10 or greater than 150. Sites with high numbers of reads are likely in copy number variable genes that have aligned to a single region of the genome. Apparent heterozygotes are more likely to be due to paralogs rather than variation within a gene. The low read filter limits the data to calls with enough coverage to provide a reasonably accurate call and proportion of reads for each base. We refer this dataset as the full dataset (FD). Additionally, we removed sites for which less than 20% of the reads contained the reference allele. We refer to this dataset as the reference dataset (RD).
For the CEU data, we obtained allele counts as above for all three individuals. We then called genotypes as above on NA12878 (the daughter of the trio) and by using the BCFtools trio caller with the data from all three individuals. We limited the dataset we used for subsequent analyses to sites that were found by both methods. Sites that were found only in triocaller, but not in the individual caller were likely identified by the pedigree with limited data for the daughter; thus, these low coverage sites were not included in subsequent analysis. We removed sites with read counts of less than 10 or greater than 150, as for CHM1. We call this the Potential Heterozygote (PH) dataset. For each of these sites we calculated the frequency of the reference allele, the frequency of the alternate allele, and the frequency of any other alleles (error). We compared the frequencies of each allele category (reference, alternate, error) for each possible genotype combination. Because we found no differences in frequencies for different genotypes, all subsequent analyses were only performed on the general reference-alternate-error dataset.
We created an additional dataset by removing sites from the PH dataset not found to be heterozygous by the 1000 Genomes Project (1000 Genomes Project Consortium et al, 2010, 2015). We then discarded sites for which the alternate allele differed from the one previously identified by the 1000 Genomes Project. We call this the True Heterozygote dataset (TH). These datasets allow us to build a model that distinguishes sites found in the PH dataset but not in the TH dataset, which are likely false positive heterozygote calls.
Because the CHM1 dataset was larger than the CEU dataset, we randomly subsampled the CMH1 dataset to have an approximately equal number of sites (40,000 sites) as the CEU dataset.
Model fitting and parameter estimation
We fit seven models to each CHM1 dataset, and eight to each CEU dataset. The models included a multinomial, a multinomial with reference bias (CEU only), Dirichlet multinomial (DM) and mixtures of DM (MDM) distributions with various number of components, ranging from two to six. We estimated the parameters and calculated the genotype likelihood for each model.
The genotype likelihood measures the likelihood of a sample’s genotype, G, given a set of base-calls, R, and is proportional to the probability of observing R if the genotype was G, i.e. L(G|R) ∝ P(R|G). We derived genotype likelihoods using MDM distributions. The Dirichlet multinomial distribution is a compound distribution generated when a Dirichlet distribution is used a prior for the probabilities of success of a multinomial distribution: p ~ Dirichlet(α) and x ~ Multinomial(N, p) where α is a vector of concentration parameters, p is a vector of proportions, x is a vector of counts, and N is the sample size. After integrating out p, the resulting probability mass function can be trivially expressed as a product of ratios of gamma functions: where and αi > = 0. Furthermore, where and
It is helpful to reparameterize the distribution by letting , where represents the pairwise correlation between samples. As a result, Var(xi) = Nπi(1 − πi)(1 + (N − 1)φ) and φ ∈ [0, 1] is a parameter controlling the amount of excess variation in the Dirichlet multinomial. When φ = 0, the DM reduces to a multinomial. Thus the Dirichlet multinomial can be interpreted as an over-dispersed multinomial distribution: as φ approaches 1, the distribution is completely overdispersed, the dataset is more heterogeneous than expected.
For a single-component Dirichlet multinomial, we computed the maximum likelihood estimate model starting with a method-of-moments estimation and optimizing using the Newton-Raphson method. For all other MDM, the maximum likelihood estimated was computed using an EM algorithm. This procedure was repeated 1000 times to search for the global maximum likelihood estimation. For each repetition, we started the search with the method of moments estimates of the parameters, then calculated the likelihood of the data for each component.
For the Dirichlet multinomial distribution, we estimated φ as a measure of the overdispersion of the data. In addition, we estimated ρ, the proportion of sites belongs to each Dirichlet multinomial component. For the CHM1 dataset we estimated the proportion of the reference allele and error term for each model or model component. For each CEU dataset we estimated the proportion of the reference allele, alternate allele, and the error term.
To determine the optimal number of components in the MDM model both the Akaike information criterion (AIC) and Bayesian information criterion (BIC) were calculated for each dataset; the model with the lowest AIC or BIC is considered as the best model. We discussed about the model with the best BIC in the result section. The AIC and BIC for each model are calculated by the following formula;
Where L is the maximum likelihood estimation from the model, k is the number of free parameters, n is the number of individual in the dataset.
Visualizing model fit
We visualized the fit of the data to each model and compared the fit between models using quantile-quantile (QQ) plots. The QQ plot plots the quantile of the observed read counts against the quantile of the estimated read counts. Parameters estimated from the EM were used to simulate the expected read counts for the plot. Two QQ plots, one for the reference allele and one for the error term, were used to illustrate the fit of models for the CHM1 dataset. Three separate QQ plots, one each for reference, alternate, and error terms, were used for the CEU datasets.
Assignment of sites to model components
To suggest the use of the MDM model as a classification method, we calculated the likelihood of every site under each component of the model in each of the CEU datasets. We assigned each site to a particular component in the MDM model by comparing the likelihood between all components. The likelihood for each component was reevaluated using the parameters estimated from the EM. The site was assigned to the component with the highest likelihood.
For the CEU PH dataset, we extracted all sites that assigned to the minor components. The number of true and false heterozygous sites and the proportion of false heterozygous sites were calculated using the 1000 Genome Project.
To determine the performance of the classification algorithm, we implemented an alternative way to assign each site. We recalculated the density of the probabilities of assignment to the major component of the model, and we interpreted that as the probability of being true heterozygous site. We used these probabilities to construct the receiver operating characteristic (ROC) curve, where sensitivity is plotted against specificity, to examine the performance of our model as a classifier across a range of classification thresholds. The area under the ROC curve (AUC) summarizes the performance of this classification method across a range of cutoff points. An AUC of 1 represents a perfect classifier, while an AUC of 0.5 suggests the prediction is close to random.
Classification of sites as copy number variants / paralogous
One possible cause of identifying false positive heterozygous site is that the site belongs to the region which is known for copy number variation (CNV). We extracted all the known CNV sites for NA12878 in the CEU dataset from the 1000 Genomes Project. We extracted all known false positive heterozygous calls, which are sites in the PH dataset but not in the TH dataset, and mapped them to the known CNV sites. We calculated the proportion of sites belong to the known copy number variation sites for each dataset and each genomic regions.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This was supported by NIH Grants R01-GM101352 to RA Zufall, RBR Azevedo, and RA Cartwright and R01-HG007178 to DF Conrad and RA Cartwright.