Distinct error rates for reference and non-reference genotypes estimated by pedigree analysis

Errors in genotype calling can have perverse effects on genetic analyses, confounding association studies and obscuring rare variants. Analyses now routinely incorporate error rates to control for spurious findings. However, reliable estimates of the error rate can be difficult to obtain because of their variance between studies. Most studies also report only a single estimate of the error rate even though genotypes can be miscalled in more than one way. Here, we report a method for estimating the rates at which different types of genotyping errors occur at biallelic loci using pedigree information. Our method identifies potential genotyping errors by exploiting instances where the haplotypic phase has not been faithfully transmitted. The expected frequency of inconsistent phase depends on the combination of genotypes in a pedigree and the probability of miscalling each genotype. We develop a model that uses the differences in these frequencies to estimate rates for different types of genotype error. Simulations show that our method accurately estimates these error rates in a variety of scenarios. We apply this method to a dataset from the whole-genome sequencing of owl monkeys (Aotus nancymaae) in three-generation pedigrees. We find significant differences between estimates for different types of genotyping error, with the most common being homozygous reference sites miscalled as heterozygous and vice versa. The approach we describe is applicable to any set of genotypes where haplotypic phase can reliably be called, and should prove useful in helping to control for false discoveries.


Introduction
We apply our new method to a dataset of genotypes collected from the whole-genome 79 sequencing of a set of owl monkey (Aotus nancymaae) pedigrees (Thomas et al. 2018). Among 80 sites that could be phased with pedigree information, we found a significant difference in the 81 direction in which phase errors occurred. This departure forms the signal for estimating the rate 82 of genotyping error. Estimated error rates were significantly different among the genotypes, with 83 the most common error being a homozygous reference site miscalled as heterozygous. The 84 principles of our method can be applied to determine the rate of different types of genotyping 85 error in any dataset where phase errors can be identified.  The expected number of phase violations for this genotypic combination can be written as where Ay is a phase violation in the parent A haplotype block, with genotypic combination y and 182 |Ay| is the number of such violations across the genome. Here, y is the abbreviated phase-183 informative genotypic combination, listing abbreviated genotypes for the first four individuals as 184 ordered in x for Gx.

185
In pedigrees with five individuals there are 18 genotypic combinations that are phase-186 informative.

201
Equation 4 is an overdetermined linear system-there are many more equations than rates 223 to be estimated. We can fit the model using a linear least squares approach, solving for I by (v. 3.6.1) script used to abbreviate the genotype-phase combinations, and the R script applying 230 the algorithm, are available on GitHub.

231
Simulating phase violations 232 We tested the performance of our method on simulated genotype combinations at 233 biallelic sites from pedigrees as in Fig. 1a with simulated errors. The phase at each site was where aN is the Watterson correction factor gn is the genotype of the nth-individual in the combination, and the conditional genotype to give a total count, nx,p, for each unique genotype and phase combination, x and p. 254 We added errors to these counts by iterating over the genotypes in each combination, x, 255 and drawing errors for the nx,p sites. Let the genotype be a vector, = {〈1,0,0〉, 〈0,1,0〉, 〈0,0,1〉}, 256 for the homozygous reference, heterozygous, and homozygous alternate genotypes respectively.

257
Given a set of error probabilities, *#+ , for each type of error as listed in Table 2, the probability 258 of a genotypic transition can be written as • , where For each genotype at each combination, the nx,p counts are divided by drawing from a

270
Accuracy of error estimators in simulations 271 We simulated genotypes with varying error rates and numbers of segregating sites, for each combination of parameters tested. In the absence of genotyping errors, the frequency of 276 parental phases at informative sites is expected to be equal due to independent assortment.

282
We assessed the accuracy of our error rate estimation in these simulations by calculating 283 the normalized root mean square deviation (NRMSD). The deviation between the estimators and 284 the simulated rate was normalized by the range of error rates as where ε ‹OE• and ε ‹Ž• are the maximum and minimum error rates in each simulation.

286
There was little change to the estimators' normalized deviation with an increasing 287 number of segregating sites for simulated error rates drawn from either the log-normal or log-288 uniform distribution. The mean NRMSD for simulated rates drawn from the log-normal 289 distribution was slightly higher, ranging from 9.0% to 12.2%, than from rates drawn from the 290 log-uniform distribution, from 6.5% to 7.1%, for simulations with between 10 and 70 million 291 segregating sites (Fig. 2b). These results indicate that our method of estimating error rates is 292 robust across populations with different levels of nucleotide diversity. 2c). We found ε I "#) to be the most accurately estimated, with a mean NRMSD of 2.1% and 305 1.2%, for simulated errors drawn from the log-normal and log-uniform respectively. The 306 estimators ε I $#) and ε I )#$ were the least accurate, with mean NRMSDs ranging from 8.0% to 307 12.4%. The accuracy of the estimator for error rates at the most frequent genotype, ε I "#$ , was 308 intermediate with an NRMSD of 4.3% and 6.1% for respective draws from the log-normal and 309 log-uniform distributions.

310
Application to phased owl monkey genotypes 311 We applied the method developed here to a dataset of genotypes from the whole-genome 312 sequencing of owl monkey (Aotus nancymaae). Individuals in this dataset were part of several 313 three-generation pedigrees, allowing us to unambiguously phase focal individuals (as in Fig. 1).

314
We selected four unrelated pedigrees with genotypes from twenty total individuals for our

322
We selected genotypes from biallelic SNPs on autosomes for the subsequent analyses.

323
Based on genotypic combinations, we were able to identify approximately 5.5 million phase-324 informative sites in each of the pedigrees. These sites were on haplotype blocks that covered, on reference and heterozygous genotypes, with mean rates ε "#$ = 2.9 × 10 -3 and ε $#" = 2.9 × 10 -3 336 per bp. The rate at which homozygotes for the alternate allele were miscalled as heterozygotes 337 was also appreciable, mean rate ε )#$ = 2.1 × 10 -3 per bp. In contrast, our estimate of the rate at 338 which homozygous reference genotypes were mistaken as homozygous alternate was zero in all 339 four pedigrees. 340 We repeated the analysis with genotypes that were filtered by genotype quality (GQ) as 341 calculated in GATK. The value of GQ is based on the difference between the probability of the 342 called genotype and the next most likely genotype. Genotypes associated with higher GQ scores 343 are typically interpreted as being more accurate. As expected, the estimated error rates decreased 344 as we removed sites with more stringent GQ filters ( Fig. 3; Table S1). Filtering by GQ appears to 345 be more effective at removing certain types of errors than others, with the most dramatic 346 reduction in heterozygote false negatives, that is ε $#" .

347
We also calculated an average error rate by weighting each estimator with the genome-348 wide frequency of the corresponding genotype (e.g. the frequency of homozygous reference which was reduced to 1.1 × 10 -3 per bp when genotypes were required to have a GQ > 60.  level of filtering stringency (Table S1). Though their resequencing approach does not distinguish 368 between all types of miscalls, they report a false positive rate for heterozygotes (ε "#$ ) that is 369 much higher than the false negative rate (ε $#" ), consistent with our findings after the application 370 of filters on GQ.

371
Heterozygote false positives at homozygous reference sites occur at the highest rate 372 among all errors, after modest GQ filtering. As the most common type of site in the genome, 373 they are also the most common genotyping error.

393
Simulations demonstrated the power of our approach to estimate error rates even in 394 samples with low levels of diversity. They also indicated differences in the accuracy of the error 395 estimators, but these were not pronounced for the most common types of sites. One caveat to our 396 simulations is that we did not simulate entire haplotype blocks or inaccuracies in phase calling.

397
Low levels of diversity and high rates of recombination can make accurate phasing more 398 difficult, while low rates of recombination could result in an imbalance in the proportion of 399 phases across the genome. The assumption of at most one genotyping error among samples per 400 site may slightly inflate our estimates of the error rate. The potential for two or more genotyping 401 errors is small (on the order of 10 -6 ), and had a negligible effect in our simulations, but may be higher at sites prone to sequencing or assembly errors. Similarly, we ignored the effects of gene 403 conversion and de novo mutation, as they are expected to occur at negligible rates compared to 404 genotyping error (Table 1). Furthermore, the signal from a genotyping error and a gene 405 conversion event are nearly identical, though careful filtering and selection of sites have been 406 successful in identifying gene conversion events (e.g. Williams et al. 2015;Miller et al. 2016).

407
Finally, we note that the estimated error rates are for genotyping from a set of called variants.

408
The heterozygote false positive rate, ε "#$ , for example, does not apply to invariant reference 409 sites.

410
As genomic data continues to accumulate, the consideration of genotyping errors will 411 remain an essential part of genetic analyses. Though we have focused mainly on whole-genome 412 sequence data, our approach is generally applicable to any collection of genotype data (e.g. SNP-   Possible miscall: 0/1 Focal individual inherits the alternate allele from Parent A, giving a 50% chance of transmission to child, leading to apparent phase violation. Frequency: ½ $$)) • ε $#" + $$)) • ε )#" Parent B Observed genotype: 0/1 Possible miscall: 0/0 Implies more than one genotyping error across pedigree. Possible miscall: 1/1 Not detectable as phase violation, as focal individual still inherits alternate allele from Parent B haplotype block.

Focal
Observed genotype: 0/1 Any miscall would imply more than one genotyping error.
Possible miscall: 0/0 Implies more than one genotyping error across pedigree. Any miscall would imply more than one genotyping error.

Focal
Observed genotype: 0/1 Any miscall would imply more than one genotyping error.

Partner
Observed genotype: 0/0 Any miscall would imply more than one genotyping error.

Focal
Observed genotype: 0/1 Any miscall would imply more than one genotyping error.
Possible miscall: 0/1 True genotype 1 has been miscalled as 2. There is a 50% chance the alternate allele was inherited from the focal individual, leading to apparent phase violation.

Focal
Observed genotype: 0/1 Any miscall would imply more than one genotyping error.
Possible miscall: 1/1 True genotype 2 has been miscalled as 1. There is a 50% chance child inherits alternate allele from Parent A, leading to apparent phase violation.
Possible miscall: 1/1 True genotype 2 has been miscalled as 1. There is a 50% chance child inherits alternate allele from Parent A, leading to apparent phase violation.
Possible miscall: 0/1 True genotype 1 has been miscalled as 2. There is a 50% chance the alternate allele was inherited from the focal individual, leading to apparent phase violation.