DeepTrio: Variant Calling in Families Using Deep Learning

Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trios from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio learns how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.


Introduction
Genomic sequencing can identify variants informative 1 for diseases 2 , traits 3 , and ancestry 4 . Sequencing is particularly informative in rare genetic disease 5 caused by high-impact pathogenic variants 6 . In a mother-father-child trio, each parent contributes half of their genome, with the addition of a small number of de novo variants 7 . Sequencing for rare disease often includes the parents in order to use this information for accurate variant identification and interpretation 8 . Rare disease studies analyze multiple families with the same suspected disease to resolve undiagnosed cases 9 .
A number of methods can discover germline variants in an individual sample. Traditional methods model the evidence for a variant with known contributors to uncertainty, such as the rate of sequencing errors 10 , the probability that a read is mapped incorrectly 11 , and the reliability of the sequence quality scores 12 . Software tools which employ these statistical approaches include Freebayes 13 , GATK 14 , Octopus 15 , 16GT 16 , and Strelka2 17 .
Recent approaches have used deep learning, which learns representations directly from data 18 . This allows a variant caller to capture aspects of the problem which are incompletely understood, or to rapidly adapt to a new sequencing technology by training on data. Deep learning variant callers include DeepVariant 19 , Clairvoyante 20 , and Neusomatic 21 . Deep learning was used in a majority of short-read and virtually all long-read submissions to the PrecisionFDA Truth Challenge V2 22 .
Some variant callers can use information about a trio to jointly call variants. The approaches range from joint calling without consideration of family information (e.g. GATK GenotypeGVCF), those which model parental transmission probabilities (e.g, FamSeq, GATK CalculateGenotypePosteriors), and those which use a deep learning approach for individual samples and postprocess with statistical methods (e.g. dv-trio 23 ) Correctly incorporating parental information requires integrating the existing uncertainties in error sources across multiple samples. A deep learning approach can directly learn from trio data how to value evidence from parents in making a call. It is also easy to adapt to different coverages, preparations, and technologies.
In this work, we build and assess DeepTrio, a deep learning-based variant caller for parent-child trios. We start from the code base of DeepVariant, a germline caller which won multiple awards in the PrecisionFDA Truth Challenge V2 22 , noted for high accuracy on genomes and exomes 24 , and shown to increase detection rate of pathogenic germline variants 25 .
We train DeepTrio to call variants in both parent and child samples, with one model for Illumina WGS, one for Illumina exome, and one for PacBio HiFi 26 . To ensure accurate performance over a range of conditions, DeepTrio is trained with a diversity of preparations (PCR-free, PCR-positive, and multiple exome kits), and across a range of child and parent coverages. DeepTrio is also trained for duo-calling. DeepTrio can write output as individual VCFs 27 , gVCFs, or as a merged family VCF. The gVCFs of multiple families can be combined with GLnexus 28 , which has been optimized for combining DeepVariant gVCFs 29 , to scalably create large joint callsets of trios, duos, and individual samples.
We show that DeepTrio has superior accuracy to both individual sample and trio-based samples, measured by concordance with the Genome in a Bottle truth set 30,31 . We show that DeepTrio is still able to accurately call de novo variants, despite their lack of support in the parents. Finally, we quantify the performance of DeepTrio across coverage, showing that DeepTrio allows high accuracy to be retained at lower coverage for both proband and parent samples.

Modifying DeepVariant to call trios
DeepVariant calls variants in three steps: make_examples, call_variants, and postprocess_variants. In the make_examples step, a simple heuristic identifies positions which might differ from the reference. For Illumina sequencing, this requires a fraction of reads supporting an alternate allele at 0.12 for SNPs and 0.06 for Indels. For PacBio sequencing, the threshold is 0.12 for both SNPs and Indels.
The call_variants step represents BAM data as a multi-dimensional pileup of a 221-bp window in the genome around a candidate. As of DeepVariant v1.0, there are 8 input channels representing 1) the bases in the read, 2) their base quality, 3) the mapping quality of the read, 4) the strand mapped to, 5) whether the read supports a variant, 6) bases which differ from the reference and (in the case of PacBio data) 7,8), realignments of the reads to the alternate alleles. postprocess_variants converts the output probabilities of the neural network into a variant call and confidence, and resolves multi-allelic candidates into their most likely alleles.
To modify make_examples for Trio calling, we perform candidate generation on each individual sample in the same manner. We also generate candidates from the union of reads from all samples with a reduced threshold for reads supporting the alternate allele, which allows discovery of alleles at a lower fraction but which are reinforced by appearing in multiple samples. When a candidate allele is identified in any single sample, it is also generated at the same position in the other samples. This allows us to generate an output variant probability for every candidate in every sample.
The call_variants stage uses a deep neural network to classify the probabilities for the genotype of each variant candidate. This process learns the important factors for classification directly from the input data. We generate tensor pileups where each sample has a fixed height with the child pileup in the middle, one parent's pileup on top, and the other parent at the bottom. Using labels from Genome in a Bottle, we train two models: a child model and a parent model. To generate calls for each parent, two sets of examples are made, with a different parent as the top pileup.
The concept of Mendelian inheritance, or any explicit modeling of parent-child relationship is never provided to DeepTrio. Training simply creates these pileups and associated labels. The child model would learn that the reads in the middle pileup are most informative for calling, and the parent reads as supporting evidence. The parent model would learn that the topmost reads are most informative for the call, with child reads providing some supporting information, and the other parent marginal additional information.
The postprocess_variants stage is not altered. The final outputs are a VCF and a gVCF for each sample. Merging multiple gVCFs uses GLnexus 28 in a manner optimized for DeepVariant outputs 29 . Figure 1 shows a representation of the calling process.
DeepTrio is trained on Genome in a Bottle samples 32 with sequencing conducted on Novaseq, HiSeqX, HiSeq4000, and PacBio Sequel II instruments with both PCR-Free and PCR+ preparations. Exome models are trained on Agilent SureSelect v7, IDT-xGen, and Truseq capture kits. This data has been previously described and released 33 . DeepTrio is trained on examples from chromosomes 1-19. Chromosomes 21 and 22 are used as a tuning set to select the model checkpoint by determining when accuracy on the withheld set has peaked. Chromosome 20 is fully withheld as an independent evaluation set to assess accuracy.

Assessing variant calling accuracy in the parent samples
Accurate variant calling of parent samples is important in order to give context for proband variants, to accurately catalog incidental findings, and to correctly generate research cohorts. To assess the accuracy of parent calling, we compared DeepTrio's parent model performance with DeepVariant and other trio and non-trio methods (Figure 2 bottom). DeepTrio outperforms other pipelines across a range of coverage, with a larger effect at lower coverages. DeepTrio's advantage is less pronounced for the parent model as compared to its advantage on the child model. This finding is reasonable, since the genotype of the child is less informative regarding the genotype of the parent than the parent's genotype is informative regarding the genotype of the child. The improvement in accuracy allows a lower coverage for the parent sample to be used while retaining the accuracy one would normally achieve with a non-trio pipeline.

Precision and recall of de novo variants
The ability to identify de novo variants which may have a dominant inheritance pattern is of particular interest in identifying rare genetic disease. Because de novo variants violate the assumptions of Mendelian inheritance, it is reasonable to think that trio calling approaches may have a more pronounced effect on the sensitivity and specificity of de novo variants. When the analysis is constrained to cases where a child is called as 0/1 and each parent is confidently called as 0/0, we observe far more pronounced differences between trio and non-trio pipelines, even in cases like GATK4 where the overall accuracy as determined by Genome in a Bottle comparison is very similar.
The trio-aware pipelines (DeepTrio, dv-trio, famseq, GATK CalculateGenotypePosteriors), and Octopus have a greatly reduced false positive rate for de novo variants, but have slightly reduced recall of true de novo variants, whether these are defined as identifying the call as de novo by confidently genotyping the child as 0/1 and the parents as 0/0, or by detecting the child variant, but with an unknown or incorrect all in a parent ( Figure 3, Tables 1,2, Supplementary tables 1,2).

Assessing the effect of parental depth on variant calling accuracy in the proband
In trio sequencing for rare disease, there is often greater importance in sequencing a child proband. To manage sequencing costs, studies often take the approach of sequencing the parents at a lower coverage than the child 36 . In order to evaluate performance in these scenarios, we performed downsampling of the parent samples while keeping the child coverage at 35x. Variant calls were generated with the same tools used for the coverage titration of the full trio and evaluated using the same methods.
For the child sample, which contains the same reads for each titration point, we observe a slight accuracy improvement for DeepTrio in both Illumina and PacBio HiFi across the parent coverage ranges observed (15x-35x). For the parent samples, which do vary in coverage over the titration range, we observe a much greater advantage for DeepTrio compared to DeepVariant and other methods (Figure 4). For PacBio HiFi, the parent model has higher accuracy when the child is at 35x, as compared to when all three samples are coverage-titrated.

Inspecting examples of variant calls improved by DeepTrio
In order to better understand cases where DeepTrio makes a correct call where other methods do not, we inspected IGV 37 images of positions which DeepTrio called correctly that were errors in DeepVariant. Since the F1 of SNP calling at 35x is already 0.9973, all inspected sites were difficult (low coverage, low mappability, presence of repeats and segmental duplications). Errors corrected by DeepTrio were often either through the ability to identify supporting evidence at low coverage and difficult to map regions ( Figure 5), or the ability to better estimate the correct genotype in sites with substantial allelic bias (Supplementary Figure 1). HG002 HG003 HG004

Computational efficiency of DeepTrio and other trio-calling pipelines
To assess the computational efficiency of DeepTrio compared to other methods of generating trio calls, we ran pipelines on the Illumina 35x WGS samples using the same hardware: a 16-CPU thread machine (n1-standard-16) available on Google Cloud Platform. This was chosen as representative of typical runs, though there are faster (and more expensive) or slower (but cheaper) methods to run these pipelines (an analysis of single-machine scaling can be found in the DeepVariant-GLnexus paper 38 ).
Some components of DeepTrio are more computationally efficient when compared to DeepVariant, for example DeepTrio can reuse calculations in the example generation stage.
Other components require more compute in DeepTrio: the size of the pileup images is larger, which corresponds to a larger neural network and more compute. Also, more examples are identified across the samples and any example in one sample requires making a call in the other. We observe that DeepTrio requires more time to run when compared to DeepVariant, but slightly less time than GATK4 when using these hardware settings ( Figure 6). Since chromosomeX in males is inherited from the mother, we performed calling on chromosomeX with only the mother provided as the parent. This reduced heterozygous calls to 3.37% (3518/104427), which is better than in the DeepVariant case. For male samples, this recommends that variant calling should be run with both parents on the autosomal and PAR regions using a BED file to restrict location, and additional variant calling should be performed using only the mother's file provided as parent for the non-PAR regions of chromosomeX, and only the father's provided for the non-PAR regions of chromosomeY.

Figure 6. Time required to run DeepTrio and other pipelines
This experiment indicates that allowing the model to infer a hemizygous chromosome through coverage and explicitly training for hemizygous variants is an opportunity for improvement, both for DeepVariant and DeepTrio.

Discussion
Here we discuss how DeepTrio's performance characteristics relate to the motives for trio calling and the data which is generally available. In this work, we considered accuracy across all variants, and separately for de novo variants. In the context of rare disease, both of these formulations of accuracy are important. Some rare genetic diseases are caused by the combination of recessive variants from each parent. While others, especially those with a dominant inheritance pattern, will arise de novo.
We demonstrate that DeepTrio has strictly superior accuracy compared to DeepVariant and other trio and non-trio methods. When considering highest overall accuracy, DeepTrio would always be preferred. The case of de novo variants is more nuanced. DeepTrio has much higher precision when calling de novo variants but has slightly lower recall. If only F1 is considered, DeepTrio's F1 (0.8947) is superior to the next highest trio method (GATK4 CalculateGenotypePosteriors -0.8235), and non-trio method (DeepVariant -0.7755). But in terms of recall, non-trio methods like DeepVariant have a higher recall of de novo calls. As a result, if investigators have a greater interest in the highest recall of de novo events, as opposed to overall accuracy, one option would be to run DeepVariant on each sample, and to re-run DeepTrio on the smaller number of regions where DeepVariant identified a de novo variant, in order to prioritize real variants first. In addition, we have previously shown that the output of the neural network is very well calibrated (Figure 2 of poplin et al. 2018) 19 , and it may be possible to rank putative de novo events by the reported confidence of the call in order to tune DeepTrio more towards recall of de novo.
The accuracy analyses above are conducted at deep coverage (35x). Sequencing a trio of samples is more expensive than a single sample, and studies often compensate for this extra expense by sequencing at a lower coverage, especially for the parents. Reducing coverage is often considered when sequencing non-human disease samples, where the importance of accurately calling every variant is balanced against larger, more comprehensive studies. DeepTrio's advantage over other methods is substantially greater at lower coverages. DeepTrio is about as accurate in~20x coverage as DeepVariant is at 28-30x coverage for both Illumina WGS and PacBio HiFi. DeepTrio is more accurate at 15x coverage than either GATK4 method at the highest coverage evaluated (35x). By maintaining high accuracy at reduced coverage, as well as by including training examples which reduce parent coverage while keeping child coverage, DeepTrio increases the flexibility of investigators to plan their trio sequencing to maximize cost-benefit.
As a deep learning method, DeepTrio does not explicitly encode the relationship between samples. DeepTrio's ability to improve accuracy of calling, and to do so in a manner which is similar to human intuition regarding de novo variants, demonstrates an ability to capture rules which mirror general knowledge. This is similar to a recent demonstration which re-trained DeepVariant to use population allele frequencies 39 . It is a strong indicator that deep-learning based variant callers can be further improved by finding ways to expose information which captures the underlying biology of samples and populations. Similarly, the framework of DeepTrio could, in theory, be further expanded to use sibling information, or to leverage more distant family relationships. Overall, the success of DeepTrio is a strong demonstration that thoughtfully identifying data which captures relevant biological or bioinformatics intuition, is a critical element to the development of strong machine learning methods in the genomics domain.

Generation of Sequencing Data
The generation of sequencing data for training is described in detail in Baid et al. 2020 33 and the WGS and PacBio evaluation data in Olson et al. 2020 22 . In summary, all WGS and exome runs were conducted with 151-bp paired-end reads at 50x intended coverage from NovaSeq and HiSeqX platforms. For WGS, sequencing for both PCR-Free and PCR-Positive preparations. All sequencing was performed on HG001-HG007, NA12891, and NA12892. For exomes, sequencing was performed with multiple capture kits, Agilent v7, IDT-xGen, and Nextera at a target of 200x coverage from NovaSeq and HiSeq4000 platforms.
For PacBio HiFi data, we requested 3 SMRT Cells 8M for each sample of HG003, HG004, HG006, and HG007. Libraries were prepared targeting a 15kb insert size and sequenced on Sequel II System with Chemistry 2.0. This was supplemented by data from Human Pangenome Reference Consortium (https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0) .

Training models
Training DeepTrio requires a set of BAM or CRAM files and a set of truth labels. Examples are generated in the same manner used for calling and are annotated with the truth labels. A training tutorial for DeepVariant from input data is available at: (https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-training-case-study.md).
Using the trio data sets described in the prior section, DeepTrio was trained across a range of coverages achieved by random downsampling of~50x BAM files at fractions of 0.7 (~35x), 0.5 (~25x), and 0.3 (~15x). This random downsampling helps DeepTrio to generalize well across coverages, and we observe that having more difficult examples results in overall better models. Examples are generated for trios where each sample is downsampled at the same fraction, and those where only the parents are further downsampled while the child is kept at a higher coverage. To train DeepTrio to natively handle duo calling, training also occurs in the same manner, but omitting one of the parents. Training occurs over the entire set of WGS samples to generate the WGS model.
For exomes, the data described in the prior section is used in the same manner as in WGS, but with different downsample fractions. The downsample fractions for exomes are 0.8, 0.6, and 0.4. The resulting coverages are diverse, since exome capture varies substantially by exon and sample. For PacBio, different combinations of 8M SMRT cell runs provide the training inputs, over a range of 2, 3, 4, 5, and 6 merged cells as the input files.

Mapping and Variant Calling
Samples were mapped to GRCh38 40 with BWA MEM 41 in an ALT-aware manner and deduplicated with Picard MarkDuplicates 42 .
For GATK, non-trio aware calling was performed by HaplotypeCaller followed by GenotypeGVCFs. For GATK trio-aware calling, this VCF was further refined by CalculateGenotypePosteriors.
For FamSeq, GATK HaplotypeCaller and GATK GenotypeGVCFs were run, and FamSeq was run on the resulting multi-sample VCF. Since FamSeq only refines the call of SNPs, Indel calls were taken from the GATK VCF without modification.
For dv-trio, single sample calling was performed as described in DeepVariant's best practices in multi-sample calling, and the dv-trio scripts were applied to the output of this file.
For Octopus, single sample variant calling for single samples was performed using the v0.7.2 release, with the matched v0.7.2 germline forest model. Because Octopus does not generate a gVCF, we did not attempt to generate a joint call for the individual samples, to avoid introducing issues with harmonizing allele representation.
For all de novo analyses, only PASS entries were used for calculations across all callers. No call (./.) positions were excluded.

Code and Data Availability
All DeepTrio code is available under a BSD-3 license at: https://github.com/google/deepvariant/tree/r1.1/deeptrio All evaluation data are derived the Illumina and PacBio FASTQ files available from the PrecisionFDA v2 Truth Challenge 22 at: https://precision.fda.gov/challenges/10 Training datasets are described in "An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development" 33 and download links for all sequence data are available in the supplement of that paper: https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.supplementary-material