GenotypeTensors: Efficient Neural Network Genotype Callers

We studied the problem of calling genotypes using neural networks. A machine learning approach to calling genotypes requires a training set, an approach to convert genomic sites into tensors and robust model development and evaluation protocols. We discuss each of these components of our approach and compare four types of neural network training protocols, two fully supervised and two semi-supervised approaches. Semi-supervised approaches use unlabeled data to supplement limited quantities of labeled data. Random hyper-parameter searches identified highly performing models that reach indel F1 of 99.4% on a chromosomes 20, 21, 22 and X of NA12878/HG001. We further validate these models by evaluating performance on HG002, an independent sample used in the PrecisionFDA challenge. We apply GenotypeTensors to evaluate the impact of (1) training with small datasets, (2) training models only with sites inside confidence regions, or (3) training with improved true label annotations. A PyTorch open-source implementation of GenotypeTensors is available at https://github.com/CampagneLaboratory/GenotypeTensors. DNANexus cloud applications are provided to help process new datasets both to train model or call genotypes with trained models.


INTRODUCTION
DNA can be sequenced with high-throughput sequencers to yield short or long reads. Genotype callers are tools that implement computational approaches to transform sequencing reads aligned to a genome into genotype calls across the sites of a genome. Several approaches have been followed to develop genotype callers. Early stage ad-hoc callers used hard thresholds and filters to identify the consensus of bases seen at each site. These approaches were quickly superseded by approaches that leverage probabilistic methods to weight the evidence for each genotype at a site. Examples of probabilistic callers include the Genomic Analysis Toolkit (GATK) [McKenna et al., 2010], or Platypus [Rimmer et al., 2014].
At the heart of the probabilistic approaches are models that given observables at each site of the genome determine the probability of each genotype at the site. The probabilistic tests implemented in traditional callers use a limited amount of observable features. For instance, they may rely on the number of bases supporting a given genotype and the quality scores of these bases. The reason for the focus on a small number of features is that it is non-trivial to manually develop models that model interactions, often non-linear, among many kinds of observables. Poplin et al. [2016] and our group [Torracinta and Campagne, 2016] have independently developed approaches to estimate the probability function for genotype calling using features observed from alignment data and neural network probability estimators. The approach used to develop DeepVariant [Poplin et al., 2016] consists in transforming aligned reads into images and train convolutional neural networks to predict genotypes. In contrast, our approach maps aligned reads to a vector of features and uses feed forward network architectures. We describe this approach in this manuscript and provide details of a robust model development and evaluation process that allows to train strong models at a fraction of the computational cost of the approach implemented in DeepVariant.
A key challenge faced when training genotype callers is access to large amounts of high-quality labeled data, that is, alignments where the genotypes at each site of the genome are known. In this work, we used data measured from NA12878/HG001, one of the samples studied by the platinum genome study [Eberle et al., 2016], subsequently studied by the Genome in a Bottle (GIAB) Consortium [Zook et al., 2014[Zook et al., , 2018. In order to test if models trained on one genome generalize to a second genome, we use sample HG002, which was used as benchmark sample in the PrecisionFDA challenge. This evaluation protocol makes it possible to compare the performance of our approach to the performance reported for other genotype callers in the same condition.
Importantly, GIAB defined confidence regions as a subset of the genome where confidence in genotype calls is strong [Zook et al., 2014[Zook et al., , 2018. Genotypes for sites outside of a confidence region may be correct, but there is less evidence that they are correct, either because data are lacking at the site, or because complexity of the genomic site yields conflicting genotypes when different sequencing technology are used to measure the same site, or across pedigrees. We use confidence regions for evaluation on both HG001 and HG002. The DeepVariant preprint reported training models with sites only contained in confidence regions. In our initial preprint, we warned that this could yield biased models that do well inside confidence regions (where the evaluation is performed), but fare worse in the rest of the genome (on the more complex sites that make up about 14% of the genome).
In this manuscript, we test whether the performance of models trained exclusively on sites located inside confidence regions is representative of performance of models trained with sites from across the entire genome.
Since high-quality genotype calls are rare and millions of these calls are necessary to train strong models, we evaluate semi-supervised training protocols to determine if using large amounts of unlabeled data can be beneficial when training genotype callers. This question is particularly interesting because the performance of neural networks has been shown to increase with larger training sets, but the cost of acquiring millions of high-quality genotype labels is expected to remain very large in the foreseeable future. Semi-supervised approaches could facilitate training strong models with limited amounts of labels. Our results show that semi-supervised approaches do not result in more predictive models when testing on an independent sample. Domain adaptation techniques have been developed to minimize performance drops when a model is trained on a dataset with different distribution as the test set. We therefore also evaluated a domainadaptation approach: Adversarial Discriminative Domain Adaptation (ADDA) [Tzeng et al., 2017]. We found that ADDA did not improve HG002 performance when models trained on HG001 where adapted to the HG002 data distribution.
Further analysis revealed that the performance of models trained on HG001 generalize well to the validation set and test set of both HG001 and HG002, suggesting that differences in performance across the two genomic samples cannot be explained by overfitting. consists of a certain number (typically 1 to 6) of densely connected layers, with decreasing number of neurons. We stack a softmax output on top of the last layer of the funnel, where each softmax element represents the presence or absence of a specific combination of genotypes at the given genomic site.
[Considering a diploid organism and two additional possible genotypes, the softmax has 17 elements (2 2+2 + 1)]. The neural network model is trained by minimizing the difference between the model output and true labels. Hyper-parameters r and c are defined in methods.
reduces the size of the representation by applying a reduction factor r (see methods). The advantage of the funnel architecture is that it does not require a preconceived notion about which feature interactions should be considered. Training determines which combination of feature values is informative to predict the labels by adjusting the weights of the neural network. The feature mapper used in this work considers 5 possible genotypes for every position of the genome, and produces 2,255 features per genomic site (451 features per genotype at each genomic site). The features are very sparse (most values are zero), since for instance, a diploid genome is expected to have one or two genotypes present at each site (when a genotype is not observed in a sample, all 452 features of this genotype have value zero). Throughout this study, we observed that networks with 2 layers are sufficient to yield top performing models (see parameters of best performing models shown in Methods).

Training Protocols
In order to establish if semi-supervised methods are beneficial for this type of models and datasets, we defined four training protocols, two supervised and two semi-supervised: • Supervised training (Direct Supervised). In this straightforward and popular training protocol, 3/13 input features and labels over a training set are used to optimize the parameters of the model by minimizing a loss between model output and true labels.
• Supervised mixup training (Mixup Supervised). This approach was proposed to improve training of images [Zhang et al., 2017]. Supervised mixup trains the model using mixture examples constructed by combining two training examples in a linear fashion. This approach seems particularly relevant to training genotype callers since alignment data is the result of measuring a sample from a mixture of DNA molecules containing different genotypes (mixture of two alleles for most sites in a diploid genome). In this setting, mixup can be seen as a data augmentation strategy that constructs novel training examples using a linear combination of existing examples.
• Semi-supervised training with an adversarial autoencoder (Semi-supervised AAE). AAEs are a semi-supervised approach which has been shown to improve performance of models trained with small number of examples [Makhzani et al., 2015]. • Semi-supervised mixup training (Semi-supervised Mixup). While mixup was initially described as a fully supervised approach, it is possible to adapt this approach in the semi-supervised setting by dreaming up labels. Here, we dream up labels for unlabeled examples by sampling from their distribution. We have evaluated two different sampling methods: either sampling from a uniform distribution, or sampling from the distribution of the labels observed in the training set.
Hyper-parameter Optimization Training a model using neural networks requires choosing a number of hyper-parameters. The performance of neural network models can critically depend on the choice of hyper-parameter, to the extent that some authors have shown that performance advantages obtained with new models can be achieved with baseline models when a similar hyper-parameter search is performed. Given the importance of hyper-parameter tuning, we report all results after tuning each method in a similar manner.
A random hyper-parameter search consists in choosing hyper-parameters randomly given a suitable range and distribution for each hyper-parameter. We performed these searches in two phases.
Performance of semi-supervised approaches with 10,000 training examples First, we generated 160 different random combinations of parameters and trained models with the first 10,000 training examples of the training sets. This produced the range of model performance shown in Figure 2. These results are surprising because semi-supervised methods are expected to perform best when few training examples are available. Instead, we observe that the two supervised methods perform best in these searches.

Domain Adaptation
The key assumption of model generalization is that the training dataset and the dataset where the model is used belong to the same data distribution. One possibility for the loss in performance observed when transferring a model trained on HG001 to HG002 is that the two datasets have different data distributions. Various domain adaptation techniques have been developed to mitigate the drop of performance in such situations. We evaluated the ability of the Adversarial Discriminative Domain Adaptation (ADDA) method to adjust the distribution of the data. Briefly, ADDA fine tunes the feature part of a model to make the features similarly distributed between the training set and the unlabeled set. The classifier part of the model is then used as is for classification. When we used ADDA to adapt a model to the HG002 data distribution, we found that the adapted model performed worse than the non-adapted model (see Table 1).

Performance with 3.9 million training examples
In a second phase of optimization, we use the entire training set, and choose random hyper-parameters from refined ranges. To refine each hyper-parameter range, we select the top three models for each training protocol and reduce the range to that observed in the top models. This process is illustrated in an online notebook [Onl]. We use these refined ranges to choose 16 sets of random hyper-parameters per training protocol. Figure 3 shows the accuracy (panel  A) and indel F1 performance (panel B) of models trained with these optimized parameters. Figure 3 does not show the performance of the AAE training protocol because these models had non-competitive performance when trained on the entire training set (i.e., all accuracies were below 80%). Table 2 presents the performance of the top 3 models, for each training protocol, measured on the validation set. We chose the best model by validation performance and evaluated its performance on the HG001 test set (test set performance shown in Table 3). Since the validation and test sets are fully 5/13 independent, constructed from distinct sets of chromosomes of HG001, these data confirm that the models trained with GenotypeTensors generalize to new sequence variations and genomic contexts.  Table 3. Best performance on HG001 test set across conditions. The top-model selected using validation performance (see Table 2) is evaluated on the test set composed of different chromosomes than the training and validation set. In order to test how well the models generalize to new genomes, and to compare the performance of our approach to other studies, we tested these models with the HG002 sample. Table 4 presents these results and shows that our approach remains strongly predictive on HG002, and substantially outperforms  Table 4 suggests that the test set and validation set are not fully representative of the sites in an entire genome since test and validation performance over-estimates performance measured on the entire genome.
Clairvoyante [Luo et al., 2018]. We also compare the performance of three models trained with the direct supervised protocol when measured on the validation and test sets of HG001 and HG002. These data indicate that the model generalizes well (Table 5), but suggests that neither the validation or test sets are fully representative of the complexity of indels and genomic context over an entire genome (compare performance in test set in HG002 to performance on entire genome HG002, Table 4).
To investigate this effect more carefully, we plot a precision recall curve for all predictions on the HG002 genome ( Figure 4). This plot indicates that while most of the precision/recall curve behaves as expected for a strong model, some sites are erroneously predicted with the wrong genotype with high probability (region of the curve marked with an arrow). We found that this result is representative of the neural network models that we have trained, which when tested on sites with characteristics that they may not have encountered in the training set can fail with high-probability. Inspection of some of these errors suggest that most of them are confusion among homozygote and heterozygote calls. While we cannot rule out that these errors are due to issues remaining in our software implementation, these results suggest that strong indel models may require training with several genomes in the training set in order to capture enough indel diversity that such errors on new samples are minimized.

Impact of training with sites in confidence regions only
Our preferred training protocol is to train with all sites of the genome, in order to avoid biasing results for possibly easier sites that may be contained in the GIAB confidence regions of the genome. Here, we tested this assumption by training models exclusively with sites contained in the GIAB confidence regions. We performed a random hyper-parameter search on HG001 and tested the top model on HG002. Table 6 presents the performance obtained on HG002 and shows that while validation performance is very high (accuracy reaching 99.9681 on the validation set on HG001), the model (MOMWU) performs worse on HG002 than when models are trained with sites both inside and outside of confidence regions (top model shown for reference: OMVIA).  ., 2016]. Since genotype calls have been refined over the years by the GIAB consortium, we trained a model with version 3.3.2 of the GIAB calls to determine the impact of true label refinements on the performance of our models. Table 7 presents these results and shows that small performance improvements can be obtained by using more recent versions of the true labels, but that our training protocol is quite robust to some errors in genotypes (i.e., the platinum calls contain more errors than the GIAB 3.3.2 calls).  . Precision Recall Curve for entire HG002 genome. This figure shows the precision recall curve for a model that performs well on the HG002 test set, but fails on some indel sites of HG002 with high-probability. Most of the curve is well behaved, but some sites yield indel genotype errors with very high probability (see arrow). This result is not an outlier, but instead typical of failure modes we have repeatedly observed when evaluating neural networks for genotype calling. Upon inspection, most of the errors are confusion between heterozygote or homozygote calls (i.e., one of the alleles is called correctly, but the other not, resulting in an overall incorrect call).  Table 7. Impact of genotype label improvements. We compare the performance (HG002 entire genome) of the top model obtained when training with platinum labels or with the latest release of the GIAB genotypes for HG001 (best of 64 models trained with random hyper-parameter search shown for each condition). The comparison shows that a small performance improvement is obtained by using more recent versions of the genotype calls, but also clearly confirms that our approach can learn despite errors typically found in the genotype calls used in the training set. Another practical aspect of computational performance is the ability to quickly determine optimal hyper-parameters for training a model. As illustrated in Figure 3 (Panel B), the performance of models varies substantially with the choice of hyper-parameters, so it is important to determine a set of optimal hyper-parameters for a dataset. Since the data loading time is the bottleneck when training models, we have developed code to load data once and train several models initialized with different hyper-parameters (see Methods for details). This approach makes it possible to train up to 64 models in parallel on the same GPU, and greatly reduces the computational burden of performing random hyper-parameter searches. With this approach, we are able to train 64 models for about 10 epochs on a single GPU in about 6-8 hours (without this approach, training 16 models on 8 GPUs independently can be done in a few days). We do not provide specific timing for this process because the search time depends on the range of hyper-parameters explored (i.e., the number of parameters optimized for each model included in a search will influence the runtime). However, this approach makes hyper-parameter selection much more practical than training models independently.

Cloud applications
We have developed DNANexus cloud applications to help process alignments and generate training and evaluation datasets, or to call genotypes across whole genomes with trained models. The source code of these applications is available at https://github.com/ CampagneLaboratory/variationanalysis-apps.

DISCUSSION
We have presented an efficient approach to train neural network methods for calling genotypes. The method is orders of magnitude more efficient than DeepVariant because it converts genomic sites to a few thousand features and uses a feed forward network instead of a convolutional architecture trained from images. While our approach is computationally efficient, it has not yielded models competitive with the performance reported for DeepVariant on HG002, or other more traditional approaches evaluated in the PrecisionFDA challenge. For instance, the best performance reported on the entire genome of HG002 in the blind evaluation of PrecisionFDA (Truth challenge) reached an indel F1 of 0.9904 (99.4%). This compares to our best model reaching an indel F1 of 0.9852. In the same challenge, DeepVariant reached an indel F1 of 0.9898. While DeepVariant has since improved performance on HG002, the performance statistics are no longer directly comparable to our results because recent DeepVariant models are trained with several genomes, in contrast to the HG001 training protocol we use for evaluation in this study.
Given the loss of performance outside of the test and validation sets that we observed in Figure 4, it is likely that adding more genomes to the training set in order to optimize HG002 performance would yield biased performance estimates (i.e., stronger performance would be observed on HG002 than on a randomly selected genome, where distinct high probability errors similar to those shown in Figure 4 could manifest).
Given these results, it is unclear if the superior performance of DeepVariant is due to the convolutional architecture used in the DeepVariant models, which gives the model access to data about the context of each specific genomic site (e.g., DeepVariant provides data about 150bp around each site) and represents a tradeoff between computational efficiency and model performance, or if the implementation of our approach could be further improved to match deep-variant like model performance at a fraction of the computational cost.

9/13
A new tool, called Clairvoyante, was recently evaluated in a preprint [Luo et al., 2018]. This tool implements a neural-network based method for calling SNPs and indels in short and long-read data. Clairvoyante is described as a convolutional network and therefore also must convert alignment data to three dimensional tensors. The preprint provides data showing that Clairvoyante is faster than DeepVariant. Interestingly, no comparison, either computational efficiency or call performance, is provided with our earlier project, variation analysis (whose performance kept improving since the release of our initial preprint [Torracinta and Campagne, 2016]). We have included Clairvoyante in our evaluation in this manuscript and shown that all the training protocols we tested outperform Clairvoyante on HG002 by a wide margin when considering indel performance (see Table 4). Our approach has an advantage as well for computational performance since (1) we can train an epoch with HG001 in less than a minute (Table 8) when Clairvoyante needs 170 seconds [Luo et al., 2018] and (2) we can perform a hyper-parameter search with 64 models in the time it takes to train one Clairvoyant model. Regarding indel performance, as we have noted in the caption of Table 4, [Luo et al., 2018] presented F1 performance evaluated across both SNPs and indels. This made comparison with prior studies that separate SNP and indel performance difficult. We find this presentation confusing because it could mislead readers who quickly glance at numerical performance metrics without realizing that Full F1 (Indel+SNP) can appear numerically competitive with indel F1, when it is far from competitive. The comparison we provide in Table 4 should clarify how the performance of Clairvoyante compares to prior neural network methods. Additionally, we note that there is no reason that our approach, as described previously and in this manuscript, would be expected to be inferior to the approach implemented in Clairvoyante for long-read datasets. Indeed longread technology does not fundamentally change what type of features are informative when considering a specific genomic site after the reads have been aligned to the genome. We therefore expect that simply performing a hyper-parameter search with long-read data with the open-source software that we provide will yield strong performance both for long-read and short-read sequencing datasets.
Clairvoyante uses a convolutional architecture similar to DeepVariant, but fails to match the performance of either our simpler single-site models, or DeepVariant models. Therefore, it seems unlikely that a convolutional architecture is sufficient to explain the difference in performance observed between our approaches and DeepVariant.
We hypothesize that so-far unnoticed software implementation problems in available code bases, and/or insufficient hyper-parameter tuning for Clairvoyante, could be responsible for the differences observed in model performance, rather than being driven only by differences in model architecture and differences in representation of aligned reads in each genomic context.
We have tested a domain adaptation approach and two semi-supervised approaches for genotype calling. In both cases, we found that these approaches, which have been successful in other domains, failed to improve the generalization of genotype caller models for indels or SNPs. This point supports the notion that data distribution differences are not primarily responsible for variations in performance from one individual sample to another.

Reads Processing
Read files were uploaded to DNANexus and processed as follows. Alignments to the genome were done with BWA (DNANexus application) with default parameters. Alignments were further processed with DNANexus applications developed by our laboratory: alignments were processed with HaplotypeCaller (GATK3) to realign SNPs around indels and reduce the size of the training set to regions likely to contain non-reference calls. Realigned BAM files were converted to Goby alignment format, then to Sequence Base Information (SBI) format. Finally, SBI files were mapped to tensors with the export-tensors tool distributed in the variationanalysis project. Tensors files were then downloaded to the laboratory local server to train models.

True Genotype Labels
Unless otherwise noted under Results, we used the platinum genome VCF file version 2017-1.0 as training data for NA12878. SNP and indel performance were measured against annotations from the GIAB 10/13 consortium by restricting evaluation to confidence region.

Training Neural Network Models
Models were developed with the PyTorch framework (http://pytorch.org/), version 0.3.1. PyTorch was selected because it implements automatic differentiation and is a popular framework with strong community support. All models were trained with a batch size of 1024 examples. The order of training examples is shuffled at each epoch. We trained models on a Linux server with 8 GTX 1080Ti GPUs (11GB of RAM per GPU, 256GB of RAM for the server), but we note that single model training can be conducted on a single GPU overnight in a couple of hours. Hyper-parameter search for 64 models can be conducted on a single GPU overnight (10 epochs).

Hyper-parameter searches
We implemented a fast random hyper-parameter search approach in GenotypeTensors. Since our models occupy only a few megabytes of GPU memory, we are able to train 64 different models (each initialized with different hyper-parameters) at the same time on a single GPU. Briefly, we load tensors for a mini-batch and then perform forward, backward and parameter updates for a single model in independent threads. Because most of the computationally expensive work is done on the GPU, the python Global Interpreter Lock is not a major bottleneck and tens of models can be trained in the same time that a single model would be trained if done sequentially. When all models are trained with a mini-batch, another batch of tensors is loaded and training continues until all batches have been used. Testing can similarly be done for 64 models in different threads. An important advantage of this approach is that a batch of data is loaded only once. This provides an important speedup when the training set is large and disk or solid state drive random access is needed to load batches. This approach is implemented in https: //github.com/CampagneLaboratory/GenotypeTensors/blob/master/src/org/ campagnelab/dl/genotypetensors/autoencoder/SearchHyperparameters.py.

Best model hyperparameters
This section provides the command lines showing hyper-parameters used to train a few of the topperforming models discussed under Results. Note that using these command lines to train a model requires to add the path to the dataset to use for training/validation. This is done by adding an argument of the form -problem genotyping:NA12878-realigned-2018-01-30, where NA12878-realigned-2018-01-30 is the path to a dataset. Adding the option -num-workers 6 on a machine with 6 cores will also result in faster training by loading data in parallel.

11/13
Funnel Architecture (FA) Funnel models were formulated as n fully connected layers with BatchNormalization and with RELU/SELU activation and a fully connected output layers with soft-max activation. The number of output layers n is a hyper-parameter of the architecture that we optimized (see hyper-parameter search). The first dense inner layer contains c times the number of input features (we call c the model capacity). The number of neurons of each layer is calculated by applying a reduction factor r to the number of input to the layer. When r < 1, the model looks like a funnel and progressively reduces the number of neurons until the softmax layer. The exact model architecture used is encoded in the class called org.campagnelab.dl.genotypetensors.autoencoder .genotype softmax classifier.GenotypeSoftmaxClassifer distributed in the Geno-typeTensors project.

Adversarial AutoEncoder Architecture (AAEA)
Adversarial AutoEncoder is implemented as described in [Makhzani et al., 2015]. This architecture is composed of an encoder, made of n fully connected layers of decreasing number of neurons. The encoder reduces the number of features progressively to a configurable number z. A decoder implements the reverse transformation to reconstitute the input features from the encoded representation. A discriminator loss is used to make each of the z features approximately normally distributed (zero mean and unit standard deviation). The decoder and encoder are trained to reconstruct unlabeled examples while semi-supervision is obtained by training the encoder to produce z features that can predict the labels of training set examples.

Direct Supervised Training
Direct supervised training optimizes the supervised loss using an Adagrad optimizer with an L2 loss (an optimized hyper-parameter). We use the PyTorch MultiLabelSoftMarginLoss.

Mixup Supervised Training
Mixup supervised training optimizes the supervised loss using an Adagrad optimizer with an L2 loss (an optimized hyper-parameter). In contrast to direct supervised training, mixup training uses two training examples and combines their features and labels in a linear manner. x mixup = α.x 1 + (1 − α).x 2 , y mixup = α.y 1 + (1 − α).y 2 We use the PyTorch MultiLabelSoftMarginLoss.

Mixup Semisupervised Training
This training procedure is similar to mixup supervised training, but instead of using two training examples, we use one training example and one unlabeled example. The label of the unlabeled example is sampled from a distribution. Two sampling approaches are implemented: uniform (each genotype class is equiprobable), sampling (each genotype class is sampled with probability the frequency of the genotype in the training set).

Label smoothing
During all types of training, we use label smoothing with configurable ε [Goodfellow et al., 2016] (section 7.5.1). An ε of 0 means that labels are not smoothed. Label smoothing is also consistently applied in semi-supervised training schemes when labels are sampled from a distribution.