CONTINUATION: Evaluation of adaptive somatic models in a gold standard whole genome somatic dataset

In http://dx.doi.org/10.1101/079087, we presented adaptive models for calling somatic mutations in high-throughput sequencing data. These models were developed by training deep neural networks with semi-simulated data. In this continuation, I evaluate how such models can predict known somatic mutations in a real dataset. To address this question, I tested the approach using samples from the International Cancer Genome Consortium (ICGC) and the previously published ground-truth mutations (GoldSet). This evaluation revealed that training models with semi-simulation does produce models that exhibit strong performance in real datasets. I found a linear relationship between the performance observed on a semi-simulated validation set and independent ground-truth in the gold set (R2 = 0.952, P < 2−16). I also found that semi-simulation can be used to pre-train models before continuing training with true labels and that this pre-training improves model performance substantially on the real dataset compared to training models only with the real dataset. The best model pre-trained with semi-simulation achieved an AUC of 0.969 [0.957-0.982] (95% confidence interval) compared to 0.911 [0.890-0.932] when training with real labels only. These data demonstrate that semi-simulation can be a very effective approach to training filtering and ranking probabilistic models.


INTRODUCTION
This manuscript is a continuation to Torracinta et al. [2016] 1 . The reader is referred to Torracinta et al. 27 [2016] for background and details of the adaptive deep learning concept tested in this continuation. 28 1 A continuation is a preprint that continues where an earlier preprint left off. The term can also be used to refer to the initial preprint and one or more continuations of the preprint.
The title of a continuation starts with the DOI of the first preprint in a continuation, followed by the word CONTINUATION in uppercase and a colon. A short sentence summarizes the results presented in the continuation. Authors of a continuation should be listed who have contributed to the material presented in the continuation, rather than to the original preprint (since these authors received credit in the first preprint already).
Instead of repeating introduction and methods that are common with the prior preprint, or revising the initial preprint and force readers to read old material to discover new one, this format encourages brevity of reporting. New results or changes to methods are reported in a continuation. An important advantage of the continuation format is that it makes it possible to report results chronologically in preprints, and clearly expose the steps taken during a research study.
A manuscript submitted for publication may later show only a subset of the results presented in these preprints, and may change the order of results in its presentation, in order to improve clarity for readers who encounter the ideas for the first time. Since the article can cite the preprints, it is understood that chronology is described accurately in the continuation format, while the peer-reviewed article is a simplifying summary designed to distill the key elements of a new scientific contribution. variation calling, I evaluated the performance of adaptive models with data from the International Cancer 31 Genome Consortium (ICGC). The ICGC recently published a benchmark dataset: the ICGC GoldSet 32 Alioto et al. [2015]. 33 The ICGC GoldSet consists of data from a matched normal and tumor sample, which both have been 34 subjected to high coverage sequencing (e.g., about 300x). The high-coverage data were used by members 35 of the Alioto study to determine the ground truth of somatic variation in the tumor sample. Using these 36 data, new somatic mutation calling approaches can be evaluated in the reduced coverage datasets using 37 ground-truth variations. A drawback of the ICGC GoldSet evaluation protocol is that some mutations with 38 low frequencies (e.g., 10%) that are visible in the 300x data can be undetectable in the reduced coverage 39 datasets. Such mutations are labeled as "GOLD" only in Supplementary  ). Left presents the ROC curve. Right presents the reliability diagram. Forecast probability is the probability generated by the model. Observed relative frequency is the proportion of true labels in a set of sites. Both plots indicate that the model performs extremely well for a majority of sites (corresponding to about 65% sensitivity), then has degraded performance and fails to identify some true positive sites described in the ICGC GoldSet. Despite the drop in performance such a model is suitable for prediction in a real dataset because strong performance is obtained for sites with highest forecast probabilities.
To evaluate semi-simulation, I trained adaptive models using the ICGC GoldSet normal and tumor sites are included in the semi-simulated dataset, their label is completely controlled by semi-simulation, 49 and not influenced by the GoldSet ground-truth. To determine if such a semi-simulation trained model can be predictive on a real dataset, I evaluated 55 the performance of the model on the ICGC gold-set dataset (labeled ICGC gold-set in Table 1). The This plot compares the performance of alternative models obtained on the validation set to the performance obtained on the GoldSet. The strong linear fit (R 2 =0.952, P < 2 −16 , N=37 alternative models) with a slope of 0.75 indicates that hyper-parameter search on a semi-simulated dataset can guide model selection even in the absence of a real dataset.
To better characterize how performance of semi-simulated models translate to the GoldSet, I generated a number of alternative models with random hyper-parameter choices. As usual when sampling hyper-66 parameters, a full range of performance is expected, from non-predictive models (AUC close to 0.5) all 67 the way to close to the performance of the best model that can be derived from the dataset, but including 68 3/7 models of intermediate performance. Figure 2 presents the performance of these alternative models on 69 the GoldSet. This figure shows an almost linear relationship between performance estimates obtained 70 on the semi-simulated ICGC-10 validation set and performance on the GoldSet (for models which were 71 trained exclusively on ICGC-10 with semi-simulation). These data confirm that semi-simulation can help 72 train models that perform well on a similarly distributed real dataset. Furthermore, the plot establishes 73 that validation performance on the semi-simulated dataset can be used as a guide for selecting a model 74 expected to perform well on a real dataset. was configured to realign reads around indels, call indels and keep sites with at least one base supporting 136 a variation and keep sites with a single distinct read index.

137
The OneSampleCanonicalSimulationStrategy was used for semi-simulation, which considers sites 138 canonical when the germline site has up to two alleles with more than 90% of bases. The plugin was 139 configured to randomly sample 10% of sites across the genome to yield a semi-simulated training set with