Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data

A number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).


INTRODUCTION
somatic variations in matched RNA-Seq data Sheng et al. [2016].
Methods to rank and filter candidate sites are often developed using one of two approaches. Early 48 development following the introduction of a new assay are often ad-hoc and may include the use of 49 hard filters or other heuristic(s). These methods are widely recognized as being sub-optimal, but are 50 nevertheless widely applied to new assays. We believe that these methods are used in these instances 51 because: (1) they do not require a strong knowledge of probabilistic models, (2) they are easy to implement 52 in software and (3) domain experts can look at results and contribute suggestions to improve the next 53 iteration of the approach and (4) they enable ranking interesting candidates in the early days of an assay to 54 demonstrate its biological interest. At these stages, it is more important to identify a few strong candidates 55 than to obtain optimal sensitivity and specificity across a wide range of datasets. 56 In contrast, probabilistic methods are developed when an assay becomes popular and more data is 57 produced that requires more sensitive or specific ranking and filtering tools. Developing probabilistic 58 methods relies on a model of the source of errors in an assay. Developing this model for a new assay can 59 be a slow process, but once introduced, probabilistic methods frequently outperform ad-hoc approaches 60 by a wide margin. 61 In this manuscript, we describe a third approach to the ranking and filtering of genomic sites. We 62 aimed for an approach that (1) would be fast to develop and implement for a new assay (2) would provide 63 domain experts with the opportunity to contribute to the development and refinement of the approach, 64 (3) would yield state of the art probabilistic models, (4) can be applied to a wide range of assays and  Figure 1. Overview of the approach.. A minimum of two samples are necessary to produce a semi-simulated dataset for training. These samples are chosen from data measured from the same individual such that genotypes should match at the majority of sites measured. One sample is arbitrarily assigned the germline role, and the second sample is assigned the somatic role. These samples are provided to the somatic call simulator (whose design can be informed by expert input), which will produce mutated examples using non-mutated examples provided in the input samples (see Methods). Mutations are always added in the sample labeled "somatic". This process yields a training dataset. The same process is repeated with independent pairs of input samples (typically from distinct subjects) to yield a validation and a test set. The training set is used to train a feedforward neural network until performance measured on a small part of the validation set starts to decrease (early stopping). Performance of the fit model can be estimated on the validation set as well as on other benchmark datasets contributed by the community.
To circumvent this problem, we train deep learning models with semi-simulated datasets. Semi-simulated 101 datasets are constructed by simulating signal into samples where no signal is otherwise expected. For instance, in this project, we use two RNA-Seq samples from different types of immune cells from the 103 same subject. Few if any somatic mutations are expected for most sites of the genome in these cells 1 . 104 We simulate somatic variations in one of the two samples by moving bases and associated features of 105 the real data to another genotype (see Material and Methods for details). The resulting semi-simulated 106 dataset should accurately reflect the properties of the real data used as input, and for instance, it is 107 expected to capture characteristics of the library preparation and sequencing assay used to obtain the data.   The samples depicted in Figure 1 consist of reads aligned to a genome. In order to effectively train a 123 deep learning network, it is necessary to convert alignments to features and labels which are compatible 124 with back-propagation. In this work, we converted alignment data to features using the process shown in  The determination of the optimal mapping of read alignment records to features required experimenta-131 tion and was performed in an iterative fashion until we found that the score on the validation set (used 132 for early stopping) did not improve substantially. This iterative model refinement process also included 133 manual inspection of results obtained on the validation set, to identify any unexpected predictions, and 134 correct software bugs that can lead to them (see Figure 3). Neural network architecture and neural net 135 hyper-parameters were also optimized during this process. The final implementation of feature mapping 136 is modular and makes it possible to plug in or out features derived from an alignment so that different 137 combinations can be tested and compared.

138
Somatic mutation calling for an RNA-Seq assay 139 We applied the process shown in Figure 1 to develop models to rank and filter somatic mutations in

5/13
Manual Inspection Expected errors: Unexpected Errors: Investigate simulator code, correct bugs, retrain model, predict  The table is ordered by model probability (that the site is a somatic mutation) and filtered to show only false positive predictions. Manual inspection of such a table helps identify the most extreme errors made by a model. (B) A low signal to noise ratio, as in this case, is an expected source of false positives and may indicate that a model needs further training. (C) This negative example is indistinguishable from a positive example (it looks just like the simulator planted a mutation, when this was not the case). Such errors are possible if the genotype of germline samples differ at some sites (D) This negative example was wrongly classified as a mutation. Upon further investigation of the simulator code, we identified a software bug. During the feature encoding phase, the order of bases was not consistent from sample to sample. Fixing this bug resolved such errors. (D) In a first attempt at simulation, we trained with datasets restricted to have at least 10 counts across the samples at a site. This prevented the model from learning that such sites are more difficult to predict and resulted in over-estimates of model probability. Removing filters from the construction of the training set (to include sites irrespective of their coverage) resolved this issue. To address this question, we developed models for a trio design. Assume somatic calls are needed 167 for a patient for whom germline DNA is not readily available (because blood is suspected to contain 168 cells with the somatic variations and contamination of other tissues by somatic cells cannot be ruled out).

169
Assuming data from DNA from both parents of the patient are available, it should be possible to call 170 somatic variations in the patient sample. We call this experimental design "trio somatic", in contrast to 171 the "pair somatic" design discussed in the previous section. We combined feature mappers developed for 172 the pair somatic problem such that features that depend on a single sample are calculated successively for 173 the father, mother and patient, and features that depend on two samples are calculated for father/patient 174 and mother/patient pairs. All features were concatenated. As shown in The performance of the models developed in this study is summarized in Figure 4. Each panel shows the where we train the model on the same experimental protocol, we find almost optimal model reliability.

184
In Panel B, we test a model trained on RNA-Seq data on the paired exome data. The purpose of this 185 comparison is to determine how transferable a model trained in one assay is to another assay. We find 186 that the model remained predictive for about 40% of the exome sites, but has lower performance on the 187 exome data. The reliability diagram also confirms that the model performs less reliably across assays.

188
Taken together these data suggest that large performance gains can be obtained by training a model to the 189 specific assay that it will be used for, adapting the model to the assay. While we have not measured this The same model trained on Pair Exome data predicts accurately a subset of high-confidence sites (1), with high predicted probability of mutation, but has degraded performance on other sites (2). This is confirmed by the RC, which shows sub-optimal reliability. specific data analysis protocols.  Training probabilistic methods 210 We have presented a new approach to develop probabilistic models for calling somatic variations in high-211 throughput sequencing data. In contrast to previous studies which have manually designed probabilistic 212 models, we show that it is possible to train probabilistic models using semi-simulated datasets and deep 213 learning methods and that such models adapt to the characteristics of the data produced by specific assays 214 and analysis protocols.
Additional validations will be needed to firmly establish that the models trained in this way are predictive 217 on real datasets. Because of regulatory limitations on data sharing of genotype data, which have hampered 218 our ability to obtain suitable validation data, we chose to distribute the software and the models to make it 219 possible for others to conduct independent evaluations on private datasets. We hope to be able to test the 220 models on the ICGC GoldSet Alioto et al. [2015] (request initiated with ICGC in July 2016, approved 221 Sept 2016, we have yet to obtain access to the data files for the ICGC GoldSet which cannot be located 222 through the ICGC portal. An email to the ICGC support mailing list to request access through the ENA, 223 as per ICGC instructions on the ENA web site has remained unanswered as of end of Sept 2016).

225
Should our approach perform competitively on real datasets, one of its key advantage will be that training 226 can be performed for arbitrary combinations of new assays, experimental designs, and analysis protocol.

244
Of the expected mutation rate 245 We showed that our approach has strong reliability when applied on the same type of assay as that used in 246 the training set. The model outputs the probability that a site is mutated, given the prior distribution of 247 mutated sites in the training set and the reliability diagrams indicate that the forecast probability produced 248 by the model is close to optimal (i.e., lying very close to the diagonal on the reliability diagram). When 249 the proportion of mutated sites in a dataset differs from that used in the training set, as is often the case in 250 practice, it will be necessary to apply Bayes Theorem to adjust for the difference in priors. Doing so will 251 require an estimate of the rate of mutation in the dataset where predictions must be made. Importantly, 252 such adjustments will not change rank order, and for this reason we expect that the model probability 253 output by the software is suitable for ranking somatic mutation in new samples. • the data indicate that at most two alleles are observed in either sample. This determination is made 295 when the counts from more than 2 genotypes, with genotypes ordered in decreasing count order, 296 have to be summed to reach 90% of all base counts for the site.

297
• the alleles identified by the 2 genotype rule match across samples at the site.

298
In this study, 18% of RNA-seq and 6% of Haloplex records were found to be non-canonical. Canonical 299 sites are used by the simulator to create two mutated versions of the data record. The one unmutated and 300 two mutated versions are added to the training set. Making a mutated version of a record consisted of a 301 few steps. We used a simple heuristic to determine the original genotype, and picked a random "source" 302 base from the two alleles in the genotype. We generated a frequency of mutation, f , from the uniform 303 distribution between 0 and 0.5 (heterozygote site) or 1 (homozygote site), and chose a random "target" 304 genotype from any of the genotypes not marked as source. We then subtracted a proportion f of bases 305 from the source genotype, and added them to the target genotype. Each base retained the same features it 306 was associated with in the source base, namely its read index location, quality score, and forward/reverse needed to produce a fixed-length output so that these outputs could be concatenated consistently into  Finding that it did not introduce training instability, the learning rate was set to the starting value 353 of 1, with SCORE decay (the learning rate was decreased when the loss on the validation set stopped