A generalizable deep learning framework for inferring fine-scale germline mutation rate maps

Germline mutation rates are essential for genetic and evolutionary analyses. Yet, estimating accurate fine-scale mutation rates across the genome is a great challenge, due to relatively few observed mutations and intricate relationships between predictors and mutation rates. Here we present MuRaL (Mutation Rate Learner), a deep learning framework to predict mutation rates at the nucleotide level using only genomic sequences as input. Harnessing human germline variants for comprehensive assessment, we show that MuRaL achieves better predictive performance than current state-of-the-art methods. Moreover, MuRaL can build models with relatively few training mutations and a moderate number of sequenced individuals, and can leverage transfer learning to further reduce data and time demands. We apply MuRaL to produce genome-wide mutation rate maps for four representative species - Homo sapiens, Macaca mulatta, Arabidopsis thaliana and Drosophila melanogaster, demonstrating its high applicability. As an example, we use improved mutation rate estimates to stratify human genes into distinct groups which are enriched for different functions, and highlight that many developmental genes are subject to high mutational burden. The open-source software and generated mutation rate maps can greatly facilitate related research.

genomes. We tried training MuRaL models with '1in200' rare variants ( Fig. 2a; allele 170 frequency being 1/200, i.e., occurring once in 100 diploid genomes). Because the '1in200' 171 data had a small number of rare variants and thus a low mutation density across the 172 genome, we generally got low k-mer and regional correlations if using them for 173 calculating observed mutation rates (Fig. 2b, c). However, with more dense rare variant 174 datasets as observed mutations ('10in20000' and '5in1000' data; Fig. 2b, c), the '1in200' 175 MuRaL models achieved much increased regional and k-mer mutation rate correlations, 176 suggesting the high predictive performance of these models. This also implies that if only 177 100 human genomes are available, it is still possible to build reasonably good models by 178 using the rare variants from the 100 individuals. correlations for non-CpG C>A mutations at different scales on Chr20 for 1in200 and 1in2000 models, 194 with grey shades indicating observed mutation rates (based on 10in20000 data) and colored lines for 195 predicted rates. As predicted and observed regional mutation rates had different magnitudes, z-score 196 normalization was applied for visualization. Mutation rates at centromeric regions were not available.

199
Moreover, we showed that AT and non-CpG models trained with singleton variants 200 derived from the '1in200' data had similar performance as those trained with the same 201 amount of singleton variants from the '1in2000' data ( Fig. 2b-d). The mutation rate 202 correlations of the CpG model trained with '1in200' data were also close to that of the 203 model trained with '5in1000' data ( Fig. 2b, c). These results further corroborated that 204 MuRaL can train effective models with a relatively small number of rare variants from a 205 moderate number of sequenced individuals. 206

Other factors that might affect the performance of MuRaL 207
As read coverage can affect mutation calling and is usually accessible for mutation 208 data, we further tried incorporating read coverage into MuRaL (Supplementary Fig. 6; 209 see Methods). In well-covered regions, MuRaL models with coverage had similar 210 performance as those without coverage (Supplementary Fig. 7). Since exonic regions 211 usually experienced stronger selection, we also trained MuRaL models using only non-212 exonic rare variants, results of which also showed similar performance with respect to k-213 mer and regional correlations (Supplementary Fig. 8; see Methods). In addition, we 214 tried training AT models using rare variants from only healthy controls in gnomAD or 215 using rare variants after excluding singletons, the results of which showed little difference 216 (Supplementary Fig. 9; see Methods). 217 We noticed that several chromosomes, such as Chr7, Chr9, Chr15 and Chr16, 218 showed smaller regional correlations than others (Supplementary Fig. 7), likely due to 219 their enrichment for recent segmental duplications 29 . The poor regional correlations of Chr8 was ascribable to under-estimated mutation rates in the region from 0Mb to 25Mb 221 ( Supplementary Fig. 10), a region reported to have a strikingly high mutation rate 30 . The 222 relatively small learning space (2Kb) of MuRaL models may not efficiently capture distinct 223 region-specific mutability signals in these complicated regions. 224 For some mutation types (e.g., CpG>GpG), because their numbers of training 225 mutations were much smaller than others (Supplementary Table 2), related model 226 parameters would be more susceptible to underfitting. 227

MuRaL outperforms existing models 228
We compared MuRaL with several recently published models for estimating mutation 229 rates across the human genome ( Fig. 4; Supplementary Table 3). Among all models, 230 MuRaL used the smallest number of training mutations (1.5 millions in total) and did not 231 rely on any functional genomic data.

237
'10in20000' rare variants for AT and non-CpG models, and '5in1000' rare variants for CpG models. (b)

238
Regional mutation rate correlations with bin sizes of 1Mb, 100Kb and 10Kb on the autosomal genome 239 for different mutation types. The color scheme was the same as that for panel a. (c) An example showing regional mutation rate correlations at different scales on Chr20 for three models (MuRaL, 241 'Carlson 7-mer+features' and 'Karczewski 3-mer'), with grey shades indicating observed mutation rates 242 and colored lines for predicted rates. As predicted and observed regional mutation rates had different 243 magnitudes, z-score normalization was applied for visualization. Mutation rates at centromeric regions

247
For genome-wide correlations between observed and predicted k-mer mutation rates, 248 MuRaL, 'Carlson 7-mer+features' and 'Carlson 7-mer' models performed similarly and 249 were much better than the other two models (Fig. 3a). Though for specific mutation types 250 such as A>C and CpG>GpG, 5-mer and 7-mer mutation rate correlations of the Carlson 251 models were slightly better than MuRaL, MuRaL showed better performance in almost all 252 correlations for 9-mer mutation rates, probably because MuRaL considered sequence 253 context beyond 7-mers. At the chromosome level, the patterns were similar to the 254 genome-wide patterns (Supplementary Figs 11-13). 255 For correlations of regional mutation rates, MuRaL performed better than any other 256 model at the genome-wide level (Fig. 3b). In general, the superiority of MuRaL was more 257 pronounced when the bin sizes were smaller, suggesting that MuRaL predicted improved 258 mutation rates at finer scales compared to previous models. (Fig. 3b,

c; Supplementary 259
Figs 11-13). It is worth noting that, if aggregating three mutation types associated with 260 the same reference base (e.g., merging A>C, A>G and A>T mutations), at the 1Kb scale 261 MuRaL models still achieved regional correlations of ~0.3 (Supplementary Fig. 14). At 262 the chromosome level, MuRaL performed best for most chromosomes and most mutation 263 types (Supplementary Figs 11-14). For chromosomes that MuRaL had relatively low  Although previous models using only local sequence context (3-mers or 7-mers) 268 generally had positive correlations for regional mutation rates, for specific mutation types 269 (especially non-CpG C/G mutations), they had poor or even negative correlations (Fig.  270 3b; Supplementary Figs 11-14). This indicates that a short adjacent sequence cannot We noticed that coefficients of variation (CVs) of regional mutation rates from all the 273 models were much smaller than that of observed regional mutation rates at different 274 scales (1Mb, 100Kb and 10Kb; Supplementary Fig. 16). Although larger CVs of 275 observed mutation rates could be partly due to sampling noise (especially for small bin 276 sizes), the big differences between CVs of observed and predicted mutation rates 277 suggested an aspect that needs to be improved in future. 278

Training with DNMs and transfer learning 279
The number of published DNMs in humans is much smaller than that of rare variants.

286
For each model, the lowest loss (mean cross-entropy loss) for each of ten trials was used to generate 287 the boxplots. (b) 3-, 5-, and 7-mer mutation rate correlations for different mutation types, based on 288 predicted single-nucleotide mutation rates on human Chr1. Colors depict three types of models like that 289 in panel a. For each model in panel a, the best trial with the lowest validation loss was used for 290 predicting mutation rates on Chr1. The mutations for calculating observed mutation rates were human 291 DNMs. (c) Regional mutation rate correlations with a 1Mb bin size on Chr1. The predicted mutation 292 rates of multiple mutation types (e.g. A>C/A>G/A>T) were aggregated for calculating regional 293 correlations, as some mutation types had very few observed DNMs in the data. Smaller bin sizes were 294 not assessed due to few DNMs. Colors depict three types of models like that in panel a. P-values of all 295 correlation tests performed for panels b and c were provided in Supplementary Data 1.

296
Transfer learning is widely used in deep learning for scenarios in which the 297 prediction tasks are similar but less training data is available. To study the effectiveness 298 of transfer learning in the MuRaL framework, we trained transfer learning models with 299 same DNMs, using pre-trained weights from aforementioned rare-variant models for 300 parameter initialization. With independent validation DNMs (see Methods), we found that 301 models with transfer learning achieved significantly lower validation losses than those 302 without transfer learning (ab initio DNM models; Fig. 4a). Transfer learning models also 303 showed generally better k-mer and regional correlations when using DNMs as observed 304 mutations (Fig. 4b, c). 305 Furthermore, we computed validation losses of the validation DNMs using the rare-306 variant models described in previous sections. Compared to the ab initio DNM models, 307 the rare-variant models achieved lower or similar validation losses for all three categories 308 of mutations (Fig. 4a). When looking at k-mer and regional correlations, rare-variant 309 models generally performed better than DNM ab initio models, and similarly to the DNM 310 transfer learning models (Fig. 4b, c). This indicates that if DNMs are unavailable, we can 311 reasonably use mutation rates predicted by rare-variant models to approximate de novo 312 mutation rates. 313 When DNMs are available but limited, it might be beneficial to train transfer learning 314 models with DNMs using pre-trained weights of rare-variant models. However, we note 315 that DNMs collected from different studies could introduce biases, which need to be 316 considered for transfer learning. For example, we found that our collected DNMs were 317 substantially depleted in low-complexity regions and segmental duplications relative to rare variants (Supplementary Fig. 17), probably due to conservative variant calling 319 procedures. 320

Generating mutation rate maps for other species 321
We further applied MuRaL to estimate mutation rates for three other species. For 322 species that are evolutionarily close to humans, their genomes have high sequence 323 similarities with the human genome, and many mutational processes are likely shared 324 between them. Hence transfer learning can be leveraged for those species. The rhesus 325 macaque (Macaca mulatta) is a close relative of humans and a widely used model 326 organism. We trained ab initio MuRaL models as well as transfer learning models for M. 327 mulatta using the rare variants from a dataset of 853 individuals 31 . The training data size 328 of transfer learning models was 30% of that for ab initio models (Supplementary Table 5; 329 see Methods). Transfer learning models showed similar performance to that of ab initio 330 models ( Fig. 5a, b), though transfer learning models used less training data and 331 computation time (Supplementary Fig. 18). 332 333 Figure 5 Application of MuRaL to other species. (a) 3-, 5-, 7-and 9-mer mutation rate correlations for 334 different mutation types, based on predicted single-nucleotide mutation rates on rheMac10 Chr20 by two 335 kinds of models for M. mulatta: ab initio models and transfer learning models. Rare variants of M. mulatta were used for calculating observed mutation rates. Separate models were trained for A/T sites, 337 non-CpG C/G sites and CpG sites, respectively. (b) Regional mutation rate correlations with bin sizes of 338 1Mb, 100Kb and 10Kb on rheMac10 Chr20 for different mutation types. (c) 3-, 5-, and 7-mer mutation 339 rate correlations for different mutation types, based on predicted single-nucleotide mutation rates for the 340 A. thaliana genome. Rare variants of A. thaliana were used for calculating observed mutation rates.

341
Separate models were trained for A/T sites, non-CpG C/G sites and CpG sites, respectively. (d)

342
Regional mutation rate correlations with bin sizes of 500Kb, 100Kb and 10Kb for the A. thaliana genome.

343
(e) An example showing regional mutation rate correlations at different scales on Chr3 for A. thaliana, 344 with grey shades indicating observed mutation rates and colored lines for predicted rates. As predicted 345 and observed regional mutation rates had different magnitudes, we applied z-score normalization for

349
We also trained ab initio MuRaL models for two model organisms that are 350 evolutionarily distant to humans -Drosophila melanogaster and Arabidopsis thaliana. As 351 these genomes (<200Mb) are much smaller than the human genome, we used only 352 100,000 mutations for training each of the models (Supplementary Tables 6-7; see 353 Methods). Despite a relatively small amount of training data, predicted results of trained 354 models suggested that MuRaL worked well in these species in terms of k-mer and 355 regional correlations (Fig. 5c-e; Supplementary Fig. 19). For example, for A. thaliana, at 356 the scale of 10Kb bins, regional correlations were >0.5 for all mutation types (Fig. 5d, e), 357 indicating high effectiveness of our method in this species. For D. melanogaster, we used 358 singleton variants from only 205 inbred lines for training and still obtained promising 359 results ( Supplementary Fig. 19), further demonstrating that MuRaL can be applied to 360 scenarios with a relatively small number of sequenced genomes. 361

Using MuRaL-predicted mutation rates for other analyses 362
Based on MuRaL-predicted mutation rates, we generated meta-gene mutation rate 363 plots for regions around coding genes of the human genome. We found that promoter 364 regions exhibited notably elevated average mutation rates (Fig. 6a), as reported 365 previously 32 , indicating that they are mutational hotspots. The high-resolution MuRaL 366 mutation rates allowed us to cluster genes based on mutation rate profiles and plot fine-367 scale heatmaps (Fig. 6b,c), which could not be done with few observed variants. 368 ( Fig. 6d; Supplementary Fig. 20), and such stratification was reflected by meta-gene 371 plots of observed rare variants (Fig. 6b). Notably, the gene cluster with elevated mutation 372 rates in both gene bodies and flanking regions was enriched with various development-373 related GOs (Fig. 6d), suggesting that many developmental genes are subject to high 374 mutational burden. This might have important implications in evolution and disease. It is 375 worth noting that clustering using 'Carlson 7-mer+features' and 'Karczewski 3-mer' rates 376 did not generate stratifications compatible with observed variants (Supplementary Fig.  377   21).

387
As another example, based on MuRaL mutation rates and gnomAD variants, we 388 generated mutational tolerance scores for sliding 500bp windows across the human 389 genome using the depletion rank (DR) method in a recent UK Biobank study 33 (see 390 Methods). We found that regions with lowest DR scores were enriched for pathogenic 391 variants in ClinVar, and the enrichment was significantly higher when using MuRaL 392 mutation rates for DR calculation than using Carlson mutation rates (Fisher's exact test, p 393 < 2.2e-16 for lowest 10% DR, Supplementary Fig. 22). This indicates that DR scores 394 derived from MuRaL mutation rates are informative for prioritizing disease-related 395 variants. 396

397
Estimation of sequence mutation rates in the genome can be traced back to the very 398 early period of molecular evolution research 34 . Being hindered by the lack of genomic 399 data, early work only obtained rough estimates of mutation rates for specific genes or 400 genomes. In past few years, by taking advantage of large-scale variant data, several 401 methods have been proposed to infer fine-scale mutation rates, but there was much 402 room for improvement. 403 In this work, we developed the MuRaL framework to address the challenge. 404 Compared with the previously best-performed model by Carlson  The generated mutation rates and software are relevant for many analyses. For 415 example, we demonstrated that the high-resolution mutation rates are useful for 416 identifying genes or regions with high mutability in the human genome. The predicted 417 mutation rates can help improve previous methods of calculating mutation intolerance 418 scores 11,15,33,35 for prioritizing disease candidate genes or variants, as exemplified by the 419 re-calculated DR scores in our work. The predicted mutation rates should also be 420 informative for detecting regions undergoing selection or introgression in recent evolution. 421 As there are already many sequenced genomes for different human populations, it 422 should not be difficult to generate population-specific mutation rate maps with MuRaL. 423 Comparison of mutation rate maps between populations or species can advance our 424 understanding of mutation rate evolution as well as underlying mutational mechanisms. learning hardware, we believe computational load will become a minor obstacle soon. 431 The MuRaL framework can be extended to estimate mutation rates for sex chromosomes 432 and organelle genomes, though more specific assessments are required. Similar models 433 could also be developed to predict fine-scale mutation rates for other mutations such as 434 small insertions and deletions. 435 To our knowledge, this is the first time that deep learning is used to estimate fine-436 scale germline mutation rates. Unlike many deep learning models in genomics designed 437 for typical classification or regression problems, our method aimed to predict accurate 438 class probabilities. This work provided an exemplary case for addressing similar 439 problems in genomics. How to obtain reliable class probabilities in deep learning models 440 is still a hot research topic in computer science 36 . The smaller CVs of predicted regional 441 mutation rates than that of observed rates might be partly due to the large class imbalance in the training data, which often causes issues in deep learning models. 443 Future progress on these topics in computer science may help improve our model. 444 In summary, we developed a generalizable framework for inferring fine-scale 445 mutation rates of genomes, which can facilitate and stimulate future research in related 446

Design of the MuRaL model 450
Previous studies revealed that adjacent nucleotides of a specific site predominantly 451 affect its mutation rate and properties of a larger sequence context (e.g., GC content, 452 replication timing) are also associated with mutation rate variation. As local and distal 453 sequences likely affect mutation rates in different ways, we constructed two different 454 neural network modules to learn the signals from the two aspects. One module (termed 455 'local' module) was designed for learning signals from a local sequence of the focal 456 nucleotide, the other (termed 'expanded' module) for learning signals from an expanded 457 sequence (Supplementary Fig. 1). 458 The 'local' module consists of an embedding layer and three fully-connected (FC) 459 layers to learn signals from the sequence. We used k-mer embedding because it was 460 reported to offer benefits for deep learning models in genomics 37 . The input local 461 sequence was firstly split into overlapping k-mers and the embedding layer maps these ResNets was followed by a FC layer to produce probabilities of four-class classification of 480 input samples, which were then combined with equal weights to form the probabilities of 481 the 'expanded' module. 482 Finally, probabilities of 'local' and 'expanded' modules were combined using equal 483 weights to form a vector of combined probabilities. 484 The key hyperparameters in the MuRaL model are the length of local sequences, the 485 length of expanded sequences, the length of k-mers in the embedding layer, sizes of two 486 hidden FC layers in the 'local' module, the kernel size and the number of channels for 487 convolutional networks in the 'expanded' module (see Supplementary Fig. 1). 488

Model implementation 489
We implemented the MuRaL model with PyTorch framework 39 , along with APIs from 490 pybedtools 40 and Janggu 41 . For model training, we used the cross-entropy loss function 491 and the Adam optimizer 42 for learning model parameters, and employed Ray Tune 43 to 492 facilitate hyperparameter tuning (Supplementary Fig. 1). The scheduler 493 'ASHAScheduler' in Ray Tune was used to coordinate trials and execute early stopping 494 before reaching the specified maximum number of training epochs (e.g., 10), which can 495 substantially reduce the training time. The mean cross-entropy loss of the validation sites 496 (i.e., validation loss) was calculated at the end of each training epoch. We further set a 497 stopping rule to terminate a trial if three consecutive epochs did not obtain a validation 498 loss smaller than the current minimum validation loss. The 'learning rate' and 'weight 499 decay' of Adam optimizer were two hyperparameters that could affect the learning 500 performance significantly. We used Ray Tune to run trials with different values for the two 501 hyperparameters. To have better convergence, we used the learning rate scheduler 502 'lr_scheduler.StepLR' in PyTorch to decay the learning rate after a number of steps by a 503 specified factor. 504 Rare variants generally arose recently in the genome and were less affected by 507 natural selection and nonadaptive evolutionary processes than common variants. 508 Previous studies 8-10 have established that rare variants can be used for estimating 509 mutation rates. For model training and evaluation, we took advantage of the gnomAD 510 database (v2.1.1) which contained genetic variation of 15,708 whole genomes 15 . Only 511 single nucleotide substitutions in autosomes were considered, as other mutation types 512 and sex chromosomes have specific features that need to be modeled separately. We 513 extracted rare variants from gnomAD to approximate DNMs. 514 When the sample size is as large as that of gnomAD, some mutation types (e.g., 515 CpG>TpG) with high mutation rates could be close to saturation, and the probability of 516 multiple independent mutations (recurrence) at a same position increases. Therefore, we 517 downsampled the gnomAD data into a specified total allele count using a hypergeometric 518 distribution (see the probability density function below), and generated the random 519 alternative allele counts from the hypergeometric distribution: 520 where AN and AC are the total allele count and the alternative allele count in original 521 data, respectively, ANdown and ACdown are the total allele count and the alternative allele 522 count after downsampling, respectively. For each polymorphic position, given values of 523 AN, AC and ANdown, we generated a random number for ACdown using the hypergeometric 524 distribution of equation (1). We then extracted the variants with a specific alternative 525 allele frequency (i.e., ACdown/ ANdown) in the downsampled data to form the rare variant 526 datasets for subsequent analyses. and '1in7000', respectively. We considered the sample size of 7000 because Carlson et mutation rates. In addition, Karczewski et al. 15 used variants with ≤5 copies in a 534 downsampled set of 1000 haploid genomes for mutation rate estimation, so we also did 535 downsampling for the sample size of 1000 and extracted the variants of ACdown ≤5 536 (termed '5in1000'). To increase the mutation density for smaller-scale evaluation, we 537 further did downsampling for the sample size of 20000 and extracted the variants of 538 ACdown ≤10 (termed '10in20000'). For SNVs with two or more different alternative alleles 539 in the original gnomAD data, we did hypergeometric sampling for each alternative allele. 540 In the downsampled dataset, if more than one alternative allele satisfied the rare variant 541 criterion for a specific position, only one alternative allele was randomly selected for 542 downstream analyses (i.e., not allowing multiple rare variants at one position). The 543 numbers of rare variants in different datasets were summarized in Supplementary Table  544 1. 545

De novo mutations 546
We collected DNMs from the gene4denovo database 44 for analysis. Because some 547 data sources in gene4denovo database contributed only a small number of DNMs and 548 different studies used distinct methods for variant calling, we used only the DNMs from 549 three large-scale studies 45-47 for our analysis, which consisted of 445,467 unique de novo 550

SNVs. 551
To check whether the extracted rare variants can well represent properties of DNMs, 552 we compared mutation spectra of rare variants and DNMs. We counted the occurrences 553 of 1-mer and 3-mer mutation types for each dataset and calculated the relative proportion 554 of each mutation type in the specific dataset. We found that mutation spectra of rare 555 variants were highly similar with that of DNMs (Supplementary Fig. 2). However, when 556 the sample size increased, the proportion difference in CpG>TpG mutation subtypes 557 between rare variants and DNMs became larger (Supplementary Fig. 2). This was not 558 surprising as mutation rates of CpG>TpG mutation subtypes were highest among all. 559 Because genomic regions with too low or too high read coverage could have a high 560 probability of false positives/negatives for mutation calls, we utilized the coverage The genome-wide mean coverage per individual in the gnomAD data was 30.5, and 563 genomic positions within the coverage range of from 15 to 45 (2,626,258,019 bp in 564 autosome retained and considered as high-mappability sites) were used for downstream 565 analyses. 566

Training and validation data for human MuRaL models 567
For the human data, we trained separate models for A/T sites, non-CpG C/G sites 568 and CpG C/G sites, respectively. For training each MuRaL model, we randomly chose 569 500,000 mutations and 10,000,000 non-mutated sites. During training, we used an 570 independent validation dataset consisting of 50,000 mutations and 1,000,000 non-571 mutated sites for evaluating training performance. The configuration of key 572 hyperparameters for human MuRaL models was provided in Supplementary Table 8. As 573 shown above, rare variants derived from a large sample of population could lead to 574 depletion of mutation types of high mutability. On the other hand, rare variants derived 575 from a small sample were relatively ancient and more affected by selection or other 576 confounding processes. To balance the two constraints, we used the '1in2000' data for 577 training models of A/T sites and non-CpG C/G sites in human. Because '1in2000' data 578 showed more depletion of CpG>TpG mutations than '1in200' and '5in1000' data, it was 579 not the ideal data for training the model of CpG sites. Although both '1in200' and '5in1000' 580 datasets were rare variants of ACdown/ANdown≤0.005, we chose '5in1000' data for training 581 the CpG model because of its larger number of mutations. For detailed model evaluation, 582 we mainly used '10in20000' rare variants for A/T and non-CpG models, and '5in1000' 583 rare variants for CpG models due to their high mutation densities (Supplementary Table  584 1). 585

Calibrating and scaling predicted probabilities 586
The main aim of our work is to obtain reliable class probabilities rather than accurate 587 classification (e.g., predicting ones or zeros). Because probabilities from neural networks 588 are usually not well calibrated, after training a MuRaL model, we applied a Dirichlet 589 calibration method 27 on the output combined probabilities to obtain better calibrated probabilities. Parameters of a Dirichlet calibrator were estimated by fitting the calibrator to 591 the predicted probabilities of the validation data. Metrics such as Expected Calibration 592 Error (ECE), classwise-ECE and Brier score 27 were used for evaluating the performance 593 of Dirichlet calibration. By comparing predicted mutation rates of validation data before 594 and after calibration, we found that Dirichlet calibration indeed resulted in better ECE, 595 classwise-ECE and Brier scores (Supplementary Fig. 23), although the improvements 596 appeared to be relatively small. Small values of ECE and classwise-ECE scores before 597 calibration (Supplementary Fig. 23) suggested that the original predicted mutation 598 probabilities were already quite well calibrated in terms of such metrics. 599 The absolute values of above combined probabilities were not mutation rates per bp 600 per generation. To obtain a mutation rate per bp per generation for each nucleotide, we 601 may scale the calibrated probabilities using previously reported genome-wide DNM 602 mutation rate and spectrum per generation. The scaling factor ( in the equations 603 below) for a specific site group (e.g. A/T sites) can be calculated using following where ( ) is the expected number of mutations of a specific site group, is 606 the per-base per-generation mutation rate obtained from DNM data, | | is the number of 607 considered genomic sites ( ) for calculating scaling factors, is the proportion of the 608 specific mutation types in the DNM dataset, and is the proportion of the specific 609 site group in the genome. and are single-nucleotide mutation probabilities 610 before and after scaling. For the human genome, we used the of 1.2 x 10 -8 611 (according to a recent study 48 ) and validation sites of MuRaL models ( ) to calculate the 612 scaling factors, and then generated scaled mutation rates. We note that whether to do or 613 not do this scaling does not affect the calculation of k-mer and regional mutation rate

Extending MuRaL with read coverage 616
Read coverage (or read depth) of alignments can affect variant calling and thus 617 observed mutation densities. We tried extending the MuRaL model to incorporate the 618 coverage information (Supplementary Fig. 6). We used the pre-compiled coverage track 619 from gnomAD (v2.1.1). In the 'local' module, we calculated mean coverage of the local 620 sequence of the focal nucleotide, and added it as an additional element to the 621 concatenated vector of embeddings of the local sequence. In the 'expanded' module, we 622 extracted a coverage vector for the nucleotides of the expanded sequence, and merged it 623 with the one-hot encoded matrix of the expanded sequence to form a five-channel input 624 for subsequent convolutional networks. Such a design can also easily incorporate other 625 genome-wide tracks (e.g., replication timing, recombination rate, etc.) to extend the 626

Correlation analysis of k-mer mutation rates 628
We classified mutations into six mutation types according to the reference and For the ith k-mer subtype, we calculated the observed mutated rate and the 641 predicted mutation rate in the considered regions: 642 where mi is the observed number of mutated sites belonging to that k-mer subtype, is 643 the total number of sites harboring the reference k-mer motif (e.g., all AAT 3-mers for the 644 subtype A[A>C]T), and is the predicted mutation probability of the jth valid site. 645 Based on the calculated observed and predicted k-mer mutation rates, we can 646 calculate the Pearson correlation coefficient for any set of k-mer subtypes (one subtype 647 as a datapoint). For example, we can calculate a correlation coefficient for 16 3-mer 648 subtypes associated with the mutation type A>C in a specific chromosome. Note that for 649 CpG sites, there are only four 3-mer subtypes for a basic mutation type as the +1 position 650 is fixed to be 'G'. The Dirichlet calibrated mutation rates were used for calculating k-mer 651 mutation rates unless otherwise specified. 652 Correlation analysis of regional mutation rates 653 Since it was impossible to evaluate accuracy of predicted mutation rates on the 654 single-nucleotide level, we compared average mutation rates in binned regions and 655 calculated correlations between observed and predicted regional mutation rates to 656 evaluate the performance of predicting models. 657 More specifically, for a specific mutation type, we calculated the observed and 658 predicted mutation rates as below. 659 First, we divided a specified region (e.g., a chromosome) into non-overlapping bins 660 with a given bin size (e.g. 10kb, 100kb, etc.). For the ith binned region, we calculated the 661 observed mutated rate and the predicted mutation rate , 662 where is the number of observed mutations of the specific mutation type (e.g. A>C), 663 is the total number of sites with same base as the reference base of that mutation type 664 (e.g. all A/T sites for the mutation type A>C), and is the predicted mutation probability Based on the calculated observed and predicted regional mutation rates, we can 667 calculate the Pearson correlation coefficient for any set of binned regions (one bin as a 668 datapoint). For example, we can calculate the correlation coefficient for regional mutation 669 rates of the mutation type A>C in all 100kb bins in a chromosome. As there are gaps and 670 low-mappability regions in the genome, to avoid using regions with few valid sites for 671 correlation analysis, we only used the bins that fit the criterion > 20% * , where 672 is the median of numbers of valid sites in all bins for a chromosome. The 673 Dirichlet calibrated mutation rates were used for calculating regional mutation rates 674 unless otherwise specified. 675

Comparison of models with different network architectures 676
To see how the 'local' and 'expanded' modules contribute to the model, we 677 considered three models -the 'local-only' model containing only the 'local' module, the 678 'expanded-only' model containing only the 'expanded' module and the full model with 679 both modules. We trained the models with a training dataset of 500,000 mutated and 680 10,000,000 non-mutated sites randomly selected from autosomes. For each network 681 architecture, ten trials were trained with Ray Tune. A validation dataset consisting of 682 50,000 mutated and 1,000,000 non-mutated sites was used to compare performance of 683 three architectures based on the validation losses of trained trials. We also used the best 684 trained trial with lowest validation loss for each of three architectures to predict mutation 685 probabilities on the whole chromosome of human Chr20, results of which were then 686 passed to compute k-mer/regional mutation rates for model comparison. 687

Comparison of models with different hyperparameters 688
Due to the high demand for GPU memory and computing, it is impossible to test the 689 behaviors of all model hyperparameters comprehensively. At the beginning, we set a 690 relatively large search space for hyperparameters and used Ray Tune to run dozens of 691 trials to get more narrowed ranges. We further detailedly investigated the impact of two 692 input-related hyperparameters -'local radius' (length of the local sequence on each side 693 of the focal nucleotide) and 'distal radius' (length of the expanded sequence on each side fixed the setting of other hyperparameters. We used a training dataset consisting of 696 100,000 mutated and 2,000,000 non-mutated A/T sites and a validation dataset 697 consisting of 50,000 mutated and 1,000,000 non-mutated A/T sites. For each setting of 698 hyperparameters, we ran ten trials and used the model with lowest validation loss of each 699 trial for comparison (Supplementary Fig. 3). 700 We also investigated the impact of different training data sizes. We tried four 701 different numbers of A/T sites as training data: 1) 50,000 mutated+1,000,000 non-702 mutated; 2) 100,000 mutated+2,000,000 non-mutated; 3) 200,000 mutated+4,000,000 703 non-mutated; and 4) 500,000 mutated+10,000,000 non-mutated. For model evaluation, 704 we used a validation dataset consisting of 50,000 mutated and 1,000,000 non-mutated 705 A/T sites to calculate validation losses for comparison (Supplementary Fig. 5). 706

Comparison of MuRaL models with previously published models 707
We considered the following four published models in our comparative analysis: 2) 'Carlson 7-mer' model 9 : this model used 7-mer mutation rates estimated from 36 716 million singleton variants from 3560 individuals. Note that some 7-mers didn't have any 717 observed mutations and thus had mutation rates of zero, which was a limitation of this 718 method. We downloaded the 7-mer mutation rates from 'Supplementary Data1' of the 719 study and generated mutation rates of all bases in human autosomes. 720 3) 'Carlson 7-mer+features' model 9 : this model used 7-mer mutation rates of the 721 'Carlson 7-mer' model and 14 genomic features for modeling. We noticed that some sites 722 had zero mutation rates for specific mutation types. In addition, this model did not CpG sites with different methylation levels. We downloaded the 3-mer mutation rates 731 from ' Supplementary Dataset 10' of the study and generated mutation rates of all bases 732 in human autosomes. We used the same methylation data as that described in the study 733 for predicting mutation rates at CpG sites. 734 When performing comparative analyses between our models and other existing 735 models, we excluded the genomic sites without predictive values in at least one model. In 736 total, 2,390,435,721 bases of the autosome genome were used in comparison. Note that 737 among the four existing models, the 'Carlson 7-mer+features' model had strongest data 738 requirements for prediction and its mutation rate profile contains the smallest number of 739 predicted sites. We calculated k-mer and regional mutation rate correlations for the four 740 models using the same method as that for MuRaL models. 741 The MuRaL models used in comparative analysis were those trained with 500,000 742 mutated and 10,000,000 non-mutated sites. The number of trainable parameters for each 743 of AT, non-CpG and CpG models was 86,904. The total number of trainable parameters 744 (86,904*3=260,712) was smaller than that of the 'Carlson 7-mer+features' model 745 (392,128) 9 . 746

Transfer learning 747
Transfer learning is widely used in deep learning for scenarios in which the 748 prediction tasks are similar. After training MuRaL models with rare variants from gnomAD, 749 we took advantage of published human DNMs to perform transfer learning. For each of 750 the AT and non-CpG models, we compiled a training dataset consisting of 150,000 DNMs 751 and 3,000,000 non-mutated sites and an independent validation dataset consisting of training dataset consisting of 50,000 DNMs and 1,000,000 non-mutated sites and a 754 validation dataset consisting of 20,000 DNMs and 400,000 non-mutated sites. We tried 755 two transfer learning strategies: 1) using all pre-trained weights for model initialization 756 and re-training all weights and 2) use all pre-trained weights for model initialization but 757 only re-training the weights of last FC layers of two modules. We chose the results of the 758 first strategy for later comparative analysis, as the second strategy led to poor 759 performance. We also trained ab initio models using the same DNM training datasets, 760 with the same hyperparameter setting as that for the rare-variant models. Because we 761 found that the collected DNMs were highly depleted in low-complexity regions and 762 segmental duplications, we excluded DNMs located in these regions from training and 763 evaluating models. 764 We trained ab initio models as well as transfer learning models for M. mulatta. For 783 ab initio models, we compiled a training dataset consisting of 500,000 mutated and 784 10,000,000 non-mutated sites, and an independent validation dataset consisting of 785 50,000 mutated and 1,000,000 non-mutated sites. We used the same hyperparameter 786 setting as that for human ab initio models. For transfer learning models, we compiled a 787 training dataset consisting of 150,000 mutated and 3,000,000 non-mutated sites and an 788 independent validation dataset consisting of 50,000 mutated and 1,000,000 non-mutated 789 sites. For each model, ten trials were run and the checkpointed model with lowest 790 validation loss among all trials was used for prediction. 791 The variant file of A. thaliana was downloaded from 1001 Genomes project 792 (https://1001genomes.org/) 50 , which included 12,883,854 polymorphic sites for 1135 793 inbred lines. The variants of each individual in the VCF file were all homozygotes 794 because of long-term inbreeding and thus the lowest AC is 2. We excluded the poorly 795 mapped genomic regions by using the coverage information from the 1001 Genomes 796 project. We first calculated the average read depth across 1135 lines for each nucleotide 797 and the mode of the rounded average depths across the genome was 21. We retained 798 the positions whose average read depth was within the range of 10-30 (102,069,978 799 sites in total). For training and validating the AT model, we used singleton variants by 800 requiring AC to be 2 and AN of >= 1000. Because there is a high mutation rate of C>T at 801 both CpG and non-CpG C/G sites, the C>T mutations were depleted in the singleton rare 802 variants. We further compiled a rare variant dataset by requiring AC <= 10 and AN >= 803 1000 for training and validating non-CpG and CpG models. For each of AT, non-CpG and 804 CpG models, we randomly selected 100,000 rare variants and 2,000,000 non-mutated 805 sites for training, and 10,000 rare variants and 200,000 non-mutated sites for validation. 806 For the CpG model, we randomly selected 50,000 rare variants and 1,000,000 non-807 mutated sites for training, and 5,000 rare variants and 100,000 non-mutated sites for 808 validation. 809 variant file in VCF format contained 3,837,601 polymorphic sites (excluding sex 812 chromosomes and heterochromatic sequences). Because of being derived from inbred 813 lines, each polymorphic site was homozygous for each individual and original AN and AC 814 tags in the variant file were corresponding to counts of individuals rather than alleles. We 815 extracted 702,864 singleton rare variants by requiring AC to be 1 and AN of >= 100. Then 816 the dataset of rare variants was divided into A/T sites (285,374) and C/G sites (418,713), 817 respectively. Because there is little methylation at CpG sites in the genome of D. 818 melanogaster 52 and the mutation rate of CpG>TpG is not exceptionally high in this 819 species, we did not separate non-CpG and CpG C/G sites for training. For each of the AT 820 and CG models, we randomly selected 100,000 rare variants and 2,000,000 non-mutated 821 sites for training, and 10,000 rare variants and 200,000 non-mutated sites for validation. 822 The configurations of hyperparameters for MuRaL models of the three species were 823 provided in Supplementary Tables 9-11. For each training task, the checkpointed model 824 with lowest validation loss among all trials was used for predicting base-wise mutation 825 rates in the whole genome. The calculation of k-mer and regional mutation rate 826 correlations was the same as that for the human data. 827

Mutation rates around human genes 828
We downloaded human gene annotations from GENCODE 53 (v37) and extracted the 829 longest transcript for each protein-coding gene as its representative transcript. 830 DeepTools 54 (v3.5.1) was used for plotting meta-gene profiles and heatmaps for mutation 831 rates around gene regions, as well as doing K-means clustering. We tried K-means 832 clustering with multiple K values. Because CpG-related variants were depleted in 833 observed rare variants, we only considered A/T and non-CpG C/G sites for this analysis. 834 ClusterProfiler 55 (v4.4.4) was used to perform GO enrichment analysis and enriched 835 'biological process' GO terms were used for visualization. 836

Depletion rank (DR) 837
We calculated the DR scores using the same method as that in Halldorsson et al. 33 . 838 For each of sliding 500-bp windows with a 50-bp step across the genome, we computed