A Deep Learning approach predicts the impact of point mutations in intronic flanking regions on micro-exon splicing definition

While mammalian exons are on average 140-nt-long, thousands of human genes harbor micro-exons (≤ 39 nt). Large numbers of micro-exons have their splicing altered in diseases such as autism and cancer, and yet there is no systematic assessment of the impact of point mutations in intronic flanking-sequences on the splicing of a neighboring micro-exon. Here, we constructed a model using the Convolutional Neural Network (CNN) to predict the impact of point mutations in intronic-flanking-sequences on the splicing of a neighboring micro-exon. The prediction model was based on both the sequence contents and conservation among species of the two 100-nt intronic regions (5’ and 3’) that flank all human micro-exons and a set with the same number of randomly selected long exons. After training our CNN model, the micro-exon splicing event prediction accuracy, using an independent validation dataset, was 0.71 with an area under the ROC curve of 0.76, showing that our model had identified sequence patterns that have been conserved in evolution in the introns that flank micro-exons. Next, we introduced in silico point mutations at each of the 200 nucleotides in the introns that flank a micro-exon and used the trained CNN algorithm to predict splicing for every mutated intronic sequence version. This analysis identified thousands of point mutations in the flanking introns that significantly decreased the power of the CNN model to correctly predict a neighboring micro-exon splicing event, thus pointing to predictive bases in intronic regions important for micro-exon splicing signaling. We found these predictive bases to locate within conserved RNA-binding-motifs for RNA-binding-proteins (RBPs) known to relate to micro-exon splicing. Experimental data of minigene splicing reporter changes upon intron-base point-mutation confirmed the effect predicted by the CNN model for some of the micro-exon splicing events. The model can be used for validating novel micro-exons de novo assembled from RNA-seq data, and for an unbiased screening of introns, identifying genomic bases that have high micro-exon-splicing predictive power, possibly revealing critical point mutations that would be related in a yet unknown manner to a given disease.

neighboring micro-exon splicing event, thus pointing to predictive bases in intronic regions important 32 for micro-exon splicing signaling. We found these predictive bases to locate within conserved RNA-33 binding-motifs for RNA-binding-proteins (RBPs) known to relate to micro-exon splicing. 34 Experimental data of minigene splicing reporter changes upon intron-base point-mutation confirmed 35 the effect predicted by the CNN model for some of the micro-exon splicing events. The model can be 36 used for validating novel micro-exons de novo assembled from RNA-seq data, and for an unbiased 37 screening of introns, identifying genomic bases that have high micro-exon-splicing predictive power, 38 possibly revealing critical point mutations that would be related in a yet unknown manner to a given 39 disease.

Introduction 41
In eukaryotes, splicing events in pre-mRNAs from several mature transcripts culminate in the 42 production of multiple protein isoforms produced from the same gene structure. These splicing 43 events involve very precise and specific mechanisms which add another layer of complexity in gene 44 regulation (Pan et al., 2008). About 90-95% of multi-exon genes are estimated to have alternative 45 splicing isoforms, affecting the variability of expression between cells and tissues (Wang et al.,46 2008), and modifying cell localization and abundance of various protein isoforms that alter gene 47 regulation of the cell (Gallego-Paez et al., 2017). In order to be a dynamic and well-orchestrated 48 mechanism, several factors influence the splicing process such as spliceosome formation, 49 involvement of RNA-binding proteins (RBPs) and participation of regulatory sequences such as 50 intronic splicing enhancers/silencers (ISE/ISS) and exonic splicing enhancers/silencers (ESE/ESS), 51 among others (Wang et al., 2015). Several studies have already pointed to altered splicing events in 52 genes transcribed in cancer or neuropsychiatric diseases ( defined as exons with lengths ≤ 51 nt (Li et al., 2015). Given their short length, micro-exons would 68 not accommodate large numbers of exonic splicing enhancers/silencers, requiring that these 69 regulatory elements be primarily located in the introns that flank these micro-exons. Li et al. (Li et 70 al., 2015) have shown that, in mammals, the conservation of bases in introns that flank micro-exons 71 is greater than the conservation of introns that flank non-micro-exons. Another documented feature is 72 the differential distribution of certain 6-base motifs (k-mers) in the intronic regions that flank micro-73 exons (Ustianenko et al., 2017). In addition, these short motifs are co-localized with some RNA 74 binding proteins such as RBFOX and PTBP1, as evidenced by CLIP-seq assays with brain tissues 75 and HeLa cells (Li et al., 2015). Silencing and overexpression assays for nSR100 protein showed a 76 large effect on the mechanism of micro-exon splicing in 293T kidney cells (Irimia et al., 2014). In 77 that study, Irimia et al. (Irimia et al., 2014) identified 126 micro-exon splicing events altered in the 78 brain of autistic patients compared with controls, corresponding to 30 % of all micro-exon splicing 79 events in that tissue. 80 The above set of information suggests that there might be specific mechanisms that define 81 micro-exon splicing, but these mechanisms are still not fully explored. randomly selected long exons was identified. For each selected exon, the 100-nt sequence from the 103 intronic region upstream of the exon 5'-end and the 100-nt intronic sequence downstream of the 3'-104 end were extracted. To provide information to the classifier about conservation in the regions that 105 flank micro-exons and long exons, the conservation score in vertebrates (PhastCon100way) for each 106 nucleotide in the two 100 bp intronic regions that flank each of the selected exons was obtained. Where u represents the genomic coordinate of the 5' end of an exon (either a micro-exon or a 117 long exon) and d the genomic coordinate of the 3'end of the same exon. s and c represent 118 respectively the vector containing the one-hot encoder and the nucleotide conservation value of a 119 given coordinate in the flanking intron related to that exon. 120 The model was trained using binary crossentropy as the loss function, learning rate = 0.001, 121 decay = 0.0 and optimized with rmsprop. The final dataset contained 7,067 exons for training (3,534  122 micro-exons and 3,533 long exons), and another 1,767 exons were used for validation (883 micro-123 exons and 884 long exons), while the remaining 982 sequences were used to assemble the ROC 124 curves (491 micro-exons and 491 long exons). 125 Training was performed during 4000 epochs using a 500-size batch. The selected model was 126 the one that obtained the highest accuracy in the validation data during the training. The analyzes 127 were performed with keras (Chollet) in python 2.7. 128

2.2
In silico mutations 129 For each base within a 100-nt intronic region that flanks a micro-exon under analysis, with a given 130 genomic coordinate, a PositionScore value was generated that corresponds to the importance of a 131 given intronic nucleotide n for the prediction of the nearby micro-exon given the trained model. 132 Calculation of the value per n position can be obtained as follows: 133 Where, variable b represents the nucleotide base found in the original sequence and variable i 135 represents one of the 3 other possible nucleotides. The p function is the prediction value of the micro-136 exon by the trained CNN model given all the original parameters of the intron that flanks a micro-137 exon, or after nucleotide nb is replaced by nucleotide ni at the given position. The final PositionScore 138 value of a given n position in any intron that flanks a micro-exon was determined by the sum of the 139 differences between the original base prediction value and the artificially mutated base prediction 140 values. 141

Motif analyses 162
In silico analyses of the RNA-binding-motifs were performed using the MEME program (v. 4.12.0) 163 (Bailey et al., 2009). All bases in the intronic regions that were identified by the machine learning 164 (ML) algorithm as having negative influence on micro-exon prediction were extended by 5 nt at both 165 ends (5 ' and 3'), resulting in an 11-nt-long sequence; for this, the respective genomic sequences were 166 retrieved from ENSEMBL (hg38) assembly is one of the largest resource of genome-wide, quantitative profiles of AS events assembled to date. 249 The vast-tools use Bowtie (Langmead, 2010) for genome mapping; first, reads were divided into 50-250 nt over a 25-nt window, after this process the reads that have not been mapped to the genome were 251 used for mapping at known exon splice junctions, and for de novo junctions a model was built where 252 the 5'-and 3'-end of the same exon needed to be less than 300 nt apart and the junction must have had 253 the canonical splice site donor/acceptor GU/AG (Irimia et al., 2014). 254

SDVs dataset 255
The Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) dataset that has been 256 generated to determine splice-disrupting variants (SDVs) was downloaded from the work by comparisons were performed using hg38 assembly coordinates. 262 3 Results 263

Prediction of splicing of micro-exons with the Convolutional Neural Network algorithm 264
using primary sequence and conservation score among vertebrates 265 In order to determine whether the pattern of bases conservation in introns interferes with micro-exon 266 splicing events in humans, we constructed a prediction deep learning model using a Convolutional 267 Neural Network (CNN) ( Figure 1A), which was based on both the sequence content of the 100-nt-268 long intronic regions that flank micro-exons and long exons at their 5'-and 3'-ends, and the sequence 269 conservation among the species of these 100-nt-long intronic sequences ( Figure 1A). Conservation 270 score values for the human genome bases obtained by comparison with 100 vertebrate genomic 271 sequences (Siepel et al., 2005(Siepel et al., , 2006 were used to obtain the conservation level of intronic regions 272 that flank the micro-exons and long exons (+100 bases downstream to exons and -100 bases 273 upstream to exons). 274 These values were then used to assess the conservation of introns that flank micro-exons of 275 different lengths. We observed that introns that flank symmetrical micro-exons (micro-exons whose 276 lengths were an exact multiple of 3) were more conserved than introns that flank non-symmetrical 277 micro-exons ( Figure 1B, see peaks at 3, 6, 9 nt, etc.). This difference in intronic conservation as a 278 function of micro-exon length was no longer noted for introns that flank micro-exons over 39 nt in 279 length ( Figure 1B). Using the conservation information of Figure 1B, we decided to train our CNN 280 model ( Figure 1A) using introns that flank micro-exons only of lengths ≤ 39 nt. This choice assumes 281 that the elements that are recognized by the splicing machinery were conserved during evolution in 282 the intronic regions that flank the micro-exons. 283 To train the CNN model, we retrieved all 4,908 micro-exons of lengths ≤ 39 nt that were 284 present in the Ensembl annotation (GRch38.76) of the hg38 version of the human genome, and 285 randomly divided the set in three parts: 10 % (491 micro-exons) were set aside for the final test of 286 performance of the model; of the remaining 4,417 micro-exons, 80 % were used for training the 287 model (3,534 micro-exons), while 20 % (883 micro-exons) were used for an independent validation 288 of the trained model. In order to have a balanced CNN model, an equal number of 4,908 randomly 289 selected long exons (> 39 nt) was used. 290 The CNN model was trained with the set of 7,067 intronic 100-nt-long sequences that flanked 291 the 3,534 micro-exons, both upstream of the micro-exon 5'-end and downstream of the 3'-end. An equal amount of 7,067 intronic 100-nt-long sequences that flanked 3,534 long exons on both ends 293 was also used for training. With the trained CNN model, a micro-exon prediction accuracy of 0.71 294 was obtained for a validation test with an independent dataset of 1,766 intronic regions, and the area 295 under the ROC curve was 0.76 ( Figure 1C). 296 As a parallel control, we tested the performance of the CNN model only using the intronic 297 sequences, without the conservation scores; after training without the conservation, the best obtained 298 micro-exon prediction accuracy was only 0.59 for the validation test with the independent dataset, 299 and the area under the ROC curve was 0.61. Given the low performance of this sequence-only model, 300 we did not explore it further. 301 The observation that a good prediction accuracy was obtained with the complete CNN model, 302 using both the sequences and their conservation scores, reinforces the idea that our machine learning 303 approach was finding a pattern in flanking introns that has been conserved in evolution, and that 304 should participate in micro-exon processing events. The red points in the map (Figure 2) show that when in silico point mutations were introduced at 327 certain points in the introns, there was an increased likelihood of those sequences being recognized 328 by the CNN model as introns that flank micro-exons. This could be due to the fact that, for that given 329 intron, the power of the CNN model classification might have been near the significance cutoff when 330 the wild-type sequence was considered, while in the in silico mutated sequence the change in a base 331 in the red region has possibly changed its sequence pattern towards a more robust, conserved pattern 332 of introns that flank micro-exons. 333 groups from GroupA to GroupE is shown in Supplementary Figure 1A. The difference in the median 348

Different density distribution of predictive PositionScore values for bases in the introns
absolute PositionScore between GroupA and GroupB was the largest (0.14) ( Table 1). 349 Analysis of the distribution of GroupA predictive base positions along the introns that flank 350 the micro-exons showed predictive bases more densely located at a median distance of 9 nt up-and 351 downstream from the micro-exon ends ( Figure 3A, Table 1), whereas in GroupB the median was 12 352 nt (Table 1). The average absolute values of PositionScores as a function of the distance to the micro-353 exon end was computed in 20-nt-long windows along the intron (for all groups A to E together) 354 (Supplementary Figure 1B). As the distance between predictive base and micro-exon end increases, 355 the absolute values of PositionScore along the intron decrease (Kendall's rank correlation, tau = -356 0.23, p-value < 2.2e-16, Supplementary Figure 1B). All comparisons between groups showed a 357 difference in distribution as a function of distance (Supplementary Table S1, Komogorov-Smirnov 358 test, p-value < 0.05). 359 Since GroupA showed predictive bases closer to the micro-exons and larger absolute values of 360 PositionScore, these bases were expected to be in more evolutionarily conserved regions compared 361 with the other groups. As expected, in the analysis comparing the PhastCon7way values, which 362 represent the conservation values among 7 vertebrates, GroupA showed higher conservation values 363 when compared with the other groups; the cumulative density of the PhastCon7way value for each 364 group shifted to the left as the group mean absolute PositionScore value decreased ( Figure 3B). For 365 GroupA bases the median value of PhastCon7way was 0.247, while in GroupB it was 0.161 (Table  366 1). (Spearman Correlation rho = -0.14, p-value < 0.05); it can be seen that higher PhastCon7way Scores 371 were associated with higher absolute PositionScore values ( Figure 3C). In this analysis, the algorithm sought, within the 11-nucleotide sequences, to obtain a multiple 383 alignment of all sequences from the same group, with at least 6 to 8 nucleotides aligned in each 384 sequence, to ensure that the predictive base was included within the RNA motif to be found. The 385 results shown in Table 2 include the 3 most abundant motifs in each group. It is worth noting that 386 GroupA had more sequences that matched each of the 3 consensus motifs, suggesting that the bases 387 with the highest absolute PositionScores housed more defined patterns. RBPs were found in common among the analyzes ( Figure 5A, was detected in all groups with the highest significance score. GO biological process enrichment 415 analysis of the six RBPs identified, resulted in 14 significantly enriched GOs (FDR ≤ 5%), of which 416 8 (57%) are for processes involved in splicing ( Figure 5B and Supplementary Table S3).  Table S4), however the number of sequences comprising each of these motifs 425 was lower than 5 % of total ( Figure S4A and Supplementary When these motifs were compared with the ATtRACT database, we found that the motifs in 429 the introns flanking the long exons resulted in the identification of fifteen RBPs ( Figure S4B), and all 430 motifs excepted for PCBP1 were different than those identified in the introns flanking the micro-431 exons. 432

Intronic splicing silencer motif was enriched in the introns that flank micro-exons 433
Other databases interrogating conserved RNA sequences were used to explore whether enriched 434 sequences containing predictive bases could harbor additional regulatory region patterns. For this, 435 two other databases were added to the analysis, one for the ISE motifs and one for the ISS. In the ISE 436 database, none of the consensus sequences found in the introns that flank micro-exons reached the E-437 value similarity threshold ≤ 0.05. It is very interesting to note that in the ISS database, the AGUAGG 438 consensus sequence showed similarity with GroupA Motif 3 (GGRGGAGG, E-value = 0.0175). This 439 motif had not been identified with statistically significant similarity to any RBP motif, in our 440 previous analysis with the ATtRACT database. 441

Motifs containing predictive bases showed occupancy distribution along flanking intronic 442 regions similar to the distribution of RBP-binding-motifs and ISSs 443
In order to investigate whether the RBP motifs identified in the previous analysis were represented at 444 a specific location in the upstream or downstream (100 nt) intronic regions that flank micro-exons, 445 the  Figure 5). 460 Next, we performed the distribution analysis of ISS binding motifs along the intronic regions 461 that flank micro-exons. In GroupA, Motif 3 GGRGGAGG harboring a high G content, showed 462 similarity with ISS consensus # I (AGUAGG) and the distribution is shown in Supplementary Figure  463 6. The distribution suggests that enrichment in the region -40 to -100 nt, away from the micro-exon, 464 was a site for splicing silencers. 465 3.9 eCLIP-seq assays evidenced that PTBP1, U2AF2 and TIA1 bind more abundantly to 466

RNA introns that flank micro-exons compared with long exons 467
To confirm the above in silico findings with experimental approaches, we analyzed publicly available 468 experimental eCLIP-seq data for PTPB1, U2AF2 and TIA1, obtained with two cell lines, namely 469 HepG2 liver carcinoma and K562 leukemia cell lines ( Van Nostrand et al., 2016). The density of 470 reads in the intronic RNA regions that flank the micro-exons or long exons was calculated by the 471 ratio between the signal abundance obtained with RBP-specific antibody and the signal in the 472 negative control (mock). In HepG2 liver cells, all three RBPs were found to bind more intensely in 473 the intronic RNA regions that flank micro-exons of all five groups (GroupA to GroupE) in relation to 474 those flanking long exons, as shown in Figure 6B to 6D. Interestingly, PTBP1 (Polypyrimidine Tract  475 Binding Protein 1) showed a higher abundance near the 3' splice site (3'ss) end, in the RNA introns 476 that flank micro-exons compared with long exons, both upstream (-5 to -50 nt) and downstream of 477 the micro-exons (+60 to +100 nt) ( Figure 6B), while U2AF2 showed higher binding in the RNA 478 introns flanking micro-exons compared with long exons in the region near the 5'ss end, in the introns 479 upstream (-100 to -75 nt) and downstream of the micro-exons (+1 to +50 nt) ( Figure 6C). Lastly, 480 TIA1 was bound more abundantly to RNA introns flanking micro-exons both upstream (-50 to -25 481 nt) and downstream (+25 to +60) of the micro-exon ( Figure 6D). Similar patterns were observed in 482 K562 leukemia cells (Supplementary Figure 7). 483  mapping to this micro-exon when PTBP1 was silenced can be clearly observed in Figure 7C, which 539

RBP knock down evidenced that PTBP1 and U2AF2 predominantly affected micro-exons
shows the RNA-seq reads mapping to this genomic locus in each knock down or control assay, for 540 the two cell lines. Aggregation Consortium (ExAC) database. The authors constructed a synthetic oligonucleotides 550 library that encodes each candidate exon and surrounding intronic sequences with the rare variants 551 (intronic or exonic), and measured the splicing inclusion/exclusion by cloning the synthetic library 552 inside a splicing reporter minigene housing the GFP and mCherry, plus the synthetic sequence 553 flanked by DHFR or SMN1 intron backbone, and integrated the constructs into HEK293T cells using 554 site-specific single-copy integration. If the synthetic exon were excluded, causing an exon skipping, 555 GFP was expressed, otherwise if the synthetic exon were included the mCherry was expressed 556 (Cheung et al., 2019). In total, the work identified 1,050 variants (out of 27,733, i.e. 3.8 %) which 557 were classified as splice-disrupting variants (SDVs) that led to almost complete loss of exon 558 recognition (Cheung et al., 2019), and 6,469 variants (23 %) that caused alteration of ΔPSI ≥ 0.1. 559 Of the 27,733 variants assayed by Cheung et al. (Cheung et al., 2019), only 436 were located at 560 intronic regions of micro-exons, and from these, a total of 27 (6.2 %) were classified as SDVs and 561 133 (30.5 %) caused a ΔPSI ≥ 0.1 comparing mutant and wild-type. From these 436 assayed variants, 562 in the introns that flank micro-exons, we found that 13 correspond to bases that were present in our 563 list of top 5 % most negative PositionScore predictive bases, which would most negatively impact 564 splicing of the flanking micro-exons. Out of these 13 variants assayed, 2 (15.4 %) were classified as 565 SDVs, and 6 (46 %) had an alteration of ΔPSI ≥ 0.1 comparing mutant and wild-type (Supplementary 566 Figure 10). Extending this analysis to the top 25 % predictive bases detected by our CNN model, 567 there were 72 bases screened, of which 6 (8.3 %) were classified as SDVs, and 24 (33 %) presented 568 alteration in ΔPSI ≥ 0.1 comparing mutant and wild type (Supplementary Figure 10). The rate of 569 confirmation of SDVs among the events predicted by the CNN model (8.3 to 15.4 %) was similar to 570 the overall rate of confirmation of SDVs among all assayed rare variants that flank micro-exons (6.2 571 %) (Cheung et al., 2019). This result shows empirical evidence that the CNN model pointed to a set 572 of intronic bases important for micro-exon splicing events that were among the set of rare variants 573 that affect micro-exon splicing, as detected by large-scale screening with a minigene reporter assay. 574

4
Discussion 575 In this work we have built a deep learning model using a CNN architecture to identify 576 conservation patterns of intronic DNA sequences important for the micro-exon splicing mechanism, 577 being the first machine learning approach to identify conservation patterns that discriminate micro- such as skeletal muscle, brain and smooth muscle (Feener et al., 1989). Also, the Dp71 transcript, 628 encoding a 70-75 kDa C-terminal protein product of the DMD gene expressed in the human brain 629 (Austin et al., 2000), shows several isoforms with alternative C-terminal, including one with exon 78 630 skipping, which changes the reading frame and modifies the translated C-terminal, producing 631 dystrophin with a 31 amino acids (aa) tail instead of a shorter 13 aa tail (Austin et al., 2000). 632 Dysregulation of these splicing isoforms were related to cognitive impairments (Tadayoni et al.,633 2012), although the mechanisms of dysregulation are not known. Regarding specifically the isoform 634 without exon 78, it is expressed in embryonic stages in pre-contractile muscle, and re-expression of 635 this isoform instead of the adult isoform contributes to progression of the dystrophic process in 636 myotonic dystrophy type I (Rau et al., 2015). On the other hand, the isoform with exon 78 skipping 637 is the most expressed in neuronal SH-SY5Y cells (Nishida et al., 2015), while in muscle tissue under 638 physiological conditions, only 2.5 % of the expressed gene corresponds to this alternative isoform 639 (Tuffery-Giraud et al., 2017). All this suggests that the correct expression of DMD isoforms is under 640 developmental control and must involve a complex machinery; we speculate that mutations in the 641 conserved region around bases -16 nt to -22 nt upstream of micro-exon 78 might affect its fine-642 tuning splicing regulation. 643 Overall, the deep learning CNN model has pointed to intron bases which had a high predictive 644 value for micro-exon splicing, and a search for conserved patterns has identified RNA-binding-645 motifs of specific RBPs associated with the splicing process. Even more interesting was the finding that the in silico motif predictions could be experimentally confirmed with data from e-CLIP RBP 647 binding assays, from silencing assays of splicing-regulatory proteins, and from splice-disrupting 648 mutations detected with minigene reporter, thus reinforcing the predictive power of the in silico 649 model. Search for the impact of variants on splicing mechanism has gained attention during the past 650 years (Li et al., 2016), which resulted from gathering information about sQTL in the human 651 population (Park et al., 2018)    the micro-exon, the delta value (PositionScore) of the prediction perturbation caused by the in silico 900 point mutation of that base is represented by the color; the delta was calculated by subtracting the 901 intronic sequence prediction value, obtained after the base at a given position was changed to the 902 other 3 possible bases, from the intronic sequence prediction value using the original wild-type base. 903 The heatmap has clustered the micro-exons according to the PositionScore pattern of the intron 904 sequences that flank each micro-exon. On the upper left is the color scale of the perturbation 905 PositionScore values. Positive values indicate that the in silico mutation increased the probability that 906 a given sequence was classified as an intron that flank a micro-exon, and negative values show that 907 the in silico mutation increased the probability that the sequence was mistakenly classified as an 908 intron that flank a long exon. 909 The y-axis shows the cumulative distribution and the x-axis shows the PhastCon7way score. 917 Statistical differences in PhastCon7way scores distribution were observed in all comparisons using 918 GroupA or GroupB (Kolmogorov-Smirnov test, p value <0.05). (C) Box plot of absolute values of 919 PositionScore of intron predictive bases (y-axis) as a function of PhastCon7wayScore computed in 920 intervals of 20 percentile (x-axis). All PositionScores from the five groups (GroupA to GroupE) were 921 plotted together. Correlation between PositionScore and PhastCon7wayScore was calculated 922 (Spearman's correlation, rho = -0.14, p-value < 0.05). 923 RBPs identified as involved in micro-exon splicing. The x-axis bar represents the enrichment rate 943 (observed/expected) and the y-axis shows the GO categories. The names in blue are for the GOs 944 related to the splicing process. 945