Empirical prediction of variant-associated cryptic-donors with 87% sensitivity and 95% specificity

Predicting which cryptic-donors may be activated by a genetic variant is notoriously difficult. Through analysis of 5,145 cryptic-donors activated by 4,811 variants (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity. Strength (according to four algorithms) and proximity to the authentic-donor appear important determinants of cryptic-donor activation. However, other factors such as auxiliary splicing elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequent mis-splicing events at each exon-intron junction, mined from 40,233 RNA-sequencing samples, predict with remarkable accuracy which cryptic-donor will be activated in rare disease. Aggregate RNA-Sequencing splice-junction data provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.


Introduction
Genetic variants affecting the conserved sequences of the consensus splicing motifs can alter binding of spliceosomal components and induce mis-splicing of precursor messenger RNA (pre-mRNA) 1 , making them a common cause of inherited disorders [2][3][4][5] . Splicing variants can simultaneously induce different missplicing outcomes, including skipping of one or more exons, activation of a cryptic splice-site(s), and/or retention of one or more introns 1 . Whether induced mis-splicing disrupts the reading frame or affects a region of known functional (and clinical) importance, has significant diagnostic implications. Therefore, knowing the specific mis-splicing outcome of genetic variant is necessary to conclusively link it to a disease.
While the accuracy of in silico algorithms in predicting whether a variant will cause mis-splicing has been comprehensively assessed [6][7][8][9] , there is currently no reliable means to predict which mis-splicing event(s) may occur in response to a variant that activates mis-splicing. As a result of this and other factors, the vast majority of splice site variants are classified as variants of uncertain significance (VUS); a non-actionable diagnostic endpoint in genomic medicine 10 .
We recently evaluated the accuracy and concordance of SpliceAI (SAI) 11 and algorithms within Alamut Visual® (Interactive Biosoftware, Rouen, France) 12,13 , to predict splicing outcomes arising from genetic variants identified in 74 families with monogenetic conditions subject to RNA diagnostic studies (79 variants; 19% essential GT-AG splice-site variants and 71% extended splice-site variants) 14 . Algorithmic predictions of the strengths of activated cryptic splice sites were highly discordant, especially for crypticdonors. SAI's deep learning showed the greatest accuracy in predicting activated cryptic splice-site(s) (66% true positive with 34% false negative), whereas historical algorithms within Alamut Visual® resulted in 34% -69% false negatives 14 .
In this study we focus on determining empirical features that inform prediction of variant-associated spliceosomal selection of a cryptic-donor, in preference to the 'authentic-donor' (positioned at the exonintron junction), and other nearby decoy-donors (any GT or GC) that are not used by the spliceosome.
Through analysis of 4,811 variants in 3,399 genes, we show that while intrinsic splice-site strength and proximity to the authentic-donor strongly influence spliceosomal selection of a cryptic-donor, these factors alone are not sufficient for accurate prediction. Importantly, we show that the most frequent mis-splicing events seen at each exon-intron junction across 40,233 publicly available RNA-seq samples predict with remarkable accuracy which cryptic-donor will be activated in rare disease. An improved ability to reliably identify cryptic-donors will greatly assist both theoretical consideration of possible consequences for the encoded protein, as well as guide PCR-based RNA-diagnostic studies seeking to experimentally determine consequences for pre-mRNA splicing 14 . In turn this will improve our ability to classify splicing variants, avoiding the non-actionable diagnostic endpoint of a variant of uncertain significance (VUS).

Reference database of variants activating a cryptic-donor
We collate a database of cryptic-donor variants, defined as variant-associated erroneous use of a donor other than the authentic-donor (positioned at the exon-intron junction). Variants were derived from several sources 11,16,17 (Figure 1A, upper. See methods). The genomic locations and extended donor consensus sequences of the authentic-donor, cryptic-donor(s), as well as any decoy-donors (any GT/GC motif) within 250 nucleotides (nt) of authentic-donors, were compiled for analysis. We define the extended donor splice-site region as spanning the fourth to last nucleotide of the exon (E -4 , E = exon) to the eighth nucleotide of the intron (D +8 ; D = donor) (see Figure 1A).
Cryptic-donor variants fall into three categories ( Figure 1B  87% of activated cryptic-donors lie within 250 nt of the authentic-donor, with an average of 36 other decoy-donors (any GT or GC) within this distance not used by the spliceosome.
95% of cryptic-donors activated by AM-variants and 65% of cryptic-donors activated by CM-variants, lie within 150 nt of the authentic-donor (Figure 2A, B). By definition, AM/CM-variants activate a cryptic-donor that spatially overlaps the authentic-donor; 26% of AM/CM cryptic-donors lie at either the E -4 or D +5 position ( Figure 2C, bottom). For exonic cryptic-donors activated at E -4 , the GT at D +1/+2 of the authenticdonor becomes D +5/+6 of the cryptic-donor; conversely for intronic cryptic-donors activated at D +5 , the GT at D +5/+6 of the authentic-donor becomes D +1/+2 of the cryptic-donor) ( Figure 2C, top).
While decoy-donors are present everywhere, those selected as cryptic-donors by the spliceosome in the context of a genetic variant appears strongly influenced by their proximity to the authentic-donor ( Figure   2A-B), as shown by their enrichment at proximal locations relative to all decoys present in the genome ( Figure 2D). The steep decline in exonic decoys ( Figure 2D, left) is due to the shorter lengths of exons limiting their frequency at these distances (50 th and 90 th percentile for exon length shown). Notably, each authentic-donor in hg19 has on average 36 decoy-donors within +/-250 nt not used by the spliceosomeindicating that features other than proximity to the authentic-donor define a usable cryptic-donor.
Only 31-67% of cryptic-donors are scored as being stronger than the authentic-donor, with discordance between algorithms.
We examined the ability of four algorithmic measures of 'splice-site strength' to predict cryptic-donor activation ( Figure 3). We compared the performance of MaxEntScan (MES) 13 , NNSplice (NNS) 12 and SpliceAI (SAI) 11 as well as our own method termed 'Donor Frequency' (DF) (see methods and Figure S1 for details, Figure S2A-C for full plots). DF calculates the median frequency of four consecutive 'windows' of nine nucleotides in length (from E -4 to D +8 ) among all donors in hg19, converted to a cumulative frequency distribution. For example, if a E -3 to D +6 sequence has a raw frequency of 222, this combination of nine bases occurs at the analogous position for 222 annotated authentic-donors in hg19, corresponding to the  35 th percentile of a cumulative frequency distribution across the human genome (see Figure S1C). DF provides a barometer related to how common a given donor sequence is in humans, as a measure of splicing competence. For these and all further analyses, we excluded the 1,113 cryptic variants in the database derived from SAI predictions already validated on GTEx RNA-seq data 11 Figure S2D).
In summary, four independent algorithms concur that cryptic-donor activation typically occurs in response to weakening of the authentic-donor (85 -99 % weaker) or strengthening of the cryptic-donor (67 -98% stronger). However, only 35 -70% of activated cryptic-donors are stronger than the authentic-donorVAR, and for unmodified cryptic-donors, 29 -62% are not the strongest decoy-donor within 250 nt. Thus, while relative strength of the authentic-and cryptic-donor influence spliceosomal use, there are other factors at play.

Competitive decoy-donors are specifically depleted proximal to the exon-intron junction.
Decoy-donors of comparable or greater strength to the authentic-donor rarely occur within 150 nt ( Figure   4A, top, red). Exonic and intronic regions around donors have characteristic single and dinucleotide frequencies ( Figure S3). In particular, the first 50 nt of the intron often shows enrichment in G and T  Figure S3E). Therefore, we adapt a method 18 to assess if decoy-donors occur more or less commonly than expected by random chance, given these and any other dinucleotide preferences (see Methods and Figure   S4). We next assessed what proportion of decoy-donors in the genome are used, albeit rarely, via unannotated splice-junctions detected across 40,233 publicly available RNA-seq samples (see methods). Unannotated splice-junctions (representing stochastic mis-splicing), seen rarely in RNA-seq samples aggregated across a population, constitute functional evidence that both splicing reactions can be executed using a decoydonor: hence we term these donors 'splice competent'. The remaining decoy-donors, for which no RNA-Seq splice-junctions are detected, we term 'non-splice competent' -acknowledging this assertion is confounded by low read depth for many transcripts.

12
The proportion of exonic splice competent 'GT' decoy-donors (relative to all decoys) dramatically increases with proximity to the authentic-donor, with intronic decoys showing only a modest change (Figure 4Bii), mirroring depletion patterns (4Bi). Decoy-donors closer to the authentic-donor appear to be inherently more splice competent, supported by the fact that they are depleted in the human genome. Less than 4% of exonic 'GC' decoy-donors show evidence of splice competence, even very close to the exon/intron junction, in line with the observed lack of depletion ( Figure S5B).
The ability of DF to measure strength/competition of a splice-site sequence is evidenced by Figure 4C. Defining 'competitive' decoy-donors as those with a DF of at least 10% that of the associated authenticdonor (see Figure 4C), in the first 50 nt of the intron, competitive decoy-donors overlapping G-triplets show no depletion and indeed appear enriched (Figure 5Ai, intron-orange). In contrast competitive decoy- donors not overlapping G-triplets are depleted (Figure 5Ai, intron-grey). Additionally, intronic decoydonors not overlapping G-triplets show greater splice competence that those overlapping G-triplets ( Figure   5Aii, intron-grey). The reciprocity in these data is consistent with a masking effect of intronic G-repeat motifs on (strong) decoy-donors, likely due to RNA secondary structure and/or RNA binding proteins preventing their use. Figure 5B shows an example variant in GAA (NM_000152.3:C.2646+2T>A) identified in an individual affected with glycogen storage disease type II 22 that induces cryptic splicing to a exonic donor 21 nt upstream of the authentic-donor. NNS, MES, and DF rank the decoy-donor at +57 as the most competitive donor -however this donor is enveloped within a G-repeat rich region which may mask it. Mis-splicing data finds no evidence of rare spliceosomal use of the +57 decoy. SAI does not predict the decoy at +57 to be a functional splice-site, instead predicting the cryptic-donor at -21 to be a functional splice site. Notably, natural mis-splicing data shows rare, natural spliceosomal use of the cryptic-donor at -21 in 137 samples, providing evidence that despite its weak primary motif, it is splice competent ( Figure 5B). SAI in silico mutation of the cryptic-donor at -21 and decoy-donor at +57 show that SAI deep-learning perceives the negative impact of the G-repeats on the usability of the +57 decoy-donor ( Figure 5C).

90% of confirmed cryptic-donors in AM-variants are detected as rare splice-junctions in RNA-Seq data, with 91% of unused decoy-donors absent.
We assessed whether evidence of splice competence from RNA-seq data provides a viable means to prioritise cryptic-donors likely to be activated in the context of a genetic variant affecting the authentic-15 donor (i.e. AM-variants). Strikingly, 90% of AM-variant activated cryptic-donors are detected as splice competent events in aggregate RNA-Seq splice-junction data, while 91% of unused decoy-donors are absent. We took the top 4 splice competent events at each splice-junction (or all events if there were less than 4 detected) as our predicted cryptic-donors and this yielded a sensitivity of 87% and specificity of 95% ( Figure 6Ai). Our method had a higher sensitivity than all four algorithms assessed (Figure 6aii, Figure S6A). Therefore, splice-junction data provides potent predictive information with respect to both true positives (cryptic-donors) and true negatives (decoy-donors). Most cryptic-donors activated are observed as very low frequency splice-junctions (44% had a maximum of 4 reads or less, Figure S6B). Sensitivity of using RNA-Seq splice-junction data is inherently influenced by read-depth of the target transcript: more than 85% of cryptic-donors are detected in genes with a read depth of > 250 for the normal exon-exon splicejunction; whereas only 29% of cryptic-donors are detected in mis-splicing data for genes where normal splicing had a maximum read count of < 100 ( Figure S6C).
We define SAI prediction of cryptic-donor activation as a donor D score of 0.1, which accurately predicts 75% of cryptic-donors and inaccurately predicts only 1% of decoy-donors (Figure 6Aii). SAI shows the lowest predictive accuracy for more distal cryptics -only 55% of cryptic-donors further than 100 nt from the authentic splice site had a D score above 0.1 ( Figure S6D). When used in combination, SAI and RNA-Seq  D +2 GC>GT variants commonly activate cryptic-donors ( Figure 1). However, because 'GC' donors are inherently much less splice competent than 'GT' donors ( Figure S5B) they are infrequently seen in missplicing data, limiting predictive ability for these variants. For D +2 CM-variants, only 32% of the crypticdonors are present in the top 4 mis-splicing events, as compared to 85% of cryptic-donors detected in the top 4 events for E -3 variants ( Figure S6E; E -3 is the third to last exonic base). In short, even if a 'GC' decoydonor shows no evidence of splice competence, conversion by a variant to a 'GT' donor presents high risk for cryptic-activation. SAI performed well for CM-variants and AM/CM-variants, correctly predicting 96% of variants which created an essential donor motif and 90% which modified an existing essential motif ( Figure   6B).

Discussion
The ultimate goal of splicing predictions is to determine if and how a genetic variant will induce missplicing of pre-mRNA. Consideration of the likely consequences of an essential splice-site variant is critical for pathology application of the ACMG-AMP code PVS1 15 (null variant due to presumed mis-splicing of the pre-mRNA) and of equal importance to strategise functional testing for RNA diagnostics 14 . While activation of a cryptic-donor 6 nucleotides away will remove or insert two amino-acids, activation of a cryptic-donor 4 nucleotides away will induce a frameshift, with vastly different implications for pathology interpretation.
We learned five key lessons from our analyses of 4,811 variants in 3,399 genes: 1) Decoy-donors shown to be splice competent via aggregate RNA-Seq splice-junction data have the greatest probability of activation as cryptic-donors. 2) Proximity to the authentic splice site increases likelihood of spliceosomal mis-use (splice competence) of a decoy-donor. 3) Cryptic-donors do not necessarily need to be 'stronger' than the authentic-donor to present substantive risk for mis-splicing. 4) Intronic G-repeats diminish the likelihood of spliceosomal recognition and use of intronic decoy splice sites. 5) SAI's deep-learning appreciates the broader sequence context determining spliceosomal 'usability' of a decoy splice-site, including 2-4 above, though less accurately predicts activation of more distal decoy-donors (>100 nt from the authentic-donor).
SAI's deep learning presents a major improvement in predicting cryptic-donor activation. However, use of SAI in a pathology context is limited by the challenge of deriving a clinically-meaningful interpretation of a number on a 0 -1 scale, without independently verifiable evidence. In contrast, aggregate RNA-Seq splicejunction data provides an accurate, evidence-based means to rank cryptic-donors likely to be activated by genetic variants.
Brandão et al. 24 used deep sequencing of twelve major cancer susceptibility genes to catalogue all alternative and aberrantly spliced transcripts. They found variant-activated cryptic splicing was often seen at much lower levels in disease controls, suggesting that the dominant transcript in rare disease may be seen as a stochastic mis-splicing event in other samples. Herein we use the breadth of publicly available RNA-seq data across numerous tissues to capture stochastic mis-splicing events across all genes, to comprehensively catalogue all splice competent cryptic splicing events. Prospectively, the potency of RNA-Seq splice competent data can be enhanced by ultra-deep sequencing.
When used in combination, SAI and RNA-Seq splice competent evidence accurately identify 93% of crypticdonors (true positives) and inaccurately identify 6% of unused decoy-donors (false positives). The heightened sensitivity of our method is of vital importance for pathology assessment of variants affecting the essential donor splice-site: as not considering a likely cryptic-donor activated can lead to profound complications in variant interpretation. It is also easy to envisage extending the method to predict other mis-splicing events such as exon skipping, and an analogous formulation for the prediction of mis-splicing events at the acceptor splice site.
RNA-Seq splice competent data can reliably identify distal cryptic-donors with high likelihood of activation, which may not be identified by SAI. Conversely, SAI can reliably identify cryptic-donors with high likelihood of activation not detected in RNA-Seq splice competent data, due to low read depth of the target gene.
Importantly, RNA-Seq empirical evidence of splice competence can augment pathology consideration of probable consequences of splice site variants on pre-mRNA splicing (and the encoded protein), to assist clinical decision-making and the diagnostic classification of variants.

Creating a database of cryptic-donor variants
Variants were derived from several sources: 1) 439 variants curated from literature, predominantly  Figure 1A).

iii. Finding authentic, cryptic & decoys
From the (up to) 500 nt of sequence we pulled E -4 -D +8 sequences for the authentic-and cryptic-donor before and after each variant (REF and VAR respectively). We also identified any other essential donor dinucleotides (i.e GT or GC) which were present in the sequence and extracted the E -4 -D +8 sequence surrounding them. These sequences we define as 'decoy'-donor-sequences containing the essential donor dinucleotides (i.e. a GT or a GC) but which weren't utilised by the spliceosome as a result of the variant ( Figure 1A, lower). For intronic decoy-donors, we excluded any which would result in an intron too short to be spliced (as defined by the 5 th percentile for intron length in the human genome) 28 . Importantly, without additional filtering, no cryptic-donors in the database violated this rule.

Algorithms for splice site strength
We retrieved predicted scores for authentic-donors, cryptic-donors and decoy-donors in the database in both the REF and VAR sequence context, for four algorithms. 1) MaxEntScan (MES) 13 scores were retrieved using the perl script provided at http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html.
MES scores below 0 were standardised to 0 as predicted 'non-functional' splice sites 2) NNSplice (NNS) 12 scores were retrieved using the online portal (https://www.fruitfly.org/seq_tools/splice.html), set to human, with default settings (i.e. a minimum score of 0.4, with any scores below predicting a 'nonfunctional' splice site) 3) SpliceAI (SAI) 11 scores were retrieved using a script adapted from the SAI github repository (https://github.com/Illumina/SpliceAI) which allows spliceAI to score custom sequences. We rounded the scores to three decimal places, and scores at 0 when rounded as such were termed 'nonfunctional' splice site predictions. 4) Donor Frequency (DF) calculates the median frequency (in hg19) among four 9 nt windows of sequence spanning the donor (see Figure S1B-D) converted to a cumulative percentile distribution scale. DF provides a barometer related to how common a given donor sequence is in humans, as a measure of splicing competence. An example of a DF calculation is shown in figure S1D, where a median DF raw value of 179 lies at the 31 st percentile of a cumulative frequency distribution. After assessing several window lengths (figure S1B) we used 9nt windows as optimally encompassing the splice site ( Figure S1C).

Naturally occurring decoy-donors
Our set of naturally occurring human decoy-donors were derived from the set of all canonical Ensembl transcripts (Release 75) 29 , with first and last introns and single exon transcripts removed. We filtered the set so that junctions were within the open reading frame for the given gene, so we knew that cryptic splicing here would affect the protein. We also removed exons with alternative 5' or 3' ends already annotated in different transcripts. We used these criteria to form a set of 142,014 canonical exon-intron junctions that are not alternatively spliced (or at least not annotated as such). We extracted sequences surrounding authentic-donors and extracted all decoy-donors just as for the cryptic database (see methods section creating a database of cryptic-donor variants)

Decoy-donor depletion
Decoy-donor depletion was calculated using an adapted method 18 that controls for dinucleotide frequencies ( Figure S5). For exonic sequences, we took up to 150 nt or the maximum length of the exon, whichever was shorter (and similarly for the intron, stopping 50nt from the 3'SS). We limited analysis to 150nt of exonic sequence as the majority of exons are shorter than this. We then shuffled exonic and intronic sequences separately, maintaining dinucleotide frequencies (using shuffle_sequences with euler method from universal motif R package 30 ). We performed the shuffle 15 times and took the average count of decoy-donors at each nucleotide position as our 'expected' count at this position. The observed count of decoy-donors was then divided by the expected count at each position.

Creating a database of splice competent events (SCE)
We had two sources of data for our database of splice competent events-RNA-seq data from Intropolis 31 and GTEx 32 . Intropolis is a set of ~42M exon-exon junctions found across 21,504 human RNA-seq samples from the Sequence Read Archive (SRA). Samples were aligned using Nellore et al.'s annotation-agnostic aligned Rail-RNA 33 . Intropolis was downloaded from its dedicated github repository (https://github.com/nellore/intropolis). Per sample splice-junction files were obtained from GTEx (phs000424.v8.p2) . Using Datamash 34 , splice-junction read counts were summarised across all samples for each unique splice-junction and translated from GRCh38 to GRCh37 using liftOver 35 .
For each set of junctions (Intropolis and GTEx), we cross-referenced and located junctions within Ensembl transcripts, keeping only events with at least one annotated splice site. We annotated cryptic-donor events by scanning for any unannotated donors used between the 5' end of the exon and the 3' end of the intron for that respective exon-intron junction, where the junction also spliced to the next annotated 3'SS. Then events from the two sources were merged and sample counts were tallied across the two datasets. If a splice-junction was seen in at least 3 samples, it was defined as 'splice competent'.

Sashimi plots
For figure 5B, and S6B-C sashimi plots were generated using 3 GTEx bam files for each example, each from the tissue with the highest TPM for the respective gene. Sashimi plots were created using ggsashimi 36 .

SpliceAI in silico mutagenesis plots
For figure 5C and S6B-C we performed in-silico mutagenesis as described in Jaganathan et al 11 . The importance score of each nucleotide was calculated as: Where !"#$!% is the score calculated on the genuine sequence, and & , for example, is the score calculated when an A is substituted at this position.

Acknowledgement
The Genotype-Tissue Expression (GTEx Ltd (Australia) has no existing financial relationships that will benefit from publication of these data. The remaining co-authors declare no conflicts of interest.