Functional and in-silico interrogation of rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders

Purpose To develop a comprehensive analysis framework to identify pre-messenger RNA splicing mutations in the context of rare disease. Methods We assessed ‘variants of uncertain significance’ through six in-silico prioritization strategies. Firstly, through comparison to functional analyses, we determined the precise effect on splicing of variants identified through clinical multi-disciplinary meetings. Next, we calculated the sensitivity of in-silico prioritization strategies to distinguish known splicing mutations from common variation (>2% in allele frequency in gnomAD) within relevant disease genes. These approaches defined an accurate in-silico strategy for variant prioritization, which we retrospectively applied to a large cohort of 2783 individuals who had previously received genomic testing for rare genomic disorders. We assessed the clinical impact of such prioritization strategies alongside routine diagnostic testing strategies. Results We identified 21 variants that potentially impacted splicing, and used cell based splicing assays to identify those variants which disrupted normal splicing. These findings underpinned new molecular diagnoses for 14 individuals. This process established that the use of pre-defined thresholds from a machine learning splice prediction algorithm, SpliceAI, was the most efficient method for variant prioritization, with a positive predictive value of 86%. We analysed 1,346,744 variants identified through diagnostic testing for 2783 individuals and observed that splicing variant prioritization strategies would improve clarity in clinical analysis for 15% of the individuals surveyed. Prioritized variants could provide new molecular diagnoses or provide additional support for molecular diagnosis for up to 81 individuals within our cohort. Conclusion We present an in-silico and functional analysis framework for the assessment of variants impacting pre-messenger RNA splicing which is applicable across monogenic disorders. Incorporation of these strategies improves clarity in diagnostic reporting, increases diagnostic yield and, with the advent of targeted treatment strategies, can directly alter patient clinical management. Key Highlights We establish an in-silico and functional analysis framework for the incorporation of splice variant assessment into diagnostic testing that is applicable across monogenic disorders. After assessment of six distinct variant prioritization strategies, we concluded that SpliceAI was the best method to accurately identify genomic variation disrupting normal pre-mRNA splicing. We determined this through (i) functional assessment of novel ‘variants of uncertain significance’ described in this study, and (ii) calculation of sensitivity and specificity for prioritization strategies to distinguish known splicing mutations from common variants in the general population. We describe novel disease-causing variants with support from cell based functional assays which underpin autosomal recessive, autosomal dominant and X-linked Mendelian disorders. This includes variants which are deeply intronic, within the nearby splice region of canonical splice sites and variants which activate cryptic splice sites within the protein-coding regions of genes. We integrated the best performing variant prioritization strategy alongside clinical diagnostic testing for 2783 individuals referred to a well-established targeted gene panel test available through the UK National Health Service. We show that integration of such strategies will increase accuracy and clarity of diagnostic reporting, including the identification of variants which could provide new diagnoses and new carrier findings for referred individuals. Functional assessment is essential for accurate clinical assessment of variants disrupting pre-mRNA splicing. We show through cell based functional assessments that variants impacting splicing may have complex impacts on pre-mRNA splicing, which may cause multiple interpretable consequences according to ACMG guidelines.


Introduction
Pinpointing disease-causing genomic variation informs diagnosis, treatment and management for a wide range of rare disorders. Molecular testing, in a healthcare setting, now frequently includes complete genomic and exome sequencing. [1][2][3] Accurate interpretation and categorization of identified variants remains a key limiting factor despite the availability of guidelines for variant analysis. 4,5 With the advent of whole genome sequencing healthcare strategies, 3,[6][7][8][9] there is an opportunity for diagnostic services to routinely include the analysis of non-coding variation to increase diagnostic yields and improve patient care. Such analysis strategies may include consideration of variants within characterized regulatory elements, [10][11][12] variants disrupting chromatin conformation 13,14 and variants that disrupt vital processes during gene expression, e.g. pre-mRNA splicing. 15,16 However, the capability to interpret variation within the non-coding genome is particularly challenging. Variant interpretation is hindered by the vast number of rare/novel non-coding variants identified in each individual, 7,9 the depleted levels of evolutionary conservation within non-coding regions, 17 and our current lack of understanding of the motifs and interactions that are required for appropriate control of gene expression and regulation. 12,18 Intragenic genomic variants have the potential to impact splicing, 16 the ubiquitous process in eukaryotic cells of converting nascent pre-mRNA molecules into a mature messenger RNA (mRNA) which can be transported out of the nucleus to provide a template for protein synthesis. Genomic variation in proteincoding, splice junction and intronic regions of genes can disrupt normal splicing mechanisms and underpin the onset of rare disease. 19 Known mechanisms of splicing disruption include the introduction of cryptic splice sites, disruption of canonical splice acceptor and donor sites, and the disruption of motifs essential for splicing, e.g. branch points and the polypyrimidine tract. 19 A number of computational tools have been developed to assist in the interpretation of genomic variation impacting splicing, and these tools have been expanded recently to include an array of machine learning tools that have been trained to prioritize splice disrupting variation through diverse means. [20][21][22][23][24] While the initial reports of these tools have shown promising results there is yet to be a formal assessment of their integration, utilization and comparative performance in clinical environments.
In this study we interrogate data derived from targeted gene panel sequencing and whole genome sequencing from individuals with a broad spectrum of rare disorders. We apply in-silico splicing tools to prioritize variants predicted to impact splicing and we assess variant pathogenicity using targeted cDNA amplification and cell based minigene assays. We establish methodologies to identify novel causes of autosomal recessive, autosomal dominant and X-linked genomic disorders and demonstrate a requirement for routine diagnostic services to assess the effect of protein-coding and non-coding variation on mRNA expression.

Patient Recruitment & Genomic Variant Dataset Generation
All individuals included in this study have provided consent for the analysis of relevant disease causing genes through tertiary healthcare centers within the UK. All individuals with whole genome sequencing datasets have consented through the Genomics England 100,000 Genomes Project.
For real-time assessment of variant prioritization strategies, we identified individuals with 'variants of uncertain significance' according to ACMG guidelines for variant interpretation. 4 In all cases we considered inheritance modes associated with monogenic disorders available in OMIM (https://omim.org/) or PanelApp (https://panelapp.genomicsengland.co.uk/), the zygosity of identified variants, additional variants identified to impact the same gene, phenotype-genotype correlations and scores determined by in-silico splicing tools ( Figure 1).

Whole genome sequencing
Whole genome sequencing datasets were created through the UK 100,000 genomes project, 3 using Illumina X10 sequencing chemistry. Sequencing reads were aligned to build GRCh37 of the human reference genome utilizing Issac. Small variants were identified through Starline (SNV and small indels ≤ 50bp), and structural variants were identified utilizing Manta and Canvas (CNV Caller). Variants were annotated and analysed with the Ensembl variant effect predictor (v92), bcftools and bespoke perl scripts within the Genomics England secure research embassy.

Gene panel sequencing
Enrichments were performed on DNA extracted from peripheral blood using Agilent SureSelect Custom Design target-enrichment kits (Agilent, Santa Clara, CA, USA). Enrichment kits were designed to capture known pathogenic intronic variants and the protein-coding regions +/-50 nucleotides of selected NCBI RefSeq transcripts; conditions tested included inherited retinal disease (105 genes or 176 genes), ophthalmic disorders (114 genes), cardiac disorders (72 genes comprised of 10 sub-panels) and severe learning difficulties (82 genes). Genes tested and relevant testing strategies are available through the UK Genetic Testing Network (https://ukgtn.nhs.uk/). All samples included in the large cohort analysis were generated through a previously described methodology, 25 and had been completed prior to August 2017. Briefly, samples were pooled and paired-end sequencing was performed using the manufacturer protocols for the Illumina HiSeq 2000/2500 platform (Illumina, Inc., San Diego, CA, USA). Sequencing reads were demultiplexed with CASAVA v.1.8.2. and aligned to the GRCh37 reference genome using Burrows-Wheeler Aligner short read (BWA-short v0.6.2) software before duplicate reads were removed using samtools v0.1.18. The detection and clinical analysis of single nucleotide variants and small insertions/deletions was performed as described previously, 25,26 and in accordance with ACMG guidelines for variant interpretation. 4

Real-time assessment of variant prioritization strategies
In-silico splicing prediction scores were generated through the latest web server for the algorithm, through access to pre-computed scores, or through 3 rd party software (Ensembl Variant Effect Predictor or Alamut Visual Software). We utilized scores available from the following algorithms: CADD, 27 SpliceAI, 20 SPIDEX, 24 S-CAP 22 and MaxEntScan. 28 Where multiple scores were available for a variant from the in-silico tool, we selected the highest score for consideration. Where scores were unavailable we arbitrarily assigned the variant a score of 0. Pre-defined thresholds were applied to determine whether a variant was 'disruptive' or 'undisruptive' to splicing, as suggested by the authors of the original papers, 20,24 by recent refinements of thresholds, 22 or through nationally recommended guidelines.

Assessment of variant prioritization strategies for known disease genes
To determine the performance of three in-silico splicing tools (CADD, SPIDEX and SpliceAI) within genes known as a cause of inherited retinal disease, we identified sets of variants which could be considered true negatives (no expected impact on splicing) or true positives (expected impact of splicing). Our genes of interest are listed in Table S1. True negative (TN) variants were defined as variants available through the gnomAD web server with an allele frequency above 2%. True positive (TP) variants were defined as 'splicing' variants available through HGMD professional. The TN and TP variant datasets were intersected and any overlapping variants were removed from subsequent analysis. For each of the in-silico tools, we analysed TN and TP variants using the 'pROC' R package to calculate the area under the curve (AUC) and to calculate tool-specific thresholds for the optimal separation of TN and TP variants.

RNA investigations
Appropriate functional assays were selected after consideration of gene expression profiles in GTEX (https://gtexportal.org/home/), and the availability of relevant patient samples.

RNA investigations from patient samples -LCLs and blood
Lymphoblast cell cultures were established for control samples and probands. RNA was extracted using the RNeasy® Mini Kit (Qiagen, UK, Catalogue No. 74104) following the manufacturer's protocol. RNA was extracted from whole-cell blood using the PAXgene™ Blood RNA System Kit (Qiagen, UK. Catalogue No. 762174), following the manufacturer's protocol for control samples and probands. Extracted RNA was reverse transcribed using the High Capacity RNA to cDNA Kit (Applied Biosystems, UK. Catalogue No. 4387406) following the manufacturer's protocol. Gene specific primers (available on request) amplified relevant regions of the genes being investigated. PCR products were visualized on an agarose gel using a BioRad Universal Hood II and the Agilent 2200 Tapestation. Visualized bands were cut out and prepared for capillary sequencing on an ABI 3730xl DNA Analyzer.

RNA investigations using cell based minigene assays
Assays were designed to amplify appropriate genomic regions from patient DNA templates. For variants nearby to wild-type exons, we amplified regions containing one or multiple exons along with flanking ~200 intronic nucleotides. For deeply intronic variants we amplified regions containing at least 500bp of flanking intronic sequence. Primer sequences are available upon request. All regions were amplified from patient DNA templates. For homozygous variants, we also generated a minigene plasmid from a control DNA template. Amplified fragments were checked for size using gel electrophoresis, purified using the QIAquick Gel Extraction kit (Qiagen, UK, Catalogue No. 28706) and then cloned into a customized minigene plasmid (a derivative of the pSpliceExpress vector) 29 containing an RSV-promoter and two control exons (rat insulin exons 2 and 3) using the NEBuilder Ò HiFi DNA assembly (NEB, E2621). Amplified fragments were inserted between the two control exons. Plasmids were transformed into competent bacteria (XL-1 blue) and incubated overnight at 37 o C on LB plates containing Carbenicillin. Individual colonies were cultured overnight before isolation of plasmid DNA using the GenElute™ miniprep kit (Sigma-Aldrich, Catalogue No. PLN350). Purified plasmids were Sanger sequenced to confirm successful cloning, and identify plasmids containing the wild type and variant sequence. Plasmids were transiently transfected into HEK-293 cells using Lipofectamine, and incubated for up to 48h in Dulbecco's Modified Eagle Medium (DMEM) supplemented with 10% foetal bovine serum at 37°C and 5% CO2.
RNA was isolated using TRI Reagent® and further purified using the RNeasy Mini Kit (Qiagen, UK, Catalogue No. 74106) which included a DNase digestion step. cDNA was synthesized from up to 4μg of purified RNA using SuperScriptÔ reverse transcriptase (ThermoFisher Scientific, Catalogue No. 18091200) and subsequently amplified by Phusion high-fidelity polymerase (ThermoFisher Scientific, Catalogue No. F553) using primers designed to amplify all minigene transcripts. PCR products were visualized by electrophoresis on a 1-2% agarose gel and purified using the QIAquick Gel Extraction kit. Purified PCR products were Sanger sequenced and aligned to the reference sequence for the minigene vector using the SnapGene software suite and assessed for differences in splicing between wild-type and variant minigene constructs.

Real-time Assessment of Variant Prioritization Strategies
Over a 6-month period (May 2018 -November 2018), we worked through clinical multidisciplinary meetings to identify individuals with 'variants of uncertain significance' that could be responsible for a presenting clinical phenotype if they impacted pre-mRNA splicing ( Figure 1). We ascertained 21 individuals that fitted this criterion. The broad spectrum of clinical phenotypes for these individuals are described in Table 1, and included individuals undergoing both WGS and targeted gene panel NGS testing.
We calculated in-silico splicing scores for each of the identified variants using a variety of algorithms, and arbitrarily assigned a value of 0 for all variants where in-silico scores were unavailable ( Table 2). We predicted variants to be 'disruptive' or 'undisruptive' according to pre-defined thresholds from each of the in-silico splicing tools, resulting in dissimilar outcomes across tools ( Table 2). None of the identified variants were consistently scored as 'disruptive' across all five in-silico splicing algorithms. Similarly, no variants were consistently assigned scores and defined as 'undisruptive' across all five in-silico splicing algorithms. Three deeply intronic variants were consistently scored as 'undisruptive', but these variants were outside of the regions of consideration for SPIDEX and S-CAP. We also applied an in-silico consensus approach, as suggested by ACMG guidelines (PP3), 4 where variants were considered as 'disruptive' if they exceeded the pre-defined thresholds for 3/5 of the in-silico splicing tools. The 3/5 consensus approach prioritized 11 of the 21 variants (Table 2).

Functional Assessment of Prioritized Variants
To determine the accuracy of the in-silico splicing algorithms, we performed functional investigations to assess the exact impact of the investigated variant on splicing and predicted the effect on protein synthesis. Of the 21 variants functionally investigated, we found that 13 clearly resulted in aberrant splicing and could be reclassified as 'likely pathogenic' (Table 3). These findings provided new molecular diagnoses for 14 individuals.
We determined positive predictive (PPV) and negative predictive values (NPV) for each in-silico splicing tool (Table 2). Of the assessed tools, SpliceAI outperformed the others with a PPV of 86% and a NPV of 86% (Table 2). SpliceAI prioritized 12/13 variants shown to disrupt splicing through functional assays.
We compared the use of defined thresholds from single in-silico splicing tools to a 3/5 consensus approach and observed that both SpliceAI and MaxEntScan outperformed the consensus approach with regards to both PPV and NPV (Table 2) for the 21 investigated variants. The consensus approach prioritized 9/13 variants shown to disrupt splicing through functional assays.
Cell based functional analysis enables accurate delineation of precise consequences of variants on the protein translation reading frame We identified a variety of consequences as a result of disruption to splicing (Table 3), including variants resulting in complete exon skipping (n=7), and cryptic splice site activation leading to partial exon truncation (n=3), intron inclusion (n=4) and pseudo exon inclusion (n=2; Box 1, Case Example).
Four variants were indicated through cell based assays to have complex impacts on splicing, resulting in multiple interpretable consequences as a result of splicing disruption (Table 3). For example, SCN2A c.2919+3A>G produced two transcripts expressed at equal levels ( Figure 2a). The first resulted in a transcript with a truncated exon, NM_001040142.1:r.2563_2710del, and the second resulted in a complete exon skip, NM_001040142.1:r.2563_2919del. While we interpreted both events as 'likely pathogenic' it is noteworthy that these events were considered differently using ACMG criteria; 4 the exon truncation event resulted in a frameshift and introduction of a premature stop codon (PVS1), whereas the complete exon skipping event resulted in the inframe removal of 119 amino acids from the transcript (PM4).
We identified a single instance where a variant caused usage of two new cryptic splice sites, MERTK c.2486+6T>A (Figure 2b). This novel variant is present in two individuals with severe rod-cone dystrophy, and resulted in the simultaneous usage of a cryptic exonic splice acceptor site and a cryptic intronic splice donor site creating a novel exon (chr2: 112,779,939-112,780,082, GRCh37), and a premature stop codon in the penultimate exon, p.(Trp784Valfs*10).

Variant Prioritization Within a Large Cohort of Individuals Referred for Targeted NGS Gene Panel Testing
To evaluate the utility of in-silico splicing tools in a clinical environment we assessed variants identified through a targeted gene panel diagnostic test for individuals with rare genomic conditions, specifically inherited retinal disorders. All 2783 individuals had received clinical testing for known disease genes (Table S1). Overall, we demonstrated that the integration of SpliceAI with standard diagnostic testing could improve clarity and/or accuracy of diagnostic testing for 15% of the individuals analyzed, resulting in new molecular diagnoses and new carrier findings.

In-silico splicing tool assessment for relevant disease genes
We assessed the sensitivity and specificity of SpliceAI, SPIDEX and CADD to distinguish TN and TP variants within genes known as a cause of inherited retinal disease. For a communal pool of 2393 variants (TN=1068,TP=1325) which could be scored by all three prediction algorithms, we show that SpliceAI outperforms both SPIDEX and CADD, with an AUC of 0.98 and an optimal score threshold of 0.15 ( Figure  3c). We also show that SpliceAI outperforms CADD and SPIDEX when considering all variants with scores available for each tool (Figure 3a) and after arbitrarily defining missing variants with a score of 0 from each tool (Figure 3b).

Clinical impact of SpliceAI integration in variant assessment
We calculated SpliceAI scores for 1,346,744 variants (20,617 unique variants) identified in 2783 individuals. Variants exceeding a spliceAI value of 0.2 were considered alongside other variants identified as 'pathogenic' or 'likely pathogenic' after routine diagnostic testing strategies (Figure 1). We identified 758 variants (528 unique variants) in 646 individuals that were hypothesized to impact splicing of known disease genes. These variants displayed an array of predicted molecular consequences and were present in genes across disease inheritance types (Table 4; Figure 4). We defined variants prioritized through SpliceAI as: -New, variant not previously highlighted or reported through diagnostic testing; -Clarified, variant previously reported through diagnostic testing but pathogenicity or pathogenic mechanism was unclear; -Reported, variant already described or established as 'pathogenic' or 'likely pathogenic' through diagnostic testing.
In this manner, we identified 379 new variants in 337 individuals, 87 clarified variants in 83 individuals and 292 reported variants in 274 individuals (Table 4). Of the 758 variants with a SpliceAI score >0.2, 145 (19%) were present in HGMD and 613 (81%) were novel. We found most (n=697) variants to be in genes known as a recessive cause of inherited retinal disease, variants were also prioritized in genes known as a cause of autosomal dominant (22 variants in 22 individuals) and X-linked (39 variants in 39 individuals) disorders.

Prioritized variants from in-silico splicing tools have a variety of molecular consequences
Variants with a SpliceAI score >0.2 were present in our dataset with 9 distinct predicted molecular consequences, including 185 canonical splice site variants, 100 intronic variants, 175 splice region variants, 64 synonymous variants and 177 missense variants. We segregated variants into 3 overlapping groups based on their SpliceAI score (  Figure 4). All canonical splice site mutations prioritized through SpliceAI had been previously reported as disease-causing or carrier findings from standard diagnostic testing (Table 4).
We identified 64 synonymous, 175 splice region and 100 intronic variants above the 0.2 SpliceAI threshold. The scores for intronic, splice region and synonymous variants were wide-ranging ( We identified 177 missense variants with a SpliceAI score >0.2, only 13 of which had been reported through diagnostic testing as 'pathogenic' or 'likely pathogenic' due to known disruption to splicing (Table  4). Thirty-seven of the prioritized missense variants had been clinically reported but analysis through SpliceAI highlighted or further supported that the pathogenic mechanism underpinning disease onset could be splicing disruption. 127 variants were newly prioritized missense variants, 66 of which were present in individuals without a clear molecular diagnosis after standard diagnostic testing, although only 16 of these variants had a high SpliceAI score. Forty-five missense variants with a score >0.2 were present in a disease-causing state, with a further 132 variants present as potential carrier findings.

Identification of new or clarified molecular diagnoses as a result of SpliceAI integration
We assessed the overall impact that the integration of SpliceAI would have in this large diagnostic cohort through the identification of new and clarified variants. In total we identified 466 variants in 407 individuals which had not been previously highlighted as a splicing mutation. Of the 407 individuals with new or clarified variants, 227 (56%) had clear/potential molecular diagnoses identified from diagnostic testing and 180 (44%) had not received a molecular diagnosis.
84 of the 466 new or clarified variants were present in a disease-causing state, including 9 in genes associated with X-linked disorders, 11 with dominant disorders and 64 with recessive disorders (16 homozygous, 48 compound heterozygous). These variants were present in 81 individuals, including 20 without a clear molecular diagnosis identified from diagnostic testing. In 64 individuals, the splicing variant was included on the clinical report as a potential but unconfirmed cause of disease, 29 of these variants were missense variants. 373 of the 466 variants were potential carrier findings in recessive genes, and 9 were potential carrier findings in X-linked genes.
Taken together, our analyses prioritized 758 variants in 646 individuals (23% of the cohort). 62% of the prioritized variants (466/758) were new or clarified variants, representing altered analysis of variants identified in 15% of the 2783 individuals. Such variants require functional investigation to establish precise effects on splicing and protein synthesis, but could account for new or refined molecular diagnoses in 81 individuals.

Discussion
Interpreting splicing variants has been an active area of genetics research for decades. 16,30 As a result, a number of motifs essential for splicing are known, 31,32 variants disrupting the normal splicing process are well documented, 19,33 and targeted treatments are emerging. 34,35 However, expanding splicing mutation analysis approaches to whole genome sequencing and NGS datasets in the context of routine clinical investigation remains significantly limited. The objective of this study was to develop a clinically applicable framework for the prioritization and functional assessment of variants impacting splicing in the context of monogenic disease. In this regard, we compared the performance of a number of in-silico splicing tools to establish methods to efficiently prioritize variants disrupting splicing, and we validated the use of a cell based methodology for variant functional interrogation within patient clinical pathways. To assess the potential impact of incorporating variant prioritization strategies into clinically accredited diagnostic genomic testing, we expanded our in-silico investigations to a large cohort of individuals with rare disease. Overall, we demonstrate that utilizing defined score thresholds for a machine learning splicing prediction algorithm, SpliceAI, 20 enables the efficient identification of variants which disrupt splicing (86% PPV), and would improve clarity in the routine clinical analysis for 15% of individuals undergoing diagnostic testing. While our analyses underpin improvements to diagnostic services and the clinical management of individuals, we encountered a number of notable limitations.

Complexity defining appropriate test variants for comparative assessments of in-silico splicing tools
There are significant limitations associated with comparing in-silico variant prioritization approaches, including information leakage, a concept in machine learning where test datasets become contaminated with data from original training datasets. Information leakage inhibits the use of a pure and unbiased dataset to compare performance across tools. To overcome this obstacle we ascertained clinically relevant variants in real-time, most of which were absent from both mutation and population datasets (Table 1). While this dataset is relatively small, it enabled a fair and unbiased comparison across all of the in-silico tools assessed in this study. For three of the in-silico tools we also compared their performance for relevant disease genes, by assessing their capability to distinguish known splicing mutations from common variants. Significantly, we demonstrated that SpliceAI outperformed other in-silico splicing tools through both of these comparative analysis strategies, and that SpliceAI outperformed a consensus approach to variant prioritization (Table 2; Figure 3).
Another limitation of comparative approaches for splicing variants is the capability to segregate performance assessments for different categories of splicing mutations. In theory, variants could be subdivided by their pathogenic mechanism, their effect on pre-mRNA splicing, by their predicted molecular consequence or by the location of the variant with respect to known splicing motifs. However, splicing mutations are often considered as a single class of variants and are therefore highly susceptible to the over-prioritization of in-silico tools to identify canonical splice site mutations, which represent the majority (~70%) of known splicing mutations. 19,24,33 Recent studies have begun to categorize variants by location or expected effect and have demonstrated variable performances of in-silico splicing tools outside of the immediate splice region. 20,22 In this study we show that while SpliceAI performs well for canonical splice site mutations (Table 4; Figure 4), we can also utilize SpliceAI as an analysis strategy to effectively prioritize intronic, missense and synonymous variants, including variants which led to new molecular diagnoses (Table 3). Interestingly, we prioritized 177 missense mutations within our large diagnostic cohort as potentially 'disruptive' to splicing (Table 4), reemphasizing earlier observations that splicing disruption is a significant pathogenic mechanism associated with missense mutations. 24,[36][37][38] Although we have been able to identify deeply intronic cryptic splice site mutations through the described techniques (Box 1, Case Example), an extensive assessment of whole genome sequencing datasets is required to comprehensively assess the success of SpliceAI and other in-silico splicing tools in this regard.
Lessons learnt from establishing a cell based framework for the functional investigation of prioritized variants We performed functional investigations to determine the precise impact of variants on splicing and predicted the impact on protein synthesis. In two cases we were able to directly extract RNA from patient samples. However, in the remaining 19 cases we were either unable to assess relevant RNA profiles from peripheral blood or LCLs, or unable to obtain additional samples for the referred patient. We therefore assessed the effects of prioritized variants within these regions through the utilization of minigene assays, a cell based technique which can be designed to insert bespoke genomic regions from patient DNA templates into a mammalian expression vector to assess differences in RNA transcripts produced from fragments containing wild-type and variant sequences, respectively. A limitation of minigene assays is the capability of such techniques to replicate in-vivo gene expression profiles. Indeed, splicing machinery may be influenced by cell-/tissue-specific factors or by genomic regions outside of the region inserted into the minigene vector, and variants may have pathogenic impacts on gene expression and/or regulation without any detrimental impact on splicing. 10,17,39 Such factors are outside of the scope of the assays performed in this study, and therefore, whilst we calculate NPVs for each of the variant prioritization strategies, future investigations may uncover pathogenic roles for variants reported here. Despite these limitations, we have validated the use of a customizable minigene assay within real-time clinical investigations and successfully identified variants causing aberrant splicing. In the absence of appropriate patient samples for RNA investigation, this is a critical tool for clinical variant investigation.
Significantly, we demonstrate a range of functional consequences on mRNA splicing as a result of disruption to canonical splice sites, including (i) exon skipping, (ii) intron inclusion, (iii) exon truncation (Figure 2a), (iv) displacement of the reading frame (Figure 2b), and (v) a combination of events (Table 3). These observations demonstrate that the precise effect of splicing variants is an important piece of evidence for consideration during clinical variant interpretation, and may in the future lead to refinements in the exact targeted treatment appropriate for these individuals. 34,40 Hence, while we clearly demonstrate that in-silico splicing analyses can prioritize and guide the identification of new clinicallyrelevant variants (Table 4), functional assessment of variant effect remains an important requirement to ensure clarity in clinical reporting and appropriate patient management.

Significant impact of incorporating SpliceAI scores into routine clinical analysis
We were able to demonstrate that 13 variants caused aberrant splicing, and in so doing, provided new molecular diagnoses for 14 individuals (Table 3). Delineating the cause of disease through novel variant interpretation strategies can lead to altered clinical management. 41 In this study, we elucidated that SCN2A c.2919+3A>G underpinned the molecular diagnosis for an individual with severe epilepsy and learning difficulties (Figure 2), and as a result, modified the use of medication to control seizures in the referred individual. 42 To establish the impact of our variant prioritization strategies in everyday clinical practice, we analyzed variants identified in a large cohort of 2783 individuals with rare ophthalmic disease displaying significant genetic and clinical heterogeneity. Overall, we prioritized 379 variants which had evaded prioritization through standard diagnostic testing strategies and provided additional support for pathogenicity for 89 variants. These findings reflected altered clinical analysis in 15% of individuals, and could lead to new or refined molecular diagnosis in up to 81 individuals. Whilst this represents a significant improvement to molecular diagnostic services, we expect that the true impact of such analysis strategies will be more pronounced. The targeted NGS approaches employed within this large cohort ignore deeply intronic regions of genes, and as shown here (Box 1, Case Example) and in other studies, [43][44][45][46][47] variation within these regions can cause aberrant splicing through the production of novel cryptic exons. We expect, therefore, that extending variant prioritization approaches to large cohorts of individuals with whole genome sequencing datasets will enable the identification of clinically relevant deeply intronic variants.
In summary, the recent availability of thousands of genomic datasets within healthcare amplifies the current limitations in interpreting variation within the non-coding genome. Our findings demonstrate the opportunity to expand bioinformatics analysis to the pre-mRNA regions of known disease genes and provide immediate increases to diagnostic yield. Moreover, we demonstrate a requirement to functionally assess variant impact on pre-mRNA splicing as the delineation of the precise effects may be important considerations for variant pathogenicity. Importantly, variants which impact mRNA expression are amenable to targeted therapy, e.g. antisense oligonucleotides, 34 and have been proven to be effective in cell lines for some of the disorders investigated here. 40,47 The prioritization and identification of pathogenic variants impacting splicing is therefore an important consideration for diagnostic services and for the development of new targeted treatments. Overall, we present an in-silico and functional analysis framework for the incorporation of splice variant assessment that is applicable across monogenic disorders.   Table 2. In-silico splicing scores calculated for 'variants of uncertain significance' functionally investigated to assess impact on pre-mRNA splicing. Scores unavailable from tools were arbitrary given a value of 0. The thresholds for SpliceAI was set at 0.2, and for SPIDEX at 5/-5, as per suggestions of the authors. The high-sensitivity thresholds utilized for S-CAP and CADD were categorized by region, S-CAP: (a) 5'extended=0.005, (b) 3'intronic=0.006, (c) 5'intronic=0.006, exonic=0.009; CADD: (a) 5'extended=7.39, (b) 3'intronic=0.0964, (c) 5'intronic=0.574, exonic=0.390. Grey boxes indicate that the variant exceeds the applied in-silico splicing score and was interpreted as 'disruptive'. PPV=positive predictive value; NPV=negative predictive value.