Abstract
Background Despite numerous molecular and computational advances, roughly half of patients with a rare disease remain undiagnosed after exome or genome sequencing. A particularly challenging barrier to diagnosis is identifying variants that cause deleterious alternative splicing at intronic or exonic loci outside of canonical donor or acceptor splice sites.
Results Several existing tools predict the likelihood that a genetic variant causes alternative splicing. We sought to extend such methods by developing a new metric that aids in discerning whether a genetic variant leads to deleterious alternative splicing. Our metric combines genetic variation in the Genome Aggregate Database with alternative splicing predictions from SpliceAI to compare observed and expected levels of splice-altering genetic variation. We infer genic regions with significantly less splice-altering variation than expected to be constrained. The resulting model of regional splicing constraint captures differential splicing constraint across gene and exon categories, and the most constrained genic regions are enriched for pathogenic splice-altering variants. Building from this model, we developed ConSpliceML. This ensemble machine learning approach combines regional splicing constraint with multiple per-nucleotide alternative splicing scores to guide the prediction of deleterious splicing variants in protein-coding genes. ConSpliceML more accurately distinguishes deleterious and benign splicing variants than state-of-the-art splicing prediction methods, especially in “cryptic” splicing regions beyond canonical donor or acceptor splice sites.
Conclusion Integrating a model of genetic constraint with annotations from existing alternative splicing tools allows ConSpliceML to prioritize potentially deleterious splice-altering variants in studies of rare human diseases.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- ACMG
- American College of Medical Genetics
- AD genes
- Autosomal Dominant genes
- AG
- Acceptor Gain
- AL
- Acceptor Loss
- AR genes
- Autosomal Recessive genes
- AUKS
- Au-Kline Syndrome
- Avg. PR Score
- Average Precision Score
- CDS
- Coding sequencing
- Cl
- Confidence lnterval
- CLl
- Command Line lnterface
- ConSplice
- Constrained Splicing
- constLevel
- Constitutive Level
- DD/ID
- Developmental Delay and Intellectual Disability
- DG
- Donor Gain
- DL
- Donor Loss
- DM
- Disease-causing Mutations
- DNA-seq
- DNA sequencing
- DNMs
- de novo mutations
- DRVT
- Dravet Syndrome
- FDR
- False Discovery Rate
- GGD
- Go Get Data
- gnomAD
- Genome Aggregate Database
- GTEx
- Genotype-Tissue Expression
- HI genes
- Haploinsufficient genes
- HGMD
- Human Gene Mutation Database
- HTT2
- Hereditary Hemorrhagic Telangiectasia
- IQR
- Interquartile range
- LOEUF
- loss-of-function observed/expected upper-bound fraction
- MFS
- Marfan Syndrome
- NF1
- Neurofibromatosis type 1
- NMD
- Nonsense-Mediated Decay
- O/E
- Observed over Expected
- OLS
- Ordinary Least Squares
- OPA1
- Optic Atrophy 1
- OMIM or MIM
- Online Mendelian Inheritance of Man
- OR
- Odds Ratio
- OR genes
- Olfactory Receptor genes
- PAR
- Pseudoautosomal Region
- pLI
- probability of loss-of-function intolerance
- pLoF
- putative loss-of-function
- PR
- Precision-Recall
- PR AUC
- Precision-Recall Area Under the Curve
- RB1
- Retinoblastoma
- RENS1
- Renpenning Syndrome
- RF
- Random Forest
- RNA-seq
- RNA sequencing
- SEMD
- Spondyloepimetaphyseal Dysplasia
- SMAPME
- Spinal Muscular Atrophy with Progressive Myoclonic Epilepsy
- SNV
- Single Nucleotide Variant
- sOutliers
- Splicing Outlier
- STL1
- Stickler Syndrome
- UDN
- Undiagnosed Disease Network
- VG
- expected genetic variation in gene expression