Abstract
Alternative splicing results in the inclusion or exclusion of exons in an RNA, thereby allowing a single gene to code for multiple RNA isoforms. Genes are often composed of many exons, allowing combinatorial choice to significantly expand the coding potential of the genome. How much coding potential is gained by alternative splicing and what is the main contributor: alternative-splicing-depth or exon-count? Here we develop a splice-site-centric quantification method, allowing us to characterize transcriptome-wide alternative splicing with a simple probabilistic model, enabling species-wide comparison. We use information theory to quantify the coding potential gain and show that an increase in alternative splicing probability contributes more to transcriptome expansion than exon-count. Our results suggest that dominant isoforms are co-expressed alongside many minor isoforms. We propose that this solves two problems simultaneously, that is, expression of functional isoforms and expansion of the transcriptome landscape potentially without a direct function, but available for evolution.
Glossary
- Transcriptome
- Set of all RNA molecules in a sample (e.g. cell, tissue, organism).
- Transcriptome expansion
- Increase of coding expansion of the genome.
- Gene annotation
- Meta information added to the raw DNA sequence, such as exon-intron structure.
- Gene architecture
- Exon-intron structure of genes.
- RNA Splicing
- RNA maturation event leading to removal of introns and joining of exons.
- Intron
- Sequence removed by splicing, often non-coding for proteins.
- Exon
- Sequence retained by splicing, often coding.
- Splice site
- Exon-intron (5’ splice site) or intron-exon boundary (3’ splice site).
- Constitutive splicing
- The process that results in the joining of two splice-sites in all observed situation.
- Alternative splicing
- The process that results that one splice site can be joined to distinct partner splice sites.
- RNA-seq experiment
- Qualitative and quantitative profile of transcriptome by deep sequencing.
- Extent
- A parameter used to characterize the amount of alternative splicing in any given transcriptome; technically, the extent , where α is the exponent in the power law distribution that describes the amount of alternative splicing in the transcriptome.
- Splice site expression
- Number of RNA-Seq observations per splice site.
- Shannon Entropy
- Metric of the expected information content.
- True Diversity
- An ecological concept which measures both the number of distinct species (richness) and how uniformly they are distributed in a sample (evenness).
- Machine Learning
- Computational algorithms which learn rules (model) to predict an output from an input.
- Random Forest
- A non-linear machine learning model based on an ensemble of decision trees with random feature subset selection at each decision node.
- Lasso Regression
- Linear regression regularized by absolute value of the sum of all regression coefficients (L1 norm).
- Bootstrapping
- Resampling technique to infer the confidence in a population measurement.
- Probability density function (pdf)
- A function of a random variable X that describes the relative frequency for X to take each of its specific values.
- Kernel density estimation
- A method of estimating the probability distribution function based on a finite sample of data.
- Bayesian Inference
- A method of statistical reference in which prior knowledge is recursively updated utilizing new data using Bayes’ Theorem in order to make statements about probablistic hypotheses.
- Prior distribution
- The distribution (pdf) that mathematically formalizes one’s belief about the state of the system before taking empirical evidence (data) into account (note that the distribution can be a mathematical formalization of being in a state of ignorance).
- Posterior distribution
- The distribution that describes the probability of the random variable in question after the evidence/data is taken into account.