ESGq: Alternative Splicing events quantification across conditions based on Event Splicing Graphs

Alternative Splicing (AS) is a regulation mechanism that contributes to protein diversity and is also associated to many diseases and tumors. Alternative splicing events quantification from RNA-Seq reads is a crucial step in understanding this complex biological mechanism. However, tools for AS events detection and quantification show inconsistent results. This reduces their reliability in fully capturing and explaining alternative splicing. We introduce ESGq, a novel approach for the quantification of AS events across conditions based on read alignment against Event Splicing Graphs. By comparing ESGq to two state-of-the-art tools on real RNA-Seq data, we validate its performance and evaluate the statistical correlation of the results. ESGq is freely available at https://github.com/AlgoLab/ESGq.


Introduction
Alternative Splicing (AS) is a post-transcription regulation mechanism that contributes to isoform and protein diversity in eukaryotes.Due to AS, depending on its environment, a single gene can produce multiple isoforms, hence complicating our understanding of the gene expression process.For instance, more than 95% of multi-exon human genes [1,2] and more than 60% of multi-exon Drosophila Melanogaster genes [3] exhibit more than one isoform.Due to its association to aging [4], cancer [5], and neuro-degenerative diseases [6], the analysis of AS is of the utmost importance.
In the last decade, RNA-Sequencing has become the defacto standard for the analysis of alternative splicing and a plethora of tools have been proposed in the literature.From a very high level point of view, the approaches for the analysis of alternative splicing available in the literature can be divided in two groups, depending at which level they work: transcript-based [7,8] and eventbased approaches.
In this work, we will focus on the second category.This kind of approaches characterizes AS at the most fine-grained level by giving a detailed and strict description of what happens at the exon-exon (or splice) junction level.Although being more rigorous in the description of AS events, this kind of approaches resulted more ac-curate than the transcript-based competitors [9], thus potentially providing a more detailed characterization of alternative splicing.Classical AS events are grouped in 5 categories [10]: exon skipping, alternative 3' (acceptor) splice sites, alternative 5' (donor) splice sites, intron retention, and mutually exclusive exons.Many tools have been developed to perform AS events detection and quantification [11,12,13,14,15].Recent works [16,17] argue that the classic definition of AS events is not satisfactory and not adequate to fully capture the complexity of alternative splicing.To this aim, they introduce Local Splicing Variations, a novel concept that aims to represent complex AS patterns and then increase the expressive power of the classical and more strict classification.However, the detection and quantification of classical AS events is an already hard -and not fully solved -problem that does not need further complications.Indeed, tools and methodologies show several limitations [9].For instance, although AS events exhibit a strict definition, tools available in literature inconsistent results, due to the different definitions and filtering criteria adopted.Moreover, every tool uses its own format to describe the AS events, making any downstream analysis quite complex.
In this context, we focus on the detection and quantification of non-novel AS events across conditions and provide an extensive comparison of two state-of-the-art tools, rMATS [11] and SUPPA2 [12].Moreover, to validate our findings, we introduce ESGq, a novel graph-based methodology for the quantification of AS events across two conditions.Inspired by the recent progress and development in the field of pangenomic and graph algorithms [18,19], ESGq models AS events as local splicing graphs, called event splicing graphs, and then quantify the events by aligning reads to them.The usage of graphs in the transcriptomic world is not new [13,14,20,21,22] but the use of simplified graph-based representation characterizing precise loci of the genome is something that was never investigated.This work provides a first exploratory investigation and may lay the foundations for a new generation of approaches for the efficient and accurate AS events quantification based on pantranscriptome graphs.
Experiments on a very recent real RNA-Seq dataset sequenced from Drosophila Melanogaster flies at two time points show that, by pairing simple graphs with accurate mapping, ESGq is able to achieve comparable results to state-of-the-art without sacrificing its efficiency.Moreover, our comparison proves once again the inconsistency of the results obtained by different methodologies for the detection and quantification of AS events.

Method
We introduce ESGq, a novel graph-based method for the differential quantification of alternative splicing events across conditions.ESGq takes as input a reference genome (in FASTA format), a gene annotation (in GTF format), and a two conditions RNA-Seq dataset with optional replicates (in FASTQ format), and computes the differential expression of annotated AS events (in custom text format).For each event ESGq provides the Percent-Spliced In (PSI, ) with respect to each input replicate and the ∆, summarizing the differential expression of each event across the two conditions.Current implementation focuses on four types of alternative splicing events (exon skipping, intron retention, alternative acceptor site, and alternative donor site) and supports both paired-end and single-end RNA-Seq datasets.
Differently from state-of-the-art approaches, which rely on spliced read alignment to a reference genome or quasi-mapping transcript quantification, ESGq relies on read alignment against a graph-based structure representing the events that need to be quantified.Instead of using a full splicing graph or a pantranscriptome, that are common structures in the literature [13,14,19], ESGq limits its computation to smaller and less complex graphs, the Event Splicing Graphs.An event splicing graph is a splicing graph which encodes only the exons and splice junctions involved in an alternative splicing event.Differently from splicing graphs commonly used in the literature, that represent all known transcripts of a gene, and differently from pantranscriptomes, where entire gene loci and intergenic regions are represented, an Event Splicing Graph encodes only the portions of the two transcripts involved in an event.By using this simpler representation, ESGq is able to achieve great efficiency without sacrificing its accuracy.
ESGq consists of three steps (also depicted in Figure 1): 1. event splicing graphs construction 2. read alignment against event splicing graphs 3.  and ∆ computation ESGq starts its computation by extracting the annotated alternative splicing events from the input gene annotation.To this aim, it employs a module of the SUPPA2 tool [12].The output of this module is a list of annotated alternative splicing events, that are events whose two isoforms are already annotated in the input gene annotation.For each event, SUPPA2 reports the type (one among SE, RI, A3, A5) and the genomic coordinates of the splice junctions involved in it.Starting from this list, ESGq builds the event splicing graphs, one per event.We note that multiple graphs can be built from the same gene.By exploiting the genomic coordinates of an event, ESGq retrieves the corresponding exons and adds them as nodes in the event splicing graph.Depending on the AS event type, ESGq adds edges between these nodes in order to represent the two isoforms involved in the event: the canonical isoform (that, in graph terms, is the path denoted as  ) and the alternative one (denoted as ).
More precisely, the four scenarios, one per event type, contemplated by ESGq are depicted in Figure 2 and can be formally defined as follow: • to model an exon skipping event (SE), ESGq needs to take into account three exons and this implies that the corresponding event splicing graph is composed of three nodes 1, 2, 3.In detail, 2 represents the exon that is spliced out during the event.The canonical isoform is represented by the path involving all three nodes ( = 1 → 2 → 3) while the alternative isoform consists in the skip of 2 ( = 1 → 3); • to model an alternative acceptor site event (A3), ESGq needs to take into account three exons and accordingly three nodes 1, 2, 3.In detail 1 represents the upstream exon, 2 the canonical downstream exon, and 3 the downstream exon with the alternative acceptor splice site.The canonical isoform is represented by the path involving the shared upstream exon and the canonical downstream exon ( = 1 → 2) whereas the alternative isoform changes the downstream exon with the alternative one ( = 1 → 3); • to model an alternative donor site event (A5), ESGq needs to take into account three exons and accordingly three nodes 1, 2, 3.In detail, 1 represents the canonical upstream exon, 2 the canonical downstream exon, and 3 the upstream exon with the alternative donor splicing site.The canonical isoform is represented by the path involving the canonical upstream exon and the shared downstream exon ( = 1 → 2) whereas the alternative isoform changes the up- stream exon with the alternative one ( = 3 → 2); • to model an intron retention event (RI), ESGq needs to take into account three exons.However, this case is harder than the previous ones: the three nodes 1, 2, 3 of the graph do not closely correspond to three exons, but one of them (3) correspond to a portion of it.In detail, 1 and 2 represent the two upstream and downstream exons whereas 3 represents the retained intron (i.e., the internal portion of the exon linking the upstream and downstream canonical exons).The canonical isoform is then represented by the path involving the upstream exon and the downstream exon ( = 1 → 2) whereas the alternative isoform includes the retained intron in the path ( = 1 → 3 → 2).
We note that, from a conceptual point of view, each event contributes to an event splicing graph, but, from a more practical point of view, ESGq builds a single graph with multiple connected components, one per event.
In the second step, ESGq indexes the graph constructed in the previous step.Since the goal is to align the input RNA-Seq reads to the graph using the Giraffe aligner [23], ESGq employs the VG toolkit [24] to build the gBWT (graph Burrows-Wheeler Transform) index [25].Each input replicate is then independently aligned to the graph using Giraffe.We note that, since by default vg breaks each node longer than 32bp into smaller nodes (of length ≤ 32), in order to keep the association between the nodes in the input graph (that is the one built by ESGq) and the nodes in the indexed version, ESGq directly breaks the nodes while building the event splicing graphs and links them accordingly to maintain the same two paths, i.e., isoforms.Due to this, in an event splicing graph we can observe two kind of edges: edges linking the smaller (≤ 32) nodes internal to an exon and edges that represent the real splice junction of interest for the AS event.Although ESGq differentiates between these two kinds of edge, conceptually only the edges representing a splice junction are used by ESGq.For this reason, we decided to omit these edges from Figure 2.
In the third and last step, ESGq computes the  value of the events w.r.t. each replicate and then summarize these values by comparing the two conditions and computing a ∆ value per event.This value represent the differential expression of each event across the two input conditions.
To do so, ESGq analyses the graph alignments computed in the previous step and assign a weight to each edge that represents a splice junction.Since each read is aligned to a path of the graph, computing this weight is straightforward as increasing a counter per edge.Indeed, a read can be aligned to a single node of the graph, hence without using any edge, or to a sequence of nodes, hence using one or more edges.In such a case, ESGq checks every edge used by the alignment and, if an edge is a junction edge, it increases its weight by 1.In other words, since each junction edge represent a splice junction, its weight represents the number of reads that have been spliced aligned over it.
Finally, ESGq uses these weights to compute the  value of each AS event following its classical formulation, i.e., the proportion of reads supporting the standard isoform over the reads supporting both isoforms [26].
Event splicing graphs computed by ESGq.For each AS event type (exon skipping, SE, alternative acceptor site, A3, alternative donor site, A5, and intron retention, RI), we report the event splicing graph  and the two annotated isoforms involved in the event: the canonical transcript   (represented in  by blue edges) and the alternative transcript   (represented in  by dotted green edges).In   and   ,  blocks represent and  blocks represent introns.The edge labels  represent the weights computed by ESGq after the read alignment step.
Differently from other approaches, which rely on both spliced and not spliced reads, ESGq  computation is based only on spliced reads counts, hence the support of an isoform is approximated using only these values and does not take into account its full coverage.We believe that this is a good approximation of the correct  value and this is also confirmed by our experimental evaluation. calculation can be summarized as follows (we also refer the reader to Figure 2): considers the mean of the weights 1, 2 of the canonical isoform and the weight 3 of the alternative isoform (in a similar fashion to [27]); 1 + 2 consider the weight 1 of the canonical isoform and the weight 2 of the alternative isoform considers the weight 1 of the canonical isoform and the mean of the weights 2, 3 of the alternative isoform Starting from these  values (one per event per replicate), which summarize the event quantification for each input replicate, ESGq computes the differential quantification across the two input conditions (∆) as the difference between the absolute value of the  means in the two conditions.Differently from other approaches, ESGq does not assign a p value to the ∆.This is mainly a consequence of the simplified AS events quantification based only on spliced reads counts.Future works will be devoted to improve the statistical validation of ESGq results.

Experimental evaluation
We implemented the ESGq pipeline in Python and the code is freely available at https://github.com/AlgoLab/ESGq.
We note that the most computationally intensive steps of the pipeline (i.e., graph indexing and read-to-graph alignment) rely on the VG toolkit [24], that is implemented in C++.We assessed ESGq efficacy and efficiency on a real dataset of RNA-Seq reads (SRA BioProject ID: PRJNA718442) that comes from a recent study [28] on the correlation between ageing and differential gene expression in Drosophila Melonogaster.More precisely, the study conducted a genome-wide differential expression analysis at two different time points: day 1 and day 60 flyes.The dataset consists of three replicates for two conditions (the two time points), for a total of 6 Illumina Hiseq samples.All samples are paired-end and consist of 151bp-long reads (see Table 1 for more details).
Differently from the aforementioned study, where the focus was the analysis of differential gene expression, in this work, we analyze the same dataset from the perspective of differential quantification of alternative splicing events.To do so, we applied ESGq and two other stateof-the-art approaches for the differential quantification of alternative splicing events across multiple conditions: rMATS [11] (version 4.1.2) and SUPPA2 [12] (version 2.3).The former performs differential quantification starting from read alignment to the reference genome whereas the latter starts from the quasi-mapping transcript quantification of Salmon [29].In this way, we have been able   was run starting from Salmon transcript quantification (version 1.10.1).All tools were run using their default parameters and 16 threads (when possible).In our analyses, we considered the reference, gene annotation, and transcripts provided by FlyBase [31], release 6.52.We compared the ∆ values reported by the three considered tools.ESGq reported 3 276, rMATS reported 3 699, and SUPPA2 reported 1 619 AS events correctly quantified.We note that both ESGq and SUPPA2 start from a list of events and quantify them: each time an event can not be correctly quantified (due to, for instance, no coverage support), they assign ∆   to that event.In that case, the event is not considered as correctly quantified, thus excluded from the analysis.The huge difference in the number of reported events proves the complexity of detecting AS events from RNA-Seq reads and highlights the inconsistency between different methodologies based on different filtering criteria [9].In our analysis we included all those events correctly quantified by all three tools and considered statistically significant by both rMATS and SUPPA2, i.e., events with p value ≤ 0.05.A total of 933 events resulted from this filter: 374 exon skipping (40%), 190 alternative 3' (20%), 154 alternative 5' (17%), and 215 intron retention (23%).We note that we also tried to include Whippet [15] in our evaluation but, due to a different AS event representation that cannot be easily compared to the representation given by the other tools, we ended up omitting it from the analysis.
Figure 3 and Table 3 report the results of our analysis.All tools achieved comparable results.Remarkably, ESGq and rMATS achieved a very good correlation, with a Pearson correlation coefficient equal to 0.918 (Figure 3a).Although both approaches are based on read alignment, they show two main methodological differences that slightly affect their results.Firstly, rMATS uses read counts coming from read alignment to a reference genome whereas ESGq uses (spliced) read counts coming from alignment to event splicing graphs, that are a reduced representation.Secondly, the quantification step implemented in ESGq is quite simplistic and not elaborate as the probabilistic framework implemented in rMATS.However, none of these two differences (that are a wanted restriction and a current -undesired -limitation) seems to affect the results of ESGq.On the other hand, SUPPA2 resulted less correlated to the two other approaches, showing a Pearson correlation coefficient ranging from 0.766 to 0.813 (Figure 3b and 3c).This was somewhat expected since it is based on transcript quantification, and not on spliced read alignment.As proven in the literature, such a difference is expected since current transcript annotation models may result inaccurate [15].

Tool1
Tool2 Table 2 reports the correlation between the three considered tools broken down by event type.Surprisingly there is no clear trend that can be observed.ESGq and rMATS exhibit the highest correlation on exon skipping events and the lowest on intron retentions.ESGq and SUPPA2, instead, show higher correlation on exon skippings and lower correlation on alternative donor events.Finally, rMATS and SUPPA2 exhibit higher correlation on alternative acceptor site and lower correlation on alternative donor events.These results are somewhat unexpected and require a further investigation.
ESGq also resulted very computationally efficient, completing the analysis in half an hour requiring 1GB of RAM.Similarly, SUPPA2 ran in less than 10 minutes and used 1.5GB of RAM.On the contrary, rMATS resulted the most expensive approach, requiring more than 5 hours and 8GB of RAM.The most expensive step is read alignment with STAR, that required from half an hour to two hours per sample.By pairing simple and precise graph representation of well localized loci of the genome (i.e., the event splicing graphs) with fast and accurate read alignment, ESGq is able to achieve results comparable to the other alignment-based approach, while being 10x faster.
Since SUPPA2 is based on the -mer based quasimapping of Salmon, we also analyzed how -mer size affects its results.For this reason, we ran Salmon (and, consequently, SUPPA2) two additional times with  ∈ {13, 21} (we note that 31 is the default value used in the previous results).As shown in Table 3, the results of SUPPA2 seems to be unaffected by the choice of the  parameter.Indeed, the correlation between SUPPA2 (ran with different  value) and other tools changes marginally.
Moreover, to evaluate if the results of the considered methodologies are affected by read length, starting from the 151-bp paired-end dataset, we manually trimmed the input reads to 51bp and 101bp using seqtk and created two additional datasets.Table 3 reports the results of this analysis.Even with very short reads (i.e., 51bp reads), the three tools achieved the same correlation (with a very marginal difference of < 0.015).
Finally, to evaluate how much paired-end information may improve the accuracy of the tools, we merged the two pairs of each replicate into a single sample, in order to simulate a single-end dataset.Surprisingly, there is small to none difference between the results on paired-end and single-end dataset, highlighting the robustness of the considered approaches.For instance, the Pearson correlation coefficient between ESGq and rMATS decreased by a very marginal 0.009 whereas correlation between SUPPA2 and the other approaches decreased by 0.041 (w.r.t.ESGq) and 0.027 (w.r.t.rMATS).
The experimental evaluation has been implemented as a Snakemake workflow [32], thus it is fully reproducible and easily replicable.Scripts and instructions are available at https://github.com/AlgoLab/ESGq.All the experiments were performed on a 64bit Linux (Kernel 5.15.0) system equipped with two 16-core AMD EPYC 7301 2.2GHz processors and 128GB of RAM.

Conclusions
In this paper we introduced ESGq, a novel graph-based approach for the AS event quantification across two conditions.Differently from state-of-the-art tools, ESGq is based on read alignment against local graph structures, introduced here as event splicing graphs, that represent AS events precisely represented in a given gene annotation.An extensive exploratory analysis on real RNA-Seq dataset showed that ESGq is able to achieve comparable results with respect to other approaches based on the alignment of reads to the reference genome, while being 10x faster.
Future works will be devoted to improving the statistical framework behind the  and ∆ computation and to extending the experimental evaluation by assessing the actual accuracy of ESGq (e.g., by computing performance metrics on simulated and RT-PCR validated events).An interesting future direction consists in extending the notion of event splicing graphs to include information on known genetic variations (SNPs and indels) in order to improve the quality of read alignment, as proven in a recent work on pantrascriptomes [19].Moreover, the detection and quantification of novel AS events from pantrancriptomes remain another interesting open problem.

Figure 1 :
Figure 1: ESGq method.(a) From the reference genome and the gene annotation, ESGq builds the Event Splicing Graphs (ESGs in the figure).(b) RNA-Seq reads from the two conditions ( 1 and  2 ) are aligned to the Event Splicing Graphs.(c) Graph alignments are used to weight junction edges and the weights are used to compute the  values (one per condition) and the Δ value (one per dataset).

( a )Figure 3 :
Figure 3: Correlation plots between the Δ values computed by ESGq, rMATS, and SUPPA2.These results refer to our analysis of 151 paired-end sample and  = 31 (for Salmon and SUPPA2).

Table 1
Real dataset used in our experimental evaluation (SRA Bio-Project ID: PRJNA718442).
Table reports the number of reads and the size in GigaByte of each paired-end replicate.
[30]S was run starting from the alignments produced by STAR aligner[30](version 2.7.10b) whereas SUPPA2

Table 2
Pearson correlation coefficients between ESGq, rMATS, and SUPPA2 (over the transcript quantification of Salmon ran with  = 31) broken down by event type on the 151bp paired-end dataset.