AERON: Transcript quantification and gene-fusion detection using long reads

Single-molecule sequencing technologies have the potential to improve measurement and analysis of long RNA molecules expressed in cells. However, analysis of error-prone long RNA reads is a current challenge. We present AERON for the estimation of transcript expression and prediction of gene-fusion events. AERON uses an efficient read-to-graph alignment algorithm to obtain accurate estimates for noisy reads. We demonstrate AERON to yield accurate expression estimates on simulated and real datasets. It is the first method to reliably call gene-fusion events from long RNA reads. Sequencing the K562 transcriptome, we used AERON and found known as well as novel gene-fusion events.

Whole-transcriptome sequencing has become an important method in many research projects. Due to the 2 revolution in short-read sequencing technologies and algorithm development, the detection of expressed 3 RNAs in biological samples with thousands of cells [1] and even single cells [2] is nowadays done routinely. 4 There are many important applications of such technologies, for example the detection of disease specific 5 gene expression patterns [3] or the detection of gene fusion events in cancer cells [4], which opens the door 6 for new therapeutic options and novel biological discoveries. However, short-read whole transcriptome 7 sequencing has its weaknesses. RNA molecules can be thousands of nucleotides long and recent studies 8 have revealed that more than 95% of multi-exonic genes undergo alternative splicing [5,6]. Short-read 9 sequencing thus has important limitations when it comes to the accurate quantification of gene isoform 10 expression levels and the detection of gene fusion events [7]. Recent long read sequencing technologies, 11 like developed by Pacific Biosciences (Pacbio) and Oxford Nanopore Technologies (ONT), have made 12 significant progress in sequencing output per run at dramatically reduced costs. Thus, an important 13 current challenge is to develop methods that can use long read RNA sequencing data for tasks where 14 long reads overpower short reads, such as transcript quantification and gene fusion detection. Given their 15 ability to cover large fractions of each transcript -and frequently complete transcripts -longer reads 16 hold the promise of more accurate expression estimates. Although there is much potential in using long 17 In all the above methods, the aligner finds a seed in the reference sequence and extends the alignment 44 from the seeds. In case of multi-mapping reads, the aligner selects the primary alignment based on the 45 alignment score. This makes the assignment of reads to a transcript biased towards how the primary 46 alignment is selected [22]. Hence, if a read maps to multiple transcripts, which are highly similar to each 47 other, the assignment becomes ambiguous. 48 Already in 2002, the concept of splicing graphs to represent possible transcripts of a gene has 49 been introduced [23]. More recently, general purpose tools for working with sequence graphs have 50 emerged [24][25][26]. Graphs with nodes representing nucleotide characters and edges representing adjacencies 51 have successfully been used for variant calling [24][25][26], genome assembly [27][28][29][30], and short tandem 52 repeat resolution [31]. Furthermore, alternative splicing events can be detected by aligning short reads 53 to splicing graphs [32]. So far, to our knowledge, no algorithm has been proposed to use sequence graphs 54 for long read transcript quantification and gene-fusion detection. 55 Contribution 56 Here, we introduce AERON, an alignment based pipeline for quantification and detection of gene-fusion 57 events using only long RNA reads. We adapt the sequence-to-graph alignment tool GraphAligner [33] 58 to align reads generated from noisy long read technologies, such as ONT, to a reference transcriptome 59 represented as gene-exon graphs. These graphs naturally allow to quantify gene and transcript expression 60 based on the overlap with paths of known transcripts. We introduce the first long-read-specific gene-fusion 61 detection algorithm. We tested AERON on three ONT datasets of varying coverage, including K562 data 62 we generated for this study, and were able to achieve accurate quantification exceeding performance of 63 current techniques. Prediction of fusions on simulated and real datasets illustrates the ability of AERON 64 to discover known and novel fusion events directly from long read data. AERON is an open source 65 software, which can be accessed via https://github.com/SchulzLab/Aeron (MIT licensed) 66

67
A new approach for long read alignment and quantification 68 Here we present the AERON pipeline, which aligns long RNA reads to sequence graphs and reports the 69 abundances of transcripts present in the sample. Figure 1 summarizes the approach. In a pre-processing 70 step, a gene-exon graph is constructed for each gene of the genome (Fig. 1a). Each node in the graph for 71 gene g represents an exon. If an exon is subject to alternative splicing in some of the transcripts, e.g. 72 alternative donor or acceptor sites, the corresponding node is divided into sub-nodes. The 3' end of a node 73 is connected to the 5' end of all the nodes downstream of it. The set of all gene-exon graphs represents 74 the transcriptome. Each annotated transcript is uniquely represented as a path in one gene-exon graph. 75 These graphs are constructed and augmented with the corresponding transcript paths as a preprocessing 76 step, which we refer to as indexing (Fig. 1a,b).

77
In the alignment step, a long read sequencing dataset is aligned to all gene-exon graphs (Fig. 1c) using 78 the GraphAligner [33]. Finally, through comparison of read alignments with graph paths of annotated 79 transcripts, expression values are computed (Fig. 1d). A transcript t is considered as expressed if for 80 any read r the following conditions are met: 1) the E-value [34] of the alignment between r and the 81 gene-exon graph is less than 1.0 and 2) at least 20% of the path covered by r is contained in the path 82 traversed by t. The read r is assigned to transcript t with a score s, where s is the percentage of the 83 path of r contained in t.

84
However, a read can -in principle -get aligned to two different transcripts with the same score. 85 Earlier studies have shown that since sequencing is performed from the 3' to the 5' end, there is a certain 86 bias towards the 3' end of the transcripts [35], i.e, more reads are expected to come from a region near 87 the 3' end of a transcript. We observed the same effect as the density of reads was higher towards the 3' 88 end of the transcripts as compared to the 5' end (Supp Fig.S1 and S2). Hence, in the scenario where a 89 read is aligned to two transcripts with the same score, we assigned the read to the transcript whose 3' 90 end is closest to the 3' end of the read. AERON assumes that each long read represents an individual 91 transcript molecule.

92
The final quantification of transcript t is done by simply counting the number of reads assigned to it 93 and converting the count into Transcripts Per Million (TPM) values (Fig. 1d). Gene-level quantification 94 is achieved by summing up the TPM values for all the transcripts belonging to a gene.

95
Comparison of sequencing protocols 96 Sequencing of a transcriptome generally involves synthesizing the complementary DNA (cDNA) and 97 amplifying the cDNA using PCR. An alternative protocol is to directly synthesize the single stranded 98 RNA (DirectRNA) [36]. To measure the performance of AERON on data generated from the above 99 two protocols, we downloaded two ONT datasets from the NA12878 cell line [37]. The first contained 100 15M sequences obtained using the cDNA protocol and the second 10M sequences obtained using the 101 DirectRNA protocol. We ran AERON on these two datasets separately. As expected, a higher density of 102 reads in both datasets aligned closer to the 3' end of the transcript (Supp. Fig. S2). This 3' bias was 103 stronger in alignments of DirectRNA reads as compared to reads from the cDNA dataset.

104
Further, we computed the expression levels of all known Ensembl genes and transcripts (version 92) 105 for both datasets. Out of 58,336 annotated Ensembl genes, 28,584 and 28,021 genes were identified 106 using the cDNA and DirecRNA dataset, respectively. 25,823 genes were commonly identified in both. 107 Similarly, out of 203,675 annotated transcripts, 102,748 and 107,030 transcripts were identified from the 108 DirectRNA and cDNA subset, respectively. 85,697 transcripts were commonly detected in both. Figure 2 109 shows a comparison between the protocols at gene and transcript level. We found gene expression 110 levels to be highly correlated across the protocols (Spearman r=0.90). Expectedly, at the transcript 111 level, the correlation between the expression estimates from the two protocol was lower, but still in 112 reasonable agreement (Spearman r=0.68, Figure 2b). Further, we randomly divided the reads of NA12878 113 dataset into three subsets and ran AERON on the three subsets seperately with default parameters. 114 We calculated the gene level and transcript level expression for the three subsets. We found both the 115 3/23 consisting of four transcripts T 1 and T 2 in G 1 and T 3 and T 4 in G 2 . The boxes within each transcripts depicts the exons present in the gene. The quantification consists of two steps -1) Indexing: this step begins with (a) graph construction where each exon of a gene serves as a node (grey box) and is split if the corresponding exon has an alternate donor/acceptor site. A node is connected, by an edge (solid arrows), to all nodes which are downstream of it. Graph construction is followed by (b) alignment of transcripts to the graph. 2) Index construction is followed by (c) alignment of reads to all the graphs. Alignments of step (b) and (c) are depicted by solid colored line and the dashed colored lines depicts the path followed by the transcript or the read. In the (d) quantification step, read and transcript alignment are compared. A read is assigned to a transcript if the path followed by the read is contained in the path followed by the transcript. gene level and transcript level estimates to be highly correlated to each other (Supp. Fig. S2 and S3). 116 Based on the above experiments we were able to assert the fact that ONT data quantified using AERON 117 produces reproducible quantification.

119
To further test the performance of AERON, we generated 2M ONT reads from the K562 cell line. We 120 aligned and quantified the NA12878 and K562 reads against the human Ensembl transcriptome. In the 121 K562 dataset, more than 80% of the reads were aligned with a median length of 398. For NA12878, 122 96% of the reads were aligned to the transcripts with a median length of 776. As expected, many of the 123 unaligned reads were either short in length or contained many low quality bases, which suggests they 124 contain too many errors to determine their origin correctly (Supp. Fig. S5).

125
In order to compare estimates from AERON and Minimap2 [19] using our long read datasets, we 126 computed an additional estimate of gene and transcript expression for both cell lines from short read 127 Illumina data using Salmon [10], see Methods. Taking the Salmon estimates as reference allowed us 128 to calculate Spearman correlation and absolute error values (MARD scores, see Methods). In these 129 comparisons we considered all genes and transcripts present in the human genome according to ENSEMBL 130 annotation (v92). We summarized the results in Table 1. For the K562 dataset, AERON is able to assign 131 approximately 50% more reads to genes as compared to Minimap2. For the complete NA12878 dataset, 132 AERON is able to assign 3% more reads than Minimap2. AERON shows better correlation and error 133 rates, i.e, lower MARD scores than Minimap2 when compared to short read gene expression estimates 134 for both datasets.
135 Figure 3 shows a scatterplot of gene expression estimates of AERON (Fig. 3a,c) and Minimap2 (Fig. 136 3b,d) against the short read estimates (y-axis). For this plot, we only take the genes which had non-zero 137 expression estimate either by AERON or by Minimap2. We observe that, for both datasets, AERON 138 is able to assign reads to the correct gene, resulting in a majority of points being located close to the 139 diagonal. Moreover, we see that AERON is able to detect more genes in lower range of expression (1-5 140 TPM) as compared to Minimap2. Similar behavior was observed at transcript level, but the correlation 141 values between the expression estimates from long read data and estimates from short reads were lower 142 (Supp. table S1). Beyond challenges to do transcript quantification from short reads, a possible reason 143 for this might be the presence of a high number of short transcripts against which no reads are mapped. 144 To test this hypothesis, we filtered out all the transcripts below a length cutoff from our correlation 145 analysis experiment. As expected, when we remove short transcripts, the correlation improves for both 146 datasets supporting our hypothesis. For a range of transcript length cutoff of 0-10000bps, the correlation 147 improved from 0.29 to 0.64 for K562 cell line and 0.27 to 0.74 for NA12878 dataset(Supp. Fig. S6).

7/23
Fusion simulation 149 Figure 4. Workflow of the fusion detection step of AERON. Partial alignment: reads are aligned to the gene-exon graphs. All secondary alignments are kept and the read may have alignments to different genes. Tentative fusions: whenever a read has a pair of alignments that end within 20 bp of each other in the read, the read votes for a fusion between the two genes. A read may vote for multiple tentative fusions. Fusion graphs: each tentative fusion induces a fusion graph, where the two genes are connected with a crossover node (N). End-to-end fusion alignments: the reads are aligned to the fusion graphs. Global alignment is used to align the read from start to end. End-to-end nonfusion alignments: the reads are aligned to the individual gene-exon graphs globally. Fusion score: the score difference between the fusion alignment and the nonfusion alignment defines a fusion score. Predicted fusions: the alignments are filtered based on the fusion score. The graph sequence along the alignment is taken as the predicted fusion transcript. Fusion support and alignments: all reads are aligned to the reference transcripts and the predicted fusion transcripts with Minimap2. A read supports a fusion if its primary alignment covers the fusion breakpoint with at least 150 base pairs on both sides.

8/23
The previous experiments underline that AERON is able to align noisy long RNA reads to transcripts 150 accurately. Motivated by a lack of approaches that enable fusion-gene detection with long reads and the 151 high relevance of fusion genes, we implemented a fusion detection method. Figure 4 shows an overview 152 of the pipeline. There are three main steps in the fusion detection pipeline: First, partial alignments 153 from the reads provide tentative fusions. Next, the reads are re-aligned to fusion graphs derived from 154 the tentative fusions. Finally, the alignments to the fusion graphs are compared with alignments to 155 gene-exon graphs to provide a fusion score for each read. Each of the steps produces a list of fusion 156 event candidates, with the later steps filtering out candidates from the previous steps. The methods 157 section describes the individual steps in more detail. 158 We first assessed the performance of our fusion detection approach in a simulation study. We 159 generated fusion events of different "lengths", where the length refers to the amount of sequence from both 160 genes. For example a fusion event with length 200 bp contains 200 base pairs from both genes and has a 161 total length of 400 bp. Events were simulated in 9 length groups, from 100-200 bp, 200-300 bp, and so on 162 until 900-1000 bp. For each of these length ranges, 50 fusions were generated, where a pair of transcripts 163 was selected randomly for each fusion. Then, a random substring of each transcript corresponding to the 164 length of the fusion was selected, and the substrings were concatenated to build the fusion transcript. 165 The reads were simulated at 10× coverage from all simulated fusion and reference transcripts. The fusion 166 detection pipeline was then ran on the simulated reads. The left part of Fig. 5 shows the precision-recall curve at varying fusion score cutoffs for different 168 fusion sizes. We see that 100-400 base pair fusions (bottom) are hard to detect with any fusion score 169 cutoff, and the recall saturates at 15%. The high error rate and short length of the reads stops them 170 from being aligned in the tentative fusion phase, which prevents the pipeline from detecting the fusions. 171 However, 400-700 base pair fusions (middle) are detected, and the recall reaches up to 87%. For the 172 longer fusions of 700-1000 base pairs (top), recall reaches up to 95% and precision around 80%. Based 173 on the curves, we chose 200 as the default fusion score cutoff, as that achieves a precision of 78% and 174 recall of 90% for large fusions.  ENSG00000223361 FTH1P10  chr5  ENSG00000162734 PEA15  chr1  3 reads  ENSG00000172493  AFF1  chr4  ENSG00000162244 RPL29  chr3  2 reads   Table 2. Predicted fusion events for NA12878. The predicted fusion transcripts do not have BLAST hits that cover the fusion breakpoint. The first 6 columns describe the two genes involved in the fusion. The column "Support" counts the number of reads whose primary alignment covers the fusion breakpoint and 150bp from both sides of it.
our three-step approach progressively removes false positive fusion events: while there are 28,696 false 178 positive predictions in the tentative step, this number reduces to 49 in the graph step and further down 179 to 20 in the final step. We see that shorter fusions are not detected even in the tentative fusion phase; 180 this is most likely due to the high error rate preventing short alignments from being found. Once the 181 fusion size grows above 400 bp, the set of tentative fusions contain most of the fusion events. Importantly, 182 the fusion graph approach removes only a small fraction of true fusions, while removing almost all false 183 positives from the tentative fusions. The fusion score cutoff further removes more false positives, but at 184 the cost of removing shorter true positives as well.

185
Fusion detection on real data 186 We ran the fusion detection pipeline on the two datasets: NA12878 and K562. The NA12878 functions 187 as a control, as we do not expect to see any fusions in that dataset. K562, on the other hand, is a highly 188 rearranged cancer cell line [38] with a known BCR-ABL1 fusion. 189 We used predictions which were supported by at least two reads, resulting in 25 events for K562 and 190 24 for NA12878. We then further manually curated the predictions in three steps. First, we removed all 191 fusions involving a mitochondrial gene and removing fusions between a gene and its own pseudogene, 192 leaving 16 events for K562 and 14 for NA12878. Then we used IGV [39] to visually inspect the alignments 193 of the reads to the predicted fusion transcripts generated by the pipeline, and discarded events where 194 the alignments to one of the genes seemed noticeably worse than the other or the read coverage over the 195 breakpoint was noticeably less than for either of the genes. Supplementary Figure S7 shows an example 196 of an event that was rejected based on the IGV visualization. After IGV visualization, 15 events were 197 left for K562 and 9 for NA12878. Since some transcripts can have similar sequences, the high error rate 198 of the reads can cause a false positive fusion prediction between the two transcripts due to sequence 199 similarity. In this case, the predicted fusion transcript should be similar to a reference transcript, because 200 it reflects a reference-guided consensus between the long reads. Hence, the average error rate of the fusion 201 transcript is much lower than the input reads and therefore easier to align with traditional methods. In 202 order to detect transcripts not annotated in the Ensembl release, we used the BLAST webserver to align 203 the predicted fusion transcripts to the human reference genome (GRCh38.p12) and transcriptome (NCBI 204 Homo sapiens Annotation Release 109). We discarded events that mapped to an existing transcript 205 including the fusion breakpoint and retained 8 events for K562 and 2 events for NA12878. 206 Figure 6. The coverage plot and the alignment of reads against the 3655bps long BCR-ABL1 fusion transcript. The fusion breakpoint was found to be at the position 1934. The image was generated using Integrated Genomics Viewer(IGV).

10/23
Ensembl ID  Table 3. Predicted fusion events for K562. The TEN1-CDK3 and BMS1P1-AGAP4 were reported earlier as read-through events. The first 6 columns describe the two genes involved in the fusion. The column "Support" counts the number of reads whose primary alignment covers the fusion breakpoint and 150bp from both sides of it.
The two predicted fusion events for NA12878 are shown in  Table 3 shows the eight predicted fusion events for K562 including the well-known BCR-ABL1 fusion 209 event (Fig. 6). Four of the eight predicted fusion events have been reported in literature before. For 210 the BCR-ABL1 fusion, the TEN1-CDK3 read-through and BMS1P4-AGAP5 read-through events, the 211 AERON predictions mapped to the existing annotations. The NOS3-PRKN predicted fusion mapped to 212 a transcript variant of the NOS3 gene, while the ARPC4-TTLL3 predicted fusion mapped to a transcript 213 variant of ARPC4. The PRIM1-NACA predicted fusion occurred with fusions across two different 214 breakpoints, which the pipeline considers separate events. The HBG2-HBG1 predicted fusion has a very 215 high read support, and the alignment of the predicted transcript to the reference is consistent with an 216 inverted duplication (see Supplementary Figure S8) occurring in the region.

218
In this work, we describe AERON, a novel approach to quantify transcripts from long reads. AERON 219 comes with the first long-read specific fusion detection method. We use GraphAligner [33], a fast sequence-220 to-graph alignment method, to align ONT reads to a reference transcriptome and find better alignments 221 as compared to Minimap2, which is used as part of previous state-of-the-art quantification pipelines. 222 We assign reads to transcripts based on the position of the read mapping within the transcriptome and 223 the fraction of read sequence contained in the transcript sequence. We tested AERON on two different 224 datasets of varying coverage and compared results to expression estimates derived from Minimap2 225 alignments of the reads to the transcripts. We compared the expression estimates from these two methods 226 (AERON and Minimap2) against the estimates generated from short read data. We found that expression 227 estimates from AERON had a better correlation as compared to estimates from Minimap2. We also 228 show that AERON does not depend on the sequencing protocol used to generate the reads. This was 229 evident from the high correlation between the AERON runs on reads generated from the cDNA and the 230 reads generated from the native RNA.

231
Although we show improvement over the existing methods, there are remaining limitations. We find 232 the gene-level quantification to be much better compared to the transcript level quantification. We have 233 shown that the presence of short-length transcripts has an effect on the assignment of reads to the correct 234 transcript, which is further complicated by highly similar transcripts. A significant number of transcripts 235 are truncated versions of longer transcripts. In cases like these, it is difficult to decide whether a read has 236 originated from the truncated transcript or is it a part of the longer transcript. This problem is further 237 aggravated if the read originating from them does not overlap completely with either of the transcripts. 238 Soneson et al. [22] recently proposed an alternate approach where the EM based algorithm of Salmon 239 was run on alignments produced with Minimap2 as input. We tested the suggested approach on both 240 the datasets and found no major effect on the transcript level expression estimate (Supp.tab S3). The 241 11/23 comparatively low correlation between estimates from short read data with estimates from the long read 242 data at transcript level may be explained by the higher sequencing depth of short read datasets. This 243 problem is mitigated to a certain extent in the gene-level quantification, where we add up the expression 244 estimates of all transcripts belonging to a gene. Long read sequencing suffers from lack of sequencing 245 depth, due to which many transcripts are either not identified or misidentified. 246 Furthermore, while our approach is able to align long error-prone reads, the analyzed data sets also 247 contained relatively short reads, which are challenging to align given the high error rate. One way to 248 potentially rescue such reads could be error correction. GraphAligner can perform error correction of 249 long reads based on alignment of long reads against a De Bruijn graph generated from short reads. The 250 process has been tested on genomic reads. Currently, no RNA-seq specific error correction tools are 251 available, which could be integrated in our pipeline. A recent study by Lime et al. 2019 [44], shows that 252 long read DNA correction methods improve analysis of long read RNA data, but have a negative effect 253 on detecting diverse transcript isoforms and alternative splice sites. Therefore, we did not consider such 254 approaches here, but this would be an interesting direction for further research. 255 We have also presented a novel fusion detection pipeline based on aligning reads to a fusion graph 256 between two genes, filling an important gap in the landscape of tools for the analysis of long read 257 RNA-seq data. We tested this on simulated data of different fusion lengths and real data from the K562 258 cell line. Experiments with simulated data show that the accuracy depends on the fusion length and 259 fusions which contain at least 700 base pairs from both genes are detected reliably. Experiments with real 260 data on the K562 cell line recovered four known events, including the BCR-ABL1 fusion, and suggested 261 four novel events. In the future, we plan to automate further curation steps which we performed manually 262 for this study. As the simulation experiments show that long fusion events are recovered with high recall, 263 it is unlikely that the sample has other long fusions with significant expression.

269
Total RNA was isolated from K562 cell line using TRIzol reagent (ThermoFisher Scientific) according 270 to the manufacturer's instructions. Total RNA concentration was measured at 1:10 dilution using a 271 Qubit 3.0 Fluorometer and Qubit HS RNA Assay Kit (ThermoFisher Scientific). RNA quality and RNA 272 integrity (RIN) were evaluated using a NanoDrop 2000 spectrophotometer (ThermoFisher Scientific) and 273 Agilent 4200 TapeStation system with the RNA screentape assay kit (Agilent).

274
Nanopore cDNA library generation and sequencing 275 Poly(A)+ RNA was enriched from 25µg total RNA using the Dynabeads mRNA Purification Kit 276 (ThermoFisher Scientific) according to manufacturer's instructions. The K562 cDNA libraries were 277 generated from 5ng of poly(A)+ RNA using the PCR-cDNA kit, SQK-PCS109 (Oxford Nanopore 278 Technologies) according to manufacturer's instructions. In brief, complementary strand synthesis and 279 strand switching were performed on input full-length poly(A)+ RNA using kit-supplied oligonucleotides, 280 followed by 14-cycle PCR amplification using primers containing 5' tags to generate double-stranded 281 cDNA library, followed by a ligase-free attachment of rapid sequencing adaptors. The splicing graph generated from the reference sequence and the paths in the graph followed by all the 286 annotated (according to ENSEMBL annotation-v92) transcripts in the splicing graph forms the index for 287 transcript quantification and fusion detection. These steps are performed only once and can be re-used 288 for multiple input datasets.

289
Graph construction 290 The set of all possible transcripts of a gene can be expressed as a splicing graph, first introduced by 291 Heber et al., 2002 [23]. In general, a splicing graph is a Directed Acyclic Graph (DAG) where nodes 292 represent the splicing sites of a given gene and edges represents exons and introns between the sites. In 293 this work, we slightly modify the splicing graph construction and name the graph as gene-exon graph, 294 which we later align long reads to. Below, we first describe the construction of the graph from genome 295 sequence. 296 We formally define a DNA sequence as a string consisting of characters from the alphabet Σ = 297 {A, C, G, T }. A gene g is a DNA sequence consisting of exons and introns. We term a base within the 298 gene g as exonic base if it is overlapped by at least one exon belonging to g. Otherwise, the base is 299 termed as intronic base. A position in g is termed as a border, if it serves as a boundary of an exon. 300 We classify borders into 5' borders (a), which are positions of the 5' end of an exon, and 3' borders (b), 301 which are the position of the 3' end of an exon. Each exon x can be characterized a pair these two border 302 positions, i.e, x = (a, b). We refer to a list of all 5' borders and 3' borders as border list ρ. We can divide 303 the border list into two sub lists namely the acceptor list α, which contains all 5' borders and the donor 304 list δ, which contains all 3' borders of g. Members of each list and sub lists are ordered increasingly. For 305 instance, let exon x 1 = (5, 10) and x 2 = (15, 20) be two exons of the same gene g of length 20 bps, then 306 ρ(g) = [5,10,15,20], α(g) = [5,15] and δ(g) = [10,20].

307
The gene-exon graph of a gene g is defined as a directed acyclic graph G g = (V, E) g . Each vertex of our gene-exon graph represents an exon present in the gene and the edges between vertices corresponds to all the possible splicing events occurring in the gene. If an exon has alternate splice sites, then the vertex corresponding to the exon is split into multiple sub-vertices. The split happens at the position of the alternate site. This concept is formalized in the following way. Each vertex V of the graph is considered as a substring between two consecutive borders of g. Let i ∈ ρ(g) and j ∈ ρ(g) be two consecutive borders in g where i < j. Then the vertex v ij ∈ V is created using the following function.
if i ∈ α(g) and j ∈ α(g), if i ∈ α(g) and j ∈ δ(g), if i ∈ δ(g) and j ∈ δ(g), g[i + 1 . . . j − 1] if i ∈ δ(g) and j ∈ α(g) and g[i + 1 . . . j − 1] is composed of exonic bases φ if i ∈ δ(g) and j ∈ α(g) and where v ij = φ represents an null vertex. In a gene-exon graph, there should always be a path in the 308 graph which represents all possible alternative splicing events. For instance, let there be three vertices 309 v 12 ,v 34 and v 56 corresponding to three exons x 12 , x 34 and x 56 respectively. If a transcript is formed by 310 splicing out x 34 and using only x 12 and x 56 , then this event should be represented in the graph as a 311 path passing through vertices v 12 and v 56 . Hence, in the graph, given two non-null vertices v ii and 312 v jj , an edge e i j is created between all vertices with i < j. Figure 7 provides a visual representation 313 of the graph construction process and the implementation details are given in Supplementary Material 314 Section 1. Sequence-to-graph alignment 316 We need to be able to align a sequence to a gene-exon graph for two purposes: first, for aligning known 317 transcripts to the graph during index construction and, second, for aligning reads to the graph for 318 quantification and fusion detection. The goal of this step is to find the path in the graph for each read 319 sequence with the lowest edit distance.

320
Given a gene-exon graph G g = (V, E), we define a path as a list of nodes The path sequence of a path is the concatenation of the node labels of p. 322 Given a read sequence and a graph, the sequence-to-graph alignment problem is to find the path in 323 the graph with the smallest edit distance between the path sequence and the read sequence. We use 324 GraphAligner for aligning the reads [33]. Briefly, GraphAligner is a seed-and-extend aligner that finds 325 maximal exact matches (MEMs) [45] between the read and the node sequences and then extends them 326 with a bit-parallel dynamic programming algorithm [46]. This produces a set of alignments, where each 327 alignment contains a path, edit distance, an E-value [34] and the start and end positions in the read. 328 The parameters used by GraphAligner are the maximum number of MEMs to extend and the minimum 329 MEM length.

330
Alignment of transcripts to graphs 331 Given a genome consisting of n genes and m transcripts, we begin by constructing a gene-exon graph G i 332 for each gene in the genome and add it to a graph set U = {G 1 , G 2 , ....G n }. We align the m reference 333 transcripts to all the n graphs present in set U . For each transcript t, we obtain the path which has the 334 minimal edit distance to the transcript. Since the nodes of a graph are composed of exonic sequences 335 from the genomic regions, the best possible path traversed by transcript t is the gene-exon graph of the 336 gene from which t originates. Hence, each transcript will only have one path associated with it. We 337 collect all such paths in set P = {p 1 , p 2 , ..., p m } and name the set as transcript-path set.

338
Assignment of reads to transcripts and quantification 339 Consider the task of aligning a long read set R = {r 1 , r 2 , ..., r k } consisting of k reads. In principle we align all reads in R to all the graphs in the set U . We only consider alignments with an E-value [34] below 1. In case a read r ∈ R has multiple alignments meeting this criterion, we select the longest 14/23 alignment. The first condition removes poor alignments between the read and the graph and the second condition ensures that the read is not mapped to multiple locations. The E-value [34] of the alignment between a query and a database describes the expected number of spurious alignments that are at least as good as the alignment. The E-value is calculated with the formula: where E is the expected number of spurious hits, K and λ are parameters that depend on the scoring 340 scheme, S is the alignment score, m is the database size in base pairs and n is the query size in base 341 pairs. The original formula by Karlin and Altschul is defined with a database of linear sequences instead 342 of graphs. We use the number of base pairs in the graph as the database size. The K and λ were chosen 343 to correspond to a scoring scheme with match score +1 and mismatch cost -2.

344
Once the reads and the transcripts have been aligned, we extract only the path of the alignment 345 and discard the base pair level alignments. That is, each read will be treated as a string q r = v 1 v 2 ...v n 346 where v i ∈ V and the differences between the read sequence and the path sequence are ignored. We then 347 compare q r to all the paths in the transcript path set. We say that a read q r belongs to a transcript p i if 348 there is a node that both paths include, that is, ∃ v : v ∈ q r ∧ v ∈ p i . For each read path q r belonging to 349 transcript path t i , we align the q r and t i to each others using the Needleman-Wunsch algorithm [47] 350 with nodes weighted according to the amount of sequence in them. We then define the overlap score as 351 the fraction of matches in the alignment over the read length. The overlap score represents the fraction 352 of the read which matches the given transcript. If the overlap score is above 20%, then q r is said to be 353 aligned to t i .

354
A read may get aligned to multiple transcripts. We assign each read to the transcript with which it has 355 the highest overlap score. It is possible that a read can get aligned to multiple transcripts with the same 356 overlap score. In such cases, the read is assigned to the transcript whose 3' end is closest to the 3'end of 357 the read. In cases where the 3' end of the candidate transcripts are located in the same genomic position, 358 the read is assigned randomly to one of them. Finally, the transcript is quantified by simply counting 359 the number of reads assigned to it and converting these counts values to Transcript Per Million (TPM). 360 Evaluation of quantification 361 We tested the quantification algorithm on simulated as well as real datasets. Throughout the paper, we use the widely accepted Transcripts Per Million as transcript and gene level metric for evaluation of predicted expression values. In addition, we also use the Mean Absolute Relative Difference (MARD) metric which was previously used for transcriptome comparisons [10], which we explain briefly below. For each transcript i, we calculate the Absolute Relative Difference: where x i and y i are the true value and estimated value respectively for transcript i. For the simulated dataset, the actual number of reads originating from a transcript i was considered as the true value and the number of reads predicted to have been originated from transcripts i by AERON was considered as the estimated value. However, in case of real datasets, the true count of the reads originating from a transcript is unknown. Hence, for read datasets, TPM value obtained using short reads for a transcript i was considered as the true value of i and the TPM value obtained using AERON was considered as the estimated value of i. For a predicted value closer to the true value, the ARD values tends to be closer to 0. The Mean Absolute Relative Difference (MARD) for M transcripts is calculated simply by: For comparison of expression estimates from long read datasets against the expression estimates 362 obtained from short read data, we calculated the TPM values for all the transcripts present in the 363 15/23 annotation file. Further, we obtained the gene-level quantification of a gene g by summing up the TPM 364 values of all the transcripts belonging to g. We then calculated the Spearman correlation between the 365 TPM values obtained from the long read dataset and the TPM values obtained from the short read 366 data at transcript level. The Spearman correlation was also calculated at gene-level taking all the 367 ENSEMBL(v92) annotated protein coding and non-coding genes into consideration.

368
Fusion detection 369 Figure 4 shows an overview of the fusion detection pipeline, which we explain in the following. The reads are first aligned to the gene-exon graphs. This proceeds the same way as in the quantification 372 pipeline, except that secondary alignments are not removed and the same part of a read may be mapped 373 to multiple gene-exon graphs. The partial alignments are used to create a list of tentative fusions. Whenever a read has a pair of partial 376 alignments to two different genes whose endpoints are within 20 base pairs to each others in the read, 377 the read supports a tentative fusion between the two genes. Each read may support multiple tentative 378 fusions.

379
Fusion graphs 380 Each tentative fusion induces a fusion graph. The fusion graph combines the gene-exon graphs of the two 381 participating genes. An extra crossover node is added to connect the two gene-exon graph. Each base 382 pair in the first gene-exon graph is connected to the crossover node, and the crossover node is connected 383 to each base pair in the second gene-exon graph. This way, the alignment may cross from any point in 384 the first gene-exon graph to any point in the second gene-exon graph.

385
End-to-end fusion alignments 386 The reads are aligned to their fusion graphs. However, here the alignment must span the entire read 387 from start to end. Clipped read ends are considered indels and contribute to the number of mismatches 388 in the alignment.

389
End-to-end nonfusion alignments 390 Each read supported a list of tentative fusions previously. We extract the list of genes involved in those 391 fusions for each read. In addition to this, we extract each pair of tentative fusions which include those 392 genes regardless of which read supports the fusion, and say that these genes are relevant for the read. 393 That is, if read R 1 supports a fusion between genes G 1 and G 2 and nothing else, but an another read 394 supports a fusion between G 2 and G 3 , then all of G 1 , G 2 and G 3 are relevant for R 1 . The reads are then 395 aligned to all of their relevant gene-exon graphs. Again the alignments must span the entire read from 396 start to end. Once the reads have been aligned end-to-end to both the fusion graphs and the gene-exon graphs, the 399 alignment scores are compared to calculate a fusion score. Given the lowest alignment edit distance to 400 a fusion graph C f and the lowest alignment edit distance to a gene-exon graph C n , the fusion score 401 of a read to the fusion graph is defined as C n − C f . This essentially describes how much better the 402 read aligns to a fusion than any individual gene; a fusion score of 0 means that the read aligns to a 403 fusion graph just as well as to a non-fusion graph, and higher fusion scores mean that the read aligns 404 16/23 better to the fusion graph than any non-fusion graph. Note that the alignment edit distance to a fusion 405 graph cannot be worse than the edit distance to a gene-exon graph, since the fusion graphs include the 406 gene-exon graphs as subgraphs. The fusion graph alignment with the lowest edit distance is kept and 407 the others are discarded. At this point each read can only support one fusion, which removes a large 408 number of false positives. 409 Predicted fusions 410 The reads are filtered based on the error rate of the alignment to the fusion graph and the fusion score. 411 Reads whose fusion score is below a user-given threshold (default 200) are discarded. Reads whose 412 alignment error rate to the fusion graph is above 20% are also discarded. The paths of the fusion 413 alignments are taken as the predicted fusion transcripts. When multiple reads align to the same fusion 414 graph and cross over at the same exon, they are considered the same fusion event and one of them is 415 arbitrarily selected as the fusion transcript. If multiple reads align to the same fusion graph but cross 416 over at different exons, they are considered separate events. The output of this step is a list of predicted 417 fusion transcripts.

418
Fusion support and alignments 419 Finally, all reads are aligned to the predicted fusion transcripts and the reference transcriptome using 420 Minimap2 [19]. A read is then considered to support a fusion if its primary alignment crosses the fusion 421 breakpoint with at least 150 base pairs on both sides. This recovers some reads which were missed by 422 the earlier steps and removes some spurious fusions. The output of this step is a BAM file containing 423 the alignments of the reads to the transcriptome and predicted fusion transcripts, and the number of 424 reads that support each predicted fusion transcript.

425
The simulated data experiment produced 20 false positive calls even after the fusion score filtering. 426 We recommend manually inspecting the predictions to filter out more false positives. In the K562 and 427 NA12878 experiments, we curated the predictions by visualizing them with IGV [39] to remove cases 428 where reads align poorly to one side of the fusion, and by aligning the predicted fusion transcripts to 429 the reference genome and transcriptome using BLAST [48] to remove false positives caused by sequence 430 similarity between transcripts.

431
Simulation of ONT and parameter optimization 432 Two important parameters for AERON are the seed length and the number of seeds (NOS) for alignment 433 of reads to the graph. Seed hits are exact matches between a part of the read and a part of a node, and 434 are used for starting the alignments. Seed length is the minimum length of the exact matches. The 435 number of seeds (NOS) denotes how many of the available seeds are used to compute an alignment 436 between the read and the graphs. We wanted that the default values of these parameters in AERON 437 should give us a good accuracy of the quantification in a reasonable runtime. For this, we first simulated 438 1M Oxford Nanopore reads using Nanosim on transcriptome mode (version 2.5.0, [49]). The novel K562 439 ONT data was given as the reference to Nanosim to create the training read profile. All the parameters 440 of Nanosim was set to default. All the parameters of the algorithm were set as default. We performed 441 several runs of AERON on the simulated dataset. Each run consisted of a different combination of seed 442 length and the number of seeds. We measure the accuracy of quantification using the MARD score (See 443 section --). Figure 8 shows the effect of varying the seed length and the number of seeds on the runtime 444 (x-axis) and the MARD score (y-axis). Each curve in the graph represents a single NOS parameter value 445 and each point in the curve represents a seed-length parameter value. As expected, with the increase 446 in the number of seeds accuracy of the quantication improves, evident by the decrease in the MARD 447 score. Note, that MARD score measures the distance between the estimated value and the true value. 448 Hence, lower the score, better the accuracy of the estimation. But setting the NOS parameter too high 449 results in a higher runtime. We also observe that with the increase in seed length, the accuracy of 450 the quantification goes down. For AERON, we selected a combination of parameters (NOS=15 and 451

17/23
Seed-length=17) which was good trade-off between the accuracy and the runtime. Although, there might 452 be other combination which might give similar or better results based on the dataset used in the pipeline. 453 Figure 8. Comparison of the MARD score (y-axis) against runtime in minutes (x-axis) for different parameter configurations of AERON on simulated data. Each line in the plot represents a value for the No. of Seeds parameter. Each point within a line represents a value of theSeed Length parameter. One such value is marked with an arrow, which denotes the default value of Aeron.

454
Two different Oxford Nanopore Technology (ONT) read sets were used for the analysis: a novel dataset 455 with 2.7M reads from K562 cells with an average read length of 750bps, and an available dataset with 25M 456 reads from NA12878 cells with average read length of 1030bps [50]. For the alignment step of AERON, 457 human genome from GRCh38.p12 [51] was used as the reference. We also aligned the two datasets against 458 the human reference genome using Minimap2 and filtered out all the secondary alignments keeping only 459 the primary alignments. For each transcript in the genome, we then counted the number of reads aligned 460 to the transcript and converted the numbers into TPM values.

461
To compare the expression estimates from AERON and Minimap2 against expression estimates from 462 short reads, we downloaded two short read data-sets: 113M paired end reads from the K562 cell line 463 (SRX4958124, [52]) and 188M paired end reads from NA12878 (ERX329208, [53]). We then calculated 464 the expression for the two datasets using Salmon ( [10], v0.11.2) with default parameters except kmer 465 size, which was set to 17.