Scallop2 enables accurate assembly of multiple-end RNA-seq data

Transcript assembly (i.e., to reconstruct the full-length expressed transcripts from RNA-seq data) has been a critical but yet unsolved step in RNA-seq analysis. Modern RNA-seq protocols can produce paired-/multiple-end RNA-seq reads, where information is available that two or more reads originate from the same transcript. The long-range constraints implied in these paired-/multiple-end reads can be much beneficial in correctly phasing the complicated spliced isoforms. However, there often exist gaps among individual ends, which may even contain junctions, making the efficient use of such constraints algorithmically challenging. Here we introduce Scallop2, a new reference-based transcript assembler optimized for multiple-end (including paired-end) RNA-seq data. Scallop2 uses an algorithmic frame-work that first represents reads from the same molecule as the so-called multiple-end phasing paths in the context of a splice graph, then “bridges” each multiple-end phasing path into a long, single-end phasing path, and finally decomposes the splice graph into paths (i.e., transcripts) guided by the bridged phasing paths. An efficient bridging algorithm is designed to infer the true path connecting two consecutive ends following a novel formulation that is robust to sequencing errors and transcript noises. By observing that failing to bridge two ends is mainly due to incomplete splice graphs, we propose a new method to determine false starting/ending vertices of the splice graphs which has been showed efficient in reducing false positive rate. Evaluations on both (multiple-end) single-cell RNA-seq datasets from Smart-seq3 protocol and Illumina paired-end RNA-seq samples demonstrate that Scallop2 vastly outperforms recent assemblers including StringTie2, Scallop, and CLASS2 in assembly accuracy.


Introduction
including scRNAss and RNA-Bloom but failed; see Appendix for more details about the efforts we made. 81 Datasets. Above four methods are compared with two types of RNA-seq data, generated by Smart-seq3 pro-82 tocol and Illumina platforms. Smart-seq3 is a single-cell protocol that generates multiple-end RNA-seq data 83 using barcoding technology. Here we use two public datasets published with Smart-seq3 paper [13], down-84 loaded from https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8735. The first dataset, referred to 85 as HEK293T, contains 192 human cells; the second dataset, referred to as Mouse-Fibroblast, includes 369 86 mouse cells. For Illumina platform, we use 10 human paired-end RNA-seq samples that were downloaded 87 from ENCODE project and were previously used in the Scallop paper [7]. We refer to these 10 samples as 88 ENCODE10. The alignments of these samples are available at http://doi.org/10.26208/8c06-w247. 89 Pipeline and evaluation. We use the pipeline illustrated in Figure 1 to evaluate the performance of all 90 compared assemblers. For Smart-seq3 data, the raw sequencing data will be demultiplexed and preprocessed 91 using zUMIs tool, in which STAR will be called to generate reads alignments. The 10 paired-end Illumina 92 RNA-seq samples are aligned with two popular aligners, STAR [22] and HISAT2 [23]. The read alignments 93 of each individual cell or sample will be piped to the compared assemblers.  The accuracy of the assembled transcripts will be evaluated using gffcompare [24]. We use the annotated 95 transcriptomes as reference (Ensembl GRCh38.104 for human data and Ensembl GRCm38.104 for mouse 96 data). We use the "transcript level" metrics defined by gffcompare: an assembled multiple-exon transcript 97 is considered as "matching" if its intron-chain exactly matches that of a transcript in the reference; an 98 assembled single-exon transcript is defined as "matching" if there is a significant overlap (80% by default) 99 with a single-exon transcript in the reference. We use two metrics calculated by gffcompare: the total number 100 of matching transcripts, which is proportional to sensitivity, and precision, defined as the ratio between the 101 total number of matching transcripts and the total number of assembled (predicted) transcripts. 102 We note that above metrics measured w.r.t. the current transcriptome underestimates the accuracy, as novel 103 transcripts that are correctly assembled will be considered as "not-matching" because they do not yet exist 104 in the reference. Nevertheless, we believe such metrics are fair for the compared assemblers, as they reflect 105 their relative accuracy, and yet we do not know the true expressed transcripts in these biological samples.

106
Comparison of assembly accuracy on Smart-seq3 data. Table 1 summarizes the assembly accuracy of the 107 4 assemblers averaged over all cells in each of the two Smart-seq3 datasets. Scallop2 consistently achieved 108 the highest precision and sensitivity on both datasets. Specifically, Scallop2 produces 0.7%, 29.0%, and 109 20.7% more matching transcripts than StringTie2, Scallop, and CLASS2 on the HEK293T dataset, and 110 assembles 0.4%, 18.1%, and 19.1% more matching transcripts than them on the Mouse-Fibroblast dataset.

111
In terms of precision, Scallop2 improves 107.1%, 12.5% and 119.7% than StringTie2, Scallop, and CLASS2 112 on the HEK293T dataset, and improves 80.0%, 6.7%, and 66.6% than them on the Mouse-Fibroblast dataset.    Comparison of assembly accuracy on Illumina paired-end RNA-seq data.  such filtering goes, the number of matching transcripts of Scallop2 will drop while its precision will likely 132 increase (as lowly-expressed ones are more likely to be false transcripts). Figure 4 shows that, the precision-133 recall curves of Scallop2 always lie to the right of other points, suggesting that Scallop2 always gives higher 134 precision at the same level of sensitivity compared with any other method.

135
To quantitatively measure of the improvement of Scallop2 compared with other methods, we calculate the 136 adjusted precision of Scallop2 w.r.t. any other method, defined as the x-coordinate of the point on the 137 precision-sensitivity curve that has the same y-coordinate with the compared method. This single metric 138 therefore offers a more direct evaluation between two methods (as opposed to using two metrics precision 139 and sensitivity).  Table 3: Comparison of precision (mean and standard deviation) between Scallop2 and each of other 3 methods at the same level of sensitivity. "compared" = (raw) precision of the compared method; "adjusted" = adjusted precision of Scallop2 at the same level of sensitivity with the compared assembler.    Table 4 and Table 5   3 Methods

150
We focus on reference-based transcript assembly. The RNA-seq reads will be first aligned to the reference 151 genome through a splice-aware aligners such as STAR [22] and HISAT2 [23]. Reads that overlap in their 152 aligned genomic coordinates are then partitioned as gene loci. Individual gene locus will be assembled 153 separately as independent instances. In Section 3.1, we describe the procedure we use to construct the 154 data structures that capture all splicing signals and multiple-end constraints in the data. In Section 3.2, we 155 formulate the multiple-end assembly problem. In Section 3.3, we describe our algorithmic framework for 156 solving above formulation. Two core algorithms called in this framework are described in Section 3.4 (i.e., to bridge two consecutive ends in a group) and in Section 3.5 (i.e., to determine false starting/ending vertices in the splice graph).
We denote by C as the collection of distinct multiple-end phasing 181 paths in {C(R 1 ),C(R 2 ), · · · ,C(R k )}. For a gene locus, the above constructed weighted splice graph G together with the multiple-end phasing 184 paths C give a complete representation of the splicing information in the alignments and the long-range 185 phasing information in the multiple-end reads. Our assembly algorithm will take G and C as input, and 186 produces a set of s-t paths P of G and assigns an abundance f (p) for each s-t path p ∈ P: each s-t path p 187 infers an expressed transcript in this gene locus, and f (p) predicts its expression abundance. 188 We now design three objectives to guide reconstructing P and f given G and C . First, since each multiple-end 189 phasing path is constructed from reads sampled from a single transcript, we expect that each multiple-end 190 phasing path appears in one of the reconstructed transcripts (i.e., in P). Formally, we say a multiple-end 191 phasing path C ∈ C is covered by P, if there exists an s-t path p ∈ P such that every path l ∈ C appears in 192 p. We require all multiple-end phasing paths in C be covered by P. Second, for each edge e ∈ E, we expect 193 the abundance of the inferred s-t paths passing through e, i.e., ∑ p∈P:e∈p f (p), is as close to its observed read 194 coverage w(e) as possible, which can be define as to minimize the sum of the deviation between w(e) and

195
∑ p∈P:e∈p f (p) for all edges, denoted as d(P, f ) := ∑ e∈E |w(e) − ∑ p∈P:e∈p f (p)|. Third, we aim to minimize 196 |P| following the parsimony principle. Combining all the three objectives, we informally describe the task 197 of transcript assembly for multiple-end RNA-seq data as follows.

198
Problem 1 (multiple-end transcript assembly) Given weighted splice graph G = (V, E, w) and associated 199 multiple-end phasing paths C , to compute a set of s-t paths P of G and abundance f (p) for each p ∈ P, such 200 that P covers all multiple-end phasing paths in C , and that both d(P, f ) and |P| are as small as possible.

Algorithmic framework 208
We propose a heuristic for Problem 1. This heuristic consists of three major steps, described below.

209
Step 1: bridging multiple-end phasing paths into single-end phasing paths. Let C = {l 1 , l 2 , · · · , l n } ∈ 210 C be a multiple-end phasing path. We sort all paths {l 1 , l 2 , · · · , l n } in C in lexicographical order w.r.t. a 211 topological ordering of all vertices of the splice graph (recall that the splice graph is a directed acyclic 212 graph). See Figure 6. We still write C = (l 1 , l 2 , · · · , l n ) after sorting all the paths lexicographically. All paths 213 in C can be regarded as being sampled from a single (unknown) s-t path of the splice graph, representing the 214 true transcript from which the reads used to construct C are generated. We propose to "bridge" all paths in 215 C as a single-end phasing path, denoted as h(C), in the splice graph. This task amounts to filling the gaps (if 216 any) between adjacent paths in C (see Figure 6). Specifically, if l i and l i+1 overlap, i.e., there exists an suffix 217 of l i that is also a prefix of l i+1 , then l i and l i+1 can be naturally concatenated into a single path; otherwise, 218 we will need to fill the gap by inferring a path connecting l i to l i+1 in the splice graph (see below). After 219 all n − 1 pairs of adjacent paths in C, i.e., (l 1 , l 2 ), (l 2 , l 3 ), · · · , (l n−1 , l n ), are bridged, we can then connect 220 them together as a single-end phasing path, i.e., h(C). The algorithm to infer the true path that connects 221 non-overlapping l i and l i+1 is one major innovation in Scallop2, separately described in Section 3.4.  (1, 2), l 2 = (4), l 3 = (6)) into a single-end phasing path h(C) = (1, 2, 3, 4, 6). Inferred bridging paths for (l 1 , l 2 ) and (l 2 , l 3 ) are marked red.
Step 2: determining false starting/ending vertices. We observe that failing to bridge a multiple-end 223 phasing path is mainly because the underlying splice graph is incomplete, such as certain junctions are 224 missing due to low coverage or alignment errors. We design a new algorithm to determine false starting and 225 ending vertices in the splice graph, separately described in Section 3.5. This algorithm constitutes another 226 algorithmic innovation of Scallop2.

227
Step exactly the same question of decomposing the splice graph in the presence of (single-end) phasing paths.

232
Here, we directly use this algorithm to obtain the set of s-t paths P by piping G and H into this algorithm. Let l = (a 1 , a 2 , · · · , a m ), l = (b 1 , b 2 , · · · , b n ) be any two consecutive, non-overlapping paths in a multiple-end 235 phasing path. The problem of bridging l i and l i+1 amounts to find a path in G from a m to b 1 . Such path, 236 called bridging path, infers the missing portion between l i and l i+1 . We note that there might be multiple 237 bridging paths due to alternative splicing and sequencing/alignment errors and we aim to infer the correct 238 one. Below we first formulate the task of inferring the true bridging path as a new optimization problem and 239 then design an efficient algorithm.

240
Formulation. We explore what's a good formulation to find the true bridging path. The main signal we have is the coverage information, and intuitively a bridging path supported by most reads are most likely the 242 true path. We determined that, a formulation that seeks a bridging path that maximizes bottleneck weight, is 243 suitable for this bridging problem. Below we formally describe this formulation. 244 We define a full ordering of all bridging paths. Let q 1 and q 2 be two arbitrary paths from a m to b 1 in G. Let 245 w j 1 (resp. w j 2 ) be the jth smallest weight in path q 1 (resp. q 2 ). We say q 1 is more reliable than q 2 , if there 246 exists an integer k such that w j 1 = w j 2 for all 1 ≤ j < k, and w k 1 > w k 2 . In other words, each bridging path is 247 represented as the sorted list of its weights in ascending order, and all bridging paths are then (implicitly) 248 sorted in lexicographical order. We formulate the problem of bridging as to find the most reliable path.

249
Intuitively, we seek a path q from a m to b 1 in G such that the smallest weight in it is maximized; in case 250 there are multiple paths with maximized smallest weight, among them we further seek the one whose second 251 smallest weight is maximized, and so on.

252
We believe this formulation is appropriate for RNA-seq bridging. First, by maximizing bottleneck weight, 253 the bridging paths are supported strongest, and hence are more likely to be the true bridging path. Second, 254 through this formulation, false paths which are usually due to sequencing/alignment errors and therefore 255 exhibit low abundances, can be automatically and efficiently excluded. 256 We note that this formulation satisfies the optimal substructure property: if a m → u 1 → u 2 → · · · → u l → b 1 257 is the most reliable path from a m to b 1 , then a m → u 1 → u 2 → · · · → u l is the most reliable path from a m to 258 u l . This allows us to design a dynamic programming algorithm to find the most reliable bridging path.

259
Dynamic programming algorithm. Let (v 1 , v 2 , · · · , v |V | ) be a topological sorting of all vertices of the splice 260 graph G (this is possible because splice graph is a directed acyclic graph). Given a particular v i , we can use 261 a single run to find the most reliable paths from v i to v j for every j > i. To compute the optimal path for v j ,

262
we examine all vertices v k that directly connects to v j , and compare all paths stored in these vertices (each 263 v k already stores the most reliable paths from v i to v k at this time point). The optimal one will be kept and 264 after concatenating v j they become the most reliable paths from v i to v j . We run this subroutine for all v i , will be accepted; otherwise we mark that C is failed to bridge. The constructed splice graph is usually erroneous due to missing junctions, sequencing and alignment errors. 280 We observe that erroneous starting and ending vertices (i.e., those connected to the source and sink vertices, 281 respectively) can be identified through reads that are failed to bridge. Figure 7 gives an example: the two 282 blue reads and the second and the third red reads cannot be bridged, as there is no edge connecting vertices 283 2 and 3; these suggest that vertices 2 and 3 are false starting and ending vertices (there is likely a missing 284 junction between vertices 2 and 3 in this example). 285 We design an algorithm that implements above observation. Let u and v be two consecutive vertices in the 286 splice graph (i.e., there is no any other vertex between u and v). If u is an ending vertex (i.e., (u,t) ∈ E) 287 and v is a starting vertex (i.e., (s, u) ∈ E), then we will add a pseudo-edge (u, v) with weight of 0.5 to the 288 splice graph. (In Figure 7, edge (2, 3) will be added as psuedo-edge.) The expanded splice graph (with 289 pseudo-edges added) will be actually used for bridging all consecutive ends by the algorithm described in 290 Section 3.4. Notice that any non-pseudo-edge will have a weight at least 1; therefore, if the bottleneck-291 weight of the most reliable path p of a pair of ends (l, l ) equals to 0.5, then we know that p must cross 292 at least one pseudo-edge and that (l, l ) cannot be bridged with non-pseudo-edges only. In this case, we 293 will examine all pseudo-edges (u, v) in p, and assign a pseudo-score of value 1 to u and to v. We run this 294 algorithm for all read groups, and the pseudo-score will be accumulated. (For example, in Figure 7, both 295 vertices 2 and 3 will have a pseudo-score of 2; the blue and the red reads contribute 1 each.) Intuitively, the  Figure 7: Illustration of identifying false starting/ending vertices. From the given alignments 4 partialexons (numbered 1-4) are identified. Vertex 2 (resp. 3) is classified as ending (resp. starting) vertex as there is no departing (resp. entering) junction. An pseudo-edge (2, 3) will be added to the splice graph for bridging. The bridging of blue reads and red reads (the second and the third ones) will cross this pseudoedge, giving a pseudo-score of 2 for both vertices.
larger pseudo-score is, the more likely the vertex is a false starting/ending vertex. Let x(v) be the pseudo-297 score of vertex v and let w(v) be the average coverage of v. We calculate z(v) := log(w(v)+1)−log(x(v)+1) 298 and report v is a false vertex only if x(v) ≥ 1 and z(v) is less than a threshold (by default 1.5). We do this 299 to ensure that, starting/ending vertices with large coverage will be determined as false only if there exists a 300 relatively significant number of supporting reads.

301
The determined false starting and ending vertices will be kept in the splice graph for decomposition (Step 3; 302 Section 3.3). A resulting s-t path that contains any false vertex will be classified as "transcript fragment" (as 303 opposed to "full-length transcript"). We keep these transcript fragments as they explain the reads in the false 304 vertices (it's that we have evidence to determine they are not full-length transcripts). In our implementation 305 full-length transcripts and transcript fragments will be written to separate files. The accuracy of assemblies 306 reported in Section 2 are for the full-length transcripts. 307 enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol., 33 (3):290-