Scallop enables accurate assembly of transcripts through phasing-preserving graph decomposition

We introduce Scallop, an accurate, reference-based transcript assembler for RNA-seq data. Scallop significantly improves reconstruction of multi-exon and lowly expressed transcripts. On 10 human samples aligned with STAR, Scallop produces (on average) 35.7% and 37.5% more correct multi-exon transcripts than two leading transcript assemblers, StringTie [1] and TransComb [2], respectively. For transcripts expressed at low levels in the same samples, Scallop assembles 65.2% and 50.2% more correct multi-exon transcripts than StringTie and TransComb, respectively. Scallop obtains this improvement through a novel algorithm that we prove preserves all phasing paths from reads (including paired-end reads), while also producing a parsimonious set of transcripts and minimizing coverage deviation.

given in Supplementary Figure 2.) Further, at minimum coverage equal to 0, the three methods obtain similar multi-exon precision when using TopHat2 alignments, while for STAR and HISAT2 alignments Scallop 110 obtains a significantly higher precision than both StringTie and TransComb.  The precision-sensitivity curves for multi-exon transcripts. Each curve connects 10 points, corresponding to the 10 different minimum coverage thresholds {0, 1, 2.5, 5, 7.5, 10, 25, 50, 75, 100}; the default value of this parameter is circled. (B) The average AUC (area under the precision-sensitivity curve) over the 5 samples. The three groups of bars correspond to TopHat2, STAR, and HISAT2 alignments, respectively (the same for other panels). The error bars show the standard deviation over the 5 samples (the same for other panels). (C) The average sensitivity and precision of multi-exon transcripts for methods running with default parameters. (D) The average sensitivity and precision of multi-exon transcripts for methods running with minimum coverage set to 0. (E) The average sensitivity and precision of single-exon transcripts for methods running with default parameters. (F) The average number of correct transcripts with different number of exons for methods running with minimum coverage set to 0. (G) The average sensitivity and precision of multi-exon transcripts with each subset of transcripts (corresponding to low, middle, and high expression level) as ground truth for methods running with minimum coverage set to 0. StringTie and TransComb obtain higher sensitivity but lower precision than Scallop on single-exon transcripts with default parameters ( Figure 1E and Supplementary Figure 1E). However, the overall number of 113 correct single-exon transcripts obtained by these methods is relatively small compared with that of multi-114 exon transcripts (compare scale of Figure 1E with Figure 1C). Scallop aggressively filters short and lowly 115 expressed single-exon transcripts to make the precision of single-exon transcripts comparable to that of 116 multi-exon transcripts. 117 We further compare the number of correct transcripts with different numbers of exons when their sensitivity 118 is maximized (i.e., the minimum coverage threshold is 0). The results are shown in Figure 1F and Sup-119 plementary Figure 1F. While Scallop is able to identify more correct transcripts for genes with at least up 120 to 17 exons, its advantage is most pronounced on transcripts with 2-7 exons. For example, with TopHat2 121 alignments, on average over the testing samples, Scallop obtains 60.0% and 54.9% more correct transcripts 122 with 2 or 3 exons than StringTie and TransComb, respectively. 123 Scallop also significantly improves identification of lowly expressed transcripts. To perform a quantitative 124 measurement, we use Salmon [31] to quantify the 10 RNA-seq samples using human annotation database 125 (ENSEMBL release 87) as reference. For each sample, we collect all multi-exon transcripts with expression 126 abundance larger than a threshold (TPM ≥ 0.1), sort them according to their expression abundances, and 127 divide them into three equal subsets corresponding to low, middle and high expression levels. We then 128 compute the accuracy of the three methods on multi-exon transcripts with each subset as the ground truth 129 ( Figure 1G and Supplementary Figure 1G). Scallop achieves higher accuracy on all three expression levels, 130 but the advantage is much more significant for low and middle levels. For example, with STAR alignment, and for each of the other two methods with smaller precision, we discard its predicted transcripts from the lowest coverage until the precision equals φ; the adjusted sensitivity is the sensitivity at this precision φ. 141 We compute the adjusted precision analogously, filtering low-coverage transcripts of the two methods with 142 higher sensitivity until all methods have the same sensitivity.    such reads that span u and v. We also add a source vertex s, and for each vertex u ∈ V \ {s} with in-degree Figure 2 for an example.

255
Many reads (including paired mates) can span more than two exons, providing phasing information to recon-256 struct the expressed transcripts. We collect such phasing information as a set of phasing paths of G, denoted 257 as H, as follows. If a read spans vertices v i 1 , v i 2 , · · · , v i m of G (i.e., this read sequentially aligns to these corre-258 sponding exons), m ≥ 3, we then add a phasing path (v i 1 , v i 2 , · · · , v i m ) to H. For the case of paired-end reads, 259 if we have that one read span vertices v i 1 , v i 2 , · · · , v i m , and its mate read spans vertices v j 1 , v j 2 , · · · , v j n , and 260 there exists a unique path (v i m , v k 1 , v k 2 , · · · , v k l , v j 1 ) from v i m to v j 1 in G, and that m + n + l ≥ 3, we then add 261 a phasing path (v i 1 , · · · , v i m , v k 1 , · · · , v k l , v j 1 , · · · , v j n ) to H. In the following, we shall equivalently represent 262 each phasing path with k vertices as a consecutive list of (k − 1) edges. Different reads or paired-end reads  might produce the same phasing path. For each phasing path h ∈ H, we use g(h) to record the number of 264 such reads or paired-end reads that produce h.

265
Based on G, w and H, we compute a set P of s-t paths of G and associate a real-value f (p) for every path 266 p ∈ P. Each path p ∈ P implies an expressed transcript, and f (p) estimates the expression abundance of 267 the corresponding transcript. We now design three objectives to guide reconstructing P and f . First, since 268 each phasing path is constructed from a single read or paired-end reads, which must be sampled from a 269 single transcript, we expect that each phasing path appears as a whole in some reconstructed transcript.

270
Formally, we say a phasing path h ∈ H is covered by P, if there exists an s-t path p ∈ P such that h is a 271 consecutive subset of edges of p. We do not enforce that all phasing paths in H must be covered by P. This 272 is because there exist false positive edges in the splice graph due to alignment errors or sequencing errors.

273
Our algorithm will try to identify and remove these false positive edges. Except these phasing paths with e ∈ E we expect that the superposition of the abundances of the inferred s-t paths passing through e, i.e.,

276
∑ p∈P:e∈p f (p), is as close to its observed read coverage w(e) as possible. Therefore, the second objective is 277 to minimize the deviation between these two quantities, defined as Third, following the principle of parsimony, we expect to use a smaller set of s-t paths to explain G, w and 279 H. That is, the third objective is to minimize |P|.

280
Combining all the three objectives, we informally describe the task of transcript assembly as follows.

281
Problem 1 (Transcript Assembly) Given G, w and H, compute a set of s-t paths P of G and abundance the current splice graph. 294 We say a vertex v ∈ V \ {s,t} is trivial, if its in-degree is 1, or its out-degree is 1; otherwise we say v is  (Figure 3(a,b)); otherwise we say v is splittable ( Figure 5(a,b)). In the following, we 305 design different subroutines to decompose unsplittable vertices, splittable vertices, and trivial vertices.

306
Decomposing Unsplittable Vertices. We now describe the subroutine to decompose an unsplittable vertex 307 v ( Figure 3). The aim of this subroutine is to replace v as a set of trivial vertices so as to locally minimize 308 d(P, f , w) and also preserve all phasing paths.

309
The first step is to balance v by computing new weights w(·) for adjacent edges of v (i.e., S v ∪ T v ). Specif-  Figure 3(b) for an example.

319
The third step of the subroutine is to assign weights for all edges (E v ) in the extended bipartite graph G v 320 so as to locally minimize the deviation w.r.t. w(·), i.e., d(P, f , w). We formulate it as a linear programming The objective function of the linear programming instance is taken to be: 327 minimize ∑ e∈S v y e + ∑ e ∈T v y e .
In most cases, when G v is not a tree, i.e., it contains a cycle (see Figure 4), the above linear programming has multiple optimal solutions. We use the abundance information of the phasing paths stored in g(·) to reassign 329 weights while keeping the optimal deviation w.r.t. w(·). For each edge (e, e ) ∈ E v , we denote by g(e, e ) 330 the number of reads or paired-end reads that continuously go through e and e , which can be computed as 331 g(e, e ) = ∑ h∈H: h contains (e,e ) g(h). Our goal is then to reassign the weights for edges in G v so as to keep the 332 above minimal deviation w.r.t. w(·) but to minimize the deviation w.r.t. g(·, ·). 333 We formulate this problem as another linear programming instance. Specifically, let y * e and y * e , e ∈ S v and 334 e ∈ T v , be the optimal solution of first linear programming instance (thus y * e and y * e are constants rather than We assign the weights for edges in E v to be the optimal value of x e,e of the second linear program. Note 341 that if the first linear program has the unique optimal solution, then both linear programs will have the same 342 optimal solution.

343
Finally, we update splice graph G by replacing v with G v (see Figure 3(c)); we denote by G the updated 344 splice graph. For the edges in G v that added to G , we maintain the information that they are artificially 345 added edges and thus do not correspond to any edge in G. This information will be used after decomposing all vertices to backtrace the paths with respect to the original splice graph (see line 6 of Algorithm 1). For

356
To choose which unsplittable vertex v to apply the above transformation to, we define and select a vertex v that minimizes z(v) to decompose (see line 3 of Algorithm 1).

358
Decomposing Splittable Vertices. We now describe the subroutine to decompose a splittable vertex v 359 ( Figure 5). The aim of this subroutine is to reduce |P| while preserving all phasing paths. Since P is not 360 explicitly available until we have finished decomposing all vertices, we use U := |E|−|V |+2 to approximate 361 |P|. It has been proved that U is an upper bound of |P| in the flow decomposition scenario: for a given flow 362 at most U paths are required to decompose this flow [32,33]. Following this approximation, in order to 363 reduce |P|, our subroutine to decompose a splittable v will increase the number of vertices or decrease the 364 number of edges, while at the same time preserving all phasing paths. Splittable vertices will typically be 365 replaced by two new vertices.

366
The first step of this subroutine is also to balance v by computing w(·) for edges in S v ∪ T v following the 367 same procedure as in decomposing unsplittable vertices. The second step is to split v into two vertices so as 368 to keep all phasing paths and to minimize the balanced weight discrepancy (i.e., to make each of the two new 369 vertices as balanced as possible). Formally, we seek such that for each (e, e ) ∈ E v , either e ∈ S v and e ∈ T v , or e ∈ S v and e ∈ T v , and that |w(S v ) − w(T v )| is minimized. Intuitively, this formulation forces that two edges in some phasing path must be adjacent after 372 splitting, and thus all phasing paths can be preserved.

373
The above problem can be equivalently transformed into the subset-sum problem. Let C be the set of all 374 connected components of G v . We define r(C) := ∑ e∈S v :e∈C w(e) − ∑ e ∈T v :e ∈C w(e ), for any C ∈ C . Then the 375 above problem is equivalent to computing a nonempty and strict subset of {r(C) | C ∈ C } such that the sum 376 of all elements of this subset is closest to 0. In our implementation, we use the existing pseudo-polynomial 377 time dynamic programming algorithm to solve it.

378
Let S * v and T * v be the optimal subsets returned by the above algorithm. We then update splice graph G 379 through performing the following procedure to decompose v. We denote the updated splice graph as G (see 380 Figure 5). Vertex v will be split into two vertices by adding another vertex v to G . Edges in S * v ∪ T * v will 381 be detached from v and attached to v . Notice that the weights for edges in S v ∪ T v are kept unchanged as 382 w(·), i.e., the balanced weight w(·) is only used to compute S * v and T * v .

383
It could be the case that will have either in-degree of 0 and out-degree or 1, or out-degree of 0 and in-degree of 1. In this case, the 385 above procedure of decomposing v will degenerate into removing this edge from G, instead of splitting v 386 into two vertices. If this is the case, it indicates that this particular edge is more likely to be a false positive 387 edge. In other words, this procedure can be used to naturally remove false positive edges in the splice graph.

388
For this case, we remove the appearance of this false positive edge for all phasing paths in H.

389
Notice that in either the general case of splitting v into two vertices, or the degenerate case of removing one 390 edge from G, after decomposing splittable vertex v, we have that U will be reduced by 1. For the degenerated case of removing one edge from G, these spanning paths that contain this edge shall be 392 not covered by G . For the usual case of splitting vertex this subroutine keeps all phasing paths H unchanged.

393
Finally, we define as the measurement to decide which splittable vertex to decompose (see line 4 of Algorithm 1).

395
Decomposing Trivial Vertices. There is a unique way to decompose a trivial vertex. Let v ∈ V be a 396 trivial vertex. Again, let S v be the set of edges that point to v, and let T v be the set of edges that leave v. i.e., h = (· · · , e 1 , e), then we simply remove e from h, i.e., it becomes h = (· · · , e 1 ) ∈ H . Otherwise, if e is 405 not the last element of h, i.e., h = (· · · , e, e 1 , · · · ), we replace e and e 1 as the edge ee 1 in G , i.e., it becomes 406 h = (· · · , ee 1 , · · · ) ∈ H .

407
In the complete algorithm (Algorithm 1), we first decompose all nontrivial vertices before decomposing 408 any trivial vertex. In other words, when we use the above subroutine to decompose a trivial vertex, all 409 vertices in the current splice graph are trivial vertices. We now prove that, when the splice graph contains consider the two cases of a phasing path h ∈ H that contains e. If h = (· · · , e, e 1 , · · · ) ∈ H, then we have 412 h = (· · · , ee 1 , · · · ) ∈ H , and h is covered by G . Since ee 1 is essentially the concatenation of e and e 1 , 413 we have that G covers h. For the other case that h = (· · · , e 1 , e) ∈ H, we have that h = (· · · , e 1 ) ∈ H .

414
(Although G covers h , but this alone does not necessarily imply that G covers h any more.) Let e 1 = (w, u) 415 and e = (u, v) in G. Since we assume that all vertices in G are trivial vertices, in particular u is a trivial 416 vertex, we have that in G all the succeeding edges of e 1 are in the type of ee , where e ∈ T v (see Figure 6).

417
In other words, for any path in G that contains h , the next edge of e 1 in this path must be an concatenated 418 edge with preceding edge of e. Hence, we have that G covers h. 419 We emphasize that if the splice graph contains nontrivial vertex, then decomposing a trivial vertex might not 420 preserve all phasing paths. Figure 7 gives such an example. Thus, it is essential to decompose all nontrivial 421 vertices before decomposing any trivial vertex.