Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths

Mingfu Shao; Carl Kingsford

doi:10.1101/087759

Abstract

Motivated by transcript assembly and multiple genome assembly problems, in this paper, we study the following minimum path flow decomposition problem: given a directed acyclic graph G = (V,E) with source s and sink t and a flow f, compute a set of s-t paths P and assign weight w(p) for p ∈ P such that , and |P| is minimized. We propose an efficient pseudo-polynomialtime heuristic for this problem based on novel insights. Our heuristic gives a framework that consists of several components, providing a roadmap for continuing development of better heuristics. Through experimental studies on both simulated and transcript assembly instances, we show that our algorithm significantly improves the previous state-of-the-art algorithm. Implementation of our algorithm is available at https://github.com/Kingsford-Group/catfish.

1 Introduction

RNA sequencing (RNA-seq) is an established technology that enables identification of novel genes and transcripts as well as accurate measurement of expression abundances [1]. The RNA-seq protocol produces short sequencing reads sampled from the expressed transcripts. Thus, a fundamental computational prob-lem is to recover the set of full-length expressed transcripts and their expression abundances in the sample from these reads. This is referred to the transcript assembly problem. There have been much research interests for this problem and many methods and software packages have been proposed. Depending on whether a reference genome is assumed available, these methods are divided into two categories, reference-based methods (e.g., Cufflinks [2], Scripture [3], IsoLasso [4], SLIDE [5], CLIIQ [6], CEM [7], MITIE [8], iReckon [9], Traph [10], and StringTie [11]), and de novo methods (e.g., TransABySS [12], Rnnotator [13], Trinity [14], SOAPdenovo-Trans [15], Velvet [16], and Oases [17]). However, according to the bench-marking studies [18, 19], the performance of these methods are far from satisfactory, especially when the expressed genes contain multiple splice isoforms. Hence, transcript assembly still remains an open and challenging problem, and thus requires further algorithm development.

Both reference-based and de novo methods share the similar pipeline. The first step of the pipeline is to build the so-called splice graph (or transcript graph, overlap graph, etc) from given reads. For reference-based methods, reads are first aligned to the reference genome using some spliced aligner (e.g., TopHat2 [20], STAR [21], and HISAT [22]). Then the boundaries of exons and introns are inferred from the spliced aligned reads, and the splice graph is consequently constructed: each vertex represents an exon (or partial exon), each edge indicates there exist reads spanning the two corresponding exons, and the weight of this edge is usually computed as the number of such spanning reads. For de novo methods, reads are usually first organized by the de bruijn graph. Then the splice graph is constructed by collapsing simple paths of the de bruijn graph, and the weight of each edge is computed as the number of reads associated with the corresponding path. For both categories of methods, the constructed splice graph is the superposition of the expressed transcripts, each of which corresponds to an (unknown) path in the splice graph. If we assume the ideal case that the read coverage for each expressed transcript is a constant across the whole transcript, then the weights of edges shall form a flow of the spliced graph. In the realistic scenario where the read coverage varies across each transcript, there exist methods that can smooth the edge weights to make it a flow [10].

The second step of the pipeline is usually to decompose the splice graph into a set of paths and associate a weight with each path. These predict the expressed transcripts and their expression abundances. This step is the most algorithmically challenging part in the whole transcript assembly pipeline. Under the principle of parsimony, it is usually formulated as an optimization problem of decomposing a given flow into a minimum number of paths. Formally, let G = (V, E) be a directed acyclic graph (DAG) with a source vertex s and a sink vertex t. We denote the set of all s-t paths of G by . Notice that might be exponential compared with |E|. Let f: E → ℤ be an s-t integral flow of G. We say (P, w) is a decomposition of (G, f) if we have , w: P → ℤ assigns an integral weight w(p) for each path p ∈ P, and f (e) = ∑ _p∈P:e∈p w(p) for every e ∈ E. Our paper focuses on the following problem.

Problem 1

(Minimum Path Flow Decomposition)

Given a DAG G = (V, E) with source vertex s and sink vertex t and an integral s-t flow f, compute a decomposition (P, w) of (G, f) such that |P| is minimized.

While we formulate Problem 1 in the context of transcript assembly, we emphasize that Problem 1 can be naturally used to model a broader class of so-called multiple genome assembly problems [23]. These are problems in genomics where several unknown strings S = {s₁, s₂, …, s_k} and their (unknown) counts C = {c₁, c₂, …, c_k} must be reconstructed from many substrings (reads) that are sampled from these strings. This is a generalization of the standard genome assembly problem that seeks to reconstruct a single string. The above-mentioned transcript assembly problem is one such instance, where S is the set of alternatively spliced isoforms of a gene, and C is the expression levels of these isoforms. Other instances include metagenomic assembly (where S represents the genomes of microbes in a community and C represents their abundances), cancer genome assembly (where S is the set of genomes of subclones in a tumor sample and C is the frequency of each subclone), and virus quasi-species assembly (where S is the set of genomes in a population of viral genomes infecting a given individual and C is their frequency).

Problem 1 has been proven to be strongly NP-hard [24], meaning that even if all the flow values are in polynomial-size, there still does not exist a polynomial-time algorithm unless P = NP. It has been further proved that even if all flow values are chosen from {1, 2, 4}, this problem is still NP-hard [25].

Several algorithms and heuristics have been proposed for Problem 1. Vatinlen et al. [24] proposed two practical greedy algorithms, namely greedy-length and greedy-width algorithms, which are to iteratively choose the shortest path and the path with the largest flow, respectively, in the remaining flow, until the entire flow is decomposed. These two algorithms have been shown that they can guarantee obtaining a decomposition with at most (|E| − |V| + 2) paths, a known upper bound for the number of paths in any optimal solution. Hartman et al. [25] designed two similar algorithms, called bicriteria-width and bicriteria-length, which can guarantee decomposing (1 − ε) fraction of the entire flow into O(|P^∗|/ε²) paths, for any ε < 1. Mumey et al. [26] proposed another two algorithms using parity balancing path flows, which have been proved to be (L^log(F) · log(F))-approximation algorithms, where L is the maximum length of all s-t paths in G, and F is the maximum flow values over all edges. We emphasize that over all existing algorithms, the greedy-width algorithm has been shown as the best heuristic through experimental comparisons [24, 25, 26], and because of that, this algorithm has been implemented in several widely used transcript assembly methods, for example, Traph [10] and StringTie [11].

Several problems related to Problem 1 have been also addressed. Hendel and Kubiak [27] studied the problem of decomposing a given flow into a set of paths such that the length of the longest path is minimized. They proposed a polynomial-time approximation scheme (PTAS) for this problem. Baier et al. [28, 29] studied the k-splittable flow problem, which is to compute a maximum flow such that it can be decomposed into at most k paths. For this problem they devised a (2/3)-approximation algorithm for k = 2 and a (2/k)-approximation algorithm for k ≥ 3.

Our contribution to Problem 1 is a pseudo-polynomial-time heuristic based on the novel properties we prove, and we show that it considerably outperforms the best previous heuristic (greedy-width) through extensive experimental comparisons. The central idea of our heuristic is that equations involving flow values on edges are intuitively suggestive of the structure of paths. For example, if f (e₁) = f (e₂) + f (e₃), then one might suspect that a path goes through e₁ and e₂ and another path goes through e₁ and e₃ (see examples in Figure 1). The first challenge of our heuristic is to identify these equations. Once identified, each equation can then be used to simplify the graph by merging edges that are involved in the same path. We define a operation, called reverse, on the graph that modifies the graph in such a way that the optimal solution does not change, but that attempts to make edges that are to be merged be incident on the same vertex. The entire approach iterates between these two steps: identifying a non-trivial relationship between the flow values on the edges and modifying the graph to facilitate edge merging (and merging the edges). We show that at each stage, either the algorithm fails, or the solution space gets reduced and the gap between the optimal and an upper bound on the optimal is reduced.

Figure 1:

The flow value for each edge is put in parenthesis. (a) Observe that f (e₁) = f (e₄), i.e, nontrivial null vector of (1, 0, 0, −1)^T. We have Δ = 3, . (b) Observe that f (e₁) = f (e₄) + f (e₆), i.e., nontrivial null vector of (1, 0, 0, −1, 0, −1)^T. We have Δ = 4, , , , , ,

2 Motivation

In this Section, we prove some properties about Problem 1 upon which we design our algorithm. We first reformulate some definitions to facilitate using algebra tools. We arbitrarily assign indices to edges in E and paths in , i.e., E = {e₁, e₂, …, e_|E _|} and . We can then write f as a row vector in ℤ^1×|E ^|, with f [k] = f (e_k). We can also write as a matrix in {0, 1}^|P ^|×|E ^|, with if and only if p_i contains edge e_j. Consequently, each subset can also be represented as a binary matrix, which can be modified from matrix by removing these rows whose corresponding paths are not in P. With these formulations, the definition that (P, w) is a decomposition of (G, f) can be equivalently written as f = w · P, where w ∈ ℤ^1×|P|. In the following, we use (and P) to represent both the set of paths as well as the corresponding matrix, but this will be obvious to distinguish. Let (P^∗, w^∗) be an optimal decomposition of (G, f). Let Δ= |E| − |V| + 2. From [24], we know that Δ gives an upper bound for |P^∗|, i.e., |P^∗| ≤ Δ. We call (Δ − |P^∗|) as the optimality gap. Let be the null space of P^∗, i.e., . We call vectors in as null vectors (w.r.t. P^∗). The proofs of all Propositions are in the Appendix C.

Proposition 1

We have f · q = 0 for any .

Proposition 1 is a necessary condition rather than a sufficient condition to characterize . On the other side, if we assume that all weights in w^∗ are randomly assigned, then it is with small probability to observe that f · q = 0 if .

Proposition 2

Suppose is sampled independently and uniformly at random from ℤ_k = {1, 2, …, k} for all j = 1, 2, …, |P^∗|. Then we have .

We further study the structure of . Let V₀ = V \ {s,t}. For each v ∈ V₀, we define a column vector q_v ∈ {−1, 0, +1}^|E ^|×1 where q_v[k] = 0 if e_k is not adjacent to v, q_v[k] = +1 if e_k points to v, and q_v[k] = −1 if e_k leaves v. Let be the linear space spanned by {q_v | v ∈ V₀}. We call vectors in as trivial null vectors, and call vectors in as nontrivial null vectors. We have the following results.

Proposition 3

Proposition 4

Proposition 4 says that the optimality gap is larger than 0 if and only if there exists nontrivial null vectors (see Figure 1 for examples). This inspires our algorithm: suppose that we have two oracles, the first one can return nontrivial null vectors, and the second one can use these vectors to reduce the optimality gap; then we can iteratively apply these two oracles, and when the first oracle returns the empty set, the optimality gap has been closed. Our algorithm then tries to approximate these two oracles.

3 Algorithm

Our heuristic for Problem 1 consists of two phases. In its first phase, in each iteration, it tries to identify nontrivial null vectors (see Section 3.1), and then use them to partially decompose and update the graph and flow (see Section 3.2) in order to obtain a new graph with a smaller optimality gap. We expect that, the optimality gap can be reduced by at least 1 after each iteration, and the entire optimality gap can be closed after the first phase. Its second phase uses the greedy-width algorithm to decompose the remaining graph, which guarantees optimality if the optimality gap can be correctly closed in the first phase. The entire algorithm is formally described and analyzed in Section 3.4.

3.1 Identifying Nontrivial Null Vectors

The idea to identify nontrivial null vectors is based on Proposition 1 and Proposition 2, i.e., to compute vector satisfying that f · q = 0 and that , where is a collection of candidate vectors that we guess in advance. We say a column vector q ∈ ℝ^|E ^|×1 is simple if all elements of q are in {−1, 0, +1}. We choose as the set of all simple vectors. Notice that it could be the case that nontrivial null vectors exist, but none of them is simple (Figure 2). Let . For any subset E₁ ⊆ E, we define f (E1) = . We say two subsets E1, E2 ⊆ E are balanced if f (E1) = f (E2). For a vector , we define E_s(q) = {e_j ∈ E | q[ j] = +1}, and define E_t (q) = {e_j ∈ E | q[ j] = −1}. It is clear that for any , we have if and only if f (E_s(q)) = f (E_t (q)). We say a pair of balanced subsets E₁ and E₂ are indivisible, if there does not exist two strict subsets and such that . Notice that two balanced subsets being indivisible implies that they are disjoint. We also say is indivisible if E_s(q) and E_t (q) are indivisible. Let be the subset that includes all indivisible vectors. We focus on computing indivisible and nontrivial null vectors, formally illustrated as Problem 2. We emphasize that is an approximation of , i.e., vectors in are not necessarily in (because we restrict to simple vectors, see Figure 2), and vectors in are not necessarily in (because f · q = 0 is not a sufficient condition).

Figure 2:

In this example, P^∗ contains six paths, {p₁, p₂, …, p₆}. The set of paths that go through each edge are labeled in the parenthesis. We also have Δ= 7, thus there exists nontrivial null vectors, for example, (2, −1, 1, 0, −1, 1, 0, −1, 0)^T. However, we can verify there does not exist simple ones for this example.

Problem 2

Given G and f, compute .

We give a heuristic for Problem 2. We first devise a dynamic programming heuristic to compute (Algorithm 1), which extends the pseudo-polynomial-time algorithm for the subset-sum problem, then we devise an algorithm (Algorithm 2) based on the properties of and to filter out vectors in .

Algorithm 1

Heuristic to compute

Input: G and f (we define | f | = ∑_e∈E _:e=(s,v) f (e) = ∑_e∈E _:e=(v,t ₎ f (e), i.e., the value of f)

Output:

1. Let A ∈ {0, 1}^|E ^|×| ^f|: A[k, x] = 1 if there exists E₁ ⊆ {e₁, …, e_k} with f (E₁) = x.

2. Let B ∈ {0, 1}^|E ^|×| ^f ^|: B[k, x] = 1 if there exists distinct E₁, E₂ ⊆ {e₁, …, e_k} with f (E₁) = f (E₂) = x.

3. Compute A with the recursion: A[k, x] = 1 if and only if A[k − 1, x] = 1, or A[k − 1, x − f (e_k)] = 1.

4. Compute B with the recursion: B[k, x] = 1 if and only if B[k − 1, x] = 1, or B[k − 1, x − f (e_k)] = 1, or A[k − 1, x] = 1 and A[k − 1, x − f (e_k)] = 1.

5. Compute C = {1 ≤ x ≤ | f | | B[|E |, x] = 1}.

6. Compute D = {x ∈ C | B[|E |, x − f (e)] = 0, ∀e ∈ E}.

7. For every x ∈ D, retrieve the two balanced subsets through tracing back matrices A and B, and then add the corresponding vector to .

By the definition of matrix B, for each x ∈ C, there exists two subsets E₁ and E₂ such that f (E₁) = f (E₂) = x, but these two subsets might be not indivisible (or even not disjoint). We can prove that the two balanced subsets through tracing back from x ∈ D are guaranteed indivisible (see proof of Proposition 5).

Proposition 5

We have that , i.e., the set returned by Algorithm 1, is a subset of . Algorithm 1 runs in O(|E| · |f|) time.

The remaining task is to compute . The following proposition gives an equivalent condition to decide whether a vector is in or not. We say a subset V₁ ⊆ V is connected if the subgraph of G induced by V₁ is weakly connected.

Proposition 6

Let . We have that if and only if there exists a connected subset V₁ ⊆ V₀ such that .

For a subset V₁ ⊆ V₀, we define E_s(V₁) and E_t (V₁) the sets of edges corresponding to cut (V₁,V \ V₁), i.e., E_s(V₁) = {(u, v) ∈ E | u ∈ V \ V₁, v ∈ V₁} and E_t (V₁) = {(u, v) ∈ E | u ∈ V₁, v ∈ V \ V₁}. Following Proposition 6, we have the following result.

Proposition 7

Let . We have that if and only if there exists a connected subset V₁ ⊆ V₀ such that E_s(q) = E_s(V₁) and E_t (q) = E_t (V₁), or E_s(q) = E_t (V₁) and E_t (q) = E_s(V₁).

Proposition 7 transforms the problem of deciding whether a vector is also in into the problem of examining whether E_s(q) and E_t (q) form a set of cut edges, which can be accomplished through two rounds of breadth-first searching (Algorithm 2).

Algorithm 2

Exact algorithm to compute

Input: , G

Output:

1. FOREACH , DO the following procedure for both q′ = q and q′ = −q

2. Traverse G from edges in E_s (q′), stop searching when reaching edges in E_t (q′). IF all searching ends up with edges in E_t (q′), i.e., t is not reached, THEN add q to .

3. Traverse G reversely from edges in E_t (q′), stop searching when reaching edges in E_s (q′). IF all searching ends up with edges in E_s (q′), i.e., s is not reached, THEN add q to .

Algorithm 2 also runs in O(|E | · | f |) time, since and for each , Algorithm 2 takes O(|E|) time to check its membership in . Thus, our heuristic for Problem 2, which consists of Algorithm 1 followed by Algorithm 2, also runs in O(|E | · | f |) time.

3.2 Reducing the Optimality Gap

In this section, we devise a heuristic (Algorithm 3) to use to reduce the optimality gap. The idea of is to iteratively choose one edge from E_s(q) and one edge from E_t (q), merge them through updating G, f and q, until finally q becomes 0.

Algorithm 3

Heuristic to reduce the optimality gap

Input: , G and f

Output: TRUE and updated G and f if the optimality gap is reduced, FALSE otherwise

1. WHILE q ≠ 0

2. Arbitrarily choose e_i = (u₁, v₁) ∈ E_s (q) and e_j = (u₂, v₂) ∈ E_t (q) that are connected; if no such pair can be found, RETURN FALSE. Assume a path from v₁ to u₂ exists; also assume f (e_i) ≤ f (e_j).

3. IF v₁ = u₂ (see Figure 3(a))

4. Add edge e_k = (u₁, v₂) to G with flow value of f (e_i); set q[k] = 0; remove e_i; set q[i] = 0.

5. Remove e _j and set q[ j] = 0 if f (e_i) = f (e _j), otherwise set f (e _j) as f (e _j) − f (e_i).

6. ELSE (see Figure 3(b))

7. Apply Algorithm 6 (Section 3.3) to make e_i and e _j adjacent. IF this succeeds, GOTO step 3.

8. Arbitrarily choose a path p from v₁ to u₂ satisfying that all edges in p have a flow value of at least f (e_i). We assume that p does not contain edges in E_s (q) ∪ E_t (q), otherwise we switch to merge this closer pair of edges. IF such path cannot be found, RETURN FALSE.

9. Add edge e_k = (u₁, v₂) to G with flow value of f (e_i); set q[k] = 0; remove e_i; set q[i] = 0.

10. For edge e _j and every edge in p, subtract their flow value by f (e_i); if their updated flow values become 0, then remove them from the graph and set q[·] = 0 for them.

11. RETURN TRUE.

Suppose that Algorithm 3 succeeds after n iterations. Let q^k, G^k = (V^k, E^k) and f^k be the updated vector, graph, and flow, respectively, after it finishes k iterations, 1 ≤ k ≤ n. Notice that qⁿ = 0. Similar to and , we also define and w.r.t. G^k and f^k. Let (P^k^∗, w^k^∗) be one optimal decomposition of (G^k, f^k), and also defineΔ^k = |E^k| − |V^k | + 2, 1 ≤ k ≤ n. Assume that G⁰ = G and f ⁰ = f for sake of simplicity.

Figure 3:

(a) Merging two adjacent edges. (b) Merging two distant edges (a, b) and (d, e).

Proposition 8

If Algorithm 3 succeeds, then we have , 1 ≤ k ≤ n − 1.

Proposition 9

If Algorithm 3 succeeds, then we have n = |E_s(q)| + |E_t (q)| − 1.

We emphasize that the merging process in Algorithm 3 narrows down the solution space. Specifically, the set of all possible decompositions of (G^k, f^k) is a subset of that of (G^k⁻¹, f ^k−1), 1 ≤ k ≤ n − 1, i.e., every decomposition of (G^k, f ^k) corresponds to a decomposition of (G^k⁻¹, f^k⁻¹). This implies that |P^k ^∗| ≥ |P^(k−1) ^∗|. Notice that it is possible that |P^k^∗| > |P^(k ^−1)∗|, i.e., optimal solutions get eliminated during Algorithm 3. Figure 4 shows that merging two adjacent edges (lines 3–5 of Algorithm 3) might break optimality.

Figure 4:

Merging adjacent edges might break optimality. (a) Original graph with 4 paths in the optimal decomposition. (b) Graph after merging e₃ and e₇, with 5 paths in the optimal decomposition.

We now describe our central result, which states that if Algorithm 3 succeeds, then the optimality gap will be reduced by at least 1. The intuition behind the proof is that during the first (n − 1) iterations, (|E| − |V|) keeps unchanged or decreases, while at the last iteration, this value shall be reduced by at least 1.

Proposition 10

If Algorithm 3 succeeds, then we have Δⁿ − |Pⁿ^∗| ≤ − |P^∗| − 1.

3.3 Making Two Distant Edges Adjacent

We now describe an algorithm (Algorithm 6) to try to make two given distant edges adjacent while keeping optimality, a key subroutine used in Algorithm 3 (line 7). In this Section, we describe the intuition and idea of this algorithm, and its formal and complete description is in Appendix A. We first characterize the hierarchical structure of the graph. We say a pair of vertices u and v is closed, if there does not exist any edge satisfying that exactly one of its ends is in the subgraph induced by these two vertices (see Figure 5 (a,c)).

Figure 5:

(a) Illustration of G = (V, E). (b) Illustration of V (a, h) and E (a, h). (c) Edges represent all closed pairs (i.e., ). (d) Graph after reversing closed pair of (c,t). Notice that edges {e₃, e₁₀} and {e₉, e₁₃} become adjacent, which are distant before reversing.

With a given pair of closed vertices, we then define a reverse operation to modify the graph (Figure 5 (a,d)). We can prove that, performing reverse operations will not change the optimality (Proposition 12). Thus, we focus on the problem of deciding whether there exists a series of reverse operations to make two given distant edges adjacent (Problem 3).

We give a polynomail-time exact algorithm to solve Problem 3 (Algorithm 6). The algorithm first use a DAG to organize all closed pairs (Figure 6). Then through traversing the DAG starting from the vertices corresponding to the two given distant edges, we can decide whether these two edges can be made adjacent, and if so the reverse operations can be retrieved by tracing back the DAG.

Figure 6:

w.r.t. the graph in Figure 5 (a). To make edge e_i = (a, b) and edge e_j = (f, h) adjacent, Algorithm 6 first computes their closures in , which are (a, b) and (f,t) respectively. The following traversing procedure gives S₁ = {a}, T₁ = {b, c}, S₂ = { f, c, s}, and T₂ = {t}. We have T₁ ∩ S₂ = c, i.e., e_i and e_j can be made adjacent at c, and the operations are reversing (a, c), (f,t) and (c,t).

3.4 The Complete Algorithm

With all above components available, we now formally state our heuristic for Problem 1 in Algorithm 4. The first phase (lines 1 to 3) is to iteratively reduce the optimality gap. It then uses the greedy-width algorithm (proposed in [24]) to fully decompose the remaining graph into paths. Notice that these paths are w.r.t. the updated graph, and we need to recover the paths corresponding to the original graph (we maintain the tracing back information when updating the graph).

Algorithm 4

Heuristic for Problem 1

Input: G and f

Output: A decomposition (P^†, w^†) of (G, f)

1. Apply Algorithm 1 and Algorithm 2 to compute . IF , GOTO step 4.

2. FOREACH

3. Apply Algorithm 3 with q. IF it succeeds, GOTO step 1.

4. Apply the greedy-width algorithm on the updated graph.

5. For the paths returned in line 4, recover the corresponding paths w.r.t. the original graph.

Proposition 11

Algorithm 4 runs in O(|V |² · |E |² · | f |) time.

From [24], we know that for an instance with optimality gap of 0, the greedy-width algorithm can guarantee returning one optimal decomposition. This implies that, for any instance, if the first phase of Algorithm 4 closes its optimality gap and does not break the optimality, then Algorithm 4 shall return an optimal decom-position for this instance. We use experimental studies to illustrate its performance in Section 4.

4 Experimental Results

We first describe the experiments in the context of transcript assembly. We use four datasets. The first dataset includes 1000 human RNA-seq samples from SRA (Sequence Reads Archive). For each gene in each sample, we build an instance (i.e., G and f) as follows. We first use Salmon [30] to identify and quantify the expressed transcripts of this gene: the expressed transcripts shall be then used as the ground truth paths, denoted as P^‡, while the expression abundances of these transcripts shall be used as the ground truth weights, denoted as w^‡. With (P^‡, w^‡), we then construct G and f: vertices of G correspond exons, edges are the union of the edges in the ground truth paths, and the flow value for each edge is the superposition of the ground truth weights of the paths that go through this edge (Figure 7). For the generated instance (G, f), we pipe it into the two algorithms (Algorithm 4 and greedy-width). We denote by (P^†, w^†) as the predicted paths and weights by any of these two algorithms. We say (P^†, w^†) is correct if we have |P^†| ≤ |P^‡|. Notice that (P^‡, w^‡) is not necessarily optimal, but only gives an upper bound, i.e., |P^‡| ≥ |P^∗|. The other three datasets are obtained by simulation using Flux-Simulator [31] on three well-annotated species, human, mouse and zebrafish. For each species, we independently simulate 100 samples. For each gene in each sample, we collect the ground truth (P^‡, w^‡) from Flux-Simulator, and use them to build the instance G and f (Figure 7).

Figure 7:

(a) Transcripts and their expression abundances (c_k) of a gene. (b) The splice graph G and flow f derived from the ground truth in (a).

Based on |P^‡|, for each dataset, we classify all instances into 10 categories, i.e., |P^‡| ∈ {1, 2, …, 10}, and remove those instances with |P^‡| > 10 (which is a tiny portion). For each category, we run both algorithms on all instances and compare their accuracy (proportion of instances that are predicted correctly), and average ratio between |P^†| and |P^‡| (i.e., mean value of |P^†|/|P^‡| over all instances).

The results are shown in Figure 8. We can observe that Algorithm 4 gives very high accuracy on all in-stances, while the accuracy of the greedy-width algorithm decreases rapidly as |P^‡| increases. Meanwhile, the decompositions given by the greedy-width algorithm uses more paths and the ratio between the number of predicted paths and the ground truth keeps increasing as |P^‡| increases, while for Algorithm 4 such ratio constantly remains closing to 1. These results demonstrate that in the application of transcript assembly, Algorithm 4 significantly outperforms the greedy-width algorithm.

Figure 8:

Comparison of the two algorithm on transcript assembly. Left: x-axis is |P^‡| and y-axis is the accuracy. Right: x-axis is |P^‡| and y-axis is the average ratio between |P^†|/|P^‡|.

To evaluate the performance of our algorithm on large graphs, we carry out another set of experiments using simulated random instances. The results show that our algorithm can give very high accuracy for a wide spectrum of parameter combinations, while the greedy-width algorithm works well only for easy instances. These results further demonstrate the significant improvement of our algorithm over the greedy-width algorithm. The simulation setup and the detailed results are in Appendix B.

5 Discussion

We give an efficient heuristic for the minimum path flow decomposition problem. The components of this heuristic can be further improved. For example, we shall study how to reduce the optimality gap while keep optimality if a true nontrivial null vector is given, and also study how to identify nontrivial null vectors that are not simple as well as how to use such vectors to reduce the optimality gap. We tackle here only the flow decomposition step of transcript assebmly. Future work also includes extending our algorithm to develop a complete transcript assembler.

Acknowledgements

We thank Meiyue Shao and Guillaume Marc¸ais for helpful discussions and suggestions. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford, by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow.

Appendix A Making Two Distant Edges Adjacent

In this Section, we describe an algorithm to make two given distant edges adjacent while keeping optimality. We first give some definitions to characterize the hierarchical structure of G. We say a pair of vertices (u, v) is connected (w.r.t. graph G), if there exists a path from u to v in G. For a pair of connected vertices (u, v), we define a subset V (u, v) ⊆ V as the set of vertices that are on some path from u to v (see Figure 5 (a,c)). Notice that if (u, v) is connected, we always have {u, v} ⊆ V (u, v). For a pair of connected vertices (u, v), we also define a subset E (u, v) ⊆ E as follows (see Figure 5): for any e ∈ E we say e ∈ E (u, v) if the both ends of e are in V (u, v). We say a pair of connected vertices (u, v) is closed, if for every vertex w ∈ V (u, v) \ {u, v}, all incident edges of w in E are also in E (u, v). We denote by the set of all closed pairs. For each pair , we define E_s(u, v) ⊆ E (u, v) as the set of edges in E (u, v) that are adjacent to u, and define E_t (u, v) ⊆ E (u, v) as the set of edges in E (u, v) that are adjacent to v.

We define a reverse operation to modify G (Figure 5 (a,d)). For each pair , we can reverse it by first detaching edges in E_s(u, v) from u and attaching them to v, detaching edges in E_t (u, v) from v and attaching them to u, and then switching the direction of all edges in E (u, v). Notice that we treat the reverse operation in a way that the set of edges in G is kept unchanged, but the locations and orientations of edges might be changed. Formally, let G′ be the graph after reversing . We assume that G′ still have the edge set of E. But for each edge e_i = (u, v) in G, it might become e_i = (u′, v′) in G′ with u′ ≠ u and/or v′ ≠ v. In other words, all edges are recognized by their original identities in E, and their adjacent two vertices might become different after performing reverse operations.

Let G′ be the graph after reversing . Let (P′^∗, w′^∗) be one optimal decomposition of (G′, f). The following result states that solving Problem 1 on (G, f) is equivalent to solving it on (G′, f).

Proposition 12

There exists a one-to-one correspondence between all decompositions of (G, f) and all decompositions of (G′, f). In particular, we have that |P^∗| = |P′^∗|.

Based on Proposition 12, we address the following problem.

Problem 3

Given two edges e_i, e_j ∈ E, decide whether we can perform a series of reverse operations, such that in the resulting graph we have e_i = (u₁, v₁) and e _j = (u₂, v₂) with either v₁ = u₂ or v₂ = u₁.

To solve Problem 3, we first describe a data structure to organize all the reverse operations. We define a binary relation on : for two pairs (u₁, v₁), , we say if V (u₁, v₁) ⊆ V (u₂, v₂). It is easy to verify that this relation forms a partial order. Let be the Hasse diagram on w.r.t. this partial order. For each edge in from (u₁, v₁) to (u₂, v₂), we remove it from if u₁ ≠ u₂ and v₁ ≠ v₂. We denote by this updated graph (see Figure 6).

We now describe an exact algorithm to solve Problem 3, which requires constructing and then search-ing series of reverse operations guided by . We first compute the transitive closure of G, which gives us all possible connected pairs. For each connected pair (u, v), we follow the definition to check whether this pair is in : we compute V (u, v), which can be done in O(|V |) time with the availability of all connected pairs; then for each vertex w ∈ V (u, v) \ {u, v} we check whether all adjacent edges of w are still in V (u, v). It takes O(|E |) time to verify each connected pair, and thus the total running time to compute is O(|V |² · |E |).

We then give an algorithm to compute , i.e., for each pair in we need to compute its parents node in . For a subset V₁ ⊆ V, |V₁| ≥ 2, we say is a closure of V₁, if we have V₁ ⊆ V (u, v), and there does not exist such that V₁ ⊆ V (u′, v′) ⊂ V (u, v). Intuitively, the closure of a given set of vertices is the smallest closed pair whose set of vertices contains the given vertices. As we will see in Proposition 13 such closure is unique. Now we describe an exact algorithm to compute the closure of a given subset, which shall be used in both constructing and the following steps (line 2 of Algorithm 6). The idea of this algorithm is to iteratively add necessary vertices until finally it becomes closed.

Algorithm 5

Exact algorithm to compute closure

Input: V₁ ⊂ V, and G

Output: Closure of V₁

1. Compute a topological ordering O of V, i.e., if (u, v) is connected, then O(u) < O(v).

2. Let S be the set containing all vertices in the final closure, initialized as V₁.

3. Let L and R be the smallest and largest vertex in S according to their orderings in O, respectively.

4. Let Q be a queue that stores all (unexamined) vertices in S \ {L, R}, initialized as S \ {L, R}.

5. REPEAT

6. WHILE Q is not empty

7. Pop element v from Q.

8. FOREACH adjacent vertex u of v that is not in S

9. Insert u into S.

10. IF O(u) < O(L), push L to Q and set L = u.

11. IF O(u) > O(R), push R to Q and set R = u.

12. FOREACH adjacent vertex u of L that is not in S

13. IF , insert u into S and push u to Q.

14. IF Q is empty, RETURN (L, R).

Proposition 13

For any V₁ ⊂ V, |V₁| ≥ 2, Algorithm 5 returns the unique closure of V₁ in O(|E |) time.

We now describe how to build using Algorithm 5. Suppose that (u′, v′) is one of the direct parents of in . We have two cases: either u ∉ {u′, v′}, or v ∉ {u′, v′}. If it is the first case, then (u′, v′) is exactly the closure of V₁ = {u, v} ∪ {w | (w, u) ∈ E} ∪ {w | (u, w) ∈ E}. If it is the second case, then (u′, v′) is exactly the closure of V₂ = {u, v} ∪ {w | (w, v) ∈ E} ∪ {w | (v, w) ∈ E}. In other words, the parents of in are exactly the closures of V₁ and V₂. For each pair , we use this procedure to compute its parents, and thus the whole graph can be constructed. To further build , by definition, we then check each edge of to see whether the two pairs corresponding to its two adjacent vertices share the same starting or ending vertex. The total running time to build is O(|V |² · |E |).

Algorithm 6

Exact algorithm for Problem 3

Input: G, e_i = (u₁, v₁), e_j = (u₂, v₂) ∈ E

Output: TRUE and series of reverse operations if they can be made adjacent, FALSE otherwise

1. Compute .

2. Compute the closures of {u₁, v₁} and {u₂, v₂}; denote them as (u′₁, v′₁) and (u′₂, v′₂), respectively.

3. Initialize S₁ = {u₁}, T₁ = {v₁}, S₂ = {u₂}, T₂ = {v₂}.

4. Traverse from (u′₁, v′₁): for each visited node , insert u to S₁ and insert v to T₁.

5. Traverse from (u′₂, v′₂): for each visited node , insert u to S₂ and insert v to T₂.

6. IF S₁ ∩ T₂ = ∅ and S₂ ∩ T₁ = ∅, RETURN FALSE.

7. Compute arbitrarily w ∈ (S₁ ∩ T₂) ∪ (S₂ ∩ T₁); assume that w ∈ S₂ ∩ T₁.

8. Compute any path from (u′₁, v′₁) to node (·, w), and traverse this path to construct the reverse operations to make e_i point to w: let (u, v) be the current node (i.e., e_i is either points to v or leaves u) and u be the boundary vertex shared by (u, v) and its parent; reverse (u, v) if e_i is not adjacent to u; move to the parent of (u, v) on this path. In the end reverse (·, w) if e_i is not adjacent to w.

9. Perform the same procedure as line 8 to construct the reverse operations to make e_j leave w.

10. RETURN TRUE.

Our complete algorithm to solve Problem 3 is described in Algorithm 6. The idea is first to locate the nodes in for the given two edges, and then traverse starting from these two nodes respectively to compute all the possible positions that the two given edges can be moved to by reverse operations. After that the algorithm examines whether there exists a common vertex where the two edges can be made adjacent. The series of reverse operations can be obtained through tracing back. The correctness and running time of Algorithm 6 is stated as Proposition 14.

Proposition 14

Algorithm 6 gives an exact solution for Problem 3 in O(|V |² · |E |) time.

Appendix B Experiments with Simulated Random Instances

The simulation procedure first generates the ground truth (P^‡, w^‡), and then builds (G, f) based on them. In the simulation, we assume that G = (V, E) contains 1000 vertices, i.e., |V | = 1000, and assume that (1, 2, …, |V |) is one topological sorting of G. To generate a path p^‡ in the ground truth, we first fix the length of this path, which is sampled uniformly at random from {1, 2, …, L}, where L ≤ |V | − 1 is a parameter governing the maximum length of the paths in the ground truth path. We then sample (|p^‡| + 1) vertices uniformly at random without replacement from {1, 2, …, |V |}, and the path connects these vertices in an ascending order. The ground truth weight of this path, i.e., w^‡(p^‡), is sampled uniformly at random from {1, 2, …, 10000}. We repeat this process to generate |P^‡| paths, where |P^‡| is another parameter governing the number of paths in the ground truth. With (P^‡, w^‡), we then construct instance (G, f) by superposition of the weights of the paths that go through the edge (similar to Figure 7). Again, we denote by (P^†, w^†) be the output of the algorithm (Algorithm 4 or greedy-width), and we say (P^†, w^†) is correct if we have |P^†| ≤ |P^‡|.

We perform experiments with two different parameter settings, one is L = 50, |P^‡| ∈ {20, 40, …, 200}, the other is L ∈ {10, 20, …, 100}, |P^‡| = 100. For each parameter combination (L, |P^‡|), we randomly generate 100 instances using the procedure mentioned above. We run both algorithms on these 100 instances, and for each algorithm we record its accuracy (proportion of instances that are predicted correctly), average ratio between |P^†| and |P^‡| (i.e., mean value of |P^†|/|P^‡| over the 100 instances), and total running time.

The results for the first parameter settings are shown in Table 1. First, we can observe that Algorithm 4 gives 100% accuracy when |P^‡| ≤ 140, and it decreases a few percentage for larger |P^‡|. Meanwhile, the accuracy of the greedy-width algorithm decreases rapidly with the increase of |P^‡|, and drops to 0 when |P^‡| ≥ 100. Second, we can see that the number of paths returned by Algorithm 4 is very close to the ground truth for all instances, while greedy-width algorithm uses more paths than the ground truth especially for large |P^‡|. Third, both algorithms can finish these instances in a short amount of time, and Algorithm 4 is about 4 times slower than the greedy-width algorithm.

View this table:

Table 1:

Comparison of the two algorithms on simulations with L = 50.

The results for the second parameter settings are shown in Table 2, which are consistent with that in Table 1. Combining these results we can conclude that Algorithm 4 can give decompositions with the same number of paths as that in the ground truth for a large scope of parameter combinations, while greedy-width algorithm can only give such decompositions for small L and small |P^‡|, and it requires a significant portion of more paths for large L or large |P^‡|. This shows a considerable improvement of Algorithm 4 over the greedy-width algorithm, especially for instances with many paths.

View this table:

Table 2:

Comparison of the two algorithms on simulations with |P^‡| = 100.

Appendix C Proofs of Propositions

Proof of Proposition 1.

Because (P^∗, w^∗) forms a decomposition of (G, f), we know that f = w^∗ · P^∗. By definition, for any we also have P^∗ · q = 0. Thus, we have that f · q = w^∗ · P^∗ · q = 0.

Proof of Proposition 2.

We can write Pr(f · q = 0) = Pr(w^∗ · P^∗ · q = 0). Since , we know that P^∗ · q ≠ 0. Combining the fact that elements in w^∗ are sampled independently and uniformly at random from ℤ_k, using the standard backward analysis, we can conclude that Pr(f · q = 0) ≤ 1/k.

Proof of Proposition 3.

We can easily verify that for every row p_i of and for every v ∈ V₀ we have that p_i · q_v = 0. Particularly, since , we have that P^∗ · q_v = 0. This implies that , for every v ∈ V₀. Hence, we have .

Proof of Proposition 4.

According to the rank-nullity theorem, we have that rank(P^∗). From [24], we have that rows in the optimal solution are linearly independent, i.e., rank(P^∗) = |P^∗|. Besides, we also have that , since all vectors in {q_v | v ∈ V₀} are linearly independent. Combining all these facts, we have that .

Proof of Proposition 5.

We first prove that Algorithm 1 returns a subset of . Let E₁ ⊆ E and E₂ ⊆ E be the two balanced subsets through tracing back from x ∈ D. We only need to show that E₁ and E₂ are indivisible. We first prove that E₁ and E₂ are disjoint. Suppose conversely that e ∈ E₁ ∩ E₂. Then we have B[|E |, x − f (e)] = 1, since there exist two subsets, namely E₁′ = E₁ \ {e} and E₂′ = E₂ \ {e}, such that f (E₁′) = f (E₂′) = x − f (e), a contradiction to the fact that x ∈ D.

Now we show that E₁ and E₂ are indivisible. Suppose conversely that there exist strict subsets E₁′ ⊂ E₁ and E₂′ ⊂ E₂ such that f (E₁′) = f (E₂′) = y. Let e_k be the edge with the highest index in E₁ ∪ E₂. Without loss of generality, we assume that e_k ∈ E₁ \ E₁^′. Let e_k′ be the edge with the highest index in E₁′ ∪ E₂′. We have that B[k′, y] = 1, because of the existence of E₁′ and E₂′. Following the recursion of B (line 4 of Algorithm 1), we also have that B[ j, y + ∑_k′_<i≤ _j f (e_i)] = 1, for all k′ < j ≤ k. This implies that B[k, x − f (e_k)] = 1, a contradiction to the fact that x ∈ D.

We can verify that computing A, B, D and tracing back for all elements from D takes O(|E | · | f |) time, while other steps take less than that. Thus, the total running time of Algorithm 1 is O(|E | · | f |).

Proof of Proposition 6.

On one side, if there exists V₁ ⊆ V₀ such that , then by definition, we have that .

Now we prove the other side. Assume that , i.e., . Let V₁ = {v ∈ V₀ | c_v ≠ 0}. Let G₁ = (V₁, E₁) be the subgraph of G induced by V₁, i.e., we define e ∈ E₁ if both ends of e is in V₁. We first prove that G₁ is weakly connected. Suppose conversely that G₁ contains k ≥ 2 connected components, . We can write q = ∑_1≤i≤k qⁱ, where . Notice that qⁱ contributes to distinct elements of q, i.e., for any edge , we have q[ j] = qⁱ[ j]. Since , we also have that . Further, we have that , since we have . This implies that the two balanced subsets corresponding to q can be split into k pairs of balanced subsets, namely, E_s(qⁱ) and E_t (qⁱ), 1 ≤ i ≤ k, a contradiction to the fact that .

We now show that c_v ∈ ℤ, ∀v ∈ V₁. Let E_s(V₁) = {(u, v) ∈ E | u ∈ V \V₁, v ∈ V₁}, and E_t (V₁) = {(u, v) ∈ E | u ∈ V₁, v ∈ V \V₁}, Let E_st (V₁) = E_s(V₁) ∪ E_t (V₁). Let V₂ ⊆ V₁ be the subset of vertices that are adjacent to some edge in E_st. For any edge e_j = (u, v) ∈ E₁, we have q[ j] = c_v − c_u. Following this formular, we have that c_v ∈ {−1, +1}, ∀v ∈ V₂, since edges in E_st are adjacent to only one vertex in V₁, which is in V₂. Since G′ is weakly connected and q[ j] ∈ {−1, 0, +1}, by propagating from V₂ with the above formular, we can conclude that c_v ∈ ℤ for all v ∈ V₁.

Consider any edge e_j = (u, v) ∈ E₁. If we assume that c_u ≥ 1, then from the above formular we have that c_v = c_u + q[ j] ≥ 0; and since c_v ≠ 0 and c_v ∈ ℤ, we have that c_v ≥ 1. Following this reasoning and the fact that G′ is weakly connected, through propagating we know that {c_v | v ∈ V₁} have the same sign. Without loss of generality, we assume that c_v ≥ 1, ∀v ∈ V₁. This implies that c_v = 1, ∀v ∈ V₂, and further implies that q[ j] = 1 for e_j ∈ E_s(V₁) and q[ j] = −1 for e_j ∈ E_t (V₁). Thus, we have that E_s(V₁) ⊆ E_s(q) and E_t (V₁) ⊆ E_t (q). According to flow conservation, we have that f (E_s(V₁)) = f (E_t (V₁)). Combining these two facts and the assumption that q is indivisible, we must have that E_s(V₁) = E_s(q) and E_t (V₁) = E_t (q). This consequently implies that V₁ = V₂. As we already have showed, c_v = 1, ∀v ∈ V₂, we have that .

Proof of Proposition 7.

Suppose that . According to Proposition 6, without loss of generality, we have that there exists a connected subset V₁ ⊆ V₀ such that . It is clear that q[ j] = 1 if and only if e_j ∈ E_s(V₁), and q[ j] = −1 if and only if e_j ∈ E_t (V₁), since for any edge e_k that are adjacent to two vertices in V₁ we have q[k] = 0. This yields that E_s(q) = E_s(V₁) and E_t (q) = E_t (V₁).

On the other side, suppose that there exists a subset V₁ ⊆ V₀ such that E_s(q) = E_s(V₁) and E_t (q) = E_t (V₁). We now prove that . In fact, exactly has value of +1 for edges in E_s(V₁) = E_s(q), and has value of −1 for edges in E_t (V₁) = E_t (q). This yields that .

Proof of Proposition 8.

According to the algorithm, we can easily verify that , from the fact that . We now show that . Suppose conversely that q^k is not indivisible, i.e., there exist strict subsets E₁ ⊂ E_s(q^k) and E₂ ⊂ E_t (q^k) such that f^k (E₁) = f^k (E₂). Let e_i ∈ E_s(q^k⁻¹) and e_j ∈ E_t (q^k⁻¹) be the two chosen edges in iteration k. Assume that f^k⁻¹(e_i) ≤ f ^k−1(e_j). This implies that e_i ∉ E_s(q^k). Let E₁′ = E₁ ∪ {e_i}. If we further have f ^k−1(e_i) = f^k⁻¹(e_j), we define E₂′ = E₂ ∪ {e_j}, otherwise, we define E₂′ = E₂. Clearly, we have that f^k⁻¹(E₁′) = f ^k−1(E₂′) according to the algorithm, and that E₁′ and E₂′ are strict subsets of E_s(q^k⁻¹) and E_t (q^k ⁻¹), respectively, following from the fact that E₁ and E₂ are strict subsets of E_s(q^k) and E_t (q^k), respectively. Thus, we conclude that q^k⁻¹ is not indivisible, a contradiction.

We then show that . Suppose conversely that . According to Proposition 6, there exists V₁ ⊆ V₀ such that E_s(q^k) = E_s(V₁) and E_t (q^k) = E_t (V₁). Let e_i ∈ E_s(q^k⁻¹) and e_j ∈ E_t (q^k⁻¹) be the two chosen edges in iteration k. Since , we know that either e_i ∈ E_s(V₁) and e_j ∈ E_t (V₁), or e_i ∉E_s(V₁) and e_j ∉ E_t (V₁). We further show the second case is not possible, since it immediately implies that q^k⁻¹ is not indivisible. Thus, we have that e_i ∈ E_s(V₁) and e_j ∈ E_t (V₁). If q^k is obtained from the first situation (line 3) of Algorithm 3, then this already means that . Otherwise (line 6), since we guarantee that the path p used in line 8 of Algorithm 3 does not contain any edge in E_s(q^k⁻¹) ∪ E_t (q^k⁻¹), we can still conclude that , a contradiction, as desired.

Proof of Proposition 9.

Since q^k is indivisible from Proposition 8, we know that during the k-th iteration, the two chosen edges e_i and e_j cannot have that f (e_i) = f (e_j), 1 ≤ k ≤ n − 1. According to Algorithm 3, we have that |E_s(q^k)| + |E_t (q^k)| = |E_s(q^k−¹)| + |E_t (q^k−¹)| − 1. After (n − 1) iterations, we must have that |E_s(q^k−¹)| = |E_t (q^k−¹)| = 1, and both the two subsets become empty after the n-th iteration. Thus, after exactly (|E_s(q)| + |E_t (q)| − 1) iterations, Algorithm 3 terminates.

Proof of Proposition 10.

We first prove that, for 1 ≤ k ≤ n − 1, we have that Δ^k ≤ Δ^k-1 Consider the change of number of edges and vertices during the k-th iteration. Let e_i and e_j be the two chosen edges. According to the proof of Proposition 9, we have that only e_i will be removed, and a new edge (namely, e_k in Algorithm 3) will be added. If it is the first situation (line 3 of Algorithm 3), then the middle vertex (namely, v₁ = u₂) cannot become isolated, since otherwise it implies that q^k⁻¹ is not indivisible. If it is the second situation (line 6 of Algorithm 3), some vertices in path p might be become isolated. But if this happens, then at least the same number of edges will also be removed in p. Thus, we always have that Δ^k ≤ Δ^k−1.

We now show that Δⁿ ≤ Δⁿ⁻¹. This is because, in the n-th iteration, both e_i and e_j are removed, and only one edge will be added. Similar to the above analysis, if some vertices become isolated, then at least the same number of additional edges will be removed. Thus, we have that Δⁿ ≤ Δⁿ⁻¹ − 1.

According to our analysis in the main text, the merging process narrows down the solution space, i.e., we have |P^k^∗| ≥ |P^(k−1)∗|, 1 ≤ k ≤ n. By combining these facts, we have that Δⁿ − |Pⁿ^*| ≤ Δ − |P*| − 1.

Proof of Proposition 11.

The running time of Algorithm 4 is dominated by lines 2 and 3. According to Proposition 9, in Algorithm 3, Algorithm 6 will be called at most |E | times. Thus, according to Proposition 14 and the fact that , the total running time is O(|V |² · |E |² · | f |).

Proof of Proposition 12.

Let (P, w) be any decomposition of (G, f). We can obtain a decomposition (P′, w′) of (G′, f) satisfying that |P′| = |P| and that w′ = w by only reversing the edges from u to v for each path in P. Since reverse operation is invertible, given a decomposition (P′, w′) of (G′, f), we can also obtain a decomposition (P, w) of (G, f) in the same way. This gives a one-to-one correspondence between the decompositions of (G, f) and that of (G′, f). And, in particular we have |P′^∗| = |P^∗|.

Proof of Proposition 13.

The correctness of the algorithm is the consequence of the following facts. First, if , then among the vertices in V (u, v), u and v are the smallest and the largest vertices w.r.t. any topological ordering. This is simply because all vertices in V (u, v) \ {u, v} can be reached from u and can reach v. Second, L and R always store the current smallest and largest vertices in S w.r.t. the topological ordering. Thus, by definition, vertices added to S in line 9 of Algorithm 5 are guaranteed in the closure of V₁. Third, Q always stores the vertices in S \ {L, R}. Thus, following the first fact, we know that vertices in Q cannot be the boundary vertices of the closure, and consequently, all their adjacent vertices (added in line 9) are guaranteed in the closure of V₁. Fourth, when the algorithm terminates (line 14), we have that S= V (L, R). In fact, let p be any path from L to R. From line 13, we know that the first vertex in p is in S. Then following lines 8 to 11, all vertices in p will be added to S. Finally, all vertices in S \ {L, R} are examined such that all their adjacent vertices are in S. Thus, by definition, we know that (L, R) is closed. Besides, since all vertices added to S are guaranteed to in the closure of V₁, we have that S is also the smallest subset, implying that (L, R) is the closure of V₁. This reason also proves that the closure of V₁ is unique. This algorithm is essentially the breadth-first search algorithm, in which edge each is examined at most once. Thus, its running time is O(|E |).

Proof of Proposition 14.

If Algorithm 6 finds w ∈ (S₁ ∩ T₂) ∪ (S₂ ∩ T₁), then clearly we can perform a series of operations to make e_i and e_j adjacent. On the other side, suppose that there exists a series of reverse operations such that in the resulting graph we have that e_i = (w₁, w) and e_j = (w, w₂). Let c₁ and c₂ be the closures of {w₁, w} and {w, w₂}, respectively. Clearly, all the reverse operations that moves e_i (resp. e_j) must operates on closures that are inside c₁ (resp. c₂). Thus, there must be a path from (u′₁, v′₁) (resp. (u′₂, v′₂)) to c₁ (resp. c₂) in , which implies that Algorithm 6 will have w ∈ (S₁ ∩ T₂) ∪ (S₂ ∩ T₁). This reasoning also implies that when Algorithm 6 traces back to recover the reverse operations (lines 8 and 9), the two paths are disjoint, since they are in two different subtrees (rooted at c₁ and c₂, respectively).

The running time of Algorithm 6 is dominated by building , which is O(|V |² · |E |).

Footnotes

↵* mingfu.shao{at}cs.cmu.edu
↵† carlk{at}cs.cmu.edu

References

[1].↵
Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10(1):57–63, 2009.
OpenUrl CrossRef PubMed Web of Science
[2].↵
C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. Van Baren, S.L. Salzberg, B.J. Wold, and L. Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated tran-scripts and isoform switching during cell differentiation. Nat. Biotechnol., 28(5):511–515, 2010.
OpenUrl CrossRef PubMed Web of Science
[3].↵
M. Guttman, M. Garber, J.Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, et al. Ab initio re-construction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nat. Biotechnol., 28(5):503–510, 2010.
OpenUrl CrossRef PubMed Web of Science
[4].↵
W. Li, J. Feng, and T. Jiang. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol., 18(11):1693–1707, 2011.
OpenUrl CrossRef PubMed
[5].↵
J.J. Li, C.-R. Jiang, J.B. Brown, H. Huang, and P.J. Bickel. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA, 108(50):19867–19872, 2011.
OpenUrl Abstract/FREE Full Text
[6].↵
Y.- Y. Lin, P. Dao, F. Hach, M. Bakhshi, F. Mo, A. Lapuk, C. Collins, and S.C. Sahinalp. CLIIQ: Accurate comparative detection and quantification of expressed isoforms in a population. In Proc. 12th Workshop Algs. in Bioinf. (WABI’12), volume 7534 of Lecture Notes in Comp. Sci., pages 178–189, 2012.
[7].↵
W. Li and T. Jiang. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics, 28(22):2914–2921, 2012.
OpenUrl CrossRef PubMed Web of Science
[8].↵
J. Behr, A. Kahles, Y. Zhong, V.T. Sreedharan, P. Drewe, and G. Ratsch¨. MITIE: Simultane-ous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–2538, 2013.
OpenUrl CrossRef PubMed Web of Science
[9].↵
A.M. Mezlini, E.J.M. Smith, M. Fiume, O. Buske, G.L. Savich, et al. iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res., 23(3):519–529, 2013.
OpenUrl Abstract/FREE Full Text
[10].↵
A.I. Tomescu, A. Kuosmanen, R. Rizzi, and V. Mäkinen. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics, 14(5):1, 2013.
OpenUrl CrossRef PubMed
[11].↵
M. Pertea, G.M. Pertea, C.M. Antonescu, T.-C. Chang, J.T. Mendell, and S.L. Salzberg. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol., 33(3):290–295, 2015.
OpenUrl CrossRef PubMed
[12].↵
G. Robertson, J. Schein, R. Chiu, R. Corbett, M. Field, S.D. Jackman, K. Mungall, S. Lee, H.M. Okada, J.Q. Qian, et al. De novo assembly and analysis of RNA-seq data. Nat. Methods, 7(11):909–912, 2010.
OpenUrl CrossRef PubMed Web of Science
[13].↵
J. Martin, V.M. Bruno, Z. Fang, X. Meng, M. Blow, T. Zhang, G. Sherlock, M. Snyder, and Z. Wang. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics, 11(1):1, 2010.
OpenUrl CrossRef PubMed
[14].↵
M.G. Grabherr, B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, X. Adiconis, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol., 29(7):644, 2011.
OpenUrl CrossRef PubMed
[15].↵
Y. Xie, G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 30(12):1660–1666, 2014.
OpenUrl CrossRef PubMed Web of Science
[16].↵
D.R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18(5):821–829, 2008.
OpenUrl Abstract/FREE Full Text
[17].↵
M.H. Schulz, D.R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8):1086–1092, 2012.
OpenUrl CrossRef PubMed Web of Science
[18].↵
T. Steijger, J.F. Abril, P.G. Engström, F. Kokocinski, T.J. Hubbard, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods, 10(12):1177–1184, 2013.
OpenUrl CrossRef PubMed Web of Science
[19].↵
K.E. Hayer, A. Pizarro, N.F. Lahens, J.B. Hogenesch, and G.R. Grant. Benchmark analysis of algo-rithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinfor-matics, 31(24):3938–3945, 2015.
OpenUrl CrossRef PubMed
[20].↵
D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S.L. Salzberg, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 14(4):R36, 2013.
OpenUrl CrossRef PubMed
[21].↵
A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T.R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013.
OpenUrl CrossRef PubMed Web of Science
[22].↵
D. Kim, B. Langmead, and S.L. Salzberg. HISAT: a fast spliced aligner with low memory require-ments. Nat. Methods, 12(4):357–360, 2015.
OpenUrl CrossRef PubMed
[23].↵
M. Pop. Genome assembly reborn: recent computational challenges. Brief. Bioinform., 10(4):354–366, 2009.
OpenUrl CrossRef PubMed Web of Science
[24].↵
B. Vatinlen, F. Chauvet, P. Chrétienne, and P. Mahey. Simple bounds and greedy algorithms for de-composing a flow into a minimal set of paths. Eur. J. Oper. Res., 185(3):1390–1401, 2008.
OpenUrl
[25].↵
T. Hartman, A. Hassidim, H. Kaplan, D. Raz, and M. Segalov. How to split a flow? In Proc. IEEE INFOCOM 2012, pages 828–836, 2012.
[26].↵
B. Mumey, S. Shahmohammadi, K. McManus, and S. Yaw. Parity balancing path flow decomposition and routing. In Proc. IEEE Globecom Workshops, pages 1–6, 2015.
[27].↵
Y. Hendel and W. Kubiak. Decomposition of flow into paths to minimize their length. http://sites.google.com/site/yannhendel/Decomposition.pdf.
[28].↵
G. Baier, E. Köhler, and M. Skutella. On the k-splittable flow problem. In Proc. 10th Eur. Symp. Algs. (ESA’02), pages 101–113, 2002.
[29].↵
Georg Baier, E. Köhler, and M. Skutella. The k-splittable flow problem. Algorithmica, 42(3):231–248, 2005.
OpenUrl
[30].↵
R. Patro, G. Duggal, and C. Kingsford. Salmon: Accurate, versatile and ultrafast quantification from RNA-seq data using lightweight-alignment. bioRxiv, page 021592, 2015.
[31].↵
T. Griebel, B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó, and M. Sammeth. Modelling and simulating generic RNA-seq experiments with the flux simulator. Nucleic Acids Res., 40(20):10073–10083, 2012.
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted November 16, 2016.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5209)
Biochemistry (11730)
Bioengineering (8743)
Bioinformatics (29179)
Biophysics (14964)
Cancer Biology (12080)
Cell Biology (17399)
Clinical Trials (138)
Developmental Biology (9417)
Ecology (14174)
Epidemiology (2067)
Evolutionary Biology (18294)
Genetics (12233)
Genomics (16791)
Immunology (11858)
Microbiology (28051)
Molecular Biology (11575)
Neuroscience (60919)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4955)
Plant Biology (10422)
Scientific Communication and Education (1682)
Synthetic Biology (2881)
Systems Biology (7338)
Zoology (1650)

[1] [1].↵
Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10(1):57–63, 2009.
OpenUrl CrossRef PubMed Web of Science

[2] [2].↵
C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. Van Baren, S.L. Salzberg, B.J. Wold, and L. Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated tran-scripts and isoform switching during cell differentiation. Nat. Biotechnol., 28(5):511–515, 2010.
OpenUrl CrossRef PubMed Web of Science

[3] [3].↵
M. Guttman, M. Garber, J.Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, et al. Ab initio re-construction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nat. Biotechnol., 28(5):503–510, 2010.
OpenUrl CrossRef PubMed Web of Science

[4] [4].↵
W. Li, J. Feng, and T. Jiang. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol., 18(11):1693–1707, 2011.
OpenUrl CrossRef PubMed

[5] [5].↵
J.J. Li, C.-R. Jiang, J.B. Brown, H. Huang, and P.J. Bickel. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA, 108(50):19867–19872, 2011.
OpenUrl Abstract/FREE Full Text

[6] [6].↵
Y.- Y. Lin, P. Dao, F. Hach, M. Bakhshi, F. Mo, A. Lapuk, C. Collins, and S.C. Sahinalp. CLIIQ: Accurate comparative detection and quantification of expressed isoforms in a population. In Proc. 12th Workshop Algs. in Bioinf. (WABI’12), volume 7534 of Lecture Notes in Comp. Sci., pages 178–189, 2012.

[7] [7].↵
W. Li and T. Jiang. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics, 28(22):2914–2921, 2012.
OpenUrl CrossRef PubMed Web of Science

[8] [8].↵
J. Behr, A. Kahles, Y. Zhong, V.T. Sreedharan, P. Drewe, and G. Ratsch¨. MITIE: Simultane-ous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–2538, 2013.
OpenUrl CrossRef PubMed Web of Science

[9] [9].↵
A.M. Mezlini, E.J.M. Smith, M. Fiume, O. Buske, G.L. Savich, et al. iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res., 23(3):519–529, 2013.
OpenUrl Abstract/FREE Full Text

[10] [10].↵
A.I. Tomescu, A. Kuosmanen, R. Rizzi, and V. Mäkinen. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics, 14(5):1, 2013.
OpenUrl CrossRef PubMed

[11] [11].↵
M. Pertea, G.M. Pertea, C.M. Antonescu, T.-C. Chang, J.T. Mendell, and S.L. Salzberg. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol., 33(3):290–295, 2015.
OpenUrl CrossRef PubMed

[12] [12].↵
G. Robertson, J. Schein, R. Chiu, R. Corbett, M. Field, S.D. Jackman, K. Mungall, S. Lee, H.M. Okada, J.Q. Qian, et al. De novo assembly and analysis of RNA-seq data. Nat. Methods, 7(11):909–912, 2010.
OpenUrl CrossRef PubMed Web of Science

[13] [13].↵
J. Martin, V.M. Bruno, Z. Fang, X. Meng, M. Blow, T. Zhang, G. Sherlock, M. Snyder, and Z. Wang. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics, 11(1):1, 2010.
OpenUrl CrossRef PubMed

[14] [14].↵
M.G. Grabherr, B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, X. Adiconis, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol., 29(7):644, 2011.
OpenUrl CrossRef PubMed

[15] [15].↵
Y. Xie, G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 30(12):1660–1666, 2014.
OpenUrl CrossRef PubMed Web of Science

[16] [16].↵
D.R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18(5):821–829, 2008.
OpenUrl Abstract/FREE Full Text

[17] [17].↵
M.H. Schulz, D.R. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8):1086–1092, 2012.
OpenUrl CrossRef PubMed Web of Science

[18] [18].↵
T. Steijger, J.F. Abril, P.G. Engström, F. Kokocinski, T.J. Hubbard, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods, 10(12):1177–1184, 2013.
OpenUrl CrossRef PubMed Web of Science

[19] [19].↵
K.E. Hayer, A. Pizarro, N.F. Lahens, J.B. Hogenesch, and G.R. Grant. Benchmark analysis of algo-rithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinfor-matics, 31(24):3938–3945, 2015.
OpenUrl CrossRef PubMed

[20] [20].↵
D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S.L. Salzberg, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 14(4):R36, 2013.
OpenUrl CrossRef PubMed

[21] [21].↵
A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T.R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013.
OpenUrl CrossRef PubMed Web of Science

[22] [22].↵
D. Kim, B. Langmead, and S.L. Salzberg. HISAT: a fast spliced aligner with low memory require-ments. Nat. Methods, 12(4):357–360, 2015.
OpenUrl CrossRef PubMed

[23] [23].↵
M. Pop. Genome assembly reborn: recent computational challenges. Brief. Bioinform., 10(4):354–366, 2009.
OpenUrl CrossRef PubMed Web of Science

[24] [24].↵
B. Vatinlen, F. Chauvet, P. Chrétienne, and P. Mahey. Simple bounds and greedy algorithms for de-composing a flow into a minimal set of paths. Eur. J. Oper. Res., 185(3):1390–1401, 2008.
OpenUrl

[25] [25].↵
T. Hartman, A. Hassidim, H. Kaplan, D. Raz, and M. Segalov. How to split a flow? In Proc. IEEE INFOCOM 2012, pages 828–836, 2012.

[26] [26].↵
B. Mumey, S. Shahmohammadi, K. McManus, and S. Yaw. Parity balancing path flow decomposition and routing. In Proc. IEEE Globecom Workshops, pages 1–6, 2015.

[27] [27].↵
Y. Hendel and W. Kubiak. Decomposition of flow into paths to minimize their length. http://sites.google.com/site/yannhendel/Decomposition.pdf.

[28] [28].↵
G. Baier, E. Köhler, and M. Skutella. On the k-splittable flow problem. In Proc. 10th Eur. Symp. Algs. (ESA’02), pages 101–113, 2002.

[29] [29].↵
Georg Baier, E. Köhler, and M. Skutella. The k-splittable flow problem. Algorithmica, 42(3):231–248, 2005.
OpenUrl

[30] [30].↵
R. Patro, G. Duggal, and C. Kingsford. Salmon: Accurate, versatile and ultrafast quantification from RNA-seq data using lightweight-alignment. bioRxiv, page 021592, 2015.

[31] [31].↵
T. Griebel, B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó, and M. Sammeth. Modelling and simulating generic RNA-seq experiments with the flux simulator. Nucleic Acids Res., 40(20):10073–10083, 2012.
OpenUrl CrossRef PubMed Web of Science