On bridging paired-end RNA-seq data

Motivation The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads. We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging. Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification. However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges. Algorithmic Results We studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends. We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.e., splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized. This formulation characterizes true bridges and is efficient in filtering out false bridges. This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed. For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length. For de novo bridging, we designed a new truncated Dijkstra’s algorithm. To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices. These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices. Experimental Results We showed that paired-end RNA-seq reads can be accurately bridged to a large extend. Our reference-based bridging tool could correctly bridge more than 79.6% of reads. For de novo bridging, high precision was observed with varied sensitivity. We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers. Availability Implementations of the algorithms for reference-based and de novo bridging are available at https://github.com/Shao-Group/rnabridge-align and https://github.com/Shao-Group/rnabridge-denovo, respectively. Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.com/Shao-Group/rnabridge-test.

We explore computationally inferring full fragments from given paired-end RNA-seq reads, a problem we refer to as bridging. We study this problem in two settings, namely reference-based bridging and de novo bridging. For reference-based bridging, we assume that the alignments of the paired-end reads are avail-able (which can be obtained using RNA-seq aligners such as STAR and HISAT2), and we aim to reconstruct the alignments of entire fragments. For de novo bridging, we aim to reconstruct the sequences of the full fragments from the sequences of paired-end reads.
We believe solving this problem could benefit various downstream RNA-seq analysis such as transcript assembly, isoform-level expression quantification, and splicing quantification. The inferred full fragments likely contain more splicing junctions than individual reads, and hence provide additional long-range information that helps resolve more complicated splicing variants in transcript assembly. Longer sequences will be less ambiguously located to transcripts expressed from the same gene or homologous genes, and hence improves isoform quantification. The reconstructed full fragments may reveal missing junctions in the unsequenced portion, which may lead to more accurate estimation of junction abundance, and hence improves splicing quantification.
However, bridging is a challenging task. It remains unclear if the reads themselves contain sufficient information to enable accurate bridging. Due to the complicated mechanism of alternative splicing and the dynamic nature of transcription and mRNA degradation, different types of splicing junctions will be captured by RNA-seq reads. According to our experiments, in many gene loci, thousands of different splicing junctions can be observed (a majority of them are with low abundance though). For those loci the possible ways of bridging a pair of reads will be enormous, and it's unclear what signal can be used to determine the true bridge. Besides, sequencing errors and alignment errors also produce false bridges, making this task more challenging to solve.
We explore modeling above bridging problem as a graph problem. Alignments or sequencing reads can be organized by a graph data structure, for instances, the splice graph for alignments and de Bruijn graph (dBG) or compacted de Bruijn graph (cdBG) for sequencing reads [3]. Naturally, the sequenced two ends of a fragment can be mapped to the graph and then represented as a pair of paths (two lists of vertices or edges) in the graph. Therefore, the bridging task becomes to find a path that connects the two paths. (Our work follows this framework for bridging; see Section 2.) However, the resulting graphs are often erroneous due to transcript noises and sequencing/alignment errors. It remains open that what is a good characterization of the true path in bridging-in other words, what objective function should be in formulating bridging as a graph optimization problem. In de novo bridging, the resulting dBG or cdBG may contain millions of vertices. This urges scalable algorithms for de novo bridging.
Although many existing RNA-seq analysis tools use paired-end information, limited efforts have been made to directly reconstruct full fragments. Existing method for reference-based bridging includes MapPER [8].
MapPER implemented a probabilistic framework: starting from splicing junctions of all end reads, Map-PER constructed potential splicing paths connecting paired-end reads; an expectation maximization method then assigned likelihood values to all splice junctions and assigned the most probable alignment for each fragment. (However, we were not able to compare with MapPER; see Appendix.) To our best knowledge, no available tool for de novo RNA-seq bridging exists.

Algorithms
We formulate bridging as a new optimization problem (Section 2.1). We then design efficient algorithms for reference-based bridging and de novo bridging in Section 2.2 and Section 2.3 respectively.

Formulation
In either setting, the given RNA-seq data can be organized by a graph data structure, i.e., splice graph for reads alignments and compacted de Bruijn graph (cdBG) for sequencing reads (see Sections 2.2 and 2.3 for their constructions). Let G = (V, E) be the underlying graph. Let f be a fragment, for which we know its two sequenced ends. Each end of f can be represented as a path in G, and f can then be represented as a pair of paths in G. Note that multiple fragments may correspond to the same pair of paths in G; we cluster them into equivalent classes. Formally, an equivalent class is a pair of paths (p 1 , p 2 ) in G that represent all fragments with two ends being corresponding to p 1 and p 2 respectively. Let F = (p 1 = (a 1 , a 2 , · · · , a m ), p 2 = (b 1 , b 2 , · · · , b n )) be an equivalent class, where path p 1 is represented as a list of vertices (a 1 , · · · , a m ) in the graph (the same for p 2 ). The problem of bridging fragments in F becomes to find a path in G from a m to b 1 ; we call such path as a bridging path. We assume that, all fragments in an equivalent class have the same true bridging path. This is because fragments in an equivalent class are similar, as their two ends are mapped to exactly the same list of vertices in the graph. Algorithmically, this assumption allows us to reduce computational load, as all fragments can be bridged in a single run. In the case of reference-based bridging (see Section 2.2), this also enables a robust voting approach to select the best bridging path from a set of candidates by integrating the distribution of fragment length.
We explore what's a good formulation to find the best bridging path. The main signal we have is the abundances (i.e., the numbers of reads that support vertices/edges of the graph, modeled as weights of vertices/edges; see Sections 2.2 and 2.3). Intuitively, a bridging path supported by more reads are more likely to be the true bridge. We determined that, a formulation that seeks a bridging path with maximized bottleneck weight is appropriate for both reference-based bridging and de novo bridging, and performs well on experimental comparisons (see Results). Below we describe this formulation.
We define a full ordering of all bridging paths w.r.t. an equivalent class F = (p 1 = (a 1 , a 2 , · · · , a m ), p 2 = (b 1 , b 2 , · · · , b n )). Let q 1 and q 2 be two arbitrary paths from a m to b 1 in G. Let w i 1 (resp. w i 2 ) be the ith smallest weight in path q 1 (resp. q 2 ). We say q 1 is more reliable than q 2 , if there exists an integer k such that w i 1 = w i 2 for all 1 ≤ i < k, and w k 1 > w k 2 . We now formulate the bridging problem as to find the most reliable path. Intuitively, we seek a path q from a m to b 1 in G such that the smallest weight in this path is maximized, and in case there are multiple paths with maximized smallest weight, among them we seek the one whose second smallest weight is maximized, and so on.
We believe this formulation is appropriate for bridging paired-end RNA-seq reads. First, by maximizing bottleneck weight, the weakest point of the bridging path is supported strongest, which avoids being dominated by vertices/edges with large weights. Second, through this formulation, false paths coming from sequencing/alignment errors can be efficiently excluded, as they usually exhibits low bottleneck abundance.
We also explored formulations such as maximizing the sum of weights in a path but they didn't perform as well as this one.
We emphasize that this formulation satisfies the optimal substructure property. Specifically, if a m → v i 1 → v i 2 → · · · → v i k → b 1 is the most reliable path, then a m → v i 1 → v i 2 → · · · → v i k is the most reliable path from a m to v i k . This implies that polynomial-time dynamic programming algorithms can be designed to find the optimal solution.
We use this formulation for both reference-based bridging and de novo bridging. For reference-based bridging, instead of just picking the most reliable path, we design an efficient dynamic programming algorithm that finds top N paths, and then incorporate the fragment-length distribution to select the best one. For de novo bridging, we propose novel, scalable algorithms to identify the most reliable path in large cdBGs with millions of vertices.
The bottleneck weight (smallest weight of the optimal path) in this formulation can be used as a filtering criterion to decide if a bridging path is true. We set this threshold as 1 (i.e., don't filter) in reference-based bridging but explicitly use this parameter to balance sensitivity and precision in de novo bridging (see Table 3 and Table 4).

Constructing Splice Graph
The input for reference-based bridging is standard RNA-seq alignment in sam/bam format. We first group fragments into gene loci: reads that overlap on their alignment coordinates will be assigned to the same gene locus. We make sure that the two ends from the same fragment will be assigned to the same gene locus.
Each gene locus will be treated as independent instances and solved independently.
In each gene locus, junctions will be extracted from spliced reads. The collected set of splicing coordinates will be used to partition the reference genome into continuous segments, called partial exons. We use the well-known splice graph to represent the partial exons and junctions in a single gene locus (see Figure 1): each partial exon corresponds to a vertex, and each junction corresponds to a direct edge that connects its two corresponding exons. Two additional vertices, source s and sink t, are added and connected to possible starting and ending partial exons. The weight of an edge is calculated as the number of reads that contain the corresponding junction; the weight of a vertex is calculated as the average coverage of the corresponding partial exon.  Figure 1: Example of the reads alignment in a gene locus (upper part). Reference genome are portioned into partial exons (numbered from 1 to 6) using the splicing coordinates. Lower part shows the corresponding splice graph; the weight of each edge is given close to it.

Bridging Algorithm
Our core algorithm to bridge all fragments in an equivalent class F = (p 1 = (a 1 , a 2 , · · · , a m ), consists of two steps: nominating and voting. The nominating step computes the N most reliable paths from a m to b 1 in G (in our implementation, N is a parameter with default value of 10), and the voting step selects one through a consensus approach using the fragment length distribution.
We implemented a dynamic programming algorithm for nominating. Specifically, let (v 1 , v 2 , · · · , v n ) be a topological sorting of all vertices of the splice graph G (this is possible because splice graph is a directed acyclic graph). Given a particular v i , we can use a single run to find N most reliable paths from v i to v j for every j > i. To compute these N paths for v j , we examine all vertices v k that directly connects to v j , and compare all paths stored in these vertices (each v k already stores N most reliable paths from v i to v k at this time point). The best N of them will be kept and after concatenating v j they become the N most reliable We now describe the voting step. Let (q 1 , q 2 , · · · , q N ) be the N candidates for an equivalent class F. Assume that these N paths are sorted (i.e., q i is more reliable than q i+1 ). We now examine each fragment f in F.
For each candidate path q i , 1 ≤ i ≤ N, we calculate the actual sequence length of fragment f , denoted as | f i |, assuming that f is bridged with q i . This can be done because once the bridging path of f is given, the alignment of the entire fragment f is fixed and hence the length of its entire sequence can be calculated. If | f i | is within a reasonable range (precomputed by sampling 100,000 paired-end reads with unique bridging path, and an empirical distribution of fragment length is then estimated; the value from percentile 2 to percentile 98 will be used as the reasonable range), then fragment f votes for path q i ; otherwise, we try next best path (i.e., q i+1 ) and check whether | f i+1 | is within the reasonable range. In other words, each fragment votes for the best candidate that leads to a fragment length within the reasonable range. Candidate path with the largest number of votes will be accepted as the bridging path for the equivalent class, and all fragments in it will be bridged using this path. If none of the candidate paths receives any vote, then this equivalent class fails bridging.

Correcting Alignment Errors
We implemented a method to use paired-end information to correct one type of alignment errors while bridging. See Figure 2. Reads that span a junction but flank one side shortly are prone to alignment errors (left red read in Figure 2). Information of paired-end can be used to detect and correct such er- be an equivalent class that fails bridging. If the length of the aligned portion on a m is small (a parameter with default value of 10 base-pairs), we will then try to can be successfully bridged. If so, the corrected and bridged alignments will be reported. In experimental studies (see Table 1), we show that this method could rescue about 1.6-1.9% of the wrongly aligned reads.  6)). The only bridging path is (2, 3, 4, 5, 6), which is most likely beyond the reasonable range of fragment length. As the aligned portion on segment 2 of the first-end is short, we will shorten (1, 2) as (1) and then try to bridge it in equivalent class ( (1), (6)), for which there is an additional candidate (1, 5, 6), which likely gives this fragment a sequence length within the reasonable range.

Constructing Compacted de Bruijn Graph (cdBG)
We use cdBG to represent the given paired-end RNA-seq reads. See Figure 3. In the de Bruijn graph (dBG), each vertex represents a distinct k-mer, and its weight equals to the number of appearance this k-mer in the reads. The corresponding cdBG is defined as concatenating each simple path (i.e., every vertex in it except the first and the last one has in-degree of 1 and out-degree of 1) of the dBG as a single vertex (the resulting sequence is called a unitig). To comply with our formulation of finding the most reliable path, we assign the weight of each vertex in cdBG as the smallest weight of the corresponding simple path in dBG. In implementation, we directly construct the cdBG and calculate vertex weights, instead of explicitly constructing dBG as an intermediate. Specifically, we use library Bifrost [7] to build the cdBG. In order to assign vertex weights, we first build a hashing table that stores the frequency of each k-mer; we then examine all k-mers in this vertex (a unitig of length l contains l − k + 1 k-mers) and assign the smallest frequency as the weight of the vertex.

Bridging Algorithms
Following our formulation, we aim to find the most reliable bridging path for all equivalent classes. As the cdBG constructed from a typical RNA-seq sample may consist of millions vertices (see Table 5), it is simply not affordable to run a standard dynamic programming algorithm for every possible starting vertices (note that a single run of dynamic programming algorithm starting from a vertex will find the most reliable path connecting it to all other vertices). We propose two algorithmic innovations to address this main challenge in de novo bridging. First, we implemented a truncated Dijkstra's algorithm to find the most reliable bridging path starting from a starting vertex a m to any other vertex up to a certain length D (default value is 400 + 2L where L is read length). This is to incorporate the prior knowledge that fragment length follows an empirical distribution and an upper bound of fragment length can be assumed. In our truncated Dijkstra's algorithm, for each vertex v we maintain the total length of the most reliable path from the current starting a m to v, and when such length for v reaches D, we won't extend v in the Dijkstra's algorithm. Similar to the algorithm in reference-based bridging, a list that stores the smallest M weights also needs to be maintained. The second algorithmic innovation is that we reuse the optimal solutions obtained for the current starting vertex a m to construct the optimal solutions for next starting vertex a m . This allows us to avoid running above truncated Dijkstra's algorithm from scratch for all possible starting vertices. Specifically, for the current starting vertex a m , we maintain a shortest path tree, denoted as T (a m ), to store the most reliable paths from a m to all other vertices (up to length D), i.e., the unique path in T from root a m to any vertex v gives the most reliable path from a m to v in the cdBG. See Figure 4. This tree can be constructed in linear time with a tracing back procedure after running the truncated Dijkstra's algorithm. We note that, again according to the property of optimal substructures, any subtree of T (a m ), say the one rooted at u, also gives the most reliable path from u to any other vertex w in the subtree. This suggests we can reuse the subtree rooted at u to calculate all reliable paths starting from u (and then construct the corresponding shortest path tree). More specifically, in implementation we directly load the subtree rooted at u to the priority queue (recall that the core data structure of Dijkstra's algorithm is a priority queue that gets updated iteratively). In other words, for next starting vertex u we run the truncated Dijkstra's algorithm in the middle rather than from scratch, as we already know the optimal solutions for a subset of vertices (i.e., those in the subtree of T (a m ) rooted at u). To benefit from this property to the largest extent, we determine the starting vertex whose subtree is largest as the next one to bridge.

Resulting Tools
We developed a tool for reference-based bridging, available at https://github.com/Shao-Group/rnabridgealign. The input for this tool is the alignments of paired-end reads, in standard sam/bam format. The output is the alignments of entire fragments, again in sam/bam format. The algorithms for de novo bridging were implemented as another tool, available at https://github.com/Shao-Group/rnabridge-denovo. The input for it is sequencing reads in fasta format, and it generates sequences of full fragments again in fasta format.

Datasets
We use 2 datasets to evaluate the accuracy of bridging and its effectiveness in improving downstream RNAseq analysis. The first dataset includes 80 paired-end RNA-seq samples simulated using Flux-Simulator [5].
We vary two parameters in the simulation: the average length of fragments (300 and 500 were used) and the read length (75 and 100 were used). For each combination of parameters, we independently simulated 20 samples. The number of reads in samples with fragment length being 300 and 500 are roughly 150M and 90M, respectively. The second dataset was previously used in the Scallop paper [19]: it contains 10 biological RNA-seq samples.

Accuracy of Reference-based Bridging
We evaluate the bridging accuracy of rnabridge-align using simulation data, for which the ground-truth (i.e., full fragments) are available. We define a bridged fragment is correct only if it's exactly the same as the ground-truth fragment. As the input for rnabridge-align is the alignments produced by another aligner (we use STAR here), which may not be perfect, we evaluate the bridging accuracy separately on the fragments that are correctly and wrongly aligned.
The averaged accuracies for each simulation type are summarized in Table 1. Overall the bridging accuracy is high with 85.8-86.2% being correctly bridged when the average fragment length is 300. For fragment length of 500 the accuracy slightly drops to 79.6-81.2%. This is expected as longer fragments are harder to bridge and the coverage of these samples are lower. Notice that rnabridge-align also correctly bridges 1.6-1.9% fragments even when the paired-end alignments are wrong, thanks to the procedure in Section 2.2.3.

Improving Reference-based Transcript Assembly
Reference-based bridging infers the alignment of full fragments, which provides longer alignments with enhanced long-range information. Such information is particularly useful in assembling complicated splicing variants. Below, we evaluate the effectiveness of reference-based bridging in improving transcript assembly.

Pipeline
We follow the workflow illustrated in Figure 5. For each aligned RNA-seq sample, we compare directly assembling with first bridging using rnabridge-align followed by assembling the bridged alignments.

Evaluation Measures
Each method assembles a set of transcripts. We define an assembled transcript is known if its intron-chain coordinates exactly match those of an existing transcript in the reference transcriptome. We use a thirdparty tool, gffcompare [17], to compute the number of known transcripts (proportional to sensitivity), and precision (the ratio between the number of known transcripts and the total number of predicted transcripts).
Notice that one method might assemble more known transcripts but exhibit lower precision, or vise versa, on certain samples. To enable comparing in these cases, we first balance them in one measure and then compare the other one. Specifically, let k 1 and k 2 be the number of known transcripts and p 1 and p 2 be the precisions obtained by two methods X 1 and X 2 . Assume that k 1 > k 2 . We sort the assembled transcripts obtained by X 1 based on the expression abundance, and gradually filter out them starting from ones with low abundance. In this way, the known transcripts of X 1 will decrease while its precision will likely increase (as lowly-expressed ones are more likely to be false positive transcripts). We calculate the two measures for X 1 , denoted as k 1 and p 1 , as filtering goes, and stop when k 1 = k 2 . We then report p 1 at this time as the adjusted precision of X 1 . Note that the relationship between p 1 and p 2 can therefore reflect the overall performance of the two methods, as their numbers of known transcripts have been adjusted identical. Similarly, we can compute the adjusted number of known transcripts, through gradually filtering out transcripts with lowest abundance in the method with lower precision, until their (adjusted) precisions match. This way of adjusting can be generalized into multiple methods; in our case, we have 4 methods. Such adjusted measures have been used in evaluating assembly accuracy [19].

Results on Simulated Dataset (Dataset 1)
The average assembly accuracies of the 4 methods on dataset 1 are illustrated in Figure 6 and

Results on Biological Dataset (Dataset 2)
The average assembly accuracies over the 10 RNA-seq samples in dataset 2 are shown in Figure 8 and

Further Analysis
Transcript assembly has long been recognized as a very challenging but critical task. We think the improvement achieved by bridging RNA-seq reads is quite significant: up to 14.4% for adjusted precision and 10.1% for adjusted sensitivity. Although such increase varies, improvements are observed on all settings with different aligners, assemblers, and simulation parameters. This confirms that our bridging tool is robust in improving reference-based transcript assembly.
StringTie and Scallop exhibit differently on simulated and biological data. This has also been observed in other studies. One of the possible reasons is that Scallop has been tailored to identify false positives during assembly (in general biological RNA-seq data is more noisy than simulated data). Again, the focus here is to show the improvement when bridging is incorporated, regardless of downstream assembler used.
On biological samples, the adjusted precision (ranging from 30%-50%) is low comparing with simulated samples with fragment length being 300. This is because we use reference transcriptome as ground-truth (as for these biological RNA-seq data we don't know the true expressed transcripts). The current transcriptome has been regarded as incomplete and over-annotated. Therefore, the precision and known transcripts reported here underestimated the actual performance of all approaches. But these measures are fair for all methods in comparing their relative performances. The low precision for simulated samples with fragment length being 500 are likely due to that they are harder instances and that these samples are with low coverage.

Running Time of Reference-based Bridging
The running time of rnabridge-align on samples in dataset 2 is given in Table 2. This is faster or comparable to assemblers like StringTie or Scallop on most biological RNA-seq samples. Simulated samples are less noisy and rnabridge-align runs even faster on them.

Accuracy of de novo Bridging
We first evaluate the accuracy of rnabridge-denovo using simulation data, for which the ground-truth are available. We define a bridged fragment is correct only if it's exactly the same as the ground-truth fragment.
The sensitivity is defined as the number of correctly bridged fragments divided by the total number of fragments (i.e., paired-end reads), and precision is defined as the number of correctly bridged fragments divided by the total number of bridged fragments.
The results are summarized in Table 3. The bottleneck threshold used to filter bridges is set to 5 for all simulated samples. Overall our tool exhibits high accuracy. The accuracy drops with long fragment length.
This is expected as in this case the missing portion is longer and therefore harder to bridge. This may also be partly due to that the coverage of these simulated samples is low. Higher accuracy is observed when the read length is longer at the same fragment length, again as expected.
We then evaluate using the biological data in dataset 2. As we don't have ground-truth for them, we again use the sequences in the reference transcriptome to evaluate. We align all the bridged fragments to reference using BLAT [9]. We define a bridged fragment is correct only if it is hit by one of the reference sequences with at least 95% sequence identity. The sensitivity and precision is defined the same as in evaluating with simulation data.
The results are summarized in Table 4. Overall the precision keeps high but the sensitivity varies quite a lot.
This is because we prioritize precision by setting a high bottleneck-threshold: 20 for samples with less than 50M paired-end reads and 100 otherwise. Comparing with simulated data, biological data are more noisy and harder to bridge. together with the size of the resulting cdBGs on the 10 biological samples in Table 5. Note that module 3 takes the least CPU time among 3 modules, which proves the efficiency and scalability of the optimized core bridging algorithms.

Conclusion
We conclude that the formulation of finding the path with maximum bottleneck weight (breaking ties with second bottleneck weight, and so on) is an appropriate model for both reference-based bridging and de novo bridging. This formulation admits optimal substructure, a nice property that leads to polynomial-time, exact algorithms. The new techniques we introduced in de novo bridging, in particularly reusing shortest path tree to speed up, are likely of independent algorithmic interests in other applications involved in cdBGs.
On both simulated and biological data, the precision of bridged reads is high (around 80-90%), suggesting that our tools can be safely combined with other RNA-seq tools and pipelines. The sensitivity, although not consistent, is quite significant (above 50% for most cases). This suggests a positive answer to the question that whether the data itself contains enough information for bridging.
We also showed that our reference-based bridging tool could improve reference-based transcript assembly.
Such improvement is universal, regardless of involved aligner, assembler, or simulation parameters.
Both tools are modular, with input and output being standard sam/bam/fasta formats. Therefore they can be easily incorporated into existing RNA-seq analysis pipeline. We devote to reproducibility: the experimental results in this manuscript can be reproduced with the scripts available at GitHub.

Discussion
We explored if de novo bridging could improve de novo transcript assembly. To this end we piped the bridged fragments to one leading assembler TransLiG [14], but only observed marginal improvement. This may be because TransLiG is not optimized to make use of mixed short and long sequences. (On the contrary, some reference-based assemblers like Scallop and StringTie does take advantage of reads that thread multiple vertices to improve assembly accuracy, and hence benefit from bridged fragments.) Developing a new de novo assembler that can fully use such bridged data is on our research agenda. Experimenting if de novo bridging could improve isoform quantification and splicing quantification is also an interesting future research topic for us.
The sensitivity of rnabridge-denovo is low on some biological samples. One reason is that we use a high bottleneck-threshold to keep high precision, which consequently disconnects paired-end reads in lowcoverage gene loci. We are developing a post-bridging algorithm to use full-range information in the reads to decide if a bridged fragment is correct (rather than just using a bottleneck-threshold), in a hope of keeping high precision while improving sensitivity. Specifically, note that the cdBG is not a loss-free representation of sequencing reads, as it breaks reads into k-mers, and any phasing information beyond (k + 1)-mer is not represented. Let s be the bridged sequence of a fragment f constructed using above algorithm. We can examine each sliding window of length L, where L is the read length, and determine the number of input reads that are identical to this L-mer (i.e., these reads support this L-mer). This gives a supporting profile, a vector of length |s| − L + 1 for this bridged sequence. Intuitively, a profile with few 0s suggests that the bridged sequence is likely a true one, while a long consecutive 0s suggests a false bridge. We are experimenting if such supporting profile could lead to more efficient algorithms in boosting bridging accuracy.

Funding
This work is partly supported by the US National Science Foundation (DBI-2019797), by the National Institutes of Health (R01HG011065), and by the Charles K. Etner Early Career Professorship to M.S. awarded by The Pennsylvania State University.