Abstract
Motivation Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used.
Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
Availability The source code is available at https://github.com/haowenz/LRECE.
Contact aluru{at}cc.gatech.edu
1. Introduction
Third-generation sequencing technologies produce long reads with average length of 10 Kbp or more that are orders of magnitudes longer than the short reads available through second-generation sequencing technologies (typically a few hundred bp). In fact, the longest read length reported to date is > 1 million bp (Sedlazeck et al., 2018). Longer lengths are attractive because they enable disambiguation of repetitive regions in a genome or a set of genomes. The impact of this valuable long-range information has already been demonstrated for de novo genome assembly (Loman et al., 2015; Chin et al., 2016; Jain et al., 2018), novel variant detection (Sedlazeck et al., 2017; Chaisson et al., 2015), RNA-seq analysis (Gordon et al., 2015), and epigenetics (Rand et al., 2017; Simpson et al. 2017).
The benefit of longer read lengths, however, comes with the major challenge of handling high error rates. Currently, there are two widely used third-generation sequencing platforms – Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Both sequencing platforms are similar in terms of their high error rates (ranging from 10-20%) with most errors occurring due to insertions or deletions (indels); however the error distribution varies (Carneiro et al., 2012; Jain et al., 2015, 2018). Pacbio sequencing errors appear to be randomly distributed over the sequence (Korlach and Biosciences, 2013). For ONT on the other hand, the error profile has been reported to be biased. For example, A to T and T to A substitutions are less frequent than other substitutions, and indels tend to occur in homopolymer regions (Jain et al., 2015; Ashton et al., 2015). These error characteristics pose a challenge for long read data analyses, particularly for detecting correct read overlaps during genome assembly and variants at single base pair resolution, thus motivating the development of error correction methods.
Error correction algorithms are designed to identify and fix or remove sequencing errors, thereby benefiting resequencing or de novo sequencing analysis. In addition, the algorithms should be computationally efficient to handle increasing volumes of sequencing data, particularly in the case of large, complex genomes. Numerous error correction methodologies and software have been developed for short reads; we refer readers to Yang et al. (2012) and Alic et al. (2016) for a thorough review. Given the distinct characteristics of long reads, i.e., significantly higher error rates and lengths, specialized algorithms are needed to correct them. Till date, several error correction tools for long reads have been developed including PacBioToCA (Koren et al., 2012), LSC (Au et al., 2012), ECTools (Lee et al., 2014), LoRDEC (Salmela and Rivals, 2014), proovread (Hackl et al., 2014), NaS (Madoui et al., 2015), Nanocorr (Goodwin et al., 2015), Jabba (Miclotte et al., 2016), CoLoRMap (Haghshenas et al., 2016), LoRMA (Salmela et al., 2016), HALC (Bao and Lan, 2017), FLAS (Bao et al., 2017), FMLRC (Wang et al., 2018), HG-CoLoR (Morisse et al., 2018) and Hercules (Firtina et al., 2018).
In addition, error correction modules have been developed as part of long read de novo assembly pipelines, such as Canu (Koren et al., 2017) and HGAP (Chin et al., 2013). In the assembly pipeline, correction helps by increasing alignment identities of overlapping reads, which facilitates overlap detection and improves assembly. Many long error correction tools require and make use of highly accurate short reads to correct long reads (accordingly referred to as hybrid methods). Others, referred to as non-hybrid methods, perform self-correction of long reads using overlap information among them.
A few review studies have showcased comparisons among rapidly evolving error correction algorithms to assess state-of-the-art. Laehnemann et al. (2015) provide an introduction to error rates/profiles and a methodology overview of some correction tools for various short and long read sequencing platforms, although no benchmark is included. A review and benchmark for hybrid methods is also available (Mahmoud et al., 2017). However, the study only used simulated reads and focused more on speed rather than correction accuracy. Besides, it does not include non-hybrid methods in the assessment. More recently, LRCstats (La et al., 2017) was developed for evaluation of long read error correction software; however, it is restricted to benchmarking with simulated reads. Furthermore, it does not provide a comprehensive evaluation of many of the current state-of-the-art correction software.
While benchmarking with simulated reads is useful, it fails to convey performance in real-world scenarios. Besides the base-level errors (i.e., indels and substitutions), real sequencing data sets also contain larger structural errors, e.g., chimeras (Fichot and Norman, 2013). However, state-of-the-art simulators such as SimLoRD (Stöcker et al., 2016) only generate reads with base-level errors rather than structural errors. Furthermore, Miclotte et al. (2016) consistently observed worse performance when using real reads instead of simulated reads, suggesting that simulation may fail to match the characteristics of actual error distribution. Therefore, benchmarking with real datasets is important.
In this study, we establish benchmark datasets, present an evaluation methodology suitable to long reads, and carry out comprehensive evaluation of the quality and computational resource requirements of state-of-the-art long read correction software. We also study the effect of trimming and different sequencing depths on correction quality. To understand impact of error correction on downstream analysis, we perform assembly using corrected reads generated by various tools and assess quality of the resulting assemblies.
2. Overview of long read error correction methods
2.1 Hybrid methods
Hybrid methods take advantage of high accuracy of short reads (error rates often < 1%) for correcting errors in long reads. An obvious requirement is that the same biological sample be sequenced using both short read and long read technologies. Based on how these methods make use of short reads, we further divide them into two categories: alignment-based and assembly-based. The first category includes Hercules, CoLoRMap, Nanocorr, Nas, proovread, LSC and PacBioToCA, whereas HG-CoLoR, HALC, Jabba, LoRDEC, and ECTools are in the latter. The ideas underlying the methods are summarized below.
2.1.1 Short-read-alignment-based methods
As a first step, these methods align short reads to long reads using a variety of aligners, e.g. BLAST (Altschul et al., 1990), Novoalign (http://www.novocraft.com/products/novoalign/). As long reads are usually error-prone, some alignments can be missed or biased. Thus, most of the tools in this category utilize various approaches to increase accuracy of alignments. Drawing upon the alignments, these methods use distinct approaches to generate corrected reads.
PacBioToCA
Consensus sequences for long reads are generated by multiple sequence alignment of short reads using AMOS consensus module (Pop et al., 2004).
LSC
Short reads and long reads are compressed using homopolymer compression (HC) transformation prior to alignment. Then error correction is performed at HC points, mismatches and indels by temporarily decompressing the aligned short reads and then generating consensus sequences. Finally, the corrected sequences are decompressed.
proovread
Similar to PacBioToCA and LSC, short reads are mapped to long reads and then the resulting alignments are used to call consensus. But its alignment parameters are carefully selected and adapted to the PacBio sequencing error profile. To further improve correction, the phred quality score and Shannon entropy value are calculated at each nucleotide for quality control and chimera detection, respectively. To reduce run time, an iterative correction strategy is employed. Three pre-correction steps are performed using increasing subsamples of short reads. In each step, the long read regions are masked to reduce alignment search space once they are corrected and covered by a sufficient number of short read alignments. In the final step, all short reads are mapped to the unmasked regions to make corrections.
NaS
Like the other tools in this category, it first aligns short reads with long reads. However, only the stringently aligned short reads are found and kept as seed-reads. Then instead of calling consensus, similar short reads are retrieved with these seed-reads. Micro-assemblies of these short reads are performed to generate contigs, which are regarded as corrected reads. In other words, the long reads are only used as template to select seed-reads.
Nanocorr
It follows the same general approach as PacBioToCA and LSC, by aligning short reads to long reads and then calling consensus. But before the consensus step, a dynamic programming algorithm is utilized to select an optimal set of short read alignments that span each long read.
CoLoRMap
CoLoRMap does not directly call consensus. Instead, for each long read region, it runs a shortest path algorithm to construct a sequence of overlapping short reads aligned to that region with minimum edit distance. Subsequently, the region is corrected by the constructed sequence. In addition, for each uncovered region (called gap) on long reads, any unmapped reads with corresponding mapped mates are retrieved and assembled locally to fill the gap.
Hercules
It first aligns short reads to long reads. Then unlike other tools, Hercules uses a machine learning-based algorithm. It creates a profile Hidden Markov Model (pHMM) template for each long read and then learns posterior transition and emission probabilities. Finally, the pHMM is decoded to get the corrected reads.
2.1.2 Short-read-assembly-based methods
These methods first perform assembly with short reads, e.g., generate contigs using an existent assembler, or only build the de Bruijn graph (DBG) based on them. Then the long reads are aligned to the assemblies, i.e., contigs/unitigs or a path in the DBG, and corrected. Algorithms for different tools in this category are summarized below.
ECTools
First, unitigs are generated from short reads using any available assembler and aligned to long reads. Afterwards, the alignments are filtered to select a set of unitigs which provide the best cover for each long read. Finally, differences in bases between each long read and its corresponding unitigs are identified and corrected.
LoRDEC
Unlike ECTools which generates assemblies, LoRDEC only builds a DBG of short reads. Subsequently, it traverses paths in the DBG to correct erroneous regions within each long read. The regions are replaced by the respective optimal paths which are regarded as the corrected sequence.
Jabba
It adopts a similar strategy as in LoRDEC, and builds a DBG of short reads followed by aligning long reads to the graph to correct them. The improvement is that Jabba employs a seed-and-extend strategy using maximal exact matches (MEMs) as seeds to accelerate the alignment.
HALC
Similar to ECTools, short reads are used to generate contigs as the first step. Unlike other methods which try to avoid ambiguous alignments (Koren et al., 2012; Yang et al., 2010), HALC aligns long reads to the contigs with a relatively low identity requirement, thus allowing long reads to align with their similar repeats which might not be their true genomic origin. Then long reads and contigs are split according to the alignments so that every aligned region on read has its corresponding aligned contig region. A contig graph is constructed with the aligned contig regions as vertices. A weighted edge is added between two vertices if there are adjacent aligned long read regions supporting it. The more regions support the edge, the lower is the weight assigned to it. Each long read is corrected by the path with minimum total weight in the graph. Furthermore, the corrected long read regions are refined by running LoRDEC, if they are aligned to similar repeats.
FMLRC
This software uses a DBG-based correction strategy similar to LoRDEC. However, the key difference in the algorithm is that it makes two passes of correction using DBGs with different k-mer sizes. The first pass does the majority of correction, while the second pass with a longer k-mer size corrects repetitive regions in the long reads. Note that a straightforward implementation of a DBG does not support dynamic adjustment of k-mer size. As a result, FMLRC uses FM-index to implicitly represent DBGs of arbitrary length k-mers.
HG-CoLoR
Similar to FMLRC, it avoids using a fixed k-mer size for the de Bruijn graph. Accordingly, it relies on a variable-order de Bruijn graph structure (Kowalski et al., 2015). It also uses a seed-and-extend approach to align long reads to the graph. However, the seeds are found by aligning short reads to long reads rather than directly selecting them from the long reads.
2.2 Non-hybrid methods
These methods perform self-correction with long reads alone. They all contain a step to generate consensus sequences using pairwise alignment/overlap information. However, the respective methods vary in how they find the overlaps and generate consensus sequences. The details are as follows.
FLAS
It takes all-to-all long read overlaps computed using MECAT (Xiao et al., 2017) as input, and clusters the reads that are aligned with each other. In case of ambiguous instances, i.e., the clusters that share the same reads, FLAS evaluates the overlaps by computing alignments using sensitive alignment parameters either to augment the clusters or discard the incorrect overlaps. The refined alignments are then used to correct the reads. To achieve better accuracy, it also corrects errors in the uncorrected regions of the long reads. Accordingly, it constructs a string graph using the corrected regions of long reads, and aligns the uncorrected ones to the graph for further correction.
LoRMA
By gradually increasing the k-mer size, LoRMA iteratively constructs DBGs using k-mers from long reads exceeding a specified frequency threshold, and runs LoRDEC to correct errors based on the respective DBGs. After that, a set of reads similar to each read termed friends are selected using the final DBG, which should be more accurate due to several rounds of corrections. Then, each read is corrected by the consensus sequence generated by its friends.
Canu error correction module
As a first step during the correction process, Canu computes all-versus-all overlap information among the reads using a modified version of MHAP (Berlin et al., 2015). It uses a filtering mechanism during the correction to favor true overlaps over the false ones that occur due to repetitive segments in genomes. The filtering heuristic ensures that each read contributes to correction of no more than D other reads, where D is the expected sequencing depth. Finally, a consensus sequence is generated for each read using its best set of overlaps.
3. Materials and Methods
We selected data sets from recent publicly accessible genome sequencing experiments. For benchmarking the different programs, our experiments used genome sequences from multiple species and different sequencing platforms with recent chemistry, e.g., R9/R7 for ONT or P6-C4/P5-C3 for PacBio. We describe our evaluation criteria and use it for a comprehensive assessment of the correction methods/software.
3.1. Benchmark data sets
Our benchmark includes resequencing data from three reference genomes – Escherichia coli K-12 MG1655 (E. coli), Saccharomyces cerevisiae S288C (yeast), and Drosophila melanogaster ISO1 (fruit fly). The biggest hurdle when benchmarking with real data is the absence of ground truth (i.e., perfectly corrected reads). However, the availability of reference genomes of these strains enables us to evaluate the output of correction software in a reliable manner using the reference. Essentially, differences in a corrected read with respect to the reference imply uncorrected errors. A summary of the selected read data sets is listed in Table 1. We leveraged publicly available high coverage read data sets of the selected genomes available from all three platforms – Illumina (for short reads), Pacbio, and ONT. In addition, some of these samples were sequenced using multiple protocols, yielding reads of varying quality. This enabled us to do a thorough comparison among error correction software across various error rates and error profiles.
Dataset D1-O1 is a recent MinION sequencing of E. coli genome (Loman et al., 2015). Its 2D reads were also extracted from the raw reads using poretools (Loman and Quinlan, 2014), and was included into the benchmark as D1-O2. Note that raw reads are more erroneous than the 2D reads, which enabled the evaluation of the tools across different error rates. Giordano et al. (2017) recently released a bundle of PacBio, MinION, and Miseq sequencing data of the yeast genome. For the same purpose, pass 2D reads and the combination of pass and fail 2D reads of the MinION data were downloaded and regarded as two separate data sets in our benchmark (D2-O1 and D2-O2).
To conduct performance evaluation under different sequencing depths, yeast sequencing reads (D2-P and D2-O1) were subsampled randomly using Seqtk (https://github.com/lh3/seqtk). Subsamples with average depth of 10x and 20x were generated for MinION reads. In addition, 10x, 20x, 30x, 60x and 90x PacBio read subsamples were generated from D2-P. Details of these subsamples are available in Supplementary Table 1.
3.2 Evaluation methodology
Our evaluation method takes uncorrected reads, corrected reads, and a reference genome as input. Both the uncorrected and corrected reads were filtered using a user defined length (default 500). Reads which were too short to include in downstream analysis were dropped during the filtration. Filtered reads were aligned to the reference genome using Minimap2 (Li, 2018). Majority of reads align to a single position in the reference. Fraction of base pairs with ambiguous or split read mappings is found to be insignificant (Supplementary Table 2). This can be attributed to two reasons. First, the reads were sequenced from the same reference genome to which they are aligned. Second, as the reads are long (> 500 bp), majority of the base pairs map uniquely to the reference. As a result, we retain only the primary alignment for a read with multiple mappings or split alignments.
In an ideal scenario, an error correction software should take each erroneous long read and produce the error-free version of it, preserving each read and its full length. To assess how close to the ideal one can get, measures such as error rate post-correction or percentage of errors removed (termed gain; see Yang et al. (2012)) can be utilized. However, long read error correction programs do not operate in this fashion. They may completely discard some reads or choose to split an input read into multiple reads when the high error rate cannot be reckoned with. In addition, short read assembly based error correction programs use long read alignments to de Bruijn graphs, and produce sequences corresponding to the aligned de Bruijn graph paths as output reads instead. Though original reads may not be fully preserved, all that matters for effective use of error correction software is that its output consists of sufficient number of high quality long reads that reflect adequate read lengths, sequencing depth, and coverage of the genome. Accordingly, our evaluation methodology reflects such assessment.
We measure the number of reads and total bases output by each error correction software, along with the number of aligned reads and total number of aligned bases extracted from alignment statistics, because they together reveal the effectiveness of correction. Besides, statistics which convey read length distribution such as maximum length and N50 were calculated to assess effect of the correction process on read lengths. Fraction of the genome covered by output reads is also reported to assess if there are regions of the genome that lost coverage or suffered significant deterioration in coverage depth post-correction. Any significant drop on these metrics can be a potential sign of information loss during the correction. Finally, alignment identity is calculated by the number of matched bases divided by the alignment length, averaged over all reads. Tools which achieve maximum alignment identity with minimum loss of information are desirable.
As part of this study, we provide an evaluation tool to automatically generate the evaluation statistics mentioned above. Besides, we provide a wrapper script which can run state-of-the-art error correction software on a grid engine given any input data from user. Using the scripts, two types of evaluations can be conducted. Users can either evaluate the performance on a list of tools with their own data to find a suitable tool for their studies, or they can run any correction tool with the benchmark data and compare it with other state-of-the-art tools.
4 Experimental results and discussion
4.1 Experimental setup
All tests were run on the Swarm cluster located at Georgia Institute of Technology. Each compute node in the cluster has dual Intel Xeon CPU E5-2680 v4 (2.40GHz) processors equipped with a total of 28 cores and 256GB main memory. The cluster is set up using 64-bit Red Hat Linux kernel version 2.6.32.
4.2. Evaluated software
We evaluated 15 long read error correction programs in this study: Hercules, HG-CoLoR, FMLRC, HALC, CoLoRMap, Jabba, Nanocorr, proovread, LoRDEC, ECTools, LSC, PacBioToCA, FLAS, LoRMA and the error correction module in Canu. NaS was not included in the evaluation because it requires Newbler assembler which is no longer available from 454. The command line parameters were chosen based on user documentations of each software (Supplementary note section “Versions and configurations”). The tools were configured to run exclusively on a single compute node and allowed to leverage all the 28 cores if multi-threading is supported. A cutoff on wall time was set to three days.
4.3. Performance on benchmark data sets
We evaluated the quality and computational resource requirements of each software on our benchmark data sets (Table 1). Results for the three different datasets are shown in Tables 2, 3 and 4, respectively. Because multiple factors are at play when considering accuracy, it is important to consider their collective influence in assessing quality of error correction. In what follows, we present a detailed and comparative discussion on correction accuracy, runtime and memory-usage. In addition, to guide error correction software users and future developers, we provide further insights into the strengths and limitations of various approaches that underpin the software. This includes evaluating their resilience to handle various sequencing depths, studying the effect of discarding or trimming input reads to gain higher accuracy, and impact on genome assembly.
4.3.1 Correction quality
We measure quality using the number of reads and total bases output in comparison with the input, the resulting alignment identity, fraction of the genome covered and read length distribution including maximum size and N50 length. From Tables 2, 3 and 4, we gather that the best performing hybrid methods (e.g., FMLRC) are capable of correcting reads to achieve base-level accuracy in the high 90’s. For the E. coli and yeast data sets, many of these programs achieve alignment identity > 99%. A crucial aspect to consider here is whether the high accuracy is achieved while preserving input read depth and N50. Few tools (e.g. Jabba, proovread) seem to attain high alignment identity at the cost of producing shorter reads and reduced depths because they choose to either discard uncorrected reads or trim the uncorrected regions. This may have a negative impact on downstream analyses. This trade-off is further discussed later in Section 4.3.4.
Among the hybrid methods, a key observation is that short-read-assembly-based methods tend to show better performance than short-read-alignment-based methods. We provide the following explanation. Given that long reads are error-prone, short read alignment to long reads is more likely to be wrong (or ambiguous) than long read alignment to graph structures built using short reads. Errors in long reads can cause false positives in identifying the true positions where the respective short reads should align, which causes false correction later. For example, during the correction of D3-P, the alignment identity of corrected reads generated by CoLoRMap in fact decreased when compared to the uncorrected reads. The reason is that CoLoRMap uses BWA-mem (Li, 2013) to map short reads, which is designed to report best mapping. However, due to the high error rates, the best mapping is not necessarily the true mapping. Large volume of erroneous long reads in D3-P can lead to many false alignments, which affected the correction process. On the other hand, long read lengths make it possible to have higher confidence when aligning them to paths in the graph. Therefore, in most of the experiments, assembly-based methods were able to produce reasonable correction.
Non-hybrid correction is more challenging as it relies solely on overlaps between erroneous long reads, yet the tools in this category yield competitive accuracy in many cases. However, non-hybrid methods may significantly reduce read count and/or read lengths, and completely fail when the original long reads are highly erroneous. For example, neither Canu nor LoRMA was able to correct D1-O1 where average input identity is only 63.46%. FLAS also discarded most of the reads.
4.3.2 Runtime and memory usage
Scalability of the correction tools is an important aspect to consider in their evaluation. Slow speed or high memory usage makes it difficult to apply them to correct large data sets. Our results show that hybrid methods, in particular assembly-based methods, are much faster than the rest. For instance, PacBioToCA and LSC failed to generate corrected reads in three days for D1-P, while most of the assembly-based tools finished the same job in less than one hour. Nanocorr, ECTools and LSC were unable to finish the correction of D2-O2 in three days, which was finished by FMLRC or LoRDEC in 30 minutes. Although proovread can complete the corrections of D2-P, D2-O1 and D2-O2, the run-time was 49.3, 17.5 and 29.3 times longer, respectively, than run-time needed by FMLRC. Moreover, assembly-based methods, e.g., LoRDEC and FMLRC, used less memory in most of the experiments. Therefore, in terms of computational performance, users should give priority to assembly-based methods over short-read alignment-based methods.
Among the non-hybrid methods, LoRMA’s memory usage was generally the highest among all the tools, and was slower than assembly-based methods. However, Canu showed superior scalability. Owing to a fast long read overlap detection algorithm using MinHash (Berlin et al., 2015), Canu was able to compute long read overlaps and used them to correct the reads in reasonable time, which is comparable to most of the hybrid methods. The memory footprint of Canu was also lower than many hybrid-methods. However, Canu did not finish the correction of D3-P in three days probably because this data set is too large to compute pairwise overlaps. FLAS showed performance comparable to Canu as FLAS also leverages the fast overlap computation method in MECAT (Xiao et al., 2017).
4.3.3 Effect of long read sequencing depth on error correction
Requiring high sequencing coverage for effective error correction can impact both cost and time consumed during sequencing and analysis. The relative cost per base pair using third-generation sequencing is still several folds higher when compared to the latest Illumina sequencers (Sedlazeck et al., 2018). Accordingly, we study how varying long read sequencing depth affects correction quality, while keeping the short read data set fixed. We conducted this experiment using data sets D2-P and D2-O1 with various depth levels obtained using random sub-sampling. The output behavior of the correction tools is shown in Supplementary Tables 3-7.
For corrected reads generated by hybrid methods, no significant change on the metrics was observed except those generated by CoLoRMap. The alignment identity of its corrected reads increased with decreased sequencing depth. This observation is consistent with the experimental results reported by its authors. Similarly, CoLoRMap did not perform well on large data sets such as D3-P and D3-O as large data sets increase the risk of false positive alignments (discussed previously in Section 4.3.1).
On the other hand, the performance of non-hybrid methods deteriorated significantly when sequencing depth was decreased. As non-hybrid methods leverage overlap information to correct errors, they require sufficient long read coverage to make true correction. The genome fraction covered by corrected reads produced by LoRMA decreased from 99.59% to 82.97% when sequencing depth dropped from 90x to 60x, and further decreased to 9.61%, 5.39% and 3.78% for 30x, 20x and 10x respectively,implying loss of many long reads after correction. The alignment identities were still greater than 99% using all subsamples because LoRMA trimmed the uncorrected regions. For corrected reads generate by Canu, no significant change on genome fraction was observed. But the alignment identity dropped from above 99% to 97.03% and 95.63% for 20x and 10x sequencing depths, respectively. FLAS showed similar performance but genome fraction for 10x sequencing depth was only 90.204%, which is lower than the 99.919% achieved by Canu.
4.3.4 Effect of discarding reads during correction
Many correction tools opt for discarding input reads or regions within reads that they fail to correct. As a result, the reported alignment identity is high (>99%), but much fewer number of bases survive after correction. This effect is more pronounced in corrected reads generated by Jabba, proovread, ECTools, PacBioToCA and LoRMA. They either trim uncorrected regions at sequence ends, or even in the middle, to avoid errors in the final output which eventually yields high alignment identity. However, aggressive trimming also makes the correction lossy and may influence downstream analysis because long range information is lost if the reads are shortened or broken into smaller pieces. Therefore, users should be conservative in trimming and turn it off when necessary. One good practice is to keep the uncorrected regions and let downstream analysis tools perform the trimming, e.g. overlap-based trimming after read correction in Canu.
A direct implication of discarding or trimming reads is the change of read length distribution. Figure 1 and 2 show the original and corrected read length distributions. Among all the tools, Hercules, FMLRC, HALC, CoLoRMAP and LoRDEC can maintain a similar read length distribution after correction whereas Nanocorr, Jabba, ECTools and prooveread lost many long reads after correction due to their trimming step. Nanocorr drops a long read when there is no short read aligning to it. This procedure can remove many error-prone long reads, which leads to a higher alignment identity after correction. However, the fraction of discarded reads in many cases is found to be significant. For example, a mere 1.5 million bp cumulative length of sequences survived out of 245.7 million bp data set, after correction of D1-O1. ECTools also generated only 10.7 million corrected bases using this data set. Canu changed the read length distribution significantly after correction although due to a different reason (Figure 1). Canu estimates the read length after correction and tries to keep the longest 40x reads for subsequent assembly. FLAS kept most of the reads with short length while losing many reads with long length.
Few hybrid-methods managed to generate enough corrected reads with relatively higher alignment identity. Notably, FMLRC and HG-CoLoR substantially outperformed other tools using D1-O1 by producing high alignment identity of 98.98% and 99.56% respectively and maintaining long read lengths (Table 2, Figure 2). Notably, HG-CoLoR generated one extremely long read of length 73,992 bp which is substantially longer than the longest read (43,624 bp) in D1-O1, perhaps due to the use of assembly DBG during the correction process.
4.3.5 Effect of error correction on genome assembly
We examine the effect of error correction on genome assembly, and evaluate if quality of error correction correlates well with the quality of genome assembly performed using corrected reads. To do so, we conducted an experiment to compute genome assembly using corrected PacBio and ONT 2D reads of E. coli, i.e., corrected reads for D1-P and D1-O2. Assembly was computed using Canu and its quality was assessed using QUAST (Gurevich et al., 2013); results are shown in Table 5.
Considering the assemblies generated using corrected PacBio reads (D1-P), NGA50 score of >3 million bases was obtained when using reads generated by FLAS, Canu, FMLRC, Nanocorr, LoRDEC or ECTools. Surprisingly, the highest NGA50 was obtained when using corrected reads generated by LoRDEC, but the alignment identity of its corrected reads was lower than most of the tools. Similarly, the highest NGA50 was achieved using corrected reads generated by Canu for D1-O2, but the alignment identity of the corrected reads was 93.32%. Therefore, higher alignment identity does not necessarily translate to a better NGA50, i.e., a more continuous assembly.
We also examined the frequency of mismatches and indels in the assemblies. For both data sets D1-P and D1-O2, corrected reads generated by HALC and ECTools produced assemblies with > 500 mismatches, significantly higher than the other tools. However, alignment identity of their corrected reads was either competitive with, or superior to, what is produced by other tools. Notably, both HALC and ECTools use assembled contigs from short reads to do error correction. Mis-assemblies of short reads, especially in repetitive and low-complexity regions, may cause false corrections, which leads to errors during assembly (Wang et al., 2018). Corrected reads produced by FMLRC achieved the least number of errors in assembly. Meanwhile, its alignment identity was also the highest among the methods which avoid trimming. Therefore, higher alignment identity of corrected reads can lead to, but not guarantee, fewer errors in genome assemblies.
Non-hybrid methods such as LoRMA, Canu produced more indels than mismatches in their assemblies while most of the hybrid methods showed the opposite behavior. To further investigate, we visualized the alignments of corrected reads generated by Canu and FMLRC for D1-O2 in Supplementary Figures 1, 2, and 3. More indels were observed in the alignments of corrected reads generated by Canu than FMLRC. Moreover, for D1-O2, indels mostly occurred in homoploymers which is consistent with ONT sequencing error profile. These observations suggest that self-correction methods are not good at handling indels when compared to hybrid methods.
5 Conclusions and Future Directions
In this work, we established benchmark data sets and evaluation methods for comprehensive assessment of long read error correction software. Our results suggest that hybrid methods aided by short accurate reads can achieve better correction quality, especially when handling low coverage-depth long reads, compared with non-hybrid methods. Within the hybrid methods, assembly-based methods are superior to alignment-based methods in terms of scalability to large data sets. Besides, better performance on correction such as preserving higher proportion of input bases and better alignment identity may lead to, but cannot guarantee, better results on downstream applications such as genome assembly. The tools with superior correction performance should be further tested in the context of applications of interest, to determine which are best suited for the application of interest.
Users can also select tools according to our experimental results for their specific expectations. When speed is a concern, assembly-based hybrid methods are preferred as long as short reads are available. Besides, hybrid methods are less sensitive to low sequencing depth than non-hybrid methods. Thus, users are recommended to choose hybrid methods when sequencing depth is relatively low. In cases where indel errors may cause a serious negative impact on downstream analyses, hybrid methods should be preferred over non-hybrid ones.
FMLRC outperformed other hybrid methods in almost all the experiments. For non-hybrid methods, Canu and FLAS showed better performance over LoRMA. Hence, these three are recommended as default when users want to avoid laborious tests on all the error correction tools.
For future work, better self-correction algorithms are expected to avoid hybrid sequencing, thus reducing experimental labor on short read sequencing preparation. In addition, most of the correction algorithms run for days to correct errors in the sequencing of even moderately large and complex genomes like the fruit fly, and become a bottleneck in sequencing data analysis. Therefore, more efficient or parallel correction algorithms should be developed to ease the computational burden. Furthermore, none of the hybrid tools makes use of paired-end information in their correction, except CoLoRMap. But the use of paired-end reads in CoLoRMap did not improve correction performance significantly according to previous studies. Paired-end reads have already been used to resolve repeats and remove entanglements in de Bruijn graphs (Bankevich et al., 2012). Since many error correction tools build de Bruijn graphs to correct long reads, the paired-end information may also be able to improve error correction.
Most of the published error correction tools focus on correction of long DNA reads sequenced from a single genome, which also served as the motivation for our review. Long read sequencing is increasingly gaining traction for transcriptomics and metagenomics applications. It is not clear whether the existing tools can be leveraged or extended to work effectively in such scenarios, and is an active area of research (de Lima et al., 2018).
Funding
This work is supported in part by the National Science Foundation under CCF-1718479.
Key Points
Despite the high error rate of long reads, the state-of-the-art correction tools achieve high correction accuracy and throughput.
The best hybrid methods show better performance than non-hybrid methods in terms of correction quality and computing resource usage.
Few correction tools discard reads, which practitioners are supposed to be careful with.
Evaluation of long read error correction should be conducted while checking its effect on downstream analysis, since better correction quality does not always imply better accuracy of downstream analysis.
Biographical Note
Haowen Zhang and Chirag Jain are PhD students in School of Computational Science and Engineering at Georgia Institute of Technology. Srinivas Aluru, PhD, is a Professor in School of Computational Science and Engineering and Co-Executive Director of Institute for Data Engineering and Science at Georgia Institute of Technology. He is a Fellow of AAAS and IEEE.