ABSTRACT
The high computational cost of current assembly methods for the long, noisy single molecular sequencing (SMS) reads has prevented them from assembling large genomes. We introduce an ultra-fast alignment method based on a novel global alignment score. For large human SMS data, our method is 7X faster than MHAP for pairwise alignment and 15X faster than BLASR for reference mapping. We develop a Mapping, Error Correction and de novo Assembly Tool (MECAT) by integrating our new alignment and error correction methods, with the Celera Assembler. MECAT is capable of producing high quality de novo assembly of large genome from SMS reads with low computational cost. MECAT produces reference-quality assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and reconstructs the human CHM1 genome with 15% longer NG50 in only 7600 CPU core hours using 54X SMS reads and a Chinese Han genome in 19200 CPU core hours using 102X SMS reads.
Introduction
Determining the genome sequence of a species or an individual in a population is one of the most important tasks in genomics1–6. De novo assembly is a process that reconstructs the genome from sequencing reads without a reference genome7–10. While technical advances in next generation sequencing (NGS) have enabled to assemble a genome in significantly lower cost and higher throughput comparing to the first-generation Sanger sequencing11, two inherited drawbacks make assembly of a genome from NGS short reads difficult12–14. First, NGS reads are only few hundreds base pair in length, which are shorter than the lengths of most repetitive sequences in either microbial or eukaryotic genomes15–17. Second, PCR amplification in library preparation causes sequencing biases, resulting in some sequence contexts, such as GC-rich regions, not being sequenced12,17. Both drawbacks lead to incomplete, fragmented assemblies18. The recently emerged third generation single molecular sequencing (SMS) technologies19, such as PacBio single molecule real time (SMRT)20,21 and Oxford Nanopore21–26, posses two distinguishing characteristics, namely, the long read length and the unbiased sequencing17,27–29, which can overcome the deficiencies of NGS17,27–29. These two properties together may help better resolve repeats and biased region, and thus obtain high-quality de novo genome assemblies21,30–33.
The SMRT and Nanopore reads usually have high error rates34–37. For example, the error rate of PacBio SMRT reads is generally 13-18%35,38. However the errors of SMRT are random and dominated by point insertions and deletions34 with no preference on particular genome regions29. Both theoretical and practical studies have shown that the SMRT reads can be corrected with high accuracy provided the sequencing coverage is high enough21. Therefore, a “correction then assembly” approach has been used by assemble pipelines, such as PBcR35, FALCON39 and HGAP21, for single molecular sequencing reads. In those pipelines, raw noisy reads are first corrected and then fed into an overlap graph based assembler, such as the Celera Assembler3,35. Previous practices have demonstrated that the “correction then assembly” approach can reconstruct highly continuous and accurate genome assemblies35,39.
Although SMRT sequencing have already been widely used to reconstruct small bacteria and archaea genomes, assembling middle or large size genomes from SMRT reads have suffered from high computational cost in the correction step of “correction then assembly” pipelines3,38–40. In PBcR-MHAP pipeline, about 84% computational time is used in read correction step. Recently, with new algorithm advancing, the assembly of D. melanogaster from SMRT reads takes only 1060 CPU core hours using PBcR-MHAP, which is dramatically reduced from 631, 000 CPU hours using original PBcR3,41. However, it still take a very long time for pipelines, such as PBcR-MHAP and FALCON, to assemble large genomes42. For example, it costs 260, 000 CPU core hours for PBcR-MHAP and FALCON to complete a human genome from 54X raw SMRT sequences3,42,43.
The high computational cost of SMRT assemble pipelines is mainly due to the all-pair alignment step to determine overlaps between read pairs for the correction. There are two sub-steps in the all-pair alignment step. First, the k-mer mapping based approach is used to identify candidate matched read pair. Then, local alignment is used to determine final matched read pair. Due to highly repetitive nature of biology genomes44,45, reads sampled from repetitive regions can lead to a high number of k-mer matches42–44, which lead to a lot of excessive44 candidate pairs. Meanwhile, the local alignment of two long noisy reads is also slow, even with linear local alignment program, like diff46. The local alignment of excessive candidate pairs is the major computational time waste in read correction.
To reduce the excessive matched reads after k-mer mapping, and then reduce the total computational time, we present a novel alignment filtering algorithm based on fast global k-mer scoring. Our filtering algorithm is inspired by the observation that the frequency of repeat subsequences decreases dramatically with their size. Thus, if we can find long matched read pairs, then these alignments can be non-repetitive matches with high confidence. We develop a novel k-mer seed score that is correlated with the overlap size between two reads, and then is able to represent the global matching information. For each read, we can select top matched reads according to their k-mer seed score for the further read correction. Noted, selecting top matched reads based on number of matched k-mers may lead to many non-informative matches since the repetitive region has more matched k-mers. Our global k-mer scoring algorithm allows us to dramatically reduce the number of non-informative matched read pairs, as well as selecting smaller number of informative matched reads for the read correction; both can lead to the significantly reducing of computational cost.
We have also presented a new SMS read error correction method by combining counting-based method and local partial order graph, which can achieve high correction accuracy and high correction speed simultaneously. With those new algorithms, we develop an ultra-fast Mapping, Error Correction and de novo Assembly Tool (MECAT) for SMRT reads. MECAT achieves superior computing efficiency to current assembly pipelines. In particular, MECAT takes only about 7600 CPU core hours to assemble a high quality human CHM1 genome using 54x SMRT data47 (CHM1) on a single 32-threads computing node with 2.0 GHz CPU, which is 34 times faster than the current PBcR-MHAP pipeline3. The MECAT makes it possible to de novo assemble large genome using SMRT reads with the similar computational cost as that the assembling of NGS reads needs.
Results
Alignment filtering in MECAT
The initial step of our alignment is also finding the candidate alignments by mapping the k-mers of two blocks3,40,48 with size of 1,000 to 2,000 bp. Two blocks are considered matched if the number of matched k-mer beyond a predefined threshold. We find a candidate alignment between two SMS reads or between a SMS read and reference genome if there is at least one block pair between them matched. The k-mer matching based method can filter out random pairs and quickly find seed alignments with high sensitivity40. However, a read often aligns to many other reads or many locations in the genome due to highly repetitive nature of genomes42,44. Local alignments are needed to find the good matched reads or best matched genome locations48. However, the computational cost for local alignments between two long SMS reads or between a SMS read and reference genome is high40,46. Meanwhile, most of SMS applications, such as SMS read correction and reference genome mapping, only need limited number of matches48–50. To quickly select high quality candidate alignments for the further local alignments, we develop a new pseudo linear global scoring algorithm to filter candidate alignments (Figure 1). Our algorithm works by scoring matched k-mer pair in two steps using distance difference factors (DDF). First, we mutually score the k-mer pair in a selected matched block pair. The k-mer pair with max score is selected as the seed, and then the seed k-mer pair is scored by the matched k-mer pairs in other matched block pairs. The score of the seed k-mer pair is supported by all informative matched k-mer pairs and their interval distance when there is a good alignment between them. Thus, our scoring algorithm integrates the global matching information between two SMS reads or between a SMS read and reference genome. Figure 2a shows that the scores of seed k-mer pairs between SMS pairs grow linearly with their overlapping lengths in four SMS data sets. Therefore, by selecting SMS read pairs with high scores, we can filter out the non-informative candidate alignments. After filtering by global scoring, we have reduced 50% to 70% candidate alignments for further local alignment (Figure 2b). And this makes the alignment tool 2-3 times faster than those without global scoring filtering. The candidate alignments are then further filtered by local alignment using diff program40,46.
Pairwise alignment performance of MECAT
To evaluate the performance of MECAT in pairwise alignment of SMS reads, we first compare memory usage and computational time cost of MECAT to those of three widely used SMS read alignment tools, MHAP3, BLASR48 and DALIGN40. We evaluate the alignment tools using five real datasets47. As shown in Table 1, the MECAT is faster than all other aligners, except it is slightly slower than DALIGNER40 on the E. coli dataset, which is due to computation cost of pre-processing procedure in MECAT. For large human genome, MECAT is 7 times faster than the second best aligner, MHAP-fast, and 15 times faster than DALIGNER. Meanwhile, MECAT uses similar amount of memory that DALIGNER uses, which is only about 1/10 of the amount of memory used by MHAP. Thus, MECAT has used only small amount of memory to achieve fast pairwise alignment.
Next, we evaluate the sensitivity and precision of aligners on three simulated datasets, including a 20x coverage E. coli, a 20x coverage yeast and a 5x coverage human chr1 datasets (Table 2). Since we knew the beginning and ending position of each read in reference genome in the simulated datasets, we can calculate the true pairwise overlap relationships between all reads. The sensitivity of an aligner indicates its ability to identify true overlaps and the precision of an aligner indicates the correctness of identified overlaps. The sensitivities of DALIGNER40 are the best among four aligners, but its precisions are the lowest. On the other hand, the BLASR48 and MHAP3 have high precisions, but low sensitivities. Meanwhile, MECAT maintains high precision as well as high sensitivity at the same time. The sensitivities of MECAT are consistently higher than those of both BLASR and MHAP while maintaining similar precisions. Comparing to DALIGNER, MECAT has higher precisions, but lower sensitivities. The precision and sensitivity of DALIGNER become extremely unbalanced for human chr1 data, with only 9.1% precision. These results show that MECAT has achieved a good balance between sensitivity and precision for both small and large genomes.
Reference genome alignment performance of MECAT
The MECAT can also be used to align SMS reads to the reference genome. We evaluate MECAT with two popular SMS reads reference genome aligners, BLASR48 and BWA-mem49. We have first compared the aligning time cost using five real datasets (Table 3). For four small genomes (E.coli, Yeast, Arabidopsis, fly)47, MECAT is 40 to 85 times faster than BLASR and 20 to 83 faster than BWA-mem. For human genome, MECAT is 15 times faster than BLASR and 5 times faster than BWA-mem. Then, we have compared the sensitivities, precisions and coverages of aligners using 20X simulated SMS data of E. coli, yeast and human genomes (Table 4)51, in which we know the read positions on the genomes. Comparing to BLASR48 and BWA-mem49, MECAT maps slightly less amount of reads to the reference genome, but it map more reads correctly for all three data sets. As a result, MECAT has higher sensitivities, precisions as well as coverages for all three data sets. The results show that MECAT can align SMS reads to the reference genome ultra-fast and maintain high sensitivity and precision. In five real datasets, the mapping overlap rates of three algorithms are as high as 95-99% of the same alignment positions (Supplementary Note 6 and Supplementary Figure 1-5), showing high confidence of MECAT alignment.
Error Correction Performance of MECAT
Due to high error rates of SMS reads, error correction is an indispensable step when they are used for genome assembly34–37. Currently, FC_Consensus, DAGCon35 and FalconSense3 are three most widely used SMS read error correction methods. DAGCon35 represents a multiple read alignment for each read to be corrected as a partial order graph (POG) and find the correct consensus sequence using dynamic programming. DAGCon is accurate but very slow. On the other hand, FalconSense simply corrects the template sequence by counting consistent base alignments. FalconSense is fast but less accuracy. The accuracy of SMS reads is important for genome assembly35,38. Here, we develop a new SMS read error correction method by combining the principles from both DAGCon and FalconSense. For regions with consistent matches/deletions without insertion (trivial regions), we use counting-based method. And for other complicate regions, we construct a local POG and solve it with dynamic programming. As the complicate regions are generally less than 10 bases, the local POGs are very small and can be solved very fast.
The first step of read correction is to perform the alignments between the template and the relative reads, which need random access the storage that stores the reads. Both DAGCon and FalconSense store the read in hard driver, which does not support random access. The slow read loading process in DAGCon and FalconSense lead to only 20% CPU usage. To accelerate the correcting process, we load all reads into memory, which supports random access. We encode each base using 2 bits. Thus, the memory occupation of MECAT is about 1/4 of the total read size. Loading reads to memory makes the CPU usage of MECAT over 96%.
We have compared MECAT error correction to FalconSense in PBcR-MHAP3 and Canu, as well as FC_Consenses in FALCON using four real datasets. Table 5 shows that the correction running speed from MECAT error correction is 5∼20 times faster than FalconSense and 3∼10 times faster than FC_Consense for four datasets. To evaluating the accuracy of corrected long reads, all corrected long reads were mapping into reference genome by dnadiff program52 (Table 5 and Supplementary Note 8). The mapping results show that the accuracies of reads corrected by MECAT are always higher than 99% and are the best in three of four datasets. Specially, for D. melanogaster dataset, the accuracies of other tree method are less than 99%, while the accuracy of MECAT is as high as 99.26% (Table 6 and Supplementary Note 8).
Assembly performance of MECAT
Generally, there are three steps in the assembly of genome using SMS reads: overlapping SMS reads to selected template reads; correcting the selected reads and constructing the contig using corrected reads3,35,38. To test the efficiency and effectiveness of MECAT aligner and error correction in genome assembly, we develop two genome assembly pipelines. In first pipeline, the SMS reads are overlapped and corrected by MECAT, and then fed into Celera Assembler (CA). We call the first pipeline MECAT-CA. In the second pipeline, which is called the MECAT, the SMS reads are also first overlapped and corrected by MECAT. Then, the corrected reads are overlapped by MECAT aligner again and the overlap graph is fed into the “Unitig Construction” module of Canu42 (v1.0) to construct the contigs. In both pipelines, we do not perform local alignment using diff program during SMS reads overlapping. We only select the top mapped reads using global scores and feed the mapping information to error correction step. We compare two pipelines to other three SMS assembly pipelines, PBcR-MHAP3, FALCON and Canu. We evaluate assembly pipelines using previously released whole genome SMRT reads of five genomes: E. coli K12, S. cerevisiae W303, D. melanogaster ISO1, Arabidopsis thaliana Ler-0 and the complete hydatidiform mole CHM147. All the genome assemblies are polished by Quiver53 to correct sequencing errors.
Table 7 lists the running time of five pipelines on the same computer. We evaluate total assembly time as well as the running time for read overlap, error correction and contig construction separately. For small E. coli K12 genome, the MECAT-CA and MECAT are 3.7 to 5.0, 3.9 to 5.4 and 2.2 to 2.9 times faster than FALCON, PBcR-MHAP and Canu, respectively. For another small S. cerevisiae W303 genome, the MECAT-CA and MECAT are 1.5 to 2.1, 3.2 to 4.3 and 3.9 to 5.3 times faster than FALCON, PBcR-MHAP and Canu, respectively. For medium D. melanogaster ISO1, the MECAT-CA and MECAT are 7.32 to 14.7, 5.3 to 10.5 and 3.6 to 7.2 times faster than FALCON, PBcR-MHAP and Canu, respectively. For another medium Arabidopsis thaliana Ler-0 genome, the MECAT-CA and MECAT are 15.5 to 20.9, 13.1 to 17.6 and 8.2 to 11.1 times faster than FALCON, PBcR-MHAP and Canu, respectively. For large human genome, we are not able to run other assembly pipelines on our single computer, thus we compare our running time to the results of previous paper3. Our MECAT-CA is 5.2 and 12 times faster than PBcR-MHAP-fast and PBcR-MHAP-sensitive, and MECAT shows remarkable 24.9 times speedup than PBcR-MHAP-fast and 56.4 times speedup than PBcR-MHAP-sensitive. The larger the size of genome is, the greater the speedups of MECAT are.
As shown in Table 7, the overlapping and correcting steps are the most time consuming steps among the three assembly steps of PBcR-MHAP, FALCON and Canu. And the speedups of MECAT-CA and MECAT are mostly coming from the efficient of MECAT aligner and error correction in these two steps. For small or medium size genomes, the running times of three steps in MECAT-CA pipeline are similar. However, the running time of contig construction became the bottleneck comparing to other two steps for large genome. For human genome, the running time of contig construction step are 3.5 and 39 times longer than those of overlapping and correcting steps in MECAT-CA pipeline. Meanwhile, for human genome, the running time of contig construction of MECAT is only 7.5% of that of MECAT-CA, which make contig construction step not a bottleneck in MECAT, in which MECAT pairwise alignment replaces OverlapIncore program in CA.
We further examine the quality of assemblies of five pipelines using four measures: assembly size, NG503,36, number of contigs and the average number of contigs >200 bps per chromosome (ctg/chr) 3,36 (Table 8). In all five compared genomes, MECAT-CA and MECAT obtain comparable or improved assemblies. For E. coli K12, both MECAT-CA and MECAT recover the complete genome with just 1 contig. For S. cerevisiae W30354, both MECAT-CA and MECAT report close to perfect continuity with only 22 and 21 contigs, respectively. However, the MECAT obtain best NG50s with 100% assembly performance (even better than that of reference assembly S228C55), while MECAT-CA only report 89% assembly performance, which is similar to the results of PBcR-MHAP and Canu and better than the result of FALCON. For Arabidopsis thaliana Ler-0, MECAT reports only 56 contigs with better than reference assembly NG50 (100% assembly performance), which is much higher than those NG50 reported by other four pipelines with assembly performance from 68% of FALCON to 85% of PBcR-MHAP. For D. melanogaster, the MECAT also reports the highest NG50 with 82% assembly performance while it reports less total assembly size. For human CHM1, the assembly size reported by MECAT-CA is the closest to the size of human Ref 38 genome56, but it reported the largest number of contigs and largest ctg/chr ratio. Meanwhile, MECAT reported slightly better assembly size than those of PBcR-MHAP, but less than that of MECAT-CA. The NG50 reported by MECAT is 15% longer than those of PBcR-MHAP sensitive and MECAT-CA. And the number of contigs and ctg/chr reported by MECAT is similar to those of PBcR-MHAP sensitive, much less than those of MECAT-CA and PBcR-MHAP fast. Given its ability that MECAT can assemble the human CHM117 genome in less than 10 days on a single 32-thread computer with comparable assembly quality, it can be an ultra-fast tool to assemble large genomes using SMS reads.
Validation analysis of assembly
We further validate the assemblies by comparing them to the reference genomes. Since MECAT is always faster and obtain better or comparable assembly performance, we only compared the assembled results of MECAT to those of PBcR-MHAH, Canu and FALCON. We first map the assemblies of E coli, Yeast, Arabidopsis thaliana and D. melanogaster to reference genomes using Nucmer57 (Supplementary Note 10), and then evaluate the mapping results using GAGE scripts58. Among four assemblies, only assemblies of E coli and D. melanogaster are generated from SMS read sampled from the same strains of reference genomes. All four assemblies are structurally consistent with reference genomes except some minor structural variation (Supplementary Figs. 6-22). Supplementary Table 3 provides GAGE58 accuracy metrics for these assemblies. With all discrepancies between assembly and reference genome sequence being counted as error, the assemblies reported by MECAT are still at least 99.99% accuracy (QV=403) compared to the reference genomes. We also align four assemblies before and after Quiver53 polishing onto reference genome using dnadiff program52, and count the single-nucleotide polymorphisms (SNPs) and big indels (>10bps). The numbers of SNP and indels in assemblies reported by all four pipelines are similar, especially after Quiver polishing (Supplemental Table 3). We further map all 17294 annotated genes of D. melanogaster59,60 onto the assemblies. We identify a total of 16972, 17044, 17055 and 16839 genes mapped to a single contig in a single alignment from assemblies of PBcR-MHAP, FALCON, Canu and MECAT, respectively, while 16944, 17020, 17037 and 16812 genes of these have over 99% identity. The results show that the qualities of assemblies from MECAT are comparable to those from other pipelines.
Solving the repeat regions is the most import task in genome assembly. We evaluate four assemblies of D. melanogaster by comparing the completeness of transposable element (TE) families61 (Supplementary Note 12 and Supplemental Table 4). In all 5,433 annotated TEs from the flybase3. MECAT assembly contains 5301 (97.6%) TEs, in which 5141 (94.6%) aligned perfectly to the reference. Meanwhile, PBcR-MHAP, FALCON, Canu assemblies contain 5274, 5306 and 5319, respectively. And 4984, 5190 and 5165 of them are aligned perfectly. We have examined two TE families, roo and juan, in detail. In MECAT assembly, 131 of 138 copies in roo TE family are aligned. Of these, 123 are perfectly aligned. For 11 elements of the juan family, all are perfectly confirmed. Those results are similar to the assemblies of other three pipelines (Supplementary Table 4). The TE analysis in D. melanogaster demonstrates that MECAT is capable of reconstructing TE repeats sequences accurately.
To further evaluate the ability of MECAT that reconstructs the repeat regions of genomes, we have examined the telomeric repeats in S. cerevisiae assembly of MECAT (Supplementary Note 13 and Supplementary Table 5). We are able to map telomeric repeats of 15 chromosomes to assembled contigs. Among them, seven chromosomes assembled in a single contig have at least 50% of terminal telomeric repeats mapped on both ends. For chromosomes contains more than one contig, we have mapped the two end telomeric repeats onto two contigs of six chromosomes and one end telomeric repeats onto one contig of one chromosome. There are telomeric repeats of two chromosomes cannot be mapped onto any contig, which may due to either assembly error or different strains. Our results are similar to those obtained from assemblies of PBcR-MHAP, FALCON and Canu (Supplemental Table 3-5).
De novo assembly of a human diploid genome
Finally, to demonstrate MECAT in de novo assembly of large genome, we have assembled 102x SMRT sequencing reads from a diploid Han Chinese genome using MECAT on a 32-core computer. It takes only 25 days to finish the whole assembly. The Han assembly is submitted and assessed by NCBI (https://www.ncbi.nlm.nih.gov/assembly/GCA_001856745.1/). We compare our Han assembly to another Han Chinese genome assembly from BGI, YH1, which is assembled from Illumina data (http://yh.genomics.org.cn/). The total size of our Han assembly is 2,908,568,123bp, which is much longer than the 2.2G bp of YH1. The NG50 of Han assembly is 8,583,694 and is 2263 times longer than NG50 of YH1. The Han assembly only has 4456 contigs and is only 0.6% of total number of contigs of YH1. Furthermore, the Han assembly has six contigs with size greater than 40M. The length of longest contig of Han is 66M and is much greater than the 0.9M of YH1. Moreover, the Han assembly shows much better continuity of genome comparing to the YH1 assembly (Figure 3). Thus, our assembly can be a better reference genome for Han Chinese and it also show that SMRT sequencing can significantly improve de novo assembly quality and integrity.
We also map the Han assembly onto hg19 human reference genome using Nummer software and the dotplot figure (Supplementary Note 14 and Supplementary Fig. 25) shows that our assembly is structurally consistent with hg19 genome except for some minor structural variation. Furthermore, we have aligned the Illumina datasets from YH1 onto Han, YH1 assemblies and hg19 human reference genome using bowtie250 (Supplementary Note 14 and Supplementary Fig. 24). The Han assembly gains the best mapping rate (83.81%) comparing to the mapping rates to YH1 (73.06%) and hg19 (82.05%). This result validates that Han assembly is a better reference genome.
To find structural variations between Han Chinese genome and European genome, we have mapped both Han and YH1 assemblies onto hg19 human reference genome. We find 29836 structural different genome region (≥ 20bp) in Han assembly and most of them (65%) can also be identified in YH1 assembly (Supplementary Table 6). We have also compared Han assembly to the Korean genome62. The result shows that the Han assembly is structurally consistent with Korean genome except minor structural variants (Supplementary Figs. 28). We have further compared MHC region of Han assembly with those of Korean and hg19 genomes. The result (Supplementary table 6) shows that Han and Korean assemblies have 157 and 147 variants (>10bp) comparing to hg19 and 50 of them are the same site. Among two recently validated max. variants (54896bp and 10286bp) in MHC region of Korean 62, only one variant (10286bp) has been found in Han assembly. This results show that there are structural different between the MHC regions in Han Chinese and Korean although 33% of those regions are the same.
Discussion
The repeat regions in the genome lead to the higher frequency of k-mer mapping between SMS reads. When k-mer matching is used to filter random pairs, all-pair alignment may produce excessive candidate pairs, and local alignment of excessive candidate pairs consume the major computational time in SMS read correction. Thus, reducing the redundant repetitive k-mer matches is the key to reduce excessive candidate pairs, and then computational cost. However, completely masking low-complexisty sequence or ignoring highly repetitive k-mer may lead to the lost of some correct overlaps42. Recently, the Canu pipeline employed a tf-idf k-mer weighting method to reduce the effects of repetitive k-mer matches. However, even with k-mer weighting, the MinHap algorithm in Canu only report the local matched k-mers pairs without considering the arrangement of k-mer pairs, which may still lead to excessive matches. In BLASR, the best arrangement of k-mer pairs is solved by slow sparse dynamic programing. Here, our scoring algorithm considers matched k-mer pairs between two reads as well as intervals between matched k-mer pairs. Furthermore, the repetitive matched k-mer pairs are removed from scoring, which reduces the effect of small repetitive region in read overlapping. Our algorithm provides a heuristic global alignment score between two reads, which is more sensitive to the true overlap. One proven of this is that the MECAT can be used to align SMS reads to reference genome with high sensitivity and precision similar to BLASR
Another benefit of our global alignment score is selecting the top informative matched reads for a give read template. Since the top informative matched reads selected by the global score are so reliable, we even do not need to perform local alignment using diff program46 to further filter them, which has reduced the computational cost for the whole read correction step.
The alignment tool in MECAT can also be plugged into other pipelines. For example, the Celera Assembler (CA35), an overlap-layout-consensus (OLC) based assembler, need the overlap length between the reads to obtain high quality assemblies. Since our alignment score is correlated with the overlap size between two reads. Thus, we can replace the overlapInCore in Canu42 (v1.0) (a new version of CA for SMS reads), which uses a slow blast-like algorithm for computing overlaps of corrected reads, with alignment tool in MECAT. This allows us to dramatically reduce the computational time for contig construction.
With the new alignment algorithm as well as the improved read correction method, we are able to develop a new assembler pipeline, MECAT, which is capable to produce high quality de novo assembly of large genome from long noisy SMRT reads with low computational cost. Our experimental results show that MECAT can assemble a high quality human genome assembly using 54X SMRT reads in only 7600 CPU core hours and a high quality Chinese Han human assembly using 102X SMRT reads in 19200 CPU core hours, which is ten times faster than current fastest assembler. The MECAT makes it possible to assembly large genomes on a single server computer and small clusters using SMRT reads.
Currently, the structurally divergent alleles are not considered in the MECAT pipeline. One of our future works to improve MECAT is designing new algorithm to distinguish structurally divergent alleles, and then make MECAT able to assemble polyploidy genomes. In this paper, we focus on assembling genomes using PacBio SMRT reads. Since Oxford Nanopore reads have similar characteristics as SMRT reads, we will evaluate the applicability of MECAT to Nanopore reads in the future.
Methods
Indexing and matching of reads
The finding of potential matching between reads is based on the matching of k-mers (substrings with length of k) in the reads. A read r of length L has total L-k+1 k-mers. We first index the reads using a hash table with the k-mers as key. We consider the overlapping k-mers between the blocks of reads. For each read, we break it into multiple blocks with each block of length B, which is usually be 1,000 bp or 2,000 bp. The values in the hash table are the position of k-mers in the blocks of reads.
To search for the matching reads, we scan the k-mers in blocks of reads and look up the matchings in the hash table. We break the reads into blocks of same length B. In order to reduce the computer time, we only sample the k-mers in each searching block. We slide a k-sized window along each block with a step length of sl. Thus, the number of k-mers in the search is only about 1/sl of the number of total k-mers from reads. A typical value of sl is 10. A searching block is matched to an indexing block if the number of their overlapping k-mers is greater than a predefined threshold m. Two reads are considered as matched if at least a pair of blocks is matched between them.
Given two read blocks of length B, the number of k-mers sampled from the search block is (B/sl-1). Let O be the overlapping length of a pair of matched blocks, O≤B. The expected number of matched k-mers in O is3 where, Prandom is the probability of a random k-mer and Pmatch is the probability that two k-mers are matched. As the block length B is fixed, for a given error rate and no repetitive sequence, the number of matched k-mers between two blocks grows with the overlapping length O. For a highly matched block pair, Pmatch >>Prandom. The expected number can roughly be estimated as:
Filtering false matched reads using global score
We develop a new pseudo linear global scoring algorithm to filter the excessive, non-informative matched reads. Our scoring algorithm has two steps. The first is the mutual scoring step. For each matched read pair, we first randomly select a matched block pair and mark it. Then, we score the matched k-mer pairs in this matched block pair. Let (pi, pj) be the positions of i-th and j-th k-mer in one block and (p′i, p′j) be the position of i-th and j-th k-mer in another block of matched pair. We define the distance difference factor (DDFi,j) between i-th and j-th k-mer as following:
If DFi,j<ε, which indicates that the both k-mers supporting each other, we increase the scores of both k-mers by 1. The ε is set to 0.3 in our experiments. By calculating the DF between all possible pair of k-mers, we obtained scores for all overlapping k-mers of matched blocks. We only use the non-repetitive k-mer pairs in our scoring. If a k-mer is matched more than once in the same block, it will be excluded from scoring. If the score of k-mer with highest score is significant (greater than a threshold), we set it as the seed position for future alignment. If there are multiple k-mers having the same score, we randomly select one as seed.
The second is the extension scoring step. In order to increase the reliability of seed and reduce the computation of whole scoring process, we extend the scoring process from selected block pair to its neighbor matched block pairs if a seed k-mer is obtained. For each overlapping k-mer in neighbor block pair, we calculate the DF between the k-mer and the seed k-mer in original block pair. If DF<ε, we increase the score of seed k-mer by 1. If 80% of DF values of overlapping k-mers in a neighbor block pair satisfy DF<ε, we mark the block and do not score the k-mers in this block pair in the future. After one loop of mutual and extension scoring process, if there are still matched block pairs having not been marked, we continue the scoring process on those block pairs. The mutual scoring is done in O(N2) time and the extension scoring is done in O(N) time, where N is the number of k-mer matches As the number of k-mers in mutual scoring is small, the overall scoring process can be done in a pseudo linear time.
Aligning SMS reads
The pair alignment of SMS reads is the first step for correcting reads and assembling genome using SMS reads. We only consider two SMS reads have more than 2000bp overlapped. Namely, we set the block length to 2000bp. After scoring the matched k-mers between two SMS reads, we sort the k-mers based on their scores. Then, we use the top ranked k-mers as seeds to perform local alignment of two reads using diff algorithm40,46. If the overlapped length between two SMS reads is longer than 2000 bp and the mismatch rate of overlapped sequence is less than twice of SMS read error rate, we consider there is a match and output the align results. All detail parameters are described in Supplementary Note 2.
Aligning SMS reads to a reference genome
The procedure of aligning SMS reads to a reference genome is similar to those of aligning two reads. We index the reference genome sequence and search the reads from the index table. We first break the reference genome into blocks with length of B and index the k-mers in each block. Then, we break the reads into blocks of same length B and sample the k-mers for looking up the index table. The matched k-mers between a read and reference genome are also scored. The top ranked k-mers are used as seeds to perform further local alignment using diff.
In order to obtain high sensitivity of the alignment of SMS reads to a reference genome and keep the computational cost low, we present a two-steps approach. In first step, we use the block length B of 1000bp and k-mer sampling step length sl of 20 to align reads to reference genome. As some SMS reads have less match k-mer or the distribution of their matched k-mers is uneven, those SMS reads cannot find the matching position in first step. In the second step, we double the block length B to 2000bp and half the k-mer sample step length sl to 10, and align so far unmatched reads again. In practice, we can find the matching for significant amount of reads in first step and can find the matching for most of reads after second step. As the computational cost of second step is much higher than that of first step, our two-steps approach allows us reduce the computational cost and maintain high sensitivity at the same time. All detail parameters are described in Supplementary Note 1.
Correcting SMS reads
The random and independent properties of errors in SMS reads make it possible to correct them. Generally, there are two steps to correct SMS reads. The first step is building a multiple read alignment for each read to be corrected. The second step is constructing the correct read from the consensus of alignment. For the first step, we use our own alignment tool. We perform pairwise alignment (without local alignment using diff) between all reads with length greater than 5000 bp. Then, we filter out alignment if the overlapped sequence is less than 90% of shorter read in the pair. The output of the alignment is written into multiple files. Each file includes the alignment information of 200,000 reads. For the second step, we develop a new SMS read error correction method by combining the principles from both DAGCon and FalconSense. We summarize the pairwise alignment to construct a consensus table with the counts of match, insertion and deletion. For trivial regions with consistent matches: match_count/(match_count+deletion_count)>0.8 and no significant insertion occurring (insertion_count<6), or consistent deletions: deletion_count/(match_count+deletion_count)>0.8 and no significant insertion occurring (insertion_count<6), we can simply determine the consensus base according to the count. For other complicate regions, we construct a local POG and solve the consensus with dynamic programming. All detail of this algorithm is described in Supplementary Note 2.
De novo assembly using SMS reads
We develop a new pipeline for assembling SMS reads by integrating our new alignment and error correction method with the Celera Assembler (CA). Our pipeline has three steps. In first step, for each reads longer than 3000bp, we perform pairwise alignment against other reads and select 100 matched reads with top matched scores. No detail local alignment is performed in this step. In second step, we correct all template reads (>3000bp) using their matched reads. Finally, we pairwise align the corrected reads using our alignment method and feed the results into the “Unitig Construction” module of CA or Canu to construct the unitigs. We can also feed our correct reads to the CA directly, which used the overlapInCore for pairwise alignment.
Evaluation
We evaluate MECAT tools using both simulated reads and raw reads from five model organisms. We compare our alignment tool to previous pairwise alignment tools, including BLASR, MHAP and Daligner, and to genome alignment tools, including BLASR and BWA-mem. We compare our correction tools to those in PBcR and Falcon. We also systematically evaluate our assembly tools by comparing with Canu, PBcR-MHAP, Falcon. The details of those comparisons are reported in Supplemental Note 5-9.
Accession codes
Assembly and annotation files of Han-1 Chinese Human are available from GenBank: GCA_001856745.1. All source codes for MECAT and the analyses presented here are available from https://github.com/xiaochuanle/MECAT. The software and data used for this manuscript (including supplementary files and scripts) are available from http://sysbio.sysu.edu.cn/software/MECAT. Note: Any Supplementary Information and Source Data files are available in the online version of our paper.
AUTHOR CONTRIBUTIONS
C.L.X. conceived and designed this project. C.L.X. conceived, designed and implemented alignment algorithm. Y.C. and C.L.X. conceived, designed and implemented the consensus algorithm. Y.C. and C.L.X integrate all programs into MECAT pipeline and the documentation writing. S,Q.X. C.L.X. and Y.C. ran and analyzed the genome assemblies and our algorithms performance. K.N.C. and Y.W. coordinated data release and assisted with pipeline executions. C.L.X and Y.C. drafted the manuscript. C.L.X. and S.Q.X. drafted supplementary files and the analysis scripts of results. F.L. provided the theoretical analysis of algorithms in this paper. F.L. rewrote and improved all manuscripts and Z.X. revised supplementary files and some sections in paper. All authors read and approved the final manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details are available in the online version of the paper.
ACKNOWLEDGMENTS
We are grateful to De-Pei Wang for supplying Chinese human Dataset and some advices throughout this project. We thank NCBI assembly group for Han-1 Chinese annotation. Here, We also thank Pacific Biosciences and all those involved in generating and freely releasing the data analyzed here. This work was collectively supported by National Natural Science Foundation of China (31471789, 31600667 and 31200612), the Fundamental Research Funds for the Central Universities (15ykjc23d) and Guangdong Natural Science Foundation (2015A030313127).