Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Assembly of long, error-prone reads using repeat graphs

Abstract

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Flye outline.
Fig. 2: Constructing the approximate repeat graph from local self-alignments.
Fig. 3: Resolving an unbridged repeat.
Fig. 4: An SD from the Flye assembly of the HUMAN dataset and the distribution of the lengths and complexities of all SDs from the same assembly.
Fig. 5: Constructing the repeat plot of a tour in the graph and constructing the repeat graph from a repeat plot.

Data availability

All described datasets are publicly available through the corresponding repositories. The supplementary files, including the assemblies generated by Flye, are available at https://doi.org/10.5281/zenodo.1143753; NCTC PacBio reads: http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/; PacBio metagenome dataset: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun; PacBio C. elegans dataset: https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set; PacBio/ONT S. cerevisiae dataset: https://github.com/fg6/YeastStrainsStudy. The ONT reads from the HUMAN and HUMAN+ datasets are available at https://github.com/nanopore-wgs-consortium/NA12878. The matching Illumina reads are available as SRA project ERP001229. The Canu HUMAN+ assembly was downloaded from https://genomeinformatics.github.io/na12878update. MaSuRCA assemblies are available from http://masurca.blogspot.com/.

Code availability

The Flye code used in this study is available in the online version of the paper. The most recent Flye version is freely available at http://github.com/fenderglass/Flye.

References

  1. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    Article  CAS  PubMed  Google Scholar 

  3. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    Article  CAS  PubMed  Google Scholar 

  4. Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).

    Article  CAS  PubMed  Google Scholar 

  7. Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & David, N. T. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).

    Article  CAS  PubMed  Google Scholar 

  10. Ghurye, J., Pop, M., Koren, S., Bickhart, D. & Chin, C. S. Scaffolding of long read assemblies using long range contact information. BMC Genomics 18, 527 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Weissensteiner, M. H. et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res. 27, 697–708 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Jiang, Z. et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).

    Article  CAS  PubMed  Google Scholar 

  15. Pu., L., Lin, Y. & Pevzner, P. A. Detection and analysis of ancient segmental duplications in mammalian genomes. Genome Res. 28, 901–909 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Bao, Z. & Eddy, S. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 8, 1269–1276 (2002).

    Article  Google Scholar 

  17. Schmid, M. D. et al. Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats. Nucleic Acids Res. 46, 8953–8965 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Tischler, G. Haplotype and repeat separation in long reads. Preprint at bioRxiv https://doi.org/10.1101/145474 (2017).

  19. Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Edmonds, J. & Johnson, E. L. Matching, Euler tours and the Chinese postman. Math. Program. 5, 88–124 (1973).

    Article  Google Scholar 

  21. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Giordano, F. et al. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Sci. Rep. 7, 3935 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407 (2017).

    Article  CAS  PubMed  Google Scholar 

  26. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS ONE 9, e112963 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Lin, Y., Nurk, S. & Pevzner, P. A. What is the difference between the breakpoint graph and the de Bruijn graph? BMC Genomics 15, S6 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 51, 608–611 (2015).

    Article  Google Scholar 

  29. Nattestad, M. S. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA 2 and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Gibbs, A. J. & McIntyre, G. A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970).

    Article  CAS  PubMed  Google Scholar 

  32. Edmonds, J. Paths, trees, and flowers. Canad. J. Math. 17, 449–467 (1965).

    Article  Google Scholar 

Download references

Acknowledgements

We are indebted to S. Nurk for his multiple rounds of critique and suggestions that have improved the paper. We are also grateful to A. Mikheenko, B. Behsaz, L. Pu, and G. Tesler for their comments. This work is supported by NSF/MCB-BSF grant no. 1715911.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to developing the Flye algorithms and writing the paper. M.K., Y.L., and J.Y. implemented the Flye algorithm. M.K. benchmarked Flye and other assembly tools. P.A.P. directed the work.

Corresponding author

Correspondence to Pavel A. Pevzner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 A comparison of the Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset.

(Left) The Flye and Hinge assembly graphs of the KP9657 dataset. There is a single unique edge entering into (and exiting) the unresolved “yellow” repeat and connecting it to the rest of the graph. Thus, this repeat can be resolved if one excludes the possibility that it is shared between a chromosome and a plasmid. In contrast to HINGE, Flye does not rule out this possibility and classifies the yellow repeat as unresolved. (Right) The Flye and Hinge assembly graphs of the EC10864 dataset show a mosaic repeat of multiplicity four formed by yellow, blue, red and green edges (the two copies of each edge represent complementary strands). HINGE reports a complete assembly into a single chromosome.

Supplementary Figure 2 The assembly graph of the YEAST-ONT dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the YEAST-ONT dataset into a graph with 21 unique and 34 repeat edges and generated 21 contigs as unambiguous paths in the assembly graph. A path v1, …vi, vi+1vn in the graph is called unambiguous if there exists a single incoming edge into each vertex of this path before vi+1 and a single outgoing edge from each vertex after vi. Each unique contig is formed by a single unique edge and possibly multiple repeat edges, while repetitive contigs consist of the repetitive edges which were not covered by the unique contigs. The visualization was generated using the graphviz tool (http://graphviz.org).

Supplementary Figure 3 The assembly graph of the WORM dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the WORM dataset into a graph with 127 unique and 61 repeat edges and generated 127 contigs as unambiguous paths in the assembly graph. The visualization was generated using the graphviz tool (http://graphviz.org).

Supplementary Figure 4 Dot plots showing the alignment of reads against the Flye assembly, the Miniasm assembly and the reference C. elegans genome.

(a) The reference genome contains a tandem repeat of length 1.9 kb (10 copies) on chromosome X with the repeated unit having length ≈190 nucleotides. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 5.5 kb (27 copies) and 2.8 kb (13 copies), respectively. 15 reads that span over the tandem repeat support the Flye assembly (the mean length between the flanking unique sequence matches the repeat length reconstructed by Flye) and suggests that the Flye length estimate is more accurate. (b) The reference genome contains a tandem repeat of length 2 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 10 kb and 5.6 kb, respectively. A single read that spans over the tandem repeat supports the Flye assembly. Since the mean read length in the WORM dataset is 11 kb, it is expected to have a single read spanning a given 10.0 kb region but many more reads spanning any 5.6 kb region (as implied by the Miniasm assembly) or 2.0 kb region (as implied by the reference genome). Six out of 23 reads cross the “left” border (two out of 18 reads cross the “right” border) of this tandem repeat by more than 5.6 kb, thus contradicting the length estimate given by Miniasm and suggesting that the Flye length estimate is more accurate. (c) The reference genome contains a tandem repeat of length 3 kb on chromosome X. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of lengths 13.6 kb and 8 kb, respectively. A single read that spans over the tandem repeat reveals the repeat cluster to be of length 12.2k, which suggests that the Flye length estimate is more accurate. (d) The reference genome contains a tandem repeat of length 1.5 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest tandem repeats of length 17 kb and 4.3 kb, respectively. One read that spans over the tandem repeat reveals the repeat cluster to be of length 18.0 kb, which suggests that the Flye length estimate is more accurate.

Supplementary information

Supplementary Figures and Text

Supplementary Figures 1–4, Supplementary Tables 1 and 2, and Supplementary Notes 1–16

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019). https://doi.org/10.1038/s41587-019-0072-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-019-0072-8

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing