Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Sequence assembly demystified

This article has been updated

Key Points

  • Advances in sequencing technologies and increased access to sequencing have led to recent renewed interest in sequence assembly algorithms and tools.

  • Assembly continues to be a computationally challenging problem in which engineering 'details' play a more important part than the choice of a specific assembly paradigm in defining the performance and accuracy of assemblers.

  • Modern sequence assemblers continue to explore new ways to capture and to analyse graph structures to carry out assembly in a time- and memory-efficient manner.

  • Most assembly programs are based on heuristics and ad hoc techniques and provide no guarantees on the correctness of the reconstructed sequence. Recent tools have sought to address this need by focusing on assembly tasks in which exact algorithms are feasible.

  • The availability of multiple sequencing technologies and library preparation protocols has brought into focus the importance of experimental design in sequence assembly.

  • Coupling of experimental design with the development of assembly algorithms may be key to optimizing assembly results in the future.

  • A combination of in silico assessment and validation using independent experimental data is currently used to assess the reliability of sequence assembly, although computational tools for assembly validation are still limited in number.

  • Sequence assembly is increasingly used for applications other than the traditional role of assembling genomes, including transcriptome analysis, reconstruction of microbial communities (metagenomics) and the discovery of genomic variants.

  • Application-specific assemblers, which exploit characteristics of the sequences to be reconstructed, have emerged as an important area of focus for assembly research.

Abstract

Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Methods for assembly validation.
Figure 2: Comparison of isolate genome, single-cell and metagenomic assembly.

Change history

  • 05 February 2013

    In the above article, the paper cited as reference 89 was incorrect. The correct reference has now been added in its place.

References

  1. Conway, T. C. & Bromage, A. J. Succinct data structures for assembling large genomes. Bioinformatics 27, 479–486 (2011).

    Article  CAS  PubMed  Google Scholar 

  2. Ye, C., Ma, Z. S., Cannon, C. H., Pop, M. & Yu, D. W. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13 (Suppl. 6), S1 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964–2971 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27, i94–i101 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

    Article  CAS  PubMed  Google Scholar 

  7. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012). This paper describes new assembly algorithms that are targeted at data generated in single-cell experiments through whole-genome amplification. The authors had to develop strategies for dealing with the highly uneven coverage of the data as well as numerous experimental errors.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011). Presented here is a collection of tools, called Trinity, for de novo assembly-based analysis of transcriptome data. This paper demonstrates that complete transcripts, including their splice forms, can be reconstructed from RNA-seq data.

    Article  CAS  Google Scholar 

  9. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909–912 (2010).

    Article  CAS  PubMed  Google Scholar 

  10. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. 30, 693–700 (2012).

    Article  CAS  Google Scholar 

  11. Ribeiro, F. J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Wetzel, J., Kingsford, C. & Pop, M. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12, 95 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Pham, S. K. et al. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly. J. Comput. Biol. 17 Jul 2012 (doi:10.1089/cmb.2012.0098).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16, 897–908 (2009). An overview is provided here of the algorithmic challenges that underlie genome assembly; the paper has a specific focus on the interplay between read length and the size of repeats that can be correctly assembled.

    Article  CAS  PubMed  Google Scholar 

  15. Peltola, H., Soderlund, H. & Ukkonen, E. SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 12, 307–321 (1984).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Peltola, H., Sonderlund, H., Tarhio, J. & Ukkonen, E. in IFIP 9th World Computer Congress (ed. Mason, R. E. A.) 53–64 (North-Holland, 1983).

    Google Scholar 

  17. Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Ronen, R., Boucher, C., Chitsaz, H. & Pevzner, P. SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28, i188–i196 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). The Velvet assembler is the first widely used de Bruijn graph assembler, and this is the first paper to demonstrate that high-quality assembly of ultra-short reads is feasible.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009). The assembler described in this study, ABySS, is the first parallel genome assembler capable of assembling human-sized data sets.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Kelley, D. R., Schatz, M. C. & Salzberg, S. L. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Salmela, L. & Schroder, J. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 1455–1461 (2011).

    Article  CAS  PubMed  Google Scholar 

  25. Ferragina, P. & Manzini, G. in Proc. 41st Annu. Symp. Foundations Comput. Sci. 390–398 (2000).

    Book  Google Scholar 

  26. Liu, Y., Schmidt, B. & Maskell, D. L. Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 12, 354 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Xing, L. PASQUAL: parallel techniques for next generation genome sequence assembly. IEEE Trans. Parallel Distrib. Syst. 10 Aug 2012 (doi:10.1109/TPDS.2012.190).

    Article  Google Scholar 

  28. Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109, 13272–13277 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Pevzner, P. A. & Tang, H. Fragment assembly with double-barreled data. Bioinformatics 17 (Suppl. 1), S225–S233 (2001). This paper introduces the de Bruijn graph paradigm for assembly and the Euler assembler. The concepts described here have formed the basis for almost all de Bruijn-graph-based assemblers that are available in the community.

    Article  PubMed  Google Scholar 

  30. Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genet. 44, 226–232 (2012).

    Article  CAS  PubMed  Google Scholar 

  32. Pop, M., Kosack, D. S. & Salzberg, S. L. Hierarchical scaffolding with Bambus. Genome Res. 14, 149–159 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Dayarian, A., Michael, T. P. & Sengupta, A. M. SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics 11, 345 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Gao, S., Sung, W. K. & Nagarajan, N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18, 1681–1691 (2011). In this study, it is demonstrated that the genome scaffolding problem can be solved exactly for commonly encountered data despite the computational intractability of this problem. This paper also introduces the scaffolder Opera, which outperforms other stand-alone scaffolding packages.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Tsai, I. J., Otto, T. D. & Berriman, M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 11, R41 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Gao, S., Bertrand, D. & Nagarajan, N. FinIS: improved in silico finishing using an exact quadratic programming formulation. Lect. Notes Comput. Sci. 7534, 314–325 (2012).

    Article  Google Scholar 

  37. Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for sequence assembly. Lect. Notes Comput. Sci. 4645, 289–301 (2007).

    Article  Google Scholar 

  38. Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nature Methods 8, 61–65 (2011). The many errors found in a de novo assembly of the human genome are highlighted here, and the authors argue for the continued development of experimental techniques aimed at fully reconstructing genomes.

    Article  CAS  PubMed  Google Scholar 

  39. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    Article  CAS  PubMed  Google Scholar 

  40. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011). This paper introduces the ALLPATHS-LG assembler, which is the first assembler that is specifically designed in concert with a specific 'recipe' for the sequencing experiment.

    Article  CAS  PubMed  Google Scholar 

  41. Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: structural variation, haplotype assembly, and transcript abundance. BMC Genomics 11, 385 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011). The Assemblathon competition compared the performance of modern genome assemblers on a simulated human-sized diploid genome. The assemblies were contributed by the community, thus reflecting the best results that could be obtained with the corresponding assemblers. The paper also includes a detailed description of methods for validating the quality of the resulting assemblies.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012). The GAGE competition compared the performance of several modern genome assemblers on real sequencing data from bacterial to eukaryotic genomes. The assemblies were carried out by the authors of the study, and the validation of the assemblies was done by comparison to known references for the genomes included. In addition, the paper provides full 'assembly recipes', which allow readers directly to reproduce the results presented.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    Article  CAS  PubMed  Google Scholar 

  45. Zhou, S. et al. A whole-genome shotgun optical map of Yersinia pestis strain KIM. Appl. Environ. Microbiol. 68, 6321–6331 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Nagarajan, N., Read, T. D. & Pop, M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24, 1229–1235 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl Acad. Sci. USA 101, 1916–1921 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Zimin, A. V. et al. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 10, R42 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Meader, S., Hillier, L. W., Locke, D., Ponting, C. P. & Lunter, G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 20, 675–684 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Gnerre, S., Lander, E. S., Lindblad-Toh, K. & Jaffe, D. B. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biol. 10, R88 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Huson, D. et al. in Proc. First Int. Workshop Algorithms Bioinf. 294–306 (2001).

    Book  Google Scholar 

  53. Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

    Article  CAS  PubMed  Google Scholar 

  54. Prufer, K. et al. The bonobo genome compared with the chimpanzee and human genomes. Nature 486, 527–531 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Choi, J. H. et al. A machine-learning approach to combined evidence validation of genome assemblies. Bioinformatics 24, 744–750 (2008).

    Article  CAS  PubMed  Google Scholar 

  57. Schatz, M. C. et al. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief. Bioinform. 23 Dec 2012 (doi:10.1093/bib/bbr074).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Narzisi, G. & Mishra, B. Comparing de novo genome assembly: the long and short of it. PLoS ONE 6, e19175 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Haiminen, N., Kuhn, D. N., Parida, L. & Rigoutsos, I. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS ONE 6, e24182 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Lin, Y. et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27, 2031–2037 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Zhang, W. et al. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS ONE 6, e17915 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Barthelson, R., McFarlin, A. J., Rounsley, S. D. & Young, S. Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS ONE 6, e28436 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009).

    Article  CAS  PubMed  Google Scholar 

  64. Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).

    Article  CAS  PubMed  Google Scholar 

  65. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). This is a large-scale catalogue of metagenomic data generated through de novo assembly of short read sequencing data. This paper is the first to demonstrate that metagenomic data can be effectively analysed through next-generation sequencing technologies.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Laserson, J., Jojic, V. & Koller, D. Genovo: de novo assembly for metagenomes. J. Computat. Biol. 18, 429–443 (2011).

    Article  CAS  Google Scholar 

  67. Dean, F. B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl Acad. Sci. USA 99, 5261–5266 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Raghunathan, A. et al. Genomic DNA amplification from a single bacterium. Appl. Environ. Microbiol. 71, 3342–3347 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Chitsaz, H. et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature Biotech. 29, 915–921 (2011).

    Article  CAS  Google Scholar 

  70. Hansen, K. D., Brenner, S. E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Surget-Groba, Y. & Montoya-Burgos, J. I. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 20, 1432–1440 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Schulz, M. H., Zerbino, D. R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Zhao, Q. Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. BMC Bioinformatics 12 (Suppl. 14), S2 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Feldmeyer, B., Wheat, C. W., Krezdorn, N., Rotter, B. & Pfenninger, M. Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics 12, 317 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  75. Charuvaka, A. & Rangwala, H. Evaluation of short read metagenomic assembly. BMC Genomics 12 (Suppl. 2), S8 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  76. The Human Microbiome Project Consortium. A framework for human microbiome research. Nature 486, 215–221 (2012).

  77. Weinstock, G. M. Genomic approaches to studying the human microbiota. Nature 489, 250–256 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  79. Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 26, 1277–1283 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Newman, T. L. et al. A genome-wide survey of structural variation between human and chimpanzee. Genome Res. 15, 1344–1356 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Khaja, R. et al. Genome assembly comparison identifies structural variants in the human genome. Nature Genet. 38, 1413–1418 (2006).

    Article  CAS  PubMed  Google Scholar 

  82. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677–681 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Chen, K. et al. BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data. Bioinformatics 28, 1923–1924 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Warren, R. L. & Holt, R. A. Targeted assembly of short sequence reads. PLoS ONE 6, e19816 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Aguiar, D. & Istrail, S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).

    Article  PubMed  Google Scholar 

  87. Eriksson, N. et al. Viral population estimation using pyrosequencing. PLoS Comput. Biol. 4, e1000074 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Prosperi, M. C. et al. Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics 12, 5 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  89. Astrovskaya, I. et al. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12 (Suppl. 6), S1 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  90. Prosperi, M. C. & Salemi, M. QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics 28, 132–133 (2012).

    Article  CAS  PubMed  Google Scholar 

  91. Fullwood, M. J., Wei, C. L., Liu, E. T. & Ruan, Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19, 521–532 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

    Article  CAS  PubMed  Google Scholar 

  93. Miller, J. M., Malenfant, R. M., Moore, S. S. & Coltman, D. W. Short reads, circular genome: skimming solid sequence to construct the bighorn sheep mitochondrial genome. J. Hered. 103, 140–146 (2012).

    Article  CAS  PubMed  Google Scholar 

  94. Loman, N. J. et al. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotech. 30, 434–439 (2012).

    Article  CAS  Google Scholar 

  95. Sutton, G. G., White, O., Adams, M. D. & Kerlavage, A. R. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1, 9–19 (1995).

    Article  CAS  Google Scholar 

  96. Jeck, W. R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

N.N. was supported by the Agency for Science, Technology and Research (A*STAR), Singapore. M.P. was supported in part by the US National Science Foundation (grants IIS-1117247 and IIS-0844494) and by the Bill and Melinda Gates Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mihai Pop.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1 (table)

Other assembly programs. (PDF 143 kb)

Related links

Related links

FURTHER INFORMATION

Niranjan Nagarajan's homepage

Mihai Pop's homepage

The Assemblathon

GAGE

Genome 10K Project

i5K — ArthropodBase wiki

phrap.doc

Glossary

Paired-end data

Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions. See also 'mate-pair read'.

Mate-pair data

Data from a pair of reads sequenced from the same circularized DNA fragment. The circularization step allows for larger fragments sizes to be used. They provide the same information as paired-end reads to the assembler.

Contiguous sequence

(Contig). A sequence reconstructed by assembling together multiple reads.

Read

The sequence generated by a sequencing machine from a DNA fragment.

Overlap

The relationship between two reads, the ends of which have highly similar sequences. The minimum length allowed for the corresponding sequence is an important parameter in assembly.

Scaffolds

An ordered collection of contiguous sequences (contigs), the relative placement of which is typically inferred from mate-pair reads and other information. The sequence within the gaps between the contigs is usually not known.

Library

A collection of paired-end or mate-pair reads derived from DNA fragments with a tightly controlled size range.

Depth of coverage

The average number of reads covering a particular base in the sequence being assembled.

N50

A statistic used for assessing the contiguity of a genome assembly. The contigs in an assembly are sorted by size and added, starting with the largest. The size of the contig is reported that makes the total greater than or equal to 50% of the genome size.

Isolate genome

The genome of a single organism isolated through culture, for which a substantial quantity of DNA can be obtained.

k-mers

Strings of k consecutive letters extracted from a longer sequence, such as a read or a reference assembly.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nagarajan, N., Pop, M. Sequence assembly demystified. Nat Rev Genet 14, 157–167 (2013). https://doi.org/10.1038/nrg3367

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3367

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research