Efficient de novo assembly of large genomes using compressed data structures

Jared T. Simpson; Richard Durbin

doi:10.1101/gr.126953.111

Efficient de novo assembly of large genomes using compressed data structures

Jared T. Simpson and
Richard Durbin 1

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

Footnotes

↵1 Corresponding author.

E-mail rd{at}sanger.ac.uk.
[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.126953.111.

Received May 31, 2011.
Accepted December 5, 2011.

Freely available online through the Genome Research Open Access option.

Efficient de novo assembly of large genomes using compressed data structures

Abstract

Footnotes

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

Related Content

Share

Preprint Server

Current Issue

From the Cover

Efficient de novo assembly of large genomes using compressed data structures

Abstract

Footnotes

Related Articles

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

Related Content

Share

Preprint Server

Current Issue

From the Cover