Toward a statistically explicit understanding of de novo sequence assembly

Mark Howison; Felipe Zapata; Casey W Dunn

doi:10.1093/bioinformatics/btt525

Toward a statistically explicit understanding of de novo sequence assembly

Bioinformatics. 2013 Dec 1;29(23):2959-63. doi: 10.1093/bioinformatics/btt525. Epub 2013 Sep 10.

Authors

Mark Howison¹, Felipe Zapata, Casey W Dunn

Affiliation

¹ Center for Computation and Visualization and Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912, USA.

PMID: 24021385
DOI: 10.1093/bioinformatics/btt525

Abstract

Motivation: Draft de novo genome assemblies are now available for many organisms. These assemblies are point estimates of the true genome sequences. Each is a specific hypothesis, drawn from among many alternative hypotheses, of the sequence of a genome. Assembly uncertainty, the inability to distinguish between multiple alternative assembly hypotheses, can be due to real variation between copies of the genome in the sample, errors and ambiguities in the sequenced data and assumptions and heuristics of the assemblers. Most assemblers select a single assembly according to ad hoc criteria, and do not yet report and quantify the uncertainty of their outputs. Those assemblers that do report uncertainty take different approaches to describing multiple assembly hypotheses and the support for each.

Results: Here we review and examine the problem of representing and measuring uncertainty in assemblies. A promising recent development is the implementation of assemblers that are built according to explicit statistical models. Some new assembly methods, for example, estimate and maximize assembly likelihood. These advances, combined with technical advances in the representation of alternative assembly hypotheses, will lead to a more complete and biologically relevant understanding of assembly uncertainty. This will in turn facilitate the interpretation of downstream analyses and tests of specific biological hypotheses.

Publication types

Review

MeSH terms

Computational Biology*
High-Throughput Nucleotide Sequencing / methods*
Sequence Analysis, DNA / methods*