An ORFome assembly approach to metagenomics sequences analysis

Yuzhen Ye; Haixu Tang

doi:10.1142/s0219720009004151

An ORFome assembly approach to metagenomics sequences analysis

J Bioinform Comput Biol. 2009 Jun;7(3):455-71. doi: 10.1142/s0219720009004151.

Authors

Yuzhen Ye¹, Haixu Tang

Affiliation

¹ School of Informatics, Indiana University, Bloomington, IN 47408, USA. yye@indiana.edu

Abstract

Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Amino Acid Sequence
Computational Biology
Databases, Genetic
Genetics, Microbial / statistics & numerical data*
Genomics / statistics & numerical data
Molecular Sequence Data
Open Reading Frames*
Polymorphism, Genetic
Seawater / virology
Sequence Alignment / statistics & numerical data
Sequence Analysis / statistics & numerical data*
Sequence Analysis, Protein / statistics & numerical data
Viral Proteins / genetics

Substances

Viral Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding