Spaced seeds improve k-mer-based metagenomic classification

Bioinformatics. 2015 Nov 15;31(22):3584-92. doi: 10.1093/bioinformatics/btv419. Epub 2015 Jul 25.

Abstract

Motivation: Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by next-generation sequencing technologies. To cope with massive data involved in modern metagenomic projects, recent tools rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes.

Results: Within this general framework, we show that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers. We support this thesis through a series of different computational experiments, including simulations of large-scale metagenomic projects.Availability and implementation, Supplementary information: Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.

Contact: gregory.kucherov@univ-mlv.fr.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Bacillus / genetics
  • Databases, Genetic
  • Genome, Bacterial
  • Metagenomics / classification*
  • Mycobacterium / genetics
  • Probability
  • Sequence Alignment
  • Statistics, Nonparametric