Abstract
Microbial communities are often studied by sequencing marker genes such as 16S ribosomal RNA. Marker gene sequences can be used to assess diversity and taxonomy, but do not directly measure functions arising from other genes in the community metagenome. Such functions can be predicted by algorithms that associate marker genes with experimentally determined traits in well-studied species. Typically, such methods use ancestral state reconstruction. Here I describe SINAPS, a new algorithm that predicts traits for marker gene sequences using a fast, simple word-counting algorithm that does not require alignments or trees. A measure of prediction confidence is obtained by bootstrapping. I tested SINAPS predictions from 16S V4 query sequences for traits including energy metabolism, Gram-positive staining, presence of a flagellum, V4 primer mismatches, and 16S copy number. Accuracy was >90% except for copy number, where a large majority of predictions were within +/−2 of the true value.