Identifying Core Operons in Metagenomic Data

An operon is a functional unit of DNA whose genes are co-transcribed on polycistronic mRNA, in a co-regulated fashion. Operons are a powerful mechanism of introducing functional complexity in bacteria, and are therefore of interest in microbial genetics, physiology, biochemistry, and evolution. Here we present a Pipeline for Operon Exploration in Metagenomes or POEM. At the heart of POEM lies the concept of a core operon, a functional unit enabled by a predicted operon in a metagenome. Using a series of benchmarks, we show the high accuracy of POEM, and demonstrate its use on a human gut metagenome sample. We conclude that POEM is a useful tool for analyzing metagenomes beyond the genomic level, and for identifying multi-gene functionalities and possible neofunctionalization in metagenomes. Availability: https://github.com/Rinoahu/POEM_py3k

experimentally assayed for operons. Furthermore, experimental studies tend to 23 use data from model species, and cross-species prediction may not work well [11]. 24 The challenge of discovering operons is compounded when trying to discover 25 operons in metagenomic data. Major additional confounders include the large 26 loss of genomic information, short contigs that rarely assemble into a full genome, 27 and misassembly that might produce chimeric contigs [45]. At the same time, 28 metagenomic data contain rich information that cannot be gleaned from clonal 29 cultures; it is therefore necessary to investigate how well we can predict operons 30 in metagenomic data. Some work has been done including use of proximity and 31 guilt-by-association [41,42]. 32 While a genome contains the total genetic information of an organism, a 33 metagenome is a partial snapshot of a population of genomes. We therefore 34 can rarely expect an operon discovery method to provide the entire content of 35 operons from metagenomic data. However, predicting whether genes participate 36 in an operon, and which functions are carried out by operons, provide valuable 37 additional information to the functional annotation of a metagenome. In this 38 study we present a method that (1) classifies gene pairs in metagenomes into 39 "operonic" and "non-operonic" classes, and (2) provides functional annotations for 40 the operons it reconstructs from metagenomic data. We introduce the concept 41 of metagenomic core operons. A core operon comprises a set of intra-operonic  Number of contigs  48,508  54,274  61,093  87,992  107,718  146,313  55,925  Max contig length  947,260  549,191  569,707  484,034  249,170  106,439  327,893  Min contig length  100  200  101  100  200  101  500  Mean contig length  2,725  2,475  1,651  1,493  1,254  582  2,188  N50  12,732  7,681  9,312  4,  on full operon prediction, we report on the precision / recall analysis as illustrated 81 in Figure 1. The total number of true operons in the simulated metagenome 82 was determined to be 2,589. The results are shown in Table 3. POEM's CNN   Table 4.

5/25
To show the utility of our method in discovering core functions facilitated by predicted operons, we ran POEM on the metagenome sample SRR2155174, 104 containing the human gut microbiome data. Figure 2A shows a core function 105 predicted from the SRR2155174 data set. The annotations of the core functions 106 indicates that it is related to lipid transport and metabolism. We found several   Table 4. Comparing core operons discovered by POEM in the simulated metagenome, and in SRR2155174. See Methods and Figure 5 for details. Intersection with True Operons: The number of shared core functions between true operons and predicted operons. SE: standard error.

129
In this study we introduce POEM, a complete pipeline for predicting operons in 130 genomic and metagenomic data. We also introduce the concept of a core operon,

167
An overview of the POEM pipeline is shown in Fig. 3. The heart of the pipeline 168 lie the Operon identification and operon core structure that POEM performs.

169
The other steps are performed with third-party tools, and are modular. Below

Removing ORF Redundancies
Once ORFs are identified, we remove redundant ORFs with an ID of >98% 200 using CD-HIT [12,23]. The assumption is that genes with a very high sequence 201 ID were taken from the same species or highly similar strains and are therefore  [7]. Since the CNN only accepts a fixed size matrix, we that k = 3 produced the best accuracy (Supplementary Figure S1).

226
To show the CNN's utility, we compared its performance to a simple baseline that can then be used by Cytoscape [37] to visualize the core operons.