Abstract
An operon is a functional unit of DNA whose genes are co-transcribed on polycistronic mRNA, in a co-regulated fashion. Operons are a powerful mechanism of introducing functional complexity in bacteria, and are therefore of interest in microbial genetics, physiology, biochemistry, and evolution. Here we present a Pipeline for Operon Exploration in Metagenomes or POEM. At the heart of POEM lies the concept of a core operon, a functional unit enabled by a predicted operon in a metagenome. Using a series of benchmarks, we show the high accuracy of POEM, and demonstrate its use on a human gut metagenome sample. We conclude that POEM is a useful tool for analyzing metagenomes beyond the genomic level, and for identifying multi-gene functionalities and possible neofunctionalization in metagenomes. Availability: https://github.com/Rinoahu/POEM_py3k
Background
It is estimated that 5-50% of bacterial genes reside in operons [6, 44], and the characterization and understanding of operons is central to bacterial genomic studies. Experimental approaches, chiefly RNA-Seq, are the most reliable way to identify operons; however, it is not feasible to perform experiments to characterize all operons. Over the years, several computational operon-prediction techniques have been developed. Generally, computational operon identification methods include three steps: 1. identify genes that are in an operon and, conversely, genes that do not participate in an operon; 2. identify features typical of each group; 3. train a classifier with these features and build a discriminating model.
Computational operon prediction methods have been developed since the late 1990’s (For a comprehensive review see: [46]). Naïve Bayes models have been used since early 2000’s for predicting operons [3, 10, 18]. Another method used microarray data to identify the different expression profiles of adjacent gene pairs in operons and outside of operons. The differential expression profiles and intergenic distances were used as as features to train a Bayesian classifier [35]. Comparative genomic methods were also used to identify operons by detecting conserved gene clusters across several species [5, 26, 31]. Other methods include particle swarm optimization [8, 9], and neural networks [39].
There are several operon databases that include automated and experimental-based operon annotation [13,25,29,33,38]. However, a manual curation method is not suitable for the rapid growing number of bacterial genomes, few of which are experimentally assayed for operons. Furthermore, experimental studies tend to use data from model species, and cross-species prediction may not work well [11].
The challenge of discovering operons is compounded when trying to discover operons in metagenomic data. Major additional confounders include the large loss of genomic information, short contigs that rarely assemble into a full genome, and misassembly that might produce chimeric contigs [45]. At the same time, metagenomic data contain rich information that cannot be gleaned from clonal cultures; it is therefore necessary to investigate how well we can predict operons in metagenomic data. Some work has been done including use of proximity and guilt-by-association [41, 42].
While a genome contains the total genetic information of an organism, a metagenome is a partial snapshot of a population of genomes. We therefore can rarely expect an operon discovery method to provide the entire content of operons from metagenomic data. However, predicting whether genes participate in an operon, and which functions are carried out by operons, provide valuable additional information to the functional annotation of a metagenome. In this study we present a method that (1) classifies gene pairs in metagenomes into “operonic” and “non-operonic” classes, and (2) provides functional annotations for the operons it reconstructs from metagenomic data. We introduce the concept of metagenomic core operons. A core operon comprises a set of intra-operonic gene pairs that have orthologs in several species in the metagenome, and are concatenated using guilt-by-association. Additionally, we introduce the core functions of operons, which identifies which functions in the metagenome are executed by operons. Commonly, metagenomic analysis pipelines provide the distribution of biological function the metagenome has based on a normalized count of functionally-annotated ORFs. Our method, a Pipeline for Operon Exploration in Metagenomes or POEM, adds more information as it considers the evolutionary conservation of co-transcribed genes in the species constituting the microbial community. This additional information is valuable for understanding the genetic potential of a microbial community introducing structural information in the form of predicted operons.
Results
We ran POEM on two different data sets. One includes simulated reads generated by ART [15] from 48 genomes of Operon DataBase v2 [29]. The genome species used and parameters of ART are shown in Supplementary Table S1. The second set is the human microbiome set SRR2155174 downloaded from ENA [21]. As a standard of truth for the operons, we used operons from Operon DataBase v2 that are supported by literature (henceforth: “true operons”). This dataset contains 8,194 genes and 5,621 adjacent genes in 2,589 operons.
Metagenome Assembly and Gene Prediction
We used IDBA-UD, MegaHIT, and Velvet [47] to assemble the simulated and experimental reads; the results are shown in Table 1. IDBA-UD provided the maximal N50 and minimal number of contigs in both datasets. MegaHIT provided the largest genome size and the most protein-coding genes.
Metagenemark found 7,855 genes of the 8,194 true operon genes in the whole genomes. In the simulated reads assembly, the number of genes numbers are 5,116, 5,078, and 3,530 (out of 8,194) using IDBA, Megahit, and Velvet respectively.
Operon Prediction and Adjacent Genes Within the Operon
We tested the operon prediction module’s performance on whole genomes and simulated metagenome assembly. The 4,425 operonic and 2,097 non-operonic adjacent genes mentioned above were used as a True Positive (TP) set and True Negative (TN) set, respectively. The precision, recall, and F1 for predicted operonic adjacency are defined in the following equations and the statistical results are shown in Table 2.
However, these results only reflect POEM’s performance on classification of operonic and non-operonic adjacency. To further evaluate POEM’s performance on full operon prediction, we report on the precision / recall analysis as illustrated in Figure 1. The total number of true operons in the simulated metagenome was determined to be 2,589. The results are shown in Table 3. POEM’s CNN performs much better than the linear baseline method when tasked with a perfect recovery of operons. For a 0.6 or better recovery, the CNN and the baseline perform similarly. This suggests that high quality longer assemblies, perhaps from longer reads, may perform better.
Core Functions Facilitated by Predicted Operons in Metagenomic Data
To functionally analyse operons in metagenomes we use core operons, which are described in the Background section. Briefly, core operons are weighted-edge undirected graphs that capture information about predicted orthologous operons or subsets of operons in the metagenome. The nature of the fragmented and partial nature of metagenomic data prohibits a clear binning of reads and a full assembly into component genomes. Therefore, we may not be able to provide an accurate prediction of all genes in the operons or their precise taxonomic affiliation. See Methods / Constructing Core Operons and Figure 5 for an explanation of how core operons are constructed. To see how well core operons capture the function of true operons on our different data sets, we examined the overlap of operonic genes with identical functions as shown in Figure 5. The results of this analysis is shown in Table 4.
To show the utility of our method in discovering core functions facilitated by predicted operons, we ran POEM on the metagenome sample SRR2155174, containing the human gut microbiome data. Figure 2A shows a core function predicted from the SRR2155174 data set. The annotations of the core functions indicates that it is related to lipid transport and metabolism. We found several predicted operons (Figure 2B-E) from the SRR2155174 data set that match the core function. Of the loci in the core operon, only lp 1674 and lp 1675 loci in Lactobacillus plantarum WCFS1 (Figure 2E) can be found in the predicted operons of Operon DataBase [29]. To find the functions of these predicted operonic genes, we examined the functional annotations for these operonic genes from GenBank [2]. The functional annotations (Supplementary Table S2) show that these operonic genes are likely to be involved in fatty acid biosynthesis. We mapped the predicted operonic genes of Lactobacillus plantarum WCFS1 (Figure 2E) to KEGG database [19] and found most of the genes involved in fatty acid biosynthesis (Supplementary Figure S2). These results show these predicted operons are likely involved in fatty acid biosynthesis and have a high probability of being true operons. Although core operons are involved in the same biological pathway, the genes outside the core function (Figure 2B-E) are diverse. The core function reflects the conservation of operons across species and is more robust and error-tolerant than operons. Core functions may reconstruct the metabolism pathways from the incomplete genome assembly data leveraging the conservation of genes across species. The ability to use core functions as familiar ground from which to explore new conserved proximal genes makes core functions a new and powerful tool for discovering novel operon-encoded pathways in metagenomic data.
Discussion
In this study we introduce POEM, a complete pipeline for predicting operons in genomic and metagenomic data. We also introduce the concept of a core operon, a functional unit of proximal genes in a metagenome, which is composed of the common functions of orthologous operons. POEM’s CNN predicts intra-operonic genes with high precision, considerably more so than the baseline method of a linear classifier. The recall rate of POEM is lower than that of the linear classifier, but that is expected as the linear classifier recovers all proximal genes with a distance of ≤ 500 bp. This means that the recall is high, but the number of false positives is also high, as indicated by the lower precision when compared to the CNN (Table 2, 69.84).
When recovering operons from metagenomes (Table 2), POEM’s results depend heavily upon the choice of gene-calling software and metagenome assembly. POEM outperforms the linear baseline method indicating that higher quality assemblies and longer reads will lead to a higher overall accuracy in POEM’s performance relative to the linear classifier. Furthermore, when recovering full operons, POEM’s CNN outperforms the linear classifier. The recovery overall is around 39% (1025 out of 2589), but it is considerably higher than that of the linear classifier (469/1,025).
the chief utility of POEM lies in identifying the functions carried out by the predicted operons in a metagenome. To that end, we introduced the core operon, identified by counting proximal predicted inter-operonic gene pairs in assembled contigs, and concatenating them using guilt-by-association. (Figure 5). The most frequent functions in the operons containing a large number of orthologous genes will be represented in the core operon. A high overlap in the count of functions (as COGs) between the core operons and the true operons indicates that while not all genes in an operon can be recovered in a metagenome, the basic functionality enabled by core operons can be recovered. The high precision and recall values shown in Table 4 indicate the the use of core operons can indeed inform us of those functions that are carried out by operons in a metagenome. In providing a characterization of core operons and their functions, POEM allows the annotation of a metagenome beyond the simple assignment of functions to genes, but to incorporate a level of annotation than includes an element of gene structure which is crucial in understanding bacterial function.
In sum, POEM is a novel and highly useful addition to the arsenal of tools helping us to better understand the functionality of metagenome, and is distinguished by offering a structural view of the metagenome, rather than a bag-of-genes-and-functions that most tools offer.
Methods
An overview of the POEM pipeline is shown in Fig. 3. The heart of the pipeline lie the Operon identification and operon core structure that POEM performs. The other steps are performed with third-party tools, and are modular. Below we elaborate upon the various stages in the POEM pipeline.
Metagenome Assembly
POEM uses, as default, the IDBA-UD de-novo assembler, but the user may supply an alternative assembler. Short read assemblers are usually based on De Bruijn graphs and are sensitive to the sequencing depth, repetitive regions, and sequencing errors [24]. For clonal bacteria, this assembly algorithm works relatively because it is easy to estimate the sequencing depth and the bacterial genomes are often compact and have few repetitive regions. However, in metagenomes it is hard to estimate the amount of sequence data that are needed for good functional coverage, and the genomes from closely related species may contain many highly conserved genes which may be interpreted as repetitive regions. Although de novo assemblers for metagenomes are still at an early stage [40], there are several tools developed for this task including MetaVelvet-SL, IDBA-UD,and Megahit [1, 22, 27, 30, 32]. In this study we also compare the effect these assemblers have on the accuracy of POEM.
Gene Prediction
We chose to use an ab-initio method for gene calling, as opposed to calling by sequence similarity. First, because ab-initio gene calling is faster in bacterial and archaeal genomes, with little accuracy sacrificed: the predicted accuracy of some methods can reach 98% [16, 17, 43, 48]. Second, metagenomic data contain many genes with no similarity to known genes, so using a homology based method may result in a large number of open reading frames (ORFs) that are not predicted as such (false negatives). Several gene prediction tools have been developed or optimized for metagenomic data, including Glimmer-MG, Metagene, Metagenemark, Prokka, Prodigal, and Orphelia [14, 17, 20, 28, 36, 48]. POEM uses Metagenemark or Prokka to predict genes. As in the contig assembly stage, this part can be modified by the user.
Removing ORF Redundancies
Once ORFs are identified, we remove redundant ORFs with an ID of >98% using CD-HIT [12, 23]. The assumption is that genes with a very high sequence ID were taken from the same species or highly similar strains and are therefore redundant information.
Gene Function Annotation
While there are many ways to annotate gene function [34], a fast and acceptably accurate way to do so typically employs sequence similarity matching against a reliable functionally annotated sequence database. Here we used the COG database as a reference. POEM uses both BLAST and DIAMOND [4], which trades off speed and sensitivity. The functional assignment is done by choosing the top hit in COG above the e-value threshold (Evalue = 10−3).
Operon Prediction
At the core of POEM lies a novel method we developed for predicting operons. POEM predicts if any given pair of adjacent genes are intra-operonic by classifying intergenic regions into intra- or extra-operonic. Thus, the operon prediction problem is cast as a binary classification problem.
POEM’s operon prediction method goes through the following steps. First, the intergenic DNA sequences of 4,425 operonic and 2,097 non-operonic adjacent genes were extracted from Operon DataBase v2 [29]. The intergenic regions are represented as a k-mer-position matrix (KPM, Figure 4). Two-thirds of the data were used for training a Convolutional Neural Network (CNN) based binary classification model and the remaining 1/3 of the data were used as the test set. We used a CNN model from the Keras package (v1.2.0) to train the classification model [7]. Since the CNN only accepts a fixed size matrix, we convert the KPM to a fixed size matrix by truncating the middle columns or adding all zero columns to the middle of the matrix. Trial-and-error has shown that k = 3 produced the best accuracy (Supplementary Figure S1).
To show the CNN’s utility, we compared its performance to a simple baseline predictor. The baseline linear classifier works as follows: if two genes on the same strand have an intergenic distance < 500 nt, then their adjacency is classified as within the same operon (operonic). A larger distance would classify them as non-operonic. The predicted operonic adjacent genes were then connected to form a full operon prediction.
Identifying Core Operons
To characterize operons in metagenomes, we introduce the concept of core operons. Core operons are weighted-edge undirected graphs that capture information about predicted orthologous operons or fractions of operons in the metagenome. Each node is a set of orthologous genes that are all annotated by at least one common COG term. An edge is drawn between two nodes if they are determined to be an intra-operonic pair. The weight of the edge is determined by the frequency of the adjacency of the intra-operon adjacent genes. To determine how well a core operon captures the real operons in a metagenome, we ran a precision-recall analysis using the operons in the simulated database as our standard-of-truth, see Figure 5. Here, precision is the number of correctly predicted intra-operonic genes (true positives) divided by the number of all predictions (true positive and false positive predictions). Recall is the number of correctly predicted intra-operonic genes divided by the all real intra-operonic genes. Finally, POEM produces a file that can then be used by Cytoscape [37] to visualize the core operons.
Availability of source code and requirements
The software and related information are listed below:
Project Name: POEM
Project Home Page: https://github.com/Rinoahu/POEM_py3k
Operating System(s): POEM was tested on GNU/Linux distribution Ubuntu 16.04 64-bit, but we expect POEM to work on most Unix-like systems.
Programming Language: Python
Other Requirements: Python 3.7 and Conda
License: GPLv3
Acknowledgments
Will be provided upon acceptance
Footnotes
Several corrections and clarifications on the concepts of core operons and core functions. Also, fixed a few typographical errors.