Abstract
Motivation A fundamental attribute of life is complex systems: systems made of parts that together perform functions that a single component, or most subsets containing individual components, cannot. Examples of molecular complexity include protein structures such as the F1F0-ATPase, the ribosome, or the flagellar motor. Each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are a classic example of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule. The operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to examine the evolution of operons and identify possible intermediate states.
Results In this work, we used a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We offer the software as the Reconstruction of Ancestral Genomes Using Events or ROAGUE.
Availability and implementation The software is available on https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction
Contact huyn{at}iastate.edu, idoerg{at}iastate.edu
1 Introduction
The evolution of complex systems is an open problem in biology[1], and has recently been studied intensively in genomes[2, 3]. To better understand how complex systems evolve, we focus on the problem of the evolution of orthologous gene blocks and operons in bacteria. Orthologous gene blocks or orthoblocks are sequences of genes co-located on the chromosome, whose evolutionary conservation is apparent[4]. Operons can be viewed as a special case of gene blocks where the genes are co-transcribed to polycistronic mRNA and are often associated with a single function, such as a metabolic pathway or a protein complex. Several models have been proposed to explain gene block and operon evolution, and it may very well be that the models are not mutually exclusive, and different operons may evolve by different models, or indeed a single operon may be the result of the combination of several models[5, 6, 7, 8].
Previously, we proposed a method that explains the evolution of orthoblocks and operons as a combination of events that take place in vertical evolution from common ancestors. In the evolution of an orthoblock, the different gene blocks may gain or lose genes, have genes duplicated, or have them split off. By determining the frequency of the events for any orthoblock in a studied clade, we can determine a cost for each event, and thus create a cost function to determine an optimal vertical path for the evolution of orthoblocks. We have used the cost function to determine the conservation of some operons and orthoblocks in proteobacteria, and show that orthoblocks that perform cellular information processing (such as mRNA translation) are more conserved than those that are associated with adaptation to specific environments [4].
In this study, we use the orthoblock evolution cost function model to reconstruct ancestral gene blocks. Reconstructing plausible ancestral states of extant complex entities can help us understand how they evolve, and which forces might affect their evolution. The rest of this paper is structured as follows: first, we present two algorithms that reconstruct ancestral states of orthoblocks. We then use these algorithms to reconstruct the ancestral states of orthoblocks in a clade of Gram-negative bacteria and a clade of Grampositive bacteria. This reconstruction involves orthoblocks comprising genes orthologous to those found in operons in Escherichia coli and in Bacillus subtilis, respectively. Finally, we present our findings and discuss our results. Our reconstructions of ancestral states show that: (1) some operons can rapidly evolve independently in several branches in their respective clades, suggesting that positive selection plays a major role in the evolution of gene blocks in bacteria; (2) other operons are highly conserved, their evolution predating the last common ancestor of the clades we chose, and (3) some operon conservation is sporadic and cannot be explained solely by vertical transmission suggesting horizontal gene transfer.
2 Methods
2.1 Definitions
2.1.1 Gene block-based evolutionary events, and event-based distances
The terms reference taxa, neighboring genes, gene blocks, events, and orthoblocks are elaborated upon in [4]. Briefly, a reference taxon is a taxon where operons have been identified by experimental means. Here we use E. coli K-12 MG1655 and B. subtilis as reference taxa. The reference taxon serves as a standard of truth to determine if the genes on a suspected orthoblock do indeed reside, at least in one species, in an operon or similar co-regulated gene block. We chose these species because their genomes are expertly and comprehensively annotated, and experimental evidence exists for many of their operons [9]. Neighboring genes: two genes are considered neighboring if they are 500 nucleotides or fewer apart and on the same strand. A gene block comprises no fewer than two open reading frames of ORFs that are neighboring. Orthoblocks, gene blocks that are orthologous, are defined as follows: two organisms have orthoblocks when each organism must have at least two neighboring genes that are homologous to genes in a gene block in the reference taxon's genome. An event is a change in the gene block between any two species with homologous gene blocks.
We identify three types of pairwise events between orthoblocks in different taxa: splits, deletions, and duplications. The event-based distance between any two orthoblocks is the sum of the minimized count of splits, duplications, and deletions.
2.1.2 Choosing species
The species tree for each clade was built using rpoB as the species marker. For the study of Gram negatives with E. coli as a reference species, we use the group of taxa from[4]. For the study of Gram positives with B. subtilis as the reference species, we use the Phylogenetic Diversity Analysis program (PDA)[10, 11] to select 33 equidistant species.
2.1.3 Orthoblocks in Phylogenetic Trees
For each orthoblock studied, we use a phylogenetic species tree T comprising a set of extant species related to either one of our reference taxa. The topology of T is determined using multiple sequence alignment of gene rpoB followed by the Neighbor Joining algorithm as described in [4]. Each leaf node v in T contains the orthologs to the genes in an operon in the reference species (E. coli or B. subtilis). For any two genes a and b, if the chromosomal distance is less than 500 bp, the genes will be written as ab. If the distance is greater than 500 bp, they are written with the separator character ‘|’ thus: a|b. For a species tree T, we define the following:
V(T): the set of nodes of T.
E(T): the set of edges of T.
L(T): set of leaf nodes of T.
I(T): set of inner nodes of T.
If a node v is an inner node, it can be one of three types (illustrated in Figure 1):
vdl: an inner node whose both children are leaf nodes.
vhl: an inner node that only one immediate child is a leaf node.
vnl: an inner node that none of its children are leaf nodes.
For node v ∈ V(T), let O be the gene block assigned to v, we define:
v.gene[g]: the set that represents the choice of including of gene g in O. There are only 3 possible cases.
v.gene[g] = {1}: this means that gene g has to be in O.
v.gene[g] = {0}: this means that gene g can not be in O.
v.gene[g] = {0,1}: this means that gene g can either be in O or not in v.
Ig(v): the identiy function of gene g in O. It can only takes value of 0 for not appearing in O or 1 ortherwise.
v.dup[g]: the set that represents the duplication status of gene g in O. There are only 3 possible cases.
v.dup[g] = {1}: this means that gene g has to be duplicated in O.
v.dup[g] = {0}: this means that gene g can not be duplicated in O.
v.dup[g] = {0, 1}: this means that gene g can either be duplicated or not in O.
Gene(O): the set of gene of O.
Dup(O): the set of gene that is duplicated in O.
HasLeaf(v): the set of leaf nodes that can be reached from node v in postorder traversal.
FREQg(v): frequency of gene g in HasLeaf(v).
DUPg(v): frequency of duplications of gene g in HasLeaf(v).
is the reference taxon, having an experimentally-verified operon composed of genes a, b and c. Nodes
are leaf nodes: extant species with different orthoblocks whose genes are orthologous to the operon in
. As an example of events, a duplication event (of a gene b homolog) occurs in the pairwise comparison of A and D. Nodes
are internal.
and
are vdl type nodes,
is a vhl type node, and
is a vnl type node. See Methods for details. (Based on [4])
Figure 1 shows an example of orthoblocks and node types on a phylogenetic tree.
2.2 Orthoblock distance functions
The distance between any two homologous gene blocks O, O' found in target organisms is defined as in [4]. We provide the definition and the formula to calculate each distance function as follows:
1. Split distance (ds) is the absolute difference in the number of relevant gene blocks between the two taxa. Relevant gene blocks between two taxa can be computed by only including the genes that appear in both taxa. We define Rel(O, O') as relevant gene blocks of O to O' and formalize the split distance as:

Example: for the reference gene block with genes (abcdefg), genome A has blocks O:= ((ab), (def)) and genome B has O':= ((abc), (de), (fg)). We then compute the relevant gene blocks Rel(O, O') = ((ab), (def)) and Rel(O', O) = ((ab), (de), (f)) (removing genes c, g). Therefore, ds(O, O') = |2 − 3| = 1.
2. Duplication distance (du) is the pairwise count of duplications between two gene blocks. We define Dif(O, O') as the set of duplicated genes of gene block O, so that these genes also appear in O' but are not duplicated in O'. We formalize the duplication distance as:

Example: For a reference gene block (abcde), genome A has gene block O = ((abd)) and genome B has gene block O' = ((abbcc)), respectively. The ortholog of gene Ob is duplicated in genome B, creating a duplication distance du(O, O') of 1. However, since gene c does not exist in O, it has no bearing on the duplication distance between the homologous gene blocks O and O'. We then compute Dif(O, O') = ø and Dif(O', O) = {b}. Therefore, du(O, O') = 0+1 = 1
3. Deletion distance (dd) is the difference in the number of orthologs that are in the homologous gene blocks of the genome of one organism, or the other, but not in both. In short, it is the symmetric difference between the set of orthologous genes of the two gene blocks O, O'. We formalize the deletion distance as:

In addition, the deletion distance can also be defined using the identity function:

Example: For a reference gene block (abcde), genome A has gene block O = ((abd)) and genome B has gene block O' = ((abce)), respectively. Since there are only genes a, b that appear in both genomes,
The duplication distance and split distance depend on the deletion distance. Intuitively, the duplication of a gene g in gene block O requires such gene appearing in O. Split distance depends on the relevant gene blocks from two taxa. Hence, it depends on the genes that appear in both taxa. Therefore, the split distance and the duplication distance depend on the deletion distance. Using the three distance functions above, we define theO total distance between any two homologous gene blocks O, O' as:

2.3 Problem definition
Let T be a tree, and G be the set of genes in a reference operon. We define Ω as the set of all possible orthoblocks over gene set G. Let λ: L ↦ Ω be the labeling of L (assign orthoblocks from Ω to the leaf nodes of T, this can include empty orthoblocks). We define the function to be an extension of λ on T if it coincides with λ on the leaves of T (assign an orthoblock to each node of T). If
, we say that vertex v is labelled with orthoblock O. Furthermore, given orthoblock O, we define GeneBlock(O) as the set of gene blocks in O. Given a labelling
and an edge (u, 1v) ∈ E, we define the distance between the two labellings of the endpoints u, v as
and the total distance function as
.
The Maximum Parsimony problem is now defined as follows: given a tree T, an operon gene set G, the orthoblock set Ω and a leaf labeling λ, find a labeling that minimizes
3 Approach
Here we explore two related Maximum Parsimony heuristic approaches, local and global, to reconstruct ancestral gene blocks.
3.1 Local Maximum Parsimony
Briefly, the local approach focuses on finding the optimal parent ancestral gene block given its child gene blocks. For each internal node u, let u1 and u2 be its 2 direct children. We present a greedy local optimization algorithm.
For proof of correctness and runtime, please refer to section 6.1.
3.2 Global Maximum Parsimony
In section 2.2, we determined that the split distance and duplication distance depend on the deletion distance. While finding the global minimum for each separate distance is simple, this dependency makes finding the global minimum of the aggregate of the three distances challenging. In the following example, we demonstrate the minimization of the deletion distance, and then of the split distance. After that, we provide an optimal solution that minimizes the aggregate sum of the two distances.
Given an inner node v and its two child nodes v1 and v2, let O be the gene block to be assigned to v. Consider the orthoblocks O1 and O2 of v1 and v2 respectively as:

We define the set of genes that appear in both O1 and O2 as S = {b, c, d, e, f}, and the union gene set of O1 and O2 as G = {a, b, c, d, e, f, g, k, o}. Any gene i ∈ S will contribute a deletion distance of 2 to if O does not contain gene i. Any gene i ∈ G but
will contribute a deletion distance of 1 to
if O either has it or not. Hence, only including all genes from S in
, which is the minimum deletion distance. On the other hand, if we just want to minimize the split distance, the most naive way is not including any genes in O. Then,
, therefore
. However, if we choose to do it this way, our deletion distance becomes large
Apparently, decreasing split distance might increase deletion distance and vice versa.
If we focus on minimizing the deletion distance, then Gene(O) = S, which means that O has to include all genes in S. Then, the relevant gene blocks between O and its children O1,O2 become:

Apparently, the split distance of . If we remove gene f from Gene(O), the relevant gene blocks of the two children of u become:

Hence, by setting our gene block O as either Rel(O, O1) or Rel(O, O2), the deletion distance increased by 2 since we excluded a gene that is in S; also, the split distance also decreased by 2. Therefore, the new deletion distance is , and the new split distance is
.
Consider another possibility, if we include gene g in Gene(O) (not increasing the deletion distance), the relevant gene blocks to u become:

By setting O:= b|cd|ef|g, the new split distance is ds(O, O1) + ds(O, O2) = 1 and the deletion distance is . Therefore, we achieve a lower aggregate sum of deletion and split distances (5 compared to 6). We can keep on adding, or removing genes that only appear in one taxon. This process requires iterations through all the subsets of the symmetrical difference
which will take exponential time. We therefore provide a heuristic approach that guarantees minimum deletion and duplication distances, but not split distances.
4 Results and Discussion
We used E. coli and B. subtilis genomes as gold standards for deriving operons from Gram negative and Gram positive bacteria, respectively. The reason we picked these two species is that they both have well-annotated genomes, including experimentally verified operons. We applied our method to groups of Gram-negative and Gram-negative bacteria, using the operons experimentally identified in E. coli K-12 and B. subtilis str. 168 for the two groups, respectively.
4.1 Operons from Escherichia coli
We chose E. coli as a representative of proteobacteria, a major group of Gram-negative bacteria. Here, we examine across 33 taxa of proteobacteria from [4]. Our selection resulted in a set of proteobacteria species comprising three ∈-proteobacteria, six α-proteobacteria, seven β-proteobacteria and 17 γ-proteobacteria. The latter includes the reference species E. coli. Our selection included two γ-proteobacteria insect en-dosymbionts: Buchnera aphidicola and Candidatus Blochmania. These two species have unusually small genomes due to their endosymbiotic nature, and display massive gene loss. We reconstructed ancestors for the following operons from E. coli: atpIBEFHAGDC, paaABCDEFGHIJK, and the regulon bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE.
atpIBEFHAGDC
The atpIBEFHAGDC operon codes for F1F0-ATPase, which catalyzes the synthesis of ATP from ADP and inorganic phosphate [12]. ATP synthase is composed of two fractions: F1 and F0 [13]. The F1 fraction contains the catalytic sites and its proteins are coded by five genes (atpA, atpC, atpD, atpG, atphH) [13]. The F0 complex constitutes the proton channel and its proteins are coded by three genes atpF, atpE, atpB.
A phylomatrix of gene block atpIBEFHAGDC. Each matrix square depicts the degree of relative conservation of the event between any two species. Blue is more conserved, red is less conserved. Left to right: conservation of deletions, duplications, splits. z-score value calculated as in[4]. As can be seen, there are few deletions and split events, and no duplications events in the pairwise comparison of this gene block, showing a high conservation. Reproduced from [4] under Creative Common CC-BY-NC 4.0. license. A larger version can be found here http://iddo-friedberg.net/operon-evolution/
Figures 3 and 4 show ancestral reconstruction using the local and global maximum parsimony algorithms, respectively. Both local and global reconstructions show a consistency of having orthoblocks atpACDGH and atpBF in the most common ancestors for different Gram negative bacteria. This finding agrees with the long-standing hypothesis that F0 and the F1 fractions have evolved separately, with the respective fractions having homologs in the hexameric DNA helicases and with flagellar motor complexes. Although we find the gene atpI in several species, the reconstruction predicts that atpI is not in the same cluster with other genes. Gene atpI is not an essential component of the F1F0 ATPase[14]. Another interesting finding is the duplication of atpF in ɛ-proteobacteria which appears to predate their common ancestor. Note that all genes exist as a gene block even in the endosymbionts Blochmannia and B. aphidicola.
Ancestral reconstruction of operon atpIBEFHAGDC using the local optimization approach. Brown: ɛ-protebacteria; blue: α-proteobacteria; black: β-proteobacteria; pink: γ-proteobacteria.
Ancestral reconstruction of operon atpIBEFHAGDC using the global optimization approach. Brown is ɛ-protebacteria, blue is α-proteobacteria, black is β-proteobacteria, pink is γ-proteobacteria.
The ɛ, α, β, and γ -proteobacteria species all have a conserved intact F1 complex (coded by the at-pACDGH cluster), which predates their common ancestor. The genes included in the F0 complex in epsilon-proteobacteria (gene products atpB, atpE,atpF) not in the same cluster as the genes making up F1. Furthermore, it is unclear whether the gene split that is only found in ɛ-proteobacteria is a split that predates the least common ancestor with the other proteobacteria clades, or whether it is a split introduced in the ɛ-proteobacteria. From the reconstructions provided, the scenario appears to be the latter. Conversely, this observation may also be a result of the small number of species studied here. The species in the e and α-proteobacteria display a known duplication of gene atpF. atpF' appears as a sister group to atpF[15].
Gene block paaABCDEFGHIJK phylomatrices, each show the degree of relative conservation of the event between any two species. Left to right: Deletions, duplications, splits. Blue to red scale is high-to-low conservation z-score [4]. Larger file can be found here http://iddo-friedberg.net/operon-evolution/
Ancestral gene block reconstruction of paaABCDEFGHIJK using the local reconstruction approach. Clade colors from top: brown: ɛ-proteobacteria, blue: α-proteobacteria, black: β-proteobacteria, pink: γ-bacteria. Asterisks in front of species names indicate that a minimal orthoblock (two or more proximal orthologs to the reference operon) was not found.
Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach. Color coding is the same as in Figure 6.
paaABCDEFGHIJK
The operon paaABCDEFGHIJK codes for genes involved in the catabolism of phenylacetate[16]. The ability to catabolize phenylacetate varies greatly between proteobacterial species, and even among different E. coli K-12 strains. In contrast with atpABCDEFG operon which is highly conserved through many species, the operon paaABCDEFGHIJK is only found in full complement as an operon in some E. coli K-12 strains and some Pseudomonas putida strains The orthoblock paaABCDE is found in three Bordetella species and also in Bradyrhizobium diazoefficiens. The products of paaA, paaB, paaC and paaE make up the subunits of the 1,2-phenylacetyl-CoA epoxidase, and paaD is hypothesized to form an iron-sulfur cluster with the product of paaE[17]. We did not find orthologs in the endosymbionts B. aphidicola and Blochmannia.
In both the local and global reconstructions, only the ancestor of the Bordetella species have a combination of paaABC complex with paaE. According to Grishin et al [17], only this combination has full activity. In addition, the global approach only predicts gene blocks for the ancestors of α and most of γ-proteobacteria. Only the common ancestor of the Bordetella genus contains the cluster paaABCE. It has been confirmed that this cluster of genes is identical to those of E. coli [18]. In both approaches, gene paaF and paaG are not found to be in the same gene blocks, hence the ancestors are most likely missing the hydratase-isomerase complex.paaJ thiolase catalyzes two steps in the phenylacetate catabolism[19, 20, 21]. In addition, paaH is the NAD+-dependent 3-hydroxyadipyl-CoA dehydrogenase involved in phenylacetate catabolism[19]. Therefore, it is reasonable that gene paaJ and paaH appear in most of the ancestral nodes that have gene blocks.
The results from the study of these operons have provided some interesting and valuable understanding of the evolution of the gene blocks. Also, in both cases, the global approach performs better in term of minimizing events. For brevity, we only provide the global ancestral reconstruction henceforth.
Ancestral reconstrucion of gene block bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE Brown: ɛ-protebacteria, blue: α-proteobacteria, black: β-proteobacteria, pink: γ-proteobacteria.
Ancestral reconstruction of rbsDACBKR. Brown: ϵ-protebacteria; blue: α-proteobacteria; black: β-proteobacteria; pink: γ-proteobacteria.
bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE
The operon bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE participates in DNA replication, repair, immune reaction, and signal transduction. It is actually a complex regulon with several promoter sites [22]. Gene bamA is highly conserved [23] and is required for Gram-negative outer membrane protein assembly [24, 25]. Gene dnaE encodes the alpha-catalytic subunit of the DNA polymerase III holoenzyme [26]. The reconstruction result has shown that those two genes have appeared in all the ancestors. Note that bamA is predicted to not be in the same regulatory block as the rest of the operon in γ-proteobacteria. At the same time, gene dnaE is not in the same block of the operon in β-proteobacteria. However, these two splits should not affect the overall operon functionality since neither bamA nor dnaE are found to form a subunit with another gene in the operon. At the same time, the cluster of lpxD-fabZ-lpxA is involved in lipid A biosynthesis in many bacteria[27, 28].
rbsDACBKR
The operon rbsDACBKR expresses genes associated with the ribose transport complex in E. coli [29, 30]. The rbsABC genes compose an ATP-dependent ribose transporter that is a member of the ATP-Binding Cassette (ABC) superfamily of transporters [31]. Mutations in each of the components eliminated transport of ribose at an external concentration of 1μM, indicating that the components make up a transport system that is responsible for high-affinity ribose transport [32]. From the reconstruction, we observe that the core gene cluster of the transporter rbsABC starts forming in three different inner nodes: (1) the common ancestor of α-proteobacteria; (2) γ-proteobacteria (genus Pseudomonas), and (3) γ-proteobacteria (Enterobacteriaceae, Pasteurellaceae families). The three other genes, rbsK, rbsD and rbsR are not essential for ribose transport. rbsR codes for the repressor protein which regulates the operon [33, 34]. rbsD, and rbsK are involved in the conversion of D-ribose to D-ribose 5-phosphate [35]. The gene block is most complete in the γ-proteobacteria, but the core transport genes appear also at the common ancestors of the α-proteobacteria.
4.2 Operons from Bacillus subtilis
B. subtilis is a Gram-positive, spore forming bacterium commonly found in soil, and is also a normal gut commensal in humans. It is a model organism for Gram-positive spore forming bacteria, and as such its genome of about 4,450 genes is well annotated. Here we used ROAGUE to reconstruct the ancestors of two B. subtilis gene blocks across 33 species. We selected species from the order Bacillales using PDA. Species from the following families were selected: Bacillaceae (including the reference organism B. subtilis), Staphylococcae: macrococcus and staphylococcus, Alicyclobacillaceae, Listeriaceae and Planococcaceae.
lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO
Gene block lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO facilitates the heat shock response in B. subtilis and the gene block hrcA-grpE-dnaK-dnaJ was the first identified heat shock operon within Bacillus spp[36]. The four genes hrcA, grpE, dnaK, dnaJ (e,c,b,a in Figure 10) form a tetracistronic structure, which is essential to the heat shock response role[37]. The four genes are proximal in all the species examined, and form the core of the orthoblock. Overall, this operon is quite conserved, and the ancestral reconstructions are highly similar to the reference operon.
Ancestor reconstruction of lepA-hemN-hrcA-grpE-dnaKJ-yqeTUV. Family color codes: brown: Macrococcus; black: Paenibacillaceae; blue: Staphylococcus; green: Alicyclobacillaceae; pink: Bacillaceae; purple: Bacillales Family XII; magenta: Listeriaceae bacteria; yellow: Planococcaceae.
Ancestor reconstruction of mmgABCDE-yqiQ. Family color codes the same as in Figure 10
mmgABCDE-prpB
The operon mmgABCDE-prpB is expressed during endosporulation [38]. Subunit mmgABC's breakdown of fatty acids is a mean for attaining energy to drive the cell's preparation for dormancy [39]. Hence, it is reasonable to see that the common ancestor has this subunit. In addition, gene mmgD and gene prpB/yqiQ are predicted to be proximal. Several studies predicted that gene mmgD, prpB, and prpD encode the proteins of the putative methylcitrate shunt [40]. However, they did not specify if deletion mutations might contribute to a defect of the functionality.
5 Conclusions
We developed ROAGUE, a method for the reconstruction of ancestral gene blocks using maximum parsimony. ROAGUE accepts a set of bacterial genomes, a species tree, and a reference gold-standard orthoblock. ROAGUE then identifies the orthoblocks to the gold-standard genome in all the species provided, using the best-orthoblock identification method developed in [4]. ROAGUE then proceeds to reconstruct the ancestral genomes using local or global parsimony. ROAGUE's output contains the species tree with the extant orthoblocks and the reconstructed orthoblocks. We provided several examples of ancestral gene block reconstructions based on reference operons in E. coli and B. subtilis.
A few interesting observations emerge regarding conservation and ancestry of operons. It appears that essentiality (the trait of being essential to life) and the formation of a protein complex are the main drivers for gene block conservation. This is most apparent in the atp operon coding for F1Fo-atpase in proteobacteria. There are few evolutionary events identified in the atp operon ancestry. The ribose transporter block also seems to preserve the core ribose transporter (rbsABC), while not the ribose phosphorylation genes rbsD and rbsK.
ROAGUE does not account for horizontal gene transfer, which is considered to be a major driver in operon evolution[7]. This can ostensibly be dealt with by reconciling a species tree with an operon tree, in the same way that phylogenomic analyses do for gene trees and species trees[41]. In addition, the gene order in a gene block is ignored. While the relationship between gene organization and expression in operons is not well understood, it is clear from several studies that gene order does have an effect on expression and on the functionality of the operon in general (e.g.[42, 43, 44]). Adding the parameters of horizontal gene transfer, gene order preservation, or both to ROAGUE would be highly valuable. We invite the community to contribute to ROAGE, as well as use the tool for phylogenetic analyses of bacterial gene blocks.
Acknowledgements
I.F. was supported, in part, by National Science Foundation award ABI-1551363. O.E. was supported, in part, by National Science Foundation award CCF-1617626.