Summary
Two datasets, the geologic record and the genetic content of extant organisms, provide complementary insights into the history of how key molecular components have shaped or driven global environmental and macroevolutionary trends. Changes in global physicochemical modes over time are thought to be a consistent feature of this relationship between Earth and life, as life is thought to have been optimizing protein functions for the entirety of its ∼3.8 billion years of history on Earth. Organismal survival depends on how well critical genetic and metabolic components can adapt to their environments, reflecting an ability to optimize efficiently to changing conditions. The geologic record provides an array of biologically independent indicators of macroscale atmospheric and oceanic composition, but provides little in the way of the exact behavior of the molecular components that influenced the compositions of these reservoirs. By reconstructing sequences of proteins that might have been present in ancient organisms, we can identify a subset of possible sequences that may have been optimized to these ancient environmental conditions. How can extant life be used to reconstruct ancestral phenotypes? Configurations of ancient sequences can be inferred from the diversity of extant sequences, and then resurrected in the lab to ascertain their biochemical attributes. One way to augment sequence-based, single-gene methods to obtain a richer and more reliable picture of the deep past, is to resurrect inferred ancestral protein sequences in living organisms, where their phenotypes can be exposed in a complex molecular-systems context, and to then link consequences of those phenotypes to biosignatures that were preserved in the independent historical repository of the geological record. As a first-step beyond single molecule reconstruction to the study of functional molecular systems, we present here the ancestral sequence reconstruction of the beta-carbonic anhydrase protein. We assess how carbonic anhydrase proteins meet our selection criteria for reconstructing ancient biosignatures in the lab, which we term paleophenotype reconstruction.
Main Text
The history of life on Earth has left two main repositories of evidence from which we can try to reconstruct it: the geological record, and the extant genetic diversity of organisms. Contemporary organisms, however, can be complex and cryptic vehicles for information about their history [1, 2]. Many genetic signatures in known life have been largely overwritten due to changing conditions of natural selection, evolutionary convergences, or simply genetic drift [3, 4]. However, functional links between the evolution of a protein (or a metabolic network) and biosignatures preserved in the geological record may provide clues to the timing and origin of major phylogenetic groups of organisms, including clades that went extinct and are no longer accessible for direct comparative study [5-9].
One way to connect the geological and genomic data sets is through ancestral sequence reconstruction [10-12]. One first infers ancestral sequences of biological molecules by phylogenetic reconstruction methods, and then uses these proposed sequences to synthesize models of paleoenzymes either computationally or experimentally [13-16]. In some cases the enzymes may be used to replace their modern counterparts in living organisms, being brought back to life as ‘revenant genes’, to obtain in situ expressions of “paleo-phenotypes” of the organisms that once harboured them [17, 18].
Efforts to reconstruct paleophenotypes require careful design, to avoid misinterpreting artefacts of reconstruction bias, or using host organisms that may be not faithfully reproduce ancestral phenotypes because too many of their systems have since adapted to other conditions. We present here a paleophenotype reconstruction approach that builds on prior efforts in paleoenzymology, extending the utilization of inferred ancestral gene / enzyme sequences engineered within modern organisms. Our functional framework builds on applying paleophenotypes to complex biology and on experimentally testing historical geobiological models and hypotheses. We begin by outlining the logical motivation for paleophenotype reconstruction and describe the criteria that should be addressed as a basis for selecting an enzymatic system for paleophenotype reconstruction at the systems level. We then use ancestral sequence reconstruction to determine the evolutionary history of a critical component of the photosynthetic CO2 fixation pathway – the beta-carbonic anhdyrase protein – and critically evaluate the selection criteria for candidate revenant genes suitable for paleopheotype reconstruction studies.
Our three linked goals are: 1) to learn (by solving concrete cases) when one must look beyond single-gene phylogeny to reconstruct entire functioning molecular systems, in order to correctly link enzyme properties to geological signatures; 2) by studying cases such as Calvin-cycle carbon fixation, where an isotope signature is inherently linked to a functional criterion such as molecular selectivity and an environmental property such as oxygen activity, to demonstrate a consistent multi-factor reconstruction of an organism’s phenotype in its environmental context; and 3) to search for features in which ancient proteins may have been truly more primitive than any proteins that have survived in extant organisms, to understand the evolutionary progression from the first (we may suppose fitful) invasions of new modes of cellular life, metabolism, or bioenergetics, and the refined forms of modern organisms that have made it difficult to infer the paths through which these emergences could have taken place.
THE LOGICAL MOTIVATION FOR PALEOPHENOTYPE RECONSTRUCTION
To understand why – and for which systems – experimental paleophenotype reconstruction is likely to be an important scientific advance, it is helpful to reflect on the limited forms of information about evolutionary processes that are actually employed in conventional methods of sequence-based phylogenetic reconstruction. Efforts to map out more of the genotype-phenotype correspondence, whether through modelling or by in vivo expression, and to correlate these with independent evidence carried geologically, may be understood as a way to bring in other dimensions of information about evolution that can contribute to historical reconstruction.
Phenotype Information Can Augment Relatively Simple Sequence Substitution Models
The field of phylogenetic inference, after nearly a half century of dedicated work, has addressed most problems of consistent sampling and error estimation [19-27]. Yet the probability models that are the workhorses of most phylogenetic inference are disconnected from their context: they typically are site-local insertion, deletion, and substitution models with no semantics of the functioning objects produced.1 Often it is necessary to restrict probability models to evaluating substitution events independently at each site, to keep computations affordable especially for large data sets. However, such models are by construction incapable of reflecting interaction properties that can cause some joint variations to have very different likelihood to produce viable organisms than the marginal variations do independently.
Information about interaction effects can often be obtained from models of protein domain structure, folding, and function [30-37]. Learning how to use functional protein models to identify the most important non-local interactions and represent their effects on viability and fitness is one goal that can be pursued as more ancestral enzymes are reconstructed. Substitution probabilities must also be estimated jointly with alignments, and systematically biased alignment estimates can lead to mis-specified substitution models [38-41]. Information about folding and function can be particularly informative for ambiguous alignments, as substitutions or crossovers that preserve domain structures should yield viable organisms more often than those that would be incompatible with maintaining functional domains.
Synthetic-Biology Methods Offer Ways to Test the Internal Consistency of Reconstructions
Much current-generation phylogenetic inference, because of the big-data survey nature of its questions [42], yields independently-derived proposals for the presence or absence of genes in ancient genomes, along with putative sequences for ancient proteins [43-45]. However, the probability models generating these claims at present include no information (as part of the Monte Carlo generate-and-test cycle itself) about the consistency of the physiologies they predict. The use of synthetic biology methods to insert reconstructed genes into living organisms, or to test proposed molecular systems either in vitro or with modified genomes in vivo provides ways to test historical models at the system level. It can help bridge the gap between ancestral sequences inferred with algorithmically sophisticated but information-poor probability methods, and proposals for how they might have co-occurred in ancient cells.
PROPOSED FRAMEWORK FOR REBUILDING PALEOPHENOTYPES IN THE LABORATORY
We outline here three criteria in choosing enzyme systems for which a paleophenotype reconstruction and systems-engineering approach may be feasible and may yield interesting insights beyond those delivered by simple sequence-phylogenetic methods alone. They are directed both at properties of the enzymes and properties of the clades and environments in which these occurred over time (see Figure 1).
Geology: Is the problem geochronologically constrained? Does the protein system of interest mediate a biosignature that is recoverable from the rock record? Is there temporal structure in that biosignature that can be correlated with important evolutionary transitions in either enzyme context or function? Do major changes in enzyme function correspond to events of phylogenetic divergence, which can then be calibrated against geochronology?
Phenotype: Can information be provided by in vivo resurrection of the protein that resolves important ambiguities in the usual methods of ancestral sequence inference, or that shows important errors in the assumptions usually made about sequence inference? This is information we think of as being reported by the phenotype of the protein, whether it is revealed by resurrection or by computational modelling.
Ancestry: Do we have a current organism that is similar enough to the host organisms for the ancestral proteins that expressing them in our current organism will reveal the phenotypic characters that governed their function in the past? Is the proposed host a well-studied model organism? Are other essential components of metabolic pathways present in contemporary organisms also remnants of ancient life [46], and can their major evolutionary innovations also be inferred where these are significant to system functions?
Through reconstructing and examining the evolutionary history of contemporary components and then tying their phenotypes into biosignatures in mineral form, we can provide insight into innovations that are grounded in the rock record and thus in the geological and ecological context.
Enzymes involved in carbon-fixation such as Ribulose-1,5-bisphosphate carboxylase / oxygenase (Rubisco) proteins are thought to be one of the main causes of a distinct biosignature preserved in the rock record. This biosignature is revealed through comparison of 13C-isotope measurements between carbonate originally derived from atmospheric CO2 and organic carbon sequestered from biomass. 13C-isotope fractionation differences are the oldest record of living organisms, extending to at least ∼3.5 billion years in the past [47-50]. Rubisco is the distinctive catalyst and putative isotopic bottleneck for the Calvin-Benson-Bassham cycle [51, 52] – the predominant photosynthetic Carbon-fixation pathway by volume – and is therefore at the heart of many fundamental questions about the co-evolution of early life and the development of biogeochemical cycles of the planet [53]. Indeed, while we do not know whether ancient Rubisco proteins exhibited paleophenotypic properties that are comparable to those produced by contemporary Rubisco, or how efficiently the ancestral Rubisco proteins functioned under ancient environmental conditions, Rubisco plays a pivotal role in biogeochemical interpretation of the C-isotope fractionation patterns in deep time [47-50, 54-56]. Undoubtedly, characterization of ancient Rubisco as a means of elucidating steps of biochemical adaptation and resulting protein biochemical and organismal behaviour at the key nodes of phylogeny would be crucial and would be applicable for a paleophenotype reconstruction approach, suitable for subsequent isotope fractionation measurement and even phenotypic resurrection of isotope fractionation through engineering these ancient genes inside modern cyanobacteria [57-59].
Rubisco proteins don’t function in isolation in a cellular system. In bacteria, carbonic anhydrase proteins support Rubisco activity, by mediating efficient CO2 transport into and around the cell [60]. Carbonic anhydrase converts bicarbonate to carbon dioxide in the carboxysomes where Rubisco is localized – organelles thought to have evolved as a consequence of the increase in atmospheric oxygen concentration in the ancient Earth [61-64] – thereby alleviating the stringency required of Rubisco for carboxylase over oxygenase activity and reducing the energy and carbon loss that result from photorespiration. Additionally, carbonic anhydrases have essential roles in facilitating the transport of carbon dioxide and protons in the intracellular space, across biological membranes and in the layers of the extracellular space [65].
To understand the later-diverging innovations of photosynthetic systems associated with the rise of oxygen, the drawdown of atmospheric and oceanic CO2, and the colonization of land, it may even become essential to jointly reconstruct innovations in Rubisco and carbonic anhydrase with other metabolic and compartmental systems that serve as Carbon-concentrating mechanisms [66, 67]. The joint evolution of pathways associated with photorespiration may also provide evidence about O2/CO2 discriminatory capabilities of ancestral enzymes as well as the interpretation of the ancient isotope signals.
As the first step beyond single-molecule reconstruction to the study of functional molecular systems, we present here the phylogenetic history of carbonic anhydrase enzymes and the ancestral sequence for the beta-carbonic anhydrase protein. We assess whether/how carbonic anhydrase proteins meet our selection criteria for paleophenotype reconstruction, and demonstrate that events of horizontal gene transfer in an evolutionary tree for a given gene need to be recognized prior to a laboratory paleophenotype reconstruction.
A UNIVERSAL CARBON SHUTTLE IN PHOTOSYNTHESIS AND BEYOND: CASE STUDY OF THE RECONSTRUCTION OF ANCIENT CARBONIC ANHYDRASE PROTEINS
Carbonic anhydrase is found in metabolically diverse species representing all three domains of life, [68-70]. The three main classes of carbonic anhydrase (alpha, beta and gamma) are not homologous and are thought to be a result of convergent evolution [71, 72]. Although molecular dates based only on sequence comparison should be regarded with caution, it has been suggested that both the gamma and the beta classes are ancient enzymes, which existed before the split between archaeal and bacterial domains [73, 74].
In a coarse assessment, carbonic anhydrase meets our paleophenotype selection criteria. It carries out an essential and ancient function in the carbon concentration machinery. While no particular study (to our knowledge) has attributed a specific biosignature to the activity of bacterial carbonic anhydrases, this enzyme mediates CO2 efflux in the carboxysome, potentially impacting the interpretation of Rubisco kinetic isotope selectivity, which is correlated with molecular CO2/O2 discrimination and turnover rate, in terms of the ambient CO2 and O2 activities in the cellular environment. The root of the gamma-class is inferred to have extended to approximately 4.2 billion years ago [73]. Moreover, the presence of carbonic anhydrase in thermophilic chemolithoautotrophs suggests that other ancient CO2-fixation pathways besides the Calvin cycle also depended on carbonic anhydrase function for efficient C-fixation [75, 76].
In this study, we focus on the beta-carbonic anhydrase, also called the prokaryotic carbonic anhydrase (although it has been found in eukaryotes as well). Beta-carbonic anhydrase is an ancient enzyme, it is widely represented in prokaryotes and despite its critical role for Earth’s biosphere, to date, not many studies focused on the molecular evolution of beta-carbonic anhydrases [77, 78]. Beta-carbonic anhydrases have been subdivided into four main clades, A to D. One group of enzymes belonging to the B clade of the beta-carbonic anhydrase is a probable example of neofunctionalization: these enzymes take CS2 (and not CO2) as a substrate [79].
The alignment of the 457 beta-carbonic anhydrase proteins revealed a strong conservation of about 200 amino-acids, of which a handful are identical in all or almost all sequences. The phylogenetic reconstruction of the carbonic anhydrase from 388 genomes confirms the existence of four, relatively well-supported clades (Figure 2 and Supp. Figure 1). Clade D seems to be the most distant one, and based on this we have chosen to root the tree from this group. The presence of archaeal homologs close to the root of three out of the four clades gives some credence to the hypothesis that duplication of the beta-carbonic anhydrase is very ancient, perhaps occurring before the archaea/bacteria separation. Except for the main clades and a few other shallower groups, the bootstrap values are altogether low, an ambiguity often observed when reconstructing deep phylogenies based on short protein alignments, due to small amount of genetic information available [80]. This is further emphasized by the fact that the main bacterial phyla – with the notable exception of cyanobacteria – were rarely reconstructed as monophyletic, even inside each clade. The incongruence between the carbonic anhydrase tree and the generally accepted tree of life can be explained by either one, or a combination of two, main factors: (a) the amount of phylogenetic information contained in this 200-site alignment might be too low to reliably approximate the true tree, or (b) the carbonic anhydrase has been horizontally transferred several times. Amongst the few clades, the two that are almost solely comprised of sequences from the same phylum are two groups of cyanobacteria, one belonging to clade B (bootstrap support: 98) and one to clade C (low bootstrap support) (Figure 2). In addition, sequences from cyanobacteria are found almost exclusively in these two clades. Despite the poor support values of the tree, even inside the group encompassing the cyanobacterial sequences of clade B, a likely scenario for the evolution of the carbonic anhydrase in cyanobacteria can be drafted. We hypothesize that the last common cyanobacterial ancestor had at least two copies of the gene, one clade B-like and the other clade C-like, and that one or the other copy was lost (and perhaps further copies gained) following different patterns in different cyanobacterial descendants. In consequence, we performed an ancestral reconstruction only for the well supported, monophyletic cyanobacterial group in clade B (see Supp. Figure 2).
Previously reconstructed phylogenetic history of the Rubisco proteins, an essential partner of carbonic anhydrase in the carboxysome, displays a highly supported phylogenetic tree, which recapitulates the organismal phylogeny [55, 81]. In contrast, here we show that the evolutionary history of the carbonic anhydrase is more complex: several enzymes with no common ancestor (the different classes of carbonic anhydrases) catalyse the same reaction. Even within classes of carbonic anhydrases, duplications and horizontal gene transfers seem frequent. This is an expected outcome of the greater generality and modular function of carbonic anhydrase compared to the specialized role of Rubisco: modular components of a metabolic system are much more readily transferred2 or re-evolved through convergent evolution. The interpretation that carbonic anhydrase function is general and modular is further suggested by its redundant presence in many organisms: the number of homologs of the beta-carbonic anhydrase ranged from none to as many as six carbonic anhydrase genes in cyanobacteria, with most genomes having more than one gene. This suggests that the cost of exchanging (by horizontal gene transfer) or losing one copy of carbonic anhydrase (either completely or by neofunctionalization, as in the case of the CS2 hydrolase [79] is small. Nevertheless, ancestral reconstruction, in parts of the tree where a phylogenetic signal can be established with significant confidence, as in the cyanobacterial group in the B clade of the beta-carbonic anhydrase, shows that the conserved parts of the alignment are even more conserved in the ancestors of the clade than in the extant species (Figure 3). None of the highly conserved residues differs in the ancestors of the group (Node 67, depicted with an asterisk on Figure 2, and its descendants, but not the extant species), suggesting that the function of the enzyme hasn’t significantly changed inside the group.
We further analysed the reliability of the ancestral carbonic anhydrase (Node 67) in the last cyanobacterial common ancestor by examining the posterior probabilities for each reconstructed residue (Figure 4). Multiple alignment of the reconstructed ancestor sequence with the sequences from all of the known extant species shows that the sequence of the large domain (located on the N-terminus side) can be confidently established (Figure 4). This fragment ranges from residues 40 to 240, which covers the functionally important zinc-binding core [83, 84]. On the other hand, the other parts of the reconstructed protein sequence are less reliable, as evidenced by the fact that the second- and third most probable amino-acid are closer to the most probable one (Figure 4).
SIGNATURES OF PRIMITIVENESS: EXTENDING LESSONS LEARNED FROM PHYLOGENETIC RECONSTRUCTION TO MORE GENERAL PRINCIPLES OF MOLECULAR EVOLUTION
From joint phylogenetic/geochronological reconstructions such as our reconstruction of Rubisco and carbonic anhydrase, we may attempt to identify more general principles of molecular evolution, either by finding signatures of primitiveness – characters in which the first invaders of a new niche or new functionality still reflect what they were before, and are not yet well-adapted to their new mode of life – or by studying the dynamic by which forces of selection become displaced from one. molecular system to another, when some primitive, inherited molecular function is incapable of responding to selective criteria from the new environment that have become irresistible.
Relevant signatures of primitiveness can vary across protein families and ancestral functions. In some cases, it may be expected that proteins which are now sub-functionalized to specific substrates were once multifunctional, a change that we would expect to see [85-87] in a deep past when error rates in genome replication and also in translation should have been higher [88], favouring fewer and shorter genes, at the cost that each gene may have been required to catalyse multiple reactions in order for pathway formation to be possible.3
The very existence of carboxylase / oxygenase discrimination in Rubisco has the character of an emerging but stalled sub-functionalization. Molecular oxygen was by all evidence a minor component of the environments in which Rubisco emerged, and the enzyme mechanism was selected on the only relevant substrate: CO2. The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of the existing ones [91, 92]. By enabling the massive proliferation of oxygenic photosynthesizers, Rubisco introduced the need for a substrate discrimination that had not existed when it arose, potentially creating conditions for its own failure. The reaction mechanism to which the whole enzyme structure is committed is one for which discrimination is costly and only partially successful, even under intense selection pressure. The result was displacement of selective pressure from Rubisco onto other enzymes such as carbonic anhydrase, and onto cellular ultrastructure in forming the carboxysome.
Due to the correlation between isotope selectivity and substrate discrimination in Rubisco, a further signature of primitiveness is suggested, which can readily be empirically tested. All modern Rubiscos fall along a rather tight linear regression between turnover rate and CO2/O2 discrimination, with a less-tightly correlated isotope shift [51, 52]. The apparently bright horizon, beyond which no Ribiscos are found, is the reason for the interpretation of an inherent trade-off in the mechanism that fixes CO2 using only the free energy of hydrolysis, which forces turnover to be sacrificed as the price of discrimination. The absence of dispersion on the low-performance side of the regression has been interpreted as evidence of evolutionary optimization: that turnover is always maximized against the futile cycle of photorespiration in the CO2/O2 environment of the enzyme. By studying turnover versus discrimination in ancestral Rubiscos, we can test whether they seem to reflect the same optimization horizon as modern Rubiscos. If not, one possibility is that the enzymes were more primitive; another is that naive sequence reconstruction methods miss essential information needed to identify the true ancestral form for this protein. Coupling an optimality analysis with functional measures of carbonic anhydrase will then allow us to compare the implied CO2/O2 environment to ambient conditions in eras suggested by the phylogenies of Rubisco and other molecular clocks or ancient biosignatures.
EXTRAPOLATING BIOLOGY BACK TO EARLIEST- OR PRE-BIOLOGICAL CONDITIONS
Comparative sequence reconstruction, even augmented by geochronology, can only reveal directly historical evidence within the era from which sequence divergences have been preserved to the present. The geochronological evidence may extend to earlier times – as is potentially the case for organic carbon signatures, although these become sparse near the time when life emerged on Earth about 4 billion years ago [93-95] – and it is plausible that complex cellular life also existed and evolved within these eras; but to understand them we will need interpretive methods beyond simple sequence comparison.
Phylogenetic reconstruction should move beyond sequence reconstruction and the functional deductions based on phenotypes that are heavily impacted by the sequence reconstruction quality. Indeed, early examples in the young field of paleoenzymology attempted to make inferences about the temperature of ancient environments that no longer exist, by interpreting ancestral protein sequences from organisms long-since extinct [96-99]. Its predictions should come to integrate enzymatic structure and folding information as well as systems-level effects on protein network interactions and physiology – hence the emphasis on “paleophenotype” rather than “paleosequence” or “paleostructure”.
Engineering ancient pathways whose behaviour could recapitulate certain past phenotypes and innovations has the potential to reconcile ambiguities in phylogenetic reconstruction. This may be realized by focusing on assessing those particular phenotypes that facilitate effective comparison to an independent historical record of component or organismal phenotype contained in the rock record. The same dynamical effects – error-prone replication, horizontal gene transfer [88, 100-103]– that tend to erase memory about genes and genomes in the deepest eras of life, also remove one of the aspects of biology that interferes most with modelling from first principles: the capacity for historically-contingent features to contribute essential context for function. If we can use a combination of deep sequence reconstruction, functional modelling, and geological constraints on phenotype, we may be able to identify the “rules of assembly” for very early living systems. These are the rules that, as we extend to epochs in which memory was less robust, should have governed strong evolutionary convergences, and for the earliest molecular systems that were bound in the most detail to their geological environments, they may permit a limited amount of prediction from first principles about what those systems could have been. The reconstruction of carbonic anhydrase demonstrates the potential for integrative investigation. Though the carbonic anhydrase tree shows that horizontal gene transfer is a barrier for deep time reconstructions of some enzymes using single gene trees, carbonic anhydrase’s intimate interactive relationship with other enzymes of the carbon uptake and carbon-concentrating mechanism apparatus leaves open the possibility of isolating its functional behaviour in conjunction with enzymes with more clearly resolved genetic histories such as Rubisco.
CONCLUSION
The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of species acclimated to anoxic conditions. Much of this information is derived from the geologic record- evidence of carbon cycling (and biological activity) can be inferred from carbon isotopes which lies at the interfaces between enzymatic activity, organismal phenotype and the formation of sedimentary rocks [104]. As a corollary to these interpretive schemes, it is assumed that the controlled fixation of inorganic carbon to organic carbon is a precondition for the emergence of living systems. While selectivity in the carbon isotope composition of biologically produced organic matter is evidence that is preserved in the rock record that reflects the metabolic activity of ancient organisms, this record is increasingly problematic near the time when life emerged on Earth about 4 billion years ago. Means of investigating life’s early evolution and origins that complements the loss of a robust geologic record may prove insightful.
Characterizing paleophenotypes in the laboratory is now feasible with evolutionary techniques already available and new techniques currently under development. However, the best chance of succeeding is in the integration of ancestral sequence reconstruction, genetic engineering and laboratory evolution. Current efforts attempt to establish the extent to which biochemical properties of paleoenzymes may be correlated with biosignatures retrievable from the rock record, such as stable isotope ratios [57]. This behavioural information is critical to understanding ancient biological innovations and their effect upon (and shaping by) the Earth’s environment. It is likely that the specific genotype / phenotype relationship in present-day organisms has evolved from organisms with very different metabolisms. If so, when and where did specific extant phenotypes evolve? How conserved are the genes of interest through time? How frequently were these genes transferred to other organisms? What can paleophenotypes tell us about the origins of critical metabolic pathways? Reprogramming contemporary organisms by engineering their genomes with ancestral DNA is a fundamentally new methodology with which to extrapolate back to the origins of life. The technique is part of our broader approach that seeks to reconstruct macroevolutionary phenotypic trends across geologic time to ultimately infer conditions that approach the Last Universal Common Ancestor and possibly life’s origins.
Methods
Genome selection
Representative genomes for different taxonomic groups were selected using the software phyloSkeleton (manuscript in revision; available at https://bitbucket.org/lionelguy/phyloskeleton) as follows: one for each species of Streptococcus, one for each family in Proteobacteria and Cyanobacteria, and one for each order otherwise. After manual curation of poorly classified genomes, 388 genomes were retained (Supp. Table 1).
Carbonic anhydrase homologs identification
Genomes were search with HMMer v3.1b2 [105], using the PFAM profile for the so-called prokaryotic carbonic anhydrase (http://pfam.xfam.org/family/PF00484), which contains representatives of the beta-carbonic anhydrase clades A, B and C (but not D), as defined in [79]. All homologs with an E-value < 1e-10 were retained. In the 388 selected genomes, between 0 and 6 homologs of CA per genome were found, with most genomes having 0 (102 genomes), 1 (176 genomes) or 2 (72 genomes) copies. In total, 457 sequences were retrieved. The sequence length varied from 118 to 867 amino acids, with eukaryotic homologs being the longest. Most sequences were between 200 and 250 amino acids.
Sequence alignment
The 457 CA homologs found by phyloSkeleton and the 35 sequences that could be retrieved from Smeulders et al. (2011) were aligned with mafft-linsi v7.215 [106] and the resulting alignment were filtered to remove positions that had >50% gaps, using trimal [107]. The filtered alignment counted 200 positions and was visually inspected for any obvious misaligned regions.
Phylogenetics
RAxML 8.2.8 [108] with the PROTCATLG model was used to infer the phylogenetic tree depicted in Figure 2. A hundred parametric bootstraps were drawn to estimate branch reliability. Trees were visualized and edited with FigTree 1.4.3 (Andrew Rambaut, available from http://tree.bio.ed.ac.uk/software/figtree/).
Ancestral reconstruction
Ancestral reconstruction of the clade B carbonic anhydrase in cyanobacteria was done by first collecting all homologs of the carbonic anhydrase in cyanobacteria, choosing one representative genome per genus in cyanobacteria with phyloSkeleton, and adding genomes branching close to cyanobacteria, both clade A and clade B (Sorangium cellulosum, Sandaracinus amyloliticus, and Beijerinckia indica) in Figure 2, resulting in 83 genomes (Supp. Table 2). Sequences belonging to clade B were then extracted and uploaded to Phylobot [109]. Sequences were analyzed with muscle and msaprobs, and trees were drawn under the PROTCATLG and PROTGAMMALG models. The complete results are available at Phylobot: http://phylobot.com/38899544/. The result of the muscle alignment and the tree drawn under PROTCATLG were further analyzed and visualized in Jalview 2.10.1 [110].
Authors’ Contributions
All authors analyzed the data, contributed to writing the final manuscript and gave final approval for publication.
Competing Interests
The authors declare that they have no competing interests.
Funding Statement
We acknowledge funding from the John Templeton Foundation grant (ID# 58562) (BK), the NASA Astrobiology Institute Reliving the History of Life CAN7 (BK, ES) and the Earth-Life Science Institute Origins Network (BK, ES) and the Harvard Origins Initiative (BK). The views expressed here do not necessarily reflect the views of any particular organization.
Supplementary Tables
Supplementary Table 1: List of 388 genomes initially searched for the presence of carbonic anhydrase. The file is a partial output of phyloSkeleton.
Supplementary Table 2: List of 79 cyanobacterial genomes and 4 outgroups. The file is a partial output of phyloSkeleton.
Footnotes
↵1 Here we make a distinction that is standard in the information theory of symbol sequences, between syntactic information, derived from probability models for uninterpreted sequence strings like the stringsubstitution models used in molecular phylogenies, and semantic information, which informally we think of as deriving from functional interpretations of those strings, and which can sometimes be formalized as mutual information between the symbol sequences and their environments [28] Adami, C. 2002 Sequence complexity in Darwinian evolution. Complexity 8, 59-56, [29] Adami, C. 2004 Information theory in molecular biology. Physics of Life Reviews 1, 3-22.
↵2 Perhaps the most striking contrast between modular and non-modular molecular components in their degree of horizontal gene transfer is recognized between proteins and RNA associated with the ribosome, and the aminoacyl-tRNA synthases. The former are proxies for a reference Tree of Life, while the latter have been ubiquitously subject to waves of replacement [82] Woese, C. R., Olsen, G. J., Ibba, M. & Soll, D. 2000 Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol Mol Biol Rev 64, 202-236.
↵3 Such an interpretation has been advanced for homologies in proteins of the rTCA cycle in Aquificaceae [89] Aoshima, M. & Igarashi, Y. 2008 Nondecarboxylating and decarboxylating isocitrate dehydrogenases: oxalosuccinate reductase as an ancestral form of isocitrate dehydrogenase. J Bacteriol 190, 2050-2055. (DOI:10.1128/JB.01799-07), [90] Aoshima, M., Ishii, M. & Igarashi, Y. 2004 A novel enzyme, citryl-CoA synthetase, catalysing the first step of the citrate cleavage reaction in Hydrogenobacter thermophilus TK-6. Mol Microbiol 52, 751-761. (DOI:10.1111/j.1365-2958.2004.04009.x)., an alternative carbon-fixation cycle likely predating the Calvin cycle.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].
- [7].
- [8].
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].
- [15].
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].
- [21].
- [22].
- [23].
- [24].
- [25].
- [26].
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].
- [37].↵
- [38].↵
- [39].
- [40].
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].↵
- [47].↵
- [48].
- [49].
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].
- [59].↵
- [60].↵
- [61].↵
- [62].
- [63].
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].
- [95].↵
- [96].↵
- [97].
- [98].
- [99].↵
- [100].↵
- [101].
- [102].
- [103].↵
- [104].↵
- [105].↵
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].