Reconstructing and Analysing The Genome of The Last Eukaryote Common Ancestor to Better Understand the Transition from FECA to LECA

It is still a matter of debate whether the First Eukaryote Common Ancestor (FECA) arose from the merger of an archaeal host with an alphaproteobacterium, or was a proto-eukaryote with significant eukaryotic characteristics way before endosymbiosis occurred. The Last Eukaryote Common Ancestor (LECA) as its descendant is thought to be an entity that possessed functional and cellular complexity comparable to modern organisms. The precise nature and physiology of both of these organisms has been a long-standing, unanswered question in evolutionary and cell biology. Recently, a much broader diversity of eukaryotic genomes has become available and this means we can reconstruct early eukaryote evolution with a greater deal of precision. Here, we reconstruct a hypothetical genome for LECA from modern eukaryote genomes. The constituent genes were mapped onto 454 pathways from the KEGG database covering cellular, genetic, and metabolic processes across six model species to provide functional insights into it’s capabilities. We reconstruct a LECA that was a facultatively anaerobic, single-celled organism, similar to a modern Protist possessing complex predatory and sexual behaviour. We go on to examine how much of these capabilities arose along the FECA-to-LECA transition period. We see a at least 1,554 genes gained by FECA during this evolutionary period with extensive remodelling of pathways relating to lipid metabolism, cellular processes, genetic information processing, protein processing, and signalling. We extracted the BRITE classifications for the genes from the KEGG database, which arose during the transition from FECA-to-LECA and examine the types of genes that saw the most gains and what novel classifications were introduced. Two-thirds of our reconstructed LECA genome appears to be prokaryote in origin and the remaining third consists of genes with functional classifications that originate from prokaryote homologs in our LECA genome. Signal transduction and Post Translational Modification elements stand out as the primary novel classes of genes developed during this period. These results suggest that largely the eukaryote common ancestors achieved the defining characteristics of modern eukaryotes by primarily expanding on prokaryote biology and gene families.


Introduction
Ernst Mayr proposed that the differences between prokaryotes and eukaryotes are the biggest phenotypic split in all of cellular life 1 . The formation of the eukaryotic cell constitutes a major transition in life's history, and as such is a major focus for study in evolutionary biology [2][3][4] . Recently the Archaean Supergroup, the Asgard Archaea, have challenged much of our understanding about how distinct eukaryotes and prokaryotes are, consistently showing a robust phylogenetic affiliation with eukaryotes [5][6][7] . Many proteins that until recently were thought to be eukaryote-specific now seem to have homologs in these Asgard genomes. Specifically, components of the ESCRT system, the TRAPP membrane-trafficking systems, the ubiquitin modifier system, the actin cytoskeleton, and an expanded range of GTPases have been found in the metagenomic assembles that identified the Asgardarchaeota species [6][7][8] . Increasingly characteristic structures and biology previously used to characterize eukaryotes appear to have its origins from within this archaeal group.
While the Asgard Archaea challenge the distinctiveness of many of the characteristic elements of the eukaryote lineage, the eukaryotic cell still possesses many notable unique features. For example, the eukaryotic cell, on average, is 1,000-fold larger by volume than bacterial and archaeal cells requiring it to be governed by different physical principles; prokaryotes can rely on free diffusion for intracellular transport but eukaryotic cells possess elaborate cytoskeleton and endomembrane systems [9][10][11][12] . Eukaryotes have also compartmentalized their nuclear genetic material, their mitochondria and mitochondrial related organelles (MROs), and their cytoplasm [13][14][15][16][17][18] . The organizational and structural complexity that eukaryote cells exhibit is accompanied by a multitude of sophisticated signaling networks, including the kinase-phosphatase and ubiquitin systems [19][20][21][22][23][24][25][26] . Eukaryote gene expression is distinct from Prokaryote mechanisms with the transcriptional regulation of individual genes occurring separately from their translation -regulated by microRNAs -and epigenetic gene regulation via chromatin remodeling 13 , [27][28][29][30][31][32] .
In her seminal paper on endosymbiotic theory, Margulis proposed that the Eukaryotic lineage was established from the merger of an archaeal host with a αproteobacterium, giving rise to the First Eukaryote Common Ancestor (FECA) 33 . This bacterial symbiont would give rise to the eukaryotic mitochondria and MROs, a theory supported by evidence that these organelles are monophyletic 34 . Modern hypotheses largely accept the central role of mitochondrial endosymbiosis, though disagreement on whether this occurred early or late during eukaryogenesis persists 34 . Mito-early hypotheses argue that the mitochondria emerged early (or first) in the process of eukaryogenesis, while mito-late hypotheses posit mitochondria were incorporated after some or much of eukaryote complexity was acquired. While these hypotheses may agree on the nature of the mitochondrial ancestor, the nature of the host that engulfed it is hotly contested. Typically, mito-early hypotheses assume an archaeal host, whereas mito-late hypotheses tend to posit a host with some eukaryotic features, sometimes referred to as a "proto-eukaryote" 35,36 . The level of complexity of this purported proto-eukaryote varies widely among different hypotheses. The archezoa hypothesis argued amitochondriate eukaryotes were primitive and eukaryotic complexity was acquired before endosymbiosis, with early molecular phylogenies placing amitochondriate eukaryotes as early branching lineages in the tree of eukaryotes supporting this proposition 37 . This evidence was later found to be a phylogenetic artefact, and mitochrondrially derived genes are present in the nuclear genomes of the supposed amitochondriate archezoa species 38 . But with the increasing body of literature on the surprising eukaryote-like properties of the Asgard Archaea [5][6][7]8 , it increasingly appears that prokaryote complexity has been consistently underestimated. For example, while members of the Planctomycetes, Verrucomicrobiae, and Chlamydiae (PVC) bacterial Superphylum don't have true compartmentalized nor nucleated cells they demonstrate a striking level of intracellular structural complexity and there are other examples of true membrane-bound vesicles in prokaryotes 39,40 .
With recent advances in DNA sequencing technology, an expanded repertoire of eukaryotic genomes has become available. Consequently, we can trace ancestral states in the eukaryotic lineage more accurately than ever before. In this paper, we have used this expanded diversity of genomes, in combination with homolog identification across all of life to create a parsimonious reconstruction of ancestral character states in order to reconstruct the LECA genome. We have used this reconstructed genome to uncover crucial aspects of the biology of eukaryotes during the period where FECA became LECA. By reconstructing this genome of the LECA followed by tracing the elements that arose during the transition from First to Last Common Ancestor we attempt to better understand what molecular functions and biological processes were developed and expanded to make eukaryotes so unique.

Reconstruction of LECA Pathways.
We identified a total of 4,462 gene clusters that included (a) a broad taxonomic distribution across eukaryotes, and (b) that formed a monophyletic group on a phylogeny inferred from the gene sequences (herein referred to as monophyletic clusters). If we include gene clusters with a broad distribution of homologs among eukaryotes but whose genes were not monophyletic, we recover a total of 1,476 additional homologous families (herein referred to as clusters with complex history) making a sum total of 5,938 clusters. Both datasets were mapped onto 454 pathways from the KEGG database covering cellular processes, genetic information processing, environmental information processing, and metabolic pathways across six model species (Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, Dictyostelium discoideum, Methanobrevibacter smithii, and Escherichia coli) (SI Figures S1-S5). These gene clusters along with their FASTA gene annotations, and their matches to KEGG components and corresponding KEGG pathways are available to the interested reader as a neo4j network (SI Data 1). We have created a series of heatmaps of pathway completeness for each model organism used in the analyses that summarises the breadth of the data contained in the network (Figure 1).
We analysed a variety of pathways of interest including central metabolism, as well as many pathways involved in membrane biology, cell division, the spliceosome, endocytosis, and the phagosome ( Figure 1, SI Figures S1-S5). These analyses confirmed what many previous studies have shown, proposed, and inferred about the biology and complexity of LECA 41 . Our reconstruction indicates that LECA possessed largely the same metabolic capacity to process glucose into pyruvate, oxidize acetate into CO 2 and water, and then to produce ATP from NADH through electron transport as the majority of modern eukaryotes (SI Figures S1-S5). Our reconstruction also suggests that mitosis, recombination, and sexual reproduction were all present in LECA in a form not dissimilar to modern mechanisms (SI Figures S1-S5). Our LECA reconstruction has a largely intact set of membrane biosynthesis pathways, as well as the capacity to manipulate these membranes for the purposes of phagocytosis and endocytosis (SI Figures S1-S5). Overall our reconstruction strongly supports the idea, and the growing body of evidence that LECA was a complex organism that would have more or less resembled a modern protist 41 .

Notable Expansion of Genes during The FECA-to-LECA transition.
Becoming a recognisable eukaryote took some time and in this span between the first and last common ancestor of eukaryotes, many biological functions were invented 42,43 . By assessing our reconstruction to identify genes present in LECA that lack prokaryote homologues, we can identify the biological functions that arose during the FECA-to-LECA transition period. This subset of our data, independently crossed-referenced against the taxonomic gene tree data in the KEGG database, reveals 1,554 genes involved in 212 KEGG pathways across four species, comprising 95 unique pathways that were biological innovations of the FECA-to-LECA transition period. These pathways have been organised into eight categories based on their KEGG classifications: Energy Metabolism; Lipid Metabolism; Miscellaneous Metabolism; Cellular Processes; Genetic Information Processing; Protein Processing; Signalling; and Amino Acid and Nucleotide Metabolism ( Figure  2). The categories showing the largest number of elements introduced during this time period are Cellular Processes, Genetic Information Processing, Protein Processing, Lipid Metabolism, and Signalling (SI Table 2). As one may predict, these results indicate that the majority of the changes in biology relate to membrane biology, and how it was used to create distinct internal environments and structures within the cell, as well as systems to regulate the interaction between these structures and complex behaviours.

Substantial Increases in the number of genes do not show a corresponding substantial increase in functional novelty.
The KEGG database employs a hierarchical system of functional annotations for genes and other biological entities, called BRITE 89 . We extracted these BRITE classifications for the genes, which arose between FECA and LECA. While the most common was the generic designation "Enzymes", the other major groups of genes belonged to the following classifications: Membrane trafficking, Ubiquitin system, Exosome, Protein kinases, Messenger RNA biogenesis, Chromosome and associated proteins, and DNA repair and recombination proteins ( Figure 3). Membrane trafficking as a major area of expansion is a predictable result, as organelles and the endomembrane system developed substantially between our nearest archaeal ancestor and modern eukaryotes, and transport between these subcompartments became necessary 44 . The Ubiquitin system co-ordinates a large number of processes through targeting proteins for degradation, and is of primary interest in this context as it has roles in the control of signal transduction pathways, transcriptional regulation, and endocytosis 19,21 . Protein Kinases are well-studied signalling components related to a large number of relevant biological processes of early eukaryotes; particularly intracellular signaling, nuclear transcription, translocation of transcription factors, and nuclear receptors 25,45 . Alongside the membrane trafficking and signal transduction elements, the remaining four functional classes relate to genetic information processing tasks. We see increases in genes involved in Messenger RNA biogenesis, as mRNA became required to be moved from discrete locations within the cell, from the nucleus to the cytoplasm, and systems arose to facilitate this 46 . RNA quality control mechanisms were also further expanded upon within the lineage, including the exosome, a highly conserved complex responsible for RNA processing 47 . DNA repair and recombination proteins increased in number; we see both an expansion of repair machinery and mechanisms [48][49][50] , but also that early eukaryotes were required to interact with chromatin, accounting for the increase in chromosome and associated proteins that is also observed, in order to accomplish these tasks 51 . Dealing with the consequences of subcompartmentalising biology and its increasing cellular size appears to have been the driving cause of many of these genetic additions to the early eukaryote lineage.

Novel Functional Classes
Both in the cases of those classes that see the largest increases and in general overall, there is little demarcation of functional classes that belong to prokaryote genes or to genes developed at the early origins of eukaryotes (Figure 4-8). However we undertook to examine the genes that display a novel classification. During the transition to becoming LECA FECA expanded the repertoire of Prenyltransferases in the Lipid Metabolism pathways examined ( Figure 4). Isoprenoid Biosynthesis isn't unique to eukaryotes and Prenyl quinines are employed as electron carriers required for mitochondrial metabolism 52 . Protein prenylation is an important post-translational modification that plays an important role in the membrane association of signal transduction regulatory elements and isoprenoids, in particular Dolichol -are involved in the production of glycoproteins 52 . While glycosylation is found in all domains of life, it is much more restricted in prokaryotes 53 , and this expansion of the process appears to have been important at the beginning of the eukaryotic lineage.
Genetic Information Processing pathways see the addition of GTP-binding proteins, cytoskeleton proteins, and peptidases during the FECA-to-LECA transition ( Figure  5). GTP-binding proteins are signal transduction elements, which control a wide variety of processes from metabolism to gene expression 54 . While Peptidases are a generic class of enzyme that exists across all domains of life 55 , new family members arose in the basal transcription factors, homologous recombination, and the RNA transport pathways (SI Table 2). Cytoskeleton proteins control nuclear morphology and chromatin organization so it's unsurprising that these types of elements were developed during this period 51 Specifically the element in question is the SUMO family protein SMT3 (SI Table 2), one of the main functions of sumoylation is nuclear-cytosolic transport as well as playing a role in DNA repair and recombination 56,57 . Nuclear transport and transmitting signals to the nucleus posed new challenges for biological life and appears to have required new functional classes of genes.
Protein Processing pathways also see an addition of GTP-binding proteins and cytoskeleton proteins, as well as genes introduced to the transcription machinery and their repertoire of transcription factors ( Figure 6). G-proteins have been discussed above in their myriad roles relating to signal transduction, and cytoskeleton proteins in this context interact with the organelles responsible for protein synthesis, modification, and trafficking 58 . XRN2 adds an additional transcription machinery component relating to the RNA degradation, which promotes termination of transcription 59 , and XBP1 adds a transcription factor relating to cellular stress response in the protein processing in endoplasmic reticulum pathways 60 . Signal transduction and cytoskeleton elements, needed for coordinating the newly established structures of the cell are a reoccurring theme across these pathway categories. We also see novel elements in how transcription and translation operated in early eukaryotes.
G protein-coupled receptors, Secretion system, Glycosylphosphatidylinositol (GPI)anchored proteins, and Polyketide biosynthesis proteins are introduced to the Cellular Processes pathways during this period (Figure 7). The Polyketide biosynthesis proteins class sees the addition of the Flavonoid biosynthesis gene CHS 61 , which is present in the circadian rhythm pathway of plants as a downstream target (SI Table 2). Polyketides are common to plants, bacteria, and some ophistikonts, largely fungi 61,62 . GPI-anchored proteins are minor plasma membrane components, involved in signal transduction and, perhaps critically in this context, play a role in clathrin-independent endocytosis 63 . G protein-coupled receptors are part of the G-protein signal transduction pathways, another example of expansions in signal transduction machinery occurring during this evolutionary period. The secretion system elements belong to the Phagosome pathway and comprise all subunits of the ER membrane protein translocator Sec61 (SI Table 2). The bacterial gene SecY is widely reported as a homolog of Sec61 genes in the literature, these genes were cross-referenced against the gene trees in the KEGG database and not shown to possess bacterial homologs there, so we investigated this apparent false positive. Sequence analysis shows a little sequence similarity has been conserved between secY and the A1 (e-value of 0.011) and A2 subunit (e-value of 0.016), and almost none against the other subunits, with the sequence conservation being so poor it is unsurprising our method failed to identify this gene 64 .
Signaling pathways see the largest number of classes added with G proteincoupled receptors, Cytokine receptors, Glycosylphosphatidylinositol (GPI)-anchored proteins, lectins, and glycosyltransferases ( Figure 8). Given many of the other genes with novel classifications in the other categories relate to signal transduction this is perhaps unsurprising. Many of the classes are similar to the other pathway categories with the signal transduction G proteins and GPI-anchored proteins; additionally the cytokine receptors are a diverse class of signal transduction molecules 65 . Glycosyltransferases and lectins, are involved in the production or recognition of glycoproteins 66,67 , which is a reoccurring theme in this data. The development of signalling pathway elements, both in these pathways, but also present in the other categories discussed, reflects how critical systems of coordination become when cellular subcompartmentalisation and increasing cellular size makes passive diffusion ineffective and active transport across internal membranes is necessary.

Discussion
Previous studies have shown that diverse representatives of the different eukaryote supergroups possess traits that can be mapped back to LECA 2,13,14,41,68-70 . The results of all these reconstructions, based for the most part on distributions of phenotypic data across eukaryotes, consistently paint a picture of a LECA that already possessed significant complexity and contained many of the signature functional systems and structures of the modern eukaryotic cell. The benefit of whole-genome analyses is that we have greater precision in understanding the functional potential of LECA as well as the functions that predate this time period so that we can better understand these traits. Our reconstructed LECA genome is composed of 4,462 gene clusters when restricted to monophyletic clusters and 5,938 gene clusters when we include gene families with a complex history. Future reconstructions will no doubt refine and expand these results to identify more hypothetical LECA genes that are not identifiable by the current sampling. It is, of course, also likely that there will be a subset of genes in LECA that have been lost in the last two billion years of evolution and will therefore remain impossible to reconstruct with this type of method regardless of sampling. Nonetheless, we can reconstruct a vivid picture of LECA and its lifestyle. Our results suggest that LECA was likely to have been a facultatively anaerobic heterotroph, generating energy by a number of mechanisms, though not capable of photosynthesis (Figure 1, Figures  S1-S5). LECA was able to engulf its prey using phagocytosis; but it remains unclear whether the elements with prokaryotic origin were sufficient to imbue FECA with a rudimentary ancestral manner of the process, leaving us unable to infer support for either the phagotrophic or autotrophic models of eukaryotic origin 71 . LECA had a functioning Golgi apparatus and active vesicle trafficking within the cell but was not capable of contemporary forms of apoptosis. The cells were quite complex and RNA processing was carried out by spliceosomes (Figure 1, Figures S1-S5). It is likely that LECA had an endoplasmic reticulum and engaged in protein processing in this organelle. LECA cells divided by a recognisable mitosis, and recombination and cell division were also occurring via meiosis (Figure 1, Figures S1-S5).
A substantial portion of the genes from the pathways we reconstructed arose during the FECA-to-LECA transitional phase of eukaryote evolution, 1,554 genes of the 5,938 genes in total. The majority of these came from pathways relating to genetic information processing, protein production and modification, membrane-bound organelles, signalling, and lipid biosynthesis. These all constitute areas of biological function that were greatly modified as nuclear material was moved into the nucleus, the processes of transcription and translation were isolated and localised to specific regions within the cell forcing the machinery for them to be moved into the correct organelle and the products of these processes transported correctly 13,72 . As processes became specialised and sub-localised in their own organelles and vesicles, the categories of pathways affected are entirely in-line with those that needed to have become heavily modified 14,44 . Accordingly the functional annotations of the genes involved show a large propensity to be involved in Membrane Trafficking, various aspects of Genetic Information Processing, and signal transduction, consistent with establishing a system of intracellular biology of sub-compartmentalised processes and the regulatory systems for controlling them. The inclusion of substantial additions to DNA repair and recombination proteins specifically is also interesting because the basis of meiosis has been argued to have involved the recruitment and modification of this class of proteins 73 .
The genes, which display novel functional classes, show a huge overlap across the different categories, with G protein pathway components as well as glycoprotein biosynthesis and interacting factors both being seen in all but one of the largest categories; Lipid Metabolism pathways lack G protein pathway components and Genetic Information Processing pathways lack Glycoprotein factors. The other classes we see that are widespread across these pathways are cytoskeleton proteins involved in coordinating both the nucleus and endomembrane system and a plethora of receptors, both nuclear and cell surface.
The functional classes of genes that were created during this transition from FECA to LECA primarily appear to have prokaryote origins, which lends additional credence to an increasing trend in the literature. There's been a plethora of research in recent years on prokaryote encroachment into many aspects of biology that have been traditionally considered eukaryote-specific [6][7][8]39,[74][75][76] , it is perhaps worth examining the basis for the competing theories of eukaryogenesis. The apparent conflict between mito-late and mito-early theories hinges on our understanding of the complexity of prokaryotes. The distinction between the theories become largely semantic; one man's "proto-eukaryote" is simply another's Archaeal host, if we accept that archaea are increasingly eukaryote-like 6-8,39,74-76 . Overall our results show an interconnected collection of signalling elements, trafficking and transport complexes, nuclear and cell surface receptors, as well as to changes to genetic information and protein processing, and structural elements relating to the newly forming organelles (Figures 4-8, SI Table 2). These cellular components appear to be a consequence of having to cope with the challenges of establishing subcompartments that need to be navigated in a coordinated manner and functioning at a different scale to bacteria; signalling and membrane regulation and trafficking become increasingly important as diffusion becomes less feasible as a mechanism to transport necessities around one's cell 77 . These methods can't disentangle whether the establishment of organelles or increasing cell size were downstream of the other or occurred in parallel. However it seems likely that a driving force for this process would be the increased levels of protein production allowed by the increased bioenergetic potential eukaryotes gained from endosymbiosis, which alongside the increasing availability of oxygen in their environment 78 , led to levels of protein production at a level orders of magnitude higher than life was ever capable of before 79,80 . The idea that mitochondria were necessary to fuel the size increase and subcompartmentalisation, which led to the need for the increases in membrane trafficking and signal transduction related gene functions has enormous appeal due to the neatness of the concept; endosymbiosis and the resultant mitochondria as the defining feature of eukaryotic life and driving force of all other observable differences.

Methods and materials Genome choice
A dataset consisting of 32 taxonomically diverse eukaryote genomes, and 105 prokaryotes, comprising 36 Archaea and 69 Bacteria (total 855,880 genes) was acquired from publicly available genome databases, either Ensembl, Genomes version 33, or PATRIC (SI Table 1) 81,82 .

Gene Cluster Creation
Eukaryotes were selected to provide at least two representatives from all the major phylogenetic supergroups 83 , including the SAR supergroup, opisthokonts, amoebozoa, archeaplastids, excavates, and orphan species. An all-versus-all BLAST search with an e-value cut-off of 10 -5 was performed to identify homologous protein sequences. The BLAST output was clustered using the Markov Chain cluster algorithm (MCL) 84 with an inflation value of two after a series of parameter tests. The clusters were filtered according to their level of conservation across the six major taxa. Gene clusters conserved across at least four taxa were considered to be of a sufficient level of conservation to be strong LECA candidate genes (SI Data 3).

Phylogenetic analysis
Clusters were aligned using MAAFT 85 and alignments were trimmed with TRIMAL 86 using the heuristic algorithm automated1. In the case of TRIMAL producing uninformative alignments the trimming step was skipped and the full sequence alignment was used. Phylogenetic hypotheses were reconstructed using the IQtree software 87 , with parameters set to combine ModelFinder, tree search, SH-aLRT test and ultrafast bootstrap with 1000 replicates. Monophyly amongst the eukaryote sequences in these phylogenetic trees was assessed using the python package ETE3 88 , in order to be able to control for HGT. We then attempted to functionally characterise two datasets, a purely monophyletic dataset, and all sufficiently conserved gene clusters regardless of them demonstrating monophyly or a more complex evolutionary history.

Functional annotation of Clusters
We used KEGG pathways from modern organisms that relate to metabolism, genetic information processing, environmental information processing, and cellular processes in order to ask whether a comparable pathway was likely to have been present in LECA (SI Table 2). We used exemplar protein sequences from the KEGG model organisms Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, Dictyostelium discoideum, Methanobrevibacter smithii, and Escherichia coli where available and compared sequences from these organisms to our gene clusters in order to determine presence or absence. Homology to the highly conserved gene clusters was determined using BLAST with an e-value cutoff of 10 -6 .

Dataset Subsetting and Analysis
For the analysis of the FECA to LECA evolutionary transition the KEGG pathway mapped data was subsetted to include only clusters that did not possess prokaryote homologs 89 . Each KEGG pathway element that matched to a eukaryote specific gene cluster was cross-referenced against the gene tree in the KEGG database to independently verify the evolutionary origin of the gene, producing a list of elements found in LECA with no known prokaryote homology. This list of genes created during the FECA to LECA transition were then categorised and functionally annotated extracting functional information from the KEGG database BRITE classifications.
The categories used in this analysis of genes created in the FECA to LECA transition were modified slightly from the KEGG classifications as follows: The Energy Metabolism category is altered from the KEGG database schema to include the 1.