The economics of endosymbiotic gene transfer and the evolution of organellar genomes

The endosymbiosis of the bacterial progenitors of mitochondrion and the chloroplast are landmark events in the evolution of life on earth. While both organelles have retained substantial proteomic and biochemical complexity, this complexity is not reflected in the content of their genomes. Instead, the organellar genomes encode fewer than 5% of genes found in living relatives of their ancestors. While some of the 95% of missing organellar genes have been discarded, many have been transferred to the host nuclear genome through a process known as endosymbiotic gene transfer. Here we demonstrate that the energy liberated or consumed by a cell as a result of endosymbiotic gene transfer can be sufficient to provide a selectable advantage for retention or nuclear-transfer of organellar genes in eukaryotic cells. We further demonstrate that for realistic estimates of protein abundances, organellar protein import costs, host cell sizes, and cellular investment in organelles that it is energetically favourable to transfer the majority of organellar genes to the nuclear genome. Moreover, we show that the selective advantage of such transfers is sufficiently large to enable such events to rapidly reach fixation. Thus, endosymbiotic gene transfer can be advantageous in the absence of any additional benefit to the host cell, providing new insight into the processes that have shaped eukaryotic genome evolution. One sentence summary The high copy number of organellar genomes renders endosymbiotic gene transfer energetically favourable for the vast majority of organellar genes.

To test this hypothesis, we assessed the conditions under which it is more energetically favourable 115 to encode a gene in the organellar or nuclear genome. Here, the free energy of endosymbiotic 116 gene transfer (which we define as the difference in energy cost between a cell which encodes a 117 given gene in the organellar genome and a cell which encodes the same gene in the nuclear 118 genome and imports the requisite amount of gene product into to the organelle, see Methods) was 119 computed for an average length bacterial gene as a function of protein abundance, protein import 120 cost, and organellar genome copy number. This revealed that there is a simple relationship such 121 that the higher the copy number of the organellar genome, the more energy that is liberated by 122 endosymbiotic gene transfer and thus the more protein that can be imported into the organelle 123 while still reducing the overall energetic cost of the cell (Figure 2A). To simulate the organellar 124 genome reduction that would result if all such energetically favourable endosymbiotic gene 125 transfers occurred, the complete genomes with measured protein abundances for an 126 alphaproteobacterium (Bartonella henselae) and a cyanobacterium (Microcystis aeruginosa) were 127 subject to a simulated endosymbiosis. Here, a range of host cell sizes was simulated such that 128 they encompassed the majority of diversity exhibited by extant eukaryotes (Milo 2013) and would 129 thus likely encompass the size range of the host cell that originally engulfed the bacterial organellar 130 progenitors. This range extended from a small unicellular yeast-like cell (10 7 proteins), to a typical 131 unicellular algal cell (10 8 proteins) to a large metazoan/plant cell (10 9 proteins). Each of these cell 132 types were then considered to allocate a realistic range of total cellular protein to 133 mitochondria/chloroplasts representative of extant eukaryotic cells (Supplemental Table S1). For 134 each simulated endosymbiosis, the free energy of endosymbiotic gene transfer was calculated for 135 each gene given its measured protein abundance (Wang, et al. 2015) and a realistic range of 136 protein import costs (including the total biosynthetic cost of the protein import machinery, See 137 Methods). This revealed that for a broad range of estimates of cell size, organellar genome copy 138 number, and organellar fraction (i.e. the fraction of the total number of protein molecules in a cell 139 that are contained within the organelle) it is energetically favourable to the cell to transfer the 140 majority of organellar genes to the nuclear genome and re-import the proteins back to the 141 organelle ( Figure 2B and 2C). Here, only the proteins with the highest abundance, and thus which 6 even if extreme costs for protein import ten times those that have been measured are considered 144 (Supplemental Figure S1), or if organellar protein import is inefficient (or organellar protein turn-145 over is higher than cytosolic protein turn-over) such that 50% of cytosolic synthesised protein fails 146 to be imported (or is turned over) and is thus wasted (Supplemental Figure S2). Thus, it is more 147 energy efficient for the cell to transfer the majority of organellar genes to the nuclear genome and 148 import the proteins into the organelle. 149 To estimate the strength of selection that would act on the change in energy incurred from an 150 endosymbiotic gene transfer event, the free energy of endosymbiotic gene transfer for each gene 151 was placed in context of the total energy required to replicate the cell. As above, this analysis was 152 conducted for a broad range of host cell size, organellar fraction, endosymbiont genome copy 153 number, and protein import cost that is representative of a broad range of eukaryotic cells ( Figure  154 3A and B, Supplemental Figures S3 -S8, Supplemental Table S2). This revealed that for even 155 modest per-cell endosymbiont genome copy numbers (≥100 copies per cell) the selection 156 coefficients for the transfer of the majority of endosymbiont genes are relatively large ~1x10 -4 157 ( Figure 3, Supplemental Figures S3 -S8), ~10,000 times stronger than the selection coefficient 158 acting against disfavoured synonymous codons (Hartl, et al. 1994). Moreover, for high per-cell 159 endosymbiont genome copy numbers (≥1000 genome copies per cell) these selection coefficients 160 are large (~1 x 10 -3 ) and similar to the strength of selection that caused the allele conferring lactose 161 tolerance to rapidly sweep through human populations in ~500 generations (Bersaglieri, et al. 162 2004). In contrast, selection coefficients for retention of genes in the organellar genome only occur 163 when organellar genome copy numbers are low, and/or when large proportions of cellular 164 resources are invested in organelle ( Figure 3A and B, Supplemental Figures S3 -S8). However, 165 with the exception of very highly abundant proteins (discussed below) these selection coefficients 166 are generally weaker. Thus, over a broad range of host cell sizes, organellar genome copy 167 numbers, organellar fractions, and per-protein ATP import costs, endosymbiotic gene transfer of 168 the majority of genes is sufficiently energetically advantageous that any such transfer events, if 169 they occurred, would rapidly reach fixation (Supplemental Figure S9). Thus, endosymbiotic gene 170 transfer of the majority of genes is intrinsically advantageous to the cell in the absence of any other 171 additional benefits. 172 Although the energetic advantage is sufficiently high to explain why such events would become 173 fixed if they occurred, it is not proposed that it is the only factor that influences the location of an 174 organellar gene. Instead, a large cohort of factors including the requirement for organellar 175 mediated RNA editing, protein chaperones, protein folding, post-translational modifications, 176 escaping mutation hazard, Muller's rachet, enhanced nuclear control, and drift will act 177 antagonistically or synergistically with the free energy of endosymbiotic gene transfer to influence 178 the set of genes that are retained in, or transferred from, the organellar genomes. Moreover, the 179 free energy of endosymbiotic gene transfer provides a mechanistic basis for selection to act with or 180 against Doolittle's "You are what you eat" ratchet for endosymbiotic gene transfer (Doolittle 1998). 181 It is noteworthy in these contexts, that if the protein encoded by the endosymbiont gene can 182 provide its function outside of the endosymbiont (e.g. by catalysing a reaction that could occur 183 equally well in the cytosol of the host as in the endosymbiont) then the energetic advantage of 184 gene transfer to the nuclear genome is further enhanced, as the cost of protein import is not 185 incurred. Similarly, although gene loss is predominantly thought to be mediated by mutation 186 pressure and drift (Lynch, et al. 2006), the elevated per-cell endosymbiont genome copy number 187 also provides a substantial energetic reward to the host cell for complete gene loss. Thus, the high 188 genome copy number required to protect DNA from damage (Shokolenko, et al. 2009) and 189 facilitate high levels of protein production (Bendich 1987), also provides the energetic incentive for 190 the cell to delete endosymbiont genes as well as transfer them to the nuclear genome. 191 The analysis presented here shows that for a broad range of cell sizes and resource allocations 192 that endosymbiotic gene transfer of the majority of organellar genes is energetically favourable and 193 thus advantageous to the cell. However, it also showed that retention of genes in the organellar 194 genomes is energetically favourable under conditions where the encoded organellar protein is 195 required in very high abundance and/or the copy number of the organellar genome is low ( Figure  196 2B, 2C, 3A and 3B). For example, in large plant cells such as those in the leaves of Arabidopsis 197 thaliana it is unfavourable to transfer the rbcL gene encoding the RuBisCO large subunit from the 198 chloroplast genome to the nuclear genome, as although it would save 8.7 x10 7 ATP per cell in DNA 199 biosynthesis costs it would incur a daily cost of ~3.96 x10 12 ATP per cell (0.17% of the daily energy 200 budget of the cell) just to import the required amount of RuBisCO large subunit back into the 201 chloroplast (see methods). Consequently, from a cost perspective it is energetically favourable to 202 the cell to retain this gene in the chloroplast genome. The same is also true for 62 of the 88 genes 203 currently found in the chloroplast genome in Arabidopsis thaliana (Supplemental Table S3) such 204 that selection would act against transfer of these genes from the chloroplast genome. In, contrast it 205 is energetically favourable to transfer the majority of genes from the mitochondrial genome to the 206 nuclear genome in Arabidopsis (99 out of 122), and all of the genes encoded in the human 207 mitochondrial genome to the human nuclear genome (Supplemental Table S3). Thus, requirement 208 for high protein abundance and low genome copy numbers will act synergistically with factors such 209 as the requirement for redox regulation of gene expression (Allen 2015) to favour retention of 210 genes in the organellar genomes. 211 While the analysis presented here focussed on the energetic cost measured in ATP so that the 212 cost of protein import and the cost of biosynthesis of DNA could be evaluated on a common basis, 213 endosymbiotic gene transfer also results in changes in the elemental requirements of a cell. 214 Specifically, as the monophosphate nucleotides that constitute DNA are composed of carbon (A = 215 10, C = 9, G = 10, T = 10), nitrogen (A = 5, C = 3, G = 5, T = 2), and phosphorous (A = 1, C = 1, G 216 = 1, T = 1) atoms, endosymbiotic gene transfer can also result in substantial savings of these 217 resources (Supplemental Figure S10). Thus, if organisms encounter carbon, nitrogen or 218 phosphorous limitation in their diet and environment then the advantage of endosymbiotic gene 219 transfer to the cell will be further enhanced. 220 While we do not know precisely what the cells that engulfed the progenitors of the mitochondrion or 221 the chloroplast looked like (as only extant derivatives survive), it is safe to assume that cell size 222 and investment in organelles has altered since these primary endosymbioses first occurred. 223 Accordingly, the selective advantage (or disadvantage) of transfer of any given gene is transient 224 and will have varied during the radiation of the eukaryotes as cell size and organellar volume 225 evolved and changed in disparate eukaryotic lineages. This coupled with the lack of an organellar 226 protein export system (i.e. from the organelle to the host cytosol) and the presence (and 227 acquisition) of introns in nuclear encoded genes (Rogozin, et al. 2012) means that it is more organelle). Collectively, this would create a ratchet-like effect trapping genes in the nuclear 230 genome even if subsequent changes in cell size and organellar fraction means that it became 231 energetically advantageous to return the gene to the organelle later in evolution. Thus, current 232 organellar and nuclear gene contents predominantly reflect past pressures to transfer genes to the 233 nuclear genome. 234 Endosymbiotic gene transfer is a recurring theme in the evolution of the eukaryotic tree of life. The 235 discovery that the free energy of endosymbiotic gene transfer can provide an energetic advantage 236 to the cell for retention or transfer of organellar genes to the nuclear genome uncovers a novel 237 process that has helped shape the content and evolution of eukaryotic genomes. Moreover, it 238 reveals that organelles have surrendered the vast majority of their genes to the nuclear genome for 239 the sake of the greater good of the cell. 240

241
Data sources 242 The Arabidopsis thaliana genome sequence and corresponding set of representative gene models 243 were downloaded from Phytozome V13 (Goodstein, et al. 2012). The human genome sequence 244 and gene models from assembly version GRCh38.p13 (GCA_000001405.28), the Bartonella 245 henselae genome sequence and gene models from assembly version ASM4670v1, the Microcystis 246 aeruginosa NIES-843 genome sequence and gene models from assembly version ASM1062v1 247 were each downloaded from Ensembl (Yates, et al. 2020). The Saccharomyces cerevisiae 248 sequence and gene models from assembly version R64-2-1_20150113 were downloaded from the 249 Saccharomyces Genome Database (Cherry, et al. 2012). Protein abundance data for all species 250 were obtained from PAXdb v4.1 (Wang, et al. 2015). 251 252 The ATP biosynthesis cost of nucleotides and amino acids was obtained from (Chen, et al.  Table S5). 266

Constants used to evaluate the per cell ATP costs of genes and chromosomes
The average gene length used for the simulation study in  Table S7). Similarly, if the total ATP biosynthesis cost of all TOM/TIM proteins in the cell in Homo 286 sapiens, Saccharomyces cerevisiae and Arabidopsis thaliana is distributed equally among all of 287 the proteins that are imported into the mitochondrion in those species then it would add an 288 additional 0.2 ATP, 0.7 ATP, and 0.2 ATP per residue imported, respectively (Supplemental Table  289 S7). In all cases the proteins that were predicted to be imported into the organelle were identified 290 using TargetP-2.0 (Almagro Armenteros, et al. 2019) and protein abundance was calculated using 291 measured protein abundance estimates for each species obtained from PAXdb 4.0 (Wang, et al. 292 2015), assuming a total cell protein content of 1x10 9 proteins for a human cell, 1x10 7 proteins for a 293 yeast cell and 2.5 x 10 10 proteins for an Arabidopsis thaliana cell. As we modelled ATP import 294 costs from 0.05 ATP to 50 ATP per-residue the cost of the import machinery was considered to be 295 included within the bounds considered in this analysis. 296 297 To provide estimates of the fraction of cellular protein resources invested in organellar proteomes 298 the complete predicted proteomes and corresponding protein abundances were quantified.  Table S1 and were used to provide the indicative 302 regions or parameter space occupied by metazoa, yeast and plants shown on Figure 2B and C. 303

Evaluating the proportion of the total proteome invested in organelles
Specifically, ~5% of total cellular protein is contained within mitochondria in H. sapiens, S. 304 cerevisiae and A. thaliana and ~50% of total cellular protein is contained within chloroplasts in A. 305

thaliana. 306
Calculating the free energy of endosymbiotic gene transfer 307 The free energy of endosymbiotic gene transfer (ΔE EGT ) is evaluated as the difference in ATP 308 biosynthesis cost required to encode a gene (ΔD) in the endosymbiont genome (D end ) and the 309 nuclear genome (D nuc ) minus the difference in ATP biosynthesis cost required to produce the 310 protein (ΔP) in the organelle (P end ) vs in the cytosol (P cyt ) and ATP cost to import the protein into 311 the organelle (P import The energetic cost of producing a protein in the endosymbiont and in the cytosol are assumed to 318 be equal and thus 319 [4] 320 It should be noted here that although the P end and P cyt are assumed to be equal for the majority of 321 calculations an analysis was conducted wherein an inefficient protein import system was assumed 322 such that 50% of protein failed to be imported and thus must be turned over (Supplemental Figure  323 S2). Even under these conditions it is still energetically favourable to encode organellar genes in 324 the nuclear genome for realistic estimates of cell sizes and investment in organelles. 325 P import is evaluated as the product of the product of the length of the amino acid sequence (L prot ), 326 the ATP cost of importing a single residue from the contiguous polypeptide chain of that protein 327 (C import ), the number of copies of that protein contained within the cell that must be imported (N p ) 328 such that 329 Measured estimates of C import range from ~0.05 ATP per amino acid to 5 ATP per amino acid 331  Where C end and C nuc are the per-cell copy number of the endosymbiont and nuclear genomes 343 respectively and the ATP biosynthesis cost for the complete biosynthesis of an A:T base pair and a 344 G:C base pair are 40.55 ATP and 40.14 ATP respectively (Chen, et al. 2016). Thus 345 Where positive values of Δ E EGT correspond to genes for which it is more energetically favourable to 347 be encoded in the nuclear genome, and negative values correspond to genes for which it is more 348 energetically favourable to be encoded in the endosymbiont genome. 349 350 The complete genomes with measured protein abundances for an alphaproteobacterium 351 (Bartonella henselae) and a cyanobacterium (Microcystis aeruginosa) were selected to sever as 352 models for an ancestral mitochondrion and cyanobacterium, respectively. To account for 353 uncertainty in the size and complexity of the ancestral pre-mitochondrial and pre-chloroplast host 354 cells, a range of potential ancestral cells was considered to be engulfed by a range of different host 355 cells with protein contents representative of the diversity of extant eukaryotic cells (Milo 2013). 356

Simulating endosymbiotic gene transfer of mitochondrial and chloroplast genes
Specifically, the size of the host cell ranged from a small unicellular yeast-like cell (10 7 proteins), to 357 a medium sized unicellular algal-like cell (10 8 proteins) to a typical metazoan/plant cell (10 9 358 proteins). Each of these host cell types was then considered to allocate a realistic range of total 359 cellular protein to mitochondria/chloroplasts typical of eukaryotic cells (i.e. ~2% for yeast (Uchida, 360 et al. 2011), ~20% for metazoan cells (David 1977) and ~50% of the non-vacuolar volume of plant 361 cells (Winter, et al. 1994)). It is not important whether the organellar fraction of the cell is 362 composed of a single large organelle or multiple smaller organelles as all costs, abundances, and 363 copy numbers are evaluated at a per-cell level. For each simulated cell, Δ E EGT was evaluated for 364 each gene in the endosymbiont genome using real protein abundance data (Wang, et al. 2015) for 365 a realistic range of endosymbiont genome copy numbers using equation 9. In all cases the host 366 cell was assumed to be diploid. The simulations were repeated for three different per-residue where Δ E EGT was positive was recorded as these genes comprise the cohort that are energetically 369 favourable to be encoded in the nuclear genome. All calculated values for Δ E EGT for both the model 370 organisms are provided in Supplemental Table S2. 371 372 To model the proportion of energy that would be saved by an individual endosymbiotic gene 373 transfer event a number of assumptions were made. It was assumed that the ancestral host cell 374 had a cell size that is within the range of extant eukaryotes (i.e. between 1 x 10 7 proteins per cell 375

Estimating the strength of selection acting on endosymbiotic gene transfer
and 1 x 10 9 proteins per cell). It was assumed that the endosymbiont occupied a fraction of the 376 total cell proteome that is within the range exhibited by most eukaryotes today (2% to 50% of total 377 cellular protein is located within the endosymbiont under consideration). It was assumed that 378 endosymbiont genome copy number ranged between 1 copy per cell (as it most likely started out 379 with a single copy) and 10,000 copies per cell. 380 We assumed an ancestral host cell with a 24-hour doubling time such that all genomes and 381 proteins are produced in the required abundance every 24-hour period. As previously defined 382 (Lynch and Marinov 2015), the energy required for cell growth was modelled as 383 ‫ܥ‬ ൌ 2 6 . 9 2 ܸ . ଽ

[10] 384
In addition, all cells, irrespective of whether they are bacterial or eukaryotic, consume ATP (C m ) in 385 proportion to their cell volume (V) at approximately the rate of 386 where C m is in units of 10 9 molecules of ATP cell -1 hour -1 , and V is in units of µm 3 (Lynch and 388 Marinov 2015). Thus, the total energy (E R ) needed to replicate a cell was considered to be 389 The proportional energetic advantage or disadvantage (E A/D ) to the host cell from the 391 endosymbiotic gene transfer of a given gene is evaluated as the free energy of endosymbiotic 392 Given that E A/D describes the proportional energetic advantage or disadvantage a cell has from a 396 given endosymbiotic gene transfer event E A/D can be used directly as selection coefficient (s) to 397 evaluate the strength of selection acting on the endosymbiotic gene transfer of a given gene. Such 398 As Δ E EGT can be positive or negative as described above, s is therefore also positive or negative 401 depending on endosymbiont genome copy number, endosymbiont fraction, host cell protein 402 content, the abundance of the protein that must be imported and the ATP cost of protein import. Similarly, the effect of protein turnover was not included as estimates for protein turn over were not 414 available for each protein considered in these analyses. However, the effect of protein turnover is 415 to increase the total amount of protein that must be produced within the life cycle of the cell. Thus, 416 for the purposes of this analysis can be considered equivalent to increasing the cellular investment 417 in organelles. 418 419 Fixation times for endosymbiotic gene transfer events for a range of observed selection coefficients 420 from 1 x 10 -5 to 1 x 10 -2 were estimated using a Wright-Fisher model with selection and drift representative of unicellular eukaryotes (Lynch and Conery 2003) and multicellularity in eukaryotes 424

Estimating time to fixation
is not thought to have evolved until after the endosymbiosis of either the mitochondrion or the 425 chloroplast. 426 The cost of transferring the rbcL gene encoding RuBisCO large subunit from the 427 chloroplast to the nuclear genome in Arabidopsis thaliana 428 The total number of proteins contained in an Arabidopsis thaliana leaf cell is 2.5 x 10 10 proteins 429 (Heinemann, et al. 2020). The fraction of cellular protein that is invested in RuBisCO large subunit 430 (F rbcL ) is 0.165 (Li, et al. 2017) and thus the number of RuBisCO large subunit proteins per cell (N p ) 431 is estimated to be 4.13 x 10 9 . The cost of import (P import ) of a protein to the chloroplast is 2 ATP per