Beyond RuBisCO: convergent molecular evolution of multiple chloroplast genes in C4 plants

Background The recurrent evolution of the C4 photosynthetic pathway in angiosperms represents one of the most extraordinary examples of convergent evolution of a complex trait. Comparative genomic analyses have unveiled some of the molecular changes associated with the C4 pathway. For instance, several key enzymes involved in the transition from C3 to C4 photosynthesis have been found to share convergent amino acid replacements along C4 lineages. However, the extent of convergent replacements potentially associated with the emergence of C4 plants remains to be fully assessed. Here, we conducted an organelle-wide analysis to determine if convergent evolution occurred in multiple chloroplast proteins beside the well-known case of the large RuBisCO subunit encoded by the chloroplast gene rbcL. Methods Our study was based on the comparative analysis of 43 C4 and 21 C3 grass species belonging to the PACMAD clade, a focal taxonomic group in many investigations of C4 evolution. We first used protein sequences of 67 orthologous chloroplast genes to build an accurate phylogeny of these species. Then, we inferred amino acid replacements along 13 C4 lineages and 9 C3 lineages using reconstructed protein sequences of their reference branches, corresponding to the branches containing the most recent common ancestors of C4-only clades and C3-only clades. Pairwise comparisons between reference branches allowed us to identify both convergent and non-convergent amino acid replacements between C4:C4, C3:C3 and C3:C4 lineages. Results The reconstructed phylogenetic tree of 64 PACMAD grasses was characterized by strong supports in all nodes used for analyses of convergence. We identified 217 convergent replacements and 201 non-convergent replacements in 45/67 chloroplast proteins in both C4 and C3 reference branches. C4:C4 branches showed higher levels of convergent replacements than C3:C3 and C3:C4 branches. Furthermore, we found that more proteins shared unique convergent replacements in C4 lineages, with both RbcL and RpoC1 (the RNA polymerase beta’ subunit 1) showing a significantly higher convergent/non-convergent replacements ratio in C4 branches. Notably, more C4:C4 reference branches showed higher numbers of convergent vs. non-convergent replacements than C3:C3 and C3:C4 branches. Our results suggest that, in the PACMAD clade, C4 grasses experienced higher levels of molecular convergence than C3 species across multiple chloroplast genes. These findings have important implications for our understanding of the evolution of the C4 photosynthesis pathway.


Introduction
carboxylase/oxygenase (RuBisCO), a large multimeric enzyme that catalyzes the carboxylation 92 of ribulose-1,5-bisphosphate (RuBP), allowing plants to fix atmospheric carbon (Andersson and 93 Backlund, 2008). RuBisCO also initiates oxygenation of RuBP, which leads to a more limited 94 production of energy and to loss of carbon in the process of photorespiration (Andersson and 95 Backlund, 2008; Maurino and Peterhansel, 2010). RuBisCO's limited ability to discriminate 96 between CO2 and O2 has been attributed to the much higher CO2 to O2 atmospheric partial 97 pressure until ~400 million years ago (Sage, 1999(Sage, , 2004Sage et al., 2012). 98 99 Previous studies have revealed multiple convergent amino acid replacements in the large 100 RuBisCO subunit in C4 lineages, encoded by the chloroplast gene rbcL (Kapralov and  Given the number of biochemical, physiological and anatomical traits that were affected 115 in each evolutionary transition from C3 to C4 photosynthesis (Heyduk et al. 2019), it is likely that 116 many genes experienced analogous selective pressures across taxa that include C4 plants. This 117 could have led to the widespread occurrence of convergent amino acid replacements among a 118 significant fraction of proteins encoded by genes involved in photosynthesis processes. A recent, 119 important work has produced the first analysis of convergent replacements across multiple 120 proteins involved in the metabolism of C4 and crassulacean acid metabolism (CAM) among 121 species belonging to the portullugo clade (Caryophyllales). Goolsby and colleagues (2018) 122 compared evolutionary patterns in 19 gene families with critical roles in metabolic pathways of 123 both C4 and CAM plants, also known as carbon-concentration mechanisms (CCMs) genes, and 124 in 64 non-CCM gene families. They found convergent replacements in proteins from C4 and 125 CAM lineages, as well as higher levels of convergent replacements in CCM vs. non-CCM gene 126 families (Goolsby et al., 2018). Additionally, several amino acid replacements that are prevalent 127 among C4 and CAM taxa compared to C3 lineages were identified in this study (Goolsby et  1995). Single copy putative orthologs that were present in more than 95% of the species were 202 retained for further analysis (Table S1). 203 204 Multiple sequence alignment 205 We aligned the individual sequences using TranslatorX ver. 1.1 (Abascal et al., 2010), 206 and further adjusted the alignments manually using BioEdit ver. 7.0.9.0 (Hall, 1999). Stop 207 codons and sites that could not be aligned unambiguously were removed. 208 209 210 We concatenated the individual sequence alignments and extracted third codon position 211 sites for phylogeny reconstruction. We ran PartitionFinder ver. 1.1.1 (Lanfear et al., 2012) to 212 identify the best partitioning scheme (partitioning by gene) for the downstream analysis using 213 both Akaike information criterion (AIC) (Akaike, 1973) and Bayesian information criterion 214 (BIC) (Schwarz, 1978 Ancestral state reconstruction 223 We reconstructed ancestral states at each phylogenetic node for each individual gene 224 using the program codeml from the software package PAML ver. 4.9a (Yang, 2007) and the 225 basic codon substitution model (model = 0, NSsites = 0). 226 227

Phylogeny reconstruction
Inference of convergent and divergent replacements 228 We extracted the reconstructed ancestral states from the codeml output. The 229 corresponding amino acid sequences were then compared to investigate individual site changes 230 along selected branches in the reconstructed phylogenetic tree in the context of emergence of the 231 C4 trait. For each group of species descendant from a single C4 ancestor, we chose the branch 232 between the most recent C3 ancestor and the most ancestral C4 node, i.e., the branch along which 233 the C4 adaptation presumably emerged (referred to as "C4 ancestral branch" throughout this 234 article, see Figs. 1 and 2). For C3 species, we chose the most ancestral branch that did not share 235 ancestry with any C4 lineage ("C3 ancestral branch", see Figs. 1 and 2). In either case, if only a 236 single species was available in a given lineage, that terminal branch was used. The outgroup 237 species (O. sativa and B. distachyon) were not included in this analysis ( Fig. 1). 238 239 We searched for amino acid changes that occurred along pairs of ancestral branches. 240 Replacements in both branches that resulted in the same state at a given site in the two 241 descendants were considered convergent, regardless of whether the corresponding ancestral 242 states of ancestral were the same or different ( We identified putative convergent and divergent amino acid changes in each gene 252 product individually. We summarized those data within each of the three categories: (1) two C4 253 ancestral branches (C4-C4), (2) C3 ancestral branch and C4 ancestral branch (C3-C4), and (3)  Phylogeny reconstructions 269 We examined 63 grass chloroplast genomes to identify gene orthologs for Zea mays 270 chloroplast genes and extracted the corresponding coding and protein sequences. The resulting 271 dataset included up to 67 DNA/protein sequences in 64 grass species that were retained for 272 further analysis (Table S1). One to four sequences were absent in thirteen species. Out of 64 273 species, 43 were classified as C4 and 21 (including two outroup species) as C3. The reconstructed 274 phylogeny is well supported, except for three branches with low to moderate bootstrap values, 275 and it is consistent for both AIC and BIC ( Fig. 1 and Figs. S1-S3). We identified thirteen C4 276 ancestral branches that represent putative C3 to C4 transitions, and nine C3 ancestral branches 277 (Fig. 1

317
The phylogeny tree was obtained using RAxML (GTR+Γ model) based on the third codon position sites 318 in 67 chloroplast genes. The partitioning scheme was selected according to Akaike 4  4  3  3  3  6  3  5  4  3  3  3  6  7  3   C3-C3  3  3  2  3  We then searched for convergent replacements that occurred along more than two C4 436 branches at sites that remained otherwise conserved in C3 and C4 lineages, arguing that such 437 changes could result from selective pressure rather than drift. We identified fourteen C4-specific 438 convergent sites in proteins from 8 genes: atpB, ccsA, matK, ndhF, ndhH, ndhI, rbcL and rpoC2 439 (Table S3). Five of these sites were found in RbcL, whereas two sites were identified in both 440

Molecular convergence in individual chloroplast proteins 444
Convergent and divergent amino acid replacements were detected in the products of 45 445 chloroplast genes, thirteen of which had at least one site with four or more replacements (Fig. 4, 446 Table 1 and Table S2). Twenty-four genes had convergent changes in C4-C4, 26 in C3-C4, and 13 447 in C3-C3 types of pairs (Table 1). Although the convergent/divergent replacement ratio was 448 higher in C4-C4 pairs than C3-C4 and C3-C3 pairs, the differences between the three 449 photosynthesis types was not statistically significant (P ≥ 0.05, Boschloo's test; Table 1). The 450 lack of replacements was the single most common state for chloroplast proteins across 451 photosynthesis types; however, in C4-C4 there were more genes with a higher number convergent 452 vs. divergent replacements ( Fig. 4 and Table S4).

459 460
Overall, 26 proteins showed a higher number of convergent vs. replacement sites, of 461 which 16, 13 and 10 were found in C4-C4, C3-C4 and C3-C3 pairs, respectively ( Fig. 5 and Table  462 S4). We found statistically significant differences in the number of replacements between C4-C4 463 and C3-C4 pairs, but not C3-C3 pairs, in the products of the genes rbcL, rpoC1 and rpoC2 (P < 464 0.05, Boschloo's test; Table S4). In RbcL and RpoC1, C4-C4 pairs shared much higher number of 465 convergent replacements, whereas the opposite was true in RpoC2. RpoC1 was also the only 466 protein showing more convergent than divergent replacements in C4-C4 pairs compared to C3-C3 467 and C3-C4 pairs. In C4-C4 pairs, RpoC1 shared 4 convergent and 1 divergent replacement, 468 compared to 1 and 2 in C3-C3 pairs and 1 and 5 in C3-C4 pairs, respectively. Additionally, the 469 proteins NdhG, NdhI, PsaI, RpoA, Rps4 and Rps11 exhibited convergent replacements only in 470 C4-C4 pairs (Table S4). When considering the number of affected sites rather than the number of 471 replacements, no genes showed a significantly different pattern between photosynthesis types (P 472 ≥ 0.05, Boschloo's test; Table S4).  The proteins encoded by matK, rpoC2 and ndhF shared much higher numbers of both 484 convergent and divergent replacements than other chloroplast proteins across all photosynthesis 485 type comparisons (Table S4). Both matK and ndhF are known to be rapidly evolving and have 486 been consistently used in low taxonomic level phylogenetic studies in flowering plants (Barthet 487 and Hilu, 2008; Patterson and Givnish, 2002). The gene rpoC2 has also been recently described 488 as a useful phylogenetic marker in angiosperms (Walker et al., 2019). 489 490 Molecular convergence across ancestral branches 491 The comparison of ancestral branch pairs with convergent and divergent replacements 492 revealed remarkable differences between photosynthesis types. Overall, C4-C4 pairs of ancestral 493 branches showed a distribution skewed toward more convergent and divergent replacements than 494 the two other categories (Fig. 6). There were significantly fewer pairs of C4-C4 ancestral 495 branches with no replacements and with no convergent replacements than C3-C4 and C3-C3 pairs 496  Table 2). Conversely, significantly more C4-C4 pairs shared more 497 convergent than divergent replacements, and at least two convergent changes compared to C3-C4 498 and C3-C3 pairs (P < 0.05, Boschloo's test; Table 2). 499 No significant difference was observed between pairs of C3-C4 and pairs of C3-C3. We 500 found identical patterns when the same analyses were performed after excluding all replacements 501 in the RbcL protein, except for the lack of a significant difference between C4-C4 and C3-C3 in 502 the proportion of pairs with divergent replacements and pairs with more convergent than 503 divergent changes (Table S6). 504 505 506

Distribution of amino acid replacements across PACMAD lineages 537
Convergent and divergent replacements were preferentially found in specific pairs of 538 ancestral branches. In C4 pairs, convergent sites were most abundant between Danthoniopsis 539 dinteri and Aristida purpurea (ten sites, branches P and V in Fig. 1), whereas divergent sites 540 were most common between Centropodia glauca and Aristida purpurea (ten sites, branches S 541 and V in Fig. 1). In pairwise C3 branch comparisons, most convergent sites were identified 542 between both Zeugites pittieri and Danthonieae (branches N and R in Fig. 1) and Danthonieae 543 and Sartidia spp. (branches R and U in Fig. 1), whereas the most divergent site-rich pair was 544 formed by Zeugites pittieri and Sartidia spp. (eight sites, branches N and U in Fig. 1; Table S5). 545 546 Molecular convergence in the RuBisCO large subunit 547 We further inspected the evolution of the RuBisCO large subunit across the PACMAD 548 clade. A total of 4 out of 9 RbcL amino acids with convergent changes in C4 ancestral 549 branches-V101I, A281S, M309I and A328S-have been identified in previous studies on 550 PACMAD grasses (Christin et al., 2008;Piot et al., 2018) as sites that experienced adaptive 551 evolution in C4 species (Table 3). 552 553 554

562
A further site, T143A, was found to evolve under positive selection in C3 to C4 transitions 563 in monocots (Studer et al., 2014). Interestingly, an adaptive S143A replacement has also been 564 detected in the gymnosperm Podocarpus (Sen et al., 2011). Three more sites with convergent 565 replacements-at positions 93, 94 and 461-correspond to amino acids that were reported to 566 evolve under positive selection in different groups of seed plants by Kapralov and Filatov 567 (2007). Thus, all of the rbcL codons that appear to have evolved convergently among the 568 PACMAD C4 lineages we have examined are also known to have experienced adaptive evolution 569 in seed plants, but not all of them have been shown to evolve adaptively in C4 grasses. 570 571 572 573 Discussion 574 The varieties with augmented resistance to high temperature and low water availability. 589 590 For these aims to be fully realized, a robust framework to assess the extent and 591 phenotypic impact of convergent molecular changes is necessary. Along the lines of strategies 592 applied in vertebrates research (Castoe et al., 2009, Thomas andHahn, 2015), we presented here 593 the results of a novel methodological approach to the study of molecular convergence in C4 594 grasses. We investigated patterns of convergent and divergent amino acid changes in nearly 70 595 chloroplast proteins across multiple C4 and C3 lineages in the PACMAD clade, with the goal of 596 testing a specific hypothesis: is the evolution of chloroplast proteins showing stronger signatures 597 of convergent amino acid replacements in C4 lineages compared to C3 lineages? This analysis 598 also allowed us to establish if proteins other than enzymes involved in the CCM biochemistry 599 underwent parallel amino acid changes in C4 lineages. Our reasoning is that many proteins 600 expressed in the chloroplast could have experienced similar selective pressure across multiple C3 601 to C4 transitions and might have accumulated convergence replacements as a result. 602 603 We based our analysis on the identification of amino acid replacements shared by pairs of 604 ancestral C4 branches, defined here as branches corresponding to C3 to C4 transitions in the 605 PACMAD phylogeny. We compared these changes to those identified in ancestral C3 branches, 606 namely all C3 lineages that include only C3 species (Figs. 1 and 2), and to changes found between 607 ancestral C3 and C4 branches. For each of the three possible pairs of photosynthesis types C4-C4, 608 C3-C4 and C3-C3, we determined the number of amino acid sites, genes and pairs of ancestral 609 branches with convergent replacements. We detected signatures of convergent evolution in all 610 types of datasets. First, we identified many individual replacements that emerged repeatedly and 611 uniquely in C4 ancestral branches, particularly in the proteins RbcL, NdhH, NdhI and MatK. We 612 also observed C3-specific convergent replacements in NdhF and RpoC2, and a case of multiple 613 C4 and C3 convergent changes in Rps3. Additionally, we identified 8 chloroplast genes with one 614 or more C4-specific convergent sites. Second, we found evidence of significantly higher rates of 615 convergent replacements in C4 lineages in both RbcL and RpoC1, and several convergent 616 replacements that occurred exclusively in C4-C4 pairs in proteins encoded by ndhG, ndhI, psaI, 617 rpoA, rps4 and rps11. These genes are involved in a variety of biological processes in the 618 chloroplast, from the cyclic electron transport in (ndhG and ndhI) and the stabilization of (psaI) 619 the photosystem I, to transcription (rpoA and rpoC1), translation (rps4 and rps11) and CO2 620 fixation (rbcL). Third, we identified statistically significant differences in pairs of C4 branches 621 with convergent replacements (Table 2). Crucially, we observed more pairs with higher 622 convergent than divergent replacements in C4-C4 compared to both C3-C3 and C3-C4, even after 623 removing replacements identified in the RuBisCO large subunit, RbcL. 624 625 Altogether, these findings suggest that multiple biochemical processes occurring in the 626 chloroplast might have experienced recurrent adaptive changes associated with the emergence of 627 C4 photosynthesis. Notably, some of these proteins are not directly involved in the light-628 dependent or light-independent reactions of the photosynthesis, implying that processes such as 629 the regulation of gene expression and protein synthesis in the chloroplast are also experiencing 630 significant selective pressures during the transition from C3 to C4 plants. These results should 631 motivate further studies to determine the prevalence of convergent amino acid replacements due 632 transitions to CCMs among the thousands of proteins encoded by nuclear genes but expressed in 633 the chloroplast (Jarvis and López-Juez, 2013). Although such analyses are currently hindered by 634 the limited number of sequenced nuclear genomes in taxa with multiple C3 and C4 lineages, 635 including the PACMAD clade, genome-wide investigations of convergent replacements will be 636 possible in the near future given the current pace of DNA sequencing in plants. 637 A further important conclusion drawn from these results is that convergent replacements 638 are not uncommon between C3-C3 and C3-C4 lineages. This is possibly due to some 639 environmental factors affecting the evolution of chloroplast genes that are shared across grass 640 lineages regardless of their photosynthesis type. 641 642 The analysis of individual convergent replacements in the RuBisCO large subunit both 643 confirmed previous findings and highlighted novel potentially adaptive changes among 644 PACMAD species. Importantly, these novel convergent replacements are known to evolve under 645 positive selection in non-PACMAD seed plants. This underscores the potential of our approach 646 to identify novel changes with functional significance in the transition to CCMs in grasses, as 647 opposed to standard statistical tests of positive selection. Alternatively, some RbcL sites could 648 experience convergence across a variety of seed plants because of selective pressure other than 649 those associated with C3 to C4 transitions. 650 651 Overall, our results are robust to several possible confounding factors. First, we analyzed 652 branches that are strongly supported in our phylogeny reconstruction. The phylogenetic tree built 653 using the 67 chloroplast genes is well supported, with the exception of three branches with fairly 654 low bootstrap support. However, all three branches are short and have minimal impact upon our 655 conclusions regarding C4 evolution ( Fig. 1 and Figs. S1-S3). Moreover, the tree is largely 656 consistent with a comprehensive recent study of 250 grasses based on complete plastome data 657 (Saarela et al., 2018). Second, by focusing only on ancestral branches and ignoring amino acid 658 replacements that may have occurred after the divergence of species within a given C4 clade, our 659 strategy provided a conservative estimate of the number of convergent changes that could have 660 occurred during the evolution of PACMAD grasses. Third, we eliminated genes with possible 661 paralogous copies, which could have introduced false positive replacements. 662 663 We recognize some potential caveats in our approach. By relying on a relatively small 664 sample of PACMAD species, our statistical power to detect signatures of convergent evolution 665 was limited. Increasing the number of ancestral C4 and C3 lineages should provide a broader 666 representation of convergent replacements in C4 clades. Furthermore, we applied a strict 667 definition of convergence that ignores changes to amino acids with similar chemical properties. 668 We think that a conservative approach was necessary given that amino acids with similar 669 chemical properties might have a very different functional effect on protein activity given their 670 size and tridimensional interactions with nearby residues. Third, we assumed that all the 671 observed convergent replacements were the result of convergent phenotypic changes, which fall 672 under the general category of homoplasy (Avise and Robinson, 2008). However, some of these 673 replacements could instead represent hemiplasy, or character state changes due to introgression 674 between different C4 lineages, incomplete lineage sorting (ILS) of ancestral alleles or horizontal 675 gene transfer (Avise and Robinson, 2008). Recombination between chloroplast genomes, which 676 is required for introgression to occur, has been documented but appears to be rare (Carbonell-677 Caballero Olofsson et al., 2016). However, these transfers were limited to a few nuclear genes. Moreover, 681 only a very few cases of horizontal transfer between chloroplast genomes have been reported in 682 plants (Stegemann et al., 2012). Therefore, the contribution of hemiplasy to the observed pattern 683 of convergent replacements in C4 lineages is likely to be minimal. Finally, we treated C4 species 684 regardless of their photosynthesis subtype (NAPD-ME, NAD-ME and PEPCK), which is known 685 to vary among PACMAD subfamilies (Taylor et al., 2010). We argue that our results are 686 conservative with regard to this aspect because convergent replacements should be expected to 687 occur more often between C4 groups sharing the same photosynthesis subtype. 688 689 690 Conclusions 691 692 In this study, we showed that molecular convergent evolution in the form of recurrent amino acid 693 replacements affected multiple chloroplast proteins in C4 lineages of the PACMAD clade of 694 grasses. This finding significantly broadened the number of genes known to have evolved 695 convergently in C4 species. We observed for the first time that genes not directly involved in 696 photosynthesis-related processes experienced convergent changes, suggesting that future efforts 697 should rely whenever possible on genome-wide analyses of amino acid changes rather than focus 698 primarily on candidate key metabolic genes, similarly to previous investigations on gene 699 expression patterns in C4 and CAM plants. Our methodological approach based on the 700 comparison of convergent and divergent replacements among photosynthesis types underscores 701 the importance of a more rigorous hypothesis-based testing of convergent evolution signatures in 702 C4 plant evolution. Our results should inform more nuanced approaches to introduce CCM-like 703 processes in C3 crops. 704 705 706 Acknowledgements 707 The project was supported by the National Institute of Food and Agriculture, U.S. Department of 708 Agriculture, under award number TEX0-1-9599, the Texas A&M AgriLife Research, and the 709 Texas A&M Forest Service. 710 711 712