Abstract
Survival in harsh environments is associated with several adaptations in plants. Species in the Portulacineae (Caryophyllales) have adapted to some of the most extreme terrestrial conditions on Earth, including extreme heat, cold, and salinity. Here, we generated 52 new transcriptomes and combined these with 30 previously generated transcriptomes, forming a dataset containing 68 species of Portulacinaeae, seven from its sister clade Molluginaceae, and seven outgroups. We performed a phylotranscriptomic analysis to examine patterns of molecular evolution within the Portulacineae. Our inferred species tree topology was largely congruent with previous analyses. We also identified several nodes that were characterized by excessive gene tree conflict and examined the potential influence of outlying genes. We identified gene duplications throughout the Portulacineae, and found corroborating evidence for previously identified paleopolyploidy events along with one newly identified event associated with the family Didiereaceae. Gene family expansion within Portulacineae was associated with genes previously identified as important for survival in extreme conditions, indicating possible molecular correlates of niche changes that should be explored further. Some of these genes also showed some evidence for positive selection. The correlation between gene function and expansion suggests that gene/genome duplication have likely contributed to the extreme adaptations seen in the Portulacineae.
Introduction
Temperature and water availability are two major ecological variables that influence plant distribution and survival (e.g., Peel et al. 2007). Plants living in severe environments that are characterized by extremely low precipitation, extreme temperature ranges, intense sun and/or dry winds, often exhibit morphological, physiological, and ecological characteristics that allow them to cope with such highly variable abiotic stresses (Gibson 1996; Kreps et al. 2002). We may expect to find some signature of the transitions in the phenotypic adaptations to survive these environments within genes and genomes. While researchers have addressed these questions in several ways in different clades (e.g., Christin et al. 2007), until recently it has not been possible to obtain large scale genomic and transcriptomic data for large clades.
The Portulacineae comprises ~ 2200 species in nine families and 160 genera (Nyffeler and Eggli 2010a; Angiosperm Phylogeny Group 2016) that are inferred to have diverged from its sister group ~54 Mya (Arakaki et al. 2011). Species in this suborder are incredibly morphologically diverse, ranging from annual herbs to long-lived iconic stem succulents to trees. They also exhibit variable habitat preferences ranging from rainforests to deserts (Smith et al. 2018). While some species exhibit a worldwide distribution, the majority are restricted to seasonally dry areas of the northern and southern hemisphere, under either hot arid or cold arid conditions (Hernández-Ledesma et al. 2015). Many specialized traits such as fleshy or succulent stems, leaves, and/or underground parts, have arisen in association with the adaptation of this clade to xeric, alpine and arctic environments. Portulacineae also includes several transitions to Crassulacean acid metabolism (CAM) and C4 photosynthesis that improve photosynthetic efficiency and hence minimize water loss in hot and dry environments compared to C3 photosynthesis (Edwards and Walker 1983; Borland et al. 2009; Edwards and Ogburn 2012).
The Cactaceae encompasses roughly 80% of species in Portulacineae (~1850 species; Nyffeler and Eggli 2010b) including one of the most spectacular New World radiations of succulent plants (Mauseth 2006; Arakaki et al. 2011). In addition to succulent structures that enable water storage, they exhibit an array of adaptations to cope with arid and semiarid conditions. This includes spines (modified leaves) with some advantageous effects (Anderson 2001) and extensive, but relatively shallow, root systems that facilitate quick absorption of rainfall (Gibson and Nobel 1990). In addition to Cactaceae, Didiereaceae exhibit similar but less pronounced adaptations to the hot and seasonally dry environments of Madagascar (Arakaki et al. 2011). By including only 20 species, Didiereaceae have lower species diversity compared to Cactaceae, whereas they are no less interesting in their specialized adaptations, including the evolution of thorns and succulent leaves. In contrast to the largely tropical and subtropical Cactaceae and Didiereaceae, Montiaceae has a cosmopolitan distribution with most species occurring in colder environments. Some even inhabit high alpine zones and/or the high Arctic, especially for species within the genera Claytonia and Montia (Ogburn and Edwards 2015; Stoughton et al. 2017). The repeated adaptations to harsh environments and remarkable habitat disparity within the Portulacineae suggest the potential that ancestral genomic elements may be predisposed to repeated co-option within each clade.
In the face of rapid environmental changes, understanding the capacity of plants to survive and evolve in different environments requires study of the genetic basis of adaptation to abiotic stresses (e.g., cold, hot, and dry conditions; Chaves et al. 2003; Fournier-Level et al. 2011). For example, genes involved in the abscisic acid (ABA) signaling pathway [e.g., ABRE-binding protein/ABRE-binding factors, the dehydration-responsive element binding (DREB) factors, and the NAC transcription factors (e.g., Qin et al. 2004, Nakashima et al. 2012)] have been demonstrated to function in drought, cold, and heat responses (Nakashima et al. 2014). However, plant stress responses may be complicated by multiple stresses (Nakashima et al. 2014), and so the role of genes to stress are still mostly unknown. Thus, by examining genes and genomes in plant lineages that have experienced macroevolutionary adaptations to stressful conditions we may yet gain new insights into the evolution of stress responses that may be broadly applicable to plant biology. Given their propensity for high stress environments, diverse ecophysiology, and relatively high number of species, the Portulacineae is an excellent group to explore such patterns.
Recent phylogenetic work has helped resolve relationships within the Portulacineae, although several key nodes around the early divergence of Portulacineae remain only moderately or poorly supported (Arakaki et al. 2011; Ogburn and Edwards 2015; Yang et al. 2018; Moore et al. 2017; Walker et al. 2018a). Several of these nodes may reflect rapid radiations (Arakaki et al. 2011) that deserve further examination. In addition, recent studies have inferred multiple paleopolyploidy events in Portulacineae, including one at the base of the clade (Yang et al. 2018). With increased taxon sampling, we may detect additional whole genome duplication (polyploidy) events. In this study, we analyzed 82 transcriptomes (of which 52 are newly generated) to better understand the evolutionary history of Portulacineae. By closely examining gene-tree/species-tree conflict, patterns of gene duplication and paleopolyploidy, and clade-specific evolutionary patterns of genes associated with stress-related responses, we aimed to 1) identify nodes with significant gene tree discordance and reveal the underlying mechanisms that are responsible for conflicting hypotheses; 2) confidently identify lineage-specific genome duplication events through improved taxon sampling within Portulacineae; and 3) test whether gene/genome duplications are associated with clade-specific adaptive traits.
Results and Discussion
Phylogeny of Portulacineae
Our concatenated supermatrix contained 841 gene regions and had a total aligned length of 1,239,871 bp. The matrix had a gene occupancy of 95.4% and character occupancy of 84.5%. Of the ingroup taxa, only Pereskia aculeata had a gene occupancy less than 80%, while the majority (67 taxa) had over 90% of genes present in the final supermatrix.
Both the concatenated maximum likelihood tree (hereafter the CML tree) and the maximum quartet support species tree (MQSST) recovered the same topology with similar support values (fig. 1), all of which were largely consistent with previous estimates of the Portulacineae phylogeny. However, because bootstraps may be a poor indicator of support in large phylogenomic datasets (e.g., wrong phylogeny can be increasingly supported as more sequence data are added, Alfaro et al. 2003; Phillips et al. 2004), we also conducted gene tree conflict analyses (fig. 2A, Smith SA et al. 2015). For example, we found strong support (100% multi-locus bootstrap from ASTRAL and ML bootstrap) for the sister relationship of Molluginaceae and Portulacineae (fig. 1), consistent with several recent studies (Edwards and Ogburn 2012; Yang et al. 2015, 2018; Moore et al. 2017), but also found that most gene trees were either uninformative for this clade or conflicted with the ML result. Our analysis recovered the ACPT clade (Anacampserotaceae, Cactaceae, Portulacaceae and Talinaceae; Hershkovitz and Zimmer 1997; Applequist and Wallace 2001; Applequist et al. 2006; Nyffeler 2007; Nyffeler and Eggli 2010b; Ocampo and Columbus 2010; Arakaki et al. 2011; Moore et al. 2017), with high support across genes trees for a topology of (((A, P), C), T) (figs. 1-2). The resolution of major clades within Portulacineae generally agrees with that of Moore et al. (2017) using a targeted enrichment approach.
Our broad taxon sampling also allowed for resolution of many deeper relationships within Portulacineae families. We found support for the sister relationship of Didiereaceae subfamilies Portulacarioideae (Ceraria and Portulacaria) and Didiereoideae (Madagascar Didiereaceae; Applequist and Wallace 2003; fig. 1), and although many gene trees were uninformative at this node, the ML resolution was the dominant gene tree topology (fig. 2A). While relationships among the genera within the Didiereoideae have been difficult to resolve with targeted-gene analyses (Applequist and Wallace 2000; Nyffeler and Eggli 2010b), we recovered very strong support (100%) in gene tree analyses for a clade of ((Alluaudia+Alluaudiopsis), (Didierea+Decarya)). This result differs from Bruyns et al. (2014) that found Alluaudiopsis to be sister to the remaining Didiereoideae.
Within Cactaceae, we recovered three major clades (Pereskia s.s., Leuenbergeria, and the core cacti, as in Edwards et al. 2005, Bárcenas et al. 2011) with strong support (> 97%, fig. 1), but with substantial gene tree discordance (see below, fig. 2A). In fact, the majority of genes (>85%) in the earliest diverging nodes within Cactaceae were either uninformative or conflicted with each other (fig. 2A). Within the core cacti, we recovered Maihuenia as sister to Cactoideae with strong support, and this clade as sister to Opuntioideae as found by Edwards et al. (2005) and Moore et al. (2017). Nevertheless, the position of Maihuenia has previously been found to be highly unstable (e.g., within Opuntioideae, Butterworth and Wallace 2005; sister to Opuntioideae, Nyffeler 2002; or sister to Opuntioideae+Cactoideae, Hernández-Hernández et al. 2014; Moore et al. 2017). Unlike other areas within the Cactaceae that have high gene tree discordance, the monophyly of Opuntioideae (e.g., Barthlott and Hunt 1993; Wallace and Dickie 2002; Griffith and Porter 2009; Hernández-Hernández et al. 2011) and Cactoideae (Nyffeler 2002; Bárcenas et al. 2011) had relatively high concordance (95-97% gene trees support these monophyletic groups; fig. 2A). The high level of discordance among gene trees within Opuntioideae likely reflects the difficulty of resolving relationships among tribes that could not be monophyletic (e.g., Cylindropuntieae, Opuntieae; Edward et al. 2005; Bárcenas et al. 2011; Hernández-Hernández et al. 2011). However, these conflicts may also be attributed to limited taxon sampling as we only have one or two species representing each tribe (figs. 1 and 2A). The topology of the Cactoideae was largely congruent with previous studies (e.g., Butterworth et al. 2002; Nyffeler 2002; Bárcenas et al. 2011; Hernández-Hernández et al. 2011). Within Cacteae, the resolution of the five sampled genera was strongly supported and are consistent with what was recovered by Hernández-Hernández et al. (2011) and Vázquez-Sánchez et al. (2013) using five loci. However, the core Cactoideae (sister to Cacteae, Hernández-Hernández et al. 2011) was poorly supported (52% CML and 54.5% MQSST, fig. 1) in both species tree inferences. With limited taxon sampling, some relationships (e.g., the monophyly of the Pachycereeae) are largely consistent with previous classification (e.g., Nyffeler and Eggli 2010b), whereas others (e.g., the position of Gymnocalycium) are still controversial among studies (Arakaki et al. 2011; Bárcenas et al. 2011; Hernández-Hernández et al. 2014).
Assessing conflict on specific nodes
As genomes and transcriptomes have become increasingly available, our ability to analyze phylogenetic conflict and concordance across hundreds or thousands of genes has increased significantly. One of the major findings of these genome-scale studies is that gene tree conflicts are prevalent throughout the tree of life (e.g. Degnan and Rosenberg 2009, Jarvis et al. 2014). Incongruence among genealogies can result from numerous processes, including incomplete lineage sorting (ILS), horizontal gene transfer (HGT), and gene duplication and loss (Maddison, 1997). Methods that explicitly model the source of discordance among multi-locus data are very important for correctly inferring the underlying species tree (Liu and Pearl 2007; Liu et al. 2008), but a close examination of gene-specific discordance is also valuable because it may reveal the underlying processes that characterize the evolutionary history of a given clade (e.g., Shen et al. 2017). Above, we discussed gene tree conflict as it pertains to support, or lack thereof, for different clades. To specifically address gene-specific support for alternative relationships, we examined individual genes and their support for nodes, focusing on phylogenetic relationships that had previously been identified to have significant conflicts. With these conflict assessments, we examined one focal node. This allowed for alternative resolutions at a node while allowing for the remainder of the tree to be optimized for the highest maximum likelihood topology. Therefore, even though two genes may support the same resolution for the node of interest, they may have completely different topologies in areas excluding this node. This focused analysis allowed us to isolate the signal at particular nodes without having to accommodate the many diverse processes that may shape conflict in other parts of the tree.
Within Portulacineae, several regions of the tree have been identified in previous studies as having significant conflict and/or low support, e.g. the early divergences of Cactaceae, the relationship among Portulacaceae, Cactaceae, and Anacampserotaceae, and the positions of Basellaceae and Didiereaceae (e.g., Moore et al. 2017, Walker et al. 2018a). For example, Portulacaceae has been recovered as sister to Anacampserotaceae (Ogburn and Edwards 2015; Walker et al. 2018a; and recovered here) and to Cactaceae (Arakaki et al. 2011; Moore et al. 2017). We examined the three topological positions among these families to determine the gene-wise pattern of support for each topology. Within the above three topological constraints, 476 (out of 841) genes supported Anacampserotaceae+Portulacaceae, 164 genes supported Cactaceae+Portulacaceae, and 201 genes supported Cactaceae+Anacampserotaceae.
Recently, studies have demonstrated that only a few genes in phylogenomic datasets can drive the resolution of nodes (e.g., Brown and Thomson 2017; Shen et al. 2017; Walker et al. 2018b). Using these node-specific analyses, we found several outlying genes (i.e., genes highly favor one topology over the other) despite high topological support as measured by bootstraps (fig. 2B). For example, one gene (cluster4488, with homology to the "ARABILLO 1-like" genes in Beta vulgaris) supported Cactaceae+Portulacaceae strongly over the others (>60 lnL units). Another gene (cluster7144, with homology to the "UV-B induced protein chloroplastic-like" gene in Chenopodium quinoa) supported Cactaceae+Anacampserotaceae strongly (>100 lnL units over the other relationships). Neither of these genes exhibited obvious alignment errors.
We conducted similar analyses for the resolution of Basellaceae and Didiereaceae and for the resolutions of the early diverging relationships within the Cactaceae. In specific, 506 genes supported (Basellaceae, (Didiereaceae+others)) and 334 genes supported Basellaceae+Didiereaceae as found by Soltis et al. (2011) and Anton et al. (2014). No significant outlying genes were observed supporting either the above topological constraints (fig. 2B). For the early diverging Cactaceae, we examined the genes that supported the grade of relationships (Leuenbergeria, (Pereskia+others)) versus Leuenbergeria+Pereskia. Overall, 472 genes supported the grade, with 368 supporting Leuenbergeria+Pereskia. We also found one gene (cluster4707, with homology to the "NF-X1-type zinc finger protein") strongly supported the grade topology (>110 lnL units), without any obvious mismatching in alignment.
WGD featured the evolution of Portulacineae
Whole-genome duplications (WGD) have been recognized as having a profound influence on the evolutionary history of extant lineages, especially for plants (e.g., Cui et al. 2006; Soltis and Soltis 2009; Jiao et al. 2011; Wendel 2015; Yang et al. 2015; Smith et al. 2018). Major WGD events have been inferred within the Caryophyllales (Yang et al. 2015, 2018; Walker et al. 2017) including three within Portulacineae: in the ancestor of Portulacineae, in the ancestor of Basellaceae, and within Montiaceae. The improved taxon sampling in the current study enabled us to infer additional events of putative WGD and large-scale gene duplications (fig. 1). Specifically, large-scale gene duplications were found in the ancestor of Portulacineae (node 13), subclades of Montiaceae [along the origin of Calyptridium (node 17), and the sister clade of Calandrinia (node 19)], in the ancestor of Didiereoideae (node 27), and in the ancestor of Cactaceae, Cactoideae and Opuntioideae (nodes 35, 39 and 50).
We considered evidence for a WGD event in the ancestor of Portulacineae (Yang et al. 2015) to be a high percentage of gene duplications (13.4%) and a Ks peak (~0.4-1.0) shared by all members in this clade. Our identification of putative WGD at the base of Portulacineae contrasts with the analyses of Yang et al. (2018) that found increased gene duplication (24%) in the ancestor of Portulacineae+Molluginaceae. We did not detect a similar pattern of gene duplications at this node (1%, fig. 1), confirming that the Yang et al. (2018)’s estimation could result from phylogenetic uncertainty. We found a high percentage of gene duplication for the Didiereoideae (50.1%, fig. 1) accompanied by a very recent Ks peak (between 0.05-0.10) for all the species within this clade. We also identified genome duplications within the Montiaceae. Yang et al. (2018) identified WGD in the ancestor of Claytonia species. With expanded sampling, we placed this WGD event at the ancestor of Lewisia and Claytonia, as all four species in this clade exhibited Ks peak between 0.2-0.4 (supplementary figs. S2 and S3, Supplementary Material online). We also identified an elevated number of gene duplications in the two Calyptridium species, accompanied by a very recent Ks peak (between 0.05-0.10; fig. 1).
While we identified a high number of gene duplications at the origin of the Cactaceae, we did not infer a WGD event as we did not detect a corresponding elevated Ks peak. This finding is consistent with other recent studies (Walker et al. 2017; Yang et al. 2018). Several other nodes within the phylogeny (e.g., nodes 50 and 39) exhibited similar patterns with high numbers of duplications but no corresponding Ks peak (fig. 1). There are several complications (gene loss, gene tree conflict, life history shifts, etc.) that may obscure the detection of gene/genome duplications and further investigation with more taxa are likely to help shed light in future studies.
For identifying duplications within a single lineage, we examined Ks plots as gene tree-based methods cannot accurately infer WGD on terminal branches. Our Ks analyses inferred two additional genome duplication events occurring within single taxa: in Basella alba (Ks=0.45) and Mollugo verticillata (Ks=0.25) (supplementary fig. S2, Supplementary Material online). Both species showed higher chromosome counts relative to close allies (Yang et al. 2018). These WGD inferences are consistent with previous studies (Yang et al. 2015, 2018).
Although Ks plots have been widely applied to WGD inference, higher Ks values (e.g., >0.75) are associated with increasingly large error (Li 1997), so it is more appropriate to detect recent WGDs with Ks values less than 2 (Vanneste et al. 2013). On the other hand, peaks at smaller Ks values (e.g., < 0.25) can also be difficult to interpret due to the splice variants or other anomalies produced during transcriptomic assembly. However, in the case of slowly evolving lineages or recent duplications, only examining larger Ks can result in missing of some potential duplications. We note that this may be the case with the Didiereoideae where the Ks peak falls in a range that also may contain splice variants (i.e., Ks peak occurred < 0.1, supplementary fig. S2, Supplementary Material online).
Some studies have suggested that some WGDs may be associated with adaptations to extreme environments (e.g., Stebbins 1971; Soltis and Soltis, 2000; Brochmann et al. 2004), speciation/diversification (Stebbins 1971; Wood et al. 2009), and success at colonization of new regions (Soltis and Soltis, 2000). Smith et al. (2018), using the information regarding duplications from Yang et al. (2018), found WGDs within the Caryophyllales to be associated with shifts in climatic niche. The addition of a WGD involving the Didiereoideae support this finding. Unlike their sister clade (including Ceraria pygmaea and Portulacaria afra) that is mainly distributed in the mainland southern Africa, all species in Didiereoideae live in the ecoregion of southwest Madagascar island. While we do not suggest that WGDs are always associated with shifts in niche or ecological setting, within the Caryophyllales, this appears to be a common pattern.
Lineage-specific gene expansions associated with adaptive traits
Analyses on Gene Ontology (GO) overrepresentation were conducted on genes duplicated in nodes 15, 17, 19, 26, 27, 35, 39, 50, and 71 (fig. 1). Of genes duplicated at each clade, 30%-55% had identifiable Arabidopsis IDs and almost all could be mapped in PANTHER (table 1). Genes belonging to “calcium ion binding” (GO:0005509) showed significant overrepresentation at the origins of the Montiaceae (node 15; except for Phemeranthus parviflorus) and the Didiereoideae (node 27); genes belonging to “sulfur compound metabolic process” (GO:0006790) also exhibited significant overrepresentation at the origin of Didiereoideae; and GO overrepresentation in “Hydrolase activity, acting on ester bonds” (i.e., esterase, GO: 0016788, p = 0.028, table 2) was found at the origin of the Cactaceae (Node 35). Many of the enriched GO terms described functional classes that are involved in diverse processes, such as “calcium ion binding” linked the universal secondary messenger calcium, or “hydrolase activity” linked to ester bond chemistry. However, others may be plausibly linked to potential adaptive traits in the Portulacineae. In this regard, the expansion of genes belonging to the “sulfur compound metabolic process” (GO:0006790) are of interest because sulfur-bearing evaporate compounds (principally gypsum) are commonly found in soils in areas with low rainfall and high evaporation rates (Watson 1979), such as the hot and cold deserts in which many lineages of the Portulacineae have diversified. Researchers have also connected primary sulfur metabolism (e.g., sulfate transport in the vasculature, its assimilation in leaves, and the recycling of sulfur-containing compounds) with drought stress responses (Chan et al. 2013). Specifically, sulfate implicated in ABA-induced stomatal closure and drought-responsive metabolites have been shown to be coordinated with sulfur metabolic pathways (Chan et al. 2013). Consequently, duplication and overrepresentation of these sulfur metabolic genes could potentially be evidence of adaptation to corresponding harsh environment of hot and cold deserts (e.g., Lee et al. 2011). However, while it is tempting to speculate that these duplications have been maintained through adaptation, we recognize that such ‘duplication’ events may simply be the remnants of ancient polyploidy. In the absence of functional characterization, it is also necessary to interpret all GO term analyses with caution.
Gene families show broad expansions across Portulacineae
In addition to looking at lineage-specific gene expansions, we also examined gene-family expansion without reference to a specific organismal lineage within Portulacineae. This same approach has been instrumental in revealing gene expansion and neo-functionalization implicated in the evolution of betalains (Yang et al. 2015; Brockington et al. 2015) in Caryophyllales. Here, we explored the top 20 most expanded gene families (hereafter, called TOP20, table 3). The TOP20 included genes encoding transporters, proteases, cytoskeletal proteins and enzymes that involved in photosynthetic pathways (table 3). Within them, some are implicated in responses to drought: the gene encoding plasma membrane intrinsic protein (PIP) is an aquaporin that can regulate the transport of water flux through membranes (Vandeleur et al. 2009), heat shock proteins can prevent proteins from aggregating or being rendered nonfunctional (Kiang and Tsokos 1998), and ubiquitin can correct or degrade misfunctional proteins (Hochstrasser 2009). But perhaps the most interesting gene radiations involve those encoding Phosphoenol-pyruvate carboxylase (PEPC, Silvera et al. 2014) and NADP-dependent malic enzyme catalyses (Ferreyra et al. 2003) that are key enzymes involved in CAM/C4 photosynthesis (Smith and Winter 1996; Cushman 2001). The role of gene duplication in the evolution of C4 photosynthesis has been contentious, and some authors have proposed that neo-functionalization of genes following duplication has not played a major role in the evolution of C4 photosynthesis (Gowik and Westhoff 2011; Williams et al. 2012; van den Bergh et al. 2014). However, recent analyses provide evidences that duplication and subsequent retention of genes is associated with the evolution of C4 photosynthesis (Emms et al. 2016), specifically including the NADP-dependent malic enzyme. Moreover, convergent evolution in several key amino acid residues of PEPC has been suggested to be associated with the origin of both C4 and CAM (Christin et al. 2007; Christin et al. 2014). A recent study has also confirmed the occurrence of multiple rounds of duplication within the major PEPC paralog (PEPC1E1) in the ancestral Portulacineae (Christin et al. 2014). Our analyses confirm that these genes have experienced expansions within the Portulacineae, suggesting a role of this process in the evolution and adaptation in this group.
Targeted analyses of drought and cold associated genes
We focused on a set of 29 functionally annotated genes known to be involved in drought and cold tolerance (table 4) as the Portulacineae have repeatedly evolved into cold and drought environments. We found almost all of these genes have experienced gene duplications including at the origin of Didiereoideae, within Montiaceae and within Cactaceae when only including nodes with SH-like support > 80 (fig. 3). Some of these lineages have experience WGDs, and so detecting gene-duplications with these genes is, perhaps, unsurprising. However, their maintenance and, in some cases, further expansion is notable. For example, we found duplications in WIN1 (SHN1) proteins in both Didiereoideae and Montiaceae (table 4). These proteins belong to the Apetala2/Ethylene Response Factor (AP2/ERF) family and are transcription factors associated with epicuticular wax biosynthesis that increases leaf surface wax content. Overexpression of this gene can reduce direct water loss through the cuticular layer and increase plant tolerance to abiotic stresses (Sajeevan et al. 2017).
It is also noteworthy that many of duplications in these 29 focal genes occurred at the origin of Portulacineae (6 genes in table 4) or earlier (fig. 3). This suggests the possibility that some genes associated with adaptations to harsh environments diversified early in the evolution of the Portulacineae and the Caryophyllales. While different lineages may have adapted to different environments throughout the evolution of the Portulacineae, their predisposition to become adapted to these environments may be the result of the early diversification of the gene families. Additionally, genes from the same family may experience different evolutionary histories. For example, six homologs that are part of the CDPK (caldium-dependent protein kinase, also CPK) gene family are known to be involved in drought stress regulation (Geiger et al. 2010; Brandt et al. 2012). However, the number of duplications varied between different lineages, regardless of the similarity in habitat [four duplicated at the origin of Didiereoideae, one within Montiaceae (node 17), one at Portulacaceae, and two within the Cactaceae (nodes 35 and 57)].
These 29 targeted genes also exhibited variable lineage-specific positive selection within Portulacineae (table 4). The strongest signals of positive selection were all found in genes associated with drought and/or cold tolerance in the ABA signaling pathway (such as the NAC10/29 and WRKY33 TFs, the CDPK18 and the PP2Cs, table 4, e.g., Golldack et al. 2014; Nakashima et al. 2014; Huang et al. 2015; Li et al. 2017), a central regulator of abiotic stress resistance in plants. Interestingly, the ABI1 and PP2CA gene (Mishra et al. 2006; Merlot et al. 2001), which encode two proteins of the PP2Cs family, were among the genes with the highest number of lineages under positive selection in our tests. As negative regulators, PP2Cs can regulate numerous ABA responses, such as stomatal closure, osmotic water permeability of the plasma membrane, drought-induced resistance and rhizogenesis, seed germination and cold acclimation (Gosti et al. 1999; Merlot et al. 2001; Mishra et al. 2006). In addition to PP2C genes, the positive regulators SNF1-related protein kinase 2s (SnRK2s) are also related to the ABA signaling module (Hubbard et al. 2010). While we did not detect positive selection in the three SnRK2s genes (i.e., SnRK2.4, 2.5, 2.6, table 4), we found that they experienced ancient duplications at the origin of Portulacineae or earlier and recent duplications within Montiaceae, Didiereaceae, or Cactaceae (nodes 17, 27, 39; table 4). Positive selection and gene family expansion in some of these gene families within Cactaceae, Didieraceae, and Montiaceae suggests that these genes warrant further investigation for their potential role in the evolution of adaptations to challenging environmental conditions.
Conclusion
Using a transcriptomic dataset spanning across 82 species, we reconstructed a phylogeny of Portulacineae that reveals significant underlying gene-tree/species-tree discordance. These conflicts may arise from an array of biological sources (e.g., hybridization, ILS), and taxon sampling (even with large number of genes, Walker et al. 2017) or sequence type may also influence species topologies in phylogenomic analyses (Reddy et al. 2017). We examined specific nodes in detail to attempt to document how much support there may be for particular relationships, regardless of conflict in the rest of the tree. Our findings, along with others recently published (Smith SA et al. 2015; Brown and Thomson 2017; Shen et al. 2017; Walker et al. 2018b), further demonstrate the need and importance to examine gene tree relationships in detail. As genome and transcriptome studies continue to document significant conflict, we need to consider different means of resolving relationships that can filter or accommodate these signals. Furthermore, the conflicts within these data likely hold interesting evolutionary questions and answers that should be analyzed in more detail.
With increased sampling, we also recorded new paleopolyploidy events. This contributes to a growing body of literature that is documenting the importance of these events in the evolution of plants (e.g., Cui et al. 2006; Smith et al. 2018). Also, as has been repeatedly demonstrated, with additional sampling effort, we are likely to find additional duplication events. The importance of these duplications for the evolution of individual clades needs to continue to be explored in detail. The gene expansion and genome duplications found within the Portulacineae seem to be associated with shifts into different environments and climates. This has been suggested by previous researchers (e.g., Smith et al. 2018) and should be examined further across other diverse angiosperm lineages.
Finally, by examining gene families in detail we are able to better highlight potential molecular evolutionary patterns associated with particular adaptations. However, the approaches explored here and by other researchers have major limitations as they rely on annotations based on model organisms that may be distantly related to the species of interest. While the gene may have interesting patterns, the GO labeled to the gene may or may not carry much meaning. These genes may have different functions in these different lineages and so the results from these analyses should not be over interpreted. Despite these limitations, however, these analyses uncover interesting gene regions that should be explored further, regardless of the accuracy of the label, as the patterns of selection and duplication alone suggest their importance in the evolution of the lineage, even if not related to the particular adaptation of interest. As angiosperms continue to be more thoroughly sequenced, we will continue to shed light on the molecular evolutionary patterns and processes that have shaped the tree of life.
Material and Methods
Taxon sampling and transcriptome generation
Sixty-eight ingroup species were included in this study, representing all Portulacineae families (Anacampserotaceae, Cactaceae, Basellaceae, Didiereaceae, Montiaceae, Talinaceae, and Portulacaceae) sensu APG IV (Angiosperm Phylogeny Group, 2016) except for Halophytaceae (supplementary tables S1 and S2, Supplementary Material online). We also included 14 Caryophyllales species as outgroups, including seven taxa of Molluginaceae (supplementary tables S1 and S2, Supplementary Material online). Of the 82 transcriptomes included in the study, 52 were newly generated following the protocol of Yang et al. (2017), which is briefly described below (supplementary table S1, Supplementary Material online). RNA was extracted from flash-frozen young leaves and/or flower buds using either the Ambion PureLink Plant RNA Reagent (ThermoFisher Scientific Inc, Waltham, MA, USA), Aurum™ Total RNA Mini Kit, or a hot acid phenol protocol (Yang et al. 2017). Total RNA was quantified with the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). It was then used for library preparation using either the TruSeq RNA Sample Preparation Kit v2 (Illumina, Inc., San Diego, CA) or the KAPA Stranded mRNA-Seq kit (KAPA Biosystems, Wilmington, Massachusetts, USA), and sequenced on an Illumina HiSeq platform at the University of Michigan DNA Sequencing Core (see deposited SRA, BioProject: XXX) (supplementary table S1, Supplementary Material online).
On average, ~30 million paired-end reads were generated for each transcriptome and corrected for error in Rcorrector (Song and Florea 2015). Corrected reads were assembled using Trinity v2.2.0 (Grabherr et al. 2011) after filtering of Illumina adapters (Illuminaclip 2:30:10) and low quality reads (sliding window 4:5, leading 5, trailing 5 and min length 25) using Trimmomatic v0.36 (Bolger et al. 2014). Transdecoder v2.0 (Haas et al. 2013) was then used to translate assembled transcripts with guidance of BLASTP against a concatenated proteome database including Arabidopsis thaliana (TAIR10) and Beta vulgaris (Dohm et al. 2014, RefBeet-1.2.2)(release-34 from EnsemblPlants, http://plants.ensembl.org/info/website/ftp/index.html). Translated amino acid sequences (PEP) and coding DNA sequences (CDS) of the newly collected samples, as well as 30 publically available datasets (Yang et al. 2015; Brockington et al. 2015; Walker et al. 2018a), were used for subsequent analyses (supplementary table S2, Supplementary Material online). Highly similar sequences were reduced using CD-HIT v4.6 for each dataset (PEP: -c 0.995 -n 5; CDS: -c 0.99 -n 10, Fu et al. 2012)
Homology Inference
Homology inference was carried out following the pipeline of Yang and Smith (2014) with minor modifications (code available at https://bitbucket.org/ningwang83/Portulacineae). We describe the procedure briefly below. An all-by-all BLASTN search was conducted using CDS data (-evalue 10 - max_target_seqs 1000) and hits with a hit fraction > 0.49 (i.e. aligned length divided by query and by subject length both > 0.4) were clustered using MCL 14-137 (-tf ‘gq(5)’ -I 1.4; van Dongen 2000). We retrieved clusters including at least 26 species in Cactaceae, seven species in Montiaceae, seven species in Didiereaceae, two species in each of Portulacaceae, Talinaceae and Molluginaceae, and one species in each of Anacampserotaceae and outgroup as putative homolog groups (supplementary tables S1 and S2, Supplementary Material online). Each cluster with less than 1000 sequences was aligned by MAFFT v7 (–genafpair –maxiterate 1000; Katoh and Standley 2013) and aligned columns with more than 90% missing data were removed using Phyutility v2.2.6 (-clean 0.1; Smith and Dunn 2008). A maximum likelihood (ML) tree for each cluster was inferred in RAxML v8.1.22 (Stamatakis 2015) with the GTR+GAMMA model. The six clusters containing more than 1000 sequences were aligned with PASTA v1.6.4 (Mirarab et al. 2015), trimmed with “-clean 0.05” in Phyutility, and their ML trees were inferred in FastTree v2.1.7 (Price et al. 2009, 2010) with the GTR model.
We trimmed the resulting homolog trees by 1) removing spurious terminal branches that are longer than 0.5 (absolute cutoff), or longer than 0.4 and ten times greater than their sister taxon (relative cutoff), 2) masking both monophyletic and paraphyletic tips of the same taxon to remove possible isoforms, in-paralogs, or recent gene duplications, and 3) separating clades connected by internal branches longer than 0.6; saving those with more than five taxa into different clusters as they might represent alternate transcripts or in-paralogs. A new set of CDS sequence files were generated based on the refining trees, and the process of tree building and filtering was repeated three times.
The PyPHLAWD package (https://github.com/FePhyFoFum/pyphlawd) was used to add data from two additional species (Anacampseros filamentosa and Basella alba) during the third round of homolog filtering process instead of conducting the time consuming all-by-all BLASTN. Specifically, after modifying some default settings in the conf.py (i.e., “takeouttaxondups = False” to keep duplication taxa; As to BLASTN hits filtering: length_limit = 0.6 to exclude sequences shorter than 60% of the alignments, evalue_limit = 5 to keep hits with evalue less than 5, perc_identity = 20 to exclude sequences with percentage of identity less than 20%.) and preparing the CDS files of the two species, the add_internal_seqs_to_clusters.py was run twice (one species at a time) to BLAST target species’ CDS sequences against each homolog, and then added the hits that meet the above criteria to the existing homologs and merge them into the existing alignments using MAFFT v7. In total, we obtained 8,592 final homolog clusters for inferring orthologs and conducting gene duplication analyses and GO annotation.
Orthology Inference and Species Tree Estimation
We used the rooted tree (RT) method in Yang and Smith (2014) to extract orthologs from homolog trees with Beta vulgaris as the outgroup. Sister clades with duplicate taxa were compared, and the side with less taxa was removed. This procedure was carried out from root to tips iteratively on all subclades until no duplicate taxa were present. Orthologous clades with more than 30 ingroup taxa were retained and subject to calculation of taxon occupancy statistics (supplementary fig. S1, Supplementary Material online). We further extracted orthologs with at least 76 taxa out of 82, aligned and cleaned them with MAFFT (genafpair –maxiterate 1000) and Phyutility (-clean 0.3), and inferred an ML tree for each ortholog using RAxML with GTR+GAMMA model. After removing taxa with terminal branch length longer than 0.1 and 10 times greater than sister clade from 58 orthologs, 841 orthologs include at least 77 taxa and 500 aligned DNA characters were re-aligned with PRANK v.140110 (Löytynoja and Goldman 2008) using default settings and trimmed in Phyutility (-clean 0.3). The resulting alignments were used for phylogenetic tree reconstruction.
The species tree was inferred using two methods. First, a concatenated matrix was built with the 841 orthologous genes and a ML tree was estimated by using RAxML with the GTR+GAMMA model. Node support was evaluated by 200 fast bootstrap replicates, partitioning by each gene. Second, we constructed a ML tree and 200 fast bootstrap trees for each ortholog using RAxML with GTR+GAMMA model. The MQSST tree was estimated in ASTRAL 4.10.12 (Mirarab et al. 2014) using ML trees from RAxML, with uncertainty evaluated by 200 bootstrap replicates using a two-stage multilocus bootstrap strategy (Seo 2008).
Assessing conflicts among gene trees
Conflicts among gene trees were assessed using the 841 rooted ortholog trees mapped onto the concatenated ML tree (which has the same topology as the MQSST, see Results and Discussion). When rooting orthologous trees with Phyx (Brown et al. 2017), the outgroup was chosen based on a three-level preference (i.e., 1) using Beta vulgaris in priority, 2) using Limeum aethiopicum and/or Stegnosperma halimifolium if Beta was not present in the tree, and 3) using Sesuvium portulacastrum, Delosperma echinatum, Anisomeria littoralis and Guapira obtusata if the first three species were not present). Phyparts (Smith SA et al. 2015) was used for comparing gene trees to the inferred species tree topology, restricted to nodes with >= 70% bootstrap support in the gene trees. We explored conflict and alternative topologies in more detail for three areas of the tree: 1) Cactaceae and relatives, 2) the early diverging branch within Cactaceae, and 3) with the placement of Basellaceae and Didiereaceae. To do this, we constructed constraint trees for each alternative topology, calculated likelihood scores for each gene tree with the individual topology constraints, and then compared the likelihood scores for each alternative resolution in each gene.
Inference of gene and genome duplication
To infer gene duplication, we first assessed clade support using the SH-like test (Anisimova et al. 2011) in RAxML for each homolog tree. Nodes with SH-like support > 80 were considered in the gene duplication inference. Using Beta vulgaris as outgroup, we extracted 8,332 rooted clusters (each include 30 or more taxa, Yang et al. 2015), identified duplication events from them, and mapped those events to the corresponding node or MRCA (if gene tree had missing taxa or a conflicting topology) on the rooted concatenated ML tree using Phyparts. Gene duplication was recorded at a node if its two children clades shared two or more taxa.
In addition to mapping phylogenetic locations of gene duplications, potential genome duplication events were also inferred using the frequency distribution of synonymous substitution rate (Ks plot) following the same process of Yang et al. (2015). Briefly, we reduced highly similar PEP sequences (by CD-HIT: -c 0.99 -n 5) for each of the 82 species, and conducted an all-by-all BLASTP (-evalue = 10, -max_target_seq = 20) and removed highly divergent hits with < 20% similarity (pident) or < 50 aligned amino acids (nident). Sequences with ten or more hits were also removed to eliminate large gene families. We then used the PEP and corresponding CDS from paralogous pairs to calculate Ks values using pipeline https://github.com/tanghaibao/biopipeline/tree/master/synonymous_calculation, which aligns paralogous pairs with ClustalW (Larkin et al. 2007) and infers Ks values using codeml in PAML (Yang 2007) with Nei–Gojobori correction for multiple substitutions (Nei and Gojobori 1986). Peaks in the distribution of Ks suggest the presence and relative timing of ancient genome duplication events (Lynch and Conery 2000; Blanc and Wolfe 2004; Schlueter et al. 2004; Cannon et al. 2014).
To determine whether a potential WGD event occurred before or after a particular speciation event, we calculated between-species Ks distribution using orthologous gene pairs (Cannon et al. 2014). The procedure is similar as above, except that a reciprocal BLASTP was carried out between two species instead of an all-by-all BLASTP within one taxon. Gene pairs with reciprocal best hits were used for calculating the Ks distribution. The between-species Ks distribution was then compared to within-species Ks in each species to determine the relative timing of the WGD versus the speciation event.
Annotation of expanded gene families
To investigate gene function and GO (gene ontology) terms overrepresentation, the CDSs from Arabidopsis thaliana (release-34) were obtained from EnsemblPlants and used as a BLASTN database, while its gene IDs were used for GO analyses. We conducted two separate analyses. First, we identified clades with large number of gene duplications [i.e., > 160 (2%) gene duplications as in node 15, 17 and 19 of Montiaceae, node 27 of Didieraceae, node 35, 39, 50 of Cactaceae, and node 71 of Portulacaceae, fig. 1], conducted BLASTN (-evalue 10) of these duplication genes (ten randomly selected sequences from each gene) in each clade against the database of A. thaliana. A random A. thaliana gene ID was chosen when multiple query sequences from one gene returned divergent top hits, and the percentage of identifiable genes were similar among clades (~35%, table 1). Assuming genes duplicated at the origin of Portulacineae (Node 13) retained in all children clades, we use them as background to detect potential overrepresentation of genes that recently duplicated in specific child clade using the GO overrepresentation test of the online version of PANTHER (http://www.pantherdb.org). We corrected for multiple tests using a Bonferroni correction. Here overrepresentation is unrelated to gene expression level, instead, it refers to the overrepresentation of genes in a specific functional category in one clade compared to another.
Second, we summarized each of the 8,592 homologous gene statistics such as the total number of tips, the total number of taxa, clades with the maximum number of average taxon repeats, and the taxon with the maximum number of tips. We then conducted manual annotation through BLASTX against the nonredundant protein database in NCBI by using a relatively long and complete PEP sequence from each of the top 20 genes with the highest total number of tips.
Identification of lineage-specific selection on stress response genes
To test if genes associated with abiotic stress response are under positive selection, we carried out selection analyses on a targeted set of 29 genes that are previously known to be associated with cold and/or drought adaptation, and also exist in our 8,592 homologous dataset (table 4). We first repeated the homolog filtering processes as indicated above three more times to reduce potential assembly or clustering errors of the chosen homologs. Their CDS sequences were then aligned based on the corresponding peptide MAFFT alignments using phyx program (Brown et al. 2017). After trimming the CDS alignment by Phyutility (-clean 0.1), a ML tree was built for each homolog in RAxML with GTR+GAMMA model and rooted with outgroup species using the same three-level criteria (see above). Lineage specific selection was calculated for each homologous gene in HyPhy v2.2.4 (Pond and Muse 2005) using an adaptive branch-site random effects likelihood test (aBSREL, Smith MD et al. 2015). Generally, the ratio of the nonsynonymous to the synonymous substitutions (ω was used to measure the extent pressure of natural selection, with ω > 1 representing positive selection, ω ~ 1 indicating neutral evolution, and ω < 1 representing purifying or negative selection (Yang 2006). aBSREL automatically infers an appropriate model among branches. It allows non-synonymous substitution to vary across sites, and synonymous rate can vary from branch to branch. This way ω varies among branches as well as among sites. The aBSREL test first fits the MG94xREV nucleotide substitution model to estimate a single ω for each branch, then greedily add ω categories to each branch after sorting them by length. The optimal number of ω categories was selected according to AICc scores. In our analyses, all internal branches were tested. The branches with episodic positive selection that show significant proportion of sites with ω > 1 were chosen with p < 0.05 after applying the Holm-Bonferroni multiple testing correction. Finally, lineages subject to positive selection were labeled on each homologous tree. We also summarized the number lineages under positive selection both on the whole tree and within each focal clade (e.g., the major families within Portulacineae, fig. 1).
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
The authors thank Wynn Anderson, the Bureau of Land Management, the US Forest Service, and the staff of Desert Botanical Garden, Sukkulenten-Sammlung Zürich, Cambridge University Botanic Garden, Missouri Botanical Garden, and the Oberlin College Greenhouse for permission to collect specimens, and thank Hilda Flores, Helga Ochoterena, and Norman Douglas for help with collecting. The authors also thank Jeet Sukumaran for constructive discussion on gene tree discordance. Special thanks to Oscar Vargas for a thorough review of the paper. This work was supported by NSF DEB 1354048 to SAS and NSF DEB 1352907 to MJM.