Abstract
A variable environment affects the physiological states of individuals and, in the long run, modifies their shapes. These changes, together with geographic barriers, generate population structure. Here, we propose a graphical representation of significant associations between genes, environments, and traits. A unique feature of the graph is the node of genome FST. The subnetwork around this node suggests the cause and the effects of population structure and segregation. A global structure of the graph enables to grasp a comprehensive picture of adaptation to the environment. Focused look at the neighbors of the environmental factors identifies the adaptive traits and the genetic background that supported the adaptation of the traits. Isolated nodes express genetic differentiations that are not explained by the population structure, implying the presence of some unrecognized environmental factor. We show the potential usefulness of our graphical representation by a detailed analysis of public dataset of wild poplar.
Living organisms are adapted to their environment. This environmental adaptation can create significant differences in phenotypes and traits among populations of a species. For example, populations of sockeye salmon exhibit diversity in regards to life history traits such as spawning time and habitat, adult body size and shape, rearing time in freshwater and seawater, and adaptation to local spawning and rearing habitats within complex lake systems (Hilborn et al. 2003). Populations of walking-stick insects have diverged in body size, shape, host preference and behavior in parallel with the divergence of their host-plant species (Nosil et al. 2002). Aridity gradients may be the cause of geographically structured populations of Poaceae characterized by cytotype segregation of diploids and allotetraploids (Manzaneda et al. 2012). When correlated with variation in environmental factors over local populations, such variation in traits and phenotypes can offer an opportunity for understanding natural selection processes (Coop et al. 2010). Adaptation to environmental factors can change traits and phenotypes of a species, thereby creating population structure. Geographical isolation, which can lead to reproductive isolation and consequent differences in allele frequencies, also contributes to population structuring (Wright 1965). Population structure needs to be considered when analyzing correlations among genes, traits and environmental factors across population samples taken from a wide range of geographical regions.
Genome-wide association studies (GWASs) are widely used to identify associations between genes and traits/environments (Visscher et al. 2017). When data are obtained from a metapopulation exhibiting population structure, the effect of genotypes can be inferred by eliminating population structure effects (Devlin and Roeder 1999) to avoid spurious associations (Pritchard and Rosenberg 1999). One representative software program, TASSEL (Yu et al. 2006; Bradbury et al. 2007), performs this type of analysis using a unified mixed model. Alternatively, associations can be tested in a Hardy-Weinberg population that has been decomposed from a structured population (Pritchard et al. 2000). Future challenges for large-scale GWASs from wild populations (wild GWASs) include the development of methods that take population structure into account (Santure and Garant 2018).
So-called “genome scan methods” consider geographically structured populations and detect SNPs related to environmental variables, traits and phenotypes (De Mita et al. 2013; De Villemereuil et al. 2014). For instance, BayeScan (Foll and Gaggiotti 2008) detects SNPs that create major differentiation in terms of global FST over a metapopulation. As illustrated in Figure 1A (top), 16 outliers were detected out of 281 SNPs in Atlantic herring in one study (Limborg et al. 2012); these outliers included a SNP in a heat-shock protein (HSP70) whose allele frequency was negatively correlated with mean sea surface salinities in spawning grounds (Figure 1A, bottom). As another example, Bayenv (Coop et al. 2010) and the latent factor mixed model (LFMM (Frichot et al. 2013)) can detect SNPs that are highly correlated with environmental factors and traits.
To obtain a comprehensive picture of population structure and adaptation using related genes, we propose a novel graphical representation of gene–environment–trait associations. The graph consists of a set of nodes and edges that connect pairs of nodes with significant association. Our graph describes correlations among allele frequencies of SNPs, states of traits, and environmental and location factors. The unique feature of our method is the use of a genome-wide population differentiation node, which enables inference of the determinants of population structure. Environmental factor nodes around this node may be the causal force for the population structure, whereas the among-locality variation of nearby traits may be the result of population differentiation, or vice versa.
In the conceptual figure of Figure 1B, a location factor, L1, is correlated with E1, an environmental factor that is correlated in turn with genome FST. Two traits, T1 and T2, are affected by this environmental factor. G4 and G6 are the candidate genes behind the differentiation of T1. Likewise, G9, G10 and G11 are the candidate genes for T2. Population structure (genome FST) may have differentiated according to some unknown traits related to G1, G2 and G3, as well as trait T2. By examining the functions of genes G7 and G8, inference of the traits selected by environmental factor E1 may be possible. Of interest, the hidden factor that differentiates gene G5 can be investigated by plotting the allele frequencies of G5 relative to location factor L2. In this way, our method provides a comprehensive perspective for understanding the genetic and ecological mechanisms of environmental adaptation of a species.
Materials and Methods
Significance of gene–environment–trait associations
The node of genome FST is in the form of a distance matrix between pairs of local populations. Likewise, all other nodes of SNPs, traits and environmental factors are represented by matrices whose elements are the differences between pairs of local populations (Supplementary Figure S1). Consequently, the correlation between a pair of nodes is the correlation between the between-population distance matrices. The significance of correlation between a pair of nodes is measured by simple linear regression analysis. Here, the dependent variable is a distance matrix of a node, and the explanatory variable is the distance matrix of the other node. To take account of correlations in the error term, we carried out bootstrap resampling of populations and individuals. For each of the bootstrap datasets, we calculated among-population distance matrices for each node pair, and obtained the regression coefficient for each node pair. The z value is the ratio of the original regression coefficient to its bootstrap standard deviation. By applying the Benjamini-Hochberg method (Benjamini and Hochberg 1995) to these p-values, we selected significant correlations with a false discovery rate (FDR) of 0.01. A node pair with a significant correlation was connected by an edge. We note that these edges represent the total associations of direct and indirect effects.
Estimation of pairwise FST
A locus pairwise FST at a single marker is the normalized difference of the allele frequencies and measures the genetic differentiation between a pair of local populations. To capture the fine-scale population structure even under high gene flow, we adopted an empirical Bayes estimator using EBFST function of the R package FinePop (Kitada et al. 2017). By averaging both numerators and denominators over multiple markers, we obtained genome FST. Genome FST indicates the magnitude of population differentiation over the genome, while locus FST indicates the contribution of each gene to population differentiation.
Application to wild poplar data
We demonstrate how the graphical representation provides a comprehensive picture of population differentiation and environmental adaptation by analyzing a publicly available data. It contains genetic and trait information of 445 individuals of wild poplar (Populus trichocarpa), which were collected from various regions over a range of 2,500 km, near the Canadian-US border at a latitude of 44′ to 59′ N, a longitude of 121′ to 138′ W, and an altitude of 0 to 800 m (McKown et al. 2014). The data included genotypes of 34,131 SNPs (3,516 genes) and values of stomatal anatomy, leaf tannin, ecophysiology, morphology and disease. Here, we focused on the four traits: adaxial stomata density (ADd), abaxial stomata density (ABd), and leaf rust disease morbidity (AUDPC) measured in 2010 and 2011 (DP10 and DPC11, respectively). Each sampling location was described by 11 environmental/geographical variables: latitude (lat), longitude (lon), altitude (alt), longest day length (DAY), frost-free days (FFD), mean annual temperature (MAT), mean warmest month temperature (MWMT), mean annual precipitation (MAP), mean summer precipitation (MSP), annual heat-moisture index (AHM) and summer heat-moisture index (SHM).
We performed a clustering analysis using the geographical distribution and divided the 445 individuals into subpopulations. We applied model-based clustering (Fraley and Raftery 2016) with three types of spatial information—latitude, longitude and altitude—as the explanation variables. Using the Bayesian information criterion based on the mclustBIC function in the R package mclust (Scrucca et al. 2016) under the VEV model (ellipsoidal, equal shape), we obtained 22 subpopulations: 5 in northern British Colombia (NBC), 11 in southern British Colombia (SBC), 3 in inland British Colombia (IBC) and 3 in Oregon (ORE).
Marker screening before analysis of the graphical representation
Because our major concern was identifying correlations between among-population differentiations of genes, traits and environmental factors, we selected the SNP with the highest global FST value over 22 populations, designated as the tag SNP, from each of the 3,516 gene regions. Out of the 3,516 tag SNPs, only those that were differentiated among populations were subjected to the graphical representation analysis. We note that the scaled global FST values, calculated as: approximately follow a chi-squared distribution with degree of freedom k (= the number of populations – 1) (Weir and Hill 2002). We performed chi-squared tests on the 3,516 genes and identified 507 tag SNPs with significant differentiation among populations (p < 0.05). Therefore, we used a total of 523 variables: 4 traits, 11 location/environmental factors, genome FST and 507 genes (Supplementary Table S1).
Results and Discussion
Global structure of the network
Our generated network then identified relationships between genome FST, 8 environmental and 2 location factors, 4 traits and 317 genes (Figure 2A, Supplementary Table S2). The network consisted of a large cluster centered around genome FST along with several isolated small clusters (Figure 2A, Supplementary Table S2). The location and environmental factors lat, lon, MAT and DAY were directly connected to genome FST in the estimated network, whereas alt was not included in the graph. In contrast, four water-related factors, MAP, MSP, AHM and SHM, were several steps away from genome FST. Several isolated clusters of genes were present at the boundary of the network. Although these clusters were differentiated between local populations, the absence of a significant correlation with genome FST implies that the diversity of these traits was not simply the result of population differentiation, but was instead due to adaptation to the local environment.
Determinants of population structure
The radius-one neighborhood of genome FST suggested that temperature and day length were the main environmental factors causing population structure, with the observation that the edges of the graph collected significant correlations (Supplementary Figure S2). Assisted by the Entrez summaries and GO terms of any genes shown in the graph window, we found that many genes related to fertility were affected by the population structure (Supplementary Figure S2, Table S3). An example was SHT (spermidine hydroxycinnamoyl transferase), which is related to pollen development and pollen exine formation (Grienenberger et al. 2009). The scatter plot visualized in the graph window provided information that correlated genome FST with the SHT gene. Other fertility-related genes included MYB5 (myb domain protein 5), HAB1 (hyper sensitive to ABA1) and ACT7 (actin 7) functioning in seed germination (Li et al. 2009; Saez et al. 2006; Gilliland et al. 2003), AT3G08640 (alphavirus core family protein, DUF3411) and HOG1 (S-adenosyl-L-homocysteine hydrolase) involved in embryo development (Rocha et al. 2005), LUG (transcriptional corepressor LEUNIG) and VRN1 (AP2/B3-like transcriptional factor family protein) related to flower development (Conner and Liu 2000; Levy et al. 2002) and REV (homeobox-leucine zipper family protein/lipid-binding START domain-containing protein) associated with flower morphogenesis (Talbert et al. 1995).
Daylight, latitude, stomatal density and disease
Consistent with McKown et al. (McKown et al. 2014), our network confirmed a strong connection between ADd and disease progress (DP10 and DP11) (Figure 2B). In contrast, ABd was not directly connected to DPs, but exhibited a strong connection to DAY, as did DPs. All these nodes were directly connected to genome FST.
Average ADd was constant in the southern region up to 50° N, but increased with latitude in the northern region (Figure 3A). In contrast, average ABd decreased with latitude in the northern region (Figure 3B). DAY, which occurs in early summer, increased monotonically with latitude (Figure 3C), while MAT decreased on average with latitude (Figure 3D). This result indicates that poplar trees in northern populations experience longer and weaker sunshine in summer but drop their leaves earlier. Interestingly, the pore size of abaxial stomata was larger at lower values of ABd (Figure 4A), which demonstrates that northern populations had larger abaxial stomata, but their density was lower because the leaf area was limited. In contrast, the large variation in the pore size of adaxial stomata displayed no relationship with ADd (Figure 4B). The presence of larger stomata causes leaves to have a lower stomatal density but a greater photosynthetic efficiency (Lawson and Blatt 2014). These results suggest that northern populations must increase photosynthetic efficiency to adapt to an environment with weak sunshine and a shorter period before leaf shed. The increased ADd of northern populations suggests that adaxial stomata compensate for the decrease in abaxial stomata. Stomatal closure is part of the innate immune response to bacterial invasion (Melotto et al. 2006). An increase in abaxial stomata size and adaxial stomata density might increase the risk of disease invasion. Our results suggest that wild poplar can expand its habitat northward by increasing photosynthetic capacity while heightening its risk of disease, although the latter is less significant in northern areas (McKown et al. 2014). This ecological trade-off may be a cause of the population structure of wild poplar.
Photosynthesis and circadian rhythm in response to day length
Geraldes et al. (Geraldes et al. 2014) have identified a large number of FST outliers that are overrepresented in genes involved in circadian rhythm and response to red/far-red light. In our graph, the allele frequencies of genes related to photosynthesis and the circadian cycle were found to be influenced by day length (Figure 5, Supplementary Table S4). For example, ACT7 (actin 7) is related to response to light stimulus (McDowell et al. 1996), and its allele frequencies were negatively correlated with DAY and lon (Figure 5, lower left). Geographical mapping of ATC7 allele frequencies and day length confirmed this correlation (Figure 6A). Other genes included PRR7 (pseudo-response regulator 7) and TOC1 (CCT motif-containing response regulator protein) related to circadian rhythm (Farré et al. 2005; Alabadí et al. 2001), APX2 (ascorbate peroxidase 2) associated with response to high light intensity and response to oxidative stress (Karpinski et al. 1997), EXPA1 (expansin A1) involved in response to red light (Esmon et al. 2006) and SUS4 (sucrose synthase 4) related to the carbon assimilation process (Bieniawska et al. 2007). These results suggest that day length is the most important factor controlling photosynthesis and that latitude causes differentiation of photosynthetic genes. Finally, the allele frequencies of SYP121 (syntaxin of plant 121) related to stomatal movement (Bassham and Blatt 2008) and PIP3 (plasma membrane intrinsic protein 3) participating in response to abscisic acid and water channel activity (Weig et al. 1997) were also significantly correlated with day length (figures not shown).
Damage response, the circadian system and stomata related to disease susceptibility
Morbidity due to leaf rust disease (DP10 and DP11) showed a close relationship to adaxial stomatal density (ADd) and day length (DAY) (Figs. 2B and 5, Supplementary Table S5). Genes closely connected to DAY, such as ACT7 (related to response to wounding), PRR7, APX2 and PIP3, were also closely linked to morbidity. Other genes, namely, FHY3 (far-red elongated hypocotyls 3) related to circadian rhythm (Allen et al. 2006) and DRT100 (DNA-damage repair/toleration 100) functioning in DNA repair (Pang et al. 1993), also were involved in this cluster. Because the DAY-related genes control stomatal opening and closing, our subgraph (Figure 5) suggests that fungal invasion into tissues occurs through stomata (Melotto et al. 2006). SHT (spermidine hydroxycinnamoyl transferase) was closely connected to leaf rust disease morbidity. As described above, SHT is related to pollen development and connected to genome FST. In addition, spermidine is known as a modulator of the immune process (Theoharides 1980). This result thus implies that the functions of SHT in immune and reproduction play important roles in population differentiation and adaptation through disease resistance. DRT100 allele frequencies were negatively correlated with DP11 (figure not shown), and the geographical gradients of DRT100 allele frequencies and DP11 well explained the correlation (Supplementary Figure S3A). This result suggests that leaf rust disease affects fertility and promotes population differentiation. Principal component analysis using these genes, which were neighbors of ADd, DP10 and DP11, clearly revealed differences in morbidity between locations from north to south (Supplementary Figure S4). This result implies that the phenotypes controlled by circadian and light-responsive genes have adapted to local environments according to latitude and day length and are responsible for the morbidity-related population differentiation.
Body growth affected by temperature
Genes in the subgraph around MAT and FFD were those involved in shoot development (Supplementary Figure S5, Table S6), such as LAS (lateral suppressor, GRAS family transcription factor) related to secondary shoot formation (Greb et al. 2003) and REV (homeobox-leucine zipper family protein/lipid-binding START domain-containing protein) linked to primary shoot apical meristem specification and leaf morphogenesis. LAS allele frequencies were negatively correlated with MAT (Supplementary Figure S5, lower left). The geographical gradients of LAS allele frequencies and MAT supported this correlation (Supplementary Figure S3b). These results imply that temperature strongly supports body growth of poplar.
Drought stress resistance depends on water conditions
The environmental factors MAP, MSP, AHM and SHM exhibited no direct connection to genome FST (Figure 2A, Supplementary Figure S6, Table S7). An indirect connection was apparent, however, through genes with functions related to water stress. These genes were XERICO (RING/U-box superfamily protein), HK2 (histidine kinase 2) and ABA1 (ABA deficient 1, zeaxanthin epoxidase) involved in response to osmotic stress and response to salt stress (Ko et al. 2006; Tran et al. 2007; Xiong et al. 2002), CBF4 (C-repeat binding factor 4) related to response to drought (Haake et al. 2002) and AGP14 (arabinogalactan protein 14) participating in root hair elongation (Lin et al. 2011). A close examination of CBF4, directly connected to AHM, revealed that its allele frequencies were negatively correlated with AMH and clustered by geographical groups (Supplementary Figure S6, lower left). CBF4 allele frequencies were particularly differentiated in IBC where AHM was high (Figure 6B). This result suggests that the CBF4 gene has differentiated as an adaptive response to dry weather. The apparent weak relationship between water stress and FST may be a consequence of the relatively small differences in water conditions in this dataset.
Vernalization depends on some unknown environmental conditions
Several isolated gene clusters, which were unconnected to genome FST, environmental factors or traits, appeared in the global network (Figure 2A). Each cluster contained genes whose functions were strongly related. For example, the largest isolated cluster consisted of vernalization genes (Supplementary Figure S7, Table S8), such as FUS6 related to regulation of flower development and seed germination (Chory et al. 1996), GA3OX1 associated with response to gibberellin and response to red light (McGinnis et al. 2003) and VRN1 linked to vernalization response and regulation of flower development. Although these genes may not be directly responsible for population structure, the appearance of the isolated cluster in the network implies a latent relationship between vernalization and population differentiation. In regards to the geographical distribution of their allele frequencies, FUS6 and GA3OX1 had similar, complicated patterns (Figure 6C, Supplementary Figure S3C). Populations in SBC and eastern IBC had a similar pattern of allele frequencies, while those in northern NBC, southern ORE and western IBC displayed a different pattern. This result may imply an adaptation to a microenvironment not observed in this data. For example, the direction of a mountain slope can create different habitats with different daylight conditions.
Gene ontology enrichment analysis
No significant GOs were predicted by gene ontology enrichment analysis (Subramanian et al. 2005; Alexa et al. 2006) for the union of the set of 317 genes selected for the graph and the sets of neighboring genes mentioned above relative to the other complementary gene sets. To obtain a comprehensive picture based on solid evidence, we focused on geographically differentiating SNPs and selected pairs of nodes by controlling FDR. As a consequence, we may have diminished the ability to identify differences between the two sets of genes. Alternatively, mutations in a few members of the relevant pathways may have enabled adaptation to the variable environments.
Our method identifies genes related to environmental adaptation with a FDR of 1% and visualizes their network, including genome FST, environmental and location factors, and traits. Our example using wild poplar has revealed the potential of our graphical model representation to aid comprehensive understanding of ecological and genetic mechanisms underlying environmental adaptation and population structuring. While conventional GWAS and genome scanning effectively search for genes related to some given factors or traits, our method captures the overall picture of the relationship among genes, environmental factors and traits in association with population structure. By following the sub-network of genes around target environmental factors and traits, we can obtain a detailed understanding of the relationship of genes behind environmental adaptation and population differentiation. In particular, detection of collaboratively adapted gene clusters, which are not directly associated with the given environment/trait factors, is an advantage of our graphical representation. Our R software module GET.graph aids this process by displaying subgraphs and scatter plots of allele frequencies of genes vs. environmental factors/traits. GET.graph retains the biological functions of genes retrieved from public databases, such as GO and ENTREZ, and helps us smoothly interpret the graph. Through this process, we can reach comprehensive understanding of population structure and adaptation by characterizing the sub-networks of the graph (Figure 2a).
Our graph collects significant correlations that sum up both direct and indirect relationships, while partial correlations extract direct relationships (Kishino and Waddell 2000; De La Fuente et al. 2004; Liu 2013). Collection of significant partial correlations in this setting is left for future study. Finally, we must be aware of computational feasibility. The calculation load greatly increases depending on the number of variables and is roughly proportional to the square of the number of variables. The analysis for this paper took several hours on an Intel Core i7 (6 core) workstation. The above step of prescreening variables is therefore indispensable. As a final remark, the data, especially genomic data, often include missing values. Our graphical representation method describes relationships between population means of allele frequencies, trait values and environmental/location factors; therefore, like Bayenv (Coop et al. 2010), it analyzes sample means among measured data. As long as the means of measured allele/environmental/trait variables are unbiased estimates of the corresponding sample means, the procedure is also unbiased.
The R software module (GET.graph) that implements the network analysis described in this paper is available in the FinePop package at CRAN (https://CRAN.R-project.org/package=FinePop).
Acknowledgements
This study was supported by the Japan Society for the Promotion of Science Grant-in-Aid for Scientific Research 25280006 and 16H02788 to HK and 18K05781 to SK.