Abstract
Many adaptive events in natural populations, as well as response to artificial selection, are caused by polygenic action. Under selective pressure, the adaptive traits can quickly respond via small allele frequency shifts spread across numerous loci. We hypothesize that a large proportion of current phenotypic variation between individuals may be best explained by population admixture.
We thus consider the complete, genome-wide universe of genetic variability, spread across several ancestral populations originally separated. We experimentally confirmed this hypothesis by predicting the differences in quantitative disease resistance levels among accessions in the wild legume Medicago truncatula. We discovered also that variation in genome admixture proportion explains most of phenotypic variation for several quantitative functional traits, but not for symbiotic nitrogen fixation. We shown that positive selection at the species level might not explain current, rapid adaptation.
These findings prove the infinitesimal model as a mechanism for adaptation of quantitative phenotypes. Our study produced the first evidence that the whole-genome modeling of DNA variants is the best approach to describe an inherited quantitative trait in a higher eukaryote organism and proved the high potential of admixture-based analyses. This insight contribute to the understanding of polygenic adaptation, and can accelerate plant and animal breeding, and biomedicine research programs.
Introduction
All living organisms adapt to the changing environment. The adaptive traits of humans include, among others, skin pigmentation (Sturm and Duffy, 2012), altitude adaptation (Huerta-Sánchez et al., 2014) and lactose tolerance (Tishkoff et al., 2007). Plants may alter flowering time, length of developmental stages, or photosynthesis (Craufurd and Wheeler, 2009). Selective pressure is often imposed by geographic variables such as climate conditions, by pathogen exposure or food resources. Thus, in free living species, genetics and geography are closely and measurably associated (Novembre et al., 2008). The species respond by changing population structure via migration, by allele sorting due to random events (genetic drift) and by natural selection (Hamilton, 2009). Local adaptations among populations are a response to heterogeneity of the natural landscape. In general, a locally adapted population has higher fitness at its native site than any other population introduced to that site (Agren and Schemske, 2012; Leimu and Fischer, 2008; Savolainen et al., 2013). Typically, local adaptation favors biodiversity, as the alleles are selected under distinct environmental conditions (Kawecki and Ebert, 2004). In most cases adaptive traits represent measurable phenotypes (i.e. “quantitative traits”), such as height, yield or pathogen resistance, that depend on the cumulative actions of many genes, with variants occuring across multiple loci with often poorly understood relationships between them (Savolainen et al., 2013).
When adaptation affects one or a few loci with large individual effect, its inheritance is explained by classical Mendelian and population genetics. The set of variants leading to adaptation is rapidly fixed in the population, along with neighbouring genomic regions, according to the “selective sweep” model (Maynard Smith and Haigh, 1974). Selective sweeps create a genomic signature which consists of reduced genetic diversity and extended linkage disequilibrium in the genomic region surrounding the loci under selection (Campo et al., 2013; Coop et al., 2009; Scheinfeldt and Tishkoff, 2013). In plants, major selective drivers are associated with selective sweeps, for example for soil conditions (Turner et al., 2008), and climate adaptation (Fournier-Level et al., 2011; Hancock et al., 2011; Pyhäjärvi et al., 2013). For most candidate signals, we know neither the exact effect on phenotype, nor the nature of the selective pressure (Coop et al., 2009). This lack of understanding of the functional mechanisms is the major drawback of the “selective sweep” model (Pavlidis et al., 2012). Due to the existence of multiple mechanisms (parallel adaptive pathways, redundant signaling and metabolic networks, etc.), an adaptation to the same selective stress can be achieved by different combinations of sweeps (Berg and Coop, 2014), resulting in a large heterogeneity of “sweeps” across populations. Therefore the signature’s content is highly dependent upon population structure and restricted gene flow, particularly in plants (Siol et al., 2010).
When adaptation has a polygenic basis, the quantitative trait will evolve rapidly, via small changes in the population frequencies of a large number of pre-existing polymorphisms (Pritchard and Di Rienzo, 2010). Adaptation of quantitative traits is described by the “infinitesimal model”, proposed by Fisher (Fisher, 1930). This approach originally consists of purely phenotype-based modeling, with no regard to molecular basis of inheritance. Consequently, in classical animal and plant breeding, the degree of expression of selective traits (size, yield etc...) are predicted based on the performances of the individuals and their relatives, ignoring molecular markers. Such phenotypic prediction makes the models very robust (Hill, 2010), and they are widely applied in breeding and in some evolutionary and ecological studies (Hadfield et al., 2010; Henderson, 1984). However, phenotype-based models obviously lack the biological mechanisms behind the phenotypic traits.
Other quantitative methodologies link molecular data to phenotype via Quantitative Trait Loci (QTLs). One set of methods is focused on studying the QTLs within relatively small families (typical for human diseases) (Daniels et al., 1996) and in the context of controlled crosses in cultivated plants (Mackay et al., 2009). The other approaches deal with large populations, focusing on unrelated individuals, such as Genome-Wide Association Studies (GWAS) (Atwell et al., 2010; Hirschhorn and Daly, 2005). In either case, the studies follow a filtering workflow aimed at identifying relatively few QTLs or variants that are most relevant to the phenotype (be it a trait or a disease). Over recent years, this approach, fueled by increasingly affordable genome sequencing or genotyping, has led to an explosion of disease-related gene discoveries in human (Welter et al., 2014) and has become a method of choice in plant and animal breeding (Wen et al., 2014) and in studies of adaptations in natural populations (Atwell et al., 2010).
However, GWAS (and QTL-based methodology in general) have also substantial drawbacks. First, only a few (if any) SNPs or QTLs are likely to have statistical significance in any given GWAS on human or plants (Rivadeneira et al., 2009). Therefore, the variants found by GWAS typically explain only a minor fraction of the heritability of a specific trait (10-30%), making them poor predictors of phenotypes (Gusev et al., 2013). To address the weak predictive power of GWAS-derived variants for phenotype (so-called “missing” heritability), some scientists have adapted the whole-genome prediction method proposed by Meuwissen et al. (Meuwissen et al., 2001). Such methods simultaneously use the full set of genome-wide SNPs to predict phenotypes (de los Campos et al., 2013). It is, however, difficult to correctly allow for population structure within these methods. Second, due to high level of genetic heterogeneity and epigenetic modification in organisms, most SNPs and QTLs are “unstable”, i.e their statistical relevance to phenotype is not supported by follow-up studies (Korte and Farlow, 2013). These drawbacks substantially limit applicability of the results of QTL or “big data” omics studies in clinical settings (such as disease predisposition multi-gene panels) or in plant and animal breeding (Bian et al., 2014).
Here, we offer a novel empirical paradigm, that we call “whole-genome modeling” -WhoGEM – for testing polygenic adaptation. Essentially, we consider the complete, genome-wide universe of genetic variability, spread across several ancestral populations originally separated (e.g. by climate, geography, tissue differentiation), ultimately displaying distinct phenotypes. Assuming that each individual is a descendant of one or several ancestral populations, the relative contributions of such populations to the genome of each individual – the so-called admixture components - are estimated using a likelihood algorithm. Our innovative working hypothesis postulates that a large proportion of current phenotypic variation between individuals may be best explained by population admixture. We therefore use admixture proportions, instead of SNP-based analysis of genome scans, to test for correlation between genomic variation and quantitative phenotypes, or environmental variables. As such, our method combines phenotypic anchoring of molecular markers with “whole- genome modeling” of the genotype, avoiding the major drawbacks of the abovementioned models by explicitly integrating the population structure.
Unlike animals, plants feature complex mating systems including selfing and limited gene dispersal through seeds and pollen, and a distinct immune system. Importantly, plants must survive under permanent selective pressure from local environmental conditions. These features make plants excellent objects for testing polygenic adaptation hypotheses (Leimu and Fischer, 2008). We test our paradigm on Medicago truncatula, a model legume with detailed genomic data available across a number of the Mediterranean populations (Branca et al., 2011; Formey et al., 2014; Tang et al., 2014; Yoder et al., 2014). As a short-lived and self-compatible species, M. truncatula probably has a more differentiated population structure and is expected to be a better model to study local adaptation than long-lived or outcrossing species (Linhart and Grant, 1996). Existing knowledge of the exact population structure of M. truncatula was incomplete, with contradictory versions described (Bonhomme et al., 2013; Ellwood et al., 2006). As this is one prerequisite for our study, we thus infer that structure with precision.
We define admixture components using a high-quality SNP set, based on the genome information of 262 M. truncatula accessions located around the Mediterranean Basin (Yoder et al., 2014). As a calculation tool, we adapt the recent Geographic Population Structure algorithm (Elhaik et al., 2014) to plants (plantGPS). We use plantGPS to refine our understanding of population structure and assess the optimal number of admixture components. We demonstrate that WhoGEM, based on an admixture model, accurately predicts quantitative traits. Focusing on disease resistance, we predict resistance to root pathogens. Using an independent set of accessions not previously characterized, we experimentally validate the prediction of disease resistance levels from whole-genome data. We also unveil relations with admixture components by analysing other key quantitative functional traits, such as plant height. Our analysis shows that geographic adaptations also rely upon a polygenic basis. Our results are among the first to experimentally validate Fisher’s infinitesimal model for the adaptation of a quantitative phenotype in plants. We also argue that our WhoGEM method is directly applicable to a wide range of biomedicine problems, such as prediction of drug response and carcinogenesis, and to accelerate breeding programs in agriculturally important plants and animals.
Results
A pattern of eight discrete populations around the Mediterranean Basin is revealed by admixture-based analysis of the Medicago truncatula genome
Many population genetic analyses assume a unique origin for the populations, with stepping stone divergence patterns. We take a different approach, relying on the glacial refugia hypothesis postulated for M. truncatula (de Mita et al., 2011), and interpret our data assuming refuge areas with subsequent plant population growth and admixture. To define M. truncatula populations, we conduct a three-step analysis combining admixture-based tools, Principal Component Analysis (PCA) and plantGPS.
First, using a likelihood-based admixture analysis (Alexander et al., 2009), we identify a substructure at K = 7 and K = 8 in which individuals appear homogeneous in their admixture composition (fig. 1, supplementary fig. S1). Higher K values yield noise which appears as ancestry shared by very few individuals within the same populations.
Second, the most suitable number of admixture components is verified using a PCA-based analysis. For K = 9 and higher, putative populations are not independently spread in the PC space, adding a supplementary argument for K = 8 being the optimal maximum number of source populations (supplementary fig. S2 to supplementary fig. S5).
Third, we apply our novel tool plantGPS to evaluate the accuracy of geographic assignment for each sample, using the distance between its recorded and predicted location. The plantGPS tool uses the admixture components of each sample. The empirical cumulative curve of distances between predicted and observed locations shows that K = 8 results in best predictions (supplementary fig. S6). With eight admixture components, 50% of accessions have their location predicted to within 100 km of their recorded location, and 75% within 800 km.
This three-step analysis demonstrates that population structure in M. truncatula can be adequately explained using eight admixture components. Pair-wise Wright’s FST divergences (Holsinger and Weir, 2009) between the admixture components (comparing the variance in allele frequencies among the components) indicate that they are strongly differentiated (table 1A). Therefore, we use this number of components for subsequent analyses. The structure of the current M. truncatula population is obtained by hierarchical clustering of the samples, based on their admixture patterns (supplementary fig. S7, supplementary table S2).
Fig. 2 displays the distribution of the eight putative M. truncatula populations around the Mediterranean Basin, showing the genome admixture proportions of the 262 samples. Based on this picture, we assign each population to a representative geographical region (table 1B). Estimates of FIS values (the inbreeding coefficient of an individual relative to its sub-population) are similar among the eight populations, suggesting no obvious intra-population heterogeneity (table 1B). All populations are clearly differentiated, even over short geographical distances, such as with the two Spanish populations.
Relationships among the populations are estimated based on the 840K SNP dataset, using accessions that have at least 90% of their genome assigned to a given ancestral population to represent that population. The resulting dendrogram (supplementary fig. S8) reveals two main clades that reflect the major divergence event. Clade 1 contains populations from the South West of the Mediterranean Basin: “Algiers” (K1), “Spanish Coastal” (K2) and “Spanish Morocco Inland” (K8). Clade 2 contains the accessions from the North East of the Mediterranean Basin. We speculate that the clade 1 / clade 2 divergence reflects expansion from glacial refugia during the early Holocene (Médail and Diadema, 2009). Within clade 2, the “French” (K6) population is clearly separated from the “Greek” (K7) population, in agreement with the “Maritim and Ligurian Alps” glacial refugia hypothesis (Médail and Diadema, 2009).
Genome admixture components predict quantitative resistance to plant pathogens, as demonstrated by experimental validation
As a first test of our hypothesis, we explore whether our admixture-based M. truncatula population structure might relate to adaptation for polygenic traits. Plant-microbe interactions are key drivers for adaptation, and we check for associations between admixture components and quantitative disease resistance (QDR).
Two types of disease resistance are described in plants (i) complete resistance conditioned by a single gene (Flor, 1971) and (ii) partial resistance, also called quantitative disease resistance, conditioned by multiple genes of partial effect (Poland et al., 2009). QDR often confers broad-spectrum resistance, being predicted to be critical for efficient control of epidemics. It is characterized by a continuous range of phenotypes from susceptible to fully resistant. QDR is often described by QTL that support the resistant phenotype and suggest modes of polygenic adaptation (Poland et al., 2009). Studies that attempt to dissect a QDR trait have reported genes with various biological functions such as ABC transporters (Krattinger et al., 2009) or atypical kinases (Huard-Chauveau et al., 2013). However, these genes do not explain all of the genetic variances reported using controlled crosses or GWAS studies. The aggregating of QDR loci has been useful in decreasing disease symptoms (Fukuoka et al., 2015), but fully resistant phenotypes are rarely described, suggesting that additional, possibly numerous loci, are required.
M. truncatula is prone to infection by the soil-borne fungal root pathogen Verticillium alfalfae. Verticillium wilt response in M. truncatula is a QDR, regulated by QTLs that differ across resistant accessions and vary according to the fungus strains (Ben et al., 2013; Negahi et al., 2014). Both plant and fungal species co-exist around the Mediterranean Basin (CABI database, PlantWise database http://www.plantwise.org/, August,25 2015).
Fig. 3 depicts the geographical partition of the Maximum Symptom Score (MSS) of 242 M. truncatula accessions when infected with the V. alfalfae strain V31-2 (supplementary table S11), together with their admixture patterns. Accessions located west of the Mediterranean Basin are mainly resistant to the V31-2 strain (low MSS), while accessions located east of the Mediterranean Basin are susceptible. Testing WhoGEM’s working hypothesis, we explore whether the degree of quantitative resistance to V. alfalfae can be explained by values of admixture components. Our findings (table 2) show that the proportions of four admixture components are significantly related to MSS, as suggested by fig. 3 (r2 = 0.31, P≤ 2.2 × 10-16). The average MSS value of the “Spanish Coastal”, “Spanish-Morocco Inland” and “South Tunisian Coastal” genome components are 1.04, 1.6 and 1.71 respectively, making them resistant genomic backgrounds. The average MSS value of the “Greek” genome component is ⋍ 3, making it a clearly susceptible genomic background.
We experimentally validate these computational results using an independent set of accessions not re-sequenced previously. We predict the phenotypes from admixture components using the model resulting from our above analysis as follows : uncharacterized accessions located within the geographic zone of resistant (respectively susceptible) accessions should exhibit resistant (respectively susceptible) phenotypes. We test 32 new accessions from the ‘‘Spanish Coastal” or “Spanish-Morocco” geographic zone, and 39 new accessions from the ‘Greek” geographic zone (fig. 4A), and assess their disease resistance level (supplementary table S12).
Fig. 4B shows the observed MSS of accessions predicted to be resistant or susceptible, along with MSS of Spanish-Moroccoan (resistant) and “Greek” (susceptible) reference samples. MSS values of samples expected to be resistant or susceptible differ significantly (ANOVA P < 2 × 10-16) and samples predicted to be resistant are significantly different from those predicted to be susceptible (adjusted P < 2 × 10-16). Interestingly, multiple means comparisons show that predicted resistant (or susceptible) accessions are not significantly different from their respective reference populations (adjusted P = 0.55 and 0.90 respectively, table 3). Defining the value MSS= 2 as a threshold for resistant (MSS < 2) or susceptible (MSS ≤ 2) accessions, we find that 26/32 of the samples predicted to be resistant from their admixture proportions actually are resistant, while 27/39 samples predicted to be susceptible are actually susceptible (χ2 = 16.03, P = 6.22 × 10-5). This shows that patterns of QDR can be predicted from genomic admixture analysis, arguing for the utility of WhoGEM when predicting complex phenotypes.
We have thus shown experimentally validated theoretical predictions, by WhoGEM, of the QDR level in M. truncatula. We show that the phenotypic difference between predicted resistant and susceptible accessions is around two points on a scale from 0 to 4, i.e. 50% of the phenotypic difference between the extremes of the phenotype distribution. This value is far greater than those previously reported (0.28-0.5) for phenotypic differences between alleles at the major QTLs detected in response to V. alfalfae (Ben et al., 2013; Negahi et al., 2014). Given the estimated narrow sense heritability of the trait (Ben et al., 2013; Negahi et al., 2014), we suggest that genome admixture components are explaining the majority of the genetic control of this disease.
QDR is typically broad-spectrum, making the arms race between hosts and pathogens probably not critical and, consequently, currently not reported in the literature. Our results re-enforce the idea that QDR in plants is likely to result from changes to a large number of genes scattered throughout the genome, and that this is reflected in admixture proportions. Because of the co-occurrence of both the plant species and the pathogen around the Mediterranean Basin, we hypothesize that the observed pattern of quantitative resistance in the M. truncatula/V. alfalfae pathosystem may be due to natural selection, with unknown additional contributions from drift and migration.
Genome admixture components can be significant predictors of quantitative functional traits in plants
Knowledge of the selective pressure acting on the phenotype, could lead to insights into the respective contributions of adaptive selection and drift toward phenotypic differentiation among populations. This can easily be tested using plant-microbe interactions, in particular for QDR, by comparing the geographical distribution of plants and pathogens.
Aphanomyces euteiches is a soil-borne pathogen of legume crops, mainly occurring North of the 45th parallel. Using data reported by Bonhomme et al. (2013), we depict the geographical structure of root-rot index (RRI), a typical quantitative phenotype measuring susceptibility of M. truncatula to this oomycete, together with the admixture patterns of the studied accessions (supplementary fig. S9). We test whether the proportions of admixture components (supplementary table S2) are predictors for RRI and find a significant relationship (P =1.8 × 10−7, supplementary table S3). The WhoGEM model accounts for 19.2% of variation in the phenotype, and provides a lower bound for heritability. Intriguingly, the populations of the Maghreb area show a contrasting response to the pathogen, whose presence is not reported in that geographic zone (CABI database, PlantWise database http://www.plantwise.org/, Aug.,25 2015; Bonhomme et al. (2013)). Hence, we hypothesize that the phenotypic differentiation among the resistant and susceptible populations may be due to genetic drift or migration. This hypothesis also suggests that the cost of resistance may be negligible in the absence of pathogen, in contrast with previous results described for foliar pathogens (Tian et al., 2003). An alternative hypothesis is that resistance to A. euteiches is driven by, or linked to, resistance to other factors, as suggested by Djébali et al. (2013), and as such, not a consequence of natural selection acting toward resistance to the oomycete.
Having demonstrated that genome admixture proportions can be used to predict different QDR, we further test our approach by examining relationships between admixture component proportions (reported supplementary table S2) and several quantitative functional traits related to development as reported by Stanton-Geddes et al. (2013). Supplementary fig. S10d and S10e depict the geographical structure of plant height and leaf number combined with admixture proportions of the recorded accessions. Plant height (supplementary table S3a) and the number of leaves (supplementary table S3b) exhibit different results regarding association with genome admixture components. The influence of population structure on plant height is very significant (r2 =0.21, P =2 × 10-11), but less so on number of leaves (r2 =0.05, P =7 × 10-4). The results suggest that a latitudinal cline for leaf numbers may exist, with accessions south of the Mediterranean Basin harboring more leaves.
Geographical and bioclimatic variables significantly shape part of genetic variation in M. truncatula
The aspects of environmental variation that generate selective gradients are poorly understood for most species. The repartition of the eight M. truncatula populations in the climatic zones defined by the Köppen-Geiger climate classification (Kottek et al., 2006) shows that several populations are present in the same climate zone; and that the “Greek” population, is scattered through several climatic zones (supplementary fig. S11). This makes it unlikely that global climate types shaped populations.
We check for associations between admixture components and 19 local bioclimatic variables, defined by WorldClim (http://www.worldclim.org). Correlations between the proportions of the various admixture components and any of the bioclimatic variables are shown in supplementary table S5, along with tests of significance (supplementary table S6). Associations with bioclimatic variables depend strongly on the admixture component. For example, the “Spanish Coastal” component is negatively correlated with temperature seasonality (BIO4) and temperature annual range (BIO7), indicating that this genome is present in accessions that grow in regions with moderate annual temperature and small temperature seasonal contrasts. Interestingly, the admixture proportions of the “North Tunisian Coastal” population are not related to any bioclimatic variables, suggesting that the differentiation of this genome may be due to other factors. Friesen et al. (Friesen et al., 2014) described how accessions belonging to this population harbor alleles that assort non-randomly with soil salinity, suggesting a differentiation of the “North Tunisian Coastal” population due to this particular abiotic condition.
Next we use redundancy analysis (RDA, Legendre and Legendre (2012)) to partition genomic variation within the species, summarized by the admixture proportions, into components explained by climate and geography. It allows estimating the change in the structure of genomic variation across spatial scales (latitude, longitude and elevation) and climatic variables. Fig. 5 shows that about half of the genomic variation is due to climate or geography (r2 = 0.46; P ≤ 0.001), with climate as a major source of variation in admixture component (41.4%). The variation explained jointly by geography and climate is 17.6%, geography alone contributes to 5% only.
This partition of genomic variation in response to climate is clearly different to A. thaliana, in which Isolation By Distance (IBD) is important and climate variation among sites of origin explained only slightly more genomic variation than geographical distance (Lasky et al., 2012). Here we show that genome admixture proportions are correlated to bioclimatic variables. This relationships suggests that a large number of loci/genes are involved in response to climate, even if the phenotypes involved in this response are still unknown.
Genome admixture components are not significant predictors of phenotypes that depend on tight genotype × genotype interactions
Polygenic adaptation is not expected for phenotypes measuring association with symbiotic bacteria because of the tight genotype × genotype interactions which govern these plant-symbiont interactions (Oldroyd, 2013). However signatures of selection have been reported for only a few genes involved in the control of symbiotic nitrogen fixation (de Mita et al., 2011; Paape et al., 2013). We thus test the WhoGEMM approach by looking for relationships between admixture component proportions and quantitative phenotypes measuring association to symbiotic nitrogen fixation bacteria.
Supplementary fig. S10a, supplementary fig. S10b and supplementary fig. S10c depict the geographical structure of nodule numbers in different root parts, used as a proxies to test for nodulation efficiency (Stanton-Geddes et al., 2013). The spatial patterns for each nodulation phenotype are complex and little correspondence appears between nodulation in upper and lower root parts. Ancestral genome proportions are significantly related to nodulation efficiency in the top 5cm of the roots (supplementary table S3d; r2 =0.15, P =1.1 × 10−8), but only to a much lower extent in the region below 5cm (supplementary table S3c; r2 =0.06, P =4.2 × 10−4). Consequently, the total number of nodules per root is only moderately related to admixture proportions (supplementary table S3e; r2 =0.1, P =5.999 × 10−6) in that experiment.
The analysis reported by Stanton-Geddes et al. (2013) used a mix of two different strains of S. meliloti simultaneously applied to all accessions in a common-garden design. In that experiment, the number of nodules (whether using the total, upper or lower 5cm of root) was not correlated to plant height or number of leaves (supplementary table S7), indicating that symbiosis was not efficiently engaged between the host plants and the selected microbial symbiotic strains. This unexpected fact, not discussed by the authors, may preclude any further conclusions on symbiosis efficiency and putative adaptation to given symbionts in this particular experimental design. It is also reported that some M. truncatula accessions present better performances with S. medicae strains rather than S. meliloti strains. The procedure described by Stanton-Geddes et al. (Stanton- Geddes et al., 2013) thus probably masks putative local adaptations to particular symbiont strains.
The weak relationship between nodulation performances and population structure contrasts with the situation for plant height, or quantitative resistances towards V. alfalfae or A. euteiches. This finding suggests that adaptation to local strains of rhizobia plays an important role, re-inforcing the genotype × genotype interaction hypothesis for the control of symbiotic nitrogen fixation.
Long-term positive selection is weakly involved in current population structure
The above results suggest that response to selective pressure contributed to the differentiation among M. truncatula populations. When trying to understand the role of natural selection, proteincoding sequences offer the great advantage in that they allow us to distinguish synonymous and non-synonymous substitutions.
The non-synonymous/synonymous rate ratio, ω = dN/dS, measures selective pressure at the protein level (Hamilton, 2009). A non-synonymous rate which is significantly higher than the synonymous rate, resulting in ω > 1, provides evidence for positive protein selection, whereas ω < 1 evidence purifying selection. To test if adaptation to biotic and abiotic conditions involves selection over a long period of time, as compared to rapid changes in allelic frequencies, we compute ω in each M. truncatula population.
We conduct a population-wise binomial test to determine if the dN/dS value for each gene differs from the genome-wide average dN/dS value. In each population, we obtain clear evidence of a trend towards extreme dN/dS values; the QQ-plot of p-values is shown in fig. 6. We identify 15 genes with significant purifying selection in at least one population (table 4 and supplementary table S8). The majority of these are related to development (cell wall and growth) and energy (photosynthesis and correlative UV protection), both involved in establishing fitness level.
Among the eight populations, we identify only 10 genes with both dN/dS ≥ 5 and a coefficient of variation ≥ 0.40 (supplementary table S9 and supplementary table S10). Five of the 10 genes are transcription factors of unknown function, indicative of the importance of regulatory pathways in adaptation. The FRIGIDA ortholog exhibits a strong differential selection pressure among the eight populations, in accordance with the necessary adaptation of vernalization and flowering times over contrasted environments (Johanson et al., 2000).
Interestingly, we identify 137 genes that have ω ≤ 7 and a low coefficient of variation (fig. 7). These genes could be involved in the overall selection response for all accessions of M. truncatula. GO term enrichment analysis of these genes does not identify specific biological processes, suggesting that all biological processes contribute to adaptation at the same pace.
Tests based on synonymous/non-synonymous comparisons are expected to be less sensitive to demographic assumptions than site frequency spectrum or LD-based methods (Siol et al., 2010). However, they cannot distinguish between past and current selection. Our results reveal strong purifying selection acting on genes directly involved in fitness, and show that putative positive selection is affecting all biological processes. Analysis of patterns of nucleotide variation shows that observed population structure is probably due to ongoing selection rather than long-term selection. We also show that the number of genes under putative long-term selection is low.
Genome admixture component prediction supports the hypothesis that quantitative traits (forming the majority of eukaryotes’ functional traits) are affected by many genes, each being under moderate to weak selection pressure. This will be true even if the overall selection pressure is high, and especially if an additive mode of action is predominant. As a consequence, the number of genes that can be detected to be under strong positive selection may be low. This agrees with our analysis of the dN/dS ratio among the eight M. truncatula populations. Our findings challenge previously reported results (Paape et al., 2013; Yoder et al., 2014), that were based on a preliminary dataset including fewer samples, no population structure, and using SNPs called on a previous version of the genome.
Geographic localization of the reference genome of M. truncatula: plant-GPS confirms that genetics helps in predicting geography
As a supplementary tool in defining the population structure, we modified the algorithm described by Elhaik et al. (2014) to adapt it to plants (plantGPS), and to reflect the expected reduced levels of migrations and outcrossing in self-pollinating plants like A. thaliana and M. truncatula.
Using the plantGPS method, we infer the geographic source for the M. truncatula A17 reference genome (Tang et al., 2014). This accession has been isolated from the Australian Jemalong cultivar (T. Huguet, personal communication) but Jemalong’s origin in the Mediterranean Basin is not documented in the literature. With plantGPS, we determine a likely primary geographical position of the Jemalong-A17 accession within the “Spanish Coastal” population, as illustrated in fig. 8. The location of A17 within this population is in accordance with its resistant phenotype in response to V. alfalfae (Ben et al., 2013). This independent result provides additional evidence of the role of admixture components in determining phenotypes.
Use of the plantGPS algorithm also provides us with an objective function to minimize i.e. the distance between predicted and reported locations, when looking for the most likely number of admixture components. This investigation shows the potential of the plantGPS method in locating unknown plant samples based on their admixture components, it may have similar applications in forensic sciences and technologies.
Discussion
Understanding the relationships between genotype and phenotype is a fundamental challenge in modern biology. Linking specific genomic variations with selective traits in plants and animals (yield, fitness, etc.) and human (disease predisposition, drug response, etc.) is key for many fields, from plant and animal breeding to individualized healthcare and drug discovery.
Most adaptive events in natural populations, or selected traits in breeds of domesticated species, occur via the evolution of quantitative, polygenic traits rather than via the fixation of a single (or few) beneficial mutations (Pritchard and Di Rienzo, 2010), with some exceptions, such as monogenic human diseases and a few plant traits described by Mendelian laws. Therefore, the phenotypic variability found in natural populations is due to a complex underlying genetic interplay of multiple, often unknown, loci with allelic effects affected by the environmental conditions (Hill, 2010; Mackay et al., 2009). Not surprisingly, the models attempting to describe genotype / phenotype relationships suffer from major drawbacks and are reductionist (focusing on relatively few genetic features). For instance, there are typically no clear “selective sweep” signals for most traits, as a result of low specificity for the corresponding model. Likewise, the “major QTL” model, which is currently the preferred concept in GWAS and controlled cross studies, shows limited performance in identifying the loci critical for breeding programs.
Here, we offer a novel methodology for predicting quantitative traits based on a whole-genome model of genomic variants and population admixture-based algorithms. We propose to call our approach “whole-genome modeling” – WhoGEM –. We move away from focusing on “large impact” variants and instead we propose calculating a simple descriptor of “mixing proportions” in individuals believed to originate from distinct ancestral populations. Our approach is akin to the calculation of phenotypic resemblance as in the whole-genome genetic resemblance, popularized by Meuwissen and others (Meuwissen and Goddard, 2010; Meuwissen et al., 2001). Unlike the latter approach, we explicitly embed the inferred population structure in our calculations, thus expanding the method’s applicability. We demonstrate the relationship between admixture proportions and quantitative traits in the model legume M. truncatula and support the view that phenotype can be explained by an additive action of a large number of loci. These components of inheritance represent combinations of genes (either protein-encoding or regulatory, such as non-coding RNAs) manifested as alleles, polymorphisms and some other genetic variants. Using examples of pathogen resistance and plant development traits, we demonstrate that the admixture proportions descriptor can predict the degree of local adaptation better than the feature selection-based methods. The WhoGEM concept is likely to be expandable to all quantitative functional traits that involve complex genetic determinism. Admixture proportions offer a meta-view of genome structure, providing information integrated across the entire genome. As admixture proportions embed information from the linked loci, they may be a straightforward way to address the long-standing debate about the relative contribution of protein-coding changes (micro-evolution) versus regulatory changes (expected to act through regulatory pathways, and thus phenotypic plasticity).
To conclude, we show that genome admixture components provide strong insights into the genetic landscape of polygenic adaptation. To the best of our knowldege, this is the first analysis to experimentally validate the infinitesimal model for natural adaptation of a quantitative phenotype in plants. Comparative analysis of response to two different pathogens clearly demonstrates that phenotypic differentiation among populations, supported by an infinitesimal model, may or may not result from natural selection. We also suggest that response to climate could be investigated using an infinitesimal model. Predicting phenotypes on the basis of genome components will help in inferring future trends of local adaptation related to global climate change. Adaptations of our method may well be applicable to “omics” data-based modeling of metastasis, clonal selection and genetic heterogeneity in cancer research. Moreover, an extension of WhoGEM would be capable of integrating and calculating admixture proportions from multiple types of genome-wide “big data”, such as whole-genome genetic landscapes, epigenetics and expression profiling. Our findings contribute to the establishment of a strong theoretical and experimental corpus of methods to detect and explain signals of polygenic adaptation within genomes.
Materials and Methods
SNP selection. A set of 262 genuine Medicago truncatula accessions (Yoder et al., 2014) was used to extract SNPs (http://medicago.hapmap.org). After quality checking and LD-pruning using PLINK, using the options --geno 0.05 --maf 0.01 --indep 300 60 1.3, we selected a total of 843 171 SNPs covering the eight chromosomes of the M. truncatula genome (Table S1).
Development of the plantGPS algorithm. plantGPS is an adaptation of the admixturebased Geographic Population Structure (GPS) algorithm (Elhaik et al., 2014) to plant species. This modification takes into account ties encountered when the genetic distances between different closely related accessions are identical. Ties are also considered when computing the contribution of other reference accessions to the sample’s genetic make-up. plantGPS calculated the Euclidean distance between the sample’s admixture proportions and a reference dataset, for predicting the provenance of sequenced accessions. The matrix of admixture proportions was calculated with the ADMIXTURE software package (Alexander et al., 2009). The ‘shortest distance’ measure, representing the test sample’s deviation from its nearest reference population, was subsequently converted into geographical distance using the linear relationship observed between genetic and geographic distances. The final position of the sample on the map was calculated by a linear combination of vectors, with the origin at the geographic center of the best matching population weighted by the distances to 10 nearest reference populations and further scaled to fit on a circle with a radius proportional to the geographical distance. If the smallest distance that represented the sample’s deviation from the best matching accession was identical for several accessions, those were considered as a set of accessions. Numerical values may thus contain ties and the initial geographical position of an unknown accession was defined as the centroid of the geographical positions of the identical, or nearest accessions. The contribution of other reference accessions m = 2..N to the sample’s genetic make-up might also contain ties. The computation of the weight was thus modified accordingly.
To estimate the assignment accuracy of plantGPS, we used the ‘leave-one-out’ approach at the individual level. In brief, we excluded each reference individual from the data set, recalculated the mean admixture proportions of its reference population, predicted its biogeography, computed the geographical distance between predicted and reported locations, tested whether it is within the geographic regions of the reported origin and then computed the mean accuracy per population. More specifically, we index our individual as the jth sample from the ith population that consists of ni individuals. For all populations, excluding the individual in question, the average admixture proportion and geographical coordinates were calculated as where is the parameter vector for the sth individual from the mth population, and nm is the size of the mth population. For the ith population the adjusted average will be
A set of 245 genuine M. truncatula accessions with geographical coordinates (latitudes and longitudes) served as the reference set for plantGPS. Seventeen accessions, among which the Jemalong- A17 accession that is used as the reference genome (Tang et al., 2014), were of unknown origin and not included in the reference set.
Strategy for population structure determination. The strategy used to identify populations combines three steps: admixture-based tools, Principal Component Analysis (PCA) and plantGPS.
First, a likelihood-based analysis is run in the unsupervised mode for K = 2 to 12. We use the ADMIXTURE software package (Alexander et al., 2009) based on the collection of high-quality LD-pruned SNPs. Each plant sample was characterized by a vector of n proportions that sum to one, n being the number of admixture components (i.e. n = K), possibly ranging from K = 2 to K = 12. This vector summarizes the proportion of the plant’s genome that belong to each of the n admixture component. A tree that summarizes the partition of the set of accessions into successive subsets defined by their major admixture component supports K = 7 or K = 8 as providing a stable structure for population stratification (fig. S1). Computations were conducted two times independently and came to almost identical results.
Second, the most suitable number of admixture components was verified using a PCA-based analysis. ADMIXTURE outputs the inferred allele frequencies of each SNP for each hypothetical population. This allowed us to simulate samples (called “zombies”) for each hypothetical population, viewed as reconstructed hypothetical ancient-like individuals, purged of centuries or millennia of admixture. The PCA thus included both the sampled accessions and zombies, as hypothetical ancestral individuals for each hypothetical population. Supplementary fig. S2 to supplementary fig. S5 show pairwise PCA plots of the actual M. truncatula accessions together with simulated samples (7 zombies per admixture component), for K = 7 to K = 10. For each level of K, the ancestral populations were distributed in the space of the first six eigenvectors. PCA computations were performed using the R package SNPRelate using PLINK formated files. Similar results were obtained using 30 simulated individuals per each hypothetical population, or using computations with EIGENSTRAT (data not shown).
Third, we applied the “leave-one out” cross-validation approach of the plantGPS algorithm at the “accession” level, to estimate the difference between predicted and reported location, for each sample. This procedure was repeated for K = 2 to K = 12 and the empirical cumulative distribution of the geographical distances was used as a criteria to estimate the accuracy of geographical prediction at each number of putative genome components (supplementary fig. S6).
We then established the definitive correlation between geographic and genetic distances between pairs of individuals, for K = 8. Given the (relatively) small distances across the Mediterranean Basin, we computed a “naive” geographical distance using pairwise Euclidean distance based on the longitude/latitude reported for the accessions. The matrix of pairwise genetic distances was computed using the admixture component proportions of each accession with K = 8. The Mantel test applied to the initial matrices revealed a significant, yet moderate, correlation between geographical and genetic distances (r=0.294, P =1 × 10−4). Fig. S12 shows that the linear relationship between geographical and genetic distances is restricted to distances less than 950 km. When filtering out the distance matrices for distances > 950km, the Mantel correlation coefficient raises to 0.78, a highly significant value (P =1 × 10−4). We thus fitted a linear relationship between geographical and genetic distances for geographical distances less than 950 km. The regression equation is Geo = 0.204 + 4.973 × Gen + ∊ with adjusted R2 = 0.61 and model P < 2.2 × 10−16.
The average distance between original and predicted locations is 471 km, the median is 214 km and 75% of the samples are predicted to be less than 677 km from their reported origin (Supplementary fig. S13).
To provide population identification, the final admixture frequencies of the eight components for the 262 M. truncatula accessions were calculated by applying ADMIXTURE in the supervised mode. Accessions were then clustered into populations, using hierarchical clustering based on the genome admixture proportions, using Euclidean distance and the ‘average’ link. The name of each population name is determined by the region which is the geographical centroid of the accessions of that population.
Relationships among populations were computed based on selected individuals that had at least 90% of their genome assigned to a given ancestral population. With that aim, genetic distances were computed based on the 840K SNP dataset (R package SNPRelate) and dendrogram was computed and drawn using R packages ape and geiger.
Maps and sample locations were drawn using the rworldmap, mapplots and maptools R packages.
dN/dS computations. To test for selection in M. truncatula populations, ancestral alleles of truncatula clade were estimated using MrBayes (Ronquist et al., 2012). MrBayes estimates ancestral alleles based on a given phylogenetic tree. We used the phylogenetic tree of Yoder et al.(Yoder et al., 2013) for this purpose. In the truncatula clade, M. soleirolii, M. turbinata, and M. doliata were used as outgroups in the estimation procedure. Ancestral alleles that were predicted with probability lower than 0.9 were removed from further consideration. Non-synonymous and synonymous mutations in M. truncatula populations were then identified based on comparison to the inferred ancestral allele states. The significance of the dN/dS ratio for each gene in each population was then calculated using a two-tailed binomial test. Specifically, we assumed the dN(dS) values were distributed as Binomial(N,p), where N is the observed number of mutations in that gene and p is set to be 1/6, which is the overall ratio of in the data. This test will then indicate genes for which the dN/dS ratio is higher or lower than is typical.
Evaluation of quantitative resistance to Verticillium alfalfae in M. truncatula. A set of 313 accessions of M. truncatula has been assessed for their response to Verticillium wilt, including 242 already sequenced accessions from the HapMap project (Yoder et al., 2014). M. truncatula seeds were from our own collection or obtained from the INRA Medicago truncatula Stock Center (Montpellier, France). All the M. truncatula accessions have been phenotyped using an Augmented Randomized Block Design in 3 independent replicates for the already sequenced (reference) accessions and 2 replicates for the other accessions. Between 4 and 10 plants per genotype were used in each replicate. 10-day-old plants were root inoculated as described in Ben et al.(Ben et al., 2013). Disease development was monitored for 32 days two or three times a week and rated using a scale from 0 (no symptoms) to 4 (dead plants). At the end of the experiment, the Maximum Symptom Score (MSS) was obtained for each plant. The LS-mean of the MSS for each accession was calculated using the linear model yijk = μ + accessioni + blockj + ∊ijk (yijk the maximum disease score for the kth plant of the ith accession of the jth block; ∊ijk, the residual) using R.
Relationship between admixture proportions and quantitative phenotypic variables. The 19 WorldClim bioclimatic variables (30 seconds resolution, downloaded at http://www.worldclim.org/current) were extracted for each accessions’ location, using the reported latitude and longitude for that accession (raster R package). The relationships between genome components, the 19 WorldClim bioclimatic variables and geography (latitude, longitude and altitude) were modeled using redundancy analysis (RDA), an approach that examines how well of the variation in one set of variables (the bioclimatic variables and/or the geography) explains the variation in another set of variables (the genome admixture proportions of each sample). The RDA was computed using the vegan R package. RDA of admixture proportions with bioclimatic variables conditionnal to geography was also computed to estimate effects of climate “corrected for” the geography.
The relationships between genome components and phenotypes were estimated using linear models. Because of dependencies among the predictors (the proportions of genome components must sum to one), a systematic search for the best minimum model was done using the leaps R package or use of the step function with both direction, employing a significance level of α = 5% as the benchmark for using a predictor.
Spatial interpolation of phenotypic traits was performed using a thin plate spline method, with a smoothing parameter of λ = 0.005, as implemented in the R package fields.
Unless otherwise stated, all computations were done using the R statistical environment (R Core Team, 2015).
Supplementary Material
Supplementary tables S1 - S10 and Supplementary figures S1 - S13 will be available at Molecular Biology and Evolution online
Acknowledgments
We thank Xavier Tassus and Christine Tayeh for providing data of plant pathogen distributions. Peter Ralph and Eran Elhaik provided valuable comments on the manuscript. We thank Jean-Marie Prosperi for providing seeds and maintaining a large part of the Medicago truncatula collections. We thank R2n for their participation in plant phenotyping. This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. Mélanie Mazurier was supported by a PhD scholarship from the French “Ministère de la Recherche et de l’Enseignement Supérieur” and a “Visiting Student” fellowship from INPT for a stay at USC. Laurent Gentzbittel and Cécile Ben were supported by a “Visiting Scholar” fellowship from INPT for a stay at USC and US Feed the Future Innovation Lab “Climate Resilient Chickpea”. Tatiana Tatarinova was supported by a “Visiting Scholar” fellowship from INPT for a stay at INPT and by NSF Division of Environmental Biology award # 1456634.