Summary
Grasses are essential plants for ecosystem functioning. Thus, quantifying the selection pressures that act on natural variation in grass species is essential regarding biodiversity maintenance. In this study, we investigated the selection pressures that act on natural populations of the grass model Brachypodium distachyon without prior knowledge about the traits under selection. To do so, we took advantage of whole-genome sequencing data produced for two natural populations of B. distachyon and used complementary genome-wide scans of selection (GWSS) methods to detect genomic regions under balancing and positive selection. We show that selection is shaping genetic diversity at multiple temporal and spatial scales in this species and affects different genomic regions across the two populations. Gene Ontology annotation of candidate genes reveals that pathogens may constitute important factors of selection in Brachypodium distachyon. We eventually cross-validated our results with QTL data available for leaf-rust resistance in this species and demonstrated that, when paired with classical trait mapping, GWSS can help pinpointing candidate genes for further molecular validation. Our study revealed widespread signatures of natural selection on genes involved in adaptation in B. distachyon and suggests that pathogens may constitute an important driving force of genetic diversity and evolution in this system. Thanks to a near-base perfect reference genome and the large collection of freely available natural accessions collected across its natural range, B. distachyon appears as a prime system for studies in ecology, population genomics and evolutionary biology.
Introduction
Grasses cover more than 40% of the world land area (Gibson, 2009) and dominate a wide variety of ecosystems, from tropical to temperate regions (Clayton, 1981; Gibson, 2009). Grasses also play a key role in eco- and agrosystem functioning as they provide habitats for many animal species (Groves, 2000) and represent the main source of grain and forage (Stromberg, 2011). Increasing crop production to meet the food and energy requirements of the world’s growing population is however putting great pressure on natural grasslands (Wallace, 1997; Helm et al., 2009; Ceballos et al., 2010). Faced with constant deterioration and fragmentation due to anthropic activities (Kiviniemi, 2002), these ecosystems are highly endangered (Ceballos et al., 2010), but little is known about their evolutionary resilience. Assessing the genetic basis of adaptation and quantifying the selection pressures that act on natural variation in grass species is therefore crucial with respect to biodiversity maintenance and food security.
To date, reciprocal transplant experiments have been extensively used to test the effect of selection on adaptive differentiation across populations (for review see Savolainen et al., 2013). Based on a “home vs. foreign” effect on fitness, reciprocal transplants are indeed powerful to unravel overall genotype by environment (GxE) interactions and demonstrated the prevalence of local adaptation in grasses and plants in general (for review see Bischoff et al., 2006; Wadgymar et al., 2017). However, reciprocal transplant experiments use proxy such as survival, vegetative growth or seed production to measure the effect of the habitat on fitness (Bischoff et al., 2006). Hence, they provide little information about the functional and genetic bases of adaptation, unless combined with trait mapping such as quantitative trait locus (QTL) analyses and genome-wide association studies (GWAS) (Latta, 2009). QTL analyses and GWAS, on the other hand, are largely constrained by the effort and time necessary for high-resolution mapping. In grasses, while these trait-by-trait approaches have been valuable to decipher the genetic architecture of important characters with regard to crop genetic improvement (Huang et al., 2002; Barbieri et al., 2012; Morris et al., 2013; Slavov et al., 2014), they remain of limited value to grasp the overall selective forces that act on natural populations.
An efficient alternative to provide insights about evolutionary forces in natural populations consists in identifying genes under various types of selection at a whole genome scale, then describing their function and the type of selection acting on them (Mitchell-olds et al., 2007). For instance, new mutations that are beneficial in some populations will be positively selected and are more likely to quickly increase in frequency. Such so-called selective sweeps tend to reduce genetic diversity, increase differentiation among populations, and lead to extended haplotypes in the vicinity of the locus under selection due to genetic hitchhiking (Nielsen, 2005; Hermisson, 2009). Various genome-wide selection scans (GWSS) methods have been developed to detect such footprints of positive selection in genomes while taking into account demographic history (Tang et al., 2007; Gautier et al., 2012; Stamatakis et al., 2013; Messer, 2015), and thanks to the remarkable progress of sequencing technologies, GWSS are now emerging as complementary approaches to classical trait mapping.
While local adaptation is commonly associated to positive selection on new advantageous polymorphisms, recent studies have demonstrated that balancing selection is also playing an important role in this evolutionary process (Mitchell-Olds et al., 2007; Rasmussen et al., 2014; Wu et al., 2017). The term balancing selection is an “umbrella” concept (Fijarczyk & Babik, 2015) which describes the maintenance of genetic diversity over longer periods of time through adaptation to spatial heterogeneity, heterozygote advantage and negative frequency-dependent selection (Mitchell-Olds et al., 2007; Rasmussen et al., 2014). Leading to the recycling of polymorphisms and to selection on standing variation (Richman, 2000; Turchin et al., 2012), this process is more difficult to detect than positive selection (Fijarczyk & Babik, 2015) since older alleles had more time to recombine and may lead to narrow signatures around selected sites. As a consequence, the effect of balancing selection is still largely overlooked in genome scans, which remain strongly biased towards the detection of recent positive selection (Hassl & Payseur, 2016).
In this study, we capitalize on the near base-perfect quality of the reference genome of the Mediterranean grass Brachypodium distachyon (https://phytozome.jgi.doe.gov) to investigate how both positive and balancing selection are shaping diversity in this species. In the last decade, B. distachyon has been developed as a powerful model for research on temperate grass species as it is closely related to major crop cereals and to some of the grasses used for biofuel production (The international brachypodium Consortium, 2010). Entirely sequenced, its small diploid genome (272Mb) is fully assembled into five chromosomes and has been exhaustively annotated (The international brachypodium Consortium, 2010). In addition, B. distachyon is broadly distributed around the Mediterranean rime (Dell’Acqua et al., 2014; Gordon et al., 2014; Tyler et al., 2016), providing access to natural populations from contrasting habitats for which a large collection has been collected. It constitutes therefore a unique and prime system to investigate the genetic basis of local adaptation in natural grass populations, opening the way to further fundamental and applied research.
Here, we took advantage of whole-genome sequencing data produced for 44 B. distachyon natural accessions originating mainly from Spain and Turkey (Gordon et al., 2017). We identified over 6 million SNPs and used four complementary GWSS methods to detect genomic regions under different regimes of selection (Figure 1). Namely, we asked i) at what time and geographical scale is selection acting in B. distachyon populations? ii) what are the selective constrains that shape diversity and adaptation in these populations? iii) whether positive selection is acting on the same genomic regions in the two populations or, on the opposite, on distinct loci?
Results
Population structure
In this study, we used whole-genome sequencing data (paired-end; Illumina technology) with a 86-fold median coverage of 44 B. distachyion accessions originating from Turkey, Iraq, Spain and France (Figure 2A, Table S1, (Gordon et al., 2017). After filtering, we identified 6,204,029 SNPs. An ADMIXTURE analysis, where K=2 was identified as the best model, highlighted two distinct genetic clusters, an eastern and a western one, with extremely little admixture between the two (Figure 2B). For the rest of the study, accessions from Turkey and Iraq will be referred to as the eastern population while accessions from Spain and France will be referred to as the western population. The western population showed a lower level of nucleotide diversity (Wilcoxon test; P-value < 2.2e-16, Figure 2C) and haplotype diversity (P-value < 2.2e-16, data not shown) than the eastern one. Excluding the reference accession Bd21, which has been artificially inbred before sequencing, the average level of heterozygosity in these accession is of 8% and ranges from 4 to 17.4% (Table S1).
Functional clustering of the genome of B. distachyon
GWSS outputs provide information about the likelihood for a given locus to be under selection. A classical approach applied to analyze the results of GWSS consists in selecting genomic regions containing top 1% outliers for signals of selection and then assessing whether some biological functions or processes are significantly over-represented in the gene sets under selection through a Gene Ontology (GO) annotation (Kelley et al., 2006; Hancock et al., 2011; Nelson et al., 2017). While recombination rate is relatively high in B. distachyon (Huo et al., 2011), signals of selection around focal loci may decrease slowly due to locally stronger linkage disequilibrium and subsequent genetic hitchhiking. 1% outlier regions may thus contain several adjacent genes. Because genes having the same function or being involved in the same biological process tend to be physically clustered (Hammond-kosack & Jones, 1996; Michelmore & Meyers, 1998; Takos et al., 2011; Nutzmann & Osbourn, 2014; Singh et al., 2015), we anticipated that this non-random organization of genomes could lead to an over-representation of some biological pathways or functions in small genomic regions and to an artificial enrichment for some GO terms in GWSS 1% outlier regions (Pavlidis et al., 2012).
To assess whether the genome of B. distachyon harbors such functional clusters of genes, we first performed a GO annotation for the 32,712 genes annotated in the reference genome. We then controlled for potential gene clustering by following the procedure described in (Al-Shahrour et al., 2010). Briefly, we split the genome into overlapping windows of 50 consecutive genes and performed enrichment analyses on each window. We identified 272 windows significantly enriched for at least one biological process (Table S2). Several windows were enriched for processes that may be associated to adaptation to local environmental conditions such response to stress and defense response (Table S2). This prompted us to narrow down top 1% outlier regions by keeping only the genes located at and in the immediate vicinity (-10% of the peak value) of each of the peaks of selection. With the exception of the coalescence analysis, which is a window-based approach (see methods), all analyses subsequent to GWSS reported in the following sections were performed on these filtered outputs.
Genes under balancing selection due to environmental heterogeneity
Variation in abiotic conditions can drive local adaptation at a fine-grained spatial scale and lead to correlations between genotypes and environment. We first used an environmental association analysis approach to detect loci that may have been repeatedly selected by convergent climatic conditions across the two populations. If alleles at these loci have been recycled in front of environmental conditions common to Spain and Turkey, they should display a detectable signal of association at the scale encompassed by our study. Note that geographically varying selection is sometimes considered as being distinct from balancing selection. While it can also be referred to as local adaptation in the literature (Mitchell-olds et al., 2007), we kept it here under the term of balancing selection.
To detect such loci and to highlight alleles shared between the two populations, we selected seven bioclimatic variables showing variation within both the eastern and western populations but little variation between them (Figure 2D). As all these variables were either associated to temperature or to precipitation and are likely to be correlated, we summarized them with a PCA. The first axis of the PCA explained 62.6% of the variance but did not discriminate populations (Figure 2D) and was used to perform an environmental association analysis with all the SNPs identified across the 44 accessions. We identified 26 genomic regions associated with the first PCA axis. These regions harbored 71 genes (Figure 3A). No significant enrichment for any specific biological process was observed (Table 1).
Genes harboring extremely long coalescence time
Balancing selection can be detected with coalescence approaches, as ancient alleles are associated with older coalescence times (Charlesworth, 2006). To detect additional candidate regions for balancing selection in B. distachyon, we used the software ARGWeaver (Rasmussen et al., 2014). Briefly, ARGWeaver models the coalescent process along chromosomes and across non-recombining blocks of sequences to address their evolutionary history. It allows recovering several statistics that describe local genealogies and recombination, such as times since the most recent common ancestor (TMRCA), which should be increased near ancient alleles such as those under ancient balancing selection. By doing so, we identified 72 regions harboring 115 genes under what will be referred to as long-term balancing selection in the following. We observed a significant enrichment for genes involved in phosphorylation (Table 1).
Genes under disruptive selection between western and eastern populations
As new mutations providing a selective advantage rise in frequency through positive selection, neutral mutations that are physically close tend to remain strongly linked to them. Recent positive selection should therefore lead to a signature of long haplotypes near selected mutations. We used the Rsb test (Tang et al., 2007) to detect such signatures of recently or almost completed selective sweeps. This test detects haplotypes that are positively selected in one population by estimating the length of haplotypes around each allele at a core SNP, then comparing these lengths between populations. In contrast to the two approaches presented above, the Rsb test should detect regions that are genetically differentiated across the two populations (Figure 1). We identified 312 regions harboring 824 genes and 319 regions harboring 1212 genes in the eastern and western populations respectively (Figure 3A). The selected regions contained more genes in the western than in the eastern population (Wilcoxon test, P-value=0.001; Figure 3B). We observed a significant enrichment in genes involved in response to stress, particularly for defense response and response to oxidative stress in the eastern population (Table 1, Table S3). The gene set associated to the process of defense response contained well-known types of resistance genes (R-genes), i.e genes with NBS-LRR domains (Table S3). Eventually, we observed a significant enrichment for genes involved in nitrogen transport in the western population (Table 1, Table S3).
Genes under ongoing selection within population
In the case of an ongoing and partial selective sweep (Figure 1), not all individuals within the population will display haplotype extension in the region under selection, which can lead to more subtle patterns that are not detected by Rsb test. We used the program H-scan to detect such incomplete patterns and genes under ongoing positive selection. We identified 142 regions harboring 487 genes and 79 regions harboring 463 genes in the eastern and western population respectively (Figure 3A). The gene set under selection in the eastern population was significantly enriched for genes involved in stress response, including the processes of defense response and response to cadmium ion (Table 1). The gene set associated to the process of defense response also contained many R-genes (Table S3). The gene set under selection in the western population showed a significant enrichment for genes involved in pyruvate metabolic process (Table 1, Table S3). As in the previous Rsb analysis, the selected windows contained more genes in the western than in the eastern population (Wilcoxon test, P-value=0.003; Figure 3B).
GWSS outliers display allele frequency spectra and coalescence patterns consistent with expectations
To test whether candidate regions displayed an allele frequency spectrum consistent with the type of selection they were supposed to detect, we computed different statistics and compared candidate regions to the rest of the genome. We first computed Tajima’s D (Tajima, 1989), which is a measure of genetic diversity influenced by both selection and demographic variation. Positive values are associated with balancing selection and bottlenecks, while negative values suggest recent expansion or positive/purifying selection. We then computed nucleotide diversity, genome-wide relative (FST) and absolute (dXY) measures of population differentiation (Cruickshank & Hahn, 2014). Note that dXY can be interpreted as FST, but is correlated to the time to coalescence of alleles from all populations, making it independent of diversity within populations. In addition to TMRCA values, we finally extracted relative TMRCA halftime (RTH) values from the output of ARGWeaver. RHT captures coalescence events are skewed toward the recent past but is independent of the overall coalescence rate (Rasmussen et al., 2014). Assuming similar selection strength acting on Rsb and H-scan outliers, RTH is thus expected to be smaller in regions under ongoing positive selection (H-scan) than in regions under disruptive selection where one allele already reached near-fixation in one of the two populations (Rsb).
Regions associated with bioclimatic variables or under long-term balancing selection displayed significantly higher Tajima’s D than the genomic background (Wilcoxon test, all P-values<1.0.10-10, Figure 3C) except for associated loci in Spain for which Tajima’s D values were only marginally higher than in the rest of the genome (P-value=0.055). These regions also display a significantly higher level of nucleotide diversity within regional groups (P-values<1.0.10-10) than the rest of the genome.
Coalescence times (TMRCA) within regional groups were higher for windows covering loci associated to environmental variables than for genomic background (P-values=1.10-14 and 0.02 for eastern and western populations respectively). As expected, regions under long-term balancing selection also displayed higher TMRCAs as they were specifically selected as belonging to the top 1% outlier for this statistic. Tajima’s D, nucleotide diversity and TMRCA statistics are thus consistent with our expectations and confirmed that both the environmental association and coalescence analyses detected older polymorphisms shared across the two populations.
On the other hand, regions identified with the H-scan and Rsb approaches harbored negative and significantly lower Tajima’s D than the rest of the genome, which is consistent under positive selection (all P-values < 1.10-5, Figure 3C). Outlier windows for Rsb also displayed higher relative (FST, P-values<2.2.10-16) and absolute levels of differentiation (dXY; P-values<2.2.10-16) than the rest of the genome. These latter patterns are coherent with the high frequencies of divergent haplotypes between the western and eastern population for Rsb outliers, resulting in an increased differentiation between them (Figure 1). H-scan and Rsb outliers also displayed a lower RTH compared to genomic background (all P-values<2.2.10-16 except for Turkish Rsb outliers for which P-value = 0.002), which is also consistent with selective sweeps where selection skews coalescence times towards the recent past for most (but not necessarily all) lineages (Rasmussen et al., 2014). Eventually, H-scan outliers displayed lower RTH than Rsb outliers (all P-values<2.2.10-16), which confirms that this test tented to detect more recent sweeps than the Rsb.
Overlap between outputs from the four different approaches
As expected given the specificity of each test, we found little overlap between the outputs of the four different tests (Figure 3A). Out of a total of 1,262 and 1,617 genes under positive selection, only 49 and 58 genes were common to the H-scan and Rsb approaches in the eastern and the western population respectively. Similarly, little overlap (10 genes in total) was observed between the environmental association analysis and the Rsb or H-scan tests (Figure 3A). No overlap was observed between the gene sets detected to be under selection with the coalescence approach and with any the three other tests.
Genomic regions affected by positive selection
We also tested whether distinct loci/genomic regions were affected by recent positive selection (Rsb and H-scan outliers combined) in the two populations using linear models. We found no association between the density of genes under selection in the eastern and western populations on chromosomes 2 and 3 (p-value = 0.4 and 0.3 respectively), which indicates that, as observed on Figure 3D, distinct regions are affected by positive selection on these two chromosomes. We found a significant association between these variables on chromosomes 1, 4 and 5 (p-value = 0.004, 1.07e-11 and 0.02 respectively). In these latter cases, however, R2 were small (7.366e-05, 0.0004 and 4.42e-05) suggesting that many regions affected by selection remain specific to each population on these chromosomes as well, as depicted on Figure 3D.
Identification of candidate genes in known QTL regions
Combining association mapping and analyses of selection constitutes a powerful approach to identify candidate genes and to address their selective regime. We identified many candidate regions harboring resistance genes. As a proof of concept, we aimed at assessing whether regions identified as resistance loci against known pathogens were also highlighted in our scans of selection. In B. distachyon, the genetic basis of resistance to the rust fungus Puccinia brachypodii has been deciphered through a QTL mapping (Barbieri et al., 2012) which showed that leaf-rust resistance is controlled by three main QTL located on chromosome 2 (from nucleotide 37,949,269 to 40,903,216), 3 (from nucleotide 13,943,000 to 14,512,222) and 4 (from nucleotide 9,649,152 to 10,679,750). For the rest of the study, these regions will be referred to as QTLrust-2, QTLrust-3 and QTLrust-4.
We screened these three regions for evidence of selection and found strong Rsb signals in QTLrust-3 and QTLrust-4 (Figure 4A). These regions were not detected as outliers with any other test but belonged to the top 0.05% p-values of Rsb outliers. The Rsb signal in QTLrust-3 reaches its highest point in a serine/threonine phosphatase (Bradi3g16320: 14,486,831- 14,488,838; Figure 4A, left panel) and 60 kb upstream the QTL peak identified in this region (Barbieri et al., 2012). The two adjacent Rsb signals in QTLrust-4 reach their highest point into two NBS-LRR resistance genes (Bradi4g10153: 9,807,879-9,812,927; and Bradi4g10171: 9,828,236-9,835,003; Figure 3A, right panel). These two genes are respectively located 6 kb and 22 kb upstream the QTL peak. We observed extended haplotypes in the eastern population in both QTLrust-3 and QTLrust-4 (Figure 4B). Congruent with the signal detected by the Rsb test, the large majority of the eastern accessions displayed the extended haplotype in these regions, indicating a nearly completed selective sweep, especially in QTLrust-4 (Figure 4B).
QTLrust-4 showed a striking enrichment for genes involved in defense response (displayed by grey boxes in Figure 4A, P-value = 8E-09) and in immune signaling process such as phosphorylation (P-value = 5.6E-06). Among 113 genes covered by QTL rust-4, 40 correspond either to a gene with a NBS-LRR, a receptor-like protein kinase (RLK) or a F-box domains, three types of genes that can confer resistance in plants. The presence of such a large gene clusters, as found in other regions of the genome (Table S2), further demonstrates the importance of narrowing down regions under selection to top outlier genes for unbiased GO annotation.
Discussion
Assessing the time and spatial scales at which selection acts is a key to understand how genetic diversity is maintained or lost through adaptation (Stinchcombe & Hoekstra, 2008; Fuller et al., 2015). In plants, and especially in grasses, this question has been largely restricted to crops, which biases our understanding of evolutionary processes that have shaped genomes in their natural ancestors and extant relatives. In this study, we investigate the selective forces influencing adaptation in two populations consisting of 44 natural accessions of the wild Mediterranean grass B. distachyon. We found that ancient balancing and recent positive selection left distinct signatures on specific gene categories, and that positive selection affects distinct loci across the two populations. Importantly, our results support a role for pathogens in driving population differentiation and confirm that GWSS constitute effective approaches to pinpoint candidate genes as a complement to classical trait mapping.
Time- and space-varying selection is shaping diversity in B. distachyon
The ending of the last glaciation period 10,000 years ago led to drastic and recent changes of plant communities in Eurasia (Svenning et al., 2008; Binney et al., 2017). At that time, climate warmed and species distribution expanded over Europe (Hewitt, 1999). Pollen-based studies show that vegetation expansion was fast, reaching up to 2km per year for some species (Hewitt, 1999). To our knowledge, no fossil pollen records are available for B. distachyon, which prevents reconstructing the geographical distribution of this species before and during the last ice age. Yet, a previous study showed that the two populations analyzed here experienced a severe population size reduction during the last glaciation followed by a rapid expansion within the last 10,000 years (Stritt et al., in press). Even though unraveling the history of populations in southern peninsulas is more complex than in northern regions (Hewitt, 2000; Feliner, 2011), these results are congruent with the recent global postglacial recolonization of Europe by plants (Hewitt, 1999; Svenning et al., 2008; Binney et al., 2017) and imply that B. distachyon populations had to adapt to newly colonized habitats in the recent past.
Balancing selection associated to spatial heterogeneity (Richman, 2000; Charlesworth, 2006; Fijarczyk & Babik, 2015; Wu et al. 2017) may have maintained ancestral polymorphisms over long periods of time in natural populations of B. distachyon. Our environmental association analysis indeed shows that loci associated to bioclimatic variables display higher diversity, older alleles and more shared variation between regional groups when compared to genomic background. We exclusively focused the analysis on bioclimatic variables that displayed variation across localities and not between (Figure 2D). This approach, together with our relatively small sample size, could explain why this analysis highlighted fewer genes than the other methods used in the rest of the study. Our observations nonetheless suggest that adaptation to climate after recolonization does not necessarily involve de novo mutations, even after strong bottlenecks. As a further support of balancing selection, we identified an even older set of shared polymorphisms with a coalescence approach. The environmental heterogeneity encountered by B. distachyon populations (Lopez-Alvarez et al., 2015) thus seems to have provided selective pressure strong enough to maintain polymorphisms over long periods of time within each population. This result is congruent with a previous analysis of natural populations of B. distachyon originating exclusively from Turkey (Dell’Acqua et al., 2014) which showed, at a smaller geographical scale than the one investigated here, that populations are adapted to local habitats (Dell’Acqua et al., 2014).
On the other hand, we also found evidence for positive selection acting on younger polymorphisms with both the Rsb and H-scan tests. These tests also revealed that positive selection is targeting different loci in the two populations. Interestingly, these loci appear to be non-randomly distributed along chromosomes (Figure 3D). As recombination rate is relatively high in B. distachyon (Huo et al., 2011), we do not believe that this pattern is due to extended linkage disequilibrium and the subsequent process of linked selection along such large genomic regions (Cutter & Payseur, 2013; Slotte, 2014). Rather, the peaks of selection we identified were narrow and allowed to pinpoint genes (Figure 4), indicating that while B. distachyon is primarily inbreeding, outcrossing events must be frequent enough to limit extended linkage disequilibrium. This is also congruent with the higher level of heterozygosity we observed here in B. distachyon compared to other selfing plants such as A. thaliana (Platt et al., 2010). Whether these regions form islands of divergence and result from a complex interaction between recombination rate variation, gene flow and selection (Renaut et al., 2013; Samuk et al., 2017) remains to be investigated. As we obtained enrichments for different GO terms in the eastern and western populations, our study nonetheless shows that contrasting abiotic and biotic factors are shaping population diversity at different regions of the genome of B. distachyon through positive selection.
B. distachyon occurs exclusively in Mediterranean habitats which may appear at a first glance to be homogeneous and unlikely to promote local adaptation. Our results defeat this prediction and revealed that natural selection affected different genes and genomic regions across populations. Even though we identified more genes under positive than under balancing selection, it would be daring to conclude that the former selection regime, less challenging to detect (Delph & Kelly, 2014), is a predominant process shaping diversity in this species. Rather, we believe that we provide here genomic evidence that large-scale balancing selection also leads to the adaptation of B. distachyon populations to local environmental conditions.
Pathogens as a potential driving force of population evolution
Host-pathogen interactions lead to a strong coevolutionary dynamics and are considered as a major factor shaping diversity (Karasov et al.; Fumagalli et al., 2011; Krattinger & Keller, 2016a). Two main types of interaction have been proposed. Under an arms race model, repeated innovation from both sides results in repeated fixation of advantageous alleles (Brown & Tellier, 2011). This interaction can therefore lead to positive selection that can be detected by tests focusing on extended haplotypes. The other type of interaction is often referred to as Red Queen dynamics or trench warfare, where alleles involved in the interaction are recycled by negative-frequency dependence and can therefore subsist for long periods of time in populations through balancing selection (Brown & Tellier, 2011).
Plant immune system machinery is complex. On one hand, it is composed of two tiers of extracellular and intracellular receptors (Krattinger & Keller, 2016a,b) that efficiently detect the presence of pathogens and constitute a first level of defense (for review Greeff et al., 2012; Couto & Zipfel, 2016; Eckardt, 2017). Further mechanisms, such as oxidative bursts which are produced at an early stage in case of pathogen invasion, can act as additional levels of defense and prevent pathogen proliferation (Wojtaszek, 1997; Torres et al., 2006; Fones & Preston, 2011; Sewelam et al., 2016). In this study, we found a significant enrichment of signals of selection at genes involved in these two levels of defense in the eastern population. More specifically, we found many of well-characterized R-genes, i.e. genes displaying NBS-LRR domains (Mchale et al., 2006; Jacob et al., 2013; Liu et al., 2014; Couto & Zipfel, 2016; Eckardt, 2017; Ooijen et al., 2017), and genes involved into oxidative stress response to be under ongoing and/or disruptive selection in the eastern population (Table 1). On the other hand, we also found additional R-genes and genes involved in phosphorylation, a process especially important for immune signaling response in plants (for review Park et al., 2012), to be under balancing selection. Our results are thus consistent with the two classical models of host-pathogens coevolution (Mondragón-palomino et al., 2002; Mchale et al., 2006; Gos et al., 2012; Mace et al., 2014; Zhong et al., 2015; Wu et al., 2017). Overall, and as shown in other organisms (Karasov et al. 2014; Fumagalli et al., 2011; Krattinger & Keller, 2016a,b; Bourgeois et al., 2017), our genome-wide approach suggests that pathogens may constitute an important driving force of population and genome evolution in B. distachyon.
Surprisingly, the significant enrichments for resistance genes in the Rsb and H-scan outlier gene sets were only observed in the eastern population. The geographical origin of the populations could play a role in this pattern. The Middle East is indeed the center of origin of many grasses, including B. distachyon, and of their associated pathogens (Wyand & Brown, 2003; Stukenbrock et al., 2005; Opanowicz et al., 2008; Hovmøller et al., 2011). Many studies found that both resistance genes in plants and effector genes in pathogens can be organized in clusters evolving through an arm race resulting in a gene birth-and-death process (Michelmore & Meyers, 1998; Dong et al., 2015; Singh et al., 2015). Because centers of origin are usually associated to higher diversity, it is therefore possible that a higher level of pathogen diversity drove selection at a larger number of resistance genes in the eastern population.
We can, however, not rule out technical biases inherent to the sampling design and to the history of the studied populations. The western population indeed experienced a stronger bottleneck than the eastern one (Stritt et al., in press), which, together with the smaller geographical area sampled, may be responsible for the reduced nucleotide and haplotype diversity we observed in this population. In addition, the bottleneck may have led to longer haplotypes in the western population, which could explain why windows under positive selection were on average larger and contained more genes in this population than in the eastern one. Discriminating the effect of selection from the one of demography remains difficult in such bottlenecked populations (Long et al., 2013). As a consequence, confounding effects may have resulted in the detection of more false positives and blurred the GO annotation in the western population. As we applied stringent filtering criteria and used approaches which are expected to be robust to demographic history (Tang et al., 2007), we nonetheless believe that we provide the community with a reliable set of candidate loci, even for the western population. It is worth noting that, despite no significant enrichment for defense response, footprints of positive selection at R-genes were also found in the western population. This indicates that while milder, pathogens may constitute a selection pressure in this population as well.
GWSS as complementary approach to QTL and GWAS
Disentangling the mechanisms that promote or prevent adaptation requires more integrated studies, using both experimentations in controlled conditions and methods to characterize genetic diversity in natural populations (Feder & Mitchell-olds, 2003; Stinchcombe & Hoekstra, 2008; Flood & Hancock, 2017). Several studies used GWSS to validate genes functionally characterized or previously identified through GWAS and to propose stronger hypotheses on the mode of selection operating on traits relevant for adaptation (Roulin et al. 2016; Tang et al., 2007; Fumagalli et al., 2011; Bourgeois et al. 2017). Following this idea, we inspected three QTL regions responsible for the resistance of B. distachyon to the leaf-rust fungus Puccinia brachypodii, a natural pathogen of B. distachyon expected to exert strong selection on natural accessions.
For two of these QTL regions, we found strong Rsb outliers and reduced haplotype diversity in the eastern cluster in the vicinity of the QTL peaks, but no sign of balancing selection. These results strongly indicate that large-scale positive selection is shaping rust resistance in B. distachyon natural populations, as suggested in other species (Dodds & Thrall, 2009; Chavan et al., 2015). Interestingly, the QTL region identified on chromosome 3 displays a strong signal of selection in a serine/threonine phosphatase, a class of genes known for their role in defense response and stress signaling (País et al., 2009; Durian et al., 2016). The region identified on chromosome 4 is more complex and consists of a cluster of resistance and stress signaling genes. Nevertheless, and while other genes display evidence of positive selection, a strong peak of selection co-localizes with the peak of the QTL and points at two R-genes. Such genes have been shown to confer resistance to rust in other species (Bettgenhaeuser et al., 2014) and constitute prime candidates for further functional characterization.
B. distachyon is closely related to major crop cereals as well as to grass species used for biofuel production. Translating research from B. distachyon to plants of agronomical and economical interest will require a deeper understanding of the genetic architecture of traits involved in the response to environmental stresses. The molecular basis of tolerance to various abiotic stresses such as drought, salt and cold has been investigated in this species (Luo et al., 2011; Manzaneda, 2013; Carmo & Charron, 2014; Gordon et al., 2014; Marais & Juenger, 2015; Sun et al., 2015; Mur & Bosch, 2016). Here, we also highlighted cadmium pollution as a potential factor of selection in the eastern population. As pollution with heavy metals including cadmium has been reported in Turkey in regions where accessions were collected for this study (Bakirdere & Yaman, 2008; Mor & Ceylan, 2017), our results suggest that B. distachyon could be used to investigate the tolerance to this stress. As genetic transformation is highly efficient in this species relative to other grasses, we anticipate that combining classical trait mapping analyses with GWSS will assist allele mining for additional eco-responsive traits.
Conclusion
Our results revealed widespread signatures of natural selection at genes involved in adaptation in B. distachyon. We also found that pathogens may constitute an important driving force of genetic diversity and evolution in this system. While we limited our analysis to classical point mutations, recent studies showed that copy number variants (CNVs) and transposable element polymorphisms are abundant across B. distachyon populations (Gordon et al., 2017; Stritt et al., in press). Hence, the important genomic resources currently developed in this species open new avenues of research to further investigate the role of structural variation in natural population evolution and adaptation. To date, B. distachyon remains a classical model for research on grass genomics with a strong orientation towards applied research. Thanks to the high quality of its reference genome and the existence of large collections of freely available natural accessions collected from the species native range, it also constitutes a prime system for studies in ecology, population genomics and evolutionary biology.
Experimental procedures
SNPs calling, population structure and genetic diversity
We used paired-end Illumina sequencing data generated for 44 accessions of B. distachyon (Gordon et al., 2017) originating from Spain (N=16), France (N=1), Turkey (N=23) and Iraq (N=4, Table S1 for information about the origin of the accessions and sequencing effort). Reads were aligned to the reference genome v2.0 with BWA-MEM (standard settings; Li, 2013). After removing duplicates with Sambamba (Tarasov et al., 2017), single nucleotide polymorphisms (SNPs) were called with Freebayes (Garrison & Marth, 2016). The output was then filtered by removing SNPs with more than 10 missing genotypes or more than 2 alleles, a quality lower than 20, a minor allele frequency of 0.05 and a mean depth lower than 20 or higher than 200. Data were phased using the software BEAGLE V4 (Browning & Browning, 2007) using default settings.
We then used the program Admixture (Alexander & Novembre, 2009) to identify the genetic structure of the two populations. The analysis was run for K values from 1 to 5, and the best model was determined as the model with the lowest cross-validation error. Summary statistics such as within-population nucleotide or haplotype diversity were computed with the R package PopGenome (Pfeifer et al., 2014). Nucleotide or haplotype diversity values were square-root transformed to fit a normal distribution and a t-test was used to compare the two within-population distributions. Levels of heterozygosity were calculated with VCFtools (Danecek et al., 2011) .
Detecting balancing selection with an environmental association analysis
Bioclimatic variables were downloaded at a 30 arc-seconds resolution from http://www.worldclim.org/ and extracted for each locality using the R libraries gdal and raster. We then picked seven bioclimatic variables (bio6, bio8, bio11, bio12, bio13, bio16, bio18) that displayed substantial variation within geographical groups, but little between them. Bio6, bio8, bio11, bio12, bio13, bio16 and bio18 correspond respectively to minimum temperature of coldest month, mean temperature of wettest quarter, mean temperature of coldest quarter, annual precipitation, precipitation of wettest month, precipitation of wettest quarter and precipitation of warmest quarter. As likely to be correlated, these variables were summarized in a principal components analysis (PCA) in R. The coordinates of the first axis were used to test for correlation between allele frequencies and environment in the R package GENABEL (Aulchenko et al., 2007). We accounted for relatedness between samples using a PCA correction and used corrected P-values of association.
Detecting balancing selection with ancestral recombination graphs (ARG)
We used the software ARGWeaver to detect additional candidate regions for balancing selection with a coalescence approach (Rasmussen et al., 2014). We included in the analysis a subset of 12 accessions with high sequencing depth and covering the largest geographical range (6 accessions from each population) to limit computation time. We used a mutation rate of 1.4x10-9/bp/generation, a recombination rate of 5.9x10-8/bp/generation. Mutation rate was estimated by aligning the orthologs of 100 genes and using rice as an out-group (divergence estimated at 40My (The international brachypodium Consortium, 2010)). Note that we subsequently used an outlier approach to identify the oldest polymorphisms present in the two populations (top 1% outliers). Therefore, potential biases inherent to the use of a molecular clock do not affect our analysis. ARGWeaver is also flexible with regard to recombination rate as it reconstructs ancestral recombination graphs and accommodates variable recombination rates and genealogies along the genome. The algorithm was run for 1000 iterations, using 20 discretized time steps, a maximum coalescence time of 3 million generations and a prior effective population size of 100,000 individuals.
Detecting disruptive positive selection
We used the Rsb test (Tang et al., 2007) to detect signatures of recent or almost completed hard sweeps. This test detects haplotypes that are positively selected in one population by using a second population as a contrast. While the output of the test provides P-value of significance, it also indicates in which population a given allele is under selection. Rsb statistics were computed for each SNP with the R package rehh2.0 (Gautier et al., 2012) with default settings. We further visualized the extension of haplotypes at candidate regions using the bifurcation.diagram() function of the rehh.
Detecting ongoing positive selection within populations
We eventually used the software H-scan (Schlamp et al., 2016) to detect incomplete ongoing positive selection. To do so, we calculated average pairwise haplotypes lengths using the number of segregating sites spanned by each tract within each population. The statistics is expected to be larger as the number of extended haplotypes increases in a population. This method is specifically dedicated to the detection of ongoing sweeps where one (hard sweep) or several (soft sweep) haplotypes are under positive selection and provides statistics for each SNP. We ran the method on Turkish and Spanish accessions independently to detect selective sweeps within each geographical group.
Testing for functional clustering
We first performed a GO annotation for the 32,712 genes annotated in the reference genome (version 2.1) with Blast2GO (Conesa et al., 2005). We then controlled for potential gene clustering by following the procedure described in (Al-Shahrour et al., 2010). The entire gene set of the reference genome was split into windows of 50 consecutive genes. Windows were moved along chromosomes in steps of 25 genes to allow for half-window overlaps. Enrichment analyses of biological processes were then performed for all the generated windows with the R package GOstats (Falcon & Gentleman, 2017) using Fisher’s exact test. P-values were subsequently adjusted for multiple testing with a Benjamin-Hochberg correction. Regions were considered significantly enriched for a biological process when they displayed a corrected P-value ≤ 0.01 and also harbored at least five genes associated to the given process.
GWWS subsequent filtering
Both the H-scan and the Rsb tests compute statistics at each SNP. To limit false positives, we first selected 10 kb windows displaying at least four significant SNPs within the top 1% outliers. Overlapping significant windows were merged. We narrowed down the selected windows by keeping only the genes located at and around (-10% of the peak value) each of the top 1% peaks of selection. For the association test, we also selected 10 kb windows displaying at least four significant SNPs, i.e. with a corrected P-value ≤ 0.001 (-log10(P-value) ≥ 3) and narrowed them down in the same manner. These filtering criteria, however, were not applied to the output of ARGweaver, which is a window-based approach and for which we only kept the top 1% outlier windows.
Overlap between the different approaches
Venn diagrams were drawn with the R package Vennerable to visualize potential overlap between the different gene sets under selection. To compare the distribution along chromosomes of candidate genes for recent positive selection (H-scan and Rsb candidate genes combined) in each population, we used linear models where the density of selected genes along each chromosome identified in the eastern and western population (100,000 bins per chromosome) were entered as variables. We eventually used the function plotBed of the R package Sushi (Phanstiel, 2015) to vizualize the density of genes under positive selection as a heat map along each chromosome. The R package ggplot2 (Wickham, 2009) was used to display the density of all the annotated genes in the genome along each chromosome as a line.
Summary statistics and coalescence characterization at candidate loci
Tajima’s D, FST and dXY were computed in the R package PopGenome (v2.2.3) over 5kb windows across the genome. We then compared values between windows overlapping with candidate regions and windows outside these regions. We extracted TMRCA and relative TMRCA halftime (RTH) from the output of ARGWeaver. Because the coalescence approach is window-based, summary statistics were averaged across windows including 300 non-recombining blocks (around 10kb windows). Significant windows were merged. We then compared TMRCA and RTH values between windows overlapping with candidate regions and those outside.
GO annotation of genes under selection
For each test, we extracted the genes located in the filtered regions with bedtools (Quinlan & Hall, 2010). We then examined potential enrichment for biological processes for each of the selected gene sets with the R package GOstats (Falcon & Gentleman, 2017) using the Fisher’s exact test. P-values were subsequently adjusted for multiple testing with a Benjamin-Hochberg correction. Gene sets were considered significantly enriched for a biological process when they displayed a P-value ≤ 0.01 and harbored at least five genes associated to the given process. The ancestor and child terms of each significant process were determined using QuickGO (http://www.ebi.ac.uk/QuickGO) and used to simplify Fisher test outputs and keep non-redundant terms.
QTL for leaf-rust resistance validation
A QTL analysis performed in B. distachyon revealed three genomic regions involved in the resistance to P. brachypodii (Barbieri et al., 2012). The coordinates of these three QTL were extracted from (Barbieri et al., 2012) from v.2.0 of the B. distachyon reference genome. We then assessed weather those three regions were identified as outliers in at least one of the tests of selection.
Availability of data and materials
The raw outputs of each GWWS will be archive upon acceptance of the manuscript. All whole-genome sequences data are available at the NCBI Sequence Read archive (SRA available in Gordon et al., 2017).
Conflict of interest
There is no conflict of interest issue related to this work.
Supporting Information
Table S1: Geographical coordinates of the 44 accessions and sequencing effort.
Table S2: Significant GO terms in each 50 gene-window of the reference genome
Table S3: Genes under selection with functional annotation
Acknowledgements
We thank the Genetic Diversity Center-ETH Zurich for providing access the Euler high-performance cluster. We also would like to thank Beat Keller’s group as well as Simon Krattinger and Mahendra Mariadassou for their advises during the elaboration of the study. This work is supported by the Swiss National Science Foundation (PZ00P3_154724). The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract No. DE-AC02-05CH11231.