Adaptive introgression and standing genetic variation, two facilitators of adaptation to high latitudes in European aspen (Populus tremula L.)

Understanding local adaptation in plants from a genomic perspective has become a key research area given the ongoing climate challenge and the concomitant requirement to conserve genetic resources. Perennial plants, such as forest trees, are good models to study local adaptation given their wide geographic distribution, largely outcrossing mating systems and demographic histories. We evaluated signatures of local adaptation in European aspen (Populus tremula) across Europe by means of whole genome re-sequencing of a collection of 411 individual trees. We dissected admixture patterns between aspen lineages and observed a strong genomic mosaicism in Scandinavian trees, evidencing different colonization trajectories into the peninsula from Russia, Central and Western Europe. As a consequence of the secondary contacts between populations after the last glacial maximum (LGM), we detected an adaptive introgression event in a genome region of ∼500kb in chromosome 10, harboring a large-effect locus that has previously been shown to contribute to adaptation to the short growing seasons characteristic of northern Scandinavia. Demographic simulations and ancestry inference suggest an Eastern origin - probably Russian - of the adaptive Nordic allele which nowadays is present in a homozygous state at the north of Scandinavia. The strength of introgression and positive selection signatures in this region is a unique feature in the genome. Furthermore, we detected signals of balancing selection, shared across regional populations, that highlight the importance of standing variation as a primary source of alleles that facilitate local adaptation. Our results therefore emphasize the importance of migration-selection balance underlying the genetic architecture of key adaptive quantitative traits.


Introduction 64
Local adaptation, the means by which populations of a species genetically adjust to 65 local environments, is a powerful process of evolution. It occurs because multiple 66 environmental factors imposing different selective pressures exist and the strength 67 of each factor varies across habitats. When a population colonizes a new habitat, 68 certain environmental conditions will impose higher selective pressures, while 69 natural selection may be relaxed for other environmental factors. The overall shift in 70 the selection landscape leads to local adaptation and consequent fitness trade-offs 71 [1]. Two fundamental genetic sources for local adaptation, particularly in temperate 72 forest trees that have large effective population size and low nucleotide substitution 73 rates per unit of time, are standing variation and intra/inter-species hybridization. 74 Hybridization occurs when reproductive isolation is not complete between species 75 or when species lineages that are separated geographically meet after a secondary 76 contact. Species capable of hybridization will therefore have access to a larger pool 77 of genetic variation that provides the raw material for selection and accelerated 78 adaptation. At the same time, standing variation can be maintained through 79 balancing selection (BS) within populations. The signatures of BS include increased 80 diversity around the target of selection, differentiation between populations 81 departing from the genome-wide average and increased linkage disequilibrium, 82 among others [2]. When selection varies geographically, it may favor locally adapted 83 alleles in the derived lineages or subpopulations, giving rise to genomic regions with 84 elevated FST and absolute divergence DXY [3,4]. there were no refugia for temperate trees north of 45°N in Europe, which means that 93 present day populations of temperate trees in northern Europe are essentially 94 young, in contrast to conifers and other boreal tree species [6]. This also implies 95 that temperate trees in northern Europe derive from southern populations, where 96 they survived in glacial refugia until temperature and moisture conditions allowed 97 for the recolonization of higher latitudes [7]. The distribution of glacial refugia across 98 the continent and the successive contact of isolated populations following range 99 expansions often led to hybridization events that played a major role in the evolution 100 of forest trees and other plants [8,9]. 101 102 European aspen, Populus tremula, is a dioecious temperate angiosperm tree with a 103 very extensive range across both the European and Asian continents. Its role in many 104 ecosystems is that of a primary colonizer for forests, growing quickly and being able to 105 cover large areas across both different latitudes and elevations [10]. Natural variation 106 of P. tremula along a north-south gradient on the Scandinavian peninsula has been 107 widely studied. GWAS and population differentiation analyses have identified a region 108 on chromosome 10 harboring the PtFT2 locus, a gene known to be involved in 109 controlling seasonal phenology [11], as a key contributor to climate adaptation at high 110 latitudes in Sweden. Other loci harboring genes related to senescence processes or 111 individuals included in the study (after the quality control steps that follow) and their 161 geographic origins are listed (Additional File2: Table S1).  HiSeq 2000 and HiSeq X) were used to cover the SwAsp collection, we observed 205 noisy signals at the population structure level, presumably as a result of the different 206 sequencing equipment and library preparation methods. By means of principal 207 component analyses (PCA) we identified differences in the grouping of the 208 individuals where differences among samples along PC1 and, to a lesser extent, PC2 209 were explained by the sequencing platform and not the geographic origin of samples 210 (Additional File1: Figure S1A; PC1 separates the SwAsp collection in two sets, each 211 corresponding to a different Illumina sequencing platform). To address these batch 212 effects, we removed SNPs associated with both components without losing the 213 geographic structure of the Swedish population. For this purpose, we generated a 214 genotype file with vcftools for each chromosome. We ran independent PCA analyses 215 on each SNP configuration (homozygous reference, heterozygous, homozygous 216 alternative) and realized it was at the heterozygous level where the batch effect was 217 present. Using the libraries "ggfortify", "factoextra" and "FactoMineR", we estimated 218 the contribution of each variant to the components affected by the platform effect. We 219 started by assuming a uniform contribution of each variant to each component, and 220 removed those that would deviate from this premise in both PC1 and PC2 using the 221 We kept only those variants that did not contribute more than expected under the 238 uniformity hypothesis (var_contrib <= T). While removing the batch effect using this 239 procedure, we observed loss of more than 4e 6 SNPs (Additional File1: Figure S1B). 240

241
In order to evaluate if we could remove the batch effect without compromising such a 242 large number of variants, we assigned a p-value to the contribution of each SNP to 243 the components using dimdesc (FactoMineR) at the chromosome level. We tested 244 different cut-offs (Additional File1: Figure S2) and observed that the batch effect was 245 removed while maintaining the geographic distribution of samples when using a p-246 value threshold of 0.05 for PC1 and 0.01 for PC2 (Additional File1: Figure S2B). With 247 these thresholds, ~1.8 e 6 SNPs were filtered out to remove the batch effects but 248 without compromising the overall population structure of the samples.

Gene flow 274
We combined several statistics developed to identify introgressed genomic regions. 275 These included the classic ABBA-BABA test in its developed Fd statistic form. For 276 these calculations we used the popgen pipeline developed by Simon Martin, 277 available at https://github.com/simonhmartin/genomics_general. We treated each 278 chromosome independently; the corresponding vcf file was converted to .geno format 279 using parseVCF.py. From these files, we calculated diversity and divergence values 280 for each subpopulation (DXY, FST and p) in non-overlapping 10 Kb windows, using the 281 script genomics.py (Additional File2: Table S2). Finally, we used the script 282 ABBABABAwindows.py to compute the four-taxon D statistic and f estimators in non- In order to avoid stochastic errors that could produce meaningless values, we only 293 considered windows with at least 100 biallelic SNPs. 294

295
We also calculated the basic distance fraction, Bdf (PopGenome; v2.7.1; [26], which 296 combines both fd and distances estimations and that is less sensitive to the time of 297 gene-flow. Just as in the previous analysis, we set a four-taxon tree hypothesis to 298 estimate the proportion of introgression from Russia into Northern Scandinavia. We 299 calculated Bdf in 10 Kb windows treating each chromosome independently. We 300 combined all 19 chromosome estimations and assigned a Z-score and p-value to 301 each Bdf value using genome-wide results and did an extra FDR correction. We 302 selected those regions that had an FDR<0.05. 303 304 Finally, we ran the Efficient local ancestry inference (ELAI) method [27] to confirm the 305 introgression event on chromosome 10. We used two upper-layer clusters and 10 306 lower-level clusters; 20 Expectation-Maximization steps and 100 generations of 307 admixture between the ancestral populations, that corresponded to the Russian and 308 Latvian aspen populations. The plotted allele dosages correspond to the average 309 over all the Swedish individuals from the northern population. We ran five 310 independent EM runs. 311 312

Positive and balancing selection 313
We scanned the genome of P. tremula for signals of positive selection using a newly 314 developed strategy called "integrated selection of alleles favoured by evolution", 315 iSAFE [28]. This method first calculates haplotype allele frequency scores based on 316 the presence of derived alleles in a particular haplotype, which is then used to 317 calculate SAFE scores. These SAFE scores are in turn calculated across a region 318 of given size in 50% overlapping windows of 300 bp to culminate in an iSAFE signal. 319 These statistics can be calculated for large regions of phased haplotypes, which we 320 obtained with BEAGLE v.4.1 [29], so chromosomes were divided into 3 Mb windows 321 for each iSAFE iteration. The iSAFE software can be set to run under a case-control 322 mode, with the case populations being, for example, the Northern Scandinavian 323 population, and the control population being all remaining individuals not used in the 324 case population. We ran this screening for Northern and Southern Scandinavia,325 Russia, and combining all the Nordic individuals from Norway, Sweden and Russia 326 together. As recommended by the authors, we considered iSAFE values significant 327 when they were >0.1. demographic models (Additional File1: Figure S3) for the Russian and Scandinavian 360 subpopulations using the demographic modelling workflow (dadi_pipeline) from [37]. 361 We tested different classes of models: simple models of divergence with and without 362 migration; simple models plus instantaneous size changes, ancient migration or 363 secondary contact, ancient migration plus instantaneous size change, and island 364 models of vicariance and founder events. In total, we tested 19 different demographic 365 scenarios using all polymorphic sites on chromosome 10. First, we ran the general 366 optimization routine (dadi_Run_Optimizations.py), which includes fitting the model 367 using particular settings for a given number of replicates, then using the parameters 368 from the best scoring replicate to seed a subsequent round of model fitting using 369 updated settings. We used four rounds, with 10, 20, 30 and 40 replicates. The     Latvian and Russian populations respectively, the region encompassing 4.5-4.9 Mb 504 has a predominant Russian ancestry ( Figure 3B; Additional file 1: Figure S8). 505 506 A.

B.
In terms of gene content, two relevant loci annotated as Heading-date 3A-like, or 507 Flowering Locus T (PtFT2) genes, are encoded in this introgressed genomic block: 508 Potra2n10c20839 (4772117 -4776457 bp) and Potra2n10c20842 (4789807 -509 4792846 bp) and the region surrounding these two genes is known to have been 510 subject to a selective sweep in the northern Swedish population [11]. implying that the sweep is shared among these populations derived from more 531 northern latitudes in the Eurasian continent. As mentioned before, this region is 532 centered around two loci annotated as Heading-date 3A-like, or PtFT2, in 533 agreement with previous analyses that have associated PtFT2 with climate 534 adaptation in P. tremula [11].  Figure S9). This suggests that another selective force may be 550 acting to maintain alleles at high frequencies, for instance balancing selection (BS). 551 In order to evaluate this possibility, we ran betascan [30] on all 19 chromosomes, 552 BS was close to 0.14 (t-test, p<0.01; Additional File2: Table S3), while p values also 564 increased in BS targeted regions compared to genome-wide average (Additional 565 File2: Table S3). Average FST values ded not differ between BS affected regions and 566 genome wide estimations, however. 567 568 Furthermore, we also observed significant bins that were unique to a specific 569 population and we therefore ran independent functional enrichment tests for 570 biological processes affected by balancing selection for each of these regions. 571 Interestingly, we found a few common terms with significant p-values (Fisher test, 572 p<0.05), such as ethylene metabolic and biosynthetic processes (pNordic=1,4e -3 ; 573 pCentralEurope=9.7e -4 ), jasmonic acid mediated signaling pathways 574 (pSouthWestScandinavia=1.8e -2 ; pCentralEurope =4.7e -3 ; pScotland=1.5e -2 ), and intramembrane 575 576 577 Figure 5. Balancing selection. Green dots highlight shared signals between subpopulations.

586
The expansion of P. tremula across the Eurasian continent provides a good scenario 587 to test for changes in population sizes across time. We evaluated population size 588 changes through time using the stairway plot method, which is based on the 589 composite likelihood of the SFS of the species and/or selected subpopulations. We 590 used 50 random samples from our collection to generate the overall behavior of the 591 Norway+Iceland Central Europe species, as well as for specific collection sites (Russia and Northern Scandinavia). 592 We observed the strongest reduction in Ne around ~700-800 ka this sharp decrease 593 in Ne was observed across all subpopulations, which means it was a bottleneck that 594 affected the entire species and that predated its dispersal in the Eurasian continent.   Our scans for positive selection using the iSAFE algorithm revealed a very strong 788 signal of a selective sweep located on chromosome 10. While we also observed 789 other regions across the aspen genome that seemly have experienced selective 790 sweeps, the calculated iSAFE values in these regions were generally too weak to 791 pass the significance threshold and did not display a clear geographic pattern, which 792 precluded them from being interpreted as selective sweeps involved in conferring 793 local adaptation. A natural conclusion from the lack of non-neutral outliers in our 794 analyses is that local adaptation in aspens is primarily polygenic and/or driven by 795 natural selection acting on standing variation rather than from new mutations that 796 would induce hard sweeps, as is the case for the hard sweep identified on 797 chromosome 10. 798 The role of standing variation in adaptive evolution has been widely debated. As 799 highlighted by [15], beneficial alleles that are present as standing variation are 800 generally older than new mutations, implying they could have undergone a selective 801 filter during past environmental conditions or in different parts of a species range. 802 These polymorphisms can be maintained for a long time due to balancing selection 803 that persists for many generations and minimizes the effect of drift. When we think of 804 temperate tree species, such as aspens, that have experienced several population 805 contraction and expansion cycles throughout the last millennia, such "pre-selected" 806 standing variation could represent a very useful pool for adaptive alleles that can 807 rapidly be brought together to mediate local adaptation to novel environmental 808 conditions encountered during range expansion. For instance, one climatic 809 consequence for plants during the LGM was water stress due to low CO2 810 concentrations and the presence of permafrost. Even if boreal trees can grow on 811 continuous permafrost, other factors such as soil texture or timing of the spring-812 summer thaw determine the amount of water available for growth, thus influencing 813 the species distribution [7]. 814 We evaluated the importance of standing variation for local adaptation in aspen by 815 scanning the genome for signals of balancing selection (BS). The combination of 816 significant beta-scores and the increased genetic diversity (p) and pairwise absolute 817 divergence (DXY) in several regions across multiple chromosomes, shared by at least 818 two aspen lineages, indicated that BS has maintained ancestral polymorphisms in 819 the species. Indeed, excess diversity around selected loci may be due to the 820 retention of ancestral polymorphisms or due to the accumulation of derived Our large panel of re-sequenced aspen individuals allowed us to unravel key 853 genomic aspects behind local adaptation across the Eurasian continent. It is clear 854 from our results that intra-species hybridization has played a major role 855 homogenizing the genomic background of the species and has promoted the 856 movement of adaptive alleles between populations. Of major relevance is the 857 observation of a recent adaptive introgression event between Nordic populations 858 around the Flowering locus T, that has facilitated the survival of aspens in high 859 latitudes. This event is however the only such event that can be detected in the 860 species, showing that the emergence of advantageous alleles and their propagation 861 is rather rare. Standing variation and its conservation through balancing selection 862 across lineages seems to be a more efficient way of keeping advantageous alleles 863 and maintaining high levels of diversity. This combination of evolutionary scenarios 864 suggests that aspens may have the capacity to adapt rapidly to new challenging B.