Abstract
Genetic variation is the fuel of evolution but analysing the spatio-temporal dynamics of genetic changes in natural populations is challenging, comprehensive sampling logistically difficult, and sequencing of entire populations costly. Here we address these issues by performing the first continent-wide genomic analysis of genetic variation in European Drosophila melanogaster, based on 48 pool-sequencing samples from 32 populations. Our analyses uncover a novel pattern of major longitudinal population structure; establish previously unknown clines in inversions and transposable elements across Europe; and provide evidence for non-local, continent-wide selective sweeps that are shared among the majority of populations. We also find pronounced variation among populations in the composition of the fly microbiome and identify five new DNA viruses adding to a single example known so far for this species. Our study has important implications for the evolution and demography of D. melanogaster, an ancestrally African species that first colonized Europe before becoming cosmopolitan.
Introduction
Studying the processes that create and maintain genetic variation in natural populations is fundamental to understanding the process of evolution (Dobzhansky 1970; Lewontin 1974; Kreitman 1983; Kimura 1984; Hudson et al. 1987; McDonald & Kreitman 1991; Adrian & Comeron 2013). Until recently, technological constraints have limited studies of natural genetic variation to small genomic regions and small numbers of individuals. With the development of population genomics, we can now analyse patterns of genetic variation on a genome-wide scale for large numbers of individuals, with sampling structured across space and time. As a result, we have gained fundamental new insights into evolutionary dynamics of genetic variation in natural populations (e.g., Hohenlohe et al. 2010; Cheng et al. 2012; Begun et al. 2007; Pool et al. 2012; Harpur et al. 2014; Zanini et al. 2015). Despite this recent technological progress, extensive large-scale sampling and genome sequencing of populations remains prohibitively expensive in terms of cost and labor for any individual research group.
Here, we present the first comprehensive, continent-wide genomic analysis of genetic variation in European Drosophila melanogaster, based on 48 pool-sequencing samples from 32 populations collected in 2014 (Figure 1) by the European Drosophila Population Genomics Consortium (DrosEU; https://droseu.net). D. melanogaster offers several advantages for studying the relevant spatio-temporal scales of evolution: a relatively small genome, a broad geographic range, a multivoltine life history that allows sampling across generations over short timescales, ease of sampling natural populations using standardized techniques, and a well-developed context for population genomic analysis (e.g., Powell 1997; Keller 2007; Hales et al. 2015). Importantly, this species is studied by an extensive research community, with a long history of developing shared resources (Larracuente & Roberts 2015; Bilder & Irvine 2017).
The current study complements and extends previous studies of genetic variation in D. melanogaster, both from its native range in sub-Saharan Africa and from its world-wide expansion as a human commensal into Europe 10–20,000 years ago and into North America and Australia in the last few centuries (e.g., Lachaise et al. 1988; David & Capy 1988; Li & Stephan 2006; Keller 2007; Sprengelmeyer et al 2018; Arguello et al. 2019; also cf. Kapopoulou et al. 2018a). The colonization of novel habitats and climate zones on multiple continents makes D. melanogaster especially powerful for studying parallel local adaptation. Previous studies of genomic variation have uncovered latitudinal clines in allele frequencies (e.g., Schmidt & Paaby 2008; Turner et al. 2008; Kolaczkowski et al. 2011b; Fabian et al. 2012; Bergland et al. 2014; Machado et al. 2016; Kapun et al. 2016a), structural variants such as chromosomal inversions (reviewed in Kapun & Flatt 2019),) transposable elements (TEs) (Boussy et al. 1998; González et al. 2008; 2010), and complex phenotypes (de Jong & Bochdanovits 2003; Schmidt & Paaby 2008; Schmidt et al. 2008; Kapun et al. 2016b; Behrman et al. 2018). Thus far, sampling across these latitudinal gradients has been restricted to single transects on the east coasts of Australia and North America; in addition to parallel local adaptation, clines on these continents may be due to admixture between cohorts of flies with different colonization histories (Caracristi & Schlötterer 2003; Yukilevich & True 2008a; b; Duchen et al. 2013; Kao et al. 2015; Bergland et al. 2016).
In contrast, the population genomics of D. melanogaster on the European continent remains largely uncharacterized (Božičević et al. 2016; Pool et al. 2016; Mateo et al. 2018). Because Eurasia was the first continent colonized by D. melanogaster as they migrated out of Africa, we sought to understand how this species has adapted to new habitats and climate zones in Europe, where it has been established the longest (Lachaise et al. 1988; David & Capy 1988). We analyse our data at three levels: (1) variation at single-nucleotide polymorphisms (SNPs) in nuclear and mitochondrial (mtDNA) genomes (∼5.5 x 106 SNPs in total); (2) structural variation, including TE insertions and chromosomal inversions; and (3) variation in the microbiota associated with flies, including bacteria, fungi, protists, and viruses.
Results
As part of the DrosEU consortium, we collected 48 population samples of D. melanogaster from 32 geographical locations across Europe in 2014 (Table 1; Figure 1). We performed pooled sequencing (Pool-Seq) of all 48 samples, with an average autosomal coverage ≥50x (Table S1). Of the 32 locations, 10 were sampled at least once in summer and once in fall (Figure 1), allowing a preliminary analysis of seasonal change in allele frequencies on a genome-wide scale.
European and other derived populations exhibit similar amounts of genetic variation
For each sample, we estimated genome-wide levels of nucleotide diversity (π and Watterson’s θ, corrected for pooling; Futschik 2010; Kofler et al. 2011). We find that most European populations have similar levels of genetic variation (Table S1). Moreover, our estimates of pairwise nucleotide diversity are similar to those from derived non-African (North American and Australian) populations, whether sequenced as individuals or as pools (Figure 2 and Table S2). Thus, although European populations are considerably older than North American and Australian populations, they exhibit similar levels of DNA sequence variability.
We next tested for associations between geographic variables and genome-wide average levels of genetic variation. We found that neither π nor θ was correlated with latitude or longitude, but both strongly decreased with altitude (Table 2). This contrasts with previous studies of flies collected from a broader range of altitudes, which found increased genetic diversity in high-elevation populations (Lian et al. 2018). Finally, we tested for a correlation between genome-wide variation and the season of collection, finding no relationship (Table 2). Together, these results suggest that there is little spatio-temporal variation among European populations in overall levels of sequence variability.
For all populations, the ratio of X-linked to autosomal variation (π X/π A) was well below the value of 0.75 expected under neutrality with equal sex ratios (ranging from 0.53 to 0.66, one-sample Wilcoxon rank test, p < 0.001). These estimates are broadly consistent with those from previous studies of European and other non-African populations (e.g. Andolfatto 2001; Kauer et al. 2002; Hutter et al. 2007; Betancourt et al. 2004; Mackay et al. 2012; Langley et al. 2012). Surprisingly, the π X/π A ratio increased significantly, significantly, albeit weakly, with latitude (Spearman’s □ = 0.315, p = 0.0289). This observation is at odds with a the predictions of a simple model of periodic bottlenecks leading to a lower X/A ratio in northern populations (Hutter et al. 2007; Pool & Nielsen 2007), but might be consistent with stronger selection or more male-biased sex-ratios in the south as compared to the north (Charlesworth 2001; Hutter et al. 2007).
Genetic variation was heterogeneous across the genome, as has been previously reported (Begun & Aquadro 1992; Mackay et al. 2012; Langley et al. 2012; Huang et al. 2014). Both π and θ were markedly reduced close to centromeric and telomeric regions (Figure 3), and strongly positively correlated with recombination rate (linear regression against fine-scale recombination rate estimates from Comeron et al. (2012), p < 0.001; not accounting for autocorrelation; Table S3). Recombination rate explained 41–47% and 31–38% of the variation in π, for the autosomes and X chromosome, respectively. Using broad-scale recombination rate estimates (Fiston-Lavier et al. 2010) yielded a qualitatively similar, but slightly stronger correlation in autosomes and weaker in the X chromosome (Figure 3, Table S3, Figure 3 - figure supplement 1).
In contrast to π and θ, the European populations showed major differences in mean Tajima’s D (Table S1). Tajima’s D measures deviations from neutral expectations in allele frequencies, which can be due either to selection or complex demography, with negative D indicating an excess of low-frequency variants (Tajima 1983). Approximately half of the European samples have negative D, It is possible that this result is artefactual, caused by heterogeneity in the proportion of sequencing errors among multiplexed sequencing runs. However, this is unlikely, because including sequence run as a covariate in the statistical model did not improve its fit (Supplementary File 2; Table S4). In all of these analyses, we controlled for confounding effects of spatio-temporal autocorrelations between samples by accounting for similarity among spatial neighbors (Moran’s I ≈ 0, p > 0.05 for all tests). When comparing D in European samples with ancestral African populations from Zambia and Rwanda, the values were generally lower in the European populations, possibly due to the recent range and population size expansion (Figure 3 and Table S5). Similar to genetic diversity, D was also heterogeneous across the genome. Tajima’s D was broadly reduced in the vicinity of telomeric and centromeric regions, possibly reflecting extended purifying selection or selective sweeps close to heterochromatic regions, and due to reduced recombination.
Several genomic regions show signatures of continent-wide selective sweeps
Genomic regions that show localized reductions in Tajima’s D are attractive candidates for having undergone recent selective sweeps. To identify such genomic regions, we used Pool-hmm (Boitard et al. 2013; Table S6A), which – like Tajima’s D – identifies candidate sweep regions via distortions in the allele frequency spectrum. Several genomic regions identified in this way coincide with previously identified, well-supported sweeps in the proximity of Hen1 (Kolaczkowski et al. 2011b), Cyp6g1 (Daborn et al. 2002), wapl (Beisswanger et al. 2006), and around the chimeric gene CR18217 (Rogers & Hartl 2012), among others (Table S6B). These regions also showed local reductions in Tajima’s D and genetic variation, again consistent with selection (Figure 4 and Figure 4-figure supplement 1 and 2). The putative sweep regions included 145 of the 232 genes previously identified using Pool-hmm in an Austrian population (Boitard et al 2012; Table S6C). Other regions identified have not previously been described as harboring sweeps; these represent potential novel targets of positive selection deserving of further investigation (Table S6A). Overall, we identified 64 genes that showed signatures of selection across all European populations analysed (Table S6D); thirty-five of them were located in regions with low Tajima’s D. This pattern suggests the existence of continent-wide sweeps that either predate the colonization of Europe (e.g., Beisswanger et al. 2006), or that have swept across the majority of European populations more recently (Table S6D). Finally, we classified the populations according to the Köppen-Geiger climate classification (Peel et al. 2007) and identified several candidate sweeps exclusive to arid, temperate or cold regions; Table S6A). For temperate climates, candidate sweep regions were enriched for functions such as ‘response to stimulus’, ‘transport’, and ‘nervous system development’; for cold climates, they were enriched for ‘vitamin and co-factor metabolic processes’ (Table S6E). In contrast, we did not find any significant GO enrichment for arid candidate sweep regions. In summary, this dataset represents a rich genomic resource for future in-depth studies of selective sweeps and adaptation to different climates in Drosophila.
European populations are structured along an east-west gradient
We next investigated patterns of genetic differentiation due to demographic substructure. Overall, pairwise differentiation as measured by FST was relatively low, particularly for the autosomes (autosomal FST 0.013–0.059; X-chromosome FST: 0.043–0.076; Mann-Whitney-U test; p < 0.001; Table S1). The slightly elevated FST for the X chromosome is expected given its smaller effective population size (Hutter et al. 2007). One population, from Sheffield (UK), was unusually differentiated from the others (Table S1) and was excluded from analyses of neutral genetic differentiation. Despite overall low levels of among-population differentiation, European populations showed evidence of geographic substructure. To analyse this pattern in detail, we focused on SNPs most likely to reflect neutral population structure, those at 4-fold degenerate sites, in regions outside those showing signatures of selective sweeps, in regions of high recombination (r > 3cM/Mb; Comeron et al. 2011) and at least 1 Mb away from the breakpoints of common inversions.
The final filtered data set consisted of 8,727 SNPs. Within Europe, we found a weak but significant pattern of isolation by distance (IBD). That is, pairwise FST, though low overall, increased significantly with geographic distance (Mantel test; p < 0.001; r=0.65, max. FST ∼ 0.05; Figure 5A and Figure 5A – figure supplement 1A).
We investigated population substructure using principal components analysis (PCA) on allele frequencies from the same set of SNPs at 4-fold degenerate sites. The first three PC axes explained >25% of the total variance (PC1: 17.88%, PC2: 5.2%, PC3: 4.7%, eigenvalues = 410, 101, and 92, respectively), with PC1 strongly correlated with longitude and to a lesser extent with altitude (Table 2). This longitudinal stratification is expected under a simple model of IBD, as the continent extends further in longitude than latitude. As there was significant spatial autocorrelation between samples (as indicated by Moran’s test on residuals from linear regressions with PC1), we repeated the analysis with an explicit spatial error model; the association between PC1 and longitude remained significant. Like PC1, PC2 is correlated with longitude and altitude. PC3, by contrast, is not associated with any variable examined (Table 2). No major PC axes were correlated with season, indicating that there were no shared seasonal differences across samples in our data. However, based on linear regressions comparing summer and fall values of PC1 (adjusted R2: 0.98; p-value < 0.001), PC2 (R2: 0.79; p-value < 0.001) and PC3 (R2: 0.93; p-value < 0.001), we found very strong associations of genetic variation across seasons in the 10 locations that were sampled in summer and fall. This indicates a high degree of spatio-temporal stability in the levels of genetic variation.
Hierarchical model fitting based on the first three PC axes resulted in three distinct clusters (Figure 5B) separated along PC1, supporting the notion of strong longitudinal differentiation among European populations. Importantly, these results remain qualitatively unchanged when restricting the analysis to SNPs located in short introns (< 60 bp), which are also assumed to be relatively unaffected by selection (Figure 5 – figure supplement 1B; Haddrill et al. 2005; Singh et al. 2009; Parsch et al. 2010; Clemente & Vogl 2012; Lawrie et al. 2013).
Model-based spatial clustering showed qualitatively similar results, with populations separated mainly by longitude (Figure 5C; using ConStruct, with K=7 spatial layers chosen based on model selection procedure via cross-validation). We could also infer levels of admixture among populations from this analysis; population samples from eastern and northwestern Europe showed low levels of admixture, while those from central Europe appeared locally well-mixed (Figure 5C).
In addition to restricted gene flow between geographic areas, local adaptation may explain population substructuring, even at neutral sites, if closely related populations tend to respond to similar selective pressures. We thus probed whether this spatial substructuring is associated with any of nineteen climatic variables, obtained from the WorldClim database (Hijmans et al. 2005). These climatic variables represent averages interpolated averages across more than 50 years of observation at the geographic coordinates corresponding to our sampling locations. Only two variables are significant after Bonferroni correction (adjusted α = 0.0026): between PC1 and ‘temperature seasonality’ (BioVar 4; Hijmans et al. 2005; R2 = 0.62, P<0.001; Figure 5 – figure supplement 1C) and between PC1 and ‘minimum temperature of the coldest month’ (R2 = 0.3, P<0.001; Figure 5 – figure supplement 1C). This suggests that the pronounced longitudinal differentiation along the European continent could at least partly be driven by the transition from oceanic to continental climate, leading to gradual changes in temperature seasonality and the severity of winter conditions which might impact demography, especially local survival. To the best of our knowledge, such strongly pronounced longitudinal structure and differentiation on a continent-wide scale has not yet been reported for D. melanogaster.
Mitochondrial haplotypes also exhibit longitudinal population structure
Our finding that European populations show strong longitudinal structure is also supported by an analysis of mitochondrial haplotypes. We identified two main mitochondrial haplotypes in Europe, separated by 41 mutations (G1.2 and G2.1; Figure 6A), with highly variable frequencies among populations (Figure 6B). Qualitatively, three types of European populations can be distinguished based on these haplotypes: (1) central European populations with a high frequency (> 60%) of the G1 haplotypes, (2) Eastern European populations in summer, with a low frequency (< 40%) of G1 haplotypes, and (3) Iberian and Eastern European populations in fall, with a combined frequency of G1 haplotypes between 40-60% (Figure 6 - figure supplement 1A). These results are consistent with analyses of mitochondrial haplotypes from a North American population (Cooper et al. 2015) as well as from worldwide samples (Wolff et al. 2016), which revealed a high level of haplotype diversity.
Mitochondrial haplotypes also showed shifts in the relative frequencies of the two haplotype classes between summer and fall, but only in 2 of 9 possible comparisons. While there was no correlation between latitude and the frequency of G1 haplotypes, we found a weak but significant negative correlation between G1 haplotypes and longitude (r2 = 0.10; p < 0.05), consistent with the longitudinal east-west population structure observed for SNPs at 4-fold degenerate sites. In a subsequent analysis, we divided the dataset at 20° longitude into an eastern and a western subset because in northern Europe 20° longitude corresponds to the division of two major climatic zones, temperate and cold (Peel et al. 2007). This split revealed a clear correlation between longitude and the combined frequency of G1 haplotypes, explaining as much as 50% of the variation in the western group (Figure 6 - figure supplement 1B). Similarly, in eastern populations, longitude and the combined frequency of G1 haplotypes were correlated, explaining approximately 20% of the variance (Figure 6 - figure supplement 1B). Thus, these data on mitochondrial haplotypes clearly confirm the pronounced east-west structure and differentiation among European populations of D. melanogaster.
The frequency of polymorphic TEs varies with longitude and altitude
To examine the population genetics of structural variants, we first focused on transposable elements (TEs). The repetitive content of the 48 samples ranged from 16% to 21% with respect to nuclear genome size (Figure 7). The vast majority of detected repeats were TEs, mostly represented by long terminal repeats (LTR) and long interspersed nuclear elements (LINE), as well as a few DNA elements (Class II). LTRs best explained total TE content (LINE+LTR+DNA) (Pearson’s r = 0.87, p < 0.01, vs. DNA r = 0.58, p = 0.0117, and LINE r = 0.36, p < 0.01 and Figure 7- figure supplement 1A).
We next estimated population frequencies of 1,630 TE insertions annotated in the D. melanogaster reference genome v.6.04 using T-lex2 (Table S7, Fiston-Lavier et al. 2015). On average, 56% of the TEs annotated in the reference genome were fixed in all samples. The majority of the remaining polymorphic TEs segregated at low frequency in all samples (Figure 7 - figure supplement 1A), potentially due to the effect of purifying selection (González et al. 2008; Petrov et al. 2011; Kofler et al. 2012; Cridland et al. 2013; Blumenstiel et al. 2014). However, we also observed 142 TE insertions present at intermediate (>10% and <95%) frequencies, which might be consistent with transposition-selection balance (Figure 7 - figure supplement 1B; Charlesworth et al. 1994).
In each of the 48 samples, TE frequency and recombination rate were negatively correlated on a genome-wide level (Spearman rank sum test; p < 0.01), as previously reported (Bartolomé et al. 2002; Petrov et al. 2011; Kofler et al. 2012). This pattern still held when only polymorphic TEs (population frequency <95%) were analysed, although it was not statistically significant for some chromosomes and populations (Table S8). In either case, the correlation was more negative when using broad-scale (Fiston-Lavier et al. 2010), rather than fine-scale (Comeron et al 2012), recombination rate estimates, indicating that broad-scale recombination patterns may best capture long-term population recombination patterns (Materials and methods, Tables S8).
We further tested whether the distribution of TE frequencies among samples could be explained by geographical or temporal variables. We focused on the 141 TE insertions that showed frequency variability among samples (interquartile range, (IQR) > 10; see Materials and Methods) and were located in regions of non-zero recombination according to both fine-scale (Comeron et al. 2012), and broad-scale (Fiston-Lavier et al. (2010) estimations. Of these, 57 TEs showed significant associations with geographical or temporal variables after multiple testing correction (Table S9). We found significant correlations of 13 TEs with longitude, 13 with altitude, five with latitude, and three with season (Table S9). In addition, the frequencies of the other 23 insertions were significantly correlated with more than one of the above-mentioned variables. These TEs were scattered along the five main chromosome arms, with the majority located inside genes (42 out of 57; Table S9).
Two TE families were enriched in the 57 TE dataset: the LTR 297 family with 11 copies, and the DNA pogo family with five copies (χ2-values after Yate’s correction < 0.05; Table S10). Interestingly, 14 of these 57 TEs coincide with previously identified adaptive candidate TEs, suggesting that our dataset might be enriched for adaptive insertions, several of which seem to exhibit spatial frequency clines (Table S9; Rech et al. 2019).
Inversions exhibit latitudinal and longitudinal clines in Europe
Another class of structural variants, chromosomal inversions, show spatial patterns in North American and Australian populations, potentially due to selection (reviewed in Kapun & Flatt 2019). In contrast to North America and Australia, inversion clines in Europe are poorly characterized (Lemeunier & Aulard 1992). Here, we examined the presence and frequency of six cosmopolitan inversions (In(2L)t, In(2R)NS, In(3L)P, In(3R)C, In(3R)Mo, In(3R)Payne) in our European samples, using a panel of inversion-specific marker SNPs (Kapun et al. 2014). All samples were polymorphic for one or more inversions (Figure 7). However, only In(2L)t segregated at substantial frequencies in most populations (average frequency = 20.2%); all other inversions were either absent or rare (average frequencies: In(2R)NS = 6.2%, In(3L)P = 4%, In(3R)C = 3.1%, In(3R)Mo =2.2%, In(3R)Payne = 5.7%).
Despite their overall low frequencies, several inversions exhibited pronounced clinality (Table 3). In particular, we observed significant latitudinal clines for In(3L)P, In(3R)C and In(3R)Payne. Although they differed in overall frequencies, In(3L)P and In(3R)Payne showed latitudinal clines in Europe qualitatively similar to clines previously observed along the North American and Australian east coasts (Figure 7 - figure supplement 2 and Table S11, Kapun et al. 2016a), which, at least in the case of In(3R)Payne, are maintained by spatially varying selection (Kapun et al. 2016a,b; Durmaz et al. 2018; Anderson et al. 2005; Umina et al. 2005; Kennington et al. 2006; Rako et al. 2006).
We also detected – for the first time – longitudinal clines for In(2L)t and In(2R)NS, with both polymorphisms decreasing in frequency from east to west, a result consistent with the strong longitudinal population differentiation in Europe. In(2L)t also increased in frequency with altitude (Table 3). Except for In(3R)C, we did not find significant residual spatio-temporal autocorrelation among samples for any inversion tested (Moran’s I ≈ 0, p > 0.05 for all tests; Table 3), suggesting that our analysis was not confounded by spatial autocorrelation for most of the inversions. Further studies are necessary to determine the extent to which these clines of inversion frequencies in Europe are shaped by selection.
European Drosophila microbiomes contain Entomophthora, trypanosomatids and unknown DNA viruses
We examined the bacterial, fungal, protist, and viral microbiota associated with D. melanogaster using the Pool-Seq data. The microbiota can affect life history traits, immunity, hormonal physiology, and metabolic homeostasis of their fly hosts (e.g., Trinder et al. 2017; Martino et al. 2017).
We characterised the taxonomic origin of the non-Drosophila reads in our dataset using MGRAST, which identifies and counts short protein motifs (’features’) within reads (Meyer et al. 2008). We examined 262 million reads in total and of these most were assigned to Wolbachia (mean 53.7%; Figure 8), a well-known endosymbiont of Drosophila (Werren et al. 2008). The abundance of Wolbachia protein features relative to other microbial protein features (relative abundance) varied strongly between samples, ranging from 8.8% in a sample from the UK to almost 100% in samples from Spain, Portugal, Turkey and Russia (Table S12). Similarly, Wolbachia loads varied 100-fold between samples, as estimated from the ratio of Wolbachia protein features to Drosophila protein features (Table S12).
Acetic acid bacteria of the genera Gluconobacter, Gluconacetobacter, and Acetobacter were the second largest group, with an average relative abundance of 34.4% among microbial protein features. Furthermore, we found evidence for the presence of several genera of Enterobacteria (Serratia, Yersinia, Klebsiella, Pantoea, Escherichia, Enterobacter, Salmonella, and Pectobacterium). Serratia occurs only at low frequencies or is absent from most of our samples, but reaches a very high relative abundance among microbial protein features in the Nicosia (Cyprus) summer collection (54.5%). This high relative abundance was accompanied by an 80x increase in Serratia bacterial load.
We also detected several eukaryotic microorganisms, although they were less abundant than the bacteria. The fraction of fungal protein features, for example, is larger than 3% in only three samples (from Finland, Austria and Turkey; Table S12). Among the eukaryotic microbiota, we found trypanosomatids in 16 samples. Trypanosomatids have been previously reported to be associated with Drosophila (Wilfert et al. 2011; Chandler & James 2013; Hamilton et al. 2015), and this appeared to have been confirmed in this first systematic survey across a wide geographic range in D. melanogaster. We also found the fungal pathogen Entomophthora muscae in 14 samples (Elya C et al. 2018). Somewhat surprisingly, we found few yeast sequences. Yeasts are commonly found on rotting fruit, the main food substrate of D. melanogaster, and have been found in association with Drosophila before (Barata et al. 2012; Chandler et al. 2012). This result suggests that, although yeasts can attract flies and play a role in food choice (Becher et al. 2012; Buser et al. 2014), they might not be highly prevalent in or on D. melanogaster bodies but are rather actively digested and thus not part of the microbiome.
Our data also allowed us to identify DNA viruses. Only one DNA virus has been previously described for D. melanogaster (Kallithea virus; Webster et al. 2015; Palmer et al. 2018) and only two others more from other Drosophilid species (Drosophila innubila Nudivirus [Unckless 2011], Invertebrate Iridovirus 31 in D. obscura and D. immigrans [Webster et al. 2016]).
Here, we found six different DNA viruses, five of which are new (Table S13). Approximately two million reads came from Kallithea nudivirus (Webster et al. 2015), allowing us to assemble the first complete Kallithea genome (>300-fold coverage in the Ukrainian sample UA_Kha_14_46; Genbank accession KX130344). We also identified around 1,000 reads from a novel nudivirus closely related to both Kallithea virus and to Drosophila innubila nudivirus (Unckless 2011) in sample DK_Kar_14_41 from Karensminde, Denmark (Table S13). As the reads from this virus in our data set were insufficient to assemble the genome, we identified a publicly available dataset (SRR3939042: 27 male D. melanogaster from Esparto, California; Machado et al. 2016) with sufficient reads to complete the genome (provisionally named “Esparto Virus”; KY608910).
We further identified two novel Densoviruses (Parvoviridae). The first is a relative of Culex pipiens densovirus, provisionally named “Viltain virus”, found at 94-fold coverage in sample FR_Vil_14_07 (Viltain; KX648535). The second is “Linvill Road virus”, a relative of Dendrolimus punctatus densovirus, represented by only 300 reads here, but with high coverage in dataset SRR2396966 from a North American sample of D. simulans (KX648536; Machado et al. 2016). In addition, we detected a novel member of the Bidnaviridae family,“Vesanto virus”, a bidensovirus related to Bombyx mori densovirus 3 with approximately 900-fold coverage in sample FI_Ves_14_38 (Vesanto; KX648533 and KX648534). Finally, in one sample (UA_Yal_14_16) we detected a substantial number of reads from an Entomopox-like virus, which we were unable to fully assemble (Table S13). Using a detection threshold of >0.1% of the Drosophila genome copy number, the most commonly detected viruses were Kallithea virus (30/48 of the pools) and Vesanto virus (25/48), followed by Linvill Road virus (7/48) and Viltain virus (5/48), with Esparto virus being the rarest (2/48).
Discussion
In recent years, large-scale population re-sequencing projects have produced major insights into the biology of both model (Mackay et al. 2012; Langley et al. 2012; Auton et al. 2015; Lack et al. 2015; Alonso-Blanco et al. 2016; Lack et al. 2016) and non-model organisms (e.g., Hohenlohe et al. 2010; Wolf et al. 2010). In particular, such massive datasets contribute greatly to our growing understanding of the processes that create and maintain genetic variation in natural populations. However, the relevant spatio-temporal scales for population genomic analyses remain largely unknown (e.g., Guirao-Rico and González 2019). Here we have applied – for the first time – a continent-wide sampling and sequencing strategy to European populations of D. melanogaster (Figure 1), allowing us to uncover previously unknown aspects of this species’ population biology and evolutionary genetics. This is particularly important because the population genomics of this species in Europe has been poorly characterized to date.
We find that European D. melanogaster populations exhibit pronounced longitudinal differentiation. We observed this pattern for a genome-wide set of SNPs at 4-fold degenerate sites, which presumably evolve neutrally (Figure 5), as well as for mitochondrial haplotypes, inversions and TEs which might be subject to spatially varying selection (Figure 6 and 7). Longitudinal differentiation might be due to the transition from oceanic to continental climate along the longitudinal axis (Figure 5-Figure 5 supplement 1). While spatial differences in climatic conditions likely play a major role in driving this pattern, we note that it is remarkably similar to that observed for human populations (e.g., Cavalli-Sforza 1966; Xiao et al. 2004; Francalacci & Sanna 2008; Novembre et al. 2008). Indeed, east-west structure has been previously found in sub-Saharan Africa populations of D. melanogaster, with the split between eastern and western African populations having occurred ∼70 kya ago (Michalakis & Veuille 1996; Aulard et al. 2002; Kapopoulou et al. 2018b), a period that – interestingly – coincides with a wave of human migration from eastern into western Africa (Nielsen et al. 2017). However, in contrast to the pronounced pattern observed in Europe, African east-west structure is relatively weak, explaining only ∼2.7% of variation, and is due to an inversion whose frequency varies longitudinally. In contrast, our demographic analyses are based on SNPs located in >1 Mb distance from the breakpoints of the most common inversions. This makes it very unlikely that the strong longitudinal pattern we have observed is driven by inversions.
Spatial patterns of differentiation were stronger for longitude than for latitude. In contrast, differentiation in North America has mainly been observed across latitude, for both neutral and adaptive polymorphisms (e.g., Machado et al. 2016; Kapun et al. 2016a; reviewed in Adrion et al. 2015). Although our present analysis showed that putatively neutral SNPs were primarily differentiated along longitude, latitudinal clines may still exist for adaptive polymorphisms. In fact, we detected latitudinal frequency clines for both inversions and TEs (Table 3 and Table S9). For the inversions In(3L)P and In(3R)Payne, the observed latitudinal clines were in qualitative agreement with parallel clines reported from North America and Australia, with the inversions decreasing in frequency as distance from the equator increases (Mettler et al. 1977; Knibb et al. 1981; Leumeunier & Aulard 1992; Fabian et al. 2012; Kapun et al. 2014; Rane et al. 2015; Kapun et al. 2016a). This pattern is widely thought to be a result of climate adaptation, with the inversions containing variants that make them better adapted to tropical or subtropical than to temperate, more seasonal climates (e.g., Kapun et al. 2016a). Several euchromatic TE insertions also showed geographic (or seasonal) patterns of variation (Table S9), indicating that they might play a role in local adaptation, particularly since many of them are located in regions where they might affect gene regulation. Further, 17 of them also show significant correlations with either geographical or temporal variables in North American populations (Lerat et al. 2019). Additionally, several inversions and TEs also exhibited longitudinal gradients.
We also examined signatures of selective sweeps in our data. Several of the identified regions have previously been reported as potential targets of positive selection (Figure 4, Table S6B and SC). However, most of these sweeps were originally identified by analysing a small number of populations (e.g. Kolaczkowski et al. 2011b; Daborn et al. 2002; Rogers & Hartl 2012). Here, we identified 64 genes (including wapl, CR18217, and mgl) which showed clear signatures of selection and which were widespread across Europe, thus strengthening the case for their adaptive significance. In addition, we found several regions with evidence of hard sweeps, some of them showing evidence of local climatic adaptation (Table S6); these candidate regions represent a valuable resource for future analyses of adaptation in European Drosophila.
Finally, our continent-wide analysis of the microbiota suggests that natural populations of European D. melanogaster vary greatly in the composition and abundance of microbes and viruses over space and time. Recent work suggests that at least parts of this variation in microbiomes follows geographic patterns (Walters et al 2018, Wang et al 2019) and contribute to phenotypic differences and local adaptation among populations, especially given that there might be tight and presumably local co-evolutionary interactions between fly hosts and their endosymbionts (e.g., Haselkorn et al. 2009; Richardson et al. 2012; Staubach et al. 2013; Kriesner et al. 2016; Wang and Staubach 2018). Most notably, we discovered five new DNA viruses of D. melanogaster. Despite this species being host to a wide diversity of RNA viruses, we now have found that the DNA viruses of D. melanogaster are also widespread, for instance with Kallithea virus detected in most populations.
Our study demonstrates that sampling on a continent-wide scale and pooled sequencing of a large number of natural populations can reveal fundamental and novel aspects of population biology, even for a well-studied model species such as D. melanogaster. Our extensive sampling was feasible only due to synergistic collaboration among many research groups. Our efforts in Europe are paralleled in North America by the Dros-RTEC consortium, with whom we are collaborating to compare population genomic data across continents. Together, we have sampled both continents annually since 2014; we aim to continue to sample and sequence European and North American Drosophila populations with increasing spatio-temporal resolution in future years. With these efforts we hope to provide a rich community resource for biologists interested in molecular population genetics and adaptation genomics.
Materials and methods
The 2014 DrosEU dataset represents the most comprehensive spatio-temporal sampling of European D. melanogaster populations to date (Table 1). It comprises 48 samples of D. melanogaster collected from 32 geographical locations across Europe at different time points in 2014 through a joint effort of 18 research groups. Collections were mostly performed with baited traps using a standardized protocol (see Supplementary File 2). From each collection, we pooled 33–40 wild-caught males. We used males as they are more easily distinguishable morphologically from similar species than females. Despite our precautions, we identified a low level of D. simulans contamination in our sequences; we computationally filtered these sequences from the data prior to further analysis (see below).
DNA extraction, library preparation and sequencing
We extracted DNA from each sample after homogenization with bead beating and standard phenol/chloroform extraction. A detailed extraction protocol can be found in the Supplementary File 2. In preparation for sequencing, 500 ng of DNA from each sample was sheared with a Covaris instrument (Duty cycle 10, intensity 5, cycles/burst 200, time 30). Library preparation was performed using NEBNext Ultra DNA Lib Prep-24 and NebNext Multiplex Oligos for Illumina-24 following the manufacturer’s instructions. Each sample was sequenced as a pool (Pool-Seq; Schlötterer et al. 2014), as paired-end fragments on a Illumina NextSeq 500 sequencer at the Genomics Core Facility of Pompeu Fabra University. Samples were multiplexed in 5 batches of 10 samples, except for one batch of 8 samples (Table S1). Each multiplexed batch was sequenced on 4 lanes at ∼50x raw coverage per sample. The read length was 151 bp, with a median insert size of 348 bp (range 209-454 bp). The data are available from NCBI Bioproject PRJNA388788.
Mapping pipeline and variant calling
Prior to mapping, we trimmed and filtered raw FASTQ reads to remove low-quality bases (minimum base PHRED quality = 18; minimum sequence length = 75 bp) and sequencing adaptors using cutadapt (v. 1.8.3; Martin 2011). We retained only pairs for which both reads fulfilled our quality criteria after trimming. FastQC analyses of trimmed and quality filtered reads showed overall high base-qualities (median range 29-35), with ∼1.36% of bases lost after trimming. We used bwa mem (v. 0.7.15; Li 2013) with default parameters to map the trimmed reads. To avoid paralogous mapping, we mapped to a compound reference, consisting of the genomes of D. melanogaster (v.6.12) and common commensals and pathogens, including Saccharomyces cerevisiae (GCF_000146045.2), Wolbachia pipientis (NC_002978.6), Pseudomonas entomophila (NC_008027.1), Commensalibacter intestine (NZ_AGFR00000000.1), Acetobacter pomorum (NZ_AEUP00000000.1), Gluconobacter morbifer (NZ_AGQV00000000.1), Providencia burhodogranariea (NZ_AKKL00000000.1), Providencia alcalifaciens (NZ_AKKM01000049.1), Providencia rettgeri (NZ_AJSB00000000.1), Enterococcus faecalis (NC_004668.1), Lactobacillus brevis (NC_008497.1), and Lactobacillus plantarum (NC_004567.2). We used Picard (v.1.109; http://picard.sourceforge.net) to remove duplicate reads and reads with a mapping quality below 20. In addition, we re-aligned sequences flanking indels with GATK (v3.4-46; McKenna et al. 2010).
After mapping, we filtered reads due to D. simulans contamination, using the method of Bastide et al. (2013). To do this, we used fixed differences between D. simulans and D. melanogaster to identify reads from D. simulans. For the nine samples that had a contamination level > 1% (range 1.2 - 8.7%; Table S1), we used custom software to remove reads that mapped preferentially to the D. simulans genome (Hu et al. 2013) using competitive mapping to references from both species. After applying our decontamination pipeline, contamination levels dropped below 0.4 % for all nine samples.
We used Qualimap (v. 2.2., Okonechnikov et al. 2016) to evaluate average mapping qualities per population and chromosome, which ranged from 58.3 to 58.8 (Table S1). Sequencing depth ranged from 34x to 115x for autosomes and from 17x to 59x for X-chromosomes (Table S1). We then combined individual bam files from all samples into a single mpileup file using samtools (v. 1.3; Li & Durbin 2009). Due to the large number of samples, we implemented quality control criteria for all libraries jointly to call SNPs. To call SNPs, we developed custom software (PoolSNP; see Supplementary File 2; available at doi: https://doi.org/10.5061/dryad.rj1gn54) using stringent heuristic parameters: (1) minimum coverage 10x for each sample, (2) maximum coverage < 95th coverage percentile for a given chromosome and sample (to avoid paralogous regions duplicated in the sample but not in the reference), (3) for each allele, a minimum read count > 20x and a minimum read frequency > 0.001, across all samples pooled. These parameters were optimized based on simulated Pool-Seq data to maximize true positives and minimize false positives (Supplementary File 2). We also excluded SNPs (1) for which more than 20% of all samples did not fulfil the above-mentioned coverage thresholds, (2) which were located within 5 bp of an indel with a minimum count larger than 10x in all samples pooled, and (3) which were located within known TEs based on the D. melanogaster TE library v.6.10. We annotated our final set of SNPs with SNPeff (v.4.2; Cingolani et al. 2012) using the Ensembl genome annotation version BDGP6.82.
Additional samples
We obtained genome sequences from African flies from the Drosophila Genome Nexus (DGN; http://www.johnpool.net/genomes.html; see Table S5 for SRA accession numbers). We used data from 14 individuals from Rwanda and 40 from Siavonga (Zambia). We mapped these data as described above and built consensus sequences for each haploid sample by only considering alleles with > 0.9 allele frequencies. We converted consensus sequences to VCF and used VCFtools (Danecek et al. 2011) for downstream analyses.
Genetic variation in Europe
We characterized patterns of genetic variation among the 48 samples for the five major chromosomal arms (X, 2L, 2R, 3L, 3R) by estimating π, Watterson’s θ and Tajima’s D (Watterson 1975; Nei 1987; Tajima 1989), using corrections for Pool-Seq data (Kofler et al. 2011). To perform these analyses for our set of SNPs, we re-implemented the methods of Kofler et al. (2011) in Python (PoolGen; doi: https://doi.org/10.5061/dryad.rj1gn54). To calculate unbiased window-wise estimates of parameters, we used an output file of our SNP calling pipeline (PoolSNP; doi: https://doi.org/10.5061/dryad.rj1gn54), which indicates for any given site in the reference, if it passed the filtering parameters used for SNP calling. These data allow for the calculation of the effective window-size, which is the difference between the total window-size and the number of sites that did not pass the quality criteria. Using effective windows-sizes as the denominator for the calculation of window-wise averages yields unbiased average estimates. In contrast, dividing the summed statistics in a given window by the total window-size, which is common practice in most software tools, results in an underestimation of averaged parameters. Before calculating the estimators, we subsampled the data to an even coverage of 40x for autosomes and 20x for the X-chromosome, as Watterson’s θ and Tajima’s D are sensitive to coverage variation (Korneliussen et al. 2013). We calculated chromosome-wide averages of π, θ and Tajima’s D for autosomes and X chromosomes using R (R Development Core Team 2009). We tested for correlations between these estimators and latitude, longitude, altitude, and season using a linear regression model: yi = Lat + Lon + Alt +Season + εi, where yi represents π, θ or D. We used Lat, Lon and Alt as continuous predictors (Table 1) and Season as a categorical factor with two levels, corresponding to collection dates before and after 1st September (‘summer’ and ‘fall’), respectively, following Bergland et al. (2014) and Kapun et al. (2016a). To test for residual spatio-temporal autocorrelation among the samples (Kühn & Dormann 2012), we calculated Moran’s I (Moran 1950) with the R package spdep (v.06-15., Bivand & Piras 2015) for the residuals of the above models. For this analysis, we considered samples within 10° latitude / longitude to be neighbours, based on the pairwise geographical distances between collection locations. Whenever these tests revealed significant autocorrelations indicating non-independence, we repeated the above regressions using a spatial weights matrix based on nearest neighbours as described above to test for remaining spatial patterning in residuals as implemented in spdep. We also fitted models with run ID as a random factor using the R package lme4 (v.1.1-14; see Supplementary File 2) to test for confounding effects of variation in error rates among sequencing runs. As these models did not fit significantly better than simpler models, we excluded it from final analysis (see Supplementary File 2 and Table S3).
To investigate genome-wide patterns of variation, we averaged π, θ, and D in 200 kb non-overlapping windows for each sample and chromosomal arm separately and plotted the distributions in R. In addition, to investigate fine-scale deviations from neutral expectations, we also calculated Tajima’s D in 50 kb sliding windows with a step size of 10 kb. We normalized diversity statistics using log-transformation and tested for correlations between π and recombination rate for 100 kb non-overlapping windows in R and plotted these data using the ggplot2 (v.2.2.1., Wickham 2016). We used both fine-scale (Comeron et al. 2012) and broad-scale (Fiston-Lavier et al. 2010) estimates of recombination rate, after converting their coordinates to reference genome v 6.
To identify regions under selection, we used Pool-hmm to calculate the SFS (Site Frequency Spectrum) for each sample in the pileup format file with the following parameters –prefix (to assign a name to each sample), -n (number of chromosomes), -- only-spectrum (for the SFS calculation), --theta 0.005 (default), and -r 100 (subsampling of 1/100 SNPs). We then split the pileups by chromosome and ran Pool-hmm with the following parameters: --prefix, -n, -k (per site transition probability between hidden states), -s (frequency spectrum file from previous step) and -e sanger (Phred quality = 33). For the 18 samples for which Tajima’s D was very low, Pool-hmm identified the majority of the genome to be under selection; we thus removed those samples from our analysis. We used three different k parameters depending on the sample: k=1e-10, k=1e-30, and k=1e-40 (Table S6A). For windows with significantly low Tajima’s D in euchromatic regions, we identified genes using bedtools intersect (v2.27.1) and the D. melanogaster v6.12 annotation file from Flybase (Thurmond et al 2019). For genes significant in all populations, we checked whether average Tajima’s D was among the lowest 10% per chromosome. We tested for enrichment of involvement in particular biological processes using DAVID with default parameters (Huang et al 2009).
Genetic differentiation and population structure in European populations
To estimate genome-wide pairwise genetic differences, we used custom software to estimate SNP-wise FST using the approach of Weir and Cockerham (1984) for all pairwise combinations of samples. For each sample, we averaged pairwise FST between that sample and the other 47 samples and ranked the 48 population samples by overall differentiation.
We inferred demographic patterns by focusing on putatively neutrally evolving SNPs. For this, we used either 4-fold degenerate sites (defined using the genome sequences and the annotation features of the D. melanogaster reference genome version 6.12) or short introns (<60 bp; Haddrill et al. 2005; Singh et al. 2009; Parsch et al. 2010; Clemente & Vogl 2012; Lawrie et al. 2013). We also restricted our analyses to SNPs that were at least 1 Mb distant from major chromosomal inversions (see below) and those located in genomic regions with high recombination rates (r > 3cM/Mb; Comeron et al. 2012) to minimize the effects of linkage, which may confound analyses of neutral evolution. As the Sheffield (UK) population showed unusually high differentiation from other populations, we repeated the following analyses without the Sheffield sample. To assess isolation by distance (IBD), we averaged pairwise FST values across all neutral markers. We calculated geographic distance using the haversine formula (Green & Smart 1985), which takes the spherical curvature of the planet into account. We tested for correlations between linearized genetic differentiation (Slatkin’s distance: FST/([1-FST]) and log10-scaled geographic distance (Slatkin 1985) using Mantel tests implemented in ade4 (v.1.7-8., Dray & Dufour 2007) with 1,000,000 iterations. In addition, we plotted the 5% smallest and largest FST values from all 1,128 pairwise comparisons among the 48 population samples onto a map to visualize geographic patterns of genetic differentiation.
We tested for population substructure using two different approaches. First, we performed principal component analysis (PCA) based on unscaled allele frequencies of the neutral marker SNPs, as suggested by Menozzi et al. (1978) and Novembre and Stephens (2008), using LEA (v. 1.2.0., Frichot et al. 2013). We focused on the first three principal components (PCs) and used mclust (v. 5.2., Fraley & Raftery 2012) to estimate the number of clusters via maximum likelihood and assigned population samples to clusters via k-means. In addition, we examined the first three PCs for correlations with latitude, longitude, altitude, and season using general linear models and tested for spatial autocorrelation as above. A Bonferroni-corrected α threshold (α’= 0.05/3 = 0.017) was used to correct for multiple testing.
In a second, complementary approach, we inferred population delineation using model-based clustering as implemented in ConStruct (v.1.0.2; Bradburd et al. 2018). In contrast to most clustering-based methods, ConStruct incorporates continuous isolation by distance to avoid inflating estimates of the number of clusters and allows estimating admixture among populations. We ran spatial models with three MCMC chains per run and 10,000 iterations and compared the goodness of fit for models incorporating 1 to 10 spatial layers by cross-validation.
Mitochondrial DNA
To obtain consensus mitochondrial sequences for each of the 48 European populations, we aligned reads from individual FASTQ files and replaced minor variants with the major variant using Coral (Salmela & Schröder 2011). This method prevents ambiguities from interfering with the assembly process. We assembled a genome for each population from the modified FASTQ files using SPAdes with standard parameters and k-mers of size 21, 33, 55, and 77 (Bankevich et al. 2012). Mitochondrial contigs were retrieved by blastn, using the D. melanogaster NC 024511 sequence as a query and each genome assembly as the database. To avoid nuclear mitochondrial DNA segments (numts), we ensured that only contigs with a higher than average coverage of the genome were retrieved. When multiple contigs were available for the same region, the one with the highest coverage was selected. Possible contamination with D. simulans was assessed by looking for two or more consecutive sites that show the same variant as D. simulans and looking for alternative contigs for that region with similar coverage. As an additional quality control measure, we also examined the presence of pairs of sites showing four gametic types using DNAsp 6 (Rozas et al. 2017) – given that there is no recombination in mitochondrial DNA no such sites are expected. The very few sites presenting such features were rechecked by looking for alternative contigs for that region and were corrected if needed. The uncorrected raw reads for each population were mapped on top of the different consensus haplotypes using Express as implemented in Trinity (Grabherr et al. 2011). If most reads for a given population mapped to the consensus sequence derived for that population the consensus sequence was retained, otherwise it was discarded as a possible chimera between different mitochondrial haplotypes. The repetitive mitochondrial hypervariable region is difficult to assemble and was therefore not used; the mitochondrial region was thus analysed as in Cooper et al. (2015). Mitochondrial genealogy was estimated using statistical parsimony (TCS network; Clement et al. 2000), as implemented in PopArt (http://popart.otago.ac.nz), and the surviving mitochondrial haplotypes. Frequencies of the different mitochondrial haplotypes were estimated from FPKM values using the surviving mitochondrial haplotypes and expressed as implemented in Trinity (Grabherr et al. 2011).
Transposable elements
To quantify transposable element (TE) abundance in each sample, we assembled and quantified repeats from unassembled sequenced reads using dnaPipeTE (v.1.2., Goubert et al. 2015). Only the left read of each pair were used. As the vast majority of high-quality trimmed reads were longer than 135 bp, we discarded reads shorter than this before sampling. Reads matching mtDNA were filtered out by mapping to the D. melanogaster reference mitochondrial genome (NC_024511.2. 1) with bowtie2 (v. 2.1.0., Langmead & Salzberg 2012). Prokaryotic sequences, including reads from symbiotic bacteria such as Wolbachia, were filtered out from the reads using the implementation of blastx vs. the non-redundant protein database (nr) using DIAMOND (v. 0.8.7, Buchfink et al. 2015). To quantify TE content, we subsampled a proportion of the raw reads (after filtering) corresponding to a genome coverage of 0.1X (assuming a genome size of 175 MB), and then assembled these reads with Trinity (Grabherr et al. 2011). Due to the low coverage of the genome obtained with the subsampled reads, only repetitive DNA present in multiple copies should be fully assembled (Goubert et al. 2015). To assess the constancy of the estimates, we repeated this process with three iterations per sample, as recommended by the program guidelines.
We further estimated frequencies of TEs present in the reference genome with T-lex2 (v. 2.2.2., Fiston-Lavier et al. 2015), using all annotated TEs (5,416 TEs) in version 6.04 of the D. melanogaster genome from flybase.org (Gramates et al. 2017). For 108 of these TEs, we used the corrected coordinates as described in Fiston-Lavier et al. (2015), based on the identification of target site duplications at the site of the insertion. We excluded TEs nested or flanked by other TEs (<100 bp on each side of the TE), and TEs, which are part of segmental duplications, since T-lex2 does not provide accurate frequency estimates in complex regions (Fiston-Lavier et al. 2015). We additionally excluded the INE-1 TE family, as this TE family is ancient, with 2,234 insertions in the reference genome, which appear to be mostly fixed (Kapitonov & Jurka 2003). After applying these filters, we were able to estimate frequencies of 1,630 TE insertions from 113 families from the three main orders, LTR, non-LTR, and DNA across all DrosEU samples. Because the mapper used by T-lex2 to detect the presence of insertions (presence module) only accepts reads ≤127 bp, we trimmed reads longer than 100 bp into two equally sized fragments using Trimmomatic (v. 0.35; Bolger et al. 2014) with the CROP and HEADCROP parameters.
To avoid inaccurate TE frequency estimates due to very low numbers of reads, we only considered frequency estimates based on at least 3 reads. Despite the stringency of T-lex2 to select only high-quality reads, we additionally discarded frequency estimates supported by more than 90 reads, i.e. 3 times the average coverage of the sample with the lowest coverage (CH_Cha_14_43, Table S1), in order to avoid non-uniquely mapping reads. This filtering allows to estimate TE frequencies for ∼96% (92.9% to 97.8%) of the TEs in each population. For 85% of the TEs, we were able to estimate their frequencies in more than 44 out of 48 DrosEU samples.
We tested for correlations between TE insertion frequencies and recombination rates using Spearman’s rank correlations as implemented in R. For SNPs, we used recombination rates from Comeron et al. (2012) and from Fiston-Lavier et al. (2010) in non-overlapping 100 kb windows and assigned to each TE insertion the recombination rate of the corresponding window.
To test for spatio-temporal variation of TE insertions, we excluded TEs with an interquartile range (IQR) < 10. We tested the population frequencies of the remaining 141 insertions for correlations with latitude, longitude, altitude, and season using generalized linear models (ANCOVA) following the method used for SNPs but with a binomial error structure in R. We further tested if significant correlations with either of the predictor variables deviated from expectations under neutral evolution. To this end, we repeated the ANCOVA analyses on 8,727 presumably neutrally evolving 4-fold degenerate sites that we described previously in the demographic analyses. Based on F-ratios obtained from the ANCOVA models for each neutral SNP and predictor, we built empirical density functions and calculated empirical p-values for each TE by integrating over the area of the curve that is delineated by the F-value specific for the given TE and the maximum F-ratio in the neutral dataset.
We also tested for residual spatio-temporal autocorrelations in TE insertion frequencies, with Moran’s I test (Moran 1950; Kühn & Dormann 2012). We used Bonferroni corrections to account for multiple testing (α’= 0.05/141 = 0.00035) and only considered Bonferroni-corrected p-values < 0.001 to be significant. To test TE family enrichment among the significant TEs we performed a χ2 test and applied Yate’s correction to account for the low number of some of the cells.
Inversion polymorphisms
Since Pool-Seq data precludes a direct assessment of the presence and frequencies of chromosomal inversions, we indirectly estimated inversion frequencies using a panel of approximately 400 inversion-specific marker SNPs (Kapun et al. 2014) for six cosmopolitan inversions (In(2L)t, In(2R)NS, In(3L)P, In(3R)C, In(3R)Mo, In(3R)Payne). We averaged allele frequencies of these markers in each sample separately. To test for clinal variation in the frequencies of inversions, we tested for correlations with latitude, longitude, altitude and season using generalized linear models with a binomial error structure in R to account for the biallelic nature of karyotype frequencies. In addition, we Bonferroni-corrected the α threshold (α’= 0.05/7 = 0.007) to account for multiple testing, accounted for residual spatio-temporal autocorrelations and tested if F-ratios of the ANCOVAs deviated from neutral expectations as explained above.
Microbiome
Raw sequences were trimmed, and quality filtered as described for the genomic data analysis. The remaining high-quality sequences were mapped against the D. melanogaster genome (v.6.04) including mitochondria using bbmap (v. 35; Bushnell 2016) with standard settings. The unmapped sequences were submitted to the online classification tool, MGRAST (Meyer et al. 2008) for annotation. Taxonomy information was downloaded and analysed in R (v. 3.2.3; R Development Core Team 2009) using the matR (v. 0.9; Braithwaite & Keegan) and RJSONIO (v. 1.3; Lang) packages. Metazoan sequence features were removed. For microbial load comparisons, the number of protein features identified by MGRAST for each taxon and sample was divided by the number of sequences that mapped to D. melanogaster chromosomes X, Y, 2L, 2R, 3L, 3R and 4.
We also surveyed the datasets for the presence of novel DNA viruses by performing de novo assembly of the non-fly reads using SPAdes 3.9.0 (Bankevich et al. 2012) and using conceptual translations to query virus proteins from Genbank using DIAMOND ‘blastp’ (Buchfink et al. 2015). In three cases (Kallithea virus, Vesanto virus, Viltain virus), reads from a single sample pool were sufficient to assemble a (near) complete genome. In two other cases, fragmentary assemblies allowed us to identify additional publicly available datasets that contained sufficient reads to complete the genomes (Linvill Road virus, Esparto virus; completed using SRA datasets SRR2396966 and SRR3939042, respectively). Novel viruses were provisionally named based on the localities where they were first detected, and the corresponding novel genome sequences were submitted to Genbank (KX130344, KY608910, KY457233, KX648533-KX648536). To assess the relative amount of viral DNA, unmapped (non-fly) reads from each sample pool were mapped to repeat-masked Drosophila DNA virus genomes using bowtie2, and coverage normalized relative to virus genome length and the number of mapped Drosophila reads.
Additional information
Funding
Author contributions
Martin Kapun, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Methodology, Investigation, Data curation, Project administration, Validation, Resources, Software; Maite G. Barrón, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Data curation, Project administration, Validation, Resources, Software; Fabian Staubach, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Data curation, Validation, Resources, Software; Jorge Vieira, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Darren J. Obbard, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Clément Goubert, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Resources; Omar Rota-Stabelli, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; Maaria Kankare, Writing-original draft preparation, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; María Bogaerts-Márques, Alejandro Sánchez-Gracia, Formal analysis, Writing-review & editing, Investigation, Validation, Resources; Annabelle Haudry, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Validation, Resources; R. Axel W. Wiberg, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources, Software; Lena Waidele, Svitlana Serga, Patricia Gibert, Damiano Porcelli, Sonja Grath, Eliza Argyridou, Lain Guio, Mads Fristrup Schou, Conceptualization, Writing-review & editing, Investigation, Resources; Iryna Kozeretska, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; Elena G. Pasyukova, Marta Pascual, Alan O. Bergland, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Resources; Volker Loeschcke, Catherine Montchamp-Moreau, Jessica Abbott, Nico Posnien, Maria Pilar Garcia Guerreiro, Banu Sebnem Onder, Conceptualization, Writing-review & editing, Funding acquisition, Investigation, Resources; Cristina P. Vieira, Visualization, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Resources; Élio Sucena, Conceptualization, Writing-review & editing, Methodology, Investigation, Project administration, Resources; Cristina Vieira, Michael G. Ritchie, Thomas Flatt, Josefa González, Writing-original draft preparation, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration, Validation, Resources; Bart Deplancke, Conceptualization, Writing-review & editing, Funding acquisition, Investigation; Bas J. Zwaan, Visualization, Writing-original draft preparation, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration; Eran Tauber, Writing-original draft preparation, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Resources; Dorcas J. Orengo, Eva Puerma, Conceptualization, Writing-review & editing, Investigation, Validation, Resources; Montserrat Aguadé, Writing-original draft preparation, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Paul S. Schmidt, John Parsch, Writing-original draft preparation, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Validation, Resources; Andrea J. Betancourt, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration, Validation, Resources.
Supplementary Files
Supplementary File 1. Supplementary Tables.
This file contains the 13 supplementary tables mention in the text.
Supplementary File 2. Additional methods.
This file contains the additional methods mention in the text.
Acknowledgments
We are grateful to all members of the DrosEU and Dros-RTEC consortia and to Dmitri Petrov (Stanford University) for support and discussion. DrosEU is funded by a Special Topic Networks (STN) grant from the European Society for Evolutionary Biology (ESEB). Computational analyses were partially executed at the Vital-IT bioinformatics facility of the University of Lausanne (Switzerland), at the computing facilities of the CC LBBE/PRABI in Lyon (France) and at the bwUniCluster of the state of Baden-Württemberg (bwHPC).
Footnotes
↵§ Members of the Drosophila Real Time Evolution (Dros-RTEC) Consortium
Competing interests: The authors declare that no competing interests exist.