ABSTRACT
Drosophila melanogaster is a premier model in population genetics and genomics, and a growing number of whole-genome datasets from natural populations of this species have been published over the last 20 years. A major challenge is the integration of these disparate datasets, often generated using different sequencing technologies and bioinformatic pipelines, which hampers our ability to address questions about the evolution and population structure of this species. Here we address these issues by developing a bioinformatics pipeline that maps pooled sequencing (Pool-Seq) reads from D. melanogaster to a hologenome consisting of fly and symbiont genomes and estimates allele frequencies using either a heuristic (PoolSNP) or a probabilistic variant caller (SNAPE-pooled). We use this pipeline to generate the largest data repository of genomic data available for D. melanogaster to date, encompassing 271 population samples from over 100 locations in >20 countries on four continents. Several of these locations are sampled at different seasons across multiple years. This dataset, which we call Drosophila Evolution over Space and Time (DEST), is coupled with sampling and environmental meta-data. A web-based genome browser and web portal provide easy access to the SNP dataset. Our aim is to provide this scalable platform as a community resource which can be easily extended via future efforts for an even more extensive cosmopolitan dataset. Our resource will enable population geneticists to analyze spatio-temporal genetic patterns and evolutionary dynamics of D. melanogaster populations in unprecedented detail.
Introduction
The vinegar fly Drosophila melanogaster is one of the oldest and most important genetic model systems and has played a key role in the development of theoretical and empirical population genetics (Schneider 2000; Hales et al. 2015; Haudry et al. 2020). Through decades of work, we now have a basic picture of the evolutionary origin (David and Capy 1988; Lachaise et al. 1988; Keller 2007; Sprengelmeyer et al. 2020), colonization history and demography (Caracristi and Schlötterer 2003; Li and Stephan 2006; Duchen et al. 2013; Grenier et al. 2015; Arguello et al. 2019; Kapopoulou et al. 2020), and spatio-temporal diversification patterns of this species and its close relatives (Kolaczkowski et al. 2011; Fabian et al. 2012; Bergland et al. 2014; Lack et al. 2016; Machado et al. 2016; Kapun et al. 2016, 2020). The availability of high-quality reference genomes (Adams 2000; Celniker and Rubin 2003; dos Santos et al. 2015) and genetic tools (Schneider 2000; Duffy 2002; Jennings 2011; Hales et al. 2015; Haudry et al. 2020) efficiently facilitates placing evolutionary studies of flies in a mechanistic context, allowing for the functional characterization of ecologically relevant polymorphism (e.g., de Jong and Bochdanovits 2003; Paaby et al. 2010, 2014; Mateo et al. 2014; Kapun et al. 2016; Durmaz et al. 2018, 2019; Ramaekers et al. 2019).
Recently, work on the evolutionary biology of Drosophila has been fueled by the growing number of population genomic datasets from field collections across a large portion of D. melanogaster’s range (Grenier et al. 2015; Machado et al. 2019; Guirao-Rico and González 2019; Arguello et al. 2019). These genomic data consist either of re-sequenced inbred (or haploid) individuals (e.g., Mackay et al. 2012; Langley et al. 2012; Grenier et al. 2015; Lack et al. 2015, 2016; Mateo et al. 2018; Kapopoulou et al. 2020) or pooled sequencing (Pool-Seq; e.g., Kolaczkowski et al. 2011; Fabian et al. 2012; Bastide et al. 2013; Campo et al. 2013; Bergland et al. 2014; Machado et al. 2016, 2019; Kapun et al. 2016, 2020) of outbred population samples. Pooled re-sequencing provides accurate and precise estimates of allele frequencies across most of the allele frequency spectrum (Zhu et al. 2012; Lynch et al. 2014; Schlötterer et al. 2014) at a fraction of the cost of individualbased sequencing. Although Pool-Seq retains limited information about linkage disequilibrium (LD) relative to individual sequencing (Feder et al. 2012), Pool-Seq data can be used to infer complex demographic histories (e.g., Cheng et al. 2012; Bergland et al. 2016; Deitz et al. 2016; Gould et al. 2017; Corbett-Detig and Nielsen 2017; Giesen et al. 2020), characterize levels of diversity (Kofler et al. 2011a, 2011b; Ferretti et al. 2013; Kapun et al. 2020), and infer genomic loci involved in recent adaptation in nature (Flatt 2016; Kapun et al. 2016, 2020; Gould et al. 2017; Machado et al. 2019; Bogaerts-Márquez et al. 2020) and during experimental evolution (e.g. Turner et al. 2011; Orozco-terWengel et al. 2012;
Burke 2012; Kofler and Schlötterer 2014). However, the rapidly increasing number of genomic datasets processed with different bioinformatic pipelines makes it difficult to compare results across studies and to jointly analyze multiple datasets. Differences among bioinformatic pipelines include filtering methods for the raw reads, mapping algorithms, the choice of the reference genome or SNP calling approaches, potentially generating biases when combining processed datasets from different sources for joint analyses (e.g., Gautier et al. 2013; Hoban et al. 2016).
To address these issues, we have developed a modular bioinformatics pipeline to map Pool-Seq reads to a hologenome consisting of fly and microbial genomes, to remove reads from potential D. simulans contaminants, and to estimate allele frequencies using two complementary SNP callers. Our pipeline is available as a Docker image (available from https://dest.bio) to standardize versions of software used for filtering and mapping, to make the pipeline available independently of the operating system used and to facilitate future updates and modification of the pipeline. In addition, our pipeline allows using either heuristic or probabilistic methods for SNP calling, based on PoolSNP (Kapun et al. 2020) and SNAPE-pooled (Raineri et al. 2012). We also provide tools for performing in-silico pooling of existing inbred (haploid) lines that exist as part of other Drosophila population genomic resources (Pool et al. 2012; Langley et al. 2012; Grenier et al. 2015; Kao et al. 2015; Lack et al. 2015, 2016). This pipeline is also designed to be flexible, facilitating the streamlined addition of new population samples as they arise.
Using this pipeline, we generated a unified dataset of pooled allele frequency estimates of D. melanogaster sampled across large portions of Europe and North America. This dataset is the result of the collaborative efforts of the European DrosEU (Kapun et al. 2020) and DrosRTEC (Machado et al. 2019) consortia and combines both novel and previously published population genomic data. Our dataset combines samples from 100 localities, 55 of which were sampled at two or more time points across the reproductive season (~10-15 generations/year) for one or more years. Collectively, these samples represent >13,000 individuals, cumulatively sequenced to >16,000x coverage. The cost-effectiveness of Pool-Seq has enabled us to estimate genome-wide allele frequencies over geographic space (continental and sub-continental) and time (seasonal, annual and decadal) scales, thus making our data a unique resource for advancing our understanding of fundamental adaptive and neutral evolutionary processes. We provide data in two file formats (VCF and GDS: (Danecek et al. 2011; Zheng et al. 2017), thus allowing researchers to utilize a variety of tools for computational analyses. Our dataset also contains sampling and environmental meta-data to enable various downstream analyses of biological interest.
Materials and Methods
Data sources
The genomic dataset presented here has been assembled from a combination of Pool-Seq libraries and in-silico pooled haplotypes. We combined 246 Pool-Seq libraries of population samples from Europe, North America and the Caribbean that were sampled through space and time by two collaborating consortia in North America (DrosRTEC: https://web.sas.upenn.edu/paul-schmidt-lab/dros-rtec/) and Europe (DrosEU: http://droseu.net) between 2003 and 2016. In addition, we integrated genomic data from >900 inbred or haploid genomes from 25 populations in Africa, Europe, Australia, and North America available from the Drosophila Genome Nexus dataset (DGN; Lack et al. 2015, 2016). We further included the D. simulans haplotype, built as part of the DGN dataset, as an outgroup, making this repository of 272 (246 pool-seq + 25 DGN + 1 D. simulans) wholegenome sequenced samples the largest dataset of genome-wide SNPs available for D. melanogaster to date.
Metadata
We assembled uniform meta-data for all samples (Supplemental Material, Table S1). This information includes collection coordinates, collection date, and the number of flies per sample. Samples are also linked to bioclimatic variables from the nearest WorldClim (Hijmans et al. 2005) raster cell at a resolution of 2.5° and to weather stations from the Global Historical Climatology Network (GHCND; ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/) for future analysis of the environmental drivers that might underlie genetic change. We also provide summaries of basic attributes of each sample derived from the sequencing data including average read depth, PCR-duplicate rate, D. simulans contamination rate, relative abundances of non-synonymous versus synonymous polymorphisms (pN/pS), the number of private polymorphisms, and diversity statistics (Watterson’s θ, π and Tajima’s D).
Sample collection
The majority of population samples contributed by the DrosEU and the DrosRTEC consortia was collected in a coordinated fashion to generate a consistent dataset with minimized sampling bias. In brief, fly collections were performed exclusively in natural or semi-natural habitats, such as orchards, vineyards and compost piles. For most European collections, flies were collected using mashed banana, or apples with live yeast as bait in traps placed at sampling sites for multiple days to attract flies or by sweep netting (see Kapun et al. 2020 for more details). For North American collections, flies were collected by sweep-net, aspiration, or baiting over natural substrate or using baited traps (see Behrman et al. 2018; Machado et al. 2019 for details). Samples were either field caught flies (n-227), from F1 offspring of wild caught females (n=7), from a mixture of F1 and wild caught flies (n=7), or from flies kept as isofemale lines in the lab for 5 generations or less (n=4); see Supplemental Table 1 for more information. To minimize cross-contamination with the closely related sympatric sister species D. simulans, we only sequenced male D. melanogaster specimens, allowing for higher confidence discrimination between the two species based on the morphology of male genitalia (Capy and Gibert 2004; Markow and O’Grady 2005). Samples were stored in 95% ethanol at −20°C before DNA extraction.
DNA extraction and sequencing
The DrosEU and DrosRTEC consortia centralized extractions from pools of flies. DNA was extracted either using chloroform/phenol-based (DrosEU: Kapun et al. 2020) or lithium chloride/potassium acetate extraction protocols (DrosRTEC: Bergland et al. 2014; Machado et al. 2019) after homogenization with bead beating or a motorized pestle. DrosEU samples from the 2014 collection were sequenced on an Illumina NextSeq 500 sequencer at the Genomics Core Facility of Pompeu Fabra University in Barcelona, Spain. Libraries of the previously unpublished DrosEU samples from 2015 and 2016 were constructed using the Illumina TruSeq PCR Free library preparation kit following the manufacturer’s instructions and sequenced on the Illumina HiSeq X platform as paired-end fragments with 2 × 150 bp length at NGX Bio (San Francisco, California, USA). The previously published samples of the DrosRTEC consortium were prepared and sequenced on GAIIX, HiSeq2000 or HiSeq3000 platforms, as described in Bergland et al. (2014) and Machado et al. (2019). For information on DNA extraction and sequencing methods of the various DGN samples see Lack et al. (2016).
Mapping pipeline
The joint analysis of genomic data from different sources requires the application of uniform quality criteria and a common bioinformatics pipeline. To accomplish this, we developed a standardized pipeline that performs filtering, quality control and mapping of any given Pool-Seq sample (see Supplemental Information Figure S1). This pipeline performs quality filtering of raw reads, maps reads to a hologenome (see below), performs realignment and filtering around indels, and filters for mapping quality. The output of this pipeline includes quality control metrics, bam files, pileup files, and allele frequency estimates for every site in the genome (gSYNC, see below). Our pipeline is provided as a Docker image, which automatically installs external software and executes the pipeline across various operating systems. Our pipeline will facilitate the integration of future samples to extend the worldwide D. melanogaster SNP dataset presented here.
The mapping pipeline includes the following major steps. Prior to mapping, we removed sequencing adapters and trimmed the 3’ ends of all reads using cutadapt (Martin 2011). We enforced a minimum base quality score ≥ 18 (-q flag in cutadapt) and assessed the quality of raw and trimmed reads with FASTQC (Andrews 2010). Trimmed reads with minimum length < 75 bp were discarded and only intact read pairs were considered for further analyses. Overlapping paired-end reads were merged using bbmerge (v. 35.50; (Bushnell et al. 2017). Trimmed reads were mapped against a compound reference genome (“hologenome”) consisting of the genomes of D. melanogaster (v.6.12) and D. simulans (Hu et al. 2013) as well as genomes of common commensals and pathogens, including Saccharomyces cerevisiae (GCF_000146045.2), Wolbachia pipientis (NC_002978.6), Pseudomonas entomophila (NC_008027.1), Commensalibacter intestine (NZ_AGFR00000000.1), Acetobacter pomorum (NZ_AEUP00000000.1), Gluconobacter morbifer (NZ_AGQV00000000.1), Providencia burhodogranariea (NZ_AKKL00000000.1), Providencia alcalifaciens (NZ_AKKM01000049.1), Providencia rettgeri (NZ_AJSB00000000.1), Enterococcus faecalis (NC_004668.1), Lactobacillus brevis (NC_008497.1), and Lactobacillus plantarum (NC_004567.2), using bwa mem (v. 0.7.15; Li 2013) with default parameters. We retained reads with mapping quality greater than 20 and reads with no secondary alignment using samtools (Li et al. 2009). PCR duplicate reads were removed using Picard MarkDuplicates (v.1.109; http://picard.sourceforge.net). Sequences were re-aligned in the proximity of insertions-deletions (indels) with GATK (v3.4-46; McKenna et al. 2010). We identified and removed any reads that mapped to the D. simulans genome using a custom python script, following methods outlined previously (Machado et al. 2019; Kapun et al. 2020; for a more in-depth analysis of D. simulans contamination see Wallace et al. 2020).
Incorporation of the DGN dataset
We incorporated population allele frequency estimates derived from inbred-line and haploid embryo sequencing data from populations sampled throughout the world. These samples have been previously collected and sequenced by several groups (Pool et al. 2012; Langley et al. 2012; Grenier et al. 2015; Kao et al. 2015; Lack et al. 2015, 2016) and form the Drosophila Genome Nexus dataset (DGN; Lack et al. 2015, 2016). We included 25 DGN populations with ≥ 5 individuals per population, plus the D. simulans haplotype built as part of the DGN dataset. The DGN populations that we used are primarily from Africa (n=18) but also include populations from Europe (n=2), North America (n=3), Australia (n=1), and Asia (n=1).
To incorporate the DGN populations into the DrosEU and DrosRTEC Pool-Seq datasets, we used the pre-computed FASTA files (“Consensus Sequence Files” from https://www.johnpool.net/genomes.html) and calculated allele frequencies at every site, for each population, using custom bash scripts. We calculated allele frequencies per population by summing reference and alternative allele counts across all individuals. Since estimates of allele frequencies and total allele counts for the DGN samples only consider unambiguous IUPAC codes, heterozygous sites or sites masked as N’s in the original FASTA files were converted to missing data. We used liftover (Kuhn et al. 2013) to translate genome coordinates to Drosophila reference genome release 6 (dos Santos et al. 2015) and formatted them to match the gSYNC format (described below).
SNP calling strategies
We used two complementary approaches to perform SNP calling. The first was PoolSNP (Kapun et al. 2020), a heuristic tool which identifies polymorphisms based on the combined evidence from multiple samples. This approach is similar to other common Pool-Seq variant calling tools (Koboldt et al. 2009, 2012; Kofler et al. 2011a, 2011b). PoolSNP integrates allele counts across multiple independent samples and applies stringent minor allele count and minor allele frequency thresholds for variant detection. PoolSNP is expected to be good at detecting variants present in multiple populations, but is not very sensitive to rare private alleles. The second approach was SNAPE-pooled (Raineri et al. 2012), a Bayesian tool which identifies polymorphic sites for each population independently using pairwise nucleotide diversity estimates as a prior. SNAPE-pooled is expected to be more sensitive to rare private polymorphisms, but also might have a higher false positive rate for variant detection.
gSYNC generation and filtering
Our pipeline utilizes a common data-format (SYNC; Kofler et al. 2011b) to encode allele counts for each population sample. A “genome-wide SYNC’’ (gSYNC) file records the number of A,T,C, and G for every site of the reference genome. Because gSYNC files for all populations have the same dimension, they can be quickly combined and passed to a SNP calling tool. They can be filtered and are also relatively small for a given sample (~500Mb), enabling efficient data sharing and access. The gSYNC file is analogous to the gVCF file format as part of the GATK HaplotypeCaller approach (McKenna et al. 2010) but is specifically tailored to Pool-Seq samples.
To generate a Pool-SNP gSYNC file, we first converted BAM files to the MPILEUP format with samtools mpileup using the -B parameter to suppress recalculations of per-base alignment qualities and filtered for a minimum mapping quality with the parameter -q 25. Next, we converted the MPILEUP file containing mapped and filtered reads to the gSYNC format using custom python scripts, which are available at https://dest.bio. To generate a SNAPE-pooled gSYNC file, we ran the SNAPE-pooled version specific to Pool-Seq data for each sample with the following parameters: θ=0.005, D=0.01, prior=‘informative’, fold=‘unfolded’ and nchr=number of flies (x2 for autosomes and x1 for the X chromosome) following Guirao-Rico and Gonzalez (2021). We converted the SNAPE-pooled output file to a gSYNC file containing the counts of each allele per position and the posterior probability of polymorphism as defined by SNAPE-pooled using custom python scripts. We only considered positions with a posterior probability ≥ 0.9 as being polymorphic and with a posterior probability ≤ 0.1 as being monomorphic. In all other cases, positions were marked as missing data.
We masked gSYNC files for Pool-SNP and SNAPE-pooled using a common set of filters. Sites were filtered from gSYNC files if they had: (1) minimum read depth < 10; (2) maximum read depth > the 95% coverage percentile of a given chromosomal arm and sample; (3) located within repetitive elements as defined by RepeatMasker; (4) within 5 bp distance up- and downstream of indel polymorphisms identified by the GATK IndelRealigner. Filtered sites were converted to missing data in the gSYNC file. The location of masked positions for every sample was recorded as a BED file.
VCF generation
We combined the masked PoolSNP-gSYNC files into a two-dimensional matrix, where rows correspond to each position in the reference genome and columns describe chromosome, position and reference allele, followed by allele counts in SYNC format for every sample in the dataset. This combined matrix was then subjected to variant calling using PoolSNP, resulting in a VCF formatted file. We performed SNP calling only for the major chromosomal arms (X, 2L, 2R, 3L, 3R) and the 4th (dot) chromosome.
We first evaluated the choice of two heuristic parameters applied to PoolSNP: global minor allele count (MAC) and global minor allele frequency (MAF). Using all 272 samples, we varied MAF (0.001, 0.01, 0.05) and MAC (5-100) and called SNPs at a randomly selected 10% subset of the genome. We calculated pN/pS and used this value to tune our choice of MAF and MAC. We found that a global MAF=0.001 and a global MAC=50 provided reasonable estimates of pN/pS for all populations. We therefore used these parameters for genome-wide variant calling (see Results: Identification and quality control of SNPs). We kept a third heuristic parameter, the missing data rate, constant at a minimum of 50%.
We generated three versions of the variant files, which differ in their inclusion of the DGN samples and the SNP calling strategy. For PoolSNP variant calling, we generated two variant tables: the first version incorporates all 272 samples of the Pool-Seq (DrosRTEC, DrosEU) and in-silico Pool-Seq populations (DGN). The second version only considers the 246 Pool-Seq samples. We combined masked SNAPE-pooled gSYNC files into a twodimensional matrix, as described above, and generated a VCF formatted output based on allele counts for any site found to be polymorphic in one or more populations. Based on this dataset we then generated a SNAPE-pooled VCF file, which included the 246 Pool-Seq samples. Final VCF files were annotated with SNPeff (version 4.3; Cingolani et al. 2012) and stored in VCF and BCF (Danecek et al. 2011) file formats alongside an index file in TABIX format (Li 2011). Besides VCF files, we also stored SNP data in the GDS file format using the R package SeqArray (Zheng et al. 2017).
Population genetic analyses
We estimated allele frequencies for each site across populations as the ratio of the alternate allele count to the total site coverage. We also calculated per-site averages for nucleotide diversity (π, Nei 1987), Watterson’s θ (Watterson 1975) and Tajima’s D (Tajima 1989) across all sites or in non-overlapping windows of 100 kb, 50 kb and 10 kb length. To estimate these summary statistics, we converted masked gSYNC files (with positions filtered for repetitive elements, low and high read depth, and proximity to indels; see gSYNC generation and filtering) back to the mpileup format using custom-made scripts. mpileup files were processed using npstat v.1 (Ferretti et al. 2013) with parameters -maxcov 10000 and -nolowfreq m=0 in order to include all filtered positions for analysis. We only considered sites identified as being polymorphic by PoolSNP or SNAPE-pooled for analysis, using the -snpfile option of npstat. For the DGN populations, chromosomes-wide summary statistics were estimated only for samples with less than 50% missing data per chromosome. Due to small sample sizes, Tajima’s D was not estimated for 7 African DGN populations that consisted of only 5 haploid embryos. In addition, we calculated pN/pS ratios based on SNP annotations with SNPeff (Cingolani et al. 2012) using a custom-made python script. To compare population genetic estimates between the PoolSNP versus SNAPE-pooled datasets, we performed Pearson’s correlations on the 210 populations present in both datasets (see Identification and quality control of SNPs) using the stats package of R v. 3.6.3. The effects of pool size (number of individuals sampled per population) on genome-wide estimates of π, Watterson’s θ, Tajima’s D and pN/pS estimates were examined for European and North American populations using the PoolSNP dataset and a generalized linear model (GLM) in R v3.6.3. Finally, for 48 European populations we estimated Pearson’s correlations between π, Watterson’s θ and Tajima’s D as estimated from the PoolSNP dataset versus previous estimates by Kapun et al. (2020) using the stats package of R v3.6.3.
Next, we examined patterns of between-population differentiation by calculating window-wise estimates of pairwise FST, based on the method from Hivert et al. (2018) implemented in the computePairwiseFSTmatrix() function of the R package poolfstat (v1.1.1). This analysis was performed for the dataset composed of 271 samples processed with PoolSNP, focusing on SNPs shared across the whole dataset. Finally, we averaged pairwise FST within and among phylogenetic clusters (Africa [17 samples], North America [76 samples], Eastern Europe [83 samples] and Western Europe [93 samples]; not included: China and Australia). These FST tracks at windows sizes of 100kb, 50kb and 10kb are available at https://dest.bio (Supplemental Figures S2, S3).
To assess population structure in the worldwide dataset, we applied PCA, population clustering, and population assignment based on a discriminant analysis of principal components (DAPC; Jombart et al. 2010) to all 271 PoolSNP-processed samples. For these analyses, we subsampled a set of 100,000 SNPs spaced apart from each other by at least 500 bp. We optimized our models using cross-validation by iteratively dividing the data as 90% for training and 10% for learning. We extracted the first 40 PCs from the PCA and ran Pearson’s correlations between each PC and all loci. We subsequently extracted the top 33,000 SNPs with large and significant correlations to PCs 1-40. We chose the 33,000 number as a compromise between panel size and differentiation power. For example, depending on the number of individuals surveyed, these 33,000 DIMs can discern divergence (T) between two populations with parametric FST of 0.001-0.0001 for sample sizes (n) of 10-1000. These estimates come from the phase change formula: T ≈ FST = 1/(nm)1/2 (Patterson et al. 2006). Here, the two populations were sampled for n/2 individuals and genotyped at m=33,000 markers. Furthermore, we included SNPs as a function of the %VE of each PC. PCAs, clustering, and assignment-based DAPC analyses were carried out using the R packages FactoMiner (v. 2.3), factoextra (v. 1.0.7) and adegenet (v. 2.1.3), respectively.
Web-based genome browser
Our HTML-based DEST browser (Supplemental Information Figure S2) is built on a JBrowse Docker container (Buels et al. 2016), which runs under Apache on a CentOS 7.2 Linux x64 server with 16 Intel Xeon 2.4 GHz processors and 32 GB RAM. It implements a hierarchical data selector□that facilitates the visualization and selection of multiple population genetic metrics or statistics for the 272 samples based on the PoolSNP-processed dataset, taking into account sampling location and date. Importantly, our genome browser provides a portal for downloading allelic information and pre-computed population genetics statistics in multiple formats (Supplemental Information Figures 2A+C, S3), a usage tutorial (Supplemental Information Figure S2B) and versatile track information (Supplemental Information Figure S2D). Bulk downloads of full variation tracks are available in BigWig format (Kent et al. 2010) and Pool-Seq files (in VCF format) are downloadable by population and/or sampling date using custom options from the Tools menu (Supplemental Information Figure S2C). All data, tools, and supporting resources for the DEST dataset, as well as reference tracks downloaded from FlyBase (v.6.12) (dos Santos et al. 2015), are freely available at https://dest.bio.□
Results and Discussion
Integrating a worldwide collection of D. melanogaster population genomics resources
We developed a modular and standardized pipeline for generating allele frequency estimates from pooled resequencing of D. melanogaster genomes (Supplemental Figure 1). Using this pipeline, we assembled a dataset of allele frequencies from 271 D. melanogaster populations sampled around the world (Figure 1A, Supplemental Material, Table S1). Many of these samples were collected at the same location, at different seasons and over multiple years (Figure 1B). The nature of the genomic data for each population varies as a consequence of biological origin (e.g., inbred lines or Pool-Seq), library preparation method, and sequencing platform.
To assess whether these features affect basic attributes of the dataset, we calculated six basic quality metrics (Figure 1C, Supplemental Material, Table S2). On average, median read depth across samples is 62X (DGN samples range: 1-190X; Pool-Seq samples range: 10-217X). Missing data rates were less than 7% for most (95%) of the samples. Excluding populations with high missing data rate (>7%), the proportion of sites with missing data was positively correlated with read depth (p=1.2×109, R2=0.4). The positive correlation between read depth and missing data rate is surprising and likely a consequence of masking sites with high coverage. The number of flies per sample varied from 40 to 205, with considerable heterogeneity among the DrosRTEC samples (standard deviation [sd] = 30), but not among DrosEU samples (sd = 0.04). Variation in the number of flies and in sequencing depth is reflected in the effective read depth, an estimate of the number of independent reads after accounting for double binomial sampling that occurs during PoolSeq (Eff. RD; Kolaczkowski et al. 2011; Feder et al. 2012; Figure 1C). There was considerable variation in PCR duplicate rate among samples, with notable differences between batches of DrosEU samples (~6% in 2014 vs. 18% in 2015/16; t-test p=1.8×10-19) and DrosRTEC samples (~3% in samples collected as part of Bergland et al. (2014) vs. ~14% in samples collected as part of Machado et al. (2019; p=6.37×10-3). Curiously, the 2015/2016 DrosEU samples were made with a PCR-free kit, suggesting that the observed PCR duplicates were optical duplicates and not amplification artifacts. Contamination of samples by D. simulans varied among populations but was generally absent (<1% D. simulans specific reads).
Identification and quality control of SNPs
In order to determine appropriate SNP calling and filtering parameters, and to identify potentially problematic population samples, we first calculated the ratio of non-synonymous to synonymous polymorphism (pN/pS) for each population sample. We chose this metric because it can reflect the presence of sequencing errors that would disproportionately inflate pN relative to pS.
For the PoolSNP dataset, we varied the global minor allele count (MAC) and global minor allele frequency (MAF) and then calculated pN/pS. We observed that pN/pS was negatively correlated with MAC (linear regression; p<0.001; Figure 2A). MAC thresholds <50 resulted in large variances of pN/pS caused by 36 populations characterized by unusually high pN/pS ratios (Supplemental Material, Table S3; Figures 2A and 2C). Some (n=21) of these samples had previously been found to show positive values of Tajima’s D across the whole genome (Kapun et al. 2020) and are characterized by a large number of private polymorphisms (Supplemental Material, Table S3; see below), indicating that there may be elevated numbers of sequencing errors in some samples. Applying a MAC threshold of 50 reduced the elevated pN/pS ratios to values similar to the rest of the dataset, and suggesting that the potential sequencing errors had been largely removed. To minimize false positive variant calling, we therefore conservatively chose MAC=50 and MAF=0.001 as threshold parameters for SNP calling with PoolSNP. Using these parameters, we identified 4,381,144 polymorphisms segregating among the 271 D. melanogaster samples (Pool-Seq plus DGN), and 4,042,456 polymorphisms segregating among the 246 Pool-Seq samples (excluding DGN), using PoolSNP.
SNAPE calls variants in each sample separately using a probabilistic approach, in contrast to PoolSNP, which integrates allelic information across all populations for heuristic SNP calling. To quantify the amount of putative sequencing errors among low frequency variants we varied the local MAF threshold per sample and calculated pN/pS for each sample in the SNAPE-pooled dataset. Similar to PoolSNP, we found that elevated pN/pS was negatively correlated with a local MAF threshold (linear regression; p<0.001; Figure 2B) and that the 36 above-mentioned problematic samples also had a strong effect on the variance and mean of pN/pS ratios. Accordingly, we removed these 36 samples and applied a conservative MAF filter of 5% for the remainder of the SNAPE-pooled analysis. Our results identified 8,541,651 polymorphisms segregating among the remaining 210 samples. Below, we discuss the geographic distribution and global frequency of SNPs identified using these two methods in order to provide insight into the stark discrepancy in the number of SNPs that they identify.
Patterns of polymorphism between PoolSNP and SNAPE-pooled
We calculated three metrics related to the amount of polymorphism discovered by our pipelines: the abundance of polymorphisms segregating in n populations across each chromosome (Figure 3A), the difference of discovered polymorphisms between SNAPE-pooled and PoolSNP (defined as the absolute value of PoolSNP minus SNAPE-pooled; Figure 3B), and the amount of polymorphism discovered per minor allele frequency bin (Figure 3C). We evaluated these three metrics across a 2×2 filtering scheme: two MAF filters (0.001, 0.05) and two sample sets (the whole dataset of 246 samples; and the 210 samples that passed the sequencing error filter in SNAPE-pooled; see Identification and quality control). Notably, PoolSNP was biased towards identification of common SNPs present in multiple samples, whereas SNAPE-pooled was more sensitive to the identification of polymorphisms that appeared in few populations only (Figure 3B). For example, at a MAF filter of 0.001, SNAPE-pooled discovered more polymorphisms that were shared in less than 25 populations (relative to PoolSNP), and these accounted for ~79% of all polymorphisms discovered by the pipeline. Likewise, at a MAF filter of 0.05, SNAPE-pooled discovered more polymorphisms that were shared in less than 97 populations; these accounted for ~71% of all discovered polymorphisms. SNAPE-pooled identifies fewer polymorphic sites that are shared among a large number of populations than PoolSNP does because SNAPE pooled does not integrate information across multiple populations. As a consequence, it can fail to identify SNPs which are overall at low frequencies and get called as monomorphic or missing in a subset of populations given the posterior-probability thresholds that we employed (see Materials and Methods).
We also compared allele frequency estimates between the two callers using the aforementioned dataset of 210 populations applying a MAF filter of 0.05 (see Supplemental Material, Table S2). Among the positions identified as polymorphic by both calling methods, our frequency estimates were consistent for the great majority of SNPs in all samples analyzed (> 97% of samples). A very small proportion differed in less than 5% frequency among both methods (< 2.3% in all samples), and very few polymorphic SNPs differed by a frequency of between 5-10% (< 0.15% in all samples) or greater than 10% (< 0.03% in all samples) (Supplemental Material, Table S4). Positions with a discordant calling represented less than a 25% of all common positions in all samples (Supplemental Material, Table S4), the majority of them being called polymorphic by PoolSNP and classified as missing data by SNAPE-pooled (Supplemental Material, Table S4). This is consistent with the SNAPE-pooled method as well as the stringent parameters used (see Materials and Methods).
Mutation-class frequencies
We estimated the percentage of mutation classes (e.g., A→C, A→G, A→T, etc.) accepted as polymorphisms in both our SNP calling pipelines, and classified these loci as being either “rare” (i.e., allele frequency < 5% and shared in less than 50 populations) or “common” (allele frequency > 5% and shared in more than 150 populations). For this analysis, we classified the minor allele as the derived allele. Figure 4A shows the percentage of each mutation class for the 210 populations which passed filters in both SNAPE-pooled and PoolSNP. In addition, we overlaid, as a horizontal line, the expected mutation frequencies for rare (blue; Assaf et al. 2017) and common (red; Mackay et al. 2012) mutations. For example, A→C variants are expected to be more abundant as common mutations than as rare mutations, and the opposite is true for C→A variants. In general, our SNP discovery pipelines produced mutation-class relative frequencies of rare and common mutations that are consistent with empirical expectations, however, there were some exceptions to this pattern. For example, the frequencies of the C/G rare mutation-class was consistently underestimated by both callers, a phenomenon that might be related to the known GC bias of modern sequencing machines (Benjamini and Speed 2012). The correlation between SNP calling pipelines was high across both common and rare mutation classes, with marginal discrepancies observed for rare variants (Figure 4B).
Comparison to previously published datasets
We compared the allele frequency and read depth estimates from the DEST dataset (based on PoolSNP) to previously published estimates by Bergland et al. (2014), Machado et al. (2019), and Kapun et al. (2020). For these datasets we employed two types of correlations, the nominal correlation (i.e., Pearson’s correlation; CO) and the concordance correlation coefficient (CCC; Lin 1989; Liao and Lewis 2000). The CCC determines how much the observed data deviate from the line of perfect concordance (i.e., the 45 degree-line on a square scatter plot).
Estimates of allele frequency were strongly correlated and consistent with previously published data. The strongest correlation of DEST allele frequencies and previously published allele frequencies was observed with the data of Kapun et al. (2020) (average CO and CCC > 0.99; Figure 5, top row; Supplemental Material, Figure S4). Allele frequency correlations with Machado et al. (2019) are also generally high (average CO and CCC > 0.98; Figure 5, top row; Supplemental Material, Figure S5). Allele frequency correlations with the data from Bergland et al. (2014) were lower (0.94; Supplemental Material, Figure S6), likely reflecting differences in data processing and quality control.
We also examined two aspects of read depth, i.e., nominal coverage and effective coverage. Nominal coverage is the number of reads mapping to a site that has passed quality control. Effective coverage is the approximate number of independent reads, after accounting for double binomial sampling, and is useful for obtaining unbiased estimates of the precision of allele frequency estimates (Kolaczkowski et al. 2011; Kofler et al. 2011a; Feder et al. 2012; Schlötterer et al. 2014). Similar to allele frequency estimates, the Pearson correlation coefficients for both coverage and effective coverage were large (0.92, 0.95, 0.90 for Machado et al. (2019), Kapun et al. (2020), and Bergland et al. (2014), respectively; see Supplemental Material, Figures S7-12), indicating that sample identity was preserved appropriately. However, the concordance correlation coefficients were substantially lower between the datasets (0.24, 0.88, 0.79, respectively), indicating systematic differences in read depth between the DEST dataset and previously published data. Indeed, read depth estimates were on average ~12%, ~14% and ~20% lower in the DEST dataset as compared to the previously published data in Machado et al. (2019), Kapun et al. (2020), and Bergland et al. (2014)(2014) respectively. The lower read depth and effective read depth estimates in the DEST dataset reflects our more stringent quality control and filtering.
Genetic diversity
We estimated nucleotide diversity (π), Watterson’s θ and Tajima’s D for both the PoolSNP and SNAPE-pooled datasets (Supplemental Material, Table S5). Results for the African, European and North American population samples are presented in Figure 6 (also see Supplemental Material, Figure S13 for estimates by chromosome arm). All estimates were positively correlated between PoolSNP and SNAPE-pooled (p<0.001), with Pearson’s correlation coefficients of 0.88, 0.94 and 0.73 for π, Watterson’s θ, and Tajima’s D, respectively. Higher values of genetic diversity were obtained for the SNAPE-pooled dataset, probably due to its higher sensitivity for detecting rare variants (see Patterns of polymorphism between PoolSNP and SNAPE-pooled). Pool size had no significant effect on the four summary statistics in European or in North American populations (GLMs, all p>0.05), suggesting that data from populations with heterogeneous pool sizes can be safely merged for accurate population genomic analysis.
The highest levels of genetic variability were observed for ancestral African populations (mean π = 0.0060, mean θ = 0.0059); North American populations exhibited higher genetic variability (mean π = 0.0054, mean θ = 0.0054) than European populations (mean π = 0.0049, mean θ = 0.0048). These results are consistent with previous observations based on individual genome sequencing (e.g., see Lack et al. 2016; Kapun et al. 2020). Our observations are also consistent with previous estimates based on pooled data from three North American populations (mean π = 0.00577, mean θ = 0.00597; Fabian et al. 2012) and 48 European populations (mean π = 0.0051, mean θ = 0.0052; Kapun et al. 2020). Estimates of Tajima’s D were positive when using PoolSNP, and slightly negative using SNAPE. These results are expected given biases in the detection of rare alleles between these two SNP calling methods. In addition, our estimates for π, Watterson’s θ and Tajima’s D were positively correlated with previous estimates for the 48 European populations analyzed by Kapun et al. (2020) (all p<0.01). Notably, slightly lower levels of Tajima’s D in North America compared to both Africa and Europe (Figure 6B) may be indicative for admixture (Stajich and Hahn 2005) which has been identified previously along the North American east coast (Caracristi and Schlötterer 2003; Kao et al. 2015; Bergland et al. 2016).
Phylogeographic clusters in D. melanogaster
We performed PCA on the PoolSNP variants in order to include samples from North America (DrosRTEC), Europe (DrosEU), and Africa (DGN) datasets (excluding all Asian and Oceanian samples). Prior to analysis we filtered the joint datasets to include only high-quality biallelic SNPs. Because LD decays rapidly in Drosophila (Comeron et al. 2012), we only considered SNPs at least 500 bp away from each other. PCA on the resulting 100,000 SNPs revealed evidence for discrete phylogeographic clusters that correspond to geographic regions (Supplemental Material, Figure S14B). PC1 (24% variance explained [VE]) partitions samples between Africa and the other continents (Figure 7A). PC2 (9% VE) separates European from North American populations, and both PC2 and PC3 (4% VE) divide Europe into two population clusters (Figure 7B). Notably, these spatial relationships become evident when PCA projections from each sample are plotted onto a world map (Figure 7C). Interestingly, the emergent clusters in Europe are not strictly defined by geography. For example, the western cluster (diamonds in Figure 7D) includes Western Europe as well as Finland, Turkey, Cyprus, and Egypt. The eastern cluster, on the other hand, consists of several populations collected in previous Soviet republics as well as Poland, Hungary, Serbia and Austria, raising the possibility that recent geo-political division in Europe could have affected migration and population structure. Whether this result arises as a relic of recent geopolitical history within Europe, more ancient migration and colonization (e.g., following post-glacial range expansion, Kapun et al. 2020), local adaptation, or sampling strategy (Novembre and Stephens 2008; cf. Kapun et al. 2020) remains unknown. Future targeted sampling is needed to resolve these alternative explanations.
A unique feature of this dataset is that it contains a mixture of Pool-Seq and inbred (or haploid) genome data. For some geographic regions, the DEST dataset contains both data types. Inbred and Pool-Seq samples from nearby geographic regions clustered in the same regions of PC space (Supplemental Material, Figure S15). Excluding the DGN-derived African samples, no PC was significantly correlated with data type (PC1 p = 0.352, PC2 p = 0.223, PC3 p = 0.998).
Geographic proximity analysis
The geographic distribution of our samples allows leveraging basic principles of phylogeography and population genetics to assess the biological significance of rare SNPs (Wright 1943; Battey et al. 2020). Accordingly, we expect to observe young neutral alleles at low frequencies among geographically close populations. We tested this hypothesis by estimating the average geographic distance among pairs of populations that share SNPs only occurring in these two populations (doubletons), among three populations that share tripletons, and so forth. Without imposing a MAF filter, both SNAPE-pooled and PoolSNP pipelines produced patterns concordant with the expectation. Populations in close proximity were more likely to share rare mutations relative to random chance pairings (Figure 8A). Notably, the PoolSNP dataset showed an elevated number of rare alleles, which violate the phylogeographic expectation (Figure 8A); however, this only affects 0.31% of all PoolSNP mutations. To further evaluate this pattern, we estimated the probability that any given population pair belongs to a particular phylogeographic cluster (Supplemental Material, Figure S16) as a function of their shared variants. Our results indicate that rare variants, private to geographically proximate populations, are strong predictors of phylogeographic provenance (see Figure 8B).
Demography-informative markers
An inherent strength of our broad biogeographic sampling is the potential to generate a panel of core demography SNPs to investigate the provenance of current and future samples. We created a panel of demography-informative markers (DIMs) by conducting a DAPC to discover which loci drive the phylogeographic signal in the dataset. We trained two separate DAPC models: the first utilized the four phylogeographic clusters identified by principal components (PCs; Figure 6AB, Supplemental Material, Figure S16, Table S1); the second utilized the geographic localities where the samples were collected (i.e., countries in Europe and the US states). This optimization indicated that the information contained in the first 40 PCs maximizes the probability of successful assignment (Figure 9A). This resulted in the inclusion of 30,000 DIMs, most of which were strongly associated with PCs 1-3 (Figure 9B inset). Moreover, the correlations were larger among the first 3 PCs and decayed monotonically for the additional PCs (Figure 9B). Lastly, our DIMs were uniformly distributed across the fly genome (Figure 9C).
We assessed the accuracy of our DIM panel using a leave-one-out cross-validation approach (LOOCV). We trained the DAPC model using all but one sample and then classified the excluded sample. We performed LOOCV separately for the phylogeographic cluster groups, as well as for the state/country labels. The phylogeographic model used all DrosRTEC, DrosEU, and DGN samples (excluding Asia and Oceania with too few individuals per sample); the state/country model used only samples for which each label had at least 3 or more samples. Our results showed that the model is 100% accurate in terms of resolving samples at the phylogeographic cluster level (Figure 9D) and 89% at the state/country level (Figure 9E). We anticipate that this set of DIMs will be useful for future analysis of geographic provenance of North American and European samples. We provide a tutorial on the usage of the DIM in Supplemental Methods.
Conclusions and Outlook
Here we have presented a new, modular and unified bioinformatics pipeline for processing, integrating and analyzing SNP variants segregating in population samples of D. melanogaster. We have used this pipeline to assemble the largest worldwide data repository of genome-wide SNPs in D. melanogaster to date, based both on previously published data (DGN: Africa; Lack et al. 2015, 2016) as well as on new data collected by our two collaborating consortia (DrosRTEC: mostly North America; Machado et al. 2019; DrosEU: mostly Europe; Kapun et al. 2020). We assembled this dataset using two SNP calling strategies that differ in their ability to identify rare polymorphisms, thereby enabling future work studying the evolutionary history of this species. We are dubbing this data repository and the supporting bioinformatics tools Drosophila Evolution over Space and Time (DEST).
One of the biggest challenges in the present “omics” era is the rapidly growing number of complex large-scale datasets which require technically elaborate bioinformatics know-how to become accessible and utilizable. This hurdle often prohibits the exploitation of already available genomics datasets by scientists without a strong bioinformatics or computational background. To remedy this situation for the Drosophila evolution community, our bioinformatics pipeline is provided as a Docker image (to standardize across software versions, as well as make the pipeline independent of specific operating systems) and a new genome browser makes our SNP dataset available through an easy-to-use web interface (see Supplemental Information Figures S2, S3; available at https://dest.bio).
The DEST data repository and platform will enable the population genomics community to address a variety of longstanding, fundamental questions in ecological and evolutionary genetics. The current dataset might for instance be valuable for providing a more accurate picture of the demographic history of D. melanogaster populations, in particular in Europe and North America, and with respect to multiple bouts of out-of-Africa migration and recent patterns of admixture.
The DEST dataset will likewise be useful for an improved understanding of the genomic signatures underlying both global and local adaptation, including a more fine-grained view of selective sweeps, their evolutionary origin and distribution (e.g., see Glinka et al. 2003; Beisswanger et al. 2006; Ometto 2010; Stephan 2016; Kapun et al. 2020). In terms of local adaptation, the broad spatial sampling across latitudinal and longitudinal gradients on the North American and European continents, encompassing a broad range of climate zones and areas of varying degrees of seasonality, will allow examining the parallel nature of local (clinal) adaptation in response to similar environmental factors in greater depth than possible before (e.g., Turner et al. 2008; Kolaczkowski et al. 2011; Fabian et al. 2012; Reinhardt et al. 2014; Kapun et al. 2016, 2020; Machado et al. 2019; Bogaerts-Márquez et al. 2020; Waldvogel et al. 2020).
Another major opportunity provided by the DEST dataset lies in studying the temporal dynamics of evolutionary change. Sampling at dozens of localities across the growing season and over multiple years will help to advance our understanding of the short-term population and evolutionary dynamics of flies living in diverse environments, thereby providing novel insights into the nature of temporally varying selection (e.g., Wittmann et al.; Bergland et al. 2014; Machado et al. 2019) and evolutionary responses to climate change (e.g., Umina 2005; Rodríguez-Trelles et al. 2013; Waldvogel et al. 2020).
Moreover, by integrating these worldwide estimates of allele frequencies, those from lab- and field-based ‘evolve and resequence’ (E & R; Turner et al. 2011; reviewed in Kofler and Schlötterer 2014; Schlötterer et al. 2014; Flatt 2020) and mesocosm experiments (e.g., Rudman et al. 2019; Erickson et al. 2020), we might be able to gain deeper insights into the genetic basis and evolutionary history of variation in fitness components (e.g., Flatt 2020).
The real value of the DEST dataset lies in the future: its long-term utility will grow as natural and experimental populations are continually being sampled, resequenced and added to the repository by the community of Drosophila evolutionary geneticists. The pipeline that we have established will make future updates to the data-repository straightforward. Furthermore, since it is not easily feasible for any single research group to sample flies densely through time and across a broad geographic range, the growing value of the DEST dataset will depend upon the synergistic collaboration among research groups across the globe, as exemplified by the DrosRTEC and DrosEU consortia. Importantly, in an era of rapidly decreasing sequencing costs, comprehensive population genomic analyses are no longer limited by genetic marker density but by the availability of biological samples from standardized, collaborative long-term collection efforts through space and time (e.g., Machado et al. 2019; Kapun et al. 2020). In this vein, the collaborative framework presented here might allow us, as a global community, to fill some important gaps in the current data repository: for example, many areas of the world (notably Asia and South America) remain largely uncharted territory in Drosophila population genomics, and the addition of phased sequencing data (e.g., providing information on haplotypes, LD, linked selection) will be crucially important for future analyses of demography, selection and their interplay.
We are convinced that the DEST platform will become a valuable and widely-used resource for scientists interested in Drosophila evolution and genetics, and we actively encourage the community to join the collaborative effort we are seeking to build.
Data availability
All scripts to make figures and perform analyses associated with this manuscript are available here: https://github.com/DEST-bio/data-paper. All scripts to build the dataset, including the mapping pipeline, SNP calling scripts, and meta-data are available here: https://github.com/DEST-bio/DEST_freeze1. All output from the DEST pipeline, including intermediate output files, metadata, etc. can be found here: https://dest.bio. The genome browser associated with the DEST dataset can be found here: http://dgvbrowser.uab.cat/dest/browser/. The mapping and SNP calling pipeline can be found here: https://hub.docker.com/r/destbiodocker/destbiodocker
Author contributions
Martin Kapun: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Visualization, Writing - original draft, Writing - review & editing. Joaquin Nunez: Formal Analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing. María Bogaerts-Márquez: Formal Analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing. Jesús Murga-Moreno: Formal Analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing. Margot Paris: Formal Analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing. Joseph Outten: Software, Writing - review & editing. Marta Coronado-Zamora: Formal Analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing. Aleksandra Patenkovic: Resources. Amanda Glaser-Schmitt: Resources. Anna Ullastres: Resources. Antonio J. Buendía-Ruíz: Resources. Banu S. Onder: Resources. Brian P Lazzaro: Resources, Writing - review & editing. Catherine Montchamp-Moreau: Resources. Christopher W. Wheat: Resources, Writing - review & editing. Cristina P. Vieira: Resources, Writing - review & editing. Daniel K. Fabian: Resources. Darren J. Obbard: Resources. Dmitry V. Mukha: Resources. Dorcas J. Orengo: Resources, Writing - review & editing. Elena Pasyukova: Resources. Eliza Argyridou: Resources. Emily L. Behrman: Resources, Writing - review & editing. Eran Tauber: Resources. Eva Puerma: Resources, Writing - review & editing. Fabian Staubach: Resources, Writing - review & editing. Francisco D Gallardo-Jiménez: Resources. Iryna Kozeretska: Resources. J. Roberto Torres: Resources. Jessica K. Abbott: Resources. John Parsch: Funding acquisition, Resources, Writing - review & editing. Jorge Vieira: Resources, Writing - review & editing. M. Josefa Gómez-Julián: Resources. Katarina Eric: Resources. Kelly A. Dyer: Resources. Lain Guio: Resources. Lino Ometto: Writing - review & editing. M. Luisa Espinosa-Jimenez: Resources. Maaria Kankare: Resources, Writing - review & editing. Mads F. Schou: Resources, Writing - review & editing. Maria P. García Guerreiro: Resources, Writing - review & editing. Marija Savic Veselinovic: Resources. Marija Tanaskovic: Resources. Marina Stamenkovic-Radak: Funding acquisition, Resources. Mihailo Jelic: Resources. Miriam Merenciano: Resources. Oleksandr M. Maistrenko: Writing - review & editing. Omar Rota-Stabelli: Resources. Sara Guirao-Rico: Resources, Writing - review & editing. Sònia Casillas: Resources, Writing - review & editing. Sonja Grath: Resources. Stephen W. Schaeffer: Resources. Subhash Rajpurohit: Resources. Svitlana V. Serga: Resources. Thomas J.S. Merritt: Resources. Vivien Horváth: Resources. Vladimir E. Alatortsev: Resources. Volker Loeschcke: Resources. Yun Wang: Resources. Antonio Barbadilla: Software, Writing - review & editing. Dmitri Petrov: Conceptualization, Funding acquisition, Project Administration, Resources, Writing - review & editing. Paul Schmidt: Conceptualization, Funding acquisition, Project Administration, Resources, Writing - review & editing. Josefa Gonzalez: Conceptualization, Funding acquisition, Project Administration, Resources, Supervision, Writing - original draft, Writing - review & editing. Thomas Flatt: Conceptualization, Funding acquisition, Project Administration, Resources, Supervision, Writing - original draft, Writing - review & editing. Alan Bergland: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Visualization, Writing - original draft, Writing - review & editing
Competing interests
The authors declare no competing interests.
Acknowledgements
We are grateful to all the members of the DrosEU and DrosRTEC consortia for their long-standing support, collaboration and for discussion. DrosEU is funded by a Special Topic Networks (STN) grant from the European Society for Evolutionary Biology (ESEB). MK (M. Kapun) was supported by the Austrian Science Foundation (grant no. FWF P32275); JG by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (H2020-ERC-2014-CoG-647900) and by the Spanish Ministry of Science and Innovation (BFU-2011-24397); TF by the Swiss National Science Foundation (SNSF grants PP00P3_133641, PP00P3_165836, and 31003A_182262) and a Mercator Fellowship from the German Research Foundation (DFG), held as a EvoPAD Visiting Professor at the Institute for Evolution and Biodiversity, University of Mu□nster; AOB by the National Institutes of Health (R35 GM119686); MK (M. Kankare) by Academy of Finland grant 322980; VL by Danish Natural Science Research Council (FNU) grant 4002-00113B; FS Deutsche Forschungsgemeinschaft (DFG) grant STA1154/4-1, Project 408908608; JP by the Deutsche Forschungsgemeinschaft Projects 274388701 and 347368302; AU by FPI fellowship (BES-2012-052999); ET Israel Science Foundation (ISF) grant 1737/17; MSV, MSR and MJ by a grant from the Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200178); AP, KE and MT by a grant from the Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200007); and TM NSERC grant RGPIN-2018-05551.