Genomes from 25 historical Drosophila melanogaster specimens illuminate adaptive and demographic changes across more than 200 years of evolution

The ability to perform genomic sequencing on long-dead organisms is opening new frontiers in evolutionary research. These opportunities are especially profound in the case of museum collections, from which countless documented specimens may now be suitable for genomic analysis. Here, we report 25 newly sequenced genomes from museum specimens of the model organism Drosophila melanogaster, including the oldest extant specimens of this species. By comparing historical samples ranging from the early 1800s to 1933 against modern day genomes, we document evolution across thousands of generations, including time periods that encompass the species’ initial occupation of northern Europe and an era of rapidly increasing human activity. At the genome-wide level, we find that historical flies from the same time and place show much greater evidence for relatedness than flies from modern collections, and some show evidence of inbreeding as well, potentially reflecting either much smaller local population sizes in the past or else the specific circumstances of the collections. We also find that the Lund, Sweden population underwent local genetic differentiation during the early 1800s to 1933 interval (potentially due to accelerated drift) but then became more similar to other European populations thereafter (potentially due to increased migration). Within each time period, our temporal sampling allows us to document compelling candidates for recent natural selection. In some cases, we gain insights regarding previously implicated selection candidates, such as ChKov1, for which our inferred timing of selection favors the hypothesis of antiviral resistance over insecticide resistance. Other candidates are novel, such as the circadian-related gene Ahcy, which yields a selection signal that rivals that of the DDT resistance gene Cyp6g1. These insights deepen our understanding of recent evolution in a model system, and highlight the potential of future museomic studies.


INTRODUCTION
Drosophila melanogaster individuals has enabled the study of genomic evolution

302
In line with the chromosome-wide distances cited above, we inferred many IBD windows 303 spanning various lengths, up to whole chromosome arms (Table S2; Table S3

313
In addition to IBD between genomes, we found within-genome IBD due to inbreeding 314 -several samples from 1933 had long regions of chromosomes with depleted heterozygosity 315 ( Table S2; Table S4   for (2L)t), whereas no inversions were detected in any 1800s genome (Table S5). Inversion 346 frequencies were however, also somewhat low in the modern Lund data, with only (2L)t (at 347 13%) giving a non-zero frequency estimate. The inferred frequency changes in (2L)t were 348 statistically significant when modern Lund was compared to all historical genomes (Fisher exact frequencies through time, but further temporal sampling will be needed to evaluate the 352 previous suggestion of a recent African origin for most inversions now present in European D.

367
(A) Genetic differentiation between population samples is summarized by F ST (below 368 diagonal, red heat map), and by the rate of pairwise sequence differences D xy (above 369 diagonal, blue heat map). Along the diagonal, the rate of pairwise sequence differences 370 within populations (nucleotide diversity) is given. Data reflects the X chromosome only (to 371 avoid effects of inversions), with down-sampling to four alleles per population at each site.

372
The Lund population appears to show decreasing diversity with time, subject to caveats

397
We further investigated a simple model of the demographic history of Lund by using 398 population genetic simulations to fit single-population models of genetic drift. We varied the 399 effective population size (N e ) for a given time interval and compared the mean difference in 400 allele frequency at non-singleton SNPs between a given pair of time points from simulated 401 versus empirical data, using a Bayesian approach to identify the best-fitting N e for each time

459
We complemented this SNP-based approach with a parallel statistical methodology to 460 search for copy number variants (CNVs). We treated each population sample's mean read 461 depth within a given window (scaled relative to its chromosome arm average) as a quantitative

517
The top outlier for the 1933-2010s interval (denoted region B1) spanned 75 kb and 518 included a known target of insecticide resistance evolution, Cytochrome P450 6g1 (Cyp6g1, 519 shown in Figure 5D). This gene is associated with resistance to dichlorodiphenyl-

889
For those museum genomes with a total coverage of < 80% and average per-site read 890 depth of < 10, we concluded that diploid genotype calls could not be confidently made. These for these low-quality genomes. When more than a single base was identified at a (pseudo-894 haploid) site, we retained the most frequent allele (or selected one at random in cases of a tie).

895
For diploid genotypes, we required at least 25% of reads to represent a minor allele, otherwise

953
We then sought to account for genome-specific differences in genetic distances, such as 954 due to sample quality. For each chromosome window of the ith genome, a background 955 genetic distance d i is the median Hamming distance between the ith genome and unrelated a distance correction factor m i that accounts for its tendency to yield higher or lower distances 958 than others from its population sample. The baseline population-specific distances D for 1800s

968
(7/8)D. The reasoning is that if one allele from each individual is IBD, then ¼ of pairwise 969 allelic comparisons will yield zero distance due to this IBD, and hence the overall 970 expected distance is ¾ of that from a non-IBD pair. For haploid to haploid (male X 971 chromosome comparison) the same rule with a factor of ½ is applied, while for haploid 972 to diploid (male to female X chromosome comparison) a factor of ¾ is used.

973
B. When pairwise IBD regions are identified, this window is masked in one genome of the 974 pair. The masking is prioritized so that low-quality genomes are preferentially masked 975 when compared to a higher quality genome in order to minimize data loss.

976
C. To avoid false negatives due to IBD regions covering partial windows, when a window 977 was masked, the proximate half of the "upstream" and the "downstream" window was 978 also masked. While this practice may mask some non-IBD regions, it errs on the 979 conservative side for avoiding oversampling IBD alleles (since IBD covering more than 980 half of a flanking window would be likely to satisfy the above masking criteria and cause 981 the full window to be masked).

982
In addition to IBD among genomes, there was also evidence for depleted 983 heterozygosity as a consequence of inbreeding in a subset of genomes from both the 1800s the heterozygosity should approximately equal nucleotide diveristy ( ). Therefore, to identify enrichment of homozygosity in inbred genomes, we compared the observed heterozygosity 987 per window to the median pairwise distance among non-IBD pairs from the same population 988 in that window. If the within-window heterozygosity of a sample was less than /2, the window 989 was flagged as IBD and a randomly selected allele of the two in the window was masked.

990
We included samples believed to originate from other sites and time points (18DZ5 991 from Zealand, Denmark and 18SL6, which was labelled as being from Smäland, Sweden) in the 992 IBD analyses, because the latter showed low divergence when compared to some of the 1800s 993 Lund samples. There was unambiguous evidence of close relatedness between 18SL6 and 994 18SL10, suggesting that the former was in fact from Lund and thus could be grouped with this 995 population in subsequent analyses. In contrast, 18DZ5 showed roughly 1% of its genome with 996 apparent IBD with one Lund genome, and lesser levels with two others. As these could be 997 false positives due to conservative filtering criteria, we continued to treat this sample as being          Table S4. Regions of each museum fly genome that were masked due to inbreeding IBD.

1681
For each GO category, the total number of genes in the analyzed regions is given, along with 1682 the number of outlier regions associated with these genes, and the identities of these outlier-1683 overlapping genes. A permutation P value is also given for each GO category, which does not