The landscape of nucleotide diversity in Drosophila melanogaster is shaped by mutation rate variation

What shapes the distribution of nucleotide diversity along the genome? Attempts to answer this question have sparked debate about the roles of neutral stochastic processes and natural selection in molecular evolution. However, the mechanisms of evolution do not act in isolation, and integrative models that simultaneously consider the influence of multiple factors on diversity are lacking; without them, confounding factors lurk in the estimates. Here we present a new statistical method that jointly infers the genomic landscapes of genealogies, recombination rates and mutation rates. In doing so, our model captures the effects of genetic drift, linked selection and local mutation rates on patterns of genomic variation. We then formalize a causal model of how these microevolutionary mechanisms interact, and cast it as a linear regression to estimate their individual contributions to levels of diversity along the genome. Our analyses reclaim the well-established signature of linked selection in Drosophila melanogaster, but we estimate that the mutation landscape is the major driver of the genome-wide distribution of diversity in this species. Furthermore, our simulation results suggest that in many evolutionary scenarios the mutation landscape will be a crucial factor shaping diversity, depending notably on the genomic window size. We argue that incorporating mutation rate variation into the null model of molecular evolution will lead to more realistic inferences in population genomics.


Introduc on
Understanding how various evoluonary mechanisms shape nucleode diversity -typically measured as the average pairwise heterozygosity, π -is a major goal of populaon genomics (Charlesworth, 2010;Ellegren & Galer, 2016), with a rich history of theorecal and empirical studies that have the fruit (y Drosophila melanogaster as its centerpiece (Casillas & Barbadilla, 2017;Charlesworth & Charlesworth, 2017;Haudry et al., 2020). For many years, the debate focused on the relave importance of genec dri/ and natural selecon to the genome-wide average π (Kimura, 1968;Ohta, 1992). The observaon that π does not scale linearly with populaon size across species (Lewonn, 1974) was termed "Lewonn's Paradox", and recent work has taken a new stab at this old problem by modeling the e>ect of natural selecon (Bu>alo, 2021;Galer & Rousselle, 2020). Later on, with recognion that linkage and recombinaon wrap the genome in regions of correlated evoluonary histories (Hudson, 1983;Hudson & Kaplan, 1985), focus shi/ed toward understanding how diversity levels vary along chromosomes of single species (Pouyet & Gilbert, 2021). In 1992, Begun and Aquadro found a posive correlaon between π and local recombinaon rate in D. melanogaster, (Begun & Aquadro, 1992) which was interpreted as the signature of linked selecon (CuEer & Payseur, 2013;Hudson & Kaplan, 1988) -at Frst in terms of selecve sweeps (Smith & Haigh, 1974;Stephan et al., 1992;Wiehe & Stephan, 1993) and soon re-framed in the light of background selecon (Charlesworth et al., 1993;Hudson & Kaplan, 1995, 1994Nordborg et al., 1996). In the three decades since these seminal works, idenfying the drivers of the genome-wide distribuon of diversity became a leading quest in the Feld of populaon genecs. Nevertheless, this search has so far been incomplete. The literature has mostly considered how paEerns of diversity are a>ected by selecon (AndolfaEo, 2007;Comeron, 2014;Elyashiv et al., 2016;McVicker et al., 2009;Murphy et al., 2022) or introgression (Hubisz et al., 2020;Stankowski et al., 2019), whereas spaal variaon in de novo mutaon rates (μ) has been largely ignored as an actual mechanism of variaon in π along the genome, presumably due to challenges in its esmaon (Besenbacher et al., 2019;Jónsson et al., 2018). Yet a study based on human trios advocates that the impact of the mutaon landscape on polymorphism may be greater than previously recognized: up to 46% of the human-chimpanzee divergence, and up to 69% of within-human diversity, can potenally be explained by variaon in de novo mutaon rates at the 100 kb scale (Smith et al., 2018). It is unclear, however, how well these results generalize to species with disnct genomic features and life history traits. The few studies conducted in non-human organisms relied on proxies of the local mutaon rate, such as synonymous diversity or divergence with a closely-related outgroup (Castellano et al., 2020(Castellano et al., , 2018. Sll, these indirect measures of the mutaon rate are suscepble to the confounding e>ect of selecon, which can act both directly ( e.g. codon usage (Lawrie et al., 2013;Machado et al., 2020)) and indirectly (e.g. recent background selecon in the case of synonymous diversity (Charlesworth et al., 1993;Hudson & Kaplan, 1995;Nordborg et al., 1996) as well as background selecon in the ancestral populaon in the case of synonymous divergence (Phung et al., 2016)). Therefore, developing dedicated stascal methods to infer mutaon rate variaon from polymorphism data is of high interest. Through simultaneous inference of the genomic landscapes of genec dri/, linked selecon, recombinaon and mutaon, confounding factors can be beEer teased apart and, subsequently, the relave contribuon of each of these micro-evoluonary mechanisms to the distribuon of diversity can be more meaningfully quanFed.
Disentangling the e>ects of mulple factors shaping the evoluon of DNA sequences is challenging because di>erent mechanisms can produce similar phenomena (sensu (Baetu, 2019)). For example, a genomic region with reduced nucleode diversity (relave to some baseline reference) can be causally explained by either linked selecon, dri/, low mutaon rate or a combinaon thereof. In an elegant e>ort to tease these mechanisms apart, Zeng and Jackson developed a likelihood-based model that jointly infers the e>ecve populaon size (N e ) (Charlesworth, 2009) and μ with high accuracy in di>erent parts of the genome (Zeng & Jackson, 2018). However, since it relies on the single-site frequency spectrum, their method is restricted to unlinked loci. While this approach avoids the confounding e>ect of linkage disequilibrium in the inference procedure (Slatkin, 2008), it discards sites in the genome where local variaon in the mutaon rate may be relevant as well as dismisses the gradual impact of recombinaon and linked selecon on spaal variaon in diversity. In this arcle, we put forward a new model to Fll in this gap. We have previously described a stascal framework (the integrave sequenally Markovian coalescent, iSMC) that jointly infers the demographic history of the sampled populaon together with variaon in the recombinaon rate along the genome via a Markov-modulated Markov process (Barroso et al., 2019). We now extend this framework to also account for sequenal changes in the mutaon rate. This integraon allows stascal inference of variaon along the genome in both recombinaon and mutaon rates, as well as in Times to the Most Recent Common Ancestor (τ), that is, the ancestral recombinaon graph of two haploids (Rosenberg & Nordborg, 2002). Whereas dri/ causes stochasc (uctuaons in τ around its expected value under neutrality (in diploid organisms, E [ τ ]=2× Ne ), natural selecon disturbs τ away from its neutral distribuon near funconally constrained regions of the genome (Palamara et al., 2018;Rasmussen et al., 2014;Stern et al., 2019;Zeng & Charlesworth, 2011). Thus, iSMC o>ers esmators of all relevant micro-evoluonary mechanisms, and we can further use causal inference (Pearl & Mackenzie, 2018) to simultaneously esmate their e>ects on diversity. Our analyses of D. melanogaster genomes reveals the impact of linked selecon; however, it suggests that the rate of de novo mutaons is quantavely the most important factor shaping nucleode diversity in this species.

Modeling spa al varia on in θ
We now introduce our approach to modeling the mutaon landscape starng from the original pairwise SMC process. Because iSMC models pairs of genomes, the genealogies underlying each orthologous site can be conveniently summarized by τ, the me to their most recent common ancestor (Li & Durbin, 2011;Schi>els & Wang, 2020). The pair of DNA sequences is described as a binary string where 0 represents homozygous states and 1 represents heterozygous states (thus, once genome pairs are combined into diploids, phasing informaon is discarded). The probability of observing 0 or 1 at any given posion of the genome depends only on τ and the scaled mutaon rate θ. If the hidden state conFguraon of the model excludes spaal variaon in the mutaon rate, then θ is assumed to be a global parameter such that the emission probabilies of homozygous and heterozygous states can be computed for every site as P (0|τ )=e (− θ× τ ) , and P (1|τ )=1− e (− θ × τ ) respecvely, as presented by Li & Durbin (2011).
We esmate the per-site, genome-wide average θ 0 as the average number of pair-wise di>erences observed between all pairs of genomes. Therefore, the e>ecve populaon size implicit in θ 0 =4 × Ne × μ is the average of N e along the genome, accounng for selecve e>ects. We Fx θ 0 to this point esmate and exclude it from the opmizaon step conducted with the HMM. To incorporate spaal heterogeneity in the mutaon rate along the genome, we modulate θ 0 by drawing scaling factors from a discrezed Gamma distribuon with mean equal to 1. The parameter shaping this prior distribuon (α θ = β θ ) is esmated by maximum likelihood (via the forward HMM algorithm) together with other parameters of the model (using the Powell opmizaon procedure (Powell, 1964)). We model the changes in mutaon rate along the genome as a Markov process with a single parameter δ θ , the transion probability between any class of mutaon rate, which is independent of the genealogical process. The jusFcaon for the Markov model is that sites in close proximity are expected to have similar mutaon rates. For example, as is the case when the eZciency of the replicaon machinery decreases with increasing distance from the start of the replicaon fork (Francioli et al., 2015). Of note, Felsenstein & Churchill (1996) used a similar approach to model substuon rate variaon across sites in a phylogenec model. Let n ( τ ) be the number of discrezed τ intervals, and n (θ ) be the number of discrezed categories of the prior distribuon of scaling factors of θ. The ensuing Markov-modulated HMM has n=n (τ ) × n (θ) hidden states. The transion matrix for spaal variaon in θ is: where δ θ is the aforemenoned auto-correlaon parameter. The resulng process is a combinaon of the SMC and the mutaon Markov model, so that its transion probabilies are funcons of the parameters from both processes, that is, the coalescence rates (parameterized by splines, similarly to (Terhorst et al., 2017)), δ θ and the global recombinaon rate ρ (Barroso et al., 2019). The forward recursion for this model evaluated at genomic posion i can be spelled out as: where θ m is the product of θ 0 and the value of the m-th discrezed category drawn from its prior Gamma distribuon. The emission probability of binary state S i depends on the height of the t-th genealogy and the focal mutaon rate θ m . More speciFcally, the emission probabilies of θ-iSMC are P ( 0 | τ t , θ m ) =e (−θ m × τ t ) , and P ( 1 | τ t ,θ m ) =1 −e ( −θ m × τ t ) . Thus, the forward recursion integrates over all n (θ ) categories of θ and over all n ( τ ) intervals of τ, for all sites in the genome. In the double-modulated model (ρ-θ-iSMC), where both mutaon and recombinaon are allowed to vary along the genome, this integraon is performed over θ, τ as well as ρ (giving a total of n ( τ ) × n (θ) ×n ( ρ) hidden states, Figure 1).
Since spaal variaon in ρ contributes to the transion probability between genealogies, the complete forward recursion is now given by: The full ρ-θ-iSMC model remains parsimonious, being characterized by a total of 11 parameters, namely, ρ 0 , θ 0 , α θ , α ρ , δ θ , δ ρ plus Fve parameters describing constrained cubic splines that embody the demographic curve over me (Barroso et al., 2019). (Such parsimony is a>orded by the structure of the Markov-modulated HMM which readily leverages physical linkage among sites in the same chromosome to Ft distribuons of recombinaon rates, mutaon rates and TMRCA that are shared throughout the genome, even if site-speciFc realizaons of these values may di>er.) Running its forward recursion independently on each pair of genomes gives the composite likelihood of the model. A/er parameter opmizaon, we seek to reconstruct single-nucleode landscapes (ρ, θ or τ) for each diploid separately. We Frst compute the posterior probability of each hidden state for every site i in the diploid genomes using regular HMM procedures (Durbin et al., 1998). A/erward, since in ρ-θ-iSMC the hidden states are triplets (Figure 1), compung the posterior average of each landscape of interest amounts to Frst marginalizing the probability distribuon of its categories and then using it to weight the corresponding discrezed values (Barroso et al., 2019). Let M be the inferred discrezed Gamma distribuon shaping mutaon rate variaon, and θ l be the product of the esmated genome-wide average mutaon rate θ 0 and the value of M inside category l. Similarly, let R be the inferred discrezed Gamma distribuon shaping recombinaon rate variaon, and ρ k be the product of the esmated genome-wide average recombinaon rate ρ 0 and the value of R inside category k. Then the posterior average θ at posion i is where P i ( θ l , ρ k , τ j ) is the probability of the triplet {θ l , ρ k , τ j } (which denotes a unique hidden state of the model) underlying the i-th site of the genome. Likewise, the posterior average ρ at posion i is given Finally, the posterior average τ at posion i is presented in units of 4 × Ne generaons and obtained with:τ For each diploid, we can then bin the inferred single-nucleode landscapes into non-overlapping windows of length L by averaging our site-speciFc esmates over all sites within each window. A consensus map of the populaon is obtained by further averaging over all n individual (binned) maps in our sample, i.e.: is our esmate of the consensus mutaon rate in a single genomic window of length L, where n is the number of pairs of genomes analyzed by iSMC, and likewise for ρ and τ: We Fnally note that the auto-correlaon parameters δ θ and δ ρ represent the probabilies of switching mutaon and recombinaon rates between adjacent sites, averaged along the genome. That is, although we include two layers of complexity in comparison to the original SMC models, we assume here that such transion probabilies are themselves spaally homogeneous. In reality, genomic regions may di>er in the rate of change between local mutaon and recombinaon rates. Nevertheless, in pracce, the reconstrucon of mutaon and recombinaon maps with posterior decoding should be somewhat robust to this model mis-speciFcaon.

Simula on study
Using SCRM (Staab et al., 2015), we simulated 10 haploid sequences of length 30 Mb with parameters based on those inferred from ρ-θ-iSMC in D. melanogaster (see Results): θ = 0.0112; ρ = 0.036; α θ (connuous Gamma distribuon used as mutaon rate prior) = 3.0; α ρ (connuous Gamma distribuon used as recombinaon rate prior) = 1.0; δ θ (mutaon rate transion probability) = 1e-05; δ ρ (recombinaon rate transion probability) = 1e-04. Note that such transion probabilies lead to landscapes where blocks of constant mutaon and recombinaon span, on average, 10 kb and 100 kb, respecvely, with stochasc variaon coming from the geometric distribuons used to model them. Supplemental Figure S1 displays a sketch of the smoothed demographic history used in the coalescent simulaons (see Results). Figures 4 and 5 display the mean R 2 value of the ANOVA performed on the inferred landscapes from 10 simulated replicates (see Results), but the standard deviaon of these esmates are very small, and conFdence intervals were, therefore, omiEed. Data leading to Figure 5 was also simulated with SCRM, with parameters described in the Results secon.
Next, we used SLiM 3.00 (Haller & Messer, 2018) to simulate the genealogy of a chromosome undergoing purifying selecon, using D. melanogaster's chromosome 2L as a template. The simulated region was 23.51 Mb long, and we used Comeron's recombinaon map in 100 kb windows (Comeron et al., 2012). We used Ensembl (Cunningham et al., 2022) release 103 gene annotaons for D. melanogaster and extracted all exons coordinates, merging overlapping exons. Forward simulaons were conducted using SLiM, with only deleterious mutaons in exons being modeled. The Ftness e>ect of mutaons was drawn from a negave gamma distribuon with a shape of 1.0 and a mean of -5/10,000. The populaon size was kept constant and equal to 10,000 and the populaon evolved for 700,000 generaons. To compensate for the low populaon size, we scaled the mutaon and recombinaon rates by a factor of 10 to result in a scenario closer to the D. melanogaster demography. The deleterious mutaon rate was set to 1e-7 bp -1 along the genome. Ten replicates were generated and saved as tree sequences (Kelleher et al., 2018), which were then further processed by the 'pyslim' python module to run a recapitaon procedure to ensure that all lineages coalesced into a single root at all genome posions. Ten genomes were then sampled uniformly at random and the underlying tree sequence exported. Finally, 'msprime' (Kelleher et al., 2016) was used to add neutral mutaons to the tree sequence and save the resulng sequence alignments. A random mutaon rate map was generated by sampling relave rates from a Gamma distribuon with mean equal to 1.0 and with a shape parameter equal to 2.5, in segments with lengths drawn from a geometric distribuon with mean equal to 100 kb. The resulng mutaon relave rate map was then scaled by the genome average mutaon rate of 1e-7 bp -1 .

Analyses of Drosophila data
Model F^ng and posterior decoding by ρ-θ-iSMC in D. melanogaster data were performed using a hidden-states conFguraon of 30 τ intervals, Fve ρ categories and Fve θ categories. We used publicly available data -haplotypes ZI103, ZI117, ZI161, ZI170, ZI179, ZI191, ZI129, ZI138, ZI198 and ZI206 coming from the Zambia populaon in the Drosophila Populaon Genomics Project Phase 3 (Lack et al., 2015). Note that the following Flters have been previously applied to these data by the original authors: A) heterozygous regions (maintained in the inbred individuals by selecon due to recessive lethal alleles); B) three bp around called in-dels; C) long identy-by-descent stretches between genomes from the same locaon; as well as D) segments showing evidence of recent admixture (from outside Africa back into Africa) were all masked (turned to 'N' in the FASTA Fles). We assigned gaps and masked nucleodes in these FASTA sequences as "missing" data (encoded by the observed state '2' within iSMC, for which all hidden states have emission probability equal to 1.0 (Li & Durbin, 2011)). To opmize computaonal me, ρ-θ-iSMC was Frst FEed to chromosome 2L only. Maximum likelihood esmates from this model were then used to perform posterior decoding on all other autosomes. Prior to F^ng the linear models, for each scale in which the iSMC-inferred landscapes were binned (50 kb, 200 kb and 1 Mb), we Fltered out windows with more than 10% missing data in the resulng maps. Genomic coordinates for coding sequences and their summary stascs (π N , and π S ) were taken from (Mounho et al., 2019).

Linear modeling
Linear models implemenng our causal model of diversity (Figure 3) were built based on genomic maps of 50 kb, 200 kb and 1 Mb resoluon. It is worth reiterang that the binning of the singlenucleode landscapes happens a/er opmizaon by the HMM such that it does not in(uence model complexity (as detailed in the model descripon, the 11 iSMC parameters are jointly esmated for the enre dataset, i.e., the model is aware of all individual sites in the sequences during opmizaon). When building linear models from real data, we Frst FEed GLS models independently to each autosome arm (2L, 2R, 3L, 3R), correcng for both auto-correlaon of and heteroscedascity of the residuals. A/er using Bonferroni correcon for mulple tesng, we observed (across the autosome arms and for di>erent window sizes) signiFcant and posive e>ects of θ and τ on π, whereas the e>ect of ρ was only signiFcant for chromosome 3L at the 200 kb scale, and the interacon between θ and τ is posive and signiFcant except for arms 2R and 3L at the 1 Mb scale (Supplemental Tables S6, S7, S8). Since the trends in coeZcients are overall consistent, we pulled the autosome arms and in the Results secon we present linear models FEed to the enre genome, for ease of exposion. Because we cannot rely on the GLS to paron the variance explained by each variable using type II ANOVA, we used OLS models to compute R 2 and restricted the GLS to assess the sign and signiFcance of variables. We standardized all explanatory variables (subtracted the mean then divided by the standard deviaon) before F^ng the regression models to aid in both computaon of variance in(aon factors and interpretaon of the coeZcients.

The sequen ally Markov coalescent with heterogeneous muta on and recombina on
The sequenally Markovian Coalescent (SMC) frames the genealogical process as unfolding spaally along the genome (Marjoram & Wall, 2006;McVean & Cardin, 2005;Wiuf & Hein, 1999). Its Frst implementaon as an inference tool derives the transion probabilies of genealogies between adjacent sites as a funcon of the historical variaon in N e (i.e., demographic history) and the genome-wide average scaled recombinaon rate ρ=4 × Ne ×r (Li & Durbin, 2011). Model F^ng is achieved by casng the SMC as a hidden Markov model (HMM) (Dutheil, 2017) and le^ng the emission probabilies be funcons of the underlying Time to the Most Recent Common Ancestor (TMRCA, τ) and the scaled mutaon rate θ=4 × Ne × μ (see Methods). The SMC has proven to be quite (exible and serves as the theorecal basis for several models of demographic inference (see Spence et al. (2018) for a review, and Sellinger et al. (2020) for another compelling, more recent development). We have previously extended this process to account for the variaon of ρ along the genome, thereby allowing for a heterogeneous frequency of transions between local genealogies in di>erent parts of the genome (Barroso et al., 2019). In this more general process called iSMC, recombinaon rate heterogeneity is captured by an auto-correlaon parameter, δ ρ , where the localized values of ρ are taken from a discrete distribuon and the transion between recombinaon rates along the genome follows a Frst-order Markov process.
In the general case, the iSMC process is a Markov-modulated Markov process that can be cast as an HMM where the hidden states are n-tuples storing all combinaons of genealogies and discrezed values of each parameter that is allowed to vary along the genome (Dutheil, 2021). If one such parameter contributes to either the transion or emission probabilies of the HMM, then the hyper-parameters that shape its prior distribuon can be opmized, e.g. by maximum likelihood (see Methods). In the iSMC with heterogeneous recombinaon (ρ-iSMC) the hidden states are pairs of genealogies and recombinaon rates (Barroso et al., 2019). Here, we extend this model by allowing the mutaon rate to also vary along the genome (Figure 1), following an independent Markov process, i.e., le^ng the hidden states of the HMM be {θ-category, ρ-category, genealogy} triplets. The signal that spaal variaon in ρ and θ leaves on the distribuon of SNPs is discernible because their contribuons to the likelihood are orthogonal: the recombinaon and mutaon rates a>ect the transion and emission probabilies of the forward HMM algorithm, respecvely. Parameter opmizaon and subsequent posterior decoding is performed as in Barroso et al. (2019). Under strict neutrality (which results in N e being homogeneous along the genome (Charlesworth, 2009)), the inferred θ landscape re(ects the landscape of de novo mutaons (μ). iSMC can, therefore, be used to infer genome-wide variaon in mutaon rates with single-nucleode resoluon and stascal noise is reduced by averaging the posterior esmates of θ within larger genomic windows (see Methods).
In order to increase power, we further extend iSMC to accommodate mulple haploid genomes. In this augmented model, input genomes are combined in pairs such that the underlying genealogies have a trivial topology reduced to their τ ( Figure 1). Although under Kingman's Coalescent (Kingman, 1982) the genealogies of mulple pairs of genomes are not independent, we approximate and compute the composite log-likelihood of the enre dataset by summing over such "diploid" log-likelihoods, similarly to MSMC2 (Malaspinas et al., 2016). Furthermore, iSMC enforces all diploids to share their prior distribuons of τ, ρ and θ so that mulple sequences provide aggregate informaon to our parameter inference during model F^ng; it does not, however, explicitly enforce that they have idencal genomic landscapes upon posterior decoding. Rather, iSMC uses posterior probabilies to reconstruct recombinaon and mutaon maps separately for each diploid.
Especially at the single-nucleode level, accuracy of the inferred posterior landscapes is limited by the large stochascity of the coalescent (Hein et al., 2004). The combinaon of genealogical and mutaonal variance leads to di>erences among the posterior landscapes of θ and ρ inferred from each diploid because it creates departures from the expected number of SNPs along pairs of genomes (hence variaon in the amount of informaon diploids bear, in di>erent regions of the genome, about ancestral processes such as mutaon and recombinaon). To reduce noise from the individual diploid esmates and obtain consensus landscapes of the whole sample, iSMC averages the posterior esmates of θ and ρ over all diploids, for each site in the genome (see Methods). On the other hand, di>erences in the τ landscapes among diploids primarily re(ect the stochasc nature of the ancestral recombinaon graph along the genome, which has intrinsic value itself. We therefore average these diploid τ landscapes not to reduce esmaon noise but to obtain a measure of dri/ in neutral simulaons. Note, however, that the average τ of the sample within a genomic window also contains informaon about natural selecon (Palamara et al., 2018) -a property we exploit in the analyses of Drosophila data. This cartoon model has three me intervals, three recombinaon rate categories and three mutaon rate categories. The genome-wide distribuon of diversity depends on the mutaon landscape (top) and on the τ landscape (boEom), which is modulated by the recombinaon landscape (middle). Discrezed values of these distribuons (le/) are combined in triplets as the hidden states of our Hidden Markov Model (right).

Muta on rate varia on impacts nucleo de diversity more than linked selec on in Drosophila
We sought to quanfy the determinants of genome-wide diversity in D. melanogaster using 10 haploid genome sequences from the Zambia populaon. To infer the genomic landscapes, we employed a ρ-θ-iSMC model with Fve mutaon rate classes, Fve recombinaon rate classes and 30 coalescence me intervals, leading to 750 hidden states. We note that the number of classes and me intervals do not a>ect the number of esmated parameters, in parcular because our implementaon of the demographic model uses splines in place of the emblemac "skyline" model (Li & Durbin, 2011;Schi>els & Durbin, 2014) (see Methods). In general, Fner discrezaon of these three distribuons leads to more precise inference unl a plateau is reached, as well as impacts the minimum and maximum values that the posterior esmates can take. However, the memory use and the likelihood computaon me scale linearly and quadracally with the total number of hidden states, respecvely. We selected 30 classes for the TMRCA and Fve classes for each rate distribuon because this conFguraon provided a good trade-o> between computaonal resources and accuracy during our tesng phase. The total run-me for F^ng the model with 750 hidden states to chromosome 2L of D. melanogaster was about 1 month on a highperformance cluster. Therefore, we proceeded in two steps: we Frst esmated model parameters on a subset of the data (chromosome arm 2L), and then used the FEed model to infer the landscape of mutaon, recombinaon and TMRCA for all autosomes (see Methods). The jusFcaon for this approach is that the HMM posterior decoding is able to reconstruct chromosome-speciFc landscapes, even from idencal prior distribuons. At the same me, we have no a priori reason to believe that the shape of these distribuons will di>er substanally among autosomes. The similarity among the results obtained with each chromosome in the downstream analyses (see "Linear Modeling" sub-secon within Methods) supports such intuion (Supplemental Tables S6, S7 and S8).
iSMC also inferred that the change in recombinaon rate across the genome was more frequent (autocorrelaon parameter δ ρ ~0.9999, corresponding to a change of recombinaon rate on average every 10 kb) than the change in mutaon rate (auto-correlaon parameter δ θ ~0.99999, corresponding to a change of mutaon rate on average every 100 kb). This suggests that our model mostly captures largescale rather than Fne-scale variaon in the mutaon rate. Our inferred genome-wide average ρ (0.036) is in line with previous esmates (Chan et al., 2012), and the coalescence rates (which, in the context of this arcle, comprise a collecon of nuisance parameters used to reFne our esmates of τ, ρ and θ along the genome) suggest a ~4-fold boEleneck followed by recovery (Supplemental Figure S1). As an empirical validaon of this new iSMC method, Spearman's rank correlaons (herea/er referred to as Spearman's rho) between our inferred recombinaon map of chromosome 2L and Comeron's map based on experimental crosses (Comeron et al., 2012) are 0.594 at the 50 kb scale, 0.693 at the 200 kb scale and 0.865 at the 1 Mb scale (all p-values < 1e-5), higher than the correlaons reported with previously published populaon genec methods applied to D. melanogaster (Adrion et al., 2019;Barroso et al., 2019;Chan et al., 2012). We used the parameters esmated from D. melanogaster to simulate 10 replicate datasets under a purely neutral scenario (see Methods). The aims of these simulaons are two-fold: (1) to benchmark iSMC's accuracy in reconstrucng the mutaon landscape; and (2) to understand how ρ, θ and τ interact to in(uence diversity levels under neutrality, thereby providing a measure of contrast for the analyses of real data (where natural selecon is present). Throughout this arcle, we analyze the determinants of nucleode diversity at di>erent scales by binning the landscapes of mutaon, recombinaon and TMRCA into non-overlapping windows of 50 kb, 200 kb and 1 Mb. We Frst report strong correlaons between inferred and simulated maps, ranging from 0.975 to 0.989 (Spearman's rho, Figure 2A, Supplemental Table S1), showcasing that our model is highly accurate under strict neutrality and when mutaon rate varies along the genome in Markovian fashion.
We then used the raw genomic landscapes from these simulated (neutral) datasets to invesgate how evoluonary mechanisms shape the distribuon of nucleode diversity along the genome, measured as π, the average per-site heterozygosity of the sample. The structure of our hypothesized causal model of diversity (solid lines in Figure 3) is rid of "backdoor paths" that would otherwise create spurious associaons between recombinaon, mutaon, TMRCA and nucleode diversity (Pearl & Mackenzie, 2018). We could thus cast our causal model as an ordinary least squares regression (OLS) that seeks to explain π as a linear combinaon of the standardized variables ρ, θ and τ and stascal associaons between our explanatory variables and the outcome variable π then represent causal relaonships that merit scienFc explanaon. The jusFcaon for a linear model of π is that for suZciently small genomewide average diversity θ 0 (a requirement which is met in D. melanogaster, as E [ θ 0 ] ~1e-2) the per-site heterozigosity P ( Heterozygous )=π =1− e (− θ × τ ) can be well approximated by θ × τ , the Frst term in the Taylor series expansion of 1 −e (−θ × τ ) . Since simulaons grant direct access to the true genomic landscapes, then by deFnion the ensuing OLS models are free of esmaon noise in the explanatory variables and serve as a ground truth assessment of how neutral evoluonary mechanisms in(uence nucleode diversity. Because of the interplay between genealogical and mutaonal variance, we tested the improvement that including an interacon term between θ and τ brought to the Ft of the linear models. In all replicates, we found that model selecon using Akaike's informaon criterion favors a regression with an interacon term between the two variables that directly in(uence nucleode diversity, the simpler model  Table S2, upper panel). This is expected since both deeper ancestry and higher mutaon rate lead to increased nucleode diversity and the in(uence of recombinaon rate on π is mediated by τ, thus disappearing due to its inclusion in the linear model. There is also a signiFcant and posive e>ect of the interacon between θ and τ, highlighng the interplay between genealogical and mutaonal variance, where the e>ect of the mutaon rate on diversity can only be fully manifested if ancestry is deep enough (reciprocally, ancestry can only be seen clearly if the local mutaon rate is high enough). Moreover, the standardizaon that we employed on the explanatory variables prior to F^ng the linear models (see Methods) allows us to evaluate their relave importance to the π distribuon straight from the esmated coeZcients. We observe that the linear coeZcient of θ is ~6 mes larger than the linear coeZcient of τ at the 50 kb scale, ~11 mes larger at the 200 kb scale and ~16 mes larger at the 1 Mb scale (Supplemental Table S2, upper panel). Besides the linear coeZcients, we further quanFed the relave in(uence of mutaon, dri/ and recombinaon to local diversity levels by paroning the R 2 contributed by each explanatory variable with type II ANOVA. Consistently with the previous results, our esmates show that the θ landscape explains most of the variance in π in our simulaons and that its contribuon increases with the genomic scale (96.3% at 50 kb, 98.6% at 200 kb and 99.3% at 1 Mb Figure 4A). On the other hand, the contribuon of the τ landscape decreases with the genomic scale (2.7% at 50 kb, 1% at 200 kb and 0.54% at 1 Mb). We propose that these trends stem from the minuscule scale of variaon in τ (changing on average every 48.42 bp due to recombinaon events in our coalescent simulaons, median = 19 bp), which smooth out more rapidly than does mutaon variaon when averaged within larger windows. Conversely, the broader scale of heterogeneity in θ (changing every 100 kb on average) makes it comparavely more relevant at larger window sizes. Strikingly, the total variance explained by the model is >99% at all scales, suggesng that these three landscapes are suZcient to describe the genome-wide distribuon of diversity, as illustrated by our causal model (Figure 3). To test whether we could recover such trends with the landscapes inferred by our HMM, we FEed the OLS models to the same genomic landscapes of nucleode diversity except using the maps inferred by iSMC as explanatory variables (i.e., θ, τ and ρ instead of the true, simulated ones: θ, τ and ρ). The sign and signiFcance of the esmated OLS coeZcients remained unchanged (Supplemental Table S2, middle panel), as do the ranking of their e>ect sizes, but in some replicates the residuals of the model were found to be correlated and/or with heterogeneous variance. As this violaon of the OLS assumpon could bias the esmates of the p-values of the linear coeZcients, we also FEed Generalized Least Squares (GLS) models accounng for both deviaons, which reassuringly produced coherent results (Supplemental Table S2, lower panel). Although co-linearity between θ and τ arises due to confounding in their esmaon by iSMC, the variance in(aon factors are always < 5, indicang that the coeZcients are robust to this e>ect (Ferré, 2009). The trends in the linear coeZcients obtained with iSMC-inferred landscapes are the same as those obtained with simulated (noise-free) landscapes, except that the e>ect of τ is esmated to be larger than that of τ. Similarly, type II ANOVA using the inferred landscapes shows that the contribuon of τ is slightly higher than when using the true landscapes (5.1%, 2.9% and 1.4%, increasing window size) whereas the contribuon of θ is slightly lower (92.5%, 95.4% and 97.5%, increasing window size), but the variance explained by each variable closely agrees between the two cases (middle and right panels in Figure 4A). Therefore, we conclude that the joint-inference approach of iSMC can infer the genomic landscapes of τ, ρ and θ and that the linear regression representaon of our causal model (Figure 3) is able to quanfy their e>ect on the distribuon of nucleode diversity, π. 3. Directed acyclic graphs depic ng our abstract causal model for the determinants of genome-wide diversity. A) for a single, hypothecal nucleode that is independent of any neighbors, its probability of being heterozygous is solely in(uenced by the local mutaon rate (μ) and TMRCA (τ), which in turn is a>ected by dri/ (D) and selecon (s). B) when conguous sites are grouped into genomic windows, their correlated histories imply that the local recombinaon rate (r) plays a role in modulang both D and the breadth of linked selecon via τ, which together with local μ in(uences π. Relaonships that may be relevant in other model systems are shown by dashed lines (where selecon a>ects μ and r through modiFer genes and where recombinaon is mutagenic). Note that P(het) in A has exactly the same form as the emission probability of the We Fnally employed the landscapes obtained with ρ-θ-iSMC to quanfy the determinants of genomewide diversity in D. melanogaster. In the following analyses, our interpretaons of the OLS models assume that sequencing errors are unbiased with respect to the explanatory variables and that the populaon is broadly panmicc (or that geographic structure is implicitly accounted for by the TMRCA, e.g. (Beichman et al., 2018)). We also follow previous work suggesng that recombinaon is not mutagenic in this system (Begun et al., 2007;Castellano et al., 2016;McGaugh et al., 2012), thus we ignore this potenal relaonship. We used our inferred D. melanogaster maps to Ft an OLS regression of the form π i = β 1 ⋅τ i +β 2 ⋅θ i + β 3 ⋅ρ i + β 4 ⋅θ i :τ i +ϵ i . As in our simulaons, the regression model shows posive e>ects of both θ and τ , but not of ρ , on π across all scales ( Table 1). Likewise, a GLS model correcng for the idenFed auto-correlaon of and heteroscedascity of the residuals yields the same trends, and its variance in(aon factors are < 5, indicang that the esmated coeZcients are robust to co-linearity (Ferré, 2009). Showcasing its dominant impact on π in the fruit (y, the linear coeZcient of θ is between three and four mes larger than that of τ , a trend that is akin to that obtained with inferred maps in the coalescent simulaons. Moreover, paroning of variance shows a small contribuon of τ that decreases with increasing genomic scale (5.9% at 50 kb, 2.1% at 200 kb and 2.1% at 1 Mb) whereas the opposite applies to θ (91.7% at 50 kb, 96.7% at 200 kb and 96.8% at 1 Mb, le/ panel in Figure 4A).
Our linear model explains >99% of the variaon in π along D. melanogaster autosomes, and the e>ects of our inferred landscapes on diversity are remarkably close to those from our neutral simulaons ( Figure  4A), suggesng that iSMC is robust to the occurrence of selecon in this system. Unlike neutral simulaons; however, the simple correlaon test between ρ and π ends up posive and signiFcant in D.
melanogaster data, at least at smaller scales (Spearman's rho = 0.20, p-value = 2e-13 at the 50 kb scale; Spearman's rho = 0.15, p-value = 0.0025 at the 200 kb scale; Spearman's rho = 0.20, p-value = 0.07 at the 1 Mb scale), recapitulang the classic result of Begun & Aquadro (1992) and indicang the presence of linked selecon. We also found a posive correlaon between ρ and τ (Spearman's rho = 0.48, p-value < 2.2e-16 at 50 kb; Spearman's rho = 0.45, p-value < 2.2e-16 at 200 kb; Spearman's rho = 0.48, p-value < 2.2e-16 at 1 Mb), once again contrasng the results under neutrality and suggesng that the e>ect of linked selecon is indeed captured by the distribuon of genealogies and modulated by the recombinaon rate (CuEer & Payseur, 2013). Although τ is primarily in(uenced by demography in SMCbased models (by means of a Coalescent prior taming the transion probabilies of the HMM (Li & Durbin, 2011;Schi>els & Durbin, 2014), it has also been demonstrated to carry the signature of selecon due to local changes in coalescence rates that have been interpreted as spaal variaon in N e (Palamara et al., 2018;Zeng & Charlesworth, 2011). Shortly, Palamara's ASMC method reconstructs the TMRCA landscape of several pairs of genomes and interprets recurrent (shallow) outliers in the τ distribuon as the outcome of linked selecon (i.e., regions where pair of genomes consistently coalesce faster than expected under neutrality). We tested the sensivity of our regression framework to this e>ect by a F^ng linear model without τ as an explanatory variable, π i = β 1 ⋅θ i + β 2 ⋅ρ i +ϵ i , hypothesizing that in the absence of its mediator the recombinaon rate would show a signiFcant and posive e>ect on diversity. Indeed, this is what we found at all genomic scales (Table 1, Model 3), corroborang our interpretaon of the causal relaonships in the presence of selecon (Figure 3B), from which the direct correlaon between ρ and π, o/en reported in the literature, is a special case. In summary, our results show that recombinaon shapes diversity via the τ distribuon and linked selecon, but that in D. melanogaster, the impact of genec hitchhiking on the diversity landscape is smaller than that of mutaon rate variaon. To invesgate the signature of selecon, we analyzed the relaonship between the local mutaon rate and the levels of synonymous (π S ) and non-synonymous (π N ) diversity across D. melanogaster genes (see Methods). We computed these summary stascs across exons and matched their coordinates with our Fnest (50 kb-scale) genomic landscapes to increase resoluon (i.e., to maximize variaon in mutaon and recombinaon rates among genes). We observed a stronger relaonship between θ and π S (Spearman's rho = 0.68, 95% CI a/er 10,000 bootstrap replicates = [0.64, 0.72], paral correlaon accounng for τ ) than between θ and π N (Spearman's rho = 0.27, 95% CI a/er 10,000 bootstrap replicates = [0.22, 0.32], paral correlaon accounng for τ ) indicang that selecon parally purges the excess of non-synonymous deleterious variants in genes with elevated mutaon rate, whereas synonymous variants segregate more freely either because they are not directly a>ected by selecon (but are sll linked to selected sites) or because selecon on codon usage (Lawrie et al., 2013;Machado et al., 2020) is not as strong as selecon on protein funcon. Since synonymous variants are interdigitated with non-synonymous variants, the contrast between these correlaon tests cannot be explained by a bias in iSMC's esmaon of θ in funconally constrained regions of the genome.
Furthermore, a correlaon test between θ and the proporon of exonic sites in the same 50 kb windows (Spearman's rho = -0.037, p-value = 0.19, paral correlaon accounng for τ ) fails to reveal such putave bias (see Discussion for a (ip side view on this test). Conversely, we observed a negave and signiFcant correlaon between τ and the proporon of exonic sites (Spearman's rho = -0.158, p-value = 2e-12, paral correlaon accounng for θ), as expected since background selecon should reduce the TMRCA more abruptly in densely constrained regions (Charlesworth, 2013;Palamara et al., 2018). We also FEed linear models considering only 50 kb windows with more than 20,000 coding sites. Once again, there were signiFcant and posive e>ects of both θ and τ , but not of ρ , on π. Moreover, the mutaon landscape remains the most important factor, explaining 93.2% of the distribuon of diversity in generich regions.

Figure 4. Variance in the distribu on of diversity explained by each genomic landscape.
Paroning of variance according to window size (x-axis, shown in log 10 scale), using either simulated data (true landscapes: right panels; inferred landscapes: middle panels) or real Drosophila data (le/ panels). A) comparison between real Drosophila data and results from neutral simulaons. B) comparison between real Drosophila data and results from simulaons under background selecon. In each panel, shapes represent explanatory variables in the linear model: θ (circles), ρ (triangles), τ (plus sign), θ:τ interacon (crosses) and the total variance explained by the model (squares) is the sum of the individual R 2 . Each point represents the average R 2 over 10 replicates. Variaon among replicates resulted in conFdence intervals too small to be ploEed.

Muta on rate varia on shapes genome-wide diversity in neutral scenarios
Our analyses of D. melanogaster data and D. melanogaster-inspired simulaons suggest that the mutaon landscape is the main factor in(uencing levels of diversity along the genome. But are there scenarios where τ has a more pronounced e>ect on π? We addressed this queson by exploring the parameter space of our neutral simulaons. For Fxed values of the long-term average populaon size (N e = 100,000), the average mutaon rate per site per generaon (μ = 2e-09), the Gamma distribuon of scaling factors of θ (α θ = β θ = 2.5) and the Gamma distribuon of scaling factors of ρ (α ρ = β ρ = 1.0), we varied the demographic history ((at N e ; 10-fold boEleneck happening 0.5 coalescent me units ago), the average recombinaon rate per site per generaon (r = 1e-08; 1e-09) and the (uctuaons of the mutaon landscape, where the realized lengths of genomic blocks of constant μ were drawn from geometric distribuons with averages equal to 50 kb, 500 kb, or instead taken as a perfectly (at mutaon landscape. We reasoned that the extent of the variaon in τ along the genome compared to that of μ (equivalently, θ) should modulate their relave in(uence on π. We FEed OLS models to explain π using the true, simulated landscapes as explanatory variables, and computed their average R 2 over all replicates for each evoluonary scenario (Figure 5). The OLS models included an interacon term between θ and τ but its individual R 2 was excluded from the plots because it is overall low (~1%) and of no direct interest. We observed clear trends emerging from these simulated data. First, for a given demographic history and paEern of variaon in the mutaon rate, increasing r reduces the in(uence of τ on π. This happens because with high recombinaon rates the genealogies change more o/en along the genome, thus displaying more homogeneous maps when averaged within windows (50 kb, 200 kb, 1 Mb). Second, for a given r and paEern of variaon in the mutaon rate, τ has a larger impact on π in the boEleneck scenario compared to the scenario of constant populaon size. This happens because when N e varies in me, the distribuon of coalescence mes becomes mul-modal (Hein et al., 2004) and therefore more heterogeneous along the genome. Third, for a given demographic history and Fxed value of r, frequent changes in μ along the genome (on average every 50 kb) reduce its impact on π relave to rare changes in μ (on average every 500 kb). This happens because frequent changes in μ lead to more homogeneity along the genome, when averaged within the window sizes used in our analyses.
Finally, if the mutaon landscape is (at, then, as expected, the variance explained by our linear model is enrely aEributed to τ. Note that although in these neutral simulaons τ varies along the genome as a result of genec dri/ alone, it sll has a non-negligible e>ect on the distribuon of diversity in most scenarios (i.e., binning into large genomic windows does not (aEen the TMRCA landscapes completely). This is in agreement with an observaon that heterogeneous recombinaon rates lead to outliers in genome-wide F ST scans, even under neutrality (Booker et al., 2020), which in turn happens because the recombinaon landscape enlarges the variance of the τ distribuon by making the frequency of genealogy transions a funcon of the local ρ (conFrming the causal e>ect ρ → τ depicted in Figure 3B). From a praccal standpoint, it means that dri/ should not be neglected as an explanaon for the distribuon of π, especially at narrow window sizes (≤ 10 kb). This is relevant because it is also at narrow window sizes that the e>ect of selecon on diversity levels along the genome can be more easily confounded by the e>ect of dri/, with extreme examples happening during populaon range expansions and especially in regions of low recombinaon (Schlichta et al., 2022).
More generally, our simulaon study of neutral scenarios shows that the relave impacts of evoluonary mechanisms on π depend primarily on (1) the joint paEerns of variaon of ρ, τ and θ along the genome; and (2) the window size used in the analysis, because of averaging e>ects when building the genomic landscapes. In light of these results, the genome of D. melanogaster -with its high e>ecve recombinaon rate, broad (as detectable by iSMC) paEern of variaon in the mutaon rate and high density of funconal sites -seems to be parcularly suscepble to the e>ect of the mutaon landscape on its large-scale distribuon of diversity. Yet, since the mutaon landscape stood out as the most relevant factor in all of the explored (neutral) scenarios where it was allowed to vary (Figure 5), we predict that it is likely to shape genome-wide diversity paEerns in other species as well. Figure 5. Variance in the distribu on of diversity explained by each genomic landscape (neutral simula on study). Paroning of variance according to window size (x-axis, shown in log 10 scale). A) Constant populaon size. B) Populaon boEleneck. Results are displayed according to parameters (rows = recombinaon rate, columns = scale of mutaon rate variaon). In each panel, point shapes represent explanatory variables in the linear model: θ (circles), ρ (triangles), τ (asterisks) and the total variance explained by the model (squares) is the sum of the individual R 2 . Each point represents the average R 2 over 10 replicates, and variaon among replicates resulted in conFdence intervals too small to be ploEed. All linear models were built using simulated (true) landscapes.

iSMC can disentangle muta on rate varia on from linked selec on
Finally, we simulated 10 replicate datasets under a background selecon model with genomic features parally mimicking those of D. melanogaster chromosome 2L (see Methods). Brie(y, we included in these simulaons the posions of exons from the Ensembl database (Cunningham et al., 2022) (whose non-synonymous mutaons had selecon coeZcients drawn from a negave Gamma distribuon), the (Comeron et al., 2012) recombinaon map of the fruit (y, and spaal variaon in mutaon rates using parameters esmated in our previous analyses. These datasets are not meant to precisely reproduce paEerns of nucleode diversity in real data -there are far too many biological processes not captured by the simulaons -, but instead to assess iSMC's ability to disentangle the θ and τ landscapes when the laEer is heavily distorted by linked selecon. As such, we used a distribuon of selecon coeZcients with a shape parameter equal to 1.0 (which contrasts with the range from ~0.3 tõ 0.4 reported in the literature, e.g. Castellano et al. (2018)) in order to arFcially exacerbate the e>ect of linked selecon (see below). As before, we FEed a ρ-θ-iSMC model with Fve mutaon rate classes, Fve recombinaon rate classes and 30 coalescence me intervals, and a/erwards binned the simulated and inferred maps into windows of 50 kb, 200 kb and 1 Mb. We then assessed the accuracy of our framework in two ways: Frst, by compung Spearman's rho between simulated and iSMC-inferred landscapes; second, by contrasng the variance explained by each variable in the OLS regression (FEed with inferred versus true maps).
We report strong and posive correlaons between simulated and inferred landscapes under such complex a scenario ( Figure 6). As expected, the inherently Fne-scale variaon in the distribuon of genealogies is the hardest to reconstruct: the Spearman's rho between the true TMRCA landscapes and the inferred ones ranges from 0.385 to 0.465 at 50 kb and increases substanally with window size (up to 0.787 at 1 Mb, Supplemental Table S3). In comparison, the correlaon between the true and inferred mutaon landscapes ranges from 0.751 (at 1 Mb) to 0.894 (at 50 kb, Supplemental Table S4). The recombinaon landscape is also well recovered, with correlaon coeZcients ranging from 0.830 (at 50 kb) to 0.963 (at 1 Mb, Supplemental Table S5). We postulate that iSMC's power to disnguish between the signal that θ and τ leave on sequence data stems exactly from the di>erence in scale at which they vary along the simulated chromosomes. Although linked selecon can increase the correlaon among genealogies around constrained sites (McVean, 2007), in most genomic regions the extent of such e>ect is sll short in comparison to the scale of mutaon rate variaon, allowing their e>ects on the distribuon of diversity to be teased apart. In summary, although under linked selecon the accuracy of inferred mutaon maps is lower than under strict neutrality (cf. Supplemental Table S1), it remains high enough to validate the robustness of our new model of mutaon rate variaon.
Given the high accuracy of iSMC in these challenging simulaons, one would expect he/y resemblance in the linear models when using the inferred versus the true, noise-free landscapes. This is indeed what we found at all scales (middle and right panels in Figure 4B), suggesng that residual biases in iSMC due to linked selecon do not carry over to the regression analyses noceably. Moreover, there is a closer agreement of real D. melanogaster data with neutral simulaons than with simulaons of background selecon. This is probably a consequence of the unrealiscally strong background selecon in the simulaons, where the distribuon of Ftness e>ects we used has a high density of weakly deleterious mutaons which segregate longer in the populaon, leading to a more localized and pronounced distoron of genealogies (Zeng & Charlesworth, 2011). This would also explain why the R 2 aEributed to θ and τ respecvely decrease and increase with window size (Figure 4B), a reverse relaonship than observed in the real data and in neutral simulaons ( Figure 4A, Figure 5). These results are corroborated by the linear coeZcients (Supplemental Table S9). In the presence of selecon, the coeZcient of θ decreases with window size and is disncvely closer to the coeZcient of τ than under neutrality, a relaonship that is reproduced when F^ng the linear models with θ and τ . Note also that under intense background selecon the coeZcient of ρ is generally larger than in the previous scenarios (and the corresponding p-values smaller), mis-matching real D. melanogaster data as well. We further inquired into these phenomena visually, by looking into the simulated landscapes, which provided crical insight into the interplay among micro-evoluonary mechanisms shaping diversity (Figure 7). Close inspecon shows that only in regions of extremely reduced recombinaon (the le/ p and right tail of the simulated chromosome) does linked selecon introduce enough correlaon among selected and neutral sites as to in(uence diversity to a larger degree than mutaon rate variaon, and this e>ect seemingly grows with window size. Otherwise, the distribuon of π predominantly mirrors that of θ, endorsing our previous results. As a side note, Figure 7 lays out a rather encing graphic of our linear models: they seek to represent π (the top row) as a linear combinaon of τ, ρ, and θ (the other three rows), plus the interacon τ:θ. From this perspecve, it becomes apparent that the mutaon landscape contributes the most to variaon in diversity along the chromosome, even in such a conservave scenario where linked selecon is arFcially strong. Taken together, the results from these simulaons provide compelling evidence that the high R 2 aEributed to θ in D. melanogaster is a solid Fnding. First, it cannot be explained either by the increased noise in our inference of τ compared to θ (Figure 6), or by potenal absorpon of linked selecon e>ects into θ, since in both of these cases we would not expect the close correspondence between OLS results FEed with inferred versus true maps (Figure 4b). Second, raw simulated data clearly demonstrate that the e>ect of linked selecon can be overwhelmed by mutaon rate variaon (Figure 7). We conclude that the modeling framework illustrated in Figures 1 and  3 sasfactorily captures the essence of the genome-wide determinants of nucleode diversity, and is likewise adequate to the study of D. melanogaster.

Discussion
The presence of mutaon rate variaon along the genome has been recognized for many years (some of the evidence in mammals reviewed over a decade ago by Hodgkinson & Eyre-Walker (2011)), although its implicaons have been largely overlooked by the populaon genecs literature. The contribuons of the present work are not to simply recapitulate this phenomenon in D. melanogaster, but mainly to (1) present a novel stascal method that can infer such variaon using populaon genec data and (2) use this method to show that the mutaon landscape has a lasng e>ect on nucleode diversity paEerns that can be quantavely larger than that of natural selecon. This awareness is long overdue, as the relave strengths of selecon and dri/ in shaping genome-wide diversity have been debated for several decades (reviewed in Hey (1999); and, more recently, Kern & Hahn (2018) and Jensen et al. (2019)), with the in(uence of local mutaon rate only recently brought to light (Castellano et al., 2018;Harpak et al., 2016;Smith et al., 2018). We were able to employ our extended iSMC model to jointly infer mutaon, recombinaon and TMRCA landscapes and to use causal inference to esmate their impact on π along the genome. Our analyses revealed that these combined landscapes explain >99% of the distribuon of , the recombinaon landscape (shared among replicates) and the mutaon landscape (shared among replicates). All landscapes are binned into non-overlapping windows of diversity along the D. melanogaster genome; when looking into the detailed paEerns, we found the footprints of linked selecon, but the major driver of genome-wide diversity in this species seems to be the mutaon landscape. Importantly, this conclusion holds whether we base the discussion on esmates of linear coeZcients or on the proporon of variance explained.
These results do not imply that linked selecon cannot extend beyond the 18.3% of the D. melanogaster genome that is exonic (Alexander et al., 2010), but rather that variaon in the mutaon rate is strong enough to contribute relavely more to the variaon in π, in the genomic scales here employed (Figure 4). Our Fndings, however, sharply contrast with an esmate by Comeron that up to 70% of the distribuon of diversity in D. melanogaster can be explained solely by background selecon at the 100 kb scale (Comeron, 2014), where the author further argued that many regions of increased diversity may be experiencing balancing selecon. Instead, we propose that mutaon rate variaon is responsible for most of these e>ects. We believe that such discrepancy can be mainly aEributed to Comeron's 70% Fgure deriving from the (rank) correlaon between π and B-value maps alone, without including other causal factors (like dri/ and local μ). The B-value represents the expected reducon of diversity due to selecon against linked deleterious mutaons (Charlesworth, 2013;Matheson & Masel, 2022;McVicker et al., 2009). This is equivalent to a scaling of the expected TMRCA between two (uniformly) random samples, which in our model is captured by τ . Indeed we Fnd that despite τ explaining liEle variance in diversity in the mulple regression se^ng, the simple correlaons between τ and π are of the same order as found by Comeron (Spearman's rho = 0.70, p-value < 2.2e-16 at the 50 kb scale; Spearman's rho = 0.66, p-value < 2.2e-16 at the 200 kb scale; Spearman's rho = 0.86, p-value < 2.2e-16 at the 1 Mb scale, cf. Table 1 in (Comeron, 2014)). The central point being that the linear models were able to reliably pinpoint θ as the main driver of π just because its e>ect was jointly esmated with those of τ and ρ. Taking a step back, it is also conceivable that selecon is not only manifested as distorons in the distribuon of genealogies, but also biases our esmate of the mutaon landscape. We note, however, that there actually seems to be a small overesmaon of the importance of the TMRCA in our results (compare values obtained with true vs inferred landscapes in Figure 4 and Supplemental Tables S2, S9), which goes in the opposite direcon to the presumed bias under linked selecon. In this way our results appear to be conservave with respect to the discussions we submiEed throughout this arcle. On top of that, based on the high similarity between real D. melanogaster data and our neutral simulaons ( Figure 4A) as well as on iSMC's robustness to the presence of linked selecon ( Figure 2B, Figure 4B, Figure 6), we argue that a bias induced by linked selecon would likely be insuZcient to overturn our conclusion of a major impact of mutaon rate variaon on the distribuon of diversity.
We also note that selecon should have a stronger impact on π when binning is performed at smaller genomic scales (≤10 kb, e.g., Figure 4 in Hudson & Kaplan (1995)), which we have not explored because of increased genealogical and mutaonal variance at such small window sizes. Besides Comeron, Elyashiv et al. (2016) also used paEerns of nucleode variaon to Ft models of linked selecon along the fruit (y genome. Using substuon rates at synonymous sites as a proxy for local mutaon rates, they employed their selecon esmates to predict genome-wide diversity in windows from 1 kb to 1 Mb. Their models predict 44% (100 kb) and 76% (1 Mb) of the distribuon of scaled nucleode diversity in D. melanogaster. However, owing to the scaling that removes the e>ect of mutaon rate variaon, the percentages in the Elyashiv et. al study represent the part of variance explained by linked selecon once the e>ect of mutaon rate variaon has been discarded (see also Murphy et al. (2022) for a similar and improved model). As such, the R 2 values they report quanfy the goodness-of-Ft of di>erent models of selecon (e.g., background selecon alone vs background selecon + selecve sweeps) instead of the actual importance of linked selecon to π, and are, therefore, not directly comparable with our esmates. Sll, we note that the remaining variance in their models may be due to mutaon rate variaon not grasped by synonymous divergence -an imperfect proxy for μ, either because of selecon on codon usage or because the mutaon landscape has evolved since the divergence of the two species. (Along the same vein, the correlaon between our 50 kb mutaon maps and genome-wide divergence between D. melanogaster and D. yakuba is only moderate, Spearman's rho = 0.197, p-value = 3e-09.) The di>erences between our approaches to capture linked selecon are also worth discussing. While Elyashiv et al. (2016) relied on elaborate models of selecon that embody strong assumpons, we leaned on the spaal distribuon of τ, similarly to Palamara et al. (2018). This heurisc renders our approach more parsimonious (11 parameters compared to 36 in the Elyashiv et. al model) and less suscepble to mis-speciFcaons of the selecon model, which could be commonplace (e.g., the presence of epistasis and/or (uctuang Ftness e>ects over me). Developing an explicit model of spaal variaon in N e into the iSMC framework is desirable but presents considerable obstacles, and is therefore le/ as a future perspecve.
Our results provide evidence that similarly to humans (Harpak et al., 2016;Smith et al., 2018), the mutaon landscape is a crucial determinant of the distribuon of diversity in D. melanogaster. The simulaon study (Figure 5, Figure 7) further suggests that in many evoluonary scenarios the mutaon landscape will remain the most relevant factor shaping π along the genome, depending notably on the window size used in the analysis. Future work using integrave models like the ones we introduced here (Figure 1, Figure 3) and applied to species with disnct genomic features and life-history traits will help elucidate how o/en -and by how much -the mutaon landscape stands out as the main driver of nucleode diversity.
We emphasize that we have not directly argued in favor of either genec dri/ or natural selecon in the classic populaon genecs debate, but instead we have highlighted the importance of a third element -the mutaon landscape -in shaping genome-wide diversity. Nevertheless, the mutaon landscape should play a role in the dynamics of natural selecon by modulang the rate at which variaon is input into genes (and other funconally important elements) depending on their posion in the genome. Consequently, levels of selecve interference, genec load and rates of adaptaon should vary accordingly (Castellano et al., 2016). In D. melanogaster, our inferred mutaon landscape varies ~10fold between minimum and maximum values at the 50 kb scale, meaning that the impact of mutaon rate variaon on selecve processes can be substanal. This opens intriguing lines of inquiry. For example, under what condions can the shape of the mutaon landscape itself be selected for? It has been shown that modiFers of the global mutaon rate are under selecon to reduce genec load (Lynch, 2008;Lynch et al., 2016). It remains to be seen whether the posion of genes or genomic features correlated with local μ (e.g., replicaon ming (Francioli et al., 2015)) can likewise be opmized (Marncorena & Luscombe, 2013). A/er all, populaon genecs theory predicts that at equilibrium the reducon in mean Ftness of the populaon due to recurrent mutaons is equal to the sum of mutaon rates among sites where they hit deleteriously and actually independent of their selecve e>ects (Haldane, 1937). Curiously, during the eding of this manuscript, the Frst evidence of an adapve mutaon landscape was reported in Arabidopsis thaliana, with coding regions experiencing fewer de novo mutaons than the rest of the genome, and essenal genes even less so (Monroe et al., 2022). This suggests that local mutaon rates have been themselves under selecve pressure to reduce genec load in at least one model system, and indicates that perhaps an even smaller fracon of the depleon of nucleode diversity near genes can be directly aEributed to linked selecon than previously inferred. In D. melanogaster, we failed to Fnd a relaonship between local mutaon rate and selecve constraint (recall that the correlaon test between θ and the proporon of exonic sites yielded Spearman's rho = -0.037 with p-value = 0.19, at the 50 kb scale); however, this could also be due to lack of power in the test because of the relavely large window size we used, combined with Drosophila's high gene density. At any rate, much more e>ort is needed to explore the causes and consequences of mutaon rate variaon across the tree of life. As a starng point, we can ask how conserved the mutaon landscape is in closelyrelated species (or, equivalently, how (uid is its evoluon within populaons). Analogous work on the recombinaon landscape has revealed overall fast evoluon of "hotspots" in mammals and has helped uncover the molecular architecture responsible for the placement of double-strand breaks (Berg et al., 2011;Jabbari et al., 2019). Moreover, adapve dynamics have been evoked to explain the di>erences in the recombinaon landscape between populaons of D. pseudoobscura, (Samuk et al., 2020). It will be interesng to test whether mutaon events follow similar paEerns, now that the impact of various sequence mofs on local μ is being more thoroughly invesgated (DeWiE et al., 2021;Kim et al., 2021;Oman et al., 2022). Unraveling the factors that shape the mutaon landscape at di>erent genomic scales will likely provide important insight. For example, can the large-scale variaon in mutaon rates that we found in D. melanogaster be parally explained by aggregaon of short (di>erenally mutable) sequence mofs, or is it driven by independent genomic features? As the molecular underpinnings of adapve mutaon landscapes become elucidated (e.g., what kind of proteins, sequence mofs and epigenec markers are involved in increasing replicaon Fdelity in funconally constrained regions and eventually decreasing it where polymorphism would tend to be beneFcial) we will gain a beEer understanding of how (exible such phenotype is and how prevalent it is expected to be in di>erent phylogenec groups. It is plausible that modiFers of the mutaon landscape may be successfully opmized, at least in species with high enough N e for such second-order e>ects to be seen by selecon (Lynch, 2010;Marncorena & Luscombe, 2013;Sung et al., 2012). Recent work notably highlighted the importance of epigenec factors in shaping the mutaon landscape and started to shed light on its evoluonary consequences Möller et al., 2021). The variety of molecular agents recruited to tweak the mutaon landscape and create pockets of decreased or increased de novo mutaon rates can be plenty, and it only outlines the complexity of evoluonary biology.
In hindsight, it is perhaps not surprising that mutaon rate variaon has a profound impact on nucleode diversity. Mutaons are, a/er all, the "stu> of evoluon" (Nei, 2013), and disnct genomic regions displaying di>erenal in(ux of SNPs must have sharp consequences to the analyses and interpretaon of genec data. This argument is naturally transferred to the ongoing discussion about incorporang complex demography and background selecon into the null model of molecular evoluon (Comeron, 2017(Comeron, , 2014Johri et al., 2020), which is movated by the goal of providing more sensible expectaons for rigorously tesng alternave scenarios. Our results suggest that a more realisc null model should also include variaon in the mutaon rate along the genome. By doing so, genome-wide scans (e.g., looking for regions with reduced diversity summary stascs as candidates for selecve sweeps) may become less suscepble to both false negaves (in regions of high mutaon rate) and false posives (in regions of low mutaon rate), paving the way to more robust inference (Booker et al., 2017;Haasl & Payseur, 2016;Venkat et al., 2018).