## ABSTRACT

The reduction of genetic diversity due to genetic hitchhiking is widely used to find past selective sweeps from sequencing data, but very little is known about how spatial structure affects hitchhiking. We use mathematical modeling and simulations to find the unfolded site frequency spectrum (SFS) left by hitchhiking in the genomic region of a sweep in a population occupying a one-dimensional range. For such populations, sweeps spread as Fisher waves, rather than logistically. We find that this leaves a characteristic three-part SFS at loci very close to the swept locus. Very low frequencies are dominated by recent mutations that occurred after the sweep and are unaffected by hitchhiking. At moderately low frequencies, there is a transition zone primarily composed of alleles that briefly “surfed” on the wave of the sweep before falling out of the wavefront, leaving a spectrum close to that expected in well-mixed populations. However, for moderate-to-high frequencies, there is a distinctive scaling regime of the SFS produced by alleles that drifted to fixation in the wavefront and then were carried throughout the population. For loci slightly farther away from the swept locus on the genome, recombination is much more effective at restoring diversity in one-dimensional populations than it is in well-mixed ones. We find that these signatures of space can be strong even in apparently well-mixed populations with negligible spatial genetic differentiation, suggesting that spatial structure may frequently distort the signatures of hitchhiking in natural populations.

## Introduction

A selective sweep reduces neutral genetic diversity at linked loci via “genetic hitchhiking”: the genetic background on which the sweep occurred increases in frequency, while others decrease (Maynard Smith and Haigh 1974). In large populations with rapid adaptation and limited recombination, hitchhiking could be the primary factor limiting neutral genetic diversity (Gillespie 2000a,b; Weissman and Barton 2012). On the other hand, in populations where most loci are unaffected by hitchhiking, the loss of genetic diversity in specific regions of the genome is the primary method for detecting recently completed selective sweeps in natural populations (Kim and Stephan 2002; Vitti *et al*. 2013; Stephan 2019). Selection scans based on hitchhiking have found widespread sweeps in many populations and inferred their properties (Li and Stephan 2006; Sabeti *et al*. 2007; Karasov *et al*. 2010; Sattath *et al*. 2011; Vitti *et al*. 2013; Garud *et al*. 2015; Smith *et al*. 2018; Stephan 2019; Hejase *et al*. 2020; Bourgeois and Warren 2021). Two properties of particular interest are the strength of selection driving the sweep and whether the sweep is “hard” (starting from a single mutation) or “soft” (starting from multiple independent mutations) (Hermisson and Pennings 2005). Inferring these quantities requires models relating them to observable patterns of genetic diversity. Even methods that simply locate swept regions based on identifying empirical outliers need to be tested by simulations (i.e., models). It is therefore important that models for hitchhiking be at least roughly accurate descriptions of the process in natural populations.

The models underlying standard sweep-finding methods assume that the population is well-mixed (Garud *et al*. 2015; Schrider and Kern 2016; Smith *et al*. 2018; Stern *et al*. 2019; Bisschop *et al*. 2021). However, natural populations typically occupy extended spatial ranges. Often in such populations, allele frequencies are fairly constant across the range; in other words, spatial structure as measured by *F*_{ST} is weak (Hartl and Clark 1997). This reflects the fact that the mixing time is short compared to the neutral coalescence time, so well-mixed models are good approximations for many aspects of neutral evolution. However, selective sweeps are necessarily very fast compared to neutral coalescence, so for many populations, sweeping alleles will be strongly affected by spatial structure even when most neutral alleles are not (Ralph and Coop 2010). Indeed, strong patterns of spatial differentiation are one of the signatures used to detect ongoing sweeps (Sabeti *et al*. 2007; Tang *et al*. 2007; Coop *et al*. 2009). Even for methods that use these patterns to detect local adaptation rather than ongoing full sweeps (reviewed in Vitti *et al*. (2013) and Bourgeois and Warren (2021)), the premise is the same: often, dispersal is fast enough relative to drift to equalize neutral allele frequencies across space, but too slow relative to selection maintain uniform allele frequencies at strongly selected loci.

The dynamics of sweeps in spatially structured populations can be completely different from those in well-mixed populations: instead of growing logistically, the restricted competition makes them spread as much slower traveling waves (Fisher 1937). These different dynamics can produce very different hitchhiking patterns (Slatkin and Wiehe 1998; Kim and Maruki 2011; Barton *et al*. 2013). Specifically, Barton *et al*. (2013) showed that a sweep with a given selective advantage reduces genetic diversity much less in a spatially structured population than it does in a well-mixed population of the same size. This suggests that spatial structure likely needs to be included in models of hitchhiking to get even a good first approximation.

A full description of the genetic diversity of a large sample is extremely complicated. Most of this complexity arises due to the linkage disequilibrium among alleles at different loci. Fortunately, the one-locus statistics are much simpler: when variation in allele frequencies across space is negligible, they can be completely described by the site frequency spectrum (SFS), which is the number of mutations *ξ*(*f*) found at frequency *f* in a sample of the population (Hartl 2020). Since it ignores linkage disequilibrium, *ξ*(*f*) can be measured in unphased data. For this reason, many sweep-inference methods are based on the SFS (DeGiorgio *et al*. 2016; Pavlidis *et al*. 2013; Harris *et al*. 2018; Kern and Schrider 2018). These inference methods are multi-locus in the sense that they concatenate information across genomic windows, but they are still based on one-locus statistics, i.e., they do not consider linkage disequilibrium.

In this paper, we model the hitchhiking caused by a sweep in a spatially structured population, focusing on the simplest case of a one-dimensional spatial range. We use heuristic mathematical arguments and simulations to find the effect of a sweep on the expected SFS at linked neutral loci. We start by considering a completely linked locus, and then find how hitchhiking decays as we move away from the swept locus on the genome and the recombination rate increases. We find that even in weakly spatially structured populations, space creates a distinctive tail in the expected site frequency spectrum (SFS) and changes the width of the region of the genome affected by hitchhiking. We discuss the potential implications of our results for the accuracy of inferences about adaptation based on well-mixed models.

## Model

We use a one-dimensional stepping-stone model, in which there is a line of *L* demes, each with a fixed population size of *ρ* haploid individuals, for a total population size of *N* = *Lρ*. Within each deme, we use a Wright-Fisher model, with individuals having a probability *m* each generation of being a migrant from a neighboring deme. The migrants have an equal probability of coming from the neighboring deme to the right or the left. At the edges of the range, the leftmost and rightmost demes have only one neighboring deme each, so individuals in these demes have a probability of only *m*/2 per generation of being a migrant. We focus on the case *ρm* ≫ 1 in which neighboring demes exchange many migrants per generation, and we expect neutral genetic variation to vary nearly continuously across space.

At some point in time, an individual acquires a beneficial mutation with selection coefficient s. Most such mutations rapidly go extinct due to drift; we discard these simulations and instead keep only those in which the mutation is lucky enough to escape extinction and sweep to fixation. For simplicity, we assume the beneficial mutation occurs in the leftmost deme in most of our results. As we discuss in Appendix C, this captures all the main qualitative patterns, and generalizing the quantitative patterns to mutations starting in other locations is straightforward. We focus on the case *s* < *m* and in which the traveling wave of the successful beneficial allele has a front that is approximately continuous in space, with a characteristic width and speed (Barton *et al*. (2013); illustrated by the black curve in Figure 1). At finite densities *ρ*, the wave speed is slightly smaller *than* (Barton *et al*. 2013), so in all plots we use the speed measured directly from the simulations (see Appendix B for details).

For the hard, complete sweeps in our model, all neutral diversity after the end of the sweep must be produced by mutation or recombination. We follow a single neutral locus with mutation rate *U*_{n} and recombination rate *r* with the selected locus. We use the infinite allele model in which every new neutral mutation is unique. This is a good first approximation as long as the population size *N* times the per-base neutral mutation rate is small compared to one. Note that our neutral locus could extend over multiple bases, as long as recombination among them is negligible; see the Discussion for consideration of possible locus sizes.

We focus on the parameter regime in which spatial structure has a negligible effect on neutral allele dynamics in the absence of the sweep, but a strong effect on the sweep dynamics. To satisfy the first condition, we set the parameters so that the mixing time from dispersal, *T*_{mix} ∝ *L*^{2}/*m*, is much smaller than the well-mixed coalescent time scale, *T*_{coal} = *N*. This makes it so that neutral heterozygosity is nearly unaffected by space (Maruyama 1971). To satisfy the second condition, we make sure that the mixing time *T*_{mix} is long compared to the duration of the sweep, . This is roughly equivalent to requiring that the range size is long compared to the wavefront of the sweep, , or that the mixing time be long compared to the time the sweep would take in a well-mixed population, *T*_{mix} ≫ ln(*Ns*)/*s*. We can summarize these conditions as *T*_{sweep} ≪ *T*_{mix} ≪ *T*_{coal}.

In this parameter regime, common alleles at sites unlinked to the sweep have little variation in their frequencies across space. During and immediately after the sweep, variation at loci linked to the sweep may vary strongly with space, but dispersal will erase these spatial patterns in a time *T*_{mix} much shorter than the time *T*_{coal} that it takes drift to erase the effect of hitchhiking on the overall allele frequencies (Ralph and Coop 2010). Thus, for most potentially detectable sweeps, there will be no direct signature of the effect of spatial structure. In other words, we do not expect to see spatial variation directly from allele frequencies, but the strong spatial structure that existed during the sweep may still leave a large signature in the overall SFS. Throughout the paper, we will focus on the unfolded SFS, assuming that there is an outgroup such that ancestral and mutant alleles can be distinguished, as the folded SFS can be immediately derived from the unfolded one.

We follow Barton et al. (2013) in conducting our simulations in two steps. First, we simulate the dynamics at just the selected locus forward in time, saving the trajectory of the sweep in time and space. We then use this saved trajectory to conduct structured coalescent simulations of the linked neutral locus backward in time. This increases computational efficiency because most of the stochasticity in the SFS is produced in the coalescent process rather than the sweep trajectory, and we can simulate many independent coalescent histories on each simulated sweep trajectory. We find the expected SFS by averaging our results over many independent forward simulations and many more independent backward simulations for each forward simulation. Because we focus on the expected SFS, we do not need to simulate neutral mutations explicitly but can instead simply multiply the branch lengths of our simulated coalescent trees by *U*_{n} to find the expected number of mutations. The neutral mutation rate *U*_{n} is, therefore, simply an overall scaling factor in all our results. For the details of the simulations, see Appendix A. All code can be found at https://github.com/weissmanlab/SFS_spatial_sweep.

## Results

### Background: Hitchhiking at a completely linked locus in a well-mixed population

We first review the expected SFS left by hitchhiking in a wellmixed population. Here, we focus on the simplest case of a completely linked locus, sampled immediately after the fixation of the beneficial allele; we consider the effects of drift after the sweep and recombination below. All diversity is, therefore, from mutations that occurred after the beginning of the sweep in individuals carrying the beneficial allele. The expected SFS follows two power-laws (Figure 2, grey curve). Almost all individuals carrying the sweeping allele who could potentially be ancestral to the sample lived in the late stages of the sweep, so most mutations are relatively recent and occurred too late to hitchhike to high frequencies. Their dynamics are, therefore, simply controlled by drift and follow the neutral expectation (Wakeley 2008):

Eq. (1) is shown as a pink dashed curve in Figure 2 and following figures. Since these mutations occurred less than ln(*Ns*)/*s* gener-ations before sampling (i.e., during the sweep), they do not have time to drift to a frequency of more than about *f* ≾ ln(*Ns*) / (*Ns*) (Desai and Fisher 2007) (Figure 2, vertical dotted turquoise line). Only the lucky older mutations that occurred when the sweeping allele was rare can hitchhike to frequencies *f* ≫ ln(*Ns*)/(*Ns*). Since the beneficial background was growing approximately exponentially at this time, the frequency spectrum of mutations follows the classic scaling ∝ 1/*f*^{2} of (Luria and Delbrück 1943). Specifically, for a mutation to hitchhike to frequency of at least *f*, it must have occurred when the sweep was at frequency of ≾ 1/*f*. The total number of mutational opportunities up to this point of the sweep is equal to the number of individuals with the beneficial allele integrated over time, ≈ 1/ (*sf*), so the expected number of mutations with frequency > *f* is ≈ *U*_{n}/ (*sf*). Differentiating the expected number of mutations with respect to frequency gives the SFS:

### Completely linked locus in a one-dimensional population

We will start by sketching an intuitive argument for the qualitative features of the SFS left by hitchhiking in a one-dimensional population. As above, we will first focus on a completely linked locus sampled immediately after the sweep. We do not attempt to characterize spatial patterns in genetic diversity at the locus even though we consider just-completed sweeps because we are interested in more typical older sweeps, finished at least ≈ *T*_{mix} generations ago. In a one-dimensional population, the sweep proceeds as a Fisher wave (Fisher 1937) (Figure 1, black curves). Most neutral mutations ancestral to the sample will occur in the bulk of the wave, where the beneficial allele is already fixed and there is no more hitchhiking (Figure 1, pink). These mutations should follow the neutral SFS and should be limited to low frequencies (Figure 2, pink dashed line). Yet, the mutations that occur within the wavefront can hitchhike and follow either of two possible trajectories. First, most will “surf” in the wavefront temporarily, but eventually be left behind and stop hitchhiking (Figure 2, blue). The luckiest mutations will manage to fix in the wavefront and hitchhike to very high frequencies as they are carried by the wave of the beneficial allele (Figure 2, orange). We, therefore, expect to see three basic regimes in the hitchhiking SFS of a one-dimensional population, rather than just the two regimes in the well-mixed population. This intuition is confirmed by simulations (black solid curve in Figure 2). These show that the expected SFS of a one-dimensional population is higher than the one of a well-mixed population, implying less genetic diversity is lost due to hitchhiking, in accord with the results of Barton et ?l. (2013) on pairwise coalescent probability.

We now give approximate analytical expressions for the form of the SFS in the three different regimes. Since the low-frequency alleles that occurred in the bulk of wave (Figure 1, pink) did not hitchhike, they simply follow the neutral expectation *ξ*(*f*) ≈ 2*NU*_{n}/*f* (Figure 2, pink), as in a well-mixed population. But since the sweep time is now much longer than it would be in a well-mixed population, *T*_{sweep} ≈ *L/v* ≫ ln(*Ns*)/*s*, these mutations can drift for longer and reach higher frequencies, *f* ≾ *T*_{sweep}/*N* ≈ 1/ (*ρv*) (Figure 2, vertical green dotted line).

To reach higher frequencies, *f* ≫ *T*_{sweep}/*N*, mutations have to hitchhike. For most of this frequency range, the SFS is approximately uniformly distributed, a pattern unique to the spatially extended population. This tail reflects the mutations that succeed in fixing in the wavefront (Figure 2, orange). Since the dynamics within the front are neutral, mutations fix in the front at a steady rate of *U*_{n}. For such a mutation to reach frequency at least *f*, it must have fixed in the wavefront before the sweeping allele reached frequency 1 – *f*. The total amount of time that sweeping allele spends at frequency less than 1 – *f* is ≈ (1 – *f*) *L/v*, so the total number of mutations that we expect to fix in the front and reach frequencies greater than *f* is *U*_{n} (1 – *f*) *L/v*. Differentiating with respect to *f* gives the expected SFS:

This analytical approximation is in good agreement with simulations (Figure 2, orange horizontal dotted line) until very high frequencies, *f* → 1, at which point the initial waiting time for the beneficial allele to settle down to a Fisher wave becomes important; see Appendix D.

Between the low-frequency neutral regime and the high-frequency flat regime described above, there can also be an intermediate-frequency hitchhiking regime (frequencies ~ 10^{—3} to ~ 10^{—1} in Figure 2). This regime consists of mutations that occurred in the wavefront in the sweeping background and surfed for a while but were then left behind (Figure 1, blue area). The expected SFS of a one-dimensional population in this frequency range is similar to but distinct from that of a well-mixed population with the same *N* and *s* (*ξ*(*f*) ≈ *U*_{n}/ (*sf*^{2}), as described above). Intuitively, the similarity is because these mutations only surf over a small portion of the range, so they have little opportunity to “feel” the spatial structure. To characterize the distinctive behavior of the one-dimensional spectrum in this regime, we can treat the wavefront as its own small population with a coalescent time scale *T*_{front} given by the power-law fits of Barton *et al*. (2013) and Birzu *et al*. (2018). Coarse-graining over this time scale then produces an effective offspring distribution with a heavy-tailed “sweepstakes” pattern (Okada and Hallatschek 2021). This gives a site frequency spectrum:
with *α* ≈ 0.3 (Barton *et al*. 2013); see Appendix E for details. The prefactor of 0.6 is a fit to the simulations shown in Figure 3. Because mutations can only persist in the wavefront for a limited amount of time before falling behind or fixing, this regime can only cover a limited frequency range that shrinks in longer habitats. This regime is also the first to be eroded as time passes from the end of the sweep (Figure 4). Taken together, these factors suggest that the higher-frequency uniform regime may be the more relevant portion of the spectrum for data from natural populations.

### The signature of spatial structure persists through time

Above, we have assumed that the population is sampled immediately after the sweep, but in reality this is unusual. More often, one is interested in all the sweeps that may have happened recently enough to still be detected. Since the signature of the sweep lasts much longer than the sweep itself, the waiting time *T*_{past} between the end of the sweep and sampling is typically longer as well: *T*_{past} ≫ *T*_{sweep}. Intuitively, this waiting time allows more time for new mutations to appear and grow neutrally via genetic drift; therefore, it extends the part of the SFS that matches the neutral expectation *ξ*(*f*) ≈ 2*NU*_{n}/*f*. More specifically, since drift has ≾ *T*_{sweep} + *T*_{past} generations to act, the neutral part of the spectrum extends to *f* ≾ (*T*_{sweep} + *T*_{past})/*N*, until drift erases the entire signature of hitchhiking at *T*_{past} ≈ *N*. For mutations at much higher frequencies *f* ≫ (*T*_{sweep} + *T*_{past}) / *N*, the change in frequency due to drift is minor, so this part of the spectrum is nearly unaffected. (The exception to this is the extreme high-frequency tail where the frequency of the ancestral allele is small, 1 — *f* ≾ (*T*_{sweep} + *T*_{past}) / *N*; these mutations can be driven to fixation by drift.) Figure 4 shows that this intuition matches simulations. As a result, the high-frequency uniform tail of the hitchhiking SFS in one-dimensional populations is the longest-lasting part of the spectrum, while the intermediate-frequency regime relaxes to neutrality first.

### The effect of recombination

So far, we have focused on neutral loci that are completely linked to the positively selected locus. We now consider a neutral locus farther away on the genome that recombines at rate *r* with the selected locus. We can do this by combining the picture of the one-dimensional dynamics developed above with the known effects of recombination on the SFS produced by hitchhiking. In this section, we will first review those effects and their mathematical expressions in well-mixed populations, and then introduce the equivalent expressions for one-dimensional populations. We can see recombination as having two related effects: first, it brings new alleles onto the sweeping background, allowing them to hitchhike, effectively increasing the mutation rate; and second, it breaks down the positive linkage disequilibrium that drives the hitchhiking. While these effects are really two sides of the same coin, it is helpful to think of them separately because they vary in importance at different genomic scales.

For the first effect of recombination, bringing new alleles onto the sweeping background, a typical recombinant will be very distantly related to the original sweeping background, with an expected time to the most recent common ancestor of *T*_{coal} ≈ *N*. They will therefore differ by an average of 2*NU*_{n} mutations, *NU*_{n} each on the original background and the recombinant. The fact that the recombinant brings in *NU*_{n} new mutations means that the effective neutral mutation rate of the sweep is increased by *rNU*_{n}, for a total of *U*_{n,eff} = (1 + *Nr*) *U*_{n}. On the other hand, the fact that the recombinant has the ancestral allele at *NU*_{n} sites where the original sweeping background was mutated, means that those mutations are prevented from fixing by the hitchhiking of the *ancestral* alleles. (In the absence of recombination, the mutant alleles would have fixed and contributed to divergence from an outgroup but not polymorphism.) This produces a mirror uptick in the unfolded SFS at high frequencies, where at effective mutation rate *NrU*_{n}, ancestral alleles arrive on the sweeping background and hitchhike following the same expressions as previously found for the mutant alleles, but with the ancestral allele frequency 1 – *f* replacing the mutant frequency *f*. Thus, in a well-mixed population, Eq. (2) for the SFS becomes:

As with a completely linked locus, Eq. (5) is only valid for allele frequencies *f* above that which can be reached by drift. It is also cut off at an upper limit of *f* ≈ 1 – (*r/s*) ln(Ns), above which the SFS flattens out (Fay and Wu 2000). Within these bounds, Eq. (5) matches simulations well, although it slightly underestimates the rate at which the SFS relaxes to neutral expectation for high recombination rates (Figure 5a-c, cyan dot-dashed lines). We see that this first effect of recombination is important even very close to the selected locus, with recombination dominating the effective mutation rate *U*_{n,eff} for *r* ⊏ 1/ *N*, and the high-frequency mirror-image portion of the spectrum being visible at even closer loci in large samples.

The second effect of recombination, reducing hitchhiking by breaking down positive linkage disequilibrium between the sweeping allele and alleles at neutral loci, is more familiar than the effect described above, but it is only effective at more distant loci. Intuitively, this reflects a comparison between the time scale for hitchhiking to increase the frequency of a neutral allele in positive linkage disequilibrium with the sweeping allele, and the time scale 1/ *r* for the linkage disequilibrium to decay. For a well-mixed population, the time for hitchhiking to boost an allele initially established at ≈ 1/*s* copies to frequency f is ≈ ln(*Nfs*)/*s*. Thus, recombination prevents hitchhiking from affecting the bulk of the SFS at loci map distance *r* ≿ *s*/ ln(*Ns*) from the sweep (Stephan et al. 1992; Kaplan et al. 1989; Barton 1998, 2000). The effect on the upper tail of the SFS persists out to slightly more distant loci because it involves alleles that hitchhike for a slightly shorter time and because the neutral spectrum is lowest there (Fay and Wu 2000); this is barely visible in Figure 5d, but even this is effectively gone before the recombination rate reaches *r* ≈ *s*.

In a one-dimensional population, the logic is exactly the same, except that two expressions, Eq. (3) and Eq. (4), need to be adjusted. For the alleles that fix in the wave front, Eq. (3) becomes:

Note that the factor (1 + 2*Nr*) *U*_{n} is the sum of mutations on the sweeping background (*U*_{n}), mutations that recombine onto the sweeping background (*NrU*_{n}), and ancestral alleles that replace mutations when they recombine onto the sweeping background (another factor of *NrU*_{n}). Since there is no f dependence, these all combine into one prefactor, rather than appearing in front of separate terms as in Eq. (5). Eq. (6) matches simulations well (Figure 5a, orange dotted line).

For the alleles that only surf temporarily in the front before being dropped, Eq. (4) becomes:

As in Eq. (5), the first term comes from the hitchhiking of mutations, while the second comes from the hitchhiking of ancestral alleles preventing the fixation of hitchhiking mutations. Eq. (7) matches simulations well, but only applies over small ranges of frequencies (Figure 5a, blue dot-dashed lines); as in the well-mixed case, it does not apply to very low or high frequencies where drift becomes more important, and it also does not apply to the middle of the frequency range, where Eq. (6) dominates.

In a one-dimensional population, sweep and thus hitchhiking dynamics are slower than in a well-mixed population. Therefore, smaller recombination rates *r* are sufficient to produce linkage disequilibrium decay times 1/*r* that are short compared to the time needed to hitchhike, the necessary condition for the second effect of recombination (i.e., recombination blocks hitchhiking). The one-dimensional population differs from the well-mixed one in that there are now two different hitchhiking time scales, corresponding to the two regimes of the completely linked SFS: the long time *T*_{sweep} ≈ *L/v* over which alleles that fix in the front hitchhike, and the shorter time . (Barton et al. 2013; Birzu et al. 2018) over which alleles that only temporarily surf in the front can hitchhike. As we move away from the swept locus on the genome, we expect the uniform portion of the hitchhiking SFS Eq. (6) to relax to neutrality first, at *r* ≈ *v/L*, while the intermediate-frequency signal persists until map distances that are similar to those at which intermediate-frequency signal vanishes in well-mixed populations.

## Discussion

We have shown that the hitchhiking caused by a selective sweep in a one-dimensional spatially structured population produces a very different expected site frequency spectrum from that left by hitchhiking in well-mixed populations. This is true even if the spatial structure is very weak, in the sense that frequencies of common neutral alleles vary very little from location to location. The most striking feature of the expected one-dimensional hitchhiking SFS is its long flat tail, consisting of mutations that fixed in the wavefront during the course of the sweep and were carried through much of the population. Intuitively, the underlying difference in the dynamics is that in a spatially structured population, an allele that begins to hitchhike midway through the sweep can reach very high frequencies, whereas in a well-mixed population hitchhiking is largely confined to those alleles present in the sweeping background very early in the sweep (Coop and Ralph 2012). Sweeps are also much slower in spatially structured populations than they are in well-mixed ones, giving mutation and recombination more time to act. We found that this makes the overall diversity higher, even at loci completely linked to the swept locus, and makes the intensity of hitchhiking decay more rapidly as one moves away from the swept locus along the genome. The high-frequency (flat) portion of the SFS relaxes back to the neutral expectation particularly quickly as genetic map distance to the swept locus increases.

The most pressing question is to what extent these effects might be seen in data from natural populations. For the restriction of hitchhiking to a narrow region of the genome, our results actually predict an absence of signal relative to a well-mixed population, so this would necessarily be present in natural populations with one-dimensional spatial structure. Since the width of the hitchhiking region on the genome is used to infer the selective coefficient of the sweep under the assumption that it is ≈ s/ ln(Ns), our results suggest that spatial structure should produce a bias towards underestimating the selective coefficients driving sweeps.

The question of whether the distinctive flat tail of the SFS should be visible in data is somewhat more subtle. Let us first consider the number of new mutations that are expected to hitchhike to high frequencies in essentially complete linkage with the sweeping allele. We predict that the expected SFS is *ξ*(*f*) ≈ *U*_{n}*T*_{sweep} in this regime. The regime extends over a range of frequencies of order 1, so is also roughly the total expected number of alleles found in this frequency range. It might appear that the slower the sweep, the more alleles we should see, because of the factor of the sweep time *T*_{sweep} in *ξ*. But increasing *T*_{sweep} also decreases the length of the completely linked portion portion of the genome, which is ≈ 1/ *T*_{sweep} in Morgans, reducing the mutation rate *U*_{n}. Let *v* be the mutation rate per Morgan, which is typically of order one. Then the locuswide mutation rate will be *U*_{n} ≈ *v*/*T*_{sweep}. We see that the factors of *T*_{sweep} approximately cancel and the total expected number of high-frequency linked mutations is ≈ *v*, so it would be reasonable for many organisms to find such a mutation. Note that this is higher than the equivalent number for well-mixed populations by a factor ≈ ln(*Ns*), which might be an order of magnitude.

While one might expect to find a mutation or two in the high-frequency tail of the SFS around a sweep, one expects to find many more alleles that were introduced by recombination. At the characteristic map length *r* ≈ 1/*T*_{sweep} of the hitchhiking region, the recombinant alleles outnumber mutations that occurred during the sweep by a factor *U*_{n,eff}/*U*_{n} ≈ *N*/*T*_{sweep} » 1. (Here, *N* is really the neutral coalescence time *T*_{coal}, i.e., the effective population size *N _{e}*.) Note that this is mostly because each recombination event brings in a new haplotype that typically differs from the original sweeping background at multiple sites; specifically, over the map distance ≈ 1/

*T*

_{sweep}over which linkage disequilibrium is maintained during the sweep, they typically differ at ≈

*Nν*/

*T*

_{sweep}sites. Thus, if instead of alleles we count high-frequency recombinant haplotypes, we find that the expected number is order one, independent of any of the population parameters. Intuitively, this is because the limits of the region of the genome affected by hitchhiking are set by the map distance

*r*≈ 1/

*T*

_{sweep}at which we expect to find recombinants that successfully hitchhiked for long distances. In other words, the hitchhiking typically extends out to the first high-frequency recombinant in each direction on the genome from the swept locus. Thus, there will usually be a few high-frequency haplotypes, but not many more, as at larger map distances

*rT*

_{sweep}≫ 1 recombination prevents haplotypes from hitchhiking to high frequencies. Here the contrast with well-mixed populations is stark: in a well-mixed population, hitchhiking is still broken up by successful recombinants occurring midway through the sweep, but there are typically a large number of these recombinants, each of which barely hitchhikes, and so only the original haplotype of the sweep reaches high frequency (Garud

*et al*. 2015).

Schrider *et al*. (2015) noted this phenomenon of potential high-frequency recombinant haplotypes around the selected locus, calling them the “soft shoulders” of the sweep and suggesting that they could potentially mislead inference methods into mistaking “hard” sweeps descended from a single mutation for “soft” sweeps descended from multiple independent mutations. But they only considered well-mixed populations and found that the soft shoulders could be reliably distinguished from the hard center by considering large genomic windows. It is unclear if this possible in one-dimensional populations, as the soft shoulders are much more frequent and much closer to the swept locus, as described above. It is possible that spatial structure may often lead to misidentification of hard sweeps as soft. While it is known that misspecified demography can interfere with sweep inference and distinguishing hard from soft sweeps (see, e.g., Harris *et al*. (2018)), the effect of space we have found here is particularly insidious because it is strong even in populations that appear to be only very weakly spatially structured by standard measures such as *F*_{ST}.

Recent studies have found evidence for widespread soft sweeps and few hard sweeps in different species (Garud *et al*. 2015; Schrider and Kern 2017). However, most of these might better be termed “firm” sweeps, in that they must have begun from only a few mutations to be detectable; very soft sweeps with significant contributions from dozens or hundreds of mutations would likely leave only a very modest trace. This is a somewhat surprising result that suggests some fine-tuning in nature. In standard population genetics models, a broad range of parameter space produces low beneficial mutation supplies and adaptation driven by hard sweeps, while a similarly broad range of parameter space produces abundant standing variation and adaptation via modest shifts in allele frequencies. Nonetheless, only a relatively narrow range reliably produces a few successful beneficial mutations. This suggests that either there must be some mechanism that drives populations to the right region of parameter space to produce firm sweeps, or the apparent prevalence of firm sweeps may be due to some other factor. For the first option, it may be that in populations with high levels of standing variation under selection, interference automatically reduces the number of fit backgrounds in such a way that the beneficial mutation supply effectively becomes of order one. However, to our knowledge there is as yet no model demonstrating this. Our results here suggest that spatial structure may be an example of the second option: a population feature which generically leads hard sweeps to appear firm.

There are a number of ways that the present work needs to be extended before we can fully assess the potential importance of spatial structure to hitchhiking in natural populations. We have focused on just the expected SFS, but (as shown by the extensive averaging that we need to do across simulation runs and allele frequencies; see Appendix A) in any particular sweep there will be large amounts of variation around the expectation, particularly at high frequencies. In addition, while have touched on potential haplotype structure here in the Discussion, our analysis has focused on just the SFS, i.e., we have not explicitly modeled haplotypes and linkage among neutral loci. This is most likely necessary to find patterns that can distinguish hard sweeps from soft sweeps in spatially structured populations; this may be a difficult problem, given that some of the features we find are reminiscent of spatial soft sweeps (Ralph and Coop 2010).

We have only considered a simple one-dimensional steppingstone model of spatial structure, but most natural populations are likely to be two-dimensional and have some long-range dispersal. Both of these are likely to reduce the effects of spatial structure on hitchhiking, as they increase the number of individuals that are contributing to the spread of the sweep at any point in time (Ralph and Coop 2010; Barton *et al*. 2013; Hallatschek and Fisher 2014; Paulose and Hallatschek 2020; Fusco *et al*. 2016). However, sweeps are still far slower, and the distribution of reproductive value across individuals carrying the sweeping allele is still far more skewed than in a well-mixed population, so we expect some of the qualitative features of our results to persist. In particular, as long as the dispersal distribution is not too broad, it is still true that a single mutant or recombinant arising midway through the sweep can be the ancestor of a significant fraction of the population, so we still expect the SFS to have a flatter tail than the well-mixed *ξ*(*f*) α 1/*f*^{2}.

These extensions to our analysis are likely to be quite challenging analytically. One could also hope to try simply simulating hitchhiking under realistic parameter ranges for natural populations and measuring the resulting genetic diversity. Unfortunately, we know little about what the ranges of the relevant parameter values are in natural populations. The problem is particularly acute for the parameters related to spatial structure like the density *ρ*, the dispersal rate m, and the pattern of dispersal across space, especially the frequency and distribution of longrange jumps, which can determine the sweep dynamics (Ralph and Coop 2010; Hallatschek and Fisher 2014). This challenge could be potentially be addressed by first using genomic regions far from putative sweeps to infer the population structure, and then using this information to simulate sweeps.

## Data availability

All codes for simulation, Jupyter notebook for making the figures, and simulation data used in the Jupyter notebook are available at https://github.com/weissmanlab/SFS_spatial_sweep.

## Acknowledgements

The authors thank Ben Good and Adrian Gushin for helpful discussions. This work was supported by a QBio Fellowship (to JM) from the NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard (grant DMS-1764269), the National Science Foundation (grant PHY-1914916 to MMD and grant PHY-2146260 to DBW), the Simons Foundation (Simons Investigator award in the Mathematical Modeling of Living Systems to DBW) and the Sloan Foundation (Sloan Research Fellowship to DBW). The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

## Appendix A. Simulation methods

As described in the Model section above, we use two-part simulations, first simulating the sweep forward in time, and then simulating coalescence at the neutral locus backward in time conditional on the sweep. The forward simulations are a standard stepping-stone Wright-Fisher model. Here, we will briefly explain the form of the backward simulations. At the start of each backward simulation, we have n lineages, sampled uniformly in space. We follow these backward through time, keeping track of their spatial location and which background they are on at the selected locus. (For the values of *n* used in each figure, see Table 1.) At the time of sampling, all are necessarily on the background of the sweeping allele, since the sweep has already been completed. Let *p*(*x, t*) be the sweep trajectory obtained from the forward simulation; going backward in time, the motion and coalescence of the lineages depends on this trajectory.

Each backward generation consists of three steps: recombination, dispersal, and coalescence. Because all rates are small, we do not expect the precise order to be very important. In the recombination step, each lineage that was in the sweeping background in deme *x* at time *t* + 1 moves to the ancestral background with probability *r*(1 – *p*(*x, t*)), and each lineage in the ancestral background moves to the sweeping background with probability *rp*(*x, t*). In the dispersal step, if a lineage is in the sweeping background in deme *x* at time *t* + 1, then at time *t* it will be in deme *x, x* – 1, or *x* + 1 with probabilities:
where *Z*(*x, t*) = *P*(*x* → *x, t*) + *P*(*x* → *x* – 1, *t*) + *P*(*x* → *x* + 1, *t*) is a normalization factor to ensure that the probabilities sum to one. For lineages on the ancestral background, the dispersal probabilities are the same but with *p* replaced by 1 – *p* in all the equations above. In the coalescence step, two lineages that are both in deme *x* at time *t* coalesce with probability 1 /(*ρp*(*x, t*)) if they are both on the sweeping background, or probability 1/(*ρ*(1 – *p*(*x, t*))) if they are both on the ancestral background. (If one is on the ancestral background and the other is on the sweeping background, they cannot coalesce.)

For a locus completely linked to the selected locus, the entire sample must coalesce by the beginning of the sweep, since the sweep begins with a single mutant individual. For a locus that can recombine with the selected locus, some lineages may escape coalescence at the beginning of the sweep, in which case the sample will usually take far longer to fully coalesce, *T*_{coal} ~ *N*. Simulating this time in full detail would be prohibitive computationally, and wasteful, since over these long time scales the population is effectively well-mixed. Therefore, once the simulations reach times prior to the beginning of the sweep, *t* < 0, we continue them for 3200 generations to allow them to complete the “scattering phase” (Wakeley and Aliacar 2001), and then switch to a simple well-mixed Kingman coalescent until the sample is fully coalesced.

Rather than store full coalescent trees for our large samples, for each simulation we only keep track of the information we need to build the SFS across generations. Specifically, each generation we increment the total branch length ancestral to *k* sampled individuals, for all *k* from 1 to *n* – 1.

For every set of parameter values, we run *K*_{forward} independent forward simulations. (Here *K*_{forward} is the number of successful simulations, not counting the many more where the beneficial mutant goes extinct without sweeping.) For each forward simulation, we run *K*_{backward} independent backward simulations. All simulation results presented are averages over these *K*_{forward} × *K*_{backward} runs. The averaging over multiple backward simulations is particularly important, especially for estimating the spectrum at high frequencies *f*, and especially with recombination, as there can be substantial stochasticity in when the last few lineages coalesce. To deal with this further, in the figures we smooth the simulation curves by averaging over sliding windows of width Δ*f* for all frequencies *f* > *f*_{smooth}. (Because the lower-frequency portions of the SFS receive many independent contributions from near the tips of each coalescent tree, they are already nearly deterministic and do not need additional smoothing.) The values of *K*_{forward}, *K*_{backward}, *f*_{smooth}, and Δ*f* for each figure are specified in Table 1.

For the well-mixed SFS in Figure 2, we use two sets of simulations. For low frequencies *f* < 8 × 10^{−4} we need to use a large sample size n, but as mentioned above we need relatively few simulation runs. For higher frequencies, we can use a smaller sample size but must run more simulations. For Figure 6, we simulate 21 different possible starting locations for the sweep (ranging from *x* = 0 to *x* = 5000 in increments of 250). For each of these, we run five independent simulations. See Table 1 for the full simulation settings for both figures.

## Appendix B. Measuring wave speed

The speed of the wave of advance of the sweeping allele actually only reaches in infinitely dense populations. In real populations with finite density *ρ*, fluctuations reduce the speed of the wave. For each value of *ρ*, we therefore measure *v* directly in a single forward simulation and use this value in all plots. To do this, we use Barton *et al*. (2013)‘s definition of the wave speed: . Since we are interested in the average speed, not the instantaneous speed of the wave front, we average v(t) over the middle half of the sweep, discarding the first *T*_{sweep} /4 generations to avoid non-equilibrium effects while the wave is first establishing, and discarding the last *T*_{swseep} / 4 generations to avoid edge effects as the wave hits the far boundary of the range. We list the values we obtain for v in Table 2. They are all within ≈ 20% of the limiting value *v*_{∞}, so these corrections are fairly minor.

## Appendix C. Sweeps starting from the middle of the range

If the genetic sweep starts somewhere in the middle of the range, the Fisher wave becomes bi-directional. Therefore, neutral alleles can fix in the left or the right wavefront. The sweep time also depends on the initial location of the beneficial mutation.

Suppose the beneficial mutation starts at spatial position *x* at *t* = 0. Then if a neutral mutation fixes in a wavefront at *t* = *t*_{seed}, the allele frequency after the fixation is roughly *f* = (*x* – *vt*_{seed}) /*L* if it is in the left side and *f* = (*L* – *l* – *vt*_{seed})/*L* if it on the right. There are different upper bounds for *t*_{seed} (0 ≤ *t*_{seed}*x/v* for the left, 0 ≤ *t*_{seed} (*L* – *x*) / *v* for the right) because the mutation has to fix before the wave arrives at the boundary. If the probability distribution of the starting location of the sweep is *q*(*x*), the expected high-frequency tail of the SFS is

In the main text, we assume that the sweep starts in the leftmost deme (*q*(*x*) = *δ*(*x*)), and therefore get *ξ*(*f*) = *U*_{n}*L/v*. If the sweep starts at some other position *l*, which we can assume without loss of generality to be < *L*/2, then *q*(*x*) = *δ*(*x* – *l*) and we have:

If we instead consider the SFS averaged over the genomic neighborhoods of many independent sweeps with starting positions uniformly distributed over the range, we have *q*(*x*) = 1/*L* and *ξ*(*f*) = (2*U*_{n}*L/v*)(1 – *f*), as shown in Figure 6.

## Appendix D. Deviation from the uniform distribution at very high frequencies

Neutral mutations are unlikely to reach very high frequencies *f* → 1 because they take a finite amount of time to fix in the wavefront, and during this time the wave advances. We can therefore estimate at which frequency the simulated expected

SFS starts deviating from the uniform distribution, *ξ*(*f*) = *U*_{n}*L/v* by considering how long it takes for a lineage to take over the wavefront. Using Barton *et al*. (2013)‘s approximation for the wavefront coalescence time, , a neutral mutation is very unlikely to fix in the first fraction *vT*_{front}/*L* of the range. In other words, the uniform tail of the SFS should be cut off at *f* ≈ 1 — *vT*_{front}/*L*. For the parameter values in Figure 2, this is *f* ≿ 1 — *vT*_{front}/*L* ≈ 0.8, which roughly agrees with the simulation results.

## Appendix E. Site frequency spectrum from surfing mutations

We want to find an approximate expression for the SFS at intermediate frequencies where it is primarily composed of alleles that surf in the wavefront before being dropped (Figure 3). To do this, we follow the wavefront in the co-moving frame, viewing it as a small population of size . If we imagine tracing out the trajectory of the frequency of the allele in the wavefront over time, the total frequency in the entire population will be proportional to the area under this curve (Okada and Hallatschek 2021). Specifically, suppose that the allele persists in the wavefront for *t* generations at a frequency of ≈ *y* before being dropped. The total number of copies of the allele will then be ~ *vtyρ*, for a total frequency of *f* ~ *vtyρ*/(*ρL*) = *vty/L*.

We now need to approximate the joint distribution of the frequency *y* that an allele reaches in the wavefront and the time *t*(*y*) for which it surfs conditional on reaching that frequency. Coarse-graining over the time scale of coalescence in the wavefront (roughly, the time ~ 1/s for it to travel its own width, up to logarithmic factors (Barton *et al*. 2013; Birzu *et al*. 2018)), we can treat the wavefront as having an effective “sweepstakes” offspring distribution with tail exponent —1 (Okada and Hallatschek 2021). The typical persistence time then is short and only logarithmically dependent on *y* (Okada and Hallatschek 2021). The exact form of this logarithmic dependence is unclear though, and most likely it only describes extremely dense populations (Birzu *et al*. 2018; Barton *et al*. 2013). Instead, we will follow Barton *et al*. (2013) and Birzu *et al*. (2018) in approximating it with a power-law. We guess that we can generalize (Barton *et al*. 2013)‘s expression for the overall coalescence time to a typical persistence time for a lineage that reaches wavefront frequency *y*. With this guess, we can rewrite the allele’s overall frequency *f* as a function of its wavefront frequency *y*:
up to numerical factors.

To find the probability *P*(*f*) that an allele reaches overall frequency *f*, note that the probability that the allele reaches a wavefront frequency of at least *y* is just the standard *P*(*y*) ≈ 1/(*yN*_{front}) (Okada and Hallatschek 2021). Inverting Eq. (10) to find *y* in terms of *f* then gives . Differentiating with respect to *f*, we find the probability density *p*(*f*):
again ignoring numerical factors. From the density, we can immediately obtain the site frequency spectrum *ξ*(*f*) by multiplying by the total number of mutations that occur in the wavefront, which is the product of the wavefront mutation supply *N*_{front}*U*_{n} and the sweep time *T*_{sweep} = *L*/*v*:
where in the last line we have used that the total population size is *N* = *Lρ*. Note that we have neglected numerical constants throughout this argument, so there is an undetermined constant of proportionality. Comparing Eq. (11) to simulations, it appears that this constant is ≈ 0.6 (Figure 3).

To check the accuracy of our guesses and approximations, we test how well Eq. (11) fits simulations over a range of parameter values. While computational limitations prevent us from varying the parameters over even an order of magnitude, the prediction does appear to match well (Figure 7).