## Abstract

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

**Author summary** Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying genomic regions under positive selection. However, methods that detect positive selective sweeps do not typically identify the favored allele, or even the haplotypes carrying the favored allele. The main contribution of this paper is the development and analysis of a new statistic (the HAF score), assigned to individual haplotypes. Using both theoretical analyses and simulations, we describe how the HAF scores differ for carriers and non-carriers of the favored allele, and how they change dynamically during a selective sweep. We also develop an algorithm, PreCIOSS, for separating carriers and non-carriers. Our tool has broad applicability as carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones.

## Introduction

With genome sequencing, we now have an opportunity to more completely sample genetic diversity in human populations, and probe deeper for signatures of adaptive evolution [1–3]. Genetic data from diverse human populations in recent years have revealed a multitude of genomic regions believed to be evolving under recent positive selection [4–16].

Methods for detecting selective sweeps from DNA sequences have examined a variety of signatures, including patterns represented in variant allele frequencies as well as in haplotype structure. Initially, the problem of detecting selective sweeps was approached primarily by considering variant allele frequencies, exploiting the shift in frequency at ‘hitchhiking’ sites linked to a favored allele relative to non-hitchhiking sites [17,18]. The site frequency spectrum (SFS) within and across populations is often used as a basis for such inference [4,6,19–25]. More recently, methods based on haplotype structure have been developed using a variety of approaches, including the frequency of the most common haplotype [26], the number and diversity of distinct haplotypes [27], the haplotype frequency spectrum [28], and the popular approach of long-range haplotype homozygosity [29–32].

In general, haplotype-based methods seek to characterize the population with summary statistics that capture the frequency and length of different haplotypes. However, the haplotypes are related through a genealogy, and relationships among them are inherently lost in such analyses. In addition, data on the site frequency spectrum can be lost or hidden in analyses focused on haplotype spectra. In this paper, we connect related measures of haplotype frequencies and the site frequency spectrum by merging information describing haplotype relationships with variant allele frequencies. Our main contribution is a statistic that we term the *haplotype allele frequency* (HAF) score, which captures many of the properties shared by haplotypes carrying a favored allele.

Consider a sample of haplotypes in a genomic region. We assume that all sites are biallelic, and at each site, we denote ancestral alleles by 0 and derived alleles by 1. We also assume that all sites are polymorphic in the sample. The *HAF vector* of a haplotype *h*, denoted *c*, is obtained by taking the binary haplotype vector and replacing non-zero entries (derived alleles carried by the haplotype) with their respective frequencies in the sample (Fig. 1A). For parameter 𝓁, we define the 𝓁-HAF *score* of *c* as:
where the sum proceeds over all segregating sites *j* in the genomic region. The 1-HAF score of a haplotype amounts to the sum of frequencies of all derived alleles carried by the haplotype. The 𝓁-HAF score is equivalent to the 𝓁-norm of c raised to the 𝓁^{th} power, or (||c||_{𝓁})𝓁. We will show that during a selective sweep, the HAF score of a haplotype serves as a proxy to its relative fitness.

### Selective sweeps

The classical model for selection, and the one that has received most attention, is the “hard sweep” model, in which a single mutation conveys higher fitness immediately upon occurrence, and rapidly rises in frequency, eventually reaching fixation [17,33]. Under this model, we can partition the haplotypes into carriers of the favored allele, and non-carriers. In the absence of recombination, the favored haplotypes form a single clade in the genealogy. As a sweep progresses, HAF scores in the favored clade will rise due to the increasing frequencies of alleles hitchhiking along with the favored allele. HAF scores of non-carrier haplotypes will decrease, as many of the derived alleles they carry become rare (Fig. 1B). After fixation of the favored and hitchhiking alleles, HAF scores will decline sharply (Fig. 1C), as the selected site and other linked sites are no longer polymorphic. Thus, this reduction in the HAF score results from the sudden loss of many high-frequency derived alleles from the pool of segregating sites [18,20,24]. Finally, as the site-frequency spectrum recovers to its neutral state due to new mutations and drift [23], so will the HAF scores.

Recombination is a source of ‘noise’ for the properties of the HAF score, predicted under the assumption of a hard sweep and no recombination, as it allows haplotypes to cross *into* and *out of* the favored clade. Recombination can lead to (i) haplotypes that carry the favored allele but little of the hitchhiking variation, thus having relatively low HAF scores despite their high fitness, or (ii) haplotypes that do not carry the favored allele but do carry much of the hitchhiking variation, thus having relatively high HAF scores despite their low fitness. By the same logic, recombination adds ‘noise’ after fixation by making the otherwise sharp decline in HAF scores more subtle and gradual. This more gradual decline occurs due to recombination weakening the linkage between the favored allele and hitchhiking variants.

Recently, “soft sweeps” have generated significant interest [34–36]. A soft sweep occurs when multiple sets of hitchhiking alleles in a given region increase in frequency, rather than a single favored haplotype. Soft sweeps may take place by one or more of the following mechanisms: (i) selection from standing variation: a neutral segregating mutation, which exists on several haplotypic backgrounds, becomes favored due to a change in the environment; (ii) recurrent mutation: the favored mutation arises several times on different haplotypic backgrounds; or, (iii) multiple adaptations: multiple favored mutations occur on multiple haplotypic backgrounds. Several methods have been developed for detecting soft sweeps [37,38], as well as for distinguishing between soft and hard sweeps [39–41]. In soft sweeps, multiple sets of hitchhiking alleles rise to intermediate frequencies as the favored allele fixes. This could cause the pre-fixation peak and post-fixation trough in HAF scores to be less pronounced and to occur more gradually compared to a hard sweep.

We find (see Results) that the properties of the HAF score remain robust to many soft sweep scenarios. Moreover, the HAF score could potentially be used to detect soft sweeps. However, in this paper, we focus on the foundations, developing theoretical analysis and empirical work that predicts the dynamics of the HAF score. We also develop a single application. Recall that most existing methods for characterizing selective sweeps focus on identifying regions under selection. Here, given a region already identified to be undergoing a selective sweep, we ask if we can accurately predict which haplotypes carry the favored allele, without knowledge of the favored site. Successfully doing so may provide insight into the future evolutionary trajectory of a population. Haplotypes in future generations are more likely to be descended from, and therefore to resemble, extant carriers of a favored allele. This predictive perspective is of particular importance when a sweep is undesirable and measures may be taken to prevent it. For instance, consider a set of tumor haplotypes isolated from single cells, some of which are drug-resistant and therefore favored under drug exposure. Given a genetic sample of the tumor haplotypes, the HAF statistic may be applied to identify the resistant haplotypes — carriers of a favored allele — before they clonally expand and metastasize.

Below, we start with a theoretical explanation of the behavior of the HAF score under different evolutionary scenarios, validating our results using simulation. We then develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to detect carriers of selective sweeps based on the HAF score. We demonstrate the power of PreCIOSS on simulations of both hard and soft sweeps, as well as on real genetic data from well-known sweeps in human populations. While our theoretical derivations make use of coalescent theory, and explicitly use tree-like genealogies, we note that HAF scores can be computed for any haplotype matrix including those with recombination events. Our results on simulated and real data imply that the utility of the HAF score extends to cases with recombination as well as other evolutionary scenarios.

## Results

### Theoretical and empirical modeling of HAF scores

We consider a sample of *n* haploid individuals chosen at random from a larger population of size *N*. Let *μ* denote the mutation rate per generation per nucleotide, and let *θ*= 2*NμL* denote the population-scaled mutation rate in a region of length *L* bp. We consider both constant-sized and exponentially growing populations. For exponentially growing populations, let *N*_{0} denote the final population size, let *r* denote the growth rate per generation, and let *α* = 2 *N*_{0} *r* the population-scaled growth rate. Let *ρ* denote the population-scaled recombination rate. In our theoretical calculations, we assume no recombination (*ρ* = 0), and we derive expressions for the general 𝓁-HAF score. We use simulations to demonstrate the concordance of theoretical and empirical values of the 𝓁-HAF score, and show that the values are robust to the presence of recombination (see ‘Simulations’ in Methods for parameter choices). Although some of our theoretical calculations below derive expressions for the general 𝓁-HAF score, we primarily use 1-HAF in the applied sections. Applications of 𝓁-HAF with 𝓁 > 1 will be explored in future work.

#### Expected 𝓁-HAF score under neutrality, constant population size

First, we assume that the genomic region of interest is evolving neutrally, the population size remains constant at *N*, and that the ancestral states are known or can be derived. In a sample of size *n*, let **c**(*v*) denote the HAF vector c for the *v*^{th} haplotype (*v* ∊ {1,…, n}). Let ξ_{w} be the number of sites with derived allele frequency *w*. We only consider polymorphic sites in the sample, so the frequency is in the range *w* ∊ {1,…, *n* − 1}; a mutation present in all or none of the haplotypes in the sample would not be detectable. Each of the ξ_{w} sites of frequency *w* contributes *w*^{𝓁} to the 𝓁-HAF score of each of the *w* haplotypes with the mutation, and contributes 0^{𝓁} = 0 for each of the other *n* − *w* haplotypes. The mean of the 𝓁-HAF scores of all *n* haplotypes in the sample is

Under the coalescent model, [42, Eq. (22)] shows that 𝔼[ξ_{w}] = 0/*w* for all 1 ≤ *w* ≤ *n* − 1. By averaging over all haplotypes in all genealogies, the expected 𝓁-HAF score is computed as

The first two cases (𝓁 = 1, 2) yield

#### Expected 𝓁-HAF score, variable population size

Our derivation of expected 𝓁-HAF scores for constant, neutrally evolving populations does not immediately extend to other demographic scenarios. We describe a second approach that separates coalescence times from the genealogy, and we apply it to compute the expected 𝓁-HAF in an exponentially growing population.

For a sample of size *n*, partition the time spanning from the present back to the sample MRCA into *n* − 1 epochs. Let epoch *k* ∊ {2,…, *n*} be the span of time during which the genealogy contains exactly k lineages (Fig. 1). Note that mutations on a given lineage in a given epoch share the same frequency, as they appear in exactly the same leaves. For example, mutations 2 and 3 in Fig. 1A occur on the same lineage in epoch 2, and they share the frequency 3. Consider the path leading from a randomly chosen haplotype back to the sample MRCA. We can write the 𝓁-HAF score of the haplotype as
where *m*_{k} is the number of mutations that occurred on the path in the *k*^{th} epoch, and *w*_{k} is the frequency or weight of those mutations. For a given genealogy with haplotypes *v* ∊ {1,…, *n*}, let **c**(*v*) denote **c** (the HAF vector) for the *v*^{th} haplotype. Similarly, let *m*_{k}(*v*) and *w*_{k}(*v*) denote the number of mutations and their frequency in the *k*^{th} epoch for the *v*^{th} haplotype. Epoch *k* splits the haplotypes into *k* equivalence classes, which we call *k-clades.* Let *m*_{k, i} and *w*_{k, i} denote the corresponding values on the *i*^{th} lineage of the *k*^{th} epoch. We compute the expected value by summing over all haplotypes and genealogies and dividing by *n*. The sum is

Let *M*_{k, i} and *W*_{k, i} be random variables denoting the number of mutations, and their frequency respectively, on the *i*^{th} lineage of the *k*^{th} epoch. As the genealogy of a neutrally evolving sample is independent of branch lengths [43], *M*_{k,i} and *W*_{k,i} are independent random variables. Thus, we can compute the expected 𝓁-HAF score of a randomly chosen haplotype as

To compute , we start with a related quantity. For positive integer 𝓁, denote the *rising factorial*
and set *w*^{(0)} = 1. We show in the Appendix that

We have *w*^{(1)} = *w* and *w*^{(2)} = *w*(*w* + 1) = *w*^{2} + *w*, so *w*^{2} = *w*^{(2)} − *w*^{(1)}, Which leads to:

In the Appendix, we generalize this equation to compute 𝔼[(*W*_{k,i})^{𝓁}]. In addition, we show that for a constant-sized population, the general form in Eq. (7) produces the same result as Eq. (3).

#### Exponential population growth

Eq. (7) can potentially be used to obtain E[𝓁-HAF] under arbitrarily complex demographics. Consider a population of current size *N*_{0} that has been growing exponentially at a rate *r*. The population size at time *t* in the past is given by *N*(*t*) = *N*_{0} *e*^{−rt}. Exponential population growth is of particular interest, as it has been used to analyze the state of a population under a selective sweep *shortly after fixation.* This is a low point (or *trough*) of observed 𝓁-HAF scores, as early hitchhiking sites have fixed by this time point, and the (relatively recent) sample MRCA is a carrier of the favored mutation. Immediately after fixation, the population — all of which are carriers of the favored allele — has been growing for the duration of the sweep at a rate that is approximately exponential (with growth rate related to the selection coefficient s). In addition, all extant and ancestral haplotypes since the sample MRCA are carriers and therefore equally favored, implying that the independence between *W*_{k,i} and *M*_{k, i} is kept. While the branch lengths and distribution of *M*_{k, i} values change under exponential growth, the distribution for *W*_{k,i} remains unchanged as described in Eq. (10). This key insight allows us to use Eq. (7) to estimate the expected HAF scores under exponential population growth.

In order to use 𝔼[*M*_{k,i}] = *μ*𝔼[*T*_{k}] under exponential growth, we implement two numerical methods to compute 𝔼[*T*_{k}]: a ‘cumulative time’ method that uses an approximate distribution of *T*_{k} given in [44, p. 559], and a ‘conditional expectation’ method (see Appendix for details). In the conditional expectation method, we compute the expected value of *T*_{k} conditioned on *T*_{k+1},…, *T*_{n}, as follows (in the order *k* = *n*, *n* − 1,…, 2):
where *α* = 2 *N*_{0} *r* is the scaled growth rate, *τ*_{k} = *t*_{k+1} + ⋯ + *t*_{n} (with *τ*_{n} = 0), and *E*_{1}(*x*) is the exponential integral

We then use 𝔼[(*W*_{k,i})^{𝓁}] (evaluated in Eq. (S21)) to evaluate Eq. (7), yielding 𝔼[𝓁-HAF] for exponential population growth as
where *S*(*𝓁, q*) denotes the Stirling number of the second kind [45, Ch. 6.1]. We describe these procedures fully in the Appendix.

#### Empirical validation of expected HAF score computation

We tested our theoretical calculations against empirical observations of HAF scores using simulations for neutral evolution with constant population size *N* = 2000 (lowered for subsequent forward simulations; see ‘Simulations’ in Methods). For example, for *θ* = 48 and *n* = 200, the expected 1-HAF is exactly 4776.0 (Eqs. (3),(7)), whereas the empirically observed mean 1-HAF score of 20,000 simulated samples is 4777.1 (Supplementary Fig. S1).

We also modeled exponential growth in population size using the scaled growth rate *α*, using the conditional expectation method in Eq. (12) (see Appendix). As expected, the HAF score is much lower than for constant population size. For *α* = 80, *θ* = 48, *n* = 200, the theoretical mean 1-HAF score is 126.9, whereas the empirical mean of 20,000 simulations is 128.1 (Fig. S2).

We compared the simulations with theoretical expected 𝓁-HAF scores for multiple values of 𝓁 ∊ {1, 2, 3, 4}, and different choices of the population-genetic parameters (*n* ∊ {100, 200, 300,400}, scaled mutation rate *θ* ∊ {24,48}, scaled recombination rate *ρ* ∊ {0, 7.568,15}, and scaled growth rate *α* ∊ {0, 30, 60, 80}). Note that theoretical expected values were computed assuming scaled recombination rate *ρ* = 0, while simulations included *ρ* ∊ {0, 7.568,15}. See Methods for a detailed description of the choice of simulation parameters. Fig. S3 shows the concordance between theoretical and empirical means for each choice of parameters. The concordance improves slightly for increasing values of *n*, 𝓁. Surprisingly, the values are robust to choice of *ρ*, and the variance even reduces slightly for higher *ρ* (Supplementary Fig. S3D).

### HAF score dynamics in selective sweeps

We now consider the dynamics of HAF scores in a population undergoing a selective sweep. To do this, we use data simulated under several scenarios including both hard and soft sweeps. Though soft sweeps can arise under different circumstances, in what follows we restrict our attention to soft sweeps from the standing variation. Fig. 2 illustrates the HAF score dynamics in a single simulated population undergoing a soft sweep, due to standing variation, with selection coefficient *s* = 0.05 and favored allele frequency *v*_{0} = 0.3 when selection begins. See ‘Simulations’ in Methods for a detailed description of the simulation parameters. Initially (leftmost, time 0) the HAF scores of carriers and non-carriers of the favored allele are similar. As the sweep progresses (times 50−200), carrier HAF scores increase to a peak value (HAF-peak). Soon after fixation (time ∼400), we observe a sharp decline in HAF scores (HAF-trough), followed by slow and steady recovery due to new mutation and drift (times 500−4000).

Below, we provide a theoretical description of these dynamics, as well as empirical validation using simulations. This allows us to predict HAF scores in (a) the postfixation trough; (b) the pre-fixation peak; and (c) the rate of growth of HAF scores from pre-sweep to peak value.

#### Empirical validation of the post-fixation HAF-trough

We showed using simulations that the HAF score computations for an exponentially growing population (Eq. (12)) also approximate a population evolving under a selective sweep *shortly after fixation.* This enables prediction of the HAF-trough value.

The HAF-trough of a sweep is the value of 1-HAF at fixation. We took the mean of the HAF-trough values over 200 populations simulated under selective sweeps with coefficients *s* ∊ [0.02,0.08] (see ‘Simulations’ in Methods), and compared it to 1-HAF values in simulated neutral populations growing exponentially at rates *α* ∊ [32,101]. Fig. 3A shows a close similarity between the 1-HAF values under exponential growth (blue) and the selective sweep trough (red). The resemblance is closer for higher values of *s* and *α*. To some extent, this may stem from the increased simulation variance we see for lower values of *s*. On a deeper level, selective sweeps with fixed population size demonstrate a logistic curve [46, Sec. 6.1.3]. However, for higher values of *s* and *α*, the concordance between selective sweeps and exponential growth is increased.

#### The pre-fixation 1-HAF-peak

As the selective sweep progresses, the value of the HAF score of haplotypes carrying the favored allele increases, eventually reaching a peak value. Consider *n* haplotypes sampled from a fixed population of *N* haploid individuals under a selective sweep. Let *μ* denote the mutation rate per base per generation in the genomic region of interest, and assume that there is no recombination. The scaled mutation rate is given by *θ* = 2*Nμ*.

We let *v* denote the fraction of carrier haplotypes in the sample. When *v* ≤ 1/n (i.e., 0 or 1 carriers), there is no selection going back in time, and the time to MRCA can be computed using the neutral Wright-Fisher model [47]. The expected 1-HAF scores for carriers and non-carriers are identical (Eq. (3)). At the time when *v* first equals 1, there are no non-carriers, and the HAF-scores are given by the exponential growth model. In the Appendix, we model the 1-HAF scores for all intermediate values of *v*.

Let 1-HAF^{car} (respectively, 1-HAF^{non}) denote the 1-HAF score of a random haplotype carrying the favored allele (respectively, a non-carrier) when a fraction *v* of the *n* sampled haplotypes carry the favored allele. In the Appendix, we show that under strong selection (*N*_{s} > > 1) and no recombination (*ρ* = 0),

For any sample of size *n*, the carrier haplotypes reach a peak value of 1-HAF^{car} as v varies along its trajectory. We do not compute the expected value of this peak (𝔼[max_{v}(1-HAF^{car}(*v,n*))]) directly. Instead, we compute the peak value of 𝔼[1-HAF^{car}(*v, n*)] (maximizing over all *v* ∊ [0,1]) as

Note that under strong selection, this peak does not depend on *s* (see Fig. 3B).

The trough for each trajectory is computed as the 1-HAF score at fixation (when *v* = 1 is first reached).

#### Empirical validation

Simulated data under selective sweeps with coefficients *s* ∊ [0.02, 0.08] show that (i) the pre-fixation HAF peak scores appear to be independent of the selection coefficient (Fig. 3B), and (ii) as predicted by Eq. (15), the mean value of the HAF peak score is approximately *θn* (Fig. S10 in Appendix). We also simulated (1-HAF^{car})/(*nθ*) as a function of *v* (Fig. 3C and Fig. S9). The results show a tight correspondence between theory and empirical observations.

### HAF score Application: Characterizing carriers and non-carriers

Our understanding of the dynamics of HAF scores of a haplotype during a selective sweep has many potential applications. For example, we could compare the dynamics of hard and soft sweeps to distinguish between the two events. Second, conditioning on known or deduced selective sweeps in a population sample, we can predict the state (carrier/non-carrier) of the favored allele in its haplotypes. Below we explore the latter, leaving the former to future work.

In Fig. 4A we show the distributions of haplotype 1-HAF scores aggregated from 500 simulated populations undergoing a hard selective sweep (see ‘Simulations’ in Methods for detailed parameter choices). Scores were computed for random samples of *n* = 200 haplotypes taken at regular time intervals. They are stratified by the frequency of the favored allele at the time of sampling. Further, scores are stratified into carrier and non-carrier classes (of the favored allele). As with a single population, HAF scores of carriers and non-carriers diverge as the sweep progresses in frequency. We note, however, that even close to fixation (frequencies 80−100%) the distributions of HAF scores between carriers and non-carriers maintain considerable overlap. The high variance in HAF scores makes them only weakly informative of sweep carrier status when comparing across population samples (or genomic regions within a single population). Within a single population sample, however, the HAF scores are highly informative of the carrier status. This is illustrated in Fig. 4B, showing the distributions of HAF score percentile rank within their respective samples. We observe that the rank distributions have minimal overlap for carriers and non-carriers of the favored allele. Any remaining overlap in the percentile rank distributions in the final stages of a sweep (favored allele frequency ≥70%) stems mostly from recombination, which allows the favored allele to recombine onto haplotypes outside the selected clade (creating low HAF score carriers) and vice-versa (creating high HAF score non-carriers). The overall strong separation between carriers and non-carriers is further illustrated by the highly significant *P*-values of Wilcoxon rank sum tests rejecting the null hypothesis of identically distributed HAF scores among carriers and non-carriers *within* each population sample (Fig. 4C).

Fig. 4 does not show how HAF scores are distributed following fixation of the sweep. Starting at fixation, we see a strong decline in HAF scores owing to the loss of many high frequency derived alleles from the pool of segregating sites. However, crossover events may unlink hitchhiking alleles from the favored allele, and they may remain segregating in the population even after fixation of the favored allele. Therefore, the decline in HAF scores may be abrupt or gradual, depending on the linkage between the favored and hitchhiking alleles. Finally, after reaching a trough, HAF scores gradually recover to their neutral levels over time. The post-fixation dynamics of HAF scores are shown in Fig. S5.

#### PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps

Our simulations suggest that, in a region undergoing a selective sweep, we could use HAF scores to predict whether a haplotype is carrying the favored allele. We implemented a simple algorithm (PreCIOSS) to carry out this prediction by clustering HAF scores in a sample. PreCIOSS takes as input a set of binary haplotypes sampled from a population undergoing a selective sweep. For each haplotype, the 𝓁-HAF score is computed (𝓁 = 1 by default). We then fit a Gaussian Mixture Model (GMM) with exactly two Gaussians to the haplotype HAF scores. The fit is performed using Expectation Maximization (EM). Finally, we apply the fitted model to assign a label to each haplotype according to the Guassian component to which it is assigned. Haplotypes whose HAF score is higher are denoted as ‘carriers’.

We apply PreCIOSS to data from simulated populations undergoing hard and soft sweeps (see ‘Simulations’ in Methods). The haplotypes predicted as carriers might in fact be carriers (True Positives, TP) or non-carriers (False Positives, FP). Similarly, the haplotypes predicted as non-carriers could be True Negatives (TN) or False Negatives (FN). We measure the *balanced accuracy*
which is more appropriate to use than *Rand accuracy* (*TP* + *TN*)/(total predictions) when the positive and negative classes appear at different proportions in the sample [48].

The balanced accuracy of PreCIOSS on hard sweeps is shown in Fig. 5A for a specific choice of parameters (500 samples with *n* = 200, *θ* = 48, *ρ* = 7.568, *s* = 0.05). Once the sweep reaches frequencies above 30%, the balanced accuracy increases (median ∼70%) and remains high (median ∼90%) for the remainder of the sweep. At the very beginning and end of the sweep, the balanced accuracy, despite being asymptotically unbiased, suffers from high variance due to the severe class imbalance (few carriers in the beginning, few non-carriers at the end). The accuracy is high for soft sweeps (Fig. 5B, run with similar parameters, and *v*_{0} = 0.3), where the higher initial carrier frequency makes for a faster increase in accuracy. As expected under soft sweeps, the overall balanced accuracy is somewhat reduced relative to hard sweeps.

We tested PreCIOSS under a wide range of population-genetic and sweep parameters (Table S1), and observed consistently high balanced accuracy in carrier-state prediction as the sweeps progress (see Fig. S6). Generally, PreCIOSS has slightly better balanced accuracy on hard sweeps than on soft sweeps. A higher recombination rate has only a limited impact (see Fig. S6A,E), suggesting that PreCIOSS will be robust to different recombination scenarios.

### Applying PreCIOSS to human selective sweeps

To evaluate the effectiveness of PreCIOSS in distinguishing carriers of a selective sweep from non-carriers, we applied it to several genomic regions (e.g., [39]) where (i) there is strong evidence of a selective sweep, and (ii) the favored allele has been characterized. In applying PreCIOSS to the datasets, we assumed that the region was known, but did not supply the favored allele to PreCIOSS. In each case, we tested if PreCIOSS could separate the haplotypes that carried the favored allele. We use phased haplotypes from the HapMap project [49], setting the ancestral allele to that observed in orthologous Chimpanzee sequence [50].

#### LCT

We consider the well-known sweep in the lactase (LCT) gene region in Northern Europeans. The best characterized variant is C/T-13910 (rs4988235), for which the T allele was found to be 100% associated with lactase persistence in the Finnish population [51]. T-13910 was further shown to be causal by in-vitro analysis, where it was found to increase enhancer activity [52,53]. We considered haplotypes from the CEU population, applying PreCIOSS to a 50 kb window centered at C/T-13910 (Fig. 6A). This yielded 100% accuracy in classifying carriers from non-carriers. Increasing the window size above 50 kb, the balanced classification accuracy reduced to ∼90% (Fig. 6B). LCT shows the highest and most prolonged (with increasing distance from the causal site) statistical significance in separating carrier and non-carrier HAF scores, remaining highly significant for haplotypes of 1.5 Mb (Fig. 6B, blue line). Despite this highly significant separation, the classification accuracy is initially unstable, alternating between ∼100% (perfect classification) and 90%. This is due to the pattern of HAF scores observed in the LCT region, where carriers form a tight cluster with the highest scores, but several non-carriers cluster closer to carriers than to the majority of other non-carriers (Fig. 6A). These haplotypes are therefore sometimes included (90% accuracy) and sometimes excluded (100% accuracy) from the reported ‘carriers’ class.

#### TRPV6

Transient Receptor Potential Cation Channel, Subfamily V, Member 6 (TRPV6) is a membrane calcium channel thought to mediate the rate-limiting step of dietary calcium absorption. It is reportedly under strong positive selection in several non-African populations [54,55]. Following Peter et al. (2012) [39], we focus on the CEU population and set rs4987682 as the favored allele. This site, of the three non-synonymous SNPs with highest allele frequency differentiation among human populations, is the only one located in the the N-terminal region of TRPV6, thought to be the target of selection [56]. Applying PreCIOSS to a 50 kb window centered at this site, we obtain ∼99% balanced classification accuracy (Fig. 6C), which gradually decays to ∼80% when considering a 1.5 Mb window (Fig. 6D). As with LCT, the separation in HAF scores between carriers and non-carriers of the allele is highly statistically significant (P < 10^{−7}; see Fig. 6D). Unlike LCT, accuracy decays stably with distance from the favored site. This appears to be due to the less complex clustering pattern of non-carrier HAF scores in the region (Fig. 6C).

#### PSCA

Prostate Stem Cell Antigen (PSCA) has been proposed to be under selection in a global analysis of allele frequency differentiation [57]. The putative causal site is a non-synonymous SNP (rs2294008) known to be involved in several cancer types [58,59]. Interestingly, the derived allele is observed in all human populations, but at vastly different frequencies. It is most common in West Africa and East Asia, where it segregates at ∼75% frequency. We consider haplotypes from the YRI population and apply PreCIOSS to a 50 kb window centered at rs2294008, yielding balanced accuracy of 97% (Fig. 6E). Unlike LCT and TRPV6, accuracy decays more noticeably with distance, reaching 63% at 1.25 Mb (Fig. 6F). This sharper decay in accuracy is even more pronounced when considering the sweep in the CHB population (Fig. S4). Such decay is consistent with a (*soft*) sweep from the standing variation, which would allow more time for recombination to break the linkage between the favored allele and hitchhiking variation. Indeed, the sweep in PSCA was proposed to be from the standing variation by Bhatia et al. (2011) [57], and further substantiated as such by Peter et al. (2012) [39].

#### ADH1B

ADH1B encodes one of four subunits of Alcohol dehydrogenase (ADH1), which plays a key role in alcohol degradation. ADH1 genes (including ADH1B) have been studied extensively on both a functional and a population-genetic level, as they are thought to be one of the major drivers of alcoholism risk [60]. These genes have also been suggested to cause the “alcohol flush” phenotype common in Asian populations [61]. A specific non-synonymous mutation in ADH1B (Arg47His, rs1229984) has been proposed to be the target of selection. This is because (i) the derived allele has been shown to cause increased enzymatic activity [62,63], and (ii) the estimated age of the allele coincides with rice domestication [61,64] and the availability of fermented beverages [65]. Computing HAF scores for phased haplotypes from East Asian populations (CHB+JPT) and applying PreCIOSS, we obtained balanced classification accuracy of 92% using a 50 kb window centered at rs1229984 (Fig. 6G). Both accuracy and statistical significance (of class separation) gradually decay with increasing window size (Fig. 6H). As before, statistical significance decays more stably than classification accuracy.

#### EDAR

EDAR encodes a cell-surface receptor and has been associated with development of distinct hair and teeth morphologies [66,67]. Specifically, a non-synonymous SNP (rs3827760, V370A) has been associated with these phenotypes [68]. The SNP is located within a DEATH-domain, which is highly conserved in mammals [69], and has been experimentally confirmed (in vitro) to increase EDAR activity [68]. It is found at very high frequencies in East Asian and American populations, while being completely absent from Europeans and Africans [68]. The EDAR gene has been found to be under selection in multiple studies [30,54,70], showing one of the strongest signatures of selection genome wide among the 1000 Genomes populations [39]. Applying PreCIOSS to phased CHB+JPT haplotypes in a 50 kb region centered at rs3827760, we obtained 98% balanced accuracy in predicting carriers vs. non-carriers of the allele.

In each case, PreCIOSS was applied to a 50 kb window centered at the favored allele, and separated the carriers and non-carriers with high accuracy of 97-100% (Fig. 6A–F). The accuracy decayed with increasing window size, but in many cases stayed high even for windows of 1.5 Mbp.

## Discussion

This paper introduces a new perspective on the genetic signatures of selective sweeps. From identifying and characterizing sweeps in a population sample — the topic of typical studies of selective sweeps — we progress to considering the role of individual haplotypes within an ongoing sweep. Using both simulated and real data, we show that the HAF score is well-correlated with the *relative* fitness of individual haplotypes, and that our algorithm (PreCIOSS) is highly effective at predicting carriers of selective sweeps.

The HAF framework has many natural extensions and potential applications. On the theoretical side, we have obtained the expected HAF score in both constant-sized and exponentially growing populations evolving neutrally (Eqs. 3,12). However, we do not yet know the variance. This quantity would provide a better understanding of the respective distributions, and a means to to statistically test for deviations from neutrality. Moreover, although we have observed in simulation and in practice that our theoretical argument is robust to recombination (genealogies violating a tree structure), a theoretical argument supporting these observations would be valuable.

In terms of application, several additional directions are worth investigating. The HAF framework is potentially useful in distinguishing hard from soft sweeps. Intuitively, hard sweep genealogies will likely have a single hitchhiking branch dominating the HAF scores, and leading to near-uniform scores in favored haplotypes. However, soft sweep genealogies may have several hitchhiking branches, potentially leading to distinct HAF score peaks. Even if the different favored clades happen to have similar scores, the haplotypes within them will not form a highly-related group as expected in hard sweeps.

Our results on known selective sweeps in humans illustrates this idea already (Fig. 6). A recent study by Peter et al. (2012) [39] assigned posterior probabilities to hard vs. soft sweeps occurring in the same genes. Peter et al. assigned the highest likelihood of a hard sweep to LCT (0.99), followed by EDAR (0.89), ADH1B (0.78), TRPV (0.45), and finally PSCA (0.24). This is in striking concordance with the spread in HAF scores in Fig. 6. The clusters capturing the carriers in LCT and EDAR have tightly distributed HAF scores (Fig. 6A,I). The cluster for ADH1B (Fig. 6G) has more variance by comparison, and the variance increases for TRPV6 (Fig. 6C) and PSCA (Fig. 6E), with PSCA showing the highest variance of HAF scores in carriers.

Finally, perhaps the highest potential impact of the HAF score could be in predicting the ‘MRCA of the future’. We know that future haplotypes are more likely similar to favored individuals than to unfavored ones, and that HAF scores correlate well with relative fitness in ongoing selective sweeps. Therefore, high HAF haplotypes are more likely to be similar to future generations. This relationship is particularly valuable when action may be taken based on such predictions. For instance, rapid influenza viral evolution is known to change the strain composition from year to year. The mutations are a mix of favored and deleterious mutations. The fitness and frequency of the current year‘s strain have been used to predict the next year’s dominant strain [71]. The HAF score may allow for a careful look at the dynamics of the current strain and possibly offer better insight into the problem. As a second example, tumor cells show great heterogeneity and much variation occurs at the single cell level. This intra-tumor variation allows sub-population of cells to resist therapy and proliferate [72]. Once again, HAF scores of haplotypes in cells undergoing treatment can potentially distinguish between carriers and non-carriers of drug resistance mutations, and thereby improve our insight into mechanisms of drug resistance.

## Methods

### Simulations

We simulated data for various evolutionary scenarios. Neutral samples were generated using the coalescent (backwards) simulator *ms* [73]; sweep samples were generated using the Wright-Fisher (forwards) simulator *mpop* [5]. All simulations generated samples of *n* ∊ {100, 200, 300, 400} haplotypes from a larger effective population of *N* = 2000 haplotypes, each of length 50 kb. A mutation rate of approximately *μ* = 2.4 · 10^{−8} mutations per bp per generation was used [74,75]. For simulating populations from a region of length *L* = 50 kb, and assuming effective population size of 10, 000 individuals (20, 000 haplotypes), we obtain *θ* = 2*N μL* = 24. For our simulations, we choose *θ* ∊ {24, 48}. However, we worked with a smaller value of *N* = 2, 000 haplotypes to reduce the computational requirements of forward simulation, and compensated by appropriately scaling the mutation rate to 2.4 · 10^{−7} to maintain the scaled mutation rate, *θ*.

The simulations either had no recombination, or used 3.784 · 10^{−8} crossovers per generation, per base (e.g. [76]). For a 50 kb region, we used the population-scaled rate

We also worked with a higher population-scaled rate of *ρ* = 15.

For exponential growth, we used *N* = 2000 as the size of the final (current) population. Let *r* denote the growth rate per generation, so that at *t* generations prior to the current generation, the population size was *N*(*t*) = *Ne*^{−rt}. Define the scaled growth rate *α* = 2*Nr*. We set α to a range of values in [30,100] (corresponding to *r* ∊ [0.0075, 0.025]).

For selective sweeps, we used forward simulations run assuming a diploid population with recombination and mutation parameters as described above. While diploid populations were simulated to incorporate recombination, we used phased haplotypes for our analysis. We assumed a single favored allele under selection coefficient *s* ∊ [0.02, 0.8] and heterozygosity 0.5 (haploid carriers get half the fitness advantage of diploid carriers). When *s* is ‘low’ (0.001 ≤ *s* ≤ 0.01), none of the available tests can detect a selective sweep with reasonable power [21,23]. Selection with *s* ≥ 0.1 is considered ‘high’ (e.g., see [21]). For high values of *s*, the carrier haplotypes are identical or very similar in simulations, making the problem of detecting carriers easy. Therefore, we chose intermediate values (*s* ∊ [0.02, 0.08]) in our simulations.

Soft sweeps can arise either due to standing variation or due to multiple favored alleles. Here, we focus on the former, where the favored allele is present in at least one carrier in the population (*v*_{0} ∊ [1/*N*, 1]), and drifting at the onset of selection. In our simulations, we set *v*_{0} at the beginning of the sweep to *v*_{0} = 1/*N* for hard sweeps and *v*_{0} = 0.3 for soft sweeps.

### Data Preprocessing

We downloaded pre-phased haplotype data from the HapMap [49] project website. Both HapMap 3 [49] and HapMap 2 [77] project data were used depending on whether the causal allele was sampled or not. For LCT (rs4988235), we used 88 CEU individuals haplotypes from HapMap 3; for PSCA (rs2294008), we used 60 YRI individuals from HapMap 2; for TRPV6, 88 CEU individuals from HapMap 3; for ADH1B (rs1229984), 90 CHB+JPT individuals from HapMap 2; and, for EDAR (rs3827760), we chose 90 CHB+JPT individuals from HapMap 2. The number of phased haplotypes was twice the number of individuals in each case.

We downloaded Chimpanzee genome alignments [50] to identify the ancestral allele. A total of ∼93% of the sites analyzed had were covered by the Chimpanzee data. For these sites, we set the ancestral allele to the Chimpanzee allele, and we discarded sites that were not covered.

## Software

The PreCIOSS software is available from the website http://bix.ucsd.edu/projects/precioss/

## Empirical validation of PreCIOSS

In Fig. 5A, we showed the balanced accuracy of PreCIOSS for predicting carriers of ongoing hard sweeps. For soft sweeps, the balanced accuracy stays robust but is slightly reduced (Fig. 5B). In Fig. S6 we apply PreCIOSS to data from 500 trials of a simulated population for each sweep type (hard or soft), and for different population genetics parameters provided in Table S1. For soft sweeps, we set *v*_{0} = 0.3 (frequency of the favored allele at the onset of selection).

## Derivation of the expected 𝓁-HAF score

We compute the expected value by averaging over all haplotypes in a genealogy (sum over the haplotypes and divide by *n*) and then averaging over all genealogies. Recall from the main text that *M*_{k,i} and *W*_{k,i} refer to random variables denoting the number of mutations and the frequency, respectively on the *i*^{th} lineage of the *k*^{th} epoch. As the genealogy of a neutrally evolving sample is independent of branch lengths [43], *M*_{k,i} and *W*_{k,i} are independent random variables. Recall Eq. (7):

### Expected value of *M*_{k,i}, constant population size

Let *T*_{k} denote the duration of epoch *k*. For a population of constant size *N*, the duration *T*_{k} is exponentially distributed with rate (see [78]):)
and the number of mutations on a lineage *i* in epoch *k*, denoted *M*_{k,i}, is Poisson distributed with rate *μT _{k}*. For a constant-sized population, this implies

Later, we will explore 𝔼[*T*_{k}] and 𝔼[*M*_{k,i}] for exponentially growing populations.

### Rising factorials and moment calculations

The probability distribution of *W*_{k, i}, or the number of descendants of the *i*^{th} lineage in epoch *k*, is given by [42, Eq. (14)]:
where 1 ≤ *i* ≤ *k*, 2 ≤ *k* ≤ *n*, and 1 ≤ *w* ≤ *n* − *k* + 1. We wish to evaluate Eq. (7) (or (S1)) for 𝓁 = 1, and generally for any 𝓁. This requires 𝔼[*M*_{k,i}], which we already have for constant populations, and , which we derive next. For a positive integer 𝓁, denote the *rising factorial*
and set *w*^{(0)} = 1. Then denotes the 𝓁^{th} rising factorial of random variable *W*_{k,i}, while denotes the ordinary 𝓁^{th} power. We will compute , and use it to compute .

### Deriving

Let 𝓁 be a nonnegative integer. We will prove that which is equivalent to Eq. (9) in the text. Using Eq. (S4), the expected value is the sum

Here we use that rising factorials and binomial coefficients are related via

The rightmost sum may be evaluated using a combinatorial method: we will count all (*k* + 𝓁 − 1)-element subsets of {1, 2,…, *n* + 𝓁 − 1} by two methods. First, it is the binomial coefficient . Second, we will show that it also equals the rightmost sum in (S7).

To show this, we may order any such subset and partition it into three parts, (*A, b, C*): *A* is the set of its smallest 𝓁 elements; *b* is the (𝓁 + 1)^{th} smallest element; and *C* is the set of the largest *k* — 2 elements.

Element *b* must be in the range 𝓁 + 1 ≤ *b* ≤ *n* − *k* + 𝓁 + 1 to allow for 𝓁 smaller elements and *k* − 2 larger elements. We rewrite *b* as *b* = 𝓁 + *w*, where *w* is in the range 1 ≤ *w* ≤ *n* − *k* + 1.

Given *b*, we may choose *A* in one of ways and choose *C* in one of ways. Multiply these counts for a given *b*, and sum over all *b* (by summing over all *w*) to obtain that the total number of (subsets) is the rightmost sum in (S7). But we also showed the number of such subsets is . Thus, (S7) evaluates to

### Deriving 𝔼[𝓁-HAF] in a constant-sized population

Our derivation of 𝔼[𝓁-HAF] consists of three main steps. We start by defining a more general form of the HAF score for an arbitrary function *f* (*w*), which we denote HAF_{f}. We then set *f*(*w*) to the rising factorial function and derive 𝔼[HAF_{w(𝓁)}]. Finally, we use Stirling Numbers to convert between rising factorials and powers, and obtain 𝔼[𝓁-HAF].

#### Step 1: The generalized form 𝔼[HAF_{f}]

Consider an arbitrary polynomial *f*(*w*) (or more generally, an arbitrary function *f* : ℤ → ℤ). Define the HAF_{f} score of a HAF vector **c** as

We now generalize the derivation of Eq. (7) to HAF_{f}. Sum the above equation over all haplotypes in a genealogy:

Note that in the sum over *j*, all summands are independent of *j*, so that sum is replaced by multiplying by the number of terms, *w*_{k, i}. We divide the above sum by *n* and then average over all genealogies to obtain

Further, since the random variables *W*_{k,i} for *i* = 1,…, *k* are identically distributed, all terms in the inner sum are the same, and may be consolidated into *k* times the value of the term on one branch:

In a constant-sized population, Eq. (S3) states that 𝔼[*M*_{k,i}] = θ/(*k*(*k* − 1)). Plug this in:

#### Step 2: The rising factorial form 𝔼[HAF_{w(𝓁)}]

We will show that for *f*(*w*) = *w*^{(𝓁)}, for any nonnegative integer 𝓁:

For this proof only, abbreviate *W* = *W*_{k,1}. Note that

Using this and Eq. (S6) gives

We will split this sum into two telescoping sums. One may show that

Using this with shifted *k*‘s and 𝓁’s, part of the sum in Eq. (S16) can be written
which is a telescoping sum that evaluates to

Note that 1^{(𝓁+2)} = (𝓁 + 2)!, simplifying this to

Similarly,

Plugging Eqs. (S18),(S19) into Eq. (S16), we obtain:

This proves Eq. (S13).

#### Step 3: The final form 𝔼[𝓁-HAF]

Powers *w*^{𝓁} and rising factorials *w*^{(𝓁)} may be expressed as linear combinations of each other via Stirling numbers. Let *c*(*𝓁, q*) denote the Unsigned Stirling Number of the First Kind, and *S*(*𝓁, q*) denote the Stirling Number of the Second Kind. Then [45, p. 264]

The expected value of (*W*_{k, i})^{𝓁} is then:

We evaluate 𝔼[𝓁-HAF] = 𝔼[HAF_{w(𝓁)}] as a linear combination of terms 𝔼[HAF_{w}(q)], using the same coefficients that express *w*^{𝓁} as a linear combination of terms of *w*^{(q)} (Eq. (S20a)):

For 𝓁 = 0:

For 𝓁 = 1:

For 𝓁 = 2:

It is straightforward to compute this for any 𝓁 in the same fashion.

### Equivalence of Eq. (3) and Eq. (7) for 𝔼[𝓁-HAF] in a constantsized population

Eq. (3) has simple derivation, but a variable number of terms, *n*, that grows as *n* grows. Eq. (7) leads to an evaluation (S22) that is a polynomial in *n* of degree *t*. Both represent 𝔼[t-HAF] in a constant-sized population; we now show that they are equal as a consequence of the following lemma:

*In a constant-sized population, for any polynomial f (w), we have*

Applying this to *f*(*w*) = *w*^{𝓁} gives equivalence of the two formulas (3) and (S22) for 𝔼[𝓁-HAF].

Both sides are linear in *f*, so it suffices to prove it on any basis of polynomials. We use the basis of rising factorials, *f*(*w*) = *w*^{(𝓁)} for 𝓁 ≥ 0. Eq. (S13) evaluates 𝔼[HAF_{f}] on this basis. We now evaluate . Note that

So

This leads to a telescoping sum:

Multiplying this sum by *θ*/n gives the same value as (S13) gives for 𝔼[HAF_{w}^{(𝓁)}]. Thus, the lemma holds on the basis *f*(*w*) = *w*^{(𝓁)} for 𝓁 ≥ 0. By linearity, it holds on all polynomials *f*(*w*).

### Deriving 𝔼[𝓁-HAF] in an exponentially growing population

Let the population size at time *t* in the past be given by *N*(*t*) = *N*_{0} *e*^{−rt}, where *N*_{o} is the current population size. *M*_{k,i} depends on *T*_{k}, which in turn depends on the size of the population in epoch *k*. Slatkin and Hudson (1991) [44, p. 559] derived an iterative formula to generate a sequence of random times *T*_{k} (*k* = *n*, *n* − 1,…, 2). The actual formula implemented in the simulator *ms* [73] is slightly different; in our notation, it is as follows. Let *α* = 2 *N*_{0} *r* be a scaled growth rate. Generate random values *U*_{n},…, *U*_{2} that are uniformly distributed in (0,1), and iteratively compute *T*_{n},…, *T*_{2} via

Then generate a random value for each *M*_{k, i} using a Poisson distribution with mean *μ*,*T*_{k} (*k* = 2,…, *n* and *i* = 1,…, *k*). This does not lead to a closed form expression for so the expected HAF scores, but we may nonetheless use it to estimate the expected HAF scores computationally, as illustrated in Fig. S2.

The first method, *cumulative time,* is to generate random values of *T*_{k}’s and *M*_{k},*i*’s as above, and plug them into (S10), along with exact values of 𝔼[*W*_{k, i} · f (*W*_{k, i})] computed using (S6) or (S21).

The second method is *conditional expectation.* For a given *k*, we may compute the expected value of (S24) given the condition that *T*_{k+1}, … , *T*_{n} are known. This gives estimates *T*_{k} = *t*_{k} (computed in order *k* = *n*, *n* − 1,…, 2) as follows:
where *τ*_{k} = *t*_{k}+_{1} + · · + *t*_{n} (with *τ*_{n} = 0) and *E*_{1}(*x*) is the exponential integral .

In practice, Eq. (S25) is used to generate scaled times,

In terms of these times, 𝔼[*M*_{k, i}] = *μ*𝔼[*T*_{k}] ≈ *μt _{k}* (which is approximate since the formula for

*t*

_{k}uses conditional expectation while this equation does not). We then estimate 𝔼[HAF

_{f}] via (S11):

For *f*(*w*) = *w*^{(𝓁)}, similarly to (S15), this gives an estimate
and for *f*(*w*) = *w*^{𝓁}, we use (S20) to obtain

In terms of *α* = 2 *N*_{0} *r* and *θ* = 2 *N*_{0} *μ*, the coefficient *μ*/*r* may be replaced by *θ*/*α*.

We used Maple (maplesoft.com) to generate the sequence of scaled times *r* · *t*_{k} by iterating Eq. (S25). Maple supports arbitrary precision numerical computations, which is needed due to complications with double precision arithmetic. Let . Small rounding errors in x due to limited precision are amplified in e^{x}, which may lead to slightly different results when iterating the recursion using double precision arithmetic vs. Maple’s arbitrary precision arithmetic. Additionally, as *x* increases, *e*^{x} grows rapidly and leads to an overflow in double precision arithmetic at approximately *x* ≥ 709.7827, while *E*_{1}(*x*) decays rapidly and leads to an underflow at approximately *x* ≥ 701.8334, even though *e*^{x}*E*_{1}(*x*) is usually representable in double precision. For example, *E*_{1}(702) ≈ 1.9 · 10^{−308} underflows in double precision arithmetic, but *e*^{702}*E*_{1}(702) ≈ 0.0014 is representable. The largest value of *x* used in Eq. (S25) is *x* = *n*(*n* − 1)/*α* in iteration *k* = *n*, leading to an underflow in double precision if *n*(*n* − 1)/*α* > 701.8.

### Number of mutations in a genealogy

Let *f*(*w*) = 1/*w* for *w* > 0 and *f*(0) = 0. Note that for an individual HAF vector **c**, some entries may be 0, but in a genealogy, all *w*_{k, i} ≠ 0 since every branch of the observed tree has at least one haplotype as its descendant. The sum (S9) of HAF_{f} (**c**) over all haplotypes in a genealogy becomes
and thus (S11) becomes

In a constant-sized population, Eqs. (S3) and (S29) give that mutations in a genealogy is where is the Harmonic number.

Under exponential growth, we instead estimate times *t*_{k} or scaled times *r* · *t*_{k} via (11). Then we estimate the expected number of mutations by

### Computing the mean HAF_{f} score from a SNP matrix

For a sample of *n* individuals whose genealogy has *q* segregating sites, the SNP matrix *A* consists of n rows (each row representing one individual) and *q* columns, with entry 0 denoting ancestral alleles and 1 denoting derived alleles.

The *v*^{th} row of A is a haplotype vector as depicted as **h** in Fig. 1.

Let **w** ^{a11} denote the sum of the rows of *A*. The frequency of the *j*^{th} allele is , the sum of all entries in the *j*^{th} column of *A*.

In , the *j*^{th} mutation contributes *f*(*w*_{j}) for each of the *w*_{j} haplotypes that have that mutation (*A*_{vj} = 1) and contributes 0 on each of the *n* − *w*_{j} others. Thus, the average of HAF_{f} over the genealogy is

We use this to compute empirical averages of HAF_{f} in simulations (*ms* [73] outputs a SNP matrix rather than a tree), as well as in further theoretical developments in this Appendix. In particular, given a haplotype vector **h** and the corresponding HAF vector **c**, we have

### HAF score dynamics and peak values

Consider *n* haplotypes randomly sampled from a Wright-Fischer (WF) fixed-size population of *N* haploid individuals under a hard sweep with selection coefficient s. Let v denote the fraction of individuals that carry the favored allele. We assume the sample has at least one carrier and at least one non-carrier, so that at fixation (*v* = 1), there are no non-carriers remaining. We assume strong selection (*N*_{s} ≫ 1) and no recombination in the region being sampled. We measure time in generations going backwards. See Fig. S7. The current time is time 0; let *T*^{car} denote the time when all sampled carriers coalesce to their most recent common ancestor (denoted by MRCA^{car}). Let *T*^{a11} denote the time when all sampled individuals coalesce to a common ancestor (denoted by MRCA^{a11}). Let *T*(*k*) denote time to MRCA of *k* randomly chosen haplotypes in a population of size *N*. From Nordborg [47], we have

At time *T*^{car}, we have exactly one ancestor of the favored allele carrier, and assume we have *m* ancestors of non-carriers. In a hard-sweep scenario, the favored mutation arises at the same time as the onset of selection; therefore, the time between *T*^{car} and *T*^{a11} is governed by the neutral WF model with population *N*, and is well approximated by neutral coalescent theory. Applying Eq. (S35) to the remaining sample of *m* + 1 individuals at time *T*^{car} gives the expected time to coalesce:

Let *A* denote the SNP matrix of a sample of *n* individuals, with *vn* carriers of the favored allele (Fig. S8). We order the columns so that all mutations are ordered chronologically from left to right. Similarly, the rows are ordered so that the first *vn* rows correspond to carrier haplotypes.

The *i*^{th} haplotype is represented by the *i*^{th} row of *A* (denoted by row vector **h _{i}**) and has 1-HAF score denoted by 1-HAF

_{i}. From Eq. (S34), we have

We partition the matrix into 3 submatrices: *A*_{1} consists of the rows corresponding to carriers, and columns corresponding to mutations occurring prior to the onset of selection (times between *T*^{car} and *T* ^{a11}). *A*_{2} is the submatrix from all non-carrier rows, and mutations prior to onset of selection. Finally, *A*_{3} is the submatrix of all columns after the onset of selection (times between 0 and *T*^{car}). Define *w*^{car} as the sum of all carrier haplotypes; similarly, let *w*^{non} be the sum of all non-carrier haplotypes. Thus,

Also, we can partition the equation for computing 1-HAF_{i} by rows as

We partition a haplotype vector **h** in the matrix by separating SNPs that occurred before (subvector **h**^{b}) and after (subvector **h**^{a}) the onset of mutation. A similar partitioning works for **w**^{a11}, **w**^{car}, and **w**^{non}. Thus, for a random haplotype **h**,

The expected 1-HAF score of a carrier (or non-carrier) haplotype h can similarly be decomposed as

To compute the expected 1-HAF score, we bound each of these constituent terms.

Consider a sample of n individuals under a hard sweep with vn carriers . Let *m* be the number of non-carrier haplotypes remaining at time *T*^{car}. Then for a random carrier haplotype **h**, we have
while for a random non-carrier haplotype **h**, these become

Mutations that happen on the lineage from MRCA^{a11} to MRCA^{car} are shared by all carriers, so all rows of matrix *A*_{1} are identical. Thus, for all carrier haplotypes **h**_{i}, **h**_{j}, we have . On restricting to the mutations from before the onset of selection, we obtain

The term is the expected number of ones in **h**^{b}, which is the expected number of mutations on the lineage from MRCA^{a11} to MRCA^{car}. Therefore, . By Eq. (S36),

On restricting to the mutations from before the onset of selection, we obtain

Going back from time *T*^{car} to *T*^{a11}, we observe the coalescence of the ancestors of carrier haplotype **h** and non-carrier haplotype **h**_{i} at a time *T*′, where *T*^{car} ≤ *T*′ ≤ *T*^{a11}. The term is the expected number of ones in common between these two haplotypes among the SNPs occurring before the onset of selection. This equals the expected number of mutations in the lineage from *T*′ to *T*^{a11}:

Note that *T*′ — *T*^{car} is the time for a sample of 2 individuals (ancestors of **h** and **h**_{i} at time *T*^{car}) to coalesce. We evaluate Eq. (S45) using Eq. (S35):

Plug this into Eq. (S44). All *n* − *vn* = (1 − *v*)*n* terms are the same, so it simplifies to Eq. (S39).▀

We have

Here, **h** is a non-carrier while **h**_{i} is a carrier, so they are distinct and they coalesce prior to the onset of selection. Thus, Eq. (S46) applies here. All *vn* terms of the sum in Eq. (S47) evaluate to Eq. (S46), so it simplifies to Eq. (S40).▀

We have

The term is the expected number of ones in common in haplotypes **h** and **h**_{i} among the SNPs occurring before the onset of selection.

We apply Lemma 3 (upcoming) to show that two random non-carrier haplotypes **h** and **h**_{i} have the same ancestor at time *T*^{car} with probability
or different ancestors with probability 1 − *p*. Thus, with probability *p*, we have , and with probability 1 − *p*, we have . Therefore,

Term is the expected number of ones in **h**^{b}, which is . Term is the expected number of ones in common between and **h**^{b}, which is . These expected times were evaluated in Eqs. (S43) and (S45). Then

Eq. (S48) is a sum of (1 − *v*)*n* terms, each of form (S50), so Eq. (S48) evaluates to Eq. (S41).▀

In the preceding proof, the probability *p* that two non-carriers share the same ancestor was computed using the following lemma, applied to *n*_{0} = (1 − *v*)*n* non-carriers in the sample and *m* ancestors of non-carriers at time *T*^{car}.

Consider a subsample of n_{0} individuals, which coalesce to m ancestors at some time T, while the rest of the sample coalesces to one or more ancestors distinct from these m. Draw two individuals (with replacement) from the subsample. The probability that they have the same ancestor at time T is

Throughout this proof, pairs of individuals (*x*, *y*) are drawn from the subsample of size *n*_{0} *with replacement.* There are such pairs.

Consider the m lineages at time *T*. The number of pairs of individuals deriving from the *i*^{th} lineage is (*w*_{m,i})^{2}. Sum over all lineages to obtain the total number of pairs of individuals with the same time *T* ancestor: . The expected number of pairs with the same ancestor is

Note that Pr(*W*_{m,i} = *w*) is given by Eq. (S4), and holds upon restricting the full genealogy to the subsample. Divide this expected value by pairs and simplify to obtain Eq. (S51).▀

Consider a sample of *n* individuals under a hard sweep. For a random haplotype **h** (carrier or non-carrier), we have

We have

The dimension of vectors and **h** is the number of mutations in [0, *T*^{car}]. The expected value of this dimension is *μ***E**[*T*^{car}]. The dot product of two binary vectors is at least 0, and is bounded above by their dimension, so

Consider a population of size *N* with selection coefficient s, under a hard sweep with strong selection (*N*_{s} ≫ 1). Then

Campbell [79] shows that under strong selection, the fixation time is approximately . As *T*^{car} is the time for the favored allele to reach frequency *vn*, it is dominated by the fixation time. Therefore,

Plug Eq. (S56) and the scaled mutation rate *θ* = 2*Nμ* into Lemma 4 to obtain Eq. (S55). ▀

Consider a sample of n individuals under a hard sweep with vn carriers . In the time [0, *T*^{car}], let the number of coalescent events among the (1 − *v*)*n* non-carrier haplotypes be *ε*_{v}(1 − *v*)*n*. Then

Since we have (1 − *v*)*n* non-carrier haplotypes in our sample, the expected number of non-carriers in the whole population is (1 − *v*)*N*. Let , and let be the expected number of non-carriers in the whole population when the *u*^{th} coalescent event occurs among the non-carrier samples. As we go backwards in time (increasing *u*), the expected number of non-carriers increases:

Additionally, the chance of coalescing decreases and the expected coalescent time increases. For a fixed size population of size *N*, the expected time for one coalescent event in a sample of size *n* is (see [47]). Hence, the expected time for the *u*^{th} coalescent event is at least

On the other hand, the sum of all coalescent times during the selective sweep is at most *T*^{car}. Therefore,

Comparing this lower bound on 𝔼[*T*^{car}] with the upper bound in Eq. (S56) gives

This gives the upper bound on *ε*_{v} in Eq. (S57). ▀

Consider a sample of *n* individuals under a hard sweep with *vn* carriers . Assume strong selection (*N*_{s} ≫ 1). In the time [0, *T*^{car}], let the number of coalescent events among the (1 − *v*)*n* non-carrier haplotypes be *ε*_{v}(1 − *v*)*n*. The expected 1-HAF score of a random carrier haplotype is
while for a random non-carrier haplotype,

Recall the decomposition of 𝔼[1-HAF] given in Eq. (S37):

The number of ancestors of non-carriers at time *T*^{car} is *m* = (1 − *ε*_{v})(1 − *v*)*n*.

For a random carrier haplotype **h**, we evaluate the first two terms in Eq. (S63) by using Eqs. (S38) and (S39), and bound the third term by using Lemma 5, to obtain:

For a random non-carrier haplotype h, we use Eqs. (S40) and (S41) and Lemma 5:

Ignoring the small term , we obtain the desired expressions (S61) and (S62).▀

We find the maximum of Eq. (S61) over 0 ≤ *v* ≤ 1 using differentiation; the result is

We simulated selective sweeps for a variety of parameters and compared their trajectories against these results. In Fig. S9, we compared the trajectories of both carriers and non-carriers in 500 selective sweeps for each pair (*θ*, *n*) with *θ* ∊ {24, 48}, *n* ∊ {100, 200, 300, 400}, *s* = 0.08, and *n* = 2000. Over the course of a trajectory, the frequency *v* of the favored allele varies from *v* = 1/*n* (1 carrier) to *v* = 1 (fixation), but different trajectories will go through a different sequence of values of *v*. We aligned the trajectories by the values of *v*. For each *v*, we separated carriers and non-carriers and averaged together (1-HAF)/(*nθ*) over the generation (if any) with allele frequency *v* in each sweep. We plotted these averaged trajectories in green (*θ* = 24) and pink (*θ* = 48). We compared these against the expected value (Theorem 7). The expected value is bounded above in blue (*ε*_{v} = 0) and below in red ; see Lemma 6. These are bounds on the expected value, not on simulation means. Simulation means may range over the whole distribution, but tend to vary around the expected value.

In Fig. S10, for each (*n*, *θ*), we compute average simulated values of (1-HAF)/*θ* over 500 trajectories, stratified by *v* by the same method. The maximum value of this average over all *v* is determined and plotted, and compared against the expected value (Eq. (S64)).

In this paper, due to the run time for forward simulations, we used a relatively small value for population size (*N* = 2000); in reality, *N* is a lot larger (10,000), so that *ε*_{v} becomes smaller (see Lemma 6). Therefore, the upper and lower bounds in the proof of Theorem 7 move closer together, and we approximate *ε*_{v} = 0. To simplify the presentation in the main text, we set *ε*_{v} = 0, giving:

## Acknowledgments

This work was supported in part by National Science Foundation grants CCF-1115206 and IIS-1318386.

## Footnotes

↵** vbafna{at}ucsd.edu