Abstract
The inference of positive selection in genomes is a problem of great interest in evolutionary genomics. By identifying putative regions of the genome that contain adaptive mutations, we are able to learn about the biology of organisms and their evolutionary history. Here we introduce a composite likelihood method that identifies recently completed or ongoing positive selection by searching for extreme distortions in the spatial distribution of the haplotype frequency spectrum relative to the genome-wide expectation taken as neutrality. Furthermore, the method simultaneously infers two parameters of the sweep: the number of sweeping haplotypes and the “width” of the sweep, which is related to the strength and timing of selection. We demonstrate that this method outperforms the leading haplotype-based selection statistics. Then, as a positive control, we apply it to two well-studied human populations from the 1000 Genomes Project and examine haplotype frequency spectrum patterns at the LCT and MHC loci. To facilitate use of this method, we have implemented it in user-friendly open source software.
Introduction
The identification and classification of genomic regions undergoing positive selection in populations has been of long standing interest for studying organisms across the tree of life. By investigating regions containing putative adaptive variation, one can begin to shed light on a population’s evolutionary history and the biological changes well-suited to cope with various selection pressures.
The genomic footprint of positive selection is generally characterized by long high-frequency haplotypes and low nucleotide diversity in the vicinity of the adaptive locus, the result of linked genetic material “sweeping” to high frequency faster than mutation and recombination can introduce novel variation. These selective sweeps are often described by two paradigms—”hard sweeps” and “soft sweeps”. Whereas a hard sweep is the result of a beneficial mutation that brings a single haplotype to high frequency [Przeworski, 2002], soft sweeps are the result of selection on multiple haplotype backgrounds, often the result of selection on standing variation or a high adaptive mutation rate. Soft sweeps are thus characterized by multiple sweeping haplotypes rising to high frequency [Hermisson and Pennings, 2005, Pennings and Hermisson, 2006a].
Many statistics have been proposed to capture these haplotype patterns to make inferences about recent or ongoing positive selection [Sabeti et al., 2002, Voight et al., 2006, Sabeti et al., 2007, Ferrer-Admetlla et al., 2014, Garud et al., 2015, Harris et al., 2018, Torres et al., 2018, Harris and DeGiorgio, 2020, Szpiech et al., 2020], most of which focus on summarizing patterns of haplotype homozygosity in a local genomic region. A particularly novel approach, the T statistic implemented in LASSI [Harris and DeGiorgio, 2020], employs a likelihood model based on distortions of the haplotype frequency spectrum (HFS). In this framework, Harris and DeGiorgio [2020] model a shift in the HFS toward one or several high-frequency haplotypes as the result of a hard or soft sweep in a local region of the genome. In addition to the likelihood test statistic T, for which larger values suggest more support for a sweep, LASSI also infers the parameter . This parameter estimates the number of sweeping haplotypes in a genomic region, and indicates support for a soft sweep.
A drawback of the original formulation of the T statistic implemented in LASSI is that it does not account for or make use of the genomic spatial distribution of haplotypic variation expected from a sweep. Specifically, Harris and DeGiorgio [2020] demonstrated that if the spatial distribution of T was directly accounted for in the machine learning approach (Trendsetter) of Mughal and DeGiorgio [2019], the power for detecting sweeps was greatly enhanced. Indeed, modern statistical learning machinery to detect sweeps has been greatly enhanced by incorporating spatial distributions of summary statistics [Lin et al., 2011, Schrider and Kern, 2016, Sheehan and Song, 2016, Kern and Schrider, 2018, Mughal and DeGiorgio, 2019, Mughal et al., 2020]. However, these machine learning methods need extensive simulations under an accurate and explicit demographic model to train the classifier. An alternative approach is to directly integrate this spatial distribution into the likelihood model, as has been performed for site frequency spectrum (SFS) composite likelihood methods to detect sweeps [Kim and Stephan, 2002, Nielsen et al., 2005, Chen et al., 2010, Huber et al., 2015, Vy and Kim, 2015, DeGiorgio et al., 2016, Racimo, 2016, Lee and Coop, 2017, Setter et al., 2020]. Here we incorporate the spatial distribution of HFS variation into the LASSI framework and introduce the Spatially Aware Likelihood Test for Improving LASSI, or saltiLASSI. For easy application to genomic datasets, we implement saltiLASSI in the open source program lassip along with LASSI [Harris and DeGiorgio, 2020], and other HFS-based statistics H12, H2/H1, G123, and G2/G1 [Garud et al., 2015, Harris et al., 2018]. lassip is available at https://www.github.com/szpiech/lassip.
After validating saltiLASSI through simulations and comparing it to other popular haplotype-based selection scans, we apply saltiLASSI to two well-studied populations from the 1000 Genomes Project [The 1000 Genomes Project Consortium, 2015] as a positive control in an empirical data set. We reproduce several well-known signals of selection in the European CEU population and the African YRI population, including the LCT (CEU), MHC (CEU and YRI), and APOL1 (YRI) loci, demonstrating that this method works well in real data.
Results
In this section we begin by developing a new likelihood ratio test statistic, termed Λ, that evaluates spatial patterns in the distortion of the HFS as evidence for sweeps. We then demonstrate that Λ has substantially higher power than competing single-population haplotype-based approaches, across a number of model parameters related to the underlying demographic and adaptive processes. Similar to the T statistic implemented in the LASSI framework of Harris and DeGiorgio [2020], we also show that Λ is capable of approximating the softness of a sweep by estimating the current number of high-frequency haplotypes . We then apply the Λ statistic to whole-genome sequencing data from two human populations from the 1000 Genomes Project [The 1000 Genomes Project Consortium, 2015].
Definition of the statistic
Here we extend the LASSI maximum likelihood framework for detecting sweeps based on haplotype data [Harris and DeGiorgio, 2020], by incorporating the spatial pattern of haplotype frequency distortion in a statistical model of a sweep. Recall that Harris and DeGiorgio [2020] defined a genome-wide background K-haplotype truncated frequency spectrum vector which they assume represents the neutral distribution of the K most-frequent haplotypes, with p1 ≥ p2 ≥ … ≥ pK ≥ 0 and normalization such that vector Harris and DeGiorgio [2020] then define the vector with and . This represents a distorted K-haplotype truncated frequency spectrum vector in a particular genomic region with a distortion consistent with m sweeping haplotypes. To create the these distorted haplotype spectra, Harris and DeGiorgio [2020] used the equation where fk ≥ 0 for k ∈ {1, 2, …, m} and , defines the way at which mass is distributed to the m “sweeping” haplotypes from the K − m non-sweeping haplotypes with frequencies pm+1, pm+2, …, pK. Harris and DeGiorgio [2020] propose several reasonable choices of fk, and for all computations here we use . The variables U and ε are associated with the amount of mass from non-sweeping haplotypes that are converted to the m sweeping haplotypes [see Harris and DeGiorgio, 2020]. The schematic in Figure 1A illustrates the LASSI framework of generating the distorted haplotype spectra.
To incorporate the spatial distribution haplotypic variation into the LASSI framework, consider an index set 𝒲 = {1, 2, …, I} of I ∈ ℤ+ contiguous (potentially overlapping) windows such that window i ∈ 𝒲 has position along a chromosome denoted zi. This position could be in physical units (such as bases), in genetic map units (such as centiMorgans), in number of polymorphic sites [such as employed by nSL in Ferrer-Admetlla et al., 2014], or in window number. We model the relative contribution of a sweep with m sweeping haplotypes at target window with index i* ∈ 𝒲 by a parameter αi ∈ [0, 1] on window i ∈ 𝒲 and the relative contribution of neutrality by 1 − αi.
Following a similar powerful framework introduced by Cheng and DeGiorgio [2020] for modeling balancing selection, we employ a mixture model to model the K-haplotype truncated frequency spectrum in window i, with a proportion deriving from a sweep model and a proportion 1 − αi(A) deriving from the genome-wide background haplotype spectrum to represent neutrality. Here, A is a parameter that we optimize over, describing the rate of decay of the effect of the sweep at target window i* on the flanking windows a certain distance away. Specifically, we model the K-truncated haplotype spectrum in window i as the vector where for k = 1, 2, …, K and i ∈ 𝒲. Note here that for target window i*, αi *(A) = 1, and hence i.e., the target window is on top of the sweep, and so it is entirely determined by the distorted m-sweeping haplotype spectrum. However given a fixed A value, for windows i far enough away from the central window i*, we have the αi(A) = 0, and therefore i.e., the expectation of a neutral window. Based on these trends, windows far from the putatively selected target window are modeled as neutral, and windows close to the target window are heavily distorted due to the sweep. Moreover, because αi(A) tends to zero for windows far enough away for the central window, the model of neutrality is nested within our proposed sweep model. The schematic in Figure 1B illustrates the saltiLASSI framework of generating the spatially-distorted haplotype spectra.
Assume that in window i ∈ 𝒲, there is a K-truncated vector of counts which are the observed counts of the K most-frequent haplotypes, with xi1 ≥ xi2 ≥ … ≥ xiK ≥ 0 and normalized such that , where ni is the total number of sampled haplotypes in window i. Following Cheng and DeGiorgio [2020] and Harris and DeGiorgio [2020], we then compute the log composite likelihood ratios for null hypothesis of neutrality at target window i* as and for the alternative hypothesis of m sweeping haplotypes at target window i* as
Using these log likelihoods, we follow Harris and DeGiorgio [2020] and construct a log likelihood ratio test statistic of a sweep at target window i* as where
Power to detect sweeps
The power to detect sweeps will depend on a number of factors, including window size used to compute a statistic, whether phasing information for genotypes is used, the selection strength of the beneficial mutation s, the age of the sweep t, the number of selected haplotypes ν, and the underlying demographic history. To explore the power of Λ, we evaluate its power to detect sweeps of varying strengths, softness, and ages. Under each setting, we interrogated its robustness to demographic history, both through idealized constant-size histories and histories with recent severe bottlenecks. Moreover we gauged whether Λ yields false sweep signals under settings of background selection. Furthermore, for each setting described, we investigated the power and robustness of using unphased multilocus genotypes as input to Λ instead of phased haplotypes. Finally, we compared Λ to competing contemporary methods that use the same type of input data, using the T statistic of Harris and DeGiorgio [2020] for phased and unphased input data, and also considered the H12 [Garud et al., 2015], nSL [Ferrer-Admetlla et al., 2014], and iHS [Voight et al., 2006] statistics for phased data and the G123 statistic [Harris et al., 2018] for unphased data. The simulation protocol for all settings is described in the Methods section.
To begin, we compare the performance of Λ to T, H12, nSL, and iHS under a constant-size demographic history with diploid effective size of N = 104 diploid individuals. The Λ, T, and H12 statistics were computed for different window sizes, consisting of 51, 101, or 201 SNPs per window. Figures 2A and S1 show that across sweeps of varying degrees of softness (beneficial mutation on ν ∈ {1, 2, 4, 8, 16} distinct haplotypes) and for both moderate (per-generation selection coefficient s = 0.01) and strong (s = 0.1) sweeps, the method with highest power regardless of time of selection (t ∈ {500, 100, 1500, 2000, 2500, 3000} generations prior to sampling) is Λ, thereby outperforming the competing methods. Interestingly, Λ applied to 51 SNP windows has generally higher power than with 101 and 201 SNP windows. Furthermore, smaller window sizes enable Λ to achieve high power even for old sweeps—with this elevated power often substantially higher than the closest competing method. This result recapitulates a finding of Harris and DeGiorgio [2020], where they observed that if the spatial distribution of the T statistic was used within a machine learning framework, computing the T statistic in a greater number of small windows yielded higher power for ancient sweeps than when a smaller number of large windows was used. This is an intriguing result, because smaller windows have poorer estimates of the distortion of the HFS, yet it appears that for detecting ancient sweeps what matters is capturing the overall spatial trend of the distortion of the HFS. That is, when using too large of windows, Λ is averaging the HFS across too large of a region, which has likely been broken up over time due to recombination for ancient sweeps. Instead, smaller windows focus on genomic segments with less shuffling of haplotype variation due to recombination events, such that distortions in the HFS are due to the effect of a sweep at a nearby selected site.
Figure S1 also highlights a key distinction between moderate and strong sweeps. Specifically, regardless of method considered, each achieves its highest power when strong sweeps are recent, whereas for moderate sweeps, highest power for each method is shifted farther in the past toward more ancient sweep. This pattern was also found previously for H12 [Harris et al., 2018] and T [Harris and DeGiorgio, 2020]. The likely reason for this result is that moderate sweeps require more time for the beneficial allele to reach high frequency and leave a conspicuous genomic footprint. In contrast, strong sweeps create an immediate selection signature to appear in the genome due to the rapid rise in frequency of a beneficial mutation, but traces of this sweep pattern erode over time due to recombination, mutation, and drift. However, regardless, the Λ statistic paired with a small window size yields uniformly better or comparable sweep detection ability than the other approaches we examined.
During a scan with Λ, the composite likelihood ratio is optimized over the number of high frequency (sweeping) haplotypes m and the footprint size of the sweep A, leading to respective estimates and Â. Therefore, at a genomic location with evidence for a sweep (high Λ value), we may better understand properties of the putative sweep by evaluating its softness through and its strength or age through Â. Figure S2 shows that for moderate sweeps, the estimated number of sweeping haplotypes is considerably different from the actual number of initially-selected haplotypes ν, regardless of window size used or age of the sweep. In contrast, Figures 2A and S2 reveal that for strong hard sweeps (ν = 1), the estimate of the number of sweeping haplotypes when using 51 SNP windows is often consistent with hard sweeps provided that the sweep is recent enough (within the last 500 generations). Similarly, under these same settings but with soft sweeps of ν ∈ {2, 4} selected haplotypes (Figures 2A and S2), the estimated number of sweeping haplotypes tends to be consistent with sweeps of the same level of softness . Moreover, for softer sweeps (ν ∈ {8, 16}) the number of sweeping haplotypes tends to be underestimated but is still consistent with a soft sweep . Therefore, provided that a sweep is recent enough, when using 51 SNP windows the value of the estimated number of sweeping haplotypes can be used to lend evidence of a hard or a soft sweep.
Similarly, the other parameter estimate  may also help characterize identified sweeps. Specifically, Figures 2A and S3 show that the footprint size of the sweep (measured as log10 Â) is substantially elevated compared to expectation for neutral simulations for sweep times at which there is high power to detect sweeps (Figures 2A and S1). Interestingly, the shape of the curves relating the mean sweep footprint size over time mirror the power of the Λ statistic with corresponding window size as a function of sweep initiation time (t), sweep softness (ν), and sweep strength (s). These results suggest that the estimate of the sweep footprint size (log10 Â) can be used to learn about the age or strength of a candidate sweep (the signatures of which appear to be confounded between the two parameters). Coupled with an estimate of the sweep softness , our saltiLASSI framework provides a means to not only detect sweeps with high power, but to also learn the underlying parameters that may have shaped the adaptive evolution of candidate sweep regions.
Obtaining phased haplotypes for input to Λ represents an error-prone step that, without sufficient reference panels or high-enough quality genotypes, may make identification of sweeps difficult or potentially impossible for a number of diverse study systems. It is therefore beneficial if the favorable performance of Λ transfers to datasets that have not been phased. Similar to prior studies [e.g., Harris et al., 2018, Kern and Schrider, 2018, Harris and DeGiorgio, 2020, Harpak et al., 2021], we sought to evaluate the power of Λ when applied to unphased multilocus genotype data, and to compare its performance with the T statistic and G123 [analogue of H12 for use with unphased data; Harris et al., 2018], both of which are also applied to unphased multilocus genotypes. Figure S4 shows that Λ maintains high power to detect sweeps of differing ages, strengths, and softness. Consistent with the results on haplotype data (Figures 2A and S1), Λ generally displays higher power than, or comparable power to, T and G123, with the best performance deriving from Λ with a small window size of 51 SNPs, and with substantially higher power for old sweeps compared to other approaches. An exception is that for recent (t ≤ 1000 generations) and highly soft (ν = 16) sweeps, using a window size of 101 SNPs for Λ had substantially higher power than using the smaller 51 SNP window. Moreover, for strong (s = 0.1) highly soft (ν = 16) and ancient (t ≥ 2000) sweeps, the power of Λ is much lower with unphased multilocus genotypes compared to phased haplotypes (compare Figures S1 and S4). Interpretation of is more difficult for multilocus genotypes compared to haplotypes. However, consistent with the results for haplotypes (Figure S2), Figure S5 shows that when using 51 SNP windows, Λ tends to estimate a small number of sweeping multilocus genotypes (smaller ) for harder sweeps (smaller ν) than for softer sweeps (larger ν).
While adaptive processes generally affect variation locally in the genome, neutral processes such as demographic history influence overall levels of genome diversity. Specifically, it is common to consider that demographic processes impact the mean value of genetic diversity, and numerous likelihood approaches for detecting sweeps [Kim and Stephan, 2002, Nielsen et al., 2005, Chen et al., 2010, Huber et al., 2015, Vy and Kim, 2015, DeGiorgio et al., 2016, Racimo, 2016, Lee and Coop, 2017, Harris and DeGiorgio, 2020, Setter et al., 2020] and other forms of natural selection [DeGiorgio et al., 2014, Cheng and DeGiorgio, 2019, 2020] have been created to specifically account for this average effect of demographic history on genome diversity. However, demographic processes, such as recent severe bottlenecks, not only alter mean diversity but also influence higher-order moments of diversity, potentially making it insufficient to account solely for the mean effect of diversity [Barton, 1998, Jensen et al., 2005, Pavlidis et al., 2008]. Given that Λ does not account for higher moments than the mean effect of demographic history on the HFS, we sought to evaluate its properties under recent strong bottlenecks—a setting that has proven challenging for other sweep statistics in the past.
The Λ statistic generally exhibits superior power to T, H12, nSL, and iHS when applied to haplotype data (Figures 2B and S7) or to T and G123 when applied to unphased multilocus genotype data (Figure S10). Moreover, the general trends in method power as a function sweep strength, softness, and age observed for the constant-size history (Figures 2A, S1, and S4) hold for this complex demographic setting (Figures 2B, S7, and S10), with the caveat that, as expected, power for all methods is generally lower under the bottleneck compared to the constant-size history. A clear difference between these two demography settings is that, whereas Λ had exhibited uniformly superior or comparable power with smaller 51 SNP windows compared to larger 101 or 201 SNP windows (Figures 2A and S1), under the bottleneck model the best window size depends on age of the sweep (Figures 2B and S7). In particular, recent sweeps often had highest power with 201 SNP windows, sweeps of intermediate age with 101 SNPs, and ancient sweeps with 51 SNPs. Therefore, under complex demographic histories, choice of window size for Λ is more nuanced than with constant-size histories. This result is consistent with those of Harris and DeGiorgio [2020] who demonstrated that, when accounting for the spatial distribution of the T statistic in a machine learning framework (referred to as T -Trendsetter), power to detect recent sweeps is higher for larger windows and power to detect ancient sweeps is higher for smaller windows under the bottleneck history considered here.
In addition to demographic history, a pervasive force acting to reduce variation across the genome is background selection [McVicker et al., 2009, Lohmueller et al., 2011, Comeron, 2014, Wilson Sayres et al., 2014], which is the loss of genetic diversity at neutral sites due to negative selection at nearby loci [Charlesworth et al., 1993, Hudson and Kaplan, 1995a, Charlesworth, 2012]. Background selection has been demonstrated to alter the neutral SFS [Charlesworth et al., 1993, 1995, Seger et al., 2010, Nicolaisen and Desai, 2013], and masquerade as false signals of positive selection [Charlesworth et al., 1993, 1995, Hudson and Kaplan, 1995a,b, Nordborg et al., 1996, McVean and Charlesworth, 2000, Boyko et al., 2008, Akashi et al., 2012, Charlesworth, 2012, Huber et al., 2015]. However, because this process does not generally lead to haplotypic variation consistent with sweeps [Enard et al., 2014, Fagny et al., 2014, Schrider, 2020], like prior studies developing haplotype approaches for detecting sweeps [Harris et al., 2018, Harris and DeGiorgio, 2020] we sought to evaluate the robustness of Λ to background selection. We find that under both simple and complex demographic histories, using either phased haplotype or unphased multilocus genotype data, all methods considered here demonstrate robustness to background selection by not falsely attributing genomic regions evolving under background selection as sweeps (Figure S13).
Application to empirical data
We next apply the Λ statistic to empirical data representing a European (CEU) and an African (YRI) population from the 1000 Genomes Project Phase 3 [The 1000 Genomes Project Consortium, 2015] to demonstrate its use and check that it identifies sensible results.
We plot the genome-wide Λ statistics for the CEU population in Figure 3A and the YRI population in Figure 3B. We find several conspicuous peaks of notably large Λ values, which indicates strong support for a highly distorted HFS in these regions compared to the genome-wide mean HFS. Using a conservative threshold for determining significance (maximum observed Λ from genome-wide neutral simulations; see Methods), we identify several regions in both populations with scores consistently above this threshold, including five regions in the CEU population (Table 1) and 14 in the YRI population (Table 2). Among these regions, we find several well-studied genes that are known to have been under selection in these populations. These include the lactase gene [LCT ; Tishkoff et al., 2007, Field et al., 2016, Ségurel and Bon, 2017, Taliun et al., 2021], the major histocompatibility complex [MHC; Field et al., 2016, Pierini and Lenz, 2018, Taliun et al., 2021], and the apolipoprotein L1 [APOL1 ; Ko et al., 2013].
We next explore two peaks in detail, the LCT and MHC loci in each population (Figure 4). The LCT locus has been previously identified as under selection in some northern European populations and eastern African populations [Tishkoff et al., 2007]. As the CEU population has largely northern European ancestry and the YRI population is from western Africa, we expect to find a peak near LCT in CEU but not in YRI. Indeed, this is what we see in Figure 4A, which plots Λ statistics in the vicinity of the LCT locus on Chromosome 2. Furthermore, we examine the truncated HFS among eleven windows spanning LCT in both YRI (Figure 4B) and CEU (Figure 4C). We see in Figure 4B that YRI has haplotype frequencies similar to the genome-wide mean (plotted and highlighted on the left), whereas Figure 4C shows that the CEU population is dominated largely by a single haplotype near 80% frequency. Indeed, the saltiLASSI method also infers a in this region (Table 1), indicating a single sweeping haplotype (i.e., a hard sweep). Furthermore, we can see the HFS in this region trending toward the genome-wide mean as the windows move farther from the sweep’s focal point, illustrating the pattern that the saltiLASSI method was designed to capture.
Figures 4D-F illustrate the Λ statistics and HFS patterns in the vicinity of the MHC locus. This locus contains a large cluster of immune system genes, and selection at this locus is distinguished from LCT in that high diversity is preferred in order for the body to be able to mount a robust response to unknown pathogen exposure. As expected, both populations have extreme Λ values (Figure 4D) and a greatly distorted HFS in this region (Figures 4E and F). However, we note that the HFS is clearly distorted in favor of multiple haplotypes, in contrast to LCT, which we expect at a locus that favors diversity. Indeed, the saltiLASSI method infers to be between seven and nine in the CEU population and between eight and 11 for YRI (variance due to multiple regions within the MHC being separately identified; Tables 1 and 2).
Finally, we repeated our analyses of these two populations and two loci using the unphased multilocus-genotype approach (Figures S14 and S15; Tables S2 and S3), and we find good concordance with the phased haplotype approach.
Discussion
In this study, we developed a new likelihood ratio test statistic Λ that examines the spatial distribution of the HFS for evidence of sweeps. We demonstrated that this statistic has high power to detect both hard and soft sweeps, with performance substantially better than competing haplotype-based approaches for the same task. Moreover, while optimizing the model parameters of Λ we obtain estimates of sweep softness m and footprint size A, which is correlated with age and strength of the sweep. These additional parameters have the potential to further characterize well-supported sweep signals from large Λ values.
In addition to lending exceptional performance on simulated data, application of Λ to whole-genome variant calls from central European and sub-Saharan African individuals recapitulated the well-established signal at the LCT gene in Europeans due to lactase persistence [Bersaglieri et al., 2004], as well as sweep footprints at the MHC locus in both populations related to immunity, which have previously been detected with other sweep statistics [Albrechtsen et al., 2010, Goeury et al., 2018, Harris and DeGiorgio, 2020]. Though not novel findings, the clear (Figure 4) and strong (Figure 3) signals at these two loci serve as positive controls to highlight the efficacy of Λ. Furthermore, these findings were similarly recapitulated with unphased multilocus genotype data (Figures S14 and S15), lending support for the utility of Λ when applied to study systems for which obtaining phased haplotypes data is challenging.
A key parameter that must be chosen when applying Λ is the number of SNPs per window. Specifically, we found that larger windows had greatest power for more recent sweeps, and smaller windows for more ancient sweeps (Figures 2, S1, and S7), mirroring the window size results observed in Figures S8 and S9 of Harris and DeGiorgio [2020] for the spatial distribution of the T statistic using a different modeling approach. Therefore, choice of window size may be informed by the time frame of selective events that is being investigated. As highlighted in Figures 2B and S7, the Λ statistic computed within windows of 201 SNPs had highest power of all other tested window sizes within the past 2000 generations under the central European demographic history. Because selective events within this time frame are consistent with adaptive events in recent evolution of modern humans [Gravel et al., 2011, Gronau et al., 2011, Schiffels and Durbin, 2014], we selected this size so that we could recapitulate expected well-established sweeps— e.g., Figures 3 and 4 highlighting the sweep signal at LCT. In addition to using simulation results to aid in selecting appropriate window sizes, an alternate method such as choosing sizes based on the expected decay of linkage disequilibrium in the genome has been demonstrated to also work well in practice [e.g., Garud et al., 2015, Harris and DeGiorgio, 2020].
The T statistic of Harris and DeGiorgio [2020] presented the first likelihood approach that evaluated distortions in the HFS to detect selective sweeps, importantly because neutrality and soft sweeps leave similar signatures in the SFS but different within the HFS [Pennings and Hermisson, 2006b]. As demonstrated by Harris and DeGiorgio [2020], using the spatial distribution of the T statistic within a machine learning framework enhanced it’s detection ability, specifically for ancient sweeps. However, machine learning frameworks require extensive simulations to train [e.g., Schrider and Kern, 2016, Sheehan and Song, 2016, Mughal and DeGiorgio, 2019], and these simulations must be based on a set of critical assumptions, such as demographic, mutation rate, and recombination rate parameters. Yet, accurate inferences of these parameters is not always possible, or can be highly error prone, and prior studies have found that these machine learning methods can make highly incorrect predictions if the distribution of training data is different from that of the test or empirical data [Mughal and DeGiorgio, 2019, Mughal et al., 2020]. Furthermore, generation of these training datasets and training the models on them often requires substantial computational time and resources. Instead, our Λ statistic is the first likelihood method to model the spatial distribution of the HFS, providing the power of modeling the spatial distribution of T afforded by current machine learning frameworks (e.g., compare Figures S1 and S7 with Figures S8 and S9 of Harris and DeGiorgio [2020]), yet with massive savings in computational speed and with predictions not hinging on accurate estimates of genetic and evolutionary model parameters to generate training sets.
While optimizing the Λ statistic, we also obtain estimates of the number of presently-sweeping haplotypes m and the footprint size A. For recent strong sweeps, estimates of m correlate well with the number of initially-selected haplotypes ν. For older and less strong sweeps, mutation and recombination events accumulate leading to more distinct haplotypes, thereby inflating m estimates. Moreover, estimates of the footprint size A correlate with power of Λ, suggesting that the estimated footprint size will be large under scenarios in which sweeps are highly supported. The relationship between A and power of Λ is related to prominence of the distortions in the HFS, which also erode due mutation and recombination rates. Therefore, though we found that estimates of m and A were not highly accurate under non-ideal sweeps settings, they may still be useful. Specifically, though not directly associated with population-genetic parameters such as νor the strength s and time t of a sweep, estimated Λ, µ, and A values can be used as input features to machine learning regression algorithms to predict underlying evolutionary model parameters of ν, s, and t [Hastie et al., 2009]. Such strategies are typically computationally expensive, but may be required for accurate characterization of sweep footprints, even though they are unnecessary for detecting sweeps due to the already high power of Λ.
The Λ statistic developed here represents an important step in advancing methodology for sweep detection by interrogating the spatial distribution of distortions in the HFS. Prior studies focused either on spatial distributions of the SFS, which cannot distinguish between hard and soft sweeps, or only local distortions in the HFS. Specifically, methods that explore the skews in the SFS typically do so with an explicit analytical population-genetic model [Kim and Stephan, 2002, Nielsen et al., 2005, Huber et al., 2015, Vy and Kim, 2015, DeGiorgio et al., 2016], which can be underpowered if the assumed model is incorrect and are unable to detect soft sweeps [Pennings and Hermisson, 2006b]. In contrast, analytical population-genetic modeling of distortions in the HFS is difficult, and alternative statistical models that capture relevant features of sweeps are often used, focusing either on local distortions in the HFS [Harris and DeGiorgio, 2020] or haplotype length distributions [Voight et al., 2006, Ferrer-Admetlla et al., 2014]. Instead, our Λ statistic represents a compromise of these two extremes, permitting simultaneous interrogations of haplotype frequency distributions and correlates of their length distributions in a computationally efficient framework that leads to expected patterns that are informed by theoretical results. Our methodological framework therefore provides a foundation for developing tools that can identify other evolutionary processes that may act locally in the genome, enhancing future investigations of sweeps and other forces across a variety of study systems.
Methods
In this section we outline the methods used to assess the power of a diversity of sweep statistics using simulations. These simulations examine an array of model parameters, including sweep strength, age, and softness as well as the confounding effects of demographic history, background selection, and haplotype phasing. We also describe pre- and post-analysis processing for the application of the Λ statistic to phased haplotype and unphased multiloucs genotype data from a pair of human populations.
Power Analysis
To assess the ability of Λ to detect sweeps, we conducted forward-time simulations using SLiMv3.2 [Haller and Messer, 2019] for sweeps of varying strength, age, and softness under a constant-size demographic history as well as under a realistic non-equilibrium demographic history inspired by human studies. Specifically, for each simulation scenario, we generated 1000 independent replicates of length 500 kb, so that Λ was able to interrogate the spatial distribution of variation across a large genomic segment. We employed a mutation rate of µ = 2.5 × 10−8 per site per generation [Nachman and Crowell, 2000] and a recombination rate of r = 10−8 per site per generation [Payseur and Nachman, 2000]. For the constant-size demographic history, we considered a population size of N = 104 diploid individuals [Takahata, 1993], and to investigate complex non-equilibrium demographic histories, we also employed the model inferred in Terhorst et al. [2017] of central European humans (CEU), which incorporates a recent bottleneck with a severe population collapse followed by rapid population expansion. In particular, we used this non-equilibrium model as it was inferred by the contemporary method SMC++ [Terhorst et al., 2017], which attempts to fit model parameters that can both recapitulate haplotype diversity and allele frequency distributions [Beichman et al., 2017] observed in genomic data from the CEU population of the 1000 Genomes Project dataset [The 1000 Genomes Project Consortium, 2015].
In addition to these genetic and demographic parameters, for selection simulations, we modeled sweeps on ν ∈ {1, 2, 4, 8, 16} initially-selected haplotypes, where each of these haplotypes harbored a beneficial allele in the center of the simulated genomic segment with strength s ∈ {0.01, 0.1} per generation that immediately appeared and became beneficial at time t ∈ {500, 1000, 1500, 2000, 2500, 3000} generation prior to sampling. To ensure that a sweep signature had the potential to be uncovered (especially under settings with s = 0.01), we required that the beneficial allele established in the population by reaching a frequency of 0.1 in the population. Simulation replicates for which the beneficial allele did not reach a frequency of 0.1 in the population were repeated until the beneficial allele established in the population. All neutral and selection simulations were run for 11N generations, where the first 10N generations were used as burn-in and n = 50 diploid individuals were sampled from the population after 11N generations (i.e., the present). Because forward-time simulations are computationally intensive, as is commonly-practiced [Yuan et al., 2012, Ruths and Nakhleh, 2013] we scaled all constant-size demographic history simulations by a factor λ = 10 and the European human history history by λ = 20, such that the selection coefficient, mutation rate, and recombination rate were multiplied by λ and the population size at each generation and the total number of simulated generations was divided by λ. This scaling leads to a speedup of approximately λ2 in computing time, such that the constant-size simulations run roughly 100 times faster than without scaling and the CEU model simulations run approximately 400 times faster, making a large-scale simulation study feasible.
When analyzing each simulated replicate, we examined the performance of Λ with the likelihood T statistic [Harris and DeGiorgio, 2020] that does not account for the spatial distribution of genomic variation, the summary statistic H12 [Garud et al., 2015] that was developed to detect hard and soft sweeps with similar power, and the standardized iHS [Voight et al., 2006] and nSL Ferrer-Admetlla et al. [2014] methods that summarize the lengths of haplotypes centered on core SNPs. To investigate the effect of window size on the relative powers of Λ, T, and H12, we considered their applications in central windows of 51, 101, and 201 SNPs, and analyzed windows every 25 SNPs across a simulated sequence. We chose SNP-delimited windows rather than windows based on physical length as they should be more robust to variation in recombination and mutation rate across the genome, as well as random missing genomic segments due to poor mappability, alignability, or sequence quality. That is, we expect SNP-delimited to be more conservative than windows based on the physical length of an analyzed genomic segment. We also examined the application of Λ, T, and G123 [Harris et al., 2018, analogue of H12] to unphased multilocus genotype input data to evaluate the relative powers of these three approaches when applied on study systems for which obtaining phased haplotypes is difficult, unreliable, or impossible [Mallick et al., 2009]. We applied the lassip software released with this article for application of the saltiLASSI Λ statistic, the LASSI T statistic, and H12 (and G123), and the selscan software [Szpiech and Hernandez, 2014] to compute standardized iHS and nSL.
Analysis of 1000 Genomes Data
We extracted the phased genomes of CEU (99 diploids) and YRI (108 diploids) populations, separately, from the full 1000 Genomes Project Phase 3 dataset (2504 diploids) [The 1000 Genomes Project Consortium, 2015]. For each population, we retained only autosomal biallelic SNPs that were polymorphic in the sample. In order to avoid potentially spurious signals, we also filtered any regions with poor mapability as indicated by mean CRG100 < 0.9 [Derrien et al., 2012, Huber et al., 2015]. This left 12,400,078 SNPs in CEU and 20,417,698 SNPs in YRI.
We compute saltiLASSI Λ statistics for both phased (haplotype-based) and unphased (multilocus-genotype-based) analyses with lassip. We set --winsize 201 and --winstep 100, and we choose --k 20 to use the ranked HFS for the top 20 most frequent haplotypes. By default lassip assumes phased data and computes haplotype-based statistics, when the --unphased flag is set, all statistics are computed using multilocus genotypes.
To determine significance thresholds, we simulated neutral whole genomes with a realistic recombination map and demographic history using stdpopsim [Adrion et al., 2020] and msprime [Kelleher et al., 2016]. Using the OutOfAfrica 2T12 demographic history [Tennessen et al., 2012] and the HapMapII GRCh37 genetic map [Consortium, 2007] in stdpopsim, we simulate 100 replicates of all 22 autosomes for each population separately, sampling 99 diploid individuals for CEU simulations and 108 diploid individuals for YRI simulations. For each replicate, we then compute saltiLASSI Λ statistics for both phased and unphased analyses with lassip, setting --winsize 201, --winstep 100, and --k 20. We then compute the max Λ, the top-0.1% Λ, and the top-1% Λ across all replicates for each population and each analysis (phased/unphased), which are given in Table S1. Putatively selected regions were identified by concatenating consecutive windows with Λ greater than the max observed across all simulations for a given population and analysis (phased/unphased).
Supplementary material
Acknowledgments
This work was supported by National Institutes of Health grant R35GM128590, by National Science Foundation grants DEB-1949268 and BCS-2001063, and by Pennsylvania State University startup funds. Computations for this research were performed using the services provided by Research Computing at the Florida Atlantic University and using the Pennsylvania State University’s Institute for Computational Data Sciences’ Roar supercomputer.