ABSTRACT
What is the genetic architecture of local adaptation and what is the geographic scale that it operates over? We investigated patterns of local and convergent adaptation in five sympatric population pairs of traditionally cultivated maize and its wild relative teosinte (Zea mays subsp. parviglumis). We found that signatures of local adaptation based on the inference of adaptive fixations and selective sweeps are frequently exclusive to individual populations, more so in teosinte compared to maize. However, for both maize and teosinte, selective sweeps are frequently shared by several populations, and often between the subspecies. We were further able to infer that selective sweeps were shared among populations most often via migration, though sharing via standing variation was also common. Our analyses suggest that teosinte has been a continued source of beneficial alleles for maize, post domestication, and that maize populations have facilitated adaptation in teosinte by moving beneficial alleles across the landscape. Taken together, out results suggest local adaptation in maize and teosinte has an intermediate geographic scale, one that is larger than individual populations, but smaller than the species range.
Introduction
As populations diverge and occupy new regions, they become locally adapted to the novel ecological conditions that they encounter. Decades of empirical work have carefully documented evidence for local adaptation, including the use of common garden and reciprocal transplant studies demonstrating that populations express higher fitness in their home environment (Clausen et al. 1948) as well as quantitative genetic approaches that show selection has acted on individual traits to make organisms better suited to their ecological conditions (Savolainen et al. 2013). It is clear from these studies that local adaptation is pervasive in natural populations.
One important but understudied aspect of local adaptation is its geographic scale. Empirical studies have documented adaptation at multiple scales, from microgeographic differentiation among mesic and xeric habitats along a single hillside (Hamrick and Allard 1972) to regional (Lowry et al. 2008; Whitehead et al. 2011) and even global scales (Colosimo et al. 2005). A key factor determining the geographic scale of local adaptation is the distribution of the biotic and abiotic challenges to which organisms are adapting, as these features place limits on the locations over which an allele remains beneficial. Environmental features overlap with each other to varying degrees (Tuanmu and Jetz 2015). The degree of overlap between environmental features may be important if mutations are pleiotropic, as an allele may not be beneficial when integrating its effect over multiple selective pressures (Chevin et al. 2010).
The geographic scale of local adaptation depends, too, on population structure. Gossmann et al. (2010) showed that the estimates of the proportion of new mutations fixed by natural selection across a number of plant species tended to overlap with zero, suggesting there is little evidence for adaptation at non-synonymous sites. One potential explanation raised by the authors for this surprising finding was that natural populations are often structured, such that very few adaptations would be expected to be common over the entire range of the species’ distribution. Indeed, even when selective pressures are shared across populations, structure can hinder a species’ adaptation by limiting the spread of beneficial alleles across the species range (Bourne et al. 2014). Consistent with this, Fournier-Level et al. (2011) conducted a continent-scale survey across strongly structured populations of Arabidopsis thaliana, finding that alleles which increase fitness tended to occur over a restricted geographic scale. Despite the progress made, it remains unclear if the scales identified in Arabidopsis are common to local adaptation across the entire genome, and how similar the general patterns are across taxa.
The majority of local adaptation studies are motivated by conspicuous differences in the phenotypes or environments of two or more populations. As such, many instances of local adaptation that are occurring, as well as the underlying beneficial mutations being selected, may go overlooked. This hinders our ability to draw more general conclusions about the overall frequency and impact of local adaptation on a given population’s evolutionary history. Efforts to more systematically survey evidence of local adaptation have led to “reverse ecology” approaches, where signatures of adaptive evolution are first identified from genomic sequencing data, and are then related to the traits, history, or environmental conditions of the populations in which they occur (Li et al. 2008; Levy and Borenstein 2012). Given their ability to survey numerous loci and populations simultaneously, reverse ecology and other population genetic approaches are well-suited to investigating the geographic scale of local adaptation. While the use of genome-scale reverse ecology approaches is increasing (Hoban et al. 2016; Bragg et al. 2015), sampling across multiple populations and explicit evaluation of the scale of local adaptation are still uncommon.
Using reverse ecology, we can compare the observed distribution of beneficial alleles across multiple populations. The patterns of selective sweeps and adaptive fixations that are exclusive to or shared among multiple populations can be used to measure a beneficial allele’s geographic extent, which is influenced by the factors outlined above. If we infer multiple structured populations have fixed the same beneficial allele, this suggests that pleiotropy has not disrupted the adaptive value of the allele across environments or that the populations share a sufficiently similar set of selective pressures. Assessment of the relative frequency and geographic extent of unique and shared beneficial alleles thus allows us to quantify the scale of local adaptation. Additionally, when multiple populations do share an adaptive allele, we can infer the mode by which sharing occurred (Lee and Coop 2017), providing further insights about the environmental and genetic context of each adaptation as well as the processes underlying allele sharing among populations.
Motivated to improve our understanding about the genetic basis of local adaptation and its geographic scale, we set out to use the reverse ecology approach to understand patterns of adaptation via selective sweeps in multiple discrete populations of domesticated maize Zea mays ssp. mays and its wild relative teosinte Zea mays ssp. parviglumis growing across their native range in Mexico. Zea mays is an annual grass, native to southern Mexico. Maize was domesticated ≈ 9, 000 years ago (Piperno et al. 2009) from its common ancestor with the extant annual grass teosinte, but traditional open-pollinated populations maintain an extremely large population size and a surprising amount of diversity (Bellon et al. 2018). Maize is also the world’s most productive crop (Ranum et al. 2014), and an important model system for genetics (Nannas and Dawe 2015).
Previous work in both maize and teosinte has demonstrated clear population structure at both regional (Pyhäjärvi et al. 2013) and fine (Van Heerwaarden et al. 2010) scales, and population genetic and common garden studies in both subspecies have shown clear signatures of populations being adapted to their ecological conditions. In maize this includes local adaptation to high elevation (Fustier et al. 2019; Gates et al. 2019), phosphorous (Rodriguez-Zapata et al. 2021), temperature (Butler and Huybers 2013), and day length (Swarts et al. 2017). Similarly, studies of teosinte have documented local adaptation based on features such as the differential patterns of microbial community recruitment (O’Brien et al. 2019), elevation (Fustier et al. 2019, 2017), and temperature and phosphorous (Aguirre-Liguori et al. 2019).
Studying local adaptation of maize and teosinte across the same geographic locations presents opportunities to disentangle multiple processes that interact with adaptation. For example, the effect of the domestication process in maize populations and their ongoing interaction and dependence on humans has created changes in the timing and types of selection imposed across all populations, as well as changes in demography (Wright et al. 2005). Based on population structure and differences in the abiotic environment among populations, we anticipated that local adaptation would have a small geographic scale. We predicted that sweeps would be exclusive to individual populations, and that adaptations shared between subspecies would be limited to sympatric pairs of populations growing in similar environments and with ample opportunity for genetic exchange. Because of domestication and the ongoing migration facilitated by humans, we expected that maize would show more shared adaptations, leading to a relatively larger geographic scale. Contrary to our predictions, our results suggest adaptations are often shared across two or more populations, and are commonly between maize and teosinte, rather than being exclusive to individual populations. We also found that migration and standing variation have played an important role as a source of beneficial alleles, including many that are shared across the two subspecies.
Results
We sampled teosinte (Zea mays subsp. parviglumis) individuals from six locations across its native range, along with a nearby (sympatric) population of traditionally cultivated open-pollinated maize (commonly referred to as landraces) at five of these locations (Figure 1C). We sampled ten individuals from each population for each subspecies, with the exception of the Palmar Chico populations, where we took advantage of 55 and 50 individuals previously sampled for maize and teosinte, respectively (Table S1, Supplement I, Chen et al. 2020). In most cases, results for both Palmar Chico populations were down-sampled to ten randomly selected individuals to facilitate comparisons to the other populations. However, we used a second random sample to assess false positives in our inference of selective sweeps (see Results and Methods)
To further evaluate patterns of local versus geographically widespread adaptation, we compared many results of our geographically well-defined populations to results found over the entire study range (referred to as “rangewide” throughout). Rangewide samples for each subspecies were constructed by randomly selecting one individual from each population. All 195 individuals were sequenced at 20 to 25x coverage and aligned to version 5 of the Zea mays B73 reference genome (Hufford et al. 2021; Portwood et al. 2019). Analyses were based on genotype likelihoods (Korneliussen et al. 2014) except in cases where called genotypes were required (see Methods).
Subspecies and populations are genetically distinct despite evidence of gene flow
To assess the relationships among our sampled populations, we constructed a population-level phylogeny using treemix (Pickrell and Pritchard 2012 v.1.13). As anticipated from previous work (Buckler and Holtsford 1996; Hufford et al. 2012), we found clear divergence between two clades composed of maize and teosinte populations, though the relationship among geographic locations differed between the subspecies clades (Figure 1B).
Within subspecies, populations were genetically distinct from one another. Using NGSadmix (Skotte et al. 2013), there was little evidence of admixture between populations of the same subspecies; only two of the sampled individuals revealed mixed population ancestry (Figure 1A and 1D).
Despite the clear phylogenetic separation between the two subspecies, there is evidence for gene flow between maize and teosinte populations. We conducted f4 tests using treemix (Pickrell and Pritchard 2012), and found that all populations showed some evidence of gene flow with various populations of the other subspecies, as measured by the high absolute Z-Scores of the f4 statistic. However, Z-Scores were sensitive to the specific combinations of non-focal populations included in each test (Table S2). More specifically, we found that elevated f4 tests almost always included the maize population from Crucero Lagunitas (p < 2 × 10−10), which was true whether or not the f4 test included its sympatric teosinte, and more generally, there is little evidence for increased geneflow between sympatric pairs (Figure 1E).
Populations vary in their diversity, demography, and history of inbreeding
We estimated pairwise nucleotide diversity (π) and Tajima’s D in non-overlapping 100Kb windows along the genome in our sampled populations using ANGSD (Korneliussen et al. 2014). For all populations, π was in the range of 0.006 to 0.01, consistent with both previous Sanger (Wright et al. 2005) and short-read (Hufford et al. 2012) estimates for both subspecies. Variation in Tajima’s D and π was greater among populations of teosinte than maize (Figure 2D, Table S2).
We independently estimated the demographic history for each population from their respective site frequency spectra using mushi (DeWitt et al. 2021 v0.2.0). All histories estimated a bottleneck that started approximately 10 thousand generations ago (assuming a mutation rate of 3 × 10−8 (Clark et al. 2005) (Figure 2E).
Teosinte is a primarily outcrossing grass (Hufford 2010), and regional maize farming practices promote outcrossing as well (Bellon et al. 2018). To validate our estimated demography and characterize the history of inbreeding in each population, we compared the empirical quantiles of homozygosity by descent (HBD) segments inferred using IBDseq (Browning and Browning 2013) to those simulated under the demography of each population. With the exception of the smallest HBD segments, which are more prone to inaccurate estimation, the simulated quantiles generally resemble the empirical quantiles (Figure 2D). This indicates that the inbreeding history of our population is adequately captured by the demography. However, consistent with previous studies of teosinte (Hufford 2010), we do see variation in the distributions of HDB among populations. For example, the size distribution of HBD segments in San Lorenzo and Los Guajes were consistently larger than those simulated from their demographies, particularly for the smallest segments. This likely reflects inbreeding caused by demographic changes, particularly those further in the past that were not detected in our demography inferences. These results are consistent with previous studies, that found evidence for historical inbreeding in teosinte, particularly in individuals sampled from San Lorenzo (Pyhäjärvi et al. 2013). Lastly, we estimated average pairwise inbreeding coefficients (F) using ngsRelate (Hanghøj et al. 2019). Although for some pairs of individuals F was as high as 0.37, the mean value and F was 0.017 ± 0.001 (SE) and 0.033 ± 0.001 (SE) for maize and teosinte (respectively), (Figure 2E), suggesting there has been relatively little inbreeding in either subspecies in the recent past.
Linked selection has a larger effect on diversity in maize
Census size is known to vary widely among maize and teosinte populations (Hufford 2010; Wilkes et al. 1967), yet genomewide pairwise sequence diversity (π) was relatively similar across our studied populations (Figure 2). Under strictly neutral models, census size differences should translate into differences in genetic diversity. Widespread selection at linked sites, however, can attenuate correlations between census size and diversity (Lewontin et al. 1974; Corbett-Detig et al. 2015; Buffalo 2021), and previous work has demonstrated differential impacts of linked selection on diversity in maize and teosinte (Beissinger et al. 2016).
In an effort to better understand the forces shaping variation among our studied populations, and to be able to better assess the reliability of our demographic inferences (which ignored linked selection), we fit a model of linked selection to predict π from estimates of recombination rate and the density of functional base pairs. As recombination rate increases, neutral variants are able to disassociate from nearby selected variants. Combined with gene annotation information, this signature allowed us to estimate what the population scaled mutation rate (θ = 4Neμ) would be in the absence of linked selection (Coop and Ralph 2012; Corbett-Detig et al. 2015).
We used AIC (Wagenmakers and Farrell 2004) to select between different models that predicted neutral θ, finding that the full linked selection model, which includes background selection and hitchhiking parameters, was favored for all populations. In all populations, the estimated confidence intervals of neutral θ from the full model were non-overlapping with that of the intercept-only model which provides an estimate of the average genomewide value of π (Figure 3B). The two estimates of θ are relatively similar for all teosinte populations, while the difference was consistently larger in maize, suggesting linked selection has played a more important role in shaping maize diversity (Figure 3B). Similarly, the expected value of π approaches the estimated neutral diversity at lower recombination rate in teosinte (Figure 3A). Specifically, averaged across populations, π in teosinte reached 95% of the neutral θ estimate at a minimum recombination rate of 0.01 cM/Mb ±0.0006 (SE) compared to 1.04 cM/Mb ±0.05 (SE) in maize. Relatedly, in maize, approximately 92%±0.2% (SE) of the available genomic windows fell below the recombination rate minimum described above, compared to 12%±0.9% in teosinte. Previous inferences of linked selection in maize and teosinte found the reverse pattern, with a strong impact of linked selection in teosinte (Beissinger et al. 2016). One potential explanation for this discrepancy is that previous analyses did not directly estimate θ, but instead assumed that levels of π distant from coding sequences were reflective of the true neutral diversity. This approach would underestimate the effects of linked selection we observed in maize, where the majority of the genome has levels of π below θ. Overall, our results indicate that linked selection has played a larger role in shaping diversity for maize than teosinte populations.
Rangewide estimates of the proportion of mutations fixed by natural selection (α) are commensurate with that of individual populations
If populations are relatively isolated and adaptation occurs primarily via local selective sweeps, then we expect that most adaptive fixations will happen locally in individual populations rather than across the entire species range. If adaptation via sweeps is commonly restricted to individual populations, using a broad geographic sample to represent a population could underestimate the number of adaptive substitutions that occur (Gossmann et al. 2010). To test this, we estimated the proportion of mutations fixed by adaptive evolution (α) (Smith and Eyre-Walker 2002) across all of our populations and the rangewide samples for both subspecies. We estimated α jointly among all populations by fitting a non-linear mixed-effect model based on the asymptotic McDonald–Kreitman test (Messer and Petrov 2013). Across populations, α varied between 0.097 (teosinte San Lorenzo) and 0.282 (teosinte Palmar Chico), with more variation among teosinte populations (Figure 3C). In contrast to our expectations, rangewide estimates of α were commensurate with individual populations. We additionally evaluated estimates of α for specific mutation types, which has been shown to be lower at sites mutating from A/T to G/C, due to the effects of GC biased gene conversion in Arabidopsis (Hämälä and Tiffin 2020). While we do find some evidence that α predictions varied by mutation type (see Figure S3, Supplement IV), the patterns are the opposite of that found in Arabidopsis, perhaps because of the increased level of methylation in maize and the higher mutation rate at methylated cytosines. Even after accounting for differences among mutation types, rangewide values remained commensurate with that of the populations.
The proportion of mutations fixed by natural selection (α) is associated with neutral diversity in teosinte
The variable demographic histories of our populations may have affected the process of adaptive evolution. Under a neutral mode of evolution, population size and selection have no impact on the rate of substitutions, which are predicted to be equal to the mutation rate. When adaptive mutations are introduced, theory predicts that the rate of substitutions depends on the mutation rate, effective population size, and the strength of selection (Gillespie 2004). It follows that populations with larger long-term effective populations sizes should have a higher proportion of substitutions that are adaptive.
Motivated by theory, we used simulations to confirm that α increases with population size, and that the relationship is robust to variation in recent demographic changes and the distribution of fitness effects (table S3 and figure S2). We then investigated the empirical relationship between θ and α in our populations by testing for an association between the two separately for each subspecies. We used the mean α predictions and associated them with the values of neutral θ estimated from the full linked selection model. We found that θ and α are strongly correlated for teosinte populations (r = 0.97, p < 3.6 × 10−9), but not for maize (r = 0.09, p = 0.77) (Figure 3).
Teosinte populations have a higher proportion of private sweeps
Our inferences of α are based on substitutions at non-synonymous sites. The functional space for selection to act on occurs over many other parts of the genome besides protein coding bases, especially for large repetitive plant genomes (Mei et al. 2018). To identify signatures of adaptation occurring any-where in the genome, we used RAiSD (Alachiotis and Pavlidis 2018) to identify putative selective sweeps in each population, where sweeps were selected using an outlier cutoff defined by the 99.9% percentile of the μ summary statistic for data generated using coalescent simulations under each population’s estimated demography (see Methods). Simulations suggest this approach has high accuracy and power compared to alternative methods over a broad range of demographic and sweep scenarios (Alachiotis and Pavlidis 2018). To further assess the accuracy of our sweep inferences, we compared the overlap in sweep regions from a second random sample from both maize and teosinte from Palmar Chico. We estimated a false positive rate to be 0.6 and 0.5 for maize and teosinte, respectively, suggesting many of the putative sweep regions in other populations are also false positives, and that the total number of sweeps we have identified is likely an overestimate. Approximately 10% of all base pairs fell within a sweep region found in one or more of the sampled populations, although the true proportion is likely lower based on our estimated false positive rate. The density of coding sequence was lower in sweep regions (0.029 coding sequence base pairs/sweep base pairs) than the genome-wide gene density (0.042 coding sequence base pairs/total base pairs).
We used the inferred sweep regions to assess the degree to which adaptation is shared or locally restricted using the sweep regions we identified. We determined how many sweep regions were exclusive to one population (private), along with the number and size of overlapping sweep regions shared across two or more populations within and between the two subspecies. Overall, sharing was common, though fewer sweeps were exclusively shared between teosinte populations (Figure 4A). In teosinte populations, 59% of sweeps were private, which was significantly greater than the 40% found in maize (binomial glm, p = 2.63 × 10−7; Figure 4B). We saw a general pattern that larger sweep regions tended to be shared between more populations, though the pattern was negligible for sweeps exclusive to teosinte (Figure 4D). Additionally, 31 out of 1550 sweep regions were exclusive to and shared by all maize populations, compared to 0 out of 660 for teosinte.
Sympatric population pairs do not share more sweeps
If local adaptation favors certain alleles in a given environment, we might expect to see increased sharing of sweeps between sympatric populations of maize and teosinte. To look for evidence of such higher sharing, we used a hypergeometric test based on the number of sweeps in each population, and the number of shared sweeps between population pairs, which allowed us to test if sympatric population pairs tended to have more sharing than expected by chance. In conducting this test, we incorporated the estimated false positive rate (see Methods). Sympatric pairs did not tend to have a lower p value than allopatric pairs, and only the population pair in Palmar Chico showed more sharing than expected by chance (Figure 4). We additionally found that, despite evidence that sweeps are commonly shared between maize and teosinte (Figure 4C), there were zero sweeps exclusive to sympatric pairs; sweeps that were shared between sympatric pairs always included at least one other allopatric population.
Convergent adaptation from migration is common among maize and teosinte populations
In instances when two or more populations shared a sweep region, we used rdmc (Lee and Coop 2017; Tittes 2020) to infer the most likely mode of convergence. We classified sweeps based on which composite log-likelihood model was greatest out of four possible models of convergence (independent mutations, migration, neutral, and standing variation). Of the 1734 sweeps that were shared by two or more populations, there were 17, 663, 627, and 427 sweeps inferred to be convergent via independent mutations, migration, neutral, and standing variation, respectively (Figure 5C). While the high proportion of neutral models inferred by rdmc is consistent with a relatively high false positive rate, confirmation of non-neutral patterns of diversity from this second approach increases our confidence that regions identified for other modes indeed represent sweeps. The strength of support (measured as the composite likelihood score of the best model relative to the next best) varied among sweeps and modes of convergence, but in general a single model tended to be clearly favored among the alternatives (Figure 5A). Selection coefficients for sweeps varied among modes, with convergence via migration having the highest average estimate (Figure 5B). When migration was the mode of convergence and sweeps were shared by both subspecies, teosinte Palmar Chico and maize from Crucero Lagunitas were the most frequent source populations (Figure 5D). In convergence models with migration, we considered only a low and high migration rate, set to 0.001 and 0.1, respectively. The most likely migration rate varied across sweeps, but the migration rate was high for a majority of sweeps shared by the two subspecies and sweeps exclusive to maize. In contrast, the low migration rate was always most likely for sweeps exclusive to teosinte (Figure 5F). Together, these findings indicate that many alleles are adaptive in the genomic back-ground of both maize and teosinte, and that adaptive alleles are commonly shared between the two subspecies.
Discussion
Local adaptation occurs at intermediate scales
Our findings overall suggest that adaptation in maize and teosinte does occur locally, but frequently at scales larger than individual populations. Gossmann et al. (2010) hypothesized that population structure within a species could limit the fixation of adaptive alleles across a species range, causing a reduction in the proportion of mutations fixed by positive selection (α). Based on this hypothesis and the strong population structure we observed (Figure 1), we expected that rangewide samples would have a smaller estimates of α. Instead, α for the rangewide samples of both maize and teosinte were commensurate with that of individual populations (Figure 3), a pattern that persisted even when we considered α estimated from several different mutation types (Figure S3). This is inconsistent with the pat-terns we would expect fine-scale local adaptation to generate, where adaptive substitutions for a given population should not be shared by other populations experiencing their own distinct local selective pressures.
We found a similar pattern from our analysis of shared versus unique selective sweeps, which were more often shared by at least one other allopatric population. Similar to our predictions for α, we expected that local adaptation would lead most sweeps to be exclusive to individual populations. Instead, the average proportion of sweeps exclusive to a single population was low to moderate for maize and teosinte populations, respectively (Figure 4). We also expected that maize and teosinte populations growing in close proximity would share similar local selective pressures and would therefore share more signatures of adaptation. However, only maize and teosinte sampled from Palmar Chico showed evidence of sharing more sweeps than would be expected by chance, and overall sympatric pairs did not show increased sharing of selective sweep regions com-pared to allopatric pairs (Figure 4). The regional scale of local adaptation is consistent with patterns seen in maize adaptation to the highlands (Calfee et al. 2021), where sympatric maize and teosinte populations show little evidence of adaptive gene flow, and adaptive teosinte introgression appears widespread among highland maize. To our surprise, private sweeps made up a higher proportion of sweeps for the rangewide sample of maize. One possible explanation is that rangewide samples are enriched for older sweeps that are shared across all populations but difficult to detect in any single sample. Consistent with this, we find evidence that the sweep statistic μ for individual maize populations was elevated in the sweep regions exclusive to the rangewide sample, but did not reach the cutoff to be considered outliers (Figure S6, 22).
There are a number of considerations to make in the inter-pretation of our results. The two methods we used to identify signatures of adaptation, estimating α and identifying signatures of selective sweeps, both depend on the fixation of beneficial mutations. For the moderate population sizes and selection co-efficients observed here, fixation of new beneficial mutations takes a considerable amount of time, on the order of 4log(2N)/s generations (Charlesworth 2020) of thousands of years. Com-pared to the sojourn time of adaptive mutations, our populations may have occupied their current locations for relatively few generations. As a result, the selective sweeps underlying local adaptation to the selective pressures that populations currently face are more likely to be incomplete, so may be more difficult to detect (Xue et al. 2021; Pritchard et al. 2010). Like-wise, the adaptive sweeps that have completed — and are more likely to be detected — may have been under selection in ancestral populations that occupied different environments than that of the sampled individuals. Another complication in detecting local adaptation relates to the size and complexity of plant genomes. Large genomes may lead to more soft sweeps, where no single mutation driving adaptive evolution would fix (Mei et al. 2018). Like incomplete sweeps, soft sweeps are harder to identify (Schrider and Kern 2016; Pritchard et al. 2010), which could obscure the signatures of local adaptation. Even if our populations have occupied their current locations for a sufficient duration for local adaptation to occur, the completion of selective sweeps may be hindered by changes and fluctuations in the local biotic and abiotic conditions. Relatively rapid change in local conditions could result in fluctuating selection, such that most alleles do not remain beneficial for long enough to become fixed (Rudman et al. 2021).
While our focus has been on the trajectory of individual beneficial alleles, the genetivc basis of many adaptive traits may be highly polygenic. Allele frequency changes underlying polygenic adaptation are more subtle than those assumed under selective sweeps, making them harder to detect (Pritchard et al. 2010). Evaluating local adaptation in maize and other systems will be facilitated by studying the contribution of polygenic adaptation to the evolution of complex traits. However, if adaptation across our studied populations were strictly polygenic, and especially if it were acting on alleles with small effect sizes, we would expect to find little or no shared sweeps. The fact that we find many instances of sharing across populations supports that a non-trival amount of local adaptation is occurring via selective sweeps, or through polygenic adaptation acting on a few loci with large effects that leave a signature similar to that of a selective sweep.
Differences in diversity and demography influence adaptation in maize and teosinte
While our results were generally similar between the two sub-species and among the sampled populations, there are several important differences. The most obvious difference between the subspecies is the ongoing interaction and dependence of maize on humans via domestication and farming. Compared to teosinte, maize had lower average genomewide estimates of diversity (Figure 2A), which persisted after accounting for the effects of linked selection (Figure 3B). These differences are consistent with the previously discovered pattern that diversity tends to be lower in crops compared to their wild relatives (Doebley 1989; Hufford et al. 2012), a pattern putatively driven by domestication bottlenecks (Eyre-Walker et al. 1998). In line with this argument, the few teosinte populations with lower diversity than those in maize (El Rodeo and San Lorenzo) were inferred to have the most substantial bottlenecks and historical inbreeding (Figure 2). More generally, we found that π and Tajima’s D were more variable among teosinte populations, indicative of differences in their demographic histories.
Our demographic inferences suggest that all populations had signatures of a bottleneck, the timing of which coincides with the beginning of maize domestication ≈ 9, 000 years ago (Piperno et al. 2009). The severity of the bottleneck varied considerably across populations, particularly for teosinte. While finding a bottleneck in the maize populations is consistent with domestication, it is less clear why we found a consistent bottleneck for the teosinte populations at approximately the same time. One possibility is that the teosinte bottlenecks reflect land use change induced by human colonization. For example, evidence from Mesoamerican phytolith records in lake sediment show evidence of anthropogenic burning as early as 11K years B.P. (Piperno 1991). The establishment and spread of human populations over the subsequent millenia would require ever increasing area for farming, dwellings, transportation, and trade (Haines et al. 2000). Such land use changes would likely encroach on the habitat available for teosinte and drive species-wide census size declines. Given the success of maize breeding and domestication, we anticipated a recent expansion for maize populations as previously seen (Beissinger et al. 2016; Wang et al. 2017). However, with only 10 individuals per population, recent expansions will be difficult to detect (Keinan and Clark 2012; Coventry et al. 2010). The demography of the rangewide samples for both subspecies showed little evidence of the bottleneck inferred in individual populations, likely due to the reduced sampled size (5 and 6 individuals) for the rangewide data. We additionally used strong regularization penalties to avoid overfitting (see Methods), which limits the detection of rapid and dramatic changes in population size. The near-constant size of the rangewide samples and the lack of recent expansion in maize are both likely influenced by this modeling choice. Lastly, given the strong effects of linked selection inferred in maize, our demographic models may have underestimated their effective population sizes. However, all populations independently con-verged to similar values in the oldest generation times, around the time when we would expect the ancestral lineages would have coalesced (Figure 2C). This suggests any biases in the estimated population sizes that are specific to maize are occurring in the more recent past. Similar to the arguments above, this provides another reason that recent population sizes in Maize may be underestimated.
Theory and simulations, including those presented here, support that the rate of fixation of new beneficial mutations should increase with the population size (Ne) and consequently the population-scaled mutation rate (θ = 4Neμ) (Gillespie 2004, Figures 3D and Supplement III). However, previous research in plants found little evidence for positive estimates of α, and an apparent lack of correlation between inferred population size and α for new non-synonymous mutations (Gossmann et al. 2010). The discrepancy between those results and our own, in regards to the overall estimates of α, are likely explained by the use of different methods and the sampling design. Our simulations suggest the relationship between α and θ is positive but non-linear. The rate of increase for α declines at larger values of θ (Figure S2) This suggests finding an association between θ and α may be sensitive (in theory and practice) on the parameter space under consideration.
We found a significant correlation between neutral θ and α, but only for teosinte populations (Figure 3C). While we consider this finding preliminary based on the small number of populations in the sample, the positive association found in teosinte populations is consistent with our simulations and some theory cited above that suggest larger populations increase the rate of adaptive fixation. However, it has also been suggested that the relationship between effective population size (Ne) and α is driven by increased efficiency of purifying selection to reduce the fixation of nearly-neutral substitutions, rather than an increase in the fixation rate of beneficial mutations per se (Galtier 2016; Hämälä and Tiffin 2020). We cannot confidently determine whether the association is driven by the increased efficiency of larger populations to purge weakly deleterious mutations or to fix adaptive ones. However, these are not mutually exclusive explanations, and both may contribute to differing degrees in actual populations. The lack of correlation in maize is likely driven by the recent divergence between populations and the strong recent selection during domestication that is shared by all the populations. As a result, populations have not had sufficient time for differences in population size to impact patterns of adaptive fixation. As recent changes in environment and population size are likely important aspects of the biology of many domesticates, we predict that the observed lack of correlation between θ and α in maize may be a common pattern across many taxa. However, another important caveat concerning these associations is the lack of independence among lineages, particularly for maize populations, which have diverged more recently. Further, studies with larger sample of populations may correct for non-independence when testing the associations between α and θ, which would bolster the validity of any associations (or lack there) found.
Although the asymptotic MK method we employed has been shown to provide reliable estimates of α when fixations are due to strong beneficial mutations (Messer and Petrov 2013), it does not account for the influence of background selection and the rate of fixation of weakly beneficial mutations (Uricchio et al. 2019). Our analyses suggests linked selection is acting similarly across populations within subspecies (Figure 3B). As such, while we may be systematically underestimating α, the bias is unlikely to change our general conclusions.
Another consideration is that our estimates of α contrasted non-synonymous and synonymous sites only. We observed a lower density of coding sequence in sweep regions, suggesting selective sweeps are more often acting on non-coding regulatory regions rather than modifying protein coding sequences directly. This result is consistent with the proposed importance of non-coding sequences for adaptation in species with large genomes (Mei et al. 2018) and highlights the importance of studying signatures of adaptation using multiple methods and regions of the genome in order to construct a comprehensive history.
Differences in adaptation between maize and teosinte, and among populations, were apparent based on differences in the patterns of selective sweeps. Maize had a higher proportion of selective sweeps shared with at least one other population (Figure S1). The greater number of shared sweeps in maize populations is likely the result of their recent shared selective history during the process of domestication, resulting in a set of phe-notypes common to all maize (Stitzer and Ross-Ibarra 2018). In comparison, the higher proportion of unique sweeps in teosinte suggests local adaptation has played more of a role in shaping their recent evolutionary history. Teosinte grows untended, and did not undergo domestication, leaving more opportunity for divergence and local selection pressures to accumulate differences among populations. This is reflected in the inferred population history, which had longer terminal branch lengths for teosinte (Figure 1B), suggesting there is increased genetic isolation among teosinte populations due to longer divergence times, reduced gene flow, or both.
Convergent adaptation is ubiquitous
We found convergent adaptation to be common among populations and subspecies (Figures 4 and 5). The frequency of convergence further suggests there are a large number of mutations that are beneficial in more than one population, even when placed in the different genomic backgrounds of the two subspecies. Our approach allowed us to distinguish between multiple potential modes of convergence, including a neutral model that models allele frequency covariance by drift alone (Lee and Coop 2017). The distribution of most likely selection co-efficients of the inferred beneficial alleles suggests the strength of selection is moderate to strong, though this estimate is likely biased as strong positive selection will be easier to detect. Convergence via independent mutations was by far the least frequent mode. This is consistent with previous analyses of domestication (Hufford et al. 2012) and adaptation (Wang et al. 2020) in maize, and unsurprising given evidence for ongoing gene flow (Figure 1E), the relatively short evolutionary time scales, and the low probability that even strongly selected new mutations can overcome drift multiple times independently. When convergent sweeps that occurred via standing variation within maize, or for those shared between maize and teosinte, the distribution of generation times that the selected variant was standing before the onset of selection tended to be bimodal, with both long and short standing times. Whereas sweeps exclusive to teosinte were consistently inferred to be standing variation for more generations (Figure 5E). Sweeps that occurred via standing variation and shared between subspecies were often found in only a subset of maize populations. Many of these sweeps likely reflect the presence of structure in ancestral populations, suggesting different alleles beneficial to maize were likely derived from more than one teosinte population.
The most common mode of convergent adaptation was via migration, and frequently occurred between geographically desperate populations (Figures S4). This included a relatively large number of shared sweeps via migration between maize and teosinte (Figures 5 and S5). There is ample evidence that maize and teosinte are capable of hybridizing (Wilkes et al. 1967; Ellstrand et al. 2007; Ross-Ibarra et al. 2009). Gene flow between geographically disparate populations has been inferred from patterns of introgression between maize and Zea mays mexicana) (Calfee et al. 2021). Further, convergence via migration between geographically disparate maize populations has been inferred during adaptation to high elevations (Wang et al. 2020).
Our findings on convergence via migration point to an in-triguing hypothesis, namely, that some non-trivial number of alleles that are beneficial to teosinte may have originally arisen in another teosinte population and moved between populations via gene flow with maize, an idea suggested by Ross-Ibarra et al. (2009) based on allele sharing at a small set of loci. There are several aspects of our results consistent with this hypothesis. Firstly, we found relatively few shared sweeps exclusive to teosinte populations (Figure 5C), which is what we would expect if maize populations facilitate the movement of beneficial teosinte alleles. However, it is important to note that there are fewer sweeps exclusive to teosinte for all modes of convergence, not just via migration, so this alone is not sufficient evidence. Secondly, sweeps shared via migration that were exclusive to teosinte were always inferred to be the lower of the two migration rates (1 × 10−3) (Figure 5F), where those shared by both subspecies or exclusive to maize were primarily inferred to be the higher migration rate (1 × 10−1). This indicates alleles that are only beneficial in teosinte move between populations more slowly. Despite the patterns of population structure in both subspecies, there is evidence of gene flow based on f4 tests. In particular, f4 tests that included maize from Crucero Lagunitas were consistently elevated across both subspecies (Figure 1E and S1). This is in accordance with our finding that Crucero Lagunitas maize was the second most commonly inferred source population of migration sweeps that were shared between maize and teosinte (Figure 5D). Together, these results suggests that geographically widespread varieties of maize such as Celaya (Crucero Lagunitas) (Orozco-Ramírez et al. 2017) may have played a prominent role as a source of and/or transport for beneficial alleles among maize and teosinte populations. However, teosinte populations were more often the source of beneficial alleles for sweeps shared between both subspecies. In particular, the teosinte Palmar Chico was the most likely source population for the vast majority of sweeps shared between maize and teosinte as well as for sweeps exclusive to teosinte (Figure 5F). Palmar Chico is in the balsas region, near the putative site of maize domestication (Piperno et al. 2009), and our demographic inferences suggests Palmar Chico has maintained the largest effective population size of all the populations (Figure 2), meaning it is capable of generating the most mutations for selection to act on. As such, teosinte Pal-mar Chico and other populations in the Balsas region may have also been a particularly important source of adaptive variation for both teosinte and maize.
Materials and Methods
Samples and whole genome re-sequencing
We sampled seeds from five populations of Zea mays ssp. parvig-lumis, and four populations of Zea mays ssp. mays from plants growing across the species’ native range. We additionally in-cluded populations of parviglumis and maize from Palmar Chico were previously analyzed and reported in (Chen et al. 2020) for a total of 6 and 5 populations of maize and teosinte (See Supplement I for accession IDs and further sample details). All maize and teosinte populations from each named location were less than 1km from one another, with the exception of Crucero Lagunitas, which were separated by approximately 18km.
DNA extraction for teosinte followed (Chen et al. 2020). Genomic DNA for landraces was extracted from leaf tissue using the E.Z.N.A.® Plant DNA Kit (Omega Biotek), following manu-facturer’s instructions. DNA was quantified using Qubit (Life Technologies). 1ug of DNA per individual was fragmented us-ing a bioruptor (Diagenode) with 30 seconds on/off cycles.
DNA fragments were then prepared for Illumina sequenc-ing. First, DNA fragments were repaired with the End-Repair enzyme mix (New England Biolabs). A deoxyadenosine triphos-phate was added at each 3’end with the Klenow fragment (New England Biolabs). Illumina Truseq adapters (Affymetrix) were then added with the Quick ligase kit (New England Biolabs). Between each enzymatic step, DNA was washed with sera-mags speed beads (Fisher Scientific). The libraries were sequenced with an average coverage of 20 to 25x PE150 on the Xten at Novogene (Sacramento, USA).
We additionally grew one individual of Zea diploperennis from the UC Davis Botanical Conservatory as an outgroup. DNA for Zea diploperennis was extracted and libraries prepared as above, and then sequenced to 60X coverage using PE250 on 3 lanes of Illumina 2000 rapid run (UC Davis Genome Center, Davis, USA).
Sequencing reads have been deposited in the NCBI Sequence Read Archive under project number (to be submitted).
Sequencing and variant identification
All paired-end reads were aligned to version 5 of the maize B73 reference genome (Hufford et al. 2021) using bwa-mem (v0.7.17) (Li 2013). Default options were used for mapping except -M to enable marking short hits as secondary, -R for providing the read group, and -K 10000000, for processing 10 Mb input in each batch. Sentieon (v201808.01) (Freed et al. 2017) was used to process the alignments to remove duplicates (option–algo Dedup) and to calculate various alignment metrics (GC bias, MQ value distribution, mean quality by cycle, and insert size metrics) to ensure proper mapping of the reads.
All downstream analyses were based on genotype likelihoods estimated with ANGSD (v0.934) (Korneliussen et al. 2014) using the following command line flags and filters:
Genetic Diversity
We estimated per base nucleotide diversity (π) and Tajima’s D (D) in non-overlapping 1kb, 10kb and 100kb windows with thetaStat utility from ANGSD, though estimates did not substantively differ between window sizes. To estimate π and the unfolded site frequency spectra for each population, we polarized alleles as ancestral and derived using short-read sequence data for Zea luxurians and Zea diploperennis as outgroups. Zea luxurians sequence from Tenaillon et al. (2011) was downloaded from The NCBI Sequence Read Archive (study SRR088692). We used the alignments from the two species to make minor allele frequency (MAF) files using ANGSD. We used the MAF files to construct a table of genotypes found at each locus. Sites with minor allele frequency estimates greater than 0.001 were treated as heterozygous. Sites that were homozygous in both species were imputed onto the maize v5 reference and assumed to be the ancestral allele. As there were substantially more called bases in Zea luxurians than in Zea diploperennis, we also assumed sites that were homozygous in luxurians and missing in diploperennis were ancestral, but excluded sites that were missing from luxurians. Sites that were classified as heterozygous were treated as missing and imputed onto the maize reference as ‘N’.
Population Structure and Introgression
We used ngsadmix (Skotte et al. 2013) to assess population struc-ture within subspecies. To do so we used a SNP-calling procedure in ANGSD with the same filters as listed above, along with a SNP p value cutoff of 1 × 10−6. We looked for evidence of gene flow between subspecies using f4 statistics and Z-scores calculated with blocked jackknifing, implemented using treemix (Pickrell and Pritchard 2012). Trees for f4 tests were always of the form (Maize_X, Maize_X; Teosinte_ focal, Teosinte_X); each unique combination of populations was considered to be the “ focal” and “_X” positions of the tree. We considered any tree with a Z-score greater than or equal to 3 significant, indicating a departure from the allele frequency differences expected if population history matched the hypothesized tree. We assessed Z-scores separately based on whether the focal population was maize or teosinte, and for trees that include the sympatric pair of the focal population.
Demographic and Inbreeding History
We inferred each population’s demography using a single unfolded site frequency spectrum with mushi (v0.2.0) (DeWitt et al. 2021). In efforts to reduce overfitting given our modest samples sizes, we increased the regularization penalty parameters to alpha_tv=1e4, alpha_spline=1e4, and alpha_ridge = 1e-1.
We assessed homozygosity by descent (HBD) in each population using IBDseq (v2.0) (Browning and Browning 2013). We compared empirical results to simulations in msprime (Kelleher et al. 2016) using each population’s inferred demographic history. We performed ten replicates of each of these simulations. Replicates were similar across all populations; only one replicate was chosen at random for visual clarity.
We estimated recent inbreeding using ngsrelate (Hanghøj et al. 2019) with default parameters. Input files were generated using ANGSD with the same filters as listed above, along with a SNP p value cutoff and maf filter of respectively.
Linked Selection
Following an approach similar to Corbett-Detig et al. (2015), we modeled the effects of linked selection (background selection and hitchhiking) on observed values of π across the genome in 1kb windows.
The full model that jointly modeled both hitchhiking and background selections was Here, α is the population-scaled rate of selective sweeps (Coop and Ralph 2012), f di is the proportion of functional bases in window i, and is the mean recombination rate per base pair in window i. Lastly, Gi models the effects of background selection and is defined as, where, U is the deleterious mutation rate, s and h are the selection and dominance coefficients for deleterious mutations (respectively), P is an index of panmixis, and Mi and Mk represent the Morgan position of the focal locus and all other loci along the chromosome, respectively (Hudson and Kaplan 1995). The proportion of functional bases f di, was estimated using the number of protein coding sequences in a window based on the most gene annotation of the maize v5 reference genome (Hufford et al. 2021). The position of loci in Morgans was approximated using the approx function in R based on a previously generated genetic map for maize that was lifted on to the v5 reference genome (Ogut et al. 2015). We treated the parameters s and U as fixed, but refit the model over a grid of possible values for each. Specifically, we allowed s to vary over 30 evenly spaced values between 1 × 10−8 and 1 × 10−1, U took on values 1 × 10−7 or 3 × 10−8, while P and h were always fixed at 1 and 0.5, respectively.
We compared the full linked selection model (equation 1) to three simpler models: one that does not account for hitchhiking one that does not include background selection, and finally a simple intercept model that estimates neutral θ
The same refitting procedure was used for all models that included U, s, and h. We fit all four equations separately for each population using a Maximum Likelihood framework. Since all values of θ are bounded between (0,1), we assumed observations were generated from a beta distribution following
We calculated the standard errors for parameters from the Hessian matrix (H) produced by the optim function as
We calculated AIC, ΔAIC and AIC weights (Wagenmakers and Farrell 2004) and used them to compare among the alternative models for each population.
Estimating the Rate of Positive Selection, α
We modeled the rate of positive selection, α (note this is different from the α estimated in our linked selection models above), of 0-fold non-synonymous mutations using the Asymptotic extension of the McDon-ald–Kreitman (MK) test (Messer and Petrov 2013), where α is calculated at each allele frequency bin of the uSFS (from 1/n to (n − 1)/n).
At each allele frequency bin, each α was calculated as where d0 and d are the number of derived fixed differences for selected and putatively neutral sites, respsectively, and p0 and p are the number of selected and putatively neutral polymorphic sites. We identified 0-fold and 4-fold sites using Python (https://github.com/silastittes/cds_fold). We fit the Asymptotic MK extension as a nonlinear Bayesian mixed-model using the R package brms (Bürkner 2017, 2018). where αij is the value of α calculated at the ith allele frequency bin of the jth population and xij is the corresponding allele frequency bin. The nonlinear brms model was coded as where all three free parameters of the asymptotic function (a, b, and c) were treated as random effects of population and nucleotide type (see Supplement IV), and the subspecies was treated as a fixed effect. These effects were coded in brms as
Identifying Selective Sweeps
We used RAiSD (Alachiotis and Pavlidis 2018) to infer regions that have experienced a selective sweep in each population, including the rangewide samples. Across all populations, we used a minor allele frequency threshold of 0.05 and a window size of 100 SNPs. We called SNPs and generated VCF files for each population using the dovcf utility from ANGSD, using a SNP calling p value cut off of 1 × 10−6. We used the Python package mop (https://pypi.org/project/mop-bam/) to exclude SNPs that fell in regions with low and excessive coverage and/or poor quality. Here we required each locus to have at least 70% of individuals with a depth between 5 and 100, and to have a phred scaled quality scores above 30 for base and mapping quality. We additionally used mop to rescale the μvar sub-statistic in each population based on the proportion of high quality bases available in each of the RAiSD SNP windows, after which we recalculated the overall μ statistic as the product of the three sub-statistics (Alachiotis and Pavlidis 2018). After correcting the μ statistic, we defined outliers using simulations of demography using msprime (Kelleher et al. 2016). We considered estimates of μ that were greater than the 99.99% quantile of the neutral simulations as evidence for being within a sweep region. RAiSD outliers within 100kb were merged using Bedtools merge (Quinlan and Hall 2010) and treated as a single sweep region in downstream analyses. We then used Bedtools intersect to identify overlapping sweep regions among populations which we treated as evidence for being a shared sweep.
Assessing the number of false positive sweeps and the number of shared sweeps expected by chance
To assess false positives, we used a second random sample non-overlapping individuals from the two Palmar Chico populations. False positives were assessed based on the number of sweep regions that did not overlap between the 2 replicate Palmar Chico samples from each subspecies. To account for differences in the total number of sweep regions for each replicate, we averaged the two proportions And where nS is the number of sweeps shared between the replicates, and nP1 and nP2 were the number of sweeps in the first and second replicates, respectively. In downstream analyses, we used the average value of FP for the two subspecies, although it was higher for maize than for teosinte (0.60 and 0.49 for maize and teosinte, respectively).
We evaluated the number of sweeps that we would expect populations to share by chance using a simple statistical test based on the hypergeometric distribution, where N is the total number of loci tested; n1 and n2 are the number of outlier loci in the first and second populations, respectively; and x is the number of shared outliers between the two populations. The population with the larger number of outliers was always designated at the first population. We accounted for false positive by multiplying the raw values of N, n1, n2, and x by the FP value described above.
We corrected p-values for multiple tests using the Benjamini and Yekutieli method, implemented with the R function, p.adjust (Benjamini and Yekutieli 2001).
Inferring modes of convergent adaptation
For sweep regions that were overlapping in 2 to 9 of the 11 populations, we used rdmc to infer the most likely mode of convergent adaptation (Lee and Coop 2017; Tittes 2020). To ensure a sufficient number loci were included to estimate decay in covariance across the sweep regions, we added 10% of each sweep region’s total length on each of its ends prior to fitting the models. To reduce the computation time, we exclude sites that had an allele frequency less than 1/20 across all populations. As sweep regions differed in size, we subset from their total number of sites to maintain approximately 250K SNPs per centiMorgan. All sweeps regions were required to have fewer than 100K SNPs. Sweep regions near the ends of chromosomes for which we could not estimate the number of centiMorgans for were subset to 10K SNPs. To contrast allele frequency covariance in sweep regions, we used allele frequencies at 100K random loci to estimate the neutral covariance among all populations. To reduce computation time when fitting the models, we estimated a covariance matrix using a random subset of 50K of these loci. When fitting rdmc, we assumed the effective population size was 50K. The recombination rate was approximated for each sweep region as the median interpolated value based on a previously generated genetic map for maize that was lifted on to the v5 reference genome (Ogut et al. 2015). The rdmc function arguments that control the grid of parameter values over which composite likelihood were computed were:
We compared composite likelihoods over four convergent adaptation modes, “neutral”, “independent”, “standing”, and “migration”. We assigned each sweep to the mode with the highest log composite likelihood. To assess the overall performance of the method to distinguish between the four modes, we computed differences between the highest composite likelihood and the next highest for each sweep.
All wrangling to prepare input data for statistical analyses was done using R (R Core Team 2020) with appreciable reliance on functions from the tidyverse package suite (Wickham et al. 2019). Figures were made using ggplot2 (Wickham 2016), patchwork (Pedersen 2019), and cowplot (Wilke 2019). Code and summarised data for the entirety of the analyses, including Jupyter notebooks for reproducing figures, is available from https://github.com/silastittes/parv_local permanent repository to follow.
Supplement I
Population sampling locations
Supplement II
Further assessment of f4 statistic inferences
Supplement III
Validating empirical relationship between α and θ
To better understand and validate the observed empirical relationship between α, the proportion of new mutations fixed by natural selection, and θ, the population scaled mutation rate (4Neμ), we conducted simulations using SLiM (Haller and Messer 2019). Each simulation consisted of a 20Mb sequence with constant recombination and mutation rates of 1 × 106 and 3 × 106, respectively, both of which are 100-fold higher than our estimated rates in maize. Matching the average values for the maize genome, the simulated 20Mb sequence had 300 gene regions separated by 30Kb each. Each gene region consisted of four 200bp exons and three 300bp introns. Mutations were only simulated within exons. Fitness effects of mutations were simulated to be neutral, strictly deleterious, or strictly positive, where the two non-neutral mutation classes were drawn from independent gamma distributions with mean and shape parameters (μ+, shape+, and μ−, shape−, for the positive and negative distributions, respectively). Following the empirical ratio of 0-fold nonsynonymous to 4-fold synonymous sites, the proportion of non-neutral mutations to neutral mutations was simulated as 3.5 to 1. The ratio of positively to negatively selected mutations (p−) was allowed to vary across simulations. We implemented as simple 3-epoch demographic model where the three population sizes (“ancestral” = NA, “bottleneck“= NB, and “modern” = N0) and the generation times (“bottleneck” = TB and “modern” = T0) for changing to a new epoch were allowed to vary across simulations. For each simulation, we computed , where ρ+, ρ−, ρ0, are the frequency of positive, negative, and neutral fixations that occurred over the course of the simulation. Simulations were always initiated with a biurn-in of 10 ∗ NA generations to ensure simulations were near equilibrium conditions prior to implementing demographic changes. To make direct comparisons with the empirical data, we converted the demographic parameters to a single value of θ as the harmonic mean of the three population size parameters, weighted by the number of generations spent at that population size. This resulted in 7 free parameters that were drawn from uniform priors (Table S3). From 1000 simulations, we fit a linear model predicting α from the free parameters.
As expected from theory, we found a positive effect of the population scaled mutation rate (θ = 4Neμ) on α (Table S3 and Figure S2), and the total number of adaptive substitutions (Table S4 and Figure S2). However, in both cases, the relationship appears to be nonlinear and plateaus for higher values of θ. The non-linearity persists after accounting for the effect of other parameters (Figure S2 C and D). Nonetheless, α and the number of adaptive substitutions increase over the broad range of θ simulated.
Unsurprisingly, the mean of both positive and negative fitness distributions, and the proportion of non-neutral mutations simulated as deleterious, were also all significant predictors of α. As noted in the discussion, given the relative ease of fixing strongly adaptive mutations, it is now widely believed that difference in the ability to purge weakly deleterious alleles plays an important role in determining α. Our simulations were conducted over a broad distribution of fitness effects. As such, we cannot directly identify to what extent the relationship between α and θ is driven by increasing efficacy of larger populations to fix positive mutations versus their ability to purge weekly deleterious ones. However, based on our simulations, it seems likely that both phenomena contribute to α, though which is more important will depend on the particular combination of parameters that a population occupies.
Supplement IV
Predicting α by mutation type
Estimates of α may be effected by differences in the mutation rates of different nucleotides and genomic regions. GC biased gene conversion has been shown to reduce α by making it harder to purge slightly deleterious alleles Hämälä and Tiffin (2020). Likewise, the higher mutation rates observed at methylated cytosine bases increases the rate of C → T mutations Ossowski et al. (2010), which is another mechanism that could result in variation in α by changing the ability to purging deleterious alleles, or by changing the probability of fixation of new adaptive mutations.
To study this, we used the same approach as Hämälä and Tiffin (2020), where we separated the site frequency spectra based on mutation types according to whether the ancestral and derived nucleotides had a single (weak) or double (strong) hydrogen bond between the DNA strands. As such, we studied three mutations types: A/T → G/C mutations (WS), G/C → A/T (SW) and C/G → G/C or A/T → T/A (SS_WW).
Unlike patterns found in Arabidopsis (Hämälä and Tiffin 2020), α was highest for WS mutations, although there was considerable overlap between the credible intervals for all mutation types.
Demystifying the high proportion of sweeps unique to the rangewide maize sample
Why does the rangewide sample for maize have such a high proportion of private sweeps? One plausible explanation is that rangewide sweeps represent older sweeps that are shared among all maize, but due to drift and divergence in the individual populations, the sweep signal has been attenuated. When maize individuals are combined from multiple pops, the sweep signal is amplified above the background noise that is generated independently in each population. If this is the case, regions where a sweep unique to the rangewide sample appears should show elevated (but perhaps not statistically significant) evidence of a sweep in the individual populations.
We confirmed this hypothesis by comparing the distribution of the RAiSD μ statistic within the regions we identified as private sweeps in the rangewide samples to a random sample of μ values found genomewide in all the maize populations. Consistent with our expectations, all maize populations have elevated μ values within the regions identified as private sweeps in the rangewide sample. (Figure S6).
Acknowledgments
We would like to thank Andi Kur for providing the corn art, along with Matthew Gibson, Tom Booker, Cathy Rushworth, and other members of the Ross-Ibarra lab for feedback and suggestions on early drafts of the manuscript. This work was funded in part by grants from the National Science Foundation (1822330 and 1238014). We would also like to acknowledge Felix Andrews for statistical advice, although we did not follow it.