Abstract
North America has seen a massive increase in cropland use since 1800, accompanied more recently by the intensification of agricultural practices. Through genome analysis of present-day and historical samples spanning environments over the last two centuries, we studied the impact of these changes in farming on the extent and tempo of evolution in the native common waterhemp (Amaranthus tuberculatus), a now pervasive agricultural weed. Modern agriculture has imposed strengths of selection rarely observed in the wild (0.027-0.10), with striking shifts in allele frequency trajectories since agricultural intensification in the 1960s. An evolutionary response to this extreme selection was facilitated by a concurrent human-mediated range shift. By reshaping genome-wide diversity and variation for fitness, agriculture has driven the success of this 21st-century weed.
One Sentence Summary Modern agriculture has dramatically shaped the evolution of a native plant into an agricultural weed through imposing strengths of selection rarely observed in the wild.
Main text
Agricultural practices across North America have rapidly intensified over the last two centuries, through cropland expansion (1), habitat homogenization (2), and increased chemical inputs (3). Since the beginning of the 1800s, cropland usage has expanded from 8 million to 200 million hectares in Canada and the United States alone (1). Since the 1960’s, increased reliance on pesticides, irrigation, large-scale mechanization, and newly developed crop varieties have greatly improved the efficiency of food production in all farming sectors, a transformation oft-referenced as the agricultural “Green Revolution” (4, 5). For pesticides, however, their effectiveness has been limited by the evolution of resistance across numerous pest species (6). While technological innovation for efficient food production has risen with increasing global food demands, the concomitant conversion of our landscape has become one of the foremost drivers of global biodiversity loss (7).
Species that have managed to survive, and even thrive, in the face of such extreme environmental change provide remarkable examples of rapid adaptation on contemporary timescales and illustrate the evolutionary consequences of anthropogenic impacts. One such species is common waterhemp (Amaranthus tuberculatus), which is native to North America and persists in large part in natural, riparian habitats (8), providing a unique opportunity to investigate the timescale and extent of contemporary agricultural adaptation in this prevalent weed. The genetic changes underlying weediness is particularly important to understand in A. tuberculatus, as it has become one of the most problematic weeds in North America due to widespread adaptation to herbicides, persistence in fields across seasons, and strong competitive ability with both soy and corn (9, 10).
To understand how changing agricultural practices have shaped the success of a ubiquitous weed, we analyzed genomic data from contemporary paired natural and agricultural populations alongside historical samples collected from 1828 until 2018 (Fig 1). With this design, we identify agriculturally adaptive alleles—those that are consistently higher in frequency in agricultural than in geographically close natural sites which constitute contrasts in selective pressures; track their frequency across nearly two centuries, and link the tempo of weed adaptation to demographic changes and key cultural shifts in modern agriculture.
A) Map of 14 contemporary paired natural-agricultural populations (n=187, collected and sequenced in Kreiner et al., 2021), along with 108 novel sequenced herbarium specimens dating back to 1828 collected across three environment types (Ag=Agricultural, Nat=Natural, Dist=Disturbed). B) Distribution of sequenced herbarium samples through time.
The genome-wide signatures of agricultural adaptation
To find alleles favored under current farming practices, we looked for those alleles that were consistently overrepresented in extant agricultural populations compared to neighboring natural populations (11), using Cochran–Mantel–Haenszel (CMH) tests (Fig 2A). Alleles involved in agricultural adaptation (the 0.1% of SNPs with lowest CMH p-values; n=2,055) are significantly enriched for 21 GO-biological process terms related to growth and development, cellular metabolic processes, and responses to biotic, external, and endogenous stimuli, including response to chemicals (Table S1). The importance of chemical inputs in shaping weed agricultural adaptation is clear in that the most significant agriculturally associated SNP (raw p-value = 8.551×10−11, [FDR corrected] q-value = 0.00062) falls just 80 kb outside the gene protoporphyrinogen oxidase (PPO)— the target of PPO-inhibiting herbicides (Fig 2B). Other genes with the strongest agricultural associations include ACO1, which has been shown to confer oxidative stress tolerance (12); HB13, involved in pollen viability (13) as well as drought and salt tolerance (13); PME3, involved in growth via germination timing (14); CAM1, a regulator of senescence in response to stress (15, 16); and both CRY2 and CPD, two key regulators of photomorphogenesis and flowering via brassinosteroid signaling (17–20) (Table S2). These signals of agricultural adaptation are notable given that genome-wide differentiation among environments as measured by FST is negligible, a mere 0.0008 (with even lower mean FST between paired sites = -0.0029; Fig 2C), suggesting that despite near panmixia among environments, strong antagonistic selection acts to maintain spatial differentiation for particular alleles.
A) Results from Cochran–Mantel–Haenszel (CMH) tests for SNPs with consistent differentiation among environments across contemporary natural-agricultural population pairs. A 10% FDR threshold is indicated by the lower dashed horizontal black line, while the Bonferroni q-value < 0.1 cutoff is shown by the upper dashed horizontal gray line. Red points indicate focal adaptive SNPs after aggregating linked variation (r2 > 0.25 within 1 Mb). Candidate agriculturally adaptive genes for peaks that are significant at a 10% FDR threshold shown. B) CHM results from the scaffold containing the most signficant CMH p-value, corresponding to variants linked to the PPO210 deletion conferring herbicide resistance and to the nearby herbicide-targeted gene ALS. C) Distribution of FST values between all agricultural and natural samples for ∼3 million genome-wide SNPs (minor allele frequency > 0.05). Vertical lines indicate FST values for the 10 candidate genes named in A. D) Pairwise frequency of six common herbicide resistance alleles across agricultural and natural habitats sampled in 2018; the first four are nonsynonymous variants in ALS and EPSPS, the EPSPSamp is a 10 Mb-scale amplification that includes EPSPS, and the last one is an in-frame single-codon deletion in PPO (each dot represents on average ∼5 individuals). Per migrant natural cost: agricultural benefit ratio relative to migration (C:B) is shown in the top right corner of eachlocus-specific comparison of frequencies across population pairs.
To further investigate the extent to which herbicides shape adaptation to agriculture, we assayed signals of selection for two complex resistance variants—a deletion of codon 210 within PPO, which is causal for resistance to PPO-inhibiting herbicides (21), and amplification of 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS), which confers resistance to glyphosate herbicides (22). Natural-vs-agricultural FST (highly correlated with the CMH test statistic, Fig S1) at the PPO210 deletion, 0.21, is higher than anywhere else in the genome and is even stronger when calculated within-population pairs (FST= 0.27) (Fig 2C). Similarly, the EPSPS amplification is ranked 20th among genome-wide biallelic FST values, 0.14 (within-pair FST= 0.22), in support of herbicides as a foremost driver of agricultural adaptation.
How differences in selection relative to migration among environments may mediate agricultural adaptation is pivotal for understanding the consequences of agricultural selection in natural environments and the persistence of resistance mutations through time. For 6 common alleles, previously shown to be causal for conferring herbicide resistance (10), as well as the top 30 independent CMH outliers, we implemented a Wright-Fisher allele-frequency-based migration-selection balance model to infer the relative strength of selection favoring resistance alleles in agricultural environments versus selection favoring susceptible alleles in natural environments at equilibrium. While resistance alleles were at intermediate frequency in agricultural populations, ranging from 0.08 to 0.35, they were rarer in natural populations, with frequencies from 0.04 to 0.22, consistent with on-going migration from agricultural into natural environments balanced by selection against these alleles in the absence of herbicides (Fig 2D). Assuming these sites are at equilibrium, we inferred that the costs of resistance per migrant arriving in natural environments are stronger than the costs of susceptibility per migrant arriving in agricultural environments (per migrant costs: benefit ratio ranges from 1.09 for ALS653 to 4.2 for the PPO210 deletion, with a mean = 1.99; Fig 2D, Table S3). For the top 30 independent CMH outliers, the costs in natural environments were about equally likely to be stronger or weaker (12/28, 42%) than the benefits in agricultural environments, scaled to the migration rate (Fig S2). Thus, while substantial gene flow between agricultural and natural sites repeatedly introduces locally unfit alleles across environments, the spread of herbicide resistance alleles appears to be strongly constrained by their cost in herbicide-free, natural environments.
Agriculturally-adaptive alleles change rapidly with intensified regimes
With a strong set of agriculture-associated alleles (251 loci after aggregating linked SNPs), we searched for signatures of temporal evolution using newly collected whole genome sequence data from a set of historical samples (n=108) dating back to 1828, collected from natural, agricultural, and disturbed environments (Fig 1). Of the 165 loci for which we had sufficient information in the historical SNP set (sequenced to 10x coverage on average), 151 were segregating with the same reference/alternate allele combination, and only three were invariant. To model allele frequency change through time on these loci, we implemented logistic regressions of genotypes (within individual allele frequencies) at each locus on collection year, where the slope of the logit-transform is equivalent to the strength of selection (s). Because our historical collections sampled both natural and human-mediated environments through time, we were able to compare allele frequency trajectories and selection across environments.
Consistent with the rapid change in land use and farming practices in the recent past, the frequency of these 154 contemporary agricultural alleles has increased substantially over the last two centuries. Whereas in present-day natural environments agriculturally-adaptive alleles have increased by 6% on average since 1870, the earliest time point at which we have collections across environment types, these same alleles have increased by 22% in disturbed and agricultural environments (Fig 3A). This observed change greatly exceeds the expected change over this time period, under null processes (drift, migration, and selection) (null 95% interquantile range for allele frequency change in agricultural and disturbed sites = [3.3, 7.9%]; for change in natural sites = [-2.7, 2.0.%]). We generated these null expectations by randomly sampling the same number of observed loci across the genome and calculating their allele frequency change through time, where each of the 1000 randomized sets matched the frequency distribution observed for extant agricultural alleles (Fig S4) and were constrained to alleles across the genome that were at a higher frequency in agricultural compared to natural habitats. These randomizations were performed separately across environment types.
A) Agricultural allele frequency trajectories for each locus, in agricultural and disturbed habitats (left), and natural habitats (right). Trajectories coloured by the quantile of frequency change in agricultural and disturbed habitats. Transparent lines indicate those with non-significant evidence of selection at α=0.05 after FDR=10% correction. B) The strength of selection on agricultural alleles for each locus in natural (dark gray) and agricultural and disturbed (light gray) habitats between 1870 and 2018. C) Agricultural allele frequency trajectories in each environment type, before and after the start of agricultural intensification in 1960. Vertical dashed line represents an inferred breakpoint in the data in a segmented regression. Environmental regression lines represent logistic fits to data that either predate or are subsequent to 1960. Large circles represent moving averages (over both loci and individuals) of allele frequencies, whereas dots represent raw genotype data for each locus and sample from which the allele frequency trajectory is estimated. Cropland use per capita in North America data from (1), rescaled by use in 1600. D) The trajectory of alleles at known herbicide resistance loci through time, fit by logistic regression for each of the seven alleles present in our contemporary data. Dots represent genotypes for each historical and contemporary sample at each herbicide resistance locus. 95% credible interval of the maximum likelihood estimate of selection between 1960-2018 provided in the legend for each resistance allele.
The considerable increase in frequency of these alleles across environments corresponds to remarkably strong selection even when estimated over century-long time periods. The 154 agriculture-associated alleles collectively exhibit since the 1870s in agricultural and disturbed habitats but exhibit much weaker selection,
, in natural habitats (agricultural and disturbed null interquantile range = [0.0013, 0.0034]; natural null interquantile range = [-0.0009, 0.0009]). The range of selection estimated across loci varies between -0.098 and 0.075 in natural habitats, and -0.045 and 0.186 in agricultural and disturbed habitats (Fig 3B, Fig S5). The top 15 agriculture-associated alleles that have experienced the strongest, significant selection over the last ∼150 years include SNPs that map near PPO, ACO1, CCB2, WRKY13, BPL3, and ATPD (Table S3). We find that both the total frequency change of agriculture-associated alleles and the estimated strength of selection in agricultural and disturbed environments are positively correlated with the extent of contemporary linkage disequilibrium around these loci (the number of SNPs with r2 > 0.25 within 1Mb) (frequency change: F = 5.16, p = 0.024; strength of selection: F= 3.99, p = 0.048; Fig S6), consistent with theoretical expectations for the genomic signatures around alleles that have recently been impacted by positive selection (23, 24).
Along with evidence of much stronger selection and frequency change of agriculturally adaptive alleles in agricultural versus natural environments over the last 150 years, we find that the trajectory of these alleles among environments varies considerably through time (Fig 3C, Fig S7). While extant pairs of agricultural and natural populations are differentiated by 18% at these loci, this decreases as we look further back in time, so that around 1900, these alleles still had equal frequencies in both environments (predicted 1900 frequency in agricultural and disturbed sites = 41.9% [SE=2.7%], predicted 1900 frequency in natural sites = 38.6% [SE=2.7%]) (Fig 3D). Moreover, when we split out samples into those that predate or are subsequent to the intensification of agriculture during the Green Revolution, we find that the increase in frequency of agricultural alleles was negligible in agricultural and disturbed environments before the 1960s (predicted 1870-1960 change = 0.005), with the subsequent change near completely accounting for the observed rise in frequency of the alleles more common today in agricultural environments (predicted 1960-2018 change = 0.219, versus total 2018-1870 change = 0.221) (Fig 3C). Corresponding estimates of selection by logistic regression using only data from before 1960 shows no evidence of selection on these loci in disturbed and agricultural (, null interquantile range = [-0.0022,0.0010]) or in natural habitats (
, null interquantile range = [-0.002,0.002]). However, samples collected subsequent to 1960 reflect a dramatic shift in selection—a collective
in disturbed and agricultural environments and a collective
in natural environments (ag null interquantile range =[0.0032,0.0098]); nat null interquantile range =[-0.0028,0.0027]) (Fig 3C; Fig S8). Together, these results suggest that while most contemporary agricultural alleles were present in historical populations, that these alleles only became associated with agricultural and human-managed sites over the last century, on timescales and rates consistent with the rapid uptake and intensification of agrochemicals, controlled irrigation, and mechanization in agriculture.
The historical trajectory of known herbicide resistance alleles epitomizes extreme selection over the last 50 years (Fig 3D). Five out of seven known herbicide resistance loci present in our contemporary collection are absent from our historical samples, consistent with the suggested importance of resistance adaptation from de novo mutation (25, 26). Only three out of 108 historical samples show variation for herbicide resistance, two samples homozygous for resistance at ALS574 and one heterozygous for resistance at ALS122—all of which were sampled after the onset of herbicide applications in the 1960s (Fig 3D). Since 1960, we find that these seven known resistance alleles in our contemporary samples have collectively experienced selection of (Z = 2.11, p = 0.035) per year—ranging from s > 0.097 for PPO210, s > 0.057 for EPSPS106, and s > 0.044 for ALS574 to no evidence of selection on ALS122 and ALS197 (Fig 3D; Table S3). As expected, selection has been particularly strong on these alleles in disturbed and agricultural environments (
, Z = 2.121, p = 0.034), but selection on these known resistance alleles remains high when estimated from samples taken over time in natural environments (
, Z = 1.912, p = 0.056), where presumably the alleles have been recurrently introduced by migration from agricultural sites.
Concurrent temporal shifts in ancestry underlie agricultural adaptation
Finally, we explored whether historical demographic change over the last two centuries has played a role in agricultural adaptation. Early taxonomy described two different A. tuberculatus varieties as separate species, with few distinguishing characteristics (seed dehiscence and tepal length (8)). Sauer’s 1955 revision of the genus, which used herbarium specimens to gauge the distribution and migration of congeners over the last two centuries (27), led him to describe an expansion of the southwestern var. rudis type (at the time, A. tamariscinus (Sauer)) northeastward into the territory of var. tuberculatus (A. tuberculatus (Sauer)), sometime between 1856-1905 and 1906-1955. Our sequencing of over 100 herbarium samples dating back to 1828, combined with nearly 200 contemporary sequences, allowed us to directly observe the change in the distribution of these two ancestral types, adding resolution to Sauer’s morphological observations of the species’ contemporary range shifts at a genome-wide level and over more recent timescales.
Range-wide, we see clear shifts in the distribution of var. rudis ancestry based on faststructure inference at K=2 (Fig S9) across three-time spans, 1830-1920, 1920-1980, and 1980-2018 (timespan: F = 5.47, p = 0.0045), and particularly so in the East (timespan x longitude: F = 5.49, p = 0.0045), consistent with a recent expansion of var. rudis ancestry (Fig 4A). Furthermore, we see strong state and province-specific shifts in ancestry through time in our historical sequences (time span by state interaction: F = 4.22, p = 7 × 10−5), highlighting not only the shift of var. rudis eastwards (with increases through time in Ontario, Ohio, Illinois, and Missouri) but also the very recent introduction of var. tuberculatus ancestry into the most eastwards part of the range in Kansas (Fig 4B). A. tuberculatus demography thus appears to have been drastically influenced by human-mediated landscape change over the last two centuries, consistent with the massive recent expansion of effective population size we have previously inferred over this same timeframe (26). That this shift has been most notable over the last 40 years is further consistent with the timescale of rampant herbicide resistance evolution within the species (10, 26, 28), suggesting selection on resistance may facilitate the colonization of var. rudis ancestry outside its historical range. Along these lines, we find this contemporary expansion has facilitated the sorting of var. rudis ancestry across environments (a longitude by time span by environment interaction: F = 5.13, p = 4 10−5; Fig 4C), with increasing overrepresentation of var. rudis ancestry in agricultural and disturbed environments in the eastern portion of the range through time, as previously suggested (11).
A) Longitudinal clines in individual-level var. rudis ancestry over three timespans, illustrating the expansion of var. rudis ancestry eastwards over the last two centuries. B) The distribution of individual-level var. rudis ancestry by state and through time, illustrating state-specific changes in ancestry. Vertical lines represent first, second, and third quantiles of ancestry within each timespan and state. Timespans indicated in A) C) Increasing sorting of individual-level var. rudis ancestry into agricultural environments on contemporary timescales. D) Environment-specific metrics of selection (CMH p-value and cross-population extended haplotype homozygosity (XPEHH)) across the genome in 100 kb windows positively correlate with var. rudis ancestry in agricultural, but not natural habitats.
To investigate whether agricultural adaptation has preferentially favored var. rudis ancestry, we reconstructed fine-scale ancestry across the genome. Based on analyses in 100 kb windows, we find a least squares mean of 5.6% more var. rudis ancestry genome-wide in agricultural environments compared to the adjacent natural habitat (Fig S10). The environment-specific proportion of var. rudis ancestry is not only positively correlated with recombination rate (F = 16.67, p = 4.5 × 10−5) and gene density (F = 5.85, p = 0.016) but also with SNP and haplotype-based evidence of environment-specific selection. Agricultural, but not natural populations, have an excess of cross-population haplotype homozygosity (agricultural vs natural XPEHH) and within-pair environmental differentiation (CMH p-value) in genomic regions of high var. rudis ancestry (Env x XPEHH: F=9.34, p=0.002; Env x CMH: F=99.70, p < 10−16; Fig 4D), implying that ancestry composition genome-wide in large part determines the extent of polygenic agricultural adaptation. Together, these findings suggest that the expansion of var. rudis ancestry across the range, particularly in the last 40 years, has facilitated adaptation to novel agricultural selective pressures through providing preadapted genetic variation.
In summary, agricultural adaptation in A. tuberculatus, a native plant in North America, has occurred over extremely rapid timescales, facilitated by range shifts in response to the agriculturalization of its native habitat. The human-mediated expansion of the southern lineage of the species northeastwards since the later half of the 20th century has introduced new genetic variation across the genome on which selection in agricultural settings could act. Through our paired sampling design, we identified 251 independent SNPs across 240 genes that are implicated in agricultural adaptation; these genes tend to be enriched for expanded southwestern ancestry, with functions affecting growth, development, abiotic tolerance, and herbicide resistance. Concurrent with the intensification of agriculture, the prevalence of agricultural alleles has increased rapidly over just the last 60 years, in agricultural environments by nearly 3% per year, and even in natural sites by more than 1% per year. The first empirical estimates of selection coefficients for herbicide resistance provided here—10% per year range-wide over a 60 year period—emphasizes the long lasting impact of selection on genetic variation even across heterogeneous environments. Modern, industrial agriculture thus imposes strengths of selection rarely observed in the wild.
These results highlight that anthropogenic change not only leads to the formation of new habitats but also provides an opportunity for range expansion that may facilitate and feedback with local adaptation, reshaping genetic variation for fitness within native species.
Funding
JMK was supported by the Biodiversity Research Institute at the University of British Columbia and a Killam Fellowship. SIW was supported by a NSERC discovery grant and a Canada research chair. JRS was supported by a NSERC discovery grant. SML, HAB and DW were supported by the Max Planck Society.
Author Contributions
JMK, JRS, and SIW conceptualized the paired sampling design, JMK, HAB, DW, JRS, and SIW conceptualized the use of herbarium data, JMK performed contemporary collections and curated the herbarium samples, SML and HAB conceptualized and designed the molecular work with herbarium specimens, SML coordinated the clean room facility work, JMK and SML performed DNA extraction and library preparations of herbarium tissue, SML oversaw the sequencing of herbarium specimens. JMK performed analyses with input from SPO, SIW, and JRS. SPO wrote the migration-selection and maximum likelihood models. JMK wrote and revised the paper with inputs from all authors.
Competing interests
The authors declare that they have no competing interests.
Data and materials availability
All novel sequence data will be archived on SRA, while scripts and and accompanying metadata will be archived on Github and Dryad upon acceptance.
Supplementary Materials
Materials & Methods
Herbarium collections
In 2019, we obtained 10 mg tissue collections of herbarium specimens from 7 herbaria across Canada and the United States and one governmental organization: the Royal Ontario Museum Herbarium, the Museum of Biological Diversity at Ohio State University Herbarium, the Dean Herbarium at Indiana State University, the Michigan State University Herbarium, the Illinois Natural History Survey Herbarium, Missouri Botanical Gardens, The McGregor Herbarium at Kansas State University, and Agriculture and Agrifood Canada. We selected samples to have an even representation of habitats through time. Samples were classified as natural (n=54), agricultural (n=28), or disturbed (n=20) based on collectors’ annotations on each plate: any reference to a cultivated field was treated as an ‘agricultural’ collection; general environmental descriptions such as dry grassland or riverbank was treated as a ‘natural’ collection; and reference to disturbed soil, railroad tracks, or manicured or managed land was treated as a ‘disturbed’ collection. For inference of contemporary allele frequency and ancestry change through time, samples collected from disturbed habitats were grouped together with the agricultural category—in both of which waterhemp exists as a weed (Table S5). When geographic coordinates were not provided, we referred to the state, county, section, intersection, and landmark descriptions to infer the geographic coordinate of a given sample. In total, we collected samples from 172 specimens, 108 of which were selected for whole-genome sequencing.
Herbarium DNA extractions & library preparations
The work was performed in the ancient DNA lab at the University of Tübingen. For DNA extraction of the herbarium samples, we followed basic protocol 1 outlined in (29). Briefly, under sterile conditions, ∼10 mg of each sample were ground and incubated with N-phenacylthiazolium bromide (PTB)-based mix overnight to lyse DNA. After a shredding step with QIAshredder spin columns, DNA was purified and eluted with DNAeasy Mini spin columns. Sequencing libraries were prepared using the basic protocol 2 outlined in (29), performing blunt-end repair, adapter ligation, a fill-in reaction, indexing, and finally PCR amplification (10 cycles) and a cleaning step. The libraries were sequenced on an Illumina NovaSeq instrument on a single flow cell. The sequencing run produced ∼3,442 Gb data, an average of 32 Gb per sample.
Mapping, damage correction, SNP calling and filtering
We removed adapters, polyQ tails, and merged reads from herbarium sequencing reads using fastp (30). Because of the small fragment size of historical DNA, this resulted in a sizable loss of sequence coverage, from 46X coverage to a mean of 11X coverage. On average, 89% of merged reads mapped to the female reference genome from (31), suggesting low rates of contamination by exogenous DNA. Finally, we performed de-duplication of merged reads with DeDup (32), which is optimized for merged paired-end sequencing data. This resulted in a final mean per-sample coverage of 9.7X.
We used the program MapDamage (33) to quantify damage patterns in the historical DNA. The fraction of C deamination, which leads to C-to-T substitutions, was low, at the first base ∼2% on average across samples, barely inflated above the C-to-T substitution rate across the rest of the reads (Fig S3). Nonetheless, the fraction of C-to-T substitutions at the first base was positively correlated with the age of the samples (Fig S3). We thus used MapDamage to rescale mapping quality scores to take into account the patterns of DNA damage. We called SNPs with freebayes (v1.3.2) in 100 kb regions in parallel across the genome, merged, and then filtered SNPs based on quality (QUAL > 30) and missing data (< 0.30).
Herbicide resistance alleles in herbarium samples were identified based on known locations of non-synonymous substitutions within ALS and EPSPS. Initially, two genotype calls from herbarium samples that predated the onset of ALS herbicide use in the 1950s, showed standing variation for resistance at ALS574 and ALS122: one individual heterozygous for Trp-574-Leu collected in 1930 from a sandy agricultural field in St. Louis, Missouri, USA (HB0973); and another individual heterozygous for Ala-122-Ser collected in 1895 from a corn field in Fayette, Ohio, USA (HB0914). Upon further inspection, read-level support for resistance alleles was low with the allelic-bias at these genotype calls being highly skewed (reference to alternate ratio = 1:9 and 2:18, respectively). Similarly, one individual collected in 1967 from the Bottom of Maumee River, Ohio (HB0977) was heterozygous for ALS122, but the alternate resistance allele had support at only one read (reference to alternate ratio =1:7). We subsequently dropped these genotype calls from analyses of selection on herbicide resistance alleles through time.
Metrics of differentiation across Environments: CMH, FST, & XPEHH
We used the 7,262,599 genome-wide high-quality SNPs called from contemporary agricultural-natural paired populations (n=187 individuals total from 17 pairs of populations, 34 populations in total) from (11) (Fig 1). Previously, these data had been only used for genome-wide PCA and faststructure based individual-level ancestry estimates. To make use of our paired sampling design, we used plink (34) to perform a Cochran–Mantel–Haenszel test, testing an (environment by SNP | pair) effect after applying a minor allele frequency cutoff of 0.01. We identified candidate agriculturally-adaptive genes based on the nearest gene (bedtools closest) to each LD-clumped, FDR q-value < 0.1 SNP. We found the Arabidopsis thaliana orthologues of our A. tuberculatus genes with orthofinder (35). For genes where orthofinder found no A. tuberculatus orthologue and in which our annotation identified no orthologue in closely related species based on gene expression data, we used blastn (36) to perform a conclusive search for similar genes across species.
Additionally, we used plink to calculate Weir and Cockerham’s FST, both between all natural and agricultural samples, and between environments within each population pair, which we later averaged to obtain the mean pairwise FST. For calculation of Fst at the EPSPS amplification, we recoded individuals as 0, 1, 2 based on copy number amplitude (<1.5, 1.5 < copies < 2.5, and >2.5, respectively). We used selscan (37) to calculate the cross-population extended haplotype homozygosity, after read-back and population-level phasing with Shapeit2 (38), both of which required knowledge of recombination rates, which we supplied in the format of our imputed LD-based map from (31).
Models of Migration-Selection Balance
We modeled migration-selection balance between natural and agricultural habitats in our contemporary data for 6 common target-site resistance alleles, based on a two-patch, allele-focused model. In each patch, we modeled the frequency of the resistance allele and the susceptible (xS, xR and yS, yR, respectively) from year to year as:
where xS*, yS*represent the frequency of the susceptible allele and and xR*, yR*represent the frequency of the resistance allele after a bout of selection in natural (x) and agricultural (y) sites within a generation. We then allowed for migration of surviving genotypes and modeled their frequency in the next generation as follows:
where mN and mA represent immigration rates into natural and agricultural sites, respectively. Assuming additivity (h=0.5) and that migration at the loci is much weaker than selection (m << s), a given pair of populations is expected to approach a polymorphic equilibrium, where:
While it is not possible to solve for selection directly in the absence of data on migration rates, these formulae allow us to estimate the strength of divergence by inferring the strength of selection relative to migration in natural and agricultural
environments, as presented in Table S3. The ratio of these metrics gives the ratio of the cost faced per migrant arriving in natural environments versus the benefit per migrant in agricultural environments, assuming that the pair of populations is near equilibrium. We note that the approach to migration-selection balance occurs exponentially at a rate proportional to the selection coefficient (when m << s << 1) and so should occur rapidly at sites under strong selection (Supplemental Index 1).
Logistic models of temporal allele frequency change
We used CMH outliers from the contemporary paired population scan to investigate patterns of agricultural-allele frequency change over the last two centuries. We were interested in tracking independent allele frequency trajectories, so from the 403 SNPs with CMH p-values that exceeded 10% FDR correction (p < 6 × 10-6), we performed a subsequent clumping step, effectively retaining a set of largely unlinked SNPs (Fig S11) that represent the most significant SNP in a particular region. Specifically, we used plink--clump, to find the most significant hit genome-wide, scan 1 Mb around it, and exclude any SNP from the resulting output that is in r2 > 0.25 with the focal SNP. This algorithm is repeated until all SNPs passing the genome-wide significance threshold have been clumped. This resulted in 251 loci that on average showed a 17.9% allele frequency difference between extant agricultural and natural environments. Because some of the alternate alleles across these loci were more frequent in natural environments, we redefined the alleles based on which one was more common in agricultural compared to natural sites.
We then found the intersection of these agriculture-associated alleles, identified in our contemporary paired collections, with the historical, filtered SNPs from the herbarium sequence data. 154 loci were present in the historical samples with the same reference/alternate allele combinations. We extracted a matrix of 0, 0.5, 1 values, representing the frequency of the agricultural allele for each locus within each individual, for samples from both our contemporary and historical collections. Combining these individual agricultural allele frequencies at each locus across historical and contemporary datasets, we then performed a logistic regression in R (glm function, family=“binomial”) of genotype on collection year, separately on samples from either natural or agricultural environments. From each logistic regression, we extracted the logit-transformed slope (selection coefficient, s), p-value, and standard error, as well as the predicted value (allele frequency) at 1870 and 2018, representing the minimum sample year and maximum sample year. While we have samples dating back to 1828, we constrained this analysis to samples collected after 1870, as the density of samples before then is low (n=4), with no representation of samples from agricultural environments.
The total allele frequency change at each locus was calculated by taking the difference between the predicted frequency of the allele in 2018 and 1870. We merged the output of these locus-specific logistic regressions in agricultural environments, with both SNP and haplotype-based statistics from these same individuals to identify contemporary correlates of the magnitude of allele frequency change and selection through time. Specifically, we examined how well contemporary recombination rate, XPEHH, the CMH p-value, the number of SNPs in linkage (r2 > 0.25) with the focal SNP (< 1Mb; i.e. number of SNPs in a clump), and the distance between linked SNPs, explained both the total allele frequency change and the estimated strength of selection (Fig S6).
We also performed a separate set of analyses, where a logistic regression was used to analyze the trajectory of all agricultural alleles or known herbicide resistance alleles at once, first across samples from natural environments and then for samples from agricultural and disturbed environments (‘genotype ∼ year + locus’; Fig 3D). We further partitioned samples in each environment to those that predate or are subsequent to the 1960s, to infer the importance of the intensification of agriculture and herbicides in shaping the strength of selection on contemporary agricultural loci. For each of the four logistic regressions ran on these partitioned sets of data, the slope of the year term represents a joint estimate of the strength of selection for agricultural alleles, between 1870-1960 or 1960-2018, in natural or agricultural environments. We refer to this joint estimate of selection at multiple loci as s.
To test whether a comparison of selection before and after the 1960’s was statistically supported, we also compared our full model analyzing temporal signatures of allele frequency change between 1870-1960 to one that fits either two or three logistic regression lines between that time frame (i.e. a segmented logistic regression). A segmented logistic regression with two breakpoints provides the best fit to our data, compared to a model with either one or no breakpoints (two-break segmented AIC=54360.55, one-break segmented AIC =54437.66, non-segmented AIC=54444.67), and converges on 1913 and 1961 breakpoints, the later supporting a priori hypotheses and our interest in interrogating signals before and after the start of the green revolution in 1960 (Fig 3D).
We designed a randomization test based on observed allele frequency changes across the genome to obtain an expected distribution under null processes (drift, migration, and selection). In particular, we were interested in quantifying the potential bias in higher frequency agricultural alleles having the leeway to change more through time, as compared to a set of lower frequency alleles. We thus randomly sampled 154 loci from our contemporary collections (the same number as our observed clumped and historically matched set of agricultural alleles), 1000x across the genome, exactly matching the frequency distribution observed for extant agricultural alleles, first in extant agricultural and then in extant natural environments. To emphasize, this randomization was done independently in each environment, such that the alleles sampled to match the extant agricultural-allele frequency distribution in agricultural environments in one iteration were different from the alleles sampled to match the frequency distribution in natural environments (Fig S4). To account for the ascertainment bias in our set of putatively agriculturally adaptive alleles—finding alleles that show the greatest excess of allele frequency in agricultural compared to natural environments—we further constrained these randomizations to alleles across the genome which were at greater frequency in agricultural than in natural environments. On each of the 1000 randomizations within each environment, we then performed the same analyses as above: matching these alleles in our historical samples, producing a matrix of genotype data for both contemporary and historical sets, and performing a logistic regression for each locus, as well as logistic regression on all loci at once, for either samples from natural or agricultural environments, and for those that either preceded or were subsequent to 1960. For the 1000x randomizations within agricultural and natural environments, we then computed the 2.75 and 97.25% quantiles (“null 95% interquantile range”) of the statistics of interest (total allele frequency change and selection coefficients) to compare against our observed values.
Maximum likelihood estimate of selection
For the 7 known herbicide resistance alleles, we were particularly interested in individual estimates of selection on each allele over time. We used a maximum likelihood approach to estimate the strength of selection for each resistance allele between 1960-2018, along with a 95% credibility interval using profile likelihood. Summing over all years (t), the log-likelihood of observing the data is given by the binomial sampling formula describing the chance of observing the number of resistance (nR) and susceptible alleles (nS) in any given year:
where p represents the initial frequency of the allele when t = 0 (defined as the present) and s represents the strength of selection, both of which are unknown and estimated by maximizing the likelihood. Because many of the resistant alleles were only observed in contemporary samples, selection must be sufficiently strong to explain this rise, but the maximum strength of selection cannot be determined (the likelihood surface becomes flat). We thus only present the 95% confidence interval in the text (i.e., those values of the s for which the ln(L) falls within 12[0.05]/2 of the maximum likelihood). We implemented this algorithm in R, using the mle2 function implemented within the bblme package in R.
Ancestry inference
For genome-wide ancestry inference, we merged filtered SNPs from herbarium samples with high-quality SNP sets from (11) (n=187, collections from 2018) and (31) (n=162, collections from 2015), resulting in 457 individuals and representing all resequenced A. tuberculatus whole genomes. We used faststructure (39) to infer individual-level ancestry, taking the proportion of an individual’s assignment to a grouping at K=2 to represent either var. rudis or var. tuberculatus ancestry. An individual’s proportion of var. rudis ancestry was then analyzed in a multivariate regression that tested how well var. rudis ancestry was explained by longitude, latitude, environment (natural or agricultural), timespan (1800-1920 [n=39], 1920-1980 [n=44], 1920-2020 [n=374]), a two-way timespan by longitude interaction, a two-way timespan by state interaction, and a three-way timespan by environment by longitude interaction:
Individual ancestry assignment ∼ longitude + latitude + environment + timespan + timespan:longitude + timespan:state + timespan:environment:longitudeWe also used plink to perform a principle component analysis of merged SNPs from just herbarium samples (Fig S12) and all 457 samples jointly (Fig S13).
We were interested in the distribution of var. rudis ancestry across the genome, and so used LAMP (40) to assign ancestry to SNPs, based on two reference populations homogenous for either var. rudis or var. tuberculatus ancestry (Kansas and Ontario Natural Populations, respectively; (31)). Ancestry informative SNPs were those with an Fst > 0.40 (2x the mean genome-wide ancestry differentiation between varieties, in these two populations) between these reference populations and that were also in common between datasets (<20% of samples with missing data) after merging historical sequences with the contemporary paired sequence data (11). Since LAMP requires recombination rate information, we also imputed the LD-based genetic map from (31) to the ancestry-informative SNPs to get genetic distance between each. Finally, we performed the LAMP analysis, one population at a time, one scaffold at a time. After merging SNP-wise ancestry assignments across scaffolds, we calculated the mean, 5%, and 95% quantile of var. rudis ancestry in 100 kb regions for each population, and eventually, each environment (Fig S10).
To understand the relationship between ancestry, agricultural selection, and genomic architecture, we performed a multiple regression to quantify drivers of fine-scale ancestry across the genome. We regressed the individual proportion of var. rudis ancestry in 100 kb windows across the genome against gene density, recombination rate, scaffold, environment, average CMH score, average XPEHH (difference in extended haplotype homozygosity across environments), the interaction between environment and average CMH score in each window, and the interaction between environment and the mean XPEHH in each window.
100kb mean ancestry ∼ scaffold + mean_genedensity + mean_recomb + mean_xpehh:env + mean_cmh:env + envThe least squares effect of environment on ancestry was taken to calculate the average difference in ancestry between agricultural and natural environments.
GO Enrichment results for the top 0.01% CMH outliers (n=2055 SNPs).
Gene and orthologue information for the 50 SNPs with the most significant CMH p-values, sorted by Scaffold and then CMH p-value. AMATA=Amaranthus tuberculatus, AT=Arabidopsis thaliana.
Selection-migration differentiation statistics for 6 common resistance alleles, along with estimates of selection estimated by logistic regression of the allele frequency through time. Ag, agricultural sites; Nat, natural sites. Cost and benefit estimates shown here for the additive (h=0.5) case. s (1960-2018) represents the maximum likelihood estimate of selection from the binomial sampling equation of allele frequency change, and we provide the associated 95% credible interval.
The top 15 loci with the strongest evidence of temporal selection between 1970 and 2018.
Metadata on herbarium collections.
CMH X2 statistic against between-environment FST, with the latter not stratifying for population pair.
Agricultural versus natural
for the 30 independent loci with the most signficant CMH scan hits compared to 6 common herbicide resistance alleles. Diagonal line represents equal agricultural benefits compared to natural costs, scaled by migration.
Percent C-to-T substitution at first base, and its correlation with read position and collection year for 108 sequenced herbarium samples. Right figure additionally illustrates C-to-T substitution level for the six samples with known target-site resistance alleles (three TSR genotypes excluded from selection analyses due to allelic bias).
The distribution of frequencies for agriculturally adaptive alleles in agricultural samples along the x-axis, and in natural samples along the y-axis. Null distributions for an expectation of change in the frequency in our focal set of contemporary alleles was generated by producing randomized allele sets of the same size (n=154) matching the extant agricultural-allele frequency distributions shown here, first in natural environments (top histogram), and then in agricultural environments (right histogram).
Inferred strength of selection on 154 agricultural alleles through time, in either agricultural or natural environments. Selection coefficients were extracted from logit-transformed logistic regressions of genotype on year, run separately for each locus in each environment. Gray ribbon for each locus represents the bounds of the standard error associated with the estimate of selection in each environment.
The association between contemporary patterns of linkage and selection and allele frequency change observed over the last 150 years across herbarium samples. Regression line shows the least square mean effect of contemporary linkage from a multiple regression analysis.
Cubic splines that illustrate the environment-specific frequency change of agricultural alleles through time since 1870. Gray ribbon denotes the 95% CI.
Logistic estimates of selection before (left) and after (right) the 1960s, the start of agricultural intensification, for agriculturally-associated alleles in natural (dark gray) versus agricultural and disturbed (light gray) environments.
Longitudinal and state-wise patterns of ancestry across 457 A. tuberculatus individuals from contemporary and historical sampling, inferred from faststructure. Samples sorted by longitude, from west (left) to east (right). White dashed lines denote clusters of specimens sampled from different states and provinces across this longitudinal gradient. K=2 taken as var. rudis versus var. tuberculatus ancestry, as in (31).
Excess of var. rudis ancestry in agricultural compared to natural environments, in 100 kb regions across the genome. Lines depict the mean ancestry across all populations within each environment, with error bars showing the mean 5th and 95th percentile of ancestry across populations. Fine-scale ancestry estimates were inferred with LAMP (40).
Heatmap of r2 values alongside a dendrogram of the 254 agriculturally associated SNPs identified through CMH tests across paired contemporary natural-agricultural samples, illustrating independence among the 154 LD-clumped CMH outliers.
PCA of herbarium samples, coloured by state/province.
Acknowledgements
We appreciate the pivotal contribution of numerous herbaria towards this research, especially the help of Eric Knox at Dean Herbarium at Indiana State University, Jamie Lynn Minnaert-Grote at the University of Illinois, Tedesse Mesfin at the University of Ohio Herbarium, Anton Reznicek at the University of Michigan Herbarium, Jim Solomon at the Missouri Botanical Gardens, Caleb Morse at the McGregor Herbarium at Kansas State University, Tyler Smith and Song Wang at Agricultural and Agrifood Canada, and Deb Metsger and Tim Dickinson at the Royal Ontario Museum. We thank the Whitlock lab (University of British Columbia), as well as Aneil Agrawal and Tyler Kent (University of Toronto) for input on the work; Christa Lanz and Rebecca Schwab (Max Planck Institute) for coordinating sequencing of herbarium samples; Ella Reiter (University of Leipzig) for scheduling and coordinating logistics for clean room facility work; and Patricia Lang, Sonja Kersten and Heike Budde (Max Planck Institute) for advice on molecular protocols troubleshooting.