Abstract
Species often include multiple ecotypes that are adapted to different environments. But how do ecotypes arise, and how are their distinctive combinations of adaptive alleles maintained despite hybridization with non-adapted populations? Re-sequencing of 1506 wild sunflowers from three species identified 37 large (1-100 Mbp), non-recombining haplotype blocks associated with numerous ecologically relevant traits, and soil and climate characteristics. Limited recombination in these regions keeps adaptive alleles together, and we find that they differentiate several sunflower ecotypes; for example, they control a 77 day difference in flowering between ecotypes of silverleaf sunflower (likely through deletion of a FLOWERING LOCUS T homolog), and are associated with seed size, flowering time and soil fertility in dune-adapted sunflowers. These haplotypes are highly divergent, associated with polymorphic structural variants, and often appear to represent introgressions from other, possibly extinct, congeners. This work highlights a pervasive role of structural variation in maintaining complex ecotypic adaptation.
Local adaptation is common in species that experience different environments across their range. This can result in the formation of ecotypes, ecological races with distinct morphological and/or physiological characteristics that provide an environment-specific fitness advantage. Despite the prevalence of ecotypic differentiation, much remains to be understood about its genetic basis and the evolutionary mechanisms leading to its establishment and maintenance. In particular, a longstanding evolutionary question, dating back to criticisms of Darwin’s theories by his contemporaries1, concerns how such ecological divergence can occur in the presence of hybridization with non-adapted populations2. Local adaptation typically requires alleles at multiple loci contributing to increased fitness in the same environment; however, different ecotypes are often geographically close and interfertile, and gene flow between them should break up adaptive allelic combinations by promoting recombination with non-adaptive alleles3.
To better understand the genetic basis of local adaptation and ecotypic differentiation, we conducted an in-depth study of genetic, phenotypic, and environmental variation in three annual sunflower species, which include multiple reproductively compatible ecotypes. Two of these species have broad, overlapping distributions across North America; Helianthus annuus, the common sunflower, is the closest wild relative of cultivated sunflower, which was domesticated from it around 4,000 years ago in East-Central North America4. Populations of H. annuus are generally found on mesic soils, but can grow in a variety of disturbed or extreme habitats, including semi-desertic or frequently flooded areas, as well as salt marshes. An especially well-characterized ecotype (formally H. annuus subsp. texanus), is adapted to the higher temperatures and herbivore pressures in Texas, USA5. Helianthus petiolaris, the prairie sunflower, prefers sandier soils, and ecotypes of this species are adapted to sand sheets and dunes6. The third species, Helianthus argophyllus, the silverleaf sunflower, is found exclusively in southern Texas and includes both an early flowering coastal island ecotype and a late flowering inland ecotype7.
Population structure of wild sunflowers
We grew in a common garden experiment ten plants from each of 151 populations of these three sunflower species (H. annuus = 71 populations; H. petiolaris = 50 populations; H. argophyllus = 30 populations), selected from across their native range (Fig. 1a), and for which we collected corresponding soil samples. We generated extensive records of developmental and morphological traits throughout the growth of the plants, and re-sequenced the genome of 1401 of them. An additional 105 H. annuus individuals were re-sequenced to fill gaps in geographic coverage, as well as twelve annual and perennial sunflowers to be used as outgroups (Supplementary Table 1). Sunflower genomes are relatively large (H. annuus = 3.5 Gbp; H. petiolaris = 3.3 Gbp; H. argophyllus = 4.3 Gbp8) and comprised of >75% retrotransposon sequences9. We used enzymatic depletion10 to reduce the proportion of repetitive sequences, resulting in an average 6.34-fold coverage of gene space (median = 6.03; Supplementary Table 1). Sequencing reads were aligned to the reference genome of cultivated sunflower9, 11, 12 (Extended Data Fig. 1), resulting in sets of >4M high-quality single nucleotide polymorphisms (SNPs) for each species.
A phylogeny based on these and previously re-sequenced sunflower samples agrees with earlier studies13, 14: H. annuus and H. argophyllus are sister species, whereas H. petiolaris is placed in a separate clade. We found three separate lineages within our H. petiolaris collection, corresponding to the subspecies fallax, petiolaris, and canescens. However, subsp. canescens falls within H. niveus, supporting an earlier classification15; due to its smaller sample size (86 individuals), we have omitted the H. niveus canescens clade from further analyses. Lastly, dune-adapted ecotypes of H. petiolaris from Colorado and Texas fall within H. petiolaris fallax (despite the Texas populations being formally designed as H. neglectus16), and are therefore analyzed as part of that clade (Fig. 1b).
Large haplotypes linked to adaptive traits
The large effective population size and outcrossing mating system of wild sunflowers17 represents a major advantage for genome-wide association (GWA) studies, since the rapid decay of linkage disequilibrium (LD) permits mapping of phenotype-genotype associations to narrow genomic regions if (as in this case) sufficient marker densities are available. GWA analyses of 87 traits identified numerous, strong links between phenotypic variation and regions of the sunflower genome (Supplementary Table 2). We observed, for example, extensive variation in flowering time for all three species (Extended Data Fig. 2a), consistent with its fundamental role in plant (and sunflower) adaptation18, 19. For H. annuus in particular, significant associations were found with the sunflower homologs of known flowering time regulators, including FLOWERING LOCUS T20 (FT), FLOWERING LOCUS M21 (FLM), and EARLY FLOWERING 722 (ELF7; Fig. 1c). We also identified several regions of the genome strongly associated with environmental and soil variables in genotype-environment association (GEA) analyses, suggesting a role in adaptation to particular habitats (Supplementary Table 2). For example, several temperature-related variables showed strong associations with the sunflower homolog of HEAT-INTOLERANT 1 (HIT1), which mediates resistance to heat stress by regulating plasma membrane thermo-tolerance in Arabidopsis thaliana23 (Fig. 1d).
In several cases, we noticed GWA and GEA signals spanning very large regions of the genome for traits that are known to be important for local adaptation, and to differentiate ecotypes in sunflower. One of the most striking examples of these GWA plateaus occurred between coastal island and inland populations of H. argophyllus. While inland populations flower late in summer, and can grow to be extremely large (up to >4 meters), smaller early flowering individuals occur at high frequency on the barrier islands of the Gulf of Mexico (Fig. 2a,b). Selection experiments indicate that late flowering in the interior is favoured7, presumably to avoid flowering during the extremely hot and dry summer, whereas early flowering appears to be advantageous under less harsh conditions on the barrier islands. Flowering time GWA analyses in H. argophyllus identified a single, highly significant association spanning ∼30 Mbp on chromosome 6 (Fig. 2c,d), also associated with leaf nitrogen and carbon content (Extended Data Fig. 2b). Principal component analysis (PCA) of this region suggested the presence of two main haplotypes, with intermediate individuals being heterozygotes (Fig. 2e). We extracted haplotype-informative sites and visualized ancestry across the region, revealing very limited recombination. A 10 Mbp region (130-140 Mbp) is perfectly correlated with flowering time phenotypes and explains 88.2% of the variance in days to bud (Fig. 2f). The early haplotype acts dominantly, with plants carrying at least one copy of it flowering on average 77 days earlier than late-flowering individuals (Fig. 2g). This region contains five of the six sunflower homologs of the flowering time regulator FT (HaFT1-3, HaFT5, HaFT6; Fig. 2f). Surprisingly, the GWA signal drops sharply around the HaFT1 locus (Fig. 2d), which is known to play a role in differences in photoperiodic responses between wild and cultivated sunflower24. Analysis of an unfiltered SNP dataset revealed that this pattern is due to the almost complete absence of reads mapping to the region in plants carrying the late-flowering haplotype (only SNPs with data for at least 90% of the individuals were used for GWAs). This is consistent with the presence of one or more deletions, including the HaFT1 locus, in late flowering H. argophyllus (Fig. 2h; Extended Data Fig. 2c). Accordingly, the HaFT1 sequence cannot be amplified from genomic DNA from late-flowering plants, and no HaFT1 expression is detected in those plants (Fig. 2i,j; Extended Data Fig. 2d). Early flowering plants carry instead at least one functional copy of HaFT1, which complements the otherwise late-flowering A. thaliana ft-10 mutant (Fig. 2k,l). To explore the origins of these haplotypes, we constructed a phylogeny of the non-recombined 10 Mbp region in chromosome 6 (Fig. 2m). We found that the two haplotypes are highly divergent, and that the early haplotype was introgressed from H. annuus (D-stat = 0.844 ∓ 0.006, p < 10-20, two-sided; see also Fig. 2g). While it is not possible to exclude a role of the other FT homologs (Extended Data Fig. 2e-g) or other genes in the region, these results strongly suggest that introgression of a functional HaFT1 copy from H. annuus played a major role in the establishment of early-flowering H. argophyllus.
We found another clear example of GWA and GEA plateaus underlying ecotypic differentiation in H. petiolaris, which has repeatedly adapted to sand dunes in Texas and Colorado, USA6. Dune populations exhibit starkly distinctive phenotypes compared to populations growing just off the same dunes (Fig. 3a-d), the most striking of which are seed size and length (Fig. 3b,c); large seeds confer a strong fitness advantage on sand dunes6, possibly by providing seedlings with enough resources to emerge after being buried by sand. Dunes also are low in resources, and dune sunflowers use soil nutrients more efficiently that their non-dune counterparts25. GWA analyses for seed size and flowering time, and GEA analyses of soil characteristics including cation exchange capacity (CEC, a measure of soil fertility) in H. petiolaris fallax identified three multi-Mbp regions on chromosomes 9, 11 and 14 (Fig. 3d,e; Extended Data Fig. 3a,b). The plateaus on chromosomes 11 and 14 co-localize with known QTL for seed size differentiating dune and non-dune populations26, and weaker associations with flowering time are observed in H. petiolaris petiolaris for the chromosome 9 and 11 regions (Extended Data Fig. 3c). All three regions are highly differentiated between dune and non-dune populations from Texas, and two of the three regions differentiate dune and non-dune populations in Colorado27 (Fig. 3f), suggesting a fundamental role in maintaining the dune ecotype.
Highly divergent haploblocks are common
The identification of these GWA/GEA plateaus suggests a broader role of large, non-recombining haplotype blocks (henceforth “haploblocks”) in adaptation. To test this hypothesis, we used a local PCA approach to identify other large genomic regions with distinct population structure28 (Fig. 4a). Across the three species, we found 37 such regions, ranging from 1 to 100 Mbp and representing 4-16% of the total genome (Fig. 4b; Extended Data Table 1). They are characterized by high LD, and PCAs in these regions separated individuals into three clusters, with the middle cluster having higher heterozygosity. This is consistent with the two extreme clusters representing individuals homozygous for two distinct haplotypes, and the middle cluster representing heterozygotes. No or very little recombination is observed between haplotypes, but generally no reduction in recombination is found within haplotypes (Fig. 4c-e; Extended Data Fig. 4,5).
These patterns match the expectations for large, segregating structural variants (SVs). Theoretical and empirical work indicates that SVs can facilitate adaptive divergence in the face of gene flow by reducing recombination between locally adaptive alleles29, 30. In particular, inversions have been shown to control adaptive phenotypic variation (e.g. migration31, colour32, flowering time33), and to be associated with environmental clines34, 35. We used three different approaches to determine whether these haploblocks are associated with SVs (Extended Data Table 1). First, we compared the genome assemblies of two cultivars of H. annuus that have opposite genotypes at haploblock regions on chromosomes 1 and 5 (ann01.01 and ann05.01). At each of those regions we found one and two large inversions, respectively (Fig 4f; Extended Data Fig. 6a). We also aligned ten H. annuus and four H. petiolaris genetic maps to the sunflower reference genome; we observed suppressed recombination at ten haploblocks, and evidence for three haploblocks being caused by large inversions (Fig. 4g; Extended Data Fig. 6b,c). Lastly, we used Hi-C sequencing36 to compare pairs of early- and late-flowering H. argophyllus and dune and non-dune H. petiolaris, and looked for differences in physical linkage at haploblock regions. We found support for SVs, ranging from likely full-length inversions to more complex rearrangements, at 11 regions in H. petiolaris and one in H. argophyllus (Fig. 4h, Extended Data Fig. 7). For one haploblock for each species, we could find no evidence of SVs in our HiC data, suggesting that recombination between haplotypes might be suppressed by other mechanisms in these regions. We also confirmed the presence of large SVs underling four of the haploblocks we detected in wild H. annuus by comparing these Hi-C data to those for the HA412-HO reference cultivar (itself H. annuus; Extended Data Fig. 7). These results point to SVs being associated with most of the haploblock regions we detected.
Of the 37 haploblocks we identified, two correspond to the chromosome 6 region associated with flowering time in H. argophyllus (arg06.01 and arg06.02), and three with the H. petiolaris seed size, flowering time and CEC plateaus (pet09.01, pet11.01 and pet14.01; Fig. 4b; Extended Data Table 1). We also identified four additional haploblocks co-localizing with regions of high genetic differentiation between dune and non-dune ecotypes of H. petiolaris (Fig. 3f, Extended Data Fig. 3d), bringing the total of dune-associated haploblocks to seven, four of which are shared between both independent dune ecotypes (Texas and Colorado; Extended Data Fig. 5). Phylogenetic analysis finds that these dune-associated haploblocks predate the split between the H. petiolaris subspecies fallax and petiolaris, and that five are polymorphic in both subspecies (Fig. 5b; Extended Data Fig. 3e). Such high levels of divergence are common to most haploblock regions (Fig. 5a). For the two haploblocks that are polymorphic between H. annuus reference genomes (ann01.01 and ann05.01; Fig. 4f; Extended Data Fig. 6a), sequence identity between haplotypes is 94-95%, much lower than the 99.4% for the rest of the genome. Divergence times between all but one of the haploblocks exceed 1 MYA, in most cases (32/37) before the H. annuus-H. argophyllus speciation event37 (Fig. 5b). This seems at odds with the observation that haploblock polymorphisms are not shared between sunflower species. Ancient haploblocks could have been maintained in selected lineages, possibly by balancing selection38, but this should result in transpecific polymorphisms. Alternatively, they could be more recently introgressed from divergent taxa, a hypothesis supported for four H. argophyllus haploblocks, in which one haplotype is phylogenetically closer to H. annuus than to H. argophyllus (Fig. 2m). However, a donor species could not be identified for more divergent haploblocks, raising the intriguing possibility that they may be introgressed from one or more extinct taxa.
Haploblocks underlie ecotype divergence
As we have shown, haploblocks can have remarkably strong associations with phenotypic traits and environmental variables (Fig. 2c, 3f; Extended Fig. 2b, 3b,c), but these examples represent only a small proportion of the total haploblock regions we identified. Are other haploblocks also involved in local adaptation? Theory suggests that SVs are likely to establish by capturing multiple adaptive alleles39; consistent with this, when we treated haploblocks as individual loci, we found that haploblocks are often associated with multiple types of traits (Fig 5c; Extended Data Fig. 8,9).
Surprisingly, some of the strongest association we identified with this approach did not show up in our initial GWA and GEA analyses. Haploblocks are in fact large enough to affect the genome-wide estimates of relatedness between individuals (kinship and PCA) routinely employed to compensate for population structure in GWA and GEA analyses, which can result in their association signal being masked40. This is particularly evident for ann13.01, which at ∼100 Mbp is the largest of the haploblocks we identified; significant plateaus for temperature difference (TD, a measure of climate continentality) and flowering time are only revealed once haploblock regions are removed from the kinship covariate (Fig 5d,e). Interestingly, this and several other haploblocks appear to differentiate Texas populations of H. annuus from the rest of the range (Fig. 5f; Extended Data Fig. 5), consistent with the distribution of the texanus ecotype of H. annuus41. Similar to the H. petiolaris dune comparison (Extended Data Fig. 3d), haploblocks are more differentiated than SNPs in comparisons between Texas and other populations (t(10) = 4.01, p = 0.0024, two-sided T-test; Fig. 5g), supporting a role for haploblocks in local adaptation of this subspecies, or in increasing its reproductive isolation with local congeners (i.e. reinforcement42).
Conclusions
We identified a large number of highly divergent, multi-Mbp-long haploblocks in wild sunflowers, many of which appear to underlie ecotype formation; four in the early flowering ecotype from H. argophyllus; seven in the texanus ecotype of H. annuus; and seven in dune ecotypes of H. petiolaris (Extended Data Fig. 5). These haploblocks are often, but not always, linked to large SVs (especially inversions), which provide a straightforward mechanism for suppressing recombination between haplotypes, therefore maintaining adaptive allelic combinations. The total number and effects of such haplotypes are likely even larger than this, since our approach is biased towards detection of divergent and large (>1 Mbp) haploblocks. Ecotypic differentiation is often seen as a first step toward the generation of new species43, and the ecotypes discussed above appear to represent different stages in the speciation continuum. The coastal island ecotype of H. argophyllus is least divergent, and the only reproductive barrier we are aware of is the difference in flowering time7, which provides only modest protection from gene flow. In contrast, multiple intrinsic and extrinsic reproductive barriers differentiate the two dune ecotypes of H. petiolaris fallax from nearby non-dune populations6, 44, reducing but not eliminating gene migration16, 45. It is noteworthy that several haploblocks are associated with both traits favouring local adaptation and contributing to reproductive isolation (e.g., seed size and flowering time, respectively, in the dune ecotypes), an architecture that facilitates speciation with gene flow29. More generally, flowering time mapped to one or more haploblocks in all ecotypes, suggesting that it plays an especially important role in successful ecotype formation, perhaps due to its dual role in local adaptation and assortative mating46.
An unanswered question is how the linked combinations of locally favoured mutations found in haploblocks arose. Possibly, sets of locally adaptive alleles initially developed in geographically isolated populations47. Secondary contact and hybridization would favour the evolution of reduced recombination among such alleles through the establishment of SVs39 or other recombination modifiers28. An origin through introgression would also help account for the high divergence and massive size of many of the haploblocks, as well as the lack of shared haploblock polymorphisms between species. The alternative explanation for the latter, incomplete lineage sorting, would require extensive haploblock loss. After haploblock establishment (regardless of how it occurred), new locally adaptive mutations would be more likely to persist under migration-selection balance if linked to other adaptive alleles48, 49, potentially leading to the outsized effects reported here. Our work reveals a modular genetic architecture underlying ecotype formation, a surprising and unforeseen origin of many locally adapted gene modules though introgression, and a critical role of recombination modifiers, especially structural variants, in adaptive divergence with gene flow.
Methods
Seed and soil collection
During the summer of 2015 we visited 192 wild populations spanning the native distribution of H. annuus, H. petiolaris, and H. argophyllus, and collected seeds from 21-37 individuals from each population. Seeds from ten additional populations of H. annuus had been previously collected in the summer of 2011. Three to five soil samples (0 - 25 cm depth) were collected with a corer at each population, from across the area in which seeds were collected. Soils were air dried in the field, further dried at 60 °C in to the lab, and passed through a 2 mm sieve to remove roots and rocks. Soils were then submitted to Midwest Laboratories Inc. (Omaha, NE, USA) for analysis.
Common garden
Ten plants from each of 151 selected populations were grown at the Totem Plant Science Field Station of the University of British Columbia (Vancouver, Canada) in the summer of 2016. Pairs of plants from the same population of origin were sown using a completely randomized design. At least three flowers from each plant were bagged before anthesis to prevent pollination, and manually crossed to an individual from the same population of origin. Phenotypic measurements were performed throughout plant growth, and leaves, stem, inflorescences and seeds were collected and digitally imaged to extract relevant morphometric data (see Supplementary Table 1).
Library preparation and sequencing
Whole-genome shotgun (WGS) sequencing libraries were prepared for 719 H. annuus, 488 H. petiolaris, 299 H. argophyllus individuals, and twelve additional samples from annual and perennial sunflowers (Supplementary Table 1). Genomic DNA was sheared to ∼400 bp fragments using a Covaris M220 ultrasonicator (Covaris, Woburn, Massachusetts, USA) and libraries were prepared using a protocol largely based on Rowan et al., 201551, the TruSeq DNA Sample Preparation Guide from Illumina (Illumina, San Diego, CA, USA) and Rolhand et al., 201252. In order to reduce the proportion of repetitive sequences, libraries were treated with a Duplex-Specific Nuclease (DSN; Evrogen, Moscow, Russia), following the protocols reported in Shagina et al. 201010 and Matvienko et al. 201353, with modifications (see Supplementary Methods for details). All libraries were sequenced at the McGill University and Génome Québec Innovation Center on HiSeq2500, HiSeq4000 and HiSeqX instruments (Illumina, San Diego, CA, USA), to produce paired end, 150 bp reads. Libraries with fewer reads were re-sequenced to increase genome coverage. After quality filtering (see below), a total of 60.7 billion read pairs were retained, equivalent to 14.5 Tbp of sequence data.
Variant calling
The call set included the 1518 samples described above, the Sunflower Association Mapping (SAM) population (a set of cultivated H. annuus lines 54), and wild Helianthus samples previously sequenced for other projects54–56, for a total of 2392 samples (Supplementary Table 1). The additional samples were included to improve SNP calling, and to identify haploblock genotypes. Sequences were trimmed for low quality using Trimmomatic57 (v0.36) and aligned to the H. annuus XRQv1 genome9 using NextGenMap58 (v0.5.3). We followed the best practices recommendations of The Genome Analysis ToolKit (GATK)59, and executed steps documented in GATK’s germline short variant discovery pipeline (for GATK 4.0.1.2). During genotyping, to reduce computational time and improve variant quality, genomic regions containing transposable elements were excluded9. Since performing joint-genotyping on the whole ensemble of samples would have been computationally impractical, genotyping was performed independently on three per-species cohorts (H. annuus, H. argophyllus and H. petiolaris).
Variant quality filtering
Genotyping produced VCF files featuring an extremely large number of variant sites (222M, 78M and 167M SNPs and indels for H. annuus, H. argophyllus and H. petiolaris, respectively). Over the called portion of the genome, this corresponds to 0.07 to 0.2 variants per bp, with 30-47% percent of variable sites being indel variation. To remove low-quality calls and produce a dataset of a more manageable size, we used GATK’s VariantRecalibrator (v4.0.1.2), which filters variants in the call set according to a machine learning model inferred from a small set of “true” variants. In the absence of an externally-validated set of known sunflower variants to use as calibration, we computed a stringently-filtered set from top-N samples with highest sequencing coverage for each species (N=67 (SAM) samples for H. annuus, and N=20 otherwise). The stringency of the algorithm in classifying true/false variants was adjusted by comparing variant sets produced for different parameter values (tranche 100.0, 99.0, 90.0, 70.0, and 50.0). For each cohort, results for tranche = 90.0 were chosen for downstream analysis, based on heuristics: the number of novel SNPs identified, and improvements to the transition/transversion ratio (towards GATK’s default target of 2.15).
Remapping sites to the HA412-HOv2 reference genome
Our initial analysis of haploblocks (see section “Population genomic detection of haploblocks”), as well as GWA/GEA results for haploblocks regions, found many instances of disconnected haploblocks and high linkage between distant parts of the genome, suggesting problems in contig ordering. We remapped genomic locations from XRQv19 to HA412-HOv211 using BWA60. Measures of LD using vcftools61 showed that remapping significantly improved LD decay (Extended Data Fig. 1a) and produced more contiguous haploblocks (Extended Data Fig. 1b), supporting the accuracy of the new genome assembly and our remapping procedure. While we recognize that this approach reduces accuracy at the local scale, and would not be appropriate, for example, for determining the effects of variants on coding sequences, it produces a more accurate reflection of the genome and linkage structure.
Phylogenetic analysis
Variants were called for 20 windows of 1 Mbp, randomly selected across the genome. Indels were removed and SNP sites were filtered for <20% missing data and minor allele frequency >0.1%. All sites were then concatenated and analyzed using IQtree62–64 with ascertainment bias and otherwise default parameters.
Genome-wide association mapping
Genome-wide association analyses were performed for 86, 30 and 69 phenotypic traits in H. annuus, H. argophyllus and H. petiolaris, respectively, using the EMMAX (v07Mar2010) or the EMMAX module in EasyGWAS65; an annotated list of candidate genes is reported in Supplementary Table 2. Inflorescence and seed traits could not be collected for H. argophyllus, since most plants of this species flowered very late in our common garden, and failed to form fully-developed inflorescences and set seeds before temperatures became too low for their survival.
Genome-environment association analyses
Twenty-four topo-climatic factors were extracted from climate data collected over a 30-year period (1961-1990) for the geographic coordinates of the population collection sites, using the software package Climate NA66. Soil samples from each population were also analyzed for 15 soil properties (Supplementary Table 1). The effects of each environmental variable were analyzed using BayPass67 version 2.1. Following Gautier, 201567, we employed Jeffreys’ rule68, and quantified the strength of associations between SNPs and variables as “strong” (10 dB ≤ BFis < 15 dB), “very strong” (15 dB ≤ BFis < 20 dB) and decisive (BFis ≥ 20 dB). An annotated list of candidates genes from GEA analyses is reported in Supplementary Table 2.
Transgenes and expression assays
The complete coding sequences (CDS) of HaFT1, HaFT2 and HaFT6 were amplified from complementary DNA (cDNA) from H. argophyllus individuals carrying the early and late haplotype for arg06.01. Two alleles of the HaFT2 CDS were identified in late-flowering H. argophyllus plants (one of them identical to the HaFT2 CDS from early-flowering individuals), differing only for two synonymous substitutions at position 285 and 288. All alleles were placed under control of the constitutive CaMV 35S promoter in pFK210 derived from pGREEN69. Constructs were introduced into plants by Agrobacterium tumefaciens-mediated transformation70. Col-0 and ft-10 seeds were obtained from the Arabidopsis Biological Resource Center. All primer sequences are reported in Supplementary Table 3.
Population genomic detection of haploblocks
The program lostruct (local PCA/population structure) was used to detect genomic regions with abnormal population structure28. Lostruct divides the genome into non-overlapping windows and calculates a PCA for each window. It then compares the PCAs derived from each window and calculates a similarity score. The matrix of similarity scores is then visualized using a multidimensional scaling (MDS) transformation. Lostruct analyses were performed on the H. annuus, H. argophyllus, H. petiolaris petiolaris, and H. petiolaris fallax datasets, as well as in a H. petiolaris dataset including both H. petiolaris petiolaris and H. petiolaris fallax individuals. For each dataset, lostruct was run with 100 SNP-wide windows and independently for each chromosome. Each MDS axis was then visualized by plotting the MDS score against the position of each window in the chromosome.
Many localized regions of extreme MDS values with high variation in MDS scores and sharp boundaries were detected (Fig. 4a; Extended Data Fig. 4). Localized changes to population structure could occur due to selection or introgression, but both the size and discrete nature of the regions are consistent with underlying structural changes defining the boundaries and preventing recombination. For example, inversions prevent recombination between orientations and if inversion haplotypes are diverged enough, they will show up in lostruct scans28. Since we are interested in recombination suppression in the context of adaptation, we focused on regions that had the following features: (1) a PCA in the region should divide samples into three groups representing 0/0, 0/1 and 1/1 genotypes, (2) the middle 0/1 genotype should have higher average heterozygosity and (3) there should be high linkage disequilibrium (LD) within the region.
The combined evidence of PCA and linkage suggests that the lostruct outlier regions are characterized by long haplotypes with little or no recombination between haplotypes. We refer to these as haploblocks. To explore the haplotype structure underlying the haploblocks, sites correlated (R2 > 0.8) with PC1 in the PCA of the haploblock were extracted as haplotype diagnostic sites and used to genotype the haploblocks. Since there is seemingly little recombination between haplotypes, this is conceptually similar to a hybrid index and we expect all samples to be consistently homozygous for one haplotypes alleles or be heterozygous at all sites (i.e. similar to an F1 hybrid). Haploblock genotypes were assigned to all samples using equation (1), where p is the proportion of haplotype 1 alleles and h is the observed heterozygosity. The haplotype structure was also visualized by plotting diagnostic SNP genotypes for each sample, with samples ordered by the proportion of alleles from haplotype 1 (e.g. Fig. 2f).
Lostruct was run in SNP datasets containing H. petiolaris petiolaris, H. petiolaris fallax, and both subspecies together. Although each dataset produced a collection of haploblocks, they were not identical. Some haploblocks were identified in one subspecies, but not the other, and some were only identified when both subspecies were analyzed together. In some cases, it was clear that haploblocks identified in both subspecies represented the same underlying haploblock because they physically overlapped and had overlapping diagnostic markers. We manually curated the list of haploblocks and merged those found in multiple datasets. We set the boundaries of these merged haploblocks to be inclusive (i.e. include windows found in either) and the diagnostic markers to be exclusive (i.e. only include sites found in both). For this merged set of haploblocks, all H. petiolaris samples were genotyped using diagnostic markers.
Design of genetic markers for haploblock screening
Diagnostic SNPs for haploblocks were extracted from filtered vcf files. The resulting markers Cleaved-Amplified Polymorphic Sequence (CAPS) or direct sequencing markers were tested on representative subsets of individuals included in the original local PCA analysis (Fig. 4a, Extended Data Fig. 4), for which the genotype at haploblocks of interest was known. Marker information are reported in Supplementary Table 3.
Sequencing coverage analysis
To detect the presence of potential deletions in the late-flowering allele of arg06.01, SNP in the haploblock region with average coverage of at least 4 across at least one of the genotypic classes were selected (in order to exclude positions with overall low mapping quality). SNP positions with coverage 0 or 1 in one genotypic class were counted as missing data for that genotypic class (Extended Data Fig. 2c).
H. annuus reference assemblies comparisons
Masked reference sequences for the H. annuus cultivars HA412-HOv2 and PSC811, 12 were aligned using MUMmer71 (v4.0.0b2). The programs nucmer (parameters -b 1000 -c 200 -g 500) and dnadiff within the MUMmer package were used. Only orthologous chromosomes were aligned together because of the high similarity and known conservation of chromosome structure. The one-to-one output file was then visualized in R and only included alignments where both sequences were > 5000 bp. Inversion boundaries and sequence identity between haplotypes were further determined using Syri72.
Genetic maps comparisons
Fourteen genetic maps were used: the seven H. annuus genetic maps used in the creation of the XRQv1 genome9; three newly generated H. annuus maps obtained from wild X cultivar F2 populations (E.B.M.D., M.T., G.L.O., L.H.R., in preparation); two previously published H. petiolaris genetic maps obtained from F1 crosses50; and two newly generated H. petiolaris maps (K.H., Rose L. Andrews, G.L.O., K.L.O., L.H.R., in preparation). Whenever necessary, marker positions relative to XRQv1 were re-mapped to the HA12-HOv2 assembly (see above). Six of the previously described H. annuus maps were obtained from crosses between cultivars (the seventh one was obtained from a wild X cultivar cross); in order to determine which haploblock could be expected to segregate in the genetic maps, all of the H. annuus SAM population lines were genotyped for each H. annuus haploblock using diagnostic markers identified in wild H. annuus. Ann01.01 and ann05.01 were found to be highly polymorphic in the SAM population, while other haploblocks were fixed or nearly fixed for a single allele. For all fourteen maps, marker order was compared to physical positions in the HA412-HOv2 reference assembly, and evidence for suppressed recombination or structural variation was recorded (Extended Data Table 1).
Hi-C
Pairs of H. petiolaris and H. argophyllus populations that diverged for a large number of haploblocks were selected. Individuals from these populations were genotypes using haploblock diagnostic markers (see “Design of genetic markers for haploblock screening”) to identify, for each species, a pair of individuals with different genotypes at the largest possible number of haploblocks. Chromosome conformation capture sequencing36, 73 (Hi-C) libraries were prepared by Dovetail Genomics (Scotts Valley, CA, USA) and sequenced on a single lane of HiSeq X with 150 bp paired end reads. Reads were trimmed for enzyme cut site and base quality using the tool trim in the package HOMER74 (v4.10) and aligned to the HA412-HOv2 reference genome using NextGenMap58 (v0.5.4). Interactions were quantified using the calls ‘makeTagDirectory - tbp 1-mapq 10’ and ‘analyzeHiC -res 1000000 -coverageNorm’ from HOMER. Hi-C data were used in two ways to identify structural changes. First, the difference between interaction matrices for samples of the same species was plotted for each haploblock region where the two samples had different genotypes. Second, the difference between interaction matrices for H. annuus (using the HiC data that were generated to scaffold the HA412-HOv2 reference assembly11) and each H. petiolaris and H. argophyllus sample were plotted.
Haploblock phenotype and environment associations
Since haploblocks are large enough to affect genome wide population structure, their associations with phenotypes of environmental variables may be masked when controlling for population structure. Therefore, a version of the variant file was created with all haploblock sites removed; GWA and GEA analyses were performed as before, but kinship, PCA and genetic covariance matrix were calculated using this haploblock-free variant file. Regions of high associations co-localizing with haploblock regions were identified, and haploblocks were also directly tested by coding each haploblock as a single bi-allelic locus.
To examine the relative importance of haploblocks to trait evolution and environmental adaptation, association results were compared between haploblocks and SNPs. Using SNPs as a baseline allows to control for the correlation between traits or environmental variables. To make values comparable, both SNPs and SVs with minor allele frequency ≤ 0.03 were removed. Each locus was classified as associated (p < 0.001 or BFis > 10 dB) or not to each trait. The number of traits or climate variable each locus was associated with was then counted. The proportion of loci with ≥ 1 traits/climate variables associated for SNPs and haploblocks was then compared using prop.test in R75 (Extended Data Fig. 9b).
Haploblocks phylogenies and dating
The phylogeny of each haploblock region was estimated by Bayesian inference using BEAST76 1.10.4 for 100 genes within the region. The dataset was partitioned, assuming unlinked substitution and clock models for the genes, and analyzed under the HKY model with 4 Gamma categories for site heterogeneity: a strict clock, a “Constant Size” tree prior with a Gamma distribution with shape parameter 10.0 and a scale parameter 0.004 for the population size. Default priors were used for the other parameters. A custom Perl script was used to combine FASTA sequences and the model parameters into XML format for BEAST input. The Markov chain Monte Carlo (MCMC) process was run for 1 million iterations and sampled every 1000 states. The convergence of chains was inspected in Tracer77 1.7.1. In order to estimate divergence times, the resulting trees were calibrated using a mutation rate estimate of 6.9 × 10−9 substitutions/site/year for sunflowers78, and visualized with R package ggtree79 and Figtree v1.4.480. Divergence times were extracted from the trees and plotted showing the 95% highest posterior density (HPD) interval based on the BEAST posterior distribution. This was repeated for 100 non-haploblock genes to estimate the species divergence times.
For the 10 Mb region on chromosome 6 controlling flowering time in H. argophyllus, the early flowering haplotype grouped with H. annuus. To determine if it is the product of an ancient haplotype that has retained polymorphism only in H. annuus or if it is introgressed from H. annuus, the phylogeny of 10 representative H. argophyllus samples homozygous for each haploblock allele, as well as 200 H. annuus samples, was inferred using IQtree. SNPs from the 10 Mb region were concatenated and the maximum likelihood tree was constructed using the GTR model with ascertainment bias correction. Branch support was estimated using ultrafast bootstrap implemented in IQtree62–64 with 1,000 bootstrap replicates. Phylogenies of haploblock arg03.01, arg03.02 and arg06.02 were inferred using the same approach. To explore intra-specific history of the H. petiolaris haploblocks, all samples homozygous for either allele for each haploblock were selected, and phylogenies were constructed using IQtree with the same settings.
Data availability
All raw sequenced data are stored in the Sequence Read Archive (SRA) under BioProjects PRJNA532579, PRJNA398560 and PRJNA564337. Accession numbers for individual samples are listed in Supplementary Table 1. The HA412-HOv2 and PSC8 genome assemblies are available at https://sunflowergenome.org/ and https://heliagene.org/. GWA results are available at https://easygwas.ethz.ch/gwas/results/xxx/. HaFT1, HaFT2 and HaFT6 sequences have been deposited in GenBank under accession numbers MN517758-MN517761.
Code availability
All code associated with this project is available at https://github.com/owensgl/haploblocks.
Author Contributions
L.H.R., S.Y., J.M.B, L.A.D., N.B. conceived the study; M.T., N.B., D.O.B., E.B.M.D., I.I., M.A.P., W.C., L.A.D. collected data and performed experiments; M.T., G.L.O., J.S.L., S.S., K.H., K.L.O., E.B.M.D., K.L. analyzed data; S.E.S., S.M. contributed resources; R.N. provided conceptual advice; M.T., G.L.O., L.H.R. wrote the manuscript, with contributions from all the authors.
Competing interests
The authors declare no competing interests.
Additional Information
Supplementary Information is available for this paper.
Acknowledgments
We thank Jérôme Gouzy and Nicolas B. Langlade for providing access to the HA412-HOv2 annotation and PSC8 genome assembly, Brook T. Moyers for discussion and providing the H. argophyllus picture, Julie Lee-Yaw and Armando J. Moreno-Geraldes for comments, Dominique Skonieczny, Amy Kim, Ana Parra and Cassandra Konecny for assistance with field work and data acquisition, James D. Herndon for providing the dune H. petiolaris picture, Mihir Nanavati and Andrew Warfield for computing advice, and Compute Canada for computing resources. Funding was provided by Genome Canada and Genome BC (LSARP2014-223SUN), the NSF Plant Genome Program (DBI-1444522), the International Consortium for Sunflower Genomic Resources, a HFSP long-term postdoctoral fellowship to M.T. (LT000780/2013), a Banting postdoctoral fellowship to G.L.O.