Abstract
Genomic outcomes of hybridization depend on selection and re-combination in hybrids. Whether these processes have similar effects on hybrid genome composition in contemporary hybrid zones versus ancient, stabilized hybrid lineages is unknown. Here we show that patterns of introgression in a contemporary hybrid zone in Lycaeides butterflies predict patterns of ancestry in geographically adjacent, ancient hybrid populations. We find a particularly striking lack of ancestry from one of the hybridizing taxa, Lycaeides melissa, on the Z chromosome in both the ancient and contemporary hybrids. The same pattern of reduced L. melissa ancestry on the Z chromosome is seen in two other ancient hybrid lineages. More generally, we find that patterns of ancestry in ancient hybrids are remarkably predictable from contemporary hybrids, which suggests selection affects hybrid genomes in a similar way across disparate time scales and during distinct stages of speciation and species breakdown.
Numerous species hybridize [1, 2] or show genomic evidence of ancient admixture [3, 4, 5]. Consequently, many organisms have genomes that are mosaics of chromosomal segments with ancestry from different lineages or species (e.g., [6, 7, 8, 9, 10, 11]). In hybrid zones, where gene flow and hybridization are ongoing, parental chromosomal segments are repeatedly introduced, and hybrids often vary substantially in genome composition [12, 13, 14]. In contrast, when gene flow and hybridization cease, as occurs during hybrid speciation or if one of the hybridizing taxa goes extinct, genome stabilization can occur whereby recombination causes ancestry segment size to decay and ancestry segments fix by drift or selection in the nascent hybrid lineage [7, 15, 16].
In both cases, the genomic mosaic of ancestry segments is shaped by selection, which can include selection against developmental incompatibilities and selection for locally adaptive ancestry combinations [12, 14]. In contemporary hybrid zones with ongoing gene flow, selection results in differential or restricted introgression of some ancestry segments (e.g., barrier loci) [17, 18, 19, 20, 21], whereas selection causes shifts in the frequency of ancestry segments (including fixation) in ancient, stabilized hybrid populations [22, 10, 16]. Changes in genome composition and patterns of linkage disequilibrium during the transition from hybrid zone to stabilized hybrid lineage could alter selection on ancestry segments. Likewise, the primary sources of selection could change from, for example, selection against developmental incompatibilities to selection for novel allele combinations that enhance fitness and persistence in a new environment. Unfortunately, comparisons of genome composition in contemporary versus ancient hybrids are mostly lacking, especially for natural hybrids (for lab hybrids, see [22, 16]). Consistency between patterns of introgression in contemporary hybrid zones and the genomic mosaic of ancestry in ancient hybrids would suggest a major role for selection in determining hybrid genome composition, and would establish connections between early and late stages of speciation, especially speciation with gene flow and hybrid speciation. For example, consistent patterns would suggest that the same genes or gene regions that prevent the fusion of hybridizing species also experience selection during the origin of hybrid lineages or species. Such consistency could allow genes or genetic regions of general importance for adaptation and speciation to be identified (e.g., genes with environment-independent effects on fitness) and aid in interpreting patterns of ancestry in cases where ancient admixture occurred but where contemporary hybrids are lacking (e.g., Homo sapiens H. neanderthalensis [8]). In contrast, a lack of consistency might imply that outcomes of genome collisions are highly context dependent, and thus difficult to predict (e.g., [23]).
Here, we take advantage of natural hybridization in North American Lycaeides butterflies to test the consistency of genome composition between a contemporary hybrid zone and multiple, ancient hybrid lineages that have progressed towards genome stabilization. In North America, Lycaeides consists of a complex of four nominal species of small blue butterflies and numerous ancient hybrid populations or lineages [6, 24, 25]. Ancient hybrid populations in the central Rocky mountains and Jackson Hole (hereafter Jackson Hole Lycaeides) arose following hybridization between L. idas and L. melissa within the past 14,000 years [26]. Like L. idas (and unlike L. melissa), these Jackson Hole Lycaeides exhibit obligate diapause and are univoltine; they also use the same larval host plant (Astragalus miser) as nearby L. idas populations [26, 27, 28]. In contrast, hybrid lineages in the Sierra Nevada and Warner mountains of western North America occupy extreme alpine habitats not used by the parental species and exhibit novel, transgressive phenotypes, such as strong preference for an alpine endemic host plant (A. whitneyi) and a lack of egg adhesion to the host substrate [6, 29]. We recently found evidence suggestive of a narrow (1–2 kilometers), contemporary hybrid zone between L. melissa and Jackson Hole Lycaeides near the town of Dubois, Wyoming at the edge of the Rocky mountains [25]. Butterflies in the putative hybrid zone use feral (i.e., naturalized), roadside alfalfa (Medicago sativa) as their host plant (similar to some L. melissa populations), and are found in close proximity (along the road) to this nonnative plant.
In the current study, we first verify that the Dubois population constitutes a contemporary hybrid zone between L. melissa and Jackson Hole Lycaeides; the latter is an ancient hybrid lineage derived from L. melissa and L. idas. We then use a mixture of partial and whole genome sequence data to compare patterns of introgression in this hybrid zone to the genomic mosaic of ancestry in Jackson Hole Lycaeides, and to additional ancient hybrid lineages in the Sierra Nevada and Warner mountains, and thereby quantify the consistency of hybrid genome composition across these cases. We expect less consistency with the Sierra Nevada and Warner mountains lineages as they differ in their ecology and in their taxonomic origin; nonetheless, these lineages represent outcomes of collisions between Lycaeides genomes and are useful for understanding the general aspects of what transpires from initial hybridization, through isolation and stabilization. We thus use the contemporary Dubois hybrid zone as a window on the evolutionary process and ask whether similar or very different processes operated during the origin and establishment of multiple, ancient hybrid lineages.
Results
Evidence of a contemporary hybrid zone in Dubois
Patterns of genetic variation across 39,193 SNPs and 23 populations (N = 835 butterflies) show that the Dubois, Wyoming population constitutes a contemporary hybrid zone between L. melissa and the ancient Jackson Hole hybrid lineage, and thus differs from the ancient Jackson Hole Lycaeides populations (Table S1, Figs. 1, 2). Admixture proportions from entropy (ver. 1.2; [25]) and a principal component analysis (PCA) of genetic variation show that Jackson Hole Lycaeides and the Dubois population are genetically intermediate between L. idas and L. melissa (Figs. 2B, S1). Whereas Jackson Hole Lycaeides form a relatively tight cluster in PCA space (especially within populations) and show similar admixture proportions, butterflies from the Dubois population span the entire genomic gradient from L. melissa to Jackson Hole Lycaeides (Fig. 2C). This single population, which spans ∼1–2 kilometers (kms) along roadside alfalfa, exhibits greater variation in genome composition than the entirety of Jackson Hole Lycaeides, with a range of more than 10,000 km2 [26]. Also consistent with ongoing hybridization in the Dubois population, these butterflies exhibit elevated linkage disequilibrium (Figs. S2, S3) and intermediate allele frequencies (Figs. S4, S5, S6) at ancestry informative SNPs (i.e., SNPs with an allele frequency difference of 0.3 or more between L. idas and L. melissa). Taken together, these results demonstrate ongoing hybridization in Dubois between L. melissa and nearby Jackson Hole Lycaeides, which are themselves a product of ancient hybridization between L. idas and L. melissa [26].
Patterns of restricted and directional introgression
We fit a Bayesian genomic cline model with bgc (ver. 1.04b; [30]) to quantify genome-wide variability in introgression between L. melissa and Jackson Hole Lycaeides in the Dubois hybrid zone. This method estimates clines in ancestry for individual genetic loci (e.g., SNPs) along a genome-average admixture gradient [31, 14] (Fig. S7). As such, it can be applied in cases in which hybrid zones are confined to a single geographic locality, such as the Dubois population. We focused this analysis on 1164 ancestry informative markers (AIMs; SNPs with allele frequency differences between L. idas and L. melissa ≥ 0.3). We detected credible variation in patterns of introgression across the genome (defined as cases where Bayesian 95% credible intervals [CIs] for genomic cline parameters did not span zero; Fig. 3). Of the 1164 AIMs, 34 (2.9%) showed credible evidence of restricted introgression (95% CIs for cline parameter β > 0) and constitute candidates for genomic regions harboring barrier loci between L. melissa and Jackson Hole Lycaeides ([32]; Figure 3B). 189 of the AIMs (16.2%) had credible evidence of excess Jackson Hole Lycaeides introgression relative to genome-average expectations (i.e., directional introgression of Jackson Hole alleles; 95% CIs for cline parameter α > 0), and 273 AIMs (23.5%) had credible evidence of excess L. melissa introgression (i.e., directional introgression of L. melissa alleles; 95% CIs for cline parameter α < 0). Similar results were obtained when only male butterflies were analyzed; males carry two copies of the Z chromosome (Fig. S8).
Genetic loci exhibiting directional or restricted introgression were distributed across the 23 Lycaeides chromosomes (Figure 3B). However, randomization tests showed that AIMs with restricted introgression (34 loci with β > 0) and those with excess Jackson Hole introgression (189 loci with α > 0) were found on the Z sex chromosome more than expected by chance (β > 0, number on Z = 31, x-fold enrichment = 4.70, P < 0.001; α > 0, number on Z = 67, x-fold enrichment = 1.84, P < 0.001) (Fig. 3C,D). This was not true for AIMs with directional introgression of L. melissa alleles (273 loci with α < 0, number on Z = 37, x-fold enrichment = 0.70, P = 0.998). There was little evidence that genetic loci showing exceptional patterns of introgression in Dubois were clustered on genes or other structural genomic features (we did find some evidence of AIMs with credible L. melissa introgression being over-represented in coding sequences and proteins; Tables S2, S3 and S4).
Genome composition and genome stabilization in the ancient Jackson Hole hybrid lineage
We next quantified genome-wide variation in the frequency of L. idas versus L. melissa ancestry segments in each of ten Jackson Hole Lycaeides populations (Table S1). This includes populations adjacent to the Dubois hybrid zone (e.g., Bald Mountain [BLD] which is ∼5 km from the Dubois site), and much more distant populations (e.g., Bunsen Peak [BNP], which is on the Yellowstone plateau 173 km from Dubois) (Fig. 2A). We once again focused on the 1164 AIMs. We estimated ancestry segment frequencies using the correlated beta process model implemented in popanc (ver. 0.1; [15]). This method is similar to a hidden Markov model and accounts for the expected autocorrelation in ancestry along chromosomes, but allows ancestry frequencies to vary along the genome. The average frequency of L. idas ancestry varied from 0.76 (BNP) to 0.51 (BCR), resulting in a northwest-southeast cline in mean genome composition (consistent with [26]) (Figure 4). However, ancestry frequencies varied considerably within and among chromosomes, with 7.7–33.2% of the genome fixed or nearly fixed (frequency ≥ 0.95) for L. melissa or L. idas ancestry (Figure 4, S9). More of the genome was fixed (or nearly fixed) for L. idas ancestry than L. melissa ancestry (mean = 10.9% versus 3.5%), and overall rates of fixation were higher on the Z than autosomes (mean = 30.1% on the Z). Similar results were obtained when only male butterflies were analyzed (Fig. S10, S11).
Regions of exceptionally high L. idas ancestry frequencies were especially pronounced on the Z chromosome, and this held across all ten Jackson Hole populations (21.3–32.0% of the 225 AIMs on the Z had L. idas ancestry frequencies 0.95). Specifically, randomization tests showed that the 10% of AIMs with the highest L. idas ancestry frequencies were found on the Z chromosome 1.28 to 3.32 times more often than expected by chance (Table S5). In contrast, we found little evidence that regions of high L. melissa ancestry were over represented on the Z chromosome (Table S6). Genomic regions with the highest levels of L. idas ancestry (i.e., the top 10% of AIMs with the highest L. idas ancestry) were in or near (within 1000 bp) genes (observed = 63, x-fold enrichment = 1.20, P = 0.024) and in or near gene coding sequences (observed = 55, x-fold enrichment = 1.22, P = 0.024) more often than expected by chance (Table S7). A similar pattern held for regions of high L. melissa ancestry (Table S8).
Introgression in contemporary hybrids predicts genome composition in ancient Jackson Hole hybrids
Regions of the genome showing exceptional introgression in the contemporary Dubois hybrid zone coincided with genomic regions exhibiting extreme ancestry frequencies in the ancient Jackon Hole hybrid lineage to a much greater extent than expected by chance (Figure 5). For example, AIMs exhibiting restricted introgression in the Dubois hybrid zone (top 10% of AIMs with the highest β values from bgc) overlapped with AIMs showing the most extreme ancestry frequencies in Jackson Hole (top 10% closest to 0 or 1) over four times more often than expected by chance (observed = 49, x-fold enrichment = 4.23, P < 0.001; Figure 5C). The degree of overlap increased when we considered more extreme cutoffs for the comparison, up to the top 1% of AIMs with the greatest degree of restricted introgression or most extreme ancestry frequencies (x-fold enrichments ranged from 4.23 to 35.42; Figure 5D). Similar patterns were observed when comparing AIMs with evidence of excess directional introgression of Jackson Hole alleles in Dubois (top 10% with highest values or α) and those with the highest (i) L. idas (observed = 22, x-fold enrichment = 1.91, P = 0.0007) or (ii) L. melissa (observed = 26, x-fold enrichment = 2.24, P < 0.001) ancestry frequencies in Jackson Hole Lycaeides (Figure 5A,D). At least for the case of high L. idas ancestry, this strength of this signal of excess overlap was again more pronounced when considering more extreme cutoffs up to the top 1% of AIMs (x-fold enrichments ranged from 1.91 to 16.93; Fig. 5F). In contrast, we found no evidence that AIMs with excess directional introgression of L. melissa alleles in Dubois (top 10% with lowest values or α) coincided with those AIMs with the highest L. idas or L. melissa ancestry frequencies in Jackson Hole Lycaeides (Figs. 5B,E,F).
Evidence of consistency (i.e., predictability) of genome composition between contemporary Dubois hybrids and the ancient Jackson Hole hybrids comes from both the autosomes and Z sex chromosome. Specifically, when we repeated the above comparisons with either the 22 autosomes or the Z chromosome alone, we obtained mostly similar results. For example, AIMs showing restricted introgression (top 10% with the highest values for β) and those with the most extreme ancestry frequencies in Jackson Hole Lycaeides (i.e., closest to being fixed for L. idas or L. melissa ancestry) coincided 2.4 times (observed = 23, P < 0.0001) and 3.9 times (observed = 37, P < 0.0001) more often than expected by chance for the autosomes and Z chromosome, respectively (Fig. 5G,H). Similarly, AIMs showing evidence of excess directional introgression of Jackson Hole alleles in Dubois (top 10%) and the highest frequencies of L. idas ancestry in the Jackson Hole populations coincided 2.2 (autosomes, observed = 21, P < 0.0001) and 2.8 (Z, observed = 12, P = 0.0003) times more often than expected by chance (Fig. 5G,H). We obtained similar results when defining AIMs as SNPs with an allele frequency difference of greater than 0.2 rather than 0.3 (2126 SNPs) (Figure S12), and when basing our analyses only on male butterflies, which carry two copies of the Z chromosome (Figs. S13, S14).
We next analyzed the potential functional significance of genetic regions harboring the AIMs that were in the top 10% for both restricted introgression in Dubois (high β) and extreme (high L. idas or high L. melissa) ancestry frequencies in the ancient Jackson Hole Lycaeides lineage. We focus on this set of 49 loci as we think these are our best candidates for tagging regions of the genome harboring barrier loci, that is regions of the genome that constitute the basis of (partial) reproductive isolation among these lineages. Twenty-seven of these 49 AIMs were in or near genes (i.e., within 1000 bps of annotated gene boundaries) (in general, these barrier loci were not over-represented in or near specific structural features of the genome; Table S9, also see Tables S10 and S11). These genes exhibited a range of predicted functions (Tables S12, S13), with several genes standing out as being of particular interest. This includes the Z-linked 6-phosphogluconate dehyrdogenase gene, 6-pgd (Fig. S15), which is associated with cold hardiness and diapause in other insects [33, 34, 35], and constitutes or is linked to a barrier locus in swallowtail butterflies [36, 37, 38, 38]. Four AIMs were in or near two immunoglobulin superfamily genes; these were also on the Z chromosome (Tables S12, S13, Fig. S15). This gene superfamily is crucial for pathogen defense in insects [39], and is associated with reproductive isolation in mice [40]. Also among this set of genes was an autosomal olfactory receptor/odorant binding gene (Fig. S15). Such genes are known to affect host plant use in other butterflies [41, 42]. One of the 49 AIMs was within the autosomal nuclear pore gene Nup93 (Tables S12, S13). This is part of the nuclear pore complex, which is involved in multiple Dobzhansky-Muller incompatabilities in Drosophila. Finally, two AIMs in armadillo or armadillo-like genes (one autosomal and one Z-linked) that are involved in the Wnt-signalling pathway and affect wing development in Heliconius and other butterflies were among this set of candidate barrier loci [43]. A diversity of functions were also predicted for the subset of AIMs with both high directional introgression of Jackson Hole alleles in Dubois and high L. idas ancestry in Jackson Hole Lycaeides (13 out of the 22 of these were in or near genes); this includes immunoglobulin superfamily genes (Tables S14, S15) (also see Tables S16 and S17).
Extending predictability to additional, ancient hybrid lineages
We next asked whether patterns of introgression in the contemporary, Dubois hybrid zone were also predictive of patterns of ancestry in two additional, ancient hybrid lineages from the Sierra Nevada and Warner mountains of western North America [6, 29, 25]. These ancient hybrid lineages occupy alpine habitat on isolated mountains ∼900–1000 kilometers from Dubois. Past work suggests that these ecologically distinct populations arose within the past two million years (and perhaps much, much more recently) following hybridization between L. anna and L. melissa (with a possible contribution from L. idas or a close relative of L. idas /L. melissa) [6, 44, 29, 25]. Using high-coverage, whole genome sequences from L. anna, L. melissa, L. idas, and the Sierra Nevada and Warner mountain lineages, we estimated phylogenetic relationships for the entire genome and for 1000 SNP windows along each chromosome to characterize patterns of ancestry in the ancient hybrids. Based on the results from Dubois and the Jackson Hole populations, we predicted reduced L. melissa ancestry on the Z chromosome, and topologies suggesting reduced introgression for the candidate barrier loci (i.e., regions containing the 49 AIMs with restricted introgression in Dubois and extreme ancestry frequencies in Jackson Hole).
Consistent with past work, the whole-genome consensus or “species” tree showed that the Warner mountain population was intermediate between L. anna and L. melissa/L. idas, whereas the Sierra Nevadan population was genetically more similar to L. anna (Fig. 6). Nonetheless, trees based on 1000 SNP windows varied across the genome. The species tree was the most common topology overall (29.9%), but several “introgression” trees were also common, especially on the autosomes (Figs. 6, S16, S17). Whereas the trees that differ from the species tree could reflect incomplete lineage sorting, tree topologies were autocorrelated along the genome which suggests that introgression has contributed to these patterns as well (Fig. S17).
The species (i.e., consensus) tree was seen about 30% of the time for autosomal windows, Z chromosome windows and the 1000 SNP windows that contained the 49 candidate barrier loci (Fig. S16). The second most common autosomal tree suggested introgression from L. melissa into the Warner mountain lineage (24.6% of topologies). As predicted, this tree was significantly and substantially underrepresented on the Z chromosome (2.3%, x-fold = 0.10, P < 0.001) and for the candidate barrier loci (6.1%, x-fold = 0.43, P = 0.047) (Fig. 6). In contrast, a tree uniting the two ancient hybrid lineages was rare on the autosomes (4.3%), but commonly observed on the Z chromosome (26.2%, x-fold = 5.36, P < 0.001) and for the barrier loci (28.6%, x-fold = 3.62, P = 0.008) (Fig. 6). This suggests an ancient shared ancestry between these hybrid lineages that has been especially resistant to gene flow, or alternatively, Z-biased introgression between them. Thus, we find evidence of exceptional patterns of introgression and ancestry in these ancient hybrids consistent with patterns observed in the Dubois hybrid zone and ancient Jackson Hole hybrid lineage. Because the majority of the candidate barrier loci were Z-linked, the signal for these loci is not independent of the signal for the Z chromosome.
Discussion
Biologists have endeavored to connect microevolutionary processes to macroevolutionary changes that have occurred over longer periods of time (e.g., [45, 46, 47]). Here we bridge this gap by showing that microevolutionary patterns and processes in a contemporary hybrid zone predict the genome composition of ancient, (partially) stabilized hybrid taxa (a macroevolutionary-scale outcome). This mirrors past work in Helianthus showing the genome composition of ancient hybrid sunflower species could be predicted from evolution in synthetic hybrids and QTL mapping experiments [48, 22]. However, our results are novel in that they demonstrate that the outcomes of genome collisions can be predicted from natural hybrids as well, and despite differences in the ecological and genomic context of the different hybridization events. Additional work in other organisms where ancient and contemporary hybrids are found is needed to determine the generality of this result (e.g., Populus and Helianthus; [49, 50, 51]). Beyond this, genomic analyses of multiple lineages at various points along the continuum from hybrid zone to stabilized hybrid species could prove particularly interesting. Such analyses could help distinguish between consistency arising from rapid fixation of ancestry segments versus sustained, consistent selection pressures over long periods of time. More generally, our results suggest that genomic analyses of both hybrid zones and hybrid lineages or species might offer a particularly tractable framework for assessing the ways in which and degree to which speciation (especially hybrid speciation) might be predictable from microevolutionary processes of selection and recombination.
Patterns of parallel or repeatable evolution strongly suggest a major role for the deterministic process of natural selection [52, 53]. Thus, genes and gene regions with restricted introgression in Dubois and extreme ancestry frequencies or phylogenetic relationships in Jackson Hole and the other ancient hybrid lineages represent strong candidates for harboring barrier loci (i.e., speciation genes). Such predictability (i.e., consistency) was highest when considering the Z chromosome, where we detected a lack of L. melissa ancestry across the hybrid zone and hybrid lineages. This could be explained by a higher density of selected loci on the Z in general [54], and specifically by an excess of intrinsic incompatibilities arising from epistatic interactions, especially as the hybrid populations generally harbored less L. melissa ancestry (lower L. melissa ancestry could reflect selection or demographic aspects of the hybridization process). Nonetheless, our ability to predict the genome composition of ancient hybrids, especially the Jackson Hole lineage, held even when excluding the Z. Thus, while our results are consistent with theory and other empirical studies that suggest a disproportionate role for sex chromosomes in speciation (e.g., [55, 54, 21, 14]), this was not the sole factor that made genome composition predictable in this system. Indeed, our analyses highlighted candidate barrier loci on autosomes (e.g., an autosomal olfactory receptor/odorant binding gene and Nup93), as well as on the Z chromosome.
In conclusion, the fact that putative barrier loci in the Dubois hybrid zone exhibit exceptional ancestry patterns in ancient hybrid lineages is consistent with the hypothesis that the same genes or gene regions that prevent species fusion during hybridization experience selection during the evolution of hybrid lineages or species. This hypothesis is also supported by studies in other systems showing that incompatibility loci contribute to hybrid speciation [48, 16], and is generally consistent with accumulating data suggesting the same genes are repeatedly involved in adaptation (e.g., [56, 57]). Thus, seemingly distinct facets of speciation, such as the maintenance of taxonomic boundaries across hybrid zones and the origin of novel biodiversity through admixture, may have a predictable, common genetic basis.
Methods
Genome assembly and annotation
We generated a new, chromosome-scale reference genome for L. melissa from information on proximity ligation of DNA in chromatin and reconstituted chromatin. Our previous genome assembly comprised 14,029 scaffolds (total assembly length = 360 mega base pairs [mbps]; scaffold N50 = 65 kilo base pairs [kbps]), which had been joined in a linkage map with 23 linkage groups (Lycaeides has 22 autosomes and a ZW sex chromosome system) [58, 59]. In the current study, we improved upon this assembly using DNA sequence data from Chicago and Hi-C libraries [60, 61]. Creation and sequencing of the Chicago and Hi-C libraries was outsourced to Dovetail Genomics. The new sequence data and our old assembly were combined using the HiRise assembler (also outsourced to Dovetail Genomics). The new L. melissa genome assembly has a final N50 of 15.5 mbps, and 90% of the genome comprises only 21 scaffolds. A whole genome comparative alignment with mummer (version 3.2 with the maximal unique matches setting; [62]) of this new genome assembly with the previously published genome assembly and linkage map shows that each of our previously defined linkage groups corresponds with one (or in one case two) of the new, large scaffolds.
We annotated structural and functional features of the new L. melissa genome using the maker pipeline (version 2.31.10) [63, 64]. This pipeline uses repeat masking, protein and RNA alignment and ab initio gene prediction to perform evidence-based gene prediction, which generates annotations that are supported by quality scores. Prior to using maker, we identified repeats in the L. melissa genome using repeatscout (version 1.0.5) [65]. This program identifies repeat elements, including tandem repeats and low complexity elements, and removes them from the genome. We took this approach to avoid missing repeat elements not already in standard data bases. We provided this de novo repeat library to maker, and used this along with repbase in repeatmasker to mask repetitive elements in the genome. maker can use protein and RNA sequence data for genome annotation. Since we lacked protein sequences for L. melissa, we downloaded 28 protein sequence fasta files from 15 butterfly species (see Table S18) from LepBase (version 4) [66]. We concatenated these fasta files to generate a protein sequence data base for maker. We used data from 24 L. melissa transcriptomes (Forister et al., manuscript in prep.) as additional evidence for genome annotation. We first used trimgalore (version 2.6.6, https://github.com/FelixKrueger/TrimGalore) for adapter trimming and quality filtering of paired-end RNA sequences. We then used these trimmed reads to generate a de novo transcriptome assembly with trinity (version 2.6.6) [67, 68]. The assembled transcriptome was passed to maker.
We ran two rounds of the maker pipeline. We first ran maker without using any information from ab initio gene predictors (e.g., augustus), to generate de novo gene models for the L. melissa genome. We then ran the maker pipeline again and used the gene models from the first run to train two gene predictors: augustus and snap. We ran snap (version 2006-07-08) [69] by using models with AED scores of 0.25 or better and a length of 50 or more amino acids. We ran augustus (version 3.3) with the insect predictions [70]. We then used both of these sets of gene predictions in the second run of maker. We then used the output from maker to obtain functional annotations of the L. melissa genome. We assigned putative gene functions by using blastp to query the maker output against the UNIPROT/SWISSPROT database [71]. We also used interproscan [72] to add protein and gene ontology information to gene model. The final annotation included 11,247 putative genes, 48,765 putative coding sequences, and 8893 UTR sequences.
DNA sequencing, alignment, and genetic variant detection
We analyzed partial genome sequences from 835 Lycaeides butterflies from 23 populations: eight L. melissa populations (N = 238 butterflies), four L. idas populations (N = 156 butterflies), 10 Jackson Hole Lycaeides populations (N = 326 butterflies) and the Dubois hybrid zone (N = 115 butterflies) (Table S1). The sequence data from 643 of these butterflies was previously described in a study of admixture across the Lycaeides species complex [25]. Data from 192 of the butterflies were generated for the current study, and this includes many (but not all) of the Dubois individuals. We extracted DNA, generated genotyping-bysequencing (GBS) libraries and sequenced these libraries following the protocols described in [73]. The GBS libraries were sequenced on an Illumina HiSeq 2500 (100 bp, single-end reads) by the Genome Sequencing and Analysis Facility at the University of Texas (Austin, TX).
We used bwa (version 0.7.17) to align the GBS sequences from 835 individuals to the draft L. melissa genome by using the mem algorithm [74, 75]. We ran bwa mem with a minimum seed length of 15, internal seeds of longer than 20 bp, and only output alignments with a quality score of ≥30. We then used samtools (version 1.5) to compress, sort and index the alignments [76]. We used samtools (version 1.5) and bcftools (version 1.6) for variant calling. For variant calling, we used the recommended mapping quality adjustment for Illumina data (-C 50), skipped alignments with mapping quality less than 20, skipped bases with base quality less than 15, and ignored insertion-deletion polymorphisms. We set the prior on SNPs to 0.001 (-P) and called SNPs when the posterior probability that the nucleotide was invariant was ≤0.01 (-p). We filtered the initial set of variants to retain only those SNPs with sequence data for at least 80% of the individuals, a mean sequence depth of 2 per individual, at least 4 reads of the alternative allele, a minimum quality score of 30, a minimum (overall) minor allele frequency of at least 0.005, and no more than 1% of the reads in the reverse orientation (this is an expectation for our GBS method). We further removed SNPs with excessive coverage (3 standard deviations above the mean) or that were tighly clustered (within 3 bp of each other), as these could reflect poor alignments (e.g., reads from multiple paralogs mapping to the same region of the genome). Finally, because we combined data from two different sequencing runs, we also removed any SNPs with a difference in sequence coverage between the published and new data that was more than half the mean coverage for the two data sets combined. This left us with 39,139 SNPs for downstream analyses.
Estimating genotypes, admixture proportions, population allele frequencies, and linkage disequilibrium
We used the admixture model from entropy (version 1.2) [25] to obtain Bayesian estimates of genotypes and admixture proportions. This analysis was based on the full data set of 835 individuals and 39,139 SNPs. The admixture model in entropy is similar to that in structure [77], but differs by accounting for uncertainty in genotypes arising from limited sequence coverage and sequence errors, and by allowing simultaneous estimation of genotypes and admixture proportions [25]. We fit the model with k ∈ {2 . . . 5} source populations. For each value of k, we ran three Markov chain Monte Carlo (MCMC) chains, each with 15,000 MCMC iterations, a burnin of 5000 iterations and a thinning interval of 5. We used assignments from a discriminant analysis of principal components to initialize the MCMC algorithm; this speeds convergence to the posterior and avoids label switching during MCMC without affecting the posterior probability distribution [25]. We obtained genotype estimates as the posterior mean allele count for each individual and locus across chains and values of k (i.e., this integrates over uncertainty in the number of hypothetical source populations). We focused on admixture proportions for k = 2, as we were interested in the two nominal species and hybrids between them. We summarized patterns of population structure and admixture across the sampled populations and individuals based on these admixture proportions and a principal component analysis (PCA) of the genotypic data. We performed the PCA in R on the centered but not standardized genotype matrix with the prcomp function.
We used the expectation-maximization (EM) algorithm implemented in estpEM (version 0.1) [53] to estimate allele frequencies for each SNP (39,139 SNPs) and population (N=23 populations). The EM algorithm estimates allele frequencies while allowing for uncertainty in genotypes [78, 53]. We used the genotype likelihoods calculated with bcftools for this analysis, and ran the algorithm with a convergence tolerance of 0.001 and allowing for 20 EM iterations. We used these allele frequency estimates to designate ancestry informative SNPs/markers (AIMs) as those with an allele frequency difference ≥0.3 between L. melissa (mean of BST, SIN, CDY, and CKV) and L. idas (mean of KHL, SDC, and SYC) (population IDs are defined in Table S1).
Finally, we calculated a metric of linkage disequilibrium (LD), the Pearson correlation coefficient between genotypes at pairs of loci, for all pairwise combinations of AIMs in each population where 20 or more butterflies were collected. Our goal was to ask whether LD was elevated in the Dubois hybrid zone, as predicted by theory (e.g., [79]). We first polarized genotypes such that positive LD (i.e., positive Pearson correlations) coincided with an association between coupling alleles, that is alleles more common in L. idas or L. melissa, whereas negative LD (i.e., negative Pearson correlations) coincided with associations between repulsion alleles, that is positive associations between L. melissa and L. idas alleles. Ongoing hybridization should cause a shift towards higher positive estimates of LD, even for unlinked markers (via admixture LD). Correlations were calculated in R (version 3.5.1).
Estimating cline parameters in the Dubois hybrid zone
We fit Bayesian genomic clines for each of the 1164 AIMs using bgc (ver. 1.04b; [30]) to quantify genome-wide variability in introgression between L. melissa and Jackson Hole Lycaeides in the Dubois hybrid zone. This method estimates clines in ancestry for individual genetic loci (e.g., SNPs) along a genome-average admixture gradient [31, 14]. As such, it can be applied in cases where hybrid zones are confined to a single geographic locality, such as the Dubois population. Deviations between genome-average introgression and introgression for each locus are measured with two cline parameters, α and β. Cline parameter α denotes an increase (for positive values of α) or decrease (for negative values of α) in the probability of ancestry from reference (parental) species 1 relative to null expectations from an individual’s hybrid index. Genomic cline parameter β describes an increase (positive values) or decrease (negative values) in the rate of transition from parental species 0 ancestry to parental species 1 ancestry along the genome-wide admixture gradient. When placed in a geographic context, α is equivalent to twice the shift in cline center, and β measures the decrease (or increase when β < 0) in cline width relative to the average [79, 14]. Cline parameters can be affected by genetic drift and selection in hybrids [32, 14]. However, in the absence of major geographic barriers to gene flow, high positive values of β (i.e., the equivalent of narrow clines in a geographic context) are most readily explained by selection.
We fit the bgc genomic clines model for the 115 Lycaeides butterflies from the Dubois hybrid zone. We used L. melissa (LAN, SIN, and CKV; N = 131 butterflies) and Jackson Hole Lycaeides (BLD and FRC; N = 94 butterflies) as reference or parental species for the analysis. These specific populations were chosen as the source populations because they were nearest to the Dubois hybrid zone. Genotype likelihoods from bcftools were used as input for the analysis (the model fit incorporates uncertainty in genotype as captured by the genotype likelihoods). We ran the analysis using only the 1164 AIMs. We fit the model using MCMC, with five chains each with 25,000 iterations, a 5000 iteration burn-in and a thinning interval of 5. We inspected the MCMC output to assess convergence of chains to the stationary distribution and combined the output of the five chains. We repeated this analysis with only males and with AIMs defined as those SNPs with an allele frequency difference of ≥0.2 between L. melissa and L. idas (2126 SNPs) to assess the robustness of our results.
We defined credible deviations from null expectations given genome-wide admixture as cases where the 95% credible intervals (specifically the equal-tail probability intervals) for α or β for a given locus (AIM) excluded 0. We focused specifically on cases of credible directional (α ≠ 0) or restricted (β > 0) introgression relative to the genome-wide average. We used randomization tests to ask whether (and to what extent) such loci were over (or under) represented on the Z chromosome or in or near (within 1000 bp of) certain annotated structural features of the genome, such as genes, coding sequences, transposable elements and annotated protein sequences or motifs. Null expectations were derived from 1000 permutations of cline parameter estimates across the 1164 AIMs.
Analysis of population ancestry in Jackson Hole-Lycaeides
We estimated ancestry segment frequencies across the genome for each of the ten Jackson Hole ancient hybrid populations using the correlated beta process model implemented in popanc (ver. 0.1; [15]). This method is similar to a hidden Markov model and accounts for the expected autocorrelation in ancestry along chromosomes, but allows ancestry frequencies to vary along the genome. It is particularly well suited for cases where genome stabilization has begun but is not yet complete [15]. We ran popanc for each of the Jackson Hole populations to estimate the frequency of L. idas-derived alleles along the genome. We used a window size of 3 SNPs and focused on the 1164 AIMs. We set SYC and KHL as representative of the L. idas parental species, and BST, SIN, CDY and CKV as the putative L. melissa parents. Maximum likelihood allele frequency estimates from estpEM were used as input for the program/analysis. We ran the MCMC analysis for each population with a 10,000 iteration chain, a 5000 iteration burn-in and thinning interval of 10. We based on inferences on point estimates (posterior mean) of L. idas ancestry frequencies for individuals populations or on averages of these estimates across all ten populations. We repeated this analysis with only males and with AIMs defined as those SNPs with an allele frequency difference of ≥0.2 between L. melissa and L. idas to assess the robustness of our results. Randomization tests were used to ask whether and to what extent AIMs with the highest L. idas or L. melissa ancestry frequencies (top 10% in each case) were over-represented on the Z chromosome or in or near certain structural features of the genome as described above for the Dubois hybrid zone analyses.
Quantifying consistency in genomic outcomes of hybridization between contemporary and ancient hybrids
We next tested for excess overlap between AIMs showing the most extreme ancestry frequencies in the ancient Jackson Hole hybrids and those with the greatest deviations from genome-wide average introgression in the contemporary Dubois hybrid zone. We focused on the following five comparisons: (i) excess introgression of Jackson Hole alleles in Dubois (α > 0) and high L. idas ancestry in Jackson Hole, (ii) excess introgression of Jackson Hole alleles in Dubois (α > 0 and high L. melissa ancestry in Jackson Hole, (iii) excess introgression of L. melissa alleles in Dubois (α < 0) and high L. idas ancestry in Jackson Hole, (iv) excess introgression of L. melissa alleles in Dubois (α < 0) and high L. melissa ancestry in Jackson Hole, and (v) restricted introgression in Dubois (β > 0) and extreme (high L. idas or high L. melissa) ancestry frequencies in Jackson Hole. We were especially interested in the last comparison, as the restricted introgression AIMs constitute our best candidates for regions of the genome associated with reproductive isolation. We initially focused on the top 10% of AIMs in each of the categories (e.g., the 116 out of 1164 AIMs with the highest point estimates of α from bgc). We conducted randomization tests by permuting parameters across AIMs to test whether and to what extent AIMs in the top 10% for each category coincided more for each comparison than expected by chance. We conducted 10,000 permutations to generate null expectations. The null distribution was used to calculate a P value and x-fold enrichment for each comparison. As an example, an x-fold enrichment of 2.0 would indicate that twice as many AIMs exhibited a pair of patterns (e.g., restricted introgression in Dubois and extreme ancestry frequencies in Jackson Hole) as expected by chance (based on the mean of the null). We repeated these analyses considering the top 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2% and 1% of AIMs in each category (i.e., across more and more extreme quantiles), and with only autosomal or only Z-linked AIMs. We also repeated this analysis with only males and with AIMs defined as those SNPs with an allele frequency difference of ≥0.2 between L. melissa and L. idas to further assess the robustness of our results.
Finally, we extracted IPR and GO terms and descriptions generated by maker for the set of AIMs in the top 10% for each comparison where we had evidence of significantly greater overlap between categories than expected by chance. We filtered the IPR terms based on their hierarchical classifications (i.e., superfamily, family, domain or repeat). We retained unique terms and dropped terms that at lower or overlapping levels in the Interproscan database [72]. For example, if a SNP was annotated for a superfamily IPR term and multiple domains within the superfamily, we retained only the superfamily term.
Whole genome phylogenomics
We next asked whether patterns of introgression in the contemporary, Dubois hybrid zone were also predictive of patterns of ancestry in two additional, ancient hybrid lineages from the Sierra Nevada and Warner mountains of western North America [6, 29, 25]. To do this, we generated high-coverage, whole genome sequences from L. anna (from Yuba Gap, CA), L. melissa (from Bonneville Shoreline, UT), L. idas (from Trout Lake, WY), and the Sierra Nevada (from Carson Pass, NV), and Warner mountain (from Buck Mountain, CA) lineages. We extracted DNA from one female (ZW) butterfly per population using Qiagen’s MagAttract HMW DNA extraction kit (Qiagen, inc.) following the manufacture’s suggested protocol. We then outsourced library construction and sequencing to Macrogen Inc. (Seoul, South Korea). One standard pairedend shotgun library (180 bp insert) was constructed for each butterfly/lineage using a TruSeq library preparation kit (Illumina, Inc.). Each library was sequenced in its own lane on a HiSeq 2000 with 2 × 100 bp paired reads. We obtained > 20 million high quality (≥ Q30) for each library.
We ran bwa mem with a minimum seed length of 15, internal seeds of longer than 20 bp, and only output alignments with a quality score of 30. We then used samtools (version 1.5) to compress, sort and index the alignments, and to remove PCR duplicates [76]. We used GATK’s HaplotypeCaller (version 3.5) to call SNPs across the five genomes [80]. We excluded bases with mapping base quality scores less than 30, assumed a prior heterozygosity of 0.001, applied the aggressive PCR indel model, and only called variants with a minimum confidence threshold of 50. We used the intermediate g.vcf file approach followed by joint variant calling with the GenotypeGVCFs command when calling variants. We then filtered the initial set of variants to include only SNPs with an average minimum coverage of 6x, maximum absolute values of 2.5 for the base quality rank sum test, the mapping quality rank sum test and the read position rank sum test, a minimum ratio of variant confidence to non-reference read depth of 2.5, and a minimum mapping quality of 30. We also excluded all indels and SNPs with more than two alleles, and all SNPs on smaller scaffolds not assigned to linkage groups. Finally, SNPs with exceptionally high coverage (>450x) or that were clustered together (within 5 bps of each other) were dropped. This left us with 2,054,096 SNPs for phylogenomic analyses.
We first estimated a genome consensus or “species” tree from a concatenated alignment of the autosomal SNPs (i.e., this analysis did not include SNPs on the Z chromosome). We estimated the tree with RAxML (version 8.2.9) [81] under the GTR model with no rate heterogeneity. 100 bootstrap replicates were generated and analyzed to assess confidence in the bifurcations in the estimated phylogeny. Next, to examine variation in (unrooted) tree topologies across the genome, we split the alignment into non-overlapping 1000 SNP windows. We estimated phylogenies for each window using RAxML as described above. We then used the R packages ape (version 5.2) [82] and phytools (version 0.6.60) [83] to identify trees (across windows) with the same topology and to quantify the proportion/number of trees on each linkage group with each topology. Finally, we used randomization tests to determine whether autosomes, the Z chromosome, or the 49 candidate barrier loci (i.e., regions containing the 49 AIMs with restricted introgression in Dubois and extreme ancestry frequencies in Jackson Hole) exhibited a significant excess or deficit of a given topology. This was done in R as well. We used 1000 randomizations (permutations) of tree topologies across 1000 SNP windows (and thus linkage groups) for each analysis.
Data Archiving
DNA sequence data have been archived on NCBIs SRA (accession numbers pending). Computer code and other key input files are available on DRYAD (accession number pending).
Acknowledgements
Thanks to Amy Springer and Megan Brady for their help in the field. We are also grateful for comments on earlier drafts of this manuscript from Jeff Feder and Patrik Nosil. This work was funded by the National Science Foundation (DEB-1638768 to ZG, DEB-1050355 to CCN, DEB-1050149 to CAB), Utah State University and the Ecology Center at Utah State University (Ecology Center Graduate Student Award to SC). The support and resources from the Center for High Performance Computing at the University of Utah are gratefully acknowledged.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵