Abstract
Never before have we had the luxury of choosing a continent, picking a large phylogenetic group of animals, and obtaining genomic data for its every species. Here, we sequence all 845 species of butterflies recorded from North America north of Mexico. Our comprehensive approach reveals the pattern of diversification and adaptation occurring in this phylogenetic lineage as it has spread over the continent, which cannot be seen on a sample of selected species. We observe bursts of diversification that generated taxonomic ranks: subfamily, tribe, subtribe, genus, and species. The older burst around 70 Mya resulted in the butterfly subfamilies, with the major evolutionary inventions being unique phenotypic traits shaped by high positive selection and gene duplications. The recent burst around 5 Mya is caused by explosive radiation in diverse butterfly groups associated with diversification in transcription and mRNA regulation, morphogenesis, and mate selection. Rapid radiation correlates with more frequent introgression of speciation-promoting and beneficial genes among radiating species. Radiation and extinction patterns over the last 100 million years suggest the following general model of animal evolution. A population spreads over the land, adapts to various conditions through mutations, and diversifies into several species. Occasional hybridization between these species results in accumulation of beneficial alleles in one, which eventually survives, while others become extinct. Not only butterflies, but also the hominids may have followed this path.
Butterflies are among the most beloved animals, beautiful and harmless, they have attracted human attention since prehistoric times (1). Being one of the best-studied insects phenotypically (2), butterflies remain largely unexplored by genomics. Until recently, genomic studies of butterflies were confined to a couple of model organisms and pests, such as Heliconius, monarch, and cabbage white, with initial studies leading to groundbreaking insights into mimicry, migration, and toxin resistance (3-5). We have been expanding these efforts on butterfly genomics to cover a broader range of species (6-11). With the rapid decrease in the price of DNA sequencing and the constant development of analytical methods, the time is ripe to sequence the genomes of all butterfly species over a continent.
The diversity of butterflies, which form a clade within moths (12), is captured in 7 families worldwide (13). Six of these families are represented in North America north of Mexico, and the butterfly fauna of this region is well-documented (2, 14, 15). Here, we obtain and analyze the genomes of all 845 (Table S1) butterfly species in the United States and Canada (USC). A number of these species are of conservation concern, including 25 endangered and threatened taxa (16). The new genomic datasets comprehensively covering USC butterflies reveal the detailed history of their speciation and adaptation and suggest the genetic basis of their unique phenotypic traits. As a result, we find a bewildering pattern of phylogenetic diversification that we rationalize in a general model of animal evolution reaching beyond butterflies and insects. Moreover, recently developed analytical methods have demonstrated the power of extracting information from thick protein sequence alignments to accurately model spatial structures (17), screen interacting partners (18) and predict functions (19) of proteins. The diverse datasets of protein sequences we have obtained will allow structure and function prediction for many eukaryotic gene products, enabling future discoveries.
Reference genomes of butterflies
We sequenced and annotated 23 reference genomes of butterflies from the United States. Combined with the 13 genomes published previously (3-11, 20-23) (8 by our group), a total of 36 reference genomes are used in this study. The N50 of the new genomic assemblies ranges from 50 to 3,700 kb, and they are over 95% complete in essential genes (24) (Table 1). While the size of these genomes is drastically different, between 217 and 1,040 Mb, they encode comparable numbers of proteins, about 15,000. In contrast, the fraction of repeats correlates strongly with the genome size (Pearson’s correlation coefficient > 0.7), suggesting that the genome size in butterflies varies due to repetitive and transposable elements. Genome size does not conform with butterfly phylogeny and can differ even for close relatives, likely due to the activity of transposons.
The reference genomes were selected to cover diverse phylogenetic groups of the USC butterflies, allowing us to carry out genomic comparisons across the phylogeny. In butterfly genomes, we detected 530 Ultra-Conserved genomic Elements (UCE, Table S2). Similarly to mammals (25, 26), UCEs in butterflies mostly reside in the intergenic regions (Fig. 1A) and around the genes functioning in transcription regulation and developmental processes (Fig. 1B and Table S3). The UCEs constitute merely 0.01% of the genomes, and most other regions have diverged rapidly among butterflies. Only 7–14% of the genomic sequence can be confidently aligned between species from different families (27), and the remainder contains variable and repetitive regions. The observed tolerance to transposon activity in butterflies may be adaptive, allowing them to exploit the retrotransposition mechanism for gene duplication and expansion (28). We identified 8,581 orthologous gene groups present in at least 75% of the reference genomes, and each species experienced gene duplications in 2–9% of these groups. In addition to the high sequence divergence, frequent gene duplication may be another reason for phenotypic diversity and adaptation.
We systematically catalogued lineage-specific gene expansions (Table S4) in reference genomes, and a number of prominent examples are given in Fig. 1C. Many of the gene expansions occur in protein families involved in acquiring nutrients from food and resisting toxins and pathogens, but some (magenta boxes in Fig. 1C) may explain the lineage-specific phenotypes. We reported an expansion of farnesyl pyrophosphate synthase (FPPS) homologs in Pterourus glaucus (11), and now we find this expansion present in all sequenced Papilionidae species. FPPSs function in steroid and terpene synthesis (30), and the Papilionidae-specific organ, osmeterium, secrets terpenes to repel birds (31). Therefore, the unique expansion and diversification of FPPSs may produce the bouquet of terpenes aiding caterpillars’ arms races with predators. Other notable gene expansions involve sugar transporters and trehalases in Lycaenidae. The caterpillars of Lycaenidae secrete nectar-like liquids as a “reward” to trick ants into protecting and even feeding the caterpillars (32). Some Riodinidae species are also associated with ants and possess similar gene expansions. The additional sugar transporters may play a role in secreting sugars that are being fed to ants, while trehalase may convert trehalose from food plant to sweeter-tasting molecules, contributing to this adaptation.
Furthermore, some gene expansions may underlie convergence in phenotypes, such as the expansion of catalases in the two distant phylogenetic lineages—the satyrs (Satyrini) and grass-skippers (Hesperiinae)—whose caterpillars converged to feeding on monocots. The catalases decompose hydrogen peroxide and thus protect against oxidative damage (33). Feeding on nutrient-deficient grasses, sedges and palms (34) leads to a prolonged caterpillar stage and may increase the likelihood of oxidative damage that would be mitigated by additional catalases. Apparently switching to monocot feeding was an evolutionarily successful innovation that resulted in explosive speciation in both satyrs and grass-skippers (35). They became the most species-rich phylogenetic groups among American butterflies, and their parallel diversifications of catalases are intriguing.
Phylogeny of USC butterflies
We obtained whole genome shotgun sequences of all 845 butterfly species recorded from the United States and Canada (15). Phylogenetic trees constructed from protein-coding genes in the nuclear genome (10,000–15,000 kb positions), Z chromosome (360–641 kb positions) and mitochondrial genome (11 kb positions) are largely consistent with each other (Fig. 2 and Fig. S1) and support established views about the deeper phylogeny of butterflies (family and subfamily). Namely, swallowtails (family Papilionidae) are sister to all other butterflies (36), and the topology of other butterfly families agrees with previous studies (13). Genetic divergence between gossamer-winged butterflies (family Lycaenidae) and metalmarks (family Riodinidae) is smaller than that between some subfamilies of brush-footed butterflies (Nymphalidae). Thus, it may be best to view metalmarks as one of the subfamilies of Lycaenidae as discussed previously (13).
However, our genome-level analysis revealed a number of problems with the current butterfly classification at a shallower phylogeny level (tribe and genus). We rectified these problems in dedicated publications (37, 38). Briefly, we proposed 6 new genera, 2 new subgenera, and reclassified 40 species (Table S1). Thus, the names of 6% USC butterflies were changed due to this expanded examination of genomes. In accordance with our previous findings (39), we stumble upon additional unexpected cases of rapid divergence and mimicry in wing patterns and shapes, where butterflies do not look similar to their close relatives, but resemble more distant species (Table S5). These insights illustrate the power of genomics in reshaping our knowledge of taxonomy and phylogeny of life.
Comparison of the three trees (autosome, Z chromosome, and mitogenome: Fig. S1) reveals confident incongruence between them. The incongruence is rampant among close relatives, and is present in almost every large genus. These inconsistencies likely reflect alternative evolutionary paths taken by different genomic segments of the same organism as a result of incomplete lineage sorting, introgression and hybridization (3, 40, 41). Generally, the nuclear trees correlate better with phenotypes than the mitogenome tree. In the genera that have experienced rapid radiation, such as Colias, Euphilotes and Speyeria, mitochondrial phylogeny is semi-random compared to nuclear phylogeny and phenotypes, and all 9 Celastrina species carry essentially identical mitochondrial DNA. Incongruence between the trees constructed from autosomes and the Z chromosome (Table S6) can originate in cases with extensive introgression (40). Sex chromosome-linked genes are shown to resist introgression in multiple species (42-44), and thus the Z chromosome tree may better reflect the history of speciation, not the averaged history of introgression. For instance, in accord with morphology, the Z chromosome suggests a sister relationship between morphologically similar but allopatric species Junonia coenia and Junonia grisea (45, 46). In contrast, the autosomal tree groups morphologically different but sympatric Junonia grisea and Junonia nigrosuffusa, who experience frequent hybridization and introgression.
Diversification, extinction and bursts in radiation
We analyzed patterns of diversification in the time-calibrated phylogenetic tree of USC butterflies constructed from all protein-coding genes. The number of species from currently non-extinct lineages at each time point in the past is shown in Fig. 3A. This curve reflects both speciation and extinction, and is similar to exponential, but exhibits a decreased diversification rate in the last 2 million years (Myr). This apparent decrease is due to both the variation between individuals of a species (i.e., terminal branches lead to individuals, not species) and incomplete speciation events: some populations are on the way to become distinct species, but are not recognized as such today.
Excluding the last 2 Myr, we fitted the diversification rate to a model with constant speciation and extinction rates, yielding estimated speciation rate of 0.15 per Myr and extinction rate of 0.08 per Myr (Fig. 3B). We exclude time points before 94 million years ago (Mya) because small number (≤ 5) of lineages is prone to random fluctuations. Both simulated and observed data show an increase in rate when approaching the present time. This increase is caused by species present at an earlier time point (but extinct by now) not being counted, leading to underestimation of the number of species that existed in the past (47). The observed species diversification rate (per 5 Myr) shows larger fluctuations than simulations, and these fluctuations are biologically meaningful. For instance, the minimum around 63–67 Mya reflects the Cretaceous–Paleogene extinction dated to 66 Mya (48), thus indirectly supporting the time-scale of the tree. Interestingly, the maxima around 90, 70, 60, 35, and 15 Mya (Fig. 3B) approximate the origins of clades corresponding to major levels in the taxonomic hierarchy: family, subfamily, tribe, subtribe, and genus, respectively. Starting from the diversification into subfamilies 70 Mya, which is the global maximum (94 Mya till now), there are 4 major peaks in the curve, as there are taxonomic levels. Discreteness of these levels may thus be a consequence of rapid speciation across phylogenetic lineages followed by extinctions that break the continuity of animal forms, leading to survival of only a few distant ones.
The latest increase in the diversification rate since 8 Mya (Fig. 3B) is the origin of species, and it brings a surprise. Inspection of the tree reveals many recent bursts of radiation, i.e., rapid diversification in some lineages leading to the origin of many species around the same time. To quantify this effect, we studied the time progression of the number of nodes in the tree that produce at least 4 branches within 2 Myr (Fig. 3C). Observed data (blue) show profoundly more radiation events in the recent past than simulations that assume an equal chance of speciation in every lineage (orange). These recent radiations recur across the tree of butterflies and were investigated in detail.
Introgression of speciation-associated genes from distant relatives leads to radiation
We identified 18 clusters of species as undergoing explosive radiation (Table S7). These clusters belong to genera from four of the largest butterfly families. We looked for genes that diverge rapidly (P-value < 0.05) among species in each radiating cluster, but evolve relatively slowly in closely related non-radiating species. We found 273–846 such genes in each cluster, about 4% of all genes. A significant overlap in these genes exists among 18 radiating clusters with 430 common genes recurrently showing elevated divergence during radiation (P-value < 0.05, Table S8). The top 21 genes rapidly diverging in over a third of radiation clusters are shown in Fig. 4A. Proteins encoded by these 21 genes mostly belong to 4 major functional categories (green cells in Fig. 4A).
First, 7 out of the 21 most frequent radiation-associated proteins are associated with splicing and silencing of mRNA. This observation echoes the studies of radiation in cichlids (49), suggesting that the increased complexity of mRNA regulation and maturation may be a general mechanism to rapidly generate divergence in animals bypassing the need of extensive variations in gene sequences. Second, 4 of the 21 proteins are directly related to mating (Fig. 4A). For instance, CACOPHONY is a calcium channel that senses species-specific mating song in Drosophila (50), and CLOCK is a transcription factor that regulates mating time (51, 52). Elevated divergence in such genes may directly alter mating behavior of butterflies, contribute to prezygotic isolation, and accelerate speciation. Finally, proteins involved in sperm and egg generation and neurological processes stand out not only among the 21 most frequent players (Fig. 4A), but also in all 430 recurrent radiation-associated proteins. Additionally, enrichment analysis of gene ontology (GO) terms associated with these 430 proteins reveals transcription regulation and morphogenesis as major functional categories for radiation-associated genes (Fig. 4B, Table S9). Differences in transcriptional factors are likely associated with divergence in DNA regulatory elements they bind to, and the latter has been shown to play an important role in Drosophila speciation (53, 54).
To understand why radiation-associated proteins exhibit elevated divergence among species in a radiating lineage, we first studied whether they are positively selected. Unexpectedly, radiation genes are characterized by a lower nonsynonymous substitution rate compared to other genes in all radiating genera but Phyciodes (Table S10). Thus, they are not undergoing stronger positive selection on their individual mutations than other genes. Instead, we find that radiation genes tend to introgress between distantly, but not closely, related species (P-value < 4.9e-12, see below). We hypothesize that introgression of speciation-promoting genes from more distant relatives is a mechanism that speeds up speciation of close relatives, causing explosive radiation.
To test this hypothesis, we compared the 18 radiating genera with others. We analyzed 63 pairs of sister species from radiating clusters and 68 pairs from non-radiating lineages (Table S11), and compared the distribution of sequence divergence of individual genes in autosomes and Z chromosome (Fig. 4C). Species pairs from radiating clusters show lower divergence in Z chromosome than in autosomes (green vs. orange in Fig. 4C), while non-radiating clusters do not display such a trend. Sex chromosome’s resistance to introgression has been documented (42, 55, 56), and thus the higher divergence of autosomal genes in radiating genera is a likely consequence of introgression from distant species. The elevated introgression of autosomal genes in radiating genera also leads to a larger deviation in the divergence of individual genes than that in non-radiating genera (orange vs. pink in Fig. 4C). More directly, we detected introgressed regions from relatively distant species by ABBA-BABA tests in 2,557 radiating triplets of species and 115 non-radiating ones (Table S12). We find that radiation clusters possess significantly more introgressed genes than non-radiating lineages at the same divergence level (Fig. 4D).
Next, we investigated the functions of introgressed genes. In the radiation clusters, we identified 2,273 genes that tend (P-value < 0.05) to introgress and 2,362 genes that are more resistant to introgression (P-value < 0.05). In the non-radiating lineages, we identified 2,159 genes that are more likely (P-value < 0.05) to introgress and 4,001 genes that never introgressed in any of the 115 triplets. Functional analysis of genes that tend to introgress between distant relatives in radiating genera versus those that are resistant to introgression using GO term enrichment is shown in Table S13. Unexpectedly, species from radiating clusters tend to acquire genes encoding proteins that function in mate recognition and selection (GO terms: courtship behavior, odorant binding, neuropeptide hormone activity), and with roles in transcription and translation. Such genes resist introgression in non-radiating genera (Fig. 4E). Divergence in mate choice genes along with transcription/translation regulators is typically associated with speciation and may confer hybrid incompatibilities (54, 57). Introgression of such genes from distant species may facilitate speciation by promoting reproductive isolation.
Furthermore, radiating clusters introgress genes related to morphogenesis (GO terms: embryo development, pattern specification process, eye-antennal disc development, sex determination), while non-radiating genera do not. Acquiring new morphological traits by individual mutations may be slow, but introgressed alleles would immediately prompt a variety of phenotypes, e.g., introgression of wing pattern genes explains mimicry among Heliconius species (3, 58). Finally, the trend of radiating clusters to exchange genes with roles in salivary gland development and peritrophic matrix (59) (a membrane structure present between food and midgut tissue) formation may allow caterpillars’ adaptation to additional food plants, thus expanding ecological niches of these species. Similarly, preferred introgression of energy-producing mitochondrial genes, DNA repair factors, and starvation-resistance molecules is observed in radiating clusters. Such exchanges may increase the chance for a species to survive hard conditions by gathering advantageous alleles that originated in other species.
Uneven evolutionary rates and positive selection
The phylogenetic tree constructed from all protein-coding genes reveals drastic variation in evolutionary rates between lineages of USC butterflies. Distances from the root to all leaves in this tree show a wide distribution, and some lineages evolve at least twice as fast as others (Fig. 5A). The clades with the largest rate (> 0.55 in Fig. 5A) prominently standing out from the rest are the blues (Polyommatinae) and the whites (Pierini). They may have experienced rapid evolution due to specialized interactions with ants (the blues) (60, 61) and adaptation to caterpillar feeding on mustards (the whites), which are toxic to many insects (62).
To better understand the reasons for the variation in evolutionary rate, we studied the effects of positive selection in each branch of the phylogenetic tree using reconstructed sequences of the internal nodes. The ratio of nonsynonymous and synonymous substitutions (Y-axis) is plotted versus the length of each tree branch (X-axis, Fig. 5B). The blues (Polyommatinae) are indeed a prominent outlier, indicating that they have been evolving under stronger positive selection. Notably, branches leading to more species (size of the circles in Fig. 5B is proportional to the number of species) generally show a higher rate of nonsynonymous substitutions. Apparently, stronger positive selection may lead to the development of adaptive traits giving advantage to a lineage and enabling it to diversify more than others. Furthermore, we find that positive selection is typically lowered in children of a long branch with strong positive selection (color of circles in Fig. 5B).
We identified genes under positive selection in the longest branches leading to diverse clades (9 subfamilies and 8 tribes) using modified McDonald-Kreitman tests (63) (Table S14). The blues (Polyommatinae) and the whites (Pierini) have the largest number of such genes. Biclustering partitions the genes into three groups: under strong positive selection in a lineage-specific fashion (top in Fig. 5C), not positively selected in any lineage (middle in Fig. 5C), and positively selected in multiple lineages (bottom in Fig. 5C). The genes that are positively selected in all lineages (Fig. 5D, red background) include a wound healing factor, a detoxification molecule and a carbohydrate transporter, which may participate in absorption of nutrients. These genes may have helped these lineages survive in tough conditions, e.g., during the Cretaceous-Paleogene extinction. In contrast, genes with the lowest positive selection (Fig. 5D, green background) function in fundamental processes, e.g., encode ribosomal proteins. Common to all life forms and polished thoroughly by evolution, these genes have few nonsynonymous substitutions.
The blues (Polyommatinae) exhibit the fastest evolution driven by strong positive selection in the largest number of genes. Their well-documented relationship with ants may be the driver. Ants protect caterpillars of blues from predators and feed on the liquid secreted by special glands of the caterpillars (64). Some blue species even evolved to feed on ant larvae, while fooling the ants by producing chemicals and sounds to accept them as their own kind (65). We identified GO terms associated with genes that have been positively selected only in the blues (Table S15, Fig. 5E). Many of these genes encode proteins possibly related to interactions with ants, e.g., proteins of oenocyte development. In ants, oenocytes secret cuticular hydrocarbons used to recognize their nestmates (66), and the blues may produce similar chemicals by oenocytes to trick the ants. The ability to secrete the ant-feeding liquid from special glands in a caterpillar should require a number of transporters, and we indeed observe strong positive selection in a number of transporters for sugars, ions, and water. Finally, we find unique positive selection in metabolic proteins such as regulators for lipid storage and enzymes for carbohydrate metabolism. Caterpillars of blues frequently feed on a nutrient-rich diet, such as flowers, fruits or even ants (61) instead of leaves, and therefore they may have altered their metabolism to adapt to this difference in food resource and achieve fast development.
Evolutionary hypotheses: a broader perspective
A model for gene exchange between Eukaryotic species though introgression, butterflies, like the Darwin finches (67), hold the promise for discovering new general principles of evolution. Errors in replication generate variations for evolution to select from. Similarly, errors in mate selection introduce genomic segments from another species, providing a shortcut to accumulating mutations. A more efficient way to generate variation than point mutations, interspecies hybridization and introgression is emerging as a powerful evolutionary force to shape the adaptive landscape in multicellular organisms (68).
Our investigation into the diversification pattern of all USC butterflies provides direct evidence for the role of introgression in promoting radiation. We find that radiating genera show significantly higher introgression between distantly related (non-sister) species than non-radiating genera. Surprisingly, genes that are possibly related to mate recognition and speciation, such as those involved in courtship behavior and morphogenesis are preferably introgressed between distant species in radiating genera, while such genes resist introgression in non-radiating genera. Acquisition of speciation genes from a distant relative helps an incipient species to diverge from its sister by recombination rather than by point mutations, speeding up speciation and contributing to radiation.
We observe abundant bursts of radiations in the USC butterfly tree during the last 8 Myr. However, the scarcity of such radiations before 15 Mya suggests that only a small number of species from each radiating cluster persist in time. Each radiating cluster exists as a community of closely related species that exchange genes and compete with each other for resources, but eventually survive mostly as a single lineage. We find that the genes involved in food digestion, energy production, resistance to starvation, and DNA damage tend to introgress between species in radiating genera. Adapted alleles of such genes likely play a crucial role in the survival of a species in hard conditions, such as food shortage or temperature fluctuations. Therefore, the surviving lineage may gather advantageous alleles from other lineages that had become extinct in the past, allowing it to rapidly adapt to the changing environment and avoid extinction.
In sum, the patterns of diversification, radiation, introgression, and extinction observed in USC butterflies suggest the following evolutionary model. A species spreads over a large geographic area, increasing its population size and accumulating variations by mutations. Geographical isolation between populations drives them to speciate by accumulating Dobzhansky-Muller incompatibilities (69). Now, evolving as a set of closely related but reproductively semi-isolated species, these organisms further diverge and adapt to their local conditions. Still porous reproductive barriers between these incipient species allow them to exchange genes, and such exchanges speed up diversification by recombining speciation genes to generate new alleles, leading to radiation. While these exchanges are uncommon and not likely to propagate throughout the population, particularly beneficial alleles may get fixed due to selective sweeps (70). As a result of such introgression, each species can acquire beneficial traits from others and become adapted to more diverse conditions. Then, either as a result of direct competition or toughening environmental conditions, most species undergo extinction, and the species that gathered the most beneficial alleles moves forward in time. The cycle of diversification-radiation-introgression-extinction repeats, generating the diversification patterns we observe today.
Looking beyond butterflies, we see parallels in the evolution of Hominids. Diversified into several species including Neanderthals and Denisovans in the last 1 Myr, Homo experienced introgression as we see in butterflies. Most Non-African modern human populations contain about 2% DNA from Neanderthal (71), and the fraction of Denisovan genes varies (72) reaching 4-6% in Melanesians (73). Meanwhile, archaic human genomes also contain genes that are traced back to Homo sapiens (74). Nowadays, only modern humans survived, but genes of archaic humans stay in our genomes. Although many introgressed genes may be selected against and are being eliminated from modern human genomes with time (75), a fraction of them may be beneficial, increasing their frequency as a result of selection (72). These introgressed genes were proposed to help modern humans adapt to diverse climates (76, 77) and fight against pathogens (78, 79).
MATERIALS AND METHODS
Reference genome assembly and annotation
We sequenced, assembled and annotated genomes as previously described (6, 9, 11). Briefly, paired-end libraries with insert sizes 250 bp and 500 bp and mate-pair libraries with insert sizes 2 kb, 5 kb, and 10 kb were constructed and sequenced. All reads were processed by Trimmomatic (80) to remove adapter sequences and low-quality (quality score < 20) bases, and by Quake (81) to correct sequencing errors (81). We used Platanus (82) to assemble the genomes. The initial assemblies from Platanus were frequently redundant. The highly heterozygous equivalent segments in the paternal and maternal chromosomes were treated separately, and thus they were present twice in the assemblies. We detected and corrected such problems as described before (6, 9, 11).
The repeats in the genomes were identified by RepeatModeler (83). In addition, since repeats with highly similar sequences may be erroneously combined into one in the genome assemblies, we identified them using very high sequence depth (more than 4 times of the expected value) after all the sequence reads were mapped to the draft genomes using BWA (84). We combined the repeats identified by RepeatModeler and our sequence depth criteria with repeats in Repbase (85) to generate species-specific repeat libraries, and these libraries were supplied to RepeatMasker (86) to annotate repeats in the genomes.
We annotated protein-coding genes using three approaches: homology-based, transcript-based, and de novo gene prediction. We used protein sets from Papilio machaon (21), Pieris rapae (9), Calycopis cecrops (10), Calephelis nemesis (8), Danaus plexippus (4), Cecropterus lyciades (7), Bombyx mori (87), and Drosophila melanogaster (88) as references for homology-based annotation. These references include one species from each of the 6 butterfly families and the established model organisms, silkworm and fruit fly with expected high quality of gene models. The reference protein sets were aligned to draft genomes using Exonerate (89). We had RNA-seq reads for 22 of the 23 new reference genomes, and we used TopHat/Cufflinks pipeline (90, 91) to perform transcript-based annotation for them. Three de novo gene prediction methods: Augustus (92), GeneMark_ES (93), and SNAP (94) were used to obtain de novo gene annotations. We trained these de novo predictors for each species using confident gene models derived from the consensus between transcript-based and homology-based annotations. Finally, annotations by different approaches were combined in EvidenceModeler (95) to obtain their consensus as the final gene predictions. We predicted the functions of these proteins by finding the closest sequence hits in Flybase (96) and Swissprot (97) using BLASTP (E-value < 0.00001) and transferred the Gene Ontology (GO) (98) terms and function annotations.
Identification and analysis of Ultra Conserved Elements (UCE) in the butterfly genomes
A total of 36 reference genomes of USC butterflies were used in our study, and 31 of them were sequenced by us. The five genomes sequenced by others (4, 21-23) were obtained from LepBase v4 (http://ensembl.lepbase.org/index.html). We used the published gene models but annotated the protein function using our pipeline described above. For each butterfly family, we selected a representative genome with high N50: Heliconius erato (Nymphalidae) (23), Megathymus ursus (Hesperiidae) (6), Apodemia nais (Riodinidae), Heraclides cresphontes (Papilionidae), Pieris rapae (Pieridae), and Feniseca tarquinius (Lycaenidae). We masked the repetitive regions in these genomes using RepeatMasker (86) and removed short (less than 10 kb) scaffolds. We used Heliconius erato (assembly with the highest N50) as the primary reference, and aligned other five genomes to it by LASTZ (27). Aligned segments were processed sequentially by axtChain (99) and ChainNet (100) to generate pairwise whole genome alignments. These pairwise alignments were merged into a multiple genome alignment using MULTIZ (101). For segments where all six genomes were aligned, we counted identical positions in overlapping sliding windows of 50 bp. Windows with more than 96% identical positions in all six genomes were considered candidate UCEs, and adjacent candidate UCEs were merged.
We identified 764 candidate UCEs from the 6 selected genomes and searched for these UCEs in the remaining 30 genomes using BLASTN (102). A UCE was considered valid in a genome if a single confident hit (E-value < 0.001) can be found with higher than 96% sequence identity to the query UCE. As a result, we obtained 530 UCEs confirmed in at least 30 of the 36 genomes. We found genes that are less than 10 kb away from the UCEs in Heliconius erato genome, and detected GO terms that are preferably associated with these genes using binomial tests (p = probability for this GO term to be associated with any gene in the genome, m = number of genes near UCE that are associated with this GO term, N: total number of genes that are less than 10 kb away from UCEs). The most significant GO terms (false discovery rate (103) < 0.1) were visualized in REVIGO (29).
Identification of lineage-specific gene expansions
We used OrthoMCL (104) to identify the groups of orthologous proteins encoded by the 36 reference genomes. We mapped proteins in each group to the closest protein (BLAST E-value < 0.00001) of Drosophila melanogaster from Flybase (96). We assigned a Drosophila protein to an orthologous group to if more than 50% members in this group mapped to the protein. Furthermore, we merged orthologous groups if at least 50% of Drosophila proteins in them were the same. After merging, 5089 orthologous groups were present in at least 50% of 36 butterfly species and included Drosophila proteins: these groups were used in the following analysis. We used the accumulative protein length instead of protein number to identify gene expansions because number of proteins can be more easily affected by scaffold discontinuity in draft genomes and errors in annotation. In addition, gene expansions tend to occur as tandem repeats, and we used this property to identify candidate gene expansions.
Among the 36 reference genomes, we have 5, 4, 6, 2, 11, and 8 species from Papilionidae, Pieridae, Lycaenidae, Riodinidae, Nymphalidae, and Hesperiidae families, respectively. We further divided three families with more than 5 members (Lycaenidae, Nymphalidae, and Hesperiidae) into smaller groups. Lycaenidae were partitioned into Polyommatinae (3 genomes) and the rest (3 genomes); Hesperiidae were partitioned into Hesperiinae (3 genomes) and the rest (5 genomes). We grouped Nymphalinae and Heliconiinae subfamilies from Nymphalidae because they both have scoli covering the caterpillars, and other Nymphalidae we sequenced (no scoli) were considered the other group. Thus, we partitioned the reference genomes into 9 lineages. We calculated the total length of proteins in each orthologous group for each lineage, and identified lineage-specific gene expansion using three criteria: (1) the average accumulative protein length for species within this lineage is at least twice the average for other species; (2) the minimal total protein length for species in this lineage is larger than 90% species from other lineages; (3) 50% of proteins in this lineage are encoded next to another protein from the same orthologous group in the genomes. A total of 22 cases passed all these criteria. We manually inspected them to remove 8 cases without functional annotation and those of possible transposon origins. The remaining cases are shown in Fig. 1C.
Protein-coding sequence assembly for all USC butterflies
We developed a pipeline to assemble the protein-coding sequences from the whole genome shotgun reads of a target species using the protein sequences of a reference genome as baits, and the genome of a species more distant from the reference species than the target species as an outgroup. Different reference genomes were used for the same target species for different purposes. To obtain the most complete protein sets for each species, we used the closest reference genome; for phylogenetic analysis of a family, we selected a single reference for all species in that family; for phylogenetic analysis of all USC butterflies together, we used Cecropterus lyciades (7) as reference.
We split the reference proteins into exons and searched against sequence reads of a target species using DIAMOND (105) with the following parameters: -l 1 --comp-based-stats 1 --masking 0 - evalue 0.01. From the DIAMOND results of all exons in the reference, we kept the reads that could be unambiguously mapped to one locus by both E-value (< 1e-5 × E-value for other loci) and sequence identity (> identity for other loci + 10). We further filtered the alignments by requiring at least 80% coverage over the reads or the query exon and sequence identity higher than that between the reference and the outgroup. Because we used a number of old dry museum specimens whose DNA can be contaminated by fungi, bacteria and surrounding specimens, we applied the following protocol to detect and remove contaminants.
For each 30 bp sliding window applied to the alignment between the reference and the reads, we clustered all the reads into groups of similar sequences using the following procedure. We ranked reads by their sequence identity to the query from high to low. The first read initiated a cluster. Starting from the second read, a new read was compared to the first sequence of each cluster and assigned to the first cluster whose first sequence had no more than one mismatch from the current sequence. If a new read could not be assigned to existing clusters, a new cluster was initiated with this read as the first member. For each cluster, we computed its size and the average number of mismatches to the query, and we considered a cluster to be good if its size was at least half of the largest cluster size and number of mismatches was no larger than minimal mismatches + 2. If the number of good clusters was no more than 2 (diploid genome), we marked the reads that were not included in the good clusters as bad reads; otherwise, we marked all reads as bad. All the bad reads were discarded.
The dominant nucleotide (frequency > 0.6) at each position in the sequence alignment after this cleaning procedure was used to generate the exon sequences of the target species. The exon sequences were further translated to amino acid sequences and sequences of different exons of a protein were concatenated to obtain the protein sequence of the target species.
Phylogenetic analysis of USC butterflies
Phylogenetic trees were constructed for each butterfly family from the protein-coding sequences using one reference genome per family: Pterourus glaucus (Papilionidae), Phoebis sennae (Pieridae), Calycopis cecrops (Lycaenidae), Calephelis nemesis (Riodinidae), Heliconius melpomene (Nymphalidae), and Cecropterus lyciades (Hesperiidae). Since the sequences for other samples were assembled using the reference as baits, they were all aligned to the reference and could be readily converted to multiple sequence alignments. Three datasets: autosomal, Z-linked and mitochondrial proteins, were used to construct phylogenetic trees. Strong conservation of gene content was reported for Lepidoptera Z chromosome (106). Therefore, we considered exons to be Z-linked if their best TBLASTN (102) hits were on Heliconius erato Z chromosome, and a gene to be Z-linked if more than 80% of its exons were Z-linked. Multiple sequence alignments of proteins were concatenated in each dataset, and positions containing more than 60% gaps were removed.
For autosome- and Z-chromosome-based phylogeny, we built trees for 100 partitions, and each partition was generated by randomly drawing 20 kb positions from the alignment. We used IQ-TREE (107) (model: GTR+I+G) to construct the maximum-likelihood trees for each partition and summarized them to obtain a consensus tree using sumtrees.py (-f 0.0) (108). For mitochondrial proteins, we used the entire alignment and applied IQ-TREE with model selection and 1000 fast bootstrap (-bb 1000) to construct the tree.
To resolve the relationship between families, we used a single reference genome, Cecropterus lyciades. Sequences for all other species were derived by mapping to this single reference, resulting in multiple sequence alignments of all USC butterflies. We constructed trees for autosomal, Z-linked and mitochondrial proteins as described above. These trees were expected to be less accurate in resolving shallower phylogeny due to the lower sequence similarity between Cecropterus lyciades and species in other families. Therefore, we replaced the clades for each family in these trees with the trees constructed for each individual butterfly family using python ETE3 module (109) to generate the USC butterfly trees used in this study. These trees were rescaled as previously described (39, 110) and the time axis was added to the tree constructed from all nuclear genes based on our published calibration (8) to match the ages of common nodes between the current and previous trees.
Simulation of diversification process under constant speciation and extinction rate
We developed an in-house script to simulate species growth under constant speciation and extinction rates. Here, speciation rate (RS) is the probability for a taxon to split into two in 1 Myr, and extinction rate (RE) is the probability for a taxon to extinct after 1 Myr. Our simulation started with one taxon and iteratively introduced speciation and extinction events for 122 times, corresponding to the 122 Myr of butterfly evolution we observed from the data. To introduce random fluctuation, we used a random number to determine whether a taxon should undergo speciation or extinction with the expected probabilities defined by RS and RE, respectively.
We computed the observed diversification rate per 5 Myr and the highest value is 2, which correspond to a speciation rate of 0.15 ((1 + 0.15)5 = 2.0) in the absence of extinctions. Therefore, we fixed RS at 0.15 and tested different RE in the range between 0.0 and 0.1 with an increment of 0.01. We ran 1000 simulations for each RE value, and a value of 0.08 gave the best chance of producing about the same (+-10%) number of species as observed. We therefore ran 10000 simulations under these parameters (RS = 0.15 and RE = 0.08) and selected the trajectories that produced about the same (+-10%) number of species as observed. From these trajectories, we analyzed the apparent diversification rate (a result of both speciation and extinction) every 5 Myr and counted the number of radiation events at each time point for comparison with the observed data.
Identification of radiation events and radiation-associated proteins
We identified radiating nodes in the tree as those generating at least 5 lineages in less than 2 Myr, and if one clade started with a radiating node and included at least 6 species, we consider it a radiation cluster. We thus identified 18 non-overlapping radiation clusters from 18 genera: Pterourus, Colias, Callophrys, Satyrium, Euphilotes, Celastrina, Oeneis, Polygonia, Phyciodes, Chlosyne, Boloria, Speyeria, Cecropterus, Erynnis, Euphyes, Hesperia, Atrytonopsis, and Agathymus. For each radiating genus, we used other closely related genera as external references to identify proteins with elevated divergence within this radiation cluster.
For each protein, we calculated its average divergence for any pair of species within the radiation cluster (DIVinternal), and the average divergence between any species in the cluster and external references from other genera (DIVexternal). We mapped proteins in each genus to their closest (E-value < 0.00001) Drosophila proteins in Flybase. If multiple proteins were mapped to the same Drosophila protein, we computed the average DIVinternal and DIVexternal weighted by the length of each protein. We detected proteins with elevated divergence in radiation clusters using two criteria: first, DIV internal is significantly (P-value < 0.01) higher than the average DIV internal over all proteins; second, DIVinternal is higher than DIV external by at least 1.5 times. The second criterion ensured that we selected proteins that tended to diverge within radiation clusters instead of generally fast evolving ones, because DIVexternal should be higher than DIVinternal for most proteins due to the larger evolutionary distance between genera than within a genus. These criteria selected 273–846 proteins in each radiating genus, and we considered them to be radiation-associated proteins.
We only considered proteins that were present in at least 9 radiating genera and we identified recurrent radiation-associated proteins using binomial tests (p = total number of radiation-associated proteins in all 18 genera / total number of proteins being analyzed in all 18 radiating genera, m = number of genera where this protein is among the radiation-associated proteins, N = number of genera where this protein is being analyzed, alternative hypothesis: greater). Proteins with P-values less than 0.05 were considered as recurrent radiation-associated proteins, and we further identified GO terms that were enriched among them using another binomial test (p = probability for this GO term to be associated with any protein, m = total number of recurrent radiation-associated proteins that are associated with this GO term, N = total number of recurrent radiation-associated proteins).
Comparison of radiating and non-radiating lineages
To investigate the differences between radiating and non-radiating genera, we first compared sister species in both types of genera. We used the 18 radiating clusters identified above, and we found non-radiating genera using the following criteria: (1) the genus does not contain any consecutive speciation events separated by less than 0.67 Myr; (2) the genus is not rich in species south of the United States. We extracted 63 pairs of sister species from radiating genera. The distances between these pairs in the phylogenetic tree were mostly below 0.03 substitutions per position. We further selected 68 pairs of sister species whose distance in the tree were below 0.03 from non-radiating lineages. We binned the sister species pairs from radiating or non-radiating genera according to their average divergence in gene sequences to the following bins: 0.05 - 0.1, 0.1 - 0.15, 0.15 - 0.2, 0.2 - 0.25. We partitioned genes into autosomal and Z-linked ones. In each bin and each partition, we observed the distribution of sequence divergence (percentage of positions with different nucleotides) for individual genes in radiating and non-radiating genera by Python seaborn package (https://seaborn.pydata.org/).
The comparisons of sister species suggested a higher level of introgression in radiating genera, and thus we further tested the extent of introgression more rigorously using ABBA-BABA tests (71). ABBA-BABA test requires 4 taxa following a tree topology ((S1,S2),S3),O; where S1 and S2 are closely related, S3 is more distant and O is the outgroup. The test is used to identify introgression from the distant group S3 to either S1 or S2 based on excessive similarity between S3 and S1 or S2, respectively. We identified taxa following the topology of ((S1,S2),S3) among radiating genera and non-radiating genera, and we required the grouping of S1 with S2 to be strongly supported with a bootstrap value of 1. We obtained outgroups from the sister genera. The divergence between taxon S3 and taxon S1 or S2 in the cases we identified from radiating genera was mostly below 0.06, and therefore we selected the cases from non-radiating genera with the same level of divergence (< 0.06). As a result, we obtained 115 non-radiating cases and 2,557 radiating cases.
For each gene in each case, we counted the number of positions following the pattern of ABBA or BABA in taxa S1, S2, S3, and O. A pattern of ABBA means that taxa S1 and O share the same nucleotide, and taxa S2 and S3 share the same nucleotide that is different from S1. A pattern of BABA means that taxa S2 and O share the same nucleotide, and taxa S1 and S3 share the same nucleotide that is different from S2. Since we used multiple taxa as outgroups, and these outgroups may not support the same pattern, we counted the fraction of outgroups supporting a certain pattern at each position. In the absence of introgression (null hypothesis), the expected total number of ABBA positions should be equal to the total number of BABA positions. We tested significant deviation from the null hypothesis using binomial tests (p = 0.5, m = count of ABBA position, N = count of ABBA or BABA position). If a gene has significantly (P-value < 0.05) more ABBA position, we consider it to be introgression between S3 and S2, while significantly more BABA positions indicate introgression between S3 and S1.
We identified 2,273 genes that were significantly more likely (P-value < 0.05, alternative hypothesis: greater) to introgress and 2,362 genes that were resistant (P-value < 0.05, alternative hypothesis: less) to introgression among radiating genera using binomial tests (p = average fraction of introgressed genes in all cases, m = number of cases where this gene is introgressed, N = number of cases where this gene is being analyzed). Meanwhile, for the non-radiating genera, we identified 2,159 genes that were more likely (P-value < 0.05) to introgress and 4,001 genes that never introgressed in the 115 cases. We analyzed the functional enrichment of genes prone to introgression versus resistant ones in both radiating and non-radiating genera using GO terms as described above.
Reconstruction of ancestral sequences and analysis of selection pressure
To study the evolutionary history and adaptation in different lineages of butterflies, we reconstructed the sequence for each gene at each node of the phylogenetic tree of USC butterflies. We derived the sequence of a target node based on its sister node and its two children using a fast in-house script. The probability of each nucleotide i at a position was computed using the following formula: where fS(i), fC1(i), and fC2(i) were the frequencies of nucleotide i in the sister node, and the two children nodes, respectively; wS, wC1, and wC2 were the weights of the sister and two children nodes, and the weights were inversely correlated with the distances (in the USC butterfly tree of nuclear genes) between these nodes and the target node.
For each tree branch starting from node A and ending with node B, we compared each gene’s DNA sequences at nodes A and B. For each codon in a gene, we enumerated all possible single-substitution paths to change form the codon of node A to that of node B, and we considered the path with the lowest number of nonsynonymous substitutions (most parsimonious) to be the most likely path. We recorded these most likely substitution paths for each branch and each gene and counted the number of synonymous (NSS) and nonsynonymous substitutions (NNS) along the paths. The total TNSS and TNNS were obtained as the sum of NSS and NNS over all genes along a tree branch, and the ratio between them (TNNS/TNSS) was used as an indicator of the positive selection in the branch.
Identification and analysis of positively selected proteins
To investigate both the universal and lineage-specific adaptation, we identified all the relatively long branches (branch length > 0.03 substitutions per position) leading to an entire subfamily, tribe or subtribe with at least 10 species in the United States and Canada. We manually inspected these branches to select a set of representatives following three rules: (1) these representatives should not overlap in species belonging to them; (2) a group originating after a longer branch and with more species is preferred. A total of 17 lineages were selected, including 9 subfamilies, 7 tribes and 1 subtribe. We identified genes showing significant positive selection along the branches leading to these lineages, respectively, using modified McDonald-Kreitman (MK) (63) tests.
A standard MK test compares the rate of nonsynonymous substitutions in a gene between species to the nonsynonymous polymorphisms rate within species (63). We generalized this method and instead evaluated if the nonsynonymous substitution rate of a gene along a tree branch was significantly higher than average nonsynonymous substitution rate in lineages originating from this branch. The rationale is that if some derived beneficial trait, such as mutualism between the blues and ants, originated in a branch, its offspring may tend to retain it by reducing the nonsynonymous substitution rate. Significance of higher nonsynonymous substitution rate in a gene along a branch was evaluated using a binomial test (p = rate of nonsynonymous substitution among species in a lineage, m = number of nonsynonymous substitutions along the branch leading to this lineage, N = number of all substitutions along the branch). To compare the positively selected genes in different lineages, we mapped genes in the 17 lineages to Drosophila genes, and rescaled the P-values for positive selection to significance scores as: score = 1 − P · 100, if P < 0.01; score = 0, if P > 0.01. We clustered the resulting score for each gene in each lineage using the clustermap function in Python seaborn package (https://seaborn.pydata.org/). In addition, GO term enrichment analysis of positively selected genes in each lineage was used to capture the functions of these genes.
ACKNOWLEDGMENTS
We acknowledge Leina Song and Ping Chen for excellent technical assistance. We are grateful to David Grimaldi and Courtney Richenbacher (AMNH: American Museum of Natural History, New York, NY, USA), Jason Weintraub (ANSP: Academy of Natural Sciences of Drexel University, Philadelphia, PA, USA), Jonathan P. Pelham (BMUW: Burke Museum of Natural History and Culture, Seattle, WA, USA), Vince Lee and the late Norm Penny (CAS: California Academy of Sciences, San Francisco, CA, USA), Boris Kondratieff (CSUC: Colorado State University Collection, Fort Collins, CO, USA), Crystal Maier and Rebekah Baquiran (FMNH: Field Museum of Natural History, Chicago, FL, USA), Weiping Xie (LACM: Los Angeles County Museum of Natural History, Los Angeles, CA, USA), Andrew D. Warren and Debbie Matthews-Lott (MGCL: McGuire Center for Lepidoptera and Biodiversity, Gainesville, FL, USA), Edward G. Riley, Karen Wright, and John Oswald (TAMU: Texas A&M University Insect Collection, College Station, TX, USA), Alex Wild (TMMC: University of Texas Biodiversity Center, Austin, TX, USA), Jeff Smith and Lynn Kimsey (UCDC: Bohart Museum of Entomology, University of California, Davis, CA, USA), Robert K. Robbins, John M. Burns, and Brian Harris (USNM: National Museum of Natural History, Smithsonian Institution, Washington, DC, USA) for granting access to the collections under their care and for stimulating discussions; to Jim P. Brock, Jack S Carter, Bill R. Dempwolf, James McDermott, the late Edward C. Knudson (specimens now at MGCL), Harry Pavulaan, James A. Scott, John A. Shuey, and Mark Walker for specimens and leg samples. Greg Kareofelas and Matthew Garhart collected needed specimens and placed them in RNAlater for molecular analysis. Boris Kondratieff, Chuck Harp, and James Scott curated the butterfly collection at the C. P. Gillette Museum of Arthropod Diversity, Colorado State University, which facilitated the accurate sampling of remaining species needed to complete the analysis. Evi Buckner-Opler assisted by providing emotional and logistic support and helped to collect specimens. We are indebted to Texas Parks and Wildlife Department (Natural Resources Program Director David H. Riskind) for the research permit 08-02Rev, to U. S. National Park Service for the research permits: Big Bend (Raymond Skiles) for BIBE-2004-SCI-0011 and Yellowstone (Erik Oberg and Annie Carlson) for YELL-2017-SCI-7076 and to the National Environment & Planning Agency of Jamaica for the permission to collect specimens. We acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources. The study has been supported in part by grants from the National Institutes of Health GM127390 and the Welch Foundation I-1505.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵