Abstract
The development of complex phenotypes requires the coordinated action of many genes across space and time, yet many species have evolved the ability to develop multiple discrete, alternate phenotypes1–3. Such polymorphisms are often controlled by supergenes, sets of tightly-linked mutations in one or more loci that function together to produce a complex phenotype4. Although theories of supergene evolution are well-established, the mutations that cause functional differences between supergene alleles remain essentially unknown. doublesex is the master regulator of insect sexual differentiation but functions as a supergene in multiple Papilio swallowtail butterflies, where divergent dsx alleles control development of discrete non-mimetic or mimetic female wing color patterns5–7. Here we demonstrate that the functional elements of the mimetic allele in Papilio alphenor are six new cis-regulatory elements (CREs) spread across 150 kb that are bound by DSX itself. Our findings provide experimental support to classic supergene theory and suggest that the evolution of auto-regulation may provide a simple route to supergene origination and to the co-option of pleiotropic genes into new developmental roles.
While sex chromosomes are the most obvious supergenes, numerous complex polymorphisms spanning alternative social structures in ants2,8, flower structures9, and male plumage10 have now been traced back to allelic variation in supergenes. Theory predicts that supergenes evolve when selection favors linkage disequilibrium between alleles at multiple loci that function together to control development of an adaptive phenotype11. Reduced recombination between supergene alleles, caused by a variety of mechanisms including inversions, allows supergene alleles to accumulate subsequent mutations that refine the supergene’s function and therefore the adaptive phenotype. Although the genetic architectures of several supergenes are well-characterized, the genetic variants that cause functional differences between supergene alleles, and therefore the evolutionary origins of the alleles, their architectures, and polymorphisms they control, remain elusive.
doublesex functions as a supergene in at least five species of Papilio swallowtail butterflies, where it controls the switch between discrete female wing color patterns5–7,12. The switch between male-like non-mimetic color patterns and a novel mimetic color pattern in the closely-related species P. polytes and P. alphenor is caused by the dominant H allele5,6. H is inverted relative to the ancestral h allele, has an extremely divergent sequence from h, and contains dsx and a novel non-coding gene, untranslated three exons (U3X; Fig 1A-B)5–7. A unique spike of dsxH expression in early pupal wings initiates the mimetic color development program, but the genetic basis of this expression pattern remained unknown (Fig 1C)13.
We therefore investigated the cis-regulatory architecture and evolution of the P. alphenor supergene alleles (Fig 1). We first narrowed our search for CREs controlling dsx wing expression by identifying topologically associating domains (TADs) containing each allele, which we expected to define the local regulatory region14. Each allele was contained within a single TAD harboring dsx and the adjacent genes sir-2, rad51, and nach; the mimetic H TAD additionally contained U3X (Fig 1D). While the right TAD boundaries coincided with the right inversion breakpoint in both alleles, the left TAD boundaries were 21.9 kb and 17.9 kb outside of the inversion, suggesting that the inversion itself caused minor changes to the local topology despite high sequence and structural divergence between the alleles (Fig 1B). TAD boundaries were similar in male and mid-pupal female and male wings (Supplementary Figure 1). We therefore expected CREs controlling dsx wing expression to be within or just to the left of the inversion.
We then identified potential dsx CREs using the assay for transposase-accessible chromatin (ATAC)15. We expected that the CRE(s) controlling dsxH expression, and therefore the mimicry polymorphism, would be 1) unique to the H allele, 2) differentially accessible (DA) between female and male wings, and 3) not DA or potentially inactive in other tissues. Alternatively, CREs may be shared between alleles but have accumulated H-specific divergence that altered their function in the developing wing. To identify CREs involved in the mimicry switch, we performed ATAC in early and mid-pupal wings and heads from H/H and h/h females and males (Fig 1E; Supplementary Figures 2-5)13,15.
We first focused on peaks in early pupal wings, where dsxH expression spikes in mimetic females (Fig 1C)13. The H allele contained 28 ATAC peaks in early pupal wings, 12 of which were wing-specific (Fig 1D). Most peaks fell within or near dsxH: 22 within introns plus peaks in exon 1, exon 6, and 0.47 kb and 3.9 kb upstream of the dsxH promoter. Peaks were also found at the UXT and U3X promoters. The h allele contained 33 ATAC peaks in early pupal wings, 16 of which were wing-specific. These h peaks were located in similar relative locations to H peaks, but were absent from ∼4 kb upstream of dsxh (Fig 1E).
We assigned orthology between these putative h and H CREs and polarized CRE gain and loss using BLAST to search for P. alphenor CRE sequences in P. alphenor, P. polytes and three additional Papilio species (Fig 1F; Supplementary Tables 4-5). Despite the enormous sequence divergence between the h and H alleles (Fig 1B), we found surprising conservation of CRE sequences and synteny over 20 million years of evolution. Importantly, all h CRE sequences were found in the P. polytes h allele and at least two other species, but six H CRE sequences were unique to the P. alphenor and P. polytes H allele, strongly suggesting that the H allele has gained multiple novel dsx CREs (Fig 1F). Both dsx alleles perform their roles in sexual differentiation equally well, so H-specific CREs cannot be essential for dsxH expression.
We expected that CREs responsible for dsxH expression would show a correspondingly unique pattern of accessibility in mimetic females. The extreme divergence between the alleles prevented direct comparisons of peak accessibility between mimetic and non-mimetic females (Fig 1B). Instead, we identified differentially accessible (DA) peaks between males and females of the same genotype and compared them based on their orthology assignments (Fig 1F; Supplementary Tables 6-7). We found that 39.2% (11/28) of H CREs were differentially accessible (DA) between the sexes, including half (3/6) of H allele-specific CREs. In contrast, 15.1% (5/33) of h CREs were DA, perhaps reflecting the similar dsx expression patterns and small amount color pattern dimorphism between non-mimetic females and males (Fig 1E)16. This sexually dimorphic accessibility is likely involved in color pattern development, as no peaks were DA specifically in heads and only one CRE was DA in both heads and wings. These results strongly support the role of a small number of dsxH-specific CREs in activating the P. alphenor mimicry switch. We experimentally tested if these H-specific CREs and six conserved CREs were required for dsxH expression, and therefore mimetic color pattern development, using CRISPR/Cas9 knockouts (Fig 2; Supplementary Figure 6; Supplementary Tables 8 - 9). We expected that knocking out CREs required for dsxH expression should cause mimetic females to develop non-mimetic color patterns.
Consistent with their requirement for dsx wing expression and mimetic color pattern development, knockouts of three conserved CREs caused mimetic females to develop mosaic non-mimetic color patterns (Fig 2; Supplementary Figure 6). We observed no color pattern differences in non-mimetic females when we targeted these six orthologous CREs. Importantly, mKOs of the H-specific CRE 22670 contained numerous patches of mimetic to the non-mimetic color pattern, strongly suggesting that this CRE controls the unique dsxH expression pattern in mimetic females (Fig 2D). We therefore hypothesize that the dsx supergene originated through recruitment of one or more novel CREs that significantly increased expression in early pupal wings. The inversion that linked this novel CRE and the dsx promoter, two functional elements located 100 kb apart, without deleterious pleiotropic effects on dsx and nearby gene expression would have been favored by selection for mimicry.
We next sought to identify the transcription factor(s) (TFs) that bind to these CREs and control dsx wing expression. The TFs that directly control dsx expression are unknown in any organism. Genetic analyses in Drosophila showed that Hox and other TFs, including Sex-combs reduced, Abdominal-B, and Caudal17, are required for DSX expression in certain contexts, but it is unknown if these TFs directly regulate dsx transcription. On the other hand, ChIP-seq experiments identified binding sites for over 230 different DNA binding proteins within dsx in whole adult Drosophila, chiefly DSX itself (25 sites)18. To identify TFs that regulate dsx wing expression, we first searched for known TF binding site motifs enriched in dsxH CREs using HOMER (Supplementary Figure 7). CREs were CUT&RUN, the active promoter/enhancer histone mark H3K4 tri-methylation, and negative control most significantly enriched with motifs for Paired (p = 1e-13), Caudal (p =1e-11), Extradenticle (p = 1e-10), and DSX (p = 1e-9), immediately suggesting that the mimetic supergene allele could be auto-regulated. To test this hypothesis and identify the direct targets of DSX in wing development, we assayed genome-wide patterns of DSX binding in early and mid-pupal h/h and H/H wings using CUT&RUN (Fig 3; Supplementary Figures 8-10; Supplementary Table 10)19.
Overall, we found 9160 DSX peaks genome-wide among all samples, with the majority of peaks (84.0%) initially called only in mimetic H/H female samples where DSX expression is highest. DSX peaks mostly overlapped ATAC peaks (89.5%), and DSX peaks were enriched with a motif similar to the known Drosophila DSX binding site (p = 1e-596; Supplementary Figure 9), supporting the quality of the data. For comparison, 82.3% (9264/11339) H3K4me3 CUT&RUN peaks overlapped ATAC peaks. In addition, we found strong DSX peaks in genes known to be bound by DSX in Drosophila, including bric-a-brac 1 (Supplementary Figure 11).
Importantly, we found multiple DSX CUT&RUN peaks within and just upstream of dsx in all samples (Fig 3). In fact, the strongest peaks on chromosome 17 in non-mimetic females were found within the inversion (Fig 3C). The h allele contained four DSX peaks in early pupal wings, all within conserved CREs, while the mimetic H allele contained 22 DSX peaks: 19 intronic, one 0.8 kb and one 4.1 kb upstream of dsxH, and one at the U3X promoter (Fig 3D). Surprisingly, five of the six H-specific CREs contained DSX peaks, strongly suggesting that the mimetic allele gained novel auto-regulatory interactions (Fig 3D). Importantly, the H-specific CRE 22670 that yielded strong mKO phenotypes (Fig 2D), also contained a strong DSX peak that was differentially bound between females and males.
Overall, 55% of DSX peaks (12/22) in the H allele and three of four h peaks were significantly differentially bound (DB) between males and females in early pupal wings. All three DB h peaks were conserved and also DB in the H allele, suggesting that DSX binding in these CREs is involved in sexual dimorphism. Thus, differential binding in the H allele may primarily contribute to the unique spike of dsxH expression. The different patterns of DSX binding in conserved CREs appears to be caused by differential use of those CREs rather than mutations that disrupt binding sites, as the binding sites are well-conserved between the supergene alleles and across species (Fig 3E-F). Log-odds probabilities, measurements of the strength of matches between predicted DSX binding sites and the consensus motif, were not significantly different between genome-wide DSX peaks, conserved peaks, or H-specific peaks (Fig 3F; all Welch’s t-test p-values > 0.10). In fact, two of the top three matches are found in H-specific CREs (Supplementary Table 11). These results suggest that the mimetic H allele has gained multiple new CREs with strong DSX binding sites that regulate dsx expression in the developing wing.
Finally, in addition to the spike of widespread dsxH expression in early pupal wings, DSXH becomes uniquely expressed in regions of the wing that will become white in mid-pupal wings13. Interestingly, no H-specific CREs were even accessible in mid-pupal wings (Supplementary Figure 4). Instead, three conserved peaks were DB by DSX between the sexes in both h/h and H/H comparisons, and a single conserved peak was DB specifically between H/H males and females (Supplementary Figure 10). These results suggest that H-specific CREs and DSX binding may be required early in development to trigger the developmental switch, but that conserved CREs are differentially used to sustain the mimetic color pattern program throughout development. In other words, changes in conserved wing CREs may have helped refine the mimetic program after its origination via novel, auto-regulated dsx expression in the early wing.
Antibody stains showed that DSXH expression never fully pre-figured the adult mimetic pattern, suggesting that DSXH initiates mimetic pattern development in early pupal wings but quickly becomes decoupled from it13. We expected that DSXH initiated the mimetic program by directly regulating one or more downstream genes in early pupae that execute the mimetic program5,6,13. To identify these direct targets and characterize the consequences of DSX binding on chromatin accessibility and gene expression, we directly compared genome-wide patterns of DSX CUT&RUN, ATAC-seq, and differential gene expression (DE) between mimetic and non-mimetic females (Fig 4).
We first identified DE genes by re-analyzing pupal wing RNA-seq data from five developmental stages13, then intersected those genes with our DSX and ATAC peak data. Consistent with our previous work, 1523 genes were DE between mimetic and non-mimetic females across development, with the majority (54.8%) of those genes being DE specifically in early pupal (15%) wings (Fig 4A; Supplementary Figure 12; Supplementary Table 13). First, 1103 (12.0%) DSX peaks in 776 genes were DB between early pupal mimetic and non-mimetic females, including 62 peaks in 49 DE genes. These potentially direct targets include rotund (a TF previously implicated in the mimicry switch in P. polytes21), the T-box TF midline, and at least three other DNA-binding proteins (Fig 4B; Supplementary Table 13). The small number of genes that DSX appears to directly regulate in the P. alphenor wing is consistent with work in Drosophila that showed DSX binds many targets genome-wide, but affects expression of a small number of those genes, presumably due to the presence of the appropriate co-factors22. By mid-pupal development, only 177 (1.9%) DSX peaks were DB, with just 15 peaks in 15 DE genes, including rotund.
Patterns of chromatin accessibility were opposite those of DSX binding. While 1.7% of ATAC peaks were DA between early pupal female wings, 27.7% were DA in mid-pupal female wings, suggesting that DSX binding early in development alters the wing regulatory landscape throughout development (Supplementary Figure 13). More specifically, 5.6% of DE genes contained DA peaks in early pupal wings, yet 40.8% contained DA ATAC peaks in mid-pupal wings, including genes known to be involved in the dsx mimicry switch and to specify color patterns in other butterflies, including invected, engrailed, and aristaless-1 (Fig 4)13. Thus, while DSX directly regulates few genes early in pupal development, those effects are propagated to later stages while being decoupled from DSX binding itself.
Our results provide several new insights into the structures and evolution of supergenes that have remained hidden by the extensive LD and complexity that characterize many of the best-characterized supergenes. First, we experimentally showed that multiple functional elements are required for supergene function, supporting predictions that inversions and other mechanisms of recombination suppression link distant functional loci together11. Second, our results suggest that the dsx supergene originated via the gain of a novel auto-regulatory element(s) in the 3’ half of the gene that enabled positive reinforcement of dsxH expression in the early pupal wing, causing a spike in dsx expression that initiates the mimetic pattern program6,13. An inversion then linked this novel DSX binding site to the dsx promoter and facilitated recruitment of additional H-specific CREs that reinforced this positive feedback loop and refined DSX expression across wing development.
Beyond supergenes, our findings provide novel insight into the genetic mechanisms by which pleiotropic genes are co-opted into new developmental roles. Auto-regulation may avoid many of the deleterious pleiotropic effects from ectopic expression because it can only occur in tissues or stages where the gene is already expressed. Second, auto-regulation, particularly up-regulation, increases the probability that novel alleles are dominant and therefore immediately exposed to natural selection. Auto-regulation could play an identical role in the evolution of multi-gene supergenes, where key TFs regulate their own and/or nearby gene expression. Recombination suppression and subsequent divergence between alleles, in CREs or protein-coding sequences, would then refine the supergene alleles’ functions.
Author contributions
NWV - Conceptualization, Investigation, Visualization, Writing - original draft, Writing - review & editing; SIS - Conceptualization, Investigation; DM - Investigation; WL - Investigation; MRK - Conceptualization, Funding acquisition, Writing - review & editing
Funding
This work was supported by NIH R35 GM131828 to MRK.
Data Availability
Illumina sequencing data is publicly available in the National Center for Biotechnology Information (NCBI) under BioProject PRJNA1062051. Genome assemblies, annotations, R projects, and full analysis results are publicly available in the Dryad repository under accession doi://AAAAAAAAAA.
Materials and Methods
Butterfly care
Papilio alphenor pupae were purchased from Philippines breeders and allowed to emerge in the University of Chicago greenhouses. New adults were sexed, labeled with a unique number on the forewing, and the sexes separated until use. We determined each individual’s dsx genotype using DNA from a single leg and a custom TaqMan (Life Technologies, USA) assay. We set up crosses between multiple virgin adults carrying the desired alleles in 2m3 mesh cages, allowed them to mate, and provided Citrus (Meyer lemon) shrubs for oviposition. Adults were fed Bird’s Choice artificial nectar and supplied with Lantana. Pre-pupae were collected and placed into labeled boxes in an incubator set to 25°C, 16h:8h light/dark cycle, and constant 65% humidity. Pupal development takes approximately 15 days under these conditions; experiments focused on two days and five days after pupation (15% and 35% pupal development, respectively).
Genome sequencing and assembly
We extracted high molecular weight (HWM) genomic DNA from thorax of freshly killed P. alphenor females that were homozygous for the h or H alleles using the QIAgen GenomicTip G-100 kit (QIAgen, USA). Extractions followed the manufacturer’s instructions, except we incubated chopped fresh tissue in lysis buffer and proteinase K overnight at 50°C and shaking at 200 rpm before purification. We then constructed Oxford Nanopore sequencing libraries using the ONT Ligation Sequencing Kit (LSK-110) and eliminated reads <10 kb using the PacBio SRE XS kit before sequencing on a MinION Mk1b and R9.4.3 flow cells to 30X - 40X coverage.
We called bases using Guppy and super high quality base calling (dna_r9.4.1_450bps_sup.cfg), then assembled the mimetic H and non-mimetic h genomes separately using these raw reads and Flye v2.9.1 with default settings with expected genome size set to 250 Mb. The initial Flye assemblies were each polished using the Guppy basecalls and Medaka v1.7.2 (medaka_consensus) with the appropriate error model (r941_min_sup_g507). We then purged duplicates using purge_dups v1.2.5.
We then scaffolded contigs together into 31 chromosomes using Hi-C data (see below) and the 3d-dna pipeline23. We identified the Z and W chromosomes by analyzing coverage of re-sequencing data from five males and five females5. The remaining 29 chromosomes were ordered and labeled by decreasing size. We assembled the alphenor mitochondrial genome using NOVOplasty v4.224 using sequencing data from SRR1108726 and the RefSeq mtDNA assembly for polytes (NC_024742.1) as the seed sequence. This resulted in a single circularized sequence of 15,247 bp. We added this sequence as chrM to each assembly.
We identified repeat sequences in the mimetic assembly using RepeatModeler, then used our custom library, the RepBase 20181026 “arthropoda” database, and Dfam 20181026 database to identify and mask repeats genome-wide in both assemblies using RepeatMasker. Finally, we hard masked regions in the nuclear genomes with homology to chrM (blastn e-value < 1e-50).
We sequenced and assembled a new Papilio nephelus genome following the same protocol, using DNA from a single random female. Papilio nephelus is a sexually monomorphic, non-mimetic species that diverged ∼15 mya from the P. alphenor lineage. Final assembly statistics can be found in Supplementary Table 14.
TAD Plot
Mimetic: chr17:5850000-6183000
Non-mimetic: chr17:5704000-6023000
Inversion:
Mimetic chr17:5937900-6095400
Non-mimetic chr17:5791937-5930539
Annotation
We annotated the mimetic assembly using EvidenceModeler 1.1.125. We first assembled a high-quality transcript database using PASA 25, SE50 data generated in VanKuren et al. 13, and PE100 and SE50 data from Nallu et al.26. After adapter trimming, we performed de novo and genome-guided assembly using Trinity v2.10.027 and genome-guided assembly using StringTie v1.3.328. RNA-seq data was also mapped to the mimetic alphenor assembly using STAR 2.6.1d29, and the resulting alignments used to generate genome-guided assemblies with Trinity and StringTie 1.3.128. We combined de novo and genome-guided assemblies using PASA 2.4.130. Evidence for protein-coding regions came from mapping the UniProt/Swiss-Prot (2020_06) database and all Papilionoidea proteins available in NCBI’s GenBank nr protein database (downloaded 6/2020) using exonerate31. We identified high-quality multi-exon protein-coding PASA transcripts using TransDecoder (transdecoder.github.io), then used these models to train and run Genemark-ET 432 and GlimmerHMM 3.0.433. We also predicted gene models using Augustus 3.3.234, the supplied heliconius_melpomene1 parameter set, and hints derived from RNA-seq and protein mapping above. Augustus predictions with >90% of their length covered by hints were considered high-quality models. Transcript, protein, and ab initio data were integrated using EVM with the weights in Supplementary Table 15.
Raw EVM models were then updated twice using PASA to add UTRs and identify alternative transcripts. Gene models derived from transposable element proteins were identified using BLASTp and removed from the annotation set. We manually curated the dsx region and key color patterning genes. The full annotation comprises 20,674 genes encoding 26,074 protein-coding transcripts, containing 97.2% complete and missing 1.9% of endopterygota single-copy orthologs according to BUSCO v5 and OrthoDB v10. We functionally annotated protein models using eggNOG’s emapper-2.0.1b utility and the v2.0 eggNOG database35. We used liftoff 36 to transfer these annotations to the non-mimetic reference genome assembly.
Downstream analyses used sequence, transcripts, and proteins from only the 31 main chromosomes and chrM.
Hi-C sequencing and analysis
We performed Hi-C on developing hindwings using Dovetail Genomics’ (USA) Omni-C kit. Hindwings were dissected from staged pupae then snap frozen in liquid nitrogen and stored at - 80°C until use. Wings from three individuals were pooled and pulverized in liquid nitrogen before proceeding with fixation and proximity ligation according to the manufacturer’s protocol. We performed Hi-C separately on hindwings from males and females homozygous for each dsx allele at two days after pupation. All eight libraries were pooled and sequenced 2 x 150 bp on an Illumina NovaSeq 6000 S1 flow cell at the University of Chicago Functional Genomics Facility (RRID:SCR_019196). We trimmed adapters and low-quality regions using Trimmomatic 0.39 37, then used Juicer v1.6 23 and bwa 0.7.17 38 to map, sort, and de-duplicate reads.
We used these merged_nodups.txt files as input to the 3d-dna pipeline for genome assembly (above). Separately, we used hicExplorer 39 to identify TADs in each sample and to plot results used in Fig 1. Juicer output files were converted to .h5 format using hicConvertFormat and 5 kb resolution (except mfp6 and mmp6, which used 15 kb resolution due to their lower quality), normalized using hicCorrectMatrix and and the Knight-Ruiz method 40, then used to call TADs with hicFindTads and default settings except “--correctForMultipleTesting fdr --minDepth 20000 --maxDepth 50000 --step 10000”. Samples mfp6 and mmp6 used settings --minDepth 60000 -- maxDepth 150000 --step 30000 to account for their lower resolution.
Genome blacklists
We generated genome blacklists by identifying low-mappability regions using genmap v1.3.0 41. We calculated mappability for 50-mers, allowing for 1 mismatch (-k50 -e1), then identified low mappability regions as those with mappability < 1. We merged low map regions within 300 bp of each other using bedtools merge, then kept regions > 100 bp. We combined these regions with all regions with mtDNA homology (identified using BLASTn with E < 1e-50) into a single blacklist. This was performed separately for the mimetic and non-mimetic genome assemblies.
ATAC-seq
We performed ATAC experiments following Lewis and Reed42 and Buenrostro et al.15 with minor modifications. Wings were dissected in room temperature PBS then immediately transferred into ice cold sucrose buffer in 2 mL dounce homogenizers. Tissues were dissociated using 30 - 50 strokes with the tight pestle, then transferred to cold 1.5 mL tubes. Cells and nuclei were pelleted by centrifugation for 5 min at 1000 x g at 4°C, then resuspended in 200 uL lysis buffer (wash buffer: 10 mM Tris-HCl pH7.5, 10 mM NaCl, 3 mM MgCl2 plus 0.2% NP-40) and lysed on ice for 5 min. Nuclei were harvested by centrifugation for 5 min at 1000 x g for 5 min and 4°C, then resuspended in 750 uL ice cold wash buffer and counted. Aliquots of 500,000 cells were pelleted for 5 min at 1000 x g and 4°C, then resuspended in transposition mix (25 uL TD buffer, 2.5 uL TDE1 enzyme, 22.5 uL water). Transposition was performed 30 min at 37°C shaking at 1000 rpm, then cleaned up using the Zymo DNA Clean and Concentrator 5 kit. Libraries were amplified 10 cycles before double-sided cleanup (0.5X - 1.8X) with SPRI Select Beads (Beckman-Coulter, USA). We sequenced libraries PE50 on a single lane of a NovaSeq X 10B flowcell, for an average of 30M read pairs per sample.
ATAC-seq analysis
Raw ATAC-seq sequencing reads were trimmed using Trimmomatic 0.3937, then mapped to each reference genome using bwa 0.7.1738. Duplicate reads were removed using picard43. We assessed sample quality using ATACseqQC 1.22.0 44, and used the cleaned BAM files for downstream analysis. Bigwig tracks for visualization were created and normalized using the RPKM method and deepTools245. We called peaks using F-Seq2 following the authors’ recommendations for ATAC-seq data (-pe -l 600 -f 0 -t 4.0 -nfr_upper_limit 150 - pe_fragment_size_range auto) in each sample, then merged peaks from biological replicates using bedtools intersect46, requiring reciprocal 20% overlap between at least two replicates. Finally, we combined sample peaksets using bedtools merge to generate a comprehensive peakset that we used to identify differentially accessible peaks (DAPs) using DiffBind 3.8.4 47,48. We performed all relevant pairwise comparisons between sexes, genotypes, and stages, then corrected p-values using the Benjamini-Hochberg method49. Only DAPs with global FDR < 0.05 were used in downstream analyses.
We annotated the full ATAC peaksets using annotatePeaks.pl from HOMER 4.11.0 50.
Peak and peak sequence orthology
We assigned orthology between h and H ATAC peak sequences and outgroup genomes using BLAST 2.2.24. First, we identified the genome region bounded by nach and UXT in the P. polytes (GCF_000836215.1)6, P. protenor (GCA_029286645.1), P. nephelus, and P. bianor (available at http://gigadb.org/dataset/100653)51 genomes by mapping the non-mimetic P. alphenor nach, dsx, and UXT transcripts to each genome using minimap2 with the “-xsplice” option52. We then converted the minimap2 results to GTF format using paftools.js and UCSCtools bedToGenePred and genePredToGtf tools. We extracted and oriented each region, then used blastn to identify h and H sequences in each target region. We used the following command, for example:
blastn -task blastn-short -evalue 1e-3 \ -outfmt “6 qseqid qlen qstart qend sseqid sstart send length \ pident evalue” -max_hsps 1 -query papAlpH.peaks.fa -db papBia.faWe collated BLAST results and P. alphenor peak sequences in R, then plotted the results using the gggenomes package53. An R project containing the analysis and plotting pipelines can be found in Dryad (AAAAAAAAA).
CUT&RUN
We assessed genome-wide Dsx binding in homozygous males and females at 15% and 35% pupal development using CUT&RUN54, following the protocol described by Meers et al.19 with minor modifications. We dissected hindwings in room temperature (RT) PBS, then enzymatically dissociated cells following Prakash and Monteiro55. Wings were immediately transferred to 750 uL TrypLE Select Enzyme (Gibco, USA) diluted to 5X in PBS, then broken up by gently pipetting 20X with a p1000 pipet set to 500 uL. Tissues were allowed to dissociate at 32°C / 1200 rpm in a thermomixer for 20 minutes, pipetting 5X every 5 minutes. Cells were harvested by centrifugation at 600 x g for 3 min, washed twice with 1 mL RT wash buffer (WB), then resuspended in 1 mL RT WB. Cells were bound to concanavalin A-bound magnetic beads (Bangs Labs) for 10 min rotating at RT. Beads were separated using a magnetic stand, then immediately resuspended in ice cold antibody buffer (WB + 0.025% digitonin). Permeabilized cells (500 000 per reaction) were aliquoted to 0.2 mL tubes before adding 0.5 ug primary antibody and nutating overnight at 4°C. Washing, digestion, and purification followed Meers et al. (2019).
We used rabbit anti-Dsx13, mouse anti-H3K4me3 (Abcam ab8580), or goat anti-rabbit IgG (Cell Signaling Technologies 45262).
We constructed sequencing libraries using the NEBNext Ultra II DNA Library Prep kit following the protocol outlined in Liu (2021) “Library Prep for CUT&RUN with NEBNext® Ultra™ II DNA Library Prep Kit for Illumina® (E7645) V.2 (https://www.protocols.io/view/library-prep-for-cut-amp-run-with-nebnext-ultra-ii-kxygxm7pkl8j/v2). The key differences between this protocol and the manufacturer’s protocol are lower annealing temperatures during end repair and PCR. Libraries were pooled and sequenced PE50 on a NovaSeq 6000 in a single SP flow cell to yield ∼10M read pairs per sample (Supplementary Table 10).
CUT&RUN analysis
Raw CUT&RUN sequencing reads were trimmed using Trimmomatic 0.3937, then mapped to each reference genome using bwa 0.7.1738. Duplicate reads were removed using picard43. Bigwig files for visualization were generated from these filtered BAM files using RPKM normalization in deepTools245. We called peaks using MACS356 for each sample and merged peak calls from biological replicates using bedtools46, requiring reciprocal 20% overlap between at least two of three replicates. Finally, we combined peaksets from different samples using bedtools merge to generate a comprehensive peakset that we used to identify differentially bound peaks (DBPs) using DiffBind 3.8.447,48. We performed all pairwise comparisons between sexes, genotypes, and stages, then corrected p-values using the Benjamini-Hochberg method49. Only DBPs with global FDR < 0.05 were used in downstream analyses.
We identified enriched motifs and annotated the full Dsx peakset relative to gene models using HOMER 4.11.0 50. We also identified transcription factor binding site motifs enriched in dsx peaks (summits +-100 bp) using HOMER’s findMotifsGenome.pl script with default parameters.
To identify potential TFBSs in Drosophila melanogaster doublesex, we downloaded ChIP-seq peak calls for all 546 DNA binding proteins assayed by the modENCODE project in whole flies18, then intersected those data with the r6.55 doublesex gene model (plus 1 kb) using bedtools intersect46.
CRISPR/Cas9
We designed pairs of guide RNAs (gRNAs) to flank target ATAC peaks using Integrated DNA Technology’s (IDT’s) gRNA Design Tool, then purchased as 2 nmol single gRNAs. sgRNAs were resuspended to 1 ug/uL (32 nM) in water. Injection mixes with two sgRNAs consisted of 125 ng/uL each sgRNA and 500 ng/uL IDT SpCas9 V3 in PBS. Reagents were mixed in a PCR tube, incubated at 37°C for 10 min to complex, then stored at -80°C in small aliquots until use.
We allowed females to lay on a fresh Citrus shrub for 1 - 3 hours, then collected eggs for injections. Eggs were aligned on a small strip of double-sided tape on a glass slide. We injected a small amount of injection mix into the bottom of each egg using pulled borosilicate needles. The double sided tape was then placed directly onto fresh Citrus leaves, where the eggs were allowed to hatch and develop. Results are shown in Supplementary Table 9. We clipped wings from adults and extracted gDNA from thorax using the QIAgen QIAwave Blood and Tissue Kit. DNA extractions followed the manufacturer’s instructions except we incubated tissue in buffer ATL/proteinase K overnight at 56°C shaking at 600 rpm in a thermomixer.
RNA-seq analysis
We re-analyzed SE50 RNA-seq data from VanKuren et al. (2023) following their pipeline. Sequencing data are available in NCBI BioProject PRJNA882073. We quantified transcript expression levels using the raw reads and Salmon 1.9.058. We used k-mer size of 23 and quantified against the transcripts from the mimetic P. alphenor assembly, using the whole genome sequence as a decoy. We allowed Salmon to correct for GC, positional, and sequence bias. We then loaded gene-level quantifications using tximport59 and identified differentially expressed genes (DEGs) between mimetic and non-mimetic females at each developmental stage. We also identified genes with significantly different developmental expression profiles using maSigpro60,61 following our previously described protocol13, keeping significant genes with fit correlations >= 0.9.
Acknowledgements
We thank the University of Chicago greenhouse staff and the University of Chicago Functional Genomics Facility for research support; the University of Chicago’s Research Computing Center for computational support.