Abstract
Multi-allelic S-locus, containing pistil S-RNase and dozens of S-locus F-box (SLF), is known to take responsibility for genetic control of self-incompatibility (SI) in Plantaginaceae. Loss of SI is a prerequisite for mating system transition from self-incompatible to self-compatible, which is very frequent in flowering plants. The transition to selfing is excepted to have noticeable effects on adaptive potential and genome evolution. However, there have been limited comparative studies on the evolutionary shift and S-locus from genome-wide perspective due to lack of high-quality genomic data to date. The genus Antirrhinum has long been used as model plant to study self-incompatibility, here, through integrating sequencing data from multiple platforms, we present the chromosome-level reference and haplotype-resolved genome assembly of an A.hispanicum line, AhS7S8. Comparative genomics approach revealed that the self-incompatible A.hispanicum diverged from its self-compatible relative A.majus approximately 3.6 million years ago (Mya) just after Zanclean flood. The expanded gene families enriched in distinct functional terms implied quite different evolutionary trajectory of outcrossing and selfing snapdragon, and contributed to the adaptation of snapdragon with different mating system. Particularly, we successfully reconstructed two complete S-haplotypes spanning ~1.2Mb, each consists of 32 SLFs and a S-RNase, as well as a pseudo S-haplotype from self-compatible A.majus. Multiple copies of SLFs derived from proximal or tandem duplication via retroelement-mediated way at a very early time, announcing the ancient history of S-locus. Furthermore, we detected a candidate cis-transcription factor associated with regulating SLFs expression, and two miRNAs may control expression of this TF in a higher position. In general, our abundant genomic data provide excellent resources to study mating system transition of snapdragon, and facilitate the functional genomics and evolutional research of S-RNase based self-incompatibility system.
Introduction
Self-incompatibility (SI) is a molecular recognition system used by angiosperms to prevent inbreeding and promote outcrossing, thereby maintaining genetic diversity to increase stress tolerance and survive in the constantly changing harsh environments. Despite the fact that SI systems are widespread in flowering plants, loss of SI, often accompanied by a shift to self-crossing, has occurred many times in different lineages independently1,2. The transition is favored by selection when short-term advantages of selfing, such as reproductive assurance and lower energy use to produce selfed seed, outweighs the costs of inbreeding depression3. However, Stebbins claimed that self-fertilization may be an evolutionary dead end because it may result in the reduction of population genetic diversity and consequently preclude adaptation to new or changing environments in the longer term1. While the underlying evolutionary mechanism remains unclear, Stebbins suggested decreased adaptive potential would lead to higher species extinction rate, which is supported by theoretical modeling4.
The molecular mechanisms of SI in dicotyledons are thoroughly studied. In Solanaceae, Rosaceae, Rubiaceae, Rutaceae and Plantaginaceae family, SI is controlled by a single Mendelian locus, called S-locus5, with a variety of haplotypes. This locus generates two types of products: pistil S determinant (S-ribonuclease with cytotoxicity) and pollen S determinant (dozens of pollen-specific S-locus F-box genes called SLFs). A unique S-haplotype is consist of specific SLFs and specific S-RNase, and successful fertilization only occurs between individuals harboring different S-haplotypes6. The S-RNase can be degraded by non-self SLFs protein products collectively of a given S-haplotype7. SLFs and S-RNase of different haplotypes exhibit polymorphism, and they are excepted to have co-evolved and be physically linked to maintain a stable SI system. However, the origin and evolution of S-locus in Plantaginaceae remain unknown.
There has been intense interest in the role of parallel molecular changes underlying the evolutionary shifts to selfing associated with loss of SI8. Loss-of-function or loss-of-expression of S-RNase can confer self-compatibility (SC) in some Solanaceae species9,10, which also observed in Antirrhinum11. The genus Antirrhinum (Plantaginaceae), commonly known as snapdragon, includes more than 20 species across the Mediterranean basin12, and self-incompatible in this genus is ancestral state while A.majus is a SC one2. Thus the genus serves as an ideal system to explore the impact of mating system shifts on evolutionary processes and dynamics of adaptation.
Accurate and contiguous genome assembly is fundamental basis to understand the genomic diversity and evolution trajectory of a species comprehensively. Yet, it is frequently constrained by heterochromatin and complex region in plant genomes. Due to its inherent heterozygosity and long-spanning in different species13,14,15, S-locus has been a notorious nuisance to assemble with completeness and continuity. In this study, we assembled a high-quality chromosome-level genome reference with total length ~480Mb of self-incompatible Antirrhinum hispanicum (AhS7S8, heterozygous genotype at S-locus, hereafter referred to as ‘SI-Ah’) and improved previous-released assembly of self-compatible A.majus (AmSCSC, hereafter referred to as ‘SC-Am’) with help of advanced sequencing technology. Along with its cultivated sister, this outcrossing snapdragon genome could provide us an opportunity to perform comparative genomic analyses and decipher causes and proximal consequences of the self-incompatibility breakdown.
Recently, Pacific Biosciences (PacBio) introduced high fidelity (HiFi) sequencing technology to generate long reads (>15 kb) with high base accuracy comparable to Illumina data. HiFi reads has provided an opportunity to assemble haplotype-resolved genome sequences of several species with adequate sequence depth, displaying its great potential in resolving regular non-model species genome assembly. We also took the advantage of HiFi data to achieve two haploid genomes of SI-Ah, each contains a S-haplotype. In this study, the complete structure and sequence of S-locus also enabled us to examine how and when the SLF genes originated unbiasedly, as well as variation between different S-haplotypes. Inter- and intra-specific structural variations were detected among serval S-haplotypes, revealing its characteristic of being highly polymorphic.
The genomic data and analysis reported in this work are of great value for understanding genetic bases of adaptive characteristic of species, and provide further insights into the evolution of SI in Antirrhinum. A better understanding of SI evolution allows manipulations to reduce mating restriction, with direct or undirect applications ranging from plant conservation to agriculture development16.
Results
Genome assembly and evaluation
In this study, we combined different sequencing strategies to derive A.hispanicum reference genome assembly: Illumina pair-ended (PE) reads, PacBio single-molecule reads, BioNano optical maps, and Hi-C library (Fig. S1 and Table S1). To characterize the basic genome feature of A.hispanicum, PE reads of line AhS7S8 were used to generate 21-mer spectrum for genome survey. The genome size was estimated to be ~473Mb and a modest heterozygosity of 0.59% (Fig. S2), consistent with reproductive mechanisms of this outcrossing species. After that, the genome was assembled with PacBio continue long reads (CLR) and circular consensus reads (CCS) using Canu17 and Falcon-phase18 separately. Then the contigs were scaffolded with BioNano cmaps and Hi-C data sequentially. At the end, we obtained three sets of chromosome-level genome assemblies, a consensus assembly with mosaic sequence structure (Fig. 1A) and two haplotype-resolved assemblies each contains a copy of S-locus (S7 haplome with S7-haplotype and S8 haplome with S8-haplotype). Yet long stretch of gaps exists in S8-haplotype of S8 haplome and cannot be filled up directly with sequencing data of AhS7S8 due to the heterochromatic characteristics of S-locus. Therefore, to overcome the difficulties in assembly of the heterozygous S-locus, we applied a genetic approach using a cross between SI-Ah and SC-Am, generating two F1 (AhS7SC and AhS8SC). Since the F1 individuals have haploid chromosome sets from AhS7S8, their sequencing data were used for error correction and gap filling to improve haplome assemblies. Comparing A.hispanicum phased genome sequences with the consensus assembly revealed generally good alignment except some assembly gaps and inversions in few chromosomes, which might due to heterozygous region collapse in Canu assembly (Fig. S4). In addition, with new BioNano optical maps and Hi-C data of SC-Am, we also improved previously released genome scaffolds19 into eight pseudo-chromosomes. Size of four genome assemblies were very close to the estimation from k-mer spectrum analysis. The length of homologous chromosomes ranged from 49.7 to 77.1 Mb (Table S3).
The completeness of genome assemblies were benchmarked using BUSCO embryophyta_odb1020 core genesets, which include 2121 single-copy orthologous genes. The results indicated that 91.4-95.3% of conserved genes were completely recovered in the four assemblies (Fig. S6 and Table S5). S7 haplome and S8 haplome were combined and evaluated under the protein mode, and 98.6% (2,090) of BUSCO genes were complete, while 92.2% (1,955) being duplicated. 30,956 Antirrhinum EST records downloaded from NCBI (2021-04-23) were mapped to the four genome assemblies, with nearly 90% of which can be mapped (Table S5). PE reads from the DNA and RNA libraries from this study were mapped to the genome assemblies using BWA-mem21 and STAR22 respectively. The mapping rate ranged from 70% to 94% (properly mapped or uniquely mapped), as listed in Table S5. In addition, LTR Assembly Index(LAI)23 based on LTR-RTs was calculated to evaluate genome assembly continuity. The LAI values of the two haplomes exceed 20, which demonstrates a ‘golden’ standard level, and the other two (the consensus assembly of SI-Ah and SC-Am) also reached ‘reference’ grade quality. Taken together, S7 haplome outperforms S8 haplome on sequence continuity and the consensus assembly, but not as good as them in accuracy (Table S5).
Cumulatively, these results above suggested the reliability of all the genome assemblies. Particularly, S-locus of self-incompatible snapdragon which spans more than 1.2Mb was perfectly reconstructed with a continuous BioNano molecular cmap as strong evidence (Fig. S7). The A.hispanicum reference and haplotype-resolved genome assemblies are, to the best of our knowledge, one of the most contiguous and complete de novo plant genomes covered complex S-locus to date.
Genome annotation
Repetitive elements accounted for more than 43% of the snapdragon genome (Table S4), and long terminal repeats (LTRs) being the largest member of transposons families, covering ~23% of the nucleus genome (Fig. 1A), of which, the Ty3/gypsy (SI-Ah: 12.0%; SC-Am: 12.4%) and Ty1/copia (SI-Ah: 11.6%; SC-Am: 12.1%) are the most abundant transposable elements (TEs). DNA transposons comprises ~12% of the Antirrhinum genomes (Table S4). According to theoretical modeling, if the effects of TE insertion are mostly recessive or co-dominant, increased homozygosity would result in more efficient elimination24, which therefore should be excepted have lesser TE in selfing species. On the other hand, natural selection against slightly deleterious TE transposition could be comprised in selfing species, because of reduced effective population size25. In this scenario, a transition would lead to TE content increase.
The insertion time of intact LRT-RTs in Antirrhinum based on the divergence of two flanking terminal repeat sequences revealed a recent burst of LTR-RTs at 0.3 Mya, and indicated that most old LTR-RTs have been lost (Fig. S9). The younger insertions have not accumulated enough substitution in two flanking repeats, making this type of genome more difficult to assemble correctly without long-range information.
Gene models of all genome assemblies were predicted followed same analysis pipeline as described in method. For consensus assembly of SI-Ah, 42,667 protein-coding genes were predicted from the soft-masked genome, 89.3% of which could be annotated in at least one functional database, and other assemblies gave close genome annotation results (Table S7). This can be further evidence to support that all the genomes were well assembled and annotated. The average transcript and CDS length of protein-coding genes were 3,023 bp and 1,146bp, respectively. The average exon and intron lengths were 232 bp and 479 bp, with 4.8 exons per gene on average. The length distribution of CDSs, exons, and introns of snapdragon show similar pattern to other plant species (Fig. S10), which confirmed the credibility of gene prediction results. The GC content is distributed unevenly in most pseudochromosomes and tends to be higher in gene-poor regions. The density of protein-coding genes on the pseudo-chromosomes was highest in distal regions and declined in centromeric regions (Fig. 1A), suggesting that gene distribution is not random. As excepted, the pattern of gene presence is opposite to that of LTR-RT elements and methylation level. A total of 2112 transcription factors belong to 58 families were annotated, including bHLH, ERF, MYB, C2H2, NAC, and B3 families with largest number of members (Fig. S11).
Analysis of small-RNAseq libraries from four tissues identified 108 known miRNA genes of 42 families of SI-Ah and 105 miRNA genes of 42 families (details listed in Table S9 and S10 respectively). In addition, 719 transfer RNA genes (tRNAs), 311 small nuclear RNA genes (snRNAs), and 791 ribosomal RNA (rRNAs) genes for SI-Ah were predicted (Table S8).
In land plants, the chloroplast DNA is a small circular molecule but is nevertheless a vital one because it codes for at least 100 proteins involved in important metabolic process such as photosynthesis. The complete circular A.hispanicum chloroplast genome is a typical quadripartite structure spanning 144.8kb, and comprises of four parts: long single copy section (LSC), inverted repeats (IRB), small single copy section (SSC), and inverted repeat A (IRA). Annotation of the chloroplast genome revealed a total of 125 genes, 92 of which are protein coding genes, 6 of which are rRNA genes, and 27 of which are tRNA genes. The mitochondrial genome was 517.4kb in length, harboring at least 40 protein-coding genes, 5 rRNA genes, and 33 tRNA genes. The mitogenome also contains similar sequences of chloroplast DNA. The landscape of chloroplast and mitochondrial genomes are presented in (Fig. 1B and 1C).
Taken together, three parts of snapdragon genome (nuclear, chloroplast, mitochondria) are of high completeness and correctness, making them a valuable resource for comparative and evolutionary studies.
Comparative genomic analysis
We employed the consensus-assembly to represent A.hispanicum in comparative genomic analysis to infer its phylogenetic position and estimate species divergence time. Proteome from SI-Ah and SC-Am, as well as two monocots, 16 eudicots, and Amborella (Table S11) were clustered into 41,209 orthogroups (Table S12) and used for species tree construction subsequently. Our analysis revealed that SI-Ah and SC-Am split from their common ancestor at around 3.6 Mya [95% CI=1.7-6.0 Mya], which is collided with emergence of modern Mediterranean climate and in agreement with previously reported results26 using various datasets and methodologies. And the divergence of Antirrhinum genus from other members in Lamiales order occurred ~59.8 Mya (Fig. 2A) during Paleocene-Eocene transition.
Whole-genome duplication (WGD) is known as a major driving force in plant evolution, as it provides abundant genetic materials. To identify WGD events in Antirrhinum, we first compared the Antirrhinum genome with the reconstructed ancient eudicot (AEK)27 genome, and Vitis vinifera genome which have underwent none large-scale genome duplication after whole-genome triplication (WGT) of AEK28. Synteny analysis of the SI-Ah with AEK and Vitis vinifera genomes showed synteny depths ratio of 1:6 and 1:2 respectively (Fig. S12), and large-scale segmental duplications can be observed from the self-alignment of SI-Ah (Fig. 1A). The Ks distribution of paranomes displayed two peaks (Fig. 2D), suggesting Antirrhinum genus have at least experienced one WGD event after the gamma-WGT. We next performed intergenomic comparisons between A.hispanicum and two other species, Solanum lycopersicum and Aquilegia coerulea (Fig. S13). Given the known lineage-specific WGD in Aquilegia coerulea29 and two lineage-specific WGT in Solanum lycopersicum30, our results further confirmed two rounds of WGD in snapdragon.
To approximately date the WGD in snapdragon, the distributions of synonymous substitutions per site (Ks) among paralogous genes within a genome were examined. The distributions plots of synonymous substitutions per synonymous site for SI-Ah, Collinsia rattanii31 and Antirrhinum linkianum paranomes showed a distinct peak at 2.23 corresponding to the well-reported gamma WGT shared by all eudicots (Fig. 2D and S14), as well as an additional peak at ~0.78 corresponding to relatively recent duplication event shared among Plantaginaceae members. Using whole-genome triplication time as ~117_Mya32 and Ks-peak at 2.23, we estimated the younger WGD occurred at ~40.9 Mya, posterior to divergence of Antirrhinum and other Lamiales plants, demonstrating a Plantaginaceae-specific WGD event. The Antirrhinum lineage neutral mutation rate was inferred using the formula μ=D/2T, where D is the evolutionary distance of SI-Ah and SC-Am (peak value of Ks log-transformed distribution =0.04, Fig. 2C), and T is the divergence time of the two species (3.6Mya, 95% CI=1.7 - 6.0Mya). The neutral mutation rate for Antirrhinum genus was estimated to be 5.6e-9 (95% CI:3.3e-9 - 1.2e-8) substitution per site per year, which is very close to the result estimated by using genotyping-by-sequencing dataset26.
Mating system change from outcrossing to selfing occurred many times during angiosperms evolution history33. To investigate the genome evolution trajectory of two Antirrhinum species, gene family analysis was conducted based on orthologous relationship of SI-Ah, SC-Am and other 19 plant species (Fig. 2A). A total of 2,137 and 1,387 gene families were expanded and contracted respectively in Antirrhinum lineage. The results also revealed 789 (3,978 family members) expanded and 689 (1,768 family members) contracted gene families in SI-Ah, 574 (3,858 family members) expanded and 623 (1,618 family members) contracted gene families in SC-Am respectively. Among the changed families, 193 (108 expansions and 85 contractions) in SI-Ah and 213 (130 expansions and 83 contractions) in SC-Am orthogroups were rapidly evolving gene families (Fig. 2A). These rapidly evolving gene families might give us some clues about consequence of losing self-incompatibility in Antirrhinum lineage.
The GO enrichment analysis for 789 expanded gene families of SI-Ah indicated that the significantly enriched GO terms including transferase activity (GO:0016740), pollination (GO:0009856), recognition of pollen (GO:0048544), pollen-pistil interaction (GO:0009875) and protein kinase activity (GO:0004672) (Fig. 3C), supporting its outcrossing behaviors. While in SC-Am, enriched GO terms of expanded families mostly related to response to endogenous stimulus (GO:0009719), hormone binding (GO:0042562), and signaling transducer activity (GO:0060089) (Fig. 3D), which may contribute to resilience of self-compatible species facing biotic and abiotic stresses. According to gene emerging theory, this may suggest the self-compatible specie underwent fitness optima change34, and shift of mating system make the species more sensitive to adapt external environmental changes. As we can observed, these changes can be reflected in the morphological differences between two species, the outcrossing one exhibit significantly reduced flower size (Fig. 2B). For the complete list of GO enrichment analysis results, see Table S13 and S14.
Therefore, although the two species maintain the stable genomic architecture (Fig. 3A), significant differences have already accumulated within genome respectively for adaptations during the past few million years after separation. Besides, 87 gene families have significantly expanded in the Antirrhinum lineage, including auxin responsive protein (some genes of this family have been reported to regulate flower opening and closure35), cytochrome P450 domains (CYP), FAD binding domain, F-box domain, major pollen allergen Bet V1 signature, NB-ARC (a domain-containing disease resistance protein), protein phosphatase 2C, and UDP-glucoronosyl and UDP-glucosyl transferase (Fig. 3B). The abundance of transposable elements across whole genome may have contributed to copy number variation of these gene families in Antirrhinum.
Sequence comparison of haplomes
Genomic variation is also a major source of raw materials contributing to genetic diversity, and may be an important cause of speciation36. A.hispanicum genome harbor loads of heterozygous sites for its outcrossing nature. To provide an accurate evaluation of the divergence between two haplomes, we identified polymorphisms between eight homologous chromosome pairs (Fig. 4 and Fig. S15-21). Based on the DNA alignments of two haplotypes, 18,283 highly similar segments (95% identity with length great than 5kb) were detected, covering 35.1% of S7 haplome and 36.2% of S8 haplome. Between these syntenic blocks, we found 2,374,480 SNPs, 236,929 insertion, and 212,829 deletion. The overall intragenomic divergence was estimated to be 0.576%, which is consistent with the heterozygosity rate predicted using kmer-based approaches (Fig. S2). The small variants were examined in genome regions with different putative function (intergenic, upstream and downstream, exon, intron, and splice sites). Regulatory regions harbored the most abundant SNVs (48.2%), followed by intergenic regions (38.4%), genic regions (13.0%). 36,857 SNP/InDels are assumed to have deleterious impact on the protein.
We also tried to detect large-scale structural variants (SVs, >50bp) between two phased genomes (S7 haplome and S8 haplome, Fig. 5 and Fig. S27-33) and two reference genome assemblies (SI-Ah and SC-Am, Fig. 3A). We identified 2,876 insertions, 2,888 deletions, 151 inversions and 6 translocations in the S7-S8 haplome comparison (Table S15), which are generally less than those identified in the SI-Ah versus SC-Am alignments. It is evident that deletions and insertions are the dominant variations, accounting for most of SVs in interspecies and intraspecies alignments. Subsequently, we analyzed the intersection of SVs with various classed of genic and intergenic functional elements. In the haplome alignments, 4,714 SVs fall into intergenic regions and intron locations, with only a few SVs overlaps with coding regions, revealing a conserved functional synteny of the snapdragon genomes. Particularly, inversion is one of the most intriguing variants and has been proved to be involved with recombination repression and reproductive isolation of animals37. We found 151 inversions ranging in size from 3,216bp to 3.72Mb, with a mean size of 415.9kb. In the comparison SI-Ah and SC-Am, a minor inversion spanning 238.5kb were found upstream of S-RNase on chromosome 8. Although amounts of SVs can be observed, it is hard to determine whether they are individual specific or species generally. Nevertheless, those observed sequence polymorphisms could be valuable in investigation of allele-specific expression, epigenetic regulation, genome structure in the evolutionary analysis of snapdragon.
In addition, we attempted to compare three haplomes by examining the potential genomic characteristic difference, such as distribution of various TE families, GC-contents, and gene density, etc. However, no distinct pattern was found between any homologous chromosomes (Fig. S4, S18-S20).
Position, structure, and flanking genes of the S-locus
Identification sequence details of the S-locus would be the initial step towards understanding SI in Antirrhinum. The Antirrhinum S-locus was found to locate in peri-telomeric region on the short arm of chromosome 8, as opposed to sub-centromeric area in Solanaceous species. F-box genes are discovered throughout the entire genome, with 275 copies altogether. A cluster of 32 SLF (S-locus F-box) genes, which match to the candidate S-locus, located in chromosome 8 (Fig. 1A). Gapless S-locus of 1.25Mb in SI-Ah and pseudo S-locus of 804kb in SC-Am are both reconstructed with BioNano long-range molecules as support evidence (Fig. S7 and S8). Except those SLFs, the S-locus also contains G-beta repeat, matE, zinc finger, cold acclimation, auxin responsive protein, and dozens of genes with unknown function (Table S19). Additionally, the S-locus was surrounded by two copies of organ specific protein at left side (OSP1, OSP2) and five copies of phytocyanin genes at right side (here named PC1, PC2, PC3, PC4, PC5). We also searched paralogs for OSP and PC gene in the snapdragon genome.
The contiguity and completeness of the genome assembly, together with solid annotation, give rise to the possibility of proposing an origin model of S-locus. There is no doubt that gene duplication plays a key role in evolution of such a supergene. To inspect how multiple SLFs were generated and proliferated, Ks distribution of pairwise SLFs and non-SLF F-box genes were compared, and the synonymous substitutions rates of SLF paralogs are much lower than non-SLF F-box paralogs (Fig. 4B). If all the SLFs originated via WGDs or large segmental duplication, their paralogs should colocalize elsewhere in the genome and have similar divergence with S-locus genes. On the contrary, if SLFs were the result of stepwise local duplications, they should exhibit less differences and there should be distant paralogs scattered elsewhere in the genome. Based on classification of gene duplication type, the two OSPs derived from WDG, while the PCs arise from tandem duplication asynchronously assuming constant substitution rate, both can be support by gene phylogenetic tree topology of paralogs (Fig. S22, S23). As for SLFs, 24 and 6 of them are proximal duplication and tandem duplication genes respectively, and other 2 SLFs are dispersed duplication derived (Table S19). The phylogenetic tree along with expression heatmap of 275 F-box genes indicated SLFs not only share closer relationship, but also display similar pollen-specific expression pattern which distinguished from other F-box genes (Fig. 4A). This phenomenon has not even changed much in SC-Am after millions of years (Fig. S21). It has been proved that tandem or proximal duplicates evolved toward functional roles in plant self-defense and adaptation to dramatic changing environments38.
Besides, 27 of 32 SLFs are intronless, suggesting SLFs proliferate most likely through retroelement-mediated way. With the estimated Antirrhinum mutation rate, the peak value of SLFs paralogs Ks distribution (range from 1.35 to 1.40) corresponding 122Mya39, indicated the S-locus structure is a very conserved and long-lived remnants 40,41 survived from long-term balancing selection (Fig. 5B).This observation comes to the conclusion that the S-allelic polymorphisms should be inherited from common ancestors and the pro cess of differentiations should begin before the currently extant species format ion. Dissection of SLFs and flanking genes indicate that the S-locus is still experiencing gene gains (Table S21).
From the intraspecies similarity matrix heatmap of SLFs of SI-Ah and SC-Am, we can also observe the divergence among paralogs are much higher than interspecific orthologous pairs (Fig. S24). The pairwise sequence identity of 32 SLF paralogs ranged from 33.1 to 77.3%. And the closer the physical distance between two SLFs, the higher the sequence similarity and identity. In addition, the values of the non-synonymous/synonymous substitutions rate ration (Ka/Ks) of nearly all SLFs from different allele below one is a signature of purifying selection to maintain protein function to detoxify S-RNase collectively (Fig. S25).
Using MEME suit analysis, we identified a significant motif (GATCCTAXAATATCTC) located upstream of most SLFs, and the motif annotation implied SLFs are likely to be co-regulated by a MYB-related family transcription factor (Fig. 5C). By examining expression of all MYB-related family members in the genome, AnH01G01437.01, AnH01G01435.01, AnH01G03933.01, AnH01G28556.01, AnH01G01110.01, AnH01G16778.01, AnH01G14798.01, AnH01G03969.01, AnH01G42539.01, AnH01G33501.01, AnH01G41740.01, AnH01G31594.01, and AnH01G17038.01 were expressed in pollen but barely in style tissue (Fig. 5D). Among them, AnH01G33501.01 is located about 1.2Mb upstream of S-locus, making it a potential regulator involved in activating SLFs expression and subsequently pollen-pistil recognition. The orthologous gene AnM01G38924.01 was also found specifically expressed in pollen of SC-Am, which suggesting the maintenance of linkage between these genes controlling SI as a genetic toolkit during the SI to SC transition. Moreover, by examining miRNA target prediction results, two candidate miRNAs (aof-miR396b and zma-miR408b-5p) were revealed to bind with different regions of AnH01G33501.01, and their express pattern exhibited negative correlation with the TF (Fig. S26). The miR396 family member, aof-miR396b, was believed to influence GRF transcription factors expression42, while over-expression of zma-miR408b in maize were reported to respond to abiotic stress to maintain leaf growth and improve biomass43. The results suggest those two miRNAs may also play an important role in self-incompatibility.
S-haplotype comparative analysis
For the sake of simplicity, we use ‘S7-locus’ and ‘S8-locus’ to refer S-locus from the S7 and S8 haplotype contain S-RNase and SLFs respectively, and their SLF genes as ‘S7-SLF’ and ‘S8-SLF’, while SC-locus which contains no S-RNase denotes ψS-locus of SC-Am. There are 32 SLFs annotated in the region of S7-locus (S7-SLF1 ~ S7-SLF32) and S8-locus (S8-SLF1 ~ S8-SLF32) respectively. Their position in chromosome and sequences information were listed in Table S16. We also compared the SC-locus with two S-haplotype of SI-Ah and S-haplotype of self-incompatible species A.linkianum (referred as Al-S-locus). All of the S-haplotype were aligned to the SC-locus of A.majus to assess structural differences and to identify S-locus specific features (Fig. 6). Synteny plot showed structure and gene content of the flanking ends of different S-haplotype were well conserved. The SC-locus has highly collinearity with S7-locus (Fig. 6B), S8 locus and Al-S-locus except a small inversion at the start of SC-S8 alignment (Fig. 6A and 6C). From the comparison between SC-locus and S7&S8-locus, a large deletion with a length of about 120kb including RNase could be observed, which involved 7 genes and 8 genes in S7-haplotype and S8-haplotype respectively. These genes comprise DNase I-like, zinc-binding domain, DUF4218, other ones mostly are uncharacterized proteins. This structural variation can be responsible for the loss of self-incompatibility. Whereas between S7 and S8 locus, a small inversion could be observed, which may involve with recombination repression, and serval neighbor genes of S7-RNase have no corresponding orthologs in S8haplotype (Fig. 6D), suggesting divergent evolution history in different regions of S-locus. The relative position of S-RNase in Al-S-locus being different from A.hispanicum, indicated that the changing physical position S-RNase does not impact its function and the dynamic change of polymorphic S-locus in Antirrhinum has been very active. And we did observe a nearby CACTA transposable element in the vicinity of A.linkianum S-RNase, Tam, which may contribute to the gene transposition.
Allelic expression and DNA methylation profiling
Since haplotype-resolved assembly is available and gene order is highly consistent between two haplomes (Fig. 4), we can determine two alleles of a physical locus and investigate allelic gene expression directly. Base on synteny and homology relationship, 79,398 genes (90.6% of all predicted genes) were found to have homologs on the counterpart haplotype (see “Method”). Most allelic genes displayed low level of sequence dissimilarity (Fig. 7A).
To understand the expression profile of allelic genes, RNA-seq datasets were analyzed using allele-aware quantification tools with combined coding sequences of two haplomes as reference. In the survey across four different tissues, most expressed loci did not exhibit large variance, only 2,504 gene pairs (3.2% of total genes, see Fig. S28) displayed significant expression imbalance between two alleles (|fold-change| ≥ 2, p ≤ 0.01), suggesting that most alleles were unbiasedly expressed in the SI-Ah genome.
We also analyzed four WGBS libraries to profile methylation state of two haplomes. About 51.6-59.5 million PE reads were generated for each sample (Table S2). About 92.4-92.9% of total reads can be properly mapped to the haplomes. Reads for each sample were mapped to the snapdragon chloroplast genome too, the conversion rate higher than 99% confirming reliability of the experiments. Methycytosines were identified in CpG, CHH, and CHG contexts. A large fraction of cytosines in CpG (~80.6%) and CHG (~55%) contexts were methylated, but a smaller fraction of cytosines in CHH (~8%) context were methylated. No significant methylation level variations were observed between two haplomes (Fig. 4), as reported in potato too44.
After examining SLFs and S-RNase expression of two haplotypes, we found that S7-RNase and S8-RNase expressed indiscriminately and fairly high in AhS7S8 style (Table S18), which is consistent with the situation in crab cactus45, and they are nearly undetectable in leaf and petal. Those two gene only expressed in style of AhS7SC and AhS8SC respectively as expect, yet at higher level than AhS7S8 style. Since the gene sequences did not change after one crossing experiment, it is very likely that RNase expression was changed through other mechanisms. As for the expression profiles of SLFs in pollen of three different genotypes, S7-SLFs and S8-SLFs did show an overall decline in two offspring samples but tend to be reduced generally. Those genes could be divided into two groups. Group1 consists of 4 genes, were highly expressed in pollen of AhS7S8with obvious tissue-specificities, but barely expressed in two hybrid progenies, and these genes displayed allelic bias. Another 28 SLFs in cluster 2 maintain express pattern in three genotypes, but the expression level in two hybrid progenies are slightly lower than AhS7S8 (Fig. 6b). We assumed that group2 are necessary for degrade toxic S-RNase.
Epigenetic modifications are believed to be crucial in controlling gene expression46. To determine whether methylation level is responsible for the difference between two groups, we compared the DNA methylation in the pollen from AhS7S8. We calculated the average DNA methylation rates for the SLFs gene regions and found that DNA methylation in gene body are higher than flanking region and the highest methylation rate of mCs was in the CpG context, followed by CHG and CHH (Fig. 6c). The average methylation rates of two groups in the CpG, CHG, and CHH are comparable in gene body and flanking regions. Thus, the similar levels and patterns of methylation of two groups SLFs may not significantly explain the variance in expression. Therefore, the difference may be caused by other kind of epigenetic factors.
Discussion
Assembling genomes of self-incompatible angiosperms has always been a challenge due to inevitable heterozygosity and large portion of repetitive sequences. In current study, we utilized multiple sequencing technologies and genetic strategy to assemble genome of self-incompatible A.hispanicum, and updated assembly of its cultivated relative A.majus. This is a vast improvement and enable us to understand genome evolution of Antirrhinum lineage, gaining knowledge of the proximate and ultimate consequences of losing self-incompatibility47. Benefiting from the robust assembly, we can determine the detailed genomic landscape of Antirrhinum. Apart from the shared gamma whole-genome triplication in eudicots, A.hispanicum genome underwent a Plantaginaceae-specific WGD event dated back to the Eocene, which may have been involved in large scale chromosomal rearrangement and TEs proliferation. The phylogenomic analysis clearly reveals that A.hispancium split from its relative A.majus during the Pliocene Epoch, coinciding with establishment of modern Mediterranean climate in the Mediterranean basin48. Mating system shift is ubiquitous throughout angiosperms, the evolutionary consequences of transition have attracted extensive attention for a long time47,49,50. The comparative genomic analysis indicated that after losing self-incompatibility, though the genomic architecture of SC-Am has not changed much, the gene contents have undergone certain fluctuations. To increased environmental tolerance, the self-compatible organism recruits alternative biological metabolic pathways to improve survival and reproduce rate, and assembled more genes to cope with various external stimuli, since it is no longer possible for self-compatible species through recombination to generate new variants to adapt volatile environments.
Owning to long-range sequencing information, complete S-locus consist of dozens of genes was fully reconstructed, our results indicated that SLF members within this gene cluster share common cis-regulatory elements, ensuring a spatially coordinated expression of protein products to degrade S-RNase. Additionally, the physical position and expression pattern of AnH01G33501 suggested that it may be the candidate transcription factor regulates downstream SLFs expression and provide targets for molecular marker selection of controlling mating reproduction. Perhaps such marker can be extrapolated to other S-RNase based self-incompatible economic species, such as potato and pear. Moreover, two miRNAs were considered to act as regulator to control expression of this transcription factor51. Moreover, based on calculation of Ks of SLFs, we found that the ancestral SLF had an ancient origin. These findings further enriched our understanding of S-locus from a higher perspective, and indicated the other members at this locus are far more important than previously thought. Duplication type analysis of SLFs revealed that all of them originated earlier than two WGDs in Plantaginaceae, yet the flanking genes of S-locus are still expanding gradually. The S-locus has charm of both old and young.
Chromosome-level structural variation such as translocations and inversion are considered drivers of genome evolution and speciation52,53. These rearrangements have likely contributed to the divergence of these species because they can lead to reproductive isolation if they are large enough to interfere with pairing and recombination during meiosis54. By comparing sequence of two S-haplotypes, we speculated such an inversion at this locus may be involved in recombination suppression, thus maintain the integrity and specificity of every haplotype. Inter-species and intra-species S-haplotype comparison revealed continuous dynamic changes exist in snapdragon S-locus. Ultimately, our study provides unprecedented insights into the genome dynamics of this ornamental flower and offers abundant genomic resource for studies on Antirrhinum. A combination of advanced sequencing technologies, comparative genomics, and multi-omics datasets will help decipher the genetic mechanisms of self-incompatibility, thereby expediting the processes of horticulture and genetic research in the future.
Materials and methods
Sample collection and library construction
All plant samples were grown and collected from warm house. For de novo assemblies, fresh young leaf tissues of 3 Antirrhinum hispanicum lines (AhS7S8, AhS7Sc, and AhS8Sc) and Antirrhinum linkianum were collected for DNA extraction using cetyltrimethylammonium bromide (CTAB) method. Among them, AhS7S8and A.linkianum are self-incompatible, whereas the other two are partial self-incompatible. PCR-free short paired-end DNA libraries of ~450bp insert size for each individual were prepared and then sequenced on an Illumina HiSeq4000 platform, following the manufacturer’s instructions (Illumina).
For PacBio Sequel II sequencing, high molecular weight (HMW) genomic DNA were extracted to construct two independently SMRTbell libraries with insert size of 20kb and 15kb respectively, using SMRTbell Express Template Prep Kit 2.0 (PacBio, #100-938-900) and sequenced on PacBio Sequel II system. The latter library was sequenced to generate high-qualitied consensus reads (HiFi reads) using ccs software in PacBio analysis toolkit (https://github.com/PacificBiosciences) with default parameters. In addition, two 20kb PacBio libraries were also constructed and sequenced for AhS7Sc, AhS8Sc and A.linkianum.
For BioNano optical maps sequencing, HMW genomic DNA of AhS7S8 and AmSCSC was extracted using BioNanoPrepTM Plant DNA Isolation Kit (BioNano Genomics), following BioNano Prep Plant Tissue DNA Isolation Base Protocol (part # 30068). Briefly, fresh leaves were collected and fixed with formaldehyde, then followed by DNA purification and extraction. Prepared nuclear DNA was labeled using BspQ1 as restriction enzyme with BioNano Prep Labeling Kit (Berry Genomics, Beijing). The fluorescently labelled DNA was stained at room temperature and then loaded onto a Saphyr chips to scan on the BioNano Genomics Saphyr System by the sequencing provider Berry Genomics Corporation (Beijing, China). Hi-C libraries were constructed for AhS7S8 and AmSCSC with protocol developed in previous study55. Young leaves were fixed with formaldehyde and lysed, then cross-linked DNA samples were digested with restriction endonuclease enzyme DpnII overnight. After repairing sticky-ends and reversing the cross-linking, DNA fragments were purified, sheared, enriched and sequenced on Illumina NovaSeq6000 under paired-end mode. Four representative tissues, including leaf, petal, pollen, and stamen, of line AhS7S8 and AmSCSC, plus pollen and style of line AhS7SC and AhS8SC were collected in the green house for RNA-seq. Two biological replicates were prepared for each tissue. Strand-specific RNA-seq libraries were prepared following the manufacturer’s recommended standard protocol, and then sequenced on Illumina platform under paired-end mode. At least 20 million paired-end reads were generated for each library. Small RNAs were separated from total RNA described above using PAGE gel, then followed adapter ligation, reverse transcription, PCR product purification, and library quality control. Samples were sequenced using Nextseq500 (50 bp single-end) to yield 20 million reads per sample. Same tissues were also used for whole-genome bisulfite sequencing. Genomic DNA was extracted and fragmented into ~450bp, then treated with bisulfite using the EZ DNA Methylation-Gold Kit according to manufacturer’s instructions. The libraries were sequenced on a HiSeq4000 system and at least 50 million PE-reads for each sample were generated.
Genome Assembly
To estimate the genome size independently of assembly, we characterized the basic genome feature using high-quality Illumina short reads of A.hispanicum. Firstly, PE reads were trimmed using Trim_galore (v0.6.1)56 with parameters “-q25 --stringency 3” to remove low-quality bases and adapters. Based on the 21-mer spectrum derived from clean data, heterozygosity rate and haploid genome length of four snapdragon lines were evaluated using software GenomeScope257.
For line AhS7S8, a consensus assembly consist of collapsed sequences, and haplotype-resolved assemblies consist of scaffolds from phased allele, were generated. Firstly, we removed reads of poor quality (RQ < 0.8) and short length (< 1kb) in PacBio subreads from the 20kb library before assembly. The contig-level assembly was performed on filtered subreads using Canu package (v1.9)17 with parameters “genomeSize=550m corMinCoverage=2 corOutCoverage=200 ‘batOptions = -dg 3 -db 3 -dr 1 -ca 500 -cp 50’”. After that, alternative haplotig sequences were removed using purge_dups58, only primary contigs were kept for further scaffolding. The draft genome of A.linkianum was assembled using same analysis process. On the other hand, to obtain haplotype-resolved genome assemblies, ~40X PacBio HiFi reads along with Hi-C data were utilized to separate maternal and paternal allele using FALCON-phase18. The resulting two phased contig-level haplotigs were used for downstream analysis independently. These three contig assemblies were scaffolded into pseudo-chromosome following same pipeline described below (Fig. S1).
Low-quality BioNano optical molecular maps with length shorter than 150kb or label density less than 9 per 100kb were removed. After that, the BioNano genome maps combining with contigs were fed into BioNano Solve hybrid scaffolding pipeline v3.6 (BioNano genomics) to produce accordingly hybrid scaffolds under non-haplotype aware mode. When there were conflicts between the sequence maps and optical molecular maps, both were cut at the conflict site and assembled with parameters “-B2 -N2”. Followed by BioNano scaffolding, Hi-C data was incorporated to cluster contigs into 8 groups subsequently. Raw Hi-C reads were removed of adapters and trimmed for low-quality bases. Clean reads were then mapped to the BioNano-derived scaffolds using BWA-mem21. The chromosome-level assembly was furtherly generated using the 3D-DNA analysis pipeline to correct mis-join, order, orientation, and translocation from the assembly59. Manual review and refinement of the candidate assembly was performed in Juicerbox Assembly Tools (v1.11.08)60 for interactive correction. Then the long pseudo-chromosomes were regenerated using script “run-asm-pipeline-post-review.sh -s finalize --sort-output --bulid-gapped-map” in 3D-DNA package with reviewed assembly file as input.
So far, three chromosome-level assemblies of AhS7S8 were constructed, and we named them as consensus-assembly SI-Ah (the mosaic one), S7 haplome and S8 haplome (contains S7 haplotype and S8 haplotype respectively). The sequences of two haplomes were named as Chr*.1 and Chr*.2 to distinguish homologous chromosomes. After manual curation, we utilized ultra-high-depth PacBio dataset and program PBJelly61 to fill gaps in haplome assemblies, followed by two rounds of polish with HiFi reads and gcpp program (https://github.com/PacificBiosciences/gcpp) in pbbioconda package. Finally, Illumina DNA-seq reads were mapped to the polished assemblies for small-scale error correction using Pilon62.
Moreover, AmSCSC assembly published in previous study19 was further improved with help of new BioNano and Hi-C sequencing data generated in this study. Since AmSCSC is a self-compatible inbred cultivar and display highly homozygosity, this A.majus consensus assembly was also referred as SC haplome.
The organellar genome of Antirrhinum were identified using BLASTn against the A.hispanicum contigs with plant plastid sequences as reference (https://www.ncbi.nlm.nih.gov/genome/organelle/). After circularizing the most confidential hits, we reconstructed a complete circular mitochondrial genome of 517,247bp and a circular chloroplast genome of 144,858bp. The organellar genome sequences were submitted to CHLOROBOX63 website to annotate automatically and visualize genome maps.
Assemblies comparison and variation detection
MUMmer package64 was used to inspect the difference between pair-wise genome assemblies. SNV (SNPs and InDels) were called using show-snps with parameters ‘-rlTHC’, then parsed and collapsed adjacent mutation/variation with our custom python script. While structural variation (SVs) including duplication, deletion, inversion, and translocation were called and visualized using SYRI65 based on reciprocal alignment from nucmer and show-coords. The effect of SNVs was estimated using SnpEff66, and SVs were annotated with commands ‘bcftools annotate’.
Repetitive DNA elements discovery
To identify repetitive elements within each genome sequences, firstly we used RepeatModeler(open-1.0.11) to build a de novo repetitive elements library from the assembled genome sequences independently. GenomeTools suite67 was used to annotate LTR-RTs with protein HMMs from the Pfam database. In addition, LTR_retriever68 was used to utilize output of GenomeTools and LTR_FINDER69 to generate whole-genome LTR annotation. Parameters of LTR_FINDER and LTRharvest were set requiring minimum and maximum LTR length of 100 bp and 7 kb.
Each of these libraries were classified with RepeatClassifier, followed by combining and removing redundancy using USERACH (https://www.drive5.com/usearch/). Then the unclassified sequences library was analyzed using BLASTX to search against transposase and plant protein databases to remove putative protein-coding genes. Unknown repetitive sequences were submitted to CENSOR (https://www.girinst.org/censor/index.php) to get further classified and annotated. De-novo searches for miniatures inverted repeat transposable element (MITEs) used MITE_Hunter70 software with parameters “-w 2500 -n 5 -L 80 -I 80 -m 2 -S 12345678 -P 1”. Finally, all genome assemblies were soft-masked in company with the repeat library using RepeatMasker (open-4-0-9)71 with parameters “-div 22 -cutoff 220 -xsmall”.
Gene prediction and annotation
The gene prediction was performed using BRAKER272 annotation pipeline with parameters ‘--etpmode --softmasking’, which integrates RNA-seq datasets and evolutionary related protein sequences, as well as ab initio prediction results. Clean RNA-seq reads were aligned to the genome assemblies using STAR22, then converted and merged to BAM format. The hint files were incorporated with ab initio prediction tool AUGUSTUS73 and GeneMark-EP74 to train gene models. The predicted genes with protein length < 120 and TPM < 0.5 in any RNA-seq sample were removed. The SLF gene models included in the S-locus were examined in Genome browser and manually curated.
The tRNA genes were annotated by homologous searching against the whole genome using tRNAscan-SE(v2.0)75, rRNA and snRNA genes were identified using cmscan program in INFERNAL(v1.1.2)76 package to search from the Rfam database. Four small-RNA sequencing datasets were collapsed, combining with known miRNA sequence download from mirBase(v22)77 as input of miRDP2(v1.1.4)78 to identify miRNA genes across the genome. Next, miRNA mature sequences were extracted to predict potential target gene using miRANDA (http://www.microrna.org) by searching against all predicted full-length cDNA sequences.
To perform functional analysis of the predicted gene models, protein sequences were search against the InterPro consortium databases79 including PfamA, PROSITE, TIGRFAM, SMART, SuperFamily, and PRINTS as well as Gene Ontology database and pathways (KEGG, Reactome) databases using InterProScan (v5.51-85.0)80. The protein sequences were also submitted to plantTFDB online server81 to identify transcription factors and assign TFs families. In addition, predicted genes were also annotated biological function using diamond82 to search against NCBI non-redundant protein and UniProtKB (downloaded on 21st Mar 2021, UniProt Consortium)83 database with an e-value of 1e-5. COG annotation of genes was performed using eggNOG-mapper84 with a threshold of 10−5.
Whole-genome duplication and intergenomic analysis
Syntenic blocks (at least 5 genes per block) of A.hispanicum were identified using MCscan (Python version)85 with default parameters. Intra-genome syntenic relationships were visualized using Circos86 in Fig1. We also compared A.hispanicum genome with several plant genomes, including self-compatible A.majus19, Salvia miltiorrhiza87, Solanum lycopersicum30, and Aquilegia coerulea29, Vitis vinifera28. Dotplots for genome pairwise synteny was generated using the command ‘python -m jcvi.graphics.dotplot -- cmap=Spectral --diverge=Spectral’.
For the Ks plots, we used wgd package88 with inflation factor of 1.5 and ML estimation times of 3 for reciprocal best hits search and MCL clustering. The synonymous substitution rate (Ks) for paralogs and orthologs were calculated using codeml program of PAML package89. We plotted output data from result files using a custom python script with Gaussian mixture model function to fit the distribution (1-5 components) and determined peak Ks values. Estimated mean peak values were used for dating WGD events.
Comparative genomics and divergence time estimation
Orthologous genes relationships were built based on the predicted proteomes deprived from consensus assembled A.hispanicum and other 20 angiosperm species listed in (Table S11) using OrthoFinder2(v2.4.1)90. Only longest protein sequences were selected to represent for each gene model. Rooted species tree inferred by OrthoFinder2 using STRIDE(Specie Tree Root Inference from Duplication Events)91 and STAG(Species Tree Inference from All Genes)92 algorithm was used as a starting species tree for downstream analysis.
The species divergence time was estimated using MCMCtree93 in PAML89 package with branch lengths estimated by BASEML, with Amborella as outgroup. The Markov chain Monte Carlo (MCMC) process consists of 500,000 burn-in iterations and 400,000 sampling iterations. The same parameters were executed twice to confirm the results were robust. Species divergence time for Amborella trichopoda-Oryza sativa (173-199 Mya), Vitis vinifera-Petunia axillaris (111-131 Mya) and Solanum lycopersicum–Solanum tuberosum (5.23-9.40 Mya) which were obtained from TimeTree database94 were used to calibrate the estimation model, and constrained the root age <200 Mya.
The determine the gene family expansion and contraction, orthologous genes count table and phylogenetic species tree topology inferred by OrthoFinder2 were taken into latest CAFE595, which employed a random birth and death model to determine expansion and contraction in gene families of given node. Corresponding p value was provided for each node and branch of phylogeny tree and cutoff 0.05 was used to identify gene families undergo expanded or contraction significantly at a specific lineage or species. KEGG and GO enrichment analysis of expanded gene family members were performed using TBtools96 to identify significantly enriched terms.
Gene expression analysis
Apart from aiding gene model prediction, RNA-seq datasets were also used for transcripts quantification. For expression analysis, we used STAR22 to map clean RNA-seq reads to reference genome of SI-Ah and SC-Am respectively with parameters ‘--quantMode TranscriptomeSAM --outSAMstrandField intronMotif --alignIntronMax 6000 --alignIntronMin 50’. The transcripts per million (TPM) values were obtained using expectation maximization tool rsem97. The reproducibility of RNA-seq samples was assessed using spearman correlation coefficient. Samples from same tissue display strong correlation with a Pearson’s correlation coefficient of r > 0.85 (Fig. S29), indicating good reproducibility. Thus, these RNA-seq data were reliable for downstream analysis.
Owing to haplotype-resolved assembly and the gene structure annotation of A.hispanicum genome, allelic gene from a same locus can be identified using a synteny-based strategy along with identity-based method. Reciprocal best hit between two haplotypes were identified using MCSCAN at first, and genes not in synteny block were search against coding sequence counterparts to fill up the allelic relationship table. We used MUSCLE98 to align coding sequences of two allelic genes, and then calculated Levenshtein edit distance to measure allelic divergence. The divergence rate was defined as the number of edit distance divided by the total length of aligned bases.
Allelic transcripts quantification was conducted using cleaned RNA-seq datasets and allele-aware tool kallisto v.0.46.099. We applied this software to obtain the expression levels read counts and TPM of genes of both haplotypes. Differential expression analysis of allelic genes was performed using R package edgeR100. Cutoff criteria of allele imbalanced expressed genes were set as adjusted P value < 0.01, false discovery rate < 0.01 and |log2(FC)|>1.
Phylogenetic analyses of genes
F-box genes in whole genome were identified by searching interproscan annotation results. Sequences annotated with both PF00646 (F-box domain) and PF01344 (Kelch motif) or PF08387 (FBD domain) or PF08268 (F-box associated domain) were considered as F-box genes. And the F-box genes located in the S-locus region of snapdragon were considered as potential SLFs. Organ specific protein and paralogs were selected using PF10950 as keyword, while PF02298 for plastocyanin gene. Sequence alignments were constructed using MUSCLE and manually checked. The maximum likelihood phylogenetic gene trees were constructed using raxml-ng101 with 100 replicates of bootstrap and parameter ‘-m MF’. And duplicate gene were classified into different categories using DupGen_finder38 with parameters ‘-d 15’.
Cis-regulatory element analysis of SLFs promoters
Cis-regulatory elements are specific DNA sequences that are located upstream of gene coding part and involved in regulation of genes expression by binding with transcription factors (TFs). Thus, we explored the upstream 2000bp sequences of 32 SLFs of SI-Ah to discover TF binding sites by MEME102, and Tomtom was used for comparison against JASPAR database103 of the discovered motif.
Whole-genome bisulfite sequencing analysis
The raw WGBS reads were processed to remove adapter sequences and low-quality bases using Trim_galore with default parameters. The cleaned whole-genome bisulfite sequencing reads were mapped to the two haplomes using abismal command from MethPipe104 package with parameters ‘-m 0.02’. All reads were mapped to the chloroplast genome (normally unmethylated) of snapdragon to estimate bisulfite conversion rate. The non-conversion ratio of chloroplast genome Cs to Ts was considered ad a measure of error rate.
Each cytosine of sequencing depth ≥5 were seen as true methylcytosines sites. Methylation level at every single methylcytosine site was estimated as a probability based on the ratio of methylated to the total reads mapped to that loci. Methylation level in genes and 2kb flanking regions was determined using Python scripts. Gene body refers to the genomic sequence from start to stop codon coordinates in the gff file. Each gene and its flanking regions were partitioned into ten bins of equal size and average methylation level in each bin was calculated by dividing the reads indicating methylation by total reads observed in the respective bin.
Data availability
All raw genome sequencing datasets, assembled genome sequences, and predicted gene models have been deposited at the GSA database in the National Genomics Data Center, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession numbers PRJCA006918 that are publicly accessible at http://bigd.big.ac.cn/gsa. RNA-seq datasets have been deposited under accession id CRA005238.
Conflict of interests
The authors declare no competing interests.
Author contributions
S.Z. and Y.X. designed research; Y.Z. prepared plant samples for sequencing; S.Z. analyzed the data; S.Z. and Y.X. wrote the paper. All authors discussed the results and commented on the final manuscript.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (32030007) and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB27010302).
Footnotes
Competing Interest Statement: The authors declare no competing interest.
Reference formats revised
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵