Abstract
We present genome sequences for Geotrypetes seraphini (3.8Gb) and Microcaecilia unicolor (4.7Gb) caecilians, a limbless, mostly soil-dwelling amphibian clade with reduced eyes, and unique putatively chemosensory tentacles. We identify signatures of positive selection unique to caecilians in 1,150 orthogroups, with enrichment of functions for olfaction and detection of chemical signals. All our caecilian genomes are missing the ZRS enhancer of Sonic Hedgehog, shown by in vivo deletions to be required for limb development in mice and also absent in snakes, thus revealing a shared molecular target implicated in the independent evolution of limblessness in snakes and caecilians.
Living amphibians, frogs, salamanders and caecilians, diverged since the Triassic. They, or their ancestors, survived all mass extinctions including the Permian-Triassic which obliterated most terrestrial vertebrates1. In our current extinction crisis, amphibians are amongst the most threatened groups. Undoubtedly reference quality genomes will aid in conservation, disease ecology and evolution, and breeding programs, yet they are amongst the most challenging of vertebrate genomes, due in part to their large and highly repetitive genomes2,3. Gymnophiona (caecilians) are the deepest diverging of the three extant amphibian orders and comprise ∼215 species which are classified into 10 families. They display an often underappreciated diversity of life history traits; including oviparity with either aquatic larvae or direct development (with or without post hatching parental care and skin feeding4) and viviparity. The Rhinatrematidae, the deepest diverging (∼225 MYA) of the ten caecilian families5, is represented by the only previously published caecilian genome Rhinatremata bivittatum, which is 5.3 Gb in size and was sequenced by the Vertebrate Genomes Project (VGP)6.
Reference genomes of Geotrypetes seraphini (Dermopdiidae) and Microcaecilia unicolor (Siphonopidae) were assembled according to the VGP’s 6.7.P5.Q40.C90 metric standards, the same used for Rhinatrema bivittatum and other vertebrates6. The assemblies present respectively contig N50 20.6Mb and 3.6Mb, scaffold N50 272Mb and 376Mb, Phred-scaled base accuracy Q43 and Q37, and 99% and 97% of the assemblies were assigned to chromosome models (Supplementary Table S1). Manual curation was performed as in Howe et al.7 resulting in 69 and 55 removals of misjoins, 122 and 84 joins automatically missed, and 18 and 0 removals of false duplications for G. seraphini and M. unicolor respectively (Supplementary Figure S1). Chromosomal units were identified and named by size. The final assembly sizes were 3.8Gb and 4.7Gb, respectively (Supplementary Table S1). Chromosome content and gene order are conserved to a remarkable extent across caecilian chromosomes, with large blocks of colinear synteny up to chromosome scale further conserved to anurans (common frog and toad) across more than 600 million years of evolution (Figure 1).
Substantial proportions of the caecilian genomes consist of repeats: a total of 67.7%, 72.5% and 69.3% for R. bivittatum, G. seraphini and M. unicolor respectively (Supplementary Table S2). Class I transposable elements (TEs; retrotransposons) are around 20 times more abundant (in %bp) than Class II TEs (DNA transposons) and make up more than 30% of each caecilian genome. LINEs are the most abundant transposon type, followed by DIRSs. These relative proportions differ from those found in the large genomes of other amphibians: a genomic low-coverage shotgun analysis of the caecilian Ichthyophis bannanicus (genome size 12.2 Gb) revealed the prevalent presence of DIRSs followed by LINEs8, while salamander genomes are dominated by LTRs, and DIRSs never surpass 7% of the genomes2,9. These results reinforce the notion that repeated instances of extreme TE accumulation in amphibians do not reflect a failure to control a specific type of TE8.
Comparing the protein coding regions across 22 vertebrate genomes we identified a set of 31,385 orthogroups, of which 15,216 contained caecilian genes. We identified 265 gene families present across vertebrates but missing in amphibia, and an additional 260 orthogroups lost specifically in caecilians (Supplementary Table S3). In contrast, 1,150 orthogroups are present only in caecilians (Supplementary Table S4), and are enriched for functions such as olfaction and detection of chemical signals (p-value<0.01). At least 20% of these caecilian specific genes contained one of three protein domains (zf-C2H2, KRAB, 7tm_4). The 7tm_4 proteins are transmembrane olfactory receptors10; enrichment of this domain amongst the novel protein families in caecilians suggests an intense selective pressure on chemosensory perception at the origin of the caecilians, as they adapted to life underground with reduced vision and compensatory elaboration of chemosensory tentacles. Proteins containing zf-C2H2 and KRAB domains are known to have functions in regulating transcription, with zf-C2H2 containing proteins in humans shown to recognize more motifs than any of the other transcription factors combined. In addition, KRAB and zf-C2H2-containing proteins have been shown to bind currently active and ancient families of specific TEs (e.g. LINEs and LTRs/ERVs)11. The emergence of novel gene families with these functional capacities at the origin of caecilians may have contributed to the unique pattern of TE accumulation we observe in this group; further work is needed.
We performed a gene birth and death analysis using CAFE v512 on the remaining 13,541 orthogroups, examining the ancestral and extant caecilian nodes where possible. The majority of these (10,035) orthogroups were excluded from the birth and death analysis because they had no net change in gene family size between caecilian species and the ancestral amphibian node (8,065 orthogroups), or had insufficient sampling (1,970 orthogroups). We reconstructed ancestral states for the remaining 3,506 orthogroups (Supplementary Table S5). There were 156 orthogroups that were completely absent in G. seraphini and M. unicolor (most likely lost in their most recent common ancestor) (Supplementary Table S3). Only 13 orthogroups showed significant changes in caecilians (Figure 2, Supplementary Table S6), with 5 expansions at the ancestral caecilian node (ACN), and 3 at the internal caecilian node (ICN), of which one gene family is significantly expanded at both nodes. There are a total of three gene families with significant contractions, all of which are on the ACN. The gene families displaying significant expansions are: cytochrome P450 family 2 (ACN), these monooxygenases catalyse many reactions involved in metabolism of a large number of xenobiotics and endogenous compounds13; butyrophilin (BTN) family (ACN), involved in milk lipid secretion in lactation and regulation of the immune response14; tripartite motif (TRIM) family (ACN and ICN) involved in a broad range of biological processes that are associated with innate immunity15; and H2A and H2B histones (ICN), which together with H3 and H4 histones and DNA form a nucleosome16. In contrast, while immune function related butyrophilin and TRIM families have significant expansions at the ACN and/or ICN, both immunoglobulin heavy and light variable gene families have significant contractions at the ACN. The final family displaying significant contractions is gamma crystallin, a structural protein found largely in the nuclear region of the lens of the eye at very high concentrations17. Changes in these gene family repertoires may have contributed to the transition to a fossorial lifestyle and packaging of a large genome.
We assessed selective pressure variation on the lineages leading to each extant caecilian and the ancestral caecilians (ACN and ICN) as compared to all other vertebrates in our dataset. In total, we detected 453 orthologous families with evidence of positive selective pressure acting across these nodes (Supplementary Table S7). On the ACN there was no statistically significant GO enrichment across the positively selected genes. Examples of genes with signatures of positive selection are: FBN1 (under positive selection on both the ACN and the ICN), AGTPBP1, and CEP290 all of which are involved in eye morphogenesis18–20. On the ICN there was significant GO enrichment for intermediate filament cytoskeleton function (GO:0045111). A sample of the genes under positive selection follow (specific internal caecilian node/lineage implicated are shown in parenthesis): HESX1 (M. unicolor and R. bivittatum) required for the normal development of the forebrain, eyes and other anterior structures such as the olfactory placodes and pituitary gland21; NFE2L2 (G. seraphini), a transcription factor that plays a key role in the response to oxidative stress: binds to antioxidant response elements present in the promoter region of many cytoprotective genes, such as phase 2 detoxifying enzymes, promoting their expression, thereby neutralizing reactive electrophiles22–25; LGR4 (R. bivittatum) is involved in the development of the anterior segment of the eye26 and is required for the development of various organs, including kidney, intestine, skin and reproductive tract27,28; COL9A3 (M. unicolor, and R. bivittatum) encodes a component of Collagen IX - a structural component of cartilage, intervertebral discs and the vitreous body of the eye29,30. In summary, the cohort of genes under positive selection does not yield statistically significant enrichment for biological processes and functions, but there are a number of genes implicated in organ (especially eye) development and morphogenesis.
Enhancer sequence conservation between vertebrates is favoured in developmental regulator genes. For example, the I12a enhancer element, located between homeobox bigenes Dlx1 and Dlx2, is known to be conserved from bony fish to mice31. Analysis of the ortholog of the l12a enhancer across the 22 vertebrate species confirms that it is easily identifiable and conserved in all vertebrates, including the three caecilians (Figure 2). Snakes contain a mutant form of an otherwise well-conserved enhancer element known as ZRS that when placed into mice produces a “serpentised” phenotype, directly implicating it in vertebrate limblessness32. The ZRS enhancer element is highly conserved and located within the LMBR1 intron between orthologous exons in vertebrates. However, the conserved ZRS element is absent in the three caecilian species. In contrast, ZRS is intact in limbless lizards where a more complex and lineage-specific route to limblessness has been proposed33. Here, the absence of ZRS in caecilians, and the functional work on the mutated form of ZRS in snakes, provides us with a common molecular target for the convergent loss of limbs in snakes and caecilians.
Methods
Sample collection
Genomes were produced from wild-caught animals that had been maintained in captivity for several years. Voucher specimens are at the Natural History Museum, London: Geotrypetes seraphini (MW 11051, from Kon, Cameroon), Rhinatrema bivittatum (MW11052) and Microcaecilia unicolor (MW11053), both from Camp Patawa, Kaw Mountains, French Guiana.
DNA preparation, Sequencing and optical mapping
All DNA extractions were from liver tissue using the Bionano Animal Tissue Plug preparation (https://bionanogenomics.com/wp-content/uploads/2018/02/30077-Bionano-Prep-Animal-Tissue-DNA-Isolation-Soft-Tissue-Protocol.pdf). Pacific Biosciences libraries were prepared with the Express Template Prep Kit 1.0 and Blue Pippin size selected. Pacific Biosciences CLR data was generated from 36 SMRTcells of M. unicolor and 6 SMRTcells of G. seraphini sequenced with the S/P2-C2/5.0 sequencing chemistry on the Pacific Biosciences Sequel machine. A further 5 SMRTcells of G. seraphini were sequenced with S/P3-C1/5.0-8M sequencing chemistry on a Pacific Biosciences Sequel II machine. The Hi-C libraries were created with a Dovetail Hi-C kit for G. seraphini and an Arima Genomics kit (version 1) for M. unicolor and sequenced on an Illumina HiSeq X. A 10X Genomic Chromium machine was used to create the linked-read libraries and sequenced on an Illumina HiSeq X. Optical maps were created for both species using a Bionano Saphyr instrument.
Genome assembly
Assembly for Geotrypetes seraphini and Microcaecilia unicolor was conducted mainly as for Rhinatrema bivittatum described in Rhie et al.6 using four data types and the Vertebrate Genomes Project (VGP) assembly pipeline (version 1.6 for G. seraphini and version 1.5 for M. unicolor; Supplementary Figure S2). In brief, the Pacific Biosciences CLR data for each species was input to the diploid-aware long-read assembler FALCON and its haplotype-resolving tool FALCON-UNZIP34. The resulting primary and alternate assemblies of M. unicolor were input to Purge Haplotigs35 and G. seraphini assemblies were input to Purge_dups36 for identification and removal of remaining haplotigs. Next, both species’ primary assemblies were subject to two rounds of scaffolding using 10X long molecule linked-reads and Scaff10X (https://github.com/wtsi-hpag/Scaff10X) and one round of Bionano Hybrid-scaffolding with pre-assembled Cmaps from 1-enzyme non-nicking (DLE-1) and the Solve Pipeline. The resulting scaffolds were then further scaffolded into chromosome-scale scaffolds using the Dovetail/Arima library Hi-C data for G. seraphini/M. unicolor and SALSA237. The scaffolded primary assemblies plus the Falcon-phased haplotigs were then subjected to Arrow38 polishing with the Pacbio reads and two rounds of short read polishing using the 10X Chromium linked reads, longranger align39, freebayes40 and consensus calling with bcftools41 (further details can be found at Rhie et al.6 and Suppl Fig 1). Assemblies were checked for contamination and were manually curated using gEVAL system42, HiGlass43 and PretextView (https://github.com/wtsi-hpag/PretextView) as described previously7. Mitochondria were assembled using mitoVGP44. Assemblies and full annotations are available on NCBI under the accession numbers GCF_902459505.1 and GCF_901765095.1. Raw reads statistics, accession numbers and software versions employed can be found at Supplementary Table S8 A, B and C.
Repeats prediction and annotation
All caecilians were submitted to homology-based and de novo approaches for repeat identification and annotation. A de novo library of repeats was created for each species using the RepeatModeler2 package45. This library was then combined with Repbase “Amphibia” library (release 26.04) forming the final library for each species. Each assembly was searched for repeats with RepeatMasker (http://www.repeatmasker.org/). Repeat landscape plots were created with perl scripts from the RepeatMasker package.
Genome annotation
The three caecilian genomes were annotated using the NCBI Eukaryotic Genome Annotation Pipeline which produces homology-based and ab initio gene predictions to annotate genes (including protein-coding and non-coding as lncRNAs, snRNAs), pseudo-genes, transcripts, and proteins (for details see Annotation HandBook https://www.ncbi.nlm.nih.gov/genbank/eukaryotic_genome_submission_annotation/). In brief, first repeats are masked with RepeatMasker (http://www.repeatmasker.org/) and Window Masker46. Next, transcripts, proteins and RNA-Seq from the NCBI database are aligned to the genomes using Splign47 and ProSplign (https://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html). Those alignments are submitted to Gnomon (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/) for gene prediction. Models built on RefSeq transcript alignments are given preference over overlapping Gnomon models with the same splice pattern. Supplementary Table S9 presents a summary of caecilian annotations and details can be found on NCBI at the accessions GCF_902459505.1, GCF_901765095.1.
Data Assembly for Comparative study
The Coding DNA sequences (CDSs) for 21 vertebrate species (Supplementary Figure S3) were downloaded from Ensembl release 10048. In those cases where a more contemporary version of the genome was available on RefSeq (Release 200)49 we used the RefSeq genome and corresponding annotations (Supplementary Table S9). The longest canonical protein coding region for each gene was retained for further analysis.
Orthogroup prediction and gene birth and death analysis
We identified 31,385 orthogroups for the 419,877 protein coding regions across 21 vertebrate species using OrthoFinder50. We extracted the corresponding uncontroversial species tree from timetree.org51. The phylogenetic distribution of the orthogroups revealed 1,150 were gained in caecilians, and 525 that were absent in all three caecilians. We used a phylostratigraphic approach to explore caecilian specific losses in the context of the vertebrate phylogeny. Information about species-specific losses elsewhere in the tree was not carried forward for further analysis. We parsed the orthogroups that lack caecilians in the following ways: (1) to identify orthogroups that lack representation across all amphibia: we identified orthogroups that contained at least two fish species and two tetrapod (non-amphibian) species - totalling 265 orthogroups, (2) to identify orthogroups that are absent only in caecilians: we extracted those orthogroups with least two fish species and two tetrapod species (including at least one frog species) - totalling 238 orthogroups, (3) to identify orthogroups that are present across amphibia and amniota but absent in caecilians: we extracted orthogroups containing two frog species and two amniota species - totalling 22 orthogroups. Orthogroups that did not satisfy these filters had patterns of loss that were spurious across vertebrata. Combining the set of orthogroups that contain caecilian representatives (13,541) plus those that passed our filters 1-3 above (525), produced our final set of 14,066 orthogroups for analysis in CAFE v5 with the lambda parameter estimated for each species12. Statistically significant contractions or expansions of gene families are detailed in the main text, and all expansions and contractions are provided in Supplementary Table S5.
Analysis of selective pressure variation
Our selective pressure variation analysis focussed on 3,236 single-copy and 9,690 multi-copy genes from our orthogroups. The ML method we employed requires a minimum of 7 species52 thus we removed families that did not meet this criterion. The 9,690 multicopy genes could be broken down into the following cohorts based on the CAFE predictions: there were 5,993 orthogroups with species-specific duplications, after this filter 3,464 of which were designated SGOs and 2,529 as multi-copy gene orthogroups (MGOs). There were 6,226 (which includes the 2529 MGOs) that were divided into their constituent single-copy paralogous groups using UPhO53. Note species-specific gene duplications that were not specific to caecilians were removed. A total of 14,807 single-copy gene orthogroups were identified in this way. We used a range of different alignment methods (MAFFT54, MUSCLE55, and Prank56) on each gene family and used MetAl57 to choose the best fitting alignment method per gene. The corresponding gene trees were reconstructed using IQtree58. Robinson-Foulds distances between gene trees and the species tree were estimated using Clann59, and only those gene trees with zero distance were retained for further analysis, i.e. the gene and species tree were in full agreement thus minimising the risk of hidden paralogy in our single-copy gene orthogroups (SGOs). We assessed the selective pressure variation using codon based models of evolution in codeml60 using Vespasian61 across all resulting 2,047 SGOs that satisfied all of the range of criteria described above. All alignments for the selective pressure analyses are at DOI:10.5281/zenodo.5780326.
GO Enrichment Analysis
The GO terms were predicted for all caecilian CDSs using EGGnog with default parameters (eggnog-mapper.embl.de)62, and GO term enrichment analysis was carried out using goatools63.
Comparative analysis of homologous enhancer elements
The ZRS enhancer sequence was identified using the method in Kvon et al.32. The ZRS enhancer in mouse is located within an intron between exons 5 and 6 of the LMBR1 gene sequence (Gene ID: 105804842). In brief, the approach involved extracting the LMBR1 sequence from the genomes of each species in our sample set (Supplementary Table S10) and identifying the homologous intron sequence containing the ZRS sequence across all species. Using BLASTn64 the ZRS region was readily identifiable across all 22 species. The level of sequence conservation was quantified between mouse ZRS and all other species (Figure 2, detailed alignment of the E1 element within ZRS Supplementary Figure S4). The ZRS sequence was also searched against the complete genomes of all three caecilians (to account for possible relocation of the enhancer) and we did not identify a ZRS-like sequence in an alternative location in the caecilian genomes. Using the same approach, we quantified the level of sequence conservation across our set of vertebrates for an additional enhancer, I12a (AF349438.2), located between the homeobox bigene cluster paralogs DLX1 and DLX2 (Supplementary Table S11). For Crocodylus porosus we used the region between METAP1D and DLX2 because the DLX1 gene was not annotated in this species.
Author contributions
MW supplied all biological samples and contributed to the interpretation of results. MS performed DNA extractions and optical mapping. KO coordinated the creation of sequencing libraries and genomic sequencing. SAM generated the genome assemblies. YS, JT and JW performed the manual curation of the assemblies. MUS performed repeat analyses and BUSCO synteny analyses. MJO’C and VO performed and interpreted the selective pressure analyses and birth and death analyses. MJO’C and VO carried out the comparative analysis of ZRS and l12a enhancer elements and interpreted results. RD supervised the genomics aspects of the project and MJO’C the comparative analyses. All authors contributed to writing the manuscript.
Supplementary Figures and legends
Acknowledgements
MUS, YS, JW, JT, KO, MS are supported by Wellcome grant WT206194, SAM and RD are supported by Wellcome grant WT207492, MW thanks the Direction de l’Environment de l’Aménagement et du Logement and Le Comité Scientifique Régional du Patrimonie Naturel, French Guiana. MJO’C would like to thank the University of Nottingham for awarding funds to support this work. MJO’C and VO are grateful for access to the University of Nottingham’s Augusta HPC service. For the purpose of open access, and as this research was funded in part by the Wellcome Trust [Grant number WT206194 and WT207492], the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.