ABSTRACT
The fission yeast Schizosaccharomyces pombe is an important model organism, but its natural diversity and evolutionary history remain under-studied. In particular, the population genomics of S. pombe mitochondrial genome (mitogenome) has not been thoroughly investigated. Here, we de novo assembled the complete circular-mapping mitogenomes of 192 S. pombe isolates, and found that these mitogenomes belong to 69 non-identical types ranging in size from 17618 bp to 26910 bp. Using the assembled mitogenomes, we identified 20 errors in the reference mitogenome and discovered two previously unknown mitochondrial introns. Analysing sequence diversity of these 69 types of mitogenomes revealed that, unexpectedly, they mainly fall into two highly distinct clades, with only three mitogenomes exhibiting signs of inter-clade recombination. This diversity pattern suggests that currently available S. pombe isolates descend from two long-separated ancestral lineages. This conclusion is corroborated by the diversity pattern of the recombination-repressed K-region located between donor mating-type loci mat2 and mat3 in the nuclear genome. We estimated that the two ancestral S. pombe lineages diverged about 40 million generations ago. These findings shed new light on the evolution of S. pombe and the datasets generated in this study will facilitate future research on genome evolution.
INTRODUCTION
The fission yeast Schizosaccharomyces pombe is a unicellular fungal species belonging to the Taphrinomycotina subphylum of the Ascomycota phylum (Liu et al. 2009). The first descriptions of this species in the 1890s reported it as an microorganism associated with fermented alcoholic drinks, including its presences in East African millet beer and in the fermenting sugar-cane molasses for making a distilled liquor (Batavia Arrack) in Indonesia (Lindner 1893; Vorderman 1893; Eijkman 1894; Barnett & Lichtenthaler 2001). Since then, S. pombe has been found in various human-associated environments throughout the world, but has never been isolated in truly wild settings (Brown et al. 2011; Jeffares 2018; Jeffares et al. 2015). In 1947, Urs Leupold, the founder of fission yeast genetics, selected an S. pombe isolate from French grape juice as the subject of his PhD research, and this strain (hereafter referred to as the Leupold strain) has essentially been the only strain used for modern S. pombe molecular biology studies (Osterwalder 1924; Leupold 1950; Hu et al. 2015). In 2002, the complete genome sequence of the Leupold strain was published, making S. pombe the sixth eukaryotic species with a sequenced genome (Wood et al. 2002). Today, S. pombe has been recognized as one of the few most prominent model organisms for understanding the molecular mechanisms of cellular processes (Hayles & Nurse 2018; Hoffman et al. 2015).
In recent years, the intraspecific genomic diversity of S. pombe has begun to be investigated (Brown et al. 2011; Rhind et al. 2011; Avelar et al. 2013; Clément-Ziza et al. 2014; Fawcett et al. 2014; Zanders et al. 2014; Hu et al. 2015; Jeffares et al. 2015, 2017). In particular, Jeffares et al. sequenced the genomes of 161 S. pombe isolates and comprehensively explored the genomic variations within this species (Jeffares et al. 2015, 2017). However, the breadth and depth of knowledge on the natural diversity and evolutionary history of S. pombe remain limited, especially compared to the other model yeast species, Saccharomyces cerevisiae (Duan et al. 2018; Peter et al. 2018).
The mitochondrion originates from a bacterial endosymbiont and, after extensive reductive evolution, still retains a small genome (Lang et al. 1997). Compared to the nuclear genome, the smaller size, higher copy number, and lower level of recombination of the mitochondrial genome (mitogenome) have long made it an attractive subject for intraspecific comparative studies, shedding light on the evolution of many species including humans (Ingman et al. 2000). Recently, the population mitogenomic approach has begun to be applied to fungal species, not only revealing how mitogenomes vary within a fungal species, but also helping to elucidate the population structure and evolutionary trajectory of the species (Freel et al. 2015; Jung et al. 2012; Leducq et al. 2017; Wolters et al. 2015).
The mitogenome of the Leupold strain of S. pombe was completely sequenced more than 10 years before the nuclear genome (Lang 1984; Lang et al. 1985, 1987; Trinkl et al. 1989). It contains 2 rRNA genes (rnl and rns), 8 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, and rps3), a gene encoding the RNA component of mitochondrial RNaseP (rnpB), and 25 tRNA genes (Bullerwell et al. 2003; Schäfer 2003). No complete mitogenome sequences of any other S. pombe isolates have been reported thus far. Restriction fragment analysis and limited Sanger sequencing have indicated that presence-absence polymorphisms of mitochondrial introns are widespread among S. pombe isolates (Zimmer et al. 1984, 1987), but an accurate and thorough understanding of the intraspecific variations of S. pombe mitogenomes is still lacking.
In this study, we used both published and our own genome sequencing data to perform de novo assembly of the mitogenomes in 199 S. pombe isolates. We successfully assembled the complete mitogenome sequences of 192 isolates. Analysing these mitogenome sequences led to the discovery of reference mitogenome errors, new mitochondrial introns, and unexpected divergence patterns that provide new insights into the evolutionary history of S. pombe.
MATERIAL AND METHODS
Previously published genome sequencing data of 161 JB strains
The Illumina genome sequencing data of 161 S. pombe strains with names that begin with the initials JB were downloaded from European Nucleotide Archive (ENA) according to the ENA accession numbers given in Supplementary Table 7 of Jeffares et al. (Jeffares et al. 2015), and are listed in Supplementary Table S1. For 11 sequencing runs that belong to the ENA Study accession number PRJEB6284, we noticed discrepancies between the read numbers reported in Supplementary Table 7 of Jeffares et al. 2015 and the read numbers in the data downloaded from ENA. The authors of the prior study confirmed that strain name mix-ups had occurred during the submission of the sequencing data to the ENA by The Genome Analysis Centre (TGAC) (Daniel Jeffares, personal communication). A list of these 11 sequencing runs with their correct corresponding strain names is provided as Supplementary Table S2.
Genome sequencing of 38 strains from culture collections in China and USA
To explore intraspecific diversity beyond the previously analysed strains, we acquired 38 S. pombe strains from four culture collections in China and one culture collection in USA: 20 strains from CGMCC (China General Microbiological Culture Collection Center), 13 strains from CICC (China Center of Industrial Culture Collection), 1 strain from CICIM (Culture and Information Centre of Industrial Microorganisms of China Universities), 1 strain from CFCC (China Forest Culture Collection Center), and 3 strains from NRRL (United States Department of Agriculture Agricultural Research Service Culture Collection) (Supplementary Table S1). Only three of these 38 strains have isolation information: CGMCC 2.1043 was isolated from fermented grains for making Moutai, a Chinese liquor; NRRL Y-11791 was from reconstituted lime juice (location unknown); NRRL Y-48646 was from a wine producing company (location unknown). Single-cell-derived clones of these strains were deposited into our laboratory strain collection (DY collection) and given strain names that begin with the initials DY (Supplementary Table S1). Cells grown on YES solid media were used for genomic DNA preparation using the MasterPure Yeast DNA Purification Kit (Epicentre). The kit manufacturer’s protocol was followed, with the exception of lysing the cells by glass bead beating in a FastPrep-24 homogenizer (MP Biomedicals) for 20 seconds at a speed setting of 6.4 m/s. The sequencing library for DY15505 was constructed using the NEBNext DNA Library Prep Master Mix (NEB). For the other 37 strains, tagmentation-based sequencing library preparation was performed using home-made Tn5 transposase (Picelli et al. 2014). Post-tagmentation gap filling and PCR amplification were performed using the KAPA HiFi HotStart PCR Kit (Kapa Biosystems) with the following cycling parameters: 3 min at 72°C, 30 sec at 95°C, and then 11 cycles of 10 sec at 95°C, 30 sec at 55°C, and 30 sec at 72°C. AMPure XP beads (Beckman Coulter) were used to select PCR product in the size range of 400 bp to 700 bp. Paired-end sequencing was performed using Illumina HiSeq 2000 (2×96 read pairs), HiSeq 2500 (2×100 read pairs or 2×101 read pairs), or HiSeq X Ten sequencer (2×150 read pairs). Sequencing data for these 38 strains have been deposited at NCBI SRA under accession numbers SRR8698890–SRR8698927.
De novo assembly of the mitogenomes
The genome sequencing data for the above mentioned 199 strains (161 JB strains and 38 DY strains) were cleaned by Trimmomatic version 0.32 with options LEADING:30, TRAILING:30, SLIDINGWINDOW:4:30, and MINLEN:80 (MINLEN:130 for 150-bp HiSeq X Ten reads) (Bolger et al. 2014). We found empirically that de novo assembly of the mitogenome requires data downsampling and 300,000 cleaned read pairs are a suitable downsampling target data size except for longer-read-length HiSeq X Ten-generated data, which require a lower downsampling target read number. Downsampling was performed using the software seqtk (https://github.com/lh3/seqtk). Three independent sets of randomly downsampled data were obtained using the seed numbers 100, 500, and 800. De novo assembly was performed using A5-miseq version 20150522 (Coil et al. 2015). From A5-miseq output we selected the mtDNA-containing contigs based on length and sequence. Custom Perl scripts were used to trim overlapping sequences at the end of the mtDNA contigs and set the starting position of the circular-mapping mtDNA to that of the reference S. pombe mitogenome (accession number NC_001326.1). Full-length mitogenome assemblies were usually obtained from all three sets of downsampled data. Pilon version 1.21 was used to polish the assemblies (Walker et al. 2014). Polished assemblies were verified by mapping all cleaned reads of a strain to the corresponding assembly and manually examining the mapping results on a genome browser to ensure the lack of mismatches. In total, we obtained full-length mitogenome assemblies for 192 of the 199 strains (Supplementary Table S1). There are 69 types of mitogenome sequences among these 192 full-length assemblies. We designate them MT type 1 to 69 or, for brevity, MT1 to MT69 (Supplementary Table S1).
Mitogenome annotation
Protein-coding genes in the assembled mitogenomes were annotated using MFannot based on genetic code 4 (the only difference between genetic code 4 and the standard code is UGA being a tryptophan codon, not a stop codon) (Lang et al. 2007; Valach et al. 2014). MFannot was also used for predicting intronic regions and intron types (group I or group II intron). tRNA and rRNA annotations were transferred from the reference mitogenome using the software RATT (Otto et al. 2011). The EMBL-format reference annotation file required by RATT was generated from the GenBank format file using the software Artemis (Carver et al. 2012), which was also used to convert the RATT output from EMBL format to GenBank format. Results of software-based annotation were verified by manual inspection. The annotations of mt-tRNAArg(UCU) and mt-tRNAGlu(UUC) were revised according to the recently published S. pombe mitochondrial transcriptome analysis (the starting and ending positions of the former were shifted upstream for one nucleotide and six nucleotides, respectively, and the ending position of the latter was shifted downstream for one nucleotide) (Shang et al. 2018). The 69 types of mitogenomes (MT1–MT69) together with their annotations have been deposited at GenBank under accession numbers MK618072– MK618140. The lengths of these mitogenomes, the total lengths of different types of sequence features in these mitogenomes, and the intron presence-absence patterns are listed in Supplementary Table S3.
Published sequences and annotations of the mitogenomes of the three other Schizosaccharomyces species were used for the analysis of genes in these three mitogenomes (accession numbers NC_004312.1 and AF275271.2 for S. octosporus, accession numbers NC_004332.1 and AF547983.1 for S. japonicus, and accession number MK457734 for S. cryophilus) (Bullerwell et al. 2003; Rhind et al. 2011).
De novo assembly of the recombination-repressed K-region in the nuclear genome
Using the same set of Illumina genome sequencing data of 199 S. pombe strains, we performed targeted de novo assembly of the K-region. For this purpose, we employed the assembler software TASR version 1.6.2 and run it in the de novo assembly mode (-i 1 mode) (https://github.com/warrenlr/TASR) (Warren & Holt 2011). Because the 4.3-kb centromere-repeat-like cenH element within the K-region cannot be assembled using short sequencing reads, we chose the 4.6-kb reference genome sequence of the mat2–cenH interval (nucleotides 4557–9169 of GenBank accession number FP565355.1) and the 1.9-kb reference genome sequence of the cenH–mat3 interval (nucleotides 13496–15416 of GenBank accession number FP565355.1) as the input sequences provided to TASR for read recruitment. Based on read mapping, 12 of the 199 strains lack the K-region (Supplementary Table S1). These 12 strains include JB22 (Leupold’s 972 strain), an h−S mating type strain in which the K-region is known to be absent (Beach & Klar 1984). For 150 (80%) of the remaining 187 strains, we were able to fully assembled sequences corresponding to the two target sequences (Supplementary Table S1). The failure to fully assemble the sequences for the other 37 strains appeared to be mainly owing to insufficient sequencing depth, as 92% (110/119) of the strains with >40× average nuclear genome sequencing depth (based on cleaned reads) have fully assembled sequences, whereas only 59% (40/68) of the strains with <40× sequencing depth have fully assembled sequences (Supplementary Table S1). We concatenated the fully assembled mat2–cenH interval, 100 Ns (representing the unassembled cenH sequence), and the fully assembled cenH–mat3 interval together as the K-region sequence. Among the K-region sequences of 150 strains, there are 29 types of K-region sequences. We designated them K-region type 1 to 29 or, for brevity, K1 to K29 (Supplementary Table S1). In particular, we assigned K-region type 1 (K1) to the K-region sequence in JB50 (Leupold’s 968 h90 strain), a strain that should have the same K-region sequence as that in the reference genome. However, K1 differs from the K-region sequence in the reference genome in 34 positions, including 26 nucleotide substitutions and 8 one-base indels. For all but one of these 34 positions, the reference genome alleles do not exist in any K-region types, whereas the alleles of K1 are shared by other K-region types. Thus, these differences are most likely due to reference sequence errors. The 29 types of K-region sequences (K1–K29) have been deposited at GenBank under accession numbers MK618141–MK618169.
Phylogenetic tree construction
We used two methods to construct phylogenetic trees based on gene sequences present in all 69 MT types. In the first method, we used the non-intronic nucleotide sequences of 9 genes (rnl, rns, cox1, cox3, cob, atp6, atp8, atp9, and cox2) present in the mitogenomes of all four fission yeast species to construct a neighbour-joining tree based on the p-distance model in MEGA 7.0.18 (Kumar et al. 2016). Bootstrap analysis with 1000 replicates was performed. In the second method, we employed MEGA to construct a maximum likelihood tree using the non-intronic nucleotide sequences of the above 9 genes plus rps3 and rnpB. The model recommended by MEGA, TN93+G+I, was used. Bootstrap analysis with 1000 bootstrap replicates was performed. Maximum likelihood tree of the K-region was constructed using MEGA. The model recommended by MEGA, T92, was used. Bootstrap analysis with 1000 replicates was performed. For the construction of the phylogenetic trees of introns, maximum likelihood tree of each intron was constructed with 100 bootstrap replicates using the model suggested by MEGA. For the construction of the phylogenetic trees of IEPs, we obtained protein sequences closely related to Schizosaccharomyces IEPs by BLASTP search of NCBI nr database. Maximum likelihood trees were constructed with 100 bootstrap replicates using the model suggested by MEGA.
ADMIXTURE analysis and heatmap analysis
For the 69 MT types, nucleotide substitution variants were identified from the sequence alignment of intron-removed sequences. Bi-allelic single-nucleotide variants (SNVs) were merged into bi-allelic multi-nucleotide variants (MNVs) if two neighbouring bi-allelic SNVs are less than 15 bp apart and share the same allelic partition of the 69 MT types. We used custom Perl scripts and PLINK 1.07 to generate a binary PLINK BED format file, which was used as input for ADMIXTURE version 1.3.0 (Alexander et al. 2009). K values were varied from 2 to 8. For each K value, 10 replicate ADMIXTURE runs were performed using seeds from 1 to 10. Post-processing and visualization of ADMIXTURE results were carried out using the CLUMPAK web server (Kopelman et al. 2015). The major modes identified by CLUMPAK are presented. Results from K > 3 appear no longer informative and are not shown. SNVs and MNVs in non-intronic sequences were visualized in a heatmap by employing the R package ComplexHeatmap. For the 29 types of K-region sequences, bi-allelic SNVs and MNVs were identified in the same way, and were used for heatmap analysis.
Divergence time estimation
BEAST version 2.4.7 and its associated programs were used for divergence time estimation (Bouckaert et al. 2014). For the MT types, the third codon positions in the concatenated alignment of the 8 protein-coding genes were used for the analysis because selection is weak at the third codon positions. For the K-region sequences, all positions were used because the K-region lacks protein-coding genes and is probably under little selective constraint. The site model for each dataset was selected using bModelTest version 1.0.4. A strict molecular clock was assumed. We compared two tree priors, Yule model and birth-death model, and chose the latter for our datasets according to evaluations performed using Tracer version 1.6. For each dataset, five independent runs were performed for 10 million generations each. We initiated runs on random starting trees, and sampled the trees every 1000th generation. Effective sampling sizes were above 100 for all parameters. Results of the five runs were combined, with 10% removed as burn-in, using LogCombiner. Maximum clade credibility trees were summarized using TreeAnnotator, with posterior probability limit set to 0.5. Trees were visualized using FigTree.
RESULTS
De novo assembly of the complete mitogenomes of 192 S. pombe strains
A previous study generated genome sequencing data of 161 S. pombe isolates (Jeffares et al. 2015). In that study, based on SNVs in the nuclear genome, 129 of these 161 isolates were classified into 25 “clonal clusters” each composed of multiple strains with nearly identical nuclear genomes. The remaining 32 isolates each possess a uniquely distinct nuclear genome. The authors of that study chose a set of 57 isolates, called “non-clonal strains”, to represent the 57 types of distinct nuclear genomes that differ from each other by no less than 1,900 SNVs (Jeffares et al. 2015). We used the publicly available genome sequencing data of these 161 isolates (Supplementary Tables S1 and S2), which we call “JB strains” here, to perform de novo assembly of their mitogenomes. We were able to assemble the complete circular-mapping mitogenomes of 154 (96%) JB strains (Supplementary Table S1). These 154 mitogenomes encompass 59 non-identical types, which we term MT types (Supplementary Table S1).
The 59 MT types present among the JB strains by and large correlate with the 57 previously defined nuclear genome types present among these strains (Supplementary Table S1). 55 of the 57 non-clonal strains have fully assembled mitogenomes. These 55 mitogenomes fall into 53 MT types, with three non-clonal strains, JB1205, JB1206, and JB1207, sharing the same MT type. The two non-clonal strains without fully assembled mitogenomes, JB842 and JB874, belong to clonal clusters 18 and 24, respectively. Other strains belonging to these two clusters do have fully assembled mitogenomes, which indicate that these two clusters correspond to two additional MT types (MT47 and MT21). The remaining 4 MT types (MT25, MT39, MT52, and MT65) are each highly similar to the MT type of a non-clonal strain and thus represent intra-cluster variations, with differences being a single SNV (MT25 vs. MT26 for cluster 10 and MT52 vs. MT53 for cluster 2), a single one-nucleotide indel (MT65 vs. MT64 for cluster 23), or the presence-absence polymorphisms of mitochondrial introns (MT39 vs. MT40 for cluster 15). The fact that we have obtained full-length mitogenomes from JB strains representing all 57 nuclear genome types indicates that the mitogenome diversity among the JB strains has been comprehensively captured.
To explore intraspecific diversity beyond that of the JB strains, we obtained from Chinese and US culture collections 38 additional S. pombe isolates, which we call “DY strains”, performed genome sequencing on them, and successfully assembled the full-length mitogenomes for all of them (Supplementary Table S1). These 38 mitogenomes fall into 19 non-identical types, including 9 MT types present among the JB strains, and 10 MT types not present among the JB strains. Thus, overall, we identified 69 MT types from the fully assembled mitogenomes of 192 S. pombe isolates. We performed gene annotation on these MT types. The annotated sequences of the 69 MT types have been deposited at GenBank (accession numbers MK618072–MK618140).
Identification of 20 errors in the reference S. pombe mitogenome
The reference S. pombe mitogenome (accession numbers NC_001326.1 and X54421.1) is that of a Leupold strain with the genotype h− ade7-50 (Lang 1984; Zimmer et al. 1984). The reference S. pombe nuclear genome is that of Leupold’s 972 h−S strain (Wood et al. 2002), which is called JB22 in the JB strain set. JB22 is the non-clonal strain representing clonal cluster 1 of the JB strains (Jeffares et al. 2015, 2017). Our de novo assembly of the mitogenomes showed that the mitogenome of JB22 is identical to those of other clonal cluster 1 strains. We designate this type of mitogenome MT1 (accession number MK618072).
Despite both being from the Leupold strain background, MT1 differs from the reference S. pombe mitogenome in 20 positions, including 13 single-nucleotide substitutions, 1 double-nucleotide substitution, 5 single-nucleotide indels, and 1 three-nucleotide insertion (Table 1). 13 of these 20 differences are located in protein-coding genes, and 11 of them alter amino acid sequences (Table 1). A previous study has uncovered 16 of these 20 differences by mapping Illumina sequencing reads to the reference mitogenome, but did not ascertain whether the differences are due to reference errors or polymorphisms (Iben et al. 2011). Our mitogenome assemblies showed that, in all these 20 positions, the sequence of MT1 is identical to those of the other 68 MT types. Thus, these 20 differences are caused by reference errors, not by naturally existing polymorphisms.
Discovery of two new mitochondrial introns
Using restriction fragment analysis, a previous study of 26 S. pombe isolates estimated that the mitogenomes vary in length from 17.6 to 24.6 kb (Zimmer et al. 1987). We found here that among the 69 MT types, the mitogenome length varies between 17618 bp and 26910 bp, and this length variation is almost entirely due to intron presence-absence polymorphisms (Figure 1A and Supplementary Table S3; intron presence-absence polymorphisms are described in more detail in a later section).
There are seven previously known mitochondrial introns in S. pombe, including cox1-I1a, cox1-I1b, cox1-I2a, cox1-I2b, cox1-I3, cob-I1, and cox2-I1 (Schäfer 2003) (Figure 1B). In our de novo assembled mitogenomes, we identified two new introns, which we named cox1-I1b′ and cox1-I4, respectively (Figure 1B). cox1-I1b′ is located at the exact same position as cox1-I1b, and its presence is mutually exclusive with the presence of cox1-I1b. Both cox1-I1b′ and cox1-I1b are group I introns, and both encode proteins containing two LAGLIDADG endonuclease domains. The first 249 nucleotides of these two introns are identical, but the remaining portions are rather divergent, with the LAGLIDADG-domain-coding sequences exhibiting only 63% identity (Figure 1C). Despite this divergence, among the proteins in the NCBI nr database, the closest homolog of the IEP of cox1-I1b′ is the IEP of cox1-I1b, suggesting a recently shared ancestry of these two proteins (Supplementary Figure S1).
The IEP of cox1-I1b has been shown to possess both homing endonuclease and intron maturase activities (Pellenz et al. 2002; Schäfer 2003; Schäfer et al. 1994). For LAGLIDADG proteins, the nuclease activity requires that the 8th residue in the namesake LAGLIDADG motif must be an acidic residue to allow coordination with metal ions essential for catalysis (Chevalier et al. 2004). The 8th residues in the two LAGLIDADG motifs in the IEP of cox1-I1b are acidic residues (Figure 1D). In contrast, the 8th residues in the two LAGLIDADG motifs in the IEP of cox1-I1b′ are non-acidic residues (Figure 1D), suggesting that this protein probably has lost the homing endonuclease activity and acts solely as a maturase. This kind of degeneration of endonuclease function is remarkably common among the Schizosaccharomyces group I intron IEPs, as half of them (7/14) have non-acidic residues at the 8th position of at least one LAGLIDADG motif (Supplementary Figure S2).
In S. pombe, all previously analysed mitochondrial intron IEPs are thought to be translated as fusions with upstream exons, as the coding sequences of IEPs are always in-frame with 5′ exons (Schäfer 2003). We observed an exception to this rule in some cox1-I1b′ sequences. In three MT types (MT52, MT53, and MT66), the LAGLIDADG-domain-coding sequences in cox1-I1b′ are out-of-frame with 5′ exons due to a one-nucleotide insertion about 70 bp upstream of the LAGLIDADG domains (Supplementary Figure S3). This observation raises the possibility that S. pombe mitochondrial intron IEPs may not always be translated as in-frame extensions of the preceding exons.
The other intron newly identified in this study, cox1-I4, is located at a position downstream of all previously known cox1 introns in S. pombe (Figure 1B). It is a group II intron. Our phylogenetic analysis showed that the IEP of cox1-I4 does not share a close relationship with any of the other Schizosaccharomyces group II intron IEPs (Supplementary Figure S4). Instead, it is most closely related to the IEPs encoded by cox1-ai1 and cox1-ai2 introns in Saccharomyces cerevisiae and cox1-ai1 introns in other species of the family Saccharomycetaceae (Figure 1E and Supplementary Figure S4). Thus, S. pombe cox1-I4 may have arisen through horizontal transfer from a Saccharomycetaceae species.
Phylogenetic relationship of 69 MT types based on non-intronic sequences
Using the non-intronic sequences of nine genes (cox1, cox3, cob, atp6, atp8, atp9, cox2, rnl, and rns), which are conserved among the four Schizosaccharomyces species, we constructed a neighbour-joining tree of the 69 MT types (Figure 2, left). Remarkably, in this tree, the 69 MT types mostly fall into two highly distinct clades, with the single exception being MT15, which is at an intermediate position between the two clades. The same tree topology was obtained when we constructed a maximum likelihood tree using the non-intronic sequences of the above nine genes plus rps3 and rnpB (Supplementary Figure S5). The smaller of the two clades contains MT1, the mitogenome in the Leupold strain background, from which the S. pombe reference genome was derived. Thus, we term this clade, which contains 14 MT types, the REF clade, and term the other clade, which contains 54 MT types, the NONREF clade.
The REF clade has a substantially lower within-clade diversity than the NONREF clade. Nonetheless, MT types in the REF clade can be clearly divided into two subclades, which we term REF-A and REF-B. Within the NONREF clade, the relatedness among the MT types is highly uneven, with 40 MT types (MT16 to MT 55) falling into a closely related cluster, which we term NONREF-S subclade (S stands for similar). The large number of closely related MT types in the NONREF-S subclade may be partly due to non-random sampling of S. pombe isolates (see Discussion). The other 14 MT types (MT56 to MT69) in the NONREF clade are much more diverse, and we group them into a subclade, termed NONREF-D (D stands for diverse).
For the REF clade and the NONREF-D subclade, affiliated strains tend to share geographic origins (Figure 2, middle). MT types in the REF clade are mainly associated with strains collected from Europe, whereas MT types in the NONREF-D subclade are mainly associated with strains collected from Asia-Pacific. Among the 30 REF clade strains with known collection locations, 22 were collected from Mediterranean European countries including France, Spain, Italy, and Malta (Supplementary Table S1), suggesting a possible Southern European origin of this MT clade. Among the 8 NONREF-D strains with known collection locations, 6 were collected from Asian countries (Supplementary Table S1), suggesting that this subclade of high diversity is mainly distributed in Asia. In contrast, the NONREF-S subclade, despite its low internal diversity, has the broadest geographic distribution, with associated strains coming from all continents where S. pombe has been isolated. One possible explanation is that the NONREF-S strains have been distributed around the world by human migration (see Discussion).
To verify and complement the results obtained using phylogenetic tree construction, we identified from the alignment of non-intronic sequences a total of 223 bi-allelic SNVs and MNVs (Supplementary Table S4), and performed maximum-likelihood clustering analysis using the ADMIXTURE program (Figure 2, right) (Alexander et al. 2009). In addition, we directly visualized these bi-allelic SNVs and MNVs in a heatmap (Figure 3). These analyses lent support to the clade and subclade division. For the ADMIXTURE analysis, when K = 2, the REF clade and the NONREF clade are clearly distinguished; when K = 3, the NONREF clade is further separated into two clusters, corresponding to the NONREF-S subclade and the NONREF-D subclade. MT15, the MT type situated between the REF clade and the NONREF clade in the phylogenetic trees, exhibits an inter-clade mosaic pattern for both K values, suggesting that it may be a recombination product between REF and NONREF mitogenomes. Interestingly, two other MT types, MT22 and MT23, also consistently exhibit inter-clade mosaic patterns in the ADMIXTURE results, albeit with lesser degree of mosaicness than MT15.
Inspecting the heatmap confirmed that MT15, MT22, MT23 are the products of inter-clade recombination (Figure 3). MT15 appear to be composed of four long stretches of sequences, with two stretches resembling the REF-A subclade and the other two stretches resembling the NONREF-S subclade. The bulk of the sequences in MT22 and MT23 match those in other NONREF-S mitogenomes, with two small stretches of sequences in the rnl gene of MT22 and one small stretch of sequence spanning the rnpB gene in MT23 exhibiting REF clade patterns. Together, the above analyses of the non-intronic sequences demonstrate that present-day S. pombe mitogenomes descend from two well-separated ancient lineages, with only rare mitogenome recombination having occurred between lineages. This is unexpected, because the previously published study of the JB strains did not uncover such a diversity pattern (Jeffares et al. 2015), perhaps owing to the masking of phylogenetic signal by extensive recombination of the nuclear genomes (see a later Results section and Discussion).
Presence-absence polymorphisms and phylogeny of mitochondrial introns
There are 18 types of intron presence-absence patterns in the 69 MT types (Figure 4A and Supplementary Table S3). For each of the 8 intron insertion sites, introns are only present in some but not all MT types, indicating that intron gain and/or loss have happened at all sites. The 4 group I intron sites are occupied in appreciably higher proportions (93%, 70%, 87%, and 67% for cox1-I1b/I1b′, cox1-I2a, cox1-I2b, and cox1-I3, respectively) than the 4 group II intron sites (41%, 14%, 28%, 54% for cox1-I1a, cox1-I4, cob-I1, and cox2-I1, respectively).
The REF clade and the NONREF clade show distinct intron presence-absence patterns, with 6 of 8 intron sites exhibiting statistically significant differences between the clades (Supplementary Figure S6). cox1-I1a, cox1-I2a, cox1-I3, and cox2-I1 are completely or almost completely absent in the REF clade, but are common or ubiquitous in the NONREF clade. In contrast, cob-I1 is present in 93% of MT types in the REF clade but is present in only 9% of the MT types in the NONREF clade. For the cox1-I1b/I1b′ site, cox1-I1b is present in all MT types in the REF clade, whereas cox1-I1b′ is present in 83% of the MT types in the NONREF clade. These opposing patterns suggest that the two ancient S. pombe mitogenome lineages evolved different intron contents after their divergence.
Intron presence-absence patterns also exhibit a correlation with the subclade division within the REF clade. The REF-A subclade and the REF-B subclade are perfectly distinguished by the presence-absence patterns of cox1-I2b and cox1-I4, with the REF-A MT types all have cox1-I2b but not cox1-I4, and the REF-B MT types all have cox1-I4 but not cox1-I2b.
Within the low-nucleotide-diversity NONREF-S subclade, two group II introns, cox1-I1a and cox2-I1, are respectively present in 45% and 67.5% of the 40 MT types, and their presence-absence patterns do not obviously correlate with the nucleotide-based phylogeny, indicating that these two introns may have undergone extensive gain and/or loss events during the evolution of NONREF-S mitogenomes. Remarkably, all 18 NONREF-S MT types containing cox1-I1a also contain cox2-I1 (P = 0.00008, Fisher’s exact test), suggesting that, for reason(s) unclear to us, the presence of cox1-I1a in this subclade may be dependent on the presence of cox2-I1.
We constructed maximum likelihood trees for each of the 9 introns (Figure 4B). By and large, the phylogeny based on the sequences of a given intron mirrors the phylogeny of the MT types harbouring that intron, and shows clear distinction between the REF clade and the NONREF clade for the four introns with appreciable presence in both clades (cox1-I1b, cox1-I2b, cox1-I4, and cob-I1). Thus, during S. pombe evolution, mitochondrial introns have rarely crossed the boundary between the two clades, consistent with the low extent of inter-clade recombination described earlier. There are a few notable exceptions. MT7 is the only REF clade MT type containing cox1-I1a and cox2-I1, and these two introns in MT7 are respectively identical to those in the majority of the NONREF-S MT types, suggesting that they may originate from cross-clade transfer. MT16 is one of few NONREF MT types harbouring cox1-I1b and cox1-I4, and these two introns in MT16 are respectively identical to those in the REF-B MT types but different from those in the other NONREF MT types, suggesting that they may also result from cross-clade transfer.
De novo assembly and phylogenetic analysis of the K-region
As the above results indicate, contemporary S. pombe mitogenomes descend from two distinct ancient lineages. We wondered whether S. pombe nuclear genomes share a similar evolutionary history. Because recombination is known to cause inaccuracy in phylogeny inference (Posada & Crandall 2002), we chose to analyse the K-region in the nuclear genome (Grewal & Klar 1997), which is situated between two donor mating-type loci mat2 and mat3, and is a known “cold spot” for both meiotic recombination and mitotic recombination (Egel 1984; Thon & Klar 1993). Using the genome sequencing data of the 199 S. pombe strains described above, we performed read mapping analysis and found that among these strains, 12 lack the K-region (Supplementary Table S1). For 150 of the 187 K-region-containing strains, we obtained by de novo assembly the complete sequences of the two unique sections of the K-region, and found that these K-region sequences belong to 29 non-identical types, which we term K-region types (Figure 5A and Supplementary Table S1).
A maximum likelihood tree of the 29 K-region types shows that, like the mitogenomes, K-region sequences fall into two highly distinct clades (Figure 5A, bottom left). Moreover, similar to the situation with the mitogenomes, the smaller clade, consisting of 6 K-region types (K1–K6), has a low internal diversity, whereas the larger clade, consisting of 23 K-region types (K7–K29), has a substantially higher internal diversity. For clarity, we refer to the two K-region clades as “groups”. The K-region type found in the reference nuclear genome, which we denote K1, belongs to the low diversity group. A heatmap analysis visualizing all bi-allelic SNVs and MNVs in the K-region sequences confirmed the deep divergence of the two groups and showed a lack of inter-group recombination (Figure 5A, bottom middle, and Supplementary Table S5). These results indicate that present-day S. pombe nuclear genomes also descend from two long-separated lineages, which probably correspond to the two ancient lineages of the mitogenomes. Based on this idea, the low-diversity K-region group should correspond to the REF clade of the MT types, and the high-diversity K-region group should correspond to the NONREF clade of the MT types.
The 150 strains with assembled K-region sequences are associated with 60 MT types (Figure 5A, bottom right, and Supplementary Table S1). Any given MT type usually corresponds to only one of the 29 K-region type, except for MT53, whose associated strains have two types of K-regions (K7 and K8, differing by one single-nucleotide indel). The correlation between mitogenome clade affiliation and K-region group affiliation is only barely statistically significant (P = 0.042, Fisher’s exact test), with 58% (7/12) of REF clade MT types corresponding to K-region types in the low-diversity group, and 74% (35/47) of the NONREF clade MT types corresponding to K-region types in the high-diversity group. This is not completely surprising, because it has been suggested that S. pombe has a weak global population structure due to extensive interbreeding (Jeffares et al. 2015). It is likely that interbreeding between populations has resulted in “MT-K inter-clade mixed” strains, in which the mitogenome and the K-region from different ancient lineages are brought together by hybridization.
We separately examined the extent of MT-K inter-clade mixing for each subclade of the MT types (Figure 5B). For the REF-A, REF-B, and NONREF-S subclades, 30% (3/10), 100% (2/2), and 35% (12/34) of the MT types are respectively MT-K inter-clade mixed. However, for the NONREF-D subclade, none of the 13 MT types are MT-K inter-clade mixed. Thus, strains harbouring the NONREF-D mitogenomes appear to have historically undergone less cross-lineage interbreeding.
Estimation of the divergence time of the two ancient lineages of S. pombe
In a published mutation accumulation (MA) study, 96 replicate lines of a Leupold background S. pombe strain were allowed to accumulate spontaneous mutations for 1716 generations, and upon whole-genome sequencing analysis, no mitochondrial mutations were found in these MA lines (Farlow et al. 2015). If we assume that a total of one mitochondrial base-substitution mutation in these 96 MA lines represents an upper limit of the mutation level, the mitochondrial mutation rate should be lower than 3.12 × 10−10 substitutions per site per generation. The same study reported a nuclear mutation rate of 2.00 × 10−10 substitutions per site per generation. Thus, mitochondrial mutation rate in S. pombe is likely to be comparable or lower than the nuclear mutation rate. This is consistent with observations showing that, for fungal species, the mitogenome usually evolves at a rate slower or comparable to that of the nuclear genome in the same species (Sandor et al. 2018; Sharp et al. 2018). Using 3.12 × 10−10 substitutions per site per generation as the mutation rate prior, we employed the Bayesian evolutionary analysis software BEAST to perform divergence dating on the S. pombe MT types, excluding the three recombinant MT types (MT15, MT22, and MT23) (Figure 6A). A divergence time of 39.6 million generations was calculated for the two ancient lineages. This would be an underestimate if the mutation rate we used is indeed an overestimate.
To obtain an independent estimate of the divergence time, we also used BEAST to analyse the 29 K-region types (Figure 6B). Based on the nuclear mutation rate of 2.00 × 10−10 substitutions per site per generation (Behringer & Hall 2015; Farlow et al. 2015), we obtained a divergence time of 39.7 million generations for the two ancient lineages. Taken together, our molecular dating results suggest that the two ancient lineages of S. pombe diverged about 40 million generations ago.
DISCUSSION
In this study, we de novo assembled and annotated full-length mitogenomes that encompass the mitogenome diversity existing in 161 JB strains, the largest previously analysed set of S. pombe isolates, and 38 additional isolates (DY strains). This comprehensive dataset allowed us to thoroughly examine the intraspecific mitogenome diversity existing among currently available S. pombe isolates and obtain new insights into the evolutionary history of this species.
Our analyses of the diversity patterns of mitogenome sequences and K-region sequences revealed that S. pombe isolates descend from two long-separated ancient lineages. The phenomenon of MT-K inter-clade mixing suggests that these two lineages have undergone admixture in recent historical time. The extent of MT-K inter-clade mixing we observed (28.8% of the MT types with fully assembled K-regions, excluding MT15) is probably an underestimate of the extent of inter-lineage admixture among S. pombe isolates, because the K-region only represents a minute fraction of the nuclear genome. Indeed, a preprint recently posted on BioRxiv also identified the two ancestry lineages of S. pombe, and through analysing the admixture proportions of the nuclear genomes, concluded that only a small minority of the JB strains have a nuclear genome mainly originating from a single ancestry lineage (Tusso et al. 2019).
Unlike in animals, where mitogenomes are usually inherited uniparentally, in fungi, biparental transmission of mitogenomes is common and thus allows the recombination between parental mitogenomes (Xu & Li 2015). For budding yeast species belonging to the family Saccharomycetaceae, naturally occurring mitogenome recombination appears to be common (Leducq et al. 2017; Peris et al. 2017; Wu & Hao 2014; Wu et al. 2015). In contrast, we show here that, despite a high level of inter-lineage admixture existing among the S. pombe isolates, inter-lineage recombination of S. pombe mitogenomes has rarely happened. A likely explanation is that, unlike the budding yeasts, S. pombe is a haplontic species, growing vegetatively as haploids, and only forming diploids transiently during sexual reproduction. The formation of a zygotic S. pombe diploid cell is immediately followed by meiosis and sporulation, and as a result, mitogenomes from two parental haploid cells may rarely have a chance to mix and recombine before being partitioned into four separate haploid progeny spores.
Even though S. pombe isolates have mostly been collected by chance rather than through dedicated search for this species, there have been a few cases of isolating multiple S. pombe strains from one relatively small geographic region. In particular, Carlos Augusto Rosa and his colleagues have isolated S. pombe from cachaça distilleries in the southeastern Brazilian state of Minas Gerais (Gomes et al. 2002; Pataro et al. 2000), and from the frozen fruit pulps acquired in markets in the eastern Brazilian state of Sergipe (Trindade et al. 2002). These Brazilian strains correspond to 10 MT types, with 6 MT types (MT45, MT47, MT49, MT50, MT51, and MT54) associated with the cachaça strains and 4 MT types (MT19, MT30, MT32, and MT37) associated with the fruit pulp strains. These 10 MT types all fall into the homogenous NONREF-S subclade and account for 25% of the MT types in this subclade, suggesting that non-random sampling partly contributes to the large size of this subclade.
In the 1960’s, Tommaso Castelli deposited into the DBVPG culture collection 13 S. pombe strains isolated from grape must and wine from the Mediterranean islands of Sicily and Malta (DBVPG online catalog, http://www.dbvpg.unipg.it/index.php/en/database). The six Sicily strains share the same MT type (MT9, subclade REF-A), whereas the seven Malta strains correspond to 3 MT types (MT4 in subclade REF-A, and MT24 and MT29 in subclade NONREF-S). The fact that MT types in both clades are found in strains isolated from wine-related substrates within a localized area (the size of Malta is only 316 km2) suggests ongoing opportunities for inter-clade exchange.
Given that the Rosa strains and the Castelli strains, the only notable S. pombe isolates with restricted geographic origins, together account for only 30% of the MT types in the NONREF-S subclade, the large size of this low-diversity subclade requires explanation(s) in addition to geographic sampling bias. Also in need of explanation are the extraordinarily wide distribution and the high extent of MT-K inter-clade mixing of this subclade. We speculate that S. pombe strains harbouring NONREF-S mitogenomes may have by chance become associated with humans earlier than those harbouring other types of mitogenomes and, as a result, gain a world-wide distribution through co-migration with humans. In turn, the spreading of these strains may also lead to their encountering and hybridizing with REF clade strains. An alternative and non-exclusive explanation is that the NONREF-S mitogenomes may provide selective advantages in human-related substrates where S. pombe has most often been found, including cultivated fruits (raw and processed), cultivated sugar cane (raw and processed), and fermented beverages.
Based on our estimation, the two ancient lineages of S. pombe diverged about 40 million generations ago. To our knowledge, the shortest generation time (doubling time) reported for S. pombe under optimal laboratory growth conditions is approximately 2 hours (Johnson 1968). At such a growth rate, S. pombe can go through 12 generations per day, or 4,383 generations per year, and a divergence time of 40 million generations corresponds to 9,126 years. However, it is highly unlikely that S. pombe can proliferate continuously at this high rate in the wild. Taking inevitable encounters with unfavourable growth conditions into consideration, previous studies have estimated that the average generation time of S. cerevisiae in the wild can be more than 10 times longer than the shortest generation time observed in the laboratory (Fay & Benavides 2005; Ruderfer et al. 2006). Applying the same rationale, if we assume that S. pombe may go through as few as 400 generations per year in the wild, 40 million generations correspond to as many as 100,000 years. Thus, the divergence time of the two ancient lineages of S. pombe may fall within the most recent glacial period (“ice age”), which occurred from approximately 110,000 to 12,000 years ago (van Ommen 2015). The expansion of ice sheets and permafrost during a glacial period can lead to vicariance, the splitting of a population through the formation of geographic barriers (Hewitt 2000; Neiva et al. 2018). We speculate that glacial vicariance may have resulted in the allopatric separation of an ancestral population of S. pombe into isolated subpopulations. One of these subpopulations may have survived the glacial period in a glacial refugium in southern Europe, and become the low-diversity lineage with the REF clade mitogenomes.
The intraspecific S. pombe divergence patterns observed in this study are consistent with the following speculative evolutionary scenario: During the last glacier period, an ancient population of S. pombe was split by glacial vicariance; one isolated subpopulation suffered a bottleneck and became a low-diversity lineage mainly distributed in Southern Europe, whereas another subpopulation became a higher-diversity lineage mainly distributed in Asia; after the glacier period ended, perhaps aided by human migration, these two long-separated lineages came into secondary contact and began to hybridize; human migration has also shaped the worldwide distribution of S. pombe, and in particular, has spread strains with the NONREF-S mitogenomes to all over the world.
ACKNOWLEDGEMENT
We thank Wen Hu for generating the Illumina sequencing library of DY15505. This work was supported by the Ministry of Science and Technology of China and by the Beijing Municipal Government.