Horizontal transfer and recombination fuel Ty4 retrotransposon evolution in Saccharomyces

Horizontal transposon transfer (HTT) plays an important role in the evolution of eukaryotic genomes, however the detailed evolutionary history and impact of most HTT events remain to be elucidated. To better understand the process of HTT in closely-related microbial eukaryotes, we studied Ty4 retrotransposon subfamily content and sequence evolution across the genus Saccharomyces using short- and long-read whole genome sequence data, including new PacBio genome assemblies for two S. mikatae strains. We find evidence for multiple independent HTT events introducing the Tsu4 subfamily into specific lineages of S. paradoxus, S. cerevisiae, S. eubayanus, S. kudriavzevii and the ancestor of the S. mikatae/S. jurei species pair. In both S. mikatae and S. kudriavzevii, we identified novel Ty4 clades that were independently generated through recombination between resident and horizontally-transferred subfamilies. Our results reveal that recurrent HTT and lineage-specific extinction events lead to a complex pattern of Ty4 subfamily content across the genus Saccharomyces. Moreover, our results demonstrate how HTT can lead to coexistence of related retrotransposon subfamilies in the same genome that can fuel evolution of new retrotransposon clades via recombination.

Table S1: Ty4 and Tsu4 subfamily content in whole genome assemblies of multiple Saccharomyces species.Data for Ty4 and Tsu4 subfamily content for S. cerevisiae assemblies from [1] and assembly quality metadata for all assemblies can be found in Additional File 2. Five truncated Tsu4 copies in S. kudriavzevii IFO1802 (denoted with an asterisk) are divergent FLEs from a new subfamily in the Ty4 family.mikatae strains based on empirical short-read WGS datasets from S. mikatae [11] plus short reads simulated from new IFO1815 and NBRC 10994 WGAs reported in this study.Clustering of IFO1815 samples based empirical [11] and simulated (this study) short read data confirms our simulation approach gives valid taxonomic placement, and reveals that NBRC 10994 represents a new S. mikatae clade (Asia C).  3B.The tree scale bar for branch lengths is in units of substitutions per site.For S. paradoxus clades, the geographic source is annotated, and taxon labels consist of strain identifier, FLE identifier and host lineage.For S. cerevisiae clades, the host lineage is annotated, and taxon labels consist of strain identifier and FLE identifier.The phylogeny shown here is a rescaled, taxon-labeled sub-tree from Figure 3B, only showing the Tsu4 clades from S. mikatae, S. jurei and S. kudriavzevii .Bootstrap support based on 100 replicates, clade numbers are the same as in Figure 3.The tree scale bar for branch lengths is in units of substitutions per site.Taxon labels consist of strain identifier and FLE identifier for all clades.Representative FLEs from S. mikatae and S. kudriavzevii recombinant clades 11 and 12 selected for sliding window divergence analysis are labeled with asterisks.eubayanus annotated with Ty4/Tsu4 copy number estimates.Shown are the copy number estimates for Ty4 (B) and Tsu4 (C) subfamilies from worldwide S. eubayanus lineages.The ML tree is reconstructed using 319,865 genome-wide SNPs from 292 S. eubayanus strains and midpoint rooted.Plotting details are identical as described in Figure 1.Major lineages are annotated according to previously-reported population structure [16].S. eubayanus strain yHAB565 whose strain-specific Tsu4 consensus sequence clusters with S. uvarum is indicated with an arrowhead in the Patagonia B lineage.uvarum annotated with Ty4/Tsu4 copy number estimates.Shown are the copy number estimates for Ty4 (B) and Tsu4 (C) subfamilies from worldwide S. uvarum lineages.The ML tree is reconstructed using 253,252 genome-wide SNPs from 62 S. uvarum strains and midpoint rooted.Plotting details are identical as described in Figure 1.Major lineages are annotated according to previously-reported population structure [17].  5. Elements used in this analysis include IFO1815 f256 for "Recombinant" Tsu4 in S. mikatae, IFO1815 f286 for "Pure" Tsu4 in S. mikatae, CBS7001 f32 for Tsu4 in S. uvarum, YPS128 f49 for Ty4 in S. cerevisiae.IFO 1815 assembly was generated in this study.CBS 7001 and YPS128 assemblies were previous published in Chen et al. [12] and Yue et al. [4], respectively.Divergence measured in substitutions per site was calculated using a Kimura 2-parameter model in overlapping 50 bp windows with a 10 bp step size.

Figure S2 :
Figure S2: Phylogenetic placement of IFO1815 and NBRC 10994 in S. mikatae.ML tree based of S.mikatae strains based on empirical short-read WGS datasets from S. mikatae[11] plus short reads simulated from new IFO1815 and NBRC 10994 WGAs reported in this study.Clustering of IFO1815 samples based empirical[11] and simulated (this study) short read data confirms our simulation approach gives valid taxonomic placement, and reveals that NBRC 10994 represents a new S. mikatae clade (Asia C).

Figure S3 :Figure S4 :
Figure S3: Phylogenetic network and tree of FLEs from the Ty4 family in Saccharomyces excluding recombinant clades 11 and 12. (A) Phylogenetic network for internal coding regions of Ty4/Tsu4 FLEs based on the NeighborNet algorithm.To simplify visualization, this network only includes Ty4 subfamily FLEs from WGAs reported in[4].Lineages in the network are labeled according to monophyletic groups identified in Panel (B).Note that the signal for recombination between the Ty4 and Tsu4 subfamilies seen in Figure3is eliminated by exclusion of clades 11 and 12. (B) Midpoint rooted ML phylogeny of internal coding regions from Ty4/Tsu4 FLEs.Bootstrap support based on 100 replicates is shown for major nodes.The scale bar for branch lengths is in units of substitutions per site.All monophyletic groups are collapsed as triangles.Two singleton Tsu4 elements (f267 from Hawaiian S. paradoxus strain UWOPS91-917.1 and f256 from S. mikatae strain NBRC 10994) are denoted as dots at tips.Triangles, tip dots, and ranges are colored for each species.Vertical heights of triangles are proportional to the number of taxa.Horizontal widths of triangles are equal to the maximum branch length within the clade.Note that the monophyletic clade for the Ty4 subfamily from S. cerevisiae (annotated with an asterisk) is re-scaled to 5% of the real sample size both horizontally and vertically, due to the large number of Ty4 sequences (n=273) in S. cerevisiae genomes.

Figure S5 :Figure S6 :
Figure S5: Annotated phylogeny of Ty4/Tsu4 FLEs in S. uvarum and S. eubayanus genomes.The phylogeny shown here is a rescaled, taxon-labeled sub-tree from Figure 3B, only showing the Tsu4 groups from S. uvarum and S. eubayanus.Bootstrap support based on 100 replicates, clade numbers are the same as in Figure3B.The tree scale bar for branch lengths is in units of substitutions per site.The host lineage is annotated and taxon labels consist of strain identifier and FLE identifier for all clades.

Figure S7 :
Figure S7: Host phylogeny of S. eubayanus annotated with Ty4/Tsu4 copy number estimates.Shown are the copy number estimates for Ty4 (B) and Tsu4 (C) subfamilies from worldwide S. eubayanus lineages.The ML tree is reconstructed using 319,865 genome-wide SNPs from 292 S. eubayanus strains and midpoint rooted.Plotting details are identical as described in Figure1.Major lineages are annotated according to previously-reported population structure[16].S. eubayanus strain yHAB565 whose strain-specific Tsu4 consensus sequence clusters with S. uvarum is indicated with an arrowhead in the Patagonia B lineage.

Figure S8 :
Figure S8: Host phylogeny of S. uvarum annotated with Ty4/Tsu4 copy number estimates.Shown are the copy number estimates for Ty4 (B) and Tsu4 (C) subfamilies from worldwide S. uvarum lineages.The ML tree is reconstructed using 253,252 genome-wide SNPs from 62 S. uvarum strains and midpoint rooted.Plotting details are identical as described in Figure1.Major lineages are annotated according to previously-reported population structure[17].

Figure S10 :
FigureS10: Sequence divergence between recombinant and pure Tsu4 from S. mikatae versus S. uvarum Tsu4 and S. cerevisiae Ty4.Shown are sliding window analysis of pairwise sequence divergence between (A) "Recombinant" Tsu4 in S. mikatae vs. Tsu4 in S. uvarum; (B) "Pure" Tsu4 in S. mikatae vs. Tsu4 in S. uvarum; (C) "Recombinant" Tsu4 in S. mikatae vs. Ty4 in S. cerevisiae; (D) "Pure" Tsu4 in S. mikatae vs. Ty4 in S. cerevisiae.Structure of a Ty4 FLE is annotated at the bottom of each panel in colored rectangles (Gag in black, Pol in darker gray, and LTRs in lighter gray).The dashed red line in panel (A) indicates the boundary of 5' and 3' internal region which is later used for partitioning the Ty4/Tsu4 phylogeny in Figure5.Elements used in this analysis include IFO1815 f256 for "Recombinant" Tsu4 in S. mikatae, IFO1815 f286 for "Pure" Tsu4 in S. mikatae, CBS7001 f32 for Tsu4 in S. uvarum, YPS128 f49 for Ty4 in S. cerevisiae.IFO 1815 assembly was generated in this study.CBS 7001 and YPS128 assemblies were previous published in Chen et al.[12] and Yue et al.[4], respectively.Divergence measured in substitutions per site was calculated using a Kimura 2-parameter model in overlapping 50 bp windows with a 10 bp step size.

Table S2 :
Statistics for de novo whole genome assemblies of S. mikatae strains IFO 1815 and NBRC 10994 generated in this study.
FLEs, truncated elements, and solo LTRs, respectively.Colored boxes show interquartile ranges (IQR), whiskers show values 1.5×IQR of the upper or lower quartiles, and the dots indicate outliers that beyond 1.5×IQR.Outliers for Tsu4 elements in strain CQS are annotated.