High quality genome assembly and annotation (v1) of the 1 eukaryotic terrestrial microalga Coccomyxa viridis SAG 216-4

19 Unicellular green algae of the genus Coccomyxa are recognized for their worldwide 20 distribution and ecological versatility. Most species described to date live in close association 21 with various host species, such as in lichen associations. However, little is known about the 22 molecular mechanisms that drive such symbiotic lifestyles. We generated a high-quality 23 genome assembly for the lichen photobiont Coccomyxa viridis SAG 216-4 (formerly C. 24 mucigena ). Using long-read PacBio HiFi and Oxford Nanopore Technologies in combination 25 with chromatin conformation capture (Hi-C) sequencing, we assembled the genome into 21 26 scaffolds with a total length of 50.9 Mb, an N50 of 2.7 Mb and a BUSCO score of 98.6%. While 27 19 scaffolds represent full-length nuclear chromosomes, two additional scaffolds represent 28 the mitochondrial and plastid genomes. Transcriptome-guided gene annotation resulted in the 29 identification of 13,557 protein-coding genes, of which 68% have annotated PFAM domains 30 and 962 are predicted to be secreted.


Introduction
Green algae are photosynthesizing eukaryotic organisms that differ greatly in terms of morphology and colonize a large variety of aquatic and terrestrial habitats.Phylogenetically, green algae form a paraphyletic group that has recently been proposed to comprise three lineages including the Prasinodermophyta in addition to the Chlorophyta and Streptophyta (Li et al., 2020).This new phylum diverged before the split of the Chlorophyta and Streptophyta that occurred between 1,000 and 700 million years ago (Morris et al., 2018).While the streptophyte lineage encompasses charophyte green algae as well as land plants, the chlorophyte lineage consists of 7 prasinophyte classes, which gave rise to 4 phycoplastcontaining core chlorophyte classes (Chlorodendrophyceae, Trebouxiophyceae, Ulvophyceae, Chlorophyceae) with one independent sister class (Pedinophyceae) (Leliaert et al., 2012;Marin, 2012).
The Coccomyxa genus is represented by coccoid unicellular green algae that belong to the class of Trebouxiophyceae.Morphologically, Coccomyxa spp.are characterized by irregular elliptical to globular cells that range from 6-14 x 3-6 μm in size, with a single parietal chloroplast lacking pyrenoids and the absence of flagellate stages (Schmidle, 1901).Members of this genus are found in freshwater, marine, and various terrestrial habitats where they occur free-living or in symbioses with diverse hosts (Darienko et al., 2015;Gustavs et al., 2017;Malavasi et al., 2016).Several Coccomyxa species establish stable, mutualistic associations with fungi that result in the formation of complex three-dimensional architectures, known as lichens (Faluaburu et al., 2019;Gustavs et al., 2017;Jaag, 1933;Yahr et al., 2015;Zoller and Lutzoni, 2003).Others associate with vascular plants or lichens as endo-or epiphytes, respectively (Cao et al., 2018a;Cao et al., 2018b;Tagirdzhanova et al., 2023;Trémouillaux-Guiller et al., 2002), and frequently occur on the bark of trees (Kulichovà et al., 2014;Štifterovà and Neustupa, 2015) where they may interact with other microbes.One novel species was recently found in association with carnivorous plants, even though the nature of this relationship remains unclear (Sciuto et al., 2019).Besides, Coccomyxa also establishes parasitic interactions with different mollusk species affecting their filtration ability and reproduction (Gray et al., 1999;Sokolnikova et al., 2016;Sokolnikova et al., 2022;Vaschenko et al., 2013).Despite this ecological versatility, little is known about the molecular mechanisms that determine the various symbiotic lifestyles in Coccomyxa.One short read-based genome is available for C. subellipsoidea C-169 that was isolated on Antarctica where it occurred on dried algal peat (Blanc et al., 2012), whereas another high-quality genome has recently been made available for a non-symbiotic strain of C. viridis that was isolated from a lichen thallus (Tagirdzhanova et al., 2023).For Coccomyxa sp.Obi, LA000219 and SUA001 chromosome-, scaffold-and contig-level assemblies are available on NCBI, respectively, as well as two metagenome-assembled genomes of C. subellipsoidea.To facilitate the study of Coccomyxa symbiont-associated traits and their evolutionary origin, we here present the generation of a high-quality chromosome-scale assembly of the phycobiont C. mucigena SAG 216-4 using long-read PacBio HiFi and Oxford Nanopore Technology (ONT) combined with Hi-C and RNA sequencing.Recent SSU and ITS rDNA sequencing-based re-evaluations of the Coccomyxa phylogeny placed the SAG 216-4 isolate in the clade of C. viridis (Darienko et al., 2015;Malavasi et al., 2016).Hence, this isolate will be referred to as C. viridis here and data have been deposited under the corresponding Taxonomy ID.

DNA and RNA extraction
Cells of a 7-week-old C. viridis culture were harvested over 0.8 m cellulose nitrate filters (Sartorius, Göttingen, Germany) using a vacuum pump.Material was collected with a spatula, snap-frozen and ground in liquid nitrogen using mortar and pestle.The ground material was used for genomic DNA extraction with the RSC Plant DNA Kit (Promega, Madison, WI, USA) using the Maxwell RSC device according to manufacturer's instructions.To prevent shearing of long DNA fragments, centrifugation was carried out at 10,000 g during sample preparation.
Following DNA extraction, DNA fragments <10,000 bp were removed using the SRE XS kit (Circulomics, Baltimore, MD, USA) according to manufacturer's instructions.DNA quantity and quality were assessed using the Nanodrop 2000 spectrometer and Qubit 4 fluorometer with the dsDNA BR assay kit (Invitrogen, Carlsbad, CA, USA), and integrity was confirmed by gel electrophoresis.High-molecular weight DNA was stored at 4°C.
For total RNA extraction, algal cells were collected from a dense nine-day-old culture and ground in liquid nitrogen using mortar and pestle.RNA was extracted with the Maxwell RSC Plant RNA kit (Promega, Madison, WI, USA) using the Maxwell RSC device according to manufacturer's instructions.RNA quality and quantity was determined using the Nanodrop 2000 and stored at -80°C.

Pacific Biosciences High-Fidelity (PacBio HiFi) sequencing
HiFi libraries were prepared with the Express 2.0 Template kit (Pacific Biosciences, Menlo Park, CA, USA) and sequenced on a Sequel II/Sequel IIe instrument with 30h movie time.HiFi reads were generated using SMRT Link (v10; (Pacific Biosciences, Menlo Park, CA, USA) with default parameters.

Chromosome conformation capture (Hi-C) and sequencing
C. viridis cells were cross-linked in 3% formaldehyde for 1 hour at room temperature.The reaction was quenched with glycine at a final concentration of 250 mM.Cells were collected by centrifugation at 16,000 g for 10 min.Pellets were flash-frozen in liquid nitrogen and ground using mortar and pestle.Hi-C libraries were prepared using the Arima-HiC+ kit (Arima Genomics, Carlsbad, CA, USA) according to manufacturer's instructions, and subsequently paired-end (2x150 bp) sequenced on a NovaSeq 6000 instrument (Illumina, San Diego, CA, USA).

RNA sequencing
Library preparation for full-length mRNASeq was performed using the NEB Ultra II Directional RNA Library Prep with NEBNext Poly(A) mRNA Magenetic Isolation Module and 500 ng total RNA as starting material, except for W-RNA Lplaty, where library prep was based on 100 ng total RNA as starting material.Sequencing was performed on an Illumina NovaSeq 6000 device with 2x150 bp paired-end sequencing protocol and >50 M reads per sample.
Whenever gaps between contigs were spanned by at least five reads with a mapping quality of 30, the contigs were fused in the assembly.
Potential telomeres were identified using tapestry (v1.0.0) with "AACCCT" as telomere sequence (Davey et al., 2020).To check for potential contaminations, Blobtools (v1.1.1)and BLAST (v2.13.0+) were used to create a Blobplot including taxonomic annotation at genus level (Camacho et al., 2009;Laetsch and Blaxter, 2017).To check completeness of the assembly and retrieve ploidy information, kat comp from the Kmer Analysis Toolkit (v2.4.2) was used, and results were visualized using the kat plot spectra-cn function with the -x 800 option to extend the x-axis (Mapleson et al., 2016).Genome synteny to the closest sequenced relative C. subellipsoidea C-169 was determined using Mummer3 (Blanc et al., 2012;Kurtz et al., 2004).In detail, the two assemblies were first aligned using Nucmer, followed by a filtering step with Delta-filter using the many-to-many option (-m).Finally, the alignment was visualized with Mummerplot.

Annotation
To annotate repetitive elements in the nuclear genome, a database of simple repeats was created with RepeatModeler (v2.0.3) that was expanded with transposable elements (TE) from the TransposonUltimate resonaTE (v1.0) pipeline (Flynn et al., 2020;Riehl et al., 2022).This pipeline uses multiple tools for TE prediction and combines the prediction output.For the prediction of TEs in Coccomyxa viridis helitronScanner, ltrHarvest, mitefind, mitetracker, RepeatModeler, RepeatMasker, sinefind, tirvish, transposonPSI and NCBICDD1000 were used within TransposonUltimate resonaTE and TEs that were predicted by at least two tools were added to the database.TEclass (v2.1.3)was used for classification (Abrusán et al., 2009).To softmask the genome and obtain statistics on the total TE and repetitive element content in the genome, RepeatMasker (v4.1.2-p1)(Smit et al., 2012) was used with excln option to exclude Ns in the masking.
Organelle genomes were annotated separately.Scaffolds were identified as organelle genomes based on their lower GC content and smaller size.The mitochondrial genome was annotated using MFannot (Lang et al., 2023) as well as GeSeq (Tillich et al., 2017) and the annotation was combined within the GeSeq platform.The plastid genome was annotated using GeSeq alone.The annotations were visualized using the OGDraw webserver (Greiner et al., 2019).

Results
The version 1 genome of C. viridis was assembled from 32.2 Gbp of PacBio HiFi reads with a mean read length of 15 kb, 0.95 Gbp Nanopore reads with a mean read length of 8.8 kb and 15 million pairs of Hi-C seq data.The PacBio HiFi reads were first assembled using Raven (Vaser and Šikić, 2021), yielding 27 contigs.These contigs were scaffolded and manually curated using Hi-C data (Dudchenko et al., 2017;Durand et al., 2016a;Durand et al., 2016b;Li and Durbin, 2009).To close the remaining gaps between contigs within scaffolds, ONT reads were mapped onto the assembly (Danecek et al., 2021;Li, 2021) and gaps that were spanned by at least 5 ONT reads with a mapping quality >30 were manually closed, finally resulting in 21 scaffolds consisting of 26 contigs with a total length of 50.9 Mb and an N50 of 2.7 Mb (Figure 1, Table 1).Using Tapestry (Davey et al., 2020), telomeric regions (AACCCTn) were identified at both ends of nine of the 21 scaffolds (5 repeats) (Figure 1a), suggesting that these represent full-length chromosomes, which was confirmed by Hi-C analysis (Figure 1b).Additionally, the Hi-C contact map indicated centromeres for some of the chromosomes.However, the determination of exact centromere locations on all chromosomes will require ChIP-seq analysis and CenH3 mapping.While Tapestry detected telomeric sequences at only one end of eight other scaffolds and none for scaffold 18 and 19, the Hi-C map points towards the presence of telomeric repeats at both ends of all scaffolds 1-19 (Figure 1b), suggesting that the v1 assembly contains 19 full-length chromosomes that compose the nuclear genome.Scaffolds 20 and 21 were considerably shorter with 162 kb and 70 kb and displayed a markedly lower GC content at 41-42% (Figure 1a), suggesting that these scaffolds represent the chloroplast and mitochondrial genomes, respectively.BLAST analyses confirmed the presence of plastid and mitochondrial genes on the respective scaffolds, and the overall scaffold lengths corresponded with the sizes of the plastid and mitochondrial genomes of Coccomyxa subellipsoidea C-169 with 175 kb and 65 kb, respectively (Blanc et al., 2012).Full annotation of scaffolds 20 and 21 showed that they indeed represent chloroplast and mitochondrial genomes, respectively (Figure 2).
To rule out the presence of contaminants, the assembly and PacBio HiFi raw reads were used to produce a Blobplot (Camacho et al., 2009;Laetsch and Blaxter, 2017), which indicates that 98.76% of the reads match only the Coccomyxa genus (Figure 3) and, consequently, that the original sample was free of contaminating organisms.Finally, a KAT analysis showed a single peak of k-mer multiplicity based on HiFi reads that were represented once in the assembly (Figure 4) (Mapleson et al., 2016), indicative of a high-quality, haploid genome.
To annotate the nuclear genome, we first assessed the presence of repetitive elements.In total, we found 8.9% of the genome to be repetitive (Table 2), comparable to the 7.2% of repetitive sequences found in the genome of C. supellipsoidea C-169 (Blanc et al., 2012).These 8.9% repetitive elements were annotated as either simple repeats (2.3%) or transposable elements (6.6%).Of the transposable elements, 36% were annotated as retrotransposons and 64% as DNA transposons.The distribution of the repetitive elements was even across the genome with only a few repeat-rich regions (Figure 5).Next, we aimed to produce a high-quality genome annotation using RNA sequencing data.In total 13,557 genes were annotated with an average length of 3.1 kb (Table 2).The amount of alternative splicing in the genome is predicted to be very low, given the average of one transcript per gene model.To confirm the actual amount of alternative splicing, however, further analyses will be required.Of the 13,557 genes, 68% have annotated PFAM domains and 962 are predicted to carry a signal peptide for secretion.A total of 1,489 (98.6 %) complete gene models among 1,519 conserved Benchmarking Universal Single-Copy Orthologs (BUSCO) (Manni et al., 2021) in the chlorophyta_odb10 database were identified (Table 2), suggesting a highly complete genome annotation.
Until recently, the taxonomic classification and definition of Coccomyxa species was based on environmentally variable morphological and cytological characteristics.This classification was reviewed based on the phylogenetic analyses of nuclear SSU and ITS rDNA sequences, which resulted in the definition of 27 currently recognized Coccomyxa species (Darienko et al., 2015;Malavasi et al., 2016).Dot plot analysis of the high-quality genome assembly of C. viridis SAG216-4 with the assembly of the most closely related sequenced relative C. subellipsoidea C-169 revealed a lack of synteny since the few identified orthologous sequences were < 1 kb and, therefore, do not represent full-length genes (Figure 6a, Table 2).This lack of synteny was no technical artifact since the C. viridis assembly could be fully aligned to itself (Figure 6b), and BLAST analyses with five out of six non-identical ITS sequences identified in the C. viridis SAG 216-4 assembly confirmed its species identity.A comparison of the assembly of C. subellipsoidea C-169 to that of Chlorella variabilis (Chlorophyte, Trebouxiophyceae) has previously identified few syntenic regions which displayed poor gene collinearity (Blanc et al., 2012).Future studies will help to clarify whether the absence of synteny between C. viridis and C. subellipsoidea is due to the quality of the available assemblies or whether it has biological implications.

Figure 1 .
Figure 1.Genome assembly of Coccomyxa viridis SAG 216-4.(a) An overview of the C. viridis genome assembly depicts chromosome-scale scaffolds.Green bars indicate scaffold sizes and red bars represent telomeres.Variations in color intensities correlate with read coverage.Read coverage per scaffold is determined by mapping PacBio HiFi reads onto the assembly.Scaffolds 20 and 21 were identified as chloroplast and mitochondrial genomes based on size and low GC contents, and BLAST analyses.(b) Hi-C contact map showing interaction frequencies between regions in the nuclear genome of Coccomyxa viridis.Scaffolds are framed by blue lines while contigs within scaffolds are depicted in green.

Figure 2
Figure 2 Scaffolds 20 and 21 represent the plastid and mitochondrial genomes of C. viridis SAG 216-4.Gene maps of the chloroplast (a) and mitochondrial (b) genomes.The inner circles indicate the GC content and mapped genes are shown on the outer circles.Genes that are transcribed clockwise are placed inside the outer circles, and genes that are transcribed counterclockwise at the outside of the outer circles.

Figure 3 .
Figure 3. Taxonomic annotation indicates absence of contaminations in the genome assembly.(b) Taxon-annotated GC coverage scatter plot (Blobplot) of the contigs from the genome assembly shows that all scaffolds are taxon-annotated as Coccomyxa and all scaffolds that belong to the nuclear genome have similar GC contents (~54%).The GC content of the mitochondrial and plastid genomes are considerably lower (~41%).(b) In total 98.76% of the reads can be mapped onto the assembly and are therefore classified as Coccomyxa reads.

Figure 4 .
Figure 4.The Coccomyxa viridis SAG 216-4 genome is haploid.The KAT specra-cn plot depicts the 27-mer multiplicity of the PacBio HiFi reads against the genome assembly.Black areas under the peaks represent k-mers present in the reads but absent from the assembly, colored peaks indicate k-mers that are present once to multiple times in the assembly.The single red peak in the KAT specra-cn plot suggests that Coccomyxa viridis has a haploid genome, while the black peak at low multiplicity shows that the assembly is highly complete and that all reads are represented in the assembly.

Figure 5 .
Figure 5. Circos plot summarizing the nuclear genome annotation of Coccomyxa viridis SAG 216-4.From outside to inside the tracks display: GC content (over 1-kb windows), gene density (blue) and repetitive element density (red).

Figure 6 .
Figure 6.No synteny detected between related Coccomyxa species.(a) Dot plot of orthologous sequences in the genome assemblies of C. viridis SAG 216-4 and C. subellipsoidea C-169.Violet and blue dots represent orthologous sequences on same and opposite strands, respectively.Dot sizes does not correlate with the length of the sequences they represent, which were all < 1 kb.The width of each box corresponds to the length (bp) of the respective scaffold.(b) Dot plot of the genome assembly of C. viridis SAG216-4 against itself.

Figure 1 .
Figure 1.Genome assembly of Coccomyxa viridis SAG 216-4.(a) An overview of the C. viridis genome assembly depicts chromosome-scale scaffolds.Green bars indicate scaffold sizes and red bars represent telomeres.Variations in color intensities correlate with read

Figure 2
Figure 2 Scaffolds 20 and 21 represent the plastid and mitochondrial genomes of C. viridis SAG 216-4.Gene maps of the chloroplast (a) and mitochondrial (b) genomes.The inner circles indicate the GC content and mapped genes are shown on the outer circles.Genes that are transcribed clockwise are placed inside the outer circles, and genes that are transcribed counterclockwise at the outside of the outer circles.

Figure 3 .
Figure 3. Taxonomic annotation indicates absence of contaminations in the genome assembly.(b) Taxon-annotated GC coverage scatter plot (Blobplot) of the contigs from the genome assembly shows that all scaffolds are taxon-annotated as Coccomyxa and all scaffolds that belong to the nuclear genome (N) have similar GC contents (~54%).The GC

Figure 4 .
Figure 4.The Coccomyxa viridis SAG 216-4 nuclear genome is haploid.The KAT specracn plot depicts the 27-mer multiplicity of the PacBio HiFi reads against the nuclear genome assembly.Black areas under the peaks represent k-mers present in the reads but absent from the assembly, colored peaks indicate k-mers that are present once to multiple times in the assembly.The single red peak in the KAT specra-cn plot suggests that Coccomyxa viridis has a haploid genome, while the black peak at low multiplicity shows that the assembly is highly complete and that all reads are represented in the assembly.

Figure 5 .
Figure 5. Circos plot summarizing the nuclear genome annotation of Coccomyxa viridis SAG 216-4.From outside to inside the tracks display: GC content (over 1-kb windows), gene density (blue) and repetitive element density (red).

Figure 6 .
Figure 6.No synteny detected between related Coccomyxa species.(a) Dot plot of orthologous sequences in the genome assemblies of C. viridis SAG 216-4 and C. subellipsoidea C-169.Violet and blue dots represent orthologous sequences on same and opposite strands, respectively.Dot sizes does not correlate with the length of the sequences they represent, which were all < 1 kb.The width of each box corresponds to the length (bp) of the respective scaffold.(b) Dot plot of the genome assembly of C. viridis SAG216-4 against itself.

Table 1 .
Genome features of C. viridis SAG 216-4 including the mitochondrial and plastid genomes.