Mapping the Algal Secret Genome: The small RNA Locus Map for Chlamydomonas reinhardtii

Small (s)RNAs play crucial roles in the regulation of gene expression and genome stability across eukaryotes where they direct epigenetic modifications, post-transcriptional gene silencing, and defense against both endogenous and exogenous viruses. The green alga Chlamydomonas reinhardtii is a well-studied unicellular alga species with sRNA-based mechanisms that are distinct from those of land plants. It is, therefore, a good model to study sRNA evolution but a systematic classification of sRNA mechanisms is lacking in this and any other algae. Here, using data-driven machine learning approaches including Multiple Correspondence Analysis (MCA) and clustering, we have generated a comprehensively annotated and classified sRNA locus map for C. reinhardtii. This map shows some common characteristics with higher plants and animals, but it also reveals distinct features. These results are consistent with the idea that there was diversification in sRNA mechanisms after the evolutionary divergence of algae from higher plant lineages.

Small (s)RNAs in many organisms are involved in regulation of gene expression at the tran-2 scriptional (epigenetic marks) and post-transcriptional (RNA degradation/translational 3 repression) levels, as well as in host defenses against viruses [25]. In eukaryotes, their 4 double stranded or highly structured RNA precursors are processed by Dicer-like (DCL) 5 endonucleases into short 20-25nt RNA duplexes that are bound by Argonaute (AGO) 6 proteins. One of the sRNA strands guides an AGO-containing effector complex toward 7 complementary DNA/RNA to mediate the various mechanisms of RNA silencing [23]. 8 Based on their origin and biogenesis, sRNAs can be classified in diverse classes including 9 small-interfering (si)RNAs, micro(mi)RNA, and piwi-interacting (pi)RNAs [10], but 10 there are many different sRNA subtypes [2] and diffuse boundaries between sRNA 11 classes, as demonstrated for Arabidopsis thaliana [13]. Consequently, development and 12 improvement of comprehensive classification methods is required for understanding of 13 sRNA-based regulation networks.
Due to its small genome, vegetative/sexual reproduction, fast growth, motility and 15 capacity to use acetate as carbon source, the photosynthetic green alga Chlamydomonas 16 reinhardtii (hereafter referred to as Chlamydomonas) has been an important model 17 organism for decades [31] and it was the first unicellular organism in which miRNAs 18 were described [26,44]. These Chlamydomonas miRNAs, however, are distinct from 19 those of land plants: they have certain animal-like features including their biogenesis 20 and mode of action [6,7,36,42]. The Chlamydomonas key proteins in RNA silencing 21 pathways (three AGO and three DCL proteins) are also distinct from homologues in 22 land plants. Phylogenetic, structural and functional analyses indicate a divergence of 23 both protein families since the common ancestor of algae and land plant about 1 billion 24 years ago [4,7,32,36]. There is also doubt about whether DCLs in Chlamydomonas 25 contain PAZ domains [32,36], which are known to be important in cleavage of precursors 26 for sRNAs of specific lengths [17]. Other differences with land plants include (i) the 27 absence in Chlamydomonas of RNA-dependent RNA polymerases (RDRs) that generate 28 dsRNAs from single stranded RNA [5] and (ii) the almost complete absence of non-CG 29 methylation in transposons that is a hallmark of sRNA-directed DNA methylation [9,19]. 30 These previously described characteristics of Chlamydomonas sRNA pathways suggest 31 potential divergence from land plants. However, to-date, there has been no comprehensive 32 characterization of the sRNA species found in in this alga. To address this issue we 33 examined Chlamydomonas sRNAs, including miRNAs and siRNAs, based on the distinct 34 loci from which they are produced. We used a Bayesian approach to generate the first 35 comprehensive sRNA locus map for Chlamydomonas. Annotation of the loci based 36 on intrinsic and extrinsic features allowed us to carry out a Multiple Correspondence 37 Analysis (MCA) followed by clustering. We identified 6 classes of sRNAs, which may 38 correspond to distinct RNA silencing pathways. Through comparison of the results 39 with those previously reported for Arabidopsis [13], distinct features to those found in 40 higher plants were uncovered, such as a particular sRNA loci distribution across the 41 genome and their association with the epigenetic landscape. Together, these results help 42 to understand the function of sRNAs in this single-celled alga Chlamydomonas and the 43 evolution of sRNA-related pathways in green algae and land plants. 44

46
In order to construct a complete locus map we first obtained a comprehensive collection 47 of 145 sRNA libraries encompassing 54 replicate groups (Supplementary Table S1). To 48 capture a maximum diversity of sRNA the libraries represent a wide range of conditions, 49 strains and stages of the life cycle. After initial trimming and filtering (see Methods), 50 the libraries contained a total of 22.3 million non-redundant (336 million redundant) 51 sRNA reads mapping to the Chlamydomonas reference assembly genome [24].

52
To obtain a general overview of the sRNA population within our datasets, we first 53 computed the same intrinsic key determinants found to be informative in the sRNA 54 locus classification of Arabidopsis [13]: locus size, 5' nucleotide, repetitiveness of genomic 55 mapping locations and abundance of individual sRNA species. To avoid a bias towards 56 non-representative conditions, for this first part of the analysis we selected only libraries 57 made from wild type strains.

58
The size distribution of sRNAs ( Figure 1A) is dominated by 21nt species comprising 59 28% of all sRNAs, in line with previous studies [6,26,36,38,44] and unlike Arabidopsis 60 and other land plants in which the 24nt sRNA fraction predominates ( Figure 1A). Also, 61 the 21nt species are the only class exhibiting counts with more than 1 million copies 62 per sRNA species which makes up about 20% within that class( Figure 1B). Many reads 63 (43% redundant and 70% non-redundant) map uniquely to the reference genome, which 64 is comparable to Arabidopsis (52% and 80%), albeit more evenly distributed across 65 size range as shown in Figure 1C. This difference is most likely due to the absence of a 66 RdDM pathway in Chlamydomonas. We did not find an abundant 23nt class found in 67 Chlamydomonas from northern blot experiments [4]. 68 Studies have shown the importance of 5' nucleotide for AGO binding specificity 69 in Chlamydomonas [38] and Arabidopsis [34]. For Chlamydomonas, and in line with 70 previous reports, we found that most sRNAs have adenine (A) and uracil (U) as 5' 71 nucleotide. As in Arabidopsis, however, the 5'-end nucleotide varies greatly for different 72 size classes ( Figure 1D). For the predominant 21nt fraction, there was a preference for A 73 and U (26% and 53%) with a higher proportion of 5' A sRNAs in the larger fractions up 74 to 27nt. Overall, we found general agreement of our data with other datasets and we 75 concluded that they are suitable for subsequent locus map generation and classification. 76  To generate a locus map with these datasets we used the R package SegmentSeq. It 78 employs a heuristic approach based on sRNA densities to derive an initial locus map 79 which is then refined using Bayesian methods to take into account replicate groups [12]. 80 The locus map based on a false discovery rate of less than 0.05 (FDR) had 6164 loci 81 ( Figure 2A) and covering 4.1% (4.57Mb) of the reference genome (110Mb). While the 82 size of the libraries varies markedly, the number of loci per library scales roughly with 83 library size ( Figure 2B) and a cumulative analysis indicates that very few extra loci are 84 likely to have been identified with further sequencing ( Figure 2C). Figure 2D shows a 85 degree of conservation of loci across replicate groups, although it should be noted that 86 the libraries consist of a variety of strains, mutants and growth conditions, so complete 87 conservation is not expected. For more stringent FDRs we saw a greater conservation of 88 loci across replicate groups ( Figure 3).  overlap with genomic features (e.g. genes, transposons, methylation level). In a strong 96 validation of the locus map, all 42 miRNAs previously identified in Chlamydomonas 97 appear as defined sRNA loci [36]. An example is depicted in Figure 4 showing the loci 98 CRSL0041450 which represents miR1157 and miR1157* as part of the 22th intron of 99 the gene Cre12ġ537671.

100
Most annotation features are categorical in nature (i.e. overlap with genomic features 101 is either true or false), but others are quantitative (i.e. size and phasing score). In 102 preparation for MCA (see below) the quantitative features were classified into discrete 103 groups according to the modality of the density distributions ( Figure 5). Figure 5A shows 104 a clear bimodal distribution of high or low locus repetitiveness and so we annotated loci 105 in three groups corresponding to the two modes and the intervening section. Strand 106 bias ( Figure 5B) shows multiple modal peaks which can be neatly divided into strong 107 bias (0.2 < x < 0.8) and medium bias (0.2-0.4 and 0.6-0.8). Figure 5C shows phasing to 108 have just a single peak with a cut-off of 60 capturing the modal peak (< 60) and the 109 long-tail (> 60). Finally, locus size cut-offs were chosen based on marked changes in 110 gradient ( Figure 5D).

127
Having assigned intrinsic and extrinsic features to each locus (a detailed breakdown of 128 intrinsic and extrinsic features are listed in Supplementary Table S3), we used the MCA 129 function from the FactoMineR R package [16] to search for underlying patterns and 130 feature associations. Following dimensional reduction with MCA, k-means clustering was 131 used to group loci according to the annotation patterns (see Methods for more details). 132 We hypothesized that such grouping might identify distinct sRNA types and therefore 133 potentially reveal distinct biogenesis or effector pathways.

134
To optimize the number of clusters and dimensions to be used we followed an approach 135 used in a similar analysis of Arabidopsis [13]. We first evaluated the stability of the 136 clustering using different combinations of clusters and dimensions through random sub-137 sampling ( Figure 6A). We also calculated the additional variance explained by inclusion 138 of an additional dimension ( Figure 6B). Taken together, this analysis indicated that 139 seven dimensions is the optimal number. In addition to the stability tests, computation 140 of the gap statistic [35] and the normalised mutual information (NMI) between the 141 clusters and annotation features both suggested six clusters to be optimal ( Figure 6C-E). 142 The presence of robust clustering for two and three clusters was noted and used to 143 generate a hierarchy plot to demonstrate how loci from clusters grouped together for 144 lower values of k (Figure 7). 145 The resulting six clusters, referred to as locus class (LC) 1-6, have relatively similar 146 sizes and their association with different annotation features is shown in Figure 8 (also 147 see Supplementary Table S4). The cluster hierarchy (see Figure 7) suggests that the 148 primary division in clusters is between LC1-2 and LC3-6. LC1-2 have a high levels of 149  repetitiveness and a stronger association with genomic methylated regions than LC3- 6. 150 As observed in Arabidopsis [21], the repetitiveness of loci in the genome may correlate 151 with different sRNA silencing mechanisms. LC1 is more associated with transposons 152 (specifically retrotransposons) than LC2. Figure S1 illustrates the overlap of LC1 loci 153 with transposons, low mappability (which corresponds to high repetitiveness) and absence 154 of overlapping genes. Loci were split up mostly due to varying coverage.

155
LC2 is associated with genic regions, exemplified by a LC2 paragon (CRSL003890) in 156 Figure S2 further can be seen in Supplementary Figures S2-5. Interestingly, a CRTOC1 157 transposon is superimposed on both the loci and the gene. LC2 contains the largest 158 loci of all classes. Indeed, the shown sRNA locus is 6kb long, which is well outside the 159 normal locus size range.

160
All miRNA containing loci are in LC3 along with the majority of DCL3-and 161 AGO3-dependent loci. Of the AGO3-dependent loci, 73% have a predominance for 162 U at the 5' end, consistent with AGO3's strong U preference [38]. LC3 loci also 163 demonstrated common expression across wild type libraries and enrichment for 21 nt 164 sRNAs. Interestingly, LC3 contained 130 out of a total of 149 loci that exhibit evidence 165 of phasing.

166
LC4 sRNAs were similarly represented in most wild type libraries suggesting con-167 stitutive or housekeeping roles. However, unlike LC3, there was a lack of sRNA size 168 specificity with 95% having a bias for sRNAs larger or smaller than the modal 20 and 21 169 nt sRNAs. DNA transposons are associated with 11% of LC5 for which there is a strong 170 bias for sRNAs with a C at the 5'-end potentially indicative of different AGO protein 171 8/23  followed an approach that allows the data-driven identification of distinct types of sRNA 186 loci [12,13]. Importantly, the MCA uses a wide range of features as inputs to enable the 187 robust identification of clusters, which could have not been derived by using individual 188 features. Inspection of the feature associations for each cluster enables the validation of 189 the clustering, as well as the dissection of important and unimportant features.

190
By reporting the first comprehensive map of sRNA loci in a unicellular organism 191 we demonstrate the multi-applicability of our pipeline, which was previously used for 192 Arabidopsis [12,13], for locus map generation, annotation and clustering. Our results 193 confirm (i) the overall size bias for 21nt sRNAs, (ii) the lack of enrichment in the 24nt 194 fraction associated with the RdDM in higher plants, and (iii) the bias for U and A at the 195 5'-end of sRNAs [26,36,38,44]. Importantly, our analyses groups known Chlamydomonas 196 miRNAs into the same cluster, LC3. Together, these results indicate the robustness of 197 our approach and the overall validity of our findings.

198
The characteristics of LC3 loci indicate that this cluster includes canonical miRNAs 199 along with other sRNAs, all produced by DCL3-mediated cleavage of precursors and 200 then bound by the AGO3 effector protein. As these non-miRNAs have most, but not all, 201 features corresponding to bona fide miRNAs, we propose that they derive from immature 202 miRNA precursors, which will evolve to either become canonical pre-miRNAs or are on 203 a pathway for potential elimination from the genome.

204
A potential novel class of Chlamydomonas sRNAs, in LC4, is independent of DCL3 205 and AGO3 and, among other peculiarities, they lack bias for U/A at the 5'-end and 206 they are variable in size. The variability in size may indicate imprecise processing by 207 DCL1/2 possibly due to their lack of PAZ domain, which is thought to confer measuring 208 specificity [20]. Importantly, loci in LC4 tend to be large, commonly expressed, and 209 have low repetitiveness, suggesting that they do not represent noise. Further analyses of 210 sRNA species from dcl1/2 and ago1/2 mutants will provide key insight on this matter. 211 It is likely that LC1-3 are also processed by DCL1/2 because they have a high 21nt 212 bias. In that scenario a PAZ domain-like measuring function may be performed by other 213 10/23  dsRNA-binding proteins, as with DGCR8 in the microprocessor complex of animals [28]. 214 An RNA-binding protein DUS16, similarly partners with DCL3 for the proper processing 215 of miRNAs [40,41].

216
Our findings also raise questions about DNA methylation and genomic defense 217 from transposons in Chlamydomonas. In Arabidopsis, the RdDM pathway is well 218 characterized [22] and, in the Arabidopsis map, a subset of loci show a combination of 219 RDR2 dependence, bias for 24nt sRNAs (the size class which directs RdDM), and overlap 220 with methylated DNA regions including transposons (the primary targets for RdDM 221 pathway) [13]. The crucial RdDM machinery has not been identified in Chlamydomonas 222 and, in this study consistent with previous findings, no enrichment in the 24nt size fraction 223 was found in any of the sRNA types. However, the presence of loci overlapping with 224 methylated retrotransposons (LC1) and with methylated genic regions (LC2) suggests a 225 possible role of sRNAs during establishment and/or maintenance of methylation states 226 in Chlamydomonas genome. If this connection between sRNA and methylation does 227 exist, then there is a possibility that these loci may represent a distinct form of RdDM 228 in Chlamydomonas that is highly divergent from that of higher plants. Moreover, LC5 229 loci, with their zygotic-specific expression and DNA transposon overlap, could possibly 230 represent a transposon silencing system activated specifically during the zygotic stage. 231 Our data-driven approach to identify and classify sRNA loci is intended primarily for 232 the purpose of hypothesis generation, giving possible insights into the biosynthesis and 233 function of sRNAs in Chlamydomonas and, potentially, in other unicellular eukaryotes. 234 The results have allowed the identification of a number of areas for further exploration, 235 as discussed above. Overall, when compared to the previously presented Arabidopsis 236 locus map [13], the results indicate both convergence with higher plants (e.g. LC3) 237 11/23 as well as diversification (e.g. LC4). These findings support a view that significant 238 diversification in sRNA pathways in Chlamydomonas occurred after division from higher 239 plants. Further studies with mutant strains will enable deeper characterisation aiming to 240 elucidate the functional significance of sRNAs in the unicellular algae Chlamydomonas 241 as well as the evolution of RNA silencing pathways in diverse linages.  Chlamydomonas cells were grown in liquid TAP media with constant shaking at 25 250 degree celcius under continuous illumination until cultures reached saturation. Total 251 RNA was extracted from cell pellets with TRIzol reagent (ThermoFisher) by following 252 a protocol previously described [26]. sRNA libraries were prepared directly from total 253 RNAs by using the TruSeq v2 RNA Sample Preparation Kit (Illumina) following the 254 manufacturer instructions, and then they were further sequenced on a HiSeq 2000 255 sequencer. Sequencing data were preprocessed using the ADDAPTS pipeline and 256 tracking system [13,27]. After 3' adaptor removal, all sequences <15 nt in length were 257 discarded, and the remaining sequences were aligned against C. reinhardtii genome using 258 the bowtie alignment program tolerating zero mismatches [15,24]. Only sequences with 259 at least one perfect match were included in further analyses. The C. reinhardtii reference 260 genome and transcriptome used were Phytozome v5.0 and version 281, respectively. 145 sRNA libraries consisting of 54 replicate groups were used as the basis for anal-263 ysis. 142 libraries were internally generated laboratory datasets [26,37] while 3 were 264 from [18]. A locus map for Chlamydomonas sRNAs was produced using the Bioconductor 265 (www.bioconductor.org) package segmentSeq [12]. This package uses a heuristic approach 266 based on sRNA densities to establish an initial locus map which is then refined using 267 Bayesian methods to take into account the separate replicate groups. Sequences that 268 aligned to the genome more than 200 times were excluded from the segmentation. Any 269 gap of greater than 100nt with no reads was sufficient to split a locus. The quality of 270 the segmentation was analysed using a series of diagnostic plots (Figure 2). The locus 271 map was formed from all loci with a false discovery rate (FDR) of less than 0.05.

272
Locus Map Annotation

273
The locus map was annotated with intrinsic locus characteristics and publicly available 274 annotations using functions mainly run in the R programming language [30]. The full 275 annotated locus map can be found in Supplementary File S1. To investigate whether there was a predominance for particular 5' nucleotides of the 282 sRNAs originating from each locus, each of the four nucleotides were tested as to whether 283 levels differed significantly from the normal ratio across all loci assuming a binomial 284 distribution [3]. The Benjamini-Hochberg procedure was used to minimise the FDR [1]. 285

286
Repetitiveness is a measure of the extent to which small RNAs that align to a given 287 location may also align to other genomic locations. We assessed this at each locus using 288 the following equation: where x i is the number of times the ith small RNA within the locus is sequenced 290 and m i is the number of genomic locations to which that small RNA aligns. This gave 291 a score between 1 and 0 (1 being highly repetitive, 0 not at all repetitive) which was 292 divided into three groups (low R < 0.6, median 0.6 < R < 0.9, and high with R > 0.9) 293 corresponding to the peaks of the distribution shown in 5A. Strand bias ratios were calculated from loci in wild type samples with more than five 296 reads. Confidence intervals for strand bias at each locus were calculated assuming a 297 binomial distribution and using a modified Jeffreys interval [3]. Loci were then classified 298 as having a strong (< 0.2 or > 0.8), medium (between 0.2 and 0.4 or between 0.6 and 299 0.8) or no bias (between 0.4 and 0.6) 5B. Since both 20nt and 21nt long sRNAs were shown to be predominant sRNA classes in 302 Chlamydomonas [26] and Figure 1 we calculated the ratio between them. Furthermore, 303 there are potential physiological roles proposed for larger [43] and smaller [33] sRNAs 304 in Arabidopsis. We therefore assigned each locus according its predominant sRNA 305 population namely using 20nt and 21nt sizes as thresholds.

306
Phasing 307 Secondary phased (pha)siRNAs production was detected using PhaseTank, a perl based 308 tool which searches for regions of a minimum of four 21nt sRNAs and computes a phasing 309 score for each one [11]. Overlap between sRNA loci and phased regions were then used to 310 annotate loci as phased or not. The phasing score was reported as part of the annotation 311 in Supplementary File S1 ranging between 0 (no phasing) and 313. This allowed the 312 identification of loci overlapping with a medium phasing region (0 < score < 60) or with 313 a high phasing region (score > 60).

315
Loci were classified as to how ubiquitous their expression was. 10 wild type replicate 316 groups were identified and loci present in more than 5 defined as common (expres-317 sion:common in Fig 5), between 1 and 5 as inbetween and only 1 as specific.

318
Mutant, Strain and Developmental Stage Annotation 319 DCL3 and AGO3 dependence was calculated by determining loci present in at least one 320 wild type replicate group but not in any of the mutant libraries. We also determined loci 321 found specifically in one of the three used strains (CC4350, CC1883, and J3). Presence 322 of loci specifically in libraries of either vegetative or zygotic developmental stages was 323 also calculated.

325
Overlap with various genome features was calculated including predictions for genes, 326 5' and 3' UTRs, exons and coding sequences obtained from the Phytozome genomics 327 portal (phytozome.jgi.doe.gov) using the most recent Chlamydomonas genome anno-328 tation (v5.5) [24]. Promoter regions were calculated as the 500 bp flanking each gene. 329 Transposon locations were established by processing the Chlamydomonas repeat masker 330 file to remove any sequence not explicitly identified as known transposons. Transposons 331 were then classified (Supplementary Table S5) according to the unified system proposed 332 by Wicker et al. and using the extensive literature concerning transposon identities [39]. 333 Predictions of miRNAs, inverted repeats (IRs) and tandem repeats (TRs) were sourced 334 from internal lab data with the IRF and TRF algorithms used to identify IRs and TRs 335 respectively [36].

337
Bisulphite sequencing data generated was processed using yama 338 (https://github.com/tjh48/YAMA) with the heuristics based functionality of 339 segmentSeq used to determine loci enriched in CG, CHH or CHG methylation. sRNA 340 Loci were then probed for overlap with these methylation enriched loci.

14/23
Multiple Correspondance Analysis 342 MCA was used to cluster the loci according to their annotations using the CRAN 343 (cran.r-project.org) package FactoMineR with the HCPC function adapted to enable 344 K-means clustering [16]. Some annotations were used as supplementary where they were 345 not predictive of the clustering but their correlations were calculated (Supplementary 346  Table S3).

347
The numbers of dimensions and clusters to select was determined by integrating 348 information from a number of analyses consistent with that applied by Hardcastle et 349 al. [13]. Dimensional reduction techniques like MCA are designed to concentrate the 350 variance explained in the lower dimensions. Thus, at a certain cut-off, higher dimensions 351 can be excluded as not being particularly significant for explaining overall variation. The 352 graphical elbow-method is a common means to do this by considering the % of variation 353 explained for each dimension, with a clear elbox displayed suggesting 6-10 dimensions 354 would be appropriate 6C-E. Using random sampling from the sRNA loci with replacement 355 to generate a consistently sized sample and re-calculating the clustering results we can 356 observe the stability of the clustering for different combinations of dimensions (1-10) and 357 clusters (2-10) ( Figure 6A). For between six and ten dimensions clustering was generally 358 only stable for two, three or six clusters. Based on the combination of these two analyses 359 we determined seven dimensions to be appropriate.

360
To determine whether two, three or six clusters would be most appropriate, we 361 calculated the gap statistic for seven dimensions seeing that two three and six clusters [35]. 362 All showed large differences between the observed and expected sum-of-squares within 363 the clusters compared to the cluster means (Wss) ( Figure 6E). Computing the normalised 364 mutual information (NMI) compared the clustering to annotation feature overlap showed 365 a large increase in NMI for six clusters ( Figure 6C and D). Thus six clusters was selected 366 for primary analyses.

367
A cluster hierarchy was also generated by carrying out clustering for two, three and 368 six clusters each time using seven dimensions which allowed the determination of how 369 loci clustered together for lower values of k 7.   Figure S1. Genome browser view of example LC1 paragon loci (CRSL014210-350). Tracks are annotated as in 4 Figure S2. Genome browser view of an example LC5 paragon loci (CRSL000070). Tracks are annotated as in Figure 4.