Homing in on the rare virosphere reveals the native host of giant viruses

Summary Giant viruses (phylum Nucleocytoviricota) are globally distributed in aquatic ecosystems1,2. They play major roles as evolutionary drivers of eukaryotic plankton3 and regulators of global biogeochemical cycles4. Recent metagenomic studies have significantly expanded the known diversity of marine giant viruses1,5–7, but we still lack fundamental knowledge about their native hosts, thereby hindering our understanding of their lifecycle and ecological importance. Here, we aim to discover the native hosts of giant viruses using a novel, sensitive single-cell metatranscriptomic approach. By applying this approach to natural plankton communities, we unraveled an active viral infection of several giant viruses, from multiple lineages, and identified their native hosts. We identify a rare lineage of giant virus (Imitervirales-07) infecting a minute population of protists (class Katablepharidaceae) and revealed the prevalence of highly expressed viral-encoded cell-fate regulation genes in infected cells. Further examination of this host-virus dynamics in a temporal resolution suggested this giant virus controls its host population demise. Our results demonstrate how single-cell metatranscriptomics is a sensitive approach for pairing viruses with their authentic hosts and studying their ecological significance in a culture-independent manner in the marine environment.

impact on the aquatic environment remains elusive. Current approaches to predict host-virus pairs 48 include computationally correlating between viruses and hosts in metagenomic data 13,14 , finding 49 homology between horizontally transferred genes in the host or viral genomes 1 , and single-cell 50 genomics to detect host and virus DNA in the same cell 15 . Despite these attempts, most 51 metagenome-derived giant viruses still lack reliable predictions of their native hosts. Notably, even 52 single-cell genomics cannot confidently link viruses to their specific hosts because many 53 heterotrophic protists feed on a wide range of viruses in the environment, leading to the detection 54 of viral DNA that is derived from ingestion rather than infection of the native host cells [16][17][18] . 55 Single-cell transcriptomics is an innovative approach to tracking host-virus dynamics by detecting 56 the co-expression of a virus and its host transcriptomes within individual cells 19,20 . This sensitive 57 approach enables the detection of active viral infection even in rare viruses that would otherwise 58 be difficult to detect through conventional analysis of bulk metagenomic or metatranscriptomic 59 data 19,20 . 60 Here, we develop a single-cell metatranscriptomic approach to map specific viral gene markers to 61 host cells across tens of thousands of single-cell transcriptomes using samples collected in natural 62 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint planktonic communities. Using this method, we found dozens of infected cells representing eight 63 distinct pairs of hosts and viruses and unraveled the hosts of several giant viruses from multiple 64 lineages, even when the host comprises less than half a percent of the protist community. Overall, 65 single-cell metatranscriptomics provides a sensitive tool for the identification of the native host of 66 giant viruses and tracking their dynamics in the natural environment. 67 68 Results and discussion 69 To identify novel host-virus interactions in the ocean, we sampled natural plankton communities 70 from an induced Emiliania huxleyi bloom during a mesocosm experiment in the Raunefjorden 71 fjord near Bergen, Norway, in May 2018 21 . During this experiment, seven bags were filled with 72 fjord water and monitored for plankton succession for 24 days. Ten samples at a size fraction of 73 3-20 µm were obtained from four of these bags and fixed on-site, before being processed in the 74 lab (Fig 1a, b). Samples were sequenced using 10X Genomics single-cell (SC) RNA-seq (Fig. 1c), 75 so that cells were barcoded, and each transcript received a unique molecular identifier (UMI). We 76 then mapped the transcripts to a reference database of genes that are conserved in NCLDV and are 77 considered giant virus marker genes, such as viral DNA polymerase beta (PolB), viral type II 78 topoisomerase (TopoII), and major capsid protein (MCP). The expression of these viral marker 79 genes was quantified (Fig. 1d, see Methods for details) and cells with high viral expression were 80 selected (Fig. 1e). Reads from these selected cells were assembled to recover longer transcripts 81 ( Fig. 1f, g). To predict the host, we used sequence homology of assembled transcripts to the 18S 82 rRNA gene (Fig. 1h, i). To identify which viruses are infecting these host cells, reads from selected 83 cells were mapped to the database of viral marker genes (Fig. 1h,

Novel host-virus interactions at a single-cell resolution 105
Out of the 218 'highly infected' cells identified, 75 host-virus pairs were defined at the class or 106 division level for the host, and the family level for the virus (Fig. 2, Supplementary Data 2). Viral 107 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint genes were expressed in protists belonging to diverse and ecologically important taxa such as 108 Chrysophyceae (29%), Prymnesiophyceae (21%), and Dinoflagellata (multiple classes, 9%), as 109 well as the understudied class of Katablepharidaceae (13%) (Fig 2). In about half (56%) of infected 110 cells, viral reads were mapped to multiple families rather than a specific match to one virus lineage. 111 This may imply that the specific virus infecting this host is yet to be discovered and is missing in 112 the reference database. Alternatively, it may also suggest that a single cell can be infected by more 113 than one lineage of viruses, which is known as superinfection 24 . In 44% of the cells (n = 33), at 114 least 90% of viral reads were mapped to a specific virus family (Fig 2). These infected cells 115 represent distinct pairs between eight protist taxa and giant viruses from the order Imitervirales 116 Cryptophytes 27 . So far, no nuclear genome of any Katablepharidaceae has been sequenced, and no 130 virus was described to infect this class. To our knowledge, this is the first report of a specific host 131 infected by an IM_07 virus. Only 19 metagenome-assembled genomes (MAGs) from this viral 132 group are currently available, and all have been found in aquatic ecosystems 8 . To verify the 133 phylogenetic position of the newly identified host-virus pairs, we constructed phylogenetic trees 134 from transcripts assembled from both host cells and their infecting viruses (Extended Data Fig. 1). 135 We thus elucidated multiple connections, including previously unknown interactions, between 136 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint important protist classes and a variety of virus families. Future research should explore if newly 137 identified viruses, that are currently absent from our database, can be detected in these samples. 138

Tracking multiple viral infections co-occurring in a natural population 143
To examine co-occurring viral infections in different protist populations during bloom succession, 144 all reads were mapped to a custom host-virus reference database which represents the different 145 protist groups in the population. This database was generated based on the single-cell 146 transcriptomes of the selected infected cells (Fig. 2). To this host-virus reference, we added genes 147 from EhV and E. huxleyi, which dominated the bloom 22 . Data derived from a total of 18,793 RNA-148 sequenced single cells were mapped to the host-virus reference and visualized using Uniform 149 Manifold Approximation and Projection (UMAP) representation. Each cell was assigned 150 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint taxonomy by 18S rRNA homology (Fig. 3a). As expected, the largest cluster was dominated by 151 Prymnesiophyceae, the class of E. huxleyi, the dominant species in the bloom 21,23 . A smaller, yet 152 distinct cluster consists of mainly MArine STramenopiles (MAST)-3, a group suggested to be one 153 of the most abundant MAST groups in the ocean 28 (Fig. 3a, in red). A secondary cluster consists 154 of diverse assemblages and includes 81 infected cells, identified by highly expressing contigs of 155 viral origin (Fig. 3b, see methods). These cells belong to previously described taxa (Fig 2):  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint Our approach enables mapping active infection events among diverse protist host cells and can 173 provide a sensitive means to detect rare infected cells. As a case study, we tracked 174 Katablepharidaceae cells for which we detected infection by giant viruses of the IM_07 family 175 (Fig 2, Fig. 3b). Katablepharidaceae represent less than 0.5% of all detected cells ( (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint with other viruses of the Imitervirales group (Extended Data Fig. 2), suggesting that these genes 204 were horizontally transferred from host genomes to viral genomes early in their evolution 3 . HSP90 205 and HSP70 are also among the most highly expressed viral genes (Fig 4b, Supplementary Data 4). 206 Heat shock proteins play a role in the life cycle of many viruses, mostly in viral replication, and in 207

Population dynamics of Katablepharidaceae following a viral infection 233
Upon pairing a specific giant virus to its host, we track host-virus dynamics during algal bloom 234 succession in the mesocosm experiment (Fig 4a). The relative abundance of Katablepharidaceae 235 was monitored using amplicon sequence variants (ASVs) of 18S rDNA 21 and compared to 236 different taxonomic classes that were detected in the SC-RNAseq (Fig. 4c). These results reveal 237 the heterogeneity of the population and confirm the presence of the classes that were found on the 238 single-cell level using SC-metatranscriptomics. The ASV analysis confirmed the rarity of 239 Katablepharidaceae in the community compared to other taxa analyzed here, such as 240 Prymnesiophyceae, Dinoflagellata, and Cercozoa. Interestingly, the Katablepharidaceae class 241 increased in abundance from 1% of the community on day 15 to a maximum of 6% on day 19, 242 followed by a population decline back to 1% only two days later, on day 21 (Fig. 4d). This sharp 243 demise in the relative abundance of Katablepharidaceae was observed after day 20, the same day 244 in which we detected that 86% (n=26) of the observed Katablepharidaceae single cells were 245 infected. This strongly suggests that the IM07 lineage virus was responsible for the population's 246 demise. The decrease in the population coincides with a sharp increase in the population of 247 Labyrinthulea, which was suggested to act as decomposers during bloom demise 21 . If these protists 248 benefited from decomposing the infected E. huxleyi cells, it is also possible that they similarly 249 benefitted from decomposing the infected Leucocryptos. to which 900 µl of chilled HPLC grade 100% methanol was added. Then, samples were incubated 306 for 15 min on ice, and stored at -80°C until further analysis. 307 308

Library preparation and RNA-seq sequencing using 10X Genomics 309
For analysis by 10X Genomics, tubes were defrosted and gently mixed, and 1.7 ml of the samples 310 were transferred into an Eppendorf Lowbind tube and centrifuged at 4°C for 3 min at 3000g. The 311 PBS/methanol mix was discarded and replaced by 400 µl of PBS. Cell concentration was measured 312 using an iCyt Eclipse flow cytometer (SONY), based on forward scatter. Cell concentration ranged 313 from 1044 cells ml -1 to 9855 cells ml -1 . All concentrations were brought to 1000 cells ml -1 to target 314 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint 7000 cells recovery, according to the 10X Genomics Cell Suspension Volume Calculator Table  315 provided in the user guide. Two libraries were prepared and sequenced on different occasions: 316 B4T19 and B7T17 in January 2020 and B3T15, B3T20, B4T13, B4T15, B4T20, B6T17, B7T16, 317 and B7T18 in August 2020. All the following steps for library preparation were accomplished 318 according to the manufacturer's protocol for Chromium Next GEM Single Cell 3' Reagent Kit 319 v3.1. Libraries were sequenced using NextSeq ® 500 High Output kit (75 cycles). 320 Raw 10X reads were trimmed using TrimGalore (v. 0.6.5), a Cutadapt wrapper 39 , and the poly-A 321 tail was removed. 322 323

Detection of infected cells in the single-cell RNA-seq data using a custom viral genes database 324
To detect viral transcripts, a reference was built from a database of conserved genes in NCLDVs 6 . 325 To remove redundancy, the proteins were clustered using CD-HIT at 90% identity 40 . From this 326 database, a reference was created using the 10X Genomics Cell Ranger mkref command. The Cell 327 Ranger Software Suite (v. 5.0.0) was used to perform barcode processing and single-cell unique 328 molecular identifier (UMI) counting on the fastq files using the count script (default parameters). 329 For downstream analysis, 972 cells were selected as 'highly infected' based on the following 330 criteria: (a) expression of more than one viral gene (>1), (b) expression of at least one gene with a 331 UMI count greater than one (>1), and (c) cell expresses in total ≥10 viral UMIs. 332

Identifying the taxonomy of individual cells by sequence homology to ribosomal RNA 333
Raw reads from each cell were pulled by the cell's unique barcode identifier using seqtk v. 1.2. 334 Reads were then trimmed, and poly-A was removed, using TrimGalore (v. 0.6.5), a Cutadapt 335 wrapper 39 . Trimmed reads from each cell were assembled using rnaSPAdes 3.15 41] with kmer 336 sizes: 21,33. Raw reads pulling, trimming, and assembly was wrapped using an in-house script. 337 To identify the taxonomy of the cells, assembled contigs from each cell were matched against 18S 338 rRNA sequences from the Protist Ribosomal Reference (PR2) 42 and metaPR2 43 . To remove 339 redundancy, the sequences in each of these databases were clustered using CD-HIT v. 4.6.6 at 99% 340 identity 40 . Contigs were filtered using SortMeRNA v. 4.3.6 44 with default parameters against the 341 PR2 database and then aligned to the PR2 and metaPR2 databases using Blastn 45 , at 99% identity, 342 E-value ≤ 10 -10 and alignment length of at least 100 bp. Contigs were ranked by their bitscore, and 343 only the best hit was kept for each contig. Each contig was assigned to one of the following 344 taxonomic groups that were prevalent in the sample: the classes Bacillariophyta, 345 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint Pseudofungi, Lobosa (Amoebozoa), Ciliphora (Ciliates), Dinoflagellata and Cercozoa. Contigs 347 that matched other groups were assigned as "other eukaryotes". Contigs that matched more than 348 one of these taxonomic groups were considered non-specific and were therefore ignored. Cells that 349 transcribe 18S rRNA transcripts homologous to more than one taxonomic group were 350 conservatively omitted. None of the cells that were assigned "other eukaryotes" had contigs with 351 conflicting annotations (contigs matching different classes). 754 Cells whose best matching virus was coccolithovirus were omitted from the downstream 366 analysis since EhV-infected cells were already reported to be abundant in the algal bloom 23 , and 367 our analysis aims to explore other host-virus pairs. 368

Plotting host-virus pairs in a Sankey plot for host cells and their infecting giant viruses 369
Of the 218 cells detected as infected by viruses other than EhV, 75 cells were selected that could 370 be identified using 18S rRNA and have at least 10 reads mapped to one of the virus families 371

Single-cell RNA-seq data alignment to a custom reference 398
A new host-virus reference database was curated from the transcriptome of the infected cells (Fig.  399 2). Repetitive sequences were removed using BBduk (BBtools 38.90) 55 . An Additional long 400 repetitive sequence was removed manually. A database of E. huxleyi and EhV genes, which were 401 shown to be abundant in the samples 23 , was also added to this reference to specifically detect E. 402 huxleyi cells and to avoid a non-specific mapping of reads from these cells to other contigs. For 403 EhV, the predicted CDSs in the EhVM1 were used as a reference 56 . For the host, an integrated 404 transcriptome reference of E. huxleyi was used as a reference 57 . Viral transcripts in the database 405 were identified using a homology search against a custom protein database as described above. A 406 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint reference was created from the database using the Cell Ranger mkref command. Fastq files were 407 mapped to this reference database using 10X Genomics Cell Ranger v. 5.0.0 count analysis. To identify the taxonomy of each of the detected cells, reads from each cell were assembled 424 independently. The taxonomy of the cells was determined by 18S rRNA homology to one of the 425 following groups, which were abundant in the population: the classes Bacillariophyta, 426 Prymnesiophyceae, Chrysophyceae, MAST-3 and Katablepharidaceae, the divisions Pseudofungi, 427 Ciliphora (Ciliates), Dinoflagellata and Cercozoa. Other taxonomic groups were clustered under 428 "Other eukaryotes". 18,793 cells were identified this way, and 11,728 cells that could not be 429 identified were excluded from the plot for convenience. Cells with 18S rRNA contigs homologous 430 to more than one taxonomic group were also conservatively omitted. Cells expressing at least 10 431 viral UMIs as described above were considered infected. 432 433

Identifying the Leucocryptos host and its virus using homology search 434
To better identify the detected Katablepharidaceae cells and to identify their infecting virus, 26 435 infected Katablepharidaceae cells from bag #4, day 20 were selected. Reads from these cells were 436 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint retrieved using the unique molecular identifier, and then trimmed using TrimGalore v. 0.6.5, a 437 Cutadapt wrapper 39 . Trimmed reads from all these cells were assembled altogether using 438 rnaSPAdes v. 3.15 41 . To identify the specific Katablepharidaceae host, assembled contigs were 439 matched against the PR2 rRNA database and the using blastn at 90% identity, E-value ≤ 10 -10 , and 440 alignment length ≥ 100bp. Contigs best matched to an unknown Katablepharidaceae (>99% 441 nucleotide identity), but after removing unidentified genera, these contigs best matched (>95% 442 nucleotide identity) the Katablepharidaceae species Leucocryptos marina. Transcripts that 443 matched classes other than Katablepharidaceae were matched against the entire NCBI database 444 using the NCBI web server 58 . They too mostly matched Katablepharidaceae genes, specifically 445 parameters). A diagnostic tree was first made with FastTree 2.1.10 50 for pruning long branches 464 before making the final tree with IQ-TREE 51 . All but three ASVs and one PR2 sequence clustered 465 together with the assembled Leucocryptos transcript, verifying the taxonomy of 97% of the ASVs 466 used in the relative abundance analysis (Extended Data Fig. 3). 467 468 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 27, 2023. ; https://doi.org/10.1101/2023.06.27.546645 doi: bioRxiv preprint