Legionella pneumophila CRISPR-Cas suggests recurrent encounters with Gokushovirinae

Legionella pneumophila is a ubiquitous freshwater pathogen and the causative agent of Legionnaires’ disease. This pathogen and its ability to cause disease is closely tied to its environmental encounters. From phagocytic protists, L. pneumophila has “learned” how to avoid predation and exploit conserved eukaryotic processes to establish an intracellular replicative niche. Legionnaires’ disease is a product of these evolutionary pressures as L. pneumophila uses the same molecular mechanisms to replicate in grazing protists and in macrophages of the human lung. L. pneumophila growth within protists also provides a refuge from desiccation, disinfection, and other remediation strategies. One outstanding question has been whether this protection extends to phages. L. pneumophila isolates are remarkably devoid of prophages and to date no Legionella phages have been identified. Nevertheless, many L. pneumophila isolates maintain active CRISPR-Cas defenses. So far, the only known target of these systems has been an episomal element that we previously named Legionella Mobile Element-1 (LME-1). In this study, we have identified over 150 CRISPR-Cas systems across 600 isolates, to establish the clearest picture yet of L. pneumophila’s adaptive defenses. By leveraging the sequence of 1,500 unique spacers, we can make two main conclusions: current data argue against CRISPR-Cas targeted integrative elements beyond LME-1 and the heretofore “missing” L. pneumophila phages are most likely lytic gokushoviruses. IMPORTANCE The causative agent of Legionnaires’ disease, an often-fatal pneumonia, is an intracellular bacterium, Legionella pneumophila, that normally grows inside amoebae and other freshwater protists. Unfortunately for us, this has two major consequences: the bacterium can take what it has learned in amoebae and use similar strategies to grow inside our lungs; and these amoebae can protect Legionella from various forms of chemical and physical disinfection regimes. Legionella are ubiquitous in the environment and frequently found in man-made water systems. Understanding the challenges to Legionella survival before it reaches the human lung is critical to preventing disease. We have leveraged our earlier discovery that L. pneumophila CRISPR-Cas systems are active and adaptive – meaning that they respond to contemporary threats encountered in the environment. In this way, CRISPR arrays can be considered genomic diaries of past encounters, with spacer sequences used to identify elements that may impinge on the pathogen’s survival. One outstanding question in the field is whether L. pneumophila is susceptible to phage, given the presumptive protection provided by intracellular replication within its eukaryotic hosts. In this work, we use CRISPR spacer sequences to suggest that the heretofore “missing” L. pneumophila phage are most likely lytic gokushoviruses. Such information is critical to the long-term goal of developing of new strategies for preventing colonization of our water systems by Legionella and subsequent human exposure to the pathogen.

clearest picture yet of L. pneumophila's adaptive defenses. By leveraging the sequence of 1,500 24 unique spacers, we can make two main conclusions: current data argue against CRISPR-Cas 25 targeted integrative elements beyond LME-1 and the heretofore "missing" L. pneumophila 26 phages are most likely lytic gokushoviruses. 27 28

IMPORTANCE 29
The causative agent of Legionnaires' disease, an often-fatal pneumonia, is an intracellular 30 bacterium, Legionella pneumophila, that normally grows inside amoebae and other freshwater 31 protists. Unfortunately for us, this has two major consequences: the bacterium can take what it 32 has learned in amoebae and use similar strategies to grow inside our lungs; and these amoebae 33 can protect Legionella from various forms of chemical and physical disinfection regimes. 34 Legionella are ubiquitous in the environment and frequently found in man-made water systems. 35 Understanding the challenges to Legionella survival before it reaches the human lung is critical 36 to preventing disease. 37 We have leveraged our earlier discovery that L. pneumophila CRISPR-Cas systems are 38 active and adaptive -meaning that they respond to contemporary threats encountered in the 39 environment. In this way, CRISPR arrays can be considered genomic diaries of past encounters, 40 with spacer sequences used to identify elements that may impinge on the pathogen's survival. 41 One outstanding question in the field is whether L. pneumophila is susceptible to phage, given 42 the presumptive protection provided by intracellular replication within its eukaryotic hosts. In 43 this work, we use CRISPR spacer sequences to suggest that the heretofore "missing" L. 44 pneumophila phage are most likely lytic gokushoviruses. Such information is critical to the long-45 term goal of developing of new strategies for preventing colonization of our water systems by 46 Legionella and subsequent human exposure to the pathogen. 47

INTRODUCTION 49
Legionella pneumophila is a Gram-negative, intracellular bacterium that is ubiquitous in 50 freshwater environments (1-3), where it replicates within a wide range of protist hosts (4). If a 51 contaminated water source becomes aerosolized and inhaled, L. pneumophila can infect human 52 lung macrophages and cause a severe pneumonia known as Legionnaires' disease (5, 6). 53 Replication in the accidental human host uses similar molecular strategies to those used to infect 54 protists (7). As humans are an evolutionary dead end to the pathogen (8), understanding how L. 55 pneumophila is able to persist and replicate in environmental reservoirs is critical to limiting its 56 ability to cause human disease. 57 Protozoan hosts not only serve as a replicative niche to L. pneumophila (1, 4), but also 58 provide protection from desiccation (9), temperature changes (10, 11) and disinfectants (10-13). 59 One outstanding question is whether these hosts also protect L. pneumophila from predation by 60 foreign genetic elements, such as viral bacteriophages (phages). Notably, phages for the obligate 61 intracellular pathogens Chlamydia psittaci and Chlamydia pecorum have been described (14-17), 62 raising the possibility that phages may access L. pneumophila even within the protection of the 63 host. 64 The first report of Legionella infection by lytic phages isolated from freshwater samples 65 is equal parts promise and obfuscation (18

CRISPR-Cas target search 145
A pipeline to search for targets of L. pneumophila CRISPR-Cas was written as a Unix 146 shell script. L. pneumophila genomes were first searched for putative type I-C, I-F and II-B 147 CRISPR arrays using the repeat structure as a guide. If a CRISPR array was detected, it was 148 subsequently masked to minimize false positive hits to the input genomes. A BLAST database 149 was then constructed containing the Legionella genomes with masked CRISPR arrays to look for 150 integrated genetic elements (see Table 1). The unique spacer library (see above) was queried 151 against the BLAST database (BLASTn parameters: -gapopen 10 -gapextend 2 -reward 1 -penalty 152 -1 -evalue 0.01 -word_size 7) (47). The BLAST search was done twice, once for each strand. 153 The BLAST output was then processed to score each hit. The scoring metric subtracted the 154 length of the alignment between a spacer and a hit from the length of the spacer, then added the 155 number of mismatches between the two sequences. For example, an alignment length of 30, a 156 spacer length of 32 and 0 mismatches would generate a score of 2. This was done to take both 157 alignment length and number of mismatches into account when ordering the hits for downstream 158 analyses (for example, so a hit with 0 mismatches and a short alignment length would not be 159 ranked higher than a hit with 2 mismatches but an alignment length across the entire spacer). 160 After scoring, the BLAST output was converted into a BED format. During this step, the putative 161 target sequences were extended to account for alignment length, followed by the addition of a 3 162 nucleotide extension on either end of the sequence for downstream protospacer adjacent motif 163 (PAM) filtering. SAMtools (55) and BEDTools (56) were used to extract the putative target 164 sequences and their flanking regions from the input genomes. Finally, the pipeline searched for 165 canonical PAMs for the type I-C, I-F and II-B systems in the flanking region of the putative 166 target sequence. Although our pipeline searched for type I-C, I-F and II-B PAMs, we considered 167 this secondary information and were intentionally PAM agnostic when assigning spacers to 168 target groups to avoid missing potential targets that have since escaped through PAM mutations 169 or those targeted by non-canonical PAMs. Analysis of the PAM sequences present in the target 170 sequences showed that many target sequences did contain the canonical PAM ( Figure S3). 171 The pipeline was also used to search for exogenous threats to the bacterium in viral, 172 plasmid and metagenomic sequences and run as described above (data sets listed in Table 1). The 173 output from the pipeline can be found in Table S6 for complete/draft genomes, and Table S7 for 174 metagenomic data. CRISPR arrays containing spacers with ascribed targets were visualized with 175 CRISPRStudio (57). PAM sequence logos for the hits were plotted using ggseqlogo (v 0.1) (58). 176 177

Major capsid protein phylogeny of targeted Microviridae genomes 178
A phylogeny of Microviridae major capsid proteins was generated using the amino acid 179 sequences from 3014 genomes (Table S8). The amino acid sequences were aligned using 180 MUSCLE (44) and the tree was generated with FastTree (v 2.1.11) using the default settings 181 (45). The resulting tree was rooted to the Bullavirinae clade and visualized using the iTOL web 182 server (46). The sub-family designations for each phage based on the literature were plotted as a 183 colour-coded ring circling the tree (Table S8). The only target of L. pneumophila CRISPR-Cas identified to date is an integrated mobile 204 element, LME-1 (21). It has been previously reported that LME-1 contains a number of predicted 205 proteins of phage origin (20, 21), so we decided to revisit whether variants of LME-1 or other 206 integrated L. pneumophila phage-like elements could be identified using established prophage 207 prediction tools. We subjected a set of over 600 publicly available L. pneumophila genomes to 208 analysis by two complementary prophage prediction programs, PhiSpy (41) and VirSorter (42) 209 (Table S1), which use different methods to identify potential phage-like sequences. PhiSpy  Of 35 predictions that overlapped between VirSorter and PhiSpy, 7 were LME-1 variants 216 -flanked by the previously described LME-1 attachment site and containing nearly identical 217 gene content (Table S2). Each of the remaining 28 predicted regions contained csrT, a regulatory 218 gene which is co-inherited with type IV secretion systems from Legionella integrative 219 conjugative elements (ICEs) (43) ( Table S2). By CsrT phylogeny, these ICEs all fall into the 220 previously described group IV Lgi elements ( Figure S1) (43). 221 To determine if any additional LME-1 variants might have been missed by our analyses, 222 we searched each L. pneumophila genome for a signature of LME-1 integration: the presence of 223 two LME-1 attachment (att) sites. This analysis identified 8 att-flanked insertions of >29 kb 224 (Table S3). These integrants corresponded to all the previously identified LME-1 variants above, 225 along with one additional LME-1 variant missed by PhiSpy. We also identified 24 instances of a 226 short, 469 bp insertion at the LME-1 att site that shares no sequence similarity to LME-1, 227 phages, or any other known sequence (Table S3). Unlike LME-1 insertions, the second att site 228 for these instances contain several substitutions and insertions, suggesting against a 229 contemporary integration event. 230 We next examined this expanded set of LME-1 variants for differences in gene content 231 and sequence similarity to phage genes. Based on core genome phylogeny, the LME-1s 232 segregate into 5 distinct variants with a set of 32 core genes and 17 accessory genes (Figure 1, 233 Table S4). We examined each LME-1 predicted gene product for similarity to phage proteins 234 using HHpred and BlastP. Strikingly, 27/32 of 235 the core LME-1 genes share similarity to 236 phage proteins (and 6/17 accessory genes). 237 The preponderance of phage-like genes in the 238 LME-1 core (including key phage structural 239 proteins) clearly points to a phage origin (20,

CRISPR-Cas defenses 247
Based on our analysis of 18 L. 248 pneumophila isolates, we previously 249 identified LME-1 as the only recurrent target 250 of L. pneumophila CRISPR-Cas (21). 251 Figure 1 | Core orthologous genes in LME-1 enriched for sequence similarity to known phage genes. Orthologous groups for LME-1 genes were determined using OrthoMCL. The core LME-1 genomes were aligned with MUSCLE; a phylogeny was constructed using FastTree and subsequently visualized in ggtree. Orthologous group presence/absence in each LME-1 genome was visualized with ggplot2. Red dots adjacent to each orthologous group indicate sequence similarity to known phage genes via BLASTp and/or HHPred predictions (Table S4) (Table S5) containing type I-C systems ( Figure 2B). This 271 is consistent with our previous observations that at least two of these systems (I-C and I-F) are 272 adaptive (21), suggesting that additional sampling of spacer sequences will continue to reveal 273 new encounters between L. pneumophila and its invaders. 274

LME-1 is the only integrative element targeted by Legionella CRISPR-Cas systems 275
With our expanded spacer library, we next sought to identify the matching protospacer 276 sequences targeted by these systems. Our previous analysis, limited in both spacer sequences 277 (queries) and L. pneumophila genomes (to search), nevertheless identified two variants of a 278 LME-1 as frequent targets of all 3 types of L. pneumophila CRISPR-Cas (21). Backed by our 279 prophage analysis above, we sought to determine if LME-1 remained uniquely targeted relative 280 to other integrated genomic elements. We queried our expanded spacer set against a local 281 BLAST database of over 600 L. pneumophila genomic sequences to search for any integrated 282 elements that overlap with our prophage analysis or may have been missed (Table 1, Figure 3A) 283 (38, 39). 284 Even within the expanded spacer catalog, LME-1 remains the only integrated target of L. 285 pneumophila CRISPR-Cas. Thirty-two spacers target the element, representing approximately 286 2% of the unique spacer library (Table S6). LME-1 targeting spacers are present in 12 distinct 287 CRISPR-Cas arrays, covering all three CRISPR-Cas sub-types (Table 2, Figure 3B). Within 288 individual arrays, we observe multiple instances of adjacent LME-1 targeting spacers, consistent 289 with repeat exposure and primed spacer acquisition (28) ( Figure 3B). Of the 32 spacers that 290 target LME-1, the vast majority (28) target all variants. 291 While two additional spacers mapped back to other L. pneumophila genomic targets 292 (Table 2, Figure 3B, Table S6), neither fall within regions predicted by PhiSpy or VirSorter.  (Table S1) 299 or even a larger, general database of plasmid sequences (63) ( Table 1). Somewhat surprisingly, 300 LME-1 remains the only integrative element (prophage-like or otherwise) targeted by L. 301 pneumophila CRISPR-Cas. 302 303 Figure 3 | L. pneumophila CRISPR-Cas systems repeatedly target LME-1 and Microviridae phages. A) A Unix shell script was written to search for targets of L. pneumophila CRISPR-Cas. This pipeline masked putative type I-C, I-F and II-B CRISPR arrays in input genomes before creating a local BLAST database. It then queried the unique spacer library against this database using BLAST and extracted putative target sequences for downstream analyses. B) L. pneumophila CRISPR arrays possessing spacers with ascribed targets were visualized with CRISPRStudio; greyed boxes represent spacers without ascribed targets, while coloured boxes represent spacers with ascribed targets. Related arrays are grouped together, with spacer loss shown as dashes and shared spacers between those arrays shown in a lighter opacity.

L. pneumophila CRISPR-Cas targets phages 304
The observation that LME-1 is the only prophage-like element targeted by CRISPR-Cas 305 argues against widespread lysogeny in L. pneumophila and is consistent with prior observations 306 that the Legionella genus is remarkably prophage-poor (22). The question remains, however, as 307 to whether one or more lytic phages might pose an exogenous threat to L. pneumophila survival. 308 To look for evidence of past L. pneumophila-phage encounters, we used the unique spacer 309 library to query an extensive database of viral and phage sequences with over 28,000 viral 310 genomes from NCBI (Table 1) (39). Within this collection, only one class of targets presented a 311 signature similar to that of LME-1 (both in the number of unique spacers and the diversity of 312 systems involved). Phages from the Microviridae family were, like LME-1, frequent targets of L. 313 pneumophila CRISPR-Cas: targeted by 28 spacers across 11 distinct arrays (Table 2, Figure 3B, 314 Table S6). Many of these spacers are found near the 5' end of the array, suggesting a recent 315 acquisition event as this is where new spacers are generally acquired (28) ( Figure 3B). As with 316 LME-1, several instances of adjacent Microviridae-targeting spacers can be found within 317 individual arrays -again suggestive of multiple exposures and consistent with primed spacer 318 acquisition (28) ( Figure 3B). 319 Since many of the targeted Microviridae genomes were assembled from metagenomic 320 reads and subsequently deposited into NCBI, we reasoned that there could be additional targets 321 of L. pneumophila CRISPR-Cas present in metagenomic data. As such, we next searched for 322 spacer hits across several metagenomic datasets (Table 1) (64-69). These analyses identified 7 323 additional spacers that targeted contigs of likely Microviridae origin, along with one additional 324 spacer that targeted a contig of likely Caudovirales origin (Table 2, Figure 3B, Table S7). 325 In total, 35 L. pneumophila spacers target sequences of Microviridae origin: 25 spacers 326 target the conserved major capsid protein VP1, 4 target the pilot protein VP2, and 5 target the 327 DNA replication protein VP4. One spacer targets a non-coding region (Table 3). As with the 328 spacers against LME-1, Microviridae-targeting spacers come from a genetically and 329 geographically diverse set of L. pneumophila isolates. As with LME-1, this suggests that one or To determine if we could further narrow the targets of L. pneumophila CRISPR-Cas to 344 one or more of these subfamilies, we first generated a major capsid (VP1) phylogeny with all 345 3000 available Microviridae genomes (Figure 4, Table S8). We also included 14 spacer-targeted the only additional hit is to genomes clustering with the proposed Pichovirinae sub-family; a 367 single spacer notable mainly because it targets a non-coding region which one would not expect 368 to be well conserved (76) (Fig S2). 369 Not surprisingly, the vast majority of Gokoshovirinae-targeting spacers map back to regions of 390 high conservation, likely reflective of the overrepresentation of conserved sequences within any 391 database ( Figure 5A). Nevertheless, two spacers flank the predicted host-determinant variable 392

Figure 5 | Spacers that meet the stringent criteria largely map to conserved regions of their target genes. A)
The VP1, VP2 and VP4 gene sequences were aligned for the targeted Gokushovirinae genomes that meet the stringency criteria (5 or fewer mismatches with its protospacer, a canonical PAM for its CRISPR subtype, no mismatches in the seed sequence region (positions 1-5, 7, 8)) and are the top hit for a spacer. Nucleotide sequence conservation is shown as a heatmap within the arrow, while spacers mapped onto the alignment are shown as purple boxes. The two boxes marked with a red asterisk contain an insertion in the protospacer region that is present in 1/15 genomes used in the alignment. The host determinant region of the major capsid protein is denoted with a grey box. B) Sequence alignments of two spacers that border the host determinant region with their respective top hit phage genome. The spacer name is shown in purple, mismatches between the spacer and the protospacer are non-bolded characters, and the grey box shows the portion of the spacer that is in the host determinant region. C) Sequence alignment of a spacer that is contained within the host determinant region. Labelling as in B. region and extend into it ( Figure 5B) and one spacer maps to internal variable sequence (Figure  393 5C). This is potentially quite informative, given the contribution of the capsid variable region to 394 Gokushovirinae host range. In short, these spacers likely provide the first glimpse into specific 395 phage sequences required to infect L. pneumophila. 396

DISCUSSION 398
We previously demonstrated that a 30+ kb episomal element, LME-1, is a frequent target 399 of an evolutionarily diverse set of L. pneumophila CRISPR-Cas arrays. We followed this 400 observation by demonstrating that these systems actively defend against the transfer of this 401 element between strains (21). We report here that LME-1 represents the only integrated element 402 targeted by L. pneumophila's CRISPR-Cas defenses. Unexpectedly, we also show that L. could support the frequent treatment of high-risk water systems (to either eliminate or prevent 415 colonization by Legionellae) (97). Secondly, the suitability of phage therapy for intracellular 416 pathogens remains an open question. Compared to Chlamydia, L. pneumophila is much more 417 genetically tractable, can be cultured outside of host cells in the laboratory, and grows inside a 418 staggering diversity of eukaryotic hosts (4). As such L. pneumophila would represent a powerful 419 model system by which to understand the feasibility and constraints of using phages to target 420 intracellular pathogens. 421 By leveraging the active, adaptive nature of L. pneumophila CRISPR-Cas, we have 422 developed an updated, sequence-based composite sketch of L. pneumophila phage. Unlike LME-423 1, which we were able to isolate from defenseless "lysogens," the Gokushovirinae sequences 424 identified in this study are exclusively metagenomic contigs from mixed environmental samples.     Supplemental Figure 3. PAM analysis from putative targets reveals that many targets possess the canonical PAM for their CRISPR sub-type. A sequence logo of PAMs from putative targets shows that many targets possess the canonical PAM for their CRISPR sub-type. This suggests spacers were not over-designated to target groups despite being PAM agnostic during data analysis.