Selective maintenance of multiple CRISPR arrays across prokaryotes

Prokaryotes are under nearly constant attack by viral pathogens. To protect against this threat of infection, bacteria and archaea have evolved a wide array of defense mechanisms, singly and in combination. While immune diversity in a single organism likely reduces the chance of pathogen evolutionary escape, it remains puzzling why many prokaryotes also have multiple, seemingly redundant, copies of the same type of immune system. Here, we focus on the highly flexible CRISPR adaptive immune system, which is present in multiple copies in a surprising 28% of the prokaryotic genomes in RefSeq. We use a comparative genomics approach looking across all prokaryotes to demonstrate that, on average, organisms are under selection to maintain more than one CRISPR array. We hypothesize that a tradeoff between memory span and learning speed could select for both “long-term memory” and “short-term memory” CRISPR arrays, and we go on to develop a mathematical model to show that such a tradeoff could, in theory, lead to selection for multiple arrays.


Introduction 1
Just as larger organisms must cope with the constant threat of infection by pathogens, 2 so too must bacteria and archaea. To defend themselves in a given pathogenic 3 environment, prokaryotes may employ a range of different defense mechanisms, and 4 oftentimes more than one (Makarova et al., 2011b(Makarova et al., , 2013Houte et al., 2016). While 5 having multiple types of immune systems may decrease the chance of pathogen 6 evolutionary escape (Iranzo et al., 2015), having multiple instances of the same type of 7 system is rather more puzzling. Here we explore this apparent redundancy in the 8 context of CRISPR-Cas immunity.

9
The CRISPR-Cas immune system is a powerful defense mechanism against the 10 viruses that infect bacteria and archaea and is the only known example of adaptive 11 immunity in prokaryotes (Makarova et al., 2006;Goren et al., 2012). This system allows 12 prokaryotes to acquire specific immune memories, called "spacers", in the form of short 13 viral genomic sequences which they store in CRISPR arrays in their own genomes CRISPR systems appear to be widespread across diverse bacterial and archaeal 23 lineages, with previous analyses of genomic databases indicating that ∼ 40% of bacteria 24 and ∼ 80% of archaea have at least one CRISPR system (Makarova et al., 2011a;Sorek 25 et al., 2013;Burstein et al., 2017). These systems vary widely in cas gene content and 26 targeting mechanism, although the cas1 and cas2 genes involved in spacer acquisition 27 are universally required for a system to be fully functional (Barrangou et al., 2007; 28 Makarova et al., 2011a). Such prevalence suggests that CRISPR systems effectively 29 defend against phage in a broad array of environments. The complete story seems to be 30 more complicated, with recent analyses of environmental samples revealing that some 31 major bacterial lineages almost completely lack CRISPR systems and that the 32 distribution of CRISPR systems across prokaryotic lineages is highly uneven (Burstein 33 et al., 2016). Other studies suggest that particular environmental factors can be 34 important in determining whether or not CRISPR immunity is effective (e.g., in 35 thermophilic environments Iranzo et al. 2013;Weinberger et al. 2012b). While previous 36 work has focused on the presence or absence of CRISPR across lineages and habitats, 37 little attention has been paid to the number of systems in a genome. 38 In fact, the multiplicity of CRISPR systems per individual genome varies greatly, 39 with many bacteria having multiple CRISPR arrays and some having multiple sets of 40 cas genes as well (e.g., Horvath et al. 2009;Cai et al. 2013). CRISPR and other immune 41 systems are horizontally transferred at a high rate relative to other genes in bacteria 42 (Puigbò et al., 2017), meaning that any apparent redundancy of systems may simply be 43 the result of the selectively neutral accumulation of systems within a genome. 44 Alternatively, there are a number of reasons, discussed below, why having multiple sets 45 of cas genes or CRISPR arrays might be under selection. 46 We suspected that prokaryotes may be under selection to maintain multiple CRISPR 47 arrays, given that it is common for organisms across lineages to have multiple systems 48 (as detailed below) and, in some clades, these appear to be conserved over evolutionary 49 time (e.g. Boudry et al. 2015;Andersen et al. 2016). Because microbial genomes have a 50 deletion bias (Mira et al., 2001;Kuo and Ochman, 2009), we would expect extraneous 51 systems to be removed over time. Here we construct a test of neutral CRISPR array 52 accumulation via horizontal transfer and loss. Using publicly available genome data we 53 show that the number of CRISPR arrays in a wide range of prokaryotic lineages 54 deviates from this neutral expectation by approximately two arrays. Thus we conclude 55 that, on average, prokaryotes are under selection to have multiple CRISPR arrays. We 56 go on to discuss several hypotheses for why having multiple arrays might be adaptive. 57 Finally, we suggest that a tradeoff between the rate of acquisition of immune memory 58 and the span of immune memory could lead to selection for multiple CRISPR arrays. December 23, 2017. Genomes were scanned for the presence of CRISPR arrays using 65 the CRISPRDetect software . We used default settings except that 66 we did not take the presence of cas genes into account in the scoring algorithm (to avoid 67 circularity in our arguments), and accordingly used a quality score cutoff of three, 68 following the recommendations in the CRISPRDetect documentation. CRISPRDetect 69 also identifies the consensus repeat sequence and determines the number of repeats for 70 each array. Presence or absence of cas genes were determined using genome annotations 71 2/21 from NCBI's automated genome annotation pipeline for prokaryotic genomes (Tatusova 72 et al., 2016). We discarded genomes that lacked a CRISPR array in any known 73 members of their taxon. In this way we only examined genomes known to be compatible 74 with CRISPR immunity.

75
Test for selection maintaining multiple arrays 76 Our power to detect selection hinges on our ability to differentiate between 77 non-functional (i.e., neutrally-evolving) and functional (i.e., potentially-selected) 78 CRISPR arrays. Since all known CRISPR systems require the presence of cas1 and 79 cas2 genes in order to acquire new spacers, we use the presence of both genes as a 80 marker for functionality and the absence of one or both genes as a marker for 81 non-functionality. Henceforth we will consider CRISPR arrays in genomes with both 82 cas1 and cas2 genes to be "functional" and CRISPR arrays in genomes lacking either 83 cas1, cas2, or both genes to be "non-functional". This differentiation allows us to 84 consider the probability distributions of the number of CRISPR arrays i in 85 non-functional (N i ) and functional (F i ) genomes, respectively. 86 We start with our null hypothesis that in genomes with functional CRISPR systems 87 possession of a single array is highly adaptive (i.e. viruses are present and will kill any 88 susceptible host) but additional arrays provide no additional advantage. Thus these 89 additional arrays will appear and disappear in a genome as the result of a neutral 90 birth/death horizontal transfer and loss process, where losses are assumed to remove an 91 array in its entirety. This hypothesis predicts that the non-functional distribution will 92 look like the functional distribution shifted by one (S i ): for i ≥ 0. We take two approaches to testing this prediction: one parametric from first 94 principles with greater power but more assumptions and one non-parametric with less 95 power but also fewer assumptions. 96 We begin by deriving a functional form for the distribution N i from first principles 97 following a neutral process. If CRISPR arrays arrive in a given genome at a constant 98 rate via rare horizontal transfer events, then we can model their arrivals using a Poisson 99 process with rate η. Assuming arrays are also lost independently at a constant rate, the 100 lifetime of each array in the genome will be exponentially distributed with rate ν. This 101 leads to a linear birth-death process of array accumulation, which yields a Poisson 102 equilibrium distribution with rate λ = η ν . While this rate might be constant for a given 103 taxon, it will certainly vary across taxa due to different intrinsic (e.g. cell wall and 104 membrane structure) and extrinsic factors (e.g. density of neighbors, environmental pH 105 and temperature) (Puigbò et al., 2017). We model this variation by allowing genome j 106 to have rate λ j = ηj νj and assuming λ j ∼ Gamma(α, β), which we pick for its flexibility 107 and analytic tractability. This combination of gamma and Poisson distributions leads to 108 the number of arrays i in a random genome following a negative binomial distribution 109 N i = NB(r, p) where r = α and p = β 1+β .

110
Now we can fit this distribution to data to find maximum likelihood estimates of r 111 and p for the distribution of array counts in both the set of non-functional genomes (N i ) 112 and the set of functional genomes as shifted under our null hypothesis (S i ). This allows 113 us to construct a parametric test of multi-array adaptiveness. We expect thatr N ≈r S 114 andp N ≈p S under our null hypothesis (where subscripts correspond to the distribution 115 to which the parameters were fit). When our null hypothesis is violated it should shift 116 the means of these distributions. Therefore we estimate and compare these means 117 µ k = p k r k 1−p k , k ∈ N, S. We expect thatμ S >μ N if more than one array is selectively 118 3/21 maintained, and we bootstrap confidence intervals on these estimates to determine 119 whether the effect is significant. 120 We also construct a non-parametric test for selection by determining at what shift s 121 the mismatch between F i+s / ∞ j=s F j and N i , measured as the sum of squared 122 differences between the distributions, is minimized: ( Under our null hypothesis s = 1, and a value of s > 1 implies that selection maintains 124 more than one array. Our parametric test is superior to s because it can detect if 125 selection maintains more than one array across the population on average, but not in all 126 taxa, so that the optimal shift is fractional.

127
Correcting for phylogenetic correlations in HGT 128 Differential rates of horizontal gene transfer (HGT) between lineages could produce an 129 observed correlation between cas1 and cas2 presence and array count in the absence of 130 any selection for having multiple CRISPR arrays. In other words, some genomes would 131 be functional and have many arrays due to a high arrival rate of foreign genetic 132 material, and other lineages would be non-functional and lack CRISPR arrays simply 133 because of low rates of HGT. If this were the case, then comparisons between these 134 lineages would lead to a spurious result of selection.

135
There are several ways to control for such correlation. First, we can perform our 136 parametric test on a subset of the data such that we take an equal number of functional 137 and non-functional genomes from each species to control for lineage-specific effects.

138
Second, we can also perform a species-wise parametric test. In this case, for each species 139 k we calculate ∆µ k =μ S k −μ N k and then bootstrap the mean of the distribution of 140 these values (∆µ) to detect if there is a significant difference from zero. We also map 141 these values onto a phylogeny (SILVA Living Tree 16s rRNA tree; Yarza et al. 2008) 142 and perform a formal test for phylogenetic signal (the K statistic; Blomberg et al. 2003;143 Revell 2012), which indicates whether or not any signal of selection is isolated to a 144 particular portion of the tree.

145
Finally, our test for selection can also be conceptualized of in terms of a regression, 146 which has standard methods for phylogenetic correction. Essentially, if we fit a model of 147 array counts per genome as predicted by functionality, our null hypothesis states that 148 the slope of the regression should be approximately one, with a slope larger than one 149 indicating that selection maintains multiple arrays. We fit a linear model of mean  Linkage between CRISPR array and cas genes 156 Often CRISPR arrays and cas genes are collocated such that loss of one may be linked 157 to loss of the other. In a theoretical sense this should not matter, as it will not alter the 158 asymptotic distribution of array counts per genome that we would expect in the case of 159 either functional or non-functional genomes. That is, while linked-loss could lower the 160 number of arrays seen in non-functional genomes in the short term, it should not change 161 the value of the eventual equilibrium array count a genome tends toward over time.

4/21
Nevertheless, genomes may not be at or near their equilibrium array counts. We can 163 test this assumption directly by regressing the species-specific values of ∆µ k (defined 164 above when correcting for lineage-specific trends in HGT) against the minimum distance 165 between CRISPR arrays and cas genes on a genome. If CRISPR-cas linkage were 166 driving our results, we would see a strong relationship between these values. We include 167 only completely assembled genomes in this analysis as genomic distances are needed.
168 CRISPR spacer turnover model 169 We develop a simple deterministic model of the spacer turnover dynamics in a single 170 CRISPR array of a bacterium exposed to n viral species (i.e., disjoint protospacer sets): 171 where µ L is the spacer loss rate parameter and a i is a function of time representing the 172 viral environment. Here we let a i (t, where µ A is the spacer acquisition 173 rate, v i is a composite parameter describing the maximum density of a viral species in 174 the environment multiplied by the adsorption rate, and f i (t) is a function describing the 175 fluctuations of the viral population over time that takes values between zero and one.

176
The rate of per-spacer loss increases linearly with locus length. This assumption is 177 based on the observation that spacer loss appears to occur via homologous 178 recombination between repeats Gudbergsdottir et al., 2011;179 Weinberger et al., 2012a), which becomes more likely with increasing numbers of spacers 180 (and thus repeats). Using this model we can determine optimal spacer acquisition rates 181 given a particular pathogenic environment. If there are multiple optima, or if optima 182 cluster in different regions of parameter space for different pathogenic environments, 183 this indicates that having multiple-arrays may be the best solution in a given 184 environment or set of environments. 185 We analyze a simple case with two viral species where there is one "background" 186 species (B) representing the set of all viruses persisting over time in the environment 187 (f B (t) = 1) and another "transient" species (T ) that leaves and returns to the 188 environment after some interval of time (f T (t) is a binary function that takes a value of 189 one if virus T is present in the environment and zero otherwise). This allows us to 190 effectively illustrate any tradeoff between the ability to maintain defenses towards a 191 constant threat and the ability to defend against threats that periodically reappear in 192 the environment. In practice we will focus on one interval in which T leaves and then 193 returns to the system in order to see if immune memory is lost, and if so how long it 194 takes to regain that memory, given a particular spacer acquisition rate (µ A ). We also 195 find how long it would take to acquire immunity towards a completely novel phage 196 species given a particular acquisition rate in order to assess any tradeoff between 197 learning-speed and memory-span. 198 We can also include the phenomenon of priming in our model, wherein if a CRISPR 199 array has a spacer targeting a particular viral species, the rate of spacer acquisition 200 towards that species is increased (Datsenko et al., 2012;Swarts et al., 2012). Thus is a stepwise function determining the presence or absence of at least one spacer towards 203 a given viral species and p > 1 is the degree of priming. For details of model analysis 204 see S1 Text.

205
There is evidence that mature crRNA transcripts from the leading end of the 206 CRISPR array are far more abundant than those from the trailing end, and that this 207 decay over the array happens quickly (most transcripts are from the first few spacers, 208 though some spacers farther down can occasionally show high representation) (Bernick 209 et al., 2012;Hale et al., 2012;Richter et al., 2012). This suggests an alternative model 210 wherein the length of an "effective" array is capped at a constant number of spacers and 211 "loss" (i.e. movement out of the zone where crRNAs are actively transcribed and 212 processed) occurs directly due to the acquisition of novel spacers. Therefore we built a 213 modified version of the model above with a hard cap on array length (S1 Text).

215
Having more than one CRISPR array is common 216 About half of the prokaryotic genomes in the RefSeq database have at least one 217 CRISPR array (44%). Of these genomes, more than half have more than one CRISPR 218 array (63%). When restricting ourselves only to putatively functional genomes where 219 the CRISPR spacer acquisition machinery was present (cas1 and cas2 ) the proportion 220 of genomes with more than one array increases to 68%. In contrast to this result, 221 having more than one set of cas targeting genes is not nearly as common. Signature 222 targeting genes are diagnostic of CRISPR system type. We counted the number of 223 signature targeting genes for type I, II, and III systems in each genome (cas3, cas9, and 224 cas10 respectively Makarova et al. 2015). Only 5% of all genomes have more than one 225 targeting gene (either multiple copies of a single type or multiple types). Even when 226 restricting ourselves again to genomes with a CRISPR array, only 10% of genomes had 227 multiple signature targeting genes. However, of those genomes with more than one set 228 of targeting genes, many had multiple types (48%).

229
Some taxa are overrepresented in RefSeq (e.g. because of medical relevance), and we 230 wanted to avoid results being driven by just those few particular taxa. We controlled for 231 this by randomly sub-sampling 10 genomes from each taxon with more than 10 genomes 232 in the database and found broadly similar results. After sub-sampling, approximately 233 40% of genomes had at least one CRISPR array, and of these 61% had more than one. 234 Of genomes with intact spacer acquisition machinery, 62% had more than one CRISPR 235 array. Of these sub-sampled genomes, restricting to those with at least one CRISPR 236 array, 9% had more than one set of cas targeting genes. Of these multi-cas genomes, 237 many had multiple types (56%).

238
Validation of functional / non-functional classification 239 Our power to detect selection depends critically on our ability to classify genomes as 240 CRISPR functional vs. non-functional. Functional CRISPR arrays should, on average, 241 contain more spacers than non-functional arrays (Gophna et al., 2015). Thus we 242 compared the number of repeats in CRISPR arrays in genomes with both cas1 and cas2 243 present ("functional", 16.01 repeats on average) to the number of spacers in genomes 244 lacking either or both genes ("non-functional", 12.23 repeats on average) and confirmed 245 that the former has significantly more than the latter (t = −36.516, df = 55340, 246 p < 2.2 × 10 −16 ; S1 Fig). This difference in length (3.80 repeats) is not as large as one 247 might expect, possibly because some systems are able to acquire or duplicate spacers via 248 homologous recombination (Kupczok et al., 2015) and arrays may have been inherited 249 6/21 recently from strains with active cas machinery. The mean array length across the 250 dataset was 15.12 repeats.

251
Instantaneous array loss vs. gradual decay 252 There are two possible routes to complete CRISPR array loss: (1) an all-at-once loss of 253 the array (e.g. due to recombination between flanking insertion sequences Almendros  (2) gradual decay due to spacer loss. Limited 255 experimental evidence supports (1) spontaneous loss of the entire CRISPR array (Jiang 256 et al., 2013), as do comparisons between closely related genomes (Shah and Garrett,257 2011). The distinction above is important, because if CRISPR array loss were to occur 258 primarily via (2) gradual decay, then functional genomes would have an intrinsically 259 lower rate of array loss than non-functional genomes. This is because in functional 260 genomes spacer acquisition would counteract spacer loss, reducing the rate of array 261 decay, whereas this compensation would not occur in non-functional genomes. This 262 could lead us to spuriously accept a result of selection maintaining multiple arrays.

263
If arrays were primarily lost via gradual decay we would expect a positive 264 relationship between the number of arrays in a genome and the average array length in 265 a genome, because arrays experiencing more decay (either due to increased spacer loss 266 rates or reduced acquisition rates) should be shorter and prone to eventual deletion. In 267 functional genomes with the complete spacer acquisition machinery (cas1 and cas2 ) this 268 trend would be due to the higher probability of stochastically reaching a 0-spacers state 269 in shorter arrays, and arrays will in general be shorter in genomes with lower spacer 270 acquisition rates. In non-functional genomes that lack the complete spacer acquisition 271 machinery, this trend would result from differences in time since loss of acquisition 272 machinery, where genomes that had lost that machinery farther in the past would have 273 both shorter arrays and fewer arrays on average.

274
Overall we see no relationship between mean array length and array count in a 275 genome (m = −0.001, p = 0.109, R 2 = 5.55 × 10 −5 ). Surprisingly, in functional genomes 276 we find a slightly negative linear relationship between mean array length in a genome 277 and the number of arrays in a genome (m = −0.0081, p < 2 × 10 −16 , R 2 = 0.0032). In 278 non-functional genomes we see a slightly positive relationship (m = 0.0054, 279 p = 7.23 × 10 −10 , R 2 = 0.0026). While both of these relationships are significant, they 280 are extremely weak and probably spurious. Thus we reject any clear array-length vs. 281 array count relationship and accordingly rule out array loss via spacer-wise decay as a 282 major driver of the patterns we will explore later.

283
Selection maintains multiple CRISPR arrays 284 We leveraged the difference between functional and non-functional genomes, within each 285 of which the process of CRISPR array accumulation should be distinct (Fig. 1, Table 1). 286 Non-functional CRISPR arrays should accumulate neutrally in a genome following 287 background rates of horizontal gene transfer and gene loss. We constructed two point 288 estimates of this background accumulation process using our parametric model to infer 289 the distribution of the number of arrays. One estimate came directly from the 290 non-functional genomes (μ N , Fig. 1a). The other came from the functional genomes, 291 assuming that having one array is adaptive in these genomes, but that additional arrays 292 accumulate neutrally (μ S , Fig. 1b). If selection maintains multiple (functional) arrays, 293 then we should find thatμ N <μ S . We found this to be overwhelmingly true, with 294 about two arrays on average seeming to be evolutionarily maintained across prokaryotic 295 taxa (∆µ =μ S −μ N = 1.09 ± 0.03). We bootstrapped 95% confidence intervals of our 296 estimates ( Table 1) and found that the bootstrapped distributions did not overlap,

8/21
Sub-sampling overrepresented taxa altered our parameter estimates slightly, but did 299 not change our overall result (∆µ = 1.13 ± 0.09, S2 Fig). To control for the possibility 300 that multiple sets of cas genes in a small subset of genomes could be driving this 301 selective signature, we restricted our dataset only to genomes with one or fewer 302 signature targeting genes (cas3, cas9, or cas10 Makarova et al. 2011aMakarova et al. , 2015 and one or 303 fewer copies each of the genes necessary for spacer acquisition (cas1 and cas2 ). Even 304 when restricting our analyses to genomes with one or fewer sets of cas genes, there is 305 selection to maintain more than one (functional) CRISPR array, though the effect size 306 is smaller ( ∆µ = 0.61 ± 0.02, S3 Fig; with   controls for differences in rates of HGT between lineages also confirms a signature of 318 multi-array selection (∆µ = 0.70 ± 0.14). Because there is a low number of genomes for 319 most species and this test restricts us to only within-species comparisons, our 320 species-wise parameter-based suffers somewhat in terms of power.

321
Third, we fit a linear model accounting for phylogeny (Tung Ho and Ané (2014)) of 322 the mean number of CRISPR arrays on a genome per species versus the proportion of 323 functional genomes per species. This yielded a slope near two, indicating maintenance 324 of two arrays on average (m = 1.9283, p < 2 × 10 −16 ), and was actually higher than the 325 slope found not considering phylogeny (m = 1.69788, p < 2 × 10 −16 ).

326
In order to determine if the signature of selection on multiple arrays we observed was 327 confined to a particular set of clades, we mapped all species-specific ∆µ k values onto 328 the SILVA Living Tree 16s rRNA tree (Yarza et al., 2008). Of the 623 species with at 329 least one functional and one non-functional genome, 568 were represented on the tree. 330 Positive and strongly positive (> 1) values of ∆µ k were distributed across the tree, 331 indicating this phenomenon was not isolated to a particular group (S12 Fig). Formal 332 testing revealed no significant phylogenetic signal in the ∆µ k values (K = 1.88 × 10 −9 , 333 p = 0.604; Blomberg et al. 2003;Revell 2012) 334 We also regressed the species-specific ∆µ k 's found above against the minimum 335 distance between CRISPR arrays and cas genes in a genome to verify that linkage was 336 not driving our result. We saw a slight positive relationship between CRISPR-cas  Finally, to confirm that assembly level had no effect on our outcome, we ran our 341 parametric test restricted to completely assembled genomes in the dataset. This   See Fig 1 and S2 Fig-S4 Fig. selection maintaining multiple arrays using our test, since the production of 348 "neo-CRISPR arrays" would only occur in functional genomes. A simple way to control 349 for this is to merge all CRISPR arrays with identical consensus repeat sequences in a 350 genome, thus removing any duplicates. Doing this, we find that the signature of 351 multi-array selection remains, albeit being somewhat less strong (∆µ = 0.46 ± 0.02). 352 We were considerably surprised that this signature of selection still remained after 353 merging, since such merging will also remove a large portion of arrays acquired through 354 horizontal transfer, assuming such transfers most often happen between closely related 355 individuals. In any case, while the production of neo-CRISPR arrays may be driving 356 our result in part, it cannot account for the overall signal. It is unclear if neo-CRISPR 357 arrays are commonly produced in bacteria via off-target integration, though Nivala et al. 358 (2018) found circumstantial evidence it may occur in two other species. The CRISPR 359 system of E. coli is not naturally active (Savitskaya et al., 2017) and requires artificial 360 up-regulation of the spacer acquisition machinery.

361
Validation of CRISPRDetect array predictions 362 We ran our tests for selective maintenance of multiple arrays on the same dataset 363 excluding arrays with a CRISPRDetect score lower than 6 (double the default 364 threshold). We found no qualitative differences in our results when we used this 365 increased detection threshold (∆µ = 1.00 ± 0.02). By default, CRISPRDetect identifies 366 arrays with repeats matching experimentally-verified CRISPR arrays as well as de novo 367 repeats. If we restrict to only arrays with a positive hit on this list we again found the 368 same pattern (∆µ = 0.76 ± 0.03). 369 We also downloaded the set of CRISPR arrays and array-lacking genomes available 370 on the CRISPR Database (Grissa et al., 2007a). This database uses an alternative 371 algorithm for array detection (Grissa et al., 2007b) and thus serves as an independent 372 verification of our results. This dataset showed a clear signature of selection 373 maintaining multiple arrays (∆µ = 1.49 ± 0.17).

374
Evidence for array specialization 375 In genomes with multiple arrays, the dissimilarity between consensus repeat sequences 376 of arrays in a single genome spanned a wide range of values (Levenshtein Distance, S7 377 Fig and S8 Fig), though the mode was at zero (i.e., identical consensus repeats). When 378 limiting our scope to only genomes with exactly two CRISPR arrays, we saw a bimodal 379 distribution of consensus repeat dissimilarity, with one peak corresponding to identical 380 arrays within a genome and the other corresponding to arrays with essentially randomly 381 drawn repeat sequences except for a few conserved sites between them (S7D Fig). We 382 also observed that among functional genomes, the area of the peak corresponding to 383 dissimilar repeat sequences was significantly higher than among non-functional genomes 384 (χ 2 = 61.432, df = 1, p < 4.582 × 10 −15 , S7 Fig). This suggests that the observed 385 10/21 signature of adaptiveness may be related to the diversity of consensus repeat sequences 386 among CRISPR arrays in a genome. 387 We attempted to independently confirm this result using repeat sequences obtained 388 from the CRISPR Database. While the distribution of pairwise-distances between 389 repeat sequences in two-array genomes was approximately the same shape as that we 390 observed for our dataset (S8 Fig), the relationship between diversity and functionality 391 was reversed, with non-functional genomes having more diverse consensus repeats 392 among their arrays (χ 2 = 4.3952, df = 1, p = 0.03604). This opposing result calls into 393 question the pattern observed in the CRISPRDetect data, though this may be due to 394 the smaller size of the CRISPR Database dataset.

395
A tradeoff between memory span and acquisition rate could 396 select for multiple arrays in a genome 397 We hypothesized that having multiple systems with different acquisition rates could 398 allow prokaryotes to respond to a range of pathogens with different characteristics (e.g. 399 residence time in the environment, frequency of recurrence). To investigate this 400 possibility we built a simple model of spacer turnover dynamics in a single CRISPR 401 array in the presence of "background" and "transient" viral species (see Methods). We 402 constructed phase diagrams of the model behavior, varying spacer acquisition rates and 403 the relative population sizes of viral species or the extent of priming, respectively (Fig. 404  2a, S9 Fig). We found that for very high spacer acquisition rates, the system is able to 405 maintain immunity to both background and transient viral populations ("short-term 406 memory"/"fast-learning"). High rates of spacer acquisition are unrealistic as they lead 407 to high rates of autoimmunity (S2 Text, Wei et al. 2015;Kumar et al. 2015;Yosef et al. 408 2012;Levy et al. 2015;Stern et al. 2010). Our analysis also reveals that there is a region 409 of parameter space with low spacer acquisition rates in which immunity is maintained 410 ("long-term memory"/"slow-learning"). This is the region where low spacer turnover 411 rates allow immune memory to remain in the system over longer periods of time (Fig.   412 2a).

413
The "long-term memory"/"slow-learning" region of parameter space is separated 414 from the "short-term memory"/"fast-learning" region of parameter space by a 415 "memory-washout" region in which spacer turnover is high so that memory is lost but 416 acquisition is not rapid enough to quickly re-acquire immunity towards the transient 417 virus (Fig. 2a). The relative densities of the different viral species modulate the relative 418 importance of fast-acquisition versus memory span. Thus for a range of pathogenic 419 environments two distinct CRISPR immune strategies exist with respect to the spacer 420 acquisition rate ("long-term memory" vs. "fast-learning"). We also note that high levels 421 of priming expand the "washout" region separating the two strategies, as high spacer 422 uptake from background viruses will crowd out long term immune memory (S9 Fig).

423
A single CRISPR array fulfilling one of the two CRISPR strategies is sufficient in 424 the case shown in Fig 2a, although only one of those strategies is likely accessible due to 425 limits on spacer acquisition rates. When a third, novel viral species enters the system, a 426 two-array solution may become necessary due to the memory span vs. learning speed 427 tradeoff (Fig 2b). Extremely high spacer acquisition rates might allow the host to 428 rapidly respond to both novel and returning threats, but, as noted above, such rates are 429 unrealistic due to physical constraints on the speed of adaptation as well as the 430 evolutionary constraint of autoimmunity (S2 Text, Wei et al. 2015;Kumar et al. 2015;431 Yosef et al. 2012;Levy et al. 2015;Stern et al. 2010). CRISPR adaptation is rapid, but 432 it is not instantaneous, and infected hosts with arrays lacking an appropriate spacer will 433 often perish before a spacer can be acquired (Hynes et al., 2014). This means that the 434 region of rapid immune response to both transient and novel threats by a single array, 435 11/21 Immune memory is maximized at intermediate and low spacer acquisition rates, creating a tradeoff with the speed of immune response to novel threats. (a) Phase diagram of the behavior of our CRISPR array model with two viral species, a constant "background" population and a "transient" population that leaves and returns to the system at some fixed interval. The yellow region indicates that immunity towards both viral species was maintained. The green region indicates where immune memory was lost towards the transient phage species, but reacquired almost immediately upon phage reintroduction (t I < 10 −5 , where t I is the time to first spacer acquisition after the return of the species to the system following an interval of absence). The light blue region indicates that only immunity towards the background species was maintained (i.e., immune memory was rapidly lost and t I > 10 −5 ). Dark blue indicates where equilibrium spacer content towards one or both species did not exceed one despite both species being present in the system (S1 Text). (b) The tradeoff between memory span and learning speed. The speed of immune response to the transient viral species (as in (a), with v B /v F = 1) is plotted against the speed of response to a novel viral species to which the system has not been previously exposed (so that there are no spacers targeting this species), over a range of µ A values (µ A ∈ [10 −3.5 , 1]). The speed of immune response to the transient species is defined as 1/(1 + t I ) (where t I = 0 if memory is maintained). The speed of response to the novel species is similarly defined as 1/(1 + t N ) where t N is the time to first spacer acquisition towards this species. For specifics on calculating t I and t N see S1 Text.

12/21
shown in the top-right corner of Fig 2b, is probably inaccessible. Thus, in order to 436 maximize novel spacer acquisition and memory span simultaneously, a two-system 437 solution will be required.

438
An alternative model with fixed-length arrays, corresponding to a situation in which 439 only the first few spacers in an array are processed into mature crRNAs (Bernick et al., 440 2012;Hale et al., 2012;Richter et al., 2012), demonstrated qualitatively similar behavior 441 to the model described above (S1 Text, S10 Fig). with at least one each of functional and non-functional genomes. We performed our test 445 for adaptiveness on each of these taxa individually (  (Touchon and Rocha, 2010;Touchon et al., 2011). All of these taxa are human 454 pathogens, and can occupy a diverse set of environmental niches on the human body. It 455 is unclear at this time what is causing the differences in the adaptive landscape each 456 taxon experiences.

457
A very small portion of the genomes used in our analyses were from archaea (< 1%). 458 We ran our analyses on these genomes alone to see if they differed qualitatively from 459 their bacterial counterparts. Archaeal genomes showed a clear signal of selection 460 maintaining multiple arrays, although the confidence interval around our statistic is 461 rather broad (∆µ = 1.05 ± 0.56). We note that the large majority of archaeal genomes 462 had CRISPR arrays and were also functional, making our approach less powerful (S11 463 Fig). Further, if those few non-functional genomes lost their cas spacer acquisition 464 machinery recently, then our power would be reduced even more because these genomes 465 might still bear the remnants of past selection.

466
Type-specific signatures of selection 467 While we do not have direct information on system type for the majority of arrays in 468 our dataset, we can subdivide genomes into those containing the signature cas targeting 469 genes for type I,II, or III CRISPR systems (cas3, cas9, and cas10 respectively), 470 13/21 assuming that this is a reliable indicator of system type (Makarova et al., 2015). The 471 number of arrays per genome differed significantly between each of these pairs (S13 Fig), 472 though the largest difference was between genomes with class I targeting proteins which 473 had around 2 arrays on average (type I and type III, 2.10 and 1.96 respectively) and 474 class II targeting proteins which only had one array on average (type II, 1.05). We 475 excluded genomes with multiple types of targeting genes for this analysis. 476 We cannot run our test for selection directly on these subsets of the data, since they 477 exclude genomes without arrays or cas genes. Instead we classified species into types if 478 the only observed targeting gene type among all representatives of that species 479 corresponded to a a particular type. Thus we can test for our signature of selection 480 among species that "favor" a particular type of CRISPR system. All types showed a 481 signature of multi-array selection (∆µ = 1.09 ± 0.05, 0.62 ± 0.02, 1.79 ± 0.06 482 respectively). In particular type III "species" had a particularly strong signal, and 483 organisms in this group may be under selection to maintain three arrays.

485
Selection maintains multiple CRISPR arrays across prokaryotic 486 taxa 487 On average, prokaryotes are under selection to maintain more than one CRISPR array. 488 This surprising result holds controlling for both overrepresented taxa and the influence 489 of multiple sets of cas genes. However, the degree of selection appears to vary between 490 taxa, likely as a function of the pathogenic environment each experiences based on its 491 ecological niche.

492
The number of CRISPR arrays in a genome appears to follow a negative binomial 493 distribution quite well (Figs 1b and 1a, S2 Fig-S4 Fig), consistent with our theoretical 494 prediction. This pattern is robust to sub-setting of the data in a variety of ways. We 495 note that, due to the large size of this dataset, formal goodness-of-fit tests to the 496 negative binomial distribution always reject the fit due to small but statistically 497 significant divergences from the theoretical expectation.

498
Our test for selection is conservative to the miscategorization of arrays as 499 "functional"or "non-functional". Miscategorizations could occur because intact targeting 500 machinery still allows for preexisting spacers to confer immunity, some CRISPR arrays 501 may be conserved for non-immune purposes (e.g. Touchon and Rocha 2010;Li et al. 502 2016), or intact acquisition machinery is no guarantee of system functionality. That 503 being said, our test is conservative precisely because of such miscategorizations, as they 504 should increaseμ N and decreaseμ S respectively. Selection against having a CRISPR 505 array in genomes lacking spacer acquisition machinery could produce a false signature of 506 selection maintaining arrays in genomes with acquisition machinery. This is unlikely 507 because there is no reason a non-functional CRISPR array should be under strong 508 negative selection given the low or nonexistent associated costs.

509
Our test should also be robust to false positive or negative array discovery rates.

510
Because we used CRISPRDetect settings such that cas-gene presence was not taken into 511 account when scoring arrays, CRISPRDetect followed identical procedures for detecting 512 arrays in functional and non-functional genomes. Thus an elevated false-positive or 513 false-negative rate should have no effect on our tests for selection, because these tests 514 rely on relative differences between array counts in functional and non-functional 515 genomes, not their actual values. We further confirmed this by changing our 516 CRISPRDetect score threshold and comparing to the distribution of arrays per genome 517 in the CRISPR Database (Grissa et al., 2007a).

518
Finally, we note that µ N and µ S take on a range of values depending on what subset 519 14/21 of taxa/genomes is considered. This is to be expected as each set of species will occupy 520 a distinct environment in terms of both the rate of horizontal gene transfer and the 521 usefulness of CRISPR immunity. Nevertheless, our qualitative result of multi-CRISPR 522 adaptiveness is robust to this quantitative variability.

524
Possibly, multiple arrays could be selectively maintained even in the absence of any 525 fitness advantage if each array acquired complementary spacer content towards distinct 526 viral targets by chance. If arrays share acquisition machinery such complementarity is 527 unlikely because priming will ensure both arrays contain spacers towards any target 528 encountered, meaning that the content of the two arrays will be largely redundant. We 529 note that this argument only holds for type I and II systems which demonstrate priming 530 (Datsenko et al., 2012). Type III systems are unprimed, have slow spacer acquisition 531 rates, and can target even mutated viral sequences (Pyenson et al., 2017). Thus type III 532 systems may be maintained via spacer complementarity, perhaps explaining why species 533 favoring type III systems appear to experience selection maintaining three rather than 534 just two CRISPR arrays.

535
Even in type I and II systems, if each array is associated with a separate set of 536 spacer acquisition machinery, then cross-priming will be less likely and complementarity 537 could arise. Nevertheless, this does not explain the multi-array conservation we see in 538 genomes with only a single set of cas genes. Thus we consider here several potential 539 adaptive explanations for the selection that maintains multiple arrays.

540
Our data show significant numbers of both functionally similar and dissimilar 541 CRISPR systems within the same genome, so either could potentially be adaptive.

542
While CRISPR systems are generally highly flexible, a prokaryote might still gain an 543 advantage in the former case if multiple similar systems lead to improved immunity 544 through redundancy and in the latter case if multiple dissimilar systems allow for 545 specialization towards multiple types of threats. The relevance of the different 546 advantages depends on whether an individual has multiple sets of cas genes, CRISPR 547 arrays, or both.

548
In the case of similar systems, immunity could be improved by (a) an increased 549 spacer acquisition rate, (b) an increased rate of targeting, or (c) a longer time to 550 expected loss of immunity. Duplication of cas genes could, in principle, increase uptake 551 (a) and targeting rates (b) through increased gene expression, but our data show that 552 multiple sets of cas genes are rare, which suggests this is, at best, a minor force.

553
Alternatively, duplication of CRISPR arrays could increase targeting (b) via an 554 increased number of crRNA transcripts or increase memory duration (c) through spacer 555 redundancy. However, the effectiveness of crRNA may actually decrease in the presence 556 of competing crRNAs (Stachler and Marchfelder, 2016;Stachler et al., 2017) and, since 557 a single array can have multiple spacers with the same target, there is a diminished 558 advantage to having multiple arrays in terms of memory span (S3 Text). Redundant 559 arrays might also be a form of bet-hedging since CRISPR functionality is lost at a high 560 rate in some prokaryotes (Jiang et al., 2013;Weissman et al., 2018).

561
In the case of dissimilar systems, immunity could be aided if diverse features are 562 advantageous. For example some viruses encode proteins that deactivate Cas targeting 563 proteins (Bondy-Denomy et al., 2013;Pawluk et al., 2014;Rauch et al., 2017). Diverse 564 cas genes may allow hosts to evade the action of these anti-CRISPR proteins, which are 565 often extremely broadly acting (Bondy-Denomy et al., 2013;Pawluk et al., 2014). 566 Alternatively, it has recently been shown that promiscuous type III Cas proteins are 567 often encoded alongside type I systems and can function as a "backup", using spacers 568 from the same array to target phages that have mutated the protospacer-adjacent 569 motifs necessary for type I targeting (Silas et al., 2017). Many genomes with multiple 570 15/21 cas signature genes also had multiple types of such genes, possibly indicating some 571 diversifying force. The inclusion of these multi-cas genomes also increased the effect size 572 of our test for adaptiveness, despite low representation in the dataset. In any case, while 573 these hypotheses remain interesting candidates to explain CRISPR multiplicity in some 574 prokaryotes, the majority of the genomes in the dataset have only one set of cas genes 575 and thus these mechanisms cannot explain the signature for multi-array adaptiveness 576 observed in the majority of the dataset.

577
The theoretical model we present demonstrates one possible explanation for selective 578 maintenance of multiple CRISPR arrays. Rate variation is an attractive hypothesis since 579 it can explain the signature of selective maintenance we observe even in the absence of 580 multiple sets of cas genes. Additionally, rates vary between systems, and rate variation 581 among arrays on a single genome has been observed in a diverse set of organisms (e.g. 582 (Paez-Espino et al., 2015;Westra et al., 2015)). Arrays with slightly different, or even 583 identical, consensus repeat sequences may differ in length, despite sharing a set of cas 584 genes (Zeng et al., 2017). It has even been shown directly that acquisition rates can 585 vary among CRISPR arrays with identical consensus repeat sequences sharing a single 586 set of cas genes . The factors influencing acquisition rate appear to 587 be idiosyncratic, perhaps related to the genomic position of the CRISPR array. 588 We do not provide empirical evidence that rate variation drives the observed 589 signature of selection of multiple arrays. Such verification is not currently possible, and 590 will require the characterization of spacer acquisition and loss rates across a large 591 number of taxa. Nevertheless, we develop this hypothesis as a promising candidate and 592 illustrate how factors intrinsic to the mechanism of CRISPR immunity can create strong 593 tradeoffs between memory span and learning speed. We show how such tradeoffs can 594 lead to selection for both high acquisition rate (i.e., short term memory) and low 595 acquisition rate (i.e., long-term memory) systems, depending on the pathogenic 596 environment of the host. As an array increases in length (i.e., the number of repeats 597 increases) the rate of spacer loss should also increase because loss occurs via homologous 598 recombination. Thus a length-dependent spacer loss rate causes high acquisition rate 599 systems to also have high loss rates, producing the aforementioned tradeoff.

600
Our primary model demonstrates that array-length dependent spacer loss can 601 produce a "learning" versus "memory" tradeoff. Nevertheless, in some systems spacer 602 loss may occur at a very low rate, making our proposed mechanism less relevant. There 603 is some evidence of spacers that persist in a population over a timespan of years (Tyson 604 and Banfield, 2008), though such conservation is possibly due to selection on the 605 population rather than a low mechanistic loss rate. A slightly modified model that 606 imposes a hard cap on the number of "effective" spacers in an array (S1 Text) and 607 would not require a strong spacer loss vs. acquisition rate link produced qualitatively 608 similar results. We based this model on evidence that the abundance of mature crRNA 609 transcripts from the leader end greatly exceeds that of those from the trailing end 610 (Bernick et al., 2012;Hale et al., 2012;Richter et al., 2012), implying that only the first 611 few spacers in an array will provide effective immunity. Thus CRISPR immunity may 612 face a memory-learning tradeoff due to multiple, distinct mechanisms, but in both cases 613 spacer acquisition rate variation among systems can lead to an optimal solution for the 614 host. 615 We note that when partial spacer-target matches exist, variability in spacer 616 acquisition rates among arrays will be largely irrelevant since priming will ensure rapid 617 acquisition of new spacers. On the other hand, when no match exists, either due to 618 spacer loss or the introduction of a truly novel viral species into the environment, 619 primed spacer uptake will not occur. Thus the rate at which a host encounters novel 620 threats will determine the relative importance of the baseline spacer acquisition rate 621 versus the primed acquisition rate. In environments where novel viruses are frequently 622 16/21 encountered, small differences in acquisition rate can be important for host fitness, 623 whereas in environments where host and virus pairs consistently coevolve over time 624 priming will be the more important phenomenon.

625
As more genome sequences from environmental samples become available, it will be 626 possible to explicitly link particular array configurations to specific features of the 627 pathogenic environment or host lifestyle. Even then, open questions remain. One 628 phenomenon that we do not address here is that a small, but non-trivial number of 629 genomes have greater than 10 arrays. It is difficult to imagine so many arrays 630 accumulating neutrally in a genome. If high array counts are a product of high 631 horizontal transfer rates, then genomes with extremely high array counts should also be 632 larger due to accumulation of foreign genetic material. This was not the case (S14 Fig), 633 indicating that rates of horizontal transfer alone cannot explain these outliers.

634
Finally, our examination of immune configuration is likely relevant to the full range 635 of prokaryotic defense mechanisms. In contrast to previous work focusing on 636 mechanistic diversity (e.g. Iranzo et al. 2013Iranzo et al. , 2015Kumar et al. 2015;Westra et al. 637 2015), we emphasize the importance of the multiplicity of immune systems in the 638 evolution of host defense. As we suggest, a surprising amount of strategic diversity may 639 masquerade as simple redundancy.