## Abstract

Some bacteria and archaea possess an adaptive immune system that maintains a memory of past viral infections as DNA elements called spacers, stored in the CRISPR loci of their genomes. This memory is used to mount targeted responses against threats. However, cross-reactivity of CRISPR targeting mechanisms suggests that incorporation of foreign spacers can also lead to autoimmunity. We show that balancing antiviral defense against autoimmunity predicts a scaling law relating spacer length and CRISPR repertoire size. By analyzing a database of microbial CRISPR-Cas systems, we find that the predicted scaling law is realized empirically across prokaryotes, and arises through the proportionate use of different CRISPR types by species differing in the size of immune memory. In contrast, strains with nonfunctional CRISPR loci do not show this scaling. We also demonstrate that simple population-level selection mechanisms can generate the scaling, along with observed variations between strains of a given species.

Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (Cas) proteins form a prokaryotic defense against phage [1]. CRISPR loci are composed of DNA repeats alternating with variable DNA segments called spacers, acquired from phage and other foreign genetic material. In a process called interference, spacer RNA guides sequence-specific binding and cleavage of target DNA by Cas proteins. In this way, spacers acquired during phage attack confer acquired, heritable resistance against subsequent invasions.

CRISPR-Cas systems are remarkably diverse, characterized by functionally divergent Cas proteins and distinct mechanisms for each stage of immune defense [2]. Spacer acquisition is mediated by the conserved Cas1–Cas2 adaptation module, which sets spacer lengths within a narrow range varying by system [3, 4]. CRISPR arrays are also broadly distributed in size, ranging from less than 10 to hundreds of spacers, and the full repertoire of a host may comprise several CRISPR arrays [5]. Maintaining a broad spacer repertoire confers resistance against many phages and possible escape mutants [6]. However, there are constitutive costs associated with Cas protein expression [7], and diminishing returns of broad defense due to finite Cas protein copy numbers [8, 9]. In addition, CRISPR-Cas systems can prevent horizontal transfer of beneficial mobile genetic elements [10, 11].

CRISPR-Cas systems also cause autoimmunity, occurring when a spacer guides interference somewhere on the host genome, leading to cell death and strong mutational pressure in the CRISPR-cas locus and target region [12– 15]. The patchy incidence of CRISPR-Cas systems in prokaryotes (roughly 40% of bacteria and 85% of archaea [2]), and the presence of diverse mechanisms for self– nonself discrimination [16], suggest that avoiding autoimmunity is a constraint in the evolution of CRISPR-Cas systems [2, 13–19].

Several mechanisms exist in divergent CRISPR-Cas types for suppressing autoimmunity arising from different forms of potential self-targeting [16]. In type I and II systems, interference requires presence of a protospacer-adjacent motif (PAM), a 2–5-nt-long sequence adjacent to target DNA but absent in CRISPR repeats, preventing interference within the CRISPR array [20, 21]. In type III systems, interference requires transcription of target DNA, which avoids targeting phages integrated into the host chromosome (prophage) [22]. Spacers acquired from the host genome are naturally self-targeting, but there are mechanisms to suppress such acquisition [23, 24]. For example, type I-E systems acquire spacers preferentially at double-stranded DNA breaks, which occur primarily at stalled replication forks of replicating phage DNA, and acquisition is confined by Chi sites which are enriched in bacterial genomes [23].

Here we propose that CRISPR evolution is also shaped by *heterologous autoimmunity*, which occurs if an acquired foreign spacer and a segment of the host genome are sufficiently similar. The likelihood of this effect depends on sequence statistics and the specificity of CRISPR targeting mechanisms. Heterologous autoimmunity is analogous to off-target effects that are an important concern in CRISPR-Cas genome editing [25, 26], but the possible effects on prokaryotic adaptive immunity have not been explored. We combine a probabilistic modeling approach with comparative analyses of CRISPR repertoires across prokaryotes to show that: (a) heterologous autoimmunity is a significant threat caused by CRISPR-Cas immune defense, (b) avoidance of autoimmunity leads to a scaling law in CRISPR repertoires, and (c) the scaling law can be achieved by population-level selection. Our work suggests that avoidance of heterologous autoimmunity is a key factor shaping CRISPR repertoires and the evolution of CRISPR-Cas systems.

## I RESULTS

## A. Cross-reactivity leads to autoimmunity

We approach heterologous self-targeting as a sequence-matching problem [27–29], and derive estimates for the probability of a spacer being sufficiently similar to at least one site in the host genome. For a spacer of length *l*_{s} and PAM of length *l*_{p} (where it exists), an exact match at a given position requires *l* ≡ *l*_{s} + *l*_{p} complementary nucleotides. In a host genome of length *L*, where *L* » *l*, there are *L* − *l* + 1 ≈ *L* starting positions for a match. At leading order, and ignoring nucleotide usage biases, we may treat matches as occurring independently with probability 4^{−l}. Thus, the probability of an exact match anywhere on the genome is (see Methods)

Considering order-of-magnitude parameter estimates for the E. coli type I-E system of *L* = 5 × 10^{6} nt, *l*_{s} = 32 nt, and *l*_{p} = 3 nt gives a negligible probability *p*_{0} ∼ 10^{−15}.

However, CRISPR interference tolerates several mismatches between spacer RNA and target DNA depending on position and identity [25, 26, 30, 31]. In general, mismatches in the PAM are not allowed, and mismatches in the PAM-distal region are tolerated to a greater extent than mismatches in the PAM-proximal region [21]. Up to ∼ 5 mismatches are allowed in type II systems [25, 26], while in type I-E systems, errors are mostly tolerated at specific positions with a 6-nt periodicity [30, 31].

Partial spacer-target matching may also trigger primed spacer acquisition, which is the rapid acquisition of new spacers from regions surrounding target DNA [30, 32, 33]. In type I-E and I-F systems, primed acquisition tolerates many (up to 10) mismatches in the PAM and target region [30, 33]. Thus, a foreign spacer that does not cause direct interference may still trigger primed acquisition of self-spacers [33] and hence cause autoimmunity.

Given that the specificity of CRISPR interference and primed acquisition have been characterized for only a few systems, we consider two general classes of mismatch tolerance that include the above scenarios: (a) mismatches at *k*_{fix} fixed positions, and (b) mismatches at *k*_{var} variable positions anywhere else in the target region. These increase the per-spacer self-targeting probability by a combinatorial factor *α*(*k*_{fix}, *k*_{var}, *l*) (see Methods), so that

A greater number of allowed mismatches greatly increases the likelihood of heterologous self-targeting (Fig. 1b). To gain intuition we can rewrite Eq. 2 as

where *l*_{eff} is the effective spacer length after discounting for allowed mismatches (see Methods). This shows that mismatches exponentially increase the probability of self-targeting, and variable-position mismatches particularly so. Considering the E. coli system as before, the matching probability increases to *p*_{self} ∼10^{−4} with *k*_{fix} = 5 nt and *k*_{var} = 5 nt (see Fig. 1b). Other CRISPR-Cas systems may similarly lie in parameter regimes with appreciable *p*_{self}, especially when including indirect self-targeting through primed acquisition [34]. Furthermore, the probability of self-targeting is likely higher than implied by our calculations as it can be increased by correlations in sequence statistics between host and phage genomes [28, 29]. Given our estimates, we thus hypothesize that heterologous autoimmunity may occur generally and be a significant cost of CRISPR-Cas immunity.

## B. Spacer length scales with repertoire size

To test this hypothesis, we exploited the large natural variability in CRISPR systems across different microbial species. As the self-targeting probability depends exponentially on spacer length (Eqs. 2–3), we expect small differences in length to lead to large variations in the risk of autoimmunity. If CRISPR repertoire sizes are selected to balance broader immunity against the risk of autoimmunity, then qualitatively we expect that species with shorter spacers should have smaller repertoires, while species with longer spacers should have larger ones (Fig. 1c).

To make this prediction more quantitative, suppose prokaryotes tolerate a maximum probability *P* of self-targeting, and that CRISPR-Cas systems are selected to maximize protection against pathogens subject to this constraint. Repertoires with *N* spacers incur a self-targeting probability of ∼ *Np*_{self}, and thus Eq. 2 implies

Linearizing the dependence of the combinatorial factor *α* around typical spacer lengths *l*_{0} (see Methods) predicts a scaling relationship between spacer length and the logarithm of repertoire size
where we arrived at the latter estimate by taking *k*_{var} ∼ *k*_{fix} ∼5 and *l*_{0} ∼35.

We analyzed a database of CRISPR-Cas systems identified in publicly available bacterial and archaeal genomes [5, 35] (see Methods). To sample widely from CRISPR-Cas systems while eliminating oversampling of certain species, we first selected strains carrying both CRISPR and cas loci, and then picked one strain at random from each species for further analysis (see Methods). We observed a multimodal distribution of spacer lengths acquired by these representative strains (Fig. 2a), consistent with different CRISPR-Cas types having narrow spacer length distributions (Fig. 3a). The distribution of spacer repertoire sizes, defined as the sum of CRISPR array sizes in each genome, was broad, ranging from 1 to 812 spacers (Fig. 2b).

A linear regression between spacer length and log-repertoire size gave a slope of 1.1 ± 0.1 (Fig. 2c), in line with the predicted scaling (Eq. 5). A range of crossreactivity parameters is broadly consistent with this scaling (Fig. S1), with a best-fit value of *k*_{var} = 3.41 ± 0.02 obtained assuming *k*_{fix} = *l*_{s}*/*6 (consistent with a 6-nt periodicity in tolerated fixed-position mismatches as in type I-E systems) (see Fig. S1). The empirical law holds over two orders of magnitude in CRISPR repertoire size, but over this range the spacer length changes only modestly. These changes however lead to significant differences in the self-targeting probability, which is exponential in spacer length (Eq. 3).

Some prokaryotes may tolerate self-targeting spacers because they have defective cas genes [12] or contain anti-CRISPRs [36]. To further test the link between autoimmune risk and spacer length, we investigated the incidence of missing cas genes across CRISPR-Cas systems. We expected that a higher autoimmune risk in species with shorter spacers would lead to a higher rate of cas gene loss. Thus, we analyzed how the fraction of strains with missing cas loci depends on spacer length, which shows the expected relationship (Fig. 2d). Once immunopathology from self-targeting is avoided by the loss of cas interference genes, the relation between spacer length and repertoire size should no longer be selected for. Indeed, we found no clear relation in strains with missing cas loci (Fig. 2e). Taken together, these observations strengthen the interpretation of the scaling law as arising from the modulation of autoimmunity risk by spacer length.

## C. Variable CRISPR-Cas type use underlies scaling

CRISPR-Cas systems are classified into different types and subtypes based on their evolutionary relationships and the use of different cas genes [2]. We wondered whether the aggregate scaling relationship between spacer length and repertoire size (Fig. 2) reflected differences at the level of CRISPR-Cas type usage. We thus grouped the species by subtype, when there is a single CRISPR-Cas system in the genome, or in a separate group when multiple subtypes are present.

For species carrying a single cas type, we aggregated all spacers found across species of each type to quantify the statistics of acquired spacer lengths. We observed differences in the spacer length distributions between types (Fig. 3a): (a) Type II-A and II-C systems have narrow distributions tightly clustered around 30 nt; (b) Type I-E and I-F systems also have narrow distributions, clustered around 32 nt, while other type I systems have spacers that are longer and more broadly distributed; (c) Type III systems have even longer and more broadly distributed spacers, with median lengths in the 36–39 nt range.

A broader distribution of acquired spacer lengths leads to a higher risk of autoimmunity than a narrow distribution with the same mean, since the self-targeting probability increases exponentially for shorter spacers. To account for an increase in autoimmune risk for broader distributions, we focused on the lower quartile of spacer lengths for each cas type as a proxy for autoimmune risk. Also, to account for the requirement of PAM recognition in type I and II (but not type III) systems, we added a PAM length of 3 nt to types I and II to obtain the overall length *l*. Strikingly, we observed that the predicted relationship between *l* and repertoire size also broadly holds between CRISPR-Cas types (Fig. 3b): Type II systems have the shortest spacers and the smallest repertoires, and among type I subtypes those with shorter spacers generally have smaller repertoires. Type III systems have smaller repertoires than type I systems despite somewhat longer spacers, but this is explained by the absence of PAMs and the broader spacer length distributions for the type III systems, both of which increase autoimmune risk.

We next tested whether this relation also carries over to species carrying multiple CRISPR-Cas systems, in the form of a differential use of cas types as a function of repertoire size. We divided species with multiple cas types into three equally sized groups by repertoire size, and determined the relative incidence of CRISPR sub-types within each group (Fig. 3c). We found that the use of types II, I-E, and I-F decreases with repertoire size in line with expectations, and an opposite pattern for two of the type III systems and the type I systems with the longest spacers. The relation between total repertoire size and spacer length in species with multiple cas sub-types was further reinforced by a direct analysis of the incidence of spacers of different lengths as a function of repertoire size, with a greater proportion of longer spacers present in larger repertoires (Fig. S2).

Taken together, we find that species carrying either single or multiple CRISPR-Cas systems differentially use CRISPR-Cas types having different spacer length distributions to form repertoires of different sizes. This differential use gives rise to the aggregate scaling observed in Fig. 2c, and is consistent with the hypothesis of minimizing the risk of heterologous autoimmunity.

## D. Dynamical origin of the scaling law

Dynamical mechanisms can give rise to the scaling law that our theory predicts, and which is found in the empirical data. While spacer dynamics involves complex epidemiological feedbacks [8, 37–44], here we consider a simple effective model in which spacer acquisition and loss are described as a birth-death process, such that spacers are acquired at a rate *b* and lost at a per-spacer rate *d* (Fig. 4a, left panel). This gives rise to a Poisson distribution of repertoire sizes at steady state, with mean *b/d* (see SI). Our statistical theory requires that the mean of the distribution should shift with spacer length. There are two mechanisms by which selection could lead to such a dependence. First, the negative fitness effect of acquiring self-targeting spacers [13] purges lineages that undergo deleterious acquisition events. Indeed, CRISPR arrays are selected for the absence of self-targeting spacers [12]. Effectively, this reduces the net acquisition rate among surviving lineages. Second, over longer evolutionary timescales, different CRISPR-Cas systems may be selected to acquire spacers at different rates depending on their respective risks of autoimmunity. These differences in rates could arise from the maintenance of multiple copies of cas genes, or through regulation of cas expression [45]. Indeed, spacer repertoire size increases with the number of cas loci (Fig. S3), suggesting that larger gene copy numbers of cas1 and cas2, necessary for spacer acquisition, result in greater acquisition rates. Interestingly, strains having exactly one copy of both cas1 and cas2 still obey a scaling relationship (Fig. S4), suggesting that regulation of these genes also contributes to minimizing autoimmune risk.

Let us suppose that one or both of these selection mechanisms lead to an effective spacer acquisition rate inversely proportional to the risk of self-targeting, *b* ∝ 1*/p*_{self}. To replicate the empirical analysis, we created a synthetic dataset of the same size by sampling strains at random from steady-state distributions at different spacer lengths, which have different *p*_{self} (Fig. 4a, middle panel) (see Methods). Plotting spacer length against mean repertoire size in the same way as we did for the empirical data, we recover a scaling law as predicted by our theory (Fig. 4a, right panel).

In addition to providing a dynamical explanation for scaling of the means, this birth-death model produces substantial variability around the mean relationship (Fig. 4a, right panel, green dots). In fact, the Poisson variance of Fig. 4a, originating from a constant per-spacer loss rate, is likely an underestimate. Spacer loss occurs through double-stranded DNA breaks followed by homologous recombination at a different CRISPR repeat, a process which may delete chunks of an array in a single deletion event (see e.g. [46, 47]). Including such correlated spacer loss greatly increases the variance. For example, in a simple analytically solvable limiting case where entire arrays are lost at once, the distribution of repertoire sizes becomes geometric (see SI) and thus very broad (Fig. 4b, left panel). Given this substantial variability, we wondered whether a comparative analysis of many species could still recover a scaling in this model. We thus sampled strains from geometric steady-state distributions whose means obey a scaling law as in panel a, observing a much larger variability in individual strains (Fig. 4b, right panel, green dots). However, the scaling of the mean is recovered with a fit to the dataset and when strains are binned by repertoire size (Fig. 4b, right panel).

Prompted by the broad variability predicted by correlated spacer loss, we analyzed the repertoire size distributions of species with many sequenced strains (see Methods). We indeed observed a broad distribution even among strains of the same species (Fig. 4c). We compared the repertoire size distributions of four pairs of highly-sampled species with the same CRISPR-Cas type, and found that the within-species variability comprises a substantial part of the overall variance. Additional variability between species, leading to different mean repertoire sizes for species with the same CRISPR-Cas type, might originate from different microbes inhabiting environments that differ in viral diversity and thus pressure to acquire broad immune defense. We tested the robustness of the comparative analysis to this additional source of variability, by sampling strains from steady-state distributions where we additionally sample the prefactor in *b* ∝ 1*/p*_{self} from a wide distribution (see Methods). This further increases the variability of individually sampled strains, but the means still show scaling (Fig. S5).

## II. DISCUSSION

An adaptive immune system is dangerous equipment to have in an organism. There is always the risk that the immune receptors, intended as defenses against foreign invaders, will instead target the self. In CRISPR-Cas systems, biophysical mechanisms avoiding various forms of autoimmunity such as targeting of the CRISPR locus and self-spacer acquisition are known [16, 20, 21, 23], but here we propose that heterologous autoimmunity, where spacers acquired from foreign DNA seed self-targeting, is a significant threat to microbes carrying CRISPR-Cas. This threat is analogous to off-target effects in genomeediting applications [25, 26], and has been observed in an experimental CRISPR-Cas system [33], but its wider implications for the evolution of CRISPR-Cas systems are unexplored. We showed that avoidance of this form of autoimmunity while maximizing antiviral defense predicts a scaling law relating spacer length and CRISPR repertoire size. The scaling depends on the number and nature of sequence mismatches permitted during CRISPR interference and primed acquisition.

To test our prediction we used a comparative approach analyzing the natural variation in CRISPR-Cas systems across microbial species, and demonstrated that: (a) the predicted scaling law is realized, (b) the observed scaling constrains parameters for cross-reactive CRISPR targeting to lie in a range consistent with experimental studies, (c) the scaling arises in part from differential usage of different CRISPR-cas subtypes having different spacer length distributions, and (d) the scaling, and hence a balanced tradeoff between successful defense and autoimmunity, can be achieved by population-level selection mechanisms. In addition, we demonstrated a negative control: CRISPR arrays in species that no longer have functional Cas proteins, and thus are not at risk of autoimmunity, do not show the predicted scaling relation. We propose two further tests of the link between spacer length and autoimmune risk: (1) If cross-reactivity leads to selftargeting, in addition to a depletion of self-targeting spacers in CRISPR arrays [12, 36], we predict a depletion of spacers several mismatches away from self-targets, and (2) Our theory predicts that CRISPR-Cas subtypes with longer spacers should acquire spacers more readily.

A similar tradeoff between sensitivity to pathogens and autoimmune risk shapes the evolution of vertebrate adaptive immune systems [27, 48]. In the light of our results it would be interesting to determine whether this tradeoff also leads to a relation between the size of the immune repertoire and specificity in vertebrates. Such a relation will likely be harder to ascertain for vertebrates as patterns of cross-reactivity between lymphocyte receptors and antigens are more complex. Interestingly, however, T cell receptor hypervariable regions in human are several nucleotides longer on average than those found in mice [49], which accompanies a substantial increase in repertoire size in human. If longer hypervariable regions translate to a greater specificity on average, one might view the increased human receptor length as an adaptation to a larger repertoire. The key to our current work was the ability to compare microbial immune strategies across a large panel of phylogenetically distant species. Further insight into how this tradeoff shapes vertebrate immune systems might thus be gained by building on recent efforts to survey adaptive immune diversity in a broader range of vertebrates [50, 51].

Many theoretical studies of adaptive immunity in both prokaryotes [8, 37–44] and vertebrates [52–55] consider detailed dynamical models of evolving immune repertoires. For prokaryotes, such dynamical models can be regarded as describing the role of CRISPR-Cas as a shortterm memory for defense against a co-evolving phage [56]. Studying adaptive immunity in this way requires detailed knowledge of the parameters controlling the dynamics, many of which are not well-characterized experimentally. In this paper, we took an alternative approach of focusing on the statistical logic of adaptive immunity, where we regard the bacterial immune system as a functional mechanism for maintaining a long-term memory of a diverse phage landscape [57], via probabilistic matching of genomic sequences. Previous work taking this perspective offered an explanation for why prokaryotic spacer repertoires lie in the range of a few dozen to a few hundred spacers [56]. As in our discussion of possible mechanisms for generating the observed scaling law, evolution should select dynamics that achieve the statistical organization that we predict, because this is what is useful for achieving a broad defense against phage while avoiding autoimmunity. A probability theory perspective of this kind has been applied to the logic of the adaptive immune repertoire of vertebrates [58–60], but to our knowledge we are presenting a novel approach to the study of CRISPR-based autoimmunity.

## III. MATERIALS AND METHODS

### a. Derivation of self-targeting probability

We estimate the probability of an alignment between a spacer + PAM sequence of length *l* and a host genome of length *L*. We assume that both sequences are random and uncorrelated, with nucleotide usage frequencies of 1/4. In a length-*L* genome, where *L* » *l*, there are *L* − *l* + 1 ≈ *L* starting positions for an alignment. The matching probability at each position, *p*_{m}, depends on the number and nature of mismatches tolerated. In regimes where *p*_{m} is small, the matching probabilities at the different positions may be treated independently. Thus, the probability of having at least one alignment within the length-*L* genome is

If no mismatches are tolerated, *p*_{m} = 4^{−l} as in Eq. 1. At each site where a mismatch is allowed, four alternative nucleotide choices are possible. This gives a certain number *α* of unique complementary sequences matching to a given spacer, which we compute as a function of the number and nature of mismatches. If up to *k*_{fix} mismatches are tolerated at fixed positions in the alignment, . If instead up to *k*_{var} mismatches are tolerated anywhere in the complementary region, naively , where the binomial coefficient is the number of combinations of sites where mismatches are allowed. This is however an upper bound as matching sequences are overcounted, and the precise expression is
where each term in the sum is the number of unique complementary sequences having exactly *i* mismatches. The largest term dominates, giving . Thus, combining *k*_{fix} mismatches at fixed positions and up to *k*_{var} mismatches at any of the remaining *l k*_{fix} positions gives

We can introduce an effective spacer length, . To leading order the binomial expression in Eq. 8 is approximated by . This gives *l*_{eff} ≈ *l* − *k*_{fix} − *k*_{var} log_{4} 3(*l* − *k*_{fix}) as in Eq. 3.

The probability that a repertoire of *N* spacers avoids self-targeting, 1− *P*_{self}, is one minus the probability that at least one spacer self-targets. This gives

If CRISPR repertoires are selected to maximize repertoire size subject to the constraint *P*_{self} ≤ *P*, we obtain Eq. 4. Taylor expanding ln *N* around *l* = *l*_{0} gives Eq. 5 to lowest order in *l*.

### b. Comparative analyses

For our comparative analyses we use CRISPRCasdb [5], which is a database of CRISPR and cas loci identified using CRISPRCasFinder [61] in public bacterial and archaeal whole-genome as-semblies [35]. CRISPR arrays are assigned evidence levels 1–4, 4 being the highest confidence [61]. We restricted our analysis to level 4 CRISPR arrays only. Strains containing both annotated CRISPR and cas loci were used for the analyses in Figs. 2a–c, 3, and 4c. Strains containing annotated CRISPR but no cas loci were used for the analyses in Figs. 2d–e. In order to eliminate oversampling of certain species, we picked one strain at random from each species for further analysis (2,449 species with annotated CRISPR and cas loci, and 340 species with annotated CRISPR but no cas loci). To produce Fig. 3, the randomly chosen strains were grouped by annotated cas subtype, or into a separate group if they contain multiple subtypes. The 12 subtypes shown in Figs. 3a and c have *>* 10 species represented in CRISPRCasdb. To produce Fig. 4c, 4 pairs of species, each with *>* 50 sequenced strains of the same CRISPR-Cas type, were chosen for analysis.

### c. Synthetic data generation and analysis

A synthetic dataset producing a scaling law was generated in the following way: (1) A spacer of length *l*_{s} was drawn from the length distribution of Fig. 2a, and (2) a repertoire size distribution with mean *A/p*_{self} was created, from which one strain was sampled and added to the dataset. Parameter values of *L* = 5 10^{6}, *l*_{p} = 3, *k*_{fix} = *l*_{s}*/*6, *k*_{var} = 3, and *A* = 10^{−5.5} were used. The steady-state distributions are Poisson in Fig. 4a, and geometric with the same mean in Fig. 4b. In Fig. S5, *A* was sampled from a log-normal distribution with the same mean, and standard deviation chosen such that the coefficient of variation is 1.2.

## SUPPLEMENTARY INFORMATION

### SI text on population dynamics

Consider a host population acquiring spacers of length *l*. Let the number of individuals in the population that have repertoire size *n* (*n* ≥ 0) be *X*_{n}. Consider spacer acquisition to occur at a rate *b*:

Spacer acquisition is balanced by spacer loss leading to a well-defined steady-state distribution of repertoire size. Spacer loss occurs through double-stranded DNA breaks followed by homologous recombination at a subsequent repeat, which deletes chunks of the CRISPR array (see e.g. [46, 47]). The precise rate and mechanism by which this occurs is not well-understood. Here, we consider 3 solvable scenarios of this process:

The first scenario represents spacer loss at the end(s) of the CRISPR array, hence independent of *n*. The second represents a constant per-spacer loss rate. For the third scenario, all spacers are lost in a deletion event, which is a solvable limit of several spacers being deleted at a time.

*Scenario 1:* . The probabilities *P*_{n} (*n* ≥0) obey the following master equation:

The steady state fulfills the detailed balance condition,

We can solve the recursion equation (Eq. 16) for the steady-state distribution,
which is geometric with parameter 1 − *b/d*. Its mean is *b/*(*b* − *d*), implying that a well-defined steady state is only possible if *d > b*.

*Scenario 2:* . The master equation is:

At steady state again detailed balance holds

Eq. 20 implies that the steady-state distribution is Poisson with mean *b/d*:

*Scenario 3:* . Here, the master equation is:

Here there is no detailed balance, but probability flux is conserved,

Eq. 24 implies that the steady-state distribution is geometric with parameter *d/*(*b* + *d*):

The mean of this distribution is *b/d*.

## Acknowledgements

We thank Serena Bradde for helpful comments on this paper. VB and HC were supported in part by a Simons Foundation grant in Mathematical Modeling for Living Systems (400425) for Adaptive Molecular Sensing in the Olfactory and Immune Systems, and by the NSF Center for the Physics of Biological Function (PHY-1734030). AM was supported by a Lewis–Sigler fellowship. VB thanks the Aspen Center for Physics, which is supported by NSF grant PHY-160761, for hospitality during this work.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].
- [15].↵
- [16].↵
- [17].
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].
- [39].
- [40].
- [41].
- [42].
- [43].
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].
- [54].
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].
- [60].↵
- [61].↵