Abstract
Some bacteria and archaea possess an adaptive immune system that maintains a memory of past viral infections as DNA elements called spacers, stored in the CRISPR loci of their genomes. This memory is used to mount targeted responses against threats. However, cross-reactivity of CRISPR targeting mechanisms suggests that incorporation of foreign spacers can also lead to autoimmunity. We show that balancing antiviral defense against autoimmunity predicts a scaling law relating spacer length and CRISPR repertoire size. By analyzing a database of microbial CRISPR-Cas systems, we find that the predicted scaling law is realized empirically across prokaryotes, and arises through the proportionate use of different CRISPR types by species differing in the size of immune memory. In contrast, strains with nonfunctional CRISPR loci do not show this scaling. We also demonstrate that simple population-level selection mechanisms can generate the scaling, along with observed variations between strains of a given species.
Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (Cas) proteins form a prokaryotic defense against phage [1]. CRISPR loci are composed of DNA repeats alternating with variable DNA segments called spacers, acquired from phage and other foreign genetic material. In a process called interference, spacer RNA guides sequence-specific binding and cleavage of target DNA by Cas proteins. In this way, spacers acquired during phage attack confer acquired, heritable resistance against subsequent invasions.
CRISPR-Cas systems are remarkably diverse, characterized by functionally divergent Cas proteins and distinct mechanisms for each stage of immune defense [2]. Spacer acquisition is mediated by the conserved Cas1–Cas2 adaptation module, which sets spacer lengths within a narrow range varying by system [3, 4]. CRISPR arrays are also broadly distributed in size, ranging from less than 10 to hundreds of spacers, and the full repertoire of a host may comprise several CRISPR arrays [5]. Maintaining a broad spacer repertoire confers resistance against many phages and possible escape mutants [6]. However, there are constitutive costs associated with Cas protein expression [7], and diminishing returns of broad defense due to finite Cas protein copy numbers [8, 9]. In addition, CRISPR-Cas systems can prevent horizontal transfer of beneficial mobile genetic elements [10, 11].
CRISPR-Cas systems also cause autoimmunity, occurring when a spacer guides interference somewhere on the host genome, leading to cell death and strong mutational pressure in the CRISPR-cas locus and target region [12– 15]. The patchy incidence of CRISPR-Cas systems in prokaryotes (roughly 40% of bacteria and 85% of archaea [2]), and the presence of diverse mechanisms for self– nonself discrimination [16], suggest that avoiding autoimmunity is a constraint in the evolution of CRISPR-Cas systems [2, 13–19].
Several mechanisms exist in divergent CRISPR-Cas types for suppressing autoimmunity arising from different forms of potential self-targeting [16]. In type I and II systems, interference requires presence of a protospacer-adjacent motif (PAM), a 2–5-nt-long sequence adjacent to target DNA but absent in CRISPR repeats, preventing interference within the CRISPR array [20, 21]. In type III systems, interference requires transcription of target DNA, which avoids targeting phages integrated into the host chromosome (prophage) [22]. Spacers acquired from the host genome are naturally self-targeting, but there are mechanisms to suppress such acquisition [23, 24]. For example, type I-E systems acquire spacers preferentially at double-stranded DNA breaks, which occur primarily at stalled replication forks of replicating phage DNA, and acquisition is confined by Chi sites which are enriched in bacterial genomes [23].
Here we propose that CRISPR evolution is also shaped by heterologous autoimmunity, which occurs if an acquired foreign spacer and a segment of the host genome are sufficiently similar. The likelihood of this effect depends on sequence statistics and the specificity of CRISPR targeting mechanisms. Heterologous autoimmunity is analogous to off-target effects that are an important concern in CRISPR-Cas genome editing [25, 26], but the possible effects on prokaryotic adaptive immunity have not been explored. We combine a probabilistic modeling approach with comparative analyses of CRISPR repertoires across prokaryotes to show that: (a) heterologous autoimmunity is a significant threat caused by CRISPR-Cas immune defense, (b) avoidance of autoimmunity leads to a scaling law in CRISPR repertoires, and (c) the scaling law can be achieved by population-level selection. Our work suggests that avoidance of heterologous autoimmunity is a key factor shaping CRISPR repertoires and the evolution of CRISPR-Cas systems.
I RESULTS
A. Cross-reactivity leads to autoimmunity
We approach heterologous self-targeting as a sequence-matching problem [27–29], and derive estimates for the probability of a spacer being sufficiently similar to at least one site in the host genome. For a spacer of length ls and PAM of length lp (where it exists), an exact match at a given position requires l ≡ ls + lp complementary nucleotides. In a host genome of length L, where L » l, there are L − l + 1 ≈ L starting positions for a match. At leading order, and ignoring nucleotide usage biases, we may treat matches as occurring independently with probability 4−l. Thus, the probability of an exact match anywhere on the genome is (see Methods)
Considering order-of-magnitude parameter estimates for the E. coli type I-E system of L = 5 × 106 nt, ls = 32 nt, and lp = 3 nt gives a negligible probability p0 ∼ 10−15.
However, CRISPR interference tolerates several mismatches between spacer RNA and target DNA depending on position and identity [25, 26, 30, 31]. In general, mismatches in the PAM are not allowed, and mismatches in the PAM-distal region are tolerated to a greater extent than mismatches in the PAM-proximal region [21]. Up to ∼ 5 mismatches are allowed in type II systems [25, 26], while in type I-E systems, errors are mostly tolerated at specific positions with a 6-nt periodicity [30, 31].
Partial spacer-target matching may also trigger primed spacer acquisition, which is the rapid acquisition of new spacers from regions surrounding target DNA [30, 32, 33]. In type I-E and I-F systems, primed acquisition tolerates many (up to 10) mismatches in the PAM and target region [30, 33]. Thus, a foreign spacer that does not cause direct interference may still trigger primed acquisition of self-spacers [33] and hence cause autoimmunity.
Given that the specificity of CRISPR interference and primed acquisition have been characterized for only a few systems, we consider two general classes of mismatch tolerance that include the above scenarios: (a) mismatches at kfix fixed positions, and (b) mismatches at kvar variable positions anywhere else in the target region. These increase the per-spacer self-targeting probability by a combinatorial factor α(kfix, kvar, l) (see Methods), so that
A greater number of allowed mismatches greatly increases the likelihood of heterologous self-targeting (Fig. 1b). To gain intuition we can rewrite Eq. 2 as
a, Sketch of the main components of CRISPR-Cas immune defense. b, Per-spacer probability of heterologous self-targeting, pself, as a function of the number of tolerated mismatches at fixed and variable positions along the spacer, kfix and kvar, respectively. c, We hypothesize that the evolution of CRISPR-Cas systems is constrained by the risk of heterologous autoimmunity. As the self-targeting probability depends strongly on spacer length, this predicts a scaling of repertoire size with spacer length.
where leff is the effective spacer length after discounting for allowed mismatches (see Methods). This shows that mismatches exponentially increase the probability of self-targeting, and variable-position mismatches particularly so. Considering the E. coli system as before, the matching probability increases to pself ∼10−4 with kfix = 5 nt and kvar = 5 nt (see Fig. 1b). Other CRISPR-Cas systems may similarly lie in parameter regimes with appreciable pself, especially when including indirect self-targeting through primed acquisition [34]. Furthermore, the probability of self-targeting is likely higher than implied by our calculations as it can be increased by correlations in sequence statistics between host and phage genomes [28, 29]. Given our estimates, we thus hypothesize that heterologous autoimmunity may occur generally and be a significant cost of CRISPR-Cas immunity.
B. Spacer length scales with repertoire size
To test this hypothesis, we exploited the large natural variability in CRISPR systems across different microbial species. As the self-targeting probability depends exponentially on spacer length (Eqs. 2–3), we expect small differences in length to lead to large variations in the risk of autoimmunity. If CRISPR repertoire sizes are selected to balance broader immunity against the risk of autoimmunity, then qualitatively we expect that species with shorter spacers should have smaller repertoires, while species with longer spacers should have larger ones (Fig. 1c).
To make this prediction more quantitative, suppose prokaryotes tolerate a maximum probability P of self-targeting, and that CRISPR-Cas systems are selected to maximize protection against pathogens subject to this constraint. Repertoires with N spacers incur a self-targeting probability of ∼ Npself, and thus Eq. 2 implies
Linearizing the dependence of the combinatorial factor α around typical spacer lengths l0 (see Methods) predicts a scaling relationship between spacer length and the logarithm of repertoire size
where we arrived at the latter estimate by taking kvar ∼ kfix ∼5 and l0 ∼35.
We analyzed a database of CRISPR-Cas systems identified in publicly available bacterial and archaeal genomes [5, 35] (see Methods). To sample widely from CRISPR-Cas systems while eliminating oversampling of certain species, we first selected strains carrying both CRISPR and cas loci, and then picked one strain at random from each species for further analysis (see Methods). We observed a multimodal distribution of spacer lengths acquired by these representative strains (Fig. 2a), consistent with different CRISPR-Cas types having narrow spacer length distributions (Fig. 3a). The distribution of spacer repertoire sizes, defined as the sum of CRISPR array sizes in each genome, was broad, ranging from 1 to 812 spacers (Fig. 2b).
a, b Distributions of spacer lengths (a) and repertoire sizes (b) across prokaryotes. For each of 2,449 species with CRISPR and cas loci we randomly picked a single strain (see Methods), and calculated its repertoire size as the sum of all CRISPR array sizes present in the genome. The length distribution of all spacers found in these filtered strains are shown in (a). Bins in (b) were formed by dividing each decade into 10 equal bins on a log scale. c, Scaling of repertoire size with spacer length. A linear fit of the mean spacer length against log repertoire size was performed on all 2,449 species, and is shown alongside the data, which is binned by repertoire size (50 strains/bin). The fitted slope is consistent with theory predictions (Eq. 5). d, Fraction of species with missing cas genes decreases with spacer length. e, Spacer length and repertoire size do not show a clear relation in strains with nonfunctional CRISPR loci. A linear fit was performed on 340 species with CRISPR arrays but no cas loci, and is shown alongside the data, which is binned by repertoire size (50 strains/bin). Error bars in panels c–e denote the standard error of the mean in each bin, which in (d) are calculated assuming a binomial probability distribution for the absence of cas at each spacer length.
a, Length distributions of all spacers found in single-type strains aggregated by CRISPR-Cas type (for types with > 10 species in CRISPRCasdb [35]; see Methods). Also indicated are the median (solid vertical) and lower quartile (dotted vertical line) for each distribution. The subtypes are presented in order of lower quartile. b, A trend is observed between spacer + PAM length and repertoire size for different CRISPR-Cas types. For spacer lengths, the central dot is the lower quartile and the whisker runs between the lowest decile and the median. Repertoire sizes are indicated as the mean ± standard error. To indicate the requirement of PAM recognition, a length of 3 nt was added to all type I and II (but not type III) subtypes. c, Variable usage of cas subtypes among multiple-type strains. A total of 826 strains with multiple CRISPR-Cas systems, randomly picked from different species, were analyzed. They were divided into 3 groups of 275, 275 and 276 strains having small, medium, and large repertoire size, respectively. Each subfigure was normalized to 1, so that the bars indicate the relative incidence of a subtype in each repertoire size bin. The order of subtypes is the same as panel a.
A linear regression between spacer length and log-repertoire size gave a slope of 1.1 ± 0.1 (Fig. 2c), in line with the predicted scaling (Eq. 5). A range of crossreactivity parameters is broadly consistent with this scaling (Fig. S1), with a best-fit value of kvar = 3.41 ± 0.02 obtained assuming kfix = ls/6 (consistent with a 6-nt periodicity in tolerated fixed-position mismatches as in type I-E systems) (see Fig. S1). The empirical law holds over two orders of magnitude in CRISPR repertoire size, but over this range the spacer length changes only modestly. These changes however lead to significant differences in the self-targeting probability, which is exponential in spacer length (Eq. 3).
Cross-reactivity parameters obtained by a fit to the empirical data lie in a plausible range. The blue points are data from 2,449 species binned in increasing windows of repertoire size (50 species/bin), and the orange line is the linear fit to all species as in Fig. 2c. The green line is the naive ln 4 scaling (Eq. 5). The fitted slope is consistent with a broad range of cross-reactivity parameters (yellow region). A best-fit to Eq. 4 was performed, in which lp was fixed at 3, and kfix was set to ls/6, consistently with a 6-nt periodicity in mismatch tolerance in type I-E systems [30, 31] and approximately 5 allowed mismatches in type II systems in which most spacer lengths are ∼ 30 nt [25, 26]. We found best-fit values of kvar = 3.41 ± 0.02 and log10(L/P) = 11.47 ± 0.02, where the errors are 90% confidence intervals. The estimate of kvar is consistent with primed acquisition tolerating many mismatches, up to 10 in some systems [30, 33], and the estimate of L/P implies a maximum risk of self-targeting P in the range of 10−4 to 10−5. We expect these cross-reactivity parameters to show significant variation around these means in individual species and systems (see Fig. 4).
Some prokaryotes may tolerate self-targeting spacers because they have defective cas genes [12] or contain anti-CRISPRs [36]. To further test the link between autoimmune risk and spacer length, we investigated the incidence of missing cas genes across CRISPR-Cas systems. We expected that a higher autoimmune risk in species with shorter spacers would lead to a higher rate of cas gene loss. Thus, we analyzed how the fraction of strains with missing cas loci depends on spacer length, which shows the expected relationship (Fig. 2d). Once immunopathology from self-targeting is avoided by the loss of cas interference genes, the relation between spacer length and repertoire size should no longer be selected for. Indeed, we found no clear relation in strains with missing cas loci (Fig. 2e). Taken together, these observations strengthen the interpretation of the scaling law as arising from the modulation of autoimmunity risk by spacer length.
C. Variable CRISPR-Cas type use underlies scaling
CRISPR-Cas systems are classified into different types and subtypes based on their evolutionary relationships and the use of different cas genes [2]. We wondered whether the aggregate scaling relationship between spacer length and repertoire size (Fig. 2) reflected differences at the level of CRISPR-Cas type usage. We thus grouped the species by subtype, when there is a single CRISPR-Cas system in the genome, or in a separate group when multiple subtypes are present.
For species carrying a single cas type, we aggregated all spacers found across species of each type to quantify the statistics of acquired spacer lengths. We observed differences in the spacer length distributions between types (Fig. 3a): (a) Type II-A and II-C systems have narrow distributions tightly clustered around 30 nt; (b) Type I-E and I-F systems also have narrow distributions, clustered around 32 nt, while other type I systems have spacers that are longer and more broadly distributed; (c) Type III systems have even longer and more broadly distributed spacers, with median lengths in the 36–39 nt range.
A broader distribution of acquired spacer lengths leads to a higher risk of autoimmunity than a narrow distribution with the same mean, since the self-targeting probability increases exponentially for shorter spacers. To account for an increase in autoimmune risk for broader distributions, we focused on the lower quartile of spacer lengths for each cas type as a proxy for autoimmune risk. Also, to account for the requirement of PAM recognition in type I and II (but not type III) systems, we added a PAM length of 3 nt to types I and II to obtain the overall length l. Strikingly, we observed that the predicted relationship between l and repertoire size also broadly holds between CRISPR-Cas types (Fig. 3b): Type II systems have the shortest spacers and the smallest repertoires, and among type I subtypes those with shorter spacers generally have smaller repertoires. Type III systems have smaller repertoires than type I systems despite somewhat longer spacers, but this is explained by the absence of PAMs and the broader spacer length distributions for the type III systems, both of which increase autoimmune risk.
We next tested whether this relation also carries over to species carrying multiple CRISPR-Cas systems, in the form of a differential use of cas types as a function of repertoire size. We divided species with multiple cas types into three equally sized groups by repertoire size, and determined the relative incidence of CRISPR sub-types within each group (Fig. 3c). We found that the use of types II, I-E, and I-F decreases with repertoire size in line with expectations, and an opposite pattern for two of the type III systems and the type I systems with the longest spacers. The relation between total repertoire size and spacer length in species with multiple cas sub-types was further reinforced by a direct analysis of the incidence of spacers of different lengths as a function of repertoire size, with a greater proportion of longer spacers present in larger repertoires (Fig. S2).
Variable usage of spacer lengths among multiple-type strains. 826 species with multiple CRISPR-Cas systems were divided into 3 groups of small, medium and large repertoire sizes containing 275, 275 and 276 species, respectively. Each repertoire size bin was normalized to 1, so that the bars indicate the fraction of spacers in each repertoire size bin with that length. The usage of spacers of length 32 nt decreases with repertoire size, while usage of spacers of length ≥ 35 nt increases with repertoire size among these strains.
Taken together, we find that species carrying either single or multiple CRISPR-Cas systems differentially use CRISPR-Cas types having different spacer length distributions to form repertoires of different sizes. This differential use gives rise to the aggregate scaling observed in Fig. 2c, and is consistent with the hypothesis of minimizing the risk of heterologous autoimmunity.
D. Dynamical origin of the scaling law
Dynamical mechanisms can give rise to the scaling law that our theory predicts, and which is found in the empirical data. While spacer dynamics involves complex epidemiological feedbacks [8, 37–44], here we consider a simple effective model in which spacer acquisition and loss are described as a birth-death process, such that spacers are acquired at a rate b and lost at a per-spacer rate d (Fig. 4a, left panel). This gives rise to a Poisson distribution of repertoire sizes at steady state, with mean b/d (see SI). Our statistical theory requires that the mean of the distribution should shift with spacer length. There are two mechanisms by which selection could lead to such a dependence. First, the negative fitness effect of acquiring self-targeting spacers [13] purges lineages that undergo deleterious acquisition events. Indeed, CRISPR arrays are selected for the absence of self-targeting spacers [12]. Effectively, this reduces the net acquisition rate among surviving lineages. Second, over longer evolutionary timescales, different CRISPR-Cas systems may be selected to acquire spacers at different rates depending on their respective risks of autoimmunity. These differences in rates could arise from the maintenance of multiple copies of cas genes, or through regulation of cas expression [45]. Indeed, spacer repertoire size increases with the number of cas loci (Fig. S3), suggesting that larger gene copy numbers of cas1 and cas2, necessary for spacer acquisition, result in greater acquisition rates. Interestingly, strains having exactly one copy of both cas1 and cas2 still obey a scaling relationship (Fig. S4), suggesting that regulation of these genes also contributes to minimizing autoimmune risk.
Spacer repertoire size is correlated with the number of CRISPR and cas loci. Data from 2,449 representative strains belonging to different species are binned by repertoire size (50 strains/bin). Error bars denote the standard error of the mean in each bin.
Repertoire size versus mean spacer length for strains restricted to one annotated gene copy of cas1 and cas2. 1,578 out of the 2,449 sampled species contain one gene copy of cas1 and cas2. The orange line is a linear fit to these species, shown alongside the data, which are binned by repertoire size (50 species/bin). Error bars denote the standard error of the mean in each bin.
a, In our model, strains acquire spacers at a rate b and lose them with a per-spacer rate d, giving rise to a Poisson distribution at steady state with mean b/d (left panel). b is selected to minimize the risk of heterologous autoimmunity, such that species differing in pself have different mean repertoire sizes (middle panel). We generate a synthetic dataset of strains by sampling from steady-state distributions with different spacer lengths and hence pself (see Methods). The synthetic data displays scaling of the mean and variability on the single-strain level (right panel). The green points show 100 individually sampled strains, the blue points means after binning by repertoire size (50 species/bin, 2,450 species total), and the orange line is a fit to all 2,450 sampled points. b, Correlated spacer loss broadens the predicted distribution of repertoire sizes. We consider a model in which all spacers are lost simultaneously during a deletion event, which leads to a geometric steady-state distribution (see SI). Despite this additional variability, a synthetic sample generated as in panel a shows scaling of the means. c, The distributions of repertoire sizes of sequenced strains belonging to the same species are broad. 4 pairs of species with > 50 sequenced strains and with the indicated CRISPR-Cas type are displayed. Vertical lines denote the mean for each species.
Let us suppose that one or both of these selection mechanisms lead to an effective spacer acquisition rate inversely proportional to the risk of self-targeting, b ∝ 1/pself. To replicate the empirical analysis, we created a synthetic dataset of the same size by sampling strains at random from steady-state distributions at different spacer lengths, which have different pself (Fig. 4a, middle panel) (see Methods). Plotting spacer length against mean repertoire size in the same way as we did for the empirical data, we recover a scaling law as predicted by our theory (Fig. 4a, right panel).
In addition to providing a dynamical explanation for scaling of the means, this birth-death model produces substantial variability around the mean relationship (Fig. 4a, right panel, green dots). In fact, the Poisson variance of Fig. 4a, originating from a constant per-spacer loss rate, is likely an underestimate. Spacer loss occurs through double-stranded DNA breaks followed by homologous recombination at a different CRISPR repeat, a process which may delete chunks of an array in a single deletion event (see e.g. [46, 47]). Including such correlated spacer loss greatly increases the variance. For example, in a simple analytically solvable limiting case where entire arrays are lost at once, the distribution of repertoire sizes becomes geometric (see SI) and thus very broad (Fig. 4b, left panel). Given this substantial variability, we wondered whether a comparative analysis of many species could still recover a scaling in this model. We thus sampled strains from geometric steady-state distributions whose means obey a scaling law as in panel a, observing a much larger variability in individual strains (Fig. 4b, right panel, green dots). However, the scaling of the mean is recovered with a fit to the dataset and when strains are binned by repertoire size (Fig. 4b, right panel).
Prompted by the broad variability predicted by correlated spacer loss, we analyzed the repertoire size distributions of species with many sequenced strains (see Methods). We indeed observed a broad distribution even among strains of the same species (Fig. 4c). We compared the repertoire size distributions of four pairs of highly-sampled species with the same CRISPR-Cas type, and found that the within-species variability comprises a substantial part of the overall variance. Additional variability between species, leading to different mean repertoire sizes for species with the same CRISPR-Cas type, might originate from different microbes inhabiting environments that differ in viral diversity and thus pressure to acquire broad immune defense. We tested the robustness of the comparative analysis to this additional source of variability, by sampling strains from steady-state distributions where we additionally sample the prefactor in b ∝ 1/pself from a wide distribution (see Methods). This further increases the variability of individually sampled strains, but the means still show scaling (Fig. S5).
Species and system-specific stochasticity increases the variability of the sampled data, but a scaling law is recovered by binning by repertoire size. A sampling procedure on synthetically generated data is replicated as in Fig. 4. Individuals were drawn from steady-state distributions with mean proportional to 1/pself, but each time the prefactor A was also drawn from a wide (log-normal) distribution with the same mean as in Fig. 4a–b, and standard deviation chosen such that the coefficient of variation is 1.2. A large variability in the data results, but binning recovers a clear relation between mean repertoire size and spacer length. The green points show 100 individually sampled strains, the blue points means after binning by repertoire size (50 species/bin, 2,450 species total), and the orange line is a fit.
II. DISCUSSION
An adaptive immune system is dangerous equipment to have in an organism. There is always the risk that the immune receptors, intended as defenses against foreign invaders, will instead target the self. In CRISPR-Cas systems, biophysical mechanisms avoiding various forms of autoimmunity such as targeting of the CRISPR locus and self-spacer acquisition are known [16, 20, 21, 23], but here we propose that heterologous autoimmunity, where spacers acquired from foreign DNA seed self-targeting, is a significant threat to microbes carrying CRISPR-Cas. This threat is analogous to off-target effects in genomeediting applications [25, 26], and has been observed in an experimental CRISPR-Cas system [33], but its wider implications for the evolution of CRISPR-Cas systems are unexplored. We showed that avoidance of this form of autoimmunity while maximizing antiviral defense predicts a scaling law relating spacer length and CRISPR repertoire size. The scaling depends on the number and nature of sequence mismatches permitted during CRISPR interference and primed acquisition.
To test our prediction we used a comparative approach analyzing the natural variation in CRISPR-Cas systems across microbial species, and demonstrated that: (a) the predicted scaling law is realized, (b) the observed scaling constrains parameters for cross-reactive CRISPR targeting to lie in a range consistent with experimental studies, (c) the scaling arises in part from differential usage of different CRISPR-cas subtypes having different spacer length distributions, and (d) the scaling, and hence a balanced tradeoff between successful defense and autoimmunity, can be achieved by population-level selection mechanisms. In addition, we demonstrated a negative control: CRISPR arrays in species that no longer have functional Cas proteins, and thus are not at risk of autoimmunity, do not show the predicted scaling relation. We propose two further tests of the link between spacer length and autoimmune risk: (1) If cross-reactivity leads to selftargeting, in addition to a depletion of self-targeting spacers in CRISPR arrays [12, 36], we predict a depletion of spacers several mismatches away from self-targets, and (2) Our theory predicts that CRISPR-Cas subtypes with longer spacers should acquire spacers more readily.
A similar tradeoff between sensitivity to pathogens and autoimmune risk shapes the evolution of vertebrate adaptive immune systems [27, 48]. In the light of our results it would be interesting to determine whether this tradeoff also leads to a relation between the size of the immune repertoire and specificity in vertebrates. Such a relation will likely be harder to ascertain for vertebrates as patterns of cross-reactivity between lymphocyte receptors and antigens are more complex. Interestingly, however, T cell receptor hypervariable regions in human are several nucleotides longer on average than those found in mice [49], which accompanies a substantial increase in repertoire size in human. If longer hypervariable regions translate to a greater specificity on average, one might view the increased human receptor length as an adaptation to a larger repertoire. The key to our current work was the ability to compare microbial immune strategies across a large panel of phylogenetically distant species. Further insight into how this tradeoff shapes vertebrate immune systems might thus be gained by building on recent efforts to survey adaptive immune diversity in a broader range of vertebrates [50, 51].
Many theoretical studies of adaptive immunity in both prokaryotes [8, 37–44] and vertebrates [52–55] consider detailed dynamical models of evolving immune repertoires. For prokaryotes, such dynamical models can be regarded as describing the role of CRISPR-Cas as a shortterm memory for defense against a co-evolving phage [56]. Studying adaptive immunity in this way requires detailed knowledge of the parameters controlling the dynamics, many of which are not well-characterized experimentally. In this paper, we took an alternative approach of focusing on the statistical logic of adaptive immunity, where we regard the bacterial immune system as a functional mechanism for maintaining a long-term memory of a diverse phage landscape [57], via probabilistic matching of genomic sequences. Previous work taking this perspective offered an explanation for why prokaryotic spacer repertoires lie in the range of a few dozen to a few hundred spacers [56]. As in our discussion of possible mechanisms for generating the observed scaling law, evolution should select dynamics that achieve the statistical organization that we predict, because this is what is useful for achieving a broad defense against phage while avoiding autoimmunity. A probability theory perspective of this kind has been applied to the logic of the adaptive immune repertoire of vertebrates [58–60], but to our knowledge we are presenting a novel approach to the study of CRISPR-based autoimmunity.
III. MATERIALS AND METHODS
a. Derivation of self-targeting probability
We estimate the probability of an alignment between a spacer + PAM sequence of length l and a host genome of length L. We assume that both sequences are random and uncorrelated, with nucleotide usage frequencies of 1/4. In a length-L genome, where L » l, there are L − l + 1 ≈ L starting positions for an alignment. The matching probability at each position, pm, depends on the number and nature of mismatches tolerated. In regimes where pm is small, the matching probabilities at the different positions may be treated independently. Thus, the probability of having at least one alignment within the length-L genome is
If no mismatches are tolerated, pm = 4−l as in Eq. 1. At each site where a mismatch is allowed, four alternative nucleotide choices are possible. This gives a certain number α of unique complementary sequences matching to a given spacer, which we compute as a function of the number and nature of mismatches. If up to kfix mismatches are tolerated at fixed positions in the alignment, . If instead up to kvar mismatches are tolerated anywhere in the complementary region, naively
, where the binomial coefficient is the number of combinations of sites where mismatches are allowed. This is however an upper bound as matching sequences are overcounted, and the precise expression is
where each term in the sum is the number of unique complementary sequences having exactly i mismatches. The largest term dominates, giving
. Thus, combining kfix mismatches at fixed positions and up to kvar mismatches at any of the remaining l kfix positions gives
We can introduce an effective spacer length, . To leading order the binomial expression in Eq. 8 is approximated by
. This gives leff ≈ l − kfix − kvar log4 3(l − kfix) as in Eq. 3.
The probability that a repertoire of N spacers avoids self-targeting, 1− Pself, is one minus the probability that at least one spacer self-targets. This gives
If CRISPR repertoires are selected to maximize repertoire size subject to the constraint Pself ≤ P, we obtain Eq. 4. Taylor expanding ln N around l = l0 gives Eq. 5 to lowest order in l.
b. Comparative analyses
For our comparative analyses we use CRISPRCasdb [5], which is a database of CRISPR and cas loci identified using CRISPRCasFinder [61] in public bacterial and archaeal whole-genome as-semblies [35]. CRISPR arrays are assigned evidence levels 1–4, 4 being the highest confidence [61]. We restricted our analysis to level 4 CRISPR arrays only. Strains containing both annotated CRISPR and cas loci were used for the analyses in Figs. 2a–c, 3, and 4c. Strains containing annotated CRISPR but no cas loci were used for the analyses in Figs. 2d–e. In order to eliminate oversampling of certain species, we picked one strain at random from each species for further analysis (2,449 species with annotated CRISPR and cas loci, and 340 species with annotated CRISPR but no cas loci). To produce Fig. 3, the randomly chosen strains were grouped by annotated cas subtype, or into a separate group if they contain multiple subtypes. The 12 subtypes shown in Figs. 3a and c have > 10 species represented in CRISPRCasdb. To produce Fig. 4c, 4 pairs of species, each with > 50 sequenced strains of the same CRISPR-Cas type, were chosen for analysis.
c. Synthetic data generation and analysis
A synthetic dataset producing a scaling law was generated in the following way: (1) A spacer of length ls was drawn from the length distribution of Fig. 2a, and (2) a repertoire size distribution with mean A/pself was created, from which one strain was sampled and added to the dataset. Parameter values of L = 5 106, lp = 3, kfix = ls/6, kvar = 3, and A = 10−5.5 were used. The steady-state distributions are Poisson in Fig. 4a, and geometric with the same mean in Fig. 4b. In Fig. S5, A was sampled from a log-normal distribution with the same mean, and standard deviation chosen such that the coefficient of variation is 1.2.
SUPPLEMENTARY INFORMATION
SI text on population dynamics
Consider a host population acquiring spacers of length l. Let the number of individuals in the population that have repertoire size n (n ≥ 0) be Xn. Consider spacer acquisition to occur at a rate b:
Spacer acquisition is balanced by spacer loss leading to a well-defined steady-state distribution of repertoire size. Spacer loss occurs through double-stranded DNA breaks followed by homologous recombination at a subsequent repeat, which deletes chunks of the CRISPR array (see e.g. [46, 47]). The precise rate and mechanism by which this occurs is not well-understood. Here, we consider 3 solvable scenarios of this process:
The first scenario represents spacer loss at the end(s) of the CRISPR array, hence independent of n. The second represents a constant per-spacer loss rate. For the third scenario, all spacers are lost in a deletion event, which is a solvable limit of several spacers being deleted at a time.
Scenario 1: . The probabilities Pn (n ≥0) obey the following master equation:
The steady state fulfills the detailed balance condition,
We can solve the recursion equation (Eq. 16) for the steady-state distribution,
which is geometric with parameter 1 − b/d. Its mean is b/(b − d), implying that a well-defined steady state is only possible if d > b.
Scenario 2: . The master equation is:
At steady state again detailed balance holds
Eq. 20 implies that the steady-state distribution is Poisson with mean b/d:
Scenario 3: . Here, the master equation is:
Here there is no detailed balance, but probability flux is conserved,
Eq. 24 implies that the steady-state distribution is geometric with parameter d/(b + d):
The mean of this distribution is b/d.
Acknowledgements
We thank Serena Bradde for helpful comments on this paper. VB and HC were supported in part by a Simons Foundation grant in Mathematical Modeling for Living Systems (400425) for Adaptive Molecular Sensing in the Olfactory and Immune Systems, and by the NSF Center for the Physics of Biological Function (PHY-1734030). AM was supported by a Lewis–Sigler fellowship. VB thanks the Aspen Center for Physics, which is supported by NSF grant PHY-160761, for hospitality during this work.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].
- [15].↵
- [16].↵
- [17].
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].
- [39].
- [40].
- [41].
- [42].
- [43].
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].
- [54].
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].
- [60].↵
- [61].↵