Abstract
Antimicrobial resistance is widely recognized as a serious global public health problem. To combat this threat, a thorough understanding of bacterial genomes is necessary. The current wide availability of bacterial genomes provides us with an in-depth understanding of the great variability of dispensable genes and their relationship with antimicrobials. Some of these accessory genes are those involved in CRISPR-Cas systems, which are acquired immunity systems that are present in part of bacterial genomes. They prevent viral infections through small DNA fragments called spacers. But the vast majority of these spacers have not yet been associated with the virus they recognize, and this has been named CRISPR dark matter. By analyzing the spacers of tens of thousands of genomes from six bacterial species highly resistant to antibiotics, we have been able to reduce the CRISPR dark matter from 80-90% to as low as 15% in some of the species. In addition, we have observed that, when a genome presents CRISPR-Cas systems, this is accompanied by particular collections of membrane proteins. Our results suggest that when a bacterium presents membrane proteins that make it compete better in its environment, and these proteins are in turn receptors for specific phages, it would be forced to acquire CRISPR-Cas immunity systems to avoid infection by these phages.
Introduction
The fight against antimicrobial resistant bacteria is one of the major challenges facing mankind in the near future (1). The World Health Organization prioritized a list of multidrug-resistant bacteria in 2017 to support research and development of effective drugs (2). The most relevant species in this list form the so-called ESKAPE group, whose acronym refers to two Gram-positive bacteria (Enterococcus faecium and Staphylococcus aureus) and four Gram-negative bacteria (Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter cloacae). All of them cause infections, some of which are acquired in the hospital environment (3, 4). A major difference between Gram-positive and Gram-negative bacteria, which conditions the way to fight against them, is that Gram-negative bacteria have a second lipid membrane outside the cell wall, which gives them a richer variability of membrane proteins, such as the outer membrane proteins found in this second layer (5).
Bacteriophages can be used to control bacterial growth and are even used to treat infections in humans (6). Thus, bacteria defend themselves against infection by these phages by means of different molecular systems. Restriction-modification systems are by far the most abundant, being present in 83% of prokaryotic genomes, followed by CRISPR-Cas with about 40% (7). The CRISPR-Cas are adaptive immunity systems found in most archaea and in less than half of the bacteria sequenced (8). They provide acquired immune resistance against phages and other foreign nucleic acid molecules such as plasmids, thus restricting gene transfer (9). There are different types of CRISPR-Cas systems based on genes that are part of the different steps of this immune system (adaptation or spacer integration, expression, and interference) and are generically called cas (CRISPR-associated genes). ESKAPE bacteria have CRISPR-Cas systems of the most common types, from I to IV, but only in a minimal number of strains, with frequencies ranging from less than 1% to 60% of genomes, depending on the species (10).
The acquired immunity of CRISPR-Cas systems is based on short nucleotide fragments, called spacers, cropped from the foreign nucleic acid sequence of a previous entry into the bacterial cell, called protospacers, which are inserted into the bacterial DNA next to the cas genes. These spacers are mostly similar to phage sequences and to a lesser extent to other extrachromosomal nucleic acid molecules. However, a large proportion of them have no known protospacer (over 80-90%). These spacers of unknown origin are believed to originate from as yet unsequenced phages and constitute what has been called the CRISPR “dark matter” (11).
Bacterial genomes with CRISPR-Cas systems have cas genes along with the spacers, which are separated by repeats (short identical or nearly identical sequences). However, other genes have been linked with these systems because specific functionalities have been found in strains that have CRISPR-Cas systems. For example, a relationship has been demonstrated with the formation of multicellular structures called biofilms in P. aeruginosa, Streptococcus mutant and Yersinia pestis (12–14), connections with the regulation of outer membrane proteins have also been described in Salmonella Typhi (15), and a specific relationship with virulence have also been shown in multiple bacterial species (16–18).
In earlier work with A. baumannii, we found that the group of strains that usually have CRISPR-Cas systems had genes involved in biofilm at a high frequency, and we also found genes encoding proteins with signal peptides and membrane lipoproteins (19). The analysis was based on a pangenome constructed from this species (all of the different genes found in the genomes of the species). The establishment of these pangenomes is now easier due to the large number of genomes available in public databases, and they allow to analyze the accessory genome, which is the set of genes that is not present in all the strains of a species (20) and are usually obtained by horizontal gene transfer (21). If strains with CRISPR-Cas systems have special functions not found in strains without these systems, accessory genes involved in these functions should appear almost exclusively in the former.
We have created a large pangenome for each ESKAPE species and compared strains with or without CRISPR-Cas systems to discover genes present with a significant frequency in the former. Then, these genes have been functionally analyzed and we found that they are enriched in genes encoding membrane proteins, which reveals a possible relationship with phages that infect bacteria and could take advantage of these membrane proteins as entry receptors. In addition, our results demonstrate that the study of thousands of genomes of the same species allows us to reduce the CRISPR dark matter and to trace the origin of most of the spacers found in them.
Materials and Methods
Genome collection and annotation
The assembled sequences of ESKAPE species available in the National Center for Biotechnology Information (NCBI) Genome database on June 14, 2021, including complete and draft genomes, were collected (22). Genomes and metadata were downloaded with the tools datasets 12.1.0 and dataformat 12.4.0 (a total of 68,352 genomes). Genomes with a low number of total genes or a low average number of shared genes (>5 times the interquartile range) were removed on suspicion that they did not correspond to the species studied.
The protein-coding genes were predicted using Prokka version 1.14.5 (23), and the pangenome was created by Roary version 3.12.0 with an identity threshold of 90% and the -s parameter for not separating paralogs at this identity threshold (24). Protein sequences were functionally annotated using Sma3s v2 and the UniProt bacterial taxonomic division bacteria 2019_01 as the reference database (25). Gene names provided by Sma3s were preferentially assigned to each protein. When a gene name was repeated, a sequential number separated by two underscores was added. In cases where Sma3s did not assign a gene name, the one proposed by Prokka was taken, if available, preceded by an underscore. A gene was classified as a core gene if it appeared in ≥99% of the genomes of the species.
CRISPR-Cas systems and their specific types were assigned using CRISPRCasTyper 1.4.1 (26). Types I-Fa and I-Fb of A. baumannii were distinguished by looking for their different integration site. To discover the spacers of CRISPR-Cas systems, CRISPRCasFinder 4.2.20 was used with default parameters (27). Only CRISPR arrays with an evidence level equal to 4 were considered. Identical spacers were collapsed together, taking into account both chains. The number of sequences of each type for each species are available in Suppl. Table S1.
Search for specific gene groups in the pangenomes
Antibiotic resistance genes were found by AMRFinderPlus 3.10.1 using the databases of the 6 bacterial species analyzed (28). Virulence genes were found by performing a similarity search with BLASTP 2.9.0+ (29) against the VFDB database version December 2020 (Virulence Factor Database), requiring at least 90% sequence identity and 90% database sequence coverage (30).
Genes encoding membrane proteins were searched for in the functional annotation performed by Sma3s, to which genes encoding outer membrane proteins were specifically added by a BLASTP similarity search against the OMPdb database release 2021, requiring at least 90% sequence identity and 90% database sequence coverage (31).
Genes from plasmids were searched using the annotation ‘Plasmid’ in the UniProt keyword field. Then, genes with ≥90% sequence identity and ≥90% query coverage with a sequence of the PLSDB database v2020_06_23_v2 were added (32). Viral genes were searched following the same protocol but using the IMG/VR database v3 (IMG_VR_2020-10-12_5.1) (33), and viral genes from the functional annotation with Sma3s. Genes that appeared between 2 genes annotated as viral were also added. The Phaster web server was used to search for complete phages (34).
Search for protospacers
Protospacers, putative genes recognized by spacers, were searched by performing a similarity search with BLASTN and the blastn-short option turned on, using a threshold of ≥95% sequence identity and 100% spacer coverage. Those protospacers that were also found following the same strategy but using sequences from CRISPR repeats instead of the spacers were discarded to avoid misannotated sequences (35).
Searching for genes associated to CRISPR-Cas types with inference on random forests
We used inference on multiple iterations of random forests to search for genes associated with specific CRISPR-Cas types. For each species, we compiled one dataset containing all strains that have no CRISPR-Cas systems, and other datasets with the strains containing each CRISPR-Cas type present in the data, respectively. From 20 iterations with different random seeds, the most important features were selected and counted. In this context the features are binary indicators of the presence of the genes for each strain. After all iterations, the count of a gene can indicate how often the random forest deemed it important for the difference between the respective CRISPR-containing and CRISPR-deficient genomes. Cas genes were removed from consideration, to be able to focus on non-directly related genes. Genes identified as important features in multiple iterations were considered to be associated with CRISPR-Cas systems, when they were more abundant in CRISPR-Cas containing genomes than in non-CRISPR-Cas containing genomes.
The random forest implementation was done in Python with the scikit-learn package 1.0.2 (36). With each iteration, random parts of the datasets were divided into train and test sets with the ratio of 0.8 to 0.2. For all six species and their CRISPR-Cas systems, the random forests achieved average accuracies higher than 0.93 over all iterations. The trained random forest object has the resulting feature importance by the mean decrease in impurity available as a parameter. The permutation feature importance may be more informative for high cardinality features, but since we only have two values for each feature, the mean decrease in impurity feature importance measurement is sufficient.
Functional enrichment analysis
To discover the functional enrichment of genes associated with specific CRISPR-Cas types, we used the R package TopGO version 2.40.0 (37), which uses GO terms from a specific ontology. The GO terms used were those annotated by Sma3s. Figures were created using the R ggplot2 library in a custom script.
MLST assignment, gene profiles and molecular phylogenies
MLST numbers were assigned to each genome by compiling the genes used in the species-specific schemes in PubMLST 23 Nov 2021 (38) and searching them in the genome sequences using the mlst program (https://github.com/tseemann/mlst). The MLST number assigned to each genome, along with the CRISPR-Cas systems it has is available in Suppl. Table S3.
MLST phylogenetic trees were constructed using the MLST sequences. Nucleotide sequences were aligned with mafft v7.271 using the G-INS-I option (39). The phylogeny was constructed with RAxML v8.2.9 with the GTRCAT model and bootstrap of 1000 (40). The model was selected with ModelFinder implemented in IQ-TREE (41). The phylogeny in Fig. 6d was constructed with the same protocol but using all genomes, the PROTGAMMAWAG model and 500 core proteins.
The gene profiles for each species were constructed using a binary representation of the bacterial genome, where a gene is either absent or present in a strain, without accounting for the number of paralogs. This data is condensed to MLST level, by assigning 0 or 1 to the gene in the MLST group by majority vote of all strains in the group. The MLST groups are then subjected to a pairwise Jaccard distance measurement, resulting in a N x N matrix of Jaccard distances between MLST groups, with N equal to the number of MLST groups for the respective species dataset. The pairwise Jaccard distances were computed with scikit-learn.
These pairwise distances were used to construct a profile of genetic distances between the MLST groups for each species. We used ward-linkage and descending distance sort for the hierarchal clustering and the dendrogram. Dendrograms were produced using SciPy 1.6.2 (42), correlation plots were plotted with seaborn 0.11.2 and matplotlib 3.5.0 (43, 44).
Results
A large proportion of CRISPR dark matter spacers could be annotated by the pangenome analysis
The genomes of the different ESKAPE species were initially obtained and they were both structurally and functionally annotated with a special emphasis on the protein-coding genes and the elements that are part of the CRISPR-Cas systems. According to the number of genes, the smallest pangenome was found for S. aureus despite having started from a larger number of genomes (Fig. 1ab). The species with the largest number of genomes with CRISPR-Cas systems were K. pneumoniae and P. aeruginosa, which also had the largest pangenomes (along with the other Gram-negative species) and more than 3500 core genes, i.e., genes common to all genomes of the species. The difference between Gram-negative and Gram-positive bacteria is also evident when comparing the number of genes per genome, with P. aeruginosa showing both the largest number of genes and average number of shared genes (Fig. 1c). Finally, the major difference between genes per genome and average number of shared genes is found in E. cloacae, which could be explained by the low number of genomes used for this species (n=317).
The overall proportion of genomes exhibiting CRISPR-Cas systems is low, with Gram-positive bacteria having them in only 1% of their genomes, both A. baumannii and E. cloacae in 12%, and K. pneumoniae and P. aeruginosa in 30% and 47%, respectively (Fig. 1d). Since the spacers of CRISPR-Cas systems usually recognize mobile genetic elements, we first separated pangenome genes that could come from plasmids (an average of 28% of the genes) and bacteriophages (an average of 8% of the genes) (Fig. 1e). Interestingly, an average of 5% of the genes were annotated as originating from both genetic elements, which could come from what is known as phage-plasmids (45).
Next, the spacers were obtained and their cognate protospacers were searched for within the pangenome genes. The major number of different spacers was found in the three species with CRISPR-Cas IV and/or I-F types (Fig. 1f). As expected, protospacers belonged in a much larger proportion to phage genes. Remarkably, it was possible to annotate more than 65% of the spacers in the same three species mentioned, with a maximum of 85% in P. aeruginosa. The species with a lower number of annotated spacers (about 25%) were E. cloacae and S. aureus. It is noteworthy that about 18% of the spacers in CRISPR-Cas type I species appear to recognize genes annotated as phage-plasmids.
Since the presence of CRISPR-Cas systems prevents the entry of foreign DNA (including resistance and virulence plasmids) into the bacterium, the number of genes involved in these functions was compared between genomes with and without CRISPR-Cas systems. On average, genomes with CRISPR-Cas systems presented a lower number of resistance and virulence genes (Suppl. Fig. S1). The most significant difference was found with CRISPR-Cas type II and III, except for resistance genes in S. aureus. On the other hand, P. aeruginosa presented the highest number of virulence genes overall, although the genomes without CRISPR-Cas systems presented a lower number than those with CRISPR-Cas systems, a result that was repeated with the resistance genes. So, we cannot conclude that all genomes with CRISPR-Cas systems tend to carry a lower number of resistance and virulence genes.
CRISPR-Cas systems appear and disappear throughout the entire phylogeny
At this point we wanted to know if the CRISPR-Cas systems were linked to a branch of the phylogeny of the species studied. To define phylogenetic relationships between genomes with and without CRISPR-Cas systems, the multi-locus sequence typing was used (MLST). This is based on several house-keeping genes of each species (38), and two adjacent groups reflect genomes arising from a recent common ancestor. Except for E. faecium type II and other specific aggregations in other species, CRISPR-Cas systems appear to be spread throughout the phylogenetic tree (Fig. 2, top), suggesting a possible gain by horizontal gene transfer in genomes for which it could provide an evolutionary advantage, and a possible subsequent loss when that advantage no longer exists.
If CRISPR-Cas systems are associated with other bacterial physiological functions, one possible hypothesis is that genomes with CRISPR-Cas systems always have a similar collection of accessory genes, regardless of cas genes. To test this idea, distance trees were constructed between the same MLST groups in the phylogeny, in this case based on the gene profile of the genomes (gene presence/absence matrix). These gene profiles showed a similar dispersion to that found with molecular phylogeny, except, again, for some particular aggregations, suggesting that CRISPR-Cas systems do not appear in genomes with a fixed collection of accessory genes (Fig. 2, bottom). However, when the two types of relationships are compared, a high correlation is observed between distances of the molecular phylogeny and the gene profile in species with a smaller number of genomes with CRISPR-Cas systems (Suppl. Fig. S2). Nevertheless, this correlation decreases in those with a higher number of genomes with CRISPR-Cas systems. Taken together, this reinforces the idea that CRISPR-Cas systems do not appear to be linked to strains phylogenetically related or to specific accessory genomes. But this raised the question of whether genomes with CRISPR-Cas systems presented particular genes at higher frequencies than genomes without these systems.
Genes associated with CRISPR-Cas systems mainly encode membrane proteins
We have already seen that CRISPR-Cas systems are associated with different accessory genomes and appear and disappear at any evolutionary time. However, assuming that CRISPR-Cas systems are associated with other physiological functions of bacteria, we should find some accessory genes more frequently in genomes having these systems. With this in mind, we searched for genes significantly associated with genomes showing CRISPR-Cas systems, excluding the cas genes themselves. Thus, a mean of 147±62 genes per genome were found associated to the different CRISPR-Cas systems, while only 39±47 genes were associated with not having CRISPR-Cas systems (Suppl. Table S2). To test whether the CRISPR-Cas associated genes were significantly involved in any biological process or function, enrichment analysis was performed, and it was found that genes encoding membrane proteins were highly prominent (Fig. 3). These included the pili proteins of P. aeruginosa and K. pneumoniae, as well as outer membrane proteins of A. baumannii and K. pneumoniae, and other A. baumannii and P. aeruginosa proteins involved in type II secretion systems. This relationship with membrane proteins was especially relevant in CRISPR-Cas type I systems, while other genes involved in catabolic processes or DNA metabolism were found more notably in types II, III, and IV (Suppl. Fig. S3). In addition, types IV were also notable for the annotation “extrachromosomal DNA”, reflecting the origin of these systems from mobile genetic elements (46).
Genomes with specific types of CRISPR-Cas systems have different sets of membrane proteins
Since the genomes bearing CRISPR-Cas systems presented specific types of membrane proteins, we wanted to know if all genomes with a specific CRISPR-Cas type had the same set of membrane proteins. To assess this, genes encoding membrane proteins previously associated with CRISPR-Cas were searched for in genomes having these systems. In general, there were at least two clusters of genomes, especially when type I was analyzed, each one presenting a different collection of genes encoding membrane proteins (Fig. 4), except for type I-F in E. cloacae, as the number of genomes in this species is low. However, the separation between these two clusters of genomes were not as clear as with the other types (Suppl. Fig. S4). As an example, about half of the A. baumannii genomes with the CRISPR-Cas type I-Fb system carry the opuD and betP genes, involved in choline and glycine betaine transport, which could protect the bacterium from osmotic stress (47). And the other half of the genomes have porins such as benP, or a specific variant of the TonB-dependent siderophore receptor bauA (48). The clusters found with CRISPR-Cas systems genomes do not seem to be normally associated with the phylogeny of the corresponding species, since the genomes that present the membrane proteins that define them are distributed throughout this phylogeny (Suppl. Fig. S5).
Genomes with both CRISPR-Cas type I systems and specific membrane proteins show exclusive spacers and phage genes
Genomes with CRISPR-Cas type I systems show specific collections of membrane proteins. These proteins could provide the bacterium with important characteristics such as specific stress responses, or the detoxification that certain outer membrane proteins can offer. Since the majority of CRISPR-Cas systems studied here seem to recognize phage sequences (Fig. 1f), we hypothesized that these membrane proteins could be acting as receptors of specific phages, and genomes with them are forced to acquire CRISPR-Cas systems to defend against infection by these viruses, while maintaining the beneficial functions for the bacterium. Indeed, the A. baumannii cluster 1 for type I-Fb includes the ompA gene, an outer membrane protein, the P. aeruginosa cluster 1 for type I-C shows the fhuA gene and the two A. baumannii clusters for I-Fa show different variants of the btuB gene, both encoding TonB-dependent proteins. These three genes have long been known to act as phage receptors in Escherichia coli and Salmonella (49).
To test the above hypothesis, we looked for spacers that recognized phage genes present in each of the two membrane protein-based clusters of genomes that did not appear (neither the spacer nor the phage gene) in the other cluster. We then searched for genomes in each cluster that might carry the phage genes recognized by these spacers. Thus, within a given cluster we could find genomes containing a cluster-specific spacer, or a phage gene recognized by one of these spacers. But it could also be the case that both elements appear in the same genome, and this would imply having two new alternatives: either a cluster-specific spacer and the viral gene recognized together, or a spacer and a viral gene recognized by one of the other cluster-specific spacers. When evaluated in CRISPR-Cas type I systems, the number of genomes with unique spacers for each cluster was high, with a predominance of genomes with unique spacers in K. pneumoniae I-E and P. aeruginosa I-C, with spacers and phage genes in A. baumannii I-F types, and with phage genes in P. aeruginosa I-E and I-F (Fig. 5).
In A. baumannii, cluster 2 of CRISPR-Cas I-Fb genomes (see Fig. 5) has specific spacers against a phage gene (unknow5433) that appears in 466 distinct genomes, with the phage gene itself also appearing in 68 of them (Fig. 6a). The complete phage is integrated near a tRNA-Arg gene and has a total of 59 proteins (Suppl. Fig. S6). In addition, the phage gene also appears in 283 genomes lacking CRISPR-Cas systems. When the specific membrane proteins of cluster 2 are also searched for in genomes lacking the CRISPR-Cas system but having the phage gene, the best match occurs with the T630_1336 membrane protein (which we will refer to as cam1 for CRISPR-associated membranome gene 1), while the rest of the membrane proteins appear more frequently in genomes lacking the phage gene (Fig. 6b). Specifically, 278 of the genomes that have cam1 also have the phage gene (63%). In fact, of the 283 genomes without CRISPR-Cas systems that have the phage gene, only in 5 of them the cam1 gene could not be found.
Cam1 is a 349 amino acid protein that shows a transmembrane region followed by a Rhs repeat-associated core domain at its N-terminal half (InterPro:IPR022385). This domain appears in bacterial toxins involved in type VI secretion systems (T6SS), but when present in proteins of less than 400 amino acids, found in bacteria such as Pseudomonas putida, it has an unknown function (50). The gene encoding this membrane protein is part of a cluster of 6 genes that is integrated next to a tRNA-Glu and the gene fhuA, and close to this region there is another T6SS spike gene (vgrG2__8). Since the presence of this protein seems to be a sine qua non condition for the appearance of the phage, we could expect that the gain of this putative T6SS by the bacterium would imply that it would be exposed to infection by the phage, something that could be counteracted by the bacterium with the acquisition of a CRISPR-Cas system (Fig. 6c). This appears to be supported when analyzing the group of genomes containing this membrane protein, together with genomes that have a similar gene profile. There is a set of 65 genomes with a gene profile similar to cluster 2 CRISPR-Cas I-Fb genomes which lack the membrane protein, phage and CRISPR-Cas systems (Fig. 6d). Then, the phylogeny of this group of genomes shows a first divergence supporting the gain of the membrane protein Cam1 that seems to allow phage entry. Later in the phylogeny, a new divergence allows the gain of the CRISPR-Cas system that seems to prevent phage integration. Indeed, while genomes with only the phage gene or CRISPR-Cas systems are rare (5 and 21 genomes, respectively), there are 278 genomes with both Cam1 and the phage gene, and 401 with both Cam1 and the CRISPR-Cas system but not the phage gene (Fig. 6e), supporting the dependence of phage on Cam1, and the dependence on CRISPR-Cas systems to prevent phage.
Discussion
A limited number of bacterial genomes have CRISPR-Cas systems to prevent the entry of foreign DNA. We have analyzed tens of thousands of genomes of bacterial species of the ESKAPE group and found that type I is the most frequent CRISPR-Cas system in the Gram-negative species of this group. Our results are consistent with previous reports that found, for example, a low proportion of genomes with CRISPR-Cas systems in E. faecium (51), and about fifty percent of those in P. aeruginosa (9). When a bacterial genome has a CRISPR-Cas system, it is expected to have fewer genes. Some of the missing genes may be those involved in antibiotic resistance and originating from plasmids or integrative and conjugative elements, or those involved in virulence and originating from phages. This was mostly found in our study with ESKAPE pangenomes, and coincides with previous studies with the same species, except for the fact that we show a more complete collection of type IV systems, and we have used twice as many genomes (10, 52).
Bacteria have different defense systems against phages. But to study in silico the relationship of these systems with their target sequences, CRISPR-Cas systems have the advantage of having spacers, which reflect previous encounters with exogenous sequences (7). Thus, spacers of CRISPR-Cas systems usually recognize sequences originating from phages (11) and to a lesser extent from plasmids, such as we previously showed in A. baumannii (19). However, most spacers have unknown origin, as the corresponding protospacer cannot be found, and have been lumped into what is known as CRISPR dark matter (11, 53). This dark matter accounts for 80-90% of the spacers and is expected to recognize phage sequences that are still unknown or have diverged from known phage sequences. And it is believed that this percentage may decrease with the future increase of sequences in the databases (54). We show that by analyzing complete pangenomes, which can include all the variation of phages infecting the species, dark matter can be greatly minimized (Fig. 1f). Thus, we were able to annotate 85% of the 9950 P. aeruginosa spacers and 72% of the 7345 A. baumannii spacers, with approximately 70% of these corresponding to phages or phage-plasmids. These phage-plasmids include phages that can remain as extrachromosomal elements in the bacterium (45) and against which we have found more associated spacers than against plasmid genes.
We have also found that genomes with CRISPR-Cas systems do not appear phylogenetically restricted, nor do they have a unique accessory genome. This suggests that bacteria acquire these systems when they provide an important evolutionary advantage. In a study carried out to measure the impact of CRISPR-Cas systems on horizontal gene transfer in bacteria, it was concluded that these systems would play an important role at the population level but not at the evolutionary scale (55). In fact, these systems are often recruited by mobile genetic elements on independent phylogenetic times (56). All of this would support our results, positioning CRISPR-Cas systems as functional modules that are acquired and discarded under certain circumstances.
We found dozens of genes that co-occur with CRISPR-Cas systems, although not necessarily close in the bacterial chromosome, many of which encode membrane proteins. This association suggests that one such circumstance could be the defense against phages that use these proteins as receptors. A previous report had already suggested that misfolded membrane proteins may trigger an envelope stress response that activates a CRISPR-Cas system (57), and other reports have found genes with probable association to the CRISPR-Cas systems and some of them encoded integral membrane proteins (58, 59). Many of these genes were related to type III systems, which is the type in S. aureus, where we found more than 20 genes annotated as integral component of membrane associated with its CRISPR-Cas system. However, we mainly found this association with membrane-related accessory genes among the different classes of type I CRISPR-Cas and propose that this may be related to the acquisition of beneficial functions for the bacterium that conversely make it more vulnerable to certain phages. These membrane proteins may help form biofilms or allow for certain virulence-related advantages (12–18), but at the same time, this membrane proteins can be receptors for specific phages.
It has been shown that phages can increase the virulence of the bacterium they infect when integrated into the bacterial chromosome, as they can carry toxins, resistance genes, or adhesion factors (60). CRISPR-Cas systems would prevent these phages from proliferating. However, we have found that in many cases spacers coexist with the cognate phage gene, especially in A. baumannii. These cases reflect that the phage would be integrated into the bacterial genome, suggesting that the immune system has not been fully efficient. Indeed, in P. aeruginosa, and partly in K. pneumoniae, we have seen that the number of virulence genes in strains carrying CRISPR-Cas systems may be higher than in those without (suppl. Fig. S1). The coexistence of the protospacer with the cognate spacer has been proposed as representing autoimmunity processes with a negative effect on the bacterium (61), although this may also be explained by the fact that the prophage is expressing anti-CRISPR systems (62). However, other studies have shown that CRISPR-Cas systems can prevent the lytic cycle of phages but tolerate the virus integration as a prophage, allowing the bacteria to co-opt the phage genes for possible use as virulence factors (63).
We observed that genomes with specific types of CRISPR-Cas systems can be separated into clusters based on the membrane proteins they have, and that the spacers of these CRISPR loci would recognize different non-overlapping phage genes (Fig. 4-5). By analyzing these relationships between membrane proteins, spacers and phage genes, we are able to propose a gene encoding a member of a putative T6SS as a putative receptor for a phage found in genomes with and without CRISPR-Cas systems. Phylogenetic data suggest that the gain of this immune system would protect the bacterium against this phage while allowing it to maintain the secretion system, which could be useful for intra- or interspecific competition (64). Interestingly, the gene cluster to which this membrane gene belongs is integrated next to the fhuA gene, and it is speculated that a class of specific receptors (TonB) could be critical for phage genome injection through their interaction with FhuA (65). Thus, this protein could help both the entry of the phage and the proper functioning of the T6SS system. Other reports have shown that secretion systems and CRISPR-Cas systems can depend on quorum sensing (66, 67), which enables the coordination of bacterial population growth and is therefore also related to biofilm formation. Thus, it could be hypothesized that the bacterium would express both systems at the same time, to avoid being exposed to phages that could take advantage of the activation of the T6SS system when the bacterial population and cell-to-cell contact is increased.
Conclusions
We demonstrate that the use of large pangenomes allows to annotate a great part of the spacers of CRISPR-Cas systems, which will allow further research in this field. Here, we describe a “Membrane protein-Phage-CRISPR” triad, in which the CRISPR-Cas systems might be especially necessary when the bacterium expresses accessory genes that encode for membrane proteins. This would give it a special advantage in that situation, but it would also represent a gateway for phages that use those proteins as receptors.
Data availability
All data generated or analyzed during this study are included in this published article and its supplementary information files.
The code used to analyze the data and the pangenomes as well as the phylogenetic distances can be found in our GitHub repository: https://github.com/UPOBioinfo/crispromeskape.
Funding
This research was supported by the Ministry of Science and Innovation of the Spanish Government with grant PID2020-114861GB-I00, and by the European Regional Development Fund and the Consejería de Transformación Económica, Industria, Conocimiento y Universidades de la Junta de Andalucía with grant PY20_00871.
Conflict of Interest
The authors declare that they have no conflict of interests.
Supplementary data
Suppl. Table S1. Distribution of CRISPR-Cas systems and spacers among all species.
Suppl. Table S2. Genes associated to CRISPR-Cas genomes. Each worksheet represents the inference results of 20 iterations of a random forest for a species and the respective CRISPR-Cas systems. The first worksheet has a header explaining the columns as follows: genes (holds the encoded gene names); known_genes_retained (contains the count of how often the gene was in the top 30 most important features of the trained random forest); log2fc_gene and diff_gene, which are the log2 fold change and the simple difference of means of the samples with and without a CRISPR-Cas system, respectively.
Suppl. Table S3. Genomes used for each species with the MLST to which they belong, and the CRISPR-Cas systems found in them. Incomplete CRISPR-Cas systems are labeled as “ambiguous”. The A. baumannii I-Fb clusters have been added, along with the metadata colors used in Fig. 6d.
Acknowledgements
We would like to thank C3UPO and the HPC-group of the JGU for the HPC support.