Comparative genomics reveals the diversity of CRISPR-Cas systems among neonatal 1 sepsis causing group B Streptococcus agalactiae 2

26 The pathogen Streptococcus agalactiae , or Group B Streptococcus (GBS) infection is the 27 leading cause of neonatal sepsis and meningitis in neonates. . In this study, we aimed to 28 investigate the occurrence and diversity of the CRISPR-Cas system in S. agalactiae genomes 29 using computational biology approaches. A total of 51 out of 52 complete genomes (98.07%) 30 of S. agalactiae possess CRISPR arrays (75 CRISPR arrays) with 17 strains possessing 31 multiple CRISPR arrays. There were only two CRISPR-Cas systems – type II-A system and 32 type I-C system in S. agalactiae strains. RNA secondary structure analysis through direct repeat 33 analysis showed that the analyzed strains could form stable secondary structures. The 16S 34 rRNA phylogeny exhibited clustering of the strains into three major clades grouped on the type 35 of CRISPR-Cas system. The anti-CRISPRs that contribute to CRISPR-Cas system diversity 36 and prevent genome editing were also detected. These results provide valuable insights into 37 elucidating the evolution, diversity, and function of CRISPR/Cas elements in this pathogen. 38 39


INTRODUCTION
Streptococcus agalactiae (also known as Group B Streptococcus, GBS), a Gram-positive bacterium, is a common commensal of intestinal and reproductive tracts in healthy adults.It can be transmitted from mother to newborn during birth [1].The GBS is a cause of stillbirth, chorioamnionitis, and neonatal infections including pneumonia, bacteremia, and meningitis.
GBS-sepsis has a mortality rate of 10-18%, with a colonization rate of approximately 18-35% in pregnant women, and neonatal infection rates of 0.4 to 1.1 cases per 1000 live births [2,3].
The infection process is mediated by multifunctional GBS virulence factors that could be a challenge to the immune-deficient neonates [4].GBS displays virulence factors including a potent hemolytic toxin, proteases, and multiple surface proteins to conquer host tissues [5].
Moreover, microbes may have more than one type of CRISPR-Cas system which function towards specific template based recognition, targeting, and degradation of exogenous nucleic acids [7].These systems could differ in type of Cas proteins present and spacer sequences and also the length and number of CRISPR repeats.Although, initially known for its involvement in viral defense, recent findings suggest involvement of CRISPR-Cas systems in regulation of expression of virulence genes and escape host immunity [8].CRISPR-Cas systems were earlier considered as adaptive immune system and widely studied in Streptococcus thermophilus [9].
Studies suggest that mainly three types of CRISPR-Cas9 systems are employed by Streptococcus sp.type I, type II, and type III.In addition to these, they also harbor a single type V and unknown CRISPR loci.[10] .The recent reports on the emergence of hypervirulent S. agalactiae suggest the contribution of phages and other mobile genetic elements (MGE) in adaptation to different hosts and its virulence profile [11].These phage-associated genes may play a major role in biological success of the strains by acting as delivery vehicles of resistance and virulence genes [12].Recently, CRISPR analysis has been used as tool to follow maternal GBS colonization and also as a typing technique over traditional subtyping systems [13] .While extensive details are available on S. thermophilus and other animal pathogenic streptococci, detailed information on the CRISIPR-Cas systems in human pathogenic S. agalactiae are lacking.Therefore, in this regard, we sought to investigate the occurrence and diversity of CRISPR-Cas systems in S. agalactiae genomes.We used CRISPRminer2 server [14] and CRISPRCasFinder [15] -two most inclusive and widely used resources for the identification of CRISPR arrays and cas genes.We report here the diversity and provide insights into existing CRISPR-Cas systems in S. agalactiae based on 52 complete genomes of GBS of human origin.

Sequence selection and retrieval
The data set included complete genomes of Streptococcus agalactiae.Only the complete genomes with human/homo sapiens as hosts were selected and retrieved from NCBI website (https://www.ncbi.nlm.nih.gov/, last accessed on 21-08-2022).Except for the reference strain NGBS128 none of the other genomes had any plasmid sequences.A total of 52 such sequences were selected and their fasta files downloaded for NCBI.

Detection of CRISPR-Cas features
The complete genomes of the 52 strains were screened for the presence of complete CRISPR-Cas loci using CRISPRminer2 server.CRISPRminer2 is a comprehensive tool that uses a comparative genomics approach to identify and annotate CRISPR-Cas loci.This tool also helps with multiple detection options, including anti-CRISPR detection and annotation, selftargeting spacer search, repeat type identification, bacteria-phage interaction detection, and prophage detection.Only "confirmed CRISPRs" identified by the CRISPRminer2 tool were selected for further analysis.The strains which then did not show CRISPR loci were eliminated and rest of the strains were retained for further analysis.The results were also corroborated by checking with CRISPRCasFinder.

Signature genes
The data from CRISPRminer2 was tabulated and a list of signature genes were determined.A tile map was generated to visualise the presence and distribution of these genes amongst the 43 strains.CRISPR map server was also used to obtain in depth information on each of the strains.
The CRISPR repeats were analysed through multiple sequence alignment and the aligned direct repeats visualised using the WebLogo program [16].

RNAFold Webserver
Direct repeats (DR) obtained via CRISPRminer2 were then compiled and 11 unique repeats found were then used to generate free energy structures via the RNAFold server [17] .The RNAfold Webserver set to default parameters was used to predict the RNA secondary structure and minimum free energy (MFE) of each DR.

Phylogenetic Trees
To understand CRISPR-Cas distribution in the genomes from a phylogenetic perspective, complete 16S rDNA sequences from 52 genomes were retrieved from NCBI and aligned using MUSCLE in MEGAX [18].ML statistical method with model selection was used to compute BIC score and AICc value of 24 different nucleotide substitution models.A maximum likelihood phylogenetic tree was constructed (Kimura-2 model of nucleotide substitution) and bootstrap analysis with 1000 random replicates.The cas1 and cas9 genes were aligned and a ML phylogenetic tree was constructed with 1000 bootstrap values.Streptococcus pyogenes was taken as the outgroup.

Spacer analysis
The spacer targets were identified using the CRIPSRminer2.The visual representation of the CRISPR spacers was performed using Excel macros, with each unique colour combination representing one unique spacer sequence

Sequence selection and retrieval
A search for S. agalactiae genomes in the NCBI database listed 1515 sequenced genomes amongst which 128 were complete genomes.Out of these only the ones affecting human/ homo sapiens hosts which resulted in a total of 52 genomes were considered for further analysis.Out of these 52 strains, 51 were determined to possess CRISPR arrays with a total of 75 CRISPR arrays detected and 21 strains possessing multiple of these CRISPR arrays.BJ01 and Sag27 were noted to have the highest number of individual arrays each with three CRISPR-Cas arrays.

Genomic context analysis of confirmed CRISPR-Cas loci
The selected strains were uploaded to CRISPRminer2 web server and results are tabulated in the Table 1.CRISPRminer2 provided details on the number and type of CRISPR locus found, number of spacers and direct repeats (DRs), including DR types, signature genes found within each locus.Other information such as number of prophages, anti-CRISPRs, mobile genetic elements and self-targeting spacers were all obtained from this server.All the strains were classified either into Type II, Type I or orphan CRISPR types (Supplementary table 1).Two

Cas genes
The tile map generated using the presence/absence matrix shows the distribution of signature genes amongst the 51 genomes (Figure 1).From the given cas genes only cas1, cas2, csn2 (Casein Beta which is a Protein Coding gene) and cas9 were seen to be present in all the 51 strains.The strains B507, CUGBS591, CU_GBS_08, CU_GBS_98, NGBS572, Sag153, Sag37, SG-M1, SG-M158, SG-M63, SG-M29 and SG-M50 possessed 2 CRISPR arrays and hence have the maximum cas proteins.None of the genomes possessed any transposons in the CRISPR loci.

Direct repeats
The DRs from all the CRISPRs were collected and the duplicates were removed.The 11 unique repeats were then uploaded to the RNAFold Webserver from which the free energy structure was obtained as seen in Figure 2. Shorter or incomplete DR sequences were eliminated and the remaining 9 structures were taken into consideration.DR1 and DR2 are seen to have the highest minimum free energy (MFE) value whilst DR2 has the lowest making it the most stable out of the 9 DR structures.The MFE of ribonucleic acids (RNAs) increases at an apparent linear rate with sequence length and the lower the MFE, the more stable the structure [19].In this case DR2 with -13.10 kcal/mol is seen to be the most stable out of the 9 predicted structures.Both DR1 and DR3 have the least stable structure with -0.3 kcal/mol and -0.4 kcal/mol respectively.

Spacer analysis
In total, 862 spacers were identified among GBS genomes positive for CRISPR loci.Of the identified spacers, 812 were unique (Supplementary figure 1).The spacers in each array ranged from 23 to 104.Among the genomes, the least number of spacers (1) was seen within the CRISPR locus of BJ01, while the highest number of spacers (31) was seen in GBS28 genome with an average of 11.4 spacers per array.An analysis of spacer sequences showed 212 spacers to match plasmids (24.79%) and 568 spacers (66.43%) to match phages.The CRISPRminer2 prediction indicated the absence of self-targeting spacers.Furthermore, 16 genomes had duplicate spacers within their genome with a total of 50 duplicate spacers across all the GBS genomes studied.

Phylogenetic Trees
Two separate phylogenetic trees were constructed for 16S sequences, and cas9 of the selected genomes (Figure 3).The 16S phylogenetic analysis showed all sequences clustering into 2 major clades based on their CRISPR-Cas status.This close clustering of strains may be indicative of close intra-genus relationship among them.Cas9 phylogenetic tree showed clustering of the strains into 3 major clades grouped on the variations seen in their respective genes (Supplementary figure 1).

DISCUSSION
In this study, we investigated the CRISPR-Cas systems in the GBS genomes isolated from humans' origin to gain insights into the occurrence, diversity, and features of its adaptive immune system.GBS had a high frequency of occurrence of the complete CRISPR-Cas system (91.4%).This is comparable to the reported prevalence of complete CRISPR loci for Streptococcus genera [9].High CRISPR-Cas prevalence has been attributed to high viral abundance coupled with lower viral diversity in the ecosystem [20].Bacterial CRISPR-Cas systems have been associated with interaction of pathogens with host cells, immune evasion and other bacterial virulence [21].Interestingly, contradictory functions have been reported on the functioning of CRISPRs.Short or complete absence of CRISPR arrays have led to increased pathogenicity as seen in gastroenteritis causing Campylobacter jejuni strains [8], while cas genes has been shown to enhance virulence in S. agalactiae mutant studies [22].On the one hand, CRISPR-Cas system may lessen the potential virulence by preventing MGE from introducing new virulence genes, while on the other hand, CRISPR-Cas may enhance virulence by regulating gene expression and promoting host colonization.GBS expresses various surface and secreted virulence factors to colonise and infect neonates, which also supports survival in the bloodstream.
The CRISPR-Cas systems are classified into two classes, Class I and Class II, 6 types and 33 subtypes based on the crRNA-effector complex [23].The genera Streptococci fundamentally harbor type I, type II and type III CRISPR-Cas systems in addition to the individual type V and unknown CRISPR loci [24].The type II system is involved in pathogenesis, quorum sensing, invasion and stress response among others while type I systems drives DNA targeting and cleavage associated with antiviral defense.Type III systems provides transcription-dependent immunity against diverse nucleic acid invaders [25].In our study, out of selected 52 genomes, 51 genomes contain CRISPR arrays with a total of 75 CRISPR arrays detected and 17 strains possessing multiple of these CRISPR arrays further classified into Type II, Type I or orphan CRISPR types.
A majority, 29 genomes (55.76%) of the CRISPR-Cas systems of the GBS genomes were of Type II-A, while 15 (28.84%) genomes contained both Type II-A and I-C type of the CRISPR-Cas system.This composition is similar to the that of CRISPR-Cas of other Streptococcus species like S. canis [26] and S. pyogenes [27].The type I-C in GBS contains seven cas genes (cas3, cas5c, cas8c, cas7, cas4, cas1 and cas2) similar to the ones found in S.
pyogenes [27].Cas9 was found in all 51 genomes.The recent studies indicates that Type II CRISPR-associated protein 9 (cas9) influenced virulence in GBS strains [28,29].The virulence factors of GBS have been implicated in vaginal colonization and invasive disease through Cas9 based regulators [29].None of the genomes contained any transposon or retrotransposon elements in the CRISPR loci.
Interestingly, we are also able to detect anti-CRISPRs from GBS, which contributes to CRISPR-Cas system diversity and which also prevents genome editing.Two types of Anti CRISPR (Acr) regions were detected from selected strains as AcrIIA21 being present in 26 strains whilst AcrIIA18 in just GBS1-NY strain.AcrIIA21 exhibits broad spectrum action by inhibiting Streptococcus pyogenes Cas9 (SpyCas9), Staphylococcus aureus Cas9 (SauCas9), and Streptococcus iniae Cas9 (SinCas9), exhibiting high efficacy against SinCas9 [30].An in depth understanding of its mechanism remains elusive.Furthermore, the modulation of Cas9 through sgRNA has also been reported from AcrIIA17 and AcrIIA18.The AcrIIA18 performs Cas9-dependent truncation of sgRNA which lead to generation of a shortened sgRNA which are incapable of triggering Cas9 activity [31].
CRISPR repeats are known to produce hairpin loops like secondary structure owing to its palindrome repeats.The stem-loop structure of DRs are known to facilitate the interaction between spacers and cas proteins.An investigation of the RNA secondary structures and their MFE values indicated that all but one DRs could form stable structures with ΔG < −10 kcal mol −1 .DR1, DR3 and DR5 had lower MFE values in comparison to DR5, DR6, DR7, DR8 and DR9.Studies indicate that active CRISPR arrays tend to be long due to the continuous acquisition of spacers [32].In this study, a maximum of 31 spacers were present in CRISPR loci indicating an active system.The average spacer length in the GBS genomes was 39 bp.In comparison, some genomes like that of E. coli contains an average length of 31 bp while it was found to be between 28 and 32 nucleotide bp length in S. thermophilus [33].Studies indicate that CRISPR systems containing spacers of length >30 bp are more active than loci with shorter spacer lengths and more spacers allow bacteria to mount a better defense against viruses [34].
Many of the geographically close strains carried a CRISPR cassette with diverse spacers.Such observations have recorded earlier from S. thermophilus where spacer hypervariability has been directly linked to phage exposure [35].Some of the spacers within the CRISPR loci were duplicated within the genome, the exact significance of this is not clear.Further experimental evidences are needed to investigate the functioning of the CRISPR-Cas systems on gene expression and regulation especially during host-pathogen interaction in GBS genomes.
types of DRs were found, II having 47 repeats and I having only 12, whilst the remaining 16 repeats were determined to be NA (Not applicable).GBS28 and NGB061 had the highest number of DRs with 31 whilst possessing only 1 CRISPR array.Meanwhile, Sag158 and BJ01 have 31 and 35 DRs respectively but with multiple CRISPR arrays.The individual CRISPR length was observed to have a wide range with FDAARGO_670 and B509 having 7693bp and 7363bp being on the higher end.Meanwhile, BGS-M002 has the shortest CRISPR with only 102 bp.Two types of Anti CRISPR (Acr) regions were also detected with AcrIIA21 being 108 aa long and being present in 26 strains whilst AcrIIA18 being 176 aa in length and present in just GBS1-NY.NGBS061 and BJ01 both possess the highest amount of self-targeting spacers with 11 each.Strain NGBS128 was found to have the greatest number of Mobile Genetic Elements (MGEs) with 37 in its single CRISPR array.

Figure 1 :
Figure 1: Heatmap of presence/absence of various signature cas genes amongst the 52 strains.The tiles in dark blue denote the presence whilst the ones in light blue show absence.

Figure 2 :
Figure 2: The secondary structure for consensus 11 unique direct repeat sequences of CRISPR arrays in GBS strains.The kcal/mol indicates the minimum free energy (MFE) which is known to increase at a clear linear rate with sequence length.The colours represent the base-pair probability range.

Figure 3 :
Figure 3: Phylogeny of GBS used in this study.The tree was based on 65 non-redundant complete 16S rRNA sequences from 51 species of GBS. S. pyogenes was taken as the outgroup.Numbers next to nodes indicate bootstrap values (%) based on 1000 iterations.Branch length scale indicates the number of substitutions per site.The phylogeny tree was constructed in MEGA10 using the maximum likelihood method.

Table 1 :
Genomic properties and CRISPR-Cas type of the 52 GBS strains used in this study.