Diversification of the restriction–modification system of Streptococcus pyogenes through its acquisition of mobile elements

Restriction–modification (RM) systems are typically regarded as “primitive immune systems” in bacteria. The roles of methylation in gene regulation, segregation, and mismatch repair are increasingly recognized. To analyze methyltransferase (MTase) diversity in Streptococcus pyogenes, we compared the RM system distribution in eight new complete genome sequences obtained here and in the database-deposited complete genome sequences of 51 strains. The MTase gene distribution showed that type I MTases often change DNA sequence specificity via switching target recognition domains between strains. The type II MTases in the included strains fell into two groups: a prophage-dominant one and a CRISPR-dominant one. Some highly variable type II MTases were found in the prophage region, suggesting that MTases acquired from phage DNA can generate methylome diversity. Additionally, to investigate the possible contribution of DNA methylation to phenotype, we compared the methylomes and transcriptomes from the four most closely related strains, the results of which suggest that phage-derived methylases possibly regulate the methylome, and, hence, regulate expression levels in S. pyogenes. Our findings will benefit further experimental work on the relationship between virulence genes and pathogenicity in S. pyogenes.

of a Mod (MTase) protein, which recognizes asymmetric target sequences, and an REase protein.
There are relatively few type III MTases in prokaryotic genomes (6.8% of all 26,582 RM systems in 5,568 analyzed bacterial genomes) (7). Finally, type IV RM systems lack methyltransferase activity and consist of only one or two DNA methylation-specific REases.
The roles of the RM system besides its function in the prokaryotic immune system have become more apparent in recent years. The type I specificity subunits of some bacterial species have been found to become switched with one another, thereby producing heterogeneity in gene expression (8)(9)(10)(11)(12). This "phase variation" can also contribute to control bacterial virulence, immune evasion, and niche adaptation (13). "Solitary methyltransferases" can also mediate DNA methylation independently of an active cognate REase. Solitary MTase, a type II RM system, has no REase but is nevertheless involved in the global gene regulation of several bacterial species (14,15). Two examples of solitary bacterial MTases are DNA adenosine methyltransferase (Dam) and cell cycle-regulated methyltransferase (CcrM). Dam methylation of GATC sites in Escherichia coli has been shown to regulate the timing of DNA replication, whereas in Caulobacter crescentus, CcrM regulates the cell cycle (16). These studies have extended our understanding, such that we now view DNA methylation as an epigenetic signaling mechanism used by bacteria to help with their ongoing adaptation.
Streptococcus pyogenes causes uncomplicated pharyngitis, scarlet fever, rheumatic fever, and streptococcal toxic shock syndrome (STSS). Invasive strains of S. pyogenes are often reported as "flesh-eating" bacteria because infections with these pathogens can progress rapidly and dramatically after illness onset, with an approximate mortality rate of 40% (18). This is why many researchers study S. pyogenes despite it not being a model organism. The incidence of S. pyogenes infection has risen since 2011 (17)(18)(19), and elucidating its invasion mechanism has become a pressing issue. It is thought that the virulence factors encoded by phages are involved in bacterial virulence to some extent (20). Comparative genome analyses have suggested the pathogenic diversity of S. pyogenes might result from single-nucleotide polymorphisms (SNPs) in genes such as those encoding two-component systems (TCSs) (21). However, the presence or absence of some specific genes and SNPs in TCSs were not found to contribute to the pathogenesis widely (22). Thus, other factors, like epigenetic function, should be investigated for their potential to define the pathogenicity of S. pyogenes.
To analyze the MTase gene diversity in S. pyogenes, the eight S. pyogenes strains we sequenced herein were compared with 51 complete genome database-deposited sequences to produce an updated view of the phylogenetic diversity of this species. The results of our methylome and RNA-Seq analyses prompt us to suggest that phage-derived methylases regulate the methylome, and hence, regulate expression levels in S. pyogenes. Our findings contribute to the current knowledge base about the effects of epigenetic modification on the pathological diversification of S. pyogenes.

Strains and DNA preparation
For genome sequence determination, we obtained four S. pyogenes strains from the American Type Culture Collection (ATCC) (ATCC14918, ATCC14919, MGAS9429, and MGAS2096) and four strains from the Department of Infectious Diseases, Kobe Institute of Health (KIH05-6, KIH06-45, KIH06-201, and KIH08-6) ( Table 1). ATCC14918 and ATCC14919 were isolated from a human pharynx in 1986. ATCC14919 is an indicator strain lysed by a phage of ATCC14918. MGAS9429 was isolated from an animal swab (23). MGAS2096 was isolated from a patient with acute poststreptococcal glomerulonephritis in 1993 (23). The emm type of all four of these strains is emm12.  Table S1). They were used as references for the comparative genomics analyses.

SMRT sequencing and sequence annotation
Complete genome sequences were determined for the strains using the PacBio RSII platform (Pacific Biosciences, Menlo Park, CA, USA), a 20-kb insert-sized library, and P4C2 chemistry ( genes from the RM systems were annotated by BLASTP searching (29) against the protein sequences using the gold-standard library for RM systems obtained from the database as queries.
All hits were clustered with Blastclust in the NCBI Blast package (29), and the resulting clusters were merged when they showed over 90% sequence similarity in the alignable regions.
Target recognition domain (TRD) sequences were determined and classified based on the BLASTP results. Sequences lacking similarity in the BLASTP search (e-value: <1e −5 ) against known TRDs in REBASE were classified as new TRD variants. Besides the gold standard, we used information on the methylation patterns of the S. pyogenes strains available at REBASE PacBio (http://rebase.neb.com/cgi-bin/pacbio; last accessed March 16, 2016) to predict the recognition sequences of the type I and II RM systems.

Characterizing prophage and CRISPR regions
Prophages in each genome were predicted using PhiSpy ver. 2.3 (30) run with the default parameters, and the prophage sequences were re-annotated using Prokka ver. 1.11. The prophage region sequences were represented from Easyfig (http://mjsull.github.io/Easyfig/). Each CDS was compared by BLASTN. The percentage average nucleotide identity (ANI) was calculated by applying the ANIm tool in the JSpecies package(31). We also analyzed the number of spacer arrays and prophages in each genome. Acquired CRISPR loci were predicted below.

Predicting CRISPR/Cas loci
CRISPR loci were predicted using the CRISPR Recognition Tool ver. 1.2 with a minimum repeat length of 30, maximum repeat length of 48, and maximum spacer length of 80 (38). To characterize the spacer sequences, the spacer list was subjected to a BLASTN search against the following four databases: 1) the "gold standards" dataset of REBASE (accessed October 2016), 2) 59 S. pyogenes genome sequences, 3) predicted RM systems in 59 S. pyogenes strains, and 4) prophage region sequences in 59 S. pyogenes strains. The BLASTP cut-off value was set at 95% for identity and 90% for query coverage.

Methylome analysis
Genomic DNA was sequenced on the PacBio RS II platform. Base modifications were analyzed using the RS_Modification_Detection.1 tool from the SMRT analysis package version 2.3.0.
Briefly, after the reads were mapped to the genome sequences, the interpulse durations were measured for all nucleotide positions in the genomes and compared with the expected durations in a kinetic model of the polymerase (39) to identify significant associations. The number of methylated bases in a 1-kb window was counted with a sliding 500-bp window for each strand to overview the methylation distribution.

General features of the genome, RM system, and methylome
We sequenced eight S. pyogenes strains with a coverage range of 76-1372. The final assemblies were gathered into each one contig, and determined as complete sequences (Table   1). On average, a total of 1876 genes with 68 tRNA genes and 18 rRNA operons were predicted.
We analyzed the whole genome data for all 59 S. pyogenes strains. The pan-genome size and the number of core genes were estimated, and we confirmed the previously proposed assertion (40) that S. pyogenes is an open pan-genome species (Supplementary Figure S1). The general characteristics of the genomes are shown in Table 1 and Supplementary Table S1.
We compared the RM systems from all 59 S. pyogenes strains using REBASE, which is a comprehensive and fully curated database of RM systems. We found type I and II RM systems in these 59 S. pyogenes strains ( Table 2); none of the strains in the present study had type III or IV systems. Our methylome analysis of the eight newly sequenced strains showed that each strain has different methylation motifs (Table 3). A DNA synthesis kinetics analysis revealed that the predominant modification type was m6A, whereas other types (e.g., m4C and m5C) were infrequent but did exist (Supplementary Figure S2). By counting the number of methylated bases per 1,000 bases (Supplementary Figure S3), we found that the distribution of methylated bases across the genome differed on each strand. Pathogenesis-related genes, such as the fibronectinbinding protein, were identified in the densely methylated region (Supplementary Table S2). A motif analysis showed that all the motifs targeted by type 1 RMs on each genome were 100% methylated. In contrast, the other motifs were not. For example, only 51.4% of 5ʹ-GMGGTAVYR motifs on the ATCC14918 genome were methylated. Finally, we clustered the methylated bases to identify motifs related to recognition sequences (Table 3).

Relationship between type I RM systems and genome phylogeny
The S subunit of a type I RM system contains two tandem TRD sequences, TRD1 and TRD2 ( Figure 1). In each TRD pair of the eight strains whose genome sequences were determined here, the combination of TRD1 and TRD2 in the four emm12 strains was the same and recognized GCANNNNNNTTAA ( Figure 1, Table 3, and Supplementary Figure S4). Similarly, the TRD combination in the emm6 strains (KIH05-6, KIH08-6) was the same and recognized CAAYNNNNNNTTYG. The emm type differed between KIH06-45 and KIH06-201, but the TRD combination in these strains was the same and recognized GCANNNNNNRTTG.
We compared the maximum parsimony tree constructed using genome-wide SNPs with the type I TRD combination of S. pyogenes (Figure 1). With the clades represented by the emm12 strains, the TRD combinations were all a and b, meaning that they share the same recognition sequences (Supplementary Figure S4). The TRD combination appears to be mostly consistent with the phylogeny; however, despite being close in phylogeny, the 12_Manfredo strain TRD combination was different from that of the 24_M23ND strain.

Features and distributions of the type II RM systems
We identified 127 type II RM systems in the 59 strains, and 104 of the genes from 41 of the strains were solitary MTases ( Table 2). The MTase-encoding genes were found in close proximity to the REase gene or both MTase and REase were encoded by a single gene. When no closely located partners were predicted, the MTases were considered solitary MTases. Despite passing quality control and having high genome coverages, the motifs detected in S. pyogenes were not specific ( Table 3). Compared with the type I motifs, these candidate type II motifs were not 100% methylated. For example, in ATCC14918, 51.4% of the 5ʹ-GMGGTAVYRs were methylated, and 24.6% of the 5ʹ-RCATAVADs were methylated. Hence, type II RM systems could change the methylome in S. pyogenes to a large extent.
We found that some type II MTases are located in the prophage region. As shown in Figure

DISCUSSION
In bacteria, the number of type I RM systems varies among species, and the RM systems are usually encoded by 5 to 10 genes in the genome (41). For example, in Helicobacter pylori, about 30 type I RM systems are encoded in the chromosomal region of each genome. In contrast, only one or two type I RM system genes are encoded in the S. pyogenes genome, which is a smaller amount than that in other streptococcal species, and these genes are not highly variable ( Table   2).
Shuffling of TRDs as "phase variation" has been described in several species over the last 20 years (42-44). One example of phase variation is the switching of TRDs in the type I RM systems of some species, which produces multiple methyltransferase variants and, consequently, variable methylation (11). Another example is type III TRD switching in Haemophilus influenzae, Neisseria gonorrhoeae, and Neisseria meningitidis, where such switching can regulate the expression of multiple genes. Recently, a type I TRD S. pyogenes knockout strain was found to show different methylation and expression profiles from those of the wildtype, but the strain lacked TRD switching, and the phenotype of the knockout was not different from that of the wildtype (45).
Hence, the switching of different TRD combinations is considered to make a relatively small contribution to phase variation in S. pyogenes.
In the present study, although the type II MTases of the four emm12 strains were all prophage-  Table 4), suggesting that genetic diversification in S.
pyogenes has frequently occurred via prophage acquisition.
It is known that the CRISPR spacers in S. pyogenes share sequence similarity with prophage sequences. It is also known that the presence of few CRISPR spacers in the genome increases the chance of S. pyogenes acquiring and integrating more prophages (46). This suggests that S.
pyogenes use CRISPR to control prophage invasion, and this facilitates genetic diversification. In this study, the sequences of some spacers showed high similarity to those of MTases, suggesting that genetic diversification in S. pyogenes has accelerated via acquiring MTase-encoding phages.
It is thought that the main reason phages have MTases is to avoid the host defense mechanism, i.e., the RM system, which is incorporated into the host genome. However, it is possible that the host is actively incorporating phages that contain MTase, which in turn can diversify its methylome. An active methylase encoded on a prophage in S. pyogenes has already been reported (49).
Expression levels and methylome changes in S. pyogenes have also been described quite recently (45). The authors of that study elucidated the contribution of the type I RM system to the methylome and phenotype. The present study differs in that the RM systems in multiple S.
pyogenes strains were comprehensively detected and compared with the methylome. Our methylome analysis was conducted in four closely related strains so that we could compare the transcriptome and methylome analyses. The mRNA expression level of each allele was dissimilar, the type and position of the recognition sequence in the genome also differed (Figure 4), and the four strains fell within the CRISPR-dominant group (Figure 3).

CONCLUSION
To understand the effect of the RM system as related to genome diversification in S. pyogenes, we acquired and compared the complete genome sequences of 59 S. pyogenes strains. We found that the S. pyogenes RM system is diversely distributed in the genome, likely at least partially due to phage-driven gene transfer. Diversification of DNA methylation was also observed in the strains.
These findings support the hypothesis that phages can be incorporated into the S. pyogenes genome by circumventing the biological defense mechanism of the restriction system by retaining their methylation enzymes. Furthermore, it seems likely that the diversity of DNA modification seen in our strains is regulated by methylases derived from external factors, or phages. It will be necessary to confirm that expression levels and phenotype changes have occurred by using a       Type  IV  R  M  S  R  M  R  M  R  01_SF370  1  3  2  0  2  0  0  0   02_MGAS8232  1  2  2  1  5  0  0  0   03_MGAS315  1  2  2  1  2  0  0  0   04_SSI1  1  2  2  1  2  0  0  0   05_MGAS10394  1  1  2  2  3  0  0  0   06_MGAS6180  1  2  2  1  2  0  0  0   07_MGAS5005  1  3  2  0  2