Four additional natural 7-deazaguanine derivatives in phages and how to 1 make them 2

1 Bacteriophages and bacteria are engaged in a constant arms race, continually evolving new 2 molecular tools to survive one another. To protect their genomic DNA from restriction 3 enzymes, the most common bacterial defence systems, double-stranded DNA phages have 4 evolved complex modifications that affect all four bases. This study focuses on modifications 5 at position 7 of guanines. Eight derivatives of 7-deazaguanines were identified, including four 6 previously unknown ones: 2’-deoxy-7-(methylamino)methyl-7-deazaguanine (mdPreQ 1 ), 2’- 7 deoxy-7-(formylamino)methyl-7-deazaguanine (fdPreQ 1 ), 2’-deoxy-7-deazaguanine (dDG), 8 and 2’-deoxy-7-carboxy-7-deazaguanine (dCDG). These modifications are inserted in DNA by 9 a guanine transglycosylase named DpdA. Three subfamilies of DpdA had been previously 10 characterized: bDpdA, DpdA1, and DpdA2. Two additional subfamilies were identified in this 11 work: DpdA3, which allows for complete replacement of the guanines, and DpdA4, which is 12 specific to archaeal viruses. Transglycosylases have now been identified in all phages and 13 viruses carrying 7-deazaguanine modifications, indicating that the insertion of these 14 modifications is a post-replication event. Three enzymes were predicted to be involved in the 15 biosynthesis of these newly identified DNA modifications: 7-carboxy-7-deazaguanine 16 decarboxylase (DpdL), dPreQ 1 formyltransferase (DpdN), and dPreQ 1 methyltransferase 17 (DpdM), which was experimentally validated and harbors a unique fold not previously 18 observed for nucleic acid methylases.


1
Because of their intrinsic properties, such as resistance to nucleases (1), or fluorescence 2 quenching (2), 7-deazaguanine derivatives have long been employed in synthetic biology. Two 3 of these derivatives are tRNA modifications, queuosine (Q) and archaeosine (G + ). They are 4 respectively involved in the avoidance of translational errors and in tRNA stabilization (3). 5 Recently, 7-deazaguanine derivatives have been found in DNA as components of 6 restriction/modification systems in bacteria (4, 5), and anti-restriction systems in phages (4, 6, 7 7). Epigenetic modifications are common among phages (8-11) to resists to various bacterial 8 defense systems (11-16). 9 Members of a transglycosylase superfamily are responsible for the incorporation of 7-10 deazaguanine derivatives into both tRNA and DNA. Proteins of the Tgt subgroup modify 11 tRNA, while DpdA subgroup proteins modify DNA (3), both by replacing the target guanine 12 with a specific 7-deazaguanine derivative. Tgt enzyme 7-deazaguanine substrates differ 13 between organisms. One of these substrates is queuine (q), which is inserted at position 34 of (bDpdA), and two phage DpdA (DpdA1 and DpdA2) (7). Of note, DpdA homologs have not 20 been identified in some of the phages that contain modified 7-dezaguanine derivatives (6, 7). 21 PreQ0, the key intermediate in all experimentally validated pathways is synthesized from 22 guanosine triphosphate (GTP) by a pathway involving four proteins (FolE, QueD, QueE and 23 QueC, see Figure 1A) found in archaea, bacteria, and some phages (3, 7). The pathways then 24 diverge, producing various modifications. PreQ0 is reduced by QueF into preQ1 in bacteria 25 through a NADPH dependent reaction (18). QueF proteins can be categorized into two 26 subgroups. Members of the unimodular subgroup harbor the NADPH binding site and the 27 catalytic residues on the same domain. Members of the bimodular subgroup contain two 28 repeating domains: the N-terminal domain with the NADPH binding site, and the C-terminal 29 domain with the catalytic residues (19). PreQ1 is inserted in tRNA by the bacterial tRNA 30 transglycosylase bTGT (20) and further modified in two steps to produce Q (3). PreQ0 is 31 directly inserted in tRNA in archaea by arcTGT, where it is further modified into G + . The 32 distant TGT paralog, ArcS (21), as well as Gat-QueC, a fusion protein of QueC and a glutamine 33 amidotransferase (22), and QueF-L, a paralog of the unimodular QueF that lacks the NADPH- 34 dependent reduction activity (22,23), have been found as interchangeable proteins for this 35 reaction. 36 1 in their genome, we identified four unique 7-deazaguanine derivatives not previously observed 2 in DNA: 7-deazaguanine (DG), 7-(methylamino)methyl-7-deazaguanine (mpreQ1), 7-3 (formylamino)methyl-7-deazaguanine (fpreQ1) and 7-carboxy-7-deazaguanine (CDG). We 4 predicted and validated a preQ1 methyltransferase enzyme and predicted the involvement of 5 five additional proteins in the synthesis of these modifications, including two additional (2.1 x 100 mm, 1.6 μm particle size) equilibrated with 98% solvent A (0.1 % v/v formic acid 23 in water) and 2 % solvent B (0.1 % v/v formic acid in acetonitrile) at a flow rate of 0.25 mL/min 24 and eluted with the following solvent gradient: 2-12 % B in 10 min; 12-2 % B in 1 min; hold 25 at 2 % B for 5 min. The HPLC column was coupled to an Agilent 1290 Infinity DAD and an 26 Agilent 6490 triple quadruple mass spectrometer (Agilent, Santa Clara, CA). The column was 27 kept at 40 °C and the auto-sampler was cooled at 4°C. The UV wavelength of the DAD was 28 set at 260 nm and the electrospray ionization of the mass spectrometer was performed in 29 positive ion mode with the following source parameters: drying gas temperature 200 °C with a 30 flow of 14 L/min, nebulizer gas pressure 30 psi, sheath gas temperature 400 °C with a flow of 31 ranged from 0.1 to 1 fmol for the modified 2'-deoxynucleosides. Data acquisition and 1 processing were performed using MassHunter software (Agilent, Santa Clara, CA). 2 Unknown DNA modification analysis was performed using Agilent 1290 ultrahigh pressure 3 liquid chromatography system equipped with DAD and 6550 QTOF mass detector managed 4 by a MassHunter workstation. The column used for the separation was a Waters ACQUITY 5 HSS T3 column (2.1´150 mm, 1.8 μm). The oven temperature was set at 45 °C. The gradient 6 elution involved a mobile phase consisting of (A) 0.1 % formic acid in water and (B) 0.1 % 7 formic acid in acetonitrile. The initial condition was set at 2 % B. A 25 min linear gradient to 8 7 % B was applied, followed by a 15 min gradient to 100 % B which was held for 5 min, then 9 returned to starting conditions over 0.1 min. Flow rate was set at 0.3 ml/min, and 2 μL of 10 samples was injected. The electrospray ionization mass spectra were acquired in positive ion 11 mode. Mass data were collected between m/z 100 and 1000 Da at a rate of two scans per second.

12
The electrospray ionization of the mass spectrometer was performed in positive ion mode with  Protein sequence detection in phages 20 HHpred online tool (https://toolkit.tuebingen.mpg.de/tools/hhpred) (24, 25) was used with 21 default setting against the pfam database (Pfam-A_v35) (26) to investigate the deduced 22 proteins encoded by genes flanking the 7-deazaguanine modification genes in Cellulophaga 23 phage phiSM, Cellulophaga phage phiST, and Halovirus HVTV-1. DpdL, DpdM, DpdA3 and 24 DpdA4 were predicted this way. DpdN was discovered by looking at the annotations of the 25 genes in the vicinity of the 7-deazaguanine modification genes in Flavobacterium phage 26 vB_FspM_immuto_2-6A. These proteins were then used as queries to retrieve homologs in the 27 proteome of viruses publicly available in NCBI GenBank database (July 2022) using 28 psiBLAST version 2.13.0 (27), with at most three iterations. Other previously discovered 29 proteins involved in the 7-deazaguanine derivative DNA modifications (Data S1) (7) were used 30 to identify homologs in viral genomes encoding for at least one of DpdL, DpdM, DpdN, 31 DpdA3, or DpdA4 using BLASTp version 2.13.0 (28). HHpred and expert annotation were 32 used to sort these proteins and curate false positives. All protein matches are summarized in 33 Data S1. FolE, QueD, QueE, QueC, and QueF ( Figure 1, Data S1) and thus should harbor preQ1 in its 5 genome, as previously observed for Halovirus HVTV-1 (7). To test this hypothesis, we used 6 liquid chromatography coupled to diode array UV detection and a tandem mass spectrometer 7 (LC-UV-MS/MS) to analyze of the nucleosides obtained from enzymatic digestion of phiSM 8 genomic DNA, as we previously described (4, 7). A 2'-deoxynucleoside form of preQ1 9 (dPreQ1) was indeed detected (3,790 modifications per 10 6 nucleotides, ~ 1.1 % of the Gs, 10 Table 1). In addition to the UV peaks for the four canonical nucleosides, dA, dC, dT and dG, corresponding to a methylamino group ( Figure 2B). The loss of the methylamino group was 25 observed in the MS/MS spectra at low CID energy, indicating that the methyl group is likely 26 linked to the 7-amino group, which is less stable than the linkage to the 2-amino group in CID 27 MS/MS experiment. 28 We chemically synthesized 7-(methylamino)methyl-2'-deoxy-7-deazaguanine (mdPreQ1, 29 Scheme 1), which was purified by HPLC and characterized by NMR and HRMS, to test   37 Given the presence of dPreQ1 in DNA (7) and its subsequent modification to mdPreQ1, one 38 must ask which precursor molecule (preQ0 or preQ1) is directly inserted in the genome. Indeed, 39 to date, all characterized DpdAs insert preQ0 into DNA (5, 7). Hence, the phage QueF should  To validate this prediction, an E. coli ΔqueF mutant was transformed with plasmids expressing 12 queF genes from three phages/viruses, namely Cellulophaga phage phiSM, Vibrio phage 13 VH7D, and Halovirus HVTV-1. We observed that expression of the phiSM and VH7D queF 14 genes, but not of the HVTV-1 one, complemented the ΔqueF strain's Q-deficiency phenotype 15 ( Figure S3C). Because HVTV-1 is a virus infecting a hyper-saline archaeon, Haloarcula 16 valismortis, expressing its queF gene in E. coli in a low salt environment may have been 17 challenging. Nonetheless, these experiments confirmed that phage QueF, like its bacterial 18 counterpart, catalyzed the reduction of preQ0 to preQ1. 19 To confirm that phage DpdA encoded in QueF-like reductase switched specificity to preQ1, we 20 cloned phiSM dpdA1 and VH7D dpdA2 in pBAD24 vector and expressed them in several 21 mutants of E. coli. In our experiments, phiSM DpdA1 was found to be inactive, while VH7D

22
DpdA2 inserted preQ1 in DNA (2,765 modifications per 10 6 nucleotides), proving that this 23 DpdA substrate specificity indeed adapted to preQ1. Interestingly, VH7D DpdA2 also inserted 24 preQ0 at a lower efficiency (712 modifications per 10 6 nucleotides) in a strain that does not 25 produce preQ1 and accumulates preQ0 (ΔqueF, see pathway in Figure 1A), as well as CDG at 26 a very low efficiency (67 modifications per 10 6 nucleotides) in a strain that accumulates CDG 27 (ΔqueC, see Figure 1A). 28 Prediction and validation of a preQ1 methyltransferase 29 Phages that harbor the mdPreQ1 modification should encode a methyltransferase that appends 30 a methyl group onto the nitrogen of the methylamino group of preQ1 in genomic DNA. There viruses (Data S4). This candidate was eliminated because no deazaguanine DNA modification 38 was ever found in eukaryotic viruses (7) and eukaryotes do not produce any preQ1 (3). Finally, 39 CEPG_00054 homologs were found in seven other phages, including Vibrio phages phi-Grn1, 40 phi-ST2, and VH7D, which were predicted to encode preQ1 modification pathways ( Figure 1B, 41 Data S1 and S5). This protein belongs to the DUF3109 family (Data S6) and has an E. coli 42 homolog, YkgJ, which is annotated as a zinc or iron binding protein, making CEPG_00054 the 43 leading candidate for the missing preQ1 methyltransferase, and tentatively renaming it DpdM. 1 homologs ( Figure 1B, Data S1), were also modified with mdPreQ1 ( Figure S4, Table 1) at a 2 rate of 0.01 % of the Gs for both phages (35 and 44 modifications per 10 6 nucleotides, 3 respectively, Table 1). Finally, expressing the predicted VH7D dpdM gene in an E. coli strain 4 already expressing the dpdA2 of Vibrio phage VH7D resulted in the formation of low levels of 5 mdPreQ1 in plasmid DNA (Table 2). Taken altogether, these data linked mdPreQ1 with the 6 presence of DpdM ( Figure 1B, Data S1).

7
As shown above with VH7D dpdA2 expression alone, when both the VH7D dpdA2 and dpdM 8 genes were expressed in a ΔqueF background, which does not produce preQ1 but accumulates 9 preQ0, preQ0 was inserted into bacterial DNA at a ~ 5-time lower efficiency. Similarly, when 10 a ΔqueC background that accumulates CDG was used, CDG was found in DNA with a ~ 40-11 fold decrease in efficiency ( Table 2).

12
DpdM proteins likely bind two metals 13 Although the initial amino acid sequence analysis of DpdM from Cellulophaga phage phiSM 14 revealed a CxxxCxxCC metal binding motif (Data S6), this motif was missing in the Vibrio 15 phage phi-ST2 homolog. We found that the orf encoding this protein was miscalled and 16 discovered that by selecting a start codon 171 nucleotides prior to the originally predicted one 17 now resulted in a polypeptide containing the CxxxCxxCC motif (phi-ST2 corrected in 18 Alignment S2). 19 The tertiary and quaternary structures of DpdM from both Cellulophaga phage phiSM and 20 Vibrio phage VH7D were predicted using AlphaFold2. Both proteins were predicted to be 21 monomeric, with only a few amino acids interacting between monomers (data not shown). The 22 phiSM DpdM prediction ( Figure S5A) had a higher confidence score than the VH7D prediction 23 ( Figure S5B). We found small domains around the VH7D predicted structure with unknown 24 function as shown in the alignment. However, the core parts of the protein were well aligned 25 ( Figure S5C).

26
The phiSM DpdM structure contains a tunnel that has an electro-positively charged groove on 27 one side ( Figure 3A), which could be a candidate site for DNA binding, and a second groove 28 on the opposite side ( Figure 3B), which could be a site for a methyl donor binding. Surprisingly, 29 majority of the conserved residues are clustered around this tunnel ( Figures 3C and D). The

30
CxxxCxxCC motif appears to be divided into two metal binding sites rather than one. The

31
CxxxCxxC motif (orange in Figure S5H; representing C33, C38, and C41), is a known motif 32 for a Fe4S4 cluster and SAM binding (37), but the presence of a fourth cysteine, C150 (red in 33 Figure 3E), in the pocket would disrupt the Fe4S4 binding and may bind another metal instead, 34 as well as a different methyl donor. It appears that the fourth cysteine in the CxxxCxxCC motif 35 is involved in another metal binding pocket containing three other cysteine residues (yellow in 36 Figure 3F; representing C42, C92, C102, and C112). Both these metal binding pockets are 37 found in the DpdM tunnel implying that they both participate in the transfer of the methyl group 38 from the methyl donor to preQ1 in DNA. 39 The tunnel observed in phiSM DpdM, the positively charged groove ( Figure S3D Data S1) (7). The product from the reaction catalyzed by QueE, 7-carboxyl-7-deazaguanine 5 (CDG; see Figure 1A) (3) was not detected in this phage DNA using LC-UV-MS/MS analysis.   Data S10) and its function is discussed in the sections below. CGPG_00067 is highly similar 34 to QueD (99.89% probability matching to PF01242.22 by HHpred, Data S11). T-fold enzymes 35 like QueD bind pterins or purines (39), and three of them are involved in preQ0 synthesis (3).

36
This gene was also found in other phages encoding for a DpdA, FolE, QueE, and QueD but not 37 QueC ( Figure 1B; Data S1 and S12). Because of these findings, CGPG_00067 was chosen as 38 the best candidate for the missing CDG decarboxylase and renamed DpdL. 39 To investigate structural differences between QueD and DpdL, we aligned the sequences of  The signature motif of QueD CxxxHGH (40) is also changed to LxxxHRHxF in DpdL. Both 44 histidines of the motif coordinate the zinc ion in the active site, and the cysteine is required for the catalyzation of the reaction. Because the glycine residue is not involved in ligand binding 1 or catalysis, changing it to arginine would not change any essential properties of the active site. 2 The conversion of cysteine to leucine does, as QueD is inactive without this cysteine (41). The 3 predicted structure of DpdM indicated that it would catalyze the reaction on the base 4 (Supplementary Text, Figure S8) via an alkaline decarboxylation mechanism involving zinc or 5 other bound metal. This would imply that the specificity of the co-encoded DpdA would be 6 changed from preQ0 to DG. 7 We expressed dpdL genes from phage phiST and Acidovorax phage ACP17 in E. coli alongside 8 their respective dpdA genes, but we were unable to detect any dDG in this heterologous system 9 (data not shown). Proteins may be inactive in E. coli due to temperature, salt, or codon 10 optimization differences with their host organisms, or other unknown enzymes may be required 11 to complete the reaction.

12
A DpdA is encoded in all phages that harbor 7-deazaguanine derivatives 13 As previously stated, CGPG_00065 is a distant homolog of a TGT/DpdA and is also found in 14 Campylobacter phages ( Figure 1B; Data S1 and S13), which have been previously shown to 15 be modified by ADG (6). This DpdA3 family had not previously been identified (6) and is the 16 most logical candidate for the enzyme inserting a 7-deazaguanine derivate in the DNA of both 17 phiST and Campylobacter phages (6).

18
It is difficult to predict the substrate specificity of the DpdA3 family ( Figure 1A). DpdA3 is 19 unlikely to insert preQ0 as the full pathways are absent in phiST and the Campylobacter phages 20 stop the synthesis at CDG (6). As a result, DpdA3 may insert CDG, a common precursor of 21 dADG and dDG. Because the nucleoside form of ADG was detected in the cytoplasm of 22 Campylobacter jejuni infected with phage CP220 (6), the DpdA3 might have shifted their 23 substrate specificity to insert DG or ADG.

24
With the discovery of the DpdA3 subfamily, only a few of the phages/viruses identified in our 25 previous study remained with no encoded DpdA (7). We reanalyzed the genome of Halovirus 26 HVTV-1, which is modified with preQ1. HVTV1_69 gene product had a 100 % probability of 27 matching with PF20314.1, a domain of unknown function (DUF6610), by HHpred, but also 28 92.5 % with PF01702.21, a tRNA-guanine transglycosylase (Data S14). Furthermore, 29 homologs of this protein were found to be encoded in other archaeal viruses that also contain 30 preQ1 synthesis genes, as well as a singleton modification gene in a few other viral genomes, 31 including Halorubrum phage HF2 ( Figure 1B; Data S1 and S15). With the discovery of this 32 final DpdA subgroup, renamed DpdA4, all phages known to harbor a 7-deazaguanine in their 33 DNA encode a DpdA family protein, which now could be considered a signature protein family 34 for the presence of such DNA modifications. and QueF (Figure 1, Data S1) and should thus have complete guanosine replacement to preQ1. 39 However, dPreQ1 was not detected in this phage genome using LC-UV-MS/MS analysis. Supplementary Data S1, S16). We found six phages that encode a similar protein and shared  Figure S12).
S1), it should harbor dG + in its genome. As we previously described (4, 7), we used LC-UV-1 10 6 nucleotides, Figure S13, Table1). 2 There was no other neighboring gene that was clearly shared with other phages or viruses (data 3 not shown). Surprisingly, the host archaeon, Sulfolobus tengchongensis, does not encode any 4 proteins involved in the Q or G + biosynthesis pathway (data not shown). We believed that its Sulfolobus virus STSV2, Supplementary Data S17, and 99.9 % for VPFG_00169 encoded by 10 Vibrio phage nt-1, Data S18). It has previously been demonstrated that the ArcS have a high 11 degree of diversity (21). Initially, four domains were identified in ArcS (Nt, C1, C2 and PUA).

12
The PUA domain is specific to RNA binding, the Nt domain is similar to the TGT catalytic 13 domain and the C1 domain is specific to ArcS and contains the catalytic core of the functions.

18
We decided to test the ArcS of phage nt-1 because the ArcS from a hyperthermophile organism 19 might be inactive in our E. coli double plasmid test system, as hypothesized previously (7). 20 Both nt-1 dpdA2 and arcS were cloned in pBAD24 and pBAD33 vectors, expressed in E. coli, 21 and the plasmids were extracted. We found that dPreQ0 is inserted into DNA when nt-1 DpdA2 22 is expressed alone, and dG + is present when nt-1 ArcS is co-expressed (Table 3). This suggested 23 that STSV-2 ArcS, which is less degenerate than nt-1 ArcS, may have the same function, 24 generating dG + . Therefore, additional STSV-2 proteins yet to be identified are required to 25 catalyze the insertion of CDG in DNA in this virus.

27
In this current era of active discoveries of new bacterial defence systems against phages driven 28 by genomic data mining (12-14), the identification of phage counter defences (11), including 29 DNA modifications, is also rising. This study focused on a group of guanine modifications, 30 known as 7-deazaguanines, where the nitrogen in position 7 of guanine is replaced by a carbon 31 allowing an easier addition of various side chains at this position. In a previous study, we 32 presented four side chains, namely dPreQ1 and dG + (7) in the genome of some viruses, and enzymes involved in the synthesis of these modified bases (Figure 1). 39 Phage genomic DNAs encoding QueF homologs always contain dPreQ1, or derivatives.

40
Indeed, viral QueF proteins are preQ0 reductases ( Figure S3), like the bacterial ones (7). Two 41 hypermodified dPreQ1 that each require an additional enzymatic step for their synthesis were 42 identified. One of the enzymes involved in this step is the dPreQ1 methytransferase, now named the protein structures, we propose that DpdM methylates preQ1 already inserted in DNA using 1 two metal groups (Figure 3 and S5). We also identified a potential preQ1 or dPreQ1 2 formyltransferase, DpdN, leading to fdPreQ1, in the genome of Flavobacterium phage 3 vB_FspM_immuto_2-6A. Finally, we identified a candidate protein, Cellulophaga phage 4 phiST DpdL, that most certainly promotes alkaline decarboxylation of CDG to lead to dDG in 5 phage genomes. Unfortunately, we were unable to demonstrate its activity.

6
The presence of dCDG in Sulfolobus virus SVST-2 is puzzling because this virus only encodes 7 discernible DpdA and ArcS homologs. Furthermore, its host does not modify its tRNA with 7-  Cellulophaga phage phiST, and Flavobacterium phage vB_FspM_immuto_2-6A (Table 1).

24
Our attempts to test members of the DpdA3 and DpdA4 families in our E. coli model were 25 unsuccessful.

26
In this study, we showed that the substrate specificity of some DpdA has shifted toward other 27 7-deazaguanines. For example, the change in substrate specificity between dPreQ0 and dPreQ1 28 seems to have occurred several times in evolution, as phages acquired both unimodular and 29 bimodular QueF (Figure 3), and almost all DpdA sub-families have a member that may insert 30 preQ1 into DNA (Data S1). We predict that dPreQ0 was the first 7-deazaguanine DNA 31 modification, as it is the modification that requires the fewest enzymes. Interestingly, Vibrio 32 phage VH7D DpdA2 inserted various 7-deazaguanine derivatives in its DNA with different 33 efficiencies ( Table 2). Vibrio phage nt-1 DpdA2 did not insert preQ1 in DNA in our assay but 34 could insert preQ0 and possibly G + (Table 3). It was previously reported that this phage 35 harbored three 7-deazaguanine DNA modifications, at various levels (7). DpdA2 family 36 exhibits promiscuity for substrate specificity. 37 We previously showed that 7-deazaguanine protect DNA from restriction enzymes at various 38 levels depending on the modification (7        predicted to bind metal are colored in yellow, orange, and red as described in the text.

6
Visualization of these cysteines were zoomed in to view both metal pockets (E and F).