N-glycan core synthesis by Alg24 strengthens the hypothesis of an archaeal origin of eukaryal N-glycosylation

Protein N-glycosylation is one of the most common posttranslational modifications found in all three domains of life. The crenarchaeal N-glycosylation begins with the synthesis of a lipid-linked chitobiose core structure, identical to that in eukaryotes. Here, we report the first identification of a thermostable archaeal beta-1,4-N-acetylglucosaminyltransferase, named archaeal glycosylation enzyme 24 (Agl24), responsible for the synthesis of the N-glycan chitobiose core. Biochemical characterization confirmed the function as an inverting β-D-GlcNAc-(1→4)-α-D-GlcNAc-diphosphodolichol glycosyltransferase. Substitution of a highly conserved histidine residue, found also in the eukaryotic and bacterial homologs, demonstrated its importance for the function of Agl24. Furthermore, bioinformatics and structural modeling revealed strong similarities between Agl24 and both the eukaryotic Alg14/13 and a distant relation to the bacterial MurG, which catalyze the identical or a similar process, respectively. Our data, complemented by phylogenetic analysis of Alg13 and Alg14, revealed similar sequences in Asgardarchaeota, further supporting the hypothesis that the Alg13/14 homologs in eukaryotes have been acquired during eukaryogenesis. Highlights First identification and characterization of a thermostable β-D-GlcNAc-(1→4)-α-D-GlcNAc-diphosphodolichol glycosyltransferase (GT family 28) in Archaea. A highly conserved histidine, within a GGH motif in Agl24, Alg14, and MurG, is essential for function of Agl24. Agl24-like homologs are broadly distributed among Archaea. The eukaryotic Alg13 and Alg14 are closely related to the Asgard homologs, suggesting their acquisition during eukaryogenesis. Cover Art


Introduction 53
Asparagine (N)-linked glycosylation is one of the commonest co-and posttranslational protein 54 modifications found in all three domains of life (Larkin andImperiali, 2011, Jarrell et al., 2014, Nothaft 55 and Szymanski, 2010) In Eukarya, N-glycosylation is an essential process and is evolutionarily highly 56 conserved from yeast to humans (Lehle et al., 2006). It is estimated that more than half of all eukaryotic 57 proteins are glycoproteins (Apweiler et al., 1999, Zielinska et al., 2010. N-glycosylation is required for 58 correct protein folding and stability, intra-and extracellular recognition of protein targets and for 59 enzyme activity (Caramelo and Parodi, 2007, Helenius and Aebi, 2004, Varki, 1993. Therefore, the 60 biological functions of protein glycosylation range from relatively minor to crucial for survival of an 61 sequence identity (Table S1). Thus, the most likely candidate for the β-1,4-N-acetylglucosamine 148 transferase was found to be Saci1262 (Uniprot: Q4J9C3), now named Agl24, a hypothetical 327 amino 149 acid protein. In contrast to Euryarchaeota, where the genes of the N-glycosylation enzymes are 150 clustered with aglB, clustering of these genes is uncommon in Crenarchaeota (Kaminski et al., 2013, 151 Nikolayev et al., 2020, which makes the identification of GTs involved in the N-glycosylation process 152 more challenging, as in the Sulfolobales. Interestingly, the gene saci1262 / agl24 is located only eight 153 genes downstream of the OST aglB in S. acidocaldarius. (Fig S1). The amino acid sequence alignment of the representative bacterial MurG, eukaryal Alg14-13, and 180 archaeal homologs revealed conservation of specific patches and individual residues (Fig 1B and 1D). predicted to contain a C-terminal helix, which interacts with the N-terminal part of the protein (Fig 2A). 199 We revealed a conserved GGxGGH 14 motif (Agl24) within the N-terminal sequences of MurG and Alg14 200 ( Fig 1D, S2, and 2B). This motif is located in the cavity between the two different structural domains 201 next to the substrate binding pocket. The crystal structure of MurG (PDB: 3s2u) in complex with  GlcNAc shows the close proximity of the sugar donor to the conserved residues ( Fig 2B) (Brown et al., 203 2013). In proximity to the GGxGGH 14 motif is the conserved glutamic acid (E 114 ) (Fig 1D, S2, and 2B). 204 Interestingly, both these conserved residues are absent in the archaeal MurG-like GT28 family 205 orthologs of pseudomurein-producing Archaea, where aspartic acid (D) and leucine (L) are more 206 frequently found replacing the H 14 and E 114 (Fig 1D and S2). We propose that these residues might be 207 required to accommodate the different acceptor molecule. 208

Agl24 is essential in S. acidocaldarius 209
Due to the overall low sequence similarity of Agl24 to Alg14-13 and MurG, we proposed to 210 demonstrate the predicted function of Agl24 in vivo by generating a deletion mutant of Agl24 in the 211 genome of S. acidocaldarius. This deletion mutant should arrest the N-glycosylation process after the 212 generation of Dol-PP-GlcNAc and should contain either non-glycosylated glycoproteins or N-213 glycosylated proteins containing only a single GlcNAc residue linked to the asparagine in the conserved 214 N-glycosylation motif. The genomic integration by homologous recombination of the plasmid 215 pSVA1312 via either the up-or downstream region of the agl24 and the selection pyrEF genes was 216 confirmed by PCR ( Fig S3A). However, we were not successful in generating the marker-less in-frame 217 agl24 deletion mutant. An alternative approach was conducted, aimed to delete the agl24 gene by a 218 single homologous recombination step by the integration of the pyrEF selection cassette. All attempts 219 to generate this gene disruption mutant failed. Therefore, this strongly suggests that the agl24 gene is 220 essential in S. acidocaldarius, at least under the conditions tested. To exclude that the other GTs, 221 identified in the bioinformatic homology search (Table S1), are involved in the N-glycosylation, marker-222 less deletion mutants of saci1907, saci1921, saci0807, saci1094, saci1201, saci1821, and Saci1249 223 were successfully generated. Only the deletion of saci0807 showed a significant effect on the N-224 glycosylation of the S-layer proteins, which have been characterized to encode the GT Agl16 that 225 transfers the terminal glucose residue to the N-glycan (Meyer et al., 2013). 226 227 Agl24 transfers a single GlcNAc residue onto a GlcNAc acceptor 228 229 In agreement with the observed membrane association in S. acidocaldarius (Fig S4), Agl24 was also 230 found in the membrane fraction during its purification from E. coli. Recombinant Agl24, produced in E. 231 coli (Fig S5), was assayed to test the function of Alg24 using the predicted nucleotide sugar donor UDP-232 GlcNAc and either one of two synthetic acceptor substrates designed to mimic the native lipid-linked 233 acceptor: C 13 H 27 -PP-GlcNAc (acceptor 1) or phenyl-O-C 11 H 22 -PP-GlcNAc (acceptor 2) ( Fig 3C). The 234 reaction product was purified and characterized. The MALDI-MS spectra obtained confirmed that 235 Agl24 transfers a single GlcNAc to both acceptor substrates when incubated with UDP-GlcNAc ( Fig 5B  236 and D) (Fig 3A). The Agl24 assay revealed that the product peaks are shifted by 203 Da to m/z = 811 237 [M-1H+2Na] + and m/z = 833 [M-2H+3Na] + (Fig 3B), corresponding to one dehydrated GlcNAc (203 Da). 238 The same shift was observed for acceptor 2 (Fig 3D). We also investigated Agl24 promiscuity towards 239 utilizing UDP-Glucose as the sugar donor, but no mass shift was observed (Fig 3E), indicating that Agl24 240 uses exclusively UDP-GlcNAc. 241 242

Activity of Agl24 243
Since S. acidocaldarius is a thermophilic microorganism, with an optimal growth temperature of 75°C, 244 the temperature dependency of the Agl24 activity was investigated using our established mass 245 spectrometry assay. The substrates are stable at the conditions tested and MALDI analysis of the 246 negative control lacking the enzyme revealed only the acceptor-1 mass (Fig 4A). At elevated 247 temperatures, the peak intensity from the product increased while the peak intensity of the acceptor 248 molecule was reduced (Fig 4B-F). Highest activity was detected at 70°C, close to the optimal growth 249 temperature of S. acidocaldarius. Furthermore, the addition of EDTA did not affect the activity of 250 Agl24, demonstrating that Agl24 is a metal independent GT (Fig S7, 5D). This result is in agreement 251 with the lack of a conserved Asp-X-Asp (DxD) motif within Agl24 sequence, which has been shown to 252 be important to coordinate the metal ions in A-fold GTs, whereas GT-B GTs are metal ion-253 independent (Lairson et al., 2008, Gloster, 2014. 254

Conserved His14 is essential for Agl24 function 255
Two conserved amino acid residues were targeted by mutagenesis to investigate their role in the Agl24 256 enzyme. Alanine substitution of histidine residue H 14 within the conserved GGH 14 motif, found in all 257 Alg14 (GSGGH), MurG (GGxGGH), and Agl24 (GGH) homologs, was inactive ( Fig 5B). This demonstrated 258 that this highly conserved residue, located next to the nucleotide-binding site, is important for the 259 enzyme function. This GGxGGH motif, and a subsequent second glycine rich motif (GGY), have been 260 proposed to enable MurG to be involved in interaction with the diphosphate group of the lipid 261 acceptor, as these two motifs resemble phosphate-binding loops of nucleotide-binding proteins (Baker 262 et al., 1992, Carugo andArgos, 1997). Alanine substitution of the conserved His residue in MurG 263 resulted in undetectable activity and loss of ability to complement a temperature sensitive MurG 264 mutant (Crouvoisier et al., 2007, Hu et al., 2003. In contrast, the substitution of the second conserved 265 residue E 114 , opposite the nucleotide-binding site, had no effect on the activity of Agl24 (Fig 5C) Agl24 is an inverting β-1,4-N-acetylglucosamine-transferase 273 274 For kinetic analysis of Agl24, a HPLC assay was used to monitor conversion of the mono-GlcNAcylated-275 PP-lipid acceptor-2 substrate to the bi-GlcNAcylated product ( Fig S6). A K m value for acceptor-2 could 276 not be determined using this system due to significant substrate inhibition above 0.5 mM, however, a 277 K m (app) for UDP-GlcNAc was determined: (UDP-GlcNAc) K m = 1.37 ± 0.13 mM, V max = 32.5 ± 0.9 pmol min -1 . presence of two differently-coupled anomeric protons (Fig 6 and S8). One anomeric proton appeared 283 as a doublet of doublets (5.35 ppm, J H1-H2 = 3.1 Hz, J H1-P = 7.2 Hz) typical of an -linked GlcNAc residue 284 (Table 1). Another anomeric signal appeared as a doublet with a large J coupling value (4.49 ppm, J H1-285 H2 = 8.5 Hz), indicative of a β-linked GlcNAc residue. The substrate acceptor 2 contains an -linked 286 GlcNAc; this strongly suggested the terminal GlcNAc introduced by Agl24 was β-linked. Experiments 287 using 1D total correlation spectroscopy (TOCSY) deciphered the detailed proton signals from each of 288 the two sugar rings (Fig S8), and 2D COSY (Fig S9). In addition, HSQC experiments ( Fig S10) were used 289 to assign the identity of each proton and carbon signal. The C4 signal of the -GlcNAc residue is shifted 290 to 79.6 ppm, strongly suggesting that the terminal β-GlcNAc was linked at this position. This was 291 further supported by a relative increase in shift of the H4 proton of the -GlcNAc residue compared to 292 the un-modified acceptor (Zorzoli et al., 2019). Relative shifts of other protons reported for the 293 acceptor aligned with our experimental data. In conclusion, a combination of 1 H NMR and 1D and 2D 294 TOCSY, COSY and HSQC experiments confirmed that Agl24 is an inverting β-1,4-GlcNAc transferase. 295 The eukaryotic GTs Alg14 and Alg13 are closely related to Asgard homologs 296 In order to analyze the phylogenetic relationship between the eukaryotic N-glycosylation GTs Alg13 297 and Alg14 with archaeal and bacterial homologs, an extensive phylogenetic analysis was performed. 298 The eukaryotic Alg13 and Alg14 sequences cluster with homologs from the Asgard phyla 299 Thorarchaeota and Odinarchaeota, as well as with sequences from Verstraetearchaeota, 300 Micrarchaeota, Geothermarchaeota, and an unclassified archaeon (Fig7 A and B). Even though the 301 sister clades of Eykaryotes are Thorarchaeota and Odinarchaeota for Alg13 and Alg14, respectively, 302 the phylogeny is not very well resolved. No closely-related Alg13-and Alg14-like sequences were found 303 in other Asgard phyla, suggesting that the remaining Asgardarchaea, e.g. Lokiarchaeota and 304 Heimdallarchaeota, could be using a different enzyme for synthesizing N-glycans. 305 Next to the Alg13-or Alg14-like cluster, is a second distinct cluster of EpsF-or EpsE-like sequences, The remaining clades in the Alg13 and Alg14 phylogenies consist of sequences from the archaeal TACK 315 superphylum and some Euryarchaeota, mainly methanogens (Fig 7A and B). The Agl24 identified in this 316 study is found among them, within a Cren-and Bathyarchaeota branch. Close homologs of the Agl24 317 are found in all presently available genomes of the Crenarchaeota, e.g. in the orders Fervidicoccales, 318 Acidilobales, Desulfurococcales and Sulfolobales, with the exception of any homolog in the members 319 of the order Thermoproteales ( Fig S11, Sup Data 1). 320 The remaining phyla of the TACK superphylum (Aig-, Bathy-, Geotherm-, Kor-, Thaum-, Nezha-, Mars-, 321 and Verstraetearchaeota) along with some additional Crenarchaeota, including Geoarchaeota, and an 322 Odinarchaeota sequence (Fig 7A and B), form a separate branch containing Agl24-like sequences. Next 323 to this branch is a clade of methanogens and Altiarchaeales. Due to the large number of very distant 324 Alg13/14 homologs with low similarity, resulting in poor alignments, for the phylogenies in Fig 7 we  325 sought to imitate and expand the clades presented previously (Lombard, 2016). Nevertheless, in the 326 homology searches used to expand the previous set (Lombard, 2016), some of the hits formed well-327 supported branches in preliminary phylogenies (Sup Data 1). By including some of them in expanded 328 Alg13/EpsF and Alg14/EpsE datasets, none of them branch within the Alg13/EpsF-like or Alg14/EpsE-329 like clades. Instead, they are fused Agl24-like sequences, mainly from Euryarchaeota and various 330 DPANN phyla. Either way, the Cren-/Bathyarchaeota Agl24 clade along with a small 331 Nanohaloarchaeota clade appear to be the closest relatives of the Alg13/EpsF-like and Alg14/EpsE-like 332 proteins (Sup Data 1). 333 The fact that we are able to trace Agl24-like homologs to the ancestor of TACK and several 334 Euryarchaeota lineages indicates that, despite multiple instances of lateral gene transfer, they are very 335 ancient and diverged from the bacterial MurG at the separation of the two domains, as proposed by 336 Lombard et al. (Lombard, 2016). Given that both MurG (Lombard, 2016) and the various Agl24-like 337 homologs are fused genes, a major split event seems to have occurred once at the base of the Alg/Eps-338 like clade (Fig 8A and B linkage to the Dol, identical to that in Eukarya. This is in contrast to Euryarchaeota, which use a mono-428 phosphate linkage. A summary of the similarities and differences of the archaeal N-glycosylation to 429 Eukarya and Bacteria are given in Table S4. All these observations strengthen the hypothesis that the 430 eukaryotic N-glycosylation has emerged from an ancient archaeon. 431 In the future, we believe that the detailed characterization of the N-glycosylation process in members 432 of the TACK and Asgard superphyla will lead to the elucidation of further molecular similarities and 433 unique properties to the eukaryotic N-glycosylation process and will provide further support for the 434 origin of the eukaryotic N-glycosylation. 435

Strains and growth conditions
The strain Sulfolobus acidocaldarius MW001 (ΔpyrE) (Wagner et al., 2012) and all derived modified 436 strains (Table S3) were grown in Brock medium at 75°C, pH 3, adjusted using sulfuric acid. The medium 437 was supplemented with 0.1% w/v NZ-amine and 0.1% w/v dextrin as carbon and energy source (Brock 438 et al., 1972). Selection gelrite (0.6%) plates were supplemented with the same nutrients with the 439 addition of 10 mM MgCl 2 and 3 mM CaCl 2 . For second selection plates 10 mg ml -1 uracil and 100 mg 440 ml -1 5-fluoroorotic acid (5-FOA) were added. For the growth of the uracil auxotrophic mutants, 10 mg 441 ml -1 uracil was added to the medium. Cell growth was monitored by measuring the optical density at 442 600 nm. Protein expression in S. acidocaldarius was conducted in medium supplemented with 0.1% 443 w/v NZ-amine and 0.1% w/v L-arabinose as carbon and energy source. All E. coli strains DH5α, BL21, 444 or ER1821 were grown in LB media at 37°C in a shaking incubator at 200 rpm. According to the 445 antibiotic resistance in the transformed vector(s), media were supplemented with the antibiotics 446 carbenicillin (amp) at 100 μg ml -1 and/or kanamycin (kan) at 50 μg ml -1 . 447

Construction of deletion plasmids
The predicted function of Agl24 was verified via the generation of the lipid-linked chitobiose core of the N-glycan. Therefore, a marker-less deletion mutant of this gene was constructed in the S. acidocaldarius MW001, as previously described (Wagner et al., 2012). Briefly, the strain MW001, auxotrophic for uracil biosynthesis, was transformed with the plasmid pSVA1312. Therefore, two ~1000 bp DNA fragments, one from the upstream and one from the downstream regions of Agl24 (saci1262) gene, were PCR amplified. Restriction sites ApaI and BamHI were introduced at the 5´ ends of the upstream forward primer (4168) and of the downstream reverse primer (4165), respectively. For heterologous Agl24 protein expression, 6 x 1 Liter LB medium was inoculated with 10 ml from an 500 overnight BL21 culture previously transformed with pHD0499. Cells were grown at 16˚C overnight in 501 auto-induction medium (Studier, 2014) containing 30 µg ml -1 kanamycin. The cells were harvested by 502 centrifugation at 4,200 x g for 25 minutes and used to isolate the membrane fractions. All subsequent 503 purification steps were carried out at 4˚C. Cells were fractionated by passing four times through an 504 Avestin C3 High Pressure Homogeniser (Biopharma, UK), followed by a 20 min low speed spin at 4,000 505 x g. The resulting supernatant was centrifuged at 200,000 x g for 2 h to obtain the membrane fraction 506 and 2-3 g of membranes were routinely used for isolation of Agl24-GFP-His 8 proteins. Samples were 507 solubilized in 18 ml Buffer 1 (500 mM NaCl, 10 mM Na 2 HPO 4 , 1.8 mM KH 2 PO 4 2.7 mM KCl, pH 7.4, 20 508 mM imidazole, 4 mM TCEP) with the addition of 1% n-dodecyl-β-maltoside (DDM) for 2 h at 4˚C. The 509 sample was two-fold diluted with Buffer 1 and centrifuged at 200,000 x g for 2 h. The supernatant was 510 loaded onto a Ni-Sepharose 6 Fast Flow (GE Healthcare) column with 1 ml of prewashed Ni-NTA-beads. 511 The column was washed with 20 ml of wash buffer (500 mM NaCl, 10 mM Na 2 HPO 4 , 1.

HPLC analysis 536
With the exception of kinetics reactions, which are detailed below, Agl24 reactions were analyzed 537 using a HPLC assay with, typically, 50 µl samples containing 2.5 mM UDP-GlcNAc, 1.5 mM lipid 538 acceptor, and 1.8 µg of purified Agl24 in a TBS buffer containing 5 mM MgCl 2 (150 mM NaCl, 50 mM 539 Tris-HCl, pH 7.5). Reactions were left at 60°C (or alternative temperatures) for the desired time period 540 and quenched with one equivalent of acetonitrile to precipitate Agl24. Following filtration to remove 541 precipitate, reactions were injected onto an XBridge BEH Amide OBD Prep column (130 Å, 5 µM, 10 x 542 250 mm) at a flow rate of 4 ml min -1 using a Dionex UltiMate 3000 system (Thermo Scientific) fitted 543 with a UV detector optimized to detect the O-phenyl functional group of the acceptor molecule at 270 544 nm. Each run was 35 minutes using running Buffer A (95% acetonitrile, 10 mM ammonium acetate, pH 545 8) and Buffer B (50% acetonitrile, 10 mM ammonium acetate, pH 8). A linear gradient from 20-80% 546 buffer B was performed over 20 min, followed by an immediate drop back to 20% buffer B for the 547 remaining 15 minutes of the run to re-equilibrate the column to starting conditions. The Alg24 reaction 548 substrates and products typically eluted 8-11 minutes into a run. For kinetic analyses, assays were 549 performed in triplicate (with the exception of 1 mM conc. which was only performed in duplicate) and 550 altered to contain a fixed concentration of lipid-acceptor (0.5 mM) whilst varying the concentration of 551 UDP-GlcNAc (0.5, 0.75, 1, 1.5, 3, 6, 9, 12 mM). Reactions were run for 10 hours before being quenched 552 and, after completion, areas of substrate and product peaks were calculated to determine the reaction 553 conversion. Conversion over 10 hours was converted to pmol per minute, and the resulting data were 554 analyzed using GraphPad Prism v8 to generate a Michaelis-Menten curve and resulting kinetic data. 555

NMR Analysis 556
The HPLC-purified Agl24 products (0.5mg-2mg) were dried using a Christ RVC 2-25 speed vacuum 557 fitted with a Christ CT 02-50 cold-trap to remove excess acetonitrile, then freeze dried (Alpha 1-2 558 LDplus, Christ) to remove residual water. Products were subsequently dissolved in 600 µl of D 2 O and 559 NMR spectra were recorded at 293K. The spectra were acquired on a Bruker AVANCE III HD 500 MHz 560 NMR Spectrometer equipped with a 5 mm QCPI cryoprobe. For 1D TOCSY experiments, H1' was 561 irradiated at 5.35 ppm, H2' was irradiated at 3.90 ppm, and H1" was irradiated at 4.49 ppm (Fig S8). A 562 combination of 1 H 1 H COSY, 1D TOCSY, and 1 H 13 C HSQC experiments were used to fully assign the 1 H 563 and 13 C signals for the Agl24 reaction product. Full 1 H and 13 C chemical shift assignments can be found 564 in Table 1 and are recorded with respect to the residual HDO signal at 4.7 ppm. 565

Phylogenetic analysis 566
The eukaryotic Alg13-Alg14 and the bacterial EpsF-EpsE sequences were taken from the datasets in 567 (Lombard, 2016), since our own databanks (25118 bacterial and 1611 eukaryotic genomes) contained 568 too many and too divergent hits that could possibly render downstream analyses unfeasible. For 569 Archaea, HMM searches were impossible, since they recovered too many and too divergent possible 570 homologs to handle, while DIAMOND searches were not sensitive enough. Ultimately, we searched for 571 homologs with DIAMOND (--more-sensitive -k 0 & default e-value cutoff) (Buchfink et al., 2015) using 572 multiple seeds from (Lombard, 2016) to cover the different clades of archaeal homologs: 1) the S.              Figure S10: 2D HSQC spectra used to assign carbon signals arising from both -and -linked GlcNAc residues. Note the signal for C4' is significantly shifted (79.6 ppm) due to the presence of the glycosidic linkage at this position. Figure S11: Universal distribution of Agl24 homologs in Crenarchaeota, white the exception of the Order Thermoproteaces. The BLAST analyses (https://blast.ncbi.nlm.nih.gov) of Agl24 (Q4J9C3) with the restriction to Crenarchaeota (Taxid:28889; 126 genomes), revealed 100 sequences (30-100% sequence identity), lacking any homology within the 27 genomes of the Order Thermoproteaces. The orders of Crenarchaeaota are background colored: Fervidicoccales (green), Acidilobales (blue), Desulfurococcales (orange), Sulfolobales (yellow), Thermoproteales (red).