Highly clade-specific biosynthesis of rhamnose: present in all plants and in only 42% of prokaryotes. Only Pseudomonas uses both D- and L-rhamnose

Rhamnose is a constituent of lipo- and capsular polysaccharides, and cell surface glycoproteins. L-rhamnose is biosynthesized by the rml or udp pathway and D-rhamnose by the gdp pathway. Disruption of its biosynthesis affects survival, colonisation, etc. Rhamnosides are commercially important in pharmaceutical and cosmetics industries. HMM profiles were used to investigate the prevalence of the three pathways in completely sequenced genomes and metagenomes. The three pathways are mutually exclusive except in Pseudomonas which has both rml and gdp pathways. The rml pathway is restricted to bacteria (42% genomes), archaea (21%) and bacteriophages, and absent in eukaryotes and other viruses. The gdp pathway is restricted to Pseudomonas and Aneurinibacillus. The udp pathway is primarily found in plants, fungi and algae, and in human faecal metagenomic samples. The rml pathway is found in >40% genomes of Actinobacteria, Bacteroidetes, Crenarchaeota, Cyanobacteria, Fusobacteria and Proteobacteria but in <20% genomes of Chlamydiae, Euryarchaeota and Tenericutes. The udp pathway is found in all genomes of Streptophyta, <=25% genomes of Ascomycota and Chordata, and none of the genomes of Arthropoda and Basidiomycota. Some genera which lack any of these pathways are Chlamydia, Helicobacter, Listeria, Mycoplasma, Pasteurella, Rickettsia and Staphylococcus. Organisms such as E. coli and Salmonella enterica showed significant strain-specific differences in the presence/absence of rhamnose pathways. Identification of rhamnose biosynthesis genes facilitates profiling their expression pattern, and in turn, better understanding the physiological role of rhamnose. Knowledge of phylogenetic distribution of biosynthesis pathways helps in fine graining the taxonomic profiling of metagenomes. AUTHOR SUMMARY In the present study, we have investigated the prevalence of rhamnose biosynthesis pathways in completely sequenced genomes and metagenomes. It is observed that the prevalence of rhamnose is highly clade specific: present in all plants but in less than half of all prokaryotes. Among chordates, only the Chinese rufous horseshoe bat has rhamnose biosynthesis pathway and this exclusive presence is quite baffling. The effect of disrupting rhamnose biosynthesis has been reported in a few prokaryotes and all these cases pointed to the essentiality of rhamnose for critical physiological processes such as survival, colonisation, etc. In this background, it is surprising that many of the prokaryotes such as Escherichia coli and Salmonella enterica show significant strain-specific differences in the presence/absence of rhamnose pathway. This study will facilitate the experimental characterization of rhamnose biosynthesis genes in organisms where this pathway has not been characterised yet, eventually leading to the elucidation of the biological role of rhamnose. Phylum-, genus-, species- and strain-level differences found with respect to presence of rhamnose biosynthesis pathway genes can be used as a tool for taxonomic profiling of metagenome samples. This study could also annotate a significant number of orphan proteins in the TrEMBL database.


INTRODUCTION
Rhamnose is 6-deoxy-mannose. Both L-and D-enantiomers of rhamnose are found in nature. Its presence has so far been reported in bacteria, archaea, plants and fungi. It is found as part of polysaccharides (1), glycoproteins (2) and small molecules such as flavonoids, terpenoids and saponins (3)(4)(5). L-rhamnose is a common component of the O-antigen of lipopolysaccharides (LPS) of Gram-positive bacteria such as Lactococcus lactis (6) and Enterococcus faecalis (7,8), and Gramnegative bacteria such as Salmonella enterica (9), Shigella flexneri (10) and Escherichia coli (11). D-Rhamnose is a constituent of LPS of Gram-negative bacteria such as Pseudomonas aeruginosa (12) and Pseudomonas syringae (13,14), and Gram-positive thermophilic bacterium Aneurinibacillus thermoaerophilus (15). The structure of rhamnose-containing glycan moiety has been elucidated in some bacteria. For e.g., the 93-kDa S-layer glycoprotein SgsE of Geobacillus stearothermophilus NRS 2004/3a is O-glycosylated at Ser/Thr residues (2). In this, rhamnan chains consisting of 12 L-rhamnose is a constituent of the glycan moiety of glycoproteins that make up flagella, fimbriae and pili. For e.g., flagellin from P. syringae pv. tabaci 6605 consisted solely of L-Rha whereas the P. syringae pv. glycinea race 4 flagellin contained L-Rha and D-Rha in 4:1 ratio (14). Fap1 adhesin is a major fimbrial subunit of Streptococcus parasanguis and its glycan contains rhamnose, glucose, galactose, GlcNAc and GalNAc in the ratio of 1:29:5:39:1 (16). Glycosylation of flagellin has been shown to be essential for bacterial virulence and host specificity (17,18). repeating units and play a major role in the development and growth of all vascular plants (19).
Bacterial cell surface polysaccharides contain rhamnose and play important roles in cell growth and development, survival and interaction between bacteria. Disruption of the rhamnose biosynthesis pathway in Enterococcus faecalis attenuates the pathogen in a mouse model (30). Deletion or disruption of the rml pathway in Pseudomonas aeruginosa is effectively lethal (31). Deletion of rmlB or rmlD in Vibrio cholerae results in defective colonisation (32). Deletion of any of the four rml genes in Streptococcus mutans inhibits cell-wall polysaccharide synthesis and mutants cannot initiate or sustain an infection (33). In the uropathogenic strain O75:K5 of E. coli, lack of functional RmlD leads to loss of serum resistance (34). In Mycobacterium tuberculosis, L-rhamnose covalently links arabinogalactan to peptidoglycan layer and inhibitors of rml pathway enzymes result in growth inhibition (35)(36)(37). These observations confirm the role of rhamnose in bacterial pathogenicity. To date, neither rhamnose nor the genes responsible for its synthesis have been identified in humans.
Thus, inhibitors of rhamnose biosynthesis pathways can be used for novel therapeutic intervention.
Elucidation of the pathways of glycosylation and biosynthesis of glycan building blocks can alleviate the challenges associated in relating structure of glycans to their functions. It will also greatly facilitate chemoenzymatic synthesis of building blocks. In addition, knowledge of the pathways will help in investigating the regulation of their expression. In this background, the present study was undertaken to identify rhamnose biosynthesis pathways in sequenced genomes. Protein Data Bank and Swiss-Prot database were mined to extract protein family-specific sequence and/or structural patterns. Sequence patterns in the form of hidden Markov model (HMM) profiles were used to identify rhamnose biosynthesis genes. Genus-, species-and strain-specific differences in the use of rhamnose were also analysed. Genomic context was used to assign the biological process level annotations in prokaryotes. Human-associated and environmental metagenomes were analysed.

Specificity and sensitivity of HMM profiles:
Ascertaining the specificity or sensitivity of an HMM profile is not straightforward since the only reliable validation is by direct enzyme activity assay. This is impractical because the number of proteins to be assayed is extremely large; in fact, HMMs are typically used to narrow down the possible activities a protein may be associated with. In the background of this caveat, the performance of the HMM profiles generated in this study was assessed using proteins which have molecular function annotations in the Swiss-Prot, Pfam and CATH databases.
Assessment using fusion proteins: Same protein was obtained as hit for both RmlC and RmlD in 69 of the 21,964 genomes. Lengths of these hits (441-494 residues) suggested that these might be fusion proteins. Inspection of alignments showed that distinct regions of the proteins align to RmlC and RmlD profiles strengthening the possibility that they are indeed fusion proteins. This was taken advantage of to validate the bit score thresholds of these two profiles. Hits of RmlC and RmlD from the genome were paired together, irrespective of whether they form fusion proteins or not, and their scores against respective profiles were plotted against each other ( Figure S1). It is seen that most of the RmlC-RmlD fusion proteins scored just above the threshold. Since only functionally interacting proteins are found as fused proteins, the choice of bit score threshold was taken as validated. The lower bit scores are suggestive of sequence divergence.
Assessment using genomic context: Co-occurrence of genes in a genome is a strong indicator of functional similarity at the level of biological process (38). Prokaryotic genomes which have homologs for all enzymes of pathway were chosen and homologs in such genomes were classified as contiguous, neighbourhood and dispersed. The distribution of bit scores of hits of these three categories was very similar ( Figure S2) indicating the validity of bit score thresholds.  (Table S1). These five Pfam families contain proteins with a variety of molecular functions.
Full-length sequences for all members of each these Pfam families were scanned using the corresponding HMM profiles generated in this study (Table 1 and Pfam Validation sheet of rhamnose.xlsx). It is found that all the profiles generated in this study have high specificity and some also have high sensitivity; detailed analysis are given in Table S1. Some of the entries from Pfam families PF01370 and PF16363 are annotated as Gmd but they are not hits for the Gmd profile; hence, these are likely to be RmlBs. Listing of all enzymes viz., UniProt IDs, PubMed ids, organism and strain names, gene names are given in the Profile dataset worksheet in the file rhamnose.xlsx. ¶ These were used as keywords to search databases. Uger has two E.C. numbers since it is a bifunctional enzyme. For the enzymes of the rml pathway, no distinction is made between enzymes which use dTDP-sugar and TDP-sugar. For Uger, the gene name UER was also used. For Ugd and Uger, rhamnose synthase / synthetase were also used for enzyme activity. ‡ This is the number of non-redundant sequences that were eventually used to generate HMM profile. 80% sequence identity cut-off was used as the redundancy criterion i.e., no two sequences within each dataset has more than 80% sequence identity. For the RmlB+Ugd profile, although numbers are shown separately for RmlB and Ugd, all sequences were used together to identify the non-redundant dataset. # A database sequence is considered as a hit for the HMM profile only when it meets both the bit score and length thresholds. The latter is the minimum number of amino acid residues of the database sequence that should align with the HMM profile. in two distinct ways: structural clusters (based on structure similarity) and functional families (based on sequence similarity). All the sequences in the CATH database were scanned against the HMM profiles (CATH Validation sheet of rhamnose.xlsx). There were no false positives for any of the profiles as judged by CATH annotations (Table S2). Taking into consideration annotations of hits obtained from the Swiss-Prot, Pfam and CATH databases, it is seen that the profiles generated in this study have high specificity (Table S1-S2).

Identifying homologs of rhamnose biosynthesis genes from TrEMBL database:
The database annotation matched with the profile annotation for more than 90% of the hits (except in case of Uger) thereby providing manual curation for these electronically annotated entries ( Table 2). The remaining 10% of hits includes those with annotation at biological process level which is consistent with the profile annotation, no annotation whatsoever and incomplete molecular function annotation (consistent or inconsistent). These hits were (re)assigned profile annotation. Annotations of hits whose molecular function annotation does not match the profile annotation were resolved after error analysis, wherever possible, based on careful scrutiny. The details are given as supplementary material (Annotation resolution -TrEMBL and TrEMBL hits worksheets of rhamnose.xlsx). Only a small fraction of the 140 million entries of the TrEMBL database were found to be homologs of rml, udp and gdp pathway enzymes. The number of homologs of Uger is less compared to those of the rml pathway and Gmd. This can be attributed to the udp pathway being restricted to plants, algae and fungi, whose contribution to the TrEMBL database is just 10-14%.
Very few homologs of Rmd are present in the TrEMBL database in contrast to the number of homologs found for Gmd and rml pathway enzymes (  (15,41,42). Other Gmds haven't been assayed for Rmd activity and hence, it is not clear whether all Gmds have rmd activity also or there are two classes of Gmds: with and without Rmd activity. Therefore, it may be envisaged that genomes in which Rmd homolog could not be found have bifunctional Gmds and thus, can carry out the function of Rmd too. It is also to be noted that the gdp pathway is for the biosynthesis of D-rhamnose and hence, finding fewer homologs can be suggestive of restricted usage of D-rhamnose. * Current annotations for genomes are taken from NCBI database and for TrEMBL hits from UniProt database. It is observed that for some proteins, NCBI annotation is at variance with that in UniProt. For example, a protein from Ketogulonicigenium vulgare (strain WSH-001) annotated as mannose-1-phosphate guanyltransferase in the NCBI database (YP_005796237.1) is annotated as glucose-1-phosphate thymidylyltransferase in UniProt database (F9Y751). ¶ These are fusion proteins that have been assigned the function of only one domain in the database. For e.g., a hit for RmlA profile is annotated in the database as RmlB, however, this is a RmlA-RmlB fusion protein. Details of how such apparent mismatches in annotation were resolved are given in Supplementary material (Annotation resolution -TrEMBL and Annotation resolution -Genomes worksheets of rhamnose.xlsx).
% Numbers are given in the format (x+y) where x and y are the number of hits whose annotation mismatches could be and could not be resolved, respectively. Here, "resolved" means that the mismatch in annotation is either because a bifunctional fused protein is partially annotated or the database annotation is incorrect. These details are given in Supplementary material (Annotation resolution -TrEMBL and Annotation resolution -Genomes worksheets of rhamnose.xlsx).

Homologs of rhamnose biosynthesis genes from genomes:
The number of hits obtained against genomes are tabulated in Table 2

Presence of rhamnose biosynthesis pathway in genomes:
Presence of homologs for all enzymes of the rml, gdp or udp pathway suggests that the organism uses rhamnose as a glycan building block.
Overall, it is found that the rml pathway, the most prevalent of the three, is found only in 42% and 21% of bacterial and archaeal genomes, respectively ( Table 3). The udp pathway is found in 18% of eukaryotic genomes. The gdp pathway is found mostly in Pseudomonas. Gmds from some organisms e.g., Paramecium bursaria Chlorella virus 1 are bifunctional i.e., they have Rmd activity also (42).
The gdp pathway will be functional in many other genomes wherein only Gmd homolog is found but not that of Rmd, if such Gmd homologs are also bifunctional.  # The genomic context of hits is not applicable for udp pathway, since it is present in eukaryotes. Acanthamoeba polyphaga mimivirus also has the udp pathway genes which are dispersed in the genome.
% The gdp pathway homologs are contiguous in Pseudomonas, whereas Aneurinibacillus has two copies of Gmd, one is contiguous with respect to Rmd, while the other is dispersed. It is possible that the copy of Gmd, which is dispersed with respect to Rmd, might be involved in some other biosynthesis pathway such as GDP-L-fucose.
One or more genes of the biosynthesis pathway are missing in some of the genomes ( Furthermore, in some genomes, more than one hit for rhamnose biosynthesis genes is found (Genomes worksheet of rhamnose.xlsx). It is possible that the other copy of these genes might be involved in some other pathway, eg., TDP-Fuc4NAc biosynthesis pathway (44)    (a) The "Possibly related -other than transferases" category includes genes whose annotations are related to the following: (i) CPS/EPS/LPS/Capsid biosynthesis, (ii) secondary metabolite production and (iii) cell surface appendage biosynthesis and modification. The "Possibly related -transferases" category includes glycosyl-and methyl-transferases. The "Maybe related" category includes genes annotated as epimerases, dehydratases, hydratases, etc without any specification of the substrate.
(b) Sub-grouping of genes in the "Possibly related -other than transferases" category. The "nonspecific" category includes genes which might be involved in multiple biological functions such as genes involved in the biosynthesis of 4-aminoquinovosamine which are involved in both LPS biosynthesis and flagellin modification.

Prevalence of rhamnose biosynthesis pathways in some common genera and species:
Pseudomonas

Fig. 3 Genus-specific variation in the prevalence of the rml pathway.
Azospirillum brasilense: Rhamnose is a part of LPS and disruption of rhamnose biosynthesis genes results in modified LPS core structure, non-mucoid colony morphology, increased EPS production, and also affects maize root colonization (59). Rml pathway genes are found in both strains for which genome sequence is available.
Aggregatibacter actinomycetemcomitans: The presence of rhamnose is strain-specific in this species (60,61). Rml pathway is found in 6 of the 8 strains for which genome sequence is available (Figure 3).
It is absent in the strains D7S-1 and 624 in agreement with the literature reports (62) (63).
Klebsiella: The O12 antigen of K. pneumoniae is said to be composed of an N-acetylglucosamine and rhamnose polymer (64). The rml pathway is found in 133 of the 320 strains of Klebsiella pneumoniae. However, genome IDs could not be mapped to serotypes in which the rhamnose pathway genes are known to be a part of the cps cluster (65)(66)(67)(68)(69)(70)(71). The sole exception is K.
pneumoniae subsp. pneumoniae MGH 78578 which is mapped to serotype K52 and in this strain rml genes are contiguous (Genomes worksheet of rhamnose.xlsx).
Lactococcus and Streptococcus: The presence of rml pathway is strain-specific in S. pyogenes, S. mutans, S. suis and S. pneumoniae, but is found in all strains of L. lactis and S. thermophilus (Genomes worksheet of rhamnose.xlsx). dTDP-L-rhamnose is an important precursor of CPS and EPS in lactococci, lactobacilli and streptococci (1,72). The rml pathway genes have been found to be essential in L. lactis MG1363 (6). Rml genes have been characterized and appear to play a vital role in the production of serotype-specific, rhamnose-containing CPS antigens in S. mutans (73,74) and S.
pneumoniae (75)(76)(77). Rml mutations in S. mutans resulted in a change in the composition of CPS and absence of the serotype-specific O antigen. S. pneumoniae cps19fL-cps19fO (RmlA-RmlD) mutants exhibited a so-called rough non-encapsulated phenotype and did not have the capacity to produce CPS, indicating that the rfb analogues play an essential role in CPS-19F production (78).
Mycobacterium: L-rhamnose covalently links arabinogalactan to peptidoglycan (79), which is critical for the overall architecture of the mycobacterial cell wall, making L-rhamnose biosynthesis essential (36,80). Rml genes are dispersed in mycobacterial genome (37). L-rhamnose biosynthesis pathway is considered a promising drug-target, since this pathway is absent in humans (35,81). consists of a repeat unit of four sugars: abequose, mannose, rhamnose and galactose (9). The Oantigens form hydrophilic surface layers that protect the bacterium from complement-mediated cell lysis (89). All the four strains of S. bongori (a species that is predominantly associated with coldblooded animals, but in rare cases, infects humans also) have partial rml pathway, implying the rml genes might be involved in some other biosynthesis pathway, other than rhamnose biosynthesis.
Salmonella sp. SSDFZ54 and SSDFZ69 have the rml pathway, however no literature is available for these strains.  Table S3.

DISCUSSION
Glycosylation is found in all domains of life, i.e., eukaryotes, archaea, eubacteria and viruses. S-layer glycoproteins, pili, flagella, fimbriae and secreted glycoproteins are some examples of prokaryotic glycoconjugates which play an important role in cell physiology, microbe-host interactions, immune escape mechanisms and biofilm formation. Prokaryotic glycans seem to show far more structural diversity than eukaryotic glycans; even serotype-specific differences are known. However, not much is known, especially in prokaryotes, about molecular details of glycan biosynthesis and factors that govern structural diversity. Consequently, establishing the relationship between structure and function has become very hard. This is in stark contrast to proteins where relating sequence changes to functional changes via site-directed mutagenesis has become routine.
Rhamnose is a typical constituent of capsular polysaccharides and glycan moieties of flagella, fimbriae and pili. It is also a precursor for secondary metabolites such as streptomycin in bacteria (90) and flavonoids in plants (3,5). So far, rhamnose has not been found in any chordate. Pathways for the biosynthesis of rhamnose have been elucidated in a few organisms. This knowledge is exploited in the present study to investigate the prevalence of rhamnose biosynthesis pathways in organisms with completely sequenced genomes.
Only a small fraction of protein sequences deposited in the databases have experimentally proven biological activity. The molecular function of a protein is often inferred on the basis of its sequence similarity to a protein whose function has been experimentally determined. In this study, protein family-specific profiles for rhamnose biosynthesis enzymes were generated by including such experimentally characterised proteins and the specificity and sensitivity of these profiles was put to test by running them against curated sets of proteins from Swiss-Prot, Pfam and CATH databases.
HMM profile for Rmd could not be generated due to lack of many experimentally demonstrated Rmds. In this case, a BLASTp approach was taken to identify homologs of Rmd.
The presence of rhamnose, though demonstrated to be important in virulence of several bacteria (30,32,33), is not limited to pathogenic bacteria. There are many non-pathogenic bacteria which also have rhamnose such as Lactobacillus and Lactococcus. Also, rhamnose is found in intracellular (e.g., Lawsonia intracellularis), facultative intracellular (e.g., Mycobacterium tuberculosis) and extracellular organisms (e.g., Streptococcus pyogenes). Rhamnose is found in both aerobic (e.g., Mycobacterium tuberculosis) and anaerobic bacteria (e.g., Prevotella).
The rml pathway is observed in bacteria, archaea and phages but not in any of the eukaryotes. Some of the approaches to assign biological process level annotation rely upon genomic context since co-occurrence of genes suggest functional interaction. However, these approaches are ineffective in cases such as the rhamnose biosynthesis genes. This is because rhamnose is used by organisms for multiple functions e.g., CPS/EPS/LPS and cell surface appendages and synthesize secondary metabolites.
Large volumes of genomic and transcriptomic data have been generated but deciphering the biological information hidden in these data is a challenge. A significant portion of identified proteins remain annotated as proteins or domains with unknown function (PUFs and DUFs). Typical enzyme activity and binding assays are low-throughput "one protein at a time" approaches but they provide high quality molecular function level annotation. Experimental techniques are constrained by time and cost, and are not practically possible in all cases. To overcome this, various electronic annotation tools which annotate a protein at the molecular function level have been developed. In a large number of cases, proteins which share (high) sequence similarity perform the same molecular function. But electronic annotations are "putative" until proven experimentally. Such annotations may be incomplete or fully/partly erroneous. It is also known that some proteins with similar sequences catalyse the same reaction but with different substrates. Additionally, molecular function of a protein can be assigned with varying levels of details. In general, the extent to which sequence similarity translates to similarity in molecular function varies and this poses a challenge for electronic annotation.
Metagenomics has turned out to be an important tool in unravelling microbial communities of biomes. Understanding the mutual influence of microbiota and metabolism in humans has been the focus of several studies. Towards this, identification of operational taxonomic units (OTUs) that constitute a biome is important. Determination of surface polysaccharide determinants and biosynthesis pathways can facilitate relatively more fine-grained identification of OTUs. Rhamnose biosynthesis pathway can be one such determinant because it is a common constituent of surface polysaccharides and glycoproteins.

MATERIALS AND METHODS
Databases and software tools: Protein sequences are from UniProtKB, PDB, NCBI, Pfam and CATH databases (Table S4). Software tools available in the public domain (Table S4) (Table S4).
The corresponding BioProject metadata were extracted from the SRA section of NCBI. Gene ontology consortium's prescription of assigning functions to a protein at three levels, namely, molecular function, cellular component and biological process, has been adapted in this study.
Procedure used for generating HMM profiles: All proteins that catalyse the same step of a pathway ( Figure 1) were considered to form a functional family. For each functional family, a family-specific HMM profile was generated as follows: PubMed, SwissProt and PDB were searched using EC number and primary gene name as keywords (Table 1). Corresponding original research articles were read to prepare a carefully curated set of experimentally characterized proteins. All-against-all pairwise alignments were used to ensure that proteins are sequence homologs. Query and subject coverages were used to identify relevant domains in case of multi-domain or fusion proteins. A sequence identity cut-off of 80% was applied to obtain a non-redundant dataset and this was designated as the Exp dataset; here, Exp signifies that the molecular function of all entries of this dataset has been experimentally characterized by direct activity assay. A multiple sequence alignment of Exp dataset proteins was used to generate a HMM profile, designated as the Exp profile. Exp dataset proteins were scored against the Exp profile and the floor value of bit score of the lowest scoring protein was set as the threshold for the Exp profile. "Best 1 domain" bit scores (T_exp bits) were used instead of E-values to facilitate comparison independent of the size of the search space.
With a view to enhance HMM profile's coverage of sequence divergence within a functional family, an extended dataset (referred to henceforth as Extend dataset) was created starting from the Exp dataset by adding proteins that meet any of the following criteria: (i) Swiss-Prot entries that score >= T_exp against the Exp profile.
(ii) Swiss-Prot entries that score < T_exp but whose annotation is consistent with the molecular function of the corresponding functional family.
(iii) PDB entries whose annotation is consistent with the molecular function of the functional family but direct activity assay has not been reported. The protein annotated as "NafoA.00085.b" (A0A1W2VMZ8) was included in the Gmd profile since it has residue conservation.
(iv) Proteins which have been inferred to have the molecular function by complementation assays or product characterization.
In case of (ii), (iii) and (iv), binding and catalytic site residues were ascertained to be conserved.
Functionally important residues were identified as follows: (a) From the literature e.g., data based on site directed mutagenesis.
(b) Multiple sequence alignment of Exp dataset proteins.
(c) From 3D structures i.e., those that form H-bond with the ligand through side chain (Table S5).
A multiple sequence alignment of Extend dataset proteins was used to generate the Extend HMM profile. Details of proteins viz., source organism, UniProt ID, PubMed ID (for proteins for which experimental data is available) and gene name are given in the worksheet named as Profile Dataset (electronic supplement file rhamnose.xlsx) and a summary is given in (Table 1). Proteins satisfying all three criteria viz., bit score and length thresholds of Extend profiles ( Table 1) and conservation of key residues (Table S5) were considered as hits for that HMM profile i.e., functional homologs.

Setting bit score thresholds for
Applying the residue conservation criterion resulted in the elimination of 7-30% of the hits obtained from HMM profiles (Table S6). It is possible that some of these hits have intragenic compensatory mutations that will restore activity. It is also possible that the enzyme activity is lost in these homologs. Applying the length criterion further increased the specificity of search for homologs primarily in the case of Uger and Gmd; these homologs are fragments, not full-length sequences.  RmlC: The gene name rmlC is associated in literature with two very similar, but distinct, enzyme activities: 3,5-epimerisation or 3-epimerisation of (d)TDP-4-keto-6-deoxy glucose ( Figure S5). Of the 12 RmlCs characterized experimentally (Profile Dataset worksheet, rhamnose.xlsx), nine have been reported to have 3,5-epimerase activity; these proteins have not been tested for 3-epimerase activity. Streptomyces niveus RmlC has both 3-epimerase (major product) and 3,5-epimerase (minor product) activities (95,96). Streptomyces bikiniensis RmlC has only 3-epimerase activity (97).
Streptomyces sp. GERI-155 RmlC has been characterized as RmlC 3,5-epimerase or 3-epimerase (98 or not assigned were further classified. For hits whose molecular function does not match with that of the profile annotation, possible reason(s) for the mismatch were analysed by scanning them against HMM profiles / BLASTp corresponding to the annotated activity; for many hits, mismatch could be resolved using this approach. Some hits have biological process, cellular component or incomplete molecular function annotation; some have no annotation whatsoever. Such hits were assigned molecular function by manual curation. For hits which have incomplete molecular function annotation, the profile annotation can be either consistent or inconsistent. For example, an RmlA (a thymidylyltransferase) hit is annotated as nucleotidyltransferase and this is consistent. In contrast, an RmlB/Ugd (a dehydratase) hit is annotated as serine carboxypeptidase and this is inconsistent. In some cases, the inconsistency could be rationalised: an RmlB/Ugd (a dehydratase) hit is annotated as NAD-dependent epimerase, and dehydratases and epimerases which use nucleotide-sugars as substrates belong to SDR superfamily.

Finding genomic context in genomes:
The dataset containing the unique set of proteins from all genomes was scanned against all HMM profiles to identify homologs. These were mapped to their respective genomes (Genome Hits worksheet of rhamnose.xlsx) and their loci were extracted from the feature table of each genome. An extra column was added to the feature table to denote the locus/order of the protein in the genome. The transcripts were numbered sequentially starting from the 5' end of the genome. The letters "C" and "P" were prefixed to the number to denote the chromosomal and plasmid origin, respectively. In organisms which have more than one plasmid, plasmids were numbered sequentially as "P1.", "P2.", etc. If, in a genome, homologs for all enzymes of a pathway are present, then these homologs were classified as follows: (i) Contiguous: the location of the homologs is contiguous in the genome; homologs need not necessarily be in the same strand of DNA.
(ii) Neighbourhood: homologs are not contiguous in the genome but are located within 10 genes of each other.
(iii) Dispersed: homologs are dispersed in the genome.
In case of (i) and (ii), annotations of neighbouring proteins were extracted to decipher the possible biological function of these homologs. The annotations of the neighbouring genes were categorised as follows: (i) Possibly related -other than transferases: Proteins which might be involved in the same biological process as rhamnose based on literature, i.e., CPS/EPS/LPS/capsid biosynthesis and assembly, cell surface appendage biosynthesis or modification and assembly, secondary metabolite production.
(iii) Maybe related: Proteins which might be involved in the same biological process as rhamnose, but their substrate specificity is not known. For e.g., epimerases, dehydratases, ABC transporters, etc.
(iv) No annotation: Proteins which do not have any annotation or are annotated as DUFs.
(v) Not (directly) related: Proteins which are not related to the biological processes which rhamnose is known to be involved in. For e.g., L-histidine biosynthesis, transcription and translation factors, etc.
Proteins in the "Possibly related -other than transferases" category were further sub-grouped as follows: (a) CPS/EPS/LPS/Capsid biosynthesis, (b) secondary metabolite production, (c) cell surface appendage biosynthesis and modification and (d) Non-specific. The "non-specific" category includes genes which might be involved in multiple biological functions such as Qui4N is involved in both LPS biosynthesis and flagellin modification. Based on the number of proteins with these types of annotations present in the neighbourhood of rhamnose biosynthesis genes, additional biological process level annotations were provided to rhamnose biosynthesis genes.

Identifying homologs of rhamnose biosynthesis genes in metagenomes:
The total set of proteins from human-associated and environmental metagenomes was scanned against all HMM profiles to identify homologs of rhamnose biosynthesis genes. Hits were mapped to their respective runs and BioSamples. The BioSamples were mapped to their respective BioProjects and BioProject to their Biomes. The hits found were collated at the level of BioSample.

ADDITIONAL FILES
Fig. S1 Scatter plot of bit score of RmlC hit (against RmlC profile) versus the bit score of RmlD hit (against RmlD profile) from the same genome. Fig. S2 Bit score distribution of rml hits when they are in contiguity (blue), neighbourhood (red) or dispersed (green) with respect to other rml pathway hits within the same genome.    Excel workbook rhamnose.xlsx: description of training set proteins and data pertaining to (i) validation of profiles, (ii) resolving annotations of TrEMBL and genome hits, (iii) summary of hits obtained from TrEMBL and genomes, (iv) prevalence of the three pathways at domain, phyla, species and strain level, and (v) hits from metagenomes. The "Summary" worksheet has a description of the data in different worksheets.