Abstract
GapMind is an automated web-based tool for annotating amino acid biosynthesis pathways in bacterial and archaeal genomes. We updated GapMind to include recently identified enzymes, including new enzymes that we identified by using high-throughput genetics and comparative genomics. Across 206 prokaryotes that have high-quality genomes and are reported to grow in minimal media, the average number of unexplained missing steps or gaps dropped from 1.4 per genome to 0.8 per genome. The majority of the remaining gaps involve the gain or loss of phosphate groups.
Introduction
Most free-living bacteria can probably make all of the standard amino acids (Ramoneda et al. 2023), but in many of their genomes, the genes for some biosynthetic enzymes cannot be identified. These gaps occur even in the genomes of bacteria and archaea that are experimentally confirmed to be prototrophic (Price et al. 2020). These gaps make it challenging to predict the growth requirements of an organism from its genome sequence, except for well-studied groups such as Enterobacteria or Pseudomonas (Seif et al. 2020). For other bacteria, predictions of amino acid auxotrophies are often incorrect (Price et al. 2018a; Price 2023). These gaps in biosynthetic pathways also imply that many novel biosynthetic enzymes remain to be discovered.
We previously described GapMind, a fast web-based tool for annotating amino acid biosynthesis pathways in bacteria and archaea (Price et al. 2020). Because many of the biosynthetic enzymes are not described in standard databases such as Swiss-Prot, MetaCyc, or BRENDA (Caspi et al. 2020; Chang et al. 2021; UniProt Consortium 2023), we curated lists of experimentally-characterized enzymes that can carry out each biosynthetic step. We also included novel enzymes that we identified using randomly-barcoded transposon sequencing (RB-TnSeq) (Price et al. 2018a; Price et al. 2020). Given a genome of interest, GapMind uses these characterized enzymes, along with curated families (hidden Markov models) from TIGRFams (Haft et al. 2013), to identify candidates for each step, at varying levels of confidence. GapMind also uses similarity to proteins with computationally-predicted functions (from Swiss-Prot) to identify additional candidates, but these are never considered high-confidence candidates. GapMind then reports the highest-confidence candidates for each step. Finally, if there are alternate pathways, GapMind selects the pathway that has the most high-confidence candidates and the fewest gaps (steps with only low-confidence candidates or no candidates). An example of GapMind’s results is shown in Figure 1.
Since the original publication of GapMind, over six hundred relevant enzymes have been reported or curated. Also, novel enzymes involved in amino acid biosynthesis were predicted based on comparative genomics (Price et al. 2021; Ashniev et al. 2022). Furthermore, we have collected RB-TnSeq data from ten more organisms, including additional phyla of bacteria and archaea (Shiver et al. 2023; Day et al. 2024). As we will show, this allowed us to identify 17 novel or diverged biosynthetic genes. Based on this new knowledge, we updated GapMind and improved its ability to explain amino acid biosynthesis in diverse bacteria and archaea. GapMind is available at http://papers.genomics.lbl.gov/gaps.
Results and Discussion
We will first describe the novel genes that we identified, using RB-TnSeq and/or comparative genomics. Second, we will discuss how we incorporated previously-published predictions from comparative genomics into GapMind (Price et al. 2021; Ashniev et al. 2022). Third, we will describe how we improved GapMind’s coverage by annotating diverged enzymes, which would otherwise be identified as low-confidence candidates, or might be missed entirely. Finally, we will assess how well the revised GapMind covers amino acid biosynthesis in prototrophic bacteria and archaea, and whether it incorrectly annotates biosynthetic pathways in auxotrophic organisms.
Novel genes for glycine synthesis
The standard pathway for glycine biosynthesis in bacteria and archaea involves a single enzyme, serine hydroxymethyltransferase (GlyA), which converts serine plus tetrahydrofolate (THF) to glycine plus 5,10-methylene-THF. Bifidobacterium breve, a human gut bacterium, and Methanococcus maripaludis, a methanogen, are capable of glycine synthesis despite lacking any apparent glyA gene. We examined previously-published RB-TnSeq data from both organisms grown in defined media that lack glycine (Shiver et al. 2023; Day et al. 2024). In both organisms, these experiments highlighted two genes which are important for fitness whenever glycine is not available (Figure 2A & 2B). We will call them glyXL and glyXS. (GlyXL is MMP_RS07345 or BBR_RS12920, which are 61% amino acid identical; GlyXS is MMP_RS03450 or BBR_RS12915, which are 46% identical.) As shown in Figure 2B, in M. maripaludis, glyXL and glyXS are important for fitness in most of the experiments without added glycine, but there are a few exceptions. These exceptions occur because there was relatively little growth in these conditions, and hence relatively little decrease in the relative abundance of auxotrophic mutants. This can be seen by comparing the fitness pattern of glyXL to another amino acid biosynthesis gene, leuA (Figure 2C). The two patterns are very similar, except that mutants of glyXL are rescued by added glycine or (to a lesser extent) by dipeptides that contain glycine.
GlyXL is distantly related to anaerobic ribonucleotide reductases, which use a glycyl radical and two nearby cysteine residues for catalysis. When we aligned the predicted structure of GlyXL from B. breve (AlphaFold version 2.0 for UniProt A0A6B4WGC7; (Varadi et al. 2022)) to the experimental structure of an anaerobic ribonucleotide reductase (PDB 4COM) using the RCSB web site (Bittrich et al. 2024), the structures were not that similar (root mean square deviation 4.33 Å, TM score 0.66), and these catalytic residues were not conserved. So GlyXL may use a different mechanism. GlyXS is an ACT domain protein (PFam PF13740; (Finn et al. 2014)); members of this family often bind amino acids to regulate metabolism. We predict that GlyXL and GlyXS catalyze the formation of glycine, but we do not have a specific proposal for the precursors.
A role for the GlyXL family in glycine synthesis was previously reported in Streptococcus pneumoniae (Kazmierczak et al. 2009). Specifically, S. pneumoniae D39 is auxotrophic for glycine and has a truncation in glyXL (spr0218), and glycine prototrophy can be restored by replacing glyXL with the full-length version from a prototrophic strain (Kazmierczak et al. 2009). A complication in interpreting these data is that S. pneumoniae D39 also encodes glyA. However, metabolic labeling patterns imply that strain D39 uses GlyA in reverse, to convert glycine to serine (Härtel et al. 2012). Furthermore, although strain D39 seems to lack the standard pathway for forming serine (no 3-phosphoglycerate dehydrogenase is apparent in the genome), it does not require serine for growth (Kazmierczak et al. 2009; Härtel et al. 2012); this is consistent with the conversion of glycine to serine.
Across diverse bacteria and archaea, GlyXL and GlyXS usually co-occur and are usually in a putative operon. For example, in the fast.genomics database of representative genomes (Price and Arkin 2024a), potential orthologs of GlyXL and GlyXS (above 30% of the best possible bit score) are found in 885 genera, and they are encoded within 5 kb and on the same strand in 815 genera. (But they are not encoded nearby in M. maripaludis.) Furthermore, in Bacillus methanolicus, expression of the glyXS-glyXL operon appears to be regulated by a glycine riboswitch (see BMMGA3_03000 in (Irla et al. 2015)). Although many of the organisms with GlyXL and GlyXS are anaerobic, B. methanolicus is obligately aerobic (Arfman et al. 1992), so we expect that GlyXL-GlyXS can function in the presence of oxygen.
Many of the genomes that encode glyXL and glyXS encode glyA as well, but unlike S. pneumoniae, most of the genomes with all three of these genes seem to encode the standard pathway for forming serine as well. We suspect that these genomes encode two different paths to glycine.
Overall, genetic data from three genera show that glyXL is involved in glycine synthesis, and both genetic data and comparative genomics show that glyXS is involved as well. In the revised GapMind, glyXL-glyXS is included as an alternative pathway for glycine synthesis.
An alternative to phosphoserine phosphatase from the DUF1015 family
In most prokaryotes, serine is formed from 3-phosphoglycerate, which is an intermediate of either glycolysis or gluconeogenesis, in three steps: 3-phosphoglycerate dehydrogenase (SerA), phosphoserine transaminase (SerC) in reverse, and phosphoserine phosphatase (SerB). No phosphoserine phosphatase (SerB) is apparent in the genome of Clostridioides difficile strain 630Δerm, but cell extracts of C. difficile can convert 3-phosphoglycerate to serine (Hofmann et al. 2018). In C. difficile, SerA and SerC are found in an operon with a DUF1015 family protein, CDIF630erm_01132, so it was proposed that CDIF630erm_01132 is the missing phosphatase (Hofmann et al. 2018).
To test this hypothesis, we examined the gene neighbors of CDIF630erm_01132 in diverse bacteria. As shown in Figure 3, homologs of CDIF630erm_01132 are often encoded near serA, and are often encoded near a likely serC as well. Also, if CDIF630erm_01132 is a replacement for SerB, then its close homologs should be found in genomes that lack the previously-known forms of SerB. We randomly selected 50 representative genomes (all from different genera) that have relatively high-scoring homologs of the putative hydrolase (at least 445 bits, corresponding to roughly 50% identity or above) encoded near serA. We ran the updated version of GapMind on these 50 genomes and it identified high-confidence candidates for SerB (at least 40% amino acid identity to a characterized protein) in just three of those 50 genomes. (One of these three was a metagenome-assembled genome, GCA_022072225.1, that also contained two serA genes, so it may reflect contamination.) Overall, the vast majority of the genomes that encode close homologs of CDIF630erm_01132 in proximity to serA lack serB, which again suggests that CDIF630erm_01132 could be a replacement for SerB.
The original study suggested that CDIF630erm_01132 “shows weak homologies to hydrolases” (Hofmann et al. 2018), but we did not identify any sequence similarity to characterized proteins. To identify structural homologs, we used the predicted structure (AlphaFold version 2.0 for UniProt A0A6B4WGC7; (Varadi et al. 2022)) as a query in Foldseek (van Kempen et al. 2024). The top hit from Foldseek was the predicted structure of serine kinase SbnI (UniProt Q2G1M5, root mean square deviation 7.2, e-value 2.7 · 10-7). Although this homology is remote, it again suggests a link between CDIF630erm_01132 and serine metabolism. However, the structural alignment suggested that the ATP binding and catalytic residues are conserved. Also, when we docked CDIF630erm_01132 with ATP using AlphaFold 3 (Abramson et al. 2024), it made a confident prediction of a binding site (interface predicted template modeling score ipTM = 0.91). Thus, even though the comparative genomic evidence suggests that CDIF630erm_01132 can replace serB, it appears to be a kinase rather than a phosphatase. One possibility we considered is that CDIF630erm_01132 might be a serine kinase operating in reverse, but this reaction would be thermodynamically unfavorable (estimated equilibrium constant of 2 · 10-3, (Flamholz et al. 2012)). In the updated GapMind, CDIF630erm_01132 is included as an alternative form of serB, but experimental studies will be needed to clarify its role.
An alternative N-acetyl-L-glutamate synthase
Steroidobacter denitrificans DSM 18526 was reported to grow in defined minimal medium ((Fahrbach et al. 2008); DSMZ medium 1116), but its pathway for synthesizing arginine appears to be incomplete. In particular, from its genome, we could not identify any proteins with similarity to characterized N-acetyl-L-glutamate synthases (ArgA). However, its genome does encode a cluster of genes for arginine synthesis, which includes a hypothetical protein (ACG33_RS14135; UniProt A0A127FCT3). Homologs of this hypothetical protein are virtually always found in arginine synthesis operons (Figure 4), which suggests a role in arginine synthesis. S. denitrificans is also missing the expected N-acetylcitrulline hydrolase (argE), but some homologs of ACG33_RS14135 are found near plausible candidates for argE (purple genes in Figure 4). So we propose that ACG33_RS14135 is the missing ArgA.
If ACG33_RS14135 is replacing the usual ArgA, then homologs of both proteins should not be found in the same genome. We compared the distribution of homologs of ACG33_RS14135 to that of homologs of ArgA from Pseudomonas aeruginosa (AT700_RS27105), using fast.genomics (Price and Arkin 2024a). Reasonably-scoring homologs (above 25% of the maximum bit score) were never found together in the same genome.
Using Foldseek (van Kempen et al. 2024), we found weak similarity between the predicted structure of ACG33_RS14135 (UniProt A0A127FCT3) and the experimentally-determined structure of a D-glutamate acetyltransferase ((Yu et al. 2023); PDB:7XRJ). Although this homology is remote (root mean square deviation = 5.6 Å, e-value = 1.4 · 10-7), it supports our proposal. ACG33_RS14135 is included in the revised GapMind as a predicted ArgA.
An alternative N-succinyl-L,L-diaminopimelate desuccinylase
In most bacteria, the N-succinyl-L,L-diaminopimelate desuccinylase DapE is required for the synthesis of diaminopimelate, which is a precursor to both lysine and peptidoglycan. DapE is missing in several prototrophic members of the phylum Bacteroidota which encode the tetrahydrodipicolinate succinylase DapD and hence are expected to use this pathway (Echinicola vietnamensis DSM 17526, Mucilaginibacter yixingensis YX-36, and Pedobacter sp. GW460-11-11-14-LB5). Because peptidoglycan synthesis is essential for growth, the gene that replaces dapE is expected to be essential. Using RB-TnSeq data for all three of these bacteria ((Price et al. 2018b); see Methods), we searched for conserved essential genes that did not have known functions, and we identified the putative amidohydrolase Echvi_1427 as a candidate to replace DapE. The homolog from another genus of Bacteroidota, Pontibacter actiniarum KMM 6156, is also essential (Price et al. 2018b).
Echvi_1427 is similar to several characterized enzymes that cleave amide bonds, including 42%-43% identity to dipeptidases (Ishikawa et al. 2001; Jamdar et al. 2015) and more distant homology to N-acetyl-L,L-diaminopimelate deacetylase. Echvi_1427 has conserved active site residues, and docking N-succinyl-L,L-diaminopimelate to its predicted structure with AutoDock Vina (Trott and Olson 2010) gave a plausible binding site, with the amide bond of the substrate near the catalytic arginine (R264), and a predicted binding energy of −6.5 kcal/mol. Echvi_1427 is included in the revised GapMind as a predicted DapE.
Short regulatory subunits of acetolactate synthase
Acetolactate synthase / acetohydroxybutanoate synthase (AHAS) is a bifunctional enzyme that is involved in the biosynthesis of both valine and isoleucine. AHAS has a catalytic subunit and a regulatory subunit; the regulatory subunit is not strictly required for activity, but the catalytic subunit has far less activity on its own (Vyazmensky et al. 2009). The regulatory subunit usually has two domains: an ACT domain and an ACT-like domain, but some strains of E. coli encode an isozyme of AHAS whose regulatory subunit (IlvM) has a single ACT domain. The single-domain IlvM can activate all three isozymes of AHAS from E. coli in vitro (Vyazmensky et al. 2009).
We noticed that some genera of bacteria that grow in minimal media lack the two-domain form of the regulatory subunit. Instead, we found proteins that were encoded near the catalytic domain but have only a single ACT domain, and are distantly related to E.coli IlvM (27-30% identity). To test the roles of these proteins, we used RB-TnSeq data for Rhodanobacter denitrificans FW104-10B01, Lysobacter soli OAE881, and Xanthomonas campestris pathovar campestris str. 8004 ((Luneau et al. 2022); see Methods). We found that in all three cases, mutants of the short ilvM-like gene (LRK54_RS10305, ABIE51_RS17555, or Xcc-8004.1058.1) had similar phenotypes as mutants in the adjacently-encoded catalytic subunit (Figure 5). This confirms that these are regulatory subunits, and they are included in the revised GapMind.
We also noticed that the regulatory subunit appeared to be missing from many Thermoproteota, including from Pyrolobus fumarii, which is prototrophic for amino acids (Blöchl et al. 1997). In many Thermoproteota, we found an ACT domain protein conserved near the catalytic subunit. For instance, KCR_RS03285 from Korarcheum cryptofilum has an ACT domain and is encoded adjacent to the catalytic subunit (KCR_RS03290). When we used the AlphaFold 3 server (Abramson et al. 2024) to predict a protein complex between the two putative subunits from K. cryptofilum, it reported a high-confidence interaction (ipTM, the predicted template modeling score for the interface, was 0.83). Furthermore, the predicted interaction is similar to the experimentally determined interaction of the AHAS subunits from yeast (Figure 6). So, in the revised GapMind, KCR_RS03285 is included as a predicted regulatory subunit of AHAS.
Previously-proposed alternate enzymes
The revised GapMind also includes a number of previously-proposed alternate enzymes (Table 1). Three of these have genetic evidence for their role, while the others are computational predictions.
First, Methanocaldococcus jannaschii synthesizes proline from serine via ornithine cyclodeaminase, but the protein responsible had not been identified (Graupner and White 2001). A recent study identified a novel family of ornithine cyclodeaminases in Anabaena and in Methanococcus maripaludis (Burnat et al. 2019). This family is present in M. jannaschii as well. Furthermore, the enzyme from M. maripaludis enzyme (UniProt Q6LXX7) is important for fitness unless proline is provided (data of (Day et al. 2024)). The revised GapMind includes these ornithine cyclodeaminases.
Second, Ashniev and colleagues associated a family of putative transaminases with serine synthesis, via genome context, and named this family SerC2 (Ashniev et al. 2022). A mutant of SAUSA300_1669, which is a member of this family, is a serine auxotroph (Verstraete et al. 2018). This confirms that SerC2 is a family of phosphoserine transaminases.
Third, we previously discussed a putative alternative to homoserine kinase (BT2402) that appears to be required for threonine synthesis in Bacteroides thetaiotaomicron (Price et al. 2020). Close homologs of BT2402 (over 80% identical) appear to be involved in threonine synthesis in two representatives of a related genus, Phocaeicola. Specifically, both homologs have similar fitness patterns as threonine synthase, with Pearson correlation r = 0.88 in P. dorei (Surya Tripathi, personal communication) and r = 0.79 in P. vulgatus (data of (Huang et al. 2024)). This confirms that BT2402 and related proteins (also known as TIGR02535 (Haft et al. 2013) or ThrB2 (Ashniev et al. 2022)) are alternatives to homoserine kinase. BT2402 is related to phosphoglycerate mutases and probably uses a different phosphate donor than ATP, perhaps phosphoenolpyruvate or phosphoglycerate. Because BT2402 probably performs a different reaction than homoserine kinase (ThrB), it is included in GapMind as a different step. We chose the name HomK, as it is an alternative to the usual homoserine kinase ThrB.
Fourth, Ashniev et al associated another family of putative transaminases with serine synthesis, which they named SerC3 (Ashniev et al. 2022). This family includes the transaminase in the serine synthesis operon from C. difficile discussed above (Figure 3).
Fifth, Ashniev and colleagues associated a family of putative epimerases with lysine synthesis, which they named DapF2 (Ashniev et al. 2022). This family includes the Alr2 protein from Staphylococcus aureus, which is often annotated as alanine racemase. But Alr2 is not involved in D-alanine synthesis in S. aureus (Panda et al. 2024), which is consistent with a role in lysine synthesis instead.
Sixth, DapF2 is also in a conserved operon with putative amidohydrolases, such as AA076_RS07060 (UniProt Q2FH40). These amidohydrolases have been proposed to be the missing N-acetyl-diaminopimelate deacetylase DapL (Jiang et al. 2015; Ashniev et al. 2022). AA076_RS07060 is 35% identical to N-acetyl-L-cysteine deacetylase SndA, which acts on a similar substrate, and has conserved catalytic residues.
Seventh, UniProt W3Y6L2 was originally proposed to be an alternate N-acetylglutamate synthase, based on its conserved proximity to arginine synthesis genes (“ArgA3” in (Ashniev et al. 2022)). However, W3Y6L2 is 42% identical to an N-acetyl-cysteine deacetylase (SndA from Bacillus subtilis), and the only analogous reaction in arginine synthesis is N-acetylornithine deacetylase (ArgE). Indeed, using fast.genomics, we found that representative genomes with closer homologs of W3Y6L2 (11 genera contain a homolog with a bit score ratio above 0.4) never contained likely ArgE proteins (homologs of E. coli ArgE with a bit score ratio above 0.2). So, we propose that W3Y6L2 is an alternative N-acetylornithine deacetylase.
Finally, we previously proposed several novel families of methionine synthases, including split MetE (cobalamin-independent synthase) and corrinoid-protein-dependent methionine synthase MesC (Price et al. 2021). These predictions are included in the revised GapMind.
Diverged enzymes
GapMind only considers a protein to be a high-confidence candidate for a step if it is at least 40% identical to a characterized enzyme. (For enzymes that have curated models in TIGRFam, high-confidence candidates can also be identified via the hidden Markov model, but 38% of steps in GapMind are not associated with any TIGRFam, and non-canonical enzymes are usually not described in TIGRFam.) To improve the coverage of GapMind, we searched for diverged enzymes that had experimental evidence but were missing from the curated databases that GapMind relied on. Using this approach, we added 13 enzymes (Table 2). Five of these were from large-scale complementation assays that we recently conducted (Biggs et al. 2024). HisN in Bifidobacterium breve was recently identified by using RB-TnSeq (Shiver et al. 2023). Two enzymes in Brevundimonas vesicularis_C GW460-12-10-14-LB2 were identified by conducting RB-TnSeq assays in minimal medium (see Methods): A4249_RS00005 is a diverged homoserine dehydrogenase and A4249_RS06835 is a diverged prephenate dehydratase. For the missing phosphoserine phosphatase (SerB) from Lysobacter soli OAE881 or Xanthomonas campestris 8004, we identified proteins that are similar to the biochemically characterized SerB from Synechocystis sp. PCC 6803 (Klemke et al. 2015). The genes from L. soli and X. campestris are in conserved operons with serB and both appear, based on RB-TnSeq data, to be essential ((Luneau et al. 2022); see Methods). These genes must encode the missing serB. Finally, TK0276 from Thermococcus kodakarensis is LysZ (LysW-glutamate kinase and LysW-2- aminoadipate 6-kinase) (Yoshida et al. 2016).
We also identified 16 diverged enzymes that lack experimental evidence but are confirmed by conserved gene context and are required to explain the biosynthesis of the amino acid in a prototrophic prokaryote. For these predicted enzymes, where possible, we also checked that catalytic residues were present, using SitesBLAST (Price and Arkin 2022). These predictions are listed in Table 3 (for functional residues see Supplementary Table S1).
The accuracy of GapMind for prototrophs
To test the coverage of the revised GapMind, we analyzed the predicted proteomes of 206 diverse bacteria and archaea that can grow in defined minimal medium (see Methods). We ensured that the growth data and the genome sequence were from the same strain. These 206 prototrophs cover 19 phyla and 160 genera from the Genome Taxonomy Database (Chaumeil et al. 2019). Across 3,690 pathway x organism combinations, 84% were fully complete and consisted only of high-confidence steps. Another 11% of pathways contained one or more medium-confidence steps, such as a comparative genomics prediction or a diverged enzyme. (Unless the step is described by a hidden Markov model, high-confidence candidates must be at least 40% identical to a characterized enzyme.) 5% of pathways had one or more gaps, with no high- or medium-confidence candidates for those steps.
Overall, we identified 204 gaps in these 206 genomes. However, for 41 of these gaps, a high- or medium-confidence candidate was identified when analyzing the six-frame translation of the genome. Most of these discrepancies were due to missing gene calls, but we also identified seven distinct frame shifts that led to 11 gaps. (Some steps occur in more than one pathway.) We previously confirmed that frame shifts in amino acid biosynthesis genes from three prototrophic bacteria were actually errors in their genome sequences (Price et al. 2018a; Price et al. 2020). That includes one of these seven frame shifts, but we expect that the other six are spurious as well. This left 163 genuine gaps, corresponding to 0.8% of all steps on the best paths. For example, the genome with the most gaps was Desulfotalea psychrophila LSv54, which is a sulfate-reducing bacterium that grows with lactate as the sole carbon source (Rabus et al. 2004). The genome has apparent frameshifts in ilvD (which is part of three pathways), aroB, and L,L-diaminopimelate aminotransferase. The genome was sequenced at only 6.4x coverage (Rabus et al. 2004), which suggests that these frameshifts could be spurious. This genome also has four genuine gaps: phosphoribosyl-ATP pyrophosphatase (hisI), histidinol-phosphate phosphatase (hisN), phosphoserine phosphatase (serB), and prephenate dehydrogenase are all missing.
Across all 206 prototrophic genomes, if we consider each step once, even if it occurs in multiple pathways, then we have 157 genuine gaps, or an average of 0.8 per genome. In contrast, the original GapMind had 275 genuine gaps for these genomes (an average of 1.3). Just four steps account for 61% of the remaining gaps: histidinol-phosphate phosphatase hisN is missing in 41 genomes (20% of genomes), phosphoserine phosphatase serB is missing in 20 genomes (10%), phosphoribosyl-ATP pyrophosphatase hisI is missing in 18 genomes (9%), and homoserine kinase thrB is missing in 16 genomes (8%). All four of these enzymes catalyze the gain or loss of phosphate groups, which can be carried out by many different protein families. We speculate that other kinases or phosphatases have evolved to act on these substrates and sometimes replace the known families. In any case, the absence of these genes should not be used to predict auxotrophies.
As shown in Figure 7, most prototrophic genomes from the phylum Pseudomonadota do not have any genuine gaps. This is probably because Pseudomonadota is the best-studied phylum. (Of the 160 known prototrophic genera in our collection, 76 are Pseudomonadota.) In contrast, most of the other bacteria and archaea that we studied have at least one genuine gap in their amino acid biosynthesis pathways. Even for the Pseudomonadota, the true rate of gaps may be higher, as we used the same set of prototrophic genomes to help us fill gaps in biosynthetic pathways.
The accuracy of GapMind for auxotrophic bacteria
We also tested GapMind on prokaryotes with known auxotrophies. We identified 26 prokaryotes (all bacteria) with experimentally-confirmed requirements for one or more amino acids (see Methods). These gave a total of 106 genome x pathway combinations for which the pathway was reported not to be functional. GapMind found that 101 of these (95%) had at least one gap or low-confidence step.
We manually examined the five cases where the GapMind results might incorrectly imply that the strain was prototrophic for that amino acid. Three of these cases involve Streptococcus pneumoniae D39. As previously reported, S. pneumoniae appears to contain all of the genes necessary for isoleucine and valine synthesis, and it is not clear why these amino acids are required for growth (Kazmierczak et al. 2009). Also, S. pneumoniae is auxotrophic for glycine even though it encodes a likely serine hydroxymethyltransferase (glyA, SPD_RS04890), which in most bacteria converts serine to glycine. As discussed above, it appears that S. pneumoniae does not synthesize or take up serine, and its GlyA functions in reverse, to convert glycine to serine (Härtel et al. 2012). Next, Lactobacillus delbrueckii subsp. lactis CRL581 requires glycine for growth (Hébert et al. 2004), but it appears to encode glyXL-glyXS.
Finally, Fusobacterium varium ATCC 27725 requires methionine for growth (Resmer and White 2011). Although GapMind did not identify any low-confidence steps in methionine synthesis in F. varium, two steps were medium-confidence: homoserine dehydrogenase and cystathionine gamma-synthase (metB). The potential homoserine dehydrogenase (C4N18_RS12320) is only 39% identical to a characterized homoserine dehydrogenase, but the active-site residues are conserved and C4N18_RS12320 is encoded in an apparent operon with two other genes for homoserine biosynthesis (aspartate kinase and aspartate-semialdehyde dehydrogenase). So we expect that C4N18_RS12320 is truly a homoserine dehydrogenase. Furthermore, F. varium grows in the absence of threonine (Resmer and White 2011), which suggests that homoserine dehydrogenase (which is usually required for threonine synthesis) is present. For metB, GapMind identified two medium-confidence candidates, both of which have greater similarity to methionine gamma-lyases (73-77% identity) than to known MetB proteins (44-48%). Given the growth data for F. varium, these proteins probably lack MetB activity. Overall, if a bacterium was reported to require an amino acid for growth, GapMind identified at least one low-confidence step in that pathway 95% of the time.
Conclusions
The revised GapMind includes novel genes for the synthesis of arginine, glycine, lysine, serine, and branched-chain amino acids, as well as many diverged enzymes. Across prototrophic bacteria and archaea, 95% of pathways are complete, with no low-confidence steps, while for experimentally-identified auxotrophies, 95% of pathways had at least one low-confidence step. The majority of the remaining gaps in functioning pathways are due to just four steps: hisN, serB, hisI, and thrB.
Materials and Methods
Updating GapMind
Besides including the additional characterized enzymes and predictions described above, we also updated GapMind to use newer versions of the curated databases Swiss-Prot (UniProt Consortium 2023), MetaCyc (Caspi et al. 2020), BRENDA (Chang et al. 2021), and the Fitness Browser reannotations (Price et al. 2018b; Price and Arkin 2024b). These were updated in November 2023. (GapMind also uses CharProtDB, which, as far as we know, has not been updated (Madupu et al. 2012).)
GapMind for amino acid biosynthesis now includes 159 different biosynthetic steps. These are described by 2,261 characterized proteins from curated databases, 138 characterized proteins that we curated, 4,469 curated predictions from Swiss-Prot, and 30 proteins whose function we predicted. Candidates that were identified from their similarity to proteins with predicted functions -- either from the predictions described above, or from curated annotations in Swiss-Prot that lack experimental evidence -- are treated as one level of confidence lower than if they were based on similarity to a characterized protein. For instance, if a candidate is at least 40% identical to a protein with a predicted function, and the alignment covers at least 70% of the protein with a predicted function, then the candidate will be considered medium confidence. GapMind also uses hidden Markov models of protein families to describe some steps. These are based on 141 TIGRFams (Haft et al. 2013) and four PFams (Finn et al. 2014).
When run from the command-line, GapMind now provides the option to use DIAMOND (Buchfink et al. 2015) instead of UBLAST (Edgar 2010) for pairwise searches. DIAMOND is not as fast as UBLAST, but is more sensitive. Because GapMind ignores homologs that are less than 30% identity or that cover less than half of the characterized protein, this makes little difference to the final results of GapMind.
Essential proteins
Essential proteins for Echinicola vietnamensis KMM 6221 (DSM 17526), Pedobacter sp. GW460-11-11-14-LB5, and Pontibacter actiniarum KMM 6156 (DSM 19842) were reported previously (Price et al. 2018b). Essential proteins were defined as proteins with unusually low coverage by transposon mutants (Price et al. 2018b). Mutants of these genes might grow very slowly, rather than being truly essential. A RB-TnSeq library for Mucilaginibacter yixingensis YX-36 will be described elsewhere (M. Torres et al., in preparation); from this library, 447 essential proteins were identified, using the same approach. Similarly, based on an RB-TnSeq library for Lysobacter soli OAE881 (see below), we identified 253 essential proteins.
Sequence analysis
Characterized homologs were identified using PaperBLAST (Price and Arkin 2017). Functional residues were identified using SitesBLAST (Price and Arkin 2022). Conserved gene neighbors were identified using fast.genomics (Morgan N Price and Arkin 2024a). RB-TnSeq data was examined using the Fitness Browser (Price and Arkin 2024b). Docking was conducted using AutoDock Vina via the DockingPie plugin to PyMOL (Trott and Olson 2010).
For finding homologs within GapMind, we used UBLAST (Edgar 2010) from 64-bit usearch 11.0.667 (available at https://github.com/rcedgar/usearch_old_binaries/) and HMMer 3.3.1 (http://hmmer.org/; (Eddy 2011)).
Prokaryotes that grow in defined media
We previously reported a list of 148 strains of prokaryotes that grow in minimal media and have complete genome sequences available ((Price et al. 2020); derived from (Dos Santos et al. 2012; Oberhardt et al. 2015)). One of those genomes (GCF_000195895.1, for Methanosarcina barkeri str. Fusaro) has been removed from Genbank, so we removed it from our list. We also reported a list of 35 bacteria that had RB-TnSeq data during growth in minimal media (Price et al. 2020). Recent studies reported RB-TnSeq data for three additional bacteria growing in minimal media: Magnetospirillum magneticum AMB-1 (McCausland et al. 2022), Nitratidesulfovibrio vulgaris Hildenborough (Trotter et al. 2023), and Xanthomonas campestris pv. campestris strain 8004 (Luneau et al. 2022). RB-TnSeq data for Mucilaginibacter yixingensis YX-36 (DSM 26809) growing in a minimal medium was provided by M. Torres (in preparation). For this study, we conducted RB-TnSeq assays for Brevundimonas vesicularis_C GW460-12-10-14-LB2, Lysobacter soli OAE881 (DSM 113522), and Rhodanobacter denitrificans FW104-10B01 in minimal medium (see below).
We identified additional prototrophic prokaryotes from the literature: Thermoanaerobacter kivui LKT-1 (Basen et al. 2018), Thermus aquaticus YT-1 (Brock and Freeze 1969), Haloferax volcanii DS2 (Trieselmann and Charlebois 1992), and Nitrosopumilus maritimus SCM1 (Qin et al. 2017). Also, a recent study identified several isolates from the human gut that are prototrophic (Soto-Martin et al. 2020); we included Anaerobutyricum hallii DSM 3353, Roseburia faecis M72, and Clostridium tyrobutyricum FAM22553. A caveat for these three strains is that cysteine was included in the growth medium as a reductant; however, cysteine synthesis appears to be complete in all three strains, so we expect that they are truly prototrophic. Similarly, cysteine was present in the growth medium for T. kivui, but since it is autotrophic (Basen et al. 2018), it is presumably able to make cysteine.
Finally, to increase diversity of prokaryotes in our list, we used the IJSEM database (Barberán et al. 2017) to identify representatives of taxonomic orders that were not included in the list, but had been reported to utilize specific carbon sources. We then checked the species descriptions to see if they used a truly defined medium with no added amino acids. By this approach, we identified ten prototrophic type strains (Algiphilus aromaticivorans DG1253, Aquimarina longa SW024, Carboxydothermus pertinax Ug1, Halococcus hamelinensis 100A6, Hippea alviniae EP5-r, Methanospirillum lacunae Ki8-1, Nocardioides dokdonensis FR1436, Phaeacidiphilus oryzae TH49, Thiohalospira halophila DSM 15071 HL 3, and Tistlia consotensis USBA 355). Overall, we expanded our list of prototrophic prokaryotes with complete genomes from 183 to 206.
Experimentally-determined amino acid auxotrophies
Most of the experimentally-determined auxotrophies were taken from previous reports (Ashniev et al. 2022; Ramoneda et al. 2023). We removed some species because the original reference did not identify which exact strain was auxotrophic, and we wanted to ensure that the genome sequence analyzed was from an auxotrophic strain. We removed some strains because the precise amino acids that were required for growth were not determined. We also identified several strains that were proposed to be auxotrophic for cysteine, but based on experiments with sulfate as the sole source of sulfur, and the pathway for sulfate assimilation is absent from the genome (i.e., (Hébert et al. 2004; Ferrario et al. 2015)). In these cases, reduced sulfur compounds such as sulfide may be the natural sulfur source, and if a suitable sulfur source was provided, cysteine synthesis may well occur. Finally, we removed Klebsiella pneumoniae KP11 because growth data was not reported and we removed Yersinia ruckeri YRB because there was significant residual growth after individual amino acids were removed from the media (Seif et al. 2020).
We added some additional auxotrophies from the literature: Cysteiniphilum litorale DSM 101832 is auxotrophic for cysteine (Liu et al. 2017); Bacillus subtilis 168 is auxotrophic for tryptophan (Zeigler et al. 2008); Myxococcus xanthus DK101 is auxotrophic for all three branched-chain amino acids (Bretscher and Kaiser 1978); Kytococcus sedentarius DSM 20547 is auxotrophic for methionine (Stackebrandt et al. 1995); Legionella pneumophila Philadelphia 3 is auxotrophic for arginine, methionine, serine, threonine, and valine (Ristroph et al. 1981); and Streptococcus pneumoniae D39 is auxotrophic for arginine, glycine, histidine, isoleucine, leucine, and valine (Kazmierczak et al. 2009).
In total, we had 26 auxotrophic bacteria, with 106 auxotrophies for amino acids. Four of these bacteria were auxotrophic for all three aromatic amino acids. In principle, these bacteria could be auxotrophic for chorismate (the common precursor for the aromatic amino acids) while being capable of converting chorismate to all three aromatic amino acids. This would complicate the comparison of the growth requirements to GapMind’s pathways, because GapMind describes chorismate synthesis separately from the biosynthetic pathways for phenylalanine, tyrosine, and tryptophan. In reality, all four of the bacteria have gaps in all three pathways downstream of chorismate. Two of the four have gaps in chorismate synthesis as well (Lactobacillus delbrueckii subsp. lactis CRL581 and Lactobacillus paracasei LC2W).
RB-TnSeq libraries
The mutant library for B. vesicularis_C was described previously (Liu et al. 2018). The mutant library for R. denitrificans will be described elsewhere (T. Owens et al., in preparation).
To construct a mutant library in Lysobacter soli OAE881 (Coker et al. 2022), we used conjugation from E. coli WM3064 harboring the pHLL250 mariner transposon vector library (strain AMD290), which was previously built via Golden Gate assembly (Liu et al. 2018). Specifically, we grew 10 mL of wild-type OAE881 in LB overnight at 30°C. The next morning, we recovered a 2 mL freezer stock of strain AMD290 in 50 mL LB supplemented with 50 mg/mL carbenicillin (Cb) and 300 µM diaminopimelic acid (DAP) at 37°C. When the OD600 of the E. coli donor strain reached 1, we harvested 20 OD600 units of the culture and washed the cells three times with fresh LB supplemented with DAP. Then, 20 OD600 units of wild-type OAE881 cells were harvested, mixed with the washed donor cells, centrifuged, and resuspended in a final volume of 0.5 mL with LB supplemented with DAP. The resuspension was spotted onto 0.45-µm membrane filters (Millipore, United States) and incubated overnight on LB agar plates supplemented with DAP at 30°C. After 16 h, the conjugation mixture was scraped from the membrane, resuspended in 10 mL LB with 50 µg/mL kanamycin (Km) and plated at different dilutions on LB plates supplemented with 50 µg/mL Km.
Plates were incubated at 30°C for 48 h to let visible colonies develop. We then pooled ∼400,000 colonies and grew the library in liquid LB supplemented with 50 µg/mL Km for 2 population doublings. We then added glycerol to a final volume of 15%, made multiple 1 mL −80°C freezer stocks (∼108 cells/mL) of the final library for subsequent experiments, and collected cell pellets to extract genomic DNA for TnSeq mapping. To map the genomic locations of the transposon insertions and link these insertions to their associated DNA barcodes, we used a variation of the previously described TnSeq protocol (Wetmore et al. 2015), where we use a splinkerette adaptor instead of a Y adapter and we used two rounds of PCR to selectively enrich for transposon junctions (Rubin et al. 2022). We mapped 118,245 barcodes to insertions in the genome, and these barcodes covered 97% of the mappable TnSeq reads. Insertions were neutral with respect to orientation within the gene (50.0% of insertions within genes are on the coding strand), and the variation in reads per gene was moderate (mean/median reads per gene was 1.70).
RB-TnSeq assays
We performed aerobic fitness assays for Brevundimonas vesicularis_C GW460-12-10-14-LB2, Lysobacter soli OAE881, and Rhodanobacter denitrificans FW104-10B01, using the approach described previously (Wetmore et al. 2015). Briefly, the mutant library was recovered from the freezer, inoculated at OD600 = 0.02 into the condition of interest, grown until saturation, genomic DNA was extracted, and barcodes were amplified with PCR and sequenced using Illumina. Fitness values (log2 ratios) were calculated as described previously (Wetmore et al. 2015).
For B. vesicularis_C, we performed experiments in a defined minimal medium at 20°C or 30°C with 20 mM glucose or 0.5% v/v Tween-20 as the carbon source. The basal medium (RCH2_defined_noCarbon) included 0.25 g/L Ammonium chloride, 0.1 g/L Potassium Chloride, 0.6 g/L Sodium phosphate monobasic monohydrate, 30 mM PIPES sesquisodium salt, Wolfe’s mineral mix (ATCC), and Wolfe’s vitamins (ATCC).
For L. soli, we performed fitness assays with 17 defined carbon sources (each at 20 mM) as well as casamino acids at 30°C and with 200 rpm shaking, again with RCH2_defined_noCarbon as the basal medium.
For R. denitrificans, we performed a variety of stress experiments in R2A medium, which will be described elsewhere (T. Owens et al, in preparation). For this study, we performed 30 fitness experiments in defined media. Two of these experiments were conducted in a minimal defined medium (RCH2_defined_noCarbon) with 20 mM glucose, at either 20°C or 30°C. For the remainder of the experiments, we used 13 different carbon sources, and we added 100 µM of each amino acid to the medium.
Availability of data and code
The code for GapMind, including the rule definitions, are available as part of the PaperBLAST code base (https://github.com/morgannprice/PaperBLAST). The code, the rules and the compiled rule definitions are archived at figshare (http://doi.org/10.6084/m9.figshare.27229272). The figshare also includes a table of the 206 prototrophic bacteria and archaea, their remaining gaps, a table of the 28 bacteria with known auxotrophies, and the GapMind results for all of these genomes. The RB-TnSeq data is available from the Fitness Browser (http://fit.genomics.lbl.gov).
Acknowledgements
We thank Surya Tripathi for pre-publication access to RB-TnSeq data for Phocaeicola dorei CL03T12C01. We thank Hans Carlson for assisting with RB-TnSeq assays for Rhodanobacter denitrificans.