ABSTRACT
Pathogens pose a major risk to human health globally, causing 44% of deaths in low-resource countries. Currently, there are over 500 known bacterial pathobionts, covering a wide range of functional capabilities. Some well-known pathobionts are well characterized computationally and experimentally. However, to gain a deeper understanding of how pathobionts are evolutionarily related to the principles that govern their different functions and ultimately identify possible targeted antimicrobials, we must consider both well-known and lesser-known pathobionts. Here, we developed a database of genome-scale metabolic network reconstructions (GENREs) called PATHGENN, which contains 914 models of pathobiont metabolism to address these questions related to functional metabolic evolution and adaptation. We determined the metabolic phenotypes across all known pathobionts and the role of isolate environment in functional metabolic adaptation. We also predicted novel antimicrobial targets for bacteria specific to their physiological niche. Understanding the functional metabolic similarities between pathobionts is the first step to ultimately developing a precision medicine framework for addressing all infections.
INTRODUCTION
Bacterial pathogens pose a major risk to human health. Globally, pathogens are responsible for 16% of all deaths, and responsible for 44% of deaths in low-resource countries1. Financially, global economic losses from pathogenic disease outbreaks amount to tens of billions of dollars in the past 10 years2. In recent years, there has been an increase in infectious disease emergence attributed to urbanization, globalization, climate change, population growth, and human/animal interaction 3. Currently, there are over 500 known human bacterial pathobionts4. Pathobionts are microorganisms that have the capacity to be pathogenic5 and range across many taxonomic classes and genera. Therefore, there exists a wide range in metabolic function, phylogeny, and infection niches (e.g., stomach, wound, lung) across pathobionts.
Due to their imminent danger to human health, some pathobiont species have been well characterized experimentally and computationally 6–8. However, to gain a deeper understanding of how pathobionts are evolutionarily related and the principles that govern their differential functions and ultimately identify novel targeted antimicrobial therapies, we need to consider both well-characterized and poorly-characterized pathobionts. We can leverage ‘omics approaches to understand the relationship between pathobionts and their physiological environment to shed light on functional metabolic differences between species. A better characterization of governing principles of pathobiont function could enable the development of new approaches to target pathobionts through novel therapies or drug repurposing. Additionally, using antimicrobial therapies to target environment-specific essential genes rather than organism-specific essential genes could reduce the harmful effects of broad-spectrum antimicrobials9
Genome-scale metabolic network reconstructions (GENREs) for can be used to elucidate the functional metabolic mechanisms of individual pathobionts6,10. Once assembled, GENREs can probe an organism’s genotype-phenotype relationship through constraint-based modeling and analysis (COBRA)11. Computational modeling through GENREs has proven effective at defining functional metabolism in individual priority pathogens, allowing for interpretation of mechanisms of infection and antibiotic resistance10.
Here, we determined the evolutionary relatedness of metabolic phenotypes across pathobionts and the role of isolate environment in functional metabolic adaptation. We characterized the correlation of functional metabolism with the physiological niche of a pathobiont. We also predicted novel antimicrobial targets for pathobionts specific to a given physiological niche. To address the above questions, we generated the first database of GENREs of all known bacterial pathobionts (referred to as PATHGENN) with a current total of 914 in silico models of pathobiont metabolism, which can serve as a key resource for the community.
RESULTS
The PATHGENN Database
We created PATHGENN, a database of GENREs for all known human bacterial pathobionts through an automated pipeline (Figure S1). PATHGENN utilizes publicly available genome sequences from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC)12 paired with open-source software including Python and COBRApy 11, and a recently developed GENRE reconstruction algorithm13. The PATHGENN database is the first to contain GENREs of all known human bacterial pathobionts and is among the largest publicly available databases of GENREs 14,15. PATHGENN consists of 914 GENREs, covering 345 species, 94 genera, 36 orders, 17 classes, and 9 phyla (Figure 1a, c) of pathobionts. PATHGENN GENREs account for the function of a sum total of 1.27 million reactions (6,304 unique reactions), 1.22 million genes, and 1.20 million metabolites. Each GENRE contains an average of 1,355 reactions (standard deviation: 344), 1,310 genes (standard deviation: 593), and 1,394 metabolites (standard deviation: 331) (Figure 1b). The relationship between the number of genes and reactions in the reconstructions is logarithmic, which is consistent with the expectation that there are limited evolutionary advantages for bacteria with increasingly large genomes16(Figure 1d).
KEGG reaction annotations were utilized and reactions across all PATHGENN GENREs were separated into core (present in > 75% of GENREs), accessory (between 25% and 75%), and unique (present in < 25%) metabolism. There are 2,515 annotated unique reactions, 1,044 annotated accessory reactions, and 752 annotated core reactions (Figure 2a). The large number of unique reactions can be attributed to the size of the PATHGENN database and the taxonomic range PATHGENN GENREs represent. Furthermore, we determined notable differences in the unique and core metabolic subsystems through KEGG reaction subsystem annotation. More unique reactions were involved in xenobiotic metabolism (7% more), terpenoid/polyketide metabolism (11% more), and carbohydrate metabolism (4% more). Additionally, more core reactions were involved in nucleotide metabolism (7% more), and cofactor/vitamin metabolism (2% more) (Figure 2b).
Metabolic Phenotype Evolution
To understand the evolutionary relationship between pathobionts and their essential genes and network structure (two important attributes of functional metabolism), we calculated predicted essential genes, genetic distance between all pairs of pathobionts, and delineated differences in the reactions present in each organism. For each strain, essential gene profiles were determined by using an FBA single-gene-knockout method in COBRApy. Given gene essentiality is a function of the organism’s physiological environment, for this analysis all exchange reactions were open which results in the minimum number of essential genes for a given organism. Reaction presence profiles were created by probing the model in COBRApy (see Methods). These analyses produced binary profiles describing the presence of all essential genes and reactions in each model, which were subsequently used to calculate pairwise dissimilarity. The evolution of essential gene and reaction presence profiles is shown in Figures 3 and S2, respectively. Both relationships can be approximated with a three-parameter logarithmic growth function. Additionally, the logarithmic function reaches a saturation point (x | y = 1.0) for essential gene dissimilarity and reaction presence dissimilarity.
The saturation points observed in Figure 3 are indicative of conserved essential genes and reactions, respectively, across bacterial strains. That is, even at genetic difference of 100%, a pair of pathobionts will be only 18% different with respect to the essential gene profiles, and 34% different with respect to the reaction presence profiles. A previous study17 determined a similar relationship between essential gene profiles and genetic distance across bacteria (not specifically pathobionts), but determined a saturation point of ∼53% essential gene difference. This discrepancy in essential gene saturation point could be attributed to possible inherent pathobiont similarities that are not shared across all genera of bacteria. With host infection as a shared functional process of pathobionts, this result could suggest a shared functional signature associated with infection regardless of the specific niche which is not shared with non-pathobiont bacteria.
Additionally, the logarithmic trends shown in Figure 3 suggests there is adaptive pressure for closely related pairs of organisms to evolve to occupy their own distinct metabolic niche. As pathobionts begin to occupy distinct metabolic niches, they continually adapt their metabolic capabilities to better take advantage of their new environments, suggesting metabolic composition of the environment as a major governing principle of the evolution of functional metabolism.
Essential Gene Metabolic Subsystem Analysis
We further explored the relationship between physiological environment and metabolic function by essential gene subsystem analysis. We pooled the essential genes for all isolates of a given environment, and determined the metabolic subsystem distribution through KEGG genes annotation. Figure 4 shows the metabolic subsystem distribution of essential genes in eight of the most represented isolate environments: throat, respiratory, lung, stool, ear, stomach, mouth, and blood. There is significantly different subsystem representation across physiological environments as determined by an ANOVA test for each subsystem (p < 0.05 for all subsystems).
Some of the most notable differences in metabolic subsystem representation were amongst stomach isolates. There was evident lack of nucleotide metabolism, energy metabolism, and glycan metabolism in the essential genes of stomach isolates. Additionally, there was a clear enrichment of amino acid and lipid metabolic subsystems compared to essential gene subsystems in other isolate environments. The clear differences in metabolic subsystem utilization by organisms in different environments provides strong evidence for differential metabolic functional adaptation according to environment.
Influence of Environment on Functional Metabolism
Previous studies have delineated a relationship between functional metabolism and taxonomic class 15,18,19. While it is clear that taxonomy is a driver for metabolic function, functional metabolism could also be attributed to physiological environment because an organism’s environment influences adaptation. To determine if there is a significant association between functional metabolism and physiological envirionment in addition to taxonomic class in pathobionts, we utilized flux balance analysis (FBA)20 for each strain (n = 10 samples per strain). t-SNE was used to reduce the dimensionality of the flux output across strains and for subsequent visualization (see methods). We colored the t-SNE output on both taxonomic class (Figure 5a) and isolate environment (Figure 5b). Significant clustering was exhibited in Figure 5a and b (PERMANOVA: p < 0.01), suggesting functional metabolism is related to both taxonomic class and isolate environment.
Gammaproteobacteria is the class of bacteria with the largest number of models in PATHGENN (Figure 5a). However, Gammaproteobacteria isolates came from a variety of sources including stool, urine, lung, and blood among others (Figure 5b). Gammaproteobacteria is the most generarich taxon of Prokaryotes, containing over 250 genera21. This diversity in bacterial genera within the Gammaproteobacteria suggests a broader range of functional capabilities than other taxa, providing reasoning for the diverse environments from which Gammaproteobacteria were isolated. Another notable cluster, Actinomycetia, contains isolates from lung, respiratory, sputum, and throat sites. Mycobacterium tuberculosis and Actinomyces species belong to this class and are known to infect the lungs and throat respectively22,23. Clustering of M. tuberculosis and Actinomyces suggests organisms in similar environments across the respiratory tract exhibit similar functional capabilities.
A prominent cluster in Figure 5b is associated with bacteria isolated from the stomach. The stomach environment is highly acidic (pH 1.5 to 2.0)24, allowing for only a few key bacteria to take up residence, one of which is Helicobacter pylori. H. pylori has adapted to this extremely unique environment by utilizing differential metabolic pathways25. The evident separation of the stomach cluster from others and the uniqueness of the stomach environment suggests25 bacteria with highly unique functional metabolism. This result suggests genes essential to growth in stomach isolates are uniquely essential compared to pathobionts from other isolation sites. We can leverage these uniquely essential genes to identify novel antimicrobial targets that are specific to stomach pathobionts.
Identifying Site-Specific Antimicrobial Targets
To determine genes that are uniquely essential to stomach bacteria, essential genes were determined for all strains in PATHGENN using an FBA single-gene-knockout method in COBRApy (see Methods). If a gene was considered essential if >= 80% of strains in an isolate environment requires the gene to produce biomass. Two genes were identified as uniquely essential to stomach pathobionts (not essential in any other environment), fabF and tktA. fabF encodes the beta-ketoacyl-ACP synthase (KAS), implicated in the chain elongation step of fatty acid synthesis26, and tktA encodes transketolase (TK), the most critical enzyme in the non-oxidative pentose phosphate pathway27. While neither of these genes are currently known antimicrobial targets specific to stomach pathobionts, there already exist several antimicrobials that target these gene products. According to DrugBank28, fabF is a target of lauric acid. Lauric acid has been shown to have bactericidal effects against the stomach pathogen H. pylori and was cited to have a lower propensity to develop resistance compared to metronidazole or tetracycline29. Other drugs that target fabF and tktA are Cerulenin (fabF, currently used as an antifungal antibiotic), Platensimycin (fabF, currently in preclinical trials as a MRSA antibiotic), and Cocarboxylase (tktA, currently used to target tktA in E. coli), although there is no published literature regarding their use to treat stomach specific infections. The ability to predict lauric acid as a possible stomach-targeted antimicrobial with indirect literature validation demonstrates the value of PATHGENN to enable clinical hypothesis generation.
Additionally, we visualized the pathway structure that the genes tktA and fabF are implicated in across three stomach isolates that were captured in the PATHGENN database: Helicobacter pylori, Arcobacter butzleri, and Campylobacter coli using fluxer30 and adapted the generated pathways in Figure 6. There are clear differences in pathway structure between the three different species of stomach isolates.
DISCUSSION
Here, we present a novel pipeline for generating GENREs of human bacterial pathobionts and apply it to create 914 GENREs representing all known bacterial pathobionts, a resource called PATHGENN. PATHGENN is among the largest databases of GENREs14,15, and the first specific to pathobionts. PATHGENN GENREs adhere to the community benchmarking standards (MEMOTE, see Methods) and utilizes the ModelSEED namespace. These standards allow PATHGENN GENREs to be easily used in conjunction with existing models from other sources. All PATHGENN models are publicly available, and we encourage others to utilize the database to probe biological and clinically relevant questions not explored here. While the models in PATHGENN are not manually curated, they were all developed using the same pipeline utilizing an automated gap-filling process, allowing for a large number of GENREs in PATHGENN to be directly compared.
There are a total of 2,515 reactions that were unique to less than 25% of GENREs (unique reactions) in PATHGENN, while there were 752 reactions that were common in greater than or equal to 75% of GENREs (core reactions). There is an evident enrichment of nucleotide metabolic subsystems in core reactions (7% more). This result is consistent with the ubiquitous role of nucleotide metabolism across bacterial species31. Additionally, it has been shown that the nucleotide metabolism pathway plays a role in pathogenesis, further providing evidence that the GENREs in PATHGENN accurately capture and represent the biochemical processes in pathobionts32. Furthermore, there was an enrichment of xenobiotic metabolic subsystems in unique reactions (7% more). Bacterial species evolve to utilize differential xenobiotic pathways to best make use of ingested compounds through the utilization of different enzymes and hydrolytic/reduction reactions33. The evolution of unique xenobiotic metabolic reaction pathways allows bacteria to occupy their own metabolic niches and take advantage of their environment.
Understanding the evolution of metabolic phenotypes can provide important insight into fitness and adaptation of pathobionts. We used PATHGENN to better understand metabolic evolution in the context of adaptation through changes in functional metabolism over generational time. Results presented in Figure 3 (and Figure S2) suggest that there is adaptive pressure for closely related organisms to occupy their own distinct metabolic niche, which could occur through possible mechanisms of horizontal gene transfer, random mutation, or other methods. Closely related pathobionts experience pressure to adapt and quickly occupy a distinct metabolic niche to avoid competition and ensure the survival of the species. In more distantly related species, organisms have already adapted to occupy their own unique metabolic niches. It is evident that organisms continue to specialize after finding their niche, adapting further to gain fitness in their given environment. This observation suggests a two-phase evolutionary process. First, an initial diversification of both essential genes and reaction network due to adaptive pressure, followed by further diversification over generations. Additionally, by definition, pathobionts share a common function with host infection. Consequently, that shared activity could limit functional differences even if genetic history of the pathobiont is quite distinct. This concept could explain results in the logarithmic nature of the relationship between essential gene/reaction similarity and genetic distance (Figure 3).
It is important to note that in Figure 3 there is one group of pathobiont pairs that are more genetically distant from each other. For every pair in this group, one bacterium in the pair is Mycolicibacterium fortuitum, which is an opportunistic pathogen that is responsible for skin and bone infections belonging to the actinomycetia taxonomic class34. In this group, the bacteria paired with Mycolicibacterium fortuitum are: seven different Bacillus species, two Vibrio species, two Acinetobacter species, two Burkholderia species, and one Providencia, Enterobacter, and Stenotrophomonas species. This result suggests that these species are genetically distant from Mycolicibacterium fortuitum, but have more similar essential gene profiles to Mycolicibacterium fortuitum than expected according to the log fit function. Additionally, there is a high density of pathobiont pairs with genetic distances between 0.2 and 0.3. This result suggests that the average genetic distance between pairs of pathobionts is between 0.2 and 0.3, which is consistent with what has been found in another study examining pairwise genetic distances (determined by 16S rRNA sequence alignment) across pairs of bacteria35.
The analysis of the evolution of metabolic phenotypes suggests that isolate environment could be a major evolutionary driver of metabolic function. This idea was further confirmed by metabolic subsystem annotation of essential genes via KEGG orthologs. There was a clear difference in metabolic subsystem representation of essential genes in different isolate environments (ANOVA with p < 0.05 for each subsystem). This difference in metabolic subsystem utilization could also suggest isolates from different isolate environments are functionally different, thereby occupying distinct metabolic niches.
Functional metabolic similarities have been tied to taxonomic class in many studies14,15,18,19, but the underlying importance of isolate environment and its role in driving adaptation is often underappreciated. We determined that functional metabolism is related to both taxonomic class and isolation source through FBA, dimensionality reduction and visualization (t-SNE), and subsequent PERMANOVA (p < 0.01). This result provides more support for the hypothesis that functional metabolism is related to metabolic niche, which has been suggested in previous work 15. Additionally, within taxonomic classes, there are distinct clusters of flux samples based on isolate environment. There are visibly distinct clusters of throat, respiratory, lung, ear, stomach, blood, and stool, which were also shown to have distinct metabolic subsystem utilization in the essential gene and metabolic subsystem analysis (Figure 4). The corroboration of results in these two different analyses provides further evidence that isolate environment is a strong factor in the evolution of metabolic phenotypes.
Additionally, within the class of Epsilonproteobacteria there are two distinct clusters: a stomach cluster and a stool cluster. This result further implies that closely related organisms develop distinct functional metabolic capabilities related to their specific environment to outcompete related organisms and ensure the survival of the distinct population or species. These results suggest similarities between organisms that occupy the same environment and not only because they are phylogenetically related. While phylogeny is undoubtedly related to metabolic phenotype, it is clear that environment is also a driving factor for the evolution of functional metabolic characteristics.
The most distinct cluster of metabolic flux samples is the stomach cluster, implying these isolates exhibit strong similarities in functional metabolism. Additionally, this suggests these isolates are functionally distinct from isolates of different environments. These functional metabolic differences could be driven by the extreme environment of the stomach, pressuring adaptation. Distinct metabolic phenotypes in the stomach environment were also shown in Figure 4, with a visible enrichment of amino acid and lipid metabolism subsystems and a lack of nucleotide, energy, and glycan metabolic subsystems in the essential genes of stomach isolates.
Stomach infection with H. pylori can cause a variety of adverse effects including chronic gastritis leading to complications (peptic ulcer, gastric cancer, lymphoma)36,37. Additionally, H. pylori infection is incredibly difficult to treat, requiring multi-antimicrobial regimens and acid suppressants36. Given that stomach isolates are functionally different from isolates in other environments, we identified two genes, fabF and tktA, that are uniquely essential to stomach isolates. Creating antimicrobial therapies specifically targeting these genes could eliminate the need for multi-antimicrobial regimens and broad-spectrum antibiotics which are associated with adverse health effects9. Additionally, targeted antimicrobial therapies would allow for more rapid response to infection, since all organisms in an environment can be treated unilaterally with one antimicrobial so species identification is not necessary. We identified four drugs that target these genes: lauric acid (fabF), Cerulenin (fabF), Platensimycin (fabF), and Carboxylase (tktA). Lauric acid has been cited to have antimicrobial properties against H. pylori, and a lower propensity to cause the development of resistance than if H. pylori were treated with metronidazole or tetracycline29. Since the GENREs in PATHGENN were able to correctly predict lauric acid as an antimicrobial target, the other three identified drugs could be tested. Additionally, we visualized the pathways that fabF and tktA are a part of in three different stomach isolate species (H. pylori, A. butzleri, and C. coli) (Figure 6). There are clear differences in pathway structure between the three different species despite tktA and fabF being essential genes in stomach isolates. This finding further highlights the importance of investigating unique metabolic functional capabilities that develop due to adaptive pressures for antimicrobial discovery and drug repurposing.
The GENREs in PATHGENN were generated through an automated pipeline, first generating genome-informed draft network reconstructions then a curation of the reconstructions through an automated gapfilling process based on parsimony principles. Generating all models through the same pipeline with the same level of automated curation allows for comparison across all GENREs for a high-level, cross-genome, analysis of bacterial pathobionts. However, the strength of the models is dependent on the accuracy and detail of genome annotations. The analyses presented in this paper could be enhanced by further manual curation of poorly annotated species.
We successfully generated a database of 914 GENREs of all human bacterial pathobionts (PATHGENN) which we used to investigate the role of environment in adaptation and generation of unique functional metabolism. Additionally, we were able to use uniquely essential metabolic genes in pathobionts isolated from the stomach to predict possible targeted antimicrobial options for treating stomach-specific bacterial infection. We can continue to investigate questions related to functional metabolism by curating the isolate environment to simulate metabolism in more specific contexts. This effort will allow for better understanding of the functional metabolic differences in pathobionts in the context in which they grow as infections. Furthermore, we can begin to integrate environment-specific functional metabolism and other pertinent metadata to identify drug targets that are relevant to patient-specific infections. Identifying unique metabolic functions across pathobiont species is the first step to developing a framework for a personalized medicine approach to addressing infection in the clinic.
METHODS
GENRE Creation From Genome Sequences
We first filtered all genome sequences in the BV-BRC 3.6.12 database to only include those that were considered “good” quality and “complete”. BV-BRC guidelines define “good” as “a genome that is sufficiently complete (80%), with sufficiently low contamination (10%)”, and amino acid sequences that are at least 87% consistent with known protein sequence. “Complete” means that replicons were completely assembled.
There are 538 species of bacterial pathobionts4, some of which either do not have publicly available genome sequences via BV-BRC or do not have “good” and “complete” genome sequences in BV-BRC. There is at least one NCBI taxid for each pathobiont species, with some species having multiple unique NCBI taxids. Multiple genome sequences are available in BV-BRC for each NCBI taxid, so sequences were selected based on the presence of metadata in a hierarchical nature. Sequences with the most associated metadata were prioritized. If multiple sequences had the same amount of metadata, we selected the sequence that had isolate environment-associated metadata. If multiple sequences fulfilled the previous requirements, the strain that had host health-associated metadata was selected. This hierarchical selection was continued for metadata categories of isolation country, collection date, and host age, in that order of priority. The resulting list contained 914 unique genome sequences. This procedure was automated with a python script.
All amino acid sequences were then automatically annotated with RAST 2.038,39, and GENREs were created for each strain using the Reconstructor13 algorithm. All models are publicly available (see Data Availability section). We benchmarked all GENREs using the community standard, MEMOTE40, and have included all scores in stable .html files on GitHub.
Genetic Distance and Essential Gene Profile/Reaction Presence profile distance
All sequences used to create GENREs in PATHGENN were re-annotated to determine the rRNA genome features. All 16S rRNA sequences were extracted from the annotation output, for a total of 245 16S rRNA sequences, each from a unique PATHGENN strain (still representing the same 9 phyla represented in all 914 PATHGENN GENREs). The 16s rRNA sequences were then aligned using Clustal Omega and the resulting Percent Identity Matrix was downloaded. Identity percentages were converted to values between 0 and 1, 0 being the most similar and 1 being the most different. This value was then converted to a percentage. This metric was defined as the genetic distance for subsequent analyses.
Essential gene profiles for each of the corresponding 245 GENREs (those with available 16s rRNA sequences) using an FBA-based, single-gene-knockout method in COBRApy (cobra.flux_analysis.variability.find_essential_genes()). Essential genes were then converted to KEGG Orthologs, and a binary matrix was created indicating essential gene presence in each strain (1 = presence, 0 = absence). The pairwise essential gene distance was defined as the calculated hamming distances41 between each strain’s essential gene profile.
Reaction presence was determined for each of 245 GENREs via model probing in COBRApy. A binary matrix was created indicating reaction presence or absence in each strain (1 = presence, 0 = absence). The pairwise reaction presence distance was defined as the calculated hamming distances between each strain’s reaction presence profile.
Genetic distance vs essential gene distance, and genetic distance vs reaction presence distance were plotted for each pair of pathobionts. Logarithmic functions were fit to both plots using the scipy.optimize.curve_fit function in the python scipy toolbox.
FBA and t-SNE Dimensionality Reduction/Visualization
For each of the 914 models, Flux Balance Analysis (FBA) was performed using the COBRApy toolbox for each model in PATHGENN to capture metabolic flux through all model reactions. 10 flux samples were taken per model for a total of 9,140 flux samples.
t-distributed stochastic neighbor embedding (t-SNE)42 was used for dimensionality and subsequent visualization of the FBA output. The perplexity parameter was optimized to preserve local and global relationships in the data using , where P = perplexity, and N = number of points. Points were colored based on taxonomic class, and subsequently colored on isolation source for visualization purposes. Significant clusters in both taxonomic class and isolation site t-SNE outputs were determined using a PERMANOVA43 test.
To ensure that 10 flux samples was sufficient to capture the flux solution space as well as 100 flux samples per model would, we ran paired-down t-SNE analyses. We randomly sampled 100 GENREs from the 914 total GENREs in PATHGENN. Then, for each of those 100 GENREs we used 100 flux samples to perform dimensionality reduction and subsequent visualization via t-SNE (Figure S3). We performed this analysis three times, to ensure that the results would hold true for multliple randomly selected subsets of GENREs.
Through this subsequent t-SNE analysis, we still see clustering by taxonomic class in figure S3. Specifically, we still see large clusters of Gammaproteobacteria and Actinomycetia. Additionally, we still see the separation of Epsilonproteobacteria into distinct clusters, one of which is completely comprised of stomach isolates.
Determination of Novel Antibiotics to Target Stomach Isolates
Essential genes for all 914 models were determined using an FBA based single-gene-knockout method in COBRApy (cobra.flux_analysis.variability.find_essential_genes()). All essential genes were translated to KEGG orthologs. Strains and their corresponding essential genes were grouped by isolation site. Essential genes present in >= 80% of strains in a given isolation source were defined as uniquely essential to that isolation source. Uniquely essential genes present in stomach isolates that are not uniquely essential to other isolation sites were selected. DrugBank28 was used to identify drugs that target uniquely essential genes of stomach isolates.
Funding
This work was supported by the NSF GRFP award number 1842490, the University of Virginia NIH Systems and Biomolecular Data Sciences Training Grant (grant number 1 T32 GM 145443-1), NIH R01s (R01-AI154242 and R01-AT010253).
Author Contributions
E.M.G and J.A.P conceived of the project. E.M.G generated the PATHGENN collection and performed subsequent analyses. E.M.G wrote the initial manuscript draft. L.R.D aided in data analysis. A.S.W assisted with model annotation. E.M.G, L.R.D, A.S.W, and J.A.P edited and approved the manuscript for final submission.
Data availability
All PATHGENN GENRE models are publicly available on GitHub along with MEMOTE benchmarking scores and all pertinent code to this study: https://github.com/emmamglass/PATHGENN.