Transcriptome-guided parsimonious flux analysis improves predictions with metabolic networks in complex environments

The metabolic responses of bacteria to dynamic extracellular conditions drives not only the behavior of single species, but also entire communities of microbes. Over the last decade, genome-scale metabolic network reconstructions have assisted in our appreciation of important metabolic determinants of bacterial physiology. These network models have been a powerful force in understanding the metabolic capacity that species may utilize in order to succeed in an environment. Increasingly, an understanding of context-specific metabolism is critical for elucidating metabolic drivers of larger phenotypes and disease. However, previous approaches to use these platforms in concert with omics data to better characterize experimental systems have met challenges due to assumptions necessary by the various integration platforms or due to large input data requirements. With these challenges in mind, we developed RIPTiDe (Reaction Inclusion by Parsimony and Transcript Distribution) which uses both parsimony of overall flux and transcriptomic abundances to identify the most cost-effective usage of metabolism that also best reflects the cell’s investments into transcription. Additionally, in biological samples where it is difficult to quantify specific growth conditions, it becomes critical to develop methods that require lower amounts of user intervention in order to generate accurate metabolic predictions. Utilizing a metabolic network reconstruction for the model organism Escherichia coli str. K-12 substr. MG1655, we found that RIPTiDe correctly identifies context-specific metabolic pathway activity without supervision or knowledge of specific media conditions. We also assessed the application of RIPTiDe to in vivo metatranscriptomic data where E. coli was present at high abundances, and found that our approach also correctly predicts metabolic behaviors of host-associated bacteria with high accuracy. In the setting of human health, understanding metabolic changes within bacteria in environments where growth substrate availability is difficult to quantify can have large downstream impacts in our ability to elucidate molecular drivers of disease-associated dysbiosis across the microbiota. Toward this point, RIPTiDe has far reaching potential for understanding context-specific bacterial metabolism within complex communities. Author Summary Transcriptomic analyses of bacteria have become instrumental to our understanding of their responses to changes in their environment. While traditional analyses have been informative, leveraging these datasets within genome-scale metabolic network reconstructions (GENREs) can provide greatly improved context for shifts in pathway utilization and downstream/upstream ramifications for changes in metabolic regulation. Many previous techniques for GENRE transcript integration have focused on creating maximum consensus with input datasets, but these approaches were recently shown to generate less accurate metabolic predictions than a transcript-agnostic method of flux minimization (pFBA), which identifies the most efficient/economic patterns of metabolism given certain growth constraints. Despite this success, growth conditions are not always easily quantifiable and highlights the need for novel platforms that build from these findings. Our new method, RIPTiDe, combines these concepts and utilizes overall minimization of flux weighted by transcriptomic analysis to identify the most energy efficient pathways to achieve growth that include more highly transcribed enzymes, without previous insight into extracellular conditions. Utilizing a well-studied GENRE from Escherichia coli, we demonstrate that this new approach correctly predicts patterns of metabolism utilizing a variety of both in vitro and in vivo transcriptomes. This platform could be important for revealing context-specific bacterial phenotypes in line with governing principles of adaptive evolution, that drive disease manifestation or interactions between microbes.


Introduction
Metabolic plasticity is critical for bacterial survival in changing environments, and fundamental to stable microbial communities [1,2] .In the context of human health, it has been shown that certain pathogens adapt their metabolism to their current environment to most effectively colonize a new host [3,4] .By fully understanding these shifts and their implications it may provide opportunities for novel therapeutic strategies.However, describing these situational differences in overall metabolism of an organism have remained difficult to quantify.While genomic analyses of these species reveal catalogues of metabolic capabilities for a species, they do not provide information toward the inter-connected networks of metabolism that invariably occur within the cell.One approach that addresses these limitations is computationally combining genomic data with known biochemical constraints for predicting/modeling cellular physiology with Genome-Scale metabolic Network Reconstructions (GENREs).A GENRE is composed of the collection of genes and metabolic reactions associated with the species of interest, representing the totality of known metabolic function that organism is able to employ.This functionality can then be formalized with a mathematical framework and constrained by known biological parameters to allow for simulation of metabolic processes.These powerful discovery platforms have allowed for guided genetic engineering efforts, directed hypothesis generation for downstream laboratory testing, and deconvoluted metabolic responses of bacteria to antibiotic stress [5,6] .
GENREs have also become strong platforms for providing context to multiple forms of high-throughput omics data, allowing for the study of whole-network changes in metabolism due to upstream/downstream shifts in enzyme activity [7] .To this end, advances over the last decade in RNA-Seq technologies have indeed enabled whole bacterial community metatranscriptomic analysis [7] .Conversely, changes identified through other means occasionally can be difficult to attribute to a single organism from complex biological samples (i.e.untargeted metabolomics).However, shotgun transcriptomic analysis followed by strict read mapping protocols allows for the characterization of individual groups of species within these communities of organisms [8,9] .These large datasets provide informative possibilities for analyses; however holistic (top down) analyses are limited in their ability to make any mechanistic claims outside of correlative results.Against the backdrop of the reductionist (bottom up) approach of GENREs, their combined informative capacity allows for mechanistic insight into the metabolic status and ecological interactions within the microbiome [10] .
Numerous approaches to integrating transcriptomic data into GENREs primarily focus on identifying reactions associated with defined cutoffs of gene transcription or change, then maximizing flux through those reactions with many transcripts and minimizing flux through those reactions associated with genes with fewer transcripts [11][12][13][14][15][16][17][18][19][20][21] .While these techniques have produced insights into the influence of transcription on the metabolic state of cells, the largely arbitrary thresholds dividing genes and associated reactions into categories of activity or probability deviates from the continuous nature of mRNA abundance data and can skew results.Several alternative platforms collectively rely on identifying core sets of metabolic functionalities that define the cell-type or condition of interest utilizing omic data, then selecting the most likely pathways or network topology for including these tasks and pruning unused sections of the models [22,23] .While these approaches have been shown to create models that contain increased amounts of known tissue-specific metabolic pathways, they do not inherently generate fully functional models of metabolism nor were they reflective of momentary metabolic trends that would be active under specific conditions.Overall, very few platforms have integrated continuous weights to reactions based on transcriptomic data as well as applying these constraints to fully functional metabolism [24][25][26] , but these ultimately require high degrees of input data (i.e.fluxomics or differential expression analysis) and rely on placing new bounds on metabolic reactions in relation to the input transcriptome.Even fewer studies have incorporated the concept that transcription does not necessarily equate with functional enzyme, examining the implications in flux analysis [27] , highlighting an opportunity for novel approaches to analyze context-specific metabolic models in relation to continuously assigned reaction weights and stand-alone transcriptomic datasets.
Recently, parsimonious Flux Balance Analysis (pFBA) was developed, which seeks to minimize the flux associated with each reaction in the model while maintaining optimum flux through the objective [28] .In essence, this approach identifies the least biologically "expensive" usage of an organism's metabolism to achieve high growth rates.This method is in line with the concept that evolutionary pressures have selected for metabolic states in microbes with minimized cellular cost that maximize growth rate under a variety of environmental conditions [29] .Surprisingly, with a model of Saccharomyces metabolism, pFBA alone was found to be able to largely outperform several other methods in predicting metabolic network behavior under various growth conditions despite not utilizing any transcriptomic data for these predictions [30] .These studies demonstrate that contextualizing data in a truly biological meaningful way has proven difficult, highlighting large gaps in our understanding of metabolic regulatory mechanisms.Recently, our laboratory has demonstrated that parsimonious flux analysis, in conjunction with differential expression analysis, can produce useful metabolic output predictions from both rat and human metabolic network reconstructions [31] .Additionally, recent studies on E. coli grown in defined in vitro conditions have shown that levels of transcript were only predictive for roughly half of the concentrations of the corresponding enzymes in the cell [32,33] .In reality, reaction flux is determined by a complex combination of local substrate and enzyme concentrations as well as the kinetics of the reaction itself.Despite this shortcoming, transcript abundances remain an indication of cellular investment into a given metabolic strategy.Furthermore, as the expression of a functional enzyme is not as simple as an on-or-off scenario, we have also implemented an unsupervised procedure to assign weights to reactions based on the distribution inherent to RNA-Seq data [34] .As arbitrary thresholds have been demonstrated to be biased [35] , and absolute mRNA abundances were shown to improve GENRE flux predictions when their correlation with reaction activity is maximized [27] , we chose to calculate continuous values along the transcript abundance distribution for the coefficient associated with each gene.This approach is unbiased by depth of sequencing across datasets, and it restricts the utility of low-transcription reactions to an optimal pFBA solution yet does not entirely prohibit it.With both of these concepts in mind, we have leveraged the insights gained by pFBA, integrating transcript abundance as a weighting metric to direct the most parsimonious flux solutions toward a state of higher fidelity to in situ biology.
We call this combined method RIPTiDe, or Reaction Inclusion by Parsimony and Transcript Distribution.RIPTiDe was then tested on the most well curated GENRE of the highly studied model organism E. coli str.K-12 substr.MG1655.Transcriptomic data from multiple distinct in vitro growth conditions were compared against each other as well as the base implementation of pFBA.RIPTiDe was able to accurately discern context-specific phenotypes including growth rates that closely match experimentally measured values under the same conditions, as well as gene essentiality predictions in metabolic pathways relevant to the corresponding media.The platform was subsequently tested utilizing in vivo metatranscriptomic sequence data from clindamycin-treated mouse cecal content where E. coli is the dominant member of the bacterial community [8] .When contrasted against results from the previous analysis, the in vivo RIPTiDe-contextualized model grew significantly slower than in aerobic rich media and more central metabolites were produced intracellularly when in competition with other bacterial species.These combined analyses validated our methodology by correctly uncovering context-specific biological trends, and previously unappreciated host-associated behaviors of E. coli .Furthermore, this technique could present a powerful tool for the study of the microbiota through the simultaneous unsupervised contextualization of omics data from multiple organisms within the same community.

Reaction Inclusion by Parsimony and Transcript Distribution (RIPTiDe)
RIPTiDe relies on bi-level optimization guided by transcript abundances in order to identify the most efficient usage of metabolism that also reflects an organism's transcriptional state.The required input for RIPTiDe includes a metabolic network reconstruction with an assigned objective and gene-associated transcript abundances.Transcript mapping is reduced to only genes that appear in the genome-scale network reconstruction of interest, therefore not skewing the distribution with data that is uninformative to the current analysis.To generate reaction coefficients that closely reflect the normalized transcript abundance [36] for their associated genes, we calculated the ratio of transcription for each gene with the maximum abundance in the dataset (Fig. 1A).In line with the overall motivation for RIPTiDe, this calculation results in a weight for all genes that represents each as a fraction of the maximum cellular investment in any one metabolic function, which has been shown to be effective in at least one previous approach for transcriptome integration [25] .The process results in a distribution of linear coefficients that is >0.0 and ≦ 1.0 where higher transcript abundance corresponds with a larger probability of being utilized in the final metabolic model.Additionally, in the cases where multiple genes contribute to the same reaction (i.e.isozymes or multimers), then the lowest coefficient (highest transcript abundance) among the group is chosen as it reflects in what processes the cell has invested energy.Others have integrated overall GENRE flux state limited by whole-cell transcriptomic analyses to more directly predict metabolic capacities [25] , however this approach did not leverage additional constraints on total flux to maximize cellular economy as a reflection of omics datasets.Furthermore, previous methods have not included any exploration of distinct forms of metabolism that may all still occur given the constraints of each algorithm on models.By calculating a continuous scale of linear coefficients based on absolute transcript abundances, it removes the need for different cutoffs between datasets of variable sequencing depth and reduces the requirement for manual intervention in the generation of functional context-specific models.
As with pFBA, reversible reactions are split into pairs of irreversible reactions carrying non-negative flux.In this way, we were able to minimize overall flux subject to the transcript-guided linear coefficients added by RIPTiDe.Within the transcript abundance distribution, progressively smaller linear coefficients are assigned to genes with increased transcription at each cutoff (Fig. 1A).
Coefficients are then subsequently transferred to the corresponding metabolic reactions for each gene.
The magnitude of linear coefficient drives the amount of flux carried by each reaction.Larger coefficients correspond to the reaction receiving larger flux during optimal growth, which would therefore be favored less when minimizing the overall flux through the objective function.Reactions without genes are assigned the median score to not be incorrectly penalized, nor overly biased to their inclusion.Importantly, this approach is equally tractable between different sequencing depths as well as both transcriptomes sequenced from pure culture or extracted from a metatranscriptome.The scope of feasible solution states in a GENRE typically out-scales the number of realistic metabolic states of the cell by a large magnitude [37] .Placing further constraints on a model that reflects likely metabolism allows for improved predictions, but only partially mitigates the potential for biologically meaningless solutions to be reported.Sampling of feasible flux distributions reveals how the applied constraints limit the possible metabolic states and ensure we don not rely on any single state that may be physiologically less likely [38,39] .Each sampled flux distribution represents a hypothesis for how metabolism is truly functioning given the constraints placed on the system.Integrating a flux sampling technique into RIPTiDe, we were able to further leverage our transcript-based constraints to maximize the context-specificity of resultant models.Following assignment of linear coefficients to metabolic reactions, we proceeded through the linear optimization steps at the core of RIPTiDe as illustrated in Fig. 1B.In order to be quickly and uniformly sample this highly constrained solution volume, we utilized the sampler gapsplit which is tuned to maximize efficiency per sample under this type of condition [40] .
Transcript abundances were used to guide the bi-level optimization and flux sampling in order to identify the most economic, likely usage of metabolism with the given cellular investment in transcription.A more detailed description of the steps in RIPTiDe are as follows: 1. Near optimal flux through the objective function is set as a constraint (≧80%, Table S1).
2. Linear coefficients are assigned as constraints to each reaction based on transcriptomic abundance of the associated gene-enzyme assignments.In cases where multiple gene products contribute to the same reaction, the largest weight among them is selected.The resultant pruned model has a dramatically more focused, biologically feasible, metabolic solution space that has been adequately sampled for feasible metabolic states (Fig. 1C).Unlike many previous approaches, models produced by RIPTiDe are able to achieve flux through their given objectives by default without any reintroduction of deactivated reactions.This is critical as the principle of optimum cellular economy with respect to the transcript data is never broken throughout the contextualization process.Spearman correlation is then performed between median absolute flux values for all reactions found during sampling and their associated levels of transcription to identify the level of concordance of context-specific metabolic states with the measured transcriptome (Fig. S3).
Furthermore, we have also integrated functionality to instead utilize one or more metabolic tasks (single reactions) as constraints which must achieve positive flux in order to accommodate larger-scale models where a single defined cellular objective may be a biologically invalid assumption (i.e.human tissue).
For a more detailed description of the algorithm, refer to the Methods section.

Application of RIPTiDe to an Example Model of Bacterial Metabolism
To test RIPTiDe, we created a small, flux-consistent metabolic network reconstruction with two compartments and generation of ATP (in arbitrary units) as the cellular objective function.The model has two separate means of accomplishing this goal; either through glycolysis where glucose is the primary substrate, or Stickland fermentation which requires pairs of amino acids in coupled redox reactions to create energy [41] .In nature, fewer ATP are generated per unit of substrate in Stickland fermentation compared to glycolysis [42] .This phenomenon was replicated in the example model as fewer reactions are required to generate ATP from glucose, making this pathway more energetically favorable (Table S2).Utilizing the base implementation of pFBA available within COBRApy to quantify which pathway was the more parsimonious solution, we ensured that glycolysis was actually the most efficient (fewest enzymes used) for optimal ATP generation (Fig. 2A).Critically, this ATP generation was performed under the conditions where all possible extracellular substrates were provided in excess to the model as to not bias results only on the basis of resource availability.
As a means to determine the context specificity of RIPTiDe output models, we then generated two distinct simulated transcript abundance distributions that each reflected elevated transcription of a single pathway (Table S2).In Fig. 2B, RIPTiDe was provided with a transcriptome in which genes for glycolysis and related reactions were more highly transcribed.The contextualized model in this case also utilized glycolysis as the only means of generating ATP, directly matching those results obtained from pFBA.Alternatively, as illustrated in Fig. 2C, when genes associated with Stickland fermentation and peptide substrate transporters were more highly transcribed, RIPTiDe was able to discern that the less directly parsimonious pathway should be utilized instead when provided sufficient transcriptomic evidence.In each case, reactions related to the opposite pathway were correctly pruned while generating the final context-specific model.Furthermore, flux samples for the cellular objective from each contextualized example model revealed that significantly more ATP could be produced when employing glycolysis (Table S2), further supporting the energetic favorability of this pathway.These combined results confirmed that RIPTiDe was indeed able to use a less parsimonious pathway when necessary to better reflect the associated transcriptomic dataset.

RIPTiDe with a Curated Model of E. coli Metabolism
To perform a thorough evaluation of our approach, we utilized the genome-scale metabolic network reconstruction iJO1366 (2583 reactions, 1805 metabolites, 3 compartments), for the model organism E. coli strain K-12 substr.MG1655 [43] .This GENRE has been the subject of extensive manual curation and experimental validation, making it the best possible candidate for examining new approaches in constraint-based modeling [44] .Leveraging these years of research, we utilized this well described system to determine the biological validity of predicted context-specific differences returned by RIPTiDe.To accomplish this goal, we performed a comparative analysis using 3 separate transcriptomic sequencing datasets, each derived from well characterized in vitro media conditions including LB aerobic liquid rich media, M9+glucose aerobic liquid minimal media, and M9+glucose anaerobic liquid minimal media.Each dataset was collected at approximately the same mid-exponential phase of growth [45,46] .These data were chosen due to the large amount of characterization each has received over time, allowing our predictions to be more readily supported compared to assessing metabolic differences possibly seen across numerous less well explored conditions.Additionally, E. coli is a facultative anaerobe with known differences in metabolism between these lifestyles that we hoped to capture with RIPTiDe [47] .
We first set a performance baseline using implementations of the distinct methodologies for contextualizing omics data with GENREs mentioned previously; strict integration to corresponding reaction flux levels versus complete context-specific model generation.For equal comparison between algorithms, these analyses were all performed utilizing the transcriptome collected from E. coli K-12 growing in LB aerobic liquid rich media.First, we elected to assess two separate, commonly cited integration algorithms, GIMME and iMAT [11,12] , due to their combined prevalence across the literature.These algorithms require abundance cutoffs for selectively deactivating reactions associated with genes that were transcribed below the desired level and as such, we utilized the recommended values published alongside each method.The python implementations of both GIMME and iMAT initially returned models that were unable to achieve flux through the biomass synthesis reaction in any condition, and required reintroduction of metabolic reactions establish the ability of the models to grow (Table S3).The resultant models were no longer concordant with the input transcriptomes, reintroducing 138 and 104 reactions respectively associated with genes that fall below the defined transcription cutoff (Table S3).Next, using a more recent technique that has similar requirements for input data, CORDA [13] , we created a context-specific model with the same transcriptome as in the previous analysis (LB rich media).A defined core set of high confidence reactions was needed and was set as those reactions associated with genes that recruited >1000 transcripts during mapping.With these settings, CORDA was indeed able to produce a model that was capable of growth, however 1001 out of 2583 reactions were included in the resulting model and the growth rate was not significantly altered.This level of pruning did not substantially narrow the possible strategies for growth that can be studied easily, and potentially speaks to the design of this algorithm for yielding tissue-specific models and less to understanding situational patterns of metabolism most relevant to the context of interest.
The performance of these algorithms could be attributed to the lack of defined media conditions in each instance, which further supports the necessity for techniques that do not require such a high degree of information at the outset.

Exchange Reaction Flux Predictions Using RIPTiDe
To measure how closely RIPTiDe-based predictions reflect those when media conditions are known, we performed NMDS ordination analysis of Bray-Curtis dissimilarities between common exchange reaction flux distribution samples from iJO1366 with and without constraints on available substrates compared with RIPTiDe-generated flux samples with no defined media (Fig. 3A).For these comparisons, RIPTiDe was applied to JO1366 with no additional constraints applied utilizing the transcriptome from growth in M9 + glucose defined minimal media.As expected, flux samples from the model with only constraints for near maximal growth are significantly different ( p -value = 0.001) from each other group as it has access to all possible growth substrates which combined allow for faster growth and distinct patterns of metabolism.A striking relationship emerged when RIPTiDe flux distributions were compared with those from media-constrained models, revealing that these groups were highly correlated.Importantly, no difference ( p -value > 0.05) was found when RIPTiDe was applied to the model with M9 + glucose media constraints contrasted against results from RIPTiDe-generated flux samples without media constraints (Table S4).A similar trend was also observed with pFBA integrating two additional transcriptomic datasets from E. coli across distinct media conditions (Fig. S5).These data indicated that RIPTiDe-based predictions for an organism's interaction with its environment (exchange fluxes) were extremely analogous to those made when extracellular metabolic conditions were known.With these results in mind, for all subsequent work we took an unsupervised approach by allowing for unpenalized flow of all extracellular metabolites in and out of the system (exchange reaction bounds set to ±1000), otherwise referred to as "complete" media.This approach allowed for inferences about the most likely growth substrates for a contextualized GENRE since only those that carry flux following RIPTiDe remain, which is imperative to accurately describing context-specific metabolic patterns in complex extracellular environments.

Context-specific Escherichia coli Biology Uncovered by RIPTiDe
We next performed RIPTiDe-based analyses of iJO1366, utilizing all three transcriptomic datasets described previously.As a standard of comparison, we also created a simulated transcriptome where all transcript abundances were identical, forcing RIPTiDe to assign maximum importance to all reactions and establish the most "economic" overall flux distribution that achieves maximum growth (referred to as Maximum Parsimony).We first assessed the impact that each transcriptome had on the components of each model created by RIPTiDe.Different topologies of reaction inclusion were found among each of the contextualized models; however none were completely unique and a core set of metabolic reactions that were necessary for optimal growth across all conditions (347 reactions; Fig. 3B).The next largest fraction of overlap was between those models generated using transcriptomes from E. coli grown in M9 minimal media (aerobic & anaerobic), which supported that RIPTiDe was able to identify general biological trends that are conserved between similar media conditions.Interestingly, the next largest group was the intersection between Maximum Parsimony and LB rich media, which may reflect that growth in an environment with surplus growth nutrients is partially analogous to the artificial state of minimal cellular investment.These distinctions in network topology were further crystallized by conditional gene essentiality between the RIPTiDe-contextualized models (necessary for at least 1% of optimum growth).It is important to note that genes found to be essential during these analyses do not necessarily correspond with survival of E. coli , only that these genes may be critical to the optimal metabolic strategy under the associated conditions [48] .Future work may seek to elucidate the individual contribution of these genes to overall metabolic strategies across respective media conditions.Although most essential genes were shared by all four models (159 "core" genes, Table S4), those essential genes that were context-specific related to pathways of biological significance to their respective media conditions (Fig. 3C).As a corollary to creating a smaller metabolic network than those using transcriptomes contextualized by RIPTiDe (by an average of 9.12%), fewer genes were distinctly essential in the model resulting from Maximum Parsimony (5 genes) compared to any of the RIPTiDe-contextualized in vitro transcriptomes (8, 15, and 20 genes respectively; Table S4).
Interestingly, the contextualized models on average shared larger ratios of components (~7.07%reactions) with each other than with Maximum Parsimony, supporting that RIPTiDe selects more distinct patterns of metabolism compared to Maximum Parsimony alone (Fig. 3B).We also calculated the relative doubling time for each of the contextualized models from the objective values associated with the biomass objective function (Fig. S1) [49] .As expected, the base model with all open exchanges and no additional constraints displayed the fastest doubling time.However, the growth rates for models associated with in vitro transcriptomic data each decrease in the expected order reflecting experimentally measured results [50,51] ; LB (aerobic) → M9+glucose (aerobic) → M9+glucose (anaerobic).These results collectively suggested that the RIPTiDe-contextualized GENREs not only topologically reflected known biological trends, but also phenotypically mimicked experimentally measured behaviors.
We subsequently focused our analysis on assessing known differences in core metabolism across transcriptomes in RIPTiDe-contextualized models.Growth conditions are known to affect which forms of metabolism are possible or favored at a given time, driving the cell to utilize alternative pathways for replenishing it's cellular currency.One such anabolic circuit in Enterobacteriaceae is the small transhydrogenase pathway which consists of two distinct enzymes, contributing to NADPH homeostasis within the cell [52] .It has been previously demonstrated in E. coli str.K-12 that transhydrogenase functionality is critical for normal growth in M9 minimal media, but dispensable in LB rich media [53] .Strikingly, the RIPTiDe-contextualized model in aerobic M9 minimal media indeed obtained the majority of cellular NADPH from this mechanism (Fig. 4A).Alternatively, both gene-associated reactions were pruned when the LB media-associated transcriptome was integrated with RIPTiDe.Instead in rich media, the contextualized model generated the bulk of its NADPH via glycerol-3-phosphate dehydrogenase (Fig. 4B).This enzyme is a major contributor to the bacterial electron transport chain and known to be active in E. coli during fermentation and growth in aerobic LB broth [54,55] .Importantly we found that pntA and pntB, the genes for NADP transhydrogenase subunits, were significantly upregulated in M9 minimal media and downregulated in rich media (Table S4).Alternatively, glycerol-3-phosphate dehydrogenase is transcribed relatively evenly in both conditions but contributes to a parsimonious flux solution.This result reinforced that RIPTiDe has the ability to correctly deviate from the most parsimonious flux states when sufficient evidence is presented.
These combined findings signified that RIPTiDe is able to capture previously measured, context-specific behaviors with only transcriptomic characterization and no manipulation of in silico media conditions.

Contextualization of In vivo Metatranscriptomic Data
Finally, we sought to test the utility of RIPTiDe for capturing metabolic behavior of bacteria from within host-associated communities of microorganisms.Computational analysis of a discrete bacterial species within a complex environment as a member of a larger microbial community has been historically difficult [56] .To address this problem, we utilized in vivo metatranscriptomic abundance data collected from the cecum of wildtype C57Bl6 mice.These data were chosen because the specific antibiotic pretreatment (intraperitoneal injection of clindamycin) resulted in a bacterial community composed of >90% E. coli [8] .To begin the analysis, we first made certain that the in vivo transcript abundance distribution reflected the same negative binomial type as those derived from in vitro transcriptomes mapped to the same E. coli K-12 MG1655 genome (Fig. 5A).This result demonstrated that with sufficient sequencing depth, a species with a GENRE may be analyzed with RIPTiDe using data collected from among a diverse collection of microbes.After contextualization of this data, the context-specific flux values during sampling were correlated with assigned linear coefficients based on transcript abundances (Fig. 5A & Fig. S2).This result supported that the transcriptomic data was indeed informative, but not overly deterministic, of optimal metabolic states through RIPTiDe.We then assessed how distinctive the flux distributions were of those metabolic reactions which were shared between context-specific models.Cross-referencing each contextualized model to identify shared reactions, we then performed principle component analysis using Bray-Curtis dissimilarity for those flux samples corresponding with this core metabolic subnetwork (Fig. 5B).This analysis revealed not only significant differences between all of the metabolic profiles identified using the transcriptomic data from E. coli across the four conditions ( p -value < 0.001), but more specifically between those bacteria grown in M9 minimal media and those found using LB rich media or in vivo transcriptomes ( p -value << 0.001).
These results implied distinct metabolic strategies necessary for growth during resource scarcity or surplus, and these patterns were captured by RIPTiDe.Although similar, no group of flux distributions entirely overlapped another and indicated that although numerous reactions and metabolites are shared between all context-specific models, they are utilized significantly differently to manage growth across media conditions and supported differential metabolism is likely necessary to cope with distinct environments.
Next, we contrasted the RIPTiDe models generated using in vivo or LB rich media transcriptomes as they were more similar during unsupervised learning analysis.Aside from one environment being highly aerobic and the other largely anaerobic, previous research on E. coli physiology during growth in liquid LB concluded that the bacterium utilizes amino acids that are simple to metabolize for growth early on, then switches to other substrates in a diauxic-like growth pattern.LB broth has also been shown to have only modest levels of usable carbohydrate growth substrates, increasing the importance of amino acid catabolism for E. coli [50] .Additionally, in vivo amino acids and simple sugars are contested resources in the cecum as the majority are already absorbed by the host or used by other microbes upstream in the GI tract, making these growth conditions extremely different [57] .As an analogy to differential growth substrates, we looked to assess the presence of context-specific exchange reactions.We identified 4 exchange reactions that were present in the LB-associated model but not in vivo , and 5 of these were peptides or peptide conjugates (Fig. 5C).
Carbohydrate growth substrates have been demonstrated to be at very low concentrations in LB media.
Additionally, E. coli has also been shown to preferentially utilize amino acids to generate energy during early exponential phase [50] .Indeed, upon assessing the measured changes in the in vivo concentrations of the highlighted metabolites they were reduced in the context where E. coli dominated the environment, possibly due to consumption by the microbe (Fig. S2).Additionally, the exchange for glucose-6-phosphate was also only found in in vivo -contextualized models, reinforcing that carbohydrates may be less available for growth by E. coli compared to amino acid substrates in LB media [50] .Continuing this analysis now with shared intracellular reactions we utilized AUC-Random Forest [58] on the sampled flux distributions for each reaction to determine those metabolic functionalities that most distinguish the two models (Fig. S4).Among other findings, machine learning revealed consistent differences in flux ranges for both flavodoxin and thioredoxin reductases in aerobic LB culture, but not in vivo .These enzymes are important for dealing with oxidative stress which would be much more prevalent in the broth culture than in the mostly anoxic environment of the mammalian gut [59] .Putrescine transport also distinguishes in vitro and in vivo flux levels, which can be converted to succinate and may be an adaptation for survival in the gut where polyamines (like putrescine) are at elevated concentrations [60] .Altogether, our results support that RIPTiDe is also a strong platform for the analysis of bacterial GENRE behavior in the context of complex microbial communities through the use of metatranscriptomic data.

Conclusions
In order to survive in a given environment, an organism must express the appropriate forms of metabolism to cope with changing conditions.Production of transcript and protein is metabolically expensive, and over-production likely puts the cell at risk if further perturbation is placed on the system [61] .Although recent findings suggest that levels of transcript for a given enzyme product do not always directly correlate with the amount of functional protein present in the cell at a given time [32] , these values may be considered a signature of cellular "investment" and therefore may still be useful in interrogation of metabolism.Furthermore, understanding the quantity of even a small number of functional enzymes remains extremely challenging, which negatively impacts our ability to grasp which and why resources are used by the cell [62] .We should then consider transcriptomic abundance data as signatures of cellular energy expenditure where high transcript levels for a particular gene are an indication of large cellular investment related to that particular function, but are not particularly deterministic of the associated phenotype.Parsimonious Flux Balance Analysis approaches the idea of maximizing cellular economy, and we have built on this idea but employing RNA-Seq data to inform where the cell is most likely spending its energy.
We have demonstrated empirically that our algorithm correctly calculated both in vitro and in vivo metabolic phenotypes with reasonably high accuracy.In addition to these predictions, our method does not require prior knowledge of active metabolic subnetworks which can preclude work with understudied or uncultured species that cannot or have not been grown or characterized in isolation.
Furthermore, the ability to accurately capture the context-specific metabolism of bacteria when media conditions are not easily defined will be an important component for the discovery of novel host-microbe metabolic interactions.Some possible limitations of RIPTiDe do exist however.First, as RIPTiDe relies on the distribution associated with shotgun transcriptomic sequencing, it is less suited for other types of omic data in its current form [63] .RNA-Seq reads are easily attributable to single species within the microbiota due to strict genomic mapping techniques and are measured within a single sequencing assay, however with sufficient consideration to reaction weights given other data types RIPTiDe's functionality could be expanded.Second, by condensing to such a small metabolic network "solution space", the models generated by RIPTiDe are meant to represent the most probable context-specific "snapshot" of likely metabolism.To counteract potential overfitting to a single metabolic state for more generalizable downstream models, we created options for manually setting a maximum weight that coefficients are allowed as well as providing lists of core functionality that must remain in the context-specific model created by RIPTiDe.In spite of these concerns, we have shown our algorithm to be a strong approach for contextualization of genome-scale metabolic network reconstructions for both in vitro and in vivo conditions, producing models that closely mirror known biological phenotypes.This technique may also prove effective for generating tissue-specific models in GENREs from larger eukaryotes.The implementation of RIPTiDe rewards divergence from the most numerically parsimonious state to one that is more reflective of cellular investment, and in turn more illustrative of in situ biology.In summary, RIPTiDe approaches transcriptomic data integration uniquely from previous techniques and provides a more biologically relevant platform for considering the cellular behavior in complex extracellular metabolic environments.

Parsimonious Flux Balance Analysis Adaptation
RIPTiDe was implemented within COBRApy [64] and is freely available with installation instructions at https://github.com/mjenior/riptide .To save on computational time, which can be intensive during flux sampling object generation, flux variability analysis was utilized first to identify and remove all blocked reactions (those reactions which do not carry flux under any circumstances).At this point, optimal flux through the objective function was set as a constraint with a 20% range of allowed variation (based on 80% fraction of optimal objective function value).Following sensitivity testing, this level was found to result in models which best achieved high growth rates while reaching acceptable levels of correlation with input transcriptomes (Table S1).

Linear Coefficient Assignment
As previously described, remaining reactions were then assigned a linear coefficient based on the user provided transcriptomic abundance data [65] .An option has also been integrated to differentially penalize reactions where multiple gene products are required for a reaction to proceed and discordant transcript abundances are observed based on GPR rules.In order to illustrate this difference, we repeated this analysis with and without GPR consideration (Table S6).The base implementation of RIPTiDe utilizes the maximum transcription among genes encoding for a single reaction, however the ability to pool overall transcript for all genes contributing to a reaction has also been added.As in parsimonious-FBA, each reversible reaction is separated into individual forward and reverse reactions v irrev .Linear coefficients w r for each vector can now integrated and an objective function is then defined to minimize total of flux across the entire model.These steps are all performed subject to near optimal biomass flux ( max v objective ) and mass-balance constraints.Optimization (standard FBA) is performed and inactive reactions are then removed from the model.The model is then checked for orphaned genes and metabolites, which are subsequently removed as well.

Flux Sampling
In order to adequately explore intracellular flux through the remaining reactions, previous coefficients are reassigned to remaining reactions using the unsorted linear coefficient distribution (assigned directly corresponding to transcript abundance).All reactions with coefficients are then assembled into a single expression which is set as a constraint for the overall solution space of all reactions with symengine ( https://github.com/symengine/symengine): The new coefficients are then assigned to the corresponding irreversible reactions v irrev .Flux sampling was then performed with the gapsplit sampler using standard parameters and 500 samples ( https://github.com/jensenlab/gapsplit ) [40] .Flux variability analysis for the constrained models is also performed to establish flux ranges for each reaction [66] .An overall concordance is then calculated by finding the Spearman correlation coefficient and associated p -value between median absolute flux and their associated linear coefficient during flux sampling for all reactions (Table S1 and Fig. S3).

Example Model Construction
The example model used for initial RIPTiDe testing consists of 16 reactions (7 exchanges, 7 transporters, and 2 metabolic) and 14 metabolites across 2 compartments (extracellular and intracellular).The intracellular demand reaction for ATP generation was utilized as the objective.
Simulated transcriptomes are shown in Table S2 and an SBML file (example_GENRE.sbml)for the model is also available at github.com/mjenior/Jenior_RIPTiDe_2019 .

Data Sources and Processing
Fastq files associated with each study were accessed from the NCBI Sequence Read Archive (SRA: PRJNA313438, SRA: PRJNA212627, & SRA: PRJNA415307).Sequencing read processing was performed as described previously in [3,8] .In summary, raw cDNA reads were quality-trimmed with Sickle [67] at ≧Q30 and subsequently mapped to Escherichia coli K-12 MG1655 genes (EMBL-EBI on Jan 28, 2019) using Bowtie2 [68] .MarkDuplicates was then used to remove optical and PCR duplicates ( broadinstitute.github.io/picard/).The remaining mappings were converted to idxstats format using SAMtools [69] , and the read counts per gene were tabulated.Discordant pair mappings were discarded, and counts were then normalized to read length and gene length to give a per-base report of gene coverage.If replicate sequencing was performed, median transcription values for each gene were then calculated.The data were then converted to a python dictionary of gene keys and associated numeric transcript abundances for use with RIPTiDe.In vivo untargeted metabolomic data were generated for a previous study [8] , and accessed from github.com/SchlossLab/Jenior_Metatranscriptomics_mSphere_2018/ .The current version of iJO1366 (GENRE for E. coli K-12 MG1655) was accessed from the BiGG Model database [70] on 09-26-2018.

Statistical Methods
All statistical analysis was performed using R (v.3.2.0).Significant differences between flux distributions and metabolite concentrations were determined by Wilcoxon signed-rank test with Benjamini-Hochberg correction.Principle Component Analysis and Non-Metric Multidimensional Scaling of sampled flux distributions was performed using the vegan package [71] .Machine-learning was accomplished with the implementation of AUC-Random Forest in R [58] .

Data and Software Availability
As previously stated, RIPTiDe is freely available for download at https://github.com/mjenior/riptideand also as a full package (v2.5.4) from the Python Package Index ( https://pypi.org/project/riptide/).Complete analysis associated with this study is available in a Jupyter notebook hosted on a GitHub repository along with all other data and code associated with this study ( https://github.com/mjenior/Jenior_RIPTiDe_2019).balance is known to be essential for E. coli growth in M9 minimal media but dispensable in LB, which is correctly selected by the unsupervised network pruning from RIPTiDe.Significant difference in flux levels determined with Wilcoxon signed-rank test ( p -value << 0.001), pathway maps generated using Escher [72] .

Figures and Legends
Reactions included from the pentose phosphate pathway in each context-specific model, pruned reactions from this pathway in both contexts, the distribution of transcript abundances in each dataset, and the flux sampling results for NADPH producing reactions in each constrained model.

3 .
Total sum of fluxes in the model is minimized subject to previously outlined constraints.4. Standard flux balance analysis then identifies inactivated subnetworks (zero flux) which are subsequently removed from the model.Orphaned genes and metabolites are then pruned. 5. Inverse linear coefficients within the original provided range are calculated and assigned for each reaction, then flux sampling is performed on the constrained and reduced model to find most likely pathway utilization patterns.

For
reaction weight C g (linear coefficient) assignments, the reads-per-million normalized transcript abundance of gene T g within the sorted (smallest to largest) transcript abundance distribution is divided by the largest abundance T max this distribution.These C g values are assembled into a sorted (largest to smallest) distribution of linear coefficients.Each gene in the sorted transcript abundance distribution is then assigned the corresponding linear coefficient from this inverted coefficient distribution for each reaction, creating distribution of reaction weights w r .

1 |
Figure 1 | Visual representation of RIPTiDe workflow.(A) Example transcript and resultant reaction

Figure 2 |
Figure 2 | Example model of bacterial metabolism supports the utility of RIPTiDe for identifying

Figure 4 |
Figure 4 | Cellular sources of NADPH in RIPTiDe contextualized models reflect known biological

Figure S2 |
Figure S2 | Relative concentrations of metabolites highlighted during in vivo analysis of iJO1366

Figure S3 |
Figure S3 | Spearman correlation results between RIPTiDe linear coefficients from flux sampling

Figure S4 |
Figure S4 | Results from Random Forest machine learning to identify the most differentiating

Figure S5 |
Figure S5 | Flux samples from context-specific models with RIPTiDe vs exchange flux media