Reconstruction of a catalogue of genome-scale metabolic models with enzymatic constraints using GECKO 2.0

Genome-scale metabolic models (GEMs) have been widely used for quantitative exploration of the relation between genotype and phenotype. Streamlined integration of enzyme constraints and proteomics data into such models was first enabled by the GECKO toolbox, allowing the study of phenotypes constrained by protein limitations. Here, we upgrade the toolbox in order to enhance models with enzyme and proteomics constraints for any organism with a compatible GEM reconstruction. With this, enzyme-constrained models for the budding yeasts Saccharomyces cerevisiae, Yarrowia lipolytica and Kluyveromyces marxianus are generated to study their long-term adaptation to several stress factors by incorporation of proteomics data. Predictions reveal that upregulation and high saturation of enzymes in amino acid metabolism are common across organisms and conditions, suggesting the relevance of metabolic robustness in contrast to optimal protein utilization as a cellular objective for microbial growth under stress and nutrient-limited conditions. The functionality of GECKO is expanded with an automated framework for continuous and version-controlled update of enzyme-constrained GEMs, also producing such models for Escherichia coli and Homo sapiens. In this work, we facilitate the utilization of enzyme-constrained GEMs in basic science, metabolic engineering and synthetic biology purposes.


Introduction
Genome-scale metabolic models (GEMs) have become an established tool for systematic analyses of metabolism for a wide variety of organisms [1][2][3][4][5][6] . Their myriads of applications span from model-driven development of efficient cell factories 3,7-9 , to their utilization for understanding mechanisms underlying complex human diseases [10][11][12] . One of the most common simulation techniques for enabling phenotype predictions with these models is flux balance analysis (FBA), which assumes that there is balancing of fluxes around each metabolite in the metabolic network. This means that fluxes are constrained by stoichiometries of the biochemical reactions in the network, and that cells have evolved in order to operate their metabolism according to optimality principles 13,14 . Quantitative determination of biologically meaningful flux distribution profiles is a major challenge for constraint-based methods, as optimal phenotypes can be attained by alternate flux distribution profiles 15 , caused by the presence of network redundancies that provide organisms with robustness to environmental and genetic perturbations. This limitation is often addressed by incorporation of experimental measurements of exchange fluxes (secretion of byproducts and uptake of substrates) as numerical flux constraints for the FBA problem. However, such measurements are not readily available for a wide variety of conditions and organisms.
In order to overcome these limitations, the concept of enzymatic limitations on metabolic reactions has been explored and incorporated by several constraint-based methods. Some of these have modelled enzyme demands of metabolic reactions by constraining metabolic networks with kinetic parameters and physiological limitations of cells, such as a crowded intracellular volume [16][17][18] , a finite membrane surface area for expression of transporter proteins 19 and a bounded total protein mass available for metabolic enzymes [20][21][22][23][24][25] . All of these modelling frameworks have been successful at expanding the range of predictions of classical FBA, providing explanations for overflow metabolism and cellular growth on diverse environments for Escherichia coli [16][17][18][19]21,23,25 , Saccharomyces cerevisiae 22,25,26 , Lactococus lactis 27 and even human cells 20,24 . However, these modelling approaches were applied to metabolic networks of extensively studied model organisms, which are usually well represented in specialized resources for kinetic parameters such as the BRENDA 28 and SABIO RK 29 databases. Furthermore, collecting the necessary parameters for the aforementioned models was mostly done manually; therefore, no generalized model parameterization procedure was provided as an integral part of these methods.
Enzyme limitations have also been introduced into models of metabolism by other formalisms, for instance, Metabolic and gene Expression models (ME-models), implemented on reconstructions for E. coli [30][31][32][33] , Thermotoga maritima 34 and Lactococus lactis 35 ; and resource balance analysis models (RBA), on reconstructions for E. coli 36 and Bacillus subtilis 36,37 . These formalisms succeeded at merging genome-scale metabolic networks together with comprehensive representations of macromolecular expression processes, enabling detailed exploration of the constraints that govern cellular growth on diverse environments.
Despite the great advances for understanding cell physiology, provided by these modelling formalisms, accuracy on phenotype predictions is compromised by the large number of parameters that are required (rate constants for transcriptional, translational, protein folding and degradation processes), with most of these not being readily available in the literature. Moreover, these models encompass processes that differ radically in their temporal scales (e.g., protein synthesis vs. metabolic rates) and their mathematical representation (presence of non-linear expressions in ME-models), requiring the implementation of more elaborate techniques for numerical simulation.
GECKO, a method for enhancement of GEMs with Enzymatic Constraints using Kinetic and Omics data, was developed in 2017 and applied to the consensus GEM for S. cerevisiae, Yeast7 38 . This method extends the classical FBA approach by incorporating a detailed description of the enzyme demands for the metabolic reactions in a network, accounting for all types of enzyme-reaction relations, including isoenzymes, promiscuous enzymes and enzymatic complexes. Moreover, GECKO enables direct integration of proteomics abundance data, if available, as constraints for individual protein demands, represented as enzyme usage pseudo-reactions, whilst all of the unmeasured enzymes in the network are constrained by a pool of remaining protein mass. Additionally, this method incorporates a hierarchical and automated procedure for retrieval of kinetic parameters from the BRENDA database, which yielded a high coverage of kinetic constraints for the S. cerevisiae network. The resulting enzyme-constrained model, ecYeast7, was used for successful prediction of the Crabtree effect in wild-type and mutant strains of S. cerevisiae and cellular growth on diverse environments and genetic backgrounds, but also provided a simple framework for prediction of protein allocation profiles and study of proteomics data in a metabolic context. Furthermore, the model formed the basis for modeling yeast growth at different temperatures 39 .
Since the first implementation of the GECKO method 38 , its principles of enzyme constraints have been incorporated into GEMs for B. subtilis 40 , E. coli 41 , B. coagulans 42 , Streptomyces coelicolor 43 and even for diverse human cancer cell-lines 2 , showing the applicability of the method even for non-model organisms.
Despite the rapid adoption of the method by the constraint-based modelling community, there is still a need for automating the model generation and enabling identification of kinetic parameters for less studied organisms. Here we wanted to build GECKO models for several organisms, and we therefore updated the GECKO toolbox to its 2.0 version. Among other improvements, we generalized its structure to facilitate its applicability to a wide variety of GEMs, and we improved its parameterization procedure to ensure high coverage of kinetic constraints, even for poorly studied organisms. Additionally, we incorporated simulation utility functions, and developed an automated virtual pipeline for update of enzyme-constrained models (ecModels), named ecModels container. This container is directly connected to the original sources of version-controlled GEMs and the GECKO toolbox, offering a continuously updated catalogue of diverse ecModels.

Community development of GECKO
To ensure wide application and enable future development by the research community, we established the GECKO toolbox as open-source software, mostly encoded in MATLAB. It integrates modules for enhancement of GEMs with kinetic and proteomics constraints, automated retrieval of kinetic parameters from the BRENDA database (python module), as well as simulation utilities and export of ecModel files compatible with both the COBRA toolbox 44  Interaction with users of the GECKO toolbox and the ecYeastGEM model has also been facilitated through the use of the GECKO repository, allowing users to raise issues related with the programming of the toolbox or even about conceptual assumptions of the method, which has guided cumulative enhancements.
Additionally, technical support for installation and utilization of the toolbox and ecYeastGEM is now provided through an open community chat room (available at: https://gitter.im/SysBioChalmers/GECKO), reinforcing transparent and continuous communication between users and developers.

New additions to the GECKO toolbox
The first implementation of the GECKO method significantly improved phenotype predictions for S. cerevisiae's metabolism under a wide variety of genetic and environmental perturbations 38 . However, its development underscored some issues, in particular that quantitative prediction of the critical dilution rate and exchange fluxes at fermentative conditions are highly sensitive to the distribution of incorporated kinetic parameters. Although S. cerevisiae is one of the most studied eukaryal organisms, not all reactions included in its model have been kinetically characterized. Therefore, a large number of kcat numbers measured for other organisms (48.35%), or even non-specific to their reaction mechanism (56.03% of kcat values found by introduction of wild cards into E.C. numbers) were needed to be incorporated, in order to fill the gaps in the available data for the reconstruction of the first S. cerevisiae ecModel, ecYeast7.
Moreover, detailed manual curation of kcat numbers was needed for several key enzymes in order to achieve biologically meaningful predictions.
As the BRENDA database 47 is the main source of kinetic parameters for GECKO, all of the available kcat and specific activity entries for non-mutant enzymes were retrieved. In total, 38,280 entries for 4,130 unique E.C. numbers were obtained and classified according to biochemical mechanisms, phylogeny of host organisms and metabolic context (Supp. file 1), in order to assess significant differences in distributions of kinetic parameters. This analysis showed that not all organisms have been equally studied. Whilst entries for H. sapiens, E. coli, R. norvegicus and S. cerevisiae account for 24.02% of the total, very few kinetic parameters are available for most of the thousands of organisms present in the database, showing a median of 2 entries per organism (Fig. 1A, Supp. file 1). The analysis also showed that kinetic activity can differ drastically, spanning several orders of magnitude even for families of enzymes with closely related biochemical mechanisms (Fig. 1B). Finally, it was also observed that kcat distributions for enzymes in the central carbon and energy metabolism differ significantly from those in other metabolic contexts across phylogenetic groups of host organisms (life kingdoms, according to the KEGG phylogenetic tree 48 ), even without filtering the dataset for entries reported exclusively for natural substrates, as previously done by other studies 49 (Fig. 1C).
As kcat numbers depend on biochemical mechanisms, metabolic context and phylogeny of host organisms, a modified set of hierarchical kcat matching criteria was implemented as part of GECKO 2.0. The modified parameterization procedure enables the incorporation of kinetic parameters that have been reported as specific activities in BRENDA when no kcat is found for a given query (as the specific activity of an enzyme is defined as its kcat over its molecular weight), adding 8,118 new entries to the catalogue of kinetic parameters in the toolbox. A phylogenetic distance-based criterion, based on the phylogenetic tree available in the KEGG database 48 , was introduced for cases in which no organism-specific entries are available for a given query in the kinetic parameters dataset. A comparison of the new kcat matching criteria with their predecessor set is shown in Supp. file 2.
In order to assess the impact of the modified kcat assignment algorithm on an ecModel, ecYeast7 was reconstructed using both the first and the new version of the GECKO toolbox (GECKO 2.0). A classification of the matched kcat numbers according to the different levels of the new matching algorithm is provided in Fig. 1D. The incorporation of specific activity values in the parameter catalogue increased the number of kinetic parameters matched to complete E.C. numbers (no added wild cards) from 1432 to 2696 (Fig. 1E). Moreover, the implementation of the phylogenetic distance-based criterion yielded a distribution of kinetic parameters that showed no significant differences when compared to the values reported in BRENDA for all fungi species, in contrast to the kinetic profile matched by the previous algorithm (p-values <10 -10 and <10 -7 , when compared to the BRENDA fungi and S. cerevisiae distributions, respectively, under a Kolmogorov-Smirnov test) (Fig. 1F). The quality of phenotype predictions for the ecYeast7 model enhanced by GECKO2.0 was evaluated by simulation of batch growth in 19 different environments, with an average relative error of 29.22% when compared to experimental data (Fig. 1G).
The introduction of manually curated kcat numbers in a metabolic network has been proven to increase the quality of phenotype predictions for S. cerevisiae 22,25,38 ; nevertheless, this is an intensive and time consuming procedure that is hard to ensure for a large number of models subject to continuous modifications. In order to ensure applicability of the GECKO method to any standard GEM, a unified procedure for curation of kinetic parameters was developed based on parameter sensitivity analysis. For automatically generated ecModels that are not able to reach the provided experimental value for maximum batch growth rate, an automatic module performs a series of steps in which the top enzymatic limitation on growth rate is identified through the quantification of enzyme control coefficients. For such enzymes, the E.C. number is obtained and then its correspondent kcat value is substituted by the highest one available in BRENDA for the given enzyme class. This procedure iterates until the specific growth rate predicted by the model reaches the provided experimental value.
Finally, as the first version of the toolbox relied on the structure and nomenclature of the model Yeast7, its applicability to other reconstructions was not possible in a straightforward way. In order to provide compatibility with any other GEM, based on COBRA 44 or RAVEN 50 formats, all of the organism-specific parameters required by the method (experimental growth rate, total protein content, organism name, names and identifiers for some key reactions, etc.) can be provided in a single MATLAB initialization script, minimizing the modifications needed for the generation of a new ecModel. ecModels container: an automatically updated repository Several GEMs that have been published are still subject to continuous development and maintenance 1-3,5,6 , this renders GEMs to be dynamic structures that can change rapidly. In order to integrate such continuous updates into the enzyme constrained version of a model in an organized way, an automated pipeline named ecModels container was developed.
The ecModels container is a continuous integration implementation whose main functionality is to provide a catalogue of ecModels for several relevant organisms that are automatically updated every time a modification is detected either in the original GEM source repository or in the GECKO toolbox, i.e. new releases in their respective repositories. The pipeline generates ecModels in different formats, including the standard SBML and MATLAB files, and stores them in a container repository (https://github.com/SysBioChalmers/ecModels) in a version controlled way, requiring minimal human interaction and maintenance. The GECKO toolbox ensures the creation of functional and calibrated ecModels that are compatible with the provided experimental data (maximum batch growth rate, total protein content of cells and exchange fluxes at different dilution rates as an optional input). This whole computational pipeline is illustrated in Fig. 2. Further description of the ecModels container pipeline functioning is included in in the Materials and Methods section.

A catalogue of new ecModels
Following the aforementioned additions to the GECKO toolbox, that have allowed its generalization, we used the toolbox for the reconstruction of four new ecModels from previously existing high-quality metabolic network reconstructions: iYali4, for the oleaginous yeast Yarrowia lipolytica 5 ; iSM996, for the thermotolerant yeast Kluyveromyces marxianus 6 ; iML1515, for the widely studied bacterium E. coli 4 ; and Human1, being the latest and largest network reconstruction available for studying H. sapiens metabolism 2 .
For the microbial models, all model parameters were calibrated according to the provided experimental data, generated by independent studies 4,51-53 , yielding functional ecModels ready for simulations. Size metrics for these models can be seen in Table 1.
These ecModels, together with ecYeastGEM, are hosted in the ecModels container repository for their continuous and automated update every time that a version change is detected either in the original model source or in the GECKO repository. In the case of microbial species, two different model structures are provided: ecModel, which has unbounded individual enzyme usage reactions ready for incorporation of proteomics data; and ecModel_batch in which all enzyme usage reactions are connected to a shared protein pool. This pool is then constrained by experimental values of total protein content, and calibrated for batch simulations using experimental measurements of maximum batch growth rates on minimal glucose media, thus providing a functional ecModel structure ready for simulations.
For ecHumanGEM just the unbounded ecModel files are provided, as this is a general network of human metabolism, containing all reactions from any kind of human tissue or cell type for which evidence is available, and therefore not suitable for numerical simulation. As H. sapiens is the most represented organism in the BRENDA database, accounting for 11% of the total number of available kcat values (Supp. file 1), kinetic parameters from other organisms were not taken into account for its enhancement with enzyme constraints. ecHuman1 provides the research community with an extensive knowledge base that represents a complete and direct link between genes, proteins, kinetic parameters, reactions and metabolites for human cells in a single model structure, subject to automated continuous update by the ecModels container pipeline.

Visualization of GECKO simulations in the Caffeine platform
We implemented simulations with ecModels in Caffeine, an open-source software platform for cell factory design. Caffeine, publicly available at http://caffeine.dd-decaf.eu, allows user-friendly simulation and visualization of flux predictions made by genome-scale metabolic models. Several standard modelling methods are already included in the platform, such as 13 C fluxomics data integration, and simulation of gene deletion and/or overexpression, to interactively explore strain engineering strategies. In order to allow for GECKO simulations, we added a new feature to the platform for uploading enzyme-constrained models and absolute proteomics data. Additionally, we added a simulation algorithm that recognizes said models, and overlays the selected proteomics data on them, leaving out data that makes the model unable to grow at a pre-specified growth rate. After these inclusions to the platform, enzyme usage can now be computed on the fly and visualized on metabolic maps (Fig. 2B), to identify potential metabolic bottlenecks in a given condition. The original proteomics data can be visualized as well, to identify if the specific bottleneck is due to a lack of enzyme availability, or instead due to an inefficient kinetic property. This will suggest different metabolic engineering strategies to the user: if the problem lies in the intracellular enzyme levels, the user can interpret this as a recommendation for overexpressing the corresponding gene, whereas if the problem lies in the enzyme efficiency, the user could assess introducing a heterologous enzyme as an alternative.

GECKO simulation utilities
As ecModels are defined in an irreversible format and incorporate additional elements such as enzymes (as new pseudo-metabolites) and their usages (represented as pseudo-reactions), they might sometimes not be directly compatible with all of the functionalities offered by currently available constraint-based simulation software 44,45,50,54,55 . We therefore added several new features to the GECKO toolbox that allow the exploration and exploitation of ecModels. These include utilities for: 1) basic simulation and analysis purposes, 2) accessible retrieval of kinetic parameters, 3) automated generation of condition-dependent ecModels with proteomic abundance constraints, 4) comparative flux variability analysis between a GEM and its ecModel counterpart, and 5) prediction of metabolic engineering targets for enhanced production with an implementation of the FSEOF method 56 for ecModels. Detailed information about the inputs and outputs for each utility can be found on their respective documentation, available at: https://github.com/SysBioChalmers/GECKO/tree/master/geckomat/utilities. All of these utilities were developed in MATLAB due to their dependency on some RAVEN toolbox functions 50 .

Predicting microbial proteome allocation in multiple environments
In order to test the quality of the phenotype predictions of an ecModel automatically generated by the ecModels container pipeline, batch growth under 11 different carbon sources was simulated with eciML1515 for E. coli. Figure 3A shows that, for all carbon sources, growth rates were predicted at the same order of magnitude as their corresponding experimental measurements, with the most accurate predictions obtained for growth on D-glucose, mannose and D-glucosamine. Furthermore, batch growth rate and protein allocation predictions, using no exchange flux constraints, were compared between eciML1515 and the iJL1678 ME-model 32 , the latter accounting for both metabolism and macromolecular expression processes. The sum squared error (SSE) for batch growth rate predictions across the 11 carbon sources using eciML1515 was 0.27, a drastic improvement when compared to the 1.21 SSE of iJL1678 ME-model predictions 32 . Figure 3B shows the predicted total proteome needed by cells to sustain the provided experimental growth rates for the same 11 environments. Notably eciML1515 predicts values that lie within the range of predictions of the iJL1678 ME-model (from the optimal to the generalist case) for 10 out of the 11 carbon sources (see Materiales and Methods for simulation details). This shows that the new version of the GECKO toolbox ensures the generation of functional ecModels that can be readily used for simulation of metabolism, due to its systematic parameter flexibilization step which reduces the need of extensive manual curation for new ecModels. Furthermore, iML1515 is a model available as a static file at the BiGG models repository 57 ; therefore, its integration to the ecModels container for continuous update demonstrates the flexibility of our pipeline, regarding compatibility with original GEM sources, which can be provided as a link to their git-based repositories or even as static URLs.

Proteomics constraints refine phenotype predictions for multiple organisms and conditions
The previously mentioned module for integration of proteomics data generates a condition-dependent ecModel with proteomics constraints for each condition/replicate in a provided dataset of absolute protein abundances [mmol/gDw]. Even though absolute quantification of proteins is becoming more accessible and integrated into systems biology studies 58-62 , a major caveat of using proteomics data as constraints for quantitative models is their intrinsic high biological and technical variability 63 , therefore some of the incorporated data constraints need to be loosened in order to obtain functional ecModels. When needed, additional condition-dependent exchange fluxes of byproducts can also be used as constraints in order to limit the feasible solution space. A detailed description of the proteomics integration algorithm implemented in GECKO is given in Supp. file 2.
The new proteomics integration module was tested on the three ecModels for budding yeasts available in ecModels container (ecYeastGEM, eciYali, eciSM996). We measured absolute protein abundances for S. cerevisiae, Y. lipolytica and K. marxianus, grown in chemostats at 0.1 h -1 dilution rate and subject to several experimental conditions (high temperature, low pH and osmotic stress with KCl) 64  were higher, at least by 20%, for 3 out the 4 simulated conditions when compared with predictions of the ecYeastGEM without proteomics data (Fig. 4E).
Relative enzyme usages, estimated as predicted absolute enzyme usage over enzyme abundance for all of the measured enzymes in an ecModel & ', can be understood as the saturation level of enzymes in a given condition. In order to analyze the metabolic mechanisms underlying long-term adaptation to stress in budding yeasts, relative enzyme usage profiles were computed from all the previous simulations of ecModels with proteomics constraints. Enzymes that display fold-changes higher than 1 for both absolute abundance and their saturation level, when comparing predicted usage profiles between stress and reference conditions, suggest regulatory mechanisms on individual proteins that contribute to cell growth on the anlyzed stress condition. Figure 4F shows all of the enzymes that were identified as responsive to environmental stress in this study, displaying enrichment for enzymes involved in biosynthesis of diverse amino acids and folate metabolism.
A further mapping of all enzymes in these ecModels to a list of 2,959 single copy protein-coding gene orthologs across the three yeast species 64 found 310 core proteins across these ecModels. Principal component analysis revealed that variance on absolute enzyme usages and abundance profiles for these core proteins is mostly explained by differences in the metabolic networks of the different species rather than by environmental conditions (Fig. S3 B-C), reinforcing previous results suggesting that, despite being phylogenetically related, their long-term stress responses at the molecular level have evolved independently after their divergence in evolutionary history 64 . Further analysis of the FVA results revealed that a reduction of at least 95% of the variability range was achieved for more than 90% of all active fluxes at high growth rates in all ecModel. Interestingly, the aforementioned flux variability metrics were overall improved even for the chemostat conditions, despite a higher degree of constraining (fixed low growth rate and optimal uptake rate), which restrains these models to an energy efficient respiratory mode (Supp. file 4).

Discussion
Here we demonstrated how enzyme constrained models for diverse species significantly improve simulation performance compared to traditional GEMs. Furthermore, to enable the community to easily adapt this modelling approach, we upgraded the GECKO toolbox for enhancement of genome-scale models with enzyme and omics constraints to its version 2.0. Major improvements on the kcat matching algorithm were incorporated into the toolbox, based on phylogenetic distance between the modeled organism and the host organisms for data queries, and an automated curation of kcat numbers for over-constrained models were incorporated into the toolbox. Major refactoring of the GECKO toolbox enabled a generalization of the method, allowing the creation of high-quality ecModels for any provided functional GEM with minimal need for case-specific introduction of new code. Additionally, several utility functions were integrated into the toolbox in order to enable basic simulation purposes, accessible retrieval of enzyme parameters, integration of proteomics data as constraints, flux variability analysis and prediction of gene targets for enhanced production of metabolites. Overall, it was shown that these enhancements to the GECKO toolbox improve the incorporation of kinetic parameters into a metabolic model, yielding ecModels with biologically meaningful kinetic profiles without compromising accuracy on phenotype predictions.
Two major limitations of the first version of the GECKO toolbox were its specific customization to the S. found to be common across the studied organisms and stress conditions (Fig. S3 D and Supp. file 3). These results suggests that yeast cells display enzyme expression profiles that provide them with metabolic robustness for microbial growth under stress and nutrient-limited conditions, in contrast to an optimal protein allocation strategy that prioritizes expression of the most efficient and non-redundant enzymes.
Our results on drastic reduction of median flux variability ranges and the number of totally unbounded fluxes for eciYali, eciSM996 and eciML1515, together with previous studies 1,2,38 , suggest that a major reduction of the solution space of metabolic models to a more biologically meaningful subspace is a general property of ecModels. However, flux variability is an intrinsic characteristic of metabolism; therefore, metabolic models with highly constrained solution spaces may exclude some biological capabilities of organisms, which are not compatible with the set of constraints used for the analysis (exchange fluxes, growth rates and even profiles of kinetic parameters, considered as condition-independent in ecModels).
Here, the predictive capabilities of eciML1515 and iJL1678 ME-model (both for E. coli) for cellular growth and global protein demands on diverse environments were compared. The major improvement in predicted maximum growth rates, together with a comparable performance on quantification of protein demands, shown by eciML1515 suggest that, despite its mathematical and conceptual simplicity, the GECKO formalism is a suitable framework for quantitative probing of metabolic capabilities, compatible with the widely used FBA method and without the need of excessive complexity or computational power.
Nevertheless, ME-models provide a much wider range of predictions that explore additional processes in cell physiology with great detail. Direct comparison between the predictions of these modelling formalisms, suggest that ME-models performance can be improved by incorporation of either curated or systematically retrieved kinetic parameters that are suitable for the modelled organisms.
Simpler modelling frameworks that account for protein or enzyme constraints in metabolism, such as flux balance analysis with molecular crowding (FBAwMC) 16,17 , metabolic modelling with enzyme kinetics (MOMENT) 23 and constrained allocation flux balance analysis (CAFBA) 21 , have also been developed and used to explore microbial cellular growth 16,17,21 and overflow metabolism 16,23 . These methods have overcome the lack of reported parameters for some specific reactions either by incorporation of proteomics measurements and prior flux distributions 23 , manual curation and sampling procedures 16,17 or even by lumping protein demands by functionally related proteome groups. In contrast, the new version of the GECKO toolbox provides a systematic and robust parameterization procedure, leveraging the vastly accumulated knowledge of biochemistry research stored in public databases, ensuring the incorporation of biologically meaningful kinetic parameters even for poorly studied reactions and organisms.
The applicability of these other simple modelling formalisms to models for diverse species is limited as none of these methods has been provided as part of a generalized model-agnostic software implementation. high-level production of valuable chemicals.

Automation pipeline and version-controlled hosting of the ecModels container
The ecModels repository is used to version-control the pipeline code and the resulting models. The pipeline is restricted to 2 short Python files, whose role is to decide when models need to be updated based on a configuration file config.ini, and to consequently invoke the use of GECKO for each model. filtering was set to peptide E-value < 0.01 and protein log(E-value) < -3. Relative quantification of protein abundances was carried out using the Normalized Spectral Abundance Factor (NSAF) 73 and the NSAF values obtained from UPS2 proteins in bulk samples were used to determine the suitable regression curves that allowed the conversion from relative protein abundance into absolute terms. MS data is available online on public databases via the PRIDE repository 74 with the dataset identifier PXD012836.

Simulation of condition-dependent flux distributions
Simulation of cellular phenotypes for conditions of environmental stress at low dilution rates with GEMs were performed by first setting bounds on measured glucose uptake and byproduct secretion rates according to experimental data from previous studies on chemostats 64 . Then the biomass production rate was constrained (both upper and lower bounds) with the experimental dilution rate (0.1 h -1 ). Maximization of the non-growth associated maintenance pseudo-reaction was set as an objective function for the parsimonious FBA problem as a representation of the additional energy demands for regulation of cellular growth at non-optimal conditions. The same procedure was followed for simulations with ecModels constrained by a total protein pool. For the case of ecModels with proteomics constraints, the same set of constraints was used but the objective function was set as minimization of the total usage of unmeasured proteins, assuming that the regulatory machinery for stress tolerance is represented by the condition-specific protein expression profile.

Prediction of microbial batch growth rates
Batch cellular growth was simulated by allowing unconstrained uptake of all nutrients present in minimal mineral media, enabling a specific carbon source uptake reaction for each case while blocking the rest of the uptake reactions and allowing unconstrained secretion rates for all exchangeable metabolites.
Maximization of the biomass production rate was used as an objective function for the resulting FBA problem. For prediction of total protein demands on unlimited nutrient conditions, media constraints were set as expressed above and experimental batch growth rate values were fixed as both lower and upper bounds for the biomass production pseudo-reaction. The total protein pool exchange pseudo-reaction was then unconstrained and set as an objective function to minimize, assuming that when exposed to unlimited availability of nutrients the total mass of protein available for catalyzing metabolic reactions becomes the limiting resource for cells. The solveLP function, available in the RAVEN toolbox, was used for solving all FBA problems in this study.

Code availability
The    Prediction of total protein content in the cell by eciML1515 and ME-iJL1678 using the optimal and generalist approaches.

Improved kcat matching algorithm
The kcat matching algorithm in the GECKO toolbox queries kinetic parameters from BRENDA, the largest database available on enzymatic information 1  In order to address the aforementioned limitations, the GECKO kcat matching algorithm was modified aiming to provide a more accurate parameterization of models. A comparison between the introduced and previous hierarchical algorithms is shown in Table S1. Original kcat matching criteria New criteria 1. As a first option, it will try to match the E.C. number, the organism and the corresponding substrate to some kcat annotation in the BRENDA database.

2.
If no match is found, it will try to match the E.C. number and the substrate, but with any organism available.
2. If no match is found, it will try to match the E.C. number and the substrate, but for the phylogenetically closest organism with available values. 3. If no match is found, it will try to match the E.C. number and the organism, but with any substrate available.

4.
If still no match is found, it will try to match the E.C. number for any organism, and any substrate available.
4. If no match is found, it will try to match the E.C. number and the organism but looking in specific activity values instead of kcat (S.A.*Mweight = kcat). 5. If still no match is found, then it will introduce one wildcard to the E.C. number and attempt all previous 4 steps again.
5. If still no match is found, it will try to match a kcat value for the E.C. number, any substrate but for the phylogenetically closest organism with available values. 6. If still no match is found, it will try to match a specific activity value for the E.C. number, any substrate but for the phylogenetically closest organism available. 7. Finally, if still no match is found, then it will introduce one wildcard (WC) to the E.C. number and attempt all previous 6 steps again.

Estimation of phylogenetic distance between pairs of organisms
The phylogenetic distance between organisms is measured as the number of nodes of separation between two organisms in the KEGG taxonomical tree (incorporated as a MATLAB workspace file into the toolbox), this new feature follows from the assumption that kinetic parameters on enzymes have been finely tuned by evolution and are phylogenetically related 4 . The incorporation of specific activity values increases the parameter coverage and avoids the assignment of a high number of wild cards, making the assignments as close as possible to the original metabolic function of the specific enzyme class.

Iterative curation of limiting kcat numbers based on enzyme control coefficients
Once kinetic parameters and protein pool bounds have been incorporated into the ecModel it is very likely that overconstraining arises due to the intrinsic uncertainty of the incorporated kcat values. For such cases, the module kcat_sensitivity_analysis flexibilizes the coefficients with a higher effect on the simulated objective function value based on enzyme control coefficients given by in which *'-). represents the kcat parameter of the enzyme i in reaction j; /0. is the original value in the objective function; ∆ *'-). is an induced perturbation in the kcat equivalent to 10-fold increase of its initial value; ∆ /0. is the change in the objective function after perturbing 123 ). .
The ECCs are ranked in a decreasing way and the enzyme with the coefficient is then selected for a 10-fold kcat increase, based on the assumption that kcat parameters may span orders of magnitude across organisms and substrates even for the same enzyme class. This procedure iterates until the ecModel is able to reach to provided experimental growth rate in the getModelParameters.m function. Information regarding the flexibilized kcat values, their respective proteins and reactions, ECCs, flexibilized and original kcat s is saved as a text file in the GECKO/model folder of the toolbox under the name kcat_modifications.txt.

Incorporation of proteomics constraints
The integrate_proteomics module in GECKO enables the generation of condition dependent models with proteomics constraints for any given dataset of absolute protein abundances [mmol/ gDw] with m replicates for n conditions. For each experimental condition, the abundance values are filtered, excluding proteins that are not present in at least 2/3 of the total number of condition replicates and also noisy measurements where -/-'4 is the measured total protein content in the cell in gprot/gDw; 5 ! is the molecular weight of the measured protein i; represents an average saturation factor for the unmeasured enzymes, assumed as 0.5 5,6 ; accounts for the fraction that the unmeasured protein sector represents out of the total proteome in the cell, this value is calculated by using a paxDB proteome abundance file for the organism of interest as a reference, if no paxDB file is provided then a value of 0.5 is assumed.
Proteomics abundances are corrected for the oxidative phosphorylation complexes, trying to avoid overconstraining of potentially erroneously measured subunits that might limit the whole pathway. This correction is limited to just this pathway as it is desirable to modify the original dataset the least possible and abundance changes in OxPhos subunits are key to meet the phenotype energy requirements. Medium constraints are set by allowing free uptake of all compounds available in the culture medium and closing the rest of the uptake reactions. Additionally, all the upper bounds for production reactions (secretion of metabolites) are set to 1000 mmol/gDw h. Using an ecModel_batch with the same constraints setup, minimal enzyme requirements for the proteins present in the filtered dataset are retrieved from a parsimonious FBA solution vector. Enzyme abundances that are lower than the minimum requirements calculated by the FBA solution are corrected in the dataset. The total flexibilized mass of protein is drawn from the remaining protein pool (upper bound for protein_pool_exchange pseudoreaction) for consistency with mass conservation. Non-growth associated ATP maintenance is fitted according to condition specific experimental data if available (measurements on exchange fluxes of oxygen and CO2 from the same samples as the proteomics dataset).
In the case of chemostat samples such conditions are set by first fixing the growth rate to the experimental value, minimizing the carbon source uptake, fixing its optimal value and then setting the total unmeasured enzymes usage as a new objective to minimize. Each condition-specific model is saved in GECKO/models/prot_constrained.

Comparative flux variability analysis
The function comparativeFVA.m in the FVA utitilities module provides a fair comparison of flux variability range distributions between a given GEM and its ecModel pair for glucose limited conditions (low dilution rates) and protein limiting regime (batch growth). For the chemostat case, a dilution rate of 0.1 h -1 was set as both lower and upper bound for the biomass pseudoreaction (+/-a tolerance value of 0.01%). Then the glucose uptake rate was set as an objective to minimize and its optimal value was then also fixed, using the same tolerance. Additional culture medium constraints were imposed (upper bound for exchange reactions of medium components were set to 1000 mmol/gDw h). All applied constraints were also applied to the original GEM. For the protein-limiting case, biomass production is maximized with the ecModel and then the optimal value is used to set a lower bound on the same reaction, in order to compare with the original GEM, the same optimal growth rate is fixed as both lower and upper bounds for the biomass pseudoreaction. A parsimonious flux distribution in which the total protein usage is minimized in the ecModel subject to all of the previous constraints is then obtained. For every reaction that is able to