INTEGRATE: Model-based multi-omics data integration to characterize multi-level metabolic regulation

Metabolism is directly and indirectly fine-tuned by a complex web of interacting regulatory mechanisms that fall into two major classes. On the one hand, the expression level of the catalyzing enzyme sets the maximal theoretical flux level (i.e., the net rate of the reaction) for each enzyme-controlled reaction. On the other hand, metabolic regulation controls the metabolic flux through the interactions of metabolites (substrates, cofactors, allosteric modulators) with the responsible enzyme. High-throughput data, such as metabolomics and transcriptomics data, if analyzed separately, do not accurately characterize the hierarchical regulation of metabolism outlined above. They must be integrated to disassemble the interdependence between different regulatory layers controlling metabolism. To this aim, we propose INTEGRATE, a computational pipeline that integrates metabolomics and transcriptomics data, using constraint-based stoichiometric metabolic models as a scaffold. We compute differential reaction expression from transcriptomics data and use constraint-based modeling to predict if the differential expression of metabolic enzymes directly originates differences in metabolic fluxes. In parallel, we use metabolomics to predict how differences in substrate availability translate into differences in metabolic fluxes. We discriminate fluxes regulated at the metabolic and/or gene expression level by intersecting these two output datasets. We demonstrate the pipeline using a set of immortalized normal and cancer breast cell lines. In a clinical setting, knowing the regulatory level at which a given metabolic reaction is controlled will be valuable to inform targeted, truly personalized therapies in cancer patients.

In recent years, the biological role of metabolism has been reconsidered. Being closely 2 integrated with most -if not all -cellular processes, metabolism may act as a sensitive 3 integrative readout of the physio-pathological state of a cell or organism [1]. 4 Consistently, many physio-pathological states and multifactorial diseases, from cancer to 5 neurodegeneration and aging, show a specific metabolic component [2]. 6 While the general topology of metabolism is well established, the characterization 7 and understanding of system-level regulation of metabolism remain largely unresolved, 8 although some general rules emerged in recent years [3,4]. Each metabolic flux depends 9 on metabolic enzymes, whose levels and catalytic activities are the outcome of gene 10 expression and regulatory events that include epigenetic control of chromatin, 11 accessibility to transcription factors, the rate of transcription of the individual genes 12 encoding metabolic enzymes, as well as post-transcriptional and post-translational 13 events (from RNA splicing to enzyme phosphorylation). These complex, hierarchical 14 regulatory mechanisms are orchestrated by signal transduction pathways and set the 15 upper level for the flux of each enzyme-catalyzed metabolic reaction. An increasing 16 body of evidence [5][6][7] indicates that metabolism is not passively regulated by the 17 hierarchical control mechanisms outlined above: on the contrary, metabolism 18 auto-regulates metabolic fluxes (i.e., the rate of individual metabolic reactions) through 19 the interactions of metabolites (substrates, cofactors, allosteric modulators) with the level and depends only on changes in [S1]; 48 • metabolic and transcriptional control : the flux through E1 is co-regulated with 49 changes in [S1]. 50 Characterizing the landscape of metabolism and its regulation is of paramount 51 importance in various fields, including health, wellness, and biotransformations [12]. 52 The first requirement for this characterization is the knowledge of metabolic fluxes. 53 However, direct determination of metabolic fluxes through the use of labeled substrates 54 lags behind other omic technologies, such as metabolomics and transcriptomics, mainly 55 due to technical difficulties [13], especially at the sub-cellular level [14]. On the contrary, 56 transcriptomics (or proteomics) and metabolomics datasets are increasingly being 57 collected in large cohorts but do not allow for accurate characterization of the 58 regulatory mechanisms controlling metabolism unless opportunely integrated. More 59 recently, parallel transcriptomic and metabolomic datasets started to appear [15][16][17][18][19][20]. 60 Yet, the integration of these different omic data has so far been generally limited to 61 gene-metabolite correlation analysis or pathway enrichment analysis of genes and 62 metabolites [21,22]. Hence, there is a need for data science methods to integrate 63 transcriptomic and metabolomic data to capture all the facets of the interdependence 64 between metabolism and gene expression. 65 Constraint-based steady-state models represent a valuable framework to predict 66 metabolic fluxes from the other high-throughput omics data. In particular, a plethora of 67 methods have been conceived to integrate transcriptomic data into these kinds of models 68 by relying on Gene-Protein-Reaction associations (GPRs) encoded within them, as 69 reviewed in [23][24][25]. Intracellular metabolomics data have also been indirectly integrated 70 into constraint-based steady-state models in the form of constraints on fluxes [26][27][28], 71 aiming at identifying the metabolic flux distribution better fitting the given data.

72
Current model-based attempts to discern trascriptionally from metabolically 73 controlled fluxes present some limitations [29,30]. Katzir et al. [30] do not directly use 74 metabolomics data to infers reactions controlled at the metabolic level using 75 metabolomic data but determine them by elimination, as fluxes that are not regulated 76 at the transcriptional, translational, or post-translational level. On the contrary, the 77 approach by Cakir at al. [29] is based on the concept of neighborhood in a graph, it 78 does not distinguish reactions substrates from products, nor enzyme subunits from 79 isoforms, and does not predict whether a reaction is up or down-regulated, but simply 80 de-regulated. 81 We here present the INTEGRATE (Model-based multi-omics data INTEGRAtion to 82 characterize mulTi-level mEtabolic regulation) pipeline to accurately characterize the 83 landscape of metabolic regulation in different biological samples, starting form 84 metabolomics data (and optionally exometabolomic data to derive utilization and 85 July 24, 2021 3/30 elimination of selected nutrients and waste products) and transcriptomic data. that vary across cells consistently with both metabolic and transcriptional regulation 2) 110 fluxes that vary consistently with metabolic regulation only.

111
The core process of INTEGRATE methodology is depicted in Figure 1 and consists 112 of integrating the input experimental datasets, which are centered around heterogeneous 113 objects (i.e., genes, metabolites and fluxes) into the input metabolic network in order to 114 obtain the three following datasets, each of which is centered around the object reaction: 115 Reaction Activity Scores (RAS). This dataset includes for each input model reaction 116 and for each sample a RAS score. The score is based on the expression value 117 (RNA-seq read counts) of the genes encoding for catalyzing enzymes and on the 118 relationship among them, as previously done in [31].  ...  variations are consistent with metabolic regulation. Reaction displaying a low RAS-RPS 145 agreement but a high FFD-RPS correspond to metabolically controlled reactions. 146 We remark that we preferred not to give the same attention to flux variations that 147 are consistent with transcriptional regulation only, based on the concordance between 148 RAS and FFD, because the two datasets are not independent.

149
Scripts to reproduce the overall workflow are available at: To test our approach, we applied INTEGRATE to cell lines that we expected to be 154 metabolically heterogeneous. We selected four breast cancer cell lines deriving from 155 either primary of metastasis breast cancer tissues belonging to different molecular 156 classifications, and a non-tumorigenic breast cell line. The name, origin and molecular 157 subtyping of the five cell lines are summarized in Table 1.

158
The five cell lines were cultured in a similar growth medium. It can be observed in 159 Figure 2A that the cell lines present major differences in terms of proliferation rate. 160 We first identified a balanced growth phase suitable to obtain measurements to be 161 used as constraints in steady state modeling. To this aim, we analyzed both the number 162 of cells in time and the protein content (Figures 2A-B). It can be observed in Figure 2C 163 that, between 0 and 48 hours, the protein content linearly correlates with the number of 164 cells, indicating that the cell size is constant in this time window. We, therefore, 165 concentrated further analyses on the 0-48 hours time window. 166 We estimated the consumption and production rates of lactate, glutamine, glucose 167 and glutamate in the 0-48 hours time interval from YSI analysis of spent medium. We 168 focused on these metabolites because glucose and glutamine are the main carbon 169 sources of cancer cells and because the rate of lactate and glutamate production over 170 glucose and glutamine consumption is notoriously difficult to be properly predicted by 171 constraint-based models [32,33]. We quantified the abundance of intracellular metabolites and prepared libraries for RNA-sequencing at 48h.

173
It can be observed in Figure 2E that the cell lines present major differences in terms 174 of metabolic profile. The clusters of samples referring to each cell are indeed well 175 separated from one another. In particular, as it can be observed in the dot plot in 176 Figure 2D, the abundance of some metabolites well distinguish a cell from the others.   Measurements on spent medium in Figure 2F show that the ratio of lactate 188 produced over glucose consumed is quite similar across the five cell lines. In contrast, 189 the lactate and glutamate over glucose ratios are more heterogeneous.  interpretation of the outcomes. The most relevant simulation problem is the presence of 199 thermodynamically infeasible loops, which make the simulated growth rate insensitive to 200 variation in essential nutrient availability constraints, as reported in [35]. For this reason, 201 we preferred to rely on a core model, extracted from Recon3D, focusing on more limited 202 aspects of metabolism, but that underwent extensive manual curation and debugging.

203
To this aim, we reconstructed the ENGRO2 metabolic network, which is a

209
Starting from ENGRO1, we reconstructed a more extended and curated 210 constraint-based core model of human metabolism. The reconstruction of the ENGRO2 211 model was based on a step-wise manual procedure using ENGRO1 model as a scaffold 212 and progressively including specific pathways or reactions from Recon3D according to 213 their relevance in literature for cancer cells. In order to associate Gene-Protein-Reaction 214 associations, we relied, when possible, on the HMR core model [37], whose GPRs were 215 manually refined, and for remaining reactions on the RECON3D model [31]. 216 The most invasive change we implemented in the model was the 217 compartmentalization of reactions and metabolites within the intracellular space. Many 218 studies support the evidence of an altered expression of some mitochondrial carriers in 219 multiple cancer cells that, most probably, arise as an adaptation to their current 220 metabolic state and the consequent new requirements [38]. Therefore, in addition to the 221 extracellular compartment, we divided the internal model side into cytosol and The reconstructed model has been refined by checking: i) the capability to reproduce 230 ENGRO1 results [36] in terms of contribution of glucose and glutamine as carbon and 231 nitrogen sources for supporting proliferative wirings and of sensitivity of the model at 232 high and low levels of these nutrients; ii) the actual essentiality of essential amino acids, 233 that is, null growth rate if they are depleted from the medium; iii) the capability to 234 reproduce experiments in literature, including the dependence of cancer phenotypes 235 from the de novo synthesis of palmitate-derived lipids rather than on an external source 236 of fatty acids, as came out in [39], the in silico simulation of the effect of an inverse 237 agonist for the nuclear receptor liver-X-receptor, whose role is to regulate the expression 238 of some key genes in the glycolysis and lipogenesis, as a putative cancer treatment 239 approach [40], the role of the creatine kinase (CK) enzyme that due to the requested 240 high amount of ATP may act as potential anticancer agent [41], and the overexpression 241 of argininosuccinate synthase enzyme as a mechanism for impairing cancer cells  Cell-relative metabolic models 249 We customized the ENGRO2 core model to obtain five cell-specific core models of the 250 cell lines under study. Because the cell-specific models must be functional to highlight 251 the metabolic differences between the cell lines, we incorporated most constraints in the 252 form of relative constraints. For this reason, the models cannot be considered as 253 cell-specific stand-alone models. Hence we refer to them as cell-relative. 254 Specifically, we integrated the following three kind of relative constraints:  [43], if the concentration of metabolite X in the medium of cell A is, 259 for instance, 20% higher than in the medium of Cell B, the maximum uptake flux 260 allowed for metabolite X in cell A will be 20% greater than that of cell B. estimated the consumption and production rate, we set constraints on their ratios, 263 namely on the glutamate/glutamine, lactate/glucose and lactate/glutamine ratios. 264 The choice of constraining the relative ratios among these metabolites rather than 265 their absolute intake or secretion rates is motivated by the limited subset of  transcriptomics-derived constraints alone are integrated well discriminates the five cell 298 lines in terms of their growth rate, as shown in the relative correlation plot in Figure 3. 299 However, predictions of growth rates improve when constraints on extracellular fluxes 300 are also added. Note that YSI-derived constraints alone are sufficient to well predict the 301 growth rates, but they are not enough to discriminate the FFD of the five models.

302
The final five cell-relative metabolic models are included in Supplementary File  (first quadrant -gray shadow in Figure 4A). Variations in these reactions must be 332 imputed to transcriptional and metabolic regulation. The heatmap in Figure 4B reports the RPSvsRAS and the RPSvsFFD concordance 340 scores (Cohen's kappa) of ENGRO2 metabolic reactions, limited to the subset (of   The maps and histograms in Figure 5B and C report the FFD values distribution what was done in [32], we compared for each amino acid its uptake flux values against 374 its contribution to biomass. The plots in Figure 6 report the results of this comparison 375 for selected amino acids, namely isoleucine, valine, leucine and tyrosine are chosen.

376
Sampled fluxes located along the bisector correspond to cases where all the internalized 377 amino acid from the extracellular environment is wholly directed towards the 378 production of biomass. This is the case of essential amino acid isoleucine ( Figure 6A). 379 Sampled fluxes located in the grey area above the bisector, as in the case of tyrosine 380 ( Figure 6D), indicate that the uptake rate of the amino acid is lower than its 381 incorporation rate within biomass. This implies that it is synthesized by other 382 metabolic processes within the network to fulfill the biomass requirement. On the 383 contrary, when the sampled fluxes are below the bisector, as it is the case for valine and 384 leucine ( Figure 6B-C), the amino acid uptake rate is higher than its incorporation rate 385 within biomass. This implies a dual role of the metabolite because in addition to 386 contributing to biomass demand it is also metabolized within the network.

387
Results in Figure 6 indicate that the five cell lines under study do not markedly 388 differ in their preference to metabolize or synthesize a given amino-acid, but only in the 389 extent at which it is metabolized or synthesized.

391
Metabolism is controlled by several interacting regulatory layers, including mechanisms 392 that control the expression and activity of each metabolic enzyme and auto-regulatory 393 features that depend on the interaction of metabolites with the enzymes [8,49,50].

394
Understanding whether a given reaction is controlled at the gene expression or at the 395 metabolic level is required not only to understand regulatory features of metabolic 396 pathways fully, but also to design appropriate actions to control metabolism in various 397 fields, including health, wellness, and biotransformations [12], where metabolism plays a 398 crucial regulatory role.

399
To enable this understanding we presented the INTEGRATE computational gene expression data, with the information on the direction of the observed variation.

421
Application of our pipeline to one non-tumorigenic cell line and four different breast 422 cancer cell lines identified fluxes that markedly differ across the five cell lines, ascribing 423 each differentially regulated flux to either transcriptional or metabolic control or both. 424 Remarkably, we identified reactions for which there is a good agreement between flux 425 variations and variations in RPSs. We remind that these two data sets are fully 426 independent.

427
INTEGRATE also complements information on the differential propensity of  The cytosolic ACONT reaction is strictly coupled to the cytosolic isocytrate 438 dehydrogenase reaction, acting as an early participant in isocitrate dehydrogenase 439 (IDH1)-dependent NADPH biosynthesis required for lipid biosynthesis [51]. 440 Consistently, knockdown of the gene encoding cytosolic isociytrate dehydrogenase in 441 preadipocytes results in decreased isocitrate dehydrogenase 1 (NADP+) mRNA levels 442 and shifts the NADPH:NADP+ ratio towards the oxidized form, impairing 443 adipogenesis [52]. These observation reinforce the involvement of cell redox balance in 444 the metabolic rewiring of cancer metabolism, as already pointed out in [36,53,54].

445
Many analyses can be conceived and performed downstream of our pipeline. As an 446 example, we analyzed fluxes that better discriminate between cell lines. We also 447 investigated the metabolism of amino acids, to analyze whether they tend to be 448 preferentially synthesized or metabolized by each cell line.

449
The direct exploitation of metabolomics represents the main novelty of our approach 450 to determine whether a flux is regulated at the metabolic level. In a first instance, we 451 intercept this information with information on transcriptional regulation. Nevertheless, 452 the pipeline can be promptly extended to integrate proteomics and phosphoproteomics, 453 allowing to characterize other hierarchical levels of regulation.

454
Knowing whether a flux is controlled at the metabolic or gene expression level is 455 mandatory in designing therapeutic strategies. If a putative therapeutic flux is 456 controlled metabolically, direct targeting the corresponding metabolic enzyme will not 457 produce any effect. On the contrary, identifying the metabolic reaction(s) that 458 indirectly affect the target reaction allows for designing an effective therapeutic 459 intervention. Hence, by integrating high-throughput omics data through mathematical 460 models, INTEGRATE makes it possible to dissect the complex and intertwined 461 regulation of metabolic networks and inform targeted strategies to counteract metabolic 462 rewiring and/or dysfunction underlying different pathological disorders.

463
The pipeline, however, can be applied to any case study. By way of example, 464 metabolic engineering efforts are already inspired by constraint-based modeling [55]. 465 Expanding the toolbox of computational tools will surely increase the success rate of 466 efforts toward the application of engineering concepts to living organisms, contributing 467 to the development of predictable, scalable, and efficient biological devices, whose 468 performance is not hampered by inadequate knowledge of the underlying design 469 principles [56]. subunits of this enzymatic complex [60,61]. In the ENGRO2 network, we included five 487 separate reactions representing the reaction catalyzed by each OXPHOS complex.

488
Because of the ROS production from the Complex I, it was necessary to add the ROS 489 detoxification pathway by means of glutathione.

490
In view of the relevance of one-carbon metabolism in cancer cells [62], we enriched 491 the ENGRO2 model by including both folate and the methionine cycles. acid oxidation contributes to the total cell NADPH pool because of acetyl-CoA that, 504 once produced, enters the TCA cycle and is converted with the oxaloacetate to citrate. 505 The high relevance of NADPH is linked to its role in providing redox power for tumour 506 cells to counteract the oxidative stress. For these reasons, we included the mitochondrial 507 beta-oxidation pathway in ENGRO2 for degradating fatty acids that in the model are 508 represented under the form of palmitate.

509
Another recent finding that we considered during the reconstruction of ENGRO2 510 regards the disregulation of the polyamines metabolism and its requirement under the 511 neoplastic condition [64]. In view of these evidence, we added reactions belonging to the 512 polyamines metabolism in the ENGRO2 model.
We remark that we do not ask the network to support a percentage of maximal 534 possible biomass production rate as often done in other studies. We simply prevent 535 biomass to be completely null, by setting the biomass synthesis lower bound to 10 −4 .

536
This small level should be high enough to be distinguished form numeric instability. the reaction to take place. On the contrary OR operator is used when distinct genes 544 encode multiple isoforms of the same enzyme, entailing that either isoform is enough for 545 the reaction catalysis. These logical operators can be combined to describe more 546 complex scenarios involving both isoforms and subunits.

547
According to [31], we combined the RNA-seq datasets with the defined GPR rules glutamine, we proceeded as follows:

574
• Firstly, we took the concentration values of the two nutrients glucose, glutamine 575 and of the two byproducts glutamate and lactate previously collected from the 576 YSI analyzer of spent medium.

577
• Treating the two biological replicas separately, we computed for each of the three 578 technical replicas the mean concentration difference of the considered metabolites 579 between initial time and after 48 hours of growth to get their utilization or 580 formation rate. Focusing on the ratio between these metabolites rather than on 581 their absolute uptake or secretion rates, we limited to consider the mean 582 concentration difference values between time 0 h and 48 h without dividing them 583 by the integral of cells number.

584
• We computed the lactate produced over glucose consumed, lactate produced over 585 glutamine consumed and glutamate produced over glutamine consumed ratios for 586 the two biological replicas.

587
• We then added these experimentally determined flux ratios as further constraints 588 to the model to ensure that exchange reactions involved in the considered ratios 589 follow the experimental values. To give an example, the lactate to glucose ratio is 590 included within the model as constraint by considering the following expression: where the ratio of the flux value v ExLac of the lactate secretion reaction over the 592 flux value v ExGlc of the glucose consumption reaction ranges between minus one 593 and plus one standard deviation σ Lac/Glc of the mean lactate to glucose ratio 594 x Lac/Glc of the two biological replicas.

595
Trascriptomics-derived constraints 596 Assuming that it is possible to modulate the reactions activity by transcription 597 mechanisms, it is possible to exploit the RAS c r score to further constraint the extremal 598 fluxes, v L , v U obtained through FVA, to represent this process and better mimic the 599 cell lines behaviour: In this case, for the internal reactions when a GPR rule exists for a given reaction, 601 its lower and upper bounds are scaled according to its computed RAS.

602
The only exceptions to application of this constraints are the boundaries for reactions 603 CARPEPT1tc and HIStiDF for which we ignored the RAS contribution because the 604 constraints were limiting too severely the growth of some of the cellular lines.

605
Generation of FFD dataset, via random sampling 606 Once the cell-relative models were created, for following analyses, we converted them 607 into irreversible models, in which reversible reactions are represented with two distinct 608 and complementary forward reactions.

609
The ability to properly sample the constrained null space of S is of paramount 610 importance to obtain correct fluxes distributions.

611
In this work, we exploited the implementation of optGpSampler algorithm [66]   The following assumptions allow one to analytically estimate relative fluxes from 621 relative abundances:

622
• for each reaction r in the system, the mass action law is assumed: sr,q , where k r is the kinetic constant of reaction r and X q is the 624 q th substrate of the N total substrates of reaction r, and s r,q is the stoichiometric 625 coefficient of substrate X q in reaction r i.e., how many molecules of the substrate 626 partake to the reaction;

627
• the kinetic constant k r of a given reaction r is assumed to not vary between two 628 steady states i and j.

629
Given such assumptions, the variation between the flux of an irreversible reaction r 630 in two steady states i and j can thus be computed as the ratio v i r /v j r : which does not depend on k r .

632
It goes without saying that if the numerator is higher than the denominator, than 633 flux v r in steady state i is higher than flux v r in steady state j. Therefore, in order to 634 compare the susbtrate contribution to the reaction rate in different cell lines, we ([X q ]) sr,q .
Of course, the above assumptions not often hold. However, RPS r allow us to predict 638 whether the flux of a reaction in a cell line is expected to be different as compared to 639 another cell line, just considering substrate availability. Nothing prevents the observed 640 flux variation to be consistent with the RPS variation, even if the enzyme activity 641 actually differs between the two cell lines under study.

642
Concordance analysis 643 We wanted to test whether the variation in the value of a reaction is consistent between 644 different datasets. Statistical evaluation of the concordance between two techniques 645 used to measure a particular variable, under identical circumstances, has been largely 646 addressed in literature [68]. We decided to treat our measurements, i.e. the log ratio of 647 the flux, as up, down or no-change.

648
Concerning the FFD values, we performed the Mann-Whitney U test [69] 649 (p-value < 0.05) between the distributions of each pair of the five investigated cell lines 650 to determine if the FFD of every reaction differed significantly between the two cell 651 lines. In parallel, the log 2 of the absolute ratio of the median flux of each reaction in 652 the two cell lines is also computed.

653
Concerning the RAS and the RPS values, we performed a t-test (p-value < 0.05) for 654 the means of each pair of RAS and RPS samples taken for all the pairwise combinations 655 of the five investigated cell lines to determine if the values differed significantly between 656 the two cell lines. We also computed the log 2 of the absolute ratio of the mean RAS 657 and RPS values of each reaction in the two cell lines.

658
Hence, we registered the sign of the variation for each pair of the five cell lines under 659 study (for a total of 10 pairs) according to each of the three datasets. At first instance, 660 a positive sign is registered if the distribution of samples values of first member of the 661 comparison is statistically higher and if the average or median value is at least 20% 662 higher. A negative sign is registered if the distribution of samples values of the first 663 member of the comparison is statistically lower and if the average or median value is at 664 least 20% lower. A 0 is registered otherwise. We used a relaxed threshold for the 665 fold-change as in [31] because even a difference of 20% in genes encoding members of a 666 metabolic pathway may dramatically alter the flux through the pathway. Yet, this 667 parameter can be modified arbitrarily. 668 We quantified the level of concordance of the 10 variation signs (1 for each pair of 669 cell lines) for a given pair of datasets by means of the Cohen's kappa metric, which has 670 been commonly used to measure inter-rater reliability for qualitative (categorical) items. 671 .

Ranking of discriminative fluxes
To highlight the fingerprinting role of reaction r, we chose to keep into account only 679 the contribution to counts for which cell line c * is greater than the contribution of other 680 cell lines C \ c * : 681