Reconstruction of consensus tissue-specific metabolic models

Genome-Scale Metabolic Models have shown promising results in biomedical applications, such as understanding cancer metabolism and drug discovery. However, to take full advantage of these models there is the need to address the representation and simulation of the metabolic phenotypes of distinct cell types. With this aim, several algorithms have been recently proposed to reconstruct tissue-specific metabolic models based on available data. Here, the most promising were implemented and used to reconstruct models for two case studies, using omics data from distinct sources. The set of obtained models were compared and analyzed, being shown they are highly variable and that no combination of algorithm and data source can achieve models with acceptable phenotype predictions. We propose an algorithm to achieve a consensus model from the set of models available for a given tissue/cell line, and to improve it given functional data (e.g. known metabolic tasks). The results show that the resulting models are more accurate, both considering the prediction of known metabolic phenotypes and of experimental data not used in the model construction. Two case studies used for model validation consider healthy hepatocytes and a glioblastoma cell line. The open-source implementation of the algorithms is provided, together with the models built, in a software container, allowing full reproducibility, and representing by itself a contribution for the community.


Introduction
successes, the full usefulness of human GSMMs lies on the ability to create methods for 23 the phenotype simulation of distinct cell types. During the last decade, a few methods 24 have been proposed towards this aim, integrating generic models with omics data. Here, 25 data are used to generate constraints refining GSMMs, to predict the phenotype of cell 26 types, as it is the case with iMAT [14] and GIMME [16]. However, these only allow the 27 phenotype prediction for conditions for which data was measured, and have revealed 28 less convincing results when tested in prediction benchmarks [17]. 29 An alternative lies in methods that create tissue-specific GSMMs, restricting the 30 metabolic portfolio of the original model (called template) to reactions occurring in that 31 cell type, as identified from omics data or biochemical knowledge/ literature. A number 32 of methods have been proposed in the last years to address this task. MBA [18] uses an 33 heuristic approach to prune the original set of reactions, based on evidences from 34 literature and data, and was first used to create an hepatocyte model. Similarly, a 35 variant of GIMME was used to create models for brain cells towards an improved 36 understanding of Alzheimer disease [19]. More recently, the INIT algorithm was applied 37 for the reconstruction of GSMMs for 69 cell types [11], based on mixed integer-linear 38 programming, using proteomics and metabolomics data. The task-driven INIT (tINIT) 39 is an extension of the previous algorithm [20] that allows to define metabolic tasks that 40 the resulting models need to be able to achieve. mCADRE [21] and FASTCORE [22] 41 are based on the ideas of MBA, but reformulate the problem to make the algorithms 42 computationally more efficient, and thus have been used to reconstruct models in a 43 larger scale. A recent method, PRIME [23], uses cell growth measurements, and their 44 correlations with gene expression, to filter the relevant reactions. 45 These methods have been used in a number of applications, many of which are 46 related to drug discovery research, mainly to identify metabolic targets that can inhibit 47 cancer cell growth. A first study built a generic cancer model [24], using a variant of Still, in spite of these successful applications, no algorithm for tissue-specific 53 reconstruction has shown a consistent ability to generate accurate models, able to 54 provide phenotype predictions according to known metabolic behaviour, taken from 55 literature or available data. Also, the models provided by different methods and data 56 sources have shown a very high level of variability in some preliminary studies [26], 57 which may suggest that an ensemble based approach might lead to better models. 58 Here, the most promising algorithms (MBA, tINIT, mCADRE and FASTCORE) 59 were used to reconstruct tissue-specific metabolic models for two case studies, 60 hepatocytes and a glioblastoma cell line, resorting to different data sources. The models 61 generated in each case were analyzed and compared in different perspectives.

62
A new algorithm to combine the set of obtained models was proposed and applied to 63 achieve a consensus metabolic model for each case study. The resulting models were 64 validated using knowledge about known metabolic tasks in liver and omics data not 65 used in the model building process (e.g. metabolomics and growth rates). 66

67
Tissue-specific reconstruction methods

68
The reconstruction of specific metabolic models is a process with four main steps 69 (Fig 1): 70 1. Collect the data from the repositories and convert the gene identifiers to the 71 notation used in the template metabolic model.  3. Build the core reaction sets, based on the assumptions described in Table 1. In 77 this step, some additional configurations are required depending on the selected 78 algorithm. In this work, the metabolic tasks used in the tINIT algorithm were 79 retrieved from [20]. Additionally, a set of metabolites were set in mCADRE and 80 tINIT algorithms as described in the original publications [20,21]. The Model-Building Algorithm (MBA) [18] reconstructs a tissue-specific metabolic 88 model from a generic model by integrating a variety of tissue-specific molecular data 89 sources. The first step of this algorithm is to infer, from the tissue-specific data, two 90 sets of reactions denoted as the core reactions (C H ) and reactions that have a moderate 91 probability to be carried out in the specific tissue (C M ). This division is made 92 according to the accuracy level of the input data. In general, the C H set includes 93 human-curated tissue-specific pathways and the C M set includes reactions certified by 94 molecular data. The aim of this method is to find the most parsimonious tissue-specific 95 consistent model, which includes all the tissue-specific high-probability reactions (C H ), 96 The reconstruction of tissue-specific metabolic models has four main steps: collect data; convert scores from gene to reaction level; build the core reactions set and run the selected algorithm. applied to integrate different types of "omics" data through the core set compilation by 141 the user, and there is no need to define parameters except the flux threshold , which is 142 used to guarantee the required minimum flux.

143
Omics data sources 144 Transcriptomics are, certainly, the most widely available type of omics data. Using

145
DNA microarrays or RNA-sequencing allows the quantification of gene expression levels 146 in different conditions [27,28]. However, mRNA molecules are not always translated 147 into proteins [29], and therefore the amount of protein produced depends on the gene 148 expression and the current state of the cell. Thus, the knowledge about the amounts of 149 proteins in the cell, provided by proteomics data [30], is of foremost relevance. These 150 data can confirm the presence of proteins and quantify the amount of proteins within a 151 cell.

152
The input data used by the reconstruction algorithms present in our framework were 153 retrieved from the Human Protein Atlas (HPA) [31] and Gene Expression Barcode 154 (GEB) [32] databases. The information collected from HPA is scored as "supportive" or 155 "uncertain", depending on the similarity in immunostaining patterns and consistency 156 with protein/gene characterization data. For the present study, only the data classified 157 as "supportive" was used. Moreover, after the conversion of Ensemble gene identifiers to 158 gene symbols, duplicated genes with different evidence levels were removed.

159
Regarding the GEB data, the conversion to gene expression levels was done 160 considering the average level of probe sets for each gene. The mapping between probe 161 sets and gene symbols was performed using the library "hgu133plus2.db" from 162 Bioconductor. Besides these two data sources, the reaction sets described in [18] were 163 used in the reconstruction of the hepatocytes metabolic model, one of the case studies 164 in this work.

165
The context-specific reconstruction methods use different formats of input data.  This diversity in the input formats leads to the need for data transformation, to 173 allow their use in different methods. The continuous data from GEB was classified as 174 High, Moderate and Low, if the gene expression evidence on that tissue is greater than 175 0.9, between 0.5 and 0.9, and between 0.1 and 0.5, respectively. The genes with 176 expression evidence below 0.1 were considered not expressed. The core reaction sets 177 used in mCADRE and FASTCORE methods were built considering the union of "High" 178 and "Moderate" gene evidences from HPA, or gene expression evidence greater or equal 179 to 0.5 from GEB, through the GPR association present in the template model. 180 Table 1 summarizes the assumptions used to create the input data sets for each 181 algorithm. Applying these transformation rules is possible to adapt different input data 182 sources, such as HPA, GEB and C H and C M sets, for all methods.
183 Table 1. Core sets used as input data. Algorithm This table summarizes the assumptions and thresholds used to create the sets used as inputs by the different algorithms.

Consensus algorithm
184 A consensus metabolic model can be reconstructed based on models obtained by the 185 combination of different algorithms and data sources. The main idea is to build a model 186 starting with the reactions present in most of the models, and iteratively append a set 187 of reactions to the final model so that it will be able to perform all the metabolic tasks 188 given as input to the algorithm.   function buildFinalModel(pM odels, n, LM T , LRS)  The consensus algorithm here proposed, together with the set of previous algorithms 218 described above, were implemented in the Java language by the authors in an integrated 219 platform to enable comparison of their results. The code is available in GitHub 220 repository (https://github.com/saragcorreia/consensus tsmm). Moreover, a Docker 221 image are available in the repository Docker Hub (saracorreia/consensus tsmm/), with 222 all source code, scripts, data and models, for each of the case studies, allowing to fully 223 reproduce the results of this study. These resources represent, by themselves, a 224 contribution of this work, as they make available in a single platform a set of algorithms 225 to reconstruct tissue-specific models that were previously hard to compare since some  Steps to reconstruct the consensus metabolic model. A) Build the lost reaction set (LRS) and the lost metabolic task set (LMT) between each pair of partial models. B) For each pM odel i l the process finds the reactions from LRS i that can be removed from pM odel i without affecting the metabolic tasks present in (LM T i ). The set of reactions that can be removed will be added to the LRS set in the next iteration.
glioblastoma case study, this increased the possibilities to compare the results with 234 already published metabolic models [21,23].

235
About 78% of the liver tissue is formed by hepatocyte cells that are the principal site 236 of the metabolic conversions underlying the diverse physiological functions of the 237 liver [33]. A consensus hepatocyte metabolic model was here reconstructed using 238 different omics data sources and algorithms, also evaluating the effects that each of 239 those data types has in the resulting tissue-specific model.

240
The hepatocytes metabolic models were generated using Recon 1 and combining 241 different data sources (GEB, HPA and the C H and C M sets from [18]). The four 242 algorithms described above (MBA, tINIT, mCADRE and FASTCORE) were used, 243 leading to the creation of twelve distinct metabolic models were obtained.

244
Following the same approach, eight metabolic models were reconstructed for U-251 245 glioblastoma cell lines, considering HPA and GEB as data sources and the same four 246 algorithms previously mentioned. Glioblastoma (GBM), also known as astrocytoma 247 grade IV, is the most common and aggressive type of brain cancer in adults [34]. 248 Despite the advances in the study of this type of cancer, it remains largely incurable.

249
The consensus algorithm described in the previous section was applied to reconstruct 250 consensus metabolic models for both hepatocytes and U-251 glioblastoma cell lines. In 251 each case, this algorithm took as input the models created using the different algorithms 252 and data sources, as described above.  Analyzing the reaction evidence levels in this case, the overlap of reactions that support 272 the inclusion in the tissue-specific model is around of 97% and 99% for HPA and GEB, 273 respectively ( Fig 3A; lower row). However, if we take into account the expression 274 evidence levels, the number of reactions with the same evidence level is surprisingly low 275 (Fig 3 B and C; lower row). The lower overlap in the high and moderate sets of 276 reactions considering the cutoffs of "High"/ 0.9 and "Medium"/ 0.5 from data retrieved 277 from HPA/GEB can have a significant impact in the resulting models, independently of 278 the used algorithm.

279
Hepatocytes tissue-specific metabolic models 280 For the hepatocytes, the twelve metabolic models built in this work have between 1178 281 and 2139 reactions, as shown in Table 2. A more detailed comparison between the 282 models reconstructed using the same algorithm or the same data source is available in 283 Fig 4, A and B    i.e with more overlap, than using the same algorithm for different omics data sources. 293 Furthermore, the mean of reactions that belong to all models of the same algorithm is 294 around 49%, and around 60% when the models are grouped by data source. Again, the 295 variability of the final results seems to be dominated by the data source factor. This can 296 also be observed performing a hierarchical clustering of the models (Fig 5 A). A set of metabolic tasks known to occur in hepatocytes was previously presented by 298 Gille et al. [35]. Some of these tasks are impossible to satisfy using Recon 1 for the 299 phenotype simulations, because these include metabolites which are not present in the 300 model. Thus, these tasks will not be considered in the validation process. We have also 301 removed disease related tasks, as these do not configure the regular behaviour of 302 hepatocytes.

303
The Recon 1 template model is able to satisfy 281 of the remaining 363 metabolic 304 functions tested, and therefore this number represents the maximum that may be 305 achieved by any model derived from Recon 1 as the template. This set of 281 metabolic 306 tasks was used to validate each hepatocytes metabolic model, running simulations with 307 it to assess if each task could be satisfied.

308
The relationship between the model size and the number of satisfied tasks for all 309 reconstructed models is presented in Fig 5 B. FASTCORE is able to produce consistent 310 models independently of the input data. tINIT also has a significant percentage of valid 311 tasks when the data source is HPA. However, generically, the number of satisfied 312 metabolic tasks is very low compared with the performance of the template model 313 Recon 1.

314
Using the consensus algorithm proposed in this work, we reconstructed the final 315 hepatocytes consensus model based on the twelve partial models. The consensus model 316 has 1859 reactions satisfying the whole set of 281 metabolic tasks (Fig 5 B). So, this  Comparing the models reconstructed with the same data source, tINIT and MBA 326 algorithms produce models with a higher number of exclusive reactions, i.e., reactions 327 present in a single model. The number of reactions shared by all models for each data 328 source is similar, 913 and 804 for HPA and GEB, respectively. However, the intersection 329 of these two sets is only of 577 reactions.  One of the hallmarks of cancer is the ability that cancer cells have to proliferate. To 337 address this issue, FBA simulations were done to test the growth rate (through 338 maximization of the biomass flux) of each model. The biomass equation was collected 339 from the Recon 2 metabolic model and the RPMI-1640 medium [36] has been 340 considered in all simulations.

341
As a result, none of the models was able to produce biomass which indicates there 342 are gaps in the models towards the production of some biomass precursor metabolites 343 (essential for growth). We then tested how many biomass precursors could be produced 344 by each metabolic model, by adding artificial drain reactions to excrete each biomass 345 precursor and simulating the maximization of these reactions. Table 3 presents the 346 number of biomass precursors produced by each of the U-251 metabolic models.

347
The U-251 model generated by the tINIT algorithm using HPA has the highest 348 number of biomass precursors satisfied. This was expected since this model has 349 approximately 500 more reactions than the remaining models.  the reconstruction of the partial models (pM odel i i ∈ 1, ..8) as described in the 357 Methods section.

358
In this case, the tasks will be defined as the production of biomass precursors. So, 359 we checked how many biomass precursors each partial model was able to produce.  Other tissue-specific metabolic models 381 Glioblastoma GSMMs were already reconstructed in previous studies [21,23]. The 382 glioblastoma tumor cells and U-251 cell line GSMMs reconstructed by mCADRE and 383 PRIME algorithms were used to perform a comparison with our consensus model. The 384 overlap between these glioblastoma metabolic models is provided in Fig 8 B). 385 Analyzing the model obtained by PRIME, we verified that the Recon 1 template  To further validate our model we decided to perform additional phenotype 392 simulations, using the methods pFBA [5], iMAT [37], GIMME [38] and E-Flux [39], 393 which are able to integrate omics data to improve prediction results. Transcriptomics 394 data published by Gholami et al. [40], which were not used for model reconstruction, 395 were used as inputs. Experimental flux values published by Jain et al. [41] were used to 396 compare with the flux exchange rates predicted by the different methods and the 397 normalized prediction errors were calculated using the equation:

14/21
where v exp is the vector of measured flux values and v sim is the vector of predicted 399 values.

400
In this study, the glioblastoma metabolic model obtained by the mCADRE 401 reconstruction method was not considered, because this model is not able to grow when 402 simulated using the Recon2 biomass equation, even with the removal of the metabolites 403 present in the biomass equation and not in the model. The normalized prediction errors 404 are given in Table 4. The best combinations have a normalized error around 0.7-0.8. The one reaching a 406 lower prediction error was obtained with our consensus U-251 metabolic model using 407 the iMAT simulation method. Although no definitive conclusions may be taken from 408 these results, this contributes to provide a higher confidence in our consensus model.

409
Critical genes 410 The validation of metabolic models is a hard task when fluxomics data are not available. 411 So, further tests were done to check if the consensus metabolic model has a better 412 phenotype prediction capability than the global model Recon 1.

413
As a final test, we calculated the predicted critical genes of both models (our 414 consensus glioblastoma model vs the Recon 1). We considered to be critical (or 415 essential) the genes that inhibit growth when they are removed from the model. We signalling and invasive phenotype of brain tumor cells could be regulated by this 428 gene [42]. Moreover, silencing the G6PT gene in U-87 brain tumor-derived glioma 429 cells may induce necrosis and late apoptosis [43] involved in the production of phosphatidylserine. This gene is involved in a patent 456 related to the development of a molecular-based method of cancer diagnosis and 457 prognosis. Together with five others genes, the PTDSS1 has a higher expression in 458 tumor samples when compared with control samples [56].

459
• The CTPS gene encodes an enzyme responsible for the conversion of UTP 460 (uridine triphosphate) to CTP (cytidine triphospate). The development of 461 methods and pharmaceutical compositions to inhibit the lymphocyte proliferation 462 through the CPTS1 inhibitors has been protected by a patent [57].

463
• The NME2 / NME1 genes were identified as potential tumor suppressors, which 464 reduce the tumor progression and proliferation [58]. Thus, it was unexpected that 465 these genes were essential for the metabolic model. To understand this result, we 466 did a deep analysis of the reactions where these genes are involved. The two genes 467 regulate the activation of nucleoside-diphosphate kinase reactions in the nucleus. 468 These reactions are responsible to produce essential metabolites present in 469 biomass equation, namely Deoxyguanosine triphosphate (dGTP), Deoxycytidine 470 triphosphate (dCTP), Deoxyadenosine triphosphate (dATP) and Deoxythymidine 471 triphosphate (dTTP). These metabolites are used in cells for DNA synthesis.

473
In this work, a critical assessment of the most important methods for the reconstruction 474 of tissue-specific metabolic models was performed. Moreover, the consistency of 475 information across important omics data sources was analyzed and these data were used 476 to verify the impact of such differences in the final metabolic models generated by each 477 method. The results show that metabolic models obtained depend more on the data 478 sources used as inputs than on the algorithms used for the reconstruction. To validate 479 the performance of the obtained metabolic models regarding phenotype prediction, a set 480 of metabolic functions was tested for each metabolic model. Generically, the number of 481 satisfied metabolic functions was surprisingly low. This shows that existing methods for 482 the reconstruction of tissue-specific metabolic models, based on a single omics data 483 source, are not enough to generate high quality metabolic models.

484
Here, a strategy to build a final metabolic model using the combination of generated 485 models through different algorithms and data sources was presented. This process 486 shows that with a similar number of reactions, it is possible to achieve a final model 487 capable of satisfying all possible metabolic tasks.

488
A variant of the previous method was also used to reconstruct a consensus model of 489 glioblastoma cell lines. The method was able to find a consistent model, able to sustain 490 growth of cancer cells, with around half of the reactions from the template model. The 491 consensus model showed a good predictive ability of flux distributions, combined with 492 further omics data, competitive with previous approaches. Also, it allowed to uncover 493 several candidate essential genes of glioblastoma cell lines, a few of which have been 494 previously identified in literature as relevant targets or biomarkers. 495 Overall, the method here proposed provides an additional tool in modeling efforts in 496 cancer research and drug development, providing a way to build more robust metabolic 497 models given available omics data sources and phenotypic data.