Architect: a tool for producing high-quality metabolic models through improved enzyme annotation

Motivation Constraints-based modeling is a powerful framework for understanding growth of organisms. Results from such simulation experiments can be affected at least in part by the quality of the metabolic models used. Reconstructing a metabolic network manually can produce a high-quality metabolic model but is a time-consuming task. At the same time, current methods for automating the process typically transfer metabolic function based on sequence similarity, a process known to produce many false positives. Results We created Architect, a pipeline for automatic metabolic model reconstruction from protein sequences. First, it performs enzyme annotation through an ensemble approach, whereby a likelihood score is computed for an EC prediction based on predictions from existing tools; for this step, our method shows both increased precision and recall compared to individual tools. Next, Architect uses these annotations to construct a high-quality metabolic network which is then gap-filled based on likelihood scores from the ensemble approach. The resulting metabolic model is output in SBML format, suitable for constraints-based analyses. Through comparisons of enzyme annotations and curated metabolic models, we demonstrate improved performance of Architect over other state-of-the-art tools. Availability Code for Architect is available at https://github.com/ParkinsonLab/Architect. Contact john.parkinson@utoronto.ca Supplementary information Supplementary data are available at Bioinformatics online.


INTRODUCTION
Metabolic modeling has been used for engineering strains of bacteria for bioremediation, 32 for understanding what drives parasite growth, as well as for shedding light on how disruptions in metabolism. Based on these scores, a genome-specific metabolic model is then reconstructed by 49 removing reactions that are either not identified or poorly supported, and adding in reactions to fill 50 gaps to construct functional pathways (Machado et al., 2018). 51 A key step in this process is the accurate identification of enzymes based on sequence data 52 alone and can be formally defined as follows: given an amino acid sequence, what are its associated 53 enzymatic function(s), if any? The problem is a multi-label classification problem; here we 54 consider enzymatic functions as defined by the Enzyme Commission (EC), in which enzymes are 55 assigned to EC numbers representing a top-down hierarchy of function (Bairoch, 2000). Enzyme Architect not only designs the metabolic model of an organism, but it also coordinates the sequence 76 of steps that go towards the SBML output given user specifications, such as the definition of an 77 objective function for gap-filling. We evaluate the performance of Architect both in terms of its 78 ability to perform accurate enzyme annotations, relying on UniProt/SwissProt sequences as a gold  and their corresponding annotations from the ENZYME database (Bairoch, 2000) (downloaded on February 9 th , 2021). Only complete EC numbers were considered in building Architect's ensemble 91 classifiers. Further, ECs associated with fewer than 10 protein sequences were removed to ensure 92 sufficient training data. This filtering resulted in a final collection of 1,670 ECs represented by 93 207,121 sequences (Supplemental Figure 1). A further set of 294,067 protein sequences not 94 associated with either complete or partial EC annotations (subsequently referred to as "non-   (Yu, Zavaljevski, Desai, & Reifman, 2009). In addition to examining the 105 performance of two relatively simple approaches, majority rule (in which we take the prediction 106 supported by the most tools) and EC-specific best tool (in which we take the prediction from the 107 tool which is found to perform best for a specific EC), we also investigated the performance of the (1)

125
Other ensemble methods explicitly consider the level of confidence by each tool (see

151
Having generated an initial network, Architect next attempts to fill gaps within the network,

152
representing reactions required to complete pathways necessary for the production of essential The penalty for adding the i th reaction is then inversely proportional to the normalized score: of test sequences with lower sequence similarity to training sequences (Supplemental Figure 2). 244 Additionally, macro-recall on multifunctional proteins is decreased for the naïve Bayes, logistic 245 regression and random forest classifiers when applying a heuristic which filters out predicted ECs 246 other than the top-scoring EC and frequently co-occurring enzymes as seen in the training data 247 (Supplemental Figures 4 and 5 and Supplemental Text); therefore, henceforth, we evaluate 248 performance of these classifiers by considering all their high-confidence EC predictions.

249
Next, we consider the possibility that higher predictive range (defined as the number of 250 ECs that a tool can predict) primarily drives the increased performance of the ensemble methods.

251
Indeed, the ensemble approaches are superior when quantifying performance on ECs predictable 252 by at least 2 tools (Supplemental Figure 3) but have similar precision and recall as DETECT on 253 sequences annotated with ECs predictable by all tools (Supplemental Figure 4) Figure 6). 258 Given the main application of Architect is to annotate enzymes to an organism's proteome, 259 we were interested in assessing the ability of the ensemble approaches to minimize false positives.

260
Applied to a set of proteins without EC annotations in SwissProt, we found that only the Naïve

261
Bayes classifier gave comparable specificity as the individual tools (Supplemental Figure 8). 262 Given the slightly elevated performance in terms of precision (for the enzymatic dataset) and Architect yields both higher precision and recall than DETECT, and higher recall than EnzDP. Bayes-based method using predictions from fewer tools, then calculated performance once again 276 on the held-out test set (Supplemental Figure 7). We observe that this procedure has a greater 277 impact on macro-recall than macro-precision. In particular, leaving out predictions from both tools  and PRIAM compared to CarveMe, a factor we account for by next using the BiGG database for 327 Architect's model reconstructions.

328
Turning to models constructed with the BiGG database, as for the KEGG-based models, 329 we find that Architect has higher precision for both C. elegans and N. meningitidis, and higher 330 recall for the former (Supplemental Figure 12)

377
Here, we present Architect, an approach for automatic metabolic model reconstruction.

378
The tool consists of two modules: first, enzyme predictions from multiple tools are combined 379 through a user-specified ensemble approach, yielding likelihood scores which are then leveraged 380 to produce a simulation-ready metabolic model. Through the use of various gold-standard datasets, 381 we have shown that Architect's first module produces more accurate enzyme annotations, and that 382 its second module can be used to produce organism-specific metabolic models with better 383 annotations than similar state-of-the-art reconstruction tools, including CarveMe and PRIAM. Our 384 expectation is that these models serve as near-final drafts, requiring users to perform only minimal 385 curation to incorporate organism-specific data. For example, models for eukaryotic organisms may 386 require the independent definition of cellular compartments. Interestingly, it is unclear whether 387 improvements in enzyme annotation, other than in terms of predictive range, lead to the 388 construction of models with either improved annotations or greater accuracy of simulations.

389
Instead, we propose three improvements to the input and the algorithm of the model reconstruction 390 module that will likely yield better models. First, we find that most essential genes also 391 incorporated into the final models were not predicted to be essential in silico (see Supplemental