Bactabolize: A tool for high-throughput generation of bacterial strain-specific metabolic models

Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale metabolic models are useful tools for studying the metabolic potential of individuals, and with the rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for comparative analysis. However, there exist few tools to construct strain-specific metabolic models at scale. Here we describe Bactabolize (github.com/kelwyres/Bactabolize), a reference-based tool which rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae (github.com/kelwyres/KpSC-pan-metabolic-model), and a quality control framework for using draft genome assemblies as input for Bactabolize. The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 performed comparatively or better than currently available automated approaches CarveMe and gapseq across 507 substrate and 2317 knockout mutant growth predictions. Novel draft genomes passing our systematically-defined quality control criteria resulted in models with a high degree of completeness (≥99% genes and reactions captured compared to models derived from matched complete genomes) and high accuracy (mean 0.97, n=10). We anticipate the tools and framework described herein will facilitate large-scale metabolic modelling analyses that broaden our understanding of diversity within bacterial species and inform novel control strategies for priority pathogens.

Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche 18 separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale 19 metabolic models are useful tools for studying the metabolic potential of individuals, and with the 20 rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for 21 comparative analysis. However, there exist few tools to construct strain-specific metabolic models 22 at scale. 23 Here we describe Bactabolize (github.com/kelwyres/Bactabolize), a reference-based tool which 24 rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe 25 a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae 26 (github.com/kelwyres/KpSC-pan-metabolic-model), and a quality control framework for using draft 27 genome assemblies as input for Bactabolize. 28 The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 performed 29 comparatively or better than currently available automated approaches CarveMe and gapseq 30 across 507 substrate and 2317 knockout mutant growth predictions. Novel draft genomes passing 31 our systematically-defined quality control criteria resulted in models with a high degree of 32 completeness (≥99% genes and reactions captured compared to models derived from matched 33 complete genomes) and high accuracy (mean 0.97, n=10). 34 Introduction 39 Bacteria exhibit metabolic diversity and can utilise a broad range of substrates for growth. It has 40 become clear amongst pathogens that there is an intertwined relationship between metabolism 41 and nutrient usage with virulence and antimicrobial resistance (1-7). Comparative analyses of 42 metabolic profiles (e.g. substrate usage) are key to fully understanding these relationships. 43 Traditionally, these profiles have been assessed via phenotypic growth on a limited number of 44 substrates, such as those used to delineate between species (8-10) which form the basis of a 45 number of commercial products for species identification. However, these methods are not 46 sufficiently discriminatory for in-depth comparisons within species, and alternative approaches 47 such as the Omnilog Phenotype MicroArray system (Biolog) are too expensive and/or labour 48 intensive for application to large numbers of isolates. Similarly, probing of essential metabolism-49 associated genes via transposon mutant libraries (e.g. to identify novel virulence factors and 50 therapeutic targets) (4, 11, 12) cannot be easily scaled across diverse bacterial populations. 51 Genome-scale metabolic models or metabolic reconstructions are a computational approach to 52 analysing the metabolic potential of an organism, within which the entire biochemical network is 53 represented as a stoichiometric matrix (13). Metabolic models are constructed programmatically, 54 but typically informed and at least partially validated using phenotypic growth data (14-16). Once 55 constructed, they can be run through simulations and analysed under various contexts, such as in 56 silico growth experiments (Flux Balance Analysis [FBA]) to predict substrate usage profiles (17), 57 evaluate the impact of single gene knockouts on growth (14,18), and identify metabolic 58 chokepoints for drug targets (19), among others. Traditionally, metabolic models are strain-specific 59 (i.e. each model represents a unique individual http://bigg.ucsd.edu/models) and may not be 60 applicable to other isolates due to unrepresented genetic diversity. 61 We recently described 37 curated strain-specific models for the Klebsiella pneumoniae Species 62 Complex (KpSC) (14) comprised of K. pneumoniae and its close relatives (20). These organisms 63 are a common cause of healthcare-associated infections world-wide, and among the World Health 64 Organization's priority antimicrobial resistant pathogens (21). KpSC are highly diverse and gene 65 content can differ substantially between strains (22, 23). Accordingly, our models varied in terms of 66 gene and reaction content, resulting in variable growth substrate usage profiles and metabolic 67 redundancy (14). Similar variation has also been described in other key bacterial pathogens e.g. 68 Escherichia coli (24), Salmonella enterica (25), Staphylococcus aureus (26) and Pseudomonas 69 aeruginosa (27). This is highly relevant to the use of metabolic models for the exploration of 70 virulence and antimicrobial resistance, and for the identification of novel drug targets. Therefore, 71 such works should seek to include multiple strain-specific models, and in some cases 100s-1000s 72 of models may be required to accurately represent population diversity (22,28,29). 73 There are several open source tools currently available that can rapidly produce strain-specific 74 metabolic models, including CarveMe (30), gapseq (31), ModelSEED (32) and KBase (33)  is dependent on commercial solvers such as CPLEX (free for academic use). However, its use of a 83 universal reference model may limit specificity of strain-specific models (36), and result in 84 overestimation of model genes. These limitations can be overcome by manual curation of the 85 output models, but such curation is highly labour intensive and not suitable for high-throughput 86 analyses. Furthermore, the CarveMe database (BiGG universal_model) appears to be no longer 87 actively maintained, meaning that there is no opportunity to integrate novel structural and/or 88 biochemical data as these become available in the literature (as discussed in COBRA community 89 forums). gapseq is a recently published command line tool which leverages an independent 90 universal database (31). In their comparative analyses, the gapseq authors demonstrate superior 91 accuracy to both CarveME and ModelSEED; however the concerns about the specificity of 92 universal models remain and it has been reported that model construction takes considerable time 93 (several hours in some cases (37)). 94 Here, we present Bactabolize (available at https://github.com/kelwyres/Bactabolize), an easy-to-95 use tool which allows scalable production of strain-specific draft metabolic models and prediction 96 of growth phenotypes in under three minutes per input genome. Bactabolize builds upon the 97 reference-based model reconstruction approach described by Norsigian et al. (36), leveraging the 98 COBRApy framework (38) and BiGG nomenclature (39). We present a pan-metabolic reference 99 model for the KpSC (derived from our 37 curated strain-specific models (14)), and describe an 100 exemplar quality control framework for the application of Bactabolize to KpSC draft genome 101 assemblies. We show that Bactabolize can rapidly produce strain-specific models from draft 102 genomes with a high degree of completeness (as compared to models generated from completed 103 genome assemblies), resulting in highly accurate growth predictions that match or exceed the 104 accuracy of models from CarveMe and manual curation efforts. 105

107
Bactabolize is written in Python 3 and utilises the metabolic modelling library COBRApy (38). 108 Bactabolize has four main commands: 109 i) Draft model generation (draft_model command), which generates a strain-specific draft 110 metabolic reconstruction ('model') using the approach outlined previously (36), and 111 uses gap-filling to identify any missing reactions required to simulate growth in the user-112 specified conditions 113 ii) Patching incomplete models (patch_model command) by the addition of missing 114 reactions e.g. those identified by the automated gap-filling process 115 iii) Substrate usage analysis via Flux Balance Analysis (FBA) (fba command) to predict 116 growth outcomes for a specified range of substrates supported by the model(s) 117 Figure 1). 118 Additional processing scripts are provided alongside Bactabolize to improve model metadata 119 annotation (improve_model_annotations.py), convert models generated using KBase and 120 ModelSEED to Bactabolize/BiGG-compatible format (SEED_to_BiGG_model_convert.sh), 121 generate network graph files from models (model_to_network_graph.py) and merging output FBA 122 profiles (merge_fba_profiles_longtable.sh). Full documentation, including example code and test 123 data are available at the Bactabolize code repository (https://github.com/kelwyres/Bactabolize). 124 For draft model construction, Bactabolize requires users to provide an input assembly (annotated 125 or unannotated FASTA or Genbank format respectively), a reference model (JSON format) and the 126 corresponding reference sequence data (gene and protein sequences in two separate multi-fasta 127 files or a single Genbank annotation in a .gbk file) ( Figure S1). For optimum results we suggest 128 using a pan-model that captures as much diversity as possible for the target species or group of 129 interest, because Bactabolize's reconstruction method is reductive i.e. each output strain-specific 130 model will include only genes, reactions and metabolites that are present in the reference or a 131 subset thereof (although novel genes, reactions and metabolites can be added via manual 132 curation). 133 If the input assembly is unannotated, Bactabolize will identify coding sequences using Prodigal 134 (40) but will otherwise honour the existing coding sequence (CDS) notations and optionally use 135 Prodigal to search for additional CDS. Draft genome-scale metabolic models are output in both 136 SMBL v3.1 (41) and JSON formats (one pair of files for each independent strain-specific model), 137 along with an optional MEMOTE quality report (42). Bactabolize will identify orthologs in the input 138 genome(s) compared to the reference sequence data using Bi-directional BLAST (43)  the patch_model command to correct the model ( Figure S2). Bactabolize uses a conservative 153 gap-filling approach that only adds the minimum number of reactions to enable growth under the 154 chosen conditions. We recommend testing the models in minimal media and atmosphere expected 155 to support growth for all isolates of the species of interest, unless the user has access to matched 156 phenotypic data demonstrating growth for individual isolates in specific conditions. Aggressive 157 gap-filling will effectively homogenise the models and should be avoided if the goal is to 158 understand the underlying strain diversity. 159 Substrate usage analysis (the fba command) is performed iteratively for each possible carbon, 160 nitrogen, sulfur and phosphor substrate supported by the model(s) (Figure S3), by replacing the 161 default substrate in the user specified growth medium (specified in the fba_spec JSON file). For 162 example, in M9 media the default substrates are glucose (carbon), ammonia (nitrogen), sulphate 163 (sulfur) and phosphate (phosphor). Each substrate can be tested in aerobic and/or anaerobic 164 conditions. Growth prediction output is recorded in a tab delimited file (one per strain). The 165 merge_fba_profiles_longtable.sh helper script will combine the outputs for multiple strains into a 166 single file for downstream analysis. 167 The growth impacts of single-gene knockout mutations can be simulated via the sgk command 168 ( Figure S4). Bactabolize will iterate through every gene in the model, temporarily removing it and 169 its associated reactions (unless they are also associated with another gene) and running FBA to 170 simulate growth in the user-specified conditions. The output is comparable to single-gene knockout 171 studies such as transposon mutagenesis and can be used to probe gene essentiality. 172

182
KpSC pan-metabolic reference model 183 We constructed a species complex-specific pan-metabolic reference model by combining a 184 collection of 37 manually curated models for which we have previously demonstrated high 185 accuracy (range 88.3%-96.8% for prediction of 94 distinct growth phenotypes (14)). These models 186 represent a diverse collection of KpSC (14)  and v) KBase (ModelSEED). Importantly, neither K. pneumoniae KPPR1 nor its genetic lineage (7  204 gene multi-locus sequence type, ST493), are represented in the KpSC pan-reference model, 205 meaning these benchmarking comparisons were on equal footing. Subsequently, each model was 206 used to predict growth phenotypes; i) in M9 minimal media with different sole sources of carbon, 207 nitrogen, phosphorus and sulfur; and ii) for all possible single gene knockouts in LB under aerobic 208 conditions. The predicted phenotypes were compared directly to the published phenotype data. 209 Among the high-throughput approaches, the Bactabolize draft model captured fewer genes and 210 reactions (n = 1233 and 2307, respectively) than the gapseq (n = 1489 and 3186, respectively) 211 and CarveMe universal models (n = 1960 and 2857, Figure 2A). The Bactabolize and CarveMe 212 universal models contained similar numbers of metabolites (1696 vs 1737, respectively), while the 213 gapseq model contained many more (n = 2519). Notably, the initial gapseq model was gap-filled 214 by simulation of growth in M9 minimal media with glucose (consistent with Bactabolize and 215 CarveMe media recipes), which resulted in the addition of 31 reactions to the draft model. This is 216 in contrast to CarveMe and Bactabolize-generated models which produced biomass in M9 minimal 217 media plus glucose without additional gap-filling. The CarveMe KpSC pan model captured 218 considerably more genes than any of the other models (n = 2407), but these were associated with 219 many fewer unique reactions and metabolites (1206 and 825, respectively). Upon further 220 investigation we determined that this method resulted in the over prescription of gene reaction 221 rules (GPRs)  (gapseq), which are difficult to harmonise in an automated manner even after the application of 236 MetaNetX. 237 In comparison to the low-throughput reconstruction approaches, the Bactabolize model contained 238 a similar number of genes, reactions and metabolites to the manually curated model (n = 1289, 239 2484 and 1827, respectively) and many more than the KBase model (72, 544 and 534, 240 respectively). This is likely due to the low number of metabolism-associated genes identified by 241 KBase, which has impacted the associated reaction and metabolite data. 242 MEMOTE scores, (produced by the MEMOTE report (42)) indicate the quality of the model 243 metadata annotations, with the scores ranging between 0 -100%. These provide a measure of 244 model portability and the level of connected databases available to support the metabolite, 245 reaction and genetic information represented in the model, but bear no reflection on model 246 accuracy. Bactabolize performs on the lower end, with CarveMe universal and gapseq performing 247 the best ( Figure 2B). The KBase model appears to perform well in this regard, however this is due 248 to the low number of genes, reactions and metabolites included in the model. Bactabolize using 249 the KpSC-pan model outperforms the model propagation mode of CarveMe using the same 250 reference model ( Figure 2B). Work is ongoing to improve the metadata annotations in the KpSC-251 pan reference model, to support large-scale model propagation. 252 We assessed the performance of each model for in silico prediction of growth phenotypes 253 compared to the previously published experimental data (15). Accuracy, sensitivity, specificity, 254 precision and F1 scores were calculated (60). Note that the specific set of growth substrates and 255 gene knockouts that can be simulated is determined by the sets of genes and metabolites 256 captured by each model and is therefore model-dependent (Data S1 and S2). Among those with 257 matched experimental phenotype data, the Bactabolize and CarveMe universal models were able 258 to predict growth for a greater number of carbon, nitrogen, phosphorous and sulfur substrates than 259 gapseq, CarveMe KpSC pan, KBase and iKp1289 models ( Figure 2C, Data S1). While the 260 CarveMe universal model had the highest number of true-positive growth predictions overall (n = 261 132 of 617 total predictions), it also had a comparably high number of false-positive predictions (n 262 = 39 of 617 total predictions, Figure 2D). Similarly, the gapseq and iKp1289 models resulted in 31 263 (262 total predictions) and 50 (513 total    is caused by repetitive sequences that cannot be resolved by the assembly algorithm and/or 312 sequence drop-out. The latter can result in the loss of genetic information such that some portion 313 of genes present in the underlying genome are lost from the genome assembly (either completely 314 or partially). This in turn, poses a limitation for the reconstruction of metabolic models using these 315 assemblies, since most published approaches use sequence searches to predict the 316 presence/absence of genes and their associated enzymatic reactions. Therefore, if we are to use 317 public genome data for high-throughput metabolic modelling studies, it is essential to evaluate the 318 expected model accuracies and understand the minimum input genome quality requirements. 319 Here we performed a systematic analysis leveraging our published curated KpSC models (n=37, 320 (14)), which were generated using completed genome sequences and were therefore considered 321 to represent 'complete' models for which the underlying genome sequence contains all genes that 322 are truly present in the genome (note the biological accuracy of these models was reported 323 previously (14) and is not the subject of the current study). We randomly subsampled the 324 corresponding Illumina read sets to various depths (10 -100x, increments of 10) in triplicate and 325 generated draft assemblies that were passed to Bactabolize for generation of draft metabolic 326 models (Data S3). Due to low read depth (≤30x), two isolates were removed from this analysis. 327 Additionally, ten 10x depth read samples failed to produce assemblies, leaving 1040 draft 328 genomes for analysis. The resulting draft metabolic models were compared to the complete 329 models to; i) determine the proportions of complete model genes and reactions captured in the 330 draft models; and ii) compare 846 in silico aerobic growth predictions in M9 minimal media, where 331 growth on 266 carbon, 153 nitrogen, 59 phosphorus and 25 sulfur sources were examined. 332 Substrates containing multiple elements were tested as sole sources of each element 333 independently and in combination, e.g. 1,5-Diaminopentane was tested as a sole carbon, sole 334 nitrogen and sole carbon plus nitrogen source. 335 As expected, assembly quality generally increased with increasing sequencing depth i.e. 336 assemblies generated from higher depth read sets were associated with higher N50 values, fewer 337 contigs and fewer assembly graph dead-ends, although the rate of improvement drastically 338 declined beyond 40-50x depth ( Figure S7, Data S3). We noted that it was rare for draft models to 339 capture 100% of model genes and reactions (just 420 of all 1040 draft assemblies were associated 340 with models that captured 100% model genes) (Data S3, Figure S8), even when using the highest 341 quality draft genomes. However, ≥ 99% of genes and reactions were commonly captured, which 342 plateaued from 40x depth onwards ( Figure S8)

364
We investigated the relationships between assembly quality metrics and model gene/reaction 365 capture in more detail. Variation in assembly graph dead-ends accounted for the greatest amount 366 of variation in model capture, closely followed by raw contig counts (cubic polynomial fit, R 2 of 367 ≥ 0.98 for graph dead-ends, R 2 of ≥ 0.9 for contig count). A segmented linear model was fitted to 368 N50 length (R 2 ≥ 0.83), producing a breakpoint at 25153 bp (Figure 3). 369 To further explore the optimum thresholds for assembly metrics, we tallied the number of draft 370 assemblies resulting in ≥ 99% and <99% gene and reaction capture at increasing graph dead-end and contig count cut-offs, and decreasing N50 cut-offs. Draft models that captured  (Figure 4). The optimum threshold for N50 was determined to be ≥ 65000, at which 94.97% of 378 'good' and 1.71% of 'bad' models were captured. The assembly graph dead-end threshold results 379 in comparatively higher sensitivity (i.e. a higher proportion of 'good' models pass the threshold) 380 than contig count and comparatively better specificity (i.e. lower proportion of 'bad' models pass 381 the threshold) than N50, but the underlying metric information is not universally available because 382 many isolate genomes are deposited in public databases only as assemblies without the 383 associated assembly graph. We therefore recommend a three-tier approach, whereby the 384 assembly graph dead-end criterion is preferenced if available, followed by N50 and then contig 385 count.  the patch_model command of Bactabolize. We then assessed the accuracy of the gap-filled 402 models for prediction of growth on the full range of substrates, as compared to the predictions from 403 the corresponding complete models. 404 Gap filling added 1 -3 missing reactions to each model, with a median of one, fully restoring 405 biomass production in M9 media with glucose in all but two of the 23 failed models. The missing 406 reactions appeared to be random genes across these 23 genomes, likely due to missing 407 information in these assemblies. 408 Substrate usage predictions from the 21 successfully gap-filled models were highly accurate, with 409 18/21 having a prediction concordance of ≥ 99% across all 846 growth conditions (12/21 had 100% 410 concordance) ( Figure S9). We therefore conclude that models generated for genome assemblies 411 passing our QC criteria, which have been gap-filled to successfully simulate growth on minimal 412 media plus glucose, are suitable for the prediction of growth across a range of substrates. 413 Predictive accuracy of draft models 414 We assessed the accuracy of Bactabolize for the construction of draft models for 10 novel KpSC 415 clinical isolates, representing five of the major taxa in the complex. We included five isolates for 416 which the associated STs were represented in the KpSC-pan v1 model and five isolates with STs 417 that were not represented. Whole genome sequence data were generated on the Illumina platform 418 and draft assemblies generated de novo. The resultant assemblies had 0-4 graph dead-ends, 419 N50s of 151958-388486 bp and 83-187 contigs (Data S4), within the tiered threshold values. 420 FBA was performed, and the predicted growth profiles compared to matched phenotypic growth 421 data for 16 carbon sources derived from Vitek GN ID cards. Though the number of tested carbon 422 sources was limited, all were associated with high accuracy metrics (Figure 5, Data S4). As 423 expected, models for isolates with STs represented in the KpSC-pan v1 reference performed 424 slightly better (mean accuracy = 0.98) than those for non-represented STs (mean accuracy = 425 0.95). 426

434
In this work we described Bactabolize, a pipeline for rapid and scalable production of accurate 435 bacterial strain-specific metabolic models and growth phenotype predictions. We describe a pan-436 reference model for the KpSC and demonstrate that a draft strain-specific model generated de 437 novo via Bactabolize using the KpSC-pan v1 reference was highly accurate for growth phenotype 438 prediction (85.79% accuracy for substrate usage across 190 substrates, and 80.57% for gene 439 essentiality across 1220 genes). Importantly, we also described a quality control framework for the 440 use of draft genome assemblies as input for metabolic reconstructions. We used a systematic 441 analysis to; i) evaluate the proportion of gene and reaction capture compared to the corresponding 442 'completed' models; ii) define quality control thresholds for input assemblies (three tier approach 443 for KpSC; ≤ 200 assembly graph dead ends, followed by ≥ 65000 N50, followed by ≤ 130 contigs); 444 and iii) estimate the accuracy of the resultant growth predictions. While the quality control 445 thresholds and accuracy estimates are specific to KpSC, the conceptual framework can be applied 446 to any organism and is essential to support the confident application of metabolic modelling for 447 large-scale genome datasets. We appreciate that assembly graphs may not be available for dead 448 end count, e.g. for draft genome assemblies accessed via public repositories, however we 449 encourage users to include this information in their quality control procedures wherever possible 450 (e.g. using the recently published counter tool available at https://github.com/rrwick/GFA-dead-451 end-counter), because these counts represent a direct reflection of the completeness of the 452 genome assembly. In contrast, contig counts and N50 are influenced by biological features such 453 as repeat copy numbers as well as the underlying sequence data quality e.g. a bacterial genome 454 harbouring many insertion sequence insertions will result in a draft assembly with a high number of 455 contigs regardless of the sequence data quality and completeness. 456 Traditionally, genome-scale metabolic reconstruction approaches have relied upon significant 458 manual curation efforts. While there will always remain a need for high quality curated models, 459 such resource intensive approaches preclude their application at scale, and have therefore limited 460 analyses to small numbers of individual strains (15, 16). However, automated reconstruction 461 approaches can support the generation and comparison of multiple strain-specific draft models 462 from which meaningful biological insights can be derived (61). Additionally, the quality of curated 463 models is likely to vary depending on their age, level and type of curation, as well as the approach 464 used for preliminary drafting. Indeed it is possible for automated approaches to outperform 465 manually curated models; a draft model for K. pneumoniae KPPR1 generated using Bactabolize 466 with the KpSC pan-v1 reference model outperformed the manually curated iKp1289 model 467 representing the same strain (15). iKp1289 was published in 2017 (6 years prior to this study) and 468 was initially drafted via the KBase pipeline (33), which uses RAST to annotate the sequences with 469 Enzyme Commission numbers. It has been demonstrated several times that the Enzyme 470 Commission scheme has systematic errors (62, 63), leading to a loss in accuracy when compared 471 to the ortholog identification methods used by automated approaches. Consistent with this 472 assertion, our draft KPPR1 model constructed with KBase (without manual curation) was an outlier 473 in terms of the very low number of genes, reactions and metabolites that were included. 474 CarveMe with universal model (30) and gapseq (31) are the current gold standard automated 475 approaches for model reconstruction, and we show that a draft KpSC model generated by 476 Bactabolize with the KpSC pan v1 reference resulted in similar or better accuracy for phenotype 477 prediction (Figure 2). Both the CarveMe universal and gapseq models resulted in high numbers of 478 true-positive and true-negative growth predictions. However, these were also accompanied by 479 comparatively higher numbers of false-positive predictions that resulted in a lower overall accuracy 480 for substrate usage analysis compared to Bactabolize with the KpSC-pan v1 reference (Figure 2), 481 and comparatively lower precision and specificity for the gene essentiality analysis. False-positive 482 predictions may indicate that the relevant metabolic machinery are present in the cell but were not 483 active during the growth experiments (e.g. due to lack of gene expression). In this regard, false-484 positives are not always a sign of model inaccuracy. However, false-positive predictions can also 485 occur from incorrect gene annotations e.g. due to reduced specificity of ortholog assignment 486 resulting from the use of the universal model without manual curation. Given a key objective here 487 is to facilitate high-throughput analysis for large numbers of genomes, it is not feasible to expect 488 that all models will be manually curated, and therefore we believe that identifying fewer genes with 489 lower overall error rates provides greater confidence in the resulting draft models. We also note 490 that the BiGG universal reference model which CarveMe leverages is no longer being actively 491 maintained. In contrast, user defined reference models can be iteratively curated and updated to 492 incorporate new knowledge and data as they become available. 493 Bactabolize's reference-based reconstruction approach is reductive, meaning the resultant draft 494 models will comprise only the genes, reactions and metabolites present in the reference, or a 495 subset thereof, and will not include novel reactions unless they are manually identified and curated 496 by the user. This is an important caveat that should be considered carefully for application of 497 Bactabolize to large genome data sets, particularly for genetically diverse organisms such as those 498 in the KpSC. For optimum results we suggest using a curated pan-model that captures as much 499 diversity as possible for the target species or group of interest. While we acknowledge that a 500 reasonable resource investment is required to generate a high-quality reference, we have shown 501 that a pan-model derived from just 37 representative strains can be sufficient to support the 502 generation of highly-accurate draft models (Figure 2 and 5). Additionally, we note that it is possible 503 to use a single strain reference model, which should ideally represent the same or closely related 504 species to that of the input genome assemblies, in order to facilitate accurate identification of gene 505 orthologs. It is technically possible to use an unrelated reference model, but this is expected to 506 result in inaccurate and/or incomplete outputs and has not been tested in this study. In 507 circumstances were no high quality closely-related reference model is available, we recommend 508 alternative reconstruction approaches that leverage universal databases e.g. CarveMe (30)  identified from the 36 additional strains by manual curation following comparison to the matched 533 phenotype data (as described in (14)). Additionally, orthologous sequence variants with <75% 534 nucleotide identity to gene sequences associated with these gene reaction rules (GPRs) were 535 added if there was phenotype data supporting the reaction. The biomass reaction was updated, 536 removing the metabolites udpgalur_c and udpgal_c as their production was strain-specific. 537 Metadata annotations were improved using the improve_model_annotations.py script (also 538 available in the Bactabolize code repository) resulting in the KpSC_pan v1 used in this study, 539 available at www.github.com/kelwyres/KpSC-pan-metabolic-model. 540

541
The annotated and unannotated genome of K. pneumoniae KPPR1 were obtained from Genbank 542 accession number: CP009208. 543 Bactabolize draft models were generated using the draft_model command in Bactabolize v1 with 544 the KpSC-pan v1 model as a reference, the annotated K. pneumoniae KPPR1 as input and the 545 following options: 546 --min_coverage 25 --min_pident 80 --media M9 --atmosphere aerobic 547 A draft model for K. pneumoniae KPPR1 was also generated via CarveMe version 1.5.1 using the 548 universal reference, the annotated K. pneumoniae KPPR1 as input, with the following commands: '-g M9 -i M9'. Subsequently, the --universe-file mode was also used, so the KpSC-pan model could 550 be used as a reference, with the previously described command. 551 A draft model was generated using gapseq version 1.2 with the 'doall' command using the 552 unannotated genome (as gapseq does not take annotated input files).

565
The annotated and unannotated genome of K. pneumoniae KPPR1 were obtained from Genbank 566 under the accession: CP009208, and draft metabolic models were generated using Bactabolize, 567 CarveMe, ModelSEED and gapseq as described above. The previously described, manually 568 curated model for KPPR1 (iKp1289) was also included for comparison (15) 35 sulfur substrates were successfully matched to BiGG and SEED IDs (Data S1). These growth 577 data were compared to in silico predictions generated via FBA using the fba command from 578 Bactabolize to optimise the biomass objective function with the following options: 579 --fba_spec_name m9 --fba_open_value -20 580 Gene essentiality was inferred from single gene knockout growth predictions using the sgk 581 command from Bactabolize with the following options to mirror the growth conditions of the TraDIS 582 library (LB media grown aerobically): 583 --media_type lb --atmosphere aerobic 584 In all cases, an objective value cut-off of ≥ 10 -4 was used to indicate binarised growth as per 585 previous studies (14,72). 586 In silico predictions were compared to matched phenotype data and the following accuracy metrics 587 were calculated: 588         ST represented by KpSC−pan model K P N 2 0 6 1 K P N 2 2 3 3 K P N 2 2 5 0 K P N 2 2 7 2 K P N 2 2 8 5 K P N 2 1 0 0 K P N 2 1 7 5 K P N 2 2 4 5 K P N 2 2 7 0 K P Metric Value B) Accuracy metrics for predicted growth phenotypes