Pan-genome-scale network reconstruction: a framework to increase the quantity and quality of metabolic network reconstructions throughout the tree of life

A genome-scale network reconstruction (GENRE) represents the knowledgebase of an organism and can be used in a variety of applications. The drop in genome sequencing costs has led to an increase in sequenced genomes, but the number of curated GENRE’ s has not kept pace. This gap hinders our ability to study physiology across the tree of life. Furthermore, our analysis of yeast GENRE’ s has found they contain significant commission and omission errors, especially in central metabolism. To address these quantity and quality issues for GENRE’ s, we propose open and transparent curation of the pan-genome, pan-reactome, pan-metabolome, and pan-phenome for taxons by research communities, rather than for a single species. We outline our approach with a Fungi pan-GENRE by integrating AYbRAH, our ortholog database, and AYbRAHAM, our new fungal reaction database. This pan-GENRE was used to compile 33 yeast/fungi GENRE’ s in the Dikarya subkingdom, spanning 600 million years. The fungal pan-GENRE contains 1547 orthologs, 2726 reactions, 2226 metabolites, and 10 compartments. The strain GENRE’ s have a wider genomic and metabolic than previous yeast and fungi GENRE’ s. Metabolic simulations show the amino acid yields from glucose differs between yeast lineages, indicating metabolic networks have evolved in yeasts. Curating ortholog and reaction databases for a taxon can be used to increase the quantity and quality of strain GENRE’ s. This pan-GENRE framework provides the ability to scale high-quality GENRE’ s to more branches in the tree of life.


INTRODUCTION
A genome-scale network reconstruction (GENRE) represents the knowledgebase of an organism and has a variety of applications. Studies have used GENRE's in metabolic simulations to uncover new enzyme-metabolite interactions (Hackett et al., 2016) and understand the pharmacokinetic response of drugs in the human body (Thiele et al., 2017), but they also can be used as a scaffold for omics integration (Österlund et al., 2013).
Current practices limit the quantity and quality of GENRE's, and their application to study metabolism throughout the tree of life. First, the drop in genome sequencing costs and the dominant paradigm of one research team for every GENRE has led to a growing gap between genome sequences and curated GENRE's. Second, most GENRE efforts are directed to a handful of organisms with existing GENRE's (Monk et al., 2014); the yeast/fungal GENRE community primarily focuses on industrially-relevant organisms, which has led to parallel reconstructions for Saccharomyces cerevisiae, Yarrowia lipolytica, Scheffersomyces stipitis, Komagataella phaffii, Aspergillus niger (Lopes and Rocha, 2017). Third, anchoring and availability biases cause non-conventional organisms' GENRE's to be skewed towards model organism's metabolism while neglecting their own unique metabolism ( Figure 1); examples of these commission and omission errors in yeast/fungi GENRE's can be found in the Supplementary Information. Finally, eukaryotic GENRE's have additional challenges not found in prokaryotes since they have larger genomes, few operons/gene clusters, numerous paralogs, alternative splicing, and compartments. These quantity and quality issues for GENRE hinders our ability to understand and control metabolism throughout the tree of life and highlight the need to build upon the bottom-up GENRE protocol outlined by Thiele and Palsson (2010).
GENRE curators must make hundreds or thousands of ad hoc decisions, often without traceability (Ravikrishnan and Raman, 2015), because existing databases for orthologs, reactions, metabolites, and phenotypes are not always accurate or complete. This approach prevents GENRE's from being scaled throughout the tree of life. To solve these problems, we outline our approach for pan-GENRE with the formalized curation of orthologs, reactions, metabolites, and phenotypes for a larger group of organisms by research communities (Figure 2A). Previous studies have used this pan-genome approach for different strains of the same species to reconstruct metabolic networks (Islam et al., 2010;Monk et al., 2013), but the pan-genome can be expanded to a broader range of organisms (Vernikos et al., 2015). The first step in our framework is the curation of orthologs, paralogs, ohnologs, and xenologs within a pan-genome; many ortholog databases are available but our analysis has shown their ortholog groups sometimes contain paralogs (Correia et al., 2017). The second step is compiling the canonical and non-canonical reactions within a taxon and annotating them with ortholog-protein-reaction associations (OPR's). This reaction database can be collected from various sources ( Figure 2B); existing reaction databases generally focus on canonical metabolism and underrepresent the metabolic potential of enzymes (Notebaart et al., 2014).
Third, curating the presence and dynamics of metabolites from a larger group of organisms can help lead to the discovery of new pathways and enzymes (Caudy et al., 2018), especially when the metabolites are not found within the S-matrix. One notable example is the discovery of riboneogenesis, which was detected by sedoheptulose-1,7-bisphosphate changes in a shb17 knockout strain (Clasquin et al., 2011).
Finally, curation of phenotypic data into machine-readable formats can help test, validate and improve GENRE's; Experiment Data Depot (EDD) has been recently described (Morrell et al., 2017) but this area is less developed than other databases. The primary purpose of these databases is for generating 2/19 high-quality GENRE's, but they can also benefit the broader scientific community.
We demonstrate our framework with a pan-GENRE for 33 yeasts/fungi spanning more than 600 million years of evolution. It was created by integrating AYbRAH, our yeast/fungal ortholog database, and Analyzing Yeasts by Reconstructing Ancestry of Homologs about Metabolism (AYbRAHAM), our new yeast/fungal reaction database. Our approach enables GENRE's for lower taxonomic levels, such as phylum, family or stains, to be compiled from the fungal pan-GENRE, and permits GENRE's to be synchronized with each other as the community characterizes more enzymes or research teams improve individual strain GENRE's ( Figure 2C). We discuss the benefits and drawbacks of our approach and compare it with Model SEED (Devoid et al., 2013), CoReCo (Pitkänen et al., 2014), and CarveMe (Machado et al., 2018), existing methods for automated metabolic network reconstruction.
stipitis (Balagurunathan et al., 2012;Caspeta et al., 2012;Li, 2012;Liu et al., 2012), Eremothecium gossypii (Ledesma-Amaro et al., 2014), and S. cerevisiae (Aung et al., 2013). Additional reactions were added to the pan-GENRE by reviewing enzyme assays, and reactions inferred from growth assays and Biolog data; these are referenced in the notes and reference section of the GENRE. Reactions were annotated with ortholog-protein-reaction associations (OPR's) using fungal ortholog groups (FOG's) from AYbRAH (Correia et al., 2017) if there was strong genomic evidence. OPR's for protein complexes were reviewed manually to determine the essential subunits. Protein complexes for S. cerevisiae and S. pombe were cross-referenced with the Complex Portal (Meldal et al., 2014). Reactions were also annotated with ORPHAN FOG if the reaction is catalyzed by an orphan enzyme and has a prerequisite FOG; ORPHAN RXN if the reaction has a prerequisite reaction; an organism code if there is evidence in a species or strain but the enzyme is unknown. Reactions annotated with BIOMASS, ESSENTIAL, ORPHAN, SPONTANEOUS, EQUILIBRIUM, DIFFUSION, EXCHANGE, or DEMAND were automatically included in all children GENRE's. A Python script was used to compile GENRE's for taxonomic ranks below Fungi in SBML3 and XLSX using the reaction annotations. We identified reactions requiring charge/mass balancing in the Fungi pan-GENRE with Memote (Lieven et al., 2018). PaperBLAST was used to find evidence of reactions for reactions in some ortholog groups (Price and Arkin, 2017).
Biomass equations. The biomass equation was composed of protein, RNA, DNA, phospholipids, sterols, chitin, β-D-glucan, glycogen, trehalose, mannose. Each macromolecule was converted to a new 3/19 metabolite species having a molar mass of 1000 g · mmol −1 (see Supplementary Information). These macromolecule species enable the coefficients in the biomass equation to be set to the weight fraction of each macromolecule as measured in experiments; a similar approach was previously employed with SpoMBEL1693 (Sohn et al., 2012). The DNA composition for each strain GENRE was adjusted according to its GC content. Two protein synthesis reactions were constructed in the pan-GENRE: condensation from amino acids without a GTP cost and condensation of amino acids from charged tRNAs and with an additional cost of 2 mol GTP per mol amino acid (Schimmel, 1993). FOG's for the ribosome, RNA polymerase, DNA polymerase were assigned to the protein, RNA, and DNA biosynthesis reactions, respectively.
FBA simulations Flux balance analysis (FBA) was carried out with COBRApy v0.9.1. Reactions enabling artificial proton motive force cycles with glucose as a carbon source were blocked in the pan-GENRE (Pereira et al., 2016). Glucose and ammonium were used as a carbon source and nitrogen source to determine the yields of biomass precursors and correct any auxotrophies in all the strain GENRE's.
Evolview v2 was used to map GENRE statistics and amino acid yields onto the species tree (He et al., 2016).

RESULTS
The Fungi pan-GENRE contains 1547 orthologs, 2726 reactions, 2226 metabolites, and 10 compartments. Expanded genomic coverage in the fungal pan-GENRE. Our strain GENRE's have more genes than GENRE's derived from manual curation or CoReCo (Table 1). Only yeastGEM and iMT1026, the most recent GENRE's for S. cerevisiae and K. phaffii, have more genes than our GENRE's when we compared the reconstructions without genes from the ribosome, RNA polymerase, DNA polymerase genes.
These genes are not usually included in GENRE's. A quick inspection of the genes in iMT1026 indicates that it also includes RNA polymerase genes and uncharacterized genes/enzymes. Curating ortholog groups and annotating reactions with OPR's minimizes the chance that distantly related homologous genes are not added to GENRE's, and orthologous genes with low sequence similarity are captured in GENRE's.
Expanded metabolic coverage in the fungal pan-GENRE. Our pan-GENRE has a wider metabolic coverage than prior yeast and fungal GENRE's. The pan-GENRE includes the degradation of alkanes and 4/19 aromatics, which have been studied in Fungi for decades (Middelhoven et al., 1991(Middelhoven et al., , 1992Van Beilen and Funhoff, 2007), and the degradation of aliphatic amines, which has been recently studied in S. stipitis by Linder (2014Linder ( , 2018. These pathways are mediated by hydroxylases, monooxygenases, and dioxygenases from the cytochrome P450's family, which do not have well described gene-protein-reaction associations in current reaction databases. Enzymes in the cytochrome P450 family are known to be promiscuous and likely have vast unexplored metabolic capabilities in yeasts and fungi. Non-canonical metabolic reactions were also added to the GENRE, such as reactions catalyzed by ene-reductase. These reactions have no obvious role in metabolism , but can be useful to know when engineering pathways in microbial hosts. The expanded metabolic coverage in our yeast and fungal GENRE's highlight the benefit for curated reaction and ortholog databases.
Amino acid yields vary across yeast strains. The amino acid yields from glucose and ammonium vary in yeasts. There are strikingly lower yields in S. pombe, Lipomyces starkeyi, Nadsonia fulvescens, and Hanseniaspora valbyensis for glutamate, glutamine, glycine, histidine, leucine, methionine, and valine.
L. starkeyi has many genomic and phenotypic features shared with Proto-Yeast, the first budding yeast (Correia et al., 2017), while the remaining yeasts independently evolved the Crabtree-effect (Correia et al., 2017). The overall protein yield, assuming the amino acid composition of S. cerevisiae from iMM904 (Mo et al., 2009), is lowest in yeasts that lost Complex I. The decrease in protein yield for these yeasts indicate that microbes are not always maximizing growth rate or biomass yield, and additional constraints (Bachmann et al., 2017) or environmental conditions can lead to loss of function. The differences in yield highlight the need to select microbial hosts with the highest potential for producing target biochemicals in metabolic engineering, or engineering higher yielding pathways in preferred hosts .
These hosts or pathways can be identified by running metabolic simulations with GENRE's from higher taxonomic ranks. Reconstructing GENRE's from curated databases can avoid commission errors that would lead to yields converging to the yields found in S. cerevisiae or other model organisms.

DISCUSSION
Most reaction databases reflect canonical metabolism, which understates the true metabolic capability of enzymes. AYbRAHAM includes non-canonical reactions identified in enzyme assays, which are relevant to metabolic engineering and enzyme evolution. Engineering the phosphoketolase pathway in S. cerevisiae required the deletion of GPP1 and GPP2, which encode glycerol 3-phosphatase phosphatase, but have promiscuous acetyl-phosphate phosphatase activity . Acetyl-phosphate phosphatase would be considered a blocked reaction in a S. cerevisiae genome-scale metabolic model, even though this enzyme activity is important to metabolic engineers. Another example can be seen with rescuing a

5/19
serine auxotrophy in E. coli from promiscuous phosphatase (Yip and Matsumura, 2013). Although these promiscuous activities are important to biologists and metabolic engineers, non-canonical reactions can increase the computational burden in genome-scale metabolic modelling. Reactions will be identified as core canonical to enable core metabolic network reconstructions to be compiled for every strain in future versions of the pan-GENRE. These examples highlight the need to capture the true metabolic network of microbes beyond canonical metabolism.
GENRE involves a review of the literature to capture metabolism unique to strains (Österlund et al., 2012). One limitation of this approach is that a literature review based on S. stipitis would ignore metabolism it shares with other non-conventional organisms but has not been characterized directly, such as alkane degradation. A high-quality ortholog and reaction database with OPR's with our framework enables discoveries in one organism to be pushed to the GENRE's of all organisms within a taxon ( Figure 2C). This synchronization can reduce the parallel GENRE efforts that have hampered the yeast community (Lopes and Rocha, 2017).
Our pan-GENRE framework relies on the ortholog conjecture, which states that orthologs are likely to have conserved function (Koonin, 2005); however, biological function is not always conserved amongst orthologs. For example, NDE1 encodes external alternative NADH dehydrogenase and is present in all yeasts 33 yeasts and fungi in this study (Correia et al., 2017). Nde1p has been found to oxidize NADH and NADPH in several species (Bruinenberg et al., 1985;Melo et al., 2001;Overkamp et al., 2002;Miranda, 2011), but this ability was lost in the ancestor of S. cerevisiae. Furthermore, S. pombe appears to encode internal and external NADH dehydrogenase from the same gene (Crichton et al., 2007;Correia et al., 2017). To resolve these issues of non-conserved function in ortholog groups, we created new "functional ortholog groups" to assign these exceptions. Future methods can address these issues by assigning biological function to internal nodes in phylogenetic trees, or annotating reactions with transcripts instead of genes (Pfau et al., 2015).

Comparison of pan-GENRE framework vs automated reconstruction.
Model SEED has is commonly used to automate or semi-automate metabolic network reconstruction (Magnúsdóttir et al., 2017). Our experience with automated reconstruction using Model SEED has found that these models require extensive curation to address commission and omission errors , especially with enzyme complexes which are not fully annotated by Model SEED. Stringent assignment of orthology from phylogenetic analysis within a defined taxonomic group ensures these errors are minimized.
CoReCo is the first method to consider the simultaneous reconstruction of GENRE's for a wide range of species (Pitkänen et al., 2014). It relies on a probabilistic model to identify orthologs using bidirectional 6/19 best hits with BLAST, which can complicate or misidentify orthology relationships for gene families with many duplications or high sequence similarity (Dalquen and Dessimoz, 2013;Correia et al., 2017).
CoReCo was better at capturing reactions that were omitted by manual curation in past reconstructions, such as ATP citrate lyase in S. pombe (Sohn et al., 2012). On average, GENRE's from our framework have 74% more genes than CoReCo GENRE's (Table 1). CoReCo is an important milestone in expanding GENRE to diverse species, but curated ortholog and reaction databases are needed to ensure they are complete and accurate.
CarveMe is a top-down approach recently outlined by Machado et al. (2018), which is similar to our framework. They introduce the concept of a universal model for bacteria which can build GENRE's by homology using DIAMOND (Buchfink et al., 2014), or orthology via eggNOG-mapper (Huerta-Cepas et al., 2017). Using the GENRE's from 23 species, they were able to scale their universal model to more than 5000 bacterial GENRE's. This method can lead to several complications with eukaryotes, which have more paralogs than prokaryotes (Makarova et al., 2005). For example, Aspergillus nidulans encodes mitochondrial, cytosolic, and peroxisomal isocitrate dehydrogenase from IdpA, but most budding yeasts have these enzymes encoded by two or three genes (Szewczyk et al., 2001;Correia et al., 2017). Assigning a metabolic function to IdpA via orthology from S. cerevisiae would annotate it is a mitochondrial NADPdependent isocitrate dehydrogenase, and omit its cytosolic and peroxisomal localizations. Therefore, homolog and ortholog-based assignments can both lead to incorrect gene-reaction-associations (GPR's) and enzyme localization. Furthermore, errors in genome annotations and existing GENRE's can be inherited in GENRE's derived from CarveMe. Our analysis of S. stipitis' electron transport chain indicates that its genome does not encode internal alternative NADH dehydrogenase (Correia et al., 2017); however, the genome annotation of S. stiptis and iBB814 includes internal NADH dehydrogenase, encoded by PICST 58800 (Jeffries et al., 2007;Balagurunathan et al., 2012). These cases support the need for community consensus on orthologs, reaction networks, and OPR's for a group of organisms, which can compile high-quality GENRE's for lower-taxonomic levels.

CONCLUSIONS
The heart of GENRE process involves integrating information about orthologs and reactions, and to a lesser extent metabolites and phenotypes. The lack of complete and accurate ortholog and reaction databases forces research teams to make ad hoc decisions about the presence or absence of reactions, especially in non-conventional organisms. This time-consuming process limits the quantity and quality of GENRE's. We introduce our pan-GENRE framework to resolve these issues and demonstrate it with the metabolic network reconstruction of 33 yeasts and fungi, spanning 600 million years of evolution.

7/19
These GENRE's have more genomic and metabolic coverage than previous yeast and fungi GENRE's.
The unified ortholog and reaction nomenclature in our GENRE's enables them to be synchronized as the pan-GENRE, or strain GENRE's are improved ( Figure 2C). This approach can be scaled to other groups of organisms throughout the tree of life to increase the quantity and quality of GENRE's.

DATA AVAILABILITY
The AYbRAH ortholog database is available at https://github.com/kcorreia/aybrah. The AYbRAHAM reaction database, the code used to compile the GENRE's, and all GENRE's for each taxonomic rank in Fungi are available at https://github.com/kcorreia/aybraham.

FUNDING
K.C. was supported by Bioconversion Network, BioCeB and NSERC CREATE M3.

AYbRAH
Analyzing Yeasts by Reconstructing Ancestry of Homologs AYbRAHAM Analyzing Yeasts by Reconstructing Ancestry of Homologs about Metabolism EDD Experimental Data Depot FOG fungal ortholog group GENRE genome-scale network reconstruction GPR gene-protein-reaction associations OPR ortholog-protein-reaction association pan-GENRE pan-genome-scale network reconstruction  Figure 1. Non-conventional organism genome-scale network reconstruction (GENRE) using a model organism GENRE as a template versus databases from community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Saccharomyces Genome Database (SGD) mines the literature to describe the physiology of Saccharomyces cerevisiae (a model organism). This knowledgebase is leveraged to compile a S. cerevisiae GENRE, which is its state of the art metabolic network; the gap between its true metabolic network and the state of the art GENRE represents undiscovered and promiscuous enzymes. In the absence of a curated genome database for Scheffersomyces stipitis (a non-conventional organism), the S. cerevisiae GENRE is used as a template to guide the S. stipitis GENRE. Anchoring bias makes the S. stipitis GENRE skew towards the S. cereivisae GENRE with commission errors; enzymes that have been characterized in S. cerevisiae, such as cytosolic NADP-dependent acetaldehyde dehydrogenase, have been included in past S. stipitis reconstructions despite no evidence based on orthology or enzymology (Correia et al., 2017). Availability bias prevents the GENRE curators from describing metabolism that is unique to S. stipitis and not well studied or documented, leading to omission errors; S. stipitis shares alkane hydroxylase orthologs with Candida tropicalis, a known alkane degrader (Lebeault et al., 1971), and yet alkane degradation has never been described in any S. stipitis or yeast GENRE. Community-driven curation of a pan-genome, pan-reactome, pan-metabolome, and pan-phenome reduces anchoring and availability biases by removing S. cerevisiae as the focal point for comparative metabolic network reconstruction; orthology assignment from rigorous phylogenic analysis of gene families rather than error-prone error prone methods; capturing non-canonical reactions catalyzed by promiscuous enzymes. Figure 2. Pan-genome-scale network reconstruction framework for improving the quantity and quality of GENRE's. (A) A research community for a taxon, such as yeasts and fungi, curate its pan-genome, pan-reactome, pan-metabolome, and pan-phenome. Orthologs, paralogs, ohnologs, and xenologs are identified in the pan-genome by rigorous phylogenetic analysis. Reactions catalyzed by enzymes within a taxon are described in the pan-reactome and annotated with ortholog-protein-reaction associations (OPR). The presence and dynamics of metabolites are described in the pan-metabolome of a taxon; metabolites in the pan-metabolome not captured in the S-matrix can lead to new pathway discovery in the pan-reactome. Phenotypes within the taxon are transcribed into machine-readable formats to test, validate and improve genome-scale metabolic models. These databases are the basis for the pan-GENRE. (B) The pan-reactome is created by mining the literature, inferred reactions from growth assays and Biolog, pathway analysis and gap-filling, in vitro enzyme characterization and crude enzyme assays, comparative genomics and comparative physiology. (C) Our Fungi pan-GENRE spans 600 million years of evolution for 33 yeasts and fungi. It was used as a template to compile GENRE's for each taxonomic rank from sub-kingdom to strain taxonomic ranks. GENRE's for any taxonomic rank can by curated by updating the AYbRAH ortholog and AYbRAHAM reaction databases. These changes are pulled to the Fungi pan-GENRE, and pushed to the lower taxonomic rank GENRE's, enabling all the GENRE's to be synchronized. A GENRE can be used for genome-scale metabolic modelling (GSMM) to test new pathways and reconcile in silico and in vivo data. Further GENRE curation can bridge the gap between in silico and in vivo data.  . Schizosaccharomyces pombe, Lipomyces starkeyi, Nadsonia fulvescens, Hanseniaspora valbyensis all have reduced yields of glutamate, glutamine, glycine, histidine, leucine, methionine, and valine than the rest of the fungi and yeasts. L. starkeyi shares many genomic features as Proto-Yeast, the first budding yeast (Correia et al., 2017), while the remaining yeasts independently evolved the Crabtree effect. S. pombe and H. valbyensis have reduced genomes and primarily rely on glucose fermentation to ethanol.