standard-GEM : standardization of open-source genome-scale metabolic models

. The ﬁeld of metabolic modelling at the genome-scale continues to grow with more models being created and curated. This comes with an increasing demand for adopting common principles regarding transparency and versioning, in addition to standardisation efforts regarding ﬁle formats, annotation and testing. Here, we present a standardised template for git-based and GitHub-hosted genome-scale metabolic models (GEMs) supporting both new models and curated ones, following FAIR principles (ﬁndability, accessibility, interoperability, and reusability), and incorporating best-practices. standard-GEM facilitates the reuse of GEMs across web services and platforms in the metabolic modelling ﬁeld and enables automatic validation of GEMs. The use of this template for new models, and its adoption for existing ones, paves the way for increasing model quality, openness, and accessibility with minimal effort.


Introduction
Genome-scale metabolic models (GEMs) have been used to understand and guide modifications of metabolism with a wide range of applications, from elucidating disease mechanisms and identifying biomarkers to optimising strains for producing valuable chemicals (Chen et al., 2019).However, the size of the GEMs and the iterative curation work required to build and refine them, as detailed in a 96-step standard operating procedure by Thiele and Palsson (2010), naturally leads modellers to desire a framework that borrows concepts commonly found in software development.Together with the code associated with performing model simulations and analyses, modellers need an accessible ensemble of standards and tools that is easy to work with, yet carries much of the burden of ensuring quality, while allowing the researcher to focus on the scientific aspects of their work.Scientific results, including those from model analysis, are increasingly scrutinised for reproducibility (Baker, 2016, Malik-Sheriff et al., 2020) able, Accessible, Interoperable, Reusable (Wilkinson et al., 2016)) provide an easily accessible suite of recommendations for the sharing of DROs.These principles are conceived as broadly applicable not only to data but also to other classes of DROs including software tools, workflows and systems biology models.As highlighted by the Infrastructure for Systems Biology Europe (ISBE), "sharing [data and] models solely through supplementary material is still common practice" (Stanford et al., 2015), and they recommend that platforms should be used to disseminate assets.In line with this, EMBO Reports and other scientific journals have advocated using BioModels (Malik-Sheriff et al., 2020) as a default deposition database for sharing published models in standardised formats, such as the Systems Biology Markup Language (SBML) (Keating et al., 2020).ISBE in cooperation with ERASysAPP initiated a key step towards promoting FAIRification in systems biology with the development of FAIRDOM and FAIRDOMHub (Wolstencroft et al., 2016).FAIRDOMHub can be thought of as a publishing platform for systems modelling projects which links their various elements from data to model.These may be located at different physical locations and in different types of repository.By facilitating linking between resources such as SEEK (Wolstencroft et al., 2015) and OpenBIS (Barillari et al., 2016), FAIR-DOMHub thus, facilitates the reproducible reuse of models.Here, we propose to continue this advancement in FAIRification through the implementation of open-source standards both during model creation and in its subsequent curation, not only for making the final model available in a reproducible format, but also enabling reproducibility throughout the model development and curation process (Tiwari et al., 2021).
The importance of code versioning has been well-recognised for biological models and repositories, especially for models that constantly evolve (Beard et al., 2009, Miller et al., 2011).Code versioning can improve the reproducibility and transparency of the curation process (Aite et al., 2018), particularly for GEMs that are continuously updated over the years by many researchers (Ravikrishnan and Raman, 2015).For example, the consensus GEM of Saccharomyces cerevisiae has been developed since 2008 (Herrgård et al.), and constantly updated with new curations, reaching version 8 in 2019 (Lu et al., 2019).At the same time, models hosted by databases such as the BiGG database (Norsigian et al., 2020) have been updated by applying the curation tool ModelPolisher (King et al., 2016).However, the changes that have been applied to the BiGG models are not traceable by the research community.Such missing provenance information makes it difficult to evaluate changes between versions.Similarly, in the latest version of BioModels, the authors note that it "transparently tracks changes to the model and associated files behind the scenes" (Malik-Sheriff et al., 2020), but such information is not readily available to researchers working with the models.Standard-GEM, therefore, aims at, but is not limited to, facilitating and streamlining the constant manual curation process takes place after publication by making all changes equally structured, explicit, and trackable, pre-viously identified as desirable characteristics by Waltemath et al. (2013).As noted by Scharm et al. (2018), "reuse of models is still impeded by a lack of trust and documentation".Within this work, we developed standard-GEM, a standardised gitbased template for the development and curation of GEMs which addresses issues of reproducibility and availability of provenance information and applies FAIR principles to GEMs.With standard-GEM, we aim to address this problem by imposing standards for model reuse, such as fixed folder structure and the use of a git-based workflow for version control, and by relying on repository-hosting solutions that provide open issues tracking and discussion forums to further boost the documentation process.We thus attempt to solve the need for standardisation by facilitating the creation of a community standard focused on simplicity.To this end, the standard introduced by standard-GEM, defined through the .standard-GEM.md file, is explicit and transparent.

Results
Template repository.As part of the repository template available at github.com/MetabolicAtlas/standard-GEM, the central file .standard-GEM.md (Supplementary Note 1) contains all the expectations of the standard, including steps for repository creation, git workflow, and model file formats.Additionally, using the standard-GEM template repository initiates a well-defined folder structure, including a structured README file that facilitates the presentation of the model, and a default licence file (CC BY 4.0).As a template repository, there are no restrictions on making repositories public.This template can be used to create both private and public repositories identically in GitHub, or any git-hosting service.As some functionality of the template is leveraging the GitHub configuration via the .githubfolder, adopting it with another hosting service would entail additional work.

Automatic validation.
GEMs must be maintained to maximise their value.This continuous effort is essential, and it also applies to GEMs created from the standard-GEM template.One of the costs associated with maintaining an open-source GEM is the burden of running checks, ranging from basic sanity checks to more complex evaluations of model content.To reduce this burden, we have created an automatic validation pipeline, publicly available at github.com/MetabolicAtlas/standard-GEM-validation.First, it uses the GitHub application programming interface (API) to identify all standard-GEM models by finding all repositories labelled "standard-GEM."The pipeline then proceeds to perform a validation of the repository's content by looking at the file tree.Then, after the repository has been deemed compliant, the pipeline verifies the formatting of the YAML and SBML files and runs the model testing framework MEM-OTE (Lieven et al., 2020).Due to the inherent modularity, it is envisioned that more tests will be added.In addition, since previous releases of a model are all available in the same repository, tests can also be run retroactively on previous releases.All the results of the automatic pipeline are versioned in JSON format in the same repository as the pipeline to be easily reused or presented, e.g., on a website.A snapshot of the validation output is also provided as Supplementary Data.The pipeline regularly runs on its own, which means new results will come in as the models continue to be upgraded.This reduces the need for modellers to maintain identical validation pipelines separately, thus reducing the cost of maintenance, while keeping the benefit of validation checks.

Discussion
Generic versus specialised, and platform versus database.We advocate using standard-GEM through an independent platform that comes at no cost to the scientific community.While a git-based standard is a platformindependent solution, closer integration with GitHub specifically brings increased simplicity, for example, by instantiating a standard-compliant GEM with a single mouse click.Nonetheless, standard-GEM could alternatively be applied to other platforms of distributed version control and source code management such as GitLab or BitBucket.A fundamental debate that underpins standard-GEM is the use of a generic and financially free infrastructure versus a specialised and non-free platform.An example of the latter has been implemented by MEMOsys (Pabinger et al., 2011), which proposed a centralised version control system.On the one hand, it offered "access to the complete development history" of a model, not unlike standard-GEM does through git and GitHub.Similar to this, standard-GEM opens the model for community curation.On the other hand, MEMOsys placed the version control platform under the responsibility of researchers, which exerts financial strains.More importantly, however, a platform like MEMOsys dissociates the code from the model, as it "has been designed to map and store all properties of a metabolic model in a database" (Pabinger et al., 2011), thus inadvertently limiting traceability and reproducibility of the changes.Instead of storing all properties of a metabolic model in a database, standard-GEM provides a standardised format such that researchers can continue to work in an opensource way, directly combining model with code, whilst keeping the output file format easy to be reused by different model-centric websites such as BioModels, BiGG, FAIR-DOMHub, Metabolic Atlas (Wang et al., 2021), and JWS Online (Olivier and Snoep, 2004).A more recent approach, ModelBricks, defines an alternative infrastructure of small models that requires the development of a database, tools, and content (Cowan et al., 2019).To some extent, standard-GEM facilitates the reuse of a model by forking the source repository.However, how to combine multiple standard-GEMs as proposed through ModelBricks remains an open question.

Suitability of a git-based workflow. The use of git and
GitHub in computational biology has become increasingly popular.An introduction to these tools and how to use them in ten simple rules has been previously published (Perez-Riverol et al., 2016).A good standard for GEM versioning should facilitate answering the 7W questions of provenance (What, Who, When, Where, Which, Why, With/How) that underpin each action in the curation process (Ram and Liu, 2012).Previously, one approach to address provenance post-curation would have been through an algorithm for difference detection in models of biological systems (Scharm et al., 2016), even though it was intended to compare finished models.The authors note that "there is scope for further extensions to provide hypotheses for the Why."Standard-GEM answers all the 7W questions by employing the version control tool git, which is widely used for tracking changes in software programming in a distributed way.Hosting a git-based project on an online platform such as GitHub makes these answers transparent by showing What the change is (a git commit), Who made the change (author/committer), When it occurred (git commit timestamp), Where it occurred (in which file or part of the model), Which changes were made (visible through a git diff).The Why and With are addressable through the functionality of online platforms, e.g., GitHub's Issues and Pull requests and commit messages for which standard-GEM provides templates.Discussions also occur openly on the GitHub platform, with changes to the standard being connected to issues, which can be raised by GitHub users.Through this approach, standard-GEM aims to document all changes of the standard and to equally invite contributors.Therefore, if used adequately, git-based versioning for models can bring benefits in terms of traceability.For these reasons, git was chosen to stand at the foundation of standard-GEM.An advantage of using git versioned GEMs is that it makes comparisons between consecutive versions of the model straightforward.With git-flow, the standard development workflow on git-based platforms in the software industry, the differences across versions can be easily inspected via branch comparisons.In cases where git diff cannot work, as can be the case for SBML files with scrambled order, the tool sbml-diff can be used for comparisons between models or versions (Scott-Brown and Papachristodoulou, 2017).However, one limitation inherent to the use of git is that line order changes matter, even if the lines' content is otherwise identical.The recommendation is that tools maintain the order of model components to provide a consistent order and robustness to the model files and make them suitable for git.While promising, the use of code versioning principles when working with GEMs is not straightforward.One could look again at software programming, where best practices have been defined and widely spread during the past decade.However, many code conventions and configuration options can become overly complex for a metabolic modelling project.The GEM testing tool MEMOTE addresses this by providing memote-cookiecutter, a utility to instantiate a gitversioned project structure (Lieven et al., 2020).This tool brings significant benefits in terms of standardisation and time-savings.However, the rules and reasoning defining the internal file tree and git branch structure of a repository in- A list of GEMs that have adopted standard-GEM and their geographical location, as fetched from github.com/topics/standard-gem and gitlab.com/explore/projects/topics/standard-gem.
stantiated with memote-cookiecutter are complex and not fully transparent.
Adhering to standards.standard-GEM is a standard as per the definition that it is only mutual agreement, and that "introducing standards should not be a goal in itself, but should help [. . .] solve problems" (Brazma et al., 2006).As noted by Yurkovich et al. (2017), "one of the biggest hurdles for those attempting to link different pieces of software is standardisation", to which they formulate "Lesson 6: Adopt or Develop Standards."In line with this recommendation, standard-GEM limits the development of new standards to where they are strictly necessary, e.g., by promoting model file formats in the community.A prominent example is SBML as the de facto file format for GEMS.Other file formats exist, such as SBtab (Lubitz et al., 2016), which provides a spreadsheet-like interface to the model and its annotation.Different modelling formats have been reviewed by Dräger and Palsson (2014) and Schreiber et al. (2020), and usage of file formats has been polled by Carey et al. (2020), based on which of the file format recommendations in standard-GEM were made.Furthermore, standard-GEM is programming language agnostic, thus making it compatible with any systems biology software.It also leverages the GitHub application programming interface (API) for programmatic access to the entire repository.With standard-GEM, we build on existing standards and reduce complexity.Recently, a set of features that defines a "gold standard" for metabolic network reconstruction has been compiled regarding content, annotation, and simulation capabilities (Carey et al., 2020).Standard-GEM aims to implement the proposed standards, being aware of the warning that a "consistent use of standards and resources remains challenging" (Carey et al., 2020).The proposed standards for the de novo reconstruction phase are implemented through a templated README file that is part of the default standard-GEM configuration.For the proposed standards of the curation process, standard-GEM leverages git-flow and the transparency of git-based platforms -commit messages, branches, issues, pull requests -to achieve the desired transparency in the reconstruction and curation process (Heavner and Price, 2015).Regarding the proposed standards for simulations, standard-GEM encourages the separation of concerns by using a fixed folder structure and README files.Moreover, standard-GEM aims to work towards full compatibility with the COMBINE archive standard (Bergmann et al., 2014) and RO-crate (Soiland-Reyes et al., 2021).Compared to other standardisation efforts and initiatives, standard-GEM brings many, if not all, advantages at no financial cost to the scientific community.

FAIR.
Adopting standard-GEM, via the reliance on mature technologies and platforms, elevates FAIRification in ways that are new to the field, and that have not been addressed before.Findable.Using standard-GEM promotes findability.As mentioned previously, several established resources already exist for storing models that are considered finished, in the sense that they are ready for publication.Making models findable in this way, however, decreases the findability of other aspects.For example, it becomes much harder to find the rationale for specific choices modellers made.When a GEM is set up according to standard-GEM, it increases the findability of the information that accompanies an individual model file as a product.Another aspect of findability to be considered is that of the repositories themselves.To emulate repository categories, GitHub uses searchable topics, and the newly introduced "standard-GEM" topic makes it easy to find all such GEMs on GitHub (Table 1).Additionally, the recommended use of Zenodo ensures automatic DOI minting for each release in the GitHub repository (Supplementary Note 1).Accessible.The main contribution of standard-GEM to FAIRification is via accessibility.When depositing a model file in a deposition database, only a minor part that goes into the development of the model becomes accessible at that resource.For example, data or code cannot be submitted to any resource.Moreover, thanks to the git versioning system and the online hosting of the repository, even previous versions of the files in the repository remain accessible.Furthermore, in anticipation of the low risk that the repository suddenly disappears, as part of the .standard-GEM.md file, we recommend the permanent archiving service Zenodo which integrates seamlessly with GitHub (Supplementary Note 1).In addition, GitHub allows easy switching between the private/public state of a repository, independently of it being instantiated from a repository template such as standard-GEM.Interoperable.While SBML has become the de facto file format, others focus on different use-cases where SBML is less optimal.For example, due to instrinsic properties of the XML format that SBML is based on, even small model curations can lead to more line changes than in other file formats.Instead of promoting a single solution as is usually the case with deposition databases, standard-GEM pools together many standards.This is achieved by recommending different file formats, thus increasing the interoperability of the models, which is a key pillar of standard-GEM.The RAVEN toolbox, for example, provides the wrapper function export-ForGit that exports multiple formats at once, to reduce the chance of out-of-sync exports.Reusable.Standard-GEM adds a new dimension to reusability.Whereas the use of standard file formats primarily covers reusability per its official interpretation, especially regarding annotation, the provenance aspect is largely overlooked.With a code versioning system, however, changes in a repository tend to appear gradually.In a standard-GEM, incremental curations have even more context than generic curation databases, such as APICURON (Hatos et al., 2021), focusing more on curation as a trackable activity than on the curation rationale.The provenance of individual curations in a GEM makes them more reusable to, e.g., another GEM, it might even lead to annotation curations in other databases.A different aspect of reusability inherent to a standard-GEM is that it is expected to contain code.Large curations, even if implemented manually, benefit from being code-aided.Even if such code is meant for a one-time use, versioning it in the repository makes it reusable for future curations.Lastly, the reusability of a repository increases as soon as it becomes public on a git-hosting service, such as GitHub.Public repositories can be forked by other users, enabling them to duplicate the entire repository, including the change history, in their account, thus reusing it for their work.Later, their new changes can be conveniently incorporated in the original repository via a pull request, thus making even work outside the original repository reusable.Future perspective.At the time of publication, standard-GEM has reached version 0.5.As the field matures, it is expected that the standard will do so as well.The discussions that have already taken place in the repository indicate potential future directions, such as expanding standard-GEM to include support of Community-GEMs and validation improvements.Moreover, instead of becoming a static standard, standard-GEM has been set up as an "open system" to continue evolving and serving the community's future needs.Together with the standard-GEM template and resources, we will continue to support the conversion of models, as deposited in their respective publications, to the standard-GEM format, thus enabling community curation, expansion, versioning, and transparent annotation of these models.

Conclusion
By creating and disseminating standard-GEM, we are going much further than documenting the GEM reconstruction process -we are supporting open-source long-term model curation.Following standard-GEM is a way to make a model "reusable, extensible, and published opensource" (Medley et al., 2016).Upon this standard, further validation is achieved through an open-source automated pipeline.By adopting the workflow described by standard-GEM, the curation history of genome-scale and other metabolic models can be transparently reviewed in the git repository, thus facilitating continued community contributions and enabling GEMs to evolve from being research outputs to acting like an infrastructure.
standard-GEM 0.5 For details about the aims, scope, and use case of this standard see the wiki pages of the standard-GEM repository.

Terminology
To facilitate understanding, the definitions used throughout this guide are copied below from the wiki.For easier differentiation, we have associated colors to each of them.

Instructions
This document serves as a checklist for creating an open source genome-scale metabolic model (GEM) on GitHub.

Repository creation
Based on the ISO guidelines, tweaked for easy understanding.!Requirements: must, must not " Recommendations: should, should not # Possibility and capability: can !All GEMs that follow the standard-GEM must contain this file.
This serves as a traceable adherence to the standard, manually confirmed by the original authors.This file must be edited only with checkmarks, in order to support automatic parsing and validation of this file.Some of the checkmarks are pre-applied based on the contents of the standard-GEM template repository.GEM authors have the responsibility of checking that their model repository does follow the guidelines entirely.With further updates to standard-GEM , one should paste over the new version of this file, and see that the changes in the new guidelines are met.

# Navigate to standard-GEM and click on the button Use this template
The standard-GEM template can be used to initiate a repository.This will copy the contents of the main branch into the new repository, which can be either private or public.

! Pick a repository name
The name must be either a common name, KEGG organism, or taxonomy-derived short name, followed by the extension -GEM or -GSMM .The -GEM extension is preferred to ease pronunciation.The name can be prefixed by an abbreviation, eg ec (enzyme constrained), sec (with secretory pathways), mito (with mitochondrion pathways), pro (with protein structures).The topic standard-GEM must be added.Other topics like genome-scale-models , systems-biology can be added.Having this topic on your repository enables automatic finding using the GitHub API, and automatic validation of the standard.Topics are not copied from standard-GEM , so they need to be added manually.

# Add a repository URL
The URL can be the link to the publication/pre-print/website where the model is introduced, for example via an identifier system (doi/EuropePMC/PubMed).

File tree
/ signifies the root of the repository.
.keep files are used to indicate that the empty folder should not be ignored by git -without it git would simply not want to version empty directories.Once folders are not empty, it is okay to remove these files.
The GEM repository must have at least two branches: main and develop.

! Releases
Releases must use the tag format X.X.X where numbers, according to semantic versioning principles.The last field, also called "patch", can also be used to indicate changes to the repository that do not actually change the GEM itself.The use of a v before the version number ( v1.0 ) is discouraged.For more information about releases see the documentation at GitHub.

# Commits
Commit messages can follow the style of semantic commits.
! /.gitignore The repository must contain a /.gitignore file.This generic .gitignorewas prepared for multiple programming languages.While it does not require modification, it can be further adapted to the needs of the repository.

! /.github
The repository must contain a /.github folder, in which the contributing guidelines, code of conduct, issue templates and pull request templates must be placed.Defaults are provided and they do not require any modification.
! /.github/CONTRIBUTING.md This file is provided by the template, but it is empty.It must be filled in with the adequate contributing guideline instructions; a good example is https://github.com/SysBioChalmers/yeast-GEM/blob/main/.github/CONTRIBUTING.md.
! /code/README.md The repository must contain a /code folder.This folder must contain all the code used in generating the model.It must also include a README.mdfile that describes how the folder is organized.
! /data/README.md The repository must contain a /data folder.This folder contains the data used in generating the model.It must also include a README.mdfile that describes how the folder is organized.

! /model
The repository must contain /model folder.This folder must contain the model files, in multiple formats, according to the table below.As a general guideline, binary formats ( .mat , .xlsx ) must not exist on any other branches than main.The main reason for this is that binary files cannot be diff'ed, which means changes cannot be compared to previous versions, thus increasing the chance of errors.Moreover, with time, the size of the repository can create difficulties, and we cannot yet recommend storing these files with Git LFS, as it introducs complexity.For more information on the sbtab file format, see sbtab.net.All model files must be named the same as the repository, and with the appropriate extension.
Example: yeast-GEM.matThe repository must contain a license file.The default license is CC-BY 4.0 International.Unless a different license is desired, the file does not require modification.
! /README.md The repository must contain a README.mdfile.A default file is provided, and the adequate contents must be The /README.md file must include a version badge.A default is provided in the file.Additionally, the /README.mdfile should contain the Zenodo badge.As soon as the first public release is in made, the repository should be archived via Zenodo, and the corresponding badge be updated.A default is provided in the file.
The /README.md can contain a contact badge, for example Gitter.When setting up the Gitter chat room, the GitHub activity should be synced with Gitter in order to see the latest updates of the repository in the chat room.A default for this badge is provided in the file.
! /version.txt The repository must contain this file, which is required for the version badge in the /README.md .The value refers to the version of the GEM, not of the standard-GEM .The value must be updated with each release.

# Files for continuous integration testing
The repository can be set up for continuous integration testing using memote with eg.Travis CI ( .travis.yml ), Jenkins ( Jenkinsfile ), GitHub Actions (under .github/workflows).

# MEMOTE report
The repository could contain a MEMOTE report on the main branch, in .htmlformat.
. Reproducibility can be viewed as part of a wider need for open science data management aimed at the reuse of what can generically be termed Digital Research Objects (DROs).The FAIR Principles (Find-Anton et al. | bioR‰iv | March 21, 2023 | 3-12 Example: ecYeast-GEM !Pick a repository description The description must include the taxonomic classification in full.Example: The consensus GEM for Saccharomyces cerevisiae !Add repository topic