Abstract
Background Mapping biomedical data to functional knowledge is an essential task in bioinformatics and can be achieved by querying identifiers, e.g. gene sets, in pathway knowledgebases. However, the isoform and post-translational modification states of proteins are lost when converting input and pathways into gene-centric lists.
Findings Based on the Reactome knowledgebase, we built a network of protein-protein interactions accounting for the documented isoform and modification statuses of proteins. We then implemented a command line application called PathwayMatcher (github.com/PathwayAnalysisPlatform/PathwayMatcher) to query this network. PathwayMatcher supports multiple types of omics data as input, and outputs the possibly affected biochemical reactions, subnetworks, and pathways.
Conclusions PathwayMatcher enables refining the network-representation of pathways by including isoform and post-translational modifications. The specificity of pathway analyses is hence adapted to different levels of granularity and it becomes possible to distinguish interactions between different forms of the same protein.
Findings
In biomedicine, molecular pathways are used to infer the mechanisms underlying disease conditions and identify potential drug targets. Pathways are composed of series of biochemical reactions, of which the main participants are proteins, that together form a complex biological network [1]. Proteins can be found in various forms, referred to as proteoforms [2]. The different proteoforms that can be obtained from the same gene/protein depend on the individual genetic profiles, on sequence cleavage and folding, and on post-translational modification (PTM) states [3]. Proteoforms can carry PTMs at specific sites, conferring each proteoform unique structure and properties [4]. Notably, many pathway reactions can only occur if all or some of the proteins involved are in specific post-translational states.
However, when analyzing omics data, both input and pathways are summarized in a gene- or protein-centric manner, meaning that the different proteoforms and their reactions are grouped by gene name or protein accession number, and the fine-grained structure of the pathways is lost. One can therefore anticipate that proteoform-centric networks provide a rich new paradigm to study biological systems. But while gene networks have proven their ability to identify genes associated with diseases [5], networks of finer granularity remain largely unexplored.
Here, we present PathwayMatcher, an open-source standalone application that considers the isoform and PTM status when building protein networks and mapping omics data to pathways from the Reactome database. Reactome [6], is an open-source curated knowledgebase consolidating documented biochemical reactions categorized in hierarchical pathways, and notably includes isoform and PTM information for the proteins participating in reactions and pathways.
As an example of the complexity of hierarchical pathway information, we provide a graph representation of Signaling by NOTCH2 from Reactome (Figure 1). This pathway is a sub-pathway of the pathways Signaling by NOTCH and Signal Transduction. It is composed of two sub-pathways (NOTCH2 intracellular domain regulates transcription and NOTCH2 Activation and Transmission of Signal to the Nucleus), comprising 32 and 54 reactions, yielding 28 and 141 edges, respectively. The 31 participants of the Signaling by NOTCH2 pathway are also involved in reactions in other pathways, between themselves and with 2,055 other proteins, resulting in 6,525 external edges. Note that in this pathway, Cyclic AMP-responsive element-binding protein 1 (coded by CREB1) is phosphorylated at position 46 (labeled as CERB1_P in Figure 1) and Neurogenic locus notch homolog protein 2 (coded by NOTCH2), is found in three forms (unmodified and with two combinations of glycosylation, labeled as NOTCH2, NOTCH2_Gly1, and NOTCH2_Gly2, respectively).
The amount of information available on reactions involving modified proteins has dramatically increased during the past two decades (Figure 2), with 3,947 and 5,631 publications indexed in Reactome (version 64 at time of writing) describing at least one reaction between modified proteins or between a modified and an unmodified protein, respectively. To harness this vast amount of knowledge, we built a network representation of pathways that we refer to as proteoform-centric, where protein isoforms with different sets of PTMs are represented with different nodes, in contrast to gene-centric networks, where one node is used per gene name or protein accession. In this representation, two proteoforms are connected if they participate in the same reaction. Note that proteoforms can participate in reactions both individually and as part of a set or complex. Furthermore, they can have four different roles: input, output, catalyst, or regulator.
The fundamental difference between gene- and proteoform-centric networks is illustrated in Figure 3, showing the graph representation of interactions with the protein Cellular tumor antigen p53 (P04637) from the TP53 gene. In a gene-centric paradigm (Figure 3A), 221 nodes are connected to a single node, making 220 connections; while in a proteoform-centric network (Figure 3B), 227 proteoforms connect to 23 proteoforms coded by TP53 making 414 connections. Note that the proteoforms coded by TP53 are themselves involved in reactions, making 24 TP53-TP53 connections. In this example, the proteoform-centric network thus presents more nodes and connections than the gene-centric network, with visible structural differences in the network organization. We hypothesize that the proteoform-centric network paradigm depicted in Figure 3B provides a rich map that will enable navigating biomedical knowledge to a higher level of detail, to better assess the effect of perturbations, and identify drug targets more specifically.
PathwayMatcher allows the user to tune the granularity of the network representation of pathways by representing nodes as (i) gene names, (ii) protein accession numbers, or (iii) proteoforms, and supports the mapping of multiple types of omics data: (i) genetic variants, genes, (iii) proteins, (iv) peptides, and (v) proteoforms. Genetic variants are mapped to proteins using the Ensembl Variant Effect Predictor [7], gene names are mapped to proteins using the UniProt identifier mapping [8], and peptides are mapped to proteins using PeptideMapper [9]. If a peptide maps to different proteins, all possible proteins are considered for the search and protein inference must be conducted a posteriori [10]. If peptides are modified, they are mapped to the proteoforms presenting compatible PTM sets. Proteins are mapped to the pathway network using their accession, while proteoforms are mapped by comparing their protein accession, isoform number, and PTM set. A schematic representation of the PathwayMatcher matching procedure is shown in Figure 4. More details on the mapping procedure, formats, and settings can be found in the methods section and in the online documentation (github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki).
PathwayMatcher produces three types of output: (i) the result of the matching, listing all possible reactions and pathways linked to the input; (ii) the results of an over-representation analysis; and (iii) networks in relationship with the input. The over-representation analysis is performed on the pathways matching and follows the first generation of pathway analysis methods [11], i.e. a p-value for each pathway in the reference database is calculated using a binomial distribution followed by Benjamini-Hochberg correction [12] (in a similar way as performed by the Reactome online analysis tool [13]). If the input can be mapped to proteoforms, the over-representation analysis is conducted using a proteoform-centric representation of pathways, using proteins otherwise. The exported networks represent the internal and external connections that can be drawn from the input, where internal connections connect two nodes from the input list, and external connections one node from the input list to any node not in the input. The user can select to export these networks using nodes defined as genes, proteins, or proteoforms. Connections between nodes in the network are annotated with information on whether they participate as complex or set, and their role in the reaction.
As displayed in Figure 5A, 68% of the pathways present at least one proteoform-specific participant, i.e. with isoform or PTM annotation. The number of pathways containing a given gene product or proteoform is displayed in Figure 5B, showing how using proteoforms allows distinguishing pathways more specifically than genes, with a median of four pathways matched per proteoform compared to eleven pathways per gene. When the input can be mapped to proteoforms, PathwayMatcher can restrict the search for reactions and pathways to those that specifically involve proteins in the desired form, hence reducing the number of possible connections for a given node in the resulting network. Conversely, the proteoform-centric network representation allows identifying interactions between multiple proteoforms originating from the same gene or protein, resulting in new connections compared to a gene-centric representation.
Figure 5C shows that the number of connections per proteoform is lower than the number of connections for the respective gene for most proteoforms, varying from 300-fold decrease to 10-fold increase. Interestingly, plotting the number of connections of a proteoform in gene-centric or proteoform-centric networks shows that the largest gene-centric hubs, corresponding to five genes, decompose into 127 proteoforms that do not outlie the distribution of the number of connections in the proteoform network (Figure 5D). Conversely, a group of 484 densely connected outliers emerges from 44 genes.
Through its paradigm-shift, PathwayMatcher hence provides a fine-grained representation of pathways for the analysis of omics data. However, this comes at the cost of increased complexity: gene-centric networks comprise a limited number of nodes, approximately 20,000 for humans, whereas in a proteoform-centric paradigm, the human network is expected to have several million nodes [14]. With the current version of Reactome, building the gene- and proteoform-centric networks results in 9,759 and 12,775 nodes with 443,229 and 672,047 connections, respectively. We classified the nodes into two categories, canonical or specific gene products, depending on whether or not they represent the unmodified canonical isoform of a protein according to UniProt. Within the proteoform network, 432,169 connections between 9,694 nodes link two canonical gene products, 95,539 connections between 7,734 nodes involved one canonical and one specific gene product, and 2,806 nodes with 144,339 connections involved two specific gene products.
In addition to the increased size of the underlying network, matching proteoforms requires comparing isoforms and sets of modifications, possibly with tolerance and wildcards for the modification definition and localization, which is computationally much more intensive than simply comparing identifiers. Figure 6 shows the performance of PathwayMatcher benchmarked against public data sets of (A) genetic variants, (B) proteins, (C) peptides, and (D) proteoforms.
For the proteins and proteoforms, the processing time increased linearly related to the query size with a small slope, making it possible to search all available proteins within a few seconds. As expected, protein identifiers provided the fastest response time, while proteoforms were the second fastest. Mapping peptides took approximately 30 seconds more, corresponding to the indexing time of the protein sequences database by PeptideMapper [9], after which the time increased linearly in a similar fashion as for proteins. For the genetic variants, an extra mapping step is required to map possibly affected proteins, adding additional computing time. The overall mapping time for a million single-nucleotide polymorphisms (SNPs) was less than a minute, which is acceptable compared to the other steps of a variant analysis pipeline. Note that the processing time was very reproducible across runs, where minor variation is only noticeable using genetic variants, resulting in very thin ribbons in Figure 6B-D.
In conclusion, PathwayMatcher is a versatile application enabling the mapping of several types of omics data to pathways in reasonable time and can readily be included in bioinformatic workflows. Thanks to the fine-grained information available in Reactome, PathwayMatcher supports refining the pathway representation to the level of proteoforms. To date, only a fraction of the several million expected proteoforms [14] have annotated interactions, but as the understanding of protein interactions continues to increase, and the ability to identify and characterize them in samples progresses, proteoform-centric networks will surely become of prime importance in biomedical studies. Notably, the effect of genetic variation on genes, transcripts, and proteins is currently only partially resolved for a fraction of the genome. The rapid development of this field will make it possible to identify biological functions affected by variants within the human network. Refining its representation to the level of proteoforms will allow pinpointing more precisely reactions and pathways, and hence increase our ability to understand biological mechanisms and potentially identify druggable targets.
Methods
Implementation
PathwayMatcher is implemented in Java 8.0.
Availability
PathwayMatcher is freely available at github.com/PathwayAnalysisPlatform/PathwayMatcher under the permissive Apache 2.0 license. It is also possible to use PathwayMatcher as a Docker image: hub.docker.com/r/lfhs/pathwaymatcher. PathwayMatcher can be obtained from the Bioconda channel of the Conda [15] package manager at bioconda.github.io/recipes/pathwaymatcher/README.html. Finally, PathwayMatcher is available as a Galaxy [16] tool in the Galaxy ToolShed [17] at toolshed.g2.bx.psu.edu/view/galaxyp/reactome_pathwaymatcher where it can be readily integrated into analysis workflows. PathwayMatcher has also been installed into the public European Galaxy instance, usegalaxy.eu, making it possible to use the application without requiring any local configuration and just providing valid input files and options. The complete URL for the online tool is: https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fgalaxyp%2Freactome_pathwaymatcher%2Freactome_pathwaymatcher
Upon installation, PathwayMatcher can be used from the command line to query Reactome using various types of omics data. Either the “.jar” file is run directly using Java or the Docker image is instantiated to a container. Detailed information on implementation, installation, usage and format specifications is available in the online documentation at github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki.
Input and Output
Detailed and updated documentation of the input and output can be found in the online documentation at github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki.
As schematized in Figure 7, a simple representation is used for proteoforms: (i) the UniProt protein accession and (ii) the set of PTMs separated by a semicolon ‘;’. The protein accession can include the isoform number specified with a dash ‘-‘. The PTM set contains each PTM separated by a comma ‘,’. Each PTM is specified using a modification identifier and a site, separated by a colon ‘:’.
Note that the order of PTMs do not affect the search. The PTM identifier is a five digit identifier from the PSI-MOD Protein Modification [18]. The site is an integer specifying the 1-based index of the modified amino acid on the sequence as defined by UniProt.
It is common to write the identifiers for the PTM types with the prefix ‘MOD:’ before the five digits of the ontology term. PathwayMatcher also allows the user to write the identifier without the prefix. PathwayMatcher also allows querying all proteoforms modified at a given site using the ‘00000’ wild card for modification type. Note that the modification site is mandatory, but a tolerance window can be set, as detailed in the “Proteoform matching” section.
Post-translational modifications in the Reactome data model
The Reactome object model specifies physical entities, e.g. complexes, proteins and small molecules, and proteins are annotated using unique identifiers. These entities participate in reactions in specific cellular compartments. They can also be connected to multiple instances of Translational Modification objects containing a specific coordinate on the protein sequence and an identifier following the PSI-MOD ontology [18]. The portion of physical entities referring to proteins are associated to other class of objects as reference entities, which contain protein annotations in external databases such as UniProt [19]. Therefore, a proteoform is represented as a physical entity associated to a set of modifications for specific processes at specific subcellular location. 127 different protein modifications are annotated in Reactome for humans, of which Figure 8 displays the most frequent.
Proteoform matching
Searching pathways using gene names or protein accessions solely requires mapping a string of characters between the input and the knowledgebase. In order to map the proteoforms to reactions and pathways, it is necessary to decide if the proteoforms in the input are equivalent to the proteoforms annotated in the Reactome database, taking into account the protein accession, isoform information, and the set of PTMs. Two proteoforms can have all, some or none of these elements in common.
We defined a set of criteria to compare two proteoforms, one from the input and one from the reference database and decide whether they are equivalent to each other. The matching types defined for PathwayMatcher are: a) Superset, b) Subset, c) One, and d) Strict.
a) Superset (with and without PTM types)
The set of input PTMs is a superset of the reference PTMs set. This includes the command line arguments: -m superset or -m superset_no_types.
The UniProt accession is the same
The isoform is the same; either:
Both have an isoform specified. Ex: P31749-3
Or both refer to the default one. Ex: P31749
The PTMs:
The input contains ALL the reference PTMs or more (input is superset or equal). Each reference PTM must have a matching input PTM. Some input PTMs may not have a matching reference PTM.
A PTM matches if these two requirements are true:
The types match:
If chosen superset then types should be equal.
If chosen superset_no_types the type is not considered.
The coordinates match if any of the following is true:
Both are known identical (positive integer) coordinates.
Both are known different (positive integer) coordinates, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
One of the coordinates is unknown (null, empty, ?, −1).
b) Subset (with and without PTM types)
The set of input PTMs is a subset of the reference PTMs set. This includes the command line arguments: -m subset or -m subset_no_types.
The UniProt accession is the same.
The isoform is the same; either:
Both have an isoform specified. Ex: P31749-3
Or both refer to the default one. Ex: P31749
The PTMs:
Each input PTM must have a matching reference PTM. Some reference PTMs may not have a matching input PTM.
A PTM matches if these two requirements are true:
Types match, i.e.:
If chosen subset (then types must be equal), or
If chosen subset_no_types (type is not considered)
The coordinates match if any of the following is true:
Both are known identical (positive integer) coordinates.
Both are known different (positive integer) coordinates but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
One of the coordinates is unknown (null, empty, ?, −1).
c) One (with and without PTM types)
At least one input PTM matches. This includes the command line arguments: -m one or –m one_no_types.
The UniProt accession is the same.
The isoform is the same; either:
Both have an isoform specified. Ex: P31749-3
Or both refer to the default one. Ex: P31749
The PTMs:
At least one input PTM must have a matching reference PTM.
A PTM matches if these two requirements are true:
The types match:
If chosen one (then types should be equal), or
If chosen one_no_types (type is not considered)
The coordinates match if any of the following it true:
Both are known identical (positive integer) coordinates.
Both are known different (positive integer) coordinates, but the absolute difference between the two coordinates is less than or equal to a user-defined margin (‘range’ option in command line).
One of the coordinates is unknown (null, empty, ?, −1).
d) Strict
Proteoforms must match exactly in all the attributes.
The UniProt accession is the same.
The isoforms are the same; either:
Both have an isoform specified. Ex: P31749-3
Or both refer to the default one. Ex: P31749
The PTMs have the same elements:
The reference PTM set and the input PTM set have the same size.
Each reference PTM has a matching input PTM.
A PTM matches if:
Types are the same.
Coordinates are the same:
In case they are numbers, they should be equal
In case they are null, then both should be null.
Extra considerations:
Negative, zero or floating-point values are invalid as sequence coordinates in the input.
We accept only PSI-MOD ontology modification types.
The margin to compare the coordinates should be set as an unsigned integer.
Table 1 shows examples of PTM coordinates matching. The letter k represents any positive integer. It compares a PTM coordinate in an input PTM with a PTM coordinate in a reference PTM.
Mapping omics data to pathways
The input is mapped to proteins or proteoforms to find the reactions where the input entities are participants (Figure 9). The input is mapped to proteins when data types without PTMs or specific translation products are specified; otherwise a mapping to proteoforms is used. When one type of data yields multiple results due to ambiguity, e.g. a SNP or peptide mapping multiple proteins, all the possibilities are included in the search entities.
When a list of SNPs is provided, mapping from the Ensembl Variant Effect Predictor (VEP) [7] is used to find the possibly affected proteins. When peptides are provided, their sequence is mapped to UniProt protein identifiers [8] using PeptideMapper [9] and possible proteoforms are constructed. When proteins or proteoforms are available, PathwayMatcher maps them to reactions and pathways using data structures embedded in the PathwayMatcher jar file. These data structures are extracted from the Reactome Neo4j graph database (neo4j.com) and serialized. The code for extraction of the relationships from proteins to pathways is available at github.com/PathwayAnalysisPlatform/Extractor.
Proteins in Reactome are defined according to UniProt following a gene-centric paradigm. From all reactions in Reactome, 9,734 involve two proteins, participating in 2,208 human pathways [1] (version 64 at time of writing). Using additional information from Reactome on the post-translational state required for a protein to participate in a reaction, PathwayMatcher allows matching proteoforms to reactions and pathways. Tables 2 and 3 respectively list the proteins and proteoforms that are participating in the highest number of pathways.
Note that PathwayMatcher maps experimental data to pathways in a systematic and unbiased fashion. This means that it collects all pathways containing at least one of the participant proteins or proteoforms of the input data and does not perform any filtering or biological inference. Therefore, it attempts at minimizing the prevalence of false negatives by considering all the possible pathways annotated in the reference database. It can however not control for missing annotation, i.e. what is not annotated in the knowledgebase is not considered to be happening.
Over-representation analysis
The matching of each entity to a given pathway is modelled as a Bernoulli trial with two possible outcomes: success or failure, depending on whether the protein or proteoform is a participant of a reaction in the pathway. Trials are considered independent from each other, meaning that the outcome of previous trials does not affect the next. Finally, the probability of success is calculated by the proportion of choosing a protein in a pathway over the total number of possible proteins, therefore the probability is constant over all trials.
First, we search all the input entities (proteins or proteoforms) across all the pathways and count how many of them were found in each pathway. The number of entities found in a pathway is taken as the number of successful trials. Then, with the binomial probability distribution, we calculate how likely it would be to get a result equal to or more extreme than the current result (the same number or more proteins or proteoforms in the pathway), given that the input (proteins or proteoforms) were randomly selected [11].
This is done using the cumulative distribution function for the binomial distribution, which calculates the probability of getting at most k successes out of n trials, with a probability p ∈ [0,1], where X is a random variable following the binomial distribution, as detailed in Equation 1. For each pathway, p is set to the ratio between the number of total proteins or proteoforms in the pathway and the total possible entities in the database, n is the number of proteins or proteoforms in the input sample, k is the number of proteins successfully mapped in the pathway, X is the number of entities found in the current pathway after the search.
Finally, given that the p-value requires the calculation of the probability of an equal or more extreme result, we use the complement of Equation 1 to calculate the probability of getting at least k successful trials out of n as stated in Equation 2. The calculations for proteins or proteoforms are similar, but are performed separately depending on the input. If the input consists of protein accessions, the number of participants is calculated by only considering proteins. On the other hand, for the proteoform input, the number of entities in the pathways and the database are the participant proteoforms.
Please note that the over-representation analysis is included as a simple analysis to identify the most covered pathways. We recommend however that users rather interpret the results of the mapping and the networks using the systems biology method that best suits the experiment and biomedical context. PathwayMatcher is developed to be a hypothesis generation tool, helping to navigating large datasets and guide experiments. It is not a validation or mechanism inference tool.
Performance Benchmark
The performance of PathwayMatcher was evaluated using data sets of different sizes obtained from sampling publicly available resources:
Proteins: human complement of the UniProtKB/Swiss-Prot database (release 2017_10).
Peptides: ProteomeTools [20] as available in PRIDE [21], dataset PXD004732, release date 23/01/2017.
Genetic variants: variants from the human assembly GRCh37.p13.
Proteoforms: annotated proteoforms in Reactome Graph database version 62.
Performance testing was done using a standard desktop computer (Intel® Core™ i7-6600U CPU @ 2.60GHz with 2 cores using 64-bit Windows 10 with Java SE 1.8.0_144 on SSD). Details and code are available at github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki/Test-datasets.
Metrics and Figures
The metrics presented in this manuscript were obtained by querying the Reactome graph database directly [22]. The queries used can be found in the online documentation at: github.com/PathwayAnalysisPlatform/PathwayMatcher/blob/master/docs/queriesForStatistics.md The figures in this manuscript were built in R version 3.4.1 (2017-06-30) - “Single Candle” (r-project.org) using the following packages: ggplot2, ggrepel, igraph, scico, grid, and gtable. The R scripts used to build the figures are available in the tool repository at: github.com/PathwayAnalysisPlatform/PathwayMatcher/tree/master/docs/figures/scripts
Availability of supporting source code and requirements
Project name: PathwayMatcher
Project home page: github.com/PathwayAnalysisPlatform/PathwayMatcher
Operating system(s): Platform independent
Programming language: Java Other requirements: License: Apache 2.0
RRID: SCR_016759
Declarations
- List of abbreviations
- PTM
- Post-translational modification
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests
Funding
LFHS, SJ, PRN, and MV are supported by the European Research Council and the Research Council of Norway. BB, CH, and HB are supported by the Bergen Research Foundation. HB is also supported by the Research Council of Norway. LFHS, SJ, PRN, and MV are supported by the European Research Council and by the Research Council of Norway. This work has been supported by National Institutes of Health BD2K grant (U54 GM114833) and National Human Genome Research Institute at the National Institutes of Health Reactome grant (U41 HG003751).
Authors’ contributions
LFHS did all of the programming, testing, documentation, and wrote the manuscript. BB and CH contributed with programming, testing, documentation, ideas, and manuscript writing. AF, SJ, and PRN contributed with ideas and manuscript writing. HB contributed with ideas, supervised the work, and wrote the manuscript. HH contributed with project design, manuscript writing, and supervised the work. MV contributed with project design, programming, documentation, testing, supervised the work, and wrote the manuscript. All authors participated in the preparation of the manuscript.
Acknowledgements
The authors thank the Reactome curators for their massive curation effort and the guidance in interpreting the annotations.
Footnotes
(luis.sanchez{at}uib.no), (bram.burger{at}uib.no), (carlos.horro{at}uib.no), (fabregat{at}ebi.ac.uk), (stefan.johansson{at}uib.no), (pal.njolstad{at}uib.no), (harald.barsnes{at}uib.no), (hhe{at}ebi.ac.uk), (marc.vaudel{at}uib.no)