Abstract
Word models (natural language descriptions of molecular mechanisms) are a common currency in spoken and written communication in biomedicine but are of limited use in predicting the behavior of complex biological networks. We present an approach to building computational models directly from natural language using automated assembly. Molecular mechanisms described in simple English are read by natural language processing algorithms, converted into an intermediate representation and assembled into executable or network models. We have implemented this approach in the Integrated Network and Dynamical Reasoning Assembler (INDRA), which draws on existing natural language processing systems as well as pathway information in Pathway Commons and other online resources. We demonstrate the use of INDRA and natural language to model three biological processes of increasing scope: (i) p53 dynamics in response to DNA damage; (ii) adaptive drug resistance in BRAF-V600E mutant melanomas; and (iii) the RAS signaling pathway. The use of natural language for modeling makes routine tasks more efficient for modeling practitioners and increases the accessibility and transparency of models for the broader biology community.
Standfirst text INDRA uses natural language processing systems to read descriptions of molecular mechanisms and assembles them into executable models.
Highlights
INDRA decouples the curation of knowledge as word models from model implementation
INDRA is connected to multiple natural language processing systems and can draw on information from curated databases
INDRA can assemble dynamical models in rule-based and reaction network formalisms, as well as Boolean networks and visualization formats
We used INDRA to build models of p53 dynamics, resistance to targeted inhibitors of BRAF in melanoma, and the Ras signaling pathway from natural language
INTRODUCTION
Biophysics and biochemistry are the foundations of quantitative reasoning about biological mechanisms (Gunawardena, 2014a). Historically, systems of biochemical mechanisms were described in reaction diagrams (familiar graphs involving forward and reverse arrows) and analyzed algebraically. As such systems became more complex and grew to include large networks in mammalian cells, word models (natural language descriptions) became the dominant way of describing biochemical processes; word models are frequently illustrated using pictograms and informal schematics. However, formal approaches are generally required to understand dynamics, multi-component switches, bistability etc. Dynamical models and systems theory have proven extremely effective in elucidating mechanisms of all-or-none response to apoptosis-inducing ligands (Albeck et al, 2008; Rehm et al, 2002), sequential execution of cell cycle phases (Chen et al, 2004), the interplay of stochastic and deterministic reactions in the control of cell fate following DNA damage (Purvis et al, 2012), drug sensitivity and disease progression (Lindner et al, 2013; Fey et al, 2015), bacterial cell physiology (Karr et al, 2012), the responses of ERK kinase (Chen et al, 2009) and the NFkB transcription factor (Hoffmann et al, 2002) to environmental stimuli, and similar biological processes. The challenge arises in linking a rich ecology of word models to computational representations of these models that can be simulated and analyzed. The technical environments used to create and explore dynamical models remain unfamiliar to many biologists and a substantial gap persists between the bulk of the literature and formal systems biology models.
A variety of methods have been developed to make mechanistic modeling more powerful and efficient. These include fully integrated software environments (Loew & Schaff, 2001; Hoops et al, 2006), graphical formalisms (Le Novère et al, 2009; Kolpakov et al, 2006), tabular formats (Tiger et al, 2012), high-level modular and rule-based languages (Smith et al, 2009; Mallavarapu et al, 2009; Danos et al, 2009), translation systems for generating Systems Biology Markup Language (SBML) models from pathway information (Büchel et al, 2013; Ruebenacker et al, 2009) and specialized programming environments such as PySB (Lopez et al, 2013). In addition, the BioModels database has provided a means to retrieve and reuse existing models (Juty et al, 2015). Such tools have increased transparency and reusability but not sufficiently to bridge the gap between verbal descriptions and computational models.
To date, most attempts to make modeling more accessible have focused on graphical interfaces in which users draw reaction diagrams that are then used to generate equations. This approach is attractive in principle, since informal diagrams are a mainstay of most scientific presentations, and schematic diagrams are essential in engineering, but it has proven difficult in practice to accommodate the simultaneous demands of accurately rendering individual biochemical reactions while also depicting large numbers of interacting components. It is particularly difficult to create graphical interfaces that model the combinatorially complex reactions encountered in animal cell signaling (Stefan et al, 2014).
In this paper we explore the idea that natural language can serve as a direct input for dynamical modeling. Natural language has many benefits as a means of expressing mechanistic information: in addition to being familiar, it can concisely capture experimental findings about mechanisms that are ambiguous and incomplete. Extensive work has been performed on the use of software to convert text into computable representations of natural language, and such natural language processing (NLP) tools are used extensively to mine the scientific literature (Fluck & Hofmann-Apitius, 2014; Krallinger et al, 2012). To our knowledge however, natural language has not been widely used as a direct input for mechanistic modeling of biological or chemical processes. A handful of studies have explored the use of formal languages resembling natural language for model creation (Wasik et al, 2013; Kahramanoğullari et al, 2009) but these systems focus on capturing low-level reaction mechanisms and require that descriptions conform to a precisely defined syntax.
Three technical challenges must be overcome to convert natural language into executable models. The first is reading text with a machine in a manner that reliably identifies mechanistic assertions in the face of variation in how they are expressed. The second is designing an intermediate knowledge representation that captures often-ambiguous and incomplete mechanisms without adding unsubstantiated assumptions (thereby implementing the rule: “don’t know, don’t write”). This intermediate representation must be compatible with existing machine-readable sources of network information such as pathway databases. The third challenge is translating mechanistic assertions from the intermediate representation into executable models involving different mathematical formalisms and levels of detail; this involves supplying necessary assumptions left out of the original text.
The method and software tool described in this paper, the Integrated Network and Dynamical Reasoning Assembler (INDRA), addresses these challenges and makes it possible to construct different types of executable models directly from natural language and fragmentary information in pathway databases. In contrast to previous approaches to incorporating natural language in models, INDRA can accommodate flexibility in style and syntax through the use of NLP algorithms that normalize variability in expression into logical forms that effectively represent the underlying meaning (Box 1). Mechanisms extracted from natural language and other sources are converted into Statements (the INDRA intermediate representation) and then translated into one of several types of models depending on the specific use case. We describe this process in some detail because it relates directly to how we understand and communicate biological mechanisms in papers and conversations. The essential challenge is converting the informality and ambiguity of language, which is frequently a benefit in the face of incomplete information, into a precise set of statements (or equations) needed for an executable mathematical model.
As a test case, we show that INDRA can be used to automatically construct a model of p53 dynamics in response to DNA damage from a few simple English statements; we show that the qualitative behavior of the INDRA model matches that of an existing mathematical model constructed by hand. In a second, more challenging test, we show that an ensemble of models of the MAP kinase pathway in cancer cells can be built using literature-derived text describing the interaction of BRAFV600E and drugs used to treat melanoma (Box 4). Finally, we use natural language and INDRA to assemble a large-scale model of the RAS pathway as defined by a community of RAS biology experts; we show how this model can be updated using sentences gathered from the RAS community.
Glossary
Application programming interface (API)
a standardized interface by which one software system can use services provided by other software, often remotely; in the current context, INDRA accesses NLP systems and pathway databases via APIs. INDRA exposes an API that other software can build upon. API is used here interchangeably with Interface (e.g. INDRA’s TRIPS Interface).
Molecular mechanism
used in this paper to refer to processes involved in changing the state of a molecular entity or in describing its interaction with another molecular entity as represented by a set of linked biochemical reactions. Descriptions of mechanisms are common in the biomedical literature and key assertions are captured in databases in formats such as BioPAX. The information we extract from such descriptions are interchangeably referred to as mechanistic information, mechanistic assertions, mechanistic facts and mechanistic findings.
Processor
a module in INDRA that constructs INDRA Statements from a specific input format.
Template extraction
the process by which INDRA Processors extract INDRA Statements from various input formats.
Assembler
a module in INDRA that constructs a model, network or other output from INDRA Statements.
Model assembly
the process of automatically generating a model in a given computational formalism from an intermediate knowledge representation; in our context from INDRA Statements.
Executable model
a computational model that can be simulated to reproduce the observable dynamical behavior of a system; often, but not always, a system of linked differential equations.
Policies
user-defined settings which affect the automated assembly process.
Knowledge representation
a formalism that allows aggregation of information, potentially from multiple sources, in a standardized computable format; in the current context, INDRA Statements serve as a common knowledge representation for mechanistic information.
Natural language (NL)
language that humans commonly use to communicate in speech and writing; in the current context, restricted to the English language.
Natural language processing (NLP)
the algorithmic process by which a computer interprets natural language text.
Named entity recognition (NER)
a sub-task of NLP concerned with the recognition of special words in a text that are not part of the general language; in the current context NER is used to identify proteins, metabolites, drugs, and other terms (which are generally referred to as named entities).
Grounding
a sub-task of NLP related to NER which assigns unique identifiers to named entities in text by linking them to ontologies and databases; in the current context this involves creating links to databases such as UniProt, HGNC, GO or ChEBI.
Logical form (LF)
a graph representing the meaning of a sentence; an intermediate output of natural language processing in the TRIPS system (Box 1).
Extraction knowledge base (EKB)
a collection of events and terms relevant to molecular biology that is the result of natural language processing with TRIPS (Box 1).
RESULTS
INDRA decouples the curation of mechanistic knowledge from model implementation
A core concept in INDRA is that the identification, extraction and regularization of mechanistic information (curation) is a distinct process from model assembly and implementation. Mechanistic models demand a concrete set of assumptions (about catalytic mechanisms, stoichiometry, rate constants, etc.) that are rarely expressed in a single paper or molecular interaction entry stored in a database. Models must therefore combine relatively general assertions about mechanisms extracted from available knowledge sources (e.g., that enzyme E “activates” substrate S) with information or assumptions about molecular details (e.g., that the enzyme acts on the substrate S in a three-step ATP-dependent mechanism involving an activating site on the substrate) derived from general knowledge about biochemistry and biophysics. Precisely how such details are constructed depends on the requirements of the mathematical formalism, the specific biological use case, and the nature of the hypothesis being tested. A similar concept was recently introduced for rule-based modeling in Basso-Blandin et al (2016) and in the context of graphical model diagrams in O’Hara et al (2016). In both works, the authors make a distinction between the curation and representation of mechanistic knowledge and its executable implementation.
Text-to-model conversion in INDRA involves three coupled steps. First, text is processed into a machine-interpretable form and the identities of proteins, genes and other biological entities are grounded in reference databases. Second, the information is mapped onto an intermediate knowledge representation (INDRA Statements) designed to correspond in both specificity and ambiguity to descriptions of biochemistry as found in text (e.g. “MEK1 phosphorylates ERK2”). Third, the translation of this intermediate representation into concrete reaction patterns and then into executable forms such as networks of ordinary differential equations (ODEs) is performed in an assembly step. In this process, Statements capture mechanistic information available from the knowledge source without additions or assumptions, deferring interpretations of specific reaction chemistry that are often unresolved by the knowledge source but must be made concrete to assemble a model.
Information flow from natural language input to a model
The three steps in text-to-model conversion are implemented in a three-layer software architecture. An input layer comprising Interface and Processor modules (Figure 1A, block 1) is responsible for communicating with language processing systems (e.g., the TRIPS NLP system, see Box 1) and pathway databases (e.g., the Pathway Commons database) to acquire information about mechanisms. An intermediate layer contains the library of Statement templates (Figure 1A, block 2), and an output layer contains Assembler modules that translate Statements into formats such as networks of differential equations or protein-protein interaction graphs (Figure 1A, block 3). INDRA is written in Python and available under the open-source BSD license. Source code and documentation are available at http://indra.bio; documentation is also included in the Appendix.
As an example of text being converted into an executable model, consider the sentence “MEK1 phosphorylates ERK2 at threonine 185 and tyrosine 187.” Figure 1B shows eight lines of Python code implementing this example; the numbers alongside each code block correspond to the three layers of the INDRA architecture in Figure 1A and implement the flow of information between the user, INDRA and external tools shown in Figure 1C. The user first enters the sentence to be processed and calls the process_text command in the INDRA TRIPS Interface. This function sends a request to the web service exposed by the TRIPS NLP system (Allen et al, 2015) (Figures 1B and 1C, block 1). INDRA can also call on the REACH NLP system, which has complementary capabilities (Valenzuela-Escarcega et al, 2015), but in this paper we focus exclusively on TRIPS. TRIPS parses the text into its logical form (Box 1, Appendix Figure S1A), and then extracts mechanisms relevant to molecular biology into an extraction knowledge base (EKB; Box 1, Appendix Figure S1B). Included in this process are entity recognition and grounding whereby MEK1 is recognized as a synonym of the HGNC gene name MAP2K1 and grounded to UniProt Q02750, and Erk2 is grounded to MAPK1 and UniProt P28482. These terms are explained in Box 1, in Appendix Section 2.1, and in (Allen et al, 2015). The TRIPS Processor in INDRA extracts Statements directly from the EKB output returned by TRIPS (Figures 1B and 1C, block 2). The translation of Statements into concrete models is performed by an INDRA Assembler. In this example, a PySB Assembler was used to build a rule-based model in PySB (Lopez et al, 2013) and generate an SBML-compatible reaction network (Figures 1B and 1C, block 3). Because the Phosphorylation Statements in this example are compatible with multiple concrete reaction patterns, the user specifies a policy for assembly: here we used the “two-step” policy, which implements phosphorylation with reversible enzyme-substrate binding (polices are described below). The resulting reaction network was instantiated as a set of ODEs and simulated using default parameter values to produce the temporal dynamics of all three phosphorylated forms of ERK2 (labeled MAPK1; Figure 1C, bottom right). The same rule-based model can also be analyzed stochastically using network-free simulators (Danos et al, 2007b; Sneddon et al, 2011).
Box 1: Natural language processing using TRIPS
To convert text into computable representations that capture syntax and semantics INDRA uses external NLP software systems exposed as web services. This paper focuses on DRUM (Deep Reader for Understanding Mechanisms; http://trips.ihmc.us/parser/cgi/drum) which is a version of the general-purpose TRIPS NLP system customized for extracting biological mechanisms from natural language text. TRIPS has been developed over a period of decades and used for natural language communication between humans and machines in medical advice systems, robotics, mission planning, etc. (see for example Ferguson & Allen, 1998; Chambers et al, 2005; Allen et al, 2006).
The first step in processing natural language with TRIPS is a “shallow” or syntactic analysis of text to identify grammatical relationships among words in a sentence, recognize named entities such as proteins, amino acids, small molecules, cell lines, etc. and link these entities to appropriate database identifiers (the process of grounding). TRIPS uses this information to perform a “deep” semantic analysis and try to determine the meaning of a sentence in terms of its logical structure. This process draws on a general purpose semantic lexicon and ontology that defines a range of word senses and semantic relations among words. The output of this process is represented as a logical form (LF) graph (Manshadi et al, 2008). The LF graph represents the sense of each word (e.g. “protein”) and captures the semantic roles of relevant arguments (e.g. “affected”) for each predicate (e.g. “activation”). The LF also represents tense, modality and aspect information — information that is crucial for determining whether a statement expresses a stated fact, a conjecture or a possibility.
The LF graph is then transformed into an extraction knowledge base (EKB) containing extractions relevant for the domain, in this case molecular biology. LF graphs compactly represent and normalize much of the variation and complexity in sentence structure; EKBs can therefore be extracted from the LF using a relatively small set of rules. The EKB is an XML file containing entries for terms (e.g., proteins, drugs), events (e.g., activation, modification) involving those terms, and higher-level causal relations between the events. The EKB also contains additional information such as the text from which a given term or event was constructed.
A more thorough technical description of TRIPS/DRUM is given in Appendix Section 2.1 and in (Allen et al, 2015); a broader overview of NLP systems can be found in (Allen, 2003).
INDRA Statements represent mechanisms from multiple sources
INDRA Statements serve as the bridge between knowledge sources and assembled models and we therefore describe them in detail. Statements are implemented as a class hierarchy that groups related mechanisms; a Unified Modeling Language (UML) diagram of existing Statement classes is shown in Appendix Figure S2. Each INDRA Statement describes a mechanism involving one or more molecular entities, along with information specific to the mechanism and any supporting evidence drawn from knowledge sources. For example, the phosphorylation Statement shown schematically in Figure 2A contains references to enzyme and substrate Agents (which in this case refers to MAP2K1 and MAPK1, respectively), the phosphorylated residue and position on the substrate, and one or more Evidence objects with supporting information. An Agent is an INDRA object that captures the features of the molecular state necessary for a participant to take part in a molecular process (Figure 2B). This includes necessary post-translational modifications, bound co-factors, mutations, cellular location and state of activity (Figure 2B and Appendix Figure S4). Agents also include annotations that ground molecular entities to unique identifiers in one or more databases or ontologies (e.g. HGNC, UniProt, ChEBI; Figure 2B). Evidence objects contain references to supporting text, citations and relevant experimental context (Figure 2C).
An important feature of both Statements and Agents is that they need not be fully specified. If there is no information in the source pertaining to a specific detail in a Statement or Agent then the corresponding entry is left blank; this is an example of the “don’t know, don’t write” principle. INDRA and the rule-based models it generates are designed to handle information that is incomplete in this way. For example, the phosphorylation Statement shown in Figure 2A indicates that the phosphorylation of substrate MAPK1 can occur when the enzyme MAP2K1 is phosphorylated at serine residues S218 and S222, but other aspects of the state of MAP2K1 are left unspecified (e.g., whether MAP2K1 is phosphorylated at S298, or bound to a scaffold protein such as KSR). Statements capture the ambiguity inherent in the vast majority of statements about biological processes thereby permitting multiple interpretations: for example, phosphorylation of MAP2K1 at S218 and S222 could be necessary and sufficient for activity against MAPK1, necessary but not sufficient, sufficient but not necessary, or neither sufficient nor necessary, depending on other molecular context outside the scope of the Statement. The ability of Statements to capture knowledge from input sources while making as few additional assumptions as possible is an essential feature of the text-to-model conversion process. It also conforms closely to the way individual experiments are described and interpreted since single experiments investigate only a subset of the facts pertaining to a biochemical mechanism and its implementation in a model. The ambiguity in Statements is resolved during the assembly step by explicitly declaring assumptions and generating a fully-defined executable model.
Users can inspect INDRA Statements in several complementary ways: (i) by inspecting Statements as Python objects; (ii) by rendering Statements visually as graphs (Appendix Figure S3A); and (iii) by serializing Statements into a platform-independent JSON exchange format (Appendix Figure S3B). The semantics of INDRA Statements as well as the semantics describing the role that Agents play in each INDRA Statement are grounded in the Systems Biology Ontology (SBO) (Courtot et al, 2011) facilitating integration and reuse in other applications. These capabilities are demonstrated in Appendix Notebook 1.
Normalized extraction of findings from diverse inputs using mechanistic templates
The principal technical challenge in extracting mechanisms from input sources is identifying and normalizing information contained in disparate formats (e.g., BEL, BioPAX, TRIPS EKB) into a common form that INDRA can use. INDRA queries input formats for patterns corresponding to existing Statement types (templates), matching individual pieces of information from the source format to fields in the Statement template. This procedure is implemented for each type of input, making it possible to extract knowledge in a consistent form. Template-matching does not guarantee that every mechanism found in a source can be captured by INDRA, but it does ensure that when a mechanism is recognized, the information is captured in a normalized way that enables downstream model assembly. The process is therefore configured for high precision at the cost of lower recall.
INDRA implements template-matching extraction for each input format using a set of Processor modules. In the case of natural language, the EKB (see Glossary and Box 1) output from TRIPS serves as an input for the TRIPS Processor in INDRA. For a statement such as “MAP2K1 that is phosphorylated on S218 and S222 phosphorylates MAPK1 at T185” the EKB extraction graph (Figure 3, top left) has a central node (red text) corresponding to a phosphorylation event that applies to three terms: MAP2K1 as the agent for this event, MAPK1 as the entity affected by this event, and “threonine-185” playing the specific role of being the site where the event occurs (green text depicts the grounding in UniProt and HGNC identifiers). A second phosphorylation event (yellow box) involving S218/S222 of MAP2K1 is recognized by TRIPS as a nested property of MAP2K1 phosphorylation. It is a precondition for the primary phosphorylation event on MAPK1.
INDRA establishes that this extraction graph corresponds to an INDRA Phosphorylation Statement and then exploits the fact that the template for such a Statement has entries for an enzyme, a substrate, a residue and a position (Figure 2A). The AGENT in the TRIPS EKB is identified as the enzyme which itself has a modification (phosphorylation) at specified positions (S218 and S222). The AFFECTED portion of the TRIPS EKB is identified as the substrate MAPK1. The extracted INDRA Statement collects this information along with target residue (“threonine”) and position (“185”) on the substrate. The end result is a biochemically plausible depiction of a specific type of reaction from a short fragment of text.
Extraction of a Phosphorylation Statement from databases using BioPAX or BEL follows the same general procedure. The INDRA BioPAX Processor uses graph patterns to search for reactions in which a substrate on the right-hand side gains a phosphorylation modification relative to the left-hand side (Figure 3, center left). The Processor identifies this as a phosphorylation reaction and constructs a Phosphorylation Statement for each such reaction that it finds.
In the case of BEL, statements consisting of subject-predicate-object expressions describe the relationships between molecular entities or biological processes (Box 2). INDRA’s BEL Processor queries a BEL corpus (formatted as an RDF graph) for expressions consistent with INDRA Statement templates. For example, Phosphorylation Statements are extracted by searching for expressions in which the subject represents the kinase activity of a protein that directly increases an object representing a modified protein (Figure 3, bottom left); directly increases is a predicate used when molecular entities interact physically. Triples that fit this pattern are extracted into an INDRA Phosphorylation Statement with the subject as the enzyme and the object as the substrate.
Box 2: BioPAX and BEL
BioPAX is a widely used format for describing biological interactions that facilitates exchange and integration of pathway information from multiple sources (Demir et al, 2010). BioPAX is the core exchange format underlying the Pathway Commons database, which aggregates information from over 20 existing sources including Reactome, NCI-PID, KEGG, PhosphoSitePlus, BioGRID and Panther (Cerami et al, 2011). Pathway Commons provides a web service with an interface for submitting queries about pathways and recovering the result as a BioPAX graph; a query could involve finding all proteins and interactions in the neighborhood of a specified protein or finding all paths between two sets of proteins.
BioPAX employs a Web Ontology Language (OWL) knowledge representation centered around biochemical processes and reactants and is applicable to metabolic, signaling and gene regulatory pathways. The representation of reactions in BioPAX is flexible: an arbitrary set of complexes and standalone molecules on the left-hand side of a reaction can produce complexes and molecules on the right hand side subject to one or more catalytic controllers.
The Biology Expression Language (BEL) facilitates the curation of knowledge from the literature in a machine-readable form. While BioPAX is designed to capture direct, molecular interactions, BEL can express indirect effects and higher level cellular-or organism-level processes; for example, BEL can represent results such as the abundance of BAD protein increases apoptosis. Each BEL Statement records a scientific finding, such as the effect of a drug or other perturbation on an experimental measurement, along with contextual annotations such as organism, disease, tissue and cell type. BEL Statements are structured as subject, predicate, object (RDF) triples: the subject and object are BEL Terms identifying molecular entities or biological processes, and the predicate is a relationship such as increases or decreases. BEL has been used to create both public and private knowledge bases for machine reasoning; the BEL Large Corpus (see www.openbel.org) is currently the largest openly-accessible BEL knowledge base and consists of about 80,000 statements curated from over 16,000 publications.
Assembly of alternative executable models from mechanistic findings
The role of INDRA Assemblers is to generate models from a set of Statements. This step is governed not only by the relevant biology, but also by the requirements of the target formalism (e.g. ODE systems, rule-based models or graphs) and decisions about model complexity (e.g., the number of variables, parameters, or agents). INDRA has multiple Assemblers for different model formats; here we focus on the PySB Assembler, which creates rule-based models that can either be simulated stochastically or as networks of differential equations (Danos et al, 2007a; Faeder et al, 2009). Models assembled by INDRA’s PySB Assembler can be exported into many widely used modeling formalisms such as SBML, MATLAB, BNGL and Kappa using existing PySB functions (Lopez et al, 2013).
Assembling an INDRA Phosphorylation Statement into executable form requires a concrete interpretation of information that is almost always unspecified or ambiguous in the source text or database object. We illustrate this process using four alternative ways to describe the phosphorylation of MAPK1 by MAP2K1 (Figure 4). As a first step, the assembly of this Statement requires a concrete interpretation of a partially specified state of the enzyme agent: MAP2K1 sites S218 and S222 are specified as being phosphorylated but no information is available about other sites or binding partners.
In assembling rules, the PySB Assembler omits any unspecified context, exploiting the “don’t care don’t write” convention (Box 3) so that the states of unspecified sites are treated as being irrelevant for rule activity. The default interpretation is therefore that phosphorylation of MAP2K1 at S218 and S222 is sufficient for kinase activity; whether or not it is also necessary is determined by other rules involving MAP2K1 that may be in the model.
Box 3: Rule-based modeling and PySB
Accurate simulation of biochemical systems requires that every species be explicitly tracked through time. The combinatorial nature of protein complex assembly, post-translational modification and related processes causes the number of possible molecular states in many signaling networks to explode and exceed the capacity for efficient simulation (Stefan et al, 2014). For example, full enumeration of complexes involved in EGF signaling would require more than 1019 molecular species differing in their states of oligomerization, phosphorylation and adapter protein binding (Feret et al, 2009). Rule-based modeling (RBM) languages such as Kappa and BioNetGen (BNGL) address this challenge by allowing interactions among macromolecules to be defined using “rules” specifying the local context required for a molecular event to occur (Faeder et al, 2009; Danos et al, 2007a). The molecular features that do not affect the event are omitted from the rule, a convention known as “don’t care, don’t write.” Specifying molecular interactions as rules has two chief benefits: (i) it makes the representation of a model much more compact and transparent than a set of equations; (ii) it enables the simulation of very complex systems using network-free methods (Danos et al, 2007b). RBMs can also be translated into conventional modeling formalisms such as networks of ODEs.
Executable model assembly in INDRA is built on PySB, a software system that embeds a rule-based modeling language within Python, thereby enabling the use of macros and modules to concisely express recurring patterns such as catalysis, complex assembly, sub-pathways, etc. (Lopez et al, 2013). Rule-based modeling languages are well-suited to building executable models from high-level information sources such as natural language because assertions about mechanisms typically specify little molecular context. INDRA converts such assertions into one or more model rules using policies that control the level of detail.
The second step in the assembly of a Phosphorylation Statement is generating a concrete set of biochemical reactions that constitute an executable model. The challenge here is that the concept of protein “phosphorylation” can be realized in a model in multiple different ways. For example, a “one-step” policy converts an INDRA Phosphorylation Statement into a single bimolecular reaction in which a product (a phospho-protein) is produced in a single irreversible reaction without explicit consideration of enzyme-substrate complex formation. One-step reactions can be modeled using a variety of rate laws depending on modeling assumptions, including a pseudo-first-order rate law (Figure 4, “one-step policy, pseudo-first-order” comprising one reaction rule and one free parameter) in which the rate of the reaction is proportional to the product of the enzyme and substrate concentrations. Such a representation is not biophysically realistic, since it does not reproduce behaviors such as enzyme saturation, but it has the advantage of requiring only one free parameter. Alternatively, a one-step reaction can be modeled with a Michaelis-Menten rate law (Figure 4, “one-step policy, Michaelis-Menten”) which generates one reaction rule and two free parameters; this policy makes a quasi-steady-state assumption about the enzyme-substrate complex (Chen et al, 2010). One-step mechanisms are convenient for modeling coarse-grained dynamics and causal flows in complex signaling networks (Salazar & Höfer, 2006). A “two-step policy” is more realistic and creates two rules: one for reversible enzyme-substrate binding and one for product release (Figure 4, “two-step policy”; two reaction rules and three free parameters). This is the most common interpretation of a phosphorylation reaction in existing dynamical models and correctly captures enzyme saturation, substrate depletion, and other important mass-action effects. However, the two-step policy does not explicitly consider ATP as a substrate, and cannot model the action of ATP-competitive kinase inhibitors at the enzyme active site. The “ATP-dependent” policy corrects for this and explicitly models the binding of ATP and substrate as separate reaction steps (Figure 4, “ATP-dependent policy”) generating three reaction rules and five free parameters. Other mechanistic interpretations of “phosphorylation” are also possible: for example, two-step or ATP-dependent policies in which the product inhibits the enzyme by staying bound (or re-binding) after the phospho-transfer reaction (Gunawardena, 2014b). Such rebinding can have a substantial impact on kinase activity.
It might appear at first glance that the most biophysically realistic policy is preferable in all cases. However, a fundamental tradeoff always exists between model complexity and faithfulness to underlying detail: as the biochemical representation becomes more detailed, the number of free parameters and intermediate species increases, reducing the identifiability of the model (Raue et al, 2009). Given such a tradeoff, the benefit of having multiple assembly policies becomes clear: alternative models can automatically be constructed from a single high-level biochemical assertion depending on their suitability for a particular modeling task. The transparency and repeatability of model generation using assembly policies is especially important for larger networks in which hundreds or thousands of distinct species are subject to adjustment as the biophysical interpretation changes.
To enable simulation of reaction networks as ODEs in the absence of data on specific rate parameters, INDRA uses a set of biophysically plausible default parameters; for example, association rates are diffusion limited (106 M-1 S-1), off-rates default to 10-1 s-1 (yielding a default KD of 100 nM) and catalytic rates default to 100 s-1. These parameter values can be adjusted manually or obtained by parameter estimation. An extensive literature and wide range of tools exist for parameter estimation using experimental data and they are directly applicable to models assembled by INDRA (Mendes & Kell, 1998; Moles et al, 2003; Eydgahi et al, 2013; Thomas et al, 2015). For simplicity, we do not discuss this important topic further and rely below either on INDRA default parameters or manually adjusted parameters (as listed in the Appendix) to facilitate dynamical simulations.
Modeling alternative dynamical patterns of p53 activation
As an initial test of using INDRA to convert a word model and accompanying schematic into an executable model, we turned to a widely cited review in Cell that describes the canonical reaction patterns controlling the responsiveness of mammalian signal transduction systems to stimulus (Purvis & Lahav, 2013). Figure 5 of (Purvis & Lahav, 2013) depicts the dynamics of p53 response to single stranded and double stranded DNA breaks (SSBs and DSBs). Using a schematic illustration, Purvis and Lahav explain that pulsatile p53 dynamics arises in response to DSBs but sustained dynamics are induced by SSBs. The difference is attributed to negative feedback from the Wip1 phosphatase to the DNA damage sensing kinase ATM, but not to the related kinase ATR. We wrote a set of simple declarative phrases (Figure 5B and C) corresponding to edges in the schematic diagram (Figure 5A) that represent activating or inhibitory interactions between Mdm2 (an E3 ubiquitin-protein ligase), p53, Wip1 and ATM (or ATR) (yellow numbers in Figure 5A, B and C). We then used INDRA to read the text (the “word models”) and assemble executable models in PySB. These models were instantiated as networks of ODEs and simulated numerically. For each model we plotted p53 activation over time using standard Python libraries (Oliphant, 2007).
We found that our initial word models (comprising sentences 1-5 in Figure 5B and sentences 1-6 in 5C) failed to reproduce the p53 dynamics expected for SSBs and DSBs: in our INDRA models SSBs induced steady, low-level activation of p53 and DSBs failed to induce oscillation (Appendix Figure S6). One feature not explicitly included in the Purvis and Lahav diagrams and hence missing from our initial word models is negative regulation of Mdm2 and Wip1. Visual representations of signaling pathways frequently omit such inhibitory mechanisms despite their impact on dynamics (Heinrich et al, 2002). (Purvis and Lahav were aware of these inhibitory reactions since they are found in ODE-based models of p53 dynamics from the same research group (Batchelor et al, 2011); because the diagram’s purpose was to illustrate the specific role of negative feedback these reactions were likely omitted for clarity.) The mechanisms that inactivate Mdm2 involve binding by the catalytic inhibitor p14ARF (Agrawal et al, 2006) and those for Wip1 involve HIPK2-mediated phosphorylation and subsequent ubiquitin-dependent degradation (Choi et al, 2013) (depicted by dotted arrows and pink numbers in Figure 5A). We added these reactions to the model as simple natural language phrases (denoted by pink numbers in 5B and C).
When the updated word models were assembled using INDRA and simulated as ODEs, p53 exhibited sustained activation in response to SSBs but did not oscillate in response to DSBs (Appendix Figure S6). We then realized that the DSB response model lacked a fundamental property of an oscillatory system, namely a time delay (Novák & Tyson, 2008). This delay had previously been modeled by Lahav and colleagues (Batchelor et al, 2011) by using delay differential equations but time delays can also be generated by positive feedback (Novák & Tyson, 2008). Both ATM and ATR are known to undergo activating auto-phosphorylation (Bakkenist & Kastan, 2003; Liu et al, 2011). We therefore added phrases describing auto-activation of ATM or ATR to the word models (denoted by dotted arrow and green numbers in Figure 5A, corresponding to green numbers in B and C). When assembled by INDRA, the extended word models successfully generated p53 oscillation in response to DSBs (Figure 5C). The presence of oscillations was robust to changes in kinetic parameters and initial conditions (Appendix Table 3 and Appendix Figure S6). Moreover, in the expanded model ATR-dependent p53 activation by SSBs still resulted in sustained p53 activation (Figure 5B, Appendix Table 2 and Appendix Figure S6). The key point in this exercise is that features essential for the operation of a dynamical system (e.g. degradation and auto-activation) were omitted from an informal diagram focusing on feedback for reasons of brevity and clarity, but this had the unintended consequence of decoupling the text from the pathway schematic and the schematic from the dynamics being described. By converting word models directly into executable computational models, we ensure that verbal descriptions and dynamical simulations are congruent.
The p53 model offers an opportunity to test how robust INDRA (and the TRIPS reading system) are to changes in the way input text is phrased. When we tested eight alternatives for the phrase “Wip1 inactivates ATM” ranging from “Wip1 has been shown to deactivate ATM” to “ATM is inactivated by Wip1” (Figure 5D, right, green sidebar) and found that all eight generated the same INDRA Statement and thus the same model as the original sentence. However, NLP is sensitive to spelling errors such as “deaactivates” [sic] and to grammatical errors such as “ Wip1 inactivate ATM”. In addition, some valid linguistic variants are not recognized, representing a limitation of extraction into INDRA Statements (Figure 5D, right, red sidebar). We also tested whether differences in the way biological entities are named affects recognition and grounding; we found that Wip1, WIP-1, WIP1, PPM1D and Protein phosphatase 1D as well as ATM, Atm and ataxia telangiectasia mutated all worked as expected (Figure 5D, bottom, green). However, the recognition of protein and gene names in text is challenging; for instance, “PP2C delta” was not recognized as a synonym for Wip1 (Figure 5D, bottom, red), though the more common variant “PP2Cδ” is.
We then used INDRA to assemble a more detailed and mechanistically realistic model of p53 activation following DSBs (Figure 5E; POMI1.0). While the model in Figure 5C contained only generic activating and inhibitory reactions, the goal of POMI1.0 was to test INDRA concepts such as phosphorylation, transcription, ubiquitination and degradation. We also used modifiers to describe the molecular state required for a protein to participate in a particular reaction (e.g. “ubiquitinated p53 is degraded”). The set of ten phrases shown in Figure 5E were assembled into 11 rules, 12 ODEs and 18 parameters (Appendix Table 4). When we simulated the resulting ODE model we observed the expected oscillation in p53 activity (Figure 5E and Appendix Figure S6). By adding and removing different aspects of the underlying mechanism using natural language we observed that including the mechanism “Active ATM phosphorylates another ATM molecule” was essential for oscillation; the phrase “ATM phosphorylates itself” generated a valid set of reactions but did not create oscillations for any of the parameter values we sampled. The difference is that “Active ATM phosphorylates another ATM molecule” corresponds to a trans-phosphorylation reaction (other phrasings also work, such as “Active ATMtrans-phosphorylates itself”)—i.e. one molecule of ATM phosphorylates another molecule of ATM—which produces the non-linearity necessary for a time delay. In contrast, “ATM phosphorylates itself” implies modification in cis, which is incapable of generating oscillations in the p53 network. It is well known that ATM and ATR auto-phosphorylation occur in trans (Bakkenist & Kastan, 2003; Liu et al, 2011), validating this aspect of the model. This result highlights a danger in the use of word models alone: differences in mechanism that profoundly impact network dynamics can be obscured by ambiguous and imprecise natural language. Such ambiguities are picked up by INDRA and can be studied at intermediate stages of the extraction and assembly process (see Appendix iPython Notebook 1). The phrase “Active ATM phosphorylates another ATM molecule” is not particularly elegant English, but it is unambiguous; understanding that “ATM phosphorylates itself” is insufficient for p53 oscillation highlights the essential difference between trans and cis phosphorylation.
The foregoing analysis of the Lahav and Purvis review illustrates several beneficial features of direct text to model conversion: (i) the possibility of identifying subtle gaps and deficiencies in word models with the potential to profoundly affect network dynamics and function; (ii) the ability to maintain precise congruence between verbal, pictorial and computational representations of a network; and (iii) a reminder to include neglected negative regulatory mechanisms when explaining network dynamics. We propose that future figures of this type include accompanying declarative text (precisely stated word models) on the basis of which graphs and dynamical models can be created. We have found that it is remarkably informative to experiment with language and then render it in computational form: it was this type of experimentation that led us to rediscover for ourselves the importance of negative regulation and nonlinear positive feedback in generating p53 oscillations.
Modeling resistance to targeted therapy by vemurafenib
The MAPK/ERK signaling pathway is a key regulator of cell proliferation, differentiation and motility and is frequently dysregulated in human cancer (Box 4). Multiple ATP competitive and noncompetitive (allosteric) inhibitors have been developed targeting kinases in this pathway. The most clinically significant drugs bind RAF and MEK kinases in BRAF-mutant melanomas. For patients whose tumors express an oncogenic BRAFV600E/k mutation, treatment with the BRAF inhibitor vemurafenib (or, in more recent practice, a combination of the BRAF inhibitor dabrafenib and MEK inhibitor trametinib) results in dramatic tumor regression. Unfortunately, this is often followed by drug-resistance and disease recurrence 6 to 18 months later (Larkin et al, 2014). The mechanisms of drug resistance are under intensive study and include an adaptive response whereby MAPK signaling is reactivated in tumor cells despite continuous exposure to BRAF inhibitors (Shi et al, 2012a; Lito et al, 2012, 2013). Re-activation of MAPK signaling in drug-treated BRAFV600E/K cells is thought to involve disruption of ERK-mediated negative feedback (Figure 6A). The biochemistry of this process has been investigated in some detail and is subtle. For example, differential affinity of BRAF kinase inhibitors to monomeric and dimeric forms of BRAF are partly responsible for the ERK rebound (Kholodenko, 2015; Yao et al, 2015). Many of these processes have not been subjected to detailed kinetic modeling within the scope of the MAPK signaling pathway, and several mechanistically distinct hypotheses have been advanced to describe the same drug adaptation phenomenon. Adaptation to BRAF inhibitors therefore represents a potentially valuable application of dynamical modeling to a rapidly moving field of cancer biology (Kholodenko, 2015).
We sought to use natural language to rapidly create models of MAPK signaling in melanoma cells using mechanisms drawn from the literature, with a particular focus on a series of influential papers from the Rosen lab (Joseph et al, 2010; Poulikakos et al, 2010; Lito et al, 2012; Yao et al, 2015). We also sought to establish whether different biochemical hypotheses could be easily tested by modifying models at the level of natural language.
Box 4: The MAPK pathway and vemurafenib resistance in cancer
In normal cells, signal transduction via MAPK is initiated when an extracellular growth factor such as EGF induces dimerization of receptor tyrosine kinases (the EGFR RTK for example) on the cell surface. Dimerization and subsequent activation of RTKs results in assembly of signaling complexes at the plasma membrane and conversion of RAS-family proteins (HRAS, KRAS, and NRAS) to an active, GTP-bound state. RAS-GTP activates members of the RAF family of serine/threonine kinases (ARAF, BRAF, and RAF1), which serve as the first tier in a three-tier MAP kinase signaling cascade: RAF proteins phosphorylate MAP2K/MEK family proteins, which in turn phosphorylate the MAPK/ERK family proteins that control transcription factor activity, cell motility and other aspects of cell physiology. MAPK signaling is subject to regulation by feedback mechanisms that include inhibitory phosphorylation of EGFR and SOS by ERK, inhibition of the GRB2-mediated scaffold by the SPRY family of proteins, and inhibition of ERK by DUSP proteins (Lito et al, 2012).
MAPK/ERK signaling is a key regulator of cell proliferation and is mutated in a variety of human cancers (Dhillon et al, 2007), with dramatic effects on cellular homeostasis. Overall, ~20% of all cancers carry driver mutations in one of the genes that encode MAPK pathway proteins (Stephen et al, 2014) and in the case of melanoma, 50% of cancers carry activating point mutations in BRAF (most commonly BRAF V600E). ATP-competitive inhibitors such as vemurafenib provide significant clinical benefit in treating BRAF-mutant melanoma. However, remission of disease is transient, as tumors and tumor-derived cell lines develop resistance to vemurafenib over time (Lito et al, 2012). Recent studies have identified feedback regulation, bypass mechanisms, and other context-dependent factors responsible for restoring ERK signaling to pre-treatment levels (Shi et al, 2012b; Lito et al, 2012, 2013). For example, in the BRAF-V600E cell line A375, vemurafenib has been shown to suppress EGF-induced ERK phosphorylation completely upon treatment (Lito et al, 2013) but ERK phosphorylation levels rebound within 48 hours, with a concurrent increase in the level of RAS-GTP, the active form of RAS (Lito et al, 2012). It is the biology of this adaptation that we aim to capture in an INDRA model.
The baseline MAPK model (Melanoma ERK Model in INDRA; MEMI1.0) consists of 14 sentences describing canonical reactions involved in ERK activation by growth factors (Figure 6B, MEMI1.0) and corresponds in scope to previously described models of MAPK signaling (Stites et al, 2007; Birtwistle et al, 2007). In the baseline model, BRAFV600E constitutively phosphorylates MEK as long as it is not bound to vemurafenib (sentence 9: “BRAF V600E that is not bound to Vemurafenib phosphorylates MEK”). A two-step policy involving reversible substrate binding was used to assemble all phosphorylation and dephosphorylation reactions. For simplicity, we did not specify residue numbers or capture multi-site phosphorylation, instead modeling each step in the MAPK cascades as a single, activating phosphorylation event. With these assumptions, 14 sentences were processed by TRIPS to yield 14 INDRA Statements that were assembled into 28 PySB rules and 99 differential equations; the network of coupled ODEs was then simulated.
A key property of vemurafenib-treated BRAFV600E cells as described by Lito et al. is that the drug initially reduces pERK below its steady state level but pERK then rebounds despite the continued presence of vemurafenib. Levels of RAS-GTP (the active form of RAS) also increase during the rebound phase (Lito et al, 2012). In MEMI1.0, addition of EGF causes activation of RAS and phosphorylation of ERK at steady state. Addition of vemurafenib rapidly reduces pERK levels (Figure 6B) but extended simulations under a range of EGF and vemurafenib concentrations show that the amount of active RAS depends only on the amount EGF and is insensitive to the amount of vemurafenib; moreover, no rebound in pERK is observed in the presence of vemurafenib (Figure 6B and Appendix Figure S7A). Thus, MEMI1.0 fails to capture drug adaptation.
In a series of siRNA-mediated knockdown experiments Lito et al. showed that pERK rebound involves an ERK-mediated negative feedback on one or more upstream pathway regulators such as Sprouty proteins (SPRY), SOS or EGFR. To identify a specific mechanism that might be involved we used the BioPAX and BEL search capabilities built into INDRA. We queried Pathway Commons (Cerami et al, 2011) for BioPAX reaction paths leading from ERK (MAPK1 or MAPK3) to SOS (SOS1 or SOS2) and obtained multiple INDRA Statements for a MAPK1 phosphorylation reaction that had one or more residues on SOS1 as a substrate (including SOS1 sites S1132, S1167, S1178, S1193 and S1197). However, Pathway Commons did not provide any information on the effects of these phosphorylation events on SOS activity. To search for this we used INDRA’s BEL Interface to query the BEL Large Corpus (Catlett et al, 2013, Box 2) for all curated mechanisms directly involving SOS1 and SOS2. We found evidence that ERK phosphorylates SOS and that ERK inactivates SOS (Corbalan-Garcia et al, 1996). We did not find a precise statement in either database stating that phosphorylation of SOS inactivates it, but the publication referred to in the BEL Large Corpus as evidence of this interaction (Corbalan-Garcia et al, 1996) describes a mechanism whereby SOS phosphorylation interferes with its association with the upstream adaptor protein GRB2. To include the inhibitory phosphorylation of SOS by ERK we therefore modified three sentences (Figure 6C, Model 2, Sentences, 5, and 14) in Model 1 and added two new sentences (Figure 6C, Model 2, sentences 15 and 16). Thus, although INDRA can assemble Statements derived from databases directly into models, in this case human curation (via changes to the natural language text) was required to identify gaps in the mechanisms available from existing sources.
The inclusion of SOS-mediated feedback produced 16 declarative sentences that were translated into a MEMI1.1 model having 34 rules and 275 ODEs. Assembly of MEMI1.1 involved imposing assumptions to limit combinatorial complexity. For instance, in sentence 15 (Figure 6C) we specified that ERK cannot be bound to DUSP6 for ERK to phosphorylate SOS. While it is not known whether or not ERK can bind both DUSP6 and SOS at the same time, allowing for this possibility introduces a “combinatorial explosion” (Faeder et al, 2005; Feret et al, 2009) in the number of reactions and makes mass-action simulation difficult. It is common to make simplifying assumptions of this type in dynamical models (see for instance (Chen et al, 2009)), and an advantage of using natural language is that the assumptions are clearly stated. When MEMI1.1 was simulated we observed that, given a sufficient level of basal activity by addition of EGF, addition of vemurafenib resulted in dose-dependent increases in active RAS over pre-treatment levels (Appendix Figure S7B). However, pERK levels remained low, suggesting that negative feedback alone (at least as modeled in MEMI1.1) is insufficient to explain the rebound phenomenon observed by Lito et al. (Figure 6C, Appendix Figure S7B).
It has been suggested that RAF dimerization plays an important role in cellular responsiveness to RAF inhibitors (Lavoie et al, 2013; Yao et al, 2015). Both wild-type and BRAFV600E dimers have a lower affinity for vemurafenib as compared to their monomeric forms (Yao et al, 2015). Moreover, Lito et al. observed that the reactivation of ERK following vemurafenib treatment was coincident with formation of RAF dimers, leading to the suggestion that vemurafenib-insensitive dimers in cells play a role in the re-activation of ERK signaling (Kholodenko, 2015). To model this possibility, we created MEMI1.2 in which binding of vemurafenib to monomeric or dimeric BRAF is explicitly specified by separate sentences, allowing the effects of different binding affinities to be explored (Figure 6D). Assembly of this model yielded 353 ODEs, many of which were required to represent the combinatorial complexity of BRAF dimerization and vemurafenib binding (Appendix Figure S8). Simulation showed that RAS activation increases and settles at a higher level following vemurafenib treatment, with the magnitude of the increase dependent on the amount of EGF and the concentration of drug (Figure 6D, Appendix Figure S7C). Following a period of pERK suppression, rebound in pERK levels to ~30% of their maximum is observed (Figure 6D) effectively recapturing the key findings of Lito et al.
Subsequent work has shown that resistance to vemurafenib can also involve proteins such as DUSP, SPRY2 (Lito et al, 2013) and CRAF (Montagut et al, 2008). These mechanisms do not feature in the models described here, but could be included in MEMI by adding a few phrases to the word model.
This example demonstrates that it is possible to use INDRA to model signaling systems of practical interest at a scope and level of detail at which interesting biological hypotheses can be explored and tested. Comparison of models MEMI1.0 to 1.2 suggests that both feedback and BRAF dimerization are necessary for vemurafenib adaption and pERK rebound, in line with experimental evidence. The number of free parameters in these models varies, and we have not performed formal model calibration or verification, so the conclusion that MEM1.2 is superior to 1.0 is not rigorously proven. However, INDRA-assembled rule sets represent a solid starting point for such downstream analysis.
One issue we encountered in assembling these models was controlling complexity arising from the formation of multiple protein complexes from a single set of reactants. This is a known challenge in dynamical modeling of biochemical networks with poorly understood implications for cellular biochemistry (Faeder et al, 2005; Harmer et al, 2010; Sneddon et al, 2011). From the perspective of an INDRA user, this is likely to manifest itself as a property that can be diagnosed at the level of PySB rules or ODE networks, which can be inspected interactively (see Appendix Notebook 2). Unwanted combinatorial complexity can be resolved in two ways: (i) by using natural language to make additional assumptions about molecular context, and (ii) by choosing assembly policies minimizing combinatorial complexity by reducing complex formation (i.e. Michaelis-Menten instead of two-step policy). Both strategies are illustrated in Appendix Notebook 2.
An extensible and executable map of the RAS signaling pathway
The BRAF pathway described above is part of a larger immediate-early signal transduction network with multiple receptors as inputs and transcription, cell motility and cell fate determination as outputs. RAS is a central node in this network and is an important oncogenic driver (Stephen et al, 2014). The ubiquity of RAS mutations in cancer has led to renewed efforts to target oncogenic RAS and RAS effectors. As a resource for the community of RAS researchers, the NCI RAS Initiative has created a curated pathway diagram that defines the scope of the RAS pathway as commonly understood by a community of experts (Stephen et al, 2014). Such pathway diagrams can serve as useful summaries, but unless they are backed by an underlying computable knowledge representation they are of limited use in quantitative data analysis.
We used INDRA to describe the RAS signaling network and automatically generated a diagram (Figure 7A, right) corresponding to the community-curated Ras Pathway v1.0 diagram (available at http://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/what-do-we-mean-ras-pathway). We described the interactions in natural language (Figure 7A left, full text shown in Appendix Section 2.4) and used TRIPS to convert the description into INDRA Statements. A node-edge graph was generated using INDRA’s Graph Assembler and rendered using Graphviz (Figure 7A, right). Although different stylistically, the pathway map assembled using INDRA matches the original one drawn by hand in the following ways: it (i) includes the same set of proteins; (ii) represents the same set of interactions among these proteins; and (iii) recapitulates the semantics and level of mechanistic detail of the original diagram in that interactions are represented as directed positive and negative edges or undirected edges indicating complex formation. The pathway map is also visually comparable to one drawn by hand, and allows natural language-based editing and extension of the underlying set of mechanisms. For example, following distribution of v1.0 RAS diagram, the RAS Initiative solicited verbal feedback from a large number of RAS biologists both in person and via a discussion forum. Suggestions from the community consisted largely of corrections and pathway extensions. Using INDRA, these revisions of the network can be made directly, simply by editing the natural language source material. For example, one contributor noted that in the published pathway diagram (Figure 7A, right), P90RSK is activated by the mTORC2 complex, whereas in fact it is actually a substrate of MAPK1 and MAPK3 (https://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/2014/what-do-we-mean-ras-pathway#comment-1693526648). To account for this correction, we modified the natural language description by replacing the sentence “mTORC2 activates P90RSK” with “MAPK1 and MAPK3 activate P90RSK.” The pathway map obtained following automated assembly of the revised text correctly reflects the change suggested by the contributor (Figure 7B).
Several readers also suggested expanding the pathway map to include other relevant proteins. Extensions of this type are easy to achieve using natural language: for example, we extended the v1.0 RAS diagram to include JNK, a MAP kinase that is activated in many cells by cytokines and stress (Anafi et al, 1997; Antonyak et al, 1998; Wagner & Nebreda, 2009). This was achieved by adding four sentences (Figure 7C, top) including “MAP3K7 activates MKK4 and MKK7” and “MKK4 and MKK7 activate JNK1 and JNK2”. The subnetwork appended to the diagram is shown in Figure 7C (bottom). Note that we used common names for the JNK pathway kinases in the word model but INDRA canonicalized these to their official gene names (e.g., “HPK1”, “MKK4”, and “JNK1” were converted to MAP4K1, MAP2K4, and MAPK8, respectively).
The set of mechanisms used to generate the diagrams in Figures 7A-C can also be translated into a qualitative predictive model. We used the Simple Interaction Format (SIF) Assembler in INDRA to generate a Boolean network corresponding to the natural language pathway description in Figure 7A (see Appendix Section 2.4 for the rules comprising the network). Such a Boolean network can be used to predict the effects of perturbations such as ligand or drug addition. For example, we simulated the effects of adding growth factors and MEK inhibitors on phosphorylated c-Jun. The Boolean network simulation correctly predicted that c-Jun would be phosphorylated in the presence and absence of MEK inhibitor (Figure 7D, blue). We then instantiated the extended network in Figure 7C (which identifies the JNK pathway as a possible contributor to c-Jun phosphorylation). In this case joint inhibition of JNK and MEK was required to fully inhibit c-Jun phosphorylation (Figure 7D, green). The biology in this example is relatively straightforward but it demonstrates that natural language descriptions of mechanisms, along with automated assembly into executable forms, can be used as an efficient and transparent way of creating extensible knowledge resources for data visualization and analysis.
DISCUSSION
In this paper, we described a software system, INDRA, for constructing executable models of signal transduction directly from text. The process uses natural language reading software (TRIPS, in this paper) to convert text into a computer-intelligible form, identifies biochemical mechanisms and then casts these mechanisms in an intermediate knowledge representation that is decoupled from both input and output formats. The intermediate representation, comprising a library of INDRA Statements, is then used to assemble computational models of different types including networks of ODEs, Boolean networks, and interaction graphs according to user-specified policies that determine the level of biophysical detail. We have applied INDRA to three successively more ambitious use cases: (i) translating a diagram and accompanying text describing p53 regulation by DNA damage; (ii) modeling adaptive drug resistance in BRAFV600E melanoma cells exposed to the BRAF inhibitor vemurafenib; and (iii) constructing a large-scale model of RAS-mediated immediate-early signaling based on a crowd-sourced schematic drawing. These examples demonstrate the surprising but encouraging ability of machines to exploit the flexibility and ambiguity of natural language and then add prior knowledge about reaction mechanisms needed to create well-defined executable models.
The p53 POMI model represents a use case corresponding in scope to the mechanistic hypotheses typically presented in the literature in verbal or graphical form. We based POMI on a word model found in a review and found it necessary to add several additional mechanisms to reproduce the described oscillations in p53 (these include negative regulatory reactions and a positive feedback step involving auto-phosphorylation of ATM in trans). Editing and updating the model to explore alternative hypotheses was accomplished strictly at the level of the natural language description. This example highlights the potential of natural language, assembled into executable model form, to expose important and frequently overlooked differences between a formal representation of a mechanism (in this case, a network for ODEs) and a diagram that purports to describe it. Direct conversion of text into models via INDRA helps to minimize such mismatches while keeping the description in an accessible and easily editable natural language form. We propose that pathway schematics found in the conclusions of molecular biology papers include a set of declarative statements that match the schematic and any depiction of dynamics arising from simulation. Ensuring congruence among these representations will improve general understanding of cellular biology and make schematics and their underlying assumptions accessible to machines.
The BRAFV600E MEMI model involved a much greater number of molecular species and reactions due to the combinatorics of complex formation among BRAF, Vemurafenib, MEK, and RAS. In INDRA, formation of unlikely polymers in the model assembled by INDRA was controlled by providing stricter molecular context on mechanisms in the form of natural language (e.g., “DUSP6 dephosphorylates ERK that is not bound to SOS”). While managing combinatorial complexity is a key challenge in building models of signaling, a benefit of using INDRA is that assumptions made regarding combinatorial complexity are made explicit either in the form of the natural language description or the policies chosen for model assembly (e.g., one-step Michaelis-Menten vs. two-step). The broader RAS pathway is the largest network tackled in this paper, but by restricting the mechanisms to positive and negative regulation and binding it remains manageable. Such a model could in principle be solicited directly from the community and we plan to release the INDRA RAS model to the same group of experts that helped Frank McCormick and colleagues build and improve the original RAS schematic (Stephen et al, 2014).
Challenges in generating executable models from text and databases
Automating the construction of detailed biochemical models from text involves overcoming three technical challenges. The first is turning text into a computable form that correctly captures the biochemical events described in a sentence (typically verbs or actions) and the precise biomolecules involved (typically the subjects and objects of a phrase or sentence). This is possible in our system because TRIPS can extract meaning from sentences describing complex causal relationships in the face of variations in syntax (Box 1). TRIPS performs an initial shallow syntactic search to identify and ground named entities (genes, proteins, drugs, etc.) and then uses generic ontologies to perform “deep” semantic analysis, determining the meaning of a sentence in terms of its logical structure.
The second challenge involves extracting and normalizing information about mechanisms contained in NLP output. INDRA extracts mechanistic information from graphs generated by TRIPS by searching for matches to a predefined set of templates corresponding to biochemical processes (e.g., phosphorylation, transcription, binding, activation, etc.; Figure 3). These templates regularize the description of biochemistry in text by capturing relevant information in pre-determined fields: for example, a template for phosphorylation is structured to have a protein kinase, a phosphorylated substrate, and a target site. Information extracted by this template matching procedure is stored in corresponding fields in Statements, INDRA’s intermediate representation; missing fields are left blank. INDRA Statements currently encompass terms and reactions commonly found in signal transduction pathways and gene regulation; however, the system is being extended to include a wider variety of biochemical processes.
The third challenge in text-to-model conversion is assembling an executable model from high-level mechanistic facts acquired from input sources. Knowledge of reaction type and reactant identity is insufficient to construct a biophysical model: additional information derived from an understanding of classes of biochemical mechanism is almost always required. For example, the conversion of a phosphorylation Statement into a reaction network can involve one-step kinetics, reversible two-step kinetics or two-step kinetics with explicit ATP binding. Conversion of Statements into explicit models is controlled by the imposition of assembly policies (Figure 4). Greater biophysical realism comes at the cost of increased model complexity and reduced parameter identifiability. Thus, there is no single optimal approach to model instantiation: the level of detail is determined by the purpose of the model and the way it will be formulated mathematically.
Constructing executable models of signaling networks from pathway databases using BioPAX or BEL presents several challenges, despite the fact that this information is structured and often manually curated by experts. BioPAX reactions and BEL statements often lack the uniqueness (i.e., many variants of the same mechanism are curated) and context (i.e., participants in curated mechanisms are missing necessary molecular state) required to build coherent executable models automatically. For instance, Pathway Commons contains a multitude of representations of the reaction whereby MAPK21 phosphorylates and activates MAPK1 (Appendix Figure S10). These reactions differ in their molecular details including which phosphorylation sites are involved and what the assumptions about the state (activity, modification, bound cofactors) of MAP2K1 are. These reactions cannot simultaneously be included in a single, coherent model as they would result in causal inconsistencies. We therefore require the user to determine which INDRA Statements extracted from a database should be included in a given model. INDRA then subjects this information to an analogous process as text, using templates and assembly policies to control the generation of specific reaction patterns. In the future, manual selection of relevant BioPAX or BEL statements could be replaced by, or supplemented with, automated tools ensuring the selection of coherent subsets of mechanisms to be included in a model.
Separating Model Content and Implementation
Most approaches to modeling biological networks directly couple the specification of scope and collection of relevant facts to the mathematical implementation. For example, in an ODE-based model, molecular species are directly instantiated as variables and related to each other using one or more differential equations containing terms determined by each mass action reaction (Figure 8A “Ordinary differential equations” and Figure 8B, left). Although conceptually straightforward, the lack of separation between content and implementation (an issue also discussed in (Basso-Blandin et al, 2016)) makes it difficult to update a model with new findings from the literature or new hypotheses, to change the level of biophysical detail or to switch mathematical formalisms. Programmatic modeling overcomes some of these problems by allowing the construction of models at a higher level of abstraction in which users implement reusable and composable macros and modules (Figure 8A “PySB Macro” and Figure 8B, center) (Lopez et al, 2013; Mallavarapu et al, 2009; Smith et al, 2009; Pedersen & Plotkin, 2008). The mathematical equations necessary for simulation are then generated automatically from the abstract representations.
INDRA introduces a further level of abstraction whereby a user describes a set of reactions in natural language or searches for related mechanisms in pathway databases and then uses a machine to turn these facts into executable models (Figure 8A “Natural language” and Figure 8B, right). In this process, a user has full control over the content of the model and the level of detail, as specified by policies, but model assembly happens automatically. Such decoupling simplifies the creation of dynamical models from natural language descriptions, enables the creation of closely related models differing in detail or mathematical formalism and makes sure that verbal and mathematical descriptions of the same process are in correspondence (Figure 8B, right).
The decoupling of biological knowledge from specific applications reflects the way in which biologists gather mechanistic information and apply it to specific research questions. We acquire informal knowledge through years of reading and experience, but this knowledge remains highly flexible; it allows for uncertainty about particular details and can be applied to a diverse set of problems in the lab. The ambiguity inherent in verbal descriptions of mechanisms conforms closely to the way in which individual experiments are designed and interpreted: it is extremely rare for one experiment to elucidate the status of all relevant sites of post translational modification, regulatory subunit binding or allosteric regulation of an enzyme. Natural language allows biologists to communicate provisional and changing knowledge without prematurely resolving ambiguities or presupposing the biological context or experimental format in which the knowledge might be applied.
Relationship to previous work
Several software tools have been developed to partially automate the construction of executable models from bioinformatics databases such as KEGG, Pathway Commons etc. (Ruebenacker et al, 2009; Wrzodek et al, 2013; Büchel et al, 2013; Turei et al, 2016a). Automating model translation in this way increases throughput and maintains links between model assumptions and curated findings in databases, eliminating the need for labor-intensive annotations of hand-built models (Le Novère et al, 2005). Such approaches have been particularly successful in the field of metabolism in which knowledge about enzyme-substrate reactions is well curated and closely corresponds in level of detail to what is required for mechanistic modeling (Büchel et al, 2013). In signal transduction, curation is less complete, the number of molecular states and interactions is far higher and networks vary dramatically from one cell type to the next. This complexity has been addressed for the most part by using strictly qualitative formalisms that describe positive and negative influences between nodes (Büchel et al, 2013; Turei et al, 2016b). In contrast INDRA uses an intermediate representation that encompasses both mechanistic processes (e.g., phosphorylation) and empirical causal influences (e.g., activation and inhibition). The model assembly procedure makes use of mechanistic information where available, but can incorporate qualitative influence relationships when mechanisms are not known. In its use of an intermediate representation to bridge the gap between elements of mechanistic knowledge and executable models, INDRA Statements are related to the graphical meta-model for rule-based modeling developed by Basso-Blandin et al., which represents binding and modification actions at the level of structural features within agents (e.g., domains and key residues) (Basso-Blandin et al, 2016). In this way, their framework is complementary to INDRA Statements, and the two approaches could be productively integrated.
INDRA’s ability to assemble information from knowledge sources into annotated, exchangeable and extensible models relies heavily on the existence of community standards such as SBO (Courtot et al, 2011) and MIRIAM (Le Novère et al, 2005), and on structured resources including identifiers.org (Juty et al, 2012), UniProt (The UniProt Consortium, 2015), CHEBI (Degtyarenko et al, 2008), etc. For an extensive overview of the role of these resources in building large, reusable models, we refer the reader to (Waltemath et al, 2016). Early instances of software systems for converting input and output formats allowed one-to-one conversion from BioPAX to SBML (Ruebenacker et al, 2009). Cell Designer (Funahashi et al, 2008) accepts input in formats such as BioPAX and makes plugins such as SBML Squeezer (Dräger et al, 2015) available for export into SBML. Similarly, Cytoscape (Cline et al, 2007) makes it possible to import protein interactions from multiple databases and output the results to SBML. More recent one-to-many tools translate information from a single knowledge source into multiple output formats (Wrzodek et al, 2013), while many-to-one tools aggregate pathway information from many sources but target a single output format (Turei et al, 2016a).
By uncoupling knowledge-level statements from a particular formal implementation, whether graphical or mathematical, natural language modeling is complementary to and compatible with a wide variety of input and output formats. In the case of INDRA, an intermediate representation enables a wide variety of many-to-many conversions involving text, BioPAX, BEL, PySB, BNGL, SBML, ODEs, logical models and graph-based formats such as SBGN (an INDRA-assembled SBGN graph of the model presented in Figure 5C is shown in Supplementary Figure S9). Further integration of natural language and graphical modeling, for example by coupling INDRA to SBGNViz graphical interface (Sari et al, 2015), will improve the quality of human-machine interaction and further facilitate model assembly and exploration.
Limitations and future extensions of INDRA
An appealing feature of using natural language to build models is that it is immediately accessible to all biologists. However, this does not necessarily imply that INDRA will allow modeling laypersons to directly build and use sophisticated models, as the use of natural language does not in and of itself address many of the other challenges in developing a meaningful dynamical model, including determination of parameter sensitivity, investigation of network dynamics, insight into combinatorial complexity, multistability, oscillations, etc. We therefore propose that natural language modeling would be useful in facilitating collaboration between biologists with domain-specific expertise and computational biologists. The advantage of natural language in this context is that it makes it easy for teams to communicate about biological hypotheses and mechanisms without becoming mired in details of model implementation. For experienced modelers, INDRA offers a means to efficiently build multiple model types from a single set of high-level assumptions, provided that the model can be described in terms of molecular mechanisms and assembled using available policies. By design, the software does not perform parameter estimation, simulation or model analysis, leaving these tasks to the many existing tools and methods.
Limitations in model construction using INDRA can be grouped into two categories: (i) issues relating to the reading natural language by external NLP systems, and (ii) limitations in the representation and assembly of mechanisms in INDRA. In this paper, we construct models using simple declarative sentences that lack much of the complexity and ambiguity of spoken language and the scientific literature. Declarative language can express a wide variety of biological mechanisms at different levels of detail and ambiguity and it reduces many of the difficulties associated with NLP-based extraction of biological mechanisms. Although TRIPS and INDRA are robust to variation in syntax and naming conventions, they cannot understand all possible ways a concept can be stated; for example, “Wip1 makes ATM inactive” is not recognized as a substitute for “Wip1 inactivates ATM” (Figure 5). In such cases rephrasing is usually successful.
The TRIPS system (as well as other NLP systems we tried, such as REACH) can be used to process the more complex and ambiguous language used in scientific publications and they are both state of the art systems with different strengths and weaknesses. Empirical results presented in (Allen et al, 2015) show that TRIPS compares favorably in precision and recall to ten other NLP systems on an event extraction task from biomedical publications, and reaches precision and recall levels close to those produced by human curators. While reading from the biomedical literature is less robust as compared to reading the declarative language used in this paper, the fundamental challenge in generating models directly from literature information is not reading but knowledge assembly. The assembly challenge involves multiple interconnected issues, including: (i) the large amount of full and partial redundancy of knowledge generated when mechanisms are read at scale (e.g. MEK phosphorylates ERK vs. MEK1 phosphorylates ERK); (ii) inconsistencies between knowledge collected from multiple sources which may or may not be resolvable based on context; (iii) the distinction between direct physical interactions and indirect effects; and (iv) technical errors such as erroneous entity disambiguation and normalization. In the approach described here, human experts simplify machine reading and assembly by paraphrasing statements about mechanisms into simplified, declarative language. As illustrated in the POMI models of p53 dynamics, the use of simplified language is not only useful for machines, it helps to clarify complex issues for humans as well. However, we are actively working to extend INDRA so it can assemble information from the primary scientific literature into coherent models.
The domains of knowledge covered by INDRA are limited by the scope of the natural language processing, intermediate representation and assembly procedures developed to date, which do not include all types of biological mechanisms (e.g., lipid biology, microRNA function, epigenetic regulation remain future extensions). However, INDRA can extract and represent comparable proportions of reactions in signaling, transcriptional regulation, and metabolism, which are widely curated in existing databases (Appendix Table 1). To further extend this coverage, we are (i) updating processors to retrieve a wider range of information; (ii) adding new Statement types; and (iii) creating new assembly procedures. Other areas of future development include automated retrieval of binding affinities and kinetic rates for parameter estimation. Encouragingly, the Path2Models software has shown that automated retrieval of kinetic parameters from databases is feasible for metabolic models (Büchel et al, 2013), and this approach may be adaptable to signaling pathways as well. Another planned extension involves capturing more abstract observations in addition to mechanistic information. For instance, the experimental finding “IRS-1 knockdown resulted in reduction of insulin stimulated Akt1 phosphorylation at Ser 473.” (Varma & Khandelwal, 2008) cannot be directly represented as a molecular mechanism. Literature and databases contain a wealth of such indirect, non-mechanistic information that could be used as biological constraints to infer or verify mechanistic models. However, we expect that INDRA will primarily remain a tool for investigating properties of linked biochemical reactions rather than as a general-purpose mathematical modeling tool. As illustrated in the p53 modeling example above (Figure 5) this emphasis requires the modeler to provide an explicit molecular basis for phenomenological properties such as oscillations, switches, delays, etc.
A system such as INDRA allowing biologists to “talk” to a machine about a biological pathway in natural language suggests the possibility that an improved machine could also “talk back” to the human user in a manner analogous to Apple’s Siri (Carvunis & Ideker, 2014). At its most basic level, such a system would allow humans and machines to jointly curate knowledge, thereby resolving ambiguities or errors in NLP or assembly. A more sophisticated machine would use its internal knowledge base to autonomously identify relevant reactions, inconsistencies in a user’s input, or novel hypotheses arising from model simulation. A computer agent could interact with many human experts simultaneously, facilitating curation and modeling efforts by communities of biologists. We anticipate that such human-machine collaborative systems will be increasingly valuable in making sense of the large and complex datasets and fragmentary mechanistic knowledge that characterize modern biomedicine.
MATERIALS AND METHODS
Software and model availability
INDRA is available under the open-source BSD license. Code and documentation are available via http://indra.bio; the documentation is also included as part of the Appendix. The TRIPS/DRUM system for extracting mechanisms from natural language is available at http://trips.ihmc.us/parser/cgi/drum. INDRA version 1.4.2 was used to obtain all results in the manuscript. INDRA can be imported in a Python environment and integrated with existing Python-based tools directly. To allow the integration of INDRA with non-Python tools, including graphical modeling environments, a REST API is available, through which all input processing and assembly functionalities of INDRA can be used (for more details on the REST API, see the INDRA documentation attached as part of the Appendix).
The POMI1.0 and MEMI1.0-1.2 models are provided as Appendix attachments in SBML, BNGL, Kappa and PySB formats, in addition to the natural language text files used to build them. The RAS pathway model and its extension are provided in SIF and Boolean network formats as Appendix attachments. Code used to generate these models is part of the INDRA repository and can be found in the models folder of https://github.com/sorgerlab/indra.
TRIPS Interface
The INDRA TRIPS Interface is invoked using the top-level function process_text. This function queries the TRIPS/DRUM web service via HTTP request, sending the natural language content as input and retrieving extracted events in the EKB-XML format. The Interface then creates an instance of the TripsProcessor class, which is then used to iteratively search the EKB-XML output, via XPath queries, for entries corresponding to INDRA Statements. Extracted Statements are stored in the statements property of the TripsProcessor, which is returned by the Interface to the calling function.
BioPAX/Pathway Commons Interface
INDRA’s BioPAX Interface either queries the Pathway Commons web service or reads an offline BioPAX OWL file (Box 2). The Interface contains three functions that can be used to query the Pathway Commons database via the web service: 1) process_pc_neighborhood, which returns the reactions containing one or more query genes, 2) process_pc_pathsbetween, which returns reaction paths connecting the query genes, subject to a path length limit, and 3) process_pc_pathsfromto, which returns reaction paths from a source gene set to a target gene set, subject to a path length limit. The BioPAX Interface processes the resulting OWL files using PaxTools (Demir et al, 2013), yielding a BioPAX model as a Java object accessible in Python via the pyjnius Python-Java bridge (https://github.com/kivy/pyjnius). INDRA’s BioPAX Processor then uses the BioPAX Patterns package (Babur et al, 2014) to query the BioPAX object model for reaction patterns corresponding to INDRA Statements.
BEL/NDEx Interface
INDRA’s BEL Interface either reads an offline BEL-RDF file or obtains BEL-RDF from the BEL Large Corpus via the Network Data Exchange (NDEx) web service (Pratt et al, 2015). Subnetworks of the BEL Large Corpus are obtained by calling the method process_ndex_neighborhood, which retrieves BEL Statements involving one or more query genes. The BEL Processor then uses the Python package rdflib to query the resulting RDF object for BEL Statements corresponding to INDRA Statements via the SPARQL Protocol and RDF Query Language (SPARQL; https://www.w3.org/TR/sparql11-overview).
Assembly of rule-based models
Assembly of rule-based models is performed by instances of the PySB Assembler class. Given a set of INDRA Statements and assembly policies as input, the make_model method of the PySB Assembler assembles models in two steps. First, information is collected about all molecular entities referenced by the set of Statements. This defines the activity types, post-translational modification sites, binding sites, and mutation sites for each Agent, which can then be used to generate the agent “signatures” for the rule-based model. In PySB, the molecular entities of the model are represented by a set of instances of the PySB Monomer class. Because assembly policies chosen by the user govern the nature of binding interactions (e.g., one-step vs. two-step modification), the binding sites and agent signatures must be generated in accordance with the chosen policies at this step. For policies involving explicit binding between proteins (e.g., the two-step policy for post-translational modifications), each PySB Monomer is given a unique binding site for each interacting partner. The second step is the generation of reaction rules corresponding to each of the input Statements. The PySB Assembler iteratively processes each Statement, calling the assembly function specific to the Statement type and chosen policy. Depending on the Statement type and policy, one or more PySB rules may be generated and added to the PySB model. The PySB model returned by the make_model function can then be converted into other formats (Kappa, BNG, SBML, Matlab, etc.) depending on the type of simulation or analysis to be performed (Lopez et al, 2013). Importantly, the PySB Assembler adds annotations to the generated PySB model that link molecular entities referenced in the model to their identities in reference ontologies (e.g., HGNC and UniProt). These annotations are in turn propagated into SBML and other model formats by existing PySB model export routines.
Models of p53 activation in response to single- and double strand break DNA damage
The text defining each model was submitted to the TRIPS web service for processing via INDRA’s TRIPS Interface. The TRIPS system returned Extraction Knowledge Base graphs (Box 1 and Appendix Section 2.1) from which INDRA’s TRIPS Processor extracted INDRA Statements. These Statements were then assembled using INDRA’s PySB Assembler into a rule-based model. The default “one-step” assembly policy was used, which generates rules in which the subject of an activation, inhibition, and modification changes the state of the object without binding.
The 8 sentences constituting the SSB damage response model (Figure 5B) resulted in 8 INDRA Statements (each of type Activation or Inhibition). For example, the sentence “Active p53 activates Mdm2” was represented as an Activation Statement with an additional condition on the Agent representing p53, requiring that it be active. During INDRA Statement construction, names of genes are standardized to their HGNC gene symbol (Eyre et al, 2006), thus, the Agent representing “Mdm2” is renamed “MDM2”, and the Agent representing “p53” is renamed “TP53”. Default initial conditions (10,000 molecules, based on a default concentration of 10-8 Molar in a typical HeLa cell volume of 1.6 x 10-12 L) generated by the PySB Assembler were used for each protein in its inactive state and simulations were started with an initial 1 active ATR molecule to initiate the activation pathway. The forward rates for activation and inhibition rules were set to 10-7 molec’V1 (using a conversion rate of 105 M-1 in a typical HeLa cell volume, as above). The forward rate of the rules corresponding to ATR auto-activation and p53 inactivation by Wip1 were modified to be 5 x 10-7 molec’V1, that is, faster than the forward rate of other rules (a summary of all rules and rates is given in the Appendix Section 2.3). PySB’s reaction network generation and simulation functions were then used to instantiate the model as a set of 8 ordinary differential equations. The model was simulated using the scipy package’s built-in vode solver for up to 20 hours of model time while tracking the amount of active p53, which was then plotted (Figure 5B). Natural language processing for this model took 10 seconds (here and in the following this includes network traffic time to and from the web service); the assembly and simulation of the model took less than 1 second.
The method for constructing the simple DSB response model (Figure 5C) with ATM was analogous to the SSB model. The same initial amounts and forward rate constants were used as in the previous model, except in this case an initial condition of 1 active ATM molecule was used, and the inactivation of ATM by Wip1 was given a forward rate of 10-5 molec’V1. For this model, the 9 natural language sentences were captured in 9 INDRA Statements and generated into a model of 9 rules and finally 9 ODEs. The model was again simulated up to 20 hours while observing the active form of p53. Similar to the SSB response model, natural language processing for this model took around 10 seconds, with assembly and simulation taking less than 1 second.
The POMI1.0 model (Figure 5E) extends the basic DSB response model by specifying the activation/inhibition processes in more mechanistic detail. The model is described in 10 sentences yielding 12 INDRA Statements and a model containing 11 PySB rules and 12 ODEs (via the PySB Assembler using the “one-step” policy). The same rate constants were used as in the simple DSB response model; additionally, the degradation rate of Mdm2 was set to 8 x 10-2 s-1 and the rate of synthesis of Mdm2 by p53 to 2 x 10-2 molec’V1 (a full list of rules and associated rate constants is given in the Appendix Section 2.3). Natural language processing for this model took 14 seconds; assembly and simulation took less than 1 second.
Models of response to BRAF inhibition
The sentences for the MEMI1.0, 1.1 and 1.2 models were processed with the TRIPS web service via INDRA’s TRIPS Interface. Natural language processing took 37 seconds for MEMI1.0, 60 seconds for MEMI1.1, and 75 seconds for MEMI1.2. The resulting INDRA Statements were then assembled using INDRA’s PySB Assembler module into a rule-based model using the “two-step” policy for assembling post-translational modifications. Kinetic rate constants were set manually and the initial amounts of each protein were set to correspond in their order of magnitude to typical absolute copy numbers measured across a panel of cancer cell lines in Table S5 of (Shi et al, 2016). A summary of the kinetic rates and initial amounts is given in Appendix Tables 5-7. Each model was instantiated as a system of ordinary differential equations and simulated using the scipy Python package’s built-in vode solver. Each model was started from an initial condition with all proteins in an inactive, unmodified and unbound state. The models were run to steady state and the values of GTP-bound RAS (active RAS) and phosphorylated ERK were saved. Another simulation was then started from the steady state values with vemurafenib added and the time courses of active RAS and phosphorylated ERK were normalized against their unperturbed steady state values and plotted.
Extensible and executable RAS pathway map
The pathway map was created by processing 47 sentences with TRIPS (see Appendix Section 2.5) to generate 141 INDRA Statements. Reading and extraction of Statements took a total of 160 seconds. The Statements were then assembled using INDRA’s Graph Assembler, which produced a network that was laid out using Graphviz (Ellson et al, 2002) as shown in Figure 7A. The same set of Statements was then assembled using the INDRA SIF Assembler which produced a list of positive and negative interactions between genes that can be interpreted by network visualization software (Shannon et al, 2003) and Boolean network simulation tools. The logical functions for each node were generated by combining the state of parent nodes such that the presence of any activating input in an on state and the absence of any inhibitory inputs in an on state resulted in the node’s value taking an on state at the next time step (logical rules are given in Appendix Section 2.5). Boolean network simulations were performed using the boolean2 package (Albert et al, 2008). First, 100 independent traces were simulated using asynchronous updates on the nodes (which results in stochastic behavior) and the average of the value of each node (with 0 corresponding to the low and 1 to the high state of each node) was taken across all simulations to produce the time course plots in Figure 7D.
AUTHOR CONTRIBUTIONS
BMG and JAB designed and implemented INDRA. BMG, JAB, KS, JM, LG and PKS conceived the overall approach. BMG, JAB, KS, and PKS wrote the paper.
CONFLICT OF INTEREST
The authors declare that they have no conflict of interest.
ACKNOWLEDGEMENTS
This work was funded by ARO grants W911NF-14-1-0397 and W911NF-15-1-0544 to PKS and W911NF-14-1-0391 to LG under the DARPA Big Mechanism and Communicating with Computers programs, and by NIGMS grant P50GM107618 to PKS. We would like to acknowledge Russ Harmer, Walter Fontana and Dexter Pratt for useful discussions and their valuable suggestions, as well as James Allen, Choh Man Teng and Will de Beaumont for their contribution to the development of the TRIPS NLP system.
1 Appendix Figures
2 Appendix Methods
2.1 The TRIPS/DRUM natural language processing system
In this section we provide more technical details on the TRIPS/DRUM system [3] which INDRA uses to process natural language text (note that in the main text we refer to the system as TRIPS). The DRUM (Deep Reader for Understanding Mechanisms) system is an instance of the TRIPS general deep language understanding system [4], customized for reading and understanding scientific text in molecular biology.
To process natural language input, first a suite of shallow NLP tools performs annotations on the text, including: a) sentence splitting; b) named entity recognition; c) derivational analysis (e.g., prefixes, pertainyms); and d) statistical parsing. Critically, DRUM uses a number of domain-specific named entity taggers to identify genes, proteins, protein families, drugs and chemicals, cell lines, diseases, etc.; successful matches are annotated with IDs in the original resources, and are also mapped to internal ontology types. At this stage all matches are annotated; there is no attempt to disambiguate among alternatives. Other domain-specific annotators use regular expressions to identify, for example, molecular sites and mutations. Statistical parsers – CoreNLP [5] and Enju [6] – are used to find constituent boundaries; only those constituents on which both parsers agree are used.
The output of all these specialized preprocessors is sent to the TRIPS parser, which uses it as advice during its search for the optimal parse of each sentence. The TRIPS parser is at the core of our approach; it uses a hand-built lexicalized context-free grammar, augmented with feature structures and feature unification, and domain-general selectional restrictions (encoded in the lexicon and ontology) to eliminate semantically anomalous sense combinations. The parser constructs from the input a logical form (LF), which is a semantic representation that captures an unscoped modal logic [7,8]. The logical form includes the surface speech act, semantic types, semantic roles for predicate arguments, and dependency relations. Lexical entities in the LF represent word senses and ontology types, as well as tense, modality and aspect information – information that is crucial for determining, for example, whether a statement expresses a stated fact, a conjecture or a possibility. The parser draws on a general purpose semantic lexicon and ontology which define a range of word senses and lexical semantic relations. The core semantic lexicon was constructed by hand and contains approximately 7, 500 lemmas (generating approximately three times that many words) and 2, 000 concepts in the ontology. The ontology is organized hierarchically and each ontology concept has associated with it possible semantic roles and selectional preferences that further refine the concept. The core lexicon is extended to cover virtually all words in WordNet [9] by adding lexical entries with plausible semantic and syntactic structures through a dynamic mapping between the WordNet hypernym hierarchy and that of the TRIPS ontology. Ontology-based lexical expansion – using WordNet as well as all the terminology extracted from the biological resources – is the main tool by which we can customize our generic, relatively low coverage semantic parser to attain the broad coverage needed to process text in as lexically rich a domain as that of molecular biology.
Finally, the LFs produced by the parser are used to extract the content relevant for the domain, in this case concepts (e.g., molecular entities), events (e.g., activation, modification) involving those concepts, causal relations between events, as well as additional information about events, such as polarity, modality (e.g., possibility, necessity, various epistemic and evidential modals). Because much of the variation and complexity in sentence constructions is handled by the TRIPS parser, and the LF terms are grounded in the TRIPS ontology, we are able to use a relatively compact rule set for defining the events and relationships of interest. The extraction rules were developed from general principles rather than based on specific training samples; thus, even though most of the text used for development was extracted from papers on the Ras signaling pathways, we expect the system to have good performance on any input describing biomolecular mechanisms.
2.2 Querying databases to extract INDRA Statements
To benchmark the ability of INDRA to extract Statements from pathway databases, we used INDRA’s BioPAX and BEL APIs to search for the neighborhood of 20 genes or metabolites in Pathway Commons and the BEL Large Corpus, respectively. In the case of BioPAX we counted the total number of controlled BiochemicalReactions and TemplateReactions (used to represent gene transcription) as reference, and in the case of BEL we counted all direct statements (i.e. with predicate directlyIncreases/directlyDecreases). We then determined how many of these source entries in each case were extracted and represented as INDRA Statements, as constructed by INDRA’s BioPAX and BEL Processors. We selected 5 representative genes or metabolites from signaling, gene regulation, metabolism (protein controller) and metabolism (metabolite) to perform the analysis. The results of this analysis are shown in Table 1.
2.3 Modeling alternative dynamical patterns of p53 activation
Tables 2-4 list the PySB rules and the associated kinetic rates constituting the three models for the activation dynamics of p53 with Table 2 corresponding to the model in Figure 5(B), Table 3 to the model in Figure 5(C) and Table 4 to the model in Figure 5(E).
2.4 Modeling resistance to targeted therapy by vemurafenib
A Pathway Commons query was performed with INDRA to obtain paths from MAPK1 or MAPK3 to SOS1 or SOS2. This yielded the following list of INDRA Statements
Phosphorylation(MAPK1(), SOS1(), S, 1132),
Phosphorylation(MAPK1(), SOS1(), S, 1197),
Phosphorylation(MAPK1(), SOS1(), S, 1193),
Phosphorylation(MAPK1(), SOS1(), S, 1178)
This indicates that MAPK1 phosphorylates SOS1 on several serine residues.
We queried the BEL Large Corpus using a neighborhood search around SOS1 and SOS2. This returned the following relevant mechanisms:
Phosphorylation(MAPK1(), SOS1(), S, 1178),
Phosphorylation(MAPK1(), SOS1(), S, 1167),
Inhibition(MAPK1(activity=‘kinase’), SOS1(), ‘catalytic’)
All three mechanisms refer to PMID8816480, Identification of the mitogen-activated protein kinase phosphorylation sites on human Sos1 that regulate interaction with Grb2 [10] as their source evidence. This implies that the serine-phosphorylation of SOS1 by MAPK1 reduces its affinity to bind to GRB2 thereby resulting in its loss of activity.
2.5 An extensible and executable map of the RAS pathway
The following text was used to assemble the RAS pathway map.
Below is the set of logical functions that define the update steps of the Boolean network automatically assembled from the RAS pathway map.
Boolean network simulations were performed using the boolean2 package available from https://github.com/ialbert/booleannet [11]. We simulated 100 traces using asynchronous updates on the nodes (which results in stochastic behavior) and then took the average of the value of each node (with 0 corresponding to the low and 1 to the high state of each node).
Appendix Notebook 1 Inspecting INDRA Statements and assembled models
In this example we look at how intermediate results of the assembly process from word models to executable models can be inspected. We first import the necessary modules of INDRA.
In [1]:Populating the interactive namespace from numpy and matplotlib
Collecting Statements from reading
First, we use the TRIPS system via INDRA’s trips module to read two sentences which describe distinct mechanistic hypotheses about ATM phosphorylation.
In [2]:Here tp is a TripsProcessor object whose extracted Statements can be accessed in a list.
Printing Statements as objects
It is possible to look at the string representation of the extracted INDRA Statements as below.
In [3]:The first Statement, obtained by reading “Active ATM phosphorylates itself”, represents the Autophosphorylation of ATM with ATM being in an active state. Here activity stands for generic molecular activity and True indicates an active as opposed to an inactive state.
The second Statement, obtained from “Active ATM phosphorylates another ATM molecule” is a Phosphorylation with the enzyme ATM being in an active state phosphorylating another ATM as a substrate.
Drawing Statements as graphs
Next, we can use the draw_stmt_graph function to display the Statements produced by reading and INDRA input processing as a graph. The root of each tree is the type of the Statement, in this case Autophosphorylation. The arguments of the Statement branch off from the root. In this case the enzyme argument of Autophosphorylation is an Agent with name ATM. Its database references can be inspected under the db_refs property. In [4]:
Printing / exchanging Statements as JSON
INDRA Statements can be serialized into JSON format. This is a human-readable and editable form of INDRA Statements which is independent of Python and can therefore be used as a platform-independent data exchange format for Statements. The function stmts_to_json in the indra.statements module takes a list of Statements and returns a JSON as a dictionary. Below we pretty-print this JSON as a string with indentations.
In[5]:Inspecting assembled rule-based models
We now assemble two PySB models, one for each Statement.
In [6]:We can examine the properties of the PySB model object before exporting it. As seen below, the model has a single Monomer and Rule, and two Parameters.
In [7]:We can look at the ATM Monomer and its sites. ATM has an activity site which can be either active of inactive. It also has a phospho site with u and p states.
In [8]:The rule representing ATM autophosphorylation can be inspected below. The rule is parameterized by the forward rate kf_a_autophos_1.
In [9]:We now assemble a model for the second Statement.
In [10]:As we see, the rule assembled for this model contains two distinct ATMs on each side, one acting as the kinase and the other as the substrate.
Inspecting assembled model annotations
Finally, models assembled by INDRA carry automatically propagated annotations. Below, the grounding of ATM in the UniProt, HGNC and NCIT databases is annotated; the semantic role of monomers in each rule are also annotated, and finally, the unique ID of the INDRA Statement that a rule was derived from is annotated.
In [14]:Appendix Notebook 2: Model complexity, context specification and assembly policies
In this notebook we explore the effects of specified conditions on Agents (e.g. bound conditions, modification conditions) and assembly policies on the combinatorial complexity of dynamical models.
First, we import INDRA’s TRIPS input API and PySB model assembler.
In [1]:Model1: RAF to ERK without specifying agent context
In the first case, two binding events and a phosphorylation is described with no additional context specified on any of the proteins.
In [2]:This yields 3 INDRA Statements, as follows. Here empty parentheses after the Agent names indicate that there is no additional context specified on them.
In [3]:Assembly with one-step policy
We now assemble this model using the default one_step policy and store it in the model1_one variable.
In [4]:The model has 4 Monomers and 4 Rules.
In [6]:Let’s examine the last rule which corresponds to MEK phosphorylating ERK. Here, mek() appears without additional context specified. This means that the rule will apply to any form of mek, for instance, MEK that is unphosphorylated.
We now generate the rule-based model into a reaction network form using PySB’s interface to BioNetGen.
In [8]:We can now plot the reaction network to examine the model. Each colored node of the reaction network is a molecular species, reactions are represented by gray nodes, and arrows show species being consumed and produced by each reaction.
In [9]:We see from the reaction network that RAF is able to phosphorylate MEK whether or not it is bound to Vemurafenib, and MEK phosphorylates ERK whether or not it is phosphorylated.
Assembly with two-step policy
Let’s now assemble the same model with the two-step policy. This will result in a more detailed model in which MEK first binds ERK reversibly, and phosphorylated ERK is released from the MEK-ERK complex. We will store this model in the model1_two variable.
In [10]:We can now generate the reaction network for model1_two and inspect the reaction network that is created.
In [14]:The two-step policy produced a total of 13 molecular species and 19 reactions. ERK now appears in 6 possible forms:
ERK(phospho=‘u’, map2k=None)
ERK(phospho=‘p’, map2k=None)
ERK(phospho=‘u’, map2k=1) % MEK(phospho=‘u’, map3k=None, mapk=1)
ERK(phospho=‘u’, map2k=1) % MEK(phospho=‘p’, map3k=None, mapk=1)
ERK(phospho=‘u’, map2k=1) % MEK(phospho=‘u’, map3k=2, mapk=1) % RAF(vemurafenib=None, map2k=2)
ERK(phospho=‘u’, map2k=1) % MEK(phospho=‘u’, map3k=2, mapk=1) % RAF(vemurafenib=3, map2k=2) % VEMU RAFENI B(map3k=3)
This means that our initial description allowed for the possibility of ERK, MEK, RAF and Vemurafenib all simultaneously being in a complex. While the existence of such a complex is not impossible, we can introduce additional assumptions to refine the model.
Model2: RAF to ERK with specifying context
In this model we introduce additional assumptions (by explicitly making them part of the model definition) to the previous model. In particular, we add additional context on the agents to make causal structure explicit and simplify the model.
Assembly with two-step policy
In [17]:The INDRA Statements extracted by processing this text are shown below.
In [18]:We see that some agents are now subject to additional conditions, for instance, RAF(bound: [vemurafenib, False]) specifies that RAF should not be bound to Vemurafenib in order to phosphorylate MEK.
Let’s now assemble the model and generate the reaction network.
In [19]:The model is now significantly simpler with a total of 7 reactions down from 19 in the previous model.
Assembly with one-step policy and Michaelis-Menten rate law
As an alternative to the two-step policy, we can assemble the same model using a simpler, one-step policy.
In [22]:As we see, this model contains 7 individual species with 3 reactions in total.
One issue with the simple one-step policy is that enzymatic catalysis is modeled with pseudo-first order kinetics. Alternatively, we can use a Michaelis-Menten policy in with case both phosphorylation processes are still effectively modeled as one-step but their kinetic rates will account for enzyme saturation. As seen below, the model still retains the same structure as under the one-step policy, only kinetic rates change.
In [25]: