Abstract
Word models describing molecular mechanisms are a common currency in spoken and written communication in biomedicine but are of limited use in predicting the behavior of complex biological networks. We present an approach to building computational models directly from natural language using automated assembly. Molecular mechanisms described in simple English are read by natural language processing algorithms, converted into an intermediate representation and assembled into executable or network models. We have implemented this approach in the Integrated Network and Dynamical Reasoning Assembler (INDRA), which draws on existing natural language processing systems as well as pathway information in Pathway Commons and other online resources. We demonstrate the use of INDRA and natural language to model three biological processes of increasing scope: (i) p53 dynamics in response to DNA damage, (ii) adaptive drug resistance in BRAF-V600E mutant melanomas, and (iii) the RAS signaling pathway. The use of natural language for modeling makes routine tasks more efficient for modeling practitioners and increases the accessibility and transparency of models for the broader biology community.
Glossary
- Application programming interface (API)
- a standardized interface by which one software system can use services provided by other software, often remotely; in the current context, INDRA accesses NLP systems and pathway databases via APIs. INDRA exposes an API that other software can build upon. API is used here interchangeably with Interface (e.g. INDRA’s TRIPS Interface).
- Molecular mechanism
- used in this paper to refer to processes involved in changing the state of a molecular entity or in describing its interaction with another molecular entity as represented by a set of linked biochemical reactions. Mechanisms are often described in the literature and are captured in databases in formats such as BioPAX. The information we extract from such descriptions are interchangeably referred to as mechanistic information, mechanistic assertions, mechanistic facts and mechanistic findings.
- Processor
- a module in INDRA that constructs INDRA Statements from a specific input format.
- Template extraction
- the process by which INDRA Processors extract INDRA Statements from various input formats.
- Assembler
- a module in INDRA that constructs a model, network or other output from INDRA Statements.
- Model assembly
- the process of automatically generating a model in a given computational formalism from an intermediate knowledge representation; in our context from INDRA Statements.
- Executable model
- a computational model that can be simulated to reproduce the observable dynamical behavior of a system; often, but not always, a system of linked differential equations.
- Policies
- user-defined settings which affect the automated assembly process.
- Knowledge representation
- a formalism that allows aggregation of information, potentially from multiple sources, in a standardized computable format; in the current context, INDRA Statements serve as a common knowledge representation for mechanistic information.
- Natural language (NL)
- language that humans commonly use to communicate in speech and writing; in the current context, restricted to the English language.
- Natural language processing (NLP)
- the algorithmic process by which a computer interprets natural language text.
- Named entity recognition (NER)
- a sub-task of NLP concerned with the recognition of special words in a text that are not part of the general language; in the current context NER is used to identify proteins, metabolites, drugs, and other terms (which are generally referred to as named entities).
- Grounding
- a sub-task of NLP related to NER which assigns unique identifiers to named entities in text by linking them to ontologies and databases; in the current context this involves creating links to databases such as UniProt, HGNC, GO or ChEBI.
- Logical form (LF)
- a graph representing the meaning of a sentence; an intermediate output of natural language processing in the TRIPS system (Box 1).
- Extraction knowledge base (EKB)
- a collection of events and terms relevant to molecular biology that is the result of natural language processing with TRIPS (Box 1).