ABSTRACT
The analysis of ‘omic data depends heavily on machine-readable information about protein interactions, modifications, and activities. Key resources include protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. Software systems that read primary literature can potentially extend and update such resources while reducing the burden on human curators, but machine-reading software systems have a high error rate. Here we describe an approach to precisely assemble molecular mechanisms at scale using natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies overlaps and redundancies in information extracted from published papers and pathway databases and uses probability models to reduce machine reading errors. INDRA enables the automated creation of high-quality, non-redundant corpora for use in data analysis and causal modeling. We demonstrate the use of INDRA in extending protein-protein interaction databases and explaining co-dependencies in the Cancer Dependency Map.
Competing Interest Statement
PKS is a co-founder and member of the BOD of Glencoe Software, a member of the BOD for Applied Biomath, and a member of the SAB for RareCyte, NanoString and Montai Health; he holds equity in Glencoe, Applied Biomath and RareCyte. PKS is a consultant for Merck and the Sorger lab has received research funding from Novartis and Merck in the past five years. PKS declares that none of these activities have influenced the content of this manuscript. JAB is currently an employee of Google, LLC. BMG declares no outside interests.