Abstract
The analysis of cancer biology data involves extremely heterogeneous datasets including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenomic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrative analysis. We introduce a graph database and query engine for discovery and analysis of cancer biology, called the BioMedical Evidence Graph (BMEG). The BMEG is unique from other biological data graphs in that sample level molecular information is connected to reference knowledge bases. It combines gene expression and mutation data, with drug response experiments, pathway information databases and literature derived associations. The construction of the BMEG has resulted in a graph containing over 36M vertices and 29M edges. The BMEG system provides a graph query based API to enable analysis, with client code available for Python, Javascript and R, and a server online at bmeg.io. Using this system we have developed several forms of integrated analysis to demonstrate the utility of the system. The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug response machine learning, patient level knowledge base queries and pathway level analysis. We have compared the resulting graph to other available integrated graph systems, and demonstrated that it is unique in the scale of the graph and the type of data it makes available.
Highlights
Data resource connected extremely diverse set of cancer data sets
Graph query engine that can be easily deployed and used on new datasets
Easily installed python client
Server online at bmeg.io
Summary The analysis of cancer biology data involves extremely heterogeneous datasets including information. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrative analysis. We introduce a graph database and query engine for discovery and analysis of cancer biology, called the BioMedical Evidence Graph (BMEG). The construction of the BMEG has resulted in a graph containing over 36M vertices and 29M edges. The BMEG system provides a graph query based API to enable analysis, with client code available for Python, Javascript and R, and a server online at bmeg.io. Using this system we have developed several forms of integrated analysis to demonstrate the utility of the system.
Introduction
Biological data produced by large-scale projects now routinely reaches petabyte levels thanks to major advances in sequencing and imaging. This exponential growth in size is well-documented and is being addressed by multiple big-data initiatives. However, the parallel increase in data heterogeneity is still a major unaddressed issue. With multiple profiling methods, platforms, versions, formats and pipelines, biological data is far from monolithic. The immense and expansive amount of heterogeneous data make it difficult to normalize and integrate data as well as perform integrative analysis across disparate experiments. When faced with these challenges as well as the substantial labor and computation costs, researchers may use only a fraction of publicly available data for their analysis, and will not update their data or analysis as new data becomes available.
Graph databases are useful tools for systems biology analysis where integration of complex data is required1–3. In the commercial sector, several major data aggregators have been successfully using graph databases for integration of heterogeneous data. Facebook uses the ‘Social Graph’4 to represent the connections between people and their information, while Google’s search engine uses a ‘Knowledge Graph’ to connect various facts about different subjects. This approach is especially powerful when entities in the graph are connected via multiple types of complex, chained interactions. Based on these observations, we have built the BioMedical Evidence Graph (BMEG) to allow for complex integration and analysis of heterogeneous biological data.
The BMEG was created by importing several cancer related resources and transforming them into a coherent graph representation. These resources include patient and sample information, mutations, gene expression, drug response data, genomic annotations and literature based analysis (see Table 1). This Graph contains 15K patients, 54K samples, 4M alleles, 640K drug response experiments and 50K literature derived genotype to phenotype associations.
To enable analysis and machine learning, our team concentrated on utilizing high quality feature extraction methods applied consistently to all samples. This included identifying the best methods of somatic variant calling and RNA-seq analysis. We utilized open challenges to create leaderboards of the best methods submitted by the community. We then participated in the development of open standards to enable the exchange of genomic associations from cancer knowledge bases.
Methods
Graph Schema
Gen3 is a data commons management system developed by the Center for Translational Data Science based on their work for the NCI’s GDC. The BMEG graph schema is described using a JSON schema derived from Gen3 architecture. JSON Schema is a data definition language for describing rules about data structuring. These rules include required fields, data types and field value ranges. The Gen3 system extends JSON schema to add concepts to constructing graph data including database ID alias mapping and edge creation.
At the core of Gen3’s description of TCGA’s metadata, is a tree representing the organization of all the different data elements that make up the program. The tree starts at a top level ‘Program’ node, representing the entire TCGA program, below that are separate projects for each of the different tumor types. Each tumor type is then populated by a number of Cases, which in term have multiple Samples, which can then be subdivided into a number of Aliquots. The BMEG schema builds on this base structure to include data from a number of areas including: 1) Genome Reference, 2) Gene and Pathway Annotations, 3) Somatic Variants, 4) Gene Expression data, 5) Knowledge Bases.
Data Sources
Initial data sources (see Table 1) for the BMEG were centered on large cohorts of patient-derived samples, with DNA and RNA profiling, cell lines with drug response data and literature-derived drug-phenotype associations. The goal was to provide uniform input data for analysis and machine learning.
RNA Seq Data
To identify the best methods for RNA analysis, we launched the SMC-RNA challenge, which benchmarked isoform quantification methods to prioritize the methods used for processing data that would be ingested into BMEG. For example, as far as RNA-Seq transcript abundances, we used Kallisto to process the TCGA and CCLE5 datasets. Additionally the GTEx project6 provided gene-level transcript-per-million mapped reads (TPM) estimates for normal tissues that could be contrasted with tumors. Combinations of these resources provide 36K vertices to the BMEG graph.
TCGA Metadata
The Genomic Data Commons (GDC) created a data system to track the clinical and administrative meta-data of the TCGA samples and files. We utilized their web API to obtain TCGA patient and sample metadata for the evidence graph.
TCGA Genomic Data
To determine the best methods for Somatic mutation calling, we partnered with the DREAM consortium, Sage BioNetworks and OICR to launched the ICGC-TCGA Somatic Mutation Calling challenge7. Many of the methods evaluated by this effort were incorporated into pipelines that would then be deployed on the TCGA’s 10K exomes as part of the Multi-Center Mutation Calling in Multiple Cancers (MC3) project8. The MC3 adds 10K vertices and connects to 3 million alleles (2.6 million distinct) in the graph. For the set of copy number alteration events, we utilized the Gistic29 data from the Broad Institute’s Firehose system.
Cell Line Drug Response Data
Drug response data has been collated by the DepMap project10. This includes response curves, IC50 and EC50 scores from CCLE11, CTRP12, 13 and GDSC14. Additionally, the DepMap meta-data files provided cell line clinical attributes and cross project ID mapping.
Variant Drug Associations
The Genotype To Phenotype (G2P) schema15 was designed to enable a number of different cancer knowledge base resources to be aggregated into a coherent resource. With this resource, the BMEG has aggregated associations from six prominent cancer knowledge bases, including 50K associations vertices.
Pathway Data
Pathway Commons16 aggregates, normalizes and integrates data from 22 public pathway databases. At 1.5 million interactions and 400K detailed biochemical reactions, it is the largest curated pathway databases available. It aggregates pathway relationships from Reactome17, NCI Pathway Interaction Database18, PhosphoSitePlus19, HumanCyc20, PANTHER Pathway21, MSigDB22, Recon X23, Comparative Toxicogenomics Database24, KEGG Pathway25, Integrating Network Objects with Hierarchies26, NetPath27, and WikiPathways28. All of these resources provided 1.9 million vertices to the graph.
Reference Data
Biological reference data and existing experimental results form a majority of the data stored in the BMEG. These concepts need to be modeled into the graph, and various transformers written to properly translate these concepts. Part of the import pipeline includes Gene, Transcript and Exon annotations, protein and PFAM29 assignments as well as Gene Ontology30 functional annotations. For this reason, the BMEG standardizes on Ensembl IDs31 as the global identifier for various genomic components.
Graph Databases and Query Languages
To enable various analytical queries, and provide a framework for building new functionality, we developed the GRaph Integration Platform (GRIP) to power queries against the BMEG web resource. GRIP stores multiple forms of data, with the ability to hold thousands of data elements per vertex and per edge of the graph. As applications need, GRIP allows efficient conversion to various data frames for downstream algorithms. This makes the system not only capable of storing sparse relationship data, such as pathways and ontologies, but also dense matrix formatted data, such as gene expression levels for thousands of genes across hundreds of samples.
The query language implements most operations needed for subgraph selection, as well as aggregation features. A general purpose endpoint places more emphasis on the client side building smart queries to obtain the data they need, rather than having custom server side components provide specialized facets to the users. Because of this, clients can easily create new queries, unanticipated by the server developers, and have them still work. We have made the API available via Python, Javascript and R clients. GRIP is written in the GO programming language, and compiles to a single static binary, which means that it can be installed onto a system with little to no dependencies.
Results
Every different example use case demonstrates the utility of the BMEG dataset and its query engine. GripQL is a traversal based graph selection language based on Gremlin32. The user describes a series of steps that will be undertaken by a ‘traveler’. An example traversal would start on a vertex with label ‘project’, go out to edges labeled ‘samples’, then go out along edges labeled ‘aliquots’. The engine then scans the graph for all valid paths that can be completed given the instructions. Each of the traversal descriptions is based on the graph schema seen in Figure 1. The commands are written using the Python version of the client, but could be executed similarly in R or Javascript. The API provides a getSchema method which describes the different types of vertices, their properties and the edges that connect them.
Our analysis begins with counting the mutations per gene in a cancer cohort. As seen in (1), this query starts on a node of a project, in this case TCGA-BRCA, and then follows the path to the cases that belong to the project, then to the samples and finally to the aliquots. As it passes the Sample node, it filters for tumor samples. Once on the aliquot node, it continues to the SomaticCallset, which represents sets of variants produced by a single mutation calling analysis. The traversal then identifies the edges that connect the SomaticCallet to different alleles, this time using the outE command to land on the edge, rather than the destination vertex. With the gene ID in hand, we then uses the aggregate method to count up the various terms that occur in the enembl_gene field.
listing the number of variant alleles found for each gene.
We can then inspect the most frequently impacted pathways. First by identifying which pathways each of the mutated genes belong to in (2) and then by normalizing by the number of genes per pathway in (3). To identify the pathways involved in mutations, we provide a list of all mutated genes, find their associated pathways and retrieve the tuples of every gene-pathway pair, using the as_ command (the underscore is added to avoid clashing with the python reserved word as) to store the gene and then using the render function to display only the data we require.
listing all of the pathways for which each gene is a member.
This information can be combined with the previous table to calculate the mutations per pathway. To normalize these values we count the number of genes per pathway. The traversal first starts on the Pathway vertex marks for later retrieval using the as_ command. Once the travel has split and moved out to the multiple child Gene vertices, the select command recalls the stored pathway vertex and moves the traveler back. As this point, an aggregation is called to count the number of travelers on each Pathway vertex.
listing the number of mutations for each pathway found.
We can also connect the sets of mutations in the BRCA cohort to the most connected papers, as linked by the G2P associations in (4). In this use case, the aggregate method is called on the special _gid variable which represents a unique Global ID for each vertex.
That lists the number of mutations for all genes connected in each of the returned papers.
The G2P associations also connect to compounds that are linked to phenotypes based on specific alleles in (5). This traversal is much like (4), however it also includes a distinct operation to identify unique pairs of cases and compounds. If there are multiple known association records from different publications and these publications link one allele to the same drug response phenotype, then only one relationship will be noted per patient.
listing the number of times compounds where associated to mutations in patients.
We can now take the most commonly linked drugs, in a list named compounds, that we found in (5) and identify the ones that have also been part of cell line drug response testing in (6). We limit our query to cell lines that were studied as part of the CTRP set of breast cancer cell lines.
listing those drugs that were profiled in the CTRP effort.
At this point, we find that Compound:CID104741 (Fulvestrant) was on our list of referenced drugs and was studied as part of CTRP. In (7) we look for the EC50 values for the samples in the breast cancer cell lines that were tested against Fulvestrant. This also includes a call to the render method which shapes the output into a custom JSON structure. In this case, it forms a tuple with the stored sample ID and EC50 value. The listed of tuples returned by the client can then be passed directly into a Pandas DataFrame33.
listing the EC50 values for each of the BRCA cell lines to the Fulvestrant agent.
With the drug response values in hand, we can then look for associated transcriptomic data. There is no direct RNA sequencing available from the CTRP project, however many of the cell lines used in the CTRP project were assayed as part of the CCLE project. To identify these samples, in (8) we follow the edge connecting the list named samples found in (7) to their parent cases. We then follow the same_as edge to identify Case vertices in other projects that are the same as the ones we started on, and then follow the tree down to the GeneExpression node to obtain the expression values and then link them to our sample IDs. Again, we can use the render function to return properly formatted data structures that can be passed directly into Pandas.
listing the expression values of each gene across cell lines with variants in CTRP and RNA in CCLE. The resulting matrix can be used to develop transcriptome based drug response models.
Data Releases
The BMEG resource was designed with portability and openness in mind. The graph query engine that runs the system is open source and easy to install, while all of the compiled source files are made available for bulk download. This allows other researchers to build on our existing system, and reuse the data we’ve collected. Because graph data can be represented by a number of different query engines, we also developed translations of the BMEG resource. Part of the BMEG toolkit is a set of scripts to translate the data set and load it into other graph database systems including Neo4J and Dgraph.
Discussion
Recently, a number of graph-based data integration projects have appeared, including biograkn34, Biograph35, Bio4j36, Bio2RDF37, Hetionet38. Many of these systems were built to aggregate pathway and genotype/phenotype linkages. BMEG is unique from these efforts, in that its primary use case is to drive analysis and machine-learning from actual samples. The BMEG holds genomic, transcriptomic and phenotypic data from cancer cases as well as cell line samples. This data is meant to provide a starting point for discovery, and generation of new models, rather than simply a repository of existing models. The core of the BMEG idea is to define a coherent input data set to enable various downstream analysis possibilities.
With the primary layer of data in place, the next step is to enable online machine learning methods and do a comparative analysis of the patterns learned. The next steps, which are currently being developed, will see machine learning based predictions join the graph. These derived associations will connect samples to phenotypes similar to the way that the G2P edges connect samples to drug sensitivity. With these novel phenotypic annotations available, we will be able to observe predicted trends across cohorts and identify new patterns.
Author Contributions
Conception and design: Josh Stuart, Kyle Ellrott Collection and assembly of data: Adam Struck, Brian Walsh, Alexander Buchanan, Ryan Spangler Data analysis and interpretation: Adam Struck, Brian Walsh, Jordan Lee Manuscript writing: All authors Final approval of manuscript: All authors Accountable for all aspects of the work: All authors