Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes

Daniel S. Himmelstein, Sergio E. Baranzini
doi: https://doi.org/10.1101/011569
Daniel S. Himmelstein
1Biological & Medical Informatics, UCSF, San Francisco, California, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sergio E. Baranzini
1Biological & Medical Informatics, UCSF, San Francisco, California, USA
2Neurology, UCSF, San Francisco, California, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: sebaran@cgl.ucsf.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants, and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks—graphs with multiple node and edge types—for accomplishing both tasks. First we constructed a network with 18 node types—genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections—and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as fundamental mechanisms explaining pathogenesis. Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes. Finally, we combined our network predictions with statistical evidence of association to propose four novel MS genes, three of which (JAK2, REL, RUNX3) validated on the masked GWAS. Furthermore, our predictions provide biological support highlighting REL as the causal gene within its gene-rich locus. Users can browse all predictions online (http://het.io). Heterogeneous network edge prediction effectively prioritized genetic associations and provides a powerful new approach for data integration across multiple domains.

Author Summary For complex human diseases, identifying the genes harboring susceptibility variants has taken on medical importance. Disease-associated genes provide clues for elucidating disease etiology, predicting disease risk, and highlighting therapeutic targets. Here, we develop a method to predict whether a given gene and disease are associated. To capture the multitude of biological entities underlying pathogenesis, we constructed a heterogeneous network, containing multiple node and edge types. We built on a technique developed for social network analysis, which embraces disparate sources of data to make predictions from heterogeneous networks. Using the compendium of associations from genome-wide studies, we learned the influential mechanisms underlying pathogenesis. Our findings provide a novel perspective about the existence of pervasive pleiotropy across complex diseases. Furthermore, we suggest transcriptional signatures of perturbations are an underutilized resource amongst prioritization approaches. For multiple sclerosis, we demonstrated our ability to prioritize future studies and discover novel susceptibility genes. Researchers can use these predictions to increase the statistical power of their studies, to suggest the causal genes from a set of candidates, or to generate evidence-based experimental hypothesis.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted December 11, 2014.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes
Daniel S. Himmelstein, Sergio E. Baranzini
bioRxiv 011569; doi: https://doi.org/10.1101/011569
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes
Daniel S. Himmelstein, Sergio E. Baranzini
bioRxiv 011569; doi: https://doi.org/10.1101/011569

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One
Subject Areas
All Articles
  • Animal Behavior and Cognition (4110)
  • Biochemistry (8813)
  • Bioengineering (6517)
  • Bioinformatics (23456)
  • Biophysics (11788)
  • Cancer Biology (9205)
  • Cell Biology (13318)
  • Clinical Trials (138)
  • Developmental Biology (7433)
  • Ecology (11407)
  • Epidemiology (2066)
  • Evolutionary Biology (15145)
  • Genetics (10433)
  • Genomics (14041)
  • Immunology (9169)
  • Microbiology (22152)
  • Molecular Biology (8808)
  • Neuroscience (47558)
  • Paleontology (350)
  • Pathology (1428)
  • Pharmacology and Toxicology (2491)
  • Physiology (3730)
  • Plant Biology (8079)
  • Scientific Communication and Education (1437)
  • Synthetic Biology (2220)
  • Systems Biology (6037)
  • Zoology (1252)