Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Embeddings of genomic region sets capture rich biological associations in lower dimensions

View ORCID ProfileErfaneh Gharavi, View ORCID ProfileAaron Gu, View ORCID ProfileGuangtao Zheng, View ORCID ProfileJason P. Smith, View ORCID ProfileAidong Zhang, View ORCID ProfileDonald E. Brown, View ORCID ProfileNathan C. Sheffield
doi: https://doi.org/10.1101/2021.05.07.443166
Erfaneh Gharavi
1Center for Public Health Genomics, University of Virginia
6School of Data Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Erfaneh Gharavi
Aaron Gu
1Center for Public Health Genomics, University of Virginia
5Department of Computer Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Aaron Gu
Guangtao Zheng
5Department of Computer Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guangtao Zheng
Jason P. Smith
1Center for Public Health Genomics, University of Virginia
4Department of Biochemistry and Molecular Genetics, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jason P. Smith
Aidong Zhang
5Department of Computer Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Aidong Zhang
Donald E. Brown
6School of Data Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Donald E. Brown
Nathan C. Sheffield
1Center for Public Health Genomics, University of Virginia
2Department of Public Health Sciences, University of Virginia
3Department of Biomedical Engineering, University of Virginia
4Department of Biochemistry and Molecular Genetics, University of Virginia
6School of Data Science, University of Virginia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nathan C. Sheffield
  • For correspondence: nsheffield@virginia.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis.

Results We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data.

Availability https://github.com/databio/regionset-embedding

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted May 09, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Embeddings of genomic region sets capture rich biological associations in lower dimensions
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Embeddings of genomic region sets capture rich biological associations in lower dimensions
Erfaneh Gharavi, Aaron Gu, Guangtao Zheng, Jason P. Smith, Aidong Zhang, Donald E. Brown, Nathan C. Sheffield
bioRxiv 2021.05.07.443166; doi: https://doi.org/10.1101/2021.05.07.443166
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Embeddings of genomic region sets capture rich biological associations in lower dimensions
Erfaneh Gharavi, Aaron Gu, Guangtao Zheng, Jason P. Smith, Aidong Zhang, Donald E. Brown, Nathan C. Sheffield
bioRxiv 2021.05.07.443166; doi: https://doi.org/10.1101/2021.05.07.443166

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3477)
  • Biochemistry (7315)
  • Bioengineering (5290)
  • Bioinformatics (20180)
  • Biophysics (9967)
  • Cancer Biology (7696)
  • Cell Biology (11242)
  • Clinical Trials (138)
  • Developmental Biology (6413)
  • Ecology (9910)
  • Epidemiology (2065)
  • Evolutionary Biology (13266)
  • Genetics (9346)
  • Genomics (12542)
  • Immunology (7665)
  • Microbiology (18919)
  • Molecular Biology (7413)
  • Neuroscience (40853)
  • Paleontology (298)
  • Pathology (1224)
  • Pharmacology and Toxicology (2124)
  • Physiology (3137)
  • Plant Biology (6833)
  • Scientific Communication and Education (1268)
  • Synthetic Biology (1890)
  • Systems Biology (5295)
  • Zoology (1083)