Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

View ORCID ProfileSeong Kyu Han, View ORCID ProfileYoshiharu Muto, View ORCID ProfileParker C. Wilson, Aravinda Chakravarti, View ORCID ProfileBenjamin D. Humphreys, View ORCID ProfileMatthew G. Sampson, View ORCID ProfileDongwon Lee
doi: https://doi.org/10.1101/2022.02.24.481844
Seong Kyu Han
1Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Seong Kyu Han
Yoshiharu Muto
2Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yoshiharu Muto
Parker C. Wilson
3Department of Pathology and Immunology, Washington University in St. Louis, St. Louis, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Parker C. Wilson
Aravinda Chakravarti
4Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin D. Humphreys
2Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO, USA
5Department of Developmental Biology, Washington University in St. Louis, St. Louis, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Benjamin D. Humphreys
Matthew G. Sampson
1Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA, USA
6Broad Institute of MIT and Harvard, Cambridge, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matthew G. Sampson
Dongwon Lee
1Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Dongwon Lee
  • For correspondence: dongwon.lee@childrens.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we use the predictability of open-chromatin peaks from DNA sequence-based machine-learning models to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 samples with DNase-seq from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify high-quality samples underperforming owing to marginal read depths. Peaks identified in high-quality samples by gkmQC are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants from genome-wide association studies (GWAS), and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for single-cell chromatin accessibility data as well as bulk data. Here we provide a standalone open-source toolkit (https://github.com/Dongwon-Lee/gkmQC) for such analyses and share improved regulatory maps using gkmQC. These resources will contribute to the functional interpretation of disease-associated regulatory genetic variation.

Competing Interest Statement

MGS is on the Scientific Advisory Board of Natera and a consultant for Maze.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted February 25, 2022.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
Seong Kyu Han, Yoshiharu Muto, Parker C. Wilson, Aravinda Chakravarti, Benjamin D. Humphreys, Matthew G. Sampson, Dongwon Lee
bioRxiv 2022.02.24.481844; doi: https://doi.org/10.1101/2022.02.24.481844
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
Seong Kyu Han, Yoshiharu Muto, Parker C. Wilson, Aravinda Chakravarti, Benjamin D. Humphreys, Matthew G. Sampson, Dongwon Lee
bioRxiv 2022.02.24.481844; doi: https://doi.org/10.1101/2022.02.24.481844

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4229)
  • Biochemistry (9118)
  • Bioengineering (6759)
  • Bioinformatics (23956)
  • Biophysics (12107)
  • Cancer Biology (9505)
  • Cell Biology (13746)
  • Clinical Trials (138)
  • Developmental Biology (7618)
  • Ecology (11669)
  • Epidemiology (2066)
  • Evolutionary Biology (15482)
  • Genetics (10622)
  • Genomics (14302)
  • Immunology (9472)
  • Microbiology (22810)
  • Molecular Biology (9083)
  • Neuroscience (48903)
  • Paleontology (355)
  • Pathology (1479)
  • Pharmacology and Toxicology (2566)
  • Physiology (3829)
  • Plant Biology (8320)
  • Scientific Communication and Education (1468)
  • Synthetic Biology (2294)
  • Systems Biology (6176)
  • Zoology (1297)