A Statistical Framework for the Analysis of ChIP-Seq Data

Pei Fen Kuan; Dongjun Chung; Guangjin Pan; James A Thomson; Ron Stewart; Sündüz Keleş

doi:10.1198/jasa.2011.ap09706

A Statistical Framework for the Analysis of ChIP-Seq Data

J Am Stat Assoc. 2011;106(495):891-903. doi: 10.1198/jasa.2011.ap09706. Epub 2012 Jan 24.

Authors

Pei Fen Kuan¹, Dongjun Chung¹, Guangjin Pan², James A Thomson³, Ron Stewart², Sündüz Keleş⁴

Affiliations

¹ Departments of Statistics and of Biostatistics and Medical Informatics.
² Genome Center of Wisconsin and Morgridge Institute for Research.
³ Department of Anatomy, Genome Center of Wisconsin, Wisconsin National Primate Research Center and Morgridge Institute for Research.
⁴ Departments of Statistics and of Biostatistics and Medical Informatics University of Wisconsin, Madison, WI 53706.

Abstract

Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) has revolutionalized experiments for genome-wide profiling of DNA-binding proteins, histone modifications, and nucleosome occupancy. As the cost of sequencing is decreasing, many researchers are switching from microarray-based technologies (ChIP-chip) to ChIP-Seq for genome-wide study of transcriptional regulation. Despite its increasing and well-deserved popularity, there is little work that investigates and accounts for sources of biases in the ChIP-Seq technology. These biases typically arise from both the standard pre-processing protocol and the underlying DNA sequence of the generated data. We study data from a naked DNA sequencing experiment, which sequences non-cross-linked DNA after deproteinizing and shearing, to understand factors affecting background distribution of data generated in a ChIP-Seq experiment. We introduce a background model that accounts for apparent sources of biases such as mappability and GC content and develop a flexible mixture model named MOSAiCS for detecting peaks in both one- and two-sample analyses of ChIP-Seq data. We illustrate that our model fits observed ChIP-Seq data well and further demonstrate advantages of MOSAiCS over commonly used tools for ChIP-Seq data analysis with several case studies.

Keywords: GC content; Mappability; Mixture model; Negative binomial regression; Next generation sequencing.

Grants and funding

R01 HG003747/HG/NHGRI NIH HHS/United States