PT - JOURNAL ARTICLE AU - Carl G. de Boer AU - Aviv Regev TI - Deciphering variance in epigenomic regulators by k-mer factorization AID - 10.1101/129247 DP - 2017 Jan 01 TA - bioRxiv PG - 129247 4099 - http://biorxiv.org/content/early/2017/04/21/129247.short 4100 - http://biorxiv.org/content/early/2017/04/21/129247.full AB - Variation in chromatin organization across single cells can help shed important light on the mechanisms controlling gene expression, but scale, noise, and sparsity pose significant analysis challenges. Here, we develop gkm-PCA, an approach to infer variation in transcription factor (TF) activity across samples through an unsupervised analysis of the variation in the DNA sequences associated with an epigenomic mark. gkm-PCA first represents each sample as a vector of DNA word frequencies for the DNA sequence surrounding an epigenomic mark of interest, and then decomposes the resulting matrix of k-mer frequencies per sample to find hidden structure in the data. This allows both unsupervised grouping of samples and identification of the TFs that distinguish groups. Applied to single cell ATAC-seq data, gkm-PCA readily distinguished cell types, treatments, batch effects, experimental artifacts, and cycling cells. The structure within the k-mer landscape can be further related to differentially active TFs, with each variable component reflecting a set of co-varying TFs, which are often known to physically interact. For example, in K562 cells, AP-1 TFs emerge as the central determinant of variability in chromatin accessibility through their diverse interactions with other TFs and variable mRNA expression levels. We provide a theoretical basis for why cooperative TF binding (and any associated epigenomic mark) is inherently more variable than non-cooperative binding. gkm-PCA and related approaches will be valuable for gaining a mechanistic understanding of the trans determinants of chromatin variability between cells, treatments, and individuals.