TY - JOUR T1 - Deciphering variance in epigenomic regulators by k-mer factorization JF - bioRxiv DO - 10.1101/129247 SP - 129247 AU - Carl G. de Boer AU - Aviv Regev Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/04/21/129247.abstract N2 - Variation in chromatin organization across single cells can help shed important light on the mechanisms controlling gene expression, but scale, noise, and sparsity pose significant analysis challenges. Here, we develop gkm-PCA, an approach to infer variation in transcription factor (TF) activity across samples through an unsupervised analysis of the variation in the DNA sequences associated with an epigenomic mark. gkm-PCA first represents each sample as a vector of DNA word frequencies for the DNA sequence surrounding an epigenomic mark of interest, and then decomposes the resulting matrix of k-mer frequencies per sample to find hidden structure in the data. This allows both unsupervised grouping of samples and identification of the TFs that distinguish groups. Applied to single cell ATAC-seq data, gkm-PCA readily distinguished cell types, treatments, batch effects, experimental artifacts, and cycling cells. The structure within the k-mer landscape can be further related to differentially active TFs, with each variable component reflecting a set of co-varying TFs, which are often known to physically interact. For example, in K562 cells, AP-1 TFs emerge as the central determinant of variability in chromatin accessibility through their diverse interactions with other TFs and variable mRNA expression levels. We provide a theoretical basis for why cooperative TF binding (and any associated epigenomic mark) is inherently more variable than non-cooperative binding. gkm-PCA and related approaches will be valuable for gaining a mechanistic understanding of the trans determinants of chromatin variability between cells, treatments, and individuals. ER -