PT - JOURNAL ARTICLE AU - Shaoke Lou AU - Kellie A. Cotter AU - Tianxiao Li AU - Jin Liang AU - Hussein Mohsen AU - Jason Liu AU - Jing Zhang AU - Sandra Cohen AU - Jinrui Xu AU - Haiyuan Yu AU - Mark Rubin AU - Mark Gerstein TI - GRAM: A generalized model to predict the molecular effect of a non-coding variant in a cell-type specific manner AID - 10.1101/482992 DP - 2019 Jan 01 TA - bioRxiv PG - 482992 4099 - http://biorxiv.org/content/early/2019/01/02/482992.short 4100 - http://biorxiv.org/content/early/2019/01/02/482992.full AB - There has been much effort to prioritize genomic variants with respect to their impact on “function”. However, function is often not precisely defined: Sometimes, it is the disease association of a variant; other times, it reflects a molecular effect on transcription or epigenetics. Here we coupled multiple genomic predictors to build GRAM, a generalized model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant in a cell-specific manner. As a first step, we performed feature engineering: using a LASSO regularized linear model, we found transcription factor (TF) binding most predictive, especially for TFs that are hubs in the regulatory network; in contrast, evolutionary conservation, a popular feature in many other functional-impact predictors, has almost no contribution. Moreover, TF binding inferred from in vitro SELEX is as effective as that from in vivo ChIP-Seq. Second, we implemented GRAM integrating SELEX features and expression profiles. The program combines a universal regulatory score for a variant in a non-coding element with a modifier score reflecting the particular cell type. We benchmarked GRAM on a large-scale MPRA dataset in the GM12878 cell line, achieving a ROC score of ∼0.73; performance on the K562 cell line was similar. Finally, we evaluated the performance of GRAM on targeted regions using luciferase assays in MCF7 and K562 cell lines. We noted that changing the insertion position of the construct relative to the reporter gene gives very different results, highlighting the importance of carefully defining the functional target the model is predicting.Author Summary Noncoding variants lie outside of protein-coding regions, and are found to have disease associations. However, knowledge on the molecular effect of these non-coding variants in a cell-specific context is very limited. Also, different output between multiple experiment platforms may introduce extra complexity in analyzing the molecular function of these variants. We developed GRAM, a generalized model to predict molecular effect of non-coding variants in multiple cell types for different experimental platforms. We first selected the most informative cell-independent SELEX transcription factor binding score on the variant locus as features and then combine cell-specific gene expression profile to build a multi-step prediction model. GRAM has been successfully tested on both MPRA and Luciferase assay, and on three different cell lines: GM12878, K562 and MCF7, shows high performance.