RT Journal Article SR Electronic T1 A pitfall for machine learning methods aiming to predict across cell types JF bioRxiv FD Cold Spring Harbor Laboratory SP 512434 DO 10.1101/512434 A1 Schreiber, Jacob A1 Singh, Ritambhara A1 Bilmes, Jeffrey A1 Noble, William Stafford YR 2019 UL http://biorxiv.org/content/early/2019/04/26/512434.abstract AB Machine learning models used to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that when the training set contains examples derived from the same genomic loci across multiple cell types, the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.