PT - JOURNAL ARTICLE AU - Olivera Grujic AU - Tanya N. Phung AU - Soo Bin Kwon AU - Adriana Arneson AU - Yuju Lee AU - Kirk E. Lohmueller AU - Jason Ernst TI - Identification and characterization of constrained non-exonic bases lacking predictive epigenomic and transcription factor binding annotations AID - 10.1101/722876 DP - 2019 Jan 01 TA - bioRxiv PG - 722876 4099 - http://biorxiv.org/content/early/2019/08/02/722876.short 4100 - http://biorxiv.org/content/early/2019/08/02/722876.full AB - Genome-wide maps of epigenomic marks and transcription factor binding provide cell type and condition specific information for annotating genomes and interpreting genetic variation. Predictions of evolutionarily constrained bases provide an orthogonal genomic annotation of potentially important bases in the genome. Evolutionary constrained non-exonic bases that are not effectively predicted from large-scale epigenomic and transcription factor binding data could suggest noteworthy gaps in the coverage of such data. To investigate this, we developed the Constrained Non-Exonic Predictor (CNEP), and applied it to the human genome using over ten thousand features defined from large-scale epigenomic and transcription factor binding data to score the evidence of each base being in a constrained non-exonic element from such data. We find that a large subset of constrained non-exonic bases is well predicted by CNEP, but another large subset is not and the predictive power for bases varies substantially with their ConsHMM conservation state annotations. Human genetic variation provided evidence to support that a set of called constrained non-exonic bases with low CNEP scores are under selection, but to a lesser extent than those with high scores. We analyzed the potential biological role of constrained non-exonic bases with low CNEP scores using regulatory sequence motifs, mouse epigenomic data, and additional prospectively considered human data. These analyses highlight how a subset of these bases may have specialized regulatory roles related to embryonic development, the brain, or response to stimuli not well annotated by commonly used compendia of epigenomic and transcription factor binding data.