Abstract
Multi-zinc finger proteins are the largest class of human transcription factors, whose DNA-binding specificity is often encoded by a subset of their tandem Cys2His2 zinc finger (ZF) domains. However, the molecular code that underlies ZF-DNA interaction is incompletely understood, and in most cases the ZF subset that is responsible for in vivo DNA binding is unknown. We developed a context-aware machine-learning-based model of DNA recognition and combined it with molecular dynamics analyses to uncover new structural aspects of ZF-DNA interaction, including novel residues that contribute to sequence specificity. By combining this model with in vivo binding data, we identified the sequence preference and the ZF subset that is responsible for DNA binding in ∼30% of all human multi-ZF proteins, showing that in vivo DNA binding is primarily driven by ∼50% of the ZFs. Analysis of genetic variation within and across species showed that DNA-binding ZFs are under strong selective pressure, and a pan-cancer analysis across 18 tissues revealed hundreds of genes whose expression is affected by somatic coding mutations in DNA-binding ZFs. Together, these results suggest that the regulatory consequences of mutations in ZFs depend on their in vivo DNA-binding functionality, which in turn is determined by a combination of context as well as ZF-intrinsic features.