Abstract
Unraveling sequence determinants which drive protein-RNA interaction is crucial for studying binding mechanisms and the impact of genomic variants. While CLIP-seq allows for transcriptome-wide profiling of in vivo protein-RNA interactions, it is limited to expressed transcripts, requiring computational imputation of missing binding information. Existing classification-based methods predict binding with low resolution and depend on prior labeling of transcriptome regions for training. We present RBPNet, a novel deep learning method, which predicts CLIP crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves high generalization on eCLIP, iCLIP and miCLIP assays, outperforming state-of-the-art classifiers. CLIP-seq suffers from various technical biases, complicating downstream interpretation. RBPNet performs bias correction by modeling the raw signal as a mixture of the protein-specific and background signal. Through model interrogation via Integrated Gradients, RBPNet identifies predictive sub-sequences corresponding to known binding motifs and enables variant-impact scoring via in silico mutagenesis. Together, RBPNet improves inference of protein-RNA interaction, as well as mechanistic interpretation of predictions.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
wagnern{at}in.tum.de, lambert.moyon{at}helmholtz-muenchen.de, klara.kuret{at}crick.ac.uk, nicolas.goedert{at}helmholtz-muenchen.de, marco.salvatore{at}bio.ku.dk, jernej.ule{at}crick.ac.uk, gagneur{at}in.tum.de
Revised abstract and title page layout.