Abstract
Copy number variants (CNV) are shown to contribute to the etiology of several genetic disorders. Accurate detection of CNVs on whole exome sequencing (WES) data has been a long sought after goal for use in clinic. This was not possible despite recent improvements in performance because algorithms mostly suffer from low precision and even lower recall on expert-curated gold standard call sets. Here, we present a deep learning-based somatic and germline CNV caller for WES data, named ÉCOLE. Based on a variant of the transformer architecture, the model learns to call CNVs per exon, using high confidence calls made on matched WGS samples. We further train and fine-tune the model with a small set of expert calls via transfer learning. We show that ÉCOLE is able mimic the expert labeling for the first time with 68.7% precision and 49.6% recall. This corresponds to precision and recall improvements of 18.7% and 30.8% over the next best performing methods, respectively. ÉCOLE is the first method to achieve high precision and recall in mimicking human expert CNV calling. We also show that same fine tuning strategy using tumor samples enables ÉCOLE to detect RT-qPCR validated variations in bladder cancer samples without the need for a control sample. We think these features of ÉCOLE make CNV calling on WES data feasible for clinical use. ÉCOLE is available at https://github.com/ciceklab/ECOLE.
Competing Interest Statement
The authors have declared no competing interest.