TY - JOUR T1 - Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight JF - bioRxiv DO - 10.1101/514497 SP - 514497 AU - Mark T. W. Ebbert AU - Tanner D. Jensen AU - Karen Jansen-West AU - Jonathon P. Sens AU - Joseph S. Reddy AU - Perry G. Ridge AU - John S. K. Kauwe AU - Veronique Belzil AU - Luc Pregent AU - Minerva M. Carrasquillo AU - Dirk Keene AU - Eric Larson AU - Paul Crane AU - Yan W. Asmann AU - Nilufer Ertekin-Taner AU - Steven G. Younkin AU - Owen A. Ross AU - Rosa Rademakers AU - Leonard Petrucelli AU - John D. Fryer Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/01/09/514497.abstract N2 - Background The human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.Results Based on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.Conclusions While we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies. ER -