Abstract
Germline copy number variants (CNVs) are known to affect a large portion of the human genome and have been implicated in many diseases. Although whole-genome sequencing can help identify CNVs, existing analytical methods suffer from limited sensitivity and specificity. Here we show that this is in large part due to the non-uniformity of read coverage, even after intra-sample normalization, and that this is exacerbated in regions of low-mappability. To improve on this, we propose PopSV, an analytical method that uses multiple samples to control for technical variation and enables the robust detection of CNVs. We show that PopSV is able to detect up to 2.7 times more variants compared to previous methods, with an accuracy of about 90%. Applying PopSV to 640 normal and cancer whole-genome datasets, we demonstrate that CNVs affect on average 7.4 million DNA bases in each individual, a 23% increase versus previous estimates. Notably, we find that regions of low-mappability, which were often concealed in previous analyses, harbor approximately 10 times more CNVs than the rest of the genome and that this contrasts with somatic CNVs (sCNVs) that are nearly uniformly distributed. We also observe that CNVs are found more than expected near centromeres and telomeres, in segmental duplications, in specific types of satellite repeats and in some of the most recent families of transposable elements. Although CNVs are found to be depleted in protein-coding genes, we identify 7206 genes with at least one exonic CNV, 324 of which harbored CNVs that would have been missed if low-mappability regions had been excluded. Similarly, 2253 trait- and disease-associated loci are observed to overlap at least one CNV. Our results provide the most comprehensive map of CNVs across the human genome to date and demonstrate the broad functional impact of this type of genetic variation including in regions of low-mappability.