DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly

  1. Bauke Ylstra1
  1. 1Department of Pathology, VU University Medical Center, 1007 MB Amsterdam, The Netherlands;
  2. 2Department of Pathology, Haartman Institute and HUSLAB, FIN-00014 University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland;
  3. 3Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, California 94158, USA;
  4. 4Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, California 94158, USA;
  5. 5Department of Epidemiology and Biostatistics, VU University Medical Center, 1007 MB Amsterdam, The Netherlands;
  6. 6Department of Mathematics, VU University, 1181 HV Amsterdam, The Netherlands;
  7. 7Department of Neurology, VU University Medical Center, 1007 MB Amsterdam, The Netherlands;
  8. 8Department of Neurology, Academic Medical Centre, 1105 AZ Amsterdam, The Netherlands;
  9. 9Department of Pathology, Radboud University Medical Centre, 6500 HB Nijmegen, The Netherlands;
  10. 10Department of Laboratory Medicine, University of California San Francisco, San Francisco, California 94153, USA;
  11. 11Bluestone Center for Clinical Research, New York University College of Dentistry, New York, New York 10010-4086, USA
  1. Corresponding author: B.Ylstra{at}vumc.nl
  1. 12 These authors contributed equally to this work.

Abstract

Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1× genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.

Footnotes

  • [Supplemental material is available for this article.]

  • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.175141.114.

    Freely available online through the Genome Research Open Access option.

  • Received March 17, 2014.
  • Accepted September 15, 2014.

This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server