Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments
Abstract
The International Cancer Genome Consortium (ICGC)’s Pan-Cancer Analysis of Whole Genomes (PCAWG) project aimed to categorize somatic and germline variations in both coding and non-coding regions in over 2,800 cancer patients. To provide this dataset to the research working groups for downstream analysis, the PCAWG Technical Working Group marshalled ~800TB of sequencing data from distributed geographical locations; developed portable software for uniform alignment, variant calling, artifact filtering and variant merging; performed the analysis in a geographically and technologically disparate collection of compute environments; and disseminated high-quality validated consensus variants to the working groups. The PCAWG dataset has been mirrored to multiple repositories and can be located using the ICGC Data Portal. The PCAWG workflows are also available as Docker images through Dockstore enabling researchers to replicate our analysis on their own data.
Subject Area
- Biochemistry (11730)
- Bioengineering (8743)
- Bioinformatics (29179)
- Biophysics (14964)
- Cancer Biology (12080)
- Cell Biology (17399)
- Clinical Trials (138)
- Developmental Biology (9417)
- Ecology (14174)
- Epidemiology (2067)
- Evolutionary Biology (18294)
- Genetics (12233)
- Genomics (16791)
- Immunology (11858)
- Microbiology (28051)
- Molecular Biology (11575)
- Neuroscience (60919)
- Paleontology (451)
- Pathology (1870)
- Pharmacology and Toxicology (3238)
- Physiology (4955)
- Plant Biology (10422)
- Synthetic Biology (2881)
- Systems Biology (7338)
- Zoology (1650)