Abstract
Cancer is the result of mutagenic processes that can be inferred from genome sequences by analysis of mutational signatures. Here we present SparseSignatures, a novel framework to extract mutational signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, enforces sparsity of non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to very large datasets. We apply SparseSignatures to whole genome sequences of 2827 tumors from 20 cancer types and show by standard metrics that our set of signatures is substantially more robust than previously reported ones, having eliminated redundancy and overfitting. Known mutagens (e.g., UV light, benzo(a)pyrene, APOBEC dysregulation) exhibit single signatures and occur in the expected tissues, a dominant signature with uncertain etiology is present in liver cancers, and other cancers exhibit a mixture of signatures or are dominated by background and CpG methylation signatures. Apart from cancers that are mostly due to environmental mutagens there is virtually no correlation between cancer types and signatures, highlighting the idea that any of several mutagenic pathways can be active in any solid tissue.
Footnotes
↵* The first two authors should be regarded as joint first authors.