Abstract
Cancer is a global health issue that places enormous demands on healthcare systems. Basic research, the development of targeted treatments, and the utility of DNA sequencing in clinical settings, have been significantly improved with the introduction of whole genome sequencing. However the broad applications of this technology come with complications. To date there has been very little standardisation in how data quality is assessed, leading to inconsistencies in analyses and disparate conclusions. Manual checking and complex consensus calling strategies often do not scale to large sample numbers, which leads to procedural bottlenecks. To address this issue, we present a quality control method that integrates somatic point mutations, allele-specific copy numbers, and tumour purity into a single quantitative score. We demonstrate its power via simulations, and on n = 2778 whole-genomes from PCAWG, on n = 10 multi-region whole-genomes of two colorectal cancers and on n = 48 whole-exomes from TCGA. Our approach significantly improves the generation of cancer mutation data, providing visualisations for cross-referencing with other analyses. The method is fully automated and designed to be compatible with any bioinformatic pipeline, and can automatise tool parameterization paving the way for fast computational assessment of data quality in the era of whole genome sequencing.
Introduction
Cancer remains an unsolved problem, and a key factor is that tumours develop as heterogeneous cellular populations (Greaves and Maley 2012; McGranahan and Swanton 2017, 2015). Cancer genomes can harbour multiple types of mutations compared to healthy cells (Macintyre et al. 2018; Martincorena et al. 2018, 2015; Nik-Zainal et al. 2012), and many of these events contribute to the pathogenesis of the disease, and therapeutic resistance. A popular design of studies intending to understand tumour development involves collecting tumour and matched-normal biopsies, and generating so-called “bulk” DNA sequencing data to identify both germline and tumour somatic mutations (Barnell et al. 2019). Using bioinformatic tools to cross reference the normal genome against a paired aberrant one, the mutations and heterogeneity thereof found in the tumour sample can be derived and used in other analyses. These analyses include, but are not limited to, driver mutation identification (Bailey et al. 2018; Gonzalez-Perez et al. 2013), which aims to discern the key aberrations that cause a tumour to grow, patient clustering, which aims to identify treatment groups with similar biological characteristics, and evolutionary inference (Ding et al. 2012; Landau et al. 2013; Caravagna et al. 2016; Jamal-Hanjani et al. 2017; Turajlic et al. 2018; Caravagna et al. 2018; Roth et al. 2014; Miller et al. 2014; Cross et al. 2018; Gerstung et al. 2020; Deshwar et al. 2015; Strino et al. 2013), which unravels how a particular tumour developed from normal cells.
There are several types of mutations that we can retrieve from DNA sequencing (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020). Broadly these can be categorized as single nucleotide variants (SNVs), copy number alterations (CNAs) and other more complex changes such as structural variants (Li et al. 2020; Zack et al. 2013). All types of mutations can drive tumour progression, and are therefore important entities to study (Kent and Green 2017; Levine, Jenkins, and Copeland 2019). Luckily, the steady drop in sequencing costs is fueling the creation of large datasets for researchers to access through public databases. Notably, we are entering the era of high-resolution whole-genome sequencing (WGS), a technology that can read out the majority of a tumour genome, providing significant improvements over whole-exome or targeted counterparts. Generating some of these data, however, poses challenges. While SNVs are the simplest type of mutations to detect using bioinformatic analysis and perhaps have the most well established supporting tools (Li et al. 2020), CNAs are particularly difficult to call since the baseline ploidy of the tumour (i.e., the number of chromosome copies) is usually unknown and has to be inferred (Van Loo et al. 2010; Favero et al. 2015; Boeva et al. 2011; Poell et al. 2019; Cun et al. 2018; Fischer et al. 2014). CNAs are important types of cancer mutations; large-scale gain and loss of chromosome arms or sections of arms can confer tumour cells with large-scale phenotypic changes, and are often important clinical targets (Gerstung et al. 2020; Watkins et al. 2020).
SNVs and CNAs are intertwined mutation groups. They can overlap within a tumour cell’s genome, meaning the number of copies of an SNV can be amplified or indeed reduced by CNAs. This depends on the ploidy of the genome regions overlapping with the variants. For instance, for a clonal - meaning present in every cell of the tumour sample - heterozygous SNV in a diploid tumour genome the expected variant allele frequency (VAF) is 50% (i.e., half of the reads from tumour cells will harbour the SNV). Alternatively, if each chromosome is present in three copies (triploid), the expected VAF is 33%, for SNVs occurring after amplification (or on the non-amplified chromosome), or 66%, for SNVs on the amplified chromosome. The theoretical frequencies are observed with a Binomial noise model that depends on sequencing depth and VAF (Nik-Zainal et al. 2012; Caravagna, Heide, et al. 2020; Roth et al. 2014; Miller et al. 2014; Strino et al. 2013; Tarabichi et al. 2021; Yuan et al. 2018). We note that these VAFs hold for pure bulk tumour samples (100% tumour cells). Realistically, most bulk samples contain normal cells, the percentage of which shifts these theoretical frequencies towards lower values. These ideas are leveraged by methods that seek to compute the Cancer Cell Fractions (CCFs) of the tumour, i.e., a normalisation of the observed tumour VAF for the CNA, the number of copies of a mutation (mutation multiplicity) and tumour purity (Dentro, Wedge, and Van Loo 2017).
Many bioinformatics pipelines are designed to start from a BAM formatted input file and, following variant calling, extract the VAF of mutations while calling CNAs (Boeva et al. 2011; Cmero et al. 2020; Zaccaria and Raphael 2020; Van Loo et al. 2010; Fischer et al. 2014; Carter et al. 2012). These analyses are nearly always decoupled, and can return inconsistent variant calls; i.e., CNAs and purity that mismatch the empirical VAF from the BAMs. Since CNAs and purity are inferred through various measurements that are subject to noise - i.e., tumour-normal depth ratios and B-allele frequencies are prime examples - they are the most likely cause of error. While in some cases these errors can be spotted and fixed by manual intervention, this process is also subject to inconsistencies in the absence of a proper statistical framework, and does not scale in studies seeking to generate very large datasets (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020; Priestley et al. 2019; Turnbull et al. 2018). The intrinsic performance of a variant caller and sequencing noise therefore massively impacts CNA calling and purity inferences, propagating errors in downstream analysis that eventually lead to incorrect biological conclusions, becoming a crucial computational bottleneck in the era of high-resolution whole-genome sequencing.
To solve these problems we developed CNAqc, a computational framework with a de novo statistical model to assess the conformance of SNVs, CNAs, and purity estimates. We strived to make the tool as simple as possible, maximising compatibility across differing pipelines. CNAqc computes a quantitative quality control (QC) score for the overall agreement of the calls, which can be used to tune the parameters of callers (e.g., decrease or increase purity), or select among multiple profiles (e.g., tetraploid versus diploid tumours) until a good fit is achieved. In CNAqc we also integrate these measures to determine Cancer Cell Fractions (CCF) after phasing mutation multiplicity from VAFs (Dentro, Wedge, and Van Loo 2017).
CNAqc is implemented as a highly optimised R package which can be used between somatic calling and downstream analyses (Figure 1a). CNAqc has a small computational overhead compared to typical downstream analyses, e.g., subclonal deconvolution, which are much more complicated because they interpret the clonal and subclonal VAF spectrum (Gerstung et al. 2020; Nik-Zainal et al. 2012; Caravagna, Heide, et al. 2020; Roth et al. 2014; Miller et al. 2014; Jamal-Hanjani et al. 2017). The tool can process both WGS and WES data, and can automatically compute a QC score in a matter of seconds, making it extremely useful for large-scale genomics consortia or retrospective analyses of public datasets. To demonstrate the tool we analysed n = 2723 high-quality whole-genomes from the Pan Cancer Analysis of Whole Genomes (PCAWG) cohort (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020), n = 10 high-resolution bulk whole-genomes datasets from two multi-region colorectal cancers, and n = 48 whole-exomes from The Cancer Genome Atlas (TCGA) cohort (Cancer Genome Atlas Research Network 2014).
Results
The CNAqc framework
CNAqc integrates clonal CNAs, tumour purity and somatic mutation calls obtained from bulk sequencing (Figure 1). The tool is intended to be used after variant calling, and before downstream analysis (Figure 1a), to compute a quality control score for allele-specific CNAs and purity based on mutation VAFs, determining a PASS or FAIL status for each segment type and the overall sample. CNAqc can also be used to select among alternative genome segmentations and purity/ploidy estimates available from a caller (e.g., a 100% pure diploid tumour versus a 50% pure tetraploid). The score also suggests corrections for tumour purity to fine-tune tools that use Bayesian priors or point parameters. Lastly, CNAqc can determine Cancer Cell Fractions (CCFs) for input mutations, together with PASS or FAIL status; mathematical details are available in the Online Methods.
In what follows, we will refer explicitly to SNVs as the main type of mutation used by CNAqc, but in principle other types of substitutions such as insertions or deletions also apply. The method supports clonal heterozygous normal states (1:1 chromosome complement), loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) form, trisomy (2:1) or tetrasomy (2:2) gains. According to data (Figure 1b) available in n = 2778 PCAWG samples (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020), the most common segments among human cancers are supported by CNAqc (>80% of ~600.000 total PCAWG segments, 93% of sequenced bases, much more prevalent than subclonal CNAs, Supplementary Figure S1). Therefore, besides rare exceptions, CNAqc can analyse any cancer sample.
Many output metrics are derived from the link between “copy states” (i.e., the copies of the major and minor alleles, which sum up to the ploidy of a segment) and allele frequencies that we find in mutation calls. Combinatorial equations inspired from ASCAT (Van Loo et al. 2010) are used to determine if CNAs and purity are consistent with observed VAFs (Online methods). The key equation links our belief about the expected VAF of a clonal mutation (i.e., the clonal VAF peak), if the input CNA segment and tumour purity were correct (Figure 1c, 1d). We consider a tumour sample with purity π ∈ [0,1], and all CNA segments with allele-specific copy numbers n. and n for the major and minor alleles. If we consider mutations with multiplicity m and mapping to any of the segments with copy state nA : nB (e.g., all 1:1, diploid heterozygous segments), we expect a VAF peak (Figure 1e) at
Notation vm explicits that the expected VAF, which is observed with Binomial noise, depends on the multiplicity of the mutation (Dentro, Wedge, and Van Loo 2017). For copy states 2:0 (nA =2, nB = 0), 2:1 and 2:2, m phases mutations acquired before or after the copy number event (Figure 1d). CNAqc supports simple copy states (1:0, 1:1, 2:0, 2:1, 2:2) and restricts m ∈ {1,2}, assuming the CNAs are acquired directly from diploid heterozygous normal states (1:1).
For each copy state, VAF peaks are detected via fast heuristics based on kernel density estimation and maximum likelihood Binomial mixtures (Supplementary Figure S2). An optimal peak is then selected and matched to vm, measuring the VAF distance between which is then converted into units of purity (Online Methods). The CNAqc sample score (e.g., + 3%, – 7%) is based on a linear combination of the distances, representing an error that approaches 0 for perfect calls, reflecting corrections to the input π. The cut to determine PASS or FAIL is the error ∊ > 0 we can tolerate in the purity estimate: e.g., for heterozygous diploid mutations with ∊ = 0.025 (2.5% maximum error) and real purity 60%, CNAqc will PASS a tumour purity estimate in [55;65%], corresponding to VAF range [27.50%; 32.5]. To normalise this error against aneuploidy and contamination, ∊ is adjusted for copy state, multiplicity and tumour purity (Online methods, Figure 1f).
CNAqc can also determine and score CCFs which are used for downstream subclonal deconvolution (Van Loo et al. 2010; Nik-Zainal et al. 2012). Assuming input CNAs and purity π are validated by peak detection, CNAqc normalises the VAF v of a mutation that sits on segment nA : nB in a tumour with the formula
This equation applies to clonal and subclonal mutations, and the main difficulty in obtaining the correct value for c is determining if the mutation is in single or double copy (multiplicity m = 1 or m = 2); we term this phasing m from the VAF spectrum.
CNAqc uses a heuristic based on a two-component Binomial mixture to compute multiplicities by clustering. The default method identifies a VAF range at the crossing of the mixture components, where m cannot be unequivocally phased. The phasing uncertainty is estimated from the entropy H(z) of the mixture latent variables z. For every copy state, depending on the maximum proportion of unassigned mutations that we decide to tolerate (e.g., 10% of total), a CCFs PASS or FAIL status is determined. An alternative method is available, which can force a value to m through a hard split on the VAF, regardless of entropy values (Online Methods).
CNAqc provides several functions to visualise segments and read count data (Figure 2), peak detection and CCFs (Figure 3), and utilities to smooth segments and detect patterns of over-fragmentation (Online Methods). This information can be used to augment and prioritise downstream analysis that seeks to determine patterns of chromothripsis, kataegis or chromoplexy from mutation and copy number data (Zack et al. 2013; Gerstung et al. 2020).
Simulations
We tested CNAqc on ~20,000 synthetic VAF distributions obtained for different values of coverage (30x, 60x, 90x, 120x) and purity (0.4, 0.6, 0.8, 0.95). For each dataset, we run CNAqc with the input purity corrupted by a variable error factor ∊err, and scan multiple levels of tolerance ∊ to match peaks.
We observed that the proportion of rejected samples approaches 100% when the purity error exceeds tolerance (∊err > ∊), suggesting that the model in CNAqc works as expected, i.e., we detect errors as big as tolerance. From simulations, we could observe that VAF quality impacts performance, and that low coverage or purity make peak detection harder (Supplementary Figure S3).
For the same batch of tumours we computed CCFs to measure their uncertainty - i.e., the number of mutations that CNAqc cannot phase from VAFs. Low coverage and low purity generate VAF peaks that overlap, where exact multiplicity phasing becomes unachievable. The performance gradient highlights the importance of data quality to assess reliable CCFs (Supplementary Figures S4).
Large-scale pan cancer PCAWG calls
We have run CNAqc on the full PCAWG cohort, for which we gathered consensus calls from SNVs, allele-specific CNAs and purity (n = 2778 samples, 40 tumour types). Excluding samples with unsuitable data, we ran n = 2723 cases on a single multi-core machine in <1 hour (Figure 4).
Median depth of sequencing and purity are 45x and ~65% (Caravagna, Heide, et al. 2020), therefore the PCAWG resolution is comparable to the mid and low range of parameters adopted in our simulations (Supplementary Figure S3). As expected, peak detection passed 2425/2723 samples using ∊ = 0.03 error purity tolerance, confirming that PCAWG consensus calls are top quality (Figure 4a). As in our simulations, the acceptance rate was determined by tumour purity and coverage (Figure 4b), with purity adjustments distributed around 0 for PASS samples, spreading towards left or right for FAIL cases (Figure 4c).
Manual inspections of some samples presented some interesting cases. For instance, tumours with low burden but high quality calls still yielded a useful report (Supplementary Figure S6). Tumours with estimates of 100% purity which are at odds with VAF peaks might suggest purity over-estimation (Supplementary Figure S7), while other cases did possess genuinely very high purity (>95%, Supplementary Figure S8).
CCFs were computed for the whole PCAWG cohort. Consistently with simulated data, the percentage of mutations for which CCF cannot be computed negatively correlated with sample purity (Figure 4d and Supplementary Figure S4). We found the CCFs produced by CNAqc (Supplementary Figure S9) are comparable to those computed by Ccube (Yuan et al. 2018) across the whole cohort, but also found cases where CNAqc helped to detect spurious subclonal clusters, which we could explain by miscalled mutation multiplicities (Supplementary Figure S10).
Summarising, while peaks could be determined for almost all PCAWG samples, mutation multiplicity assessment would have required higher coverage and purity. Our analyses reveal that every type of computation - peak detection or CCF - has different data quality requirements, and should therefore be quality controlled with specific methods like the ones available in CNAqc.
Multi-region colorectal cancer data
We have run CNAqc on previously published WGS multi-region data (Cross et al. 2018; Caravagna, Heide, et al. 2020), which was collected from multiple regions of primary colorectal adenocarcinomas (10 samples, 2 patients, median coverage ~80x, purity ~80%, Figure 5). We augmented somatic mutations called by Platypus (Cross et al. 2018) with allele-specific CNAs and purity from Sequenza (Favero et al. 2015), and used CNAqc to rank segments and purity obtained by multiple parameterizations of the tool, which were defined considering also the alternative fitting solutions proposed during the fit.
Sequenza was first run with the default range proposals for purity and ploidy, which we then improved in a final run following CNAqc. From the default Sequenza runs, we collected the proposed alternative solution, which was tetraploid 2:2 with halved purity. We used these parameters to compute a de novo Sequenza fit with ploidy ranging 3.8-4.2, together with a run constrained with low purity. Runs for sample Set7_57 (patient Set7) highlighted that both Sequenza (not shown) and CNAqc are strongly confident about the diploid solution with the correct purity (Figure 5a). The peak detection scores produced by CNAqc invariably fail both the tetraploid and low purity solutions, passing the others; the little adjustment suggested to the default parameters slightly improves the purity, but the overall quality is high even with default parameters (Figure 5b) and the final segments for Set7_57 show mild aneuploidy (Figure 5c).
This case is instructive of how CNAqc can be used to assess miscalled CNA segments ahead of the VAF data, for both tetraploid and low purity solutions (Figure 5d,e). With CNAqc we obtained, in an completely automated manner, good mutations, copy numbers and purity for all samples in patient Set_7 (Figure 5f and Supplementary Figure S11), profiling a tumour consistent with a microsatellite stable colorectal cancer (Cross et al. 2018). An equivalent result is also obtained for 6 WGS samples of patient Set_6 (Supplementary Figure S12).
Whole exome data
CNAqc is conceptualised and designed to exploit properties of the VAF distribution in high-resolution whole genomes. Lower-resolution whole-exomes can be analysed if the reduced mutational burden does not compromise VAF quality, peak detection or multiplicity estimation.
We tested CNAqc with WES from n = 48 TCGA (Cancer Genome Atlas Research Network 2014) lung adenocarcinomas samples available in the LUAD cohort (Online Methods), selecting the lowest-purity and highest-purity cases to capture different levels of data quality, which we could analyse successfully in most cases (Supplementary Figure S13). Interestingly and in line with the multi-region colorectal cohort (Figure 5), CNAqc could rank calls generated by multiple callers even with WES data. For instance, for sample TCGA-53-7624-01A (Supplementary Figure S14), the TCGA consensus measurement of purity estimations (CPE) obtained by running ESTIMATE (Yoshihara et al. 2013), IHC, LUMP (Aran, Sirota, and Butte 2016) and ABSOLUTE (Carter et al. 2012) is ~80%. CNAqc showed that the CPE consensus is likely wrong, and that the correct purity was estimated only by ABSOLUTE (69%).
Discussion
WGS is a powerful approach to detect extensive mutations that drive human cancers. Many large-scale initiatives such as PCAWG (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020), the Hartwig Medical Foundation (Priestley et al. 2019) and Genomics England (Turnbull et al. 2018) have already generated WGS data for thousands of cancer patients, with many cancer institutes converging towards these efforts. Calling mutations from WGS data requires complex bioinformatics pipelines (Barnell et al. 2019; Cmero et al. 2020; Li et al. 2020) and any downstream analysis relies upon these calls, putting the quality of the generated data under the spotlight.
CNAqc leverages on statistical properties of VAF distributions from WGS, offering the first principle framework to quality control the tumour mutation calls, allele-specific clonal copy number segments and tumour purity. The tool can analyse SNVs and more general types of nucleotide substitutions; SNVs are more reliable and depend less on alignment quality, and should be checked first. CNAqc uses a peak detection analysis to validate segments and purity, exploiting a combinatorial model for somatic alleles applied to the most frequent CNAs found across human cancers. Within the same framework, CNAqc also computes CCF values, highlighting mutations whose multiplicity cannot be phased and are therefore uncertain. This can help to interpret subclonal clusters found by downstream deconvolution tools. We have also shown that CNAqc can process both whole-genome and whole-exome data, across data from different callers. CNAqc features can be used to clean up data, automatising parameter choice for any caller, prioritizing good calls and selecting information for downstream analyses.
The CNAqc framework leverages the relationship between tumour VAF and ploidy. The quality of the control process itself depends on the ability to process the VAF spectrum and detect peaks. Therefore, if the VAF quality is very low because, for example, the sample has low purity or coverage, the overall quality of the check decreases, making it more difficult to completely automate quality checking. However, for the large majority of samples, CNAqc provides a very effective and efficient way to integrate quality metrics in standard pipelines. We note that CNAqc is much faster than quality control by using standard deconvolution tools (Supplementary Figure S15).
Generating high quality calls is a forerunner to more complex analyses that interpret cancer genotypes and their history, with and without therapy (Ding et al. 2012; Landau et al. 2013; Caravagna et al. 2016; Jamal-Hanjani et al. 2017; Turajlic et al. 2018; Caravagna et al. 2018; Roth et al. 2014; Miller et al. 2014; Cross et al. 2018; Gerstung et al. 2020; Deshwar et al. 2015; Strino et al. 2013). CNAqc can pass a sample at an early stage, leaving the possibility of assessing, at a later stage, whether the quality of the data is high enough to approach specific research questions. With the ongoing implementation of large-scale WGS sequencing efforts, and the great amount of WES data already available, CNAqc provides a good solution for modular pipelines that self-tune parameters based on quality scores. To our knowledge, this is the first stand-alone tool which leverages the power of combining the most common types of cancer mutations - SNVs and CNAs - to automatically control the quality of cancer sequencing assays. We believe CNAqc can help reduce the burden of manual quality checking and parameter tuning. In the future, this tool could be extended to consider other types of CNAs (e.g., extrachromosomal DNA, ecDNA, or subclonal CNAs). ecDNA fragments usually span small genomic regions (Zeng, Wan, and Wu 2020; Verhaak, Bafna, and Mischel 2019) and involve copy states that are not yet supported by CNAqc. Nevertheless, their specific role in amplifying oncogenes and driving tumour evolution and drug resistance (Wu et al. 2019; Kim et al. 2020; Turner et al. 2017) is becoming increasingly important. Adjusting for subclonal CNAs could improve the QC especially at the local level and for those tumors characterized by a strong karyotypic heterogeneity (Ha et al. 2014).
Data Availability
Multiregion colorectal cancer data is deposited in EGA under accession number EGAS00001003066. PCAWG calls are publicly available at (https://dcc.icgc.org/), the ICGC Data Portal. TCGA calls are publicly available at (https://portal.gdc.cancer.gov), the GDC Data Portal.
Software Availability
CNAqc is implemented as an open source R package that is hosted at https://caravagnalab.github.io/CNAqc/.
The tool webpage contains RMarkdown vignettes to run analyses, visualisation inputs and outputs, and parametrise the tool. All analyses presented in this paper can be replicated following those vignettes; multiregion colorectal cancer data to replicate our analysis is hosted in the GitHub repository.
Authors contribution
All authors conceived the method, which GC formalised and implemented. RB and SM carried out CNAqc simulations and comparisons against other methods. All authors analysed the data and wrote the manuscript.
Competing interests
The authors declare no competing interests.
Online methods
CNAqc supports the most frequent allele-specific clonal copy number profiles found in human cancers (Supplementary Figure S1):
heterozygous diploid states (1:1)1;
loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) states; triploid (AAB or 2:1) or tetraploid (AABB or 2:2) states.
Data supports this design choice. In the PCAWG cohort 36% of allele-specific CNA segments are 1:1, 15% are 2:1, 11% are 1:0, 8% are 2:2 and 8% are 2:0. These are >75% of the whole set of calls (>600,000 segments). Moreover, these segments span 93% of all CNA-covered bases in the whole PCAWG cohort. Moreover, in the same cohort, clonal copy numbers are much more frequent, and span significantly larger portions of the tumour genome, than subclonal counterparts obtained by the Battenberg caller. In ~95% of PCAWG samples, >50% of the genome is covered by clonal CNAs. Limiting CNAqc to support simple clonal CNAs comes with the advantage that mutation multiplicities are easier to manage, at least from a computational perspective.
CNAqc supports two human reference genome assemblies (GRCh38 and hg19) and makes the simplifying assumption that CNAs are acquired in a single step from an heterozygous germline diploid state. For this reason, for tetraploid segments we only consider the copy state 2:2, instead of 3:1 or 4:0.
CNAqc is conceptualised to work with high-resolution - i.e., high purity and coverage - whole-genome sequencing (WGS) data, but can also be applied to whole-exome (WES) data. The main challenge with WES or low coverage/purity WGS data is the reduced mutational burden and noise in the VAF, which decreases the signal strength. The key determinant to detect VAF peaks is the number of mutations per copy state, with the idea that thousands from a high-quality WGS assay, are certainly better than hundreds from WES or from low-quality WGS. For tumours which are genomically unstable, or exposed to endogenous mutant factors such as smoking or UV-light, or with high mutation rate like microsatellite unstables, the observable mutational burden in exomes might be suitable for CNAqc analysis. A general main disadvantage of low-quality data is that the automation process available via CNAqc might be more difficult, reporting false positives or negatives. In those cases we suggest performing a manual inspection of the proposed scores to optimise the tool, and check consistency against our intuition.
Expected clonal VAF peaks given CNAs and purity
A bulk is a mixture of tumour and normal cells present in proportion π > 0 and (1 – π), respectively. We derive a simple equation describing our belief about the position of the clonal VAF peak in the data, assuming the input clonal segments and purity are correct. This equation is segment-specific, and links all segments with the same allele-specific copy number profile. In this manuscript, we denote as nA: nB allele-specific segments with nA copies of major allele and and nB of the minor allele. For instance, with 1:0 we denote nA = nB =1, with 1:1 we denote nA = 1, nB =0, etc.
We introduce the multiplicity m ≥ 1 (or copies) of a clonal mutation mapping on top of the considered segments. As in ASCAT (Van Loo et al. 2010), the expected proportion of reads that can be attributed to a mutation with multiplicity m is mπ. The difference between ASCAT and CNAqc is that the former considers germline single-nucleotide polymorphisms, while the latter considers somatic mutations (i.e., germline is removed); besides this difference, the conceptualisation is similar. For segments nA: nB, the proportion of all reads from the tumour is π(nA + nB). Here we term nA + nB the ploidy of the nA: nB segments, remarking that is not the overall tumour ploidy. Similarly, assuming a healthy diploid normal, the proportion of reads that come from normal cells is 2(1 – π). When we consider clonal mutations with multiplicity m sitting on nA: nB segments, we would expect them to peak in the VAF distribution at value
This formula describes our belief about the position of the clonal VAF peak in the data, assuming the input segments (determined by segments with nA and nB alleles) and purity π is correct.
The formula is intuitive and gives the expected results. Consider 1 = nA = nB and π = 1, i.e., heterozygous diploid segments in a pure tumour, since clonal mutations have m = 1 the clonal VAF should be ~50%, and in fact vm = 0.5. Instead, for tetraploid segments obtained after whole-genome duplication, where we have 2 = nA = nB and π = 1, under the simplifying assumption of CNAqc, clonal mutations could be present in single (m = 1) or double (m = 2) copy. Evaluating equation (1) we obtain v1 = 0.25 for mutations in a single copy (25%, post-aneuploidy), and vm=2 = 0.5 (50% VAF, pre-aneuploidy).
Transforming VAFs to CCFs
There are several methods to compute CCFs from VAFs, allele-specific copy number and purity. The equation used by CNAqc is inspired by seminal works (Van Loo et al. 2010; Dentro, Wedge, and Van Loo 2017), and converts the observed VAF v > 0 of a mutation with multiplicity m into the CCF c
Note that all the parameters π, nA, nB are as in eq. (1). Given input VAFs, CNAs and purity, the only quantity to be estimated to compute c is m, the multiplicity.
CCFs are proportional to VAFs, as we expect. Consider a heterozygous clonal diploid mutation (1 = nA = nB = m = π so p = 2). Its expected VAF is 50% and c = 1 reporting a correct 100% of cells with the mutation, which is clonal. The same formula works for subclonal mutations. As another example, if a single-copy clonal mutation (m = 1) sits in an amplified triploid state (2 = nA and 1 = nB = π) and has a VAF of 33% - 1 out of 3 copies - we have c = 1. The other type of clonal mutation that we can observe in those types of segments has VAF of 66%, with 2 out of 3 copies and then its CCF is again c = 1 using equation (2).
Peak detection quality control for allele-specific CNAs and purity
Data peaks can be used to quality control (QC) both tumour purity and CNA segments, following the intuition of equation (1). The procedure is summarised in pseudocode in Supplementary Figure S2, and works by partitioning mutations by the copy states of the segments (after mutation mapping on CNAs), and analysing them independently to determine a PASS or FAIL status per mutation multiplicity. A sample-level PASS or FAIL score is then computed by aggregating all statuses in a majority-based system, where each copy state weights proportionally to the number of mutations it harbours (i.e., the evidence from the data).
In CNAqc there are therefore three levels of PASS or FAIL status: i) for each VAF peak in a given copy state, ii) overall for a copy state and iii) for the whole sample. This allows subsetting of calls according to a fine-grained set of metrics, for instance passing only some variants even if the overall sample fails.
The peak detection strategy take as input ∊ > 0, the upper bound on the error that we can tolerate on purity. For example, if ∊ = 0.05, we can accept a 5% error on the purity; if the true purity was 60% and the caller reported a value in range [55%, 65%], CNAqc would pass the sample. The range associated to e is adjusted to account for ploidy and mutation multiplicity, providing a conversion between errors measures in VAF and purity units. The formula that we introduce is presented below.
Peak matching strategies
For every copy state, CNAqc matches either 1 or 2 peaks, depending on the ploidy of the involved segments and multiplicity: one peak is matched for 1:0 and 1:1, two for all others. Here we discuss the strategy to detect a generic peak, assuming to work with copy state nA: nB as in equation (1).
The tool implements methods (described below) to detect n peaks d1,…,dn in the VAF distribution, and match them against vm, the expected clonal VAF for a peak with multiplicity m. The matching of vm, determines the PASS or FAIL status for the associated multiplicity. To compute the match we select one d* among d1…,dn that is close enough to vm, where the choice of d* can be done in two ways:
by closest hit match, where d* is the data peak that is closest to vm, i.e. d* = argi min |di – vm|, where i ranges 1,…,n..
by rightmost hit match: where d* is taken from the subset of peaks D = {di > vm | i = 1,.., n} on the right of the expected peak, and d* is the largest possible value d* = argi∊D max|di – vm| (most right apart).
The CNAqc default strategy is the first, which selects d* as the peak closest to vm. The second strategy is more stringent, but can help identify miscalled segments. Consider for instance a diploid 1:1 copy state, if in the pool of putative diploid mutations some should have been associated to LOH segments (miscalled), an extra VAF peak is expected on the right of the clonal cluster. The rightmost hit match strategy will associate the theoretical peak vm to the LOH one, flagging the diploid segment as FAIL because the LOH peak will be far off the clonal peak.
When we match the peaks, the desired purity error ε gets rescaled depending on the copy state nA : nB, following this general equation
This means that, in order to match a purity error ε, we create a range of acceptance based on ∊m, as we discuss below. In this equation we interpret the VAF as a function of tumour purity and, assuming ∊ to be small, we truncate the Taylor expansion of νm(π+∊) at the first order.
Notice that the error on the VAF depends in general on purity and multiplicity. Consider, for instance, a segment with copy state 2:1 for a tumour with purity 90%, ε = 0.05 (5%) corresponds to an error in the VAF of approximately 1% and 2% for the VAF peaks with multiplicity m = {1,2} respectively. Inverting equation (3), one can express the purity as a function of the VAF, ploidy and multiplicity and derive the error propagation formula from the VAF space to the purity space used in CNAqc. Using the same approach of equation (5), we treat the purity as a function the VAF, shift the VAF by a small error ∊m and truncate the Taylor expansion at the first order to get
Peaks are matched by including a VAF tolerance ∊VAF > 0, which helps ameliorate the fact that we do not explicitly model noise affecting peak detection. The intervals are created with centre at d*, the matched peak in the VAF, with size 2∊VAF, and tested for overlap with the interval Im. If overlaps with Im, i.e., , the clonal peak for multiplicity m is matched by d*, and therefore receives a PASS status. Otherwise it receives a FAIL status.
The PASS or FAIL status per copy state with two possible multiplicity values is defined by taking the status of the peak associated with the largest number of mutations nm. The value of nm is determined by binning the VAF distribution with 100 bins (from 0 to 1, with size 0.01), and counting the number of mutations that associate to the bin of the matched peak. In this way, we PASS a copy state if the tallest of its peaks is a PASS, and is associated with more mutations than any FAIL peak.
The sample-level QC status is based on an error metric that uses the actual distance between the centres of the intervals, d*, and vm, which is given by d* – v.
CNAqc sample-level error metric
An error metric is assembled across copy states to determine a sample-level PASS or FAIL status. Consider wk as the normalised number of mutations mapped to copy state k, which we further rescale by 2 if the copy state is supposed to have two peaks. For every copy state and every mutation multiplicity, we have a PASS or FAIL from peak detection.
We split PASS (Pk) from FAIL (Fk) peaks, and define two scores per copy state by linear combination where the subscript k denotes the copy state (i.e., 1:0), and denotes the peak matched for multiplicity m in copy state k. We define the overall sample score λ where K is the set of copy states 1:0, 1:1, 2:0, 2:1 and 2:2 supported by CNAqc. Equation (9) is the linear combination whose terms can be either positive or negative, depending on whether the matched peaks are on the right or left of the expected peaks. The sample score λ is a weighted mean since by construction all the wk sum to one.
The overall status on the sample is taken by comparing and and selecting the status corresponding to the largest of the two.
Computing peaks from VAF
CNAqc implements a joint strategy to detect n peaks d1,…dn in the VAF distribution:
Kernel-based: via kernel density estimation with default adjustment 1 and fixed bandwidth, a smoothed VAF profile is obtained. Peaks are then estimated from the discretized smooth, using specialised R packages for peak detection and removing peaks with density below 1/20 (empirical cut) of the maximum peak.
Mixture-based: via Binomial mixture from the BMix (Caravagna, Heide, et al. 2020) package (https://caravagn.github.io/BMix/), a peak is associated with each Binomial probability, for all mixture components.
The latter strategy is inspired by subclonal deconvolution methods, and computes the model density for w clusters (default w < 5), with model-selection to optimise w using the Integrated Classification Likelihood score (Caravagna, Heide, et al. 2020); the likelihood is where πi are the mixing proportions of the mixture, not to be confused with sample purity. Here we use a Binomial likelihood for rx successes determined as the number of reads with the mutant covering mutation x, nx is the total trials given by the sequencing depth at the locus, and pi the Binomial success probability Assuming that calls have passed the quality metrics for CNAqc, then pi is defined as the expected theoretical VAF from equation (1), so it is known. A key advantage of BMix over other deconvolution tools is the fast maximum likelihood implementation, with full access to the model parameters (e.g., latent variables).
CCF estimation
A lot of tools for downstream subclonal deconvolution compute CCFs to normalise mutations, CNAs and purity, and cluster mutations. Some popular tools - e.g, PyClone (Roth et al. 2014) - focus on cluster-level rather than per-mutation CCFs. For this reason, not all deconvolution tools offer the same information accessible from CNAqc, with Bayesian deconvolution algorithms in PyClone or DPclust being computationally much more demanding than CNAqc (Nik-Zainal et al. 2012; Roth et al. 2014).
CNAqc offers a way to estimate CCFs and a PASS or FAIL status which can be used to assess the quality of the estimates.
CCF computation
CNAqc offers two distinct approaches to compute CCFs:
- Entropy-based computation: in which a Binomial mixture like in equation (10) is peaked at the VAFs vm values from equation (1), and input mutations are phased to their multiplicity only if the mixture’s latent variables are well separated.
- Rough computation: in which a Binomial mixture is used and mutations are phased regardless of the latent variables of the mixture
The entropy-based model can fail to compute the multiplicity of a mutation, and return CCF values with NA associated; this is how uncertainty is reported in CNAqc. The latter method, by design, will always assign a multiplicity m ∈ {1,2}.
The final PASS or FAIL status of a copy state is determined from the proportion of mutations with available CCF. Therefore, while the rough computation will always PASS a copy state, this is not the case for the entropy-based method. By default, if more than 10% of the mutations per copy state have no available CCF, a FAIL is raised; the percentage parameter can be set to arbitrary values.
We first detail the rough approach. We describe the case of copy states 2:0, 2:1 and 2:2, the others being trivial. To initialise a mixture analogous to equation (10):
we build two Binomial densities from the theoretical expectations of the VAF peaks, i.e., v1 and v2, depending on the copy state, as defined in equation (1). This will create, for instance, one Binomial with parameter p = 0.33 and one with p = 0.66 for a pure (π = 1) tumour and 2:1 copy state.
We fix - in equation (10) - the number of Binomial trials to the median coverage of the considered mutations, and compute the 1% and 99% quantiles of the data distributions to obtain a VAF range around each peak.
Finally, we count mutations that, according to VAF, map to either one or the other computed range. The number of mutations n1 and n2, associated to multiplicity m = 1 and m = 2, is then used to obtain the normalised mixing proportions π1 and π2 to complete the model in equation (10).
Densities are computed at steps 1 and 2, while mixing proportions are computed at step 3; with these parameters we can compute the mixture likelihood. Akin to mixtures, we introduce the notion of latent variables z as a matrix of mutations by clusters, for which we define, the probability of assigning read counts data for mutation n to component i ∈ {1,2}. With these latents, every row of matrix z is a categorical random variable reporting the probability of assigning m = 1 or m = 2 to a mutation, for which we can define the entropy in the standard way.
The entropy is maximal if zn,1 = zn,2, and the mutations are equally likely in single and double copies. It is minimal if zn,1 = 1 and zn,2 = 0, or vice versa. If the entropy is low, the mutation is often difficult to phase to single or double copy mutations. The shape of the entropy resembles - by construction - a growing curve with a central spike, which we use to create a simple criterion to discriminate high from low entropy. The geometric intuition of this criterion is extremely simple: at the crossing of Binomial densities peaked at m1 and at m2 if the entropy is high we cannot confidently phase mutations to multiplicities. The amount of Binomial overlap depends on coverage and purity - this is the technical reason CCF is more uncertain for low resolution data.
CNAqc inspects the entropy profile to determine peaks {h1, h2} around the spike, using the same peak detection tool used for quality control. Every mutation in the range cannot be unequivocally assigned multiplicity values, and are therefore undetermined using the entropy-based method.
The rough approach determines the midpoint o = v1 + (v2 – v1)π1 between the two expected theoretical VAF peaks v1 and v2, given the mixing proportion π1 of the first mixture component. The midpoint is computed by weighting each of the two peaks proportionally to the number of mutations that appear underneath each peak, which we compute like with the entropy method. The midpoint is a cut: x < o are phased to a single copy, values above to two copies. This procedure requires data with good general quality because it assumes that all mutations can be phased correctly by a hard VAF split, a fact that depends largely on coverage and purity.
When multiplicities have been determined, CCFs are computed with equation (2).
Genome fragmentation
Some recently identified patterns of somatic CNAs can be attributed to the presence of highly fragmented tumour genomes, termed chromothripsis and chromoplexy, or localised hypermutation patterns, termed kataegis (Cortés-Ciriano et al. 2020).
While these can be identified using dedicated tools, CNAqc offers a simple statistical test to detect the presence of potential over-fragmentation in a chromosome arm, a prerequisite that could point to the presence of such patterns. CNAqc analysis does not substitute dedicated tools, but provides preliminary information to determine what parts of the genome might be run with ad hoc methods.
The test works at the level of each chromosome arm (1p, 1q, 2p, 2q, etc.), and uses the length of each input CNA segment to assign a “long segment” or “short segment” status. This is determined by a cut parameter μ that is set, by default, to 20% (i.e., μ = 0.2). Recent evidence from large pan-cancer studies can be used to calibrate this parameter to cancer-specific values (Zack et al. 2013).
Then, a null hypothesis is used to compute a p-value using a Binomial test based on k, the number of trials given by the total segments in the arm, and the observed number of short segments s. The Binomial distribution for H0 is defined by μ, and the null is the probability of observing at least s short segments, and therefore we defined a one-tailed test for whether the observations are biased towards short segments. The p-value is adjusted for family-wise error rate by Bonferroni, dividing the desired α-value by the number of tests.
This test is applied to a subset of chromosome arms with a minimum number of segments, and that “jump” in ploidy by a minimum amount (empirical default values estimated from trial data). The arm-level jump is determined as the sum of the difference between the ploidy of two consecutive DNA segments. These covariates are similar to those used to infer CNA signatures from single-cell low-pass WGS (Macintyre et al. 2018).
Other features
CNAqc contains multiple functions to subset the data (i.e., select mutations that map only to certain copy states, subset CNAs with a total ploidy, etc.), visualise the data (i.e., plot mutational burden by tumour genome) or smooth the input CNA segments.
Smoothing is an operation that can be carried out before testing for over-fragmentation. In CNAqc, by smoothing we merge two contiguous segments if they have exactly the same allele-specific profile (i.e. same numbers for the major and minor alleles), and if they are a maximum distance apart (e.g. 1 megabase by default). This operation does not affect the ploidy profile of the calls, but reduces the amount of breakpoints that would inflate the p-value of the Binomial over-fragmentation test.
Peak detection simulations
We tested CNAqc on a synthetic dataset of ~20.000 tumours, generated to mimic data that we observed in real patient tumours.
We first simulated synthetic VAFs from clonal CNA segments generated with breakpoints distributions following Poissons (6 segments per chromosome, on average. We used a Dirichlet copy state concentration 1 for 1:0, 1 for 2:0, 6 for 1:1, 2 for 2:1 and 1 for 2:2). Then we simulated Poisson-distributed coverage with median depth 30x, 60x, 90x and 120x, and set purity to 0.4, 0.6, 0.8 or 0.95. The idea of this test was to simulate a tumour with purity π and run CNAqc with an input purity that contained a positive or negative error εerr, i.e., we imputed CNAqc purity π + εerr. Then, for different values of the input tolerance ∊, i.e., the maximum purity error we want to tolerate in CNAqc, we run the tool with default peak-matching parameters and perform quality control. Ideally, when the input error εerr is lower than tolerance ∊, ∊err < ∊, CNAqc should pass the sample.
We performed the quality check applying an error on the purity varying in range [0; 0.2] with intervals of length 0.02, setting a tolerance on the purity error ranging in [0.01; 0.05] with intervals of length 0.004. We tested CNAqc on 100 simulated tumours for any combination of all the considered parameters. We consistently observed that, as the purity error εerr exceeds tolerance ∊, the proportion of failures approaches 100% (Supplementary Figure S4). For instance, setting a tolerance parameter of 2%, we can accept a purity error of 5% at most. Over this threshold the proportion of FAIL samples increases reaching maximum at ~7%. One can check this behaviour for the samples of purity 0.95 and coverage 90x: for a tolerance of ~2%, the proportion of rejected samples is close to 0% when the purity error is smaller than 5%, it increases to 70-75% for a purity error of ~5/6%, while for a purity error of ~10% the fail proportion is 100%. From the test we also observed that the ability of CNAqc to detect samples with incorrect purity improves consistently as we increase coverage, with this effect more evident for samples with high purity.
For the same tumours we also computed CCFs and the proportion of mutations for which CNAqc could not phase multiplicity (only for copy states 2:0, 2:1, 2:2 since 1:0 and 1:1 have single multiplicity). We plot the percentage of unassignable mutations as a function of purity in Supplementary Figure S5. We can see that the proportion decreases as we increase coverage and purity, meaning that the computation of reliable CCFs can depend largely on data quality. The observed trend was expected, since at low coverage and purity we have the overlaps between clonal clusters which makes it harder to phase multiplicity from VAFs.
Comparison to deconvolution methods
Some of the functioning of CNAqc is inspired by the design of subclonal deconvolution methods (Roth et al. 2014; Nik-Zainal et al. 2012; Dentro, Wedge, and Van Loo 2017; Jamal-Hanjani et al. 2017; Gerstung et al. 2020; Jiang et al. 2016; Caravagna, Heide, et al. 2020; Caravagna, Sanguinetti, et al. 2020). Therefore, we sought to compare CCFs by CNAqc with the one obtained by Ccube (default parameters), a CCF-computation method developed by the PCAWG Evolution and Heterogeneity Working Group (Yuan et al. 2018).
In Supplementary Figure S9 (panel a) we show the correlation among the CCF values computed by Ccube and CNAqc (entropy method) in PCAWG. In the plot we annotate the proportion of cases, split by copy state and mutation multiplicity, where the estimates are different after rounding to the second digit. We observe that the tools report the same CCF for ~99% of the analysed mutations, whenever CNAqc identifies a reliable CCF value. We remark that a feature of CNAqc is reporting the percentage of mutations where the CCF cannot be unequivocally determined. In the above statistics, the CCF values are therefore computed only for mutations where the uncertainty is not present in CNAqc. The information regarding uncertainty is however very helpful to integrate CNAqc with other tools for CCF computations, as we show with two examples from our test.
In Supplementary Figure S9 (panels b-g) we report an example PCAWG case where the CCFs are in perfect agreement (1 out of 307 mutations in 2:2 segments with different CCF). In Supplementary Figure S10, instead, we show a case where CNAqc detects uncertainty in 14% of input triploid mutations, informing of potential challenges in using CCFs for those mutations. In that case the uncertainty is explained by the intermixing between two clonal picks in triploid 2:1 segments. Ccube assigns multiplicity 2 to a group of clonal SNVs at the right tail of the lowest clonal pick. The consequent CCF distribution breaks the expected clonal peak around ~1, alluding to the presence of two close CCF clusters. This is due to Ccube assigning some single-copy mutations m = 2, and vice versa. The entropy-based method by CNAqc highlights 14% of 2:1 mutations as uncertain, including the ones mistaken by Ccube. In turn, CNAqc assigns a FAIL status to these mutations with default values (cutoff >10%). Notably, the CCF distribution returned by CNAqc, which uses 86% of total mutations once the 14% non-assignable are removed, is correctly peaked at ~1.
Errors in CCFs can affect downstream subclonal deconvolution, which in turn inflates evolutionary statistics (e.g., number of subclonal clusters, clonal complexity). In this example, miscalled multiplicities generate a spurious cluster in the CCF distribution fit by Ccube, which leads to subclonal cluster 2 (panel g, Supplementary Figure S10). Even after removal of 14% CCFs flagged as uncertain by CNAqc, Ccube still assigns the wrong mutation multiplicity to a significant number of variants and infers the spurious CCF cluster (panel h, Supplementary Figure S10). For this reason, reporting a FAIL status in CNAqc informs that multiplicity computation in this sample is highly confounded by intermixing of VAFs, cautioning the interpretation of downstream deconvolution analyses.
Whole-exome sequencing data
There is an obvious difference between the richness of information that is available in a whole-genome assay, compared to a whole-exome one. Similarly, there is a difference between samples with high purity and coverage for current standards (e.g., WGS >60x with 70% purity), and those with lower parameters.
We collected whole-exome data from n = 48 lung adenocarcinoma samples available in TCGA LUAD (Cancer Genome Atlas Research Network 2014), selecting the 24 ones with top and bottom purity values, as of the consensus purity estimated by TCGA (CPE score). We report example cases in Supplementary Figure 13, where PASS and FAIL values are obtained by using somatic SNVs, CPE purity estimates and default CNAqc parameters.
The case in panel (a), sample TCGA-53-7624-01A, is 84% pure and the inferred ploidy is correct, but purity is slightly overestimated. The case in panel (b) is 82% pure, but with a similar error pattern. The case in panel (c) is PASS with 30% purity; in this case it is difficult to assess if the small peak matched by CNAqc is a noise artifact. This is an example of a VAF distribution that is low resolution. The case in panel (d) is 83% pure tumour, with good calls. The case in panel (e) is 32% pure and passed because most of the tetraploid mutations seem legit, but it contains a poorly-peaked VAF distribution in triploid states (2:1, 47% of the overall mutational burden). In this case CNAqc struggles to detect peaks from VAF; this is another example of low resolution VAF distribution.
CNAqc can also be used to select among multiple purity estimates provided by different CNA callers, even with WES data. We focus on case (a) from Supplementary Figure S13. In TCGA, we obtain purity estimates from CPE, which is the consensus among ABSOLUTE, ESTIMATE, IHC and LUMP. We used CNAqc to assess the quality of the estimates for the LUAD sample TCGA-53-7624-01A. For this sample, ESTIMATE, IHC and LUMP agree and determine the value for CPE. We found that only ABSOLUTE detected the true tumour ploidy (69%, Supplementary Figure S14), according to CNAqc. This shows that CNAqc can be used to select among multiple purity estimates the value that best integrates mutations and copy number data, even from WES assays, avoiding in principle the need of consensus calling.
From these tests we conclude that CNAqc can also be used on WES data like the data available in TCGA, possibly coupled with manual revision of critical cases.
Wall-time performance
The analysis of PCAWG showed that CNAqc is fast; in order to generalise that assessment and understand how performance scales with sample size, we compared the wall-clock time of CNAqc against common deconvolution tools.
We chose Sciclone (Miller et al. 2014), Ccube (Yuan et al. 2018) and Pyclone-vi (Gillis and Roth 2020) to represent a diverse set of popular algorithms for deconvolution. To build the dataset we subsetted all the mutations in diploid regions from a melanoma sample of the PCAWG cohort (DO220877) leading to a total of 207508 mutations. This is the PCAWG sample with highest mutational burden in the cohort. Then, from those 207508 SNVs we sampled N={500,1000,25000,5000,1000,25000,50000} mutations; this process was repeated 10 times to have 10 replicates for each N. The CNaqc analysis for peak detection was run with default parameters. Similarly, default parameters were also used for Sciclone (default one-dimensional deconvolution) and Ccube (but with numOfRepeat=1); Pyclone-vi was run with beta binomial likelihood, number of clusters from 1-10 and 30 repetitions (Supplementary Figure S15).
CNAqc turned out to be the fastest tool, capable of processing up to 500,000 mutations in under one minute. Immediately after, tools based on variational inference were about an order of magnitude slower. The latter two algorithms range from being 4 to 16 times slower than CNAqc for our range of tests (consider the log-scale in the plot y-axis), and the performance gap increased with larger N. Notably, Sciclone took an average of two hours to process 50,000 mutations, which is 128 times slower than CNAqc as suggested by a log-difference of 5. In all tests, CNAqc, Ccube and Pyclone-vi scaled approximately exponentially, while Sciclone showed a jump from 25,000 to 50,000 mutations. All simulations were performed on a machine with 36 Intel(R) Xeon(R) Gold 6140 CPUs @ 2.30GHz and 220 GB of RAM (Ubuntu 20.04 LTS, Python 3.8.2 and R 4.1.0).
Main Figures
Supplementary Figures
Acknowledgments
The research leading to these results has received funding from AIRC under MFAG 2020 - ID. 24913 project – P.I. Caravagna Giulio. Some research was performed using the Cancer Research UK City of London Major Centre High performance computing facility (colcc.ac.uk) and was also funded by a Wellcome Trust grant (ID: 202778/Z/16/Z).