A new approach to the challenging problem of mutational signature attribution

Mutational signatures are characteristic patterns of mutations caused by endogenous mutational processes or by exogenous mutational exposures. Much research has focused on the problem of inferring mutational signatures as latent variables in somatic mutation data from multiple tumors. However, the problem of determining which signatures are present in a given sample and how many mutations each signature is responsible for has received negligible attention. In particular, there has been little systematic benchmarking of various approaches to this problem. This problem is referred to as “signature attribution” in a single sample. We show that this is a challenging problem, because there are often many combinations of signatures that can reconstruct the mutational spectrum of a given sample reasonably well. We benchmarked the accuracy of five approaches to signature attribution, including a new approach we call Presence Attribute Signature Activity (PASA), on large synthetic data sets. These data sets recapitulated the single-base, insertion-deletion, and doublet-base mutational signature repertoires of 9 cancer types. For single-base substitution mutations, PASA outperformed other approaches on all the cancer types combined. Interestingly, however, the ranking of approaches varied by cancer type. For doublet-base substitutions and small insertions and deletions, the ranking of approaches was more stable, with PASA outperforming other approaches in most, but not all of the nine cancer types. For all mutation types, the ranking of approaches varied by cancer type, and no approach achieved both high precision and recall. We believe these observations reflect the inherent challenges in signature attribution.

Much prior research has concentrated on discovering mutational signatures, which is not the main topic of this manuscript.One common approach to discovering signatures is to expose cultured cells to a suspected mutagen, and then sequence the exposed culture [14][15][16].In addition, machine learning methods can discover mutational signatures in large databases of somatic mutations from tumors [1,17].
This process is often referred to as signature extraction.The approach depends on the model that a mutational spectrum can be explained as a linear combination of mutations generated by mutational signatures (Figure 1).The number of mutations due to a particular signature is referred to as the signature's activity.Signature extraction discovers mutational signatures as latent variables that can parsimoniously explain sets of mutational spectra [18][19][20].Several benchmarking studies have systematically examined the accuracy of different approaches to signature extraction [20][21][22][23].To -date, experimental methods and in silico signature extraction together have identified > 100 reference mutational signatures [24].
In addition to the discovery of mutational signatures, another important task is to estimate the presence of existing mutational signatures and their activities in a mutational spectrum, a task that commonly called signature attribution.This is often a challenging task.Absent critical review of output, signature attribution software can generate results that are useless for understanding the underlying biology of mutagenesis and its consequences.For example, one study reported that nearly 50% of lung tumors in never smokers (mostly adenocarcinomas) have the SBS3 signature (Fig. 4 in reference [25]).SBS3 is the result of deficient homologousrecombination-based DNA damage repair, and the same study (contradictorily) reported that HRDetect [12] detected homologous recombination deficiency in only 16% of the tumors (Extended Data Fig. 8a in reference [25]).Furthermore, Alexandrov et al. [1] detected SBS3 in only 8% of lung adenocarcinomas.SBS3 is especially prone to this kind of error, which is also shown in Extended Data Fig. 3a in reference [26], in which high proportions of tumors of almost all cancer types have SBS3, an implausible result in light of the actual prevalences of homologous recombination deficiency across cancer types.A related issue is signature attribution software often includes small activities of signatures due to implausible exposures.
For example, one study reported the signature of UV exposure not only in cells from skin melanomas and in skin fibroblasts, but also in cells from every tissue, including kidney, liver and skeletal muscle (Fig. 3b in reference [27]).Furthermore, beyond these sorts of implausible results, challenges remain.We show below that it is often possible to reconstruct the mutational spectrum of a sample using dozens or more different combinations of signatures, all of which yield reasonably good reconstructions.
Despite the importance of mutational signature attribution, there has been little benchmarking of software for this task [21,28,29].In addition, previous studies of which we are aware studied only single-base substitution (SBS) mutational signatures and neglected doublet-base substitution (DBS) signatures and insertiondeletion (ID) signatures.
Here, we present PASA (Presence-based Attribution of Signature Activity), a new, statistically-grounded algorithm for mutational signature attribution for three types of mutational signatures -SBS, DBS, and ID -in a sample.We assessed the accuracy of PASA and four other prominent approaches: SigProfilerAssignment (SigPro, "signature assignment" is another term for signature attribution) [28], Signature Fit Multi-Step (FitMS) in SignatureToolsLib [17], MutationalPatterns (MP) [30] and Mutational Signature Attribution (MSA) [31] using synthetic SBS, DBS and ID data from 2,700 tumors representing 9 cancer types.

Preliminary definitions
The mutational spectrum of one sample (tumor) is a 1-column matrix,  ∈

Exhaustive enumeration of stomach cancer attributions and their accuracy
We selected 14 mutational signatures that were common in stomach adenocarcinoma (Stomach-AdenoCA in reference [1]).For each stomach adenocarcinoma in the Pan Cancer Analysis of Whole Genomes mutation data analyzed in [1], for the non-empty power set of these signatures, we used the function mSigTools::optimize_exposure_QP to select an attribution that minimizes the Euclidean distance to the given spectrum.We excluded attributions with a contribution of < 0.5 mutation from any signature as redundant with attributions lacking that signature altogether.Code and attributions with cosine similarity > 0.98 are available at https://github.com/Rozen-Lab/sig_attribution_paper_code/common_code/figure_2/.

PASA algorithm
The PASA algorithm for signature attribution takes as input a mutational spectrum, , to be reconstructed from a set of  possible signatures represented by a matrix , as above.Also as above, the algorithm returns a column matrix, , of signature activities that can reconstruct .For signature attribution in a given cancer,  normally consists of the set of signatures previously observed in that cancer type.
The PASA algorithm proceeds in two steps.

Step 1: Signature presence tests
In

Running other approaches to signature attribution
The code for running other approaches and raw output are available at https://github.com/Rozen-Lab/sig_attribution_paper_code/analysis/.Importantly, for every approach and every cancer type, we allowed attribution with only the set of signatures previously observed in that cancer type as reported in reference [1].Table S1 lists software versions and arguments.

Generation of synthetic mutation data
We used COSMIC [24] v3.2 reference mutational signatures and the signature activities estimated by Alexandrov et al. [1] to generate synthetic SBS, DBS and ID mutational spectra.Detailed methods are described in our previous publication [20].
Briefly, for each of the SBS, DBS, and ID mutation types, we generated 100 synthetic spectra for each of nine cancer types.To generate one synthetic spectrum of a particular cancer type, the code first selects the signatures present in the spectrum and the ground-truth activities of each signature as random draws from the distributions of these estimated from [1].Then, the numbers of each mutation type are selected from a negative binomial distribution that is centered on the overall number of mutations due to the signature times the proportion of mutations of that mutation type in the signature.Code to generate the synthetic mutational spectra and the spectra themselves are at https://github.com/Rozen-Lab/sig_attribution_paper_code/synthetic_data/.

Definition of evaluation measures
For a given synthetic spectrum with given ground truth activities, let  be the number of signatures which have activity > 0. Let  (true positive) be the number of signatures with activity > 0 that also have estimated activities > 0. Let  (false positive) be the number of signatures with 0 activity, but that have estimated activity > 0 The evaluation measures for attribution of signatures for a synthetic spectrum of a given cancer type are:

Hundreds of reasonable signature attribution solutions present a challenge
As noted in the Introduction, signature attribution appears to be inherently challenging, and we show here that one reason for this is that often there are many different reasonable attributions of a given original spectrum.We take as examples 49 stomach cancer spectra (from [1]) that, using 14 SBS signatures common in this type of tumor, have attributions providing cosine similarities to the original spectrum > 0.98.These spectra have a median of 257 distinct attributions with similarity above this similarity threshold (Figure 2A).For one stomach cancer, Figure 2B-F and Table S2 show 4 example attributions out of 189 attributions that exceed this similarity threshold.The relevant code and, for all 49 stomach cancers, the attributions that have cosine similarity > 0.98, are available at https://github.com/Rozen-Lab/sig_attribution_paper_code/common_code/figure_2/.
Perhaps one could simply take the attribution with the highest similarity to a given spectrum as the most likely true attribution for that spectrum.We explored this question by examining at all possible attributions for each of 95 synthetic stomachadenocarcinoma spectra.For 93 out of these 95 spectra, the ground-truth attribution was inferior to the best alternative attribution -inferior in the sense of having lower cosine similarity to the given spectrum (Table S3).Indeed, there was a median of 15 alternative attributions that were superior to the ground truth attribution.We emphasize that the signatures available for attribution were only those actually found in this type of cancer.
Alternative attributions superior to the ground-truth attribution can exist because, although the number of mutations due to a given signature was known for each synthetic spectrum, the distribution of counts due to that signature across different mutation types (e.g.ACA → AAA, …, CTG → CAG, …, TTT → TGT), was sampled randomly.This was designed to simulate a model in which a mutational process generates a certain number of mutations according to a fixed multinomial distribution across mutation types, but the count of mutations of each mutation type varies due to random sampling.
Of note, the rankings for the signature attribution approaches varied across cancer type (Figure 3E).Thus, while PASA's median Combined Score ranked 1st in 6 of the 9 cancer types, it ranked 5th for Liver-HCC and 2nd for Lung-AdenoCA and Ovary-AdenoCA.As another example, MSA ranked last in 5 cancer types but ranked 2nd in Eso-AdenoCA.
All approaches other than MSA had precision of 1 on the majority of spectra but had median recall in the range of 0.75 to 0.80.By contrast, MSA's median precision was < 0.74 while median recall was 1.We speculate that this reflects a tacit consensus that for most application of signature attribution precision is more useful than recall.Therefore, we also investigated whether MSA's precision could be improved by adjusting its parameters.Indeed, we were able to increase its precision and Combined Score, at the cost of decreased recall (Figure 3).Running MSA with these parameters is labelled MSA_opt in the figures.
We also observed that, across tumor types, most approaches had the lowest median Combined Scores for Skin-Melanoma, mainly due to very low recall (Figure 3, S1, Table S4).We originally hypothesized that presence of SBS7a might interfere with detection of SBS7b, since both are dominated by C → T mutations thought to be caused by exposure to ultraviolet radiation.In fact, however, SBS5 and SBS1 were the most common false negatives in Skin-Melanoma (Table S5).For all approaches other than PASA and MSA, SBS5 was a false negative in over half of the spectra in which it was present.Similarly, for all approaches other than PASA and SigPro, SBS1 was a false negative in over half of the spectra in which it was present.

Accuracy on DBS (doublet-base substitution) mutational spectra
On synthetic DBS spectra, PASA had the highest Combined Score, which was significantly higher than that of FitMS, which had next highest (median: 2.92 versus 2.88,  < 5.2 × 10 −6 , 2-sided Wilcoxon rank-sum test, Figure 4, Table S6).The Combined Scores of other approaches were much lower.Recall for DBS data tended to be better than for SBSs, with medians of 1 for all approaches other than SigPro, although the mean recall for the 3 approaches with the highest Combined Scores ranged from 0.87 to 0.92.As was the case for SBSs, MSA with default parameters had low precision, but unlike the case for SBS, attempts to adjust MSA's parameters were not able to improve its precision.While SigPro performed well on synthetic SBS data (Figure 3), on synthetic DBS data it had lower recall than the other approaches (Figures 4 and S2, Table S6).SigPro's interface offers no parameters that might be tuned to improve its recall on DBS spectra.
For DBS spectra, as was the case for SBS spectra, there was variation in the ranking of the approaches across cancer types.Based on median Combined Scores, PASA ranked 1st in 8 of the 9 cancer types (tying for 1st in Skin-Melanoma).As another example, FitMS ranked 2nd in 7 of the 9 cancer types, but ranked 1st in Eso-AdenoCA and tied for 1st in Skin-Melanoma.

Accuracy on ID (insertion and deletion) mutational spectra
On synthetic ID spectra, PASA had the largest Combined Score, which was significantly higher than that of FitMS, which had next highest (Figure 5, Table S7, median: 2.91 versus 2.84,  < 1.1 × 10 −10 , 2-sided Wilcoxon rank-sum test).The Combined Scores of the remaining tools were much lower.As was the case for SBS, MSA with default parameters had low precision.Attempts to adjust MSA's parameters resulted in a slight improvement in precision.
For ID spectra, as was the case for SBS and DBS spectra, there was variation in the ranking of the approaches across cancer types.PASA's median Combined Score ranked 1st in 8 of the 9 cancer types and ranked 2nd in ColoRect-AdenoCA.

CPU usage
We calculated the total CPU time used by the process and its children when running each approach to mutational signature attribution.On all three types of synthetic spectra (SBS, DBS, and ID), MSA required 4 orders of magnitude more CPU time than the least resource-intensive approaches, FitMS and MP.(Figure 6 and Table S8).This is mainly because the MSA algorithm creates simulations of the input data and then tests using each of 4 pre-specified thresholds (program parameter "weak_thresholds") to select the threshold for final signature attribution.For each proposed threshold, MSA evaluates results on at least 1,000 simulated spectra.After a threshold is selected, MSA calculates confidence intervals for signature attribution by bootstrapping for each input spectrum.All these factors contributed to the substantial CPU resources required by MSA.

DISCUSSION
We have presented the first (to our knowledge) systematic benchmarking of signature attribution on all three of SBS, DBS, and ID mutational signatures, and we have presented a new method for attribution that is based on finding an attribution that maximizes the likelihood of a target spectrum.We assessed the accuracy of this new method, PASA (Presence Attribute Signature Activity) and of other four approaches [17,28,30,31,33], on a total of 2,700 synthetic spectra encompassing SBS, DBS, and ID mutation types.
While some previous studies [21,28,29] benchmarked accuracy on the SBS mutational signatures, we are not aware of any that have benchmarked attribution accuracy for DBS or ID signatures.We also point out that two of these studies did not examine accuracy from the point of view of precision or recall, and instead used mean squared error [21] or a variation on scaled Manhattan distance [29], neither of which can distinguish false negatives from false positives.We propose that understanding the propensity of approaches to include false positive signatures or exclude false negatives is important for applications of signature attribution.These would include molecular cancer epidemiology, for which one might want to determine with certainty whether the signature of a particular mutagen is enriched in a particular group of cancers [5,[7][8][9].They would also include efforts to understand the mutational exposures or processes responsible for oncogenic mutations [8,11].In addition, the accuracy measures in [21] and [29] may have little power to distinguish the accuracy of different signature attribution methods.For example [21] stated that "[a]ll methods give almost identical results"; see also Figure 9 in [21].
We also demonstrated that attribution is a challenging task.First, we showed that, for a given spectrum, there are often over one hundred possible alternative attributions that yield reasonably good similarity to the spectrum (Figure 2, Table S2).
Second, we showed that, for a given synthetic spectrum, there can be many incorrect attributions that provide more similar reconstructions than the correct, ground-truth attribution (Table S3).
For SBS, DBS, and ID mutational signatures, across all cancer types together, the new algorithm presented here, PASA (Presence Attribute Signature Activity) was more accurate than the other four approaches [17,28,30,31,33] (Figures 3 to 5).
However, the ranking for different approaches varied by cancer type, and, for example, PASA was the most accurate approach in 6 of the 9 cancer types (Figure 3, Table S4).We speculate that this is a consequence of the inherent challenges of mutational signature attribution.Signature attribution remains an open area, and advances might depend partly on integrating data from all thee mutation types (SBS, DBS, ID) or on incorporating prior evidence on signature prevalence and activity in different cancer types.
to Figure 1.The task of mutational signature attribution is to find signatures that can reconstruct the mutational spectrum well.(A) Example of an SBS spectrum that can be reconstructed with a cosine similarity of 0.988 from four signatures.The bar at the left shows that mutational signature SBS4 contributed the most mutations to this spectrum.This signature is associated with tobacco smoking.(B) The DBS spectrum from the same tumor can be reconstructed with a cosine similarity of 0.989 from two signatures.The bar at the left shows that mutational signature DBS2, which is also associated with tobacco smoking, contributed the most mutations to this spectrum.(C) The ID spectrum from the same tumor can be reconstructed with a cosine similarity of 0.982 from three signatures.The bar at the left shows that mutational signature ID3 contributed the most mutations to this spectrum.Like the SBS signature SBS4, this ID signature is also associated with tobacco smoking.Spectra from Lung-AdenoCA::SP52667 [1].The x axes of all panels follow the conventions described at https://cancer.sanger.ac.uk/signatures/.S1).Abbreviations for cancer types are as in reference [1].
Precision = /( + ) Recall =  /  Scaled Manhattan distance = ∑ |  −   |/  , where the   are the ground truth activities of all signatures known to occur in the given cancer type, the   are the estimated activities, and  is the number of mutations in the sample Combined = (1 -Scaled Manhattan distance) + Precision + Recall

Figure .
Figure .Many mutational spectra have numerous distinct attributions with cosine similarity 0.98.Shown here is the example of stomach adenocarcinomas (Stomach-AdenoCA) from the Pan Cancer Analysis of Whole enomes mutation data analyzed in [1].(A) umbers of stomach adenocarcinoma spectra by number of distinct attributions with cosine similarity to the spectrum 0.98.(B) One example spectrum from those analyzed in panel A. The x axis of this and the following panels follow the conventions described at https://cancer.sanger.ac.uk/signatures/.(C-F) Several example attributions that have cosine similarity to the spectrum in panel B 0.98.Code for this figure and the attributions are available at https://github.com/Rozen-Lab/sig_attribution_paper_code/common_code/figure_2/

Figure . Figure .
Figure .Accuracy of signature attribution approaches on synthetic DBS spectra.Abbreviations are as in Figure3.

Figure .
Figure .Total CPU time for running approaches to signature attribution on synthetic (A) SBS, (B) DBS and (C) ID mutational spectra.The approaches are ordered by decreasing median Combined Score (Figures3 to 5).Abbreviations for attribution approaches are as in Figure3.
follows a chi-squared distribution with 1 degree of freedom from which a p-value can be determined.If  exceeds the significance level (default 0.05), the null hypothesis is not rejected and ℎ  ⃗⃗⃗ is removed from the signature set that will be used in Step 2. The output of Step 1 is the set of signatures,  = [  1 ⃗⃗⃗⃗  2 ⃗⃗⃗⃗ …   ⃗⃗⃗ ],  ≤ , that survived the signature presence test.Activities of  ≤  signatures from  that can plausibly reconstruct , // in the sense that the likelihood ratio test for reconstructing  from  // versus reconstructing  from  is not significant at significance level  ℎ max ,   = OptimizeActivity (, )   = OptimizeActivity (, ) // Recalculate the activity matrix for // reconstruction with the signatures in Step 1, PASA conducts a signature presence test for each signature ℎ  ⃗⃗⃗ ∈  to exclude signatures that are not statistically likely to be present in the tumor sample.This is a likelihood ratio test, in which the null hypothesis is that reconstruction of the spectrum without ℎ  ⃗⃗⃗ (a reduced model) is as good as the reconstruction with all input signatures (full model).The test first separately searches for activities that optimize the likelihood P( | ), where  is either [ 1 …  −1  +1 …   ]  (reduced model) or [ 1 …   ]  (full model).The search is implemented by the R nloptr package [32] (https://cran.r-project.org/package=nloptr,see function OptimizeExposure in the mSigAct package).The likelihood ratio test then depends on the test statistic  = −2(log   − log   ), where log   and log   are the log likelihoods of the reduced and full model respectively.Algorithm: STEP 2 of PASA Forward search for a minimal set of mutational signatures // We will be moving signatures from  to  in a greedy search // to accumulate signatures in  for reconstructing  FOR  in 1 to  − 1  = [ ] FOR  in 1 to ()  = [; [ , ]] // We will test a matrix of signatures, , // consisting of  and the th signature of  ℎ ,  = OptimizeActivity (,  ) [] = (ℎ max , ℎ,  =  − ) // Perform a likelihood ratio test // comparing reconstruction of  with  // versus reconstruction with  = [; [ , ]] ENDFOR  = ℎℎ() // Find the index of the signature that // improves the reconstruction the most  = [; [ , ]] // Add the signature with the largest p value to   = [ , −] // Remove the signature with the largest p value from  IF [] >  // Reconstruction with  is not significantly // better than reconstruction with  ℎ opt ,