A pan-cancer landscape of somatic substitutions in non-unique regions of the human genome

Around 13% of the human genome displays high sequence similarity with at least one other chromosomal position and thereby poses challenges for computational analyses such as detection of somatic events in cancer. We here extract features of sequencing data from across non-unique regions and employ a machine learning pipeline to describe a landscape of somatic substitutions in 2,658 cancers from the PCAWG cohort. We show mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation load and substitution profiles, and can be validated with linked-read sequencing. This uncovers hidden mutations in ~1,700 coding sequences and thousands of regulatory elements, including known cancer genes, immunoglobulins, and highly mutated gene families.


Introduction
Catalogs of somatic mutations in cancer promise insights into disease-initiating pathways, underlying evolutionary processes, and the identification of potential therapeutic opportunities. undertaken the analysis of 2,658 whole cancer genomes from the ICGC and TCGA to characterize regions unobserved via exome sequencing studies 1 . This resource has led to studies of the processes underlying somatic events 2 , complex structural variation 3 , driver genes 4 , and timing 5 . However, these analyses rely on variants that can be positioned uniquely in the genome and thus the repertoire of variation in non-unique regions still remains unexplored.
Short-read sequencing -the technology used in cancer studies including PCAWG -identifies somatic mutations by comparing fragments of DNA from normal and tumor samples to a reference genome. At the scale of 100bp, however, 13% of the human genome consists of sequences that are present at more than one chromosomal location 6,7 . These regions range in multiplicity from two to several thousand copies and in identity of sequences from vague to perfect matches. This non-uniqueness complicates genetic analyses and creates recurrent blind spots to somatic mutation calling. Irrespective of their amenability to analysis, non-unique regions include genes and regulatory elements that participate in human diseases 8 , developmental processes 9 , as well as splicing factors and nuclear RNAs that are recurrently mutated in cancers 10,11 . Blind spots in these regions thus hinder a systematic understanding of relevant biological processes.
To alleviate the limitations of variant detection from short-read sequencing due to non-unique regions, a technique called thesaurus annotation characterizes mutations in terms of equivalence classes of genetic positions 12,13 . This approach does not pinpoint the precise location of mutations, but it enables calculation of summary statistics such as mutational load. It also provides sufficient information about somatic events to study mutational signatures 14 and to identify affected functional elements. In the present work, we employed this technique to study the PCAWG dataset with the aim to describe the landscape of somatic single-base substitutions.
This uncovers a vast set of somatic events across cancer genomes and cancer types.

Results
Thesaurus annotation uncovers a distinct class of mutations in a pan-cancer cohort To perform an analysis of somatic mutations that is inclusive to non-unique regions, we set up a pipeline for variant calling on the PCAWG dataset without filtering reads based on mapping quality (Methods). This provided a comprehensive set of candidate positions in all regions of the genome.
We then annotated the sites using a procedure that links called sites to possible alternative positions in the genome 12 (Figure 1a). To search for somatic mutations among the candidates, we utilized curated PCAWG data in two distinct ways. First, we assembled a panel of 237 genomes from normal tissues to filter out common germline polymorphisms and sequencing artifacts. Second, we trained a machine learning model to classify somatic events. Similar to other machine-driven approaches 15,16 , our pipeline provided an algorithm with 18 features about candidate sites collected from tumors and matched normal samples. The algorithm then learned a strategy to call mutations from the candidates to match the PCAWG consensus calls 1 . Crucially, training and testing were performed using only data from unique regions where the PCAWG call set is expected to be of high quality, and the features provided did not include information about mapping quality. The final classifier achieved a root-mean-square error of 8.9%, with false discovery and false negative rates of 7.6% and 4.3%, respectively ( Figure S1). The most important feature for classification was allelic frequency in the matched normal sample (increase in classification error to 37% when omitted), but many other features, such as allelic frequency in the tumor and the frequency in the panel of normals contributed as well ( Figure S2).
After training the machine-learning model, we processed the entire PCAWG dataset and thereby produced new sets of somatic mutation calls. We split mutations in the new call set into those placed uniquely in the genome, which we describe further as 'simple' or 'local', and those that can be linked to alternate sites, which we term 'thesaurus'. Compared to the PCAWG calls, our set of simple mutations showed median false discovery and false negative rates per sample at 7% and 9%, respectively, albeit with variability across the samples in the cohort (Figure 1b). Such discrepancies and variability are not unexpected, as differences among computational pipelines are well-documented 17,18 . Indeed, modeling revealed that mutation frequency, coverage, and mutation spectrum can explain the largest discrepancies, and that high false discovery and false negative rates are related to internal consistency within the consensus itself ( Figure S3).
Performance was stable across cancer types ( Figure S4).
Importantly, the set of thesaurus mutations showed little overlap with the PCAWG calls ( Figure   1b), indicating most of those sites were previously hidden. To investigate whether these sites are reasonable additions to the samples' landscape, we studied total mutation load across samples among the simple and thesaurus calls and found a high correlation (spearman rho 0.96) that was concordant with direct proportionality (Figure 1c). Other properties such as allele frequency and mutation coverage were also concordant ( Figure S5). Moreover, counts of thesaurus mutations correlated only weakly with sequencing coverage (spearman rho 0.16), suggesting the calls were not dominated by noise.
We then studied the position of mutations in relation to annotations of non-unique sequence. As expected, PCAWG calls showed under-representation in regions with low mapping quality ( Figure 1d). A simulation of mutation calls that might be obtained using a naive pipelineconsidering loci in non-unique regions but without using thesaurus annotation -showed severe over-representation (Figure 1d), providing a justification for common mutation callers to filter out such regions. In contrast, our pipeline produced an intermediate distribution, albeit with a systematic over-representation compared to the proportion of non-unique sequence. This can be due to a residual level of false positives, due to gaps in the sequence of the human reference genome, or due to a propensity for samples to accumulate or tolerate mutations in those regions.
As an orthogonal validation, we performed short-read and linked-read sequencing on one additional cancer sample (Methods). The linked-read protocol uses barcodes to help aligners position reads at their correct coordinates in the reference genome 19,20 and thus expands the regions where variation can be assessed by common mutation callers. In the short-read data, our pipeline called 3,074 simple and 189 thesaurus somatic mutations, which we sought to confirm in the linked-read data. Validation rates for simple variants were proportional to the variant allele frequency in the short reads and surpassed 90% at allele frequency of 0.5 (Figure 1e, Figure S6).
For calls with thesaurus annotation, the validation rate was just 11% lower. This suggests that while thesaurus calls may retain more false components than local calls, the large majority of hits from our pipeline represent real events.
Mutations in non-unique regions are consistent with known mutational processes Somatic mutations in cancers appear through several biochemical processes, many of which leave distinct patterns in samples' mutation profiles 2,14 . To a first approximation, these processes can be presumed to act similarly in unique and non-unique regions and therefore manifest among Several strategies are used to assess the importance of mutations and identify driver genes [21][22][23][24] .
Recurrence in a cohort is a key indicator, but this signal can be confounded by factors such as region size, chromosomal location, proximity between adjacent mutations, sequence composition, and, in the case of coding sequences, effects on protein structure 22,25 . However, these covariates can themselves be confounded by non-uniqueness in the genome. Here, in order to study mutation patterns across all region types and in a way compatible with thesaurus candidates, we performed modeling using only region length as a covariate (Methods). Starting with coding sequences, we fit quantile regression models to describe cohort frequency in genes with unique sequence (Figure 3b). The resultant model captured the expected upward trend, with established driver genes such as TP53 and KRAS as strong outliers. Applying the same model on genes that include non-unique sequences revealed the same trends. Interestingly, the cohort frequencies of many genes shifted across quantile boundaries depending on whether thesaurus mutations were excluded or included.
As the same modeling strategy is also suitable to study mutations outside of coding regions, we carried out a genome-wide analysis and summarized the deviation of individual regions from model trends using z-scores ( Figure 3c, Table S1). Distributions of z-scores for elements affected only by local mutations centered around zero by construction. For regions affected by thesaurus mutations, distributions were also centered near zero despite this property not being built into the models. There was a consistent shift toward positive z-scores, but distributions for genes harboring thesaurus mutations exclusively included heavy tails of negative scores, indicating the mutation set may still suffer from false negatives. Scores were consistent when the same modeling was repeated on sub-cohorts and, for coding regions, were correlated with a ranking produced by a specialized scheme accounting for additional covariates (Methods, Figure S8).
Overall, the z-scores therefore provide a reasonable, albeit rudimentary, prioritization of hits.
To further refine the prioritization, we computed an entropy-based measure of specificity across cancer types (Methods). We then used the pan-cancer z-scores and specificity measures together to visualize hits in coding sequence (Figure 3d), promoters (defined as regions of at most 2000bp upstream of genes, Figure 3e), and other regions ( Figure S9, S10, Table S1). This approach captured expected characteristics, in particular that most genes are neither recurrently mutated or specific to a cancer-type, and that both dimensions carry outliers. Coding regions in TP53 and for ROBO2) up to a large majority (e.g. 88% for RGPD3).
To gain more insight into where the mutations lay along the gene structure, we visualized variantlevel results along the gene sequences, splitting the results by mutation type and comparing with the PCAWG mutation calls ( Figure S12, S13). For PIK3CA and KMT2C, two of the genes with the highest mutation load, our pipeline detected 13% and 15% additional simple substitution events compared to the PCAWG set ( Figure 4b). This is broadly consistent with our previous comparisons with the consensus and can in part be attributed to technical differences in the pipelines 17 (Figure 4c). Given their high mutation load and interactions with cancer pathways relevant to patient stratification schemes 29 , thesaurus mutations that fill gaps in their mutation profiles offer direct opportunities to test their translational relevance.
While thesaurus mutations constitute a minority of hits for most genes, they represent the dominant class for others ( Figure 4d). Among genes highlighted by our z-score analysis and also by alternative prioritization methods, TRIM49 and TRIM64B, two members of the tripartite motif family of proteins were prominent with mutations along their entire gene body. This family is involved in innate immunity, autophagy and carcinogenesis 30 . AMY1B encodes an amylase isoenzyme typically expressed in the salivary gland and the pancreas and is embedded in a region of variable copy-number. It may influence metabolism, but its high mutation burden may also be a corollary of the genomic fragility of its surrounding genomic region 31  suggests that thesaurus annotation reliably detects these non-germline events and can thus inform translational approaches that use immune signatures.

Discussion
The structure of the human genome has been shaped by its evolution, including by duplications and rearrangements. This history leaves a substantial portion of the genome to appear nonunique at the scale of short reads used by high-throughput sequencing studies. As long-and linked-read sequencing protocols become established and widespread, these regions will become accessible for direct analysis 19 . However, existing datasets, including efforts to sequence whole genomes of pan-cancer primary tumors 1 and metastases 34 , already offer opportunities to evaluate somatic events in these non-unique regions.
When specific sequences are of interest a priori, targeted approaches can perform re-analysis of subsets of sequencing data. This analytic strategy is effective when there are complex rearrangements as in the case of the HLA locus 35 , or sequences are present in a large number of copies as in the case of transposons 36 . However, the human genome also carries areas that are almost exactly duplicated due to recent evolutionary events and are present in fewer than ten copies. Such regions can be studied in a systematic manner through a technique that links specific genomic positions and provides information about clusters based on multi-locus alignments 12  Our analyses of cohort mutation rates, regional recurrence and hotspots, cancer-type specificity, and co-occurrence are a first-pass summary of the patterns in these data. Indeed, mutational processes are modulated, directly or indirectly, by a myriad of factors that include nucleotide content, chromatin accessibility, gene expression 22 . While methods developed in these areas provide guidance for more refined analyses, they rely on auxiliary data as model covariates. In the context of non-unique regions, these covariates, as they are often acquired through shortread sequencing, are likely to suffer biases related to sequence uniqueness 37 . A careful examination of those covariates in the non-unique genome is a critical step toward better understanding of the statistical and functional importance of the uncovered mutational landscape.
Beyond somatic substitution events, cancer genomes also suffer other types of mutations, including small insertions and deletions and regional copy-number changes 4  for argument --minmapqual 0, which instructs the software to use all primary read alignments irrespective of mapping quality.
Variants detected in samples included in the panel of normals were aggregated into a single table.
The frequency of each variant in this panel was kept as a proxy for population frequency.
Variants from tumor samples were annotated using GeneticThesaurus (v0. Separately, PCAWG mutation calls for the same samples were assembled into a truth set.
Importantly, the PCAWG mutation calls were filtered using the same panel-of-normals frequency filter used on the candidate sites and then trimmed further to remove items that were not present in the variant candidates. These steps ensured that the candidates and the truth set are consistent, and that all items in the truth set could in principle be obtained from non-missing features in the candidate data. Both the candidate and the truth dataset were restricted to sites with a PASS filter code in the candidate data, i.e. to those sites not linked to any additional locations via thesaurus links. This ensured that the identification of somatic mutations among the candidates could be determined by technical features of the sequencing data and biological aspects of the tumor and was not confounded by aspects related to mappability.
Machine learning models were trained using xgboost 42  Overall, a total of 1,538,338 GEMs were generated, containing on average 589kb of DNA with an average size of 86kb, each producing a median of 47 linked reads for a final 38x depth of coverage and 32x median depth at mutated sites.
The reported variants were compared to the calls from the machine learning approach on a standard short-read WGS dataset of the same tumor sample. Simple variants called from the short reads were declared validated if they were part of the PASS mutations in the linked reads.
Thesaurus calls were declared validated if they were part of PASS mutations in the linked reads, or if their thesaurus-linked sites were part of the linked-read dataset. This approach allows for ambiguity in placing the mutation location based on short-read data, but does not inflate detection rates 12 .
Mutation trinucleotide profiles Trinucleotide contexts were extracted for all called mutations in the cohort. These neighborhoods were used to assign each substitution mutation to one of 96 categories as previously described 14 .
Counting the number of mutations of each type produced two profiles -one based on simple mutations in unique genomic regions and one based on mutations with thesaurus links.
For correlation analysis, the simple and thesaurus profiles were treated as 96-dimensional vectors and correlation was evaluated using the Spearman method. Because mutation profiles are degenerate when the overall number of mutations is small and constrained to non-negative counts, statistical significance was estimated by simulation. For a given profile of thesaurus mutations, 10,000 random profiles were generated with an equivalent number of mutations. The Spearman correlation value between the local and thesaurus profiles were compared to the distribution of correlations between the local and simulated profiles. The procedure provided approximations to p-values that were sufficiently precise to determine significance at nominal and multiple-testing adjusted levels.
For visualization of similarities of the mutation profiles, count-based mutation profiles were adjusted using allele-frequency data. This adjustment provided weighting toward well-measured mutation instances and avoided degenerate comparisons based on integer counts. Mutation profiles based on simple mutations were sum-normalized and embedded into a two-dimensional space using UMAP 43,44 , a dimensional reduction technique, using a euclidean distance metric. Modeling of the relation between region size and mutation frequency was performed using quantile regression. The model used logarithmically transformed frequency and region size, log(frequency) ~ a0 + a1 log(size) + a2 (log(size))^2, With a0, a1, and a2 as free parameters. The model quadratic term allows some nonlinearity in the relationship between size and frequency, which is required to allow the growth in frequency to taper for very large regions. Quantile regression with this model was performed at the 50% percentile to describe the primary trend, and at 5%, 25%, 75%, 95% levels to obtain intervals of variability. After fitting the parameters, each genomic region was associated with an expected mutation frequency and an interquartile interval.
Quantile regression based on a linear equation is guaranteed to produce fitted models that preserve ordering of percentiles, e.g. with a model at quantile of 75% always yielding larger values than at 50%. This property is not guaranteed for models with higher-order terms. Predictions from the fitted models were thus adjusted post-hoc. Furthermore, the interquartile interval was forced to correspond to at least 1/N, with N being the number of samples in the modeled cohort. All model predictions were restricted to the unit interval, [0, 1].

Identifying driver genes with dndscv
The dN/dS ratio was computed through maximum likelihood estimates across trinucleotide contexts using dndscv 22 . After removing hypermutator samples with more than 300 coding mutations, we ran dndscv on the pan-cancer cohort for de-novo discovery of candidate cancer genes. We used the pooled set of thesaurus and simple mutations, but we did not correct for epigenetic covariates, as these covariates are not annotated for non-uniquely mapping regions.
Cancer type specificity