Evaluating topography of mutational signatures with SigProfilerTopography

The mutations found in a cancer genome are shaped by diverse processes, each displaying a characteristic mutational signature that may be influenced by the genome's architecture. While prior analyses have evaluated the effect of topographical genomic features on mutational signatures, there has been no computational tool that can comprehensively examine this interplay. Here, we present SigProfilerTopography, a Python package that allows evaluating the effect of chromatin organization, histone modifications, transcription factor binding, DNA replication, and DNA transcription on the activities of different mutational processes. SigProfilerTopography elucidates the unique topographical characteristics of mutational signatures, unveiling their underlying biological and molecular mechanisms.


BACKGROUND
Somatic mutations are found across the genomic landscapes of all cancers and of all normally functioning somatic cells [1,2].These mutations are carved by the activities of endogenous and exogenous mutational processes with each process exhibiting a characteristic mutational pattern, termed, mutational signature [3][4][5].Prior studies have demonstrated that mutations are not uniformly distributed across the genome and that most mutational signatures are affected by the topographical features of the human genome [6,7].Specifically, mutational signatures can have distinct enrichments, depletions, or periodicities in the vicinity of early and late replicating regions [8,9], genic and intergenic regions [10,11], nucleosomes [12,13], dense chromatin regions [14], histone modifications [15], and transcription factor binding sites [16,17].Additionally, some mutational signatures also exhibit transcription strand asymmetries, replication strand asymmetries, and/or strand-coordinated mutagenesis [18,19].
MutationalPatterns allows comparing the mutational patterns between different regions of the human genome and it can be used for testing enrichments or depletions using Poisson tests [22].
However, the tool does not consider the structure of the genome, the patterns of different mutational signatures, and the activities of these signatures when performing statistical comparisons.In addition, a subset of topography features has also been considered in extracting .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted January 9, 2024.; https://doi.org/10.1101/2024.01.08.574683 doi: bioRxiv preprint

Implementation and computational workflow
As input, SigProfilerTopography requires a set of topographical features of interest and a compendium of somatic mutations from a set of samples (Fig. 1A).Topographical features can be derived from different genomic assays (e.g., ATAC-seq, Repli-seq, MNase-seq, ChIP-seq, etc.) and these features can be inputted in a number of standard file formats, including: wig, bigWig, bed, or bigBed.SigProfilerTopography's support for multiple input formats allows for topographical features to be directly downloaded from the Encyclopedia of DNA Elements (ENCODE) [35] or these features can be provided from user-generated experimental datasets.
Similarly, SigProfilerTopography can examine somatic mutations using commonly supported file formats, including, Variant Call Format (VCF) and Mutation Annotation Format (MAF).By default, SigProfilerTopography utilizes SigProfilerAssignment [36] to attribute the activities of known reference mutational signatures from the Catalogue Of Somatic Mutations In Cancer (COSMIC) database [37] to each examined sample.Alternatively, if another tool for assigning mutational signatures is preferred, users can provide two additional input matrices that include the patterns and activities of all operative mutational signatures in the examined samples.In either case, SigProfilerTopography will utilize the signatures' patterns and their activities to derive the probability for each mutational signatures to generate each type of somatic mutation [33].
After processing the input data, SigProfilerTopography simulates all somatic mutations in each sample n times (Fig. 1B; default of n=100) using SigProfilerSimulator [38] while maintaining the distribution of mutations across the genome at a preset resolution (Fig. 1B).By default, the preset resolution maintains the total number of mutations per chromosome and the trinucleotide pattern .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted January 9, 2024.; https://doi.org/10.1101/2024.01.08.574683 doi: bioRxiv preprint of each somatic mutation, which encompasses the mutated base and its immediate 5' and 3' basepairs.The performed background simulations can be extensively customized depending on the appropriate scientific question [38].Through simulating all somatic mutations, the tool generates a background model that accounts for at least a preset part of the reference genome's structure and allows assessing any statistical differences between real and simulated somatic mutations.Both real and simulated somatic mutations are categorized in their appropriate mutation types (Fig. 1C) and a mutational signature is probabilistically attributed to each somatic mutation (Fig. 1D).
SigProfilerTopography controls the false-discovery rate and, by default, only statistically compares mutations with an average of 90% probability of being caused by a specific mutational signature (Fig. 1E).Lastly, the tool outputs a variety of results allowing to distinguish differences in the topographical distribution of real somatic mutations when compared to the distribution of simulated mutations.Example analyses include evaluations of occupancy, strand asymmetries, replication timing, enrichments/depletions, and strand-coordinated mutagenesis (Fig. 1F).

Analysis of Feature Occupancy
For a given topographical feature of interest, the tool evaluates the signal for detecting this feature in the vicinity, default of ±1 kilobase (kb) flanking regions, of each examined somatic mutation (Fig. 2A).The signal is aggregated for each flaking genomic position across all somatic mutations and averaged based on all available data (Fig. 2A).In the rare case of no signal being found for a specific flanking location across all mutations, the average signal is reported as zero.Occupancy analysis is jointly performed for both real and simulated somatic mutations, thus, allowing statistical comparisons of the flanking patterns and any enrichments/depletions between real and synthetic mutations.Occupancy analysis is commonly performed to evaluate the effect of nucleosome occupancy, open chromatin, transcription factor binding sites, and histone modifications on the accumulation of somatic mutations from specific mutational signatures [6,18].
To illustrate SigProfilerTopography's capabilities for occupancy analysis, we examined the effect of nucleosome occupancy (measured by MNase-seq data) and binding of CTCF (based on ChIPseq data), a key regulator of chromatin architecture, on mutational signatures SBS17b and ID2 in the ESCC cohort.Signature SBS17b has a generally unknown etiology with prior studies reporting associations with damage from reactive oxygen species [39] and possible exposure to 5fluorouracil chemotherapy [40].Mutations due to SBS17b exhibited periodicity with a period of approximately 190 base-pairs reflecting the nucleosome positions (Fig. 2B).This periodicity has been previously attributed to high damage [41] and less repair at nucleosome positions [42].
Additionally, SBS17b substitutions were highly enriched at CTCF binding sites, which is strikingly different when compared to expected by chance from the simulated substitutions (Fig. 2C).Signature ID2 has been previously attributed to slippage during DNA replication of the DNA template strand and this signature can be highly enriched in cells that are mismatch repair deficient [5].Mutations due to ID2 were preferentially depleted at nucleosome-occupied regions (Fig. 2D) while significantly enriched at CTCF binding sites (Fig. 2E).
In addition to evaluating the patterns in the vicinity of a topographical feature, SigProfilerTopography allows summarizing the different enrichments and depletions of topographical features in the vicinity of somatic mutations when compared to synthetic mutations.Specifically, the tool performs a statistical test to evaluate whether the topographical signal is enriched, depleted, or as expected based on the simulated data.Applying SigProfilerTopography to 8 topographical features and 5 mutational signatures in the ESCC cohort reveals that mutational signatures can be distinctly affected by each topographical feature.For example, SBS17b is enriched in CTCF binding sites and depleted at histone marks (Fig. 2F).This depletion is especially profound at H3K4me1 and H3K27ac, both of which delineate enhancer regulatory regions [43,44].

Evaluating Replication Timing
Cells replicate their DNA following a predefined replication timing program [45][46][47].DNA replication begins simultaneously at multiple origins of replication and propagates bidirectionally on both strands.Chromosomal regions close to the origin of replication will replicate early, whereas regions that are far from the origin will replicate late.SigProfilerTopography can infer early and late replicating regions based on Repli-seq assay.Since higher signal in Repli-seq data reflects earlier replication [48,49], the tool performs a search for local minima and maxima of the provided signal (Fig. 3).Specifically, weighted average data are smoothed and transformed into wavelet-smoothed signal, which results in regions with high signal values indicating domains of early replication where initiation occurs earlier in S-phase or early in a higher proportion of cells.
Local maxima and local minima in the wavelet-smoothed signal data correspond to replication initiation zones (peaks) and replication termination zones (valleys), respectively (Fig. 3A).
SigProfilerTopography uses wavelet-smoothed signal data in replication timing analysis and, additionally, peaks and valleys data in replicational strand asymmetry analysis.After sorting the replication time signals into descending order from early to late, the tool splits the signal into deciles, with each decile containing 10% of the replication time signals.To demonstrate SigProfilerTopography's capabilities for replication timing analysis, we evaluated the effect of .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted January 9, 2024.; https://doi.org/10.1101/2024.01.08.574683 doi: bioRxiv preprint replication timing in the ESCC cohort on signature SBS2 (Fig. 3C), a mutational signature previously attributed to the activity of the APOBEC family of deaminases [34].Similar to prior reports [6,18], SBS2 exhibited an increasing normalized mutation density from early to late replicating regions (Fig. 3D).

Examining Replication Strand Asymmetries
In eukaryotic cells, DNA replication is initiated around multiple replication origins, from where it proceeds in both directions on both strands (Fig. 3B).The strand where the direction of DNA synthesis and growing replication fork are the same is replicated continuously and it is termed leading strand.Conversely, when the direction of DNA polymerase and the growing replication fork are opposite, then that strand (termed, lagging strand) is replicated discontinuously in short Okazaki fragments [50].Imbalance between DNA damage and DNA repair may lead to mutations from the same type to be enriched on the leading or lagging strands.
Using data from an Repli-seq assay, SigProfilerTopography can annotate mutations as ones occurring on the leading or lagging strand by orienting them by the pyrimidine base of the reference Watson-Crick base-pair.Applying SigProfilerTopography to the mutations attributed to the APOBEC-associated signature SBS2 in the ESCC cohort reveals an enrichment of mutations on the lagging strand when compared to simulated data (Fig. 3E).This result is consistent with prior reports of APOBEC deaminases targeting single-stranded DNA during replication [51].

Examining Transcription Strand Asymmetries
In addition to evaluating the effect of replication on the accumulation of mutational signatures (Fig. 3), SigProfilerTopography also allows examining the impact of transcription on somatic mutagenesis.Specifically, the tool annotates each mutation as either genic or intergenic, where genic mutations are within the genomic regions of well-annotated protein coding genes and intergenic mutations are outside these regions (Fig. 4A).Moreover, somatic mutations within wellannotated protein coding genes are further subclassified based on the pyrimidine base of the reference Watson-Crick base-pair resulting into two additional subclasses: un-transcribed mutations and transcribed mutations (Fig. 4A).This subclassification allows measuring transcription strand asymmetries due to either transcription-coupled DNA repair [52,53] or transcription-coupled DNA damage [19].Applying SigProfilerTopography to the somatic mutations due to SBS16 (Fig. 4B), a mutational signature previously associated with alcohol consumption [54], revealed both accumulation of higher number of T>C mutations on the transcribed strand (Fig. 4C) as well as an enrichment of mutations within genic regions (Fig. 4D).
This topographical behavior of signature SBS16 has been previously attributed to the role of transcription-coupled damage in actively transcribed genes [19,55].

Mapping Strand-coordinated Mutagenesis
Prior studies have shown that strand-coordinated mutations are commonly observed, for example, due to damage on single-stranded DNA, and can form hypermutable genomic regions [56,57].
SigProfilerTopography allows performing analysis of strand-coordinated mutagenesis by identifying groups of consecutive mutated single base substitutions, attributed to the same .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted January 9, 2024.; https://doi.org/10.1101/2024.01.08.574683 doi: bioRxiv preprint mutational signatures, with no more than 10kb distance between any two mutations.Mutations are oriented by the reference base of the Watson-Crick base-pair to ensure that they are occurring on the same strand, e.g., consecutive C>A mutations attributed to a single mutational signature.
Groups of varying lengths are pooled across all samples for each mutational signature.Same procedure is repeated for simulated mutations to assess the statistical significance of the observed number of strand-coordinated mutagenesis groups with expected list of number of strandcoordinated mutagenesis groups for each group length (Fig. 5A-C).
Applying SigProfilerTopography to all mutational signatures operative in the 552 whole-genome sequenced samples revealed statistically significant strand-coordinated mutagenesis for multiple signatures.The APOBEC-attributed signatures SBS2 and SBS13 exhibited groups of up to 11 consecutive mutations likely due to APOBEC-induced kataegis [58,59].Interestingly, the flat signatures SBS5 and SBS40 also manifested strand-coordinated mutagenesis of varying group length.Lastly, the mismatch repair deficiency signature SBS15 also exhibited strand-coordinated mutagenesis for as many as 5 consecutively mutated bases (Fig. 5D).

DISCUSSION
SigProfilerTopography is an open-source Python package that allows understanding the interplay between somatic mutagenesis and the structural and topographical features of a genome.The tool can reveal mutational signature-specific tendencies associated with chromatin organization, histone modifications, and transcription factor binding as well as ones affected by cellular processes such as DNA replication and transcription.As we illustrated by applying the tool to 552 whole-genome sequenced ESCCs, SigProfilerTopography simultaneously examines real somatic mutations and simulated mutations, compares their tendencies, and then elucidates the statistically significant differences for each structural and topographical feature of interest.The tool also seamlessly integrates with other SigProfiler tools and leverages them for parts of its computational workflow, including classification of somatic mutations using SigProfilerMatrixGenerator [60], simulating realistic background mutations with SigProfilerSimulator [38], and assigning mutational signatures to each somatic mutation using SigProfilerAssignment [36].
SigProfilerTopography has at least three known limitations.First, the tool can only be used to explore small mutational events including single base substitutions, doublet substitutions, and small insertions and deletions.Currently, the tool does not allow exploring large mutational events [61] such as copy-number changes and structural rearrangements.Second, SigProfilerTopography can be applied only to whole-genome sequenced cancers, and it will not work on whole-exome or targeted cancer gene panel sequencing data as the algorithm requires profiling the non-coding regions of the genome.Lastly, the tool necessitates sufficient numbers of somatic mutations for the statistical analyses to be meaningful and statistically significant.We have previously shown that topographical analyses will work and can yield biologically exciting results when examining .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted January 9, 2024.; https://doi.org/10.1101/2024.01.08.574683 doi: bioRxiv preprint adult cancers [6], however, it is currently unclear whether some pediatric cancer genomes will have sufficient numbers of somatic mutations for examining the topography of their mutational signatures.

CONCLUSIONS
SigProfilerTopography enables a thorough examination of how genome topography and genome architecture impact the accrual of somatic mutations.The tool offers a robust approach for evaluation of localized somatic mutation rates across various genomic features within a single comprehensive platform, offering a scalable solution for analyzing large datasets encompassing many thousands of cancer genomes and all types of small mutational event.Overall, SigProfilerTopography is a computational tool that provides an unprecedented opportunity for understanding the biological mechanisms and molecular processes influencing somatic mutational processes that have operated in a cancer genome.

Tool implementation
SigProfilerTopography is developed as a computationally efficient Python package, and it is available for installation through PyPI.The tool leverages SigProfilerAssignment for attributing mutational signatures to individual somatic mutations [36], SigProfilerSimulator for generating all simulated datasets [38], and SigProfilerMatrixGenerator for processing input data for somatic mutations [60].SigProfilerTopography allows processing all types of small mutational events, including: (i) single base substitutions, (ii) doublet base substitutions, and (iii) small insertions and deletions.The tool supports most commonly used data formats for somatic mutations: Variant Calling Format (VCF), Mutation Annotation Format (MAF), International Cancer Genome Consortium (ICGC) data format, and simple text file.SigProfilerTopography allows examining topography features in wiggle (wig), browser extensible data (bed), bigWig, and bigBed formats.
The tool has been extensively tested on data from transposase-accessible chromatin with sequencing (ATAC-Seq), replication sequencing (Repli-Seq), micrococcal nuclease sequencing (MNase-Seq), and immunoprecipitation sequencing (ChIP-Seq).By default, the tool performs statistical comparisons and Benjamini-Hochberg corrections for multiple hypothesis testing using the statsmodels Python package.SigProfilerTopography is freely available, distributed under the BSD-2-Clause license, and has been extensively documented.Statistically significant replication strand asymmetries are depicted with * (q-value ≤ 0.05).Circle plot displays the group lengths from 2 to 11 mutations on the x-axis and the SBS mutational signatures on the y-axis.Circle size represents the number of strand-coordinated mutagenesis groups for the corresponding group length, which is normalized for each mutational signature.
Circle color indicates the statistical significance of the finding with -log10 (q-value), with darker color corresponding to lower q-value.Simulating n times with same mutational patterns ABBREVIATIONS

Figure 2 .
Figure 2. Evaluating occupancy of topographical features.(A) Conceptual and simplified

Figure 3 .
Figure 3. Examining the effect of replication timing and replication strands.(A) DNA

Figure 4 .
Figure 4. Assessing the impact of the transcriptional machinery.(A) Somatic mutations within

Figure 3 .Figure 4 .
Figure 3. Examining the effect of replication timing and replication strands.
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made * q-value ≤ 0.05