Explainable deep learning on 7500 whole genomes elucidates cancer-specific patterns of chromosomal instability

Chromosomal instability (CIN) refers to an increased rate of chromosomal changes within cells. It is highly prevalent in cancer cells and leads to abnormalities in chromosome number (aneuploidy) and structure. CIN contributes to genetic diversity within a tumour, which facilitates tumour progression, drug resistance, and metastasis. Here, we present a deep learning method and an exploration of the chromosome copy aberrations (CNAs) resultant from CIN, across 7,500 high-depth, whole genome sequences, representing 13 cancer types. We found that the types of CNAs can act as a highly specific classifier for primary site. Using an explainable AI approach, we revealed both established and novel loci that contributed to cancer type, and focusing on highly significant chromosome loci within cancer types, we demonstrated prognostic relevance. We outline how the developed methodology can provide several applications for researchers, including drug target and biomarker discovery, as well as the identification of cancers of unknown primary site.


Introduction
Chromosomal instability (CIN) is a phenomenon commonly observed in cancer cells, characterized by an increased rate of changes in both the number and structure of chromosomes.This instability can result from a wide range of mechanisms, which include those directly associated with mitotic apparatus, such as chromosome mis-segregation, but also less well understood events such as dysfunction of telomeres, impairments in cell cycle checkpoints, abnormalities in the mitotic spindle apparatus, double stand break repair and genome doubling.Chromosome copy aberrations (CNAs), the specific loci that reflect an aneuploidic state, have been observed in around 90% of solid cancer genomes [1] making them one of the most prominent features of the cancer genome.The observed chromosome profile of a given cancer is a reflection of these mechanisms plus its unique evolutionary walk, which is a combination of the positive selection forces of its microenvironment (defined in a major part by the cite of origin) and the stochastic nature of mutagenesis.
For this reason, CNA patterns have been increasingly studied.Following the historical work of von Hansemann [2] and Theadore Boveri [3,4] and more recently in vitro [5], in vivo [6], and bioinformatic studies [7] a view of chromosome imbalances (aneuploidy) as a general potentiate of malignancy has been constructed [8].More recent studies have sought to explore the genome-wide architecture that results from CIN [9] and to explain patterns of CNAs through generative signatures [10,11].Although the signature approach is appealing there are many hurdles associated with defining the underlying features to use, in contrast to the point mutation case [12].
One of the challenges associated with CIN is its complexity.Separating functional vs non functional variation is non-trivial.In a sense, many researchers are seeking to classify those sites that are important to cancer biology, and those that are passengers resultant from stochastic forces.An ideal situation is that specific loci define each cancer subtype, thereby directing drug development, and making molecular diagnosis tractable.To date, many cancer-specific CNAs have been identified [13,14] and in specific gene contexts, such as TERT amplifications [15] they can act as biomarkers for disease severity, drug resistance [16], or as novel drug targets [17].That said, these are often discovered from frequency or mechanistic methodologies and it remains broadly challenging to assess whole cancer genomes for undiscovered loci that are critical to cancer biology.
In this work we sought to take an approach where we examined CIN patterns directly.We classified cancers based on patterns of copy number across the genome and then used explainable AI to dissect the loci that contributed information to the classification.These loci capture differences between cancers, which we can assume gives rise to different biology in those cell types.Our method reveals the "unseen" differences in CNA patterns between cancer types, and reveals novel loci that will be of wide interest to cancer biologists, and drug and biomarker developers.We verified the prognostic information in these loci using patient data.

Results
Analysis of 13,000 rearranged cancer genomes in the 100,000 genomes project The 100,000 Genomes project was completed in 2019, and achieved its goal of sequencing 100,000 genomes from cancer, rare disease and infectious disease patients [18,19].Managed by Genomics England, it served as a transformative project for integrating genomics into the National Health Service (NHS) and to evaluate the role of whole genome sequencing (WGS) at scale for NHS cancer patients.Participants provided consent for their genomic data to be associated with annonymised longitudinal health records.This information is then shared within a secure workspace with international clinicians and scientists from approximately 300 institutions in 24 countries within the Genomics England Clinical Interpretation Partnership (GeCIP).The aim is to advance our understanding of the causes and treatment approaches for various types of cancers [20].
The 13,000 WGS cancer samples are labelled according to their primary site and diagnosis.Primary site classes with fewer than 100 samples (testicular, endocrine, etc.) or those whose class labels are broad, or composed of multiple primary sites (hepatopancreatobiliary, sarcoma, cancer of unknown primary site, etc.) were omitted from the data.The remaining genomes were used in our analysis (Figure 1a).
We also focused on cancers with a high degree of chromosomal instability (CIN) by excluding those with less than 30% of the genome altered, as quantified by the Percentage Genome Altered (PGA).This threshold aligns with previous work suggesting that a substantial genomic alteration is a reliable indicator of increased chromosomal instability [21,22].The distribution of PGA across cancer samples from diverse primary sites is given in Figure 1b.Furthermore, we excluded genomes that had undergone significant rearrangements or primarily influenced by mutations other than CNAs, since these are likely to involve distinct biological mechanisms such as microsatellite instability.This left 7522 genomes for the downstream analyses.
We explored the frequency of CNA mutations at known pan-cancer genes from Rosetta [23].Figure 1c shows the number and the proportion of samples where each locus is implicated.Simultaneously, each pan-cancer gene has an associated stacked bar of the copy state for each sample in the GEL data, showing the proportion of samples with deletions, loss of heterozygousity (LOH), duplication, amplification or diploid.
The degree of genomic instability, which is highly sensitive to the cancer primary site, can be measured by the number of copy number segments, the proportion LOH segments within the genome and the degree of whole genome doubling (WGD) [10].This is illustrated in Figure 1d which shows median proportion of LOH against the median number of LOH segments, with the size of each point reflecting the degree of WGD.The ratio of mean proportion to number of LOH segments is lowest in haematological cancers and highest in upper-gastrointestinal samples, where WGD is inversely highest in haematological and lowest in upper-gastrointestinal samples.

A novel deep learning framework for analysis of rearranged genomes
We developed a model for analysing rearranged genomes using a neural network classifier trained on the primary site labels in the GEL dataset.To prepare the data, we first split the autosomes into 100kb bins and calculated the median total copy number and median minor allele CNA state within each bin.This reduced each sample into a 28749 × 2 tensor and provides a conserved genome coordinate across all samples.The data were then scaled across samples.The neural network architecture itself, is composed of an ensemble of 22 chromosome-specific sub-networks, separately acting on each chromosome in the input (Figure 2a).As such, each chromosome sub-network has an input layer of nodes equal to the binned size of the respective chromosome, with chromosome 1 the largest number of nodes and chromosome 21 with the fewest.These chromosome feature extractors then serve as inputs to a deep feed-forward network.
Each sample in our dataset is composed of a large feature set, while the classification involves only 13 classes.This high-dimensional feature space posed a challenge for our model, requiring it to discern meaningful patterns efficiently.Navigating through such a large feature space demands a robust model capable of identifying relevant patterns while avoiding overfitting.The chromosome feature extraction prevented overfitting, as did dropout layers with high probability (p = 0.5).In addition, we chose not to use convolutional layers in order to improve downstream interpretability.Our training strategy (see Methods), also addressed this challenge by striking a balance between model complexity and generalisation.

Many cancers can be distinguished based on their copy number profiles
The results in Figure 2b showcase the model's ability.Despite the abundance of features, the model demonstrates nuanced performance across classes, highlighting its capability to distil information from the diverse set of features.The average classification accuracy of the model on the test set is 72%, with endometrial cancer samples having the highest misclassification.As with other cancers, samples are misclassified when primary sites share similar biology, in this case with gynaecological cancers or colorectal cancer and upper gastrointestinal cancers.The performance of this model on classifying samples based on the CNA distribution of samples shows that there are differences in chromosomal instability (CIN) patterns between cancers.However, this observation is not universally applicable to all cancers.For example, lung cancer emerges as one of the classes with the lowest classification accuracy, which is most likely due to the mix of lung cancer subtypes within the data.These were not supplied in the training and thus the heterogeneity negatively influences the predictive accuracy of the model.

Using explainable AI to locate functional variation within genomes undergoing CIN
In order to infer the most important features that determine classification of primary sites by the trained network, we employed an approach based on Integrated Gradients (IG).IG is a technique used in explainable artificial intelligence (xAI) to understand the predictions of neural networks by attributing importance scores to input features [24].Although there have been a number of improvements to the approach [25], it can still be challenging to interpret the information gained.We used the guided integrated gradients (GIG) method [26].Unlike conventional IG, adaptive path methods such as GIG reduce the noise accumulated along the IG path by defining a path integral between the baseline and input sample, conditioning the path not just on the input sample but also on the model being explained (Figure 2c).This conditioning ensures a more accurate attribution of feature importance, taking into account the specific characteristics of the neural network's decision boundaries.By utilising GIG, out aim was to provide a more nuanced understanding of the features contributing to the classification of primary sites, enhancing the interpretability of the model's decision-making process.The resulting feature attributions are numerical scores, ranking the input features based on their influence on the model's prediction.This ranking gives the relative importance of each genomic locus in contributing to the classification of an individual cancer primary site versus the other primary sites.As such, it does not -by design -highlight the pan-cancer features such as those listed in Figure 1c, and instead serves as a powerful tool for interpreting the understanding the discriminative factors guiding the classification.This approach aids the identification of biologically relevant CNA factors driving CIN within each sample.
The GIG inference on the trained model is applied individually for each sample, generating a ranking of genomic bins.The model is trained exclusively on the CNA states of each bin, with none of the associated metadata such as gene or enhancer content fed into the training.By aligning the inferred feature importance with specific genomic annotations, and examining whether a feature corresponds to a specific copy number alterations (e.g.duplication, deletion, LOH), our approach provides a spatially and biologically contextualised understanding of the model's decision-making.This can be used to capture distinctions between cancers, interrogating the complex landscape of cancer biology within the context of CIN.

Sample level analysis reveals focal loci shaping tumour biology
The sample level explainability provides insight into the loci within the tumour genome that contribute to that particular classification.Figure 3A, B and C shows examples from breast, lung, and colorectal cancer respectively.Each panel contians the CNV distribution across the genome with the highly attributed loci highlighted in colour for both the total CN and minor allele CN.The accompanying tables show the 10 highest loci for each sample and their corresponding copy state, cytoband, COSMIC oncogenes and enhancers (see Methods).
The sample in (Figure 3a) is correctly predicted to belong to the breast cancer cohort with a classification probability of 0.96.The attributions point to a range of loci that are mostly established in the literature.In the total CN channel, amplification at 8p11.22 is observed, although only limited literature provides evidence that this region has a role in breast cancer [27,28].Similarly, amplification at 16p13.12, associated with ERCC4, appears to be novel in the context of breast cancer.The loci with the highest attribution in the minor allele are also duplicated or amplified in this patient.In a subgroup of estrogen receptor (ER)-negative grade III breast cancers, the 19q12 locus emerges as amplified.This amplicon, housing nine genes including cyclin E1 (CCNE1), is proposed as a potential 'driver.'The survival of cancer cells displaying this amplification requires the expression of multiple genes, emphasising the complexity of the 19q12 amplicon [29].Located in the 17q24.1 region, AXIN2 is identified as a target and negative feedback regulator of Wnt signalling.This gene is enriched in the Wnt-responsive cell population with Axin2+, particularly in mammary stem cells in the adult mammary gland [30][31][32].Chromosome arm 17q exhibits a high frequency of rearrangement mutations in breast cancer, with c-erbB-2 gene amplification at 17q11.2-q21.1 having been established as a common variant that have also been correlated with poor prognosis [33,34].Regions on chromosome 17, including 17q21.31,17q21.2, and 17q25.3,show amplification without definitive evidence in the current literature.These regions present potential areas for exploration, warranting further research into their role in breast cancer pathogenesis.
The lung cancer sample in (Figure 3b) is classified with a probability of 0.86 and points to the highest attributions in amplifications in chromosomes 5,11,14, duplications in chromosome 17 and LOH in chromosome 19.In particular, the amplification at 5p15.33 encompasses the genes TERT and CLPTM1L, implicating their potential role in lung cancer aetiology.Previous studies have highlighted the importance of gene amplification in hTERT, establishing it as an independent poor prognostic marker for disease-free survival in non-small cell lung cancer (NSCLC) [35][36][37].Within the region of 14q13.2-14q21.1,a minimal common region of gain containing MBIP, NKX2-1, NKX2-8, and PAX9 has been identified.The cooperative action of these genes is implicated in lung tumourigenesis, particularly in the context of lung adenocarcinoma in never smokers [38].The amplification at 5p12.2 emerges as a potential early event in NSCLC, with gene amplification in this region strongly associated with development.This underscores the significance of 5p12.2 as a key player in the molecular landscape of NSCLC [39].While specific details regarding the amplification at 11q21 are limited, translocation and gain events at this locus indicate its potential importance in the genomic alterations associated with cancer [40].Identified in a comparative genomic hybridisation study of squamous cell carcinomas (SCC) of the lung, the amplification at 14q12 reveals chromosomal imbalances associated with the metastatic phenotype in lung SCC [41].Patients harbouring gains of 11q23.3,11q13.1, and 14q32.3,along with deletions of 3p21.3 and 9p21.3,tend to have poor survival outcomes.These findings emphasise the prognostic implications of genomic alterations in these specific loci [42].Recurrent alterations, including B2M inactivation and inactivation of antigen presentation-related genes such as CALR, have been linked to increased immune evasion.The loss of BRG1 in NSCLC cells further demonstrates its impact on cellular morphology and tumourigenic potential [43,44].Overexpression of CCR7 in non-small cell lung cancers is highly associated with lymph node metastasis, underscoring its pivotal role in tumour cell migration [45].The association of 17q11.2 with HER2 is a critical prognostic and predictive marker in breast cancer, particularly indicating poor breast cancer-specific survival [46][47][48].Duplications in the 17q21.2region involve a number of keratins, which are otherwise ised as diagnostic tumour markers due to epithelial malignancies retaining keratin patterns associated with their respective cells of origin [49].Genes such as KRT19 and KRT16 play significant roles in regulating the reprogramming of tumour stem cells, impacting tumour stem cell markers and influencing the growth and metastasis of lung cancer [50,51].
Finally, the colorectal cancer (CRC) sample in (Figure 3c) is classified with a probability of 0.84 and largely separates its total allele CN and minor allele CN attributions between amplifications and LOH, respectively.The amplification at 13q12.2, specifically involving the CDX2 gene, corroborates a previous study that challenged conventional notions of CDX2 as a tumour suppressor in colorectal cancer, this study has instead identified CDX2 as an amplified lineage-survival oncogene required for the proliferation and survival of colorectal cancer cells and implicated Wnt/β-catenin signalling, itself a key oncogenic pathway in colorectal cancer [52].In the 13q34 cytoband, IRS2 has previously been identified as a likely driver oncogene, offering a potential mechanism for PI3 kinase pathway activation in CRC [53].A genomic duplication spanning 13q12.2-q12.3,as identified using Genomic Identification of Significant Targets in Cancer (GISTIC), encompasses several genes, including PDX1, ATP5EP2, CDX2, PRHOXNB, FLT3, LOC100288730, PAN3, and FLT1 [54].Limited literature exists regarding the specific details of 19q12 loss of heterozygosity (LOH) in CRC, with a mention of CCNE1 as a potential player in this context.Cyclin E (CCNE1), a regulator of the cell cycle, is overexpressed in many human tumours, including CRC and has been proposed to lead to CIN and tumourigenesis by uncoupling cell cycle progression from the regulation of mitosis.[55].This gene was also found to be linked to CIN in a recent study modelling rearrangement formation [56].In the 17q24.1 region, the association of AXIN2 with CRC is established, with variations in AXIN2 potentially serving as a risk marker for predisposition and prognosis [57].The role of 17q24.2,housing PRKAR1A, remains inconclusive, with some reports suggesting its potential as a tumour suppressor gene [58].Limited evidence is available regarding the specific implications of the 19p13.12locus in CRC.In summary, these results showcase the ability of the xAI approach to find real biological signatures that can provide insight into tumour biology.

Cohort level analysis reveals novel loci implicated in tissue specific tumour biology
To explore the high attribute loci from individual samples across entire cohorts within a specific primary site, we first aggregated all ranked loci based on their frequency within each population (see Methods).Figure 4 displays some of these results for the the breast, lung and colorectal cancer cohorts.Figure 4a-c shows the chromosome arm level analysis for these cancers and highlights prominent large-scale events, such as whole arm deletions already known.
In breast cancer, regions 16p, 17q, and 1q exhibit high-density signals for total copy numbers, while both arms on chromosomes 11, 16, and 17 show high-density signals for minor allele copy numbers (Figure 4a).In breast cancer, abnormalities of chromosome 17 have long been recognised as pivotal in tumourigenesis.Alterations at specific loci on chromosome 17, including ERBB2 amplification, P53 loss, BRCA1 loss, and TOP2A amplification or deletion, are known to play crucial roles in breast cancer pathophysiology.Numerical aberrations of chromosome 17 are intricately linked to breast cancer initiation, progression, and treatment response [59].Moreover, whole arm gain of chromosome arm 1q and loss of chromosome arm 16q are among the most frequent genomic events observed in breast cancer.Chromosome arm 16p, found to be a translocation partner to 1q, is assumed to be an early event in breast cancer progression [60].The gain of chromosome 1q has been mechanistically linked to increased expression of MDM4 and suppression of TP53 signalling.Notably, TP53 mutations are mutually exclusive with 1q aneuploidy in human cancers, suggesting specific aneuploidies play essential roles in breast cancer tumorigenesis [61].Furthermore, deletion of chromosome arm 11q has been observed in breast tumours.While tumours with 11q deletion do not exhibit a more aggressive phenotype or genotype distinguishing them from those without this deletion, its association with relapse in patients with lymph node-negative breast cancer who did not receive anthracycline-based chemotherapy presents a novel avenue for cancer treatment [61].
For lung cancer, our results underscore the significance of regions on the 5p arm for total copy number and the 9p and 19p arms for the minor allele copy number (Figure 4b).Previous molecular cytogenetic studies have consistently demonstrated chromosomal aberrations on the short arm of chromosome 5 in all major lung tumor types.Notably, gains on 5p have been identified as among the most frequent alterations observed in small cell lung cancer (SCLC), underscoring the importance of this region in lung tumorigenesis [39,62,63].Furthermore, chromosome arm 9p has emerged as a critical target in lung cancer, with studies highlighting it as the most frequent site of homozygous deletions.This observation suggests that chromosome 9p harbors multiple tumor suppressor genes and/or genomic features fragile during lung carcinogenesis, highlighting its significance in lung cancer pathogenesis [64].In addition, frequent deletion of chromosome arm 19p has been reported in lung cancer.Studies have shown that simultaneous mutations at LKB1 and BRG1 are common in lung cancer cells, providing evidence of how a single event, such as loss of heterozygosity (LOH) of chromosome 19p, can target multiple tumor suppressors, contributing to the complex molecular landscape of lung cancer [65].
Finally, the colorectal cancer cohort demonstrates a large-scale region of high attributes at 20q and 13q for total copy number signals, and 20q, 18q, and 17p for minor allele copy number signals (Figure 4c).A frequent genomic event observed in CRC is the duplication of chromosome arm 20q.Studies have consistently reported that gain of 20q is observed in more than 65% of CRCs and is associated with poor clinical outcomes.This region harbours multiple oncogenes, such as BCL2L1, AURKA, and TPX2, which play crucial roles in CRC initiation and progression [66][67][68][69].Additionally, genes such as PLAGL2 and Protein POFUT1 have been identified as driven by copy number amplification on 20q, highlighting their cancer-causing potential in CRC [67,70,71].These findings suggest that genes on chromosome arm 20q may serve as highly specific biomarkers for CRC with potential clinical applications [66].Furthermore, LOH of chromosome arm 18q is a common CIN signature in CRC.SMAD4, present in 18q21.1 and encoding a downstream signal transducer of TGF-β, has been implicated as an important tumour suppressor lost by 18q deletion.18q deletion has been associated with worse prognosis in patients with CRC, underscoring its clinical significance [72][73][74][75].Additionally, chromosome arm 13 duplication has been identified as part of other arm-level events associated with colorectal adenoma to carcinoma progression.Combinations of genetic aberrations, including 17p loss, KRAS mutation, 8q and 13q gain, and 18q loss and 20q gain, are associated with progressed colorectal adenomas and CRC.These findings suggest the existence of multiple independent chromosomal instability pathways in CRC progression [76].Moreover, deletion of chromosome arm 1p has been implicated in CRC progression, with primary carcinoma cells with metastatic ability often exhibiting 1p deletions.This suggests that 1p loss may be important both early and late in CRC carcinogenesis [77].
Circular Manhatten plots also show how these loci are distributed across the genome (Figure 4d-f).The identification of such focal events provides a targeted view into the genomic intricacies of the primary site, facilitating a more comprehensive understanding of the underlying biology and potential avenues for therapeutic exploration.
To visualise the attribution distributions across all samples within a cohort and depict the arrangement of consistently highly ranked features, we created heatmaps (Figure 4g-i).The heatmaps illustrate the high attribute features based on their copy state, with the top section representing total copy number and the bottom section representing minor allele copy number.Loci that do not rank high in importance are left uncoloured, appearing as white on the heatmap.To validate our results, we extracted the top ranked cytobands for breast, colorectal, and lung cancer cohorts by selecting all events present in > 50% of samples.We used existing databases to find and verify these regions.Every single region aligned with well known CNA features associated with each cancer type (Supplementary Tables 1-3).

Prognostic value of xAI approach
We considered whether the two primary outputs of our deep learning method, the high attribute loci and the cancer prediction scores, have prognostic value.Plausible scenarios are that a loci was discovered within a particular cancer type because it induces phenotypes associated with aggressive disease.Although only limited numbers of CNAs currently have prognostic value, we wanted to examine any new potential in our loci sets, many of which are novel findings.We also considered that the prediction scores, which reflect the karyotypic closeness of a particular patient's cancer to an archetype, may also have prognostic significance.This could be due to better treatment of archetypal diseases, perhaps due to drug suitability.Alternatively in patients with low prediction scores diseases might involve some cross-over in pathogenesis, as observed in some gyncological and gastro-intestinal diseases.
To test the prognostic significance of high attribute loci, we performed an outcome analysis using standard statistics, restricting each test set to a given cancer type and to higher stage diseases (see Methods, n = 5168).Following stage correction, we found that 42 loci were significant, though following multiple testing correction, this was reduced to 4 loci (6q23.1,6q23.2, 10p14, 13q34) all in Breast cancer (Figure 5a).We noted that the hazard ratio of these loci was only slightly lower than stage, demonstrating the value of CNAs in some diagnostic settings.
Next we explored the cancer prediction scores, again in higher stage disease, and separately in specific cancers.We partitioned each set of patients into high and low cancer prediction scores using the median within each cancer type, and then used these groups to assess differences in outcome.Surprisingly, we found that the effect for a high prediction score was a worse prognosis in adult glioma but a better prognosis in melanoma (Figure 5b).We wondered whether this link between prediction score and outcome might be reflecting a rediscovery of some underlying genetic factor, and in examining ploidy, tumour content, age, the distributions of sex, we found that the high-low Glioma groupings were different by age and that melanomas by ploidy.

Discussion
We developed a methodology to explore rearranged genomes using deep learning and explainable AI approaches.We then applied this to 7500 genomes from the 100,000 Genomes Project.We showed that cancers can be classified with reasonable accuracy based purely on their copy number profiles, indicating that patterns of CNA have some cancer specificity.We demonstrated that guided integrated gradients can be used to pull out biologicals signals that have prognostic value.
The significance of this research lies in the use of explainable AI to explore biological function.Although deep learning approaches have been used to classify cancer genomes [78], and studies have looked at how CNAs perform in this context [79], no other studies have used techniques to explore tissue specific genome changes.This bespoke approach to genomic analysis, wherein the AI model independently learns from genomic data, ensures an unbiased exploration of the genomic landscape.The work has potential clinical implications, as it may contribute to the development of molecular diagnostics and the identification of cancer-specific drug targets.This research not only advances our understanding of the complex genomic landscape of cancer but also aligns with broader efforts to enhance personalised and effective treatment strategies in oncology, marking a significant contribution to the field of genomic medicine.
There are some limitations associated with explainable AI approaches, including the fact that the usefulness of the information depends on the predictive performance of the model on that sample [80].However, in our current approach using classification, we can filter on classification probability therefore retaining samples with confident predictions.Moreover, while we used classification, it should be acknowledged that class labels can be fluid, particularly in instances of close proximity between primary sites, as observed in the case of ovarian and endometrial cancers.The genomic landscape, shared mutational profiles, and inherent biological similarities between these proximate primary sites may contribute to challenges in precise classification.Future work will involve the use of unsupervised learning approaches that can alleviate these biases.Additionally, the incorporation of cancer subtype information could also be beneficial under some scenarios.
In summary, the effective classification of cancer primary sites based solely on CNA data is a testament to their significance in cancer biology.Our new approach serves as a valuable tool for further research into the mechanisms of CIN, potentially revealing novel aspects of cancer biology and progression.

Data and preprocessing
The Genomics England 100,000 Genomes Project was an NHS transformation project that took place between 2016 and 2019 [81].During this time nearly 20,000 participants with cancer consented for the study and had paired blood and tumour biopsies DNA sequenced to 50 and 100X respectively.The bioinformatic processing utilised an accredited bioinformatic pipeline, which included industry-standard tools and in house quality [82].
The Genomics England (GEL) CNA data using in this project was produced using the CANVAS tool [83].We visualised the relationship between the median proportion of the genome impacted by LOH and the median number of LOH segments in each cancer primary site by mapping genomic states (such as deletion, diploid, LOH, duplication and amplification) to integers.This mapping facilitated the identification and quantification of continuous stretches of each copy state within the samples.LOH segments were specifically extracted and analysed to determine their distribution across the genome.The degree of whole genome doubling (WGD) was calculated using the PCAWG approach [84] GEL primary sites with fewer than 100 sample are insufficient for training a neural network (testicular, endocrine, etc.) or those whose class labels is composed of multiple primary sites or does not fall under the same convention (hepatopancreatobiliary, sarcoma, cancer of unknown primary site, etc.) were omitted from the analysis.Furthermore, we determined the threshold for CNAdriven cancers to have at least 30% of their genome altered (PGA).
The CNA calls for each patient were organised into tabulated data indicating the CNA state by genomic coordinate for both alleles, along with the major allele.To prepare these data for use by the neural network, we conducted a series of transformations.Each sample was anonymised and processed to generate an individual array, wherein the copy state of the 22 chromosomes was binned into 100 kb segments, mapped to a single array element.Within each sample, two channels were encoded: one for the total allele CNA state and another for the minor allele CNA state.The resulting array has shape (2,28749).This data structure has a conserved genome coordinate across all samples, where each element/bin denotes the copy state of a given genomic region of 100 kb.Prior to training the network, the dataset was feature-wise normalised by subtracting the mean and dividing by the standard deviation.

Deep learning approach
The neural network was trained using PyTorch 1.9.1.[85] The network architecture is composed of an initial ensemble of 22 chromosome-specific subnetworks, separately acting on each chromosome in the input (Figure 2).As such, each chromosome sub-network has an input layer of nodes equal to the binned size of the respective chromosome, with chromosome 1 the largest number of nodes and chromosome 21 with the fewest.The model was not trained on either X nor Y chromosomes.The chromosome feature extractors then serve as inputs to a deep feed-forward network.The network's input layer comprises 1,100 nodes, and it consists of three layers with 1000, 50 and 13 nodes, respectively.Each layer is composed of linear, dropout, and LeakyReLU layers, providing the network with a hierarchical structure.The output layer, with 13 nodes, aligns with the number of cancer primary sites in the GEL dataset and employs a SoftMax function to produce the final predictions.
During training, a weighted sampler was used to address the class imbalance in the GEL data (see Figure 1a).The neural network was trained using the stochastic gradient descent optimiser with a learning rate of 0.001.The cross-entropy was used as the loss function.An architecture such as this tends to be less favourable for applications that necessitate handling geometric transformations, including but not limited to permutation invariance, particularly in domains like image data.In contrast, our proposed approach leverages the independence of each input feature.By encoding the data into a unified genomic coordinate space, this methodology facilitates the learning of mutational patterns.Consequently, it enables interpretability regarding each sample's class without the necessity for metadata associated with individual input features.

AI explainability
In order to infer the most important features that determine classification of primary sites by the trained network, we employed the guided integrated gradients (GIG) method.[26] Unlike conventional integrated gradients (IG), adaptive path methods such as GIG reduce the noise accumulated along the IG path by defining a path integral between the baseline and input sample, conditioning the path not just on the input sample but also on the model being explained (Figure 2).In our model, the input data is highly disparate across samples, therefore the explainability results can be sensitive to the choice of baseline.Namely, a conventional choice of baseline for image data may take the form of a black image or an array of zero values, which is counterproductive for data composed of (feature-normalised) CNA states.Instead, the median CNA state of a sample across a given channel is chosen as a baseline to start the GIG path integral, ending with the sample CNA state values.

Cohort-level explainability analysis
In order to aggregate the explainability score across each primary site cohort we selected the top 1% of loci within a sample and then selected those resulting loci that were present in more than 5% of the cohort.This conservative approach allowed us to visualise patterns across the primary sites and to verify known biological features (Figure 4).
To provide a more statistically robust assessment of the significant loci for the clinical outcome analysis, we performed an association study following previous work [25].This comprised a loci dependent t-test on the raw attribute signal where the null distribution was a sample of attribute scores at that loci in other primary sites.The null distribution contained a number of points equal to the number of samples in the test cohort.In order to reduce variability in this testing procedure we repeated the t-test ten times and calculated the average p-value.The p-values were corrected using the Bonferonni multiple testing correction.Loci were further processed to remove chromosome arm level regions leaving only focal events for the clinical outcome analysis.

Clinical outcome analysis
For the majority of patients outcome data was available, including date of death and time interval of followup.We focused on higher stage disease which are associated with fatal conditions (stage > 1, n = 5168).Data from the high attribute loci and cancer prediction scores were subjected to Kaplan-Meier (log rank) and cox proportional hazard testing, the latter correcting for cancer stage in all type bar Glioma.The R Survival and Survminer packages were used for this purpose, and multiple testing correction was performed using the Benjamin Hochberg method on any reported p-values.Visualisation was handled by the standard plotting functions included in the packages: ggsurvplot, and ggforest.Fig. 1 (preceding page) Genomic landscape analysis across cancer types in the GEL dataset (a) The proportion of samples in the GEL dataset categorised by their respective cancer primary sites.(b) A violin plot of the distributions of non-diploid bins across the genome according to each cancer primary site.(c) A heatmap representing the frequency of non-diploid statuses at various Rosetta gene loci across multiple cancer cohorts.Each column corresponds to a different cancer primary site, with the colour intensity indicating the proportion of samples exhibiting non-diploid status at each locus.Adjacent to the heatmap, stacked bar charts for each gene locus quantify the copy number variations.The number of samples per cohort is annotated within the heatmap cells, providing a direct correlation between sample size and observed genetic alterations.(d) The correlation between the median proportion of the genome affected by Loss of Heterozygosity (LOH) and the median frequency of LOH stretches in the samples for each cancer primary site.The LOH segments are defined as a continuous segment of the genome where heterozygosity is lost, indicating a uniform genetic state in that region.
The training process involved a batch size of 100 samples per iteration.The training dataset was randomly split into 70% training and 30% validation sets to monitor the model's performance during training.The model underwent training for 50 epochs when terminated by early stopping to prevent overfitting.During training, dropout layers with a rate of 0.5 were utilised to enhance model generalisation.

Fig. 2 (
Fig.2(preceding page) Analysis of cancer classification using neural network and Guided Integrated Gradients (a) Schematic of the neural network architecture, illustrating the input from whole genome sequencing (WGS) samples.Each chromosome's binned data feeds into a sub-network, culminating in a deep feed-forward network with hierarchical layers.(b) Confusion matrix for each cancer primary site, accompanied by precision and recall metrics, providing a detailed assessment of the model's classification performance.(c) Regression analyses: F1 score and classification accuracy against Median Proportion of Genome Altered (PGA) with a second-order polynomial regression line, and against sample size with a linear regression line, both with 95% confidence interval shading.(d) Cartoon depiction of the Guided Integrated Gradients (GIG) method, tracing the process from sample classification prediction to genome-wide explainability.A comparison is drawn between Integrated Gradients (IG) and GIG, highlighting the path selection mechanism in GIG for enhanced gradient accumulation analysis.

Fig. 3 Fig. 4 (
Fig. 3 AI explainability in single whole genome sequencing samples across different cancer types.(a) One breast cancer sample, (b) one lung cancer sample, and (c) one colorectal cancer sample, each showcasing CNAs across the genome.The top part of each panel displays the total copy number for both alleles, while the bottom shows the minor allele copy number.High-importance genomic loci, identified by the model, are colour-coded by chromosome, with unselected loci in grey.Accompanying tables detail the copy state (Diploid, Duplication, Amplification, Deletion, LOH), cytoband, presence of oncogenes, and enhancers for the high attribute loci in both total and minor allele signals.Table rows are colour-coded based on the magnitude of normalised attribution values from the GIG analysis, highlighting key loci contributing to the AI's classification decision.