Detecting oncogenic selection through biased allele retention in The Cancer Genome Atlas

Background The loss of genetic diversity in segments over a genome (loss-of-heterozygosity, LOH) is a common occurrence in many types of cancer. By analysing patterns of preferential allelic retention during LOH in approximately 10,000 cancer samples from The Cancer Genome Atlas (TCGA), we sought to systematically identify genetic polymorphisms currently segregating in the human population that are preferentially selected for, or against during cancer development. Results Experimental batch effects and cross-sample contamination were found to be substantial confounders in this widely used and well studied dataset. To mitigate these we developed a generally applicable classifier (GenomeArtiFinder) to quantify contamination and other abnormalities. We provide these results as a resource to aid further analysis of TCGA whole exome sequencing data. In total, 1,678 pairs of samples (14.7%) were found to be contaminated or affected by systematic experimental error. After filtering, our analysis of LOH revealed an overall trend for biased retention of cancer-associated risk alleles previously identified by genome wide association studies. Analysis of predicted damaging germline variants identified highly significant oncogenic selection for recessive tumour suppressor alleles. These are enriched for biological pathways involved in genome maintenance and stability. Conclusions Our results identified predicted damaging germline variants in genes responsible for the repair of DNA strand breaks and homologous repair as the most common targets of allele biased LOH. This suggests a ratchet-like process where heterozygous germline mutations in these genes reduce the efficacy of DNA double-strand break repair, increasing the likelihood of a second hit at the locus removing the wild-type allele and triggering an oncogenic mutator phenotype.


48
Loss-of-heterozygosity (LOH) describes the somatic loss of genetic material from one copy of 49 a heterozygous locus. It can occur as a consequence of whole or partial chromosome deletion, 50 or as a copy-neutral event, in which one copy is replaced by the other -for example through 51 homologous repair 1 or locus duplication followed by loss of the non-duplicated allele. LOH 52 often occurs as the 'second-hit' in tumour initiation, where somatic loss of the wild-type (WT) 53 copy opposite either a germline or somatic mutation drives cancer progression 2,3 . 54 55 Previous studies of LOH have sought to identify novel tumour suppressor genes by mapping 56 patterns of LOH in tumours, but were hampered by low-resolution data and inadequate sample 57 sizes 4 . A more recent study in ovarian cancer overlapped recurrent regions of LOH with somatic 58 mutation data, and whilst their results revealed strong selection of known cancer genes (deletion 59 of the WT allele in 94% of cases with deleterious somatic TP53 or BRCA1 mutations), it failed 60 to reveal novel drivers 5 . In contrast, studies in cutaneous squamous cell carcinoma, ovarian 61 cancer and colorectal cancer revealed evidence of preferential allelic imbalance of putative 62 germline risk variants 5-7 , indicating that LOH may also have a role in the selection of small-63 effect, inherited, cancer-predisposing variants. By systematically quantifying biased allele loss 64 or retention across a large cohort of whole exome sequencing (WXS) data, we sought to explore 65 genetic selection of common cancer-associated variants during cancer progression. 66

67
The Cancer Genome Atlas (TCGA) is a public resource of genomic, clinical and associated 68 data from over 10,000 patients across 36 types of cancer, including WXS from matched 69 tumour:normal sample pairs 8 . This wealth of data is extensively used and a valuable resource 70 in the field of cancer genomics, but is subject to the influence of batch effects and technical 71 artefacts 9-12 . To control for these effects we performed a systematic analysis of mapping and 72 sequencing artefacts, exome target capture kit biases and contamination in TCGA, and 73 developed a workflow to identify and remove these confounding influences. Strikingly we 74 found evidence of contamination and other issues in 1,678 pairs of samples (14.7%), a result 75 that may have had unforeseen impact on previous analyses performed using this dataset. 76 77 After rigorous filtering of the data, we did not identify preferential retention of common 78 variants via LOH, although we did observe an overall trend for biased retention of cancer-79 associated risk alleles identified by genome wide association study (GWAS). Subsequent

Initial analysis of biased allele retention during loss-of-heterozygosity in tumours 89
Analysis of biased LOH was performed using paired normal and primary tumour samples from 90 9,905 patients across 36 cancer subtypes. Germline variants were called using Strelka2 13 and 91 GATK HaplotypeCaller 14 , and subsequent LOH calling was performed using CloneCNA 15  2.5e-23). After cross-referencing our data with gnomAD 16 , a database of human genetic 104 variation that includes the normal samples within TCGA; we found that the proportion of 105 significantly LOH-biased variants failing gnomAD variant filters or missing from the gnomAD 106 database was higher than seen for non-significant variants (Figure 1c; OR = 25.33, Fisher's 107 exact test, p-value = 1.1e-37). These results indicate that many of the significant allele retention 108 biases we detected were the result of artefactual variants. Motivated by these results we 109 undertook a systematic interrogation of TCGA WXS data covering three broad areas: 1) 110 mapping and sequencing artefacts, 2) exome target capture kit biases and 3) sample specific 111 abnormalities including contamination. 112 113

Systematic detection of artefactual and unreliable germline heterozygous variants 114
We first removed all variants that failed the gnomAD filters or were missing from the gnomAD 115 database, therefore focusing our analysis on consensus, high-quality germline variants. We 116 found that the genomic regions with consistently low VAF giving rise to likely false biased 117 allele retention signals, typically fall into one of two categories. First, lower alternative allele 118 read-mapping rates in haplotype segments with clusters of non-reference alleles (example locus 119 shown in Supplementary Figure 1a Low VAF can often occur due to stochastic sampling of reads at low coverage and is not always 125 indicative of an underlying problem; as such, a standard VAF thresholding measure is not 126 enough to identify error-prone or hard-to-map loci. Consequently we developed a binomial-test 127 based reliability score to predict the likelihood of a variant being 'true' given its observed VAF 128 and read-depth across all normal heterozygous samples (Materials and Methods, Figure 1d). In 129 total, 13,088 variants (8.7% of common variants) failed our threshold, and were subsequently 130 removed (Figure 1e,f) -this included all 26 variants previously identified as significant in the 131 LOH bias analysis, which passed gnomAD filters. 132    (Figure 2a), demonstrating a kit specific effect on the raw sequence data, 154 consistent with prior observations 17,18 . Additionally, we found that on average, individuals 155 sequenced by the same kit had significantly more variants in common than those sequenced 156 using a different kit (mean correlation coefficient 1.2 fold higher, t-test, p-value < 2.2e-16; 157

Evidence of contamination in samples from The Cancer Genome Atlas 206
The final step of our interrogation of technical artefacts in TCGA WXS data was to investigate 207 sample specific abnormalities. The VAF distribution of germline heterozygous variants in 208 matched tumour:normal samples can be used to infer characteristics of the tumour -such as the 209 fraction of the genome subject to LOH (Figure 3a,b), and the cellularity of the sample (Figure  210 3c). By quantifying deviations from the expected distribution, we can identify other 211 abnormalities -such as contamination with genetic material from a different individual ( Figure  212 3d-g) and low quality sequencing data (Figure 3h were sub-classified based upon their reason for thresholding, as: low quality with few total 249 variants (LQn, n=79); low quality with highly dispersed VAFs (LQd, n=8); high normal sample 250 specific contamination (NC, n=12); tumour sample specific contamination (TC, n=23); high 251 contamination of both the tumour and normal sample (NTC, n=23) and other (O, n=11). LQn 252 samples had significantly fewer somatic SNPs than non-contaminated (median total somatic 253 SNPs 3.5 fold lower, Mann-Whitney U Test, p = 9.3e-26), whilst NC and TC had significantly 254 more (Mann-Whitney U Test, p < 0.001; Table 1; Supplementary Figure 6a). We then 255 calculated the proportion of each sample's somatic SNVs that appeared in gnomAD -allowing 256 us to infer the relative extent to which germline variants are being erroneously called as somatic 257 SNVs. Using a linear regression model, we hence estimated the enrichment of somatic 258 gnomAD variants within each group. All groups except C and NC had a significant increase in 259 the proportion of somatic gnomAD variants compared to non-contaminated samples (p < 0.05), 260 with the highest enrichment seen in TC samples (Supplementary Figure 6b; Table 1). For 261 samples with evidence of tumour-sample contamination (TC, NTC), the increase in somatic 262 gnomAD variants is likely due to mis-calling of contaminating germline variants in the tumour 263 sample as somatic SNVs. In the low-quality samples (LQn, LQd), it seems likely that true-264 germline variants missed in the normal sample are erroneously called as somatic due to the 265 lower stringency required for calling variants in tumour samples.  Table 2). These pathways were split between five biological 382 processes: DNA repair (n=14), gene expression (n=5), cell cycle (n=4), metabolism of proteins 383 (n=1) and reproduction (n=1). In total, the significant pathways included 232 unique genes, 12 384 of which, when removed from the analysis, cause at least one pathway to drop below the 385 threshold for significance, demonstrating a significant contribution towards the overall burden 386 of that pathway (Figure 6g Table 1). Furthermore, 475 our results and methodology provide a broadly applicable framework that can be used to profile 476 contamination, low quality data and other technical issues using germline variant data from 477 matched tumour:normal samples. To enable the general application of these quality control 478 metrics to both user-generated and public data we make the GenomeArtiFinder software 479 package available (https://git.ecdf.ed.ac.uk/taylor-lab/GenomeArtiFinder).

Availability of data and materials 514
Details of all software and packages used in the analysis are in Supplementary Table 6. 515 516 All analyses were performed using the Genomic Data Common (GDC) data harmonization and 517 generation pipeline GRCh38 reference sequence (GRCh38.d1.vd1.fa, available from: 518 https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-ference-files). 519 Gapfiller_7m (all patients sequenced using this kit failed filtering steps) and SureSelect_50Mb 538 (only used to sequence 5 patients); consequently BED files from eight exome target capture 539 kits were downloaded (Supplementary Table 7). Files were lifted over to hg38 then intersected 540 using BEDTools to produce a consensus file of genomic regions common to all eight kits. 541 Finally, LCR and SegDup regions were then subtracted from the exome target region file as 542 heterozygous)]). We expect that the ORAC of NHET samples should be equally distributed 590 around 1, therefore by using the NHET samples as a control group, we can control for overall 591 bias in the distribution of ORAC at a given loci. Exome-wide LOH bias tests were performed 592 for the pan-cancer dataset (all versus all), and on a cancer subtype specific basis (test group 593 versus all). 594 595

Cancer GWAS Variants 596
Using the EMBL-EBI GWAS Catalog, all previously reported cancer-associated variants were 597 downloaded (trait = cancer, EFO ID = EFO_0000311, 5,552 associations from 580 studies). 598 After processing, genome information was extracted for 4,723 unique SNPs. SNPs were 599 intersected with our dataset, leaving 217 cancer associated exome variants. For each cancer 600 subtype, associated traits were queried with related keywords to extract cancer subtype specific 601 SNP associations. LOH bias was calculated for each SNP as described above. referencing these variants, we found that although they were rare across the total dataset, they 670 were enriched within this subset of samples. We consequently compared their observed hetFreq 671 within these 25 patients to their expected hetFreq (estimated from the gnomAD population 672 allele frequency using Hardy-Weinberg equilibrium) and filtered any variant that was more 673 than fivefold enriched, removing 170 variants in total. All analyses were performed having pre-674 filtered these variants from all samples.