Comparative assessment of genes driving cancer and somatic evolution in noncancer tissues: an update of the NCG resource

Genetic alterations of somatic cells can drive nonmalignant clone formation and promote cancer initiation. However, the link between these processes remains unclear hampering our understanding of tissue homeostasis and cancer development. Here we collect a literature-based repertoire of 3355 well-known or predicted drivers of cancer and noncancer somatic evolution in 122 cancer types and 12 noncancer tissues. Mapping the alterations of these genes in 7953 pancancer samples reveals that, despite the large size, the known compendium of drivers is still incomplete and biased towards frequently occurring coding mutations. High overlap exists between drivers of cancer and noncancer somatic evolution, although significant differences emerge in their recurrence. We confirm and expand the unique properties of drivers and identify a core of evolutionarily conserved and essential genes whose germline variation is strongly counter-selected. Somatic alteration in even one of these genes is sufficient to drive clonal expansion but not malignant transformation. Our study offers a comprehensive overview of our current understanding of the genetic events initiating clone expansion and cancer revealing significant gaps and biases that still need to be addressed. The compendium of cancer and noncancer somatic drivers, their literature support and properties are accessible at http://www.network-cancer-genes.org/.


BACKGROUND
Genetic alterations conferring selective advantages to cancer cells are the main drivers of cancer evolution and hunting for them has been at the core of international cancer genomic efforts 1,2,3 . Given the instability of the cancer genome, distinguishing driver alterations from the rest relies on analytical approaches that identify genes altered more frequently than expected or quantify the positive selection acting on them 4,5,6 . The results of these analyses have greatly expanded our understanding of the mechanisms driving cancer evolution, revealing high heterogeneity across and within cancers 7,8,9 . Recently, deep sequencing screens of noncancer tissues have started to map positively selected genetic mutations in somatic cells that drive in situ formation of phenotypically normal clones 10,11 . Many of these mutations hit cancer drivers, sometimes at a frequency higher than in the corresponding cancer 12,13,14,15,16 . Yet, they do not drive malignant transformation. This conundrum poses fundamental questions on how genetic drivers of normal somatic evolution are related to and differ from those of cancer evolution. Addressing these questions will clarify the genetic relationship between tissue homeostasis and cancer initiation, with profound implications for cancer early detection.
To assess the extent of the current knowledge on cancer and noncancer drivers, we undertook a systematic review of the literature and assembled a comprehensive repertoire of genes whose somatic alterations have been reported to drive cancer or noncancer evolution. This allowed us to compare the current driver repertoire across and within cancer and noncancer tissues and map their alterations in the large pancancer collection of samples from The Cancer Genome Atlas (TCGA).
This revealed significant gaps and biases in our current knowledge of the driver landscape. We also computed an array of systems-level properties across driver groups, confirming the unique evolutionary path of driver genes and their central role in the cell.
We collected all cancer and noncancer driver genes, together with a large set of their properties, in the Network of Cancer Genes and Healthy Drivers (NCG HD ) open-access resource.

More than 3300 genes are canonical or candidate drivers of cancer and noncancer somatic evolution
We conducted a census of currently known drivers through a comprehensive literature review of 331 scientific articles published between 2008 and 2020 describing somatically altered genes with a proven or predicted role in cancer or noncancer somatic evolution ( Figure 1A). These publications included three sources of experimentally validated (canonical) cancer drivers, 311 sequencing screens of cancer (293) and noncancer (18) tissues and 17 pancancer studies (Supplementary Table 1, Additional File 1). Each paper was assessed by at least two independent experts (Supplementary Figure 1A-C, Additional File 2) returning a total of 3355 drivers, 3347 in 122 cancer types and 95 in 12 noncancer tissues, respectively ( Figure   1A). We further computed the systems-level properties of drivers and annotated their function, somatic variation and drug interactions ( Figure 1A).
We reviewed the three sources of canonical cancer drivers 17,18,19 to exclude false positives (Supplementary Table 2, Additional File 1) and fusion genes whose properties could not be mapped. Only 11% of the resulting 591 canonical drivers (Supplementary Table 3, Additional File 1) were common to all three sources ( Figure 1B), indicating poor consensus even in well-known cancer genes. We further annotated the genetic mode of action for >86% of canonical drivers, finding comparable proportions of oncogenes or tumour suppressors ( Figure 1C). The rest had a dual role or could not be univocally classified.
We extracted additional cancer drivers from the curation of 310 sequencing screens that applied a variety of statistical approaches (Supplementary Figure 1D Table 3, Additional File 1). Therefore, 170 canonical drivers have never been detected by any method, suggesting that they may elicit their role through non-mutational mechanisms or may fall below the detection limits of current approaches. Given the prevalence of cancer coding screens (Figure 1A), only coding driver alterations have been reported for most genes ( Figure 1E) while 16% of them (531) were identified as drivers uniquely in noncoding screens. Since the prediction of drivers with noncoding alterations remains challenging, we further investigated the type of support that these genes had for their driver activity. The overwhelming majority of them (467 genes, 87%) have been predicted as drivers in only one screen.
The remaining 64 genes are canonical drivers, have been predicted as drivers in multiple screens or have additional experimental support for their driver activity   To compare cancer and healthy drivers across and within tissues, we grouped the 122 cancer types and 12 noncancer tissues into 12 and seven organ systems, respectively (Methods).
Despite the high numbers of sequenced samples (Supplementary Table 4 Our analysis also showed that the contribution of noncoding driver alterations remains largely unappreciated and noncoding drivers have not yet been reported in several tumours, including all paediatric cancers ( Figure 2D). Owing to the re-analysis of large whole genome collections 21,22,23,24,25,26 , almost 40% of adult pancancer drivers were instead modified by noncoding alterations (Figure 2D). Haematological and skin tumours also had a high proportion of noncoding driver variants thanks to screens focused on noncoding mutations 27,28 . Therefore, the re-analysis of already available whole genome data and further sequencing screens of noncoding variants are needed to fully appreciate their driver contribution.  Figure 2G). Therefore, differences start to emerge at the tissue level between drivers of cancer and noncancer evolution.
Moreover, unlike cancer drivers, no correlation existed between numbers of drivers and donors ( Figure 2I). This is likely affected by the lower number of noncancer sequencing studies available so far. If additional studies will confirm the absence of correlation, this may indicate that the healthy driver repertoire is easier to saturate since less drivers are needed to initiate and sustain noncancer clonal expansion 10,11 .

Alteration pattern hints at driver mode of action and confirms the incompleteness of the driver repertoire
To gain further insights into their mode of action, we mapped the type of alterations acquired by cancer and healthy drivers in 34 cancer types from TCGA.
After predicting the damaging alterations in 7953 TCGA samples with matched mutation, copy number and gene expression data (Methods), we identified the drivers with loss-of-function (LoF) and gain-of-function (GoF) alterations in these samples, respectively ( Figure 3A).
The comparison between canonical cancer drivers detected and undetected in sequencing screens ( Figure 1D) revealed that the latter were damaged in a significantly lower number of samples, due to fewer LoF alterations ( Figure  comparable between the two groups, suggesting that current driver detection methods fail to identify drivers that undergo copy number gains but are rarely mutated. We confirmed that the driver alteration patterns reflected their mode of action, with canonical tumour suppressors and oncogenes showing a prevalence of LoF and GoF alterations, respectively ( Figure 3C). Canonical drivers with a dual role resembled the alteration pattern of oncogenes while those still unclassified had a prevalence of LoF alterations, suggesting a putative tumour suppressor role ( Figure   3C). While all frequently altered (>500 samples) oncogenes were overwhelmingly modified by GoF alterations (Supplementary Table 6 The number of damaged cancer drivers in individual TCGA samples confirmed that, despite all efforts, the current driver repertoire is still largely incomplete. The large majority of samples (71% and 87%, considering all drivers or only canonical drivers, respectively) had less than five damaged drivers and ~15% of them had no damaged driver ( Figure 3G).
Given their high overlap with cancer drivers, most healthy drivers were recurrently damaged in cancer samples with no prevalence of GoF or LoF alterations ( Figure 3H, Supplementary Table 6, Additional File 1). Interestingly, all healthy drivers, even the eight with no cancer involvement, were damaged in significantly more cancer samples than the rest of human genes ( Figure 3I). Moreover, 57% of TCGA samples had at least two altered drivers, one of which was a healthy driver, further supporting the hypothesis that more than one driver may be needed to promote transformation of nonmalignant clones into cancer 10,11 . Canonical and candidate healthy drivers correspond to genes with a known or predicted cancer driver role.
i. Number of TCGA samples with damaged canonical, candidate and remaining healthy drivers and the rest of human genes.
All distributions were compared using a two-sided Wilcoxon rank-sum test.

Properties of cancer and healthy drivers support their central role in the cell.
A substantial body of work including our own 44,45,46,47,48,49,50,51,52,53 has shown that cancer drivers differ from the rest of genes for an array of systems-level properties ( Figure 1A) that are consequence of their unique evolutionary path and role in the cell. Using our granular annotation of drivers, we set out to check for similarities and differences across driver groups.
We confirmed that cancer drivers, and in particular canonical drivers, were more We further expanded the systems-level properties of cancer drivers by exploring their tolerance towards germline variation, because this may indicate their essentiality.
Using germline data from healthy individuals 54 , we compared the loss-of-function observed/expected upper bound fraction (LOEUF) score, which quantifies selection towards LoF variation 54 as well as the number of damaging mutations and structural variants (SVs) per coding base pairs (bp) between drivers and the rest of genes (Methods). Cancer drivers, and in particular canonical drivers, had a significantly lower LOEUF score and retained fewer damaging germline mutations and SVs than the rest of genes (Figure 4A). This indicates that they are indispensable for cell survival in the germline. Selection against harmful variation was stronger in tumour suppressors than oncogenes ( Figure 4B). This was supported by a significantly higher proportion of cell lines where cancer drivers, and in particular tumour suppressors, were essential ( Figure 4A-B), as gathered from the integration of nine genome-wide essentiality screens 55,56,57,58,59,60,61,62,63 (Methods).
Genes with noncoding driver alterations had weaker systems-level properties than those with coding alterations (Figure 4C, Supplementary Table 7 Systems-level properties of healthy drivers varied according to the overlap with cancer drivers (Figure 4G, Supplementary Table 7, Additional File 1). Intriguingly, canonical healthy drivers showed stronger systems-level properties than any other group of drivers. In particular, they were enriched in evolutionarily conserved and broadly expressed genes encoding highly inter-connected proteins are regulated by many miRNA. Moreover, these genes showed a strong selection against germline variation and high enrichment in essential genes ( Figure 4G). They therefore represent a core of genes with a very central role in the cell, whose modifications are not tolerated in the germline but are selected for in somatic cells because they confer selective growth advantages. Candidate healthy drivers and those not involved in cancer had a substantially different property profile (Figure 4G). Although numbers are too low for any robust conclusion, it is tempting to speculate that genes able to initiate noncancer clonal expansion but not tumourigenesis may follow a different evolutionary path.

Figure 4. Systems-level properties of cancer and healthy drivers
Comparisons of systems-level properties between (a) canonical or candidate cancer drivers and the rest of human genes; (b) tumour suppressors and oncogenes, (c) cancer genes with coding driver alterations and cancer genes with noncoding driver alterations. The normalised property score was calculated as the normalised difference between the median (continuous properties) or proportion (categorical properties) values in each driver group and the rest of human genes (Methods).
g. Comparisons of systems-level properties between canonical healthy, candidate healthy and remaining healthy drivers and the rest of human genes.
Proportions of old (pre-metazoan), duplicated, essential genes, and proteins involved in complexes were compared using a two-sided Fisher's exact test. Distributions of gene and protein expression, protein-protein, miRNA-gene interactions, and germline variation were compared using a two-sided Wilcoxon rank-sum test. False discover rate (FDR) was corrected for using Benjamini-Hochberg.

The Network of Cancer Genes: an open-access repository of annotated drivers.
We  Table 8, Additional File 1). Healthy drivers closely resembled the functional profile of cancer drivers, given the high overlap ( Figure 5B). Because of the low number, it was not possible to assess the functional enrichment of healthy drivers not involved in cancer.
More than 9% of canonical cancer drivers are targets of anti-cancer drugs and cancer drivers constitute around 40% of their targets ( Figure 5C). Moreover, most of the genes used as biomarkers of resistance or response to treatment in cell lines ( Figure 5D) or clinical trials (Figure 5E) are cancer drivers, with an overwhelming prevalence of canonical cancer drivers.

DISCUSSION
The wealth of cancer genomic data and the availability of increasingly sophisticated analytical approaches for their interpretation have substantially improved the understanding of how cancer starts and develops. However, our in-depth analysis of the vast repertoire of drivers that have been collected so far shows clear limits in the current knowledge of the driver landscape.
The identification of drivers as genes under positive selection or with a higher than expected mutation frequency within a cohort of patients has biased the current cancer driver repertoire towards genes whose coding point mutations or small indels frequently recur across patients. This strongly impairs the ability to map the full extent of driver heterogeneity leading to an underappreciation of the driver contribution of rarely altered genes and those modified through noncoding or gene copy number alterations, particularly amplifications. It also results in a sizeable fraction of samples with very few or no cancer drivers. This gap can be solved by complementing cohortlevel approaches with methods that account for all types of alterations and predict drivers in individual samples, for example identifying their network deregulations 64,65,66 or applying machine learning to identify driver alterations 67 . Alternatively, we have shown that systems-level properties capture the main features of cancer drivers, justifying their use for patient-level driver detection 68,69 .
Our comprehensive study has also shown that cancer sequencing screens have so far mostly focused on resequencing and analysing the protein-coding portion of cancer genomes, leaving the contribution of noncoding drivers mostly uncovered. This bias may be addressed by performing additional cancer whole genome sequencing screens and improving analytical methods for the prediction of noncoding driver alterations.
Biases are starting to emerge also in the knowledge of healthy drivers. Many noncancer sequencing screens only targeted cancer genes and healthy driver detection methods used so far were originally developed for cancer genomics. Both these factors may contribute at least in part to explain the high overlap between drivers of cancer and noncancer evolution. An unbiased investigation of altered genes able to promote clonal expansion but not tumourigenesis could confirm whether their properties are indeed different from cancer drivers as suggested by our initial analysis on the few of them that have been identified so far. Additionally, the investigation of somatically mutated clones in noncancer tissues has just started and new screens are continuously published. The integrated analysis of these new studies will broaden our understanding of noncancer clonal expansion and further clarify its relationship with cancer transformation.
Our literature review did not cover driver genes deriving from chromosomal rearrangements or epigenetic changes because of their scattered annotations in the literature and difficulty in mapping their properties. Adding these genes to the repertoire when their knowledge will be mature will help closing the gaps in the knowledge of the genetic drivers of tumourigenesis.

CONCLUSIONS
Our comprehensive analysis of cancer sequencing screens showed that the current repertoire of cancer driver genes is still incomplete and biased towards frequent mutations altering the gene coding sequence. This calls for the need of additional screens and methods to identify further coding and noncoding cancer drivers at single patient resolution. We confirmed the central role of cancer drivers within the cell, which is reflected in their evolutionary path and is shared by the majority of known healthy drivers. Further sequencing screens of healthy tissues are needed to clarify whether this is a feature of all genes whose mutations can driver noncancer clonal expansion or there is a group of healthy drivers that underwent a different evolutionary path.  Cancer types and noncancer tissues were mapped to organ systems using previous classification 72 . Cancer types not included in this classification were mapped based on their histopathology (retinoblastoma to central nervous system; vascular and peripheral nervous system cancers to soft tissue; penile tumours to urologic system). In total, 518115 genes were considered to acquire LoF alterations because they underwent homozygous deletion or had truncating, missense damaging, splicing mutations, or double hits (CN = 1 and LoF damaging mutation), while 1674717 genes were considered to acquire GoF alterations because they had a hotspot mutation or underwent gene gain with increased expression (Figure 3A).

Pancancer TCGA data
Protein sequences from RefSeq 87 v.99 were aligned to hg38 using BLAT 88 . Unique genomic loci were identified for 19756 genes based on gene coverage, span, score and identity 89 . Genes sharing at least 60% of their protein sequence were considered as duplicates 46 .
Evolutionary conservation was assessed for 18922 human genes using their orthologs in EggNOG 90 v.5.0. Genes were considered to have a pre-metazoan origin

Ethics approval and consent to participate
Not applicable.

Consent for publication
The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

Availability of data and materials
data. FDC conceived and supervised the study. MB, AAS, GS, and FDC wrote the manuscript with contributions from LD and HM. All authors reviewed and approved the manuscript.

Acknowledgments
We thank Steve Hindmarsh and Stefan Boeing for their contribution to the development of the NCG database and website.

ADDITIONAL FILES
Additional file 1: Supplementary tables (XLSX).  Table S7. Systems-level properties of driver genes; Table S8. Proportion of enriched pathways across driver groups. Supplementary Figures (DOCX). Figure S1, Literature search, review and annotation workflow; Figure S2. Correlation between numbers of donors and cancer drivers in individual organ systems; Figure S3: Patterns of driver damaging alterations in TCGA samples.