Essentiality, Protein-Protein Interactions and Evolutionary Properties are Key Predictors for Identifying Cancer Genes Using Machine Learning

The identification of genes that may be linked to cancer is of great importance for the discovery of new drug targets. The rate at which cancer genes are being found experimentally is slow, however, due to the complexity of the identification and confirmation process, giving a narrow range of therapeutic targets to investigate and develop. One solution to this problem is to use predictive analysis techniques that can accurately identify cancer gene candidates in a timely fashion. Furthermore, the effort in identifying characteristics that are linked to cancer genes is crucial to further our understanding of this disease. These characteristics can be employed in recognising therapeutic drug targets. Here, we investigated whether certain genes’ properties can indicate the likelihood of it to be involved in the initiation or progression of cancer. We found that for cancer, the essentiality scores tend to be higher for cancer genes than for all protein coding human genes. A machine-learning model was developed and we found that essentiality related properties and properties arising from protein-protein interaction networks or evolution are particularly effective in predicting cancer-associated genes. We were also able to identify potential drug targets that have not been previously linked with cancer, but have the characteristics of cancer-related genes. Author Summary Mutations in numerous genes are known to be involved in cancer, yet there are undoubtedly many more to be discovered. We analysed a set of hundreds of cancer genes with the aim of finding out what makes them different from genes not known to be mutated in cancer. In particular, we found that genes that are essential for the survival of an organism are more likely to be involved in cancer. We used the gene properties that we examined to develop an artificial intelligence method that can accurately predict whether a gene is involved in cancer or not. Applying the method gives hundreds of non-cancer genes that resemble cancer genes. New discoveries of cancer genes are likely to be found within this set.

(represented by variation detection) in the gene. It uses the following methods: residual variation 109 intolerance score (RVIS), LoFtool, Missense-Z, the probability of loss-of-function intolerance (pLI) and 110 the probability of haplo-insufficiency (Phi) (S1 Table). The second group (Wang, EvoTol) studies the impact of variation on cell viability. For all methods above measuring essentiality, a 112 higher score indicates a higher degree of essentiality, with the exception of Missense-Z21 where a lower 113 score indicates higher degree of essentiality.

114
We find that on average the cancer genes exhibit a higher degree of essentiality compared to the 115 average scores calculated for all protein coding human genes and all metrics. We find that genes 116 associated with cancer have higher essentiality scores on average in both categories (intolerance to 117 variants and cell line viability) compared to the average scores across all human genes (data can be found 118 in S2 Table), with p-values consistently < 0.00001 (Table 1). We also investigated whether Tumour 119 Suppressor (TS) genes as a distinct group of genes would show different degrees of essentiality. We 120 found that no significant difference in the degree of essentiality on average for that group compared to the 121 set of all cancer genes (Table 1).

127
The results are particularly of interest in the context of cancer, as essential genes have been 128 shown to evolve more slowly than nonessential genes [6, 10, 11] although some conflicts have been 129 reported [11]. A slower evolutionary rate indicates less probability to evolve resistance to a cancer drug.

130
This is particularly important in the case of anticancer drugs as it was reported that these drugs cause a 131 change in the selection pressure when administrated, leading to increased drug resistance [12].

134
This association between cancer-related genes and essentiality scores prompted us to develop 135 methods to identify cancer-related genes using this information. We used a machine-learning approach, 136 using the DataRobot platform. This platform applies a range of open-source algorithms and optimizes the 137 weights of terms to produce the most accurate classifier. We focused on properties related to protein-protein interaction networks, as essential genes are likely to encode hub proteins, i.e. those central to the

144
The model blueprint (Fig 1) -9999). This is effective for tree-based models, as they can learn a split 150 between the arbitrary value (-9999) and the rest of the data (which is far away from this value). A log

151
showing any data transformation and imputation of values can be found in S4 Table. 152 A total of ten different modeling approaches (or blueprints) were run on the data to ensure the 153 selection of the best performing approach (the list of these can be found in S3 Table along with their   154 performance metrics). The performance metric used to rank the models was Logarithmic Loss (LogLoss),

155
LogLoss is an appropriate and known performance measure when the model is of a binary-classification 156 type. The LogLoss measures confidence of the prediction and estimates how to penalise incorrect 157 classifications. The selection mechanism for the performance metric takes the type of model (binary 158 classification in this case) and distribution of values into consideration when recommending the 159 performance metric. However, other performance metrics were also calculated and can be found in the S3 160  indicates where the model predicted a high score. The "Predicted" blue line displays the average 188 prediction score for the rows in that bin. The red "Actual" line displays the actual percentage for the rows 189 in that bin.

191
The confusion matrix (Table 3) and the summary statistics (

214
Another way to visualise the model performance, and determine the optimal score to use as a 215 threshold between cancer and non-cancer genes, is the Prediction Distribution graph (Fig 3) that 220 how well our model discriminates between prediction classes (cancer gene or non-cancer gene) and shows 221 the selected score (threshold) that could be used to make a binary (true/false) prediction for a gene to be 222 classified as a candidate cancer gene.  (Table 4). Every prediction to the left of the dividing line is classified as non-cancer 227 and every prediction to the right of the dividing line is classified as cancer.

229
The Prediction Distribution graph can be interpreted as follows: purple to the left of the threshold 230 line, is for instances where genes were correctly classified as non-cancer (true negatives). Green to the left 231 of the threshold line is for instances were incorrectly classified as non-cancer (false negatives). Purple to 232 the right of the threshold line, is for instances were incorrectly (according to the current 233 training/validation dataset) classified as cancer genes (false positives and thus potential future cancer 234 genes). Green to the right of the threshold line, is for instances that were correctly classified as cancer 235 genes (true positives). The graph again confirms that the model was able to accurately between cancer 236 and non-cancer genes.

237
Using the ROC curve produced for our model (Fig 4), we were able to evaluate the accuracy of

272
We also retrained our model using a data set that excludes general gene properties, as listed in the 273 'Data Sets' section, and found that a reduction in model's performance was evident but very small. The 274 model trained on this dataset achieved an AUC of 0.835 and a sensitivity of 55% at a specificity of 89%.

275
This small reduction in the predictability of the models confirms that essentiality and protein-protein

282
We have demonstrated that gene essentiality is a strong indication that it is associated with 283 cancer. This is the case for both oncogenes and tumour suppressors. Genes classed as essential are often 284 involved in cell, embryo and organism growth. Similarly, proliferation is key for cancer cells. Therefore,

285
the sets of genes that are essential and those that are involved in unregulated growth as seen in cancer 286 tend to overlap.

287
We used a range of direct methods of essentiality, such as LofTool and Missense Z-score, based 288 on human population sequence data, and Blomen KBM7 and Wang K562 that are based on cell viability 289 data. We also investigated a range of gene and protein properties, such as number of protein-protein 290 interactions and position in the interaction network, expression levels, and various measure of 291 evolutionary selection pressure. All of these gene and protein properties are strongly linked and closely 292 correlate with essentiality, even though they do not measure it directly.

293
Machine learning allows the identification of the most important features for classification of 294 genes into cancer-related and non-cancer-related groups. Such an approach can provide a reliable 295 framework that first helps in identifying properties predictive of cancer association and can then provide a 296 reliable model that can be used to predict the most likely candidates to be cancer genes. Machine learning 297 methods can produce better performing predictive models than traditional statistical regression methods 298 because they are more flexible and rely on fewer statistical assumptions. A high degree of flexibility in 299 defining the model structure typically results in better model performance. In our application, the only 300 assumption being made is that the model training data is representative of the future scoring data. In our 301 case, this means that current knowledge of cancer genes is applicable to those that will be found in the 302 future. The resulting classifier is accurate (AUC > 0.85) in predicting whether or not a human protein-303 coding gene is cancer-related.

339
URLs and values are included in S1 Table. 340

341
We used gene properties provided and constructed in [3] including genomic location, protein 342 network parameters and summary statistics of neutrality for human genes.

343
The Genomic location properties we used in our work were: Chr, Start, End and Strand and

366
The properties we used to construct our dataset are naturally not inclusive of all possible features.

367
In particular, other studies carried out on mouse have investigated an extended list of essentiality 368 properties, though the subset of features we selected here were shown to be of particular interest (14).

369
Expanding the number of properties used would be an option to explore in the future.

519
The initial idea for the review article was from A.S. which was closely developed with A.S., 520 S.C.L. and A.J.D. All authors wrote and approved the figures and final wording of the manuscript.

522
Amro Safadi is currently working as a Data Scientist at DataRobot Inc. All the other authors have 523 no conflicts of interest to declare.