Benchmark of lasso-like penalties in the Cox model for TCGA datasets reveal improved performance with pre-filtering and wide differences between cancers

Motivation Prediction of patient survival from tumor molecular ‘omics’ data is a key step toward personalized medicine. With this aim, the databases available are growing, with the collection of various ‘omics’ characterizations of patient tumors, together with their associated clinical outcomes for weeks to years of follow-up. Cox models with variable selection used with RNA profiling datasets are popular for identification of prognostic biomarkers and for clinical predictions. However, these models are confronted with the ‘curse of dimensionality’, as the number p of covariates (genes) can greatly exceed the number n of patients. To tackle this problem, variance-based pre-filtering and penalization methods are popular for dimension reduction. In the present paper, we study the impact of a pre-filtering step based on gene variability, and we evaluate the performance of the lasso penalization of the Cox model and four variants (i.e., elastic net, adaptive elastic net, ridge, univariate Cox) in terms of prediction, selection and stability. Results First, we show that the prediction capacity with the Cox penalties method is cancer dependent. Second, we develop a methodology to fix a threshold to filter out genes with low variability without losing prediction capacity. Third, we show that it is best not to use the Cox model to select prognostic biomarkers, as its false discovery proportion is always ≥ 50%. Finally, to predict overall survival, we can suggest the use of the ridge penalty, or the elastic net if a more parsimonious model is needed, after the pre-filtering step. Availability We provide the R script generated to reproduce all of the figures presented in this article. Supplementary information Supplementary Figures and R scripts are available.


Introduction
The roots of the 'P4' model of cancer medication lie in prediction combined with personalization, prevention, and participation [15]. Prediction of the best treatment for a given patient and prediction of various clinical outcomes, including overall survival, are both of growing interest. 'Omics' technologies now come with decreasing costs, which has made possible the molecular characterization of tumor samples of various types, including quantification of mutations and gene expression [29,32]. As a result, there are growing numbers of knowledge databases that include molecular profiling of patient tumors, together with clinical information from patient follow-up. Survival analysis from transcriptome profiling of cancer patients in terms of messenger RNA (mRNA) expression is now emerging for clinical use [9].
The Cox proportional hazard model [6] is one of the most popular approaches in medicine to link covariates to survival data. [38] and [18] showed that Cox model are at least as good as, or even better than, neural networks and other machine learning models. Cox regression models with lasso penalty for variable selection [35] are often used to identify a few prognostic biomarkers from among the thousands of genes profiled, and to obtain a parsimonious model for simpler and cheaper clinical applications. When considering the numbers of covariates, p (which can typically be 20,000 gene products), in relation to the number of patients in the databases, n (which can typically be only a few hundred), various issues occur due to the high dimensionality, which include the lack of stability of the selected genes [16] and over-fitting [25]. This p n problem is referred to as "the curse of dimensionality" [1].
Lasso generalizations have been proposed for generalized linear models, such as Cox regression, to improve the performance and stability. In particular, the elastic net [40] and the adaptive elastic net [41] are regularization procedures that can overcome some stability issues of the lasso in the presence of highly correlated variables [39]. This occurs in particular for gene signatures. The adaptive elastic net also ensures additional theoretical properties to recover true biomarkers. As lasso-like methods, the ridge penalty allows control of the variance of the estimator. Although there is no selection, the ridge regression has been shown to have promise for reliable survival predictions with high-dimensional microarray data [36]. [24] showed that the lasso, ridge and elastic net penalties perform equally for low dimensional settings with few events. [3] compared the lasso, elastic net, ridge and adaptive lasso penalties on simulated and real acute myeloid leukemia data, and on real breast cancer data. They used the integrated Brier score as the metric to judge the performance of the models, and the datasets they used comprised fewer than 300 patients.
Pre-filtering procedures are popular methods that considerably reduce dataset dimensionality to avoid the curse of dimensionality issues. For instance, the less variable genes are subject to measurement noise and provide poor contributions to distinguish between patients [23]. Moreover, pre-filtering the less variable genes has been shown to increase the identification power of differentially expressed genes [13,4]. Finally, this pre-filtering step is often done without justification of the threshold [11,21,22], while this value has been shown to be crucial to optimize the selection process [4].
Univariate selection procedures are also widely used methodologies to select features. [28] showed that univariate Cox regression is the best from among a set of eight univariate methods to select genes related to survival.
Finally, determination of the cohort size is a key issue to design any clinical trial [34]. If too few patients are included, inference procedures can provide non-significant results. On the other hand, including many patients is expensive, time-consuming, and requires more effort. Therefore, we investigated the impact of the number of patients on the predictive accuracy of the models.
To the best of our knowledge, no independent benchmark of univariate and multivariate (e.g., lasso, elastic net, ridge, adaptive elastic net) Cox regression has been defined using The Cancer Genome Atlas (TCGA) datasets, with pre-filtering of the genes based on their variability, and according to well-established metrics. In the context of survival analysis using high-dimension mRNA-seq datasets, the goals of this paper are to: (i) compare four multivariate Cox penalty methods (i.e., lasso, elastic net, adaptive elastic net, ridge) with univariate Cox regression and to each other; (ii) study the impact of pre-filtering the genes based on their interquartile range (IQR), and propose a rationale toward the fixing of a threshold; and (iii) study the impact of cohort size on the prediction performances. The methodologies are assessed in terms of the quality of the prediction of overall survival and prognostic biomarker selection, along with the stability.
To evaluate the Cox regression methods for prediction of overall survival and selection of genes correlated to survival, we use a panel of four cancers from TCGA (http://cancergenome.nih.gov/) that share different characteristics. Real datasets are used to study the prediction performances, and we provide realistic simulated datasets that maintain the complex structure of the molecular profiles of the patients to assess selection accuracy. where h0(t) is the baseline hazard function, Xj are the covariates (here as gene expression), and βj are the associated coefficients, for j = 1, . . . , p. For each patient i = 1, . . . , n, let ti be the follow-up time (as either patient survival or censoring time), and δi the associated status, as 1 for death and 0 for censoring. The vector ((t1, δ1), ..., (tn, δn)) is referred to as the 'survival data'. Let X i = (X i 1 , . . . , X i p ) T be the vector of covariates for the i−th patient. The vector of coefficients β = (β1, ..., βp) T can be estimated by maximizing the Cox pseudo-likelihood, as proposed by Breslow [5]: where for each patient i = 1, . . . , n whose death is observed (i.e., δi = 1), Ri is the set of patients at risk at time ti; i.e., such that tj ≥ ti (including patient i). This function is called the 'pseudo-likelihood', because it is not a product of density functions, but a product of conditional probabilities.β is computed by maximizing this pseudolikelihood function:β = βarg maxl(β), with l(β) = log(L(β)), the log-pseudo-likelihood.
Note that the Cox model is not intuitive, in the sense that it links genetic data to patient survival in an indirect way, through the hazard function. However, Cox pseudo-likelihood allows censored data to be efficiently dealt with. Moreover, this yields a robust inference procedure where the baseline function h0(t) does not need to be modeled or estimated in a parametric way. Finally, the estimation procedure leads to a convex optimization problem, for which efficient procedures and packages exist for computingβ [12].

Penalization methods and univariate Cox selection for a sparse model
The main goal of survival analysis is to predict survival times with a set of covariates X. All of the 20,000 genes included do not necessarily contribute to the prediction of survival, and selection procedures are useful to reduce the dimension while retaining the most informative covariates.
For variable selection, we considered three alternative penalties in the likelihood of the Cox model: the lasso [35], the elastic net [40], and the adaptive elastic net [41] (implementation is provided in R script functions/learn models func.R). The adaptive elastic net uses a two-step procedure that includes the ridge regression [36]. Briefly, these methods consist of the addition of a penalty term to the log-pseudo-likelihood l before the maximization: • The adaptive elastic net (a two-step procedure) 1. estimates the vectorβ 0 by maximizing l with the ridge regression. 2. weights the elastic net penalty with the coefficient β 0 j , j ∈ {1, ..., p} computed in step 1:β The 1 norm forces some coefficient estimatesβj, j = 1, ..., p to be zero, and allows the selection to be made. For multivariate Cox selection models, genes selected are defined as the genes with nonzeroβj coefficients. It has been empirically observed that if there are high correlations between predictors, the ridge penalty provides better prediction performances than the lasso [35], and that the shrinkage effect of the lasso is too strong for large effects [39]. The elastic net and adaptive elastic net penalties have been developed to tackle these two issues, respectively.
The weight of the penalty, λ, is computed by K-fold cross-validation (K=5) using the package glmnet [12] in R version 3.6.0 [27]. The weight λ that minimizes deviation in the cross-validation is given by λmin. For more details on the mathematical concepts used in this article, we refer the reader to the book 'The Statistical Analysis of Failure Time Data' [17].
Univariate Cox selection consists of the learning of a Cox model individually for each gene. A p-value that corresponds to the statistical test with null hypothesis "βj = 0" is then assigned to each gene j with 1 ≤ j ≤ p.

Filtering by interquartile-range
The Interquartile-Range (IQR) is a robust measurement of statistical dispersion, and it is defined as the difference between the 75th and 25th percentiles. We removed genes for which the IQR of expression among all the patients was below a given threshold, thus reducing the dimension. Bourgon et al. (2010) and Hackstad et al. (2009) showed that pre-filtering by variance increases the selection power of differentially expressed genes in high dimension analyses, and both of these studies indicated the crucial issue of the choice of the threshold. Additionally, the most variable genes present a better signal-to-noise ratio with respect to measurement noise. These genes are then easier to measure and more reliable, technically speaking.

Assessing the concordance index for prediction quality
The concordance index (C-index) allows the predictive ability of a risk score to be assessed by quantifying the proportion of comparable patient pairs whose risk scores are in good agreement with their survival data. For two patients i and j with risk scores Ri and Rj, and with survival times Ti and Tj, the C-index is defined as C = P (Ti < Tj | Ri > Rj).
A C-index of 1 indicates perfect agreement, which means that the survival time can be perfectly described by a deterministic monotonically decreasing transformation of the risk score. Conversely, a C-index of 12 corresponds to random chance agreement, which means that the risk score do not provide any information on survival time. The risk prediction score used for multivariate Cox regressions is the Prognostic Index (PI) of the Cox model, which is classically defined as PI (i) =β T X i for patient i, 1 ≤ i ≤ n. These PIs are linear combinations of gene signatures for each patient that are computed in a testing dataset (as a random selection of 20% of the patients), withβ coefficients estimated with the Cox model in a training dataset (as 80% of the remaining patients). This procedure is repeated 100 times, and boxplots of the C-indices can be drawn for all of the methodologies and for different IQR thresholds. As theβ estimated are different for each penalization method, the PIs and the C-index are different for each method. For univariate Cox selection, the gene with the lowest p-value is selected from the training dataset, and we use the expression level of this gene as the risk prediction score to compute the C-index for the testing dataset (R script prediction quality.R).
We took the estimator of the C-index given by [14] and theorized by [26], which is implemented in the package survcomp [31]: where U is the union of all of the comparable pairs (the death of the patient with the lowest follow-up time has to be observed), #U is the size of U, and 1i,j is equal 1 if risk scores and survival data of patients i and j are in good agreement, and 0 otherwise. As a comparison, we calculate the C-index estimates with the ridge regression under the null hypothesis "β = 0". This is done with the following permutation test procedure: the survival times are randomly shuffled, while the gene expression profiles are not modified, and the C-index is computed as explained in the previous paragraph. This procedure is repeated 100 times (i.e., 100 Monte-Carlo runs) to obtain the empirical null distribution of the C-index. This allows determination of whether the prediction performance, as computed in the previous paragraph, is significantly better than a purely random model.

The Cancer Genome Atlas dataset
First, in a preliminary study, we included all of the 11 cancers for which more than 500 patients with mRNA-seq data were available in TCGA (http://cancergenome.nih.gov/) (R script cancer characteristics.R). We studied the 'time-to-death', also called 'overall survival', which was defined as the time between diagnosis and death. We computed multiple characteristics for these 11 cancers, including the C-index estimates computed with the ridge regression method applied over all of the genes (Fig. 1). Then, we selected four cancers with different characteristics (Table 1) as a representative panel to evaluate the five methods (i.e., lasso, elastic net, adaptive elastic net, ridge, univariate Cox). We chose brain lower grade glioma (LGG) for its high C-index (0.81), breast invasive carcinoma (BRCA) because many samples were available (1040), and kidney renal clear cell carcinoma (KIRC) and lung squamous cell carcinoma (LUSC) as they had similar numbers of patients (524, 474, respectively) and censoring rates (0.69, 0.67, respectively), but very different C-indices (0.71, 0.55, respectively).
The clinical and mRNA-seq datasets were obtained using the Broad GDAC FIREHOSE utility (https://gdac.broadinstitute.org/). We used the RNA-seq data normalized according to expectation maximization (RSEM) [20], and for all of the estimations ofβ with the Cox regression models, the gene expression data were standardized.

Simulation pipeline and selection accuracy for different IQR thresholds
We set up a simulation pipeline to assess the selection accuracy compared to the so-called ground truth, which is graphically explained in Supplementary Figure S1 (R script functions/simulations func.R). First, we defined the ground truth as all of the genes selected in common by the lasso, elastic net, and adaptive elastic net, plus the genes selected by at least one of the three methods for the real TCGA dataset. We set the number N of genes of the ground truth as the median number of genes selected by the three methods, which corresponds to the number of genes selected by the elastic net. As an example, for KIRC and without pre-filtering, N = 55.
Characteristics of the four cancers used. C-indices were computed with the ridge regression and all of the genes in the Cox model, and p 3 is the number of genes with an IQR ≥ 3. Cancer Censoringn patients To obtain realistic simulations, we calibrated a Weibull proportional hazard model on real TCGA data using the genes of the ground truth. Then we simulated survival times Ti, for i = 1, . . . , n, from TCGA mRNA-seq data using the [2] method. Censoring times Ci, for i = 1, . . . , n, are assumed to be uniform, and the censoring rate of the real survival data is achieved using the method proposed by [37]. We define the follow-up times ti as ti = min(Ti, Ci), with Ti the survival time and Ci the censoring time, and the status δi = 1 T i ≤C i , for i = 1, . . . , n. In the following, we will indicate whether the results are obtained from real or simulated datasets.
We estimate the selection accuracy of the four selection methods (i.e., lasso, elastic net, adaptive elastic net, univariate Cox) by comparing the set of selected genes for the simulated data with the set of genes of the ground truth, using the sensitivity and the false discovery proportion (FDP) (R script selection accuracy.R). We were careful to select in the ground truth only genes that satisfied the chosen IQR threshold. As a result, the procedure naturally favors high IQR thresholds, as the selection is performed on fewer genes.
The genes selected by univariate Cox selection are defined as the N genes with the lowest p-values, where N is the number of genes in the ground truth. This scenario is then the most favorable for selection accuracy.

Stability of the prognostic index and reliability of the selected set of genes
We define a procedure as unstable if small changes in the input (patients) can cause large changes in the output (PIs, or set of genes with nonzero coefficients, also referred to as the 'selected genes').
To assess the robustness of the PIs, we compute 100 PIs for real TCGA datasets for each patient, using the bootstrapping method (i.e., random sampling with replacement). This makes it possible to compute an IQR and a median for each PI from their empirical bootstrap distribution. Then we fitted the loess curves to estimate the relation between IQR and the median PI. We repeat this procedure for each penalization method (i.e., lasso, elastic net, adaptive elastic net, ridge) (R script stability PI.R).
For the stability of the selection, we randomly select genes from two separated subdatasets of the real TCGA dataset that share x% common patients, with x ∈ {100, 90, 80, 70, 60, 50} while the sample size of patients remains constant. Then we compare the proportions of common genes selected from these two subdatasets. This procedure is graphically explained in Supplementary Figure S2. To further assess the reproducibility of the set of selected genes, the dataset is then divided into two subdatasets with no common patients (x = 0). Each procedure is repeated for 100 times (i.e, 100 Monte-Carlo runs), and the proportions of common genes are drawn as a function of the proportion of common patients for each selection method (i.e., lasso, elastic net, adaptive elastic net, univariate Cox) (R script stability genes.R).

Impact of the number of patients
To assess the impact of the number of patients in a cohort to inferβ, we use real TCGA datasets and we consider only a subset of nT patients for the training set (nT ∈ {25, 50, 100, 200, 400}), Figure 1: Boxplots of the C-indices for the 11 cancers from the TCGA with more than 500 patients with mRNA-seq data available, computed with the ridge penalty and all of the genes in real datasets. PRAD, prostate adenocarcinoma; STES, stomach and esophageal carcinoma; LUSC, lung squamous cell carcinoma; LUAD, lung adenocarcinoma; THCA, thyroid carcinoma; COADREAD, colorectal adenocarcinoma; HNSC, head and neck squamous cell carcinoma; BRCA, breast invasive carcinoma; UCEC, uterine corpus endometrial carcinoma; KIRC, kidney renal clear cell carcinoma; LGG, brain lower grade glioma.
while keeping 100 patients in the test set to compute the C-index estimates (R script impact n patients.R). Figure 1 shows the distribution of the C-index estimates for the 11 cancers using the estimation procedure described in section 2.4 for the real TCGA datasets. Prostate adenocarcinoma (PRAD) has the lowest prediction quality, with a median C-index of 0.50, which corresponds to pure random estimation. Different characteristics are observed among the four cancers we selected further. LGG shows the highest median concordance (C = 0.81), KIRC ranks second (C = 0.70), BRCA has intermediate predictive accuracy (C = 0.66), and LUSC has low median concordance (C = 0.55). This indicates that the overall survival prediction quality using a Cox model is cancer dependent.

Pre-filtering to keep only the most variable genes preserves prediction performance while reducing the dimension
For all of the four cancers investigated, with the removal of the genes with IQR ≥ 4, the C-indices computed for the real TCGA datasets vary very slightly ( Fig. 2; Supplementary  Fig. S3). However, filtering of more variable genes (IQR ≥ 5) leads to accuracy reductions in three of the four cancers investigated. For KIRC, this pre-filtering step allows a reduction from 20,245 to 178 for the number of genes, while hardly affecting the prediction accuracy. The median C-index even increases for BRCA: for the ridge regression, this goes from 0.66 without pre-filtering (20,248 genes) to 0.70 when restricted to the most variable genes across the patients (IQR ≥ 4, 271 genes).
We suggest that the C-index is used to determine the IQR thresholds for the pre-filtering step as follows: • Determine the different thresholds to test according to the distribution of IQR for each gene (in this study, typically 0,1,...,5). Boxplots of C-indices as a function of IQR threshold. The numbers of genes selected by the prefiltering step are also shown. Blue, the lasso; orange, the elastic net; green, the adaptive elastic; red, the ridge; dark red, the univariate Cox; light blue, the ridge under the null hypothesis.
• Computeβ with the chosen method from a random selection of 80% of the patients, and estimate the C-index with the remaining 20% of the patients, 100 times for each threshold.
• Choose a threshold for the pre-filtering step according to the observed relation C=f(IQR threshold), as in Figure 2.
Another important feature is that the elastic net is less stringent than the lasso, as it selects more genes, while it is more stringent than the adaptive elastic net ( Supplementary  Fig. S4). For the LUSC patients analyzed with the lasso and elastic net penalties, keeping all of the genes leads to no genes selected in many cases ( Supplementary Fig. S4D). For the adaptive elastic net, due to the weight on each gene that is calculated in the first step of the procedure, a few tens of genes are always selected, while the overall performance remains very low, with a concordance of ∼ 0.50. For this cancer, the pre-filtering of the genes clearly improves the estimation, but the C-index remains typically below 0.55, with an optimal threshold of 4 or 5, depending on the penalization method ( Supplementary Fig.  S3C).
For all of the IQR thresholds, the median C-indices are significantly higher for all multivariate methods compared to the univariate Cox selection for KIRC (Fig. 2) and LGG ( Supplementary Fig. S3B), and equivalent for BRCA and LUSC (Supplementary Figs.  S3A, S3C). A linear combination of gene signatures with coefficients estimated in the Cox model provides better prediction of the overall survival than one gene alone. The different multivariate Cox penalizations (i.e., lasso, elastic net, adaptive elastic net, ridge) perform equally in terms of prediction, but the lasso, elastic net and adaptive elastic net penalties allow for the selection of models of moderate size.
The C-indices computed under the null hypothesis "β = 0" for each methodology and each threshold are shown in Supplementary Figure S5 for KIRC, and these allow comparisons of the results with the pure random model.

Gene selection leads to high rate of false positives and negatives and low stability performance
We assess the selection performance using simulations of overall survival and follow-up time.
Best results are obtained with the genes with an IQR ≥ 4.The ground truth, as the pool of genes used for this simulation, is comprised of 37 genes for KIRC that are chosen from , computed on simulated datasets as a function of the IQR threshold. Blue, the lasso; orange, the elastic net; green, the adaptive elastic net; dark red, univariate Cox.
among the 178 genes with IQR ≥ 4, as detailed in the Materials and Methods section. The elastic net reaches the highest median sensitivity (0.70), but with a median FDP of 0.55 (Fig. 3). The lasso and adaptive elastic net penalties perform similarly, with sensitivities of 0.58 and 0.62, respectively, and FDPs of 0.48 and 0.49, respectively. Indeed, overall, half of the selected genes are false positives, and 75% of the ground truth is discovered in the best scenario. The univariate Cox models clearly perform the worst to retrieve the genes of the ground truth, with a sensitivity below 25% and an FDP above 75%. Similar results are obtained for BRCA, LGG, and LUSC ( Supplementary Fig. S6). Supplementary Figure S7 shows that the elastic net and adaptive elastic net penalties select too many genes, while the lasso selects a number of genes comparable to the length of the ground truth.
A good compromise in terms of concordance and selection performance among the five methods (i.e., lasso, elastic net, adaptive elastic net, ridge, univariate Cox) and the four studied cancers is a threshold of 3 for the IQR. This reduces the number of genes to 300 to 900, depending on the cancer. For the following, we use a pre-filtering step, to only keep

3.4
The ridge regression reduces the PIs more and is more robust than the lasso, elastic net and adaptive elastic net penalties Figure 4 shows that the PIs computed with the ridge penalty for the real TCGA subdatasets are more robust and reduced than for the three other methods (i.e., lasso, elastic net, adaptive elastic net). Among the three selection methods, the PIs computed with the elastic net have the best robustness, while the adaptive elastic net performs the worst. Similar results are obtained for BRCA, LGG and LUSC ( Supplementary Fig. 8). The squared 2-norm in the elastic net penalty brings stability to the estimation ofβ, whereas the two steps of the adaptive elastic net makes the estimation more unstable.

Gene selection is unstable
To evaluate the stability of the pool of selected genes using the different variants of the lasso, we compare the pools obtained for two subsets of the real TCGA data. A decrease in the proportion of common patients quickly reduces the proportion of common genes selected, down to 25% for 50% of the patients in common (Fig. 5). When selecting genes within two subsets with independent patients, the proportion of overlapping genes drops to ¡10% for the penalized Cox models (6.7% for the lasso, 9.2% for the elastic net, 9.3% for the adaptive elastic net). Similar results are obtained for BRCA and LUSC. The best results remain ¡25%, and are obtained for LGG (13.8% for the lasso, 23.5% for the elastic net, 19.1% for the adaptive elastic net) (Supplementary Fig. S9). The univariate Cox model reaches a higher stability performance, but does not exceed 50% of common genes selected. Note that for identical datasets (100% of common patients), the pools of genes selected are not exactly the same, as the inferred λmin can differ due to the random partitions used in the cross-validation step. Figure 5: Stability of the genes selected computed with the genes with IQR ≥ 3 from the real TCGA dataset for kidney renal clear cell carcinoma. Boxplots of proportions of common genes selected in the two datasets as a function of the proportion of common patients in the two datasets. Blue, the lasso; orange, the elastic net; green, the adaptive elastic net; dark red, the univariate Cox.
3.6 Prediction accuracy reaches a plateau from different numbers of patients, which depends on the cancer As expected, concordance increases with the number of patients used to estimateβ for the real TCGA datasets (Supplementary Fig. S10). This reaches a plateau in different ways, depending on the cancer considered. For KIRC, the plateau is not even clearly reached with 400 patients, whereas for LGG the plateau is reached for 100-200 patients. Except for LGG, for which the ridge regression performs similarly with the other penalized Cox methods, the ridge method slightly outperforms the other penalization methods when the numbers of patients are low. As the cover is different for each nT (i.e., between each bootstrap runs, there are more common patients in the training sets for nT = 400 than for nT = 25), the variances are biased. Thus, we only show the median C-indices here.

Discussion
We first show here that overall survival prediction quality is cancer dependent, and can be as low as pure random prediction, with a median concordance of 0.50 for PRAD. Also, for the univariate selection procedures based on correlation between covariates and binary outcome ("good" or "bad"), the prognostic genes listed were shown to be non-robust [10]. In a similar way, our results illustrate the same issue in multivariate Cox selection methods. Different hypotheses can explain both these differences across cancers and the poor selection performances. First, many studies show intra-tumoral heterogeneity (e.g., genetic and phenotypic variations across different geographic regions for one tumor), including for brain, breast, renal, and lung cancers [33]. The expression levels of the genes correlated to survival (the input of the Cox model) can then vary across geographic regions for one tumor, although the survival times of the patients are a global outcome. Secondly, other explanatory variables can better predict overall survival for some cancers. For example, [30] recently showed that tumor microbiome diversity influences the tumor outcome for patients with pancreatic cancer. Predicting patient outcome with transcriptomic data if they are poorly correlated to these hidden variables here means a low probability of success. Finally, the existence of sub-types of cancers with different molecular characteristics [7,19] would lead to poor regression performance and prediction, as all of the patient data are treated similarly with the same model.
The main goal of penalty methods in Cox models is both to provide significant weight for prognostic genes, and to predict overall survival. Several comparisons of selection methods that include multivariate Cox models have been carried out, but these were applied to smaller datasets and with different metrics to assess the quality of the models. [3] concluded that there is a strong need for research on improved pre-filtering of covariates for high-dimensional data Cox regression. We share these conclusions, and we propose a methodology to tune the pre-filtering of the genes based on a robust estimation of their variability among biopsies from patients. Even if direct application of the lasso alternatives of the Cox model shows disappointing prediction for many cancer types, promising strategies do show up. [8] suggested to implement 'knowledge-based' (as opposed to 'ignorance-based') approaches, as widely used in all branches of physics. Bioinformatics associated with strong clinical and biological assumptions can increase the robustness, simplicity, and accuracy of such results. Finally, the author also showed that combining 'omics' and clinical data showed promising results.

Conclusions
This article investigates Cox model penalty methods (i.e., lasso, elastic net, adaptive elastic net, ridge), univariate Cox procedure, and pre-filtering of the genes based on their variability, for prediction of overall survival and the selection of a set of prognostic biomarkers. We compare four cancers (i.e., KIRC, BRCA, LGG, LUSC) using data from TCGA. We first demonstrate that the prediction accuracy is highly cancer dependent, with concordance indices as high as 0.81 for LGG, to 0.55 for LUSC, and down to 0.50 for PRAD. Secondly, we propose a methodology to pre-filter the genes according to their variability, and show that the prediction accuracy varies very slightly while keeping the genes with IQR ≥ 4. This methodology allows for an important dimension reduction. Thirdly, we show that even in the ideal situation with the data simulated using the Cox model, the pool of selected genes is not stable, with many false positive and false negative genes. Fourth, multivariate Cox penalties (i.e., lasso, elastic net, adaptive elastic net, ridge) perform better than the univariate Cox procedure for prediction of patient outcome, and the lasso, elastic net and adaptive elastic net penalties are also more powerful for selection accuracy than the univariate Cox selection. Finally, we demonstrate that prognostic indices computed with the ridge regression are more robust than those computed with the lasso, elastic net or adaptive elastic net penalties, and they reach a slightly better prediction performances. Thus, we can be more confident with the use of the ridge penalty for computing the risk factor (PIs).
As practical conclusions, we first advise that the genes are pre-filtered based on their variability across patients, with the C-index computed by cross-validation as a decision criterion. Then, we do not advise to use selection procedures based on the Cox model to identify prognostic biomarkers. Finally, we advise to use the ridge penalty to predict overall survival after the pre-filtering step, and elastic net if a more parsimonious model is needed although the Prognostic Indices are less robust.