Stable breast cancer prognosis

Predicting breast cancer prognosis helps improve the treatment and management of the disease. In the last decades, many prediction models have been developed for breast cancer prognosis based on transcriptomic data. A common assumption made by these models is that the test and training data follow the same distribution. However, in practice, due to the heterogeneity of breast cancer and the different environments (e.g. hospitals) where data are collected, the distribution of the test data may shift from that of the training data. For example, new patients likely have different breast cancer stage distribution from those in the training dataset. Thus these existing methods may not provide stable prediction performance for breast cancer prognosis in situations with the shift of data distribution. In this paper, we present a novel stable prediction method for reliable breast cancer prognosis under data distribution shift. Our model, known as Deep Global Balancing Cox regression (DGBCox), is based on the causal inference theory. In DGBCox, firstly high-dimensional gene expression data is transferred to latent network-based representations by a deep auto-encoder neural network. Then after balancing the latent representations using a proposed causality-based approach, causal latent features are selected for breast cancer prognosis. Causal features have persistent relationships with survival outcomes even under distribution shift across different environments according to the causal inference theory. Therefore, the proposed DGBCox method is robust and stable for breast cancer prognosis. We apply DGBCox to 12 test datasets from different breast cancer studies. The results show that DGBCox outperforms benchmark methods in terms of both prediction accuracy and stability. We also propose a permutation importance algorithm to rank the genes in the DGBCox model. The top 50 ranked genes suggest that the cell cycle and the organelle organisation could be the most relevant biological processes for stable breast cancer prognosis. Author summary Various prediction models have been proposed for breast cancer prognosis. The prediction models usually train on a dataset and predict the survival outcomes of patients in new test datasets. The majority of these models share a common assumption that the test and training data follow the same distribution. However, as breast cancer is a heterogeneous disease, the assumption may be violated in practice. In this study, we propose a novel method for reliable breast cancer prognosis when the test data distribution shifts from that of the training data. The proposed model has been trained on one dataset and applied to twelve test datasets from different breast cancer studies. In comparison with the benchmark methods in breast cancer prognosis, our model shows better prediction accuracy and stability. The top 50 important genes in our model provide clues to the relationship between several biological mechanisms and clinical outcomes of breast cancer. Our proposed method in breast cancer can potentially be adapted to apply to other cancer types.

Introduction 1 take advantage of machine learning techniques, to selecting vital features in 23 high-dimension transcriptomic data and to model non-linear relationships between 24 features (genes) and survival outcomes. 25 These methods assume no distribution shift from the training data to the test data 26 (new unseen data). However, the assumption will be violated in real applications. For 27 example, gene expression data may be produced using different platforms, by different 28 labs and patients may come from different countries or have different breast cancer 29 stages. Users do not know what future unseen data will be like. Inaccurate prognosis 30 predictions will have resulted from the distribution shift even though the models 31 perform well in the training and hold out validation datasets. 32 The computational methods need to have stable prediction performance since their 33 predictions directly impact treatment decisions. Here the stability of a method means 34 that the method performs consistently with the training data and the test data (i.e. 35 unseen breast cancer patients) which may have different distributions. One strategy to 36 improve the stability of a prediction method is to utilize samples from multiple sources 37 to train the model. For example, several methods have been developed to select a set of 38 robust features from multiple datasets obtained from the studies of the same disease, 39 and use these features to build a more reliable prediction model [13,14]. Another From the above discussion, we can see that these existing stable prediction methods, 48 when used for breast cancer prognosis, require cancer patient samples with 49 transcriptomic data and/or survival outcomes coming from either different sources or 50 multiple domains in the training phases. Both are challenging in practice because it is 51 time-consuming to follow up cancer patients over long-term periods and collect a wide 52 variety of data. Therefore, it is useful to develop survival prediction models that train 53 on an available dataset and are robust to distribution shifts in test data from new 54 environments. 55 In this paper, we focus on designing a model for stable survival prediction under new 56 environments where the data distribution has been shifted from that in the training 57 dataset. The new environment is agnostic to us and the distribution in the test data is 58 unknown. In this case, we cannot use any information from the test data when we train 59 a model. To learn a robust prediction model across different environments, the 60 predictors (signatures) must be causal genes, i.e. genes causing the development of the 61 disease [17]. The pinpointed causal genes in the model can be translated into disease 62 mechanism and clinical treatment [18]. Therefore, we will model stable breast cancer 63 prognosis predictions as a causal problem. 64 In the research area of causal inference, the causal effects are used for measuring the 65 impact of a treatment on patients. Covariate balancing strategies are used for 66 estimating the causal effect of a treatment variable on the outcome in the observational 67 data. A recent global balancing strategy by Kuang et al. [19] is a covariate balancing 68 strategy used to estimate the causal effects of covariates on the outcome when one does 69 not know which covariates are causal and which ones are not. This can be used for our 70 design because we do not know which genes are causal. However, three limitations 71 prevent this method from being immediately applied to achieve a stable breast cancer stable breast cancer prognosis based on censored survival data and gene expression data. 80 With DGBCox, we have made two main contributions: (1)  The clinical outcomes included in the datasets used in this study include relapse free 109 survival time (UPP, GSE6532, GEO736, TCGA1093, TCGA753, TCGA500,     and learns weights to balance the dataset (details in the next section). Therefore we 140 apply a deep auto-encoder neural network [35] to reduce the dimensionality of the gene 141 expression data and feed the low dimensional latent features to SGB.

142
DAE consists of two parts, the encoder and the decoder, which can be defined as 143 transitions ϕ : X → ϕ(X) and φ : ϕ(X) → φ(X), respectively. Specifically, the latent 144 representations for each hidden layer of the encoder can be given as: where K is the number of hidden layers in the encoder. A k and b k are the weight 146 matrix and bias vector on the k-th hidden layer of the encoder, respectively. σ(.) is an 147 activation function.

148
The decoder transitions the resulting latent representations ϕ K (X) to a 149 reconstructedX with the same shape of X: whereÀ k andb k are the weight matrix and bias vector on the k-th layer of the encoder 151 respectively. The reconstructedX is the K-th layer output φ K (X). DAE is trained to 152 minimize reconstruction errors between the input gene expression data X and the 153 reconstructedX, such as X −X .

154
The output of the DAE component is the low-dimensional latent features (in ϕ K (X)) 155 which are the non-linear combinations of the input gene expression data. For the sake of 156 simplicity, the superscript K is dropped when the context is clear. ϕ(X) is used for 157 input data of SGB and Cox instead of X as described in the following sections.

158
It is known that causal relationships between covariates are persistent and stable across 160 different environments whereas noncausal relationships or purely correlations may 161 variant across different environments [36].
sample size of the training dataset. In causal inference, to estimate of average treatment 183 effect from observed data with non-random treatment assignment, covariate balancing is 184 a commonly used method to balance the distribution of the covariates between treated 185 and control groups [19]. Under the assumption of unconfoundedness , the general idea of the covariate balancing approach is to 187 re-weight the training samples, i.e. re-weigtht the i-th sample with the weight W i , and 188 arg min Through the sample re-weighting, the average treatment effect can be estimated using 190 the difference of the expected weighted survival time of patients in treated and control 191 groups [19].

192
When the causal features (regarded as treatments) are unknown, Kuang et al. 193 proposed a Global Balancing (GB) method to identify causal features from the 194 observational data [19]. GB successively takes each covariate as a treatment variable 195 and balance the distribution of all the other covariates between treatment and control. 196 As a result, GB learns global sample weights in W satisfied with whereP is the dimension of X, X .,p is the p-th covariate or the binary of the p-th 198 covariate (for non-binary variable) in X, and X .,−p = X \ X .,p holds all the remaining 199 covariates except the p-th covariate in X. ⊙ refers to element-wise multiplication. where z (T, Z) = P r(C > T |Z = z) denotes the conditional censoring distribution 217 which is estimated by the Kaplan-Meier method [39,40].

218
With the extended covariate balancing approach, we propose SGB to learn global 219 sample weight vector W from censored survival data: whereπ p = E ⊘ z (T, X .,p ), and ⊘ refers to element-wise division.

221
In gene expression data X, the dimensionP is large. There may be no sufficient 222 samples and computational resources to estimate W by optimizing Formula 6.

223
Therefore we successively regard each latent feature in a low-dimensional space from 224 DAE as a treatment variable and learn W by: where Φ = ϕ(X), P is the dimension of Φ, and Z .,p is the p-th treatment variable via 226 dichotomizing Φ .,p by its mean value. Φ .,p is the p-th covariate in Φ, and where h 0 (t) is the baseline hazard function which is not specified in Cox regression. The 237 prediction of Cox regression is the ratio h(t|X) h0(t) = exp(β T X) which represents the 238 relative hazard risk of a patient.β holds the coefficients of the predictors and can be 239 estimated from training data, and this is often done through minimizing the following 240 loss function (called average negative log partial likelihood): where ℜ( we use the following loss function of our modified Cox regression: where the L1 regularization (also known as lasso regularization) term (∥β∥ 1 ≤ λ 1 ) and 250 the L2 regularization term (∥β∥ The condition of N i=1 (W i − 1) 2 ≤ λ 7 helps to avoid all the weights to be 0.

258
Cancer prognosis with DGBCox

259
The relative hazard risk score for a new patient is calculated by r = exp(β T ϕ(x)). The 260 patient with a low (hazard) risk score is expected to have a better prognosis than the 261 one with a high risk score.

262
Evaluation metrics 263 We first introduce the traditional evaluation metrics for cancer prognosis methods, and 264 then define the metrics for evaluating the stability of a prognosis method.

265
Concordance index C-index [41] is a commonly used measure for evaluating the 266 accuracy of the risk score prediction of a cancer prognosis method. Let Y i and r i be the 267 potential survival time and the risk score for a patient i, respectively. C-index is defined 268 as the concordance probability P r(r j > r i |Y j < Y i ) for all randomly selected pairs of 269 patients (i, j). The concordance means that the patient with a high risk score survives 270 shorter than the one with a low risk score. In other words, the predicted risk score 277 Therefore, C-index can be computed with the following formula:

284
Hazard ratio A commonly used metric for evaluating risk group predictions is 285 hazard ratio [42]. We obtain the predicted groups G for patients via dichotomizing the 286 predicted hazard risk scores by their median value. If a patient's risk score is bigger 287 than the median predicted hazard risk score of the cohort, the patient is put into the 288 high-risk group, otherwise, the patient is put into the low-risk group. Then the risk 289 groups G is fitted in a Cox regression model: where h 0 (t) is the same as that in Eq 8. The quantity exp(µ) is defined as hazard ratio 291 which indicates the risk difference between the two groups of patients. The method with 292 the high hazard ratio performs better than the one with the low hazard ratio.

293
The Log-rank test We use the Log-rank test [43] to assess whether a method 294 successfully stratified patients into two risk groups with statistically significant different 295 survival distributions. In the Log-rank test, the null hypothesis is that there is no where M is the number of test datasets. To assess whether a cancer prognosis model is 303 a stable and accurate model, we define a set of stability evaluation metrics as below. 304 where mean(·) and sd(·) returns the mean and standard deviation of a given vector that 305 holds C-indices or hazard ratios of a method on all the M test datasets. The larger 306 Stability ci or Stability hr is, the more stable the cancer prognosis method will be. In this 307 study, a cancer prognosis model with high stability implies that the model has not only 308 a high average accuracy but also a low standard deviation of accuracy across different 309 environments. Bayesian hyper-parameters search [44]. More information about the hyper-parameters 334 and the training data partition is in Section 3 in S1 File. 335 We also compare our methods with the state-of-the-art methods, including 336 DeepSurv [11], Robust [13], AURKA [45], ESR1 [45], ERBB2 [45], GGI [22], 337 GENIUS [6], Endopredict [7], OncotypeDx [5], TAMR13 [4], PIK3CAGS [27], methods. Fig 2 also shows the stability results based on Stability ci and Stability hr .

350
Please note that the higher the better for all the evaluation metrics.

351
From the results shown in Fig 2,     prognosis in previous studies. 423 We then compare the top 50 ranked genes of DGBCox with cancer causal genes to 424 demonstrate the quality of DGBCox and its findings. It has been confirmed that cancer 425 development and progression are caused by gene mutations [47], and these genes are 426 called cancer causal genes or cancer drivers. We observe that three of the top 50 ranked 427 genes of DGBCox appear in the Cancer Gene Census (CGC) [47], which is a manually 428 curated list of likely cancer drivers. The full list of the top 50 ranked genes and their 429 overlapping with signatures and cancer drivers can be found in S1 Table. 430 To investigate which biological mechanisms are important for stable breast cancer 431 prognosis, we conduct Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis (detailed in Section 5 in S1 File). As that are commonly used for breast cancer subtype diagnosis [48]. STC2 is reported to 442 be associated with breast cancer cell development [49]. ANGPT2 is reported to be 443 capable of promoting breast cancer metastasis [50]. PTTG1 is known as an oncogene 444 that promotes cancer cell proliferation [51], migration, and invasion [52] as well as 445 associated with the epithelial-mesenchymal transition (EMT) in breast cancer [53]. The 446 top 50 ranked genes of DGBCox are significantly enriched (adjusted p-value less than studies have suggested that this pathway plays an important role in the development of 449 breast cancer [54,55], however, the underlying biological mechanism needs to be verified 450 by further laboratory and clinical researches.

452
In this study, we proposed DGBCox to improve the stability of cancer prognosis across 453 unknown environments. To the best of our knowledge, this is the first method designed 454 for the stable cancer prognosis task. Results on 12 independent test datasets 455 demonstrated that DGBCox is effective and stable for breast cancer prognosis across 456 different environments. 457 We further analyzed which genes are important for the stable breast cancer 458 prognosis. Genes were ranked by the proposed permutation importance algorithm. Log-rank tests. The x-axis stands for the dataset, and the y-axis stands for the 484 method. The steel blue tile means that the p-value of a method is less than 0.05, which 485 indicates that the method can successfully stratify patients in the dataset into two risk 486 groups with distinct survival patterns. The grey tile indicates the opposite.