A conditional survival distribution-based method for censored data imputation: overcoming the hurdle in machine learning-based survival analysis

Data analyses by machine learning (ML) algorithms are gaining popularity in biomedical research. When time-to-event data are of interest, censoring is common and needs to be properly addressed. Most ML methods cannot conveniently and appropriately take the censoring information into consideration, potentially leading to inaccurate or biased results. We aim to develop a general-purpose method for imputing censored survival data, facilitating downstream ML analysis. In this study, we propose a novel method of imputing the survival times for censored observations. The proposal is based on their conditional survival distributions (CondiS) derived from Kaplan-Meier estimators. CondiS can replace censored observations with their best approximations from the statistical model, allowing for direct application of ML methods. When covariates are available, we extend CondiS by incorporating the covariate information through ML modeling (CondiS-X), which further improves the accuracy of the imputed survival time. Compared with existing methods with similar purposes, the proposed methods achieved smaller prediction errors and higher concordance with the underlying true survival times in extensive simulation studies. We also demonstrated the usage and advantages of the proposed methods through two real-world cancer datasets. The major advantage of CondiS is that it allows for the direct application of standard ML techniques for analysis once the censored survival times are imputed. We present a user-friendly R package to implement our method, which is a useful tool for ML-based biomedical research in this era of big data.


INTRODUCTION
Analyzing the time to the event of interest for patients with personalized predictors is a main goal of survival analysis. It provides valuable information about survival related factors and may enlighten directions for prolonging patients' lives. Such analysis also assists clinicians in allocating resources more efficiently and developing suitable treatment plans for patients at different risk levels. More generally, survival analysis is a branch of statistics that involves the modelling of the time-to-event data. [1] Compared with traditional data analysis problems, partial observations of the outcome are more prevalent in survival analysis. This form of missing data issue is called censoring. [2] There are three types of censoring: left, right, and interval. Right-censoring, the most common, usually happens when patients drop out from the study or when the study ends earlier than the occurrence of the event of interest. Here, we focus on addressing right censoring in survival data, but the principal may also apply for other types of censoring events.
Given the importance of survival data, there is a long history of statistical method development to appropriately address censoring in survival analysis. These statistical approaches can be roughly divided into three categories: nonparametric methods (e.g., the Kaplan-Meier estimator [3] and the Nelson-Aalen estimator [4]), parametric methods (e.g., the Tobit model [5]), and semi-parametric methods (e.g., the Cox proportional hazards model [6]). The parametric and semi-parametric methods can be applied as regression-based models. However, most machine learning methods currently are developed for non-censored data. It takes huge amounts of efforts to re-develop current machine learning (ML) methods to handle censored data. In the literature, a lot of researchers chose to use ad-hoc techniques, which can lead to severely biased results. For example, some ignored the censoring information and directly applied ML methods for non-censored data to data with censoring, [7 8] while some simply discarded those censored observations. [9 10] Currently, the healthcare industry has generated a vast amount of clinical data stored in electronic health records (EHRs) and genomic data created from laboratory experiments. [11] ML methods are gaining popularity as they are an advance in taking in this massive amount of data. [12] Several ML methods have been adapted to handle survival data, such as the survival tree model [13], the support vector approach [14], the neural network and the convolutional network methods [15] [16], as well as the deep survival and deep learning methods [17] [18].
However, all of these methods are reliant upon altering specific ML algorithms to handle censored data. The number of ML methods with the specific ability to handle censored data is limited. Thus, this adapting approach is not generalizable to the wide variety of existing ML methods. This limits the application of ML in biomedical research. Alternatively, researchers may simply discard censored observations or treat them as observed event times. [2] These ad hoc approaches ignore the nature of censoring events and may lead to biased or even wrong results. There is a need for statistically rigorous methods that can provide a general solution for addressing censored data in ML research.
The objective of this study is to develop a general method for right-censored time-to-event data for use in all ML algorithms. A popular imputation method proposed by Klein et al. generates pseudo-survival time based on right-censored data. [19] These pseudo values are derived from the imputation formula by comparing the whole population-averaged and leave-one-out estimators of the survival time. [20] Their method can be applied to all ML algorithms since the pseudo-observations are computed for all patients. However, there are a few limitations. First, the method imputes survival times even for those subjects whose survival times have been observed, and the imputed survival times usually are different from the truly observed survival times. Second, the Pseudo method heavily relies on a tuning parameter controlling the maximum follow up time. Inappropriate selection of this parameter may lead to poor results.
In this study, we proposed a novel method of imputing the survival times for censored observations based on their conditional survival distributions (CondiS) derived from the Kaplan-Meier estimator. CondiS can replace the censored observations with the best approximations from the statistical model, allowing for direct application of the ML methods.
When covariates were available, we extended CondiS by incorporating the covariate information through ML modeling (CondiS-X), which further improved the survival time imputation. Using CondiS, only the observations of censored data were imputed.
Compared with the pseudo survival time method by Klein et al, our method is more natural.
We studied the use of (1) raw data by treating censoring time as observed survival time (termed OBS), (2) the Pseudo method and (3) the CondiS method, on both simulated data and realworld data. When compared with the true survival time, using metrics including the concordance index, the mean absolute error (MAE), and the median absolute error (medAE), the simulation results show that CondiS outperforms the other two methods both before and after applying ML models. Similar conclusions were reached for real-data examples.
This paper is organized as follows: in Section 2, we first provide the statistical definition of the CondiS method and introduce its extension, CondiS-X. A schematic flowchart of our method is given. We also discuss the setup of the simulation design and some basic information of two real-world datasets. In addition, we describe three metrics to evaluate the performance of our method. In Section 3, we present the results of the Monte Carlo simulations and two empirical real-world examples. In Section 4, we illustrate the advantages and potential limitations of the proposed method. In Section 5, we briefly summarize this study. Additional proofs of the performance of our method are collected in the Supplementary Materials. We performed all the experiments in R and a user-friendly software package for implementing our proposed methods is available from https://github.com/yizhuo-wang/CondiS.

MATERIALS AND METHODS
The conventional survival models, such as the widely-used Cox model, usually are developed from the hazard function, which is defined as the instantaneous probability of the event of interest occurring within a narrow time frame. [1] For this reason, in traditional clinical trials the treatment effect frequently is reported in terms of the hazard ratio (HR). [21] HR is an estimator of relative risk/hazard rate reduction in different groups and can be obtained directly from a Cox proportional hazards model. However, many studies have shown that when the proportional hazards (PH) assumption of the treatment effect is violated, the estimation of the hazard rate is affected by the follow-up time in the study. This compromises the use of HR and makes it inaccurate. [22] Moreover, HR does not directly convert to differences in survival times, which makes it difficult to explain for non-statisticians. As a result, more interpretable measures are being explored for assessing the treatment effect. [22] Recently, the restricted mean survival time (RMST) has become popular for quantifying the between-group difference in survival analysis. [23] RMST does not depend on the proportional hazard assumption. It is a robust criterion for comparing treatments, and it is intuitively appealing to patients and physicians. Our proposed method, CondiS, is based on the concept of RMST and the timeconditioned survival function.

CondiS
Suppose the Kaplan-Meier estimator of the survival function that the probability of life being longer than t is given by Ŝ(t). Then the mean survival time restricted to * (RMST) [24] is: If patient is censored at = < * , then his expected survival time restricted to * can be estimated by a t * -year conditional survival distribution for patients who have survived for ti years: Suppose 1 ≤ 2 ≤ ⋯ ≤ , * can be any self-defined time-of-interest or can be defaulted as the largest survival time, . If ⩾ * , then ̂ * = * . If patient is not censored, then ̂ * = min( , * ). We propose to use ̂ * as an imputed survival time if patient i's survival time is censored. After all censored survival times are imputed this way, they are combined with observed survival times and supplied to any ML method for analysis. Transformation may be applied to all the survival times (imputed or observed) before applying the ML analysis.

CondiS-X
Ideally, we want the imputed survival time to be as close to the true survival time as possible.
In the above imputation by CondiS, we did not take the covariate information into consideration.
It is likely that some covariates are associated with patients' survival and thus can help further improve our imputation. To use that covariate information, we extended CondiS by incorporating the covariates through ML modeling (CondiS-X). Specifically, we trained an ML model with the imputed or observed survival time as the dependent variable and covariates as independent variables. We then used this model to predict the censored survival times in the same dataset and used the predicted survival times to update the previously imputed values.
This step may be done similar to cross-validation. That is, we randomly divide the whole dataset into m (e.g., m=10) partitions of equal sample size, use m-1 partitions of the data to train the ML model, and use the resulting model to make survival time predictions for the remaining partition. We repeat this process for each of the m parts. A schematic diagram of the whole process of CondiS and CondiS-X is shown in Figure 1. A survival dataset including right-censored observations is our input data. We first apply CondiS to this dataset and obtain the imputed survival times for the censored observations. If the covariates information is available, we will add an extra step that uses CondiS-X to update the imputed survival times.
Now we have a survival dataset without censoring that we can use to apply any ML technique.

Simulation design
We use the inverse probability method presented by Bender et al. to generate true survival times T from the proportional hazards model. [25] Let (· | ) be the conditional survival function derived from the proportional hazards model: where 0 ( ) is the baseline cumulative hazard function, is the given covariate vector and is the regression coefficient. Generate random numbers from a uniform distribution on [0,1], then we can generate T through ) .
Here we assume that at the baseline, T follows a Weibull distribution with a shape parameter > 0 and a scale parameter > 0 , i.e., we have the cumulative hazard function and its inverse: Then we generate the Weibull survival time conditional on covariates by drawing In this example, the covariate matrix consists of ten biomarkers, one binary treatment, two treatment-biomarker interaction terms and one random noise: x ′ = 1 1 + 2 2 + 3 3 + 4 4 + 5 5 + 6 6 + 7 ( 7 ) + 8 ( 8 ) + 9 ( 9 ) + 10 ( 10 ) + α + 1 5 + 2 6 + , where 1~10 are biomarkers generated from a normal distribution with a mean of 0 and a standard deviation of 1, is a binary treatment with either 0 or 1, 1~10 are biomarkers' coefficients, α is the treatment main effect coefficient, 1~2 are the treatment-biomarker interaction coefficients, and is a tunable random noise. Among 10 biomarkers, 7~10 contribute to the Weibull survival time as step functions: To create right-censored observations, we drew random times from another Weibull distribution as the censoring time : where > 0 is the shape parameter, > 0 is the scale parameter of the distribution, and is a random number to tune the censoring time . Then our final observed survival time can be computed as = ( , ).
We applied the Pseudo and CondiS methods and obtained their survival times. The better method should be closer to the true survival time. For the extension, CondiS-X, we built 7 commonly used ML models, including the generalized linear model (GLM) [26], LASSO regression [27], Ridge regression [28], gradient boosting machine (GBM) [29], random forest (RF) [30], support vector machine (SVM) [31], k-nearest neighbors (KNN) [32], and artificial neural networks (ANN) [33]. Note that the same models were built using the raw observed time (OBS-X) and the pseudo time (Pseudo-X) for later comparison.

Real-world examples
We applied our method to two real-world datasets. The first one included 2982 primary breast cancers patients whose data were included in the Rotterdam tumor bank. [34] We took only the uncensored part of the data, which left us with 1214 observations. We used only the observed survival time so that we had a gold standard to evaluate our methods. For the dataset of 1214 observations, we artificially generated censoring times from a uniform distribution. The final observed survival time was computed as = ( , ) , where was the true survival time from the original Rotterdam dataset. In this way, we were able to evaluate the performance of our method after applying it to the dataset of 1214 observations, some of which were censored by the artificially generated censoring times C. The second dataset we used was a blood cancer dataset, which included a cohort of 1,001 diffuse large B cell lymphoma (DLBCL) patients. [35] We kept the 574 uncensored subjects and, similarly, we simulated censoring times for this DLBCL dataset with 574 observations to apply and test our method.

Evaluation metrics
For censored data, standard evaluation metrics for regression problems such as the root of the mean squared error and R 2 are not appropriate for measuring the performance in survival analysis. [36] In our study, we resorted to other evaluation metrics. One of the most frequently used metrics in survival analysis is the concordance index (CI), which is a measure of the rank correlation. [37] We used CI to assess the correlation between imputed survival time and the Mean absolute error (MAE) is another commonly-used evaluation metric, which measures the average magnitude of errors between paired observations. [38] The formula is where is the true survival time and ̂ is the simulated survival time.
Median absolute error (medAE) is also used here because it is robust to outliers. [38] It is calculated by taking the median of all absolute differences between the true survival time and the simulated survival time.
Using these three metrics, we thoroughly compared the CondiS imputed survival time to the Pseudo survival time. Meanwhile, since we extended our method by incorporating covariate information through ML models (CondiS-X), we should also do the same thing to other methods when comparing them. We extended the raw observation time (OBS), i.e., ignoring the censoring status, and the Pseudo survival time to OBS-X and Pseudo-X through the same ML models. Then we compared the performance of OBS-X, Pseudo-X, and CondiS-X using CI, MAE, and medAE.
We conducted 1000 Monte Carlo simulations for each scenario and compared the results of the raw observed time, the Pseudo method, and the CondiS method. Since MAE may be greatly distorted by extreme outliers, we also present medAE. CI describes the consistency between observed and predicted relative ranks of the survival time of the subjects. For MAE and medAE, smaller is better. For CI, larger is better.
In Figure 2 Figure 2(B), we can see that the medAE value remains at zero until the censoring percentage raises to 60%. This makes sense because, using the CondiS method, only the censored observations in the dataset are replaced. When the censoring percentage is less 50%, the medAE is always zero because more than 50% of survival times are observed, and they stay the same before and after applying the CondiS method, resulting in zero absolute errors.

CondiS-X
To improve the imputed survival time, we utilized the given covariate information through different ML models. Fixing the censoring rate at 40%, the results of MAE, medAE, and CI comparing OBS-X, Pseudo-X, and CondiS-X are shown in Figure 3 (A), (B), and (C). Again, in Figure 3(A) and Figure 3(C), CondiS X gives us the lowest MAE and the highest CI among these three methods. And as we would expect, the medAE values of CondiS-X remain at zero for each of the ML models because the censoring percentage is under 50%. Furthermore, we can see that, when X is a nonparametric model such as the GBM, the RF, and the SVM models, we obtain better results. For this specific simulation dataset, these models maximize the effect of covariates and improve the imputed time to a greater extent.
In Figure 3 CondiS method maintained its advantage of a zero median error. Regarding the CI in Figure   3(F), the performance of the OBS method was unstable, the performance of the Pseudo method was worse, and only the CondiS method improved correlation with the true survival time. In summary, we conclude that the CondiS method benefited from the covariate information and improved its imputed survival time.

Real-world examples
We tested CondiS/CondiS-X on two real-world datasets, the Rotterdam dataset [34] and the DLBCL dataset. [35] Similarly, we compared the performance of our method to the OBS and the Pseudo methods. We took only the uncensored subjects with true survival times and then simulated censoring times for these subjects. In this way, we have a gold standard to evaluate our methods.
For both real-world datasets, we observed a similar pattern as for the simulation.

Discussion
This study introduced a method (CondiS) to impute censored survival data, allowing for direct application to all ML models. Moreover, we extended our method by utilizing the covariate information through ML models. Regarding which ML algorithm to use, it depends on the data.
For example, if there are lots of non-linear relationships and interactions in the data, one of the nonparametric models might be the appropriate choice. No matter which algorithm is used, the imputed time will always get closer to the true survival time as long as your covariates are relatively significant in the model. increase the effective sample size for these real-world datasets.

Conclusion
Censoring complicates the simple regression analysis of survival time and also causes inconvenience when applying complex ML algorithms. In this study, we have proposed an alternative method to impute the censored data based on the conditional survival distribution (CondiS). Compared with existing methods with similar objectives, the imputed time of our method is able to be more like the underlying true survival time by achieving a smaller mean absolute error, a smaller median absolute error, and a higher concordance index. When covariates are available, we further improve our method by utilizing the covariate information through different machine learning models (CondiS-X), which helps to predict the survival time more accurately. The major advantage of our method is that it allows for direct use of standard ML techniques for analysis once the imputed survival time is obtained. Figure

AUTHOR CONTRIBUTIONS
XH, ZL and YW conceived the concept of the study and designed the method. YW implemented the method, performed the simulations, and drafted the initial manuscript. CF provided guidance on interpreting the real-world data. All authors edited and approved the final manuscript.