 Research
 Open access
 Published:
Improved nonparametric survival prediction using CoxPH, Random Survival Forest & DeepHit Neural Network
BMC Medical Informatics and Decision Making volume 24, Article number: 120 (2024)
Abstract
In recent times, timetoevent data such as time to failure or death is routinely collected alongside highthroughput covariates. These highdimensional bioinformatics data often challenge classical survival models, which are either infeasible to fit or produce low prediction accuracy due to overfitting. To address this issue, the focus has shifted towards introducing a novel approaches for feature selection and survival prediction. In this article, we propose a new hybrid feature selection approach that handles highdimensional bioinformatics datasets for improved survival prediction. This study explores the efficacy of four distinct variable selection techniques: LASSO, RSFvs, SCAD, and CoxBoost, in the context of nonparametric biomedical survival prediction. Leveraging these methods, we conducted comprehensive variable selection processes. Subsequently, survival analysis models—specifically CoxPH, RSF, and DeepHit NN—were employed to construct predictive models based on the selected variables. Furthermore, we introduce a novel approach wherein only variables consistently selected by a majority of the aforementioned feature selection techniques are considered. This innovative strategy, referred to as the proposed method, aims to enhance the reliability and robustness of variable selection, subsequently improving the predictive performance of the survival analysis models. To evaluate the effectiveness of the proposed method, we compare the performance of the proposed approach with the existing LASSO, RSFvs, SCAD, and CoxBoost techniques using various performance metrics including integrated brier score (IBS), concordance index (CIndex) and integrated absolute error (IAE) for numerous highdimensional survival datasets. The real data applications reveal that the proposed method outperforms the competing methods in terms of survival prediction accuracy.
Introduction
Survival data analysis is a statistical subfield focused on studying the duration until a specific event occurs, such as the time until death in living organisms or the time until failure in mechanical systems. It aims to answer questions like the proportion of a population expected to survive beyond a given time point, the rate at which those who survive will experience the event, and how the likelihood of survival varies with different conditions or characteristics. The objectives of survival analysis are to identify relationships between risk variables and occurrences, to explain the likelihood of an event occurring by a certain period, or to forecast survival times based on informative characteristics. Survival outcomes are generally referred to as events or results related to the duration of time that an individual or entity survives or remains in a particular state before a specific event occurs such as death, relapse, or a specific health condition developing.
In recent decades, there has been an increasing focus on devising methods for selecting relevant features in timetoevent data. This heightened interest is driven by the availability of extensive datasets and the recognition that data sparsity may exist. When we refer to sparsity, we mean that certain features in the dataset may not have any relevance to the outcome of interest, and variable selection becomes the preferred approach to address this. A number of feature selection (FS) techniques have been developed in literature, each working for the same objective function that is to reduce dimension of the data but using a different mechanism to reach the goal. FS algorithms available are numerous but it is important to mention the saying ‘onesizefitsall’ type algorithm does not exists in reality. In such cases, margin of error always exists which needs to be minimized. This motivates researchers to improve already existing algorithm/technique or come up with a fresh idea to do the same task with minimum error. With this in mind, we tried to introduce a hybrid type FS algorithm for survival analysis whose working mechanism is explained in the next section.
Related work
Numerous methods for selecting variables in survival data have been devised over time. To provide a concise overview of the existing body of work and to gain insight into the functionality of each algorithm in this context, a brief literature review is presented below.
Tibshirani [1] extended the Least Absolute Shrinkage and Selection Operator (LASSO) method to the Cox model that was initially introduced for linear regression. In this approach, an L1norm penalty term is added to the loss function. The coefficients (β) are then estimated via maximization of the partial likelihood function while observing the constraints \(\sum {\beta }_{j}\le s\) where ‘s’ is a userdefined nonnegative value. By effectively shrinking the coefficients of less significant and superfluous variables to zero, this constraint lowers the complexity of the model.
The Adaptive LASSO for the Cox proportional hazards model was developed by Zhang and Lu [2] to improve the estimator's characteristics and make it work with common techniques. In order to achieve equilibrium, this strategy gives larger weights to small coefficients and smaller weights to large coefficients. Global optimizers are certain to exist because of the convex shape of the penalty term. The adaptively weighted \(L1\) penalty of the form \(\lambda {\sum }_{k=1}^{p}\frac{{\beta }_{k}}{{\widetilde{\beta }}_{k}}\) where \({\widetilde{\beta }}_{k}=({\widetilde{\beta }}_{1}, {\widetilde{\beta }}_{2}, {\widetilde{\beta }}_{3},\dots , {\widetilde{\beta }}_{p})\) known as Adaptive Lasso penalty maximized partial likelihood.
In certain situations, LASSO has shown limitations. For instance, it tended to limit the number of features based on a systematic relationship with the number of samples in the dataset. Additionally, when dealing with highly correlated predictors, it often selected only one feature among them. To overcome these challenges, alternative methods have been introduced, one of which is Elastic Net. This method involves incorporating a combination of L1 and L2norm penalties into the regression coefficient estimation process. The constraint of the form \(\lambda \left(\left(1\alpha \right)\sum_{j=1}^{p}\left{\beta }_{j}\right+\alpha \sum_{j}^{p}{\beta }_{j}^{2}\right)\) is added to the partial loglikelihood function while maximizing it during the estimation process [3].
Crossvalidation was used to estimate the two parameters, regularization parameter \(\lambda\) and mixing parameter α. Later, it was suggested to just estimate \(\lambda\) by crossvalidation using a fix α = 0.5.
Du et al. [4] introduced a method for variable selection in the Cox Proportional Hazards model with semiparametric relative risk. They initially divided the model into two components: parametric and nonparametric. For the nonparametric component, they employed a smoothing spline ANOVA model to estimate risk, and variable selection was performed using the Kullback–Leibler geometry method. In contrast, for the parametric component, risk estimation was conducted using the Penalized Profile Partial Likelihood approach. Variable selection in this part was achieved by applying a concave penalty, either SCAD (Smoothly Clipped Absolute Deviation) or an Adaptive LASSO penalty. It is advisable to incorporate discrete covariates into the parametric component and continuous covariates into the nonparametric component of the model. Nevertheless, if the estimation process indicates that certain continuous covariate effects can be suitably represented by specific parametric forms, like linear relationships, these covariates can be transferred to the parametric section, and a fresh model can be developed accordingly.
Li and Luan [5] introduced a boosting algorithm utilizing smoothing splines, which constructs a series of smoothing spline models and amalgamates them into a final model. This iterative algorithm modifies the hazard function at each step to rectify errors from prior models. In a comparative study against traditional proportional hazards models, Li and Luan demonstrated the superiority of their boosting algorithm, particularly in analyzing highdimensional microarray data from a breast cancer study. Their research underscores the efficacy of the boosting algorithm with smoothing splines for survival analysis in highdimensional datasets and its capacity to handle complex covariate structures in survival analysis.
Morris et al. [6] devised a feature selection technique tailored for stratified Cox models using gradient boosting. This method is intended for situations where the assumption of proportional hazards is not met. To mitigate confounding effects, they introduced a stratification process, resulting in a stratified proportional hazards model. In their algorithm, variables are chosen based on their capacity to maximize the first partial derivative of the likelihood function. This iterative process updates β coefficients estimates while monitoring the risk of overfitting, with the number of iterations determined through various options like Bayesian Information Criterion (BIC), a predefined number of variables for selection, changes in likelihood, or kfold crossvalidation. Additionally, they developed an R package called 'SurvBoost' to implement this method and conducted a simulation study comparing its accuracy and runtime with the existing 'mBoost' R package.
He et al. [7] introduced an algorithm that applies gradient boosting to the data while considering only one component of β, aiming to improve computational feasibility. The fundamental concept behind this algorithm involves modifying certain aspects of existing variable selection techniques designed for highdimensional survival data. Although componentwise boosting algorithms were already present in the literature, the authors made specific adaptations to enhance its computational efficiency, drawing inspiration from the MinimizationMaximization (MM) algorithm by Hunter and Lange [8].
When conducting FS on randomly sampled observations, the objective is to determine the probability that features are included in the model. This approach is known as stability selection [9]. Additionally, the authors proposed a stability selection boosting procedure based on random permutations, which follows the concept introduced by Tusher et al. [10] in the context of the Significance Analysis of Microarrays (SAM). This modification resulted in a reduced false discovery rate for their variable selection algorithm.
Ishwaran et al. [11] introduced a novel approach in survival tree analysis by introducing a dimensionless ordered statistic known as the 'minimal depth of maximal subtree.' This statistic served as a measure of variable predictiveness within survival trees. Unlike the traditional method of calculating Variable Importance (VIMP) in random forests using permutation, they opted for the minimal depth statistic. This statistic assesses the importance of variables based on their proximity to the root node in a tree, specifically in relation to the root of the nearest maximal subtree. Their variable selection procedure comprises three main steps. Initially, they randomly select covariates for the model and identify the most crucial ones using minimal depth. An initial model is then constructed with these chosen covariates. This process is iterated a predefined number of times, incorporating additional variables into the initial model based on minimal depth criteria until the joint VIMP of nested models stabilizes. This entire process is repeated several times, and the covariates that consistently appear in models larger than the average size are ultimately selected.
Pang et al. [12] introduced a gene selection method that employs an iterative feature elimination procedure within the framework of random survival forest (RSF). Their approach begins by fitting an RSF model to the dataset containing the complete set of covariates and ranking all covariates based on their Variable Importance (VIMP) scores, which are calculated using a permutationbased method. The top variables, typically around 80%, are retained, and their outofbag errors are computed. This process is repeated iteratively until only two covariates remain in the model. The objective is to identify the set of covariates with the minimum number required to maintain an outofbag error rate within 1 standard error. This methodology effectively takes into account the multivariate correlations among variables. Experimental results on real highdimensional microarray datasets with survival outcomes demonstrated that this approach excels in identifying a compact set of genes while preserving predictive accuracy for survival.
Mbogning and Broet [13] introduced a variable selection approach tailored for survival data, which relies on a Topological Index derived from permutation methods. Their methodology begins with the construction of a bagging survival forest on the training data. The importance score, utilized as a criterion for node splitting or determining tree depth during forest construction, serves as the basis for calculating an importance score. These importance scores are denoted as \({HS}_{j}(j = \mathrm{1,2},3,\dots ,p)\) . Subsequently, another bagging survival forest is created, but this time, the importance of variables is computed based on permuted data. This process is iterated a specified number of times. Another list of scores are generated in \({HS}_{j}^{*} (j = \mathrm{1,2},3,\dots ,p)\). Pvalues are calculated for all competing variables \({X}_{j}\) using \({P}_{j}=\frac{1}{Q}\sum_{q=1}^{Q}I\{{HS}_{jq}^{0}>{HS}_{j}\}\). Given a global level α, variables which satisfies the conditions that \({p}_{j}<\frac{a}{m}\) are selected according to a Bonferroni procedure for multiple comparisons.
Indeed the above mentioned techniques are useful for variable selection, their application in this study is however limited for identifying discriminative features in highdimensional survival datasets. Therefore, we have proposed a novel tools for variable/feature selection due their ability in handling highdimensional survival data which is important for better predictions.
Methods and material
Datasets and data processing
In this section, we explore the datasets considered in this research study. A total of 11 survival datasets were used and their detail description along with their source is presented in Table 1. While selecting the datasets for the current research study related to feature selection in survival analysis, it was ensured that the benchmark datasets are highdimensional in nature.
The Cox PH model
The Cox Proportional Hazard Model (CPHM) is a semiparametric method used in survival analysis. It defines the hazard function \('h(t)'\) as the sum of two components: a baseline hazard \('h_0(t)'\) that depends solely on time 't,' and a component related to covariates. The mathematical form of the CPHM is as follows:
Here in this equation, \({h}_{0}(t)\) is a timedependent component that is not influenced by covariates, while \({e}^{{\beta }{\prime}{x}_{i}}\) represents the covariaterelated component, which does not depend on time’t’. It's important to note that there is no constant term “\({\beta }_{0}\)” in the regression coefficients. This absence of a constant term is because it can be absorbed or canceled out by the baseline hazard function, essentially being a part of the hazard function itself [24].
Feature selection methods
With the introduction of new technologies in past few decades, we are able to access such information about a subject under study which one had never before. Having more information about samples/subjects on the other hand can create a situation called the curse of dimensionality. Especially in the case when number of features are substantially greater than the number of observations. This create several problems during the analysis and processing of the data, such as increase in computational cost, noise and redundancy, the problem of overfitting and poor generalization of performance on unseen data [25]. There are three common types of feature selection methods namely filter, wrapper and embedded methods.
Filter methods
This consists of feature selection techniques that ranks the features in dataset ahead of running a learning algorithm and selects features in the model based on a prespecified criteria in connection with the statistical measure being used for ranking purpose. This set of methods are less timeconsuming and inexpensive in nature as they are done as a preprocessing step.
Wrapper methods
Wrapper methodology consists of techniques where subsets of features are made, model is trained on each subset and comparison is made for each subset in terms of performance metrics. Features are added or removed from the subsets until a predefined number of features and performance measure is obtained. The subset which yields a predefined model output is considered as a final set of features for modelling the data further.
Embedded methods
This method combines the core properties of both, the filter and wrapper methods. It is named as embedded because the feature selection technique is blended as a part of the actual learning algorithm. It is a less timeconsuming, inexpensive and more accurate than aforementioned methods. The methods employed in this study for comparison fall under this category. In the following section, we provide a brief overview of the variable selection methods utilized in this study.
Least Absolute Shrinkage and Selection Operator – LASSO
Least Absolute Shrinkage and Selection Operator, or LASSO, is a method that penalises variables. It introduces an L1type penalty term λβ to the coefficients of the Cox regression [1]. This penalty can effectively reduce some coefficients to zero, leading to a reduction in the model's size while maintaining its parsimony. The parameter λ plays a crucial role in determining the number of variables selected in the model. A larger λ value leads to more coefficients being reduced to zero, resulting in a model with fewer features. Conversely, reducing the λ value increases the number of features included in the model when compared to a higher λ value.
In a survival context, the triplet \(\{\left({Y}_{i}, {\delta }_{i}, {X}_{i}\right), i=\mathrm{1,2},3, \dots , n\}\) is used to represent the observed data. Where \({Y}_{i}={\text{min}}({T}_{i}, {C}_{i})\) is the observed survival time, taking minimum of either observed event time “\({T}_{i}\)” or censoring time “\({C}_{i}\)”. \({\delta }_{i}\) is the censoring indicator, \({\delta }_{i}=I({{T}_{i}\le C}_{i})=1\) when actual event of interest is observed and is 0 otherwise. And \({X}_{i}={({x}_{1}, {x}_{2}, {x}_{3},\dots ,{x}_{p})}^{t}\) is the matrix of ‘p’ predictor variables for each subject in the dataset. To model the survival data, we consider semiparametric cox proportional hazards model: \(h\left(tx\right)={h}_{0}(t){\text{exp}}(\sum_{j=1}^{p}{B}_{j}{X}_{j})\).
The partial likelihood for cox model is given as: \({L}_{n}\left(\beta \right)={\prod }_{i\in D}\frac{{\text{exp}}({X}_{i}^{t}\beta )}{\sum_{l\in {R}_{i}}{\text{exp}}({X}_{l}^{t}\beta )}\).
Where D is the set of indices for observed events, and \({R}_{i}\) are observations at risk at time \({Y}_{i}\). Now, a function known as log partial likelihood i.e., \({l}_{n}\left(\beta \right)={\text{log}}\left\{{L}_{n}\left(\beta \right)\right\}/n\) with a penalty term, specifically called lasso penalty i.e., \({p}_{\lambda }\left(\beta \right)=\lambda \sum_{j=1}^{p}{\beta }_{j}\) applied on the coefficients β, when minimized, results in the sparsity, hence variable selection.
Mathematically: \(g\left(\beta\right)=l_n\left(\beta\right)+\rho_\lambda\left(\beta\right)=,g\left(\beta\right)=\frac{\log\left\{\left(\beta\right)\right\}}n+\lambda{\textstyle\sum_{j=1}^p}\left\beta_j\right,\left(\beta\right)=l_n\left(\beta\right)+\lambda{\textstyle\sum_{j=1}^p}\left\beta_j\right.\)
Where λ is a nonnegative tuning parameter taking on any positive value and controls the amount of variables selected in the final model. The lasso penalty \(\lambda \sum_{j=1}^{p}{\beta }_{j}\) is singular at point \({\beta }_{j}=0\) and is therefore able to eliminate the redundant variables from the model and keep the relevant ones only.
Random Survival Forest’s Variable Selection – RSFvs
A useful tool for variable selection is the Random Survival Forest (RSF), which is an extension of the random forest approach for survival data [26]. RSF builds trees in a similar way as conventional random forests. Following the random selection of B bootstraps at random from the data, a tree is created on each bootstrap sample. A cumulative hazard function (CHF) is produced by averaging the predictions made by these trees. The RSF then offer two option for feature selection: the variable hunting algorithm (RSFVH) and minimal depth. The minimal depth approach for FS is advised when the ratio of the number of features (p) to the number of samples (n) is less than ten that is p/n < 10. However, when p/n > 10, the RSFVH approach for FS is preferred. When splitting of a node is carried out, the minimal depth is used to rank the features according to their distance from the root node in the tree. Shorter paths between variables and the root node are regarded as having greater predictive power in the model. More detailed information on minimal depth of the maximal subtree can be found in the work by Ishwaran et al. [11]. In RSFVH, an initial model is constructed using covariates according to a predetermined minimal depth threshold value. Additional covariates are gradually incorporated into the initial model based on their minimal depth rankings until the joint variable importance (VIMP) for the resulting nested models stabilizes. This process is typically repeated multiple times, often 50 repetitions, and the variables that are frequently selected in the models are included in the final model.
Smoothly Clipped Absolute Deviation – SCAD
SCAD was proposed by Fan and Li [27] as an improved alternative to LASSO for penalizing the regression coefficients. Some studies [27,28,29] showed that LASSO can come up with biased results for coefficients with larger values, while working fine for the coefficients with relatively smaller values. This led the researchers to introduce another penalty term known as nonconcave SCADpenalty.
The penalty function is rather defined primarily by its first derivative which is given as:
This penalty term contains two tuning parameters that are, \(\lambda\) and \(a\) whose values could be obtained by some criteria such as cross validation based on a grid search for the best pair of (\(\lambda , a\)) values. But this could be computationally very expensive too. Thus, based on Bayesian statistical point of view and practical simulation studies, Fan and Li [27] suggests using, \(a=3.7\).
Boosting Algorithm for Variable Selection – CoxBoost
Boosting, originally developed in machine learning for classification and regression problems [30,31,32]. It is basically an ensemble technique which runs iteratively to combine the predictions of many weak models into one strong model. With time, boosting algorithms started getting notable attention, and were later on extended to statistical field, operating in many statistical problems including regression and survival analysis.
We utilized the boosting algorithm known as ‘CoxBoost’ to perform feature selection making use of the cox regression model. For this, we used an R package ‘mboost’ [33] which performs a model based boosting using the builtin function ‘coxPH’ to be specified in argument ‘family’. Using this argument, we are about to use negative partial loglikelihood as a loss function L(y, F(X)) and OLS estimator as the base learner. The complete boosting algorithm for model fitting as well as feature selection context completes in these five steps.

1.
Initialize \(\widehat{\beta }=\left(0,\dots ,0\right);\)

2.
Compute the negative gradient vector: \(u^{(i)}=\delta^{(i)}\sum_{l\in R^{\left(i\right)}}\delta^{(i)}\frac{\text{exp}\{X^{(l)T}\widehat{\beta\}}}{\sum\{X^{\left(l\right)T}\widehat\beta\}}\)

3.
Compute the possible updates to the gradient vector by fitting least square estimator, \(\overset\frown{b_j}={(X_j^TX_j)}^{1}X_j^Tu;\)

4.
Select the best update, \(j^\ast={argmin}_j{\textstyle\sum_{i=1}^n}\left(u^{\left(i\right)}X_j^{\left(i\right)}{\widehat b}_j\right)^2;\)

5.
Update the estimate, \({\widehat{\beta }}_{j*}={\widehat{\beta }}_{j*}+v{\widehat{b}}_{j*}.\)
The steps 2–5 are repeated m_{stop} number of times, which plays a crucial role in both, feature selection and prediction models. In feature selection, increasing this will increase the number of features selected and vice versa. This can create problems of overfitting and irrelevant variable selection in the model in case of larger value while we may miss important predictor variables if this value is kept low [34]. To cope with this issue, a tenfold cross validation is used to select the optimal value for m_{stop} – which is the number of boosting steps – and do the feature selection using that optimal m_{stop} value.
Machine learning models
Timetoevent data can be analyzed for making predictions about survival time and estimate the survival probability at a specific estimated survival time, using both traditional statistical methods and machine learning models. Although, they both share this common goal of making predictions of the survival time and estimating the survival probabilities, yet the focus of both methods is on different objectives. Traditional methods mainly focuses on distributions of event times and statistical properties of estimation of parameters, whereas machine learning models combines the power of traditional methods along with machine learning techniques to make better predictions of occurrence of event at a given time [35].
There are three types of statistical methods commonly used in the context of survival analysis that are Parametric, Semiparametric and Nonparametric. In this study, the semiparametric approach, i.e., the Cox Proportional Hazards Model, is employed for predictions, along with two other machine learning models: Random Survival Forest and DeepHit Neural Network. These models are used for modeling timetoevent data and are discussed in the following section.
Random Survival Forest (RSF)
RSF is an ensemble method usually categorized under advanced machine learning techniques which is basically an extension of random forests approach to tackle survival information [36, 37].
The forest is grown in the same manner as in a usual random forest i.e.

i.
Randomly select B samples of the same size as the original dataset, allowing for replacement. Any samples not chosen are considered outofbag (OOB).

ii.
Construct a survival tree for each of the B samples selected in the first step.

a
At each tree node, randomly choose a subset of predictor variables and determine the best predictor and splitting value that yield two subsets (referred to as daughter nodes) with the maximum difference in the objective function.

b
Repeatedly apply the above step recursively to each daughter node until a specified stopping criterion is met.

a

iii.
Calculate the cumulative hazard function (CHF) for each tree and then compute the average CHF across all B trees to create the ensemble CHF.

iv.
Assess the prediction error of the ensemble CHF using solely the OOB data.
Since the CHF and survival function S(t) are related, the RSF also gives us estimates of survival function which we can further use for predictions using the test set of the data and calculate our performance evaluation metrics.
DeepHit neural network
DeepHit is a deep learning model designed for survival analysis, capable of simultaneously addressing singlecause and competing risks scenarios. It utilizes a network structure comprising a shared subnetwork and several causespecific subnetworks. DeepHit's training process involves a loss function that leverages both survival times and relative risks. This model is proficient in capturing nonlinear and nonproportional relationships between covariates and risk factors. Further, it is a discretetime survival model, means that survival times are discretized into either equidistant (equally spaced) or quantiles intervals [38]. The DeepHit method was the first to implement neural networks to the discretetime likelihood for survival data. A more detailed reading on how discretization process works in survival context and its further extensions can be obtained from a study conducted by Kvamme, H. and Borgan, Ø., 2021 [39].
In this study, we implemented the DeepHit NN using R package ‘survivalmodels’ [40] which makes it happen using R package ‘reticulate’ [41]. Where Reticulate is a popular R package creating a Python environment in R software so that one can use Python packages and functions inside R [42]. No hyper parameter tuning was applied, rather the default parameters were used for training the neural network.
Proposed model architecture
In this paper, Feature selection methods described some wellknown feature selection techniques for survival studies, briefly explaining their algorithms to separate and select only important features to the response variable. In practice, it is usually observed that employing many FS methods to the same dataset, it is not guaranteed that all techniques agree upon the same set of features. It is due to the mechanism a FS technique uses to give relevant importance to features. For instance, RSFvs uses a dimensionless order statistic called minimal depth of maximal subtree which shows the predictive power of a variable in a survival tree. RSFvs ranks features based on minimal depth criteria and selects the topmost from the list. In contrast, LASSO and SCAD implies their respective penalties on the regression coefficients in order to contract redundant variables’ coefficients to as low as zero. Although, the mechanism each method uses are way different than each other, yet the objective function is the same for all, so it is expected from each to agree upon other’s selection as much as possible, off course if not hundred percent.
In this work, we aimed at exploiting and capitalizing four different feature selection techniques that uses contrasting mechanism to obtain the objective function. Intuitively, the variables which are important and nonredundant in reality is believed to have higher chances of being chosen by most of the algorithms. This led to propose a novel feature selection method for survival data.
The proposed method is a threestep procedure for feature selection, constituting a hybrid method that selects the most informative or discriminative features on which the majority of the feature selection techniques agree. The method is explained as follows:

i.
Utilize each of the four feature selection techniques—LASSO, RSFvs, SCAD, and CoxBoost—individually to obtain four feature sets, one for each of the four FS techniques employed.

ii.
After comparison, create a new set that only includes those features chosen for at least three of the four sets in step 1.

iii.
The final set of features is determined by selecting the set of characteristics that satisfy the criteria in step 2. These features will then go through additional processes, like fitting a survival model utilising predictive machine learning models and assessing their predictive accuracy.
The suggested alg0rithm is described mathematically as follows:

Let FS 1, FS 2, FS 3, and FS 4 stand for the sets of feature that were 0btained by using the LASSO, RSFvs, SCAD, and C0xB00st methods, respectively.

Then, the common features that has agreement or the set of intersection of these feature, represented as FS intersection, is determined by:

$${FS}_{\,intersection}=\left\{f\in F/FS_1\cap FS_2\cap FS_3\cap FS_4\right\}$$

Next, the features that show up in at least three of the four feature sets are then filtered out as:
$$\begin{array}{c}\mathrm{FS}\;\mathrm{final}=\left\{\mathrm f\mid\;\mathrm f\;\in\;\mathrm{FS\,intersection},\mathrm{count}(\mathrm f)\;\geq\;3\right\}\\\mathrm{Here}\;\mathrm{count}\;\left(\mathrm f\right)=\mid\left\{{\mathrm{FS}}_{\mathrm i}\mid\mathrm f\mathit\;\mathit\in{\mathrm{FS}}_{\mathrm i}\right\}\mid\end{array}$$
where count(f) is the number of times feature f appears throughout the four feature sets.
The final collection of attributes, denoted by the set FS _{final}, will be used for additional analysis, including the fitting and performance evaluation of survival models.
The primary goal of this suggested approach is to create a feature selection pr0cess for survival predictions that is both simple to comprehend and apply. The 0bjective of this study is to develop an appr0ach that is both efficient and intuitive by combining wellknown feature selection techniques with a simple selection criterron based 0n their agreement across these techniques. Despite its simplicity, the prop0sed method has shown promising results in 0ur experiments, as evidenced by its performance in survival prediction tasks compared to existing techniques.
The algorithm of the proposed hybrid feature selection method is given below.
The following flowchart shows the basic outline of the proposed method in a graphical way in Fig. 1.
Performance evaluation metrics
To evaluate model performance, researchers have access to various evaluation metrics, allowing them to choose the most appropriate ones for their specific problem. For survival models, common metrics include the Concordance index [43], Cstatistic (a modified version of Cindex suitable for models with high censoring rates) [44], Brier score, integrated Brier score [45], integrated square error (ISE) [46], and others. In this study, we chose to assess the competitive models using Integrated Brier Scores (IBS), the CIndex, Integrated Absolute Error (IAE), and Integrated Square Error (ISE). These methods are briefly explained as follow:

i.
Integrated Brier Score (IBS)
The Brier score, initially introduced by Brier in 1950 [45] to assess the accuracy of weather forecasts, was later adapted to evaluate the performance of survival models that incorporate censored observations [47]. The Brier score varies with time. In the absence of censoring, the Brier score can be expressed as:
In scenarios involving censoring, a weighted version of the formula is used to accommodate censoring. Specifically, BS(t) is divided by \(1/\widehat{G}({t}_{i})\) when censoring occurs before time ‘t’ and it is divided by \(1/\widehat{G}(t)\) when censoring occurs after time 't.' Observations that are censored before time 't' are not included in the Brier score calculation. The formula for calculating Brier score in the presence of censoring is as follows:
A Brier score approaching 0 signifies superior predictive performance, while a score nearing 1 suggests poorer performance. The Integrated Brier Score (IBS) is derived by integrating the Brier score across all available time intervals, denoted as \({t}_{min}\le t\le {t}_{max}\). Mathematically, this can be expressed as follows:
The integration can be readily computed using the trapezoidal rule, which calculates the area under the prediction curve [48].

ii.
Concordance index (CIndex)
The Cindex is a discrimination measure that indicates the ability of a model to effectively distinguish between a pair of observations categorized as 'high' and 'low' risk. It is defined as the ratio of concordant pairs to the total comparable pairs [47, 49]. Where a comparable pair means a pair of individuals (say \(i\) and \(j\)) in a dataset such that \({t}_{i}\) and \({t}_{j}\) are its actual event times and \(S({t}_{i})\) and \(S({t}_{j})\) are their predicted survival times. Now, if a pair \((i, j)\) is such that \({t}_{i}>{t}_{j}\) for which \(\left({t}_{i}\right)>\) \(S\left({t}_{i}\right)\), this means the actual observed time for \({i}^{th}\) individual is higher than \({j}^{th}\), and the model predicted the same, it is considered as a concordant pair. Otherwise, it is a discordant pair. We can write:
To handle censoring when determining comparable pairs, certain rules are followed. For example, a censored instance can only be paired with uncensored instances that occur after it in the dataset. Additionally, a censored instance cannot be paired with either another censored instance or an uncensored instance that occurs after it. This concept is illustrated in Fig. 2, where we have five observations ordered from top to bottom. We have two possible scenarios: (a) All five observations are uncensored, resulting in a total of \({}^{5}{C}_{2}=10\) pairs. (b) The second and fourth observations are censored, reducing the number of pairs to 6, as per the rule that censored observations cannot be paired with uncensored observations occurring after them.
In survival models that predicts survival time as an output, the C index is calculated as:
Where ‘N’ is the total number of comparable pairs. ‘I[.]’ is the indicator function and S(.) are the estimated survival probabilities from the model. Some survival models do not directly estimate survival probabilities; instead they rather compute hazard ratios, such as Cox PH model. In such cases, the Cindex can be computed as:
Where ‘\(\widehat{\beta }\)’ represents the estimated parameters calculated from hazardratio based models such as Cox, Cindex values can range from 0 to 1. A higher value for the Cindex indicates better accuracy in terms of separating ‘low’ and ‘high’ risk observations. A Cindex equal to 0.5 suggests that the model made random guesses, while a value close to 1 indicates perfect separation, and a value below 0.5 suggests separation in the wrong direction.

iii.
Integrated Absolute and Integrated Square Errors (IAE and ISE):
Survival models based on simulated datasets can be evaluated using these two similar methods. This is because the mathematical expression of the survival function S(t) is typically unknown in practical experiments, but an approximate expression can be obtained using a nonparametric Kaplan–Meier estimation. Using this approximate ‘\(S(t)\)’, we can obtain the measures IAE and ISE for the real dataset as well. This approximation is available as a builtin function in R package ‘SurvMetrics’ [50]. Thus, if \(S(t)\) is the true survival function and \(\widehat{S}(t)\) is its estimate, then, the two measures are given as:
and
The resultant value of these two measures ranges from 0 to infinity, where lower value for a predictive model indicate better predictions. Since the results of both IAE and ISE were quite similar, we therefore reported only IAE in the results section.
Results and discussion
In this section, we review four tables containing the results of the analysis on 11 highdimensional survival datasets. The first three tables contain results obtained from the analysis of each of three survival prediction models i.e., Cox Proportional Hazards model, Random Survival Forest and DeepHit employed on five different features selection methods i.e., our proposed method, LASSO, RSFvs, SCAD and CoxBoost. The results in the table are obtained in similar fashion as explained in the above section. that is, we performed variable selection for each of the 11 datasets and obtained their evaluation metrics values for each dataset corresponding to each feature selection method employed.
Table 2 contains the results of the Cox Proportional Hazards Model when it is employed on different feature selection methods listed in the rightmost column of the table. For each of the methods, three performance metrics were computed i.e., Integrated Brier Score, Cindex and Integrated Absolute Error. The values of these metrics were recorded in rows corresponding to each variable selection method.
The values in bold indicate better performance among all variable selection methods for each dataset. Since three evaluation metrics were considered, we compared all these three metrics for each method corresponding to each dataset. In terms of individual datasets, such as the 'Breast dataset', when features were selected through LASSO and applied to the Cox PH survival model, we observed better predictive performance across all three metrics: IBS, Cindex, and IAE. Similar results were obtained from the next two datasets, namely "WPBC" and "VDV", where LASSO feature selection again demonstrated superior performance. However, there appears to be a deviation in the fourth row of the column. In the fourth row of Table 2, we observe the metrics for the 'Heart FD' dataset. Here, we find that the LASSO feature selection method outperformed the others in terms of IBS and Cindex. However, it is the CoxBoost FS method that outperformed the others in terms of IAE. This approach allows for a more thorough examination of each dataset, enabling us to assess the performance of each feature selection method relative to others. Each row corresponds to the performance metrics for a single dataset, arranged sequentially. The last row of the Table 2, is named as ‘Average’ which reflects evaluation metrics values averaged across all 11 datasets. This provide us with a comprehensive measure based on which it can be determined which feature selection method outperforms the rest. If we compare the averaged performance across all 11 datasets, it becomes evident that our proposed feature selection method performs exceptionally well across all three metrics.
The comparison of results can be linked to a voting process, where we tally the number of times a method outperforms the others. This involves counting the instances where a feature selection method performs better than the alternatives. By applying the voting criteria for comparison, it becomes evident that in 8 out of 11 datasets, the proposed method surpassed the other methods across all three metrics considered for comparison.
Examining Table 3, which follows a similar format to Table 2, we can compare the outcomes of different feature selection methods in two ways, as discussed earlier. One method involves examining the average value of each metric corresponding to each feature selection method to determine the winner. The other method entails counting the number of times a feature selection method outperforms others across the datasets, which we refer to as "voting". Analyzing the bottom row of the table, we observe that our proposed method surpasses all other four feature selection methods across all comparison metrics, including IBS, Cindex, and IAE. Furthermore, when assessing performance based on the number of datasets on which each feature selection method excels, our proposed method outperforms the competition on 9 out of 11 datasets. This indicates that our proposed method demonstrates superior performance according to our "voting" criteria as well. The findings from Table 3 provide substantial evidence to support the assertion that utilizing the proposed method for feature selection, followed by modeling the data using Random Survival Forest, can lead to improved predictive performance in survival analysis.
Table 4, displaying the performance evaluation metric values, follows the same format as the previous tables. It's evident from the results that when the proposed method of variable selection is employed for feature reduction followed by DeepHitNN as a survival prediction model, 8 out of 11 datasets exhibit lower IBS and IAE values. When examining the CIndex, it's apparent that the proposed method performed even better, with 9 out of 11 datasets showing higher Cindex values when the proposed FS method was employed. Additionally, in terms of the average value across all 11 datasets, the proposed method outperformed other FS methods in all three comparison metrics: IBS, IAE, and Cindex.
Table 5 presents the averaged performance evaluation values for five feature selection methods, aligned with each survival model, derived from the average values of each column in Tables 2, 3 and 4. For instance, the IBS value attributed to the 'Proposed' method under the 'Cox PH' model, which is "0.165298", represents the average IBS value obtained from Table 2 under the 'Proposed' method (averaged from the first column). All other values are recorded following the same methodology. This comprehensive approach provides a detailed overview of our entire analysis. The table distinctly illustrates that our proposed feature selection method outperforms other methods across all comparison metrics, including IBS, Cindex, and IAE. Referring back to the previous practice of comparison using the voting criteria, it is evident that our proposed method significantly surpasses the performance of other competing feature selection methods.
The tabulated results are further supported by graphical representations in the form of Figs. 3, 4, and 5, illustrating the performance evaluation metrics for the three different predictive models we employed with each of the five feature selection methods.
Conclusion
This study aimed to harness a range of existing feature selection techniques to develop a hybrid feature selection (FS) technique that could perform the same task with improved accuracy and reduced margin of error. To assert the accomplishment of the studies objective, we employed a total of five FS methods including the proposed one, along with three different survival models to compute three different performance evaluation metrics namely Integrated Brier Score (IBS), Concordance Index (CI) and Integrated Absolute Error (IAE). The results are presented in both tabular and graphical formats in the results section.
In conclusion, based on the results presented in the preceding section, we have two approaches to determine which method surpasses the others in reducing prediction error and enhancing prediction accuracy. One approach involves comparing how frequently a feature selection (FS) method outperforms others on individual datasets considered in our analysis.
As a rule of thumb, if a feature selection (FS) algorithm performs well on more than twothirds of the dataset counts, it would be considered ideal. Specifically, in this case, since we are using a total of 11 datasets and employing 5 different FS algorithms, it would be quite rare and ideal for a FS technique to achieve that level of performance consistently across datasets. If a variable selection method performs well on 6–8 datasets, it is considered a satisfactory outcome. We would ideally expect our proposed method to perform at least as well, if not better (on 9–11 datasets).
Another approach to comparing the different feature selection techniques is to assess their average performance across all datasets. Given the detailed discussion of results in the previous section, we now provide a concise conclusion.
In Table 2, the proposed method achieved results above the satisfactory level, outperforming other methods in terms of both the count (8 out of 11 datasets) and averaged results across all three metrics. Moving to Table 3, our proposed method yielded ideal results, outperforming the other methods in terms of both the voting (9 out of 11 datasets) and average comparison. These surprising results align with our expectations, primarily due to the added advantage of using Random Survival Forest as our predictive model. In the case of DeepHit as a prediction model, the proposed technique obtained somewhat similar results to those of Cox PH. In summary, focusing on Table 5, the most comprehensive table, it is evident that the proposed method outperformed the other feature selection methods in terms of all three metrics, whether it is counting/voting or average results. Based on the results provided therein, we conclude that the proposed method improved predictive performance of timetoevent data, especially when the proposed algorithm is employed for dimension reduction and utilized Random Survival Forest for survival prediction.
Availability of data and materials
No datasets were generated or analysed during the current study.
References
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95. https://doi.org/10.1002/(sici)10970258(19970228)16:4%3c385::aidsim380%3e3.0.co;23.
Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94(3):691–703.
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B. 2005;67(2):301.
Du P, Ma S, Liang H. Penalized variable selection procedure for cox models with semiparametric relative risk. Ann Stat. 2010;38(4):2092. https://doi.org/10.1214/09AOS780.
Li H, Luan Y. (2005) Boosting proportional hazards models using smoothing splines, with applications to highdimensional microarray data. Bioinformatics. 2005;21(10):2403–9. https://doi.org/10.1093/bioinformatics/bti324.
Morris M, He K, Li Y, Kang J. (2020) SurvBoost: An R Package forHighDimensional Variable Selection inthe Stratified Proportional Hazards Modelvia Gradient Boosting. The R J. 2020;12:105.
He K, Li Y, Zhu J, Liu H, Lee JE, Amos CI, Li Y. Componentwise gradient boosting and false discovery control in survival analysis with highdimensional covariates. Bioinformatics. 2016;32(1):50.
Hunter DR, Lange K. A tutorial on MM algorithms. Am Stat. 2004;58(1):30–7.
Meinshausen N, Bühlmann P. Stability selection. J Royal Stat Soc Ser B. 2010;72(4):417–73.
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98(9):5116–21.
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. Highdimensional variable selection for survival data. J Am Stat Assoc. 2010;105(489):205–17.
Pang H, George SL, Hui K, Tong T. Gene selection using iterative feature elimination random forests for survival outcomes. IEEE/ACM Trans Comput Biol Bioinf. 2012;9(5):1422–31.
Mbogning C, Broët P. Bagging survival tree procedure for variable selection and prediction in the presence of nonsusceptible patients. BMC Bioinformatics. 2016;17:1–21.
Ternes N, Rotolo F, Michiels S. (2018) Biospear. Biomarker selection in penalized regression models. https://github.com/Oncostat/biospear
Wolberg,William, Street,W., and Mangasarian,Olvi. (1995). Breast Cancer Wisconsin (Prognostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5GK50.
van’t Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;12:530–6.
Ahmad T, Munir A, Bhatti SH, Aftab M, Raza MA. Survival analysis of heart failure patients: A case study. PLoS ONE. 2017;12(7):e0181001. https://doi.org/10.1371/journal.pone.0181001.
Wang KY, Pupo GM, Tembe V, Patrick E, Strbenac D, Schramm SJ, Thompson JF, Scolyer RA, Mueller S, Tarr G, Mann GJ. 2020. CrossPlatform Omics Prediction procedure: a game changer for implementing precision medicine in patients with stageIII melanoma. bioRxiv, pp.202012.
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, BørresenDale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA. 2003;100(14):8418–23. https://doi.org/10.1073/pnas.0932692100.
Van De Vijver MJ, He YD, Van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M. A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med. 2002;347(25):1999–2009.
James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. 2nd ed. New York, NY, USA: Springer; 2017.
Kałwak K, Porwolik J, Mielcarek M, Gorczyńska E, OwocLempach J, Ussowicz M, Dyla A, Musiał J, Paździor D, Turkiewicz D, Chybicka A. Higher CD34+ and CD3+ cell doses in the graft promote longterm survival, and have no impact on the incidence of severe acute or chronic graftversushost disease after in vivo T celldepleted unrelated donor hematopoietic stem cell transplantation in children. Biol Blood Marrow Transplant. 2010;16(10):1388–401.
Ramanan, D. (2016). NKI Breast Cancer Data. Accessed 10 Sept 2023. https://data.world/deviramanan2016/nkibreastcancerdata
EmmertStreib F, Dehmer M. Introduction to survival analysis in practice. Mach Learn Knowl Extraction. 2019;1(3):1013–38.
Yang L, Pelckmans K. Machine learning approaches to survival analysis: Case studies in microarray for breast cancer. Int J Mach Learn Comput. 2014;4(6):483.
Breiman L. Random Forests. Mach Learn. 2001;45:5–32. https://doi.org/10.1023/A:1010933404324.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.
Breheny P, Huang J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat Comput. 2015;25:173–87.
Algamal ZY, Lee MH. Penalized logistic regression with the adaptive LASSO for gene selection in highdimensional cancer classification. Expert Syst Appl. 2015;42(23):9326–32.
Schapire RE. The strength of weak learnability. Mach Learn. 1990;5:197–227.
Freund Y. Boosting a weak learning algorithm by majority. Inf Comput. 1995;121(2):256–85.
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In icml. 1996;96:148–56.
T. Hothorn, P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2022). mboost: ModelBased Boosting, R package version 2.9–7, https://CRAN.Rproject.org/package=mboost
Seibold H, Bernau C, Boulesteix AL, De Bin R. On the choice and influence of the number of boosting steps for highdimensional linear Coxmodels. Comput Statistics. 2018;33(3):1195–215.
Wang P, Li Y, Reddy CK. Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR). 2019;51(6):1–36.
Breiman L. Random forests. Mach learn. 2001;45:5–32.
Ishwaran H, Kogalur UB, Blackstone EH. and Lauer MS. 2008. Random survival forests.
Lee C, Zame W, Yoon J. and van der Schaar, M. (2018) “DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks”, Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11842.
Kvamme H, Borgan Ø. Continuous and discretetime survival prediction with neural networks. Lifetime Data Anal. 2021;27:710–36.
Sonabend R (2022). _survivalmodels: Models for Survival Analysis_. R package version 0.1.13, <https://CRAN.Rproject.org/package=survivalmodels>.
Ushey K, Allaire J, Tang Y (2023). _reticulate: Interface to 'Python'_. R package version 1.31, <https://CRAN.Rproject.org/package=reticulate>.
Van Rossum, G. & Drake, F.L., 2009. Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.
Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982;247(18):2543–6.
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 2011;30(10):1105–17.
Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.
Zou Y, Fan G, Zhang R. Integrated Square Error of Hazard Rate Estimation for Survival Data with Missing Censoring Indicators. J Syst Sci Complexity. 2021;34(2):735–58.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18(17–18):2529–45.
Shukla. A, (2021, June 24), Trapezoidal Rule, GeeksforGeeks, https://www.geeksforgeeks.org/trapezoidalrule/
Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med. 1984;3(2):143–52.
Hanpu Zhou and Xuewei Cheng and Sizheng Wang and Yi Zou and Hong Wang (2022). SurvMetrics: Predictive Evaluation Metrics in Survival Analysis, R package version 0.5.0, https://github.com/skyee1/SurvMetrics
Acknowledgements
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R 299), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Funding
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R 299), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Author information
Authors and Affiliations
Contributions
N.A & U.K conceived the idea, contributed to the design, data collection, Analysis, writing of the manuscript, evaluation and supervision. H.M.A & D.M.K contributed to the conception of the study, methodology, review and editing of this manuscript, and project administration. B.A, M.H, contributed to resources, investigation, analysis, data curation and validation. All authors reviewed the manuscript and gave final approval for submission this work.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Asghar, N., Khalil, U., Ahmad, B. et al. Improved nonparametric survival prediction using CoxPH, Random Survival Forest & DeepHit Neural Network. BMC Med Inform Decis Mak 24, 120 (2024). https://doi.org/10.1186/s1291102402525z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102402525z