Improved nonparametric survival prediction using CoxPH, Random Survival Forest & DeepHit Neural Network

In recent times, time-to-event data such as time to failure or death is routinely collected alongside high-throughput covariates. These high-dimensional bioinformatics data often challenge classical survival models, which are either infeasible to fit or produce low prediction accuracy due to overfitting. To address this issue, the focus has shifted towards introducing a novel approaches for feature selection and survival prediction. In this article, we propose a new hybrid feature selection approach that handles high-dimensional bioinformatics datasets for improved survival prediction. This study explores the efficacy of four distinct variable selection techniques: LASSO, RSF-vs, SCAD, and CoxBoost, in the context of non-parametric biomedical survival prediction. Leveraging these methods, we conducted comprehensive variable selection processes. Subsequently, survival analysis models—specifically CoxPH, RSF, and DeepHit NN—were employed to construct predictive models based on the selected variables. Furthermore, we introduce a novel approach wherein only variables consistently selected by a majority of the aforementioned feature selection techniques are considered. This innovative strategy, referred to as the proposed method, aims to enhance the reliability and robustness of variable selection, subsequently improving the predictive performance of the survival analysis models. To evaluate the effectiveness of the proposed method, we compare the performance of the proposed approach with the existing LASSO, RSF-vs, SCAD, and CoxBoost techniques using various performance metrics including integrated brier score (IBS), concordance index (C-Index) and integrated absolute error (IAE) for numerous high-dimensional survival datasets. The real data applications reveal that the proposed method outperforms the competing methods in terms of survival prediction accuracy.


Introduction
Survival data analysis is a statistical subfield focused on studying the duration until a specific event occurs, such as the time until death in living organisms or the time until failure in mechanical systems.It aims to answer questions like the proportion of a population expected to survive beyond a given time point, the rate at which those who survive will experience the event, and how the likelihood of survival varies with different conditions or characteristics.The objectives of survival analysis are to identify relationships between risk variables and occurrences, to explain the likelihood of an event occurring by a certain period, or to forecast survival times based on informative characteristics.Survival outcomes are generally referred to as events or results related to the duration of time that an individual or entity survives or remains in a particular state before a specific event occurs such as death, relapse, or a specific health condition developing.
In recent decades, there has been an increasing focus on devising methods for selecting relevant features in time-to-event data.This heightened interest is driven by the availability of extensive datasets and the recognition that data sparsity may exist.When we refer to sparsity, we mean that certain features in the dataset may not have any relevance to the outcome of interest, and variable selection becomes the preferred approach to address this.A number of feature selection (FS) techniques have been developed in literature, each working for the same objective function that is to reduce dimension of the data but using a different mechanism to reach the goal.FS algorithms available are numerous but it is important to mention the saying 'one-size-fits-all' type algorithm does not exists in reality.In such cases, margin of error always exists which needs to be minimized.This motivates researchers to improve already existing algorithm/ technique or come up with a fresh idea to do the same task with minimum error.With this in mind, we tried to introduce a hybrid type FS algorithm for survival analysis whose working mechanism is explained in the next section.

Related work
Numerous methods for selecting variables in survival data have been devised over time.To provide a concise overview of the existing body of work and to gain insight into the functionality of each algorithm in this context, a brief literature review is presented below.
Tibshirani [1] extended the Least Absolute Shrinkage and Selection Operator (LASSO) method to the Cox model that was initially introduced for linear regression.In this approach, an L1-norm penalty term is added to the loss function.The coefficients (β) are then estimated via maximization of the partial likelihood function while observing the constraints |β j | ≤ s where 's' is a user- defined non-negative value.By effectively shrinking the coefficients of less significant and superfluous variables to zero, this constraint lowers the complexity of the model.
The Adaptive LASSO for the Cox proportional hazards model was developed by Zhang and Lu [2] to improve the estimator's characteristics and make it work with common techniques.In order to achieve equilibrium, this strategy gives larger weights to small coefficients and smaller weights to large coefficients.Global optimizers are certain to exist because of the convex shape of the penalty term.The adaptively weighted L1 penalty of the form | where β k = ( β 1 , β 2 , β 3 , . . ., β p ) known as Adaptive Lasso penalty maximized partial likelihood.
In certain situations, LASSO has shown limitations.For instance, it tended to limit the number of features based on a systematic relationship with the number of samples in the dataset.Additionally, when dealing with highly correlated predictors, it often selected only one feature among them.To overcome these challenges, alternative methods have been introduced, one of which is Elastic Net.This method involves incorporating a combination of L1 and L2-norm penalties into the regression coefficient estimation process.The constraint of the form (1 − α) is added to the partial log-likelihood function while maximizing it during the estimation process [3].
Cross-validation was used to estimate the two parameters, regularization parameter and mixing parameter α.Later, it was suggested to just estimate by cross-validation using a fix α = 0.5.
Du et al. [4] introduced a method for variable selection in the Cox Proportional Hazards model with semiparametric relative risk.They initially divided the model into two components: parametric and non-parametric.For the non-parametric component, they employed a smoothing spline ANOVA model to estimate risk, and variable selection was performed using the Kullback-Leibler geometry method.In contrast, for the parametric component, risk estimation was conducted using the Penalized Profile Partial Likelihood approach.Variable selection in this part was achieved by applying a concave penalty, either SCAD (Smoothly Clipped Absolute Deviation) or an Adaptive LASSO penalty.It is advisable to incorporate discrete covariates into the parametric component and continuous covariates into the nonparametric component of the model.Nevertheless, if the estimation process indicates that certain continuous covariate effects can be suitably represented by specific parametric forms, like linear relationships, these covariates can be transferred to the parametric section, and a fresh model can be developed accordingly.
Li and Luan [5] introduced a boosting algorithm utilizing smoothing splines, which constructs a series of smoothing spline models and amalgamates them into a final model.This iterative algorithm modifies the hazard function at each step to rectify errors from prior models.In a comparative study against traditional proportional hazards models, Li and Luan demonstrated the superiority of their boosting algorithm, particularly in analyzing high-dimensional microarray data from a breast cancer study.Their research underscores the efficacy of the boosting algorithm with smoothing splines for survival analysis in high-dimensional datasets and its capacity to handle complex covariate structures in survival analysis.
Morris et al. [6] devised a feature selection technique tailored for stratified Cox models using gradient boosting.This method is intended for situations where the assumption of proportional hazards is not met.To mitigate confounding effects, they introduced a stratification process, resulting in a stratified proportional hazards model.In their algorithm, variables are chosen based on their capacity to maximize the first partial derivative of the likelihood function.This iterative process updates β coefficients estimates while monitoring the risk of overfitting, with the number of iterations determined through various options like Bayesian Information Criterion (BIC), a predefined number of variables for selection, changes in likelihood, or k-fold cross-validation.Additionally, they developed an R package called 'SurvBoost' to implement this method and conducted a simulation study comparing its accuracy and runtime with the existing 'mBoost' R package.
He et al. [7] introduced an algorithm that applies gradient boosting to the data while considering only one component of β, aiming to improve computational feasibility.The fundamental concept behind this algorithm involves modifying certain aspects of existing variable selection techniques designed for high-dimensional survival data.Although component-wise boosting algorithms were already present in the literature, the authors made specific adaptations to enhance its computational efficiency, drawing inspiration from the Minimization-Maximization (MM) algorithm by Hunter and Lange [8].
When conducting FS on randomly sampled observations, the objective is to determine the probability that features are included in the model.This approach is known as stability selection [9].Additionally, the authors proposed a stability selection boosting procedure based on random permutations, which follows the concept introduced by Tusher et al. [10] in the context of the Significance Analysis of Microarrays (SAM).This modification resulted in a reduced false discovery rate for their variable selection algorithm.
Ishwaran et al. [11] introduced a novel approach in survival tree analysis by introducing a dimensionless ordered statistic known as the 'minimal depth of maximal subtree.' This statistic served as a measure of variable predictiveness within survival trees.Unlike the traditional method of calculating Variable Importance (VIMP) in random forests using permutation, they opted for the minimal depth statistic.This statistic assesses the importance of variables based on their proximity to the root node in a tree, specifically in relation to the root of the nearest maximal subtree.Their variable selection procedure comprises three main steps.Initially, they randomly select covariates for the model and identify the most crucial ones using minimal depth.An initial model is then constructed with these chosen covariates.This process is iterated a predefined number of times, incorporating additional variables into the initial model based on minimal depth criteria until the joint VIMP of nested models stabilizes.This entire process is repeated several times, and the covariates that consistently appear in models larger than the average size are ultimately selected.
Pang et al. [12] introduced a gene selection method that employs an iterative feature elimination procedure within the framework of random survival forest (RSF).Their approach begins by fitting an RSF model to the dataset containing the complete set of covariates and ranking all covariates based on their Variable Importance (VIMP) scores, which are calculated using a permutation-based method.The top variables, typically around 80%, are retained, and their out-of-bag errors are computed.This process is repeated iteratively until only two covariates remain in the model.The objective is to identify the set of covariates with the minimum number required to maintain an out-of-bag error rate within 1 standard error.This methodology effectively takes into account the multivariate correlations among variables.Experimental results on real high-dimensional microarray datasets with survival outcomes demonstrated that this approach excels in identifying a compact set of genes while preserving predictive accuracy for survival.
Mbogning and Broet [13] introduced a variable selection approach tailored for survival data, which relies on a Topological Index derived from permutation methods.Their methodology begins with the construction of a bagging survival forest on the training data.The importance score, utilized as a criterion for node splitting or determining tree depth during forest construction, serves as the basis for calculating an importance score.These importance scores are denoted as HS j (j = 1, 2, 3, . . ., p).Subsequently, another bagging survival forest is created, but this time, the importance of variables is computed based on permuted data.This process is iterated a specified number of times.Another list of scores are generated in HS * j (j = 1, 2, 3, . . ., p) .P-values are calculated for all competing variables X j using P j = 1 Q Q q=1 I{HS 0 jq > HS j } .Given a global level α, variables which satisfies the conditions that p j < a m are selected according to a Bonferroni procedure for multiple comparisons.
Indeed the above mentioned techniques are useful for variable selection, their application in this study is however limited for identifying discriminative features in high-dimensional survival datasets.Therefore, we have proposed a novel tools for variable/feature selection due their ability in handling high-dimensional survival data which is important for better predictions.

Datasets and data processing
In this section, we explore the datasets considered in this research study.A total of 11 survival datasets were used and their detail description along with their source is presented in Table 1.While selecting the datasets for the current research study related to feature selection in survival analysis, it was ensured that the benchmark datasets are high-dimensional in nature.

The Cox PH model
The Cox Proportional Hazard Model (CPHM) is a semiparametric method used in survival analysis.It defines the hazard function ′ h(t) ′ as the sum of two components: a baseline hazard ′ h 0 (t) ′ that depends solely on time 't, ' and a component related to covariates.The mathematical form of the CPHM is as follows: Here in this equation, h 0 (t) is a time-dependent com- ponent that is not influenced by covariates, while e β′x i represents the covariate-related component, which does not depend on time't' .It's important to note that there is no constant term " β 0 " in the regression coef- ficients.This absence of a constant term is because it can be absorbed or canceled out by the baseline hazard (1) function, essentially being a part of the hazard function itself [24].

Feature selection methods
With the introduction of new technologies in past few decades, we are able to access such information about a subject under study which one had never before.Having more information about samples/subjects on the other hand can create a situation called the curse of dimensionality.Especially in the case when number of features are substantially greater than the number of observations.This create several problems during the analysis and processing of the data, such as increase in computational cost, noise and redundancy, the problem of overfitting and poor generalization of performance on unseen data [25].There are three common types of feature selection methods namely filter, wrapper and embedded methods.
Filter methods This consists of feature selection techniques that ranks the features in dataset ahead of running a learning algorithm and selects features in the model based on a pre-specified criteria in connection with the statistical measure being used for ranking purpose.This set of methods are less time-consuming and inexpensive in nature as they are done as a pre-processing step.
Wrapper methods Wrapper methodology consists of techniques where subsets of features are made, model is trained on each subset and comparison is made for each subset in terms of performance metrics.Features 10 Bone M: Data pertaining to pediatric patients suffering from various hematologic diseases who underwent unrelated donor hematopoietic stem cell transplantation (UD HSCT) without manipulation 187 37 [22] 11 NKI: A subset dataset based on top varying genes from gene expression dataset 272 1567 [23] are added or removed from the subsets until a predefined number of features and performance measure is obtained.The subset which yields a pre-defined model output is considered as a final set of features for modelling the data further.
Embedded methods This method combines the core properties of both, the filter and wrapper methods.It is named as embedded because the feature selection technique is blended as a part of the actual learning algorithm.
It is a less time-consuming, inexpensive and more accurate than aforementioned methods.The methods employed in this study for comparison fall under this category.In the following section, we provide a brief overview of the variable selection methods utilized in this study.

Least Absolute Shrinkage and Selection Operator -LASSO
Least Absolute Shrinkage and Selection Operator, or LASSO, is a method that penalises variables.It introduces an L1-type penalty term λ||β|| to the coefficients of the Cox regression [1].This penalty can effectively reduce some coefficients to zero, leading to a reduction in the model's size while maintaining its parsimony.The parameter λ plays a crucial role in determining the number of variables selected in the model.A larger λ value leads to more coefficients being reduced to zero, resulting in a model with fewer features.Conversely, reducing the λ value increases the number of features included in the model when compared to a higher λ value.In a survival context, the triplet {(Y i , δ i , X i ), i = 1, 2, 3, . . ., n} is used to represent the observed data.Where Y i = min(T i , C i ) is the observed survival time, taking minimum of either observed event time " T i " or censoring time " C i ". δ i is the censoring indicator, δ i = I(T i ≤ C i ) = 1 when actual event of interest is observed and is 0 otherwise.And X i = (x 1 , x 2 , x 3 , . . ., x p ) t is the matrix of 'p' pre- dictor variables for each subject in the dataset.To model the survival data, we consider semi-parametric cox proportional hazards model: The partial likelihood for cox model is given as: . Where D is the set of indices for observed events, and R i are observations at risk at time Y i .Now, a function known as log partial likelihood i.e., l n (β) = log{L n (β)}/n with a penalty term, specifically called lasso penalty i.e., p (β) = p j=1 |β j | applied on the coefficients β, when minimized, results in the sparsity, hence variable selection. Mathematically: Where λ is a non-negative tuning parameter taking on any positive value and controls the amount of variables selected in the final model.The lasso penalty p j=1 |β j | is singular at point β j = 0 and is therefore able to elimi- nate the redundant variables from the model and keep the relevant ones only.

Random Survival Forest's Variable Selection -RSF-vs
A useful tool for variable selection is the Random Survival Forest (RSF), which is an extension of the random forest approach for survival data [26].RSF builds trees in a similar way as conventional random forests.Following the random selection of B bootstraps at random from the data, a tree is created on each bootstrap sample.A cumulative hazard function (CHF) is produced by averaging the predictions made by these trees.The RSF then offer two option for feature selection: the variable hunting algorithm (RSF-VH) and minimal depth.The minimal depth approach for FS is advised when the ratio of the number of features (p) to the number of samples (n) is less than ten that is p/n < 10.However, when p/n > 10, the RSF-VH approach for FS is preferred.When splitting of a node is carried out, the minimal depth is used to rank the features according to their distance from the root node in the tree.Shorter paths between variables and the root node are regarded as having greater predictive power in the model.More detailed information on minimal depth of the maximal subtree can be found in the work by Ishwaran et al. [11].In RSF-VH, an initial model is constructed using covariates according to a predetermined minimal depth threshold value.Additional covariates are gradually incorporated into the initial model based on their minimal depth rankings until the joint variable importance (VIMP) for the resulting nested models stabilizes.This process is typically repeated multiple times, often 50 repetitions, and the variables that are frequently selected in the models are included in the final model.

Smoothly Clipped Absolute Deviation -SCAD
SCAD was proposed by Fan and Li [27] as an improved alternative to LASSO for penalizing the regression coefficients.Some studies [27][28][29] showed that LASSO can come up with biased results for coefficients with larger values, while working fine for the coefficients with relatively smaller values.This led the researchers to introduce another penalty term known as non-concave SCAD-penalty.
The penalty function is rather defined primarily by its first derivative which is given as: This penalty term contains two tuning parameters that are, and a whose values could be obtained by some cri- teria such as cross validation based on a grid search for the best pair of ( , a ) values.But this could be computa- tionally very expensive too.Thus, based on Bayesian statistical point of view and practical simulation studies, Fan and Li [27] suggests using, a = 3.7.

Boosting Algorithm for Variable Selection -CoxBoost
Boosting, originally developed in machine learning for classification and regression problems [30][31][32].It is basically an ensemble technique which runs iteratively to combine the predictions of many weak models into one strong model.With time, boosting algorithms started getting notable attention, and were later on extended to statistical field, operating in many statistical problems including regression and survival analysis.
We utilized the boosting algorithm known as 'Cox-Boost' to perform feature selection making use of the cox regression model.For this, we used an R package 'mboost' [33] which performs a model based boosting using the built-in function 'coxPH' to be specified in argument 'family' .Using this argument, we are about to use negative partial log-likelihood as a loss function L(y, F(X)) and OLS estimator as the base learner.The complete boosting algorithm for model fitting as well as feature selection context completes in these five steps.
1. Initialize β = (0, . . ., 0); 2. Compute the negative gradient vector: 3. Compute the possible updates to the gradient vector by fitting least square estimator, 4. Select the best update, j * = argmin j n i=1 u (i) − X (i) j b j 2 ; 5. Update the estimate, The steps 2-5 are repeated m stop number of times, which plays a crucial role in both, feature selection and prediction models.In feature selection, increasing this will increase the number of features selected and vice versa.This can create problems of overfitting and irrelevant variable selection in the model in case of larger value while we may miss important predictor variables if this value is kept low [34].To cope with this issue, a tenfold cross validation is used to select the optimal value for m stop -which is the number of boosting steps -and do the feature selection using that optimal m stop value.

Machine learning models
Time-to-event data can be analyzed for making predictions about survival time and estimate the survival probability at a specific estimated survival time, using both traditional statistical methods and machine learning models.Although, they both share this common goal of making predictions of the survival time and estimating the survival probabilities, yet the focus of both methods is on different objectives.Traditional methods mainly focuses on distributions of event times and statistical properties of estimation of parameters, whereas machine learning models combines the power of traditional methods along with machine learning techniques to make better predictions of occurrence of event at a given time [35].
There are three types of statistical methods commonly used in the context of survival analysis that are Parametric, Semi-parametric and Non-parametric.In this study, the semi-parametric approach, i.e., the Cox Proportional Hazards Model, is employed for predictions, along with two other machine learning models: Random Survival Forest and DeepHit Neural Network.These models are used for modeling time-to-event data and are discussed in the following section.

Random Survival Forest (RSF)
RSF is an ensemble method usually categorized under advanced machine learning techniques which is basically an extension of random forests approach to tackle survival information [36,37].
The forest is grown in the same manner as in a usual random forest i.e. i. Randomly select B samples of the same size as the original dataset, allowing for replacement.Any samples not chosen are considered out-of-bag (OOB).ii.Construct a survival tree for each of the B samples selected in the first step.
a At each tree node, randomly choose a subset of predictor variables and determine the best predictor and splitting value that yield two subsets (referred to as daughter nodes) with the maximum difference in the objective function.b Repeatedly apply the above step recursively to each daughter node until a specified stopping criterion is met.
iii.Calculate the cumulative hazard function (CHF) for each tree and then compute the average CHF across all B trees to create the ensemble CHF.
iv. Assess the prediction error of the ensemble CHF using solely the OOB data.
Since the CHF and survival function S(t) are related, the RSF also gives us estimates of survival function which we can further use for predictions using the test set of the data and calculate our performance evaluation metrics.

DeepHit neural network
DeepHit is a deep learning model designed for survival analysis, capable of simultaneously addressing singlecause and competing risks scenarios.It utilizes a network structure comprising a shared sub-network and several cause-specific sub-networks.DeepHit's training process involves a loss function that leverages both survival times and relative risks.This model is proficient in capturing non-linear and non-proportional relationships between covariates and risk factors.Further, it is a discrete-time survival model, means that survival times are discretized into either equidistant (equally spaced) or quantiles intervals [38].The DeepHit method was the first to implement neural networks to the discrete-time likelihood for survival data.A more detailed reading on how discretization process works in survival context and its further extensions can be obtained from a study conducted by Kvamme, H. and Borgan, Ø., 2021 [39].
In this study, we implemented the DeepHit NN using R package 'survivalmodels' [40] which makes it happen using R package 'reticulate' [41].Where Reticulate is a popular R package creating a Python environment in R software so that one can use Python packages and functions inside R [42].No hyper parameter tuning was applied, rather the default parameters were used for training the neural network.

Proposed model architecture
In this paper, Feature selection methods described some well-known feature selection techniques for survival studies, briefly explaining their algorithms to separate and select only important features to the response variable.In practice, it is usually observed that employing many FS methods to the same dataset, it is not guaranteed that all techniques agree upon the same set of features.It is due to the mechanism a FS technique uses to give relevant importance to features.For instance, RSF-vs uses a dimensionless order statistic called minimal depth of maximal subtree which shows the predictive power of a variable in a survival tree.RSF-vs ranks features based on minimal depth criteria and selects the top-most from the list.In contrast, LASSO and SCAD implies their respective penalties on the regression coefficients in order to contract redundant variables' coefficients to as low as zero.Although, the mechanism each method uses are way different than each other, yet the objective function is the same for all, so it is expected from each to agree upon other's selection as much as possible, off course if not hundred percent.
In this work, we aimed at exploiting and capitalizing four different feature selection techniques that uses contrasting mechanism to obtain the objective function.Intuitively, the variables which are important and nonredundant in reality is believed to have higher chances of being chosen by most of the algorithms.This led to propose a novel feature selection method for survival data.
The proposed method is a three-step procedure for feature selection, constituting a hybrid method that selects the most informative or discriminative features on which the majority of the feature selection techniques agree.The method is explained as follows: i. Utilize each of the four feature selection techniques-LASSO, RSF-vs, SCAD, and CoxBoostindividually to obtain four feature sets, one for each of the four FS techniques employed.ii.After comparison, create a new set that only includes those features chosen for at least three of the four sets in step 1. iii.The final set of features is determined by selecting the set of characteristics that satisfy the criteria in step 2. These features will then go through additional processes, like fitting a survival model utilising predictive machine learning models and assessing their predictive accuracy.
The suggested alg0rithm is described mathematically as follows: Let FS 1 , FS 2 , FS 3 , and FS 4 stand for the sets of feature that were 0btained by using the LASSO, RSF-vs, SCAD, and C0xB00st methods, respectively.Then, the common features that has agreement or the set of intersection of these feature, represented as FS intersection , is determined by: Next, the features that show up in at least three of the four feature sets are then filtered out as: where count(f ) is the number of times feature f appears throughout the four feature sets.
The final collection of attributes, denoted by the set FS final , will be used for additional analysis, including the fitting and performance evaluation of survival models.
The primary goal of this suggested approach is to create a feature selection pr0cess for survival predictions that is both simple to comprehend and apply.The 0bjective of this study is to develop an appr0ach that is both efficient and intuitive by combining well-known feature selection techniques with a simple selection criterron based 0n their agreement across these techniques.Despite its simplicity, the prop0sed method has shown promising results in 0ur experiments, as evidenced by its performance in survival prediction tasks compared to existing techniques.
The algorithm of the proposed hybrid feature selection method is given below.

Algorithm 1: Pseudo code of the proposed method
The following flowchart shows the outline of the proposed method in graphical way in Fig. 1.

Performance evaluation metrics
To evaluate model performance, researchers have access to various evaluation metrics, allowing them to choose the most appropriate ones for their specific problem.For survival models, common metrics include the Concordance index [43], C-statistic (a modified version of C-index suitable for models with high censoring rates) [44], Brier score, integrated Brier score [45], integrated square error (ISE) [46], and others.In this study, we chose to assess the competitive models using Integrated Brier Scores (IBS), the C-Index, Integrated Absolute Error (IAE), and Integrated Square Error (ISE).These methods are briefly explained as follow: The Brier score, initially introduced by Brier in 1950 [45] to assess the accuracy of weather forecasts, was later adapted to evaluate the performance of survival models that incorporate censored observations [47].The Brier score varies with time.In the absence of censoring, the Brier score can be expressed as: (3) In scenarios involving censoring, a weighted version of the formula is used to accommodate censoring.Specifically, BS(t) is divided by 1/ G(t i ) when censoring occurs before time 't' and it is divided by 1/ G(t) when censor- ing occurs after time 't.' Observations that are censored before time 't' are not included in the Brier score calculation.The formula for calculating Brier score in the presence of censoring is as follows: A Brier score approaching 0 signifies superior predictive performance, while a score nearing 1 suggests poorer performance.The Integrated Brier Score (IBS) is derived by integrating the Brier score across all available time intervals, denoted as t min ≤ t ≤ t max .Mathematically, this can be expressed as follows: The integration can be readily computed using the trapezoidal rule, which calculates the area under the prediction curve [48].

ii. Concordance index (C-Index)
The C-index is a discrimination measure that indicates the ability of a model to effectively distinguish between ( 4) a pair of observations categorized as 'high' and 'low' risk.It is defined as the ratio of concordant pairs to the total comparable pairs [47,49].Where a comparable pair means a pair of individuals (say i and j ) in a data- set such that t i and t j are its actual event times and S(t i ) and S(t j ) are their predicted survival times.Now, if a pair (i, j) is such that t i > t j for which (t i ) > S(t i ) , this means the actual observed time for i th individual is higher than j th , and the model predicted the same, it is considered as a concordant pair.Otherwise, it is a discordant pair.We can write: To handle censoring when determining comparable pairs, certain rules are followed.For example, a censored instance can only be paired with uncensored instances that occur after it in the dataset.Additionally, a censored instance cannot be paired with either another censored instance or an uncensored instance that occurs after it.This concept is illustrated in Fig. 2, where we have five observations ordered from top to bottom.We have two possible scenarios: (a) All five observations are uncensored, resulting in a total of 5 C 2 = 10 pairs.(b) The second and fourth observations are censored, reducing the number of pairs to 6, as per the rule that censored observations cannot be paired with uncensored observations occurring after them.
In survival models that predicts survival time as an output, the C-index is calculated as:  Where 'N' is the total number of comparable pairs.'I[.]' is the indicator function and S(.) are the estimated survival probabilities from the model.Some survival models do not directly estimate survival probabilities; instead they rather compute hazard ratios, such as Cox PH model.In such cases, the C-index can be computed as: Where ' β ' represents the estimated parameters cal- culated from hazard-ratio based models such as Cox, C-index values can range from 0 to 1.A higher value for the C-index indicates better accuracy in terms of separating 'low' and 'high' risk observations.A C-index equal to 0.5 suggests that the model made random guesses, while a value close to 1 indicates perfect separation, and a value below 0.5 suggests separation in the wrong direction.
iii.Integrated Absolute and Integrated Square Errors (IAE and ISE): Survival models based on simulated datasets can be evaluated using these two similar methods.This is because the mathematical expression of the survival function is typically unknown in practical experiments, but an approximate expression can be obtained using a non-parametric Kaplan-Meier estimation.Using (7) this approximate ' S(t) ' , we can obtain the measures IAE and ISE for the real dataset as well.This approximation is available as a built-in function in R package 'SurvMetrics' [50].Thus, if S(t) is the true survival function and S(t) is its estimate, then, the two measures are given as: and The resultant value of these two measures ranges from 0 to infinity, where lower value for a predictive model indicate better predictions.Since the results of both IAE and ISE were quite similar, we therefore reported only IAE in the results section.

Results and discussion
In this section, we review four tables containing the results of the analysis on 11 high-dimensional survival datasets.The first three tables contain results obtained from the analysis of each of three survival prediction models i.e., Cox Proportional Hazards model, Random Survival Forest and DeepHit employed on five different features selection methods i.e., our proposed method, LASSO, RSF-vs, SCAD and CoxBoost.The results in the table are obtained in similar fashion as explained in the above section.that is, we performed variable selection for each of the 11 datasets and obtained their evaluation metrics values for each dataset corresponding to each feature selection method employed.
Table 2 contains the results of the Cox Proportional Hazards Model when it is employed on different feature selection methods listed in the right-most column of the table.For each of the methods, three performance metrics were computed i.e., Integrated Brier Score, C-index and Integrated Absolute Error.The values of these metrics were recorded in rows corresponding to each variable selection method.
The values in bold indicate better performance among all variable selection methods for each dataset.Since three evaluation metrics were considered, we compared all these three metrics for each method corresponding to each dataset.In terms of individual datasets, such as the 'Breast dataset' , when features were selected through LASSO and applied to the Cox PH survival model, we observed better predictive performance across all three metrics: IBS, C-index, and IAE.Similar results were obtained from the next two datasets, namely "WPBC" and "VDV", where LASSO feature selection again demonstrated superior performance.However, there appears to be a deviation in the fourth row of the column.In the fourth row of Table 2, we observe the metrics for the 'Heart FD' dataset.Here, we find that the LASSO feature selection method outperformed the others in terms of IBS and C-index.However, it is the CoxBoost FS method that outperformed the others in terms of IAE.This approach allows for a more thorough examination of each dataset, enabling us to assess the performance of each feature selection method relative to others.Each row corresponds to the performance metrics for a single dataset, arranged sequentially.The last row of the Table 2, is named as ' Average' which reflects evaluation metrics values averaged across all 11 datasets.This provide us with a comprehensive measure based on which it can be determined which feature selection method outperforms the rest.If we compare the averaged performance across all 11 datasets, it becomes evident that our proposed feature selection method performs exceptionally well across all three metrics.The comparison of results can be linked to a voting process, where we tally the number of times a method outperforms the others.This involves counting the instances where a feature selection method performs better than the alternatives.By applying the voting criteria for comparison, it becomes evident that in 8 out of 11 datasets, the proposed method surpassed the other methods across all three metrics considered for comparison.
Examining Table 3, which follows a similar format to Table 2, we can compare the outcomes of different feature selection methods in two ways, as discussed earlier.One method involves examining the average value of each metric corresponding to each feature selection method to determine the winner.The other method entails counting the number of times a feature selection method outperforms others across the datasets, which we refer to as "voting".Analyzing the bottom row of the table, we observe that our proposed method surpasses all other four feature selection methods across all comparison metrics, including IBS, C-index, and IAE.Furthermore, when assessing performance based on the number of datasets on which each feature selection method excels, our proposed method outperforms the competition on 9 out of 11 datasets.This indicates that our proposed method demonstrates superior performance according to our "voting" criteria as well.The findings from Table 3 provide substantial evidence to support the assertion that utilizing the proposed method for feature selection, followed by modeling the data using Random Survival Forest, can lead to improved predictive performance in survival analysis.
Table 4, displaying the performance evaluation metric values, follows the same format as the previous tables.It's evident from the results that when the proposed method of variable selection is employed for feature reduction followed by DeepHit-NN as a survival prediction model, 8 out of 11 datasets exhibit lower IBS and IAE values.When examining the C-Index, it's apparent that the proposed method performed even better, with 9 out of 11 datasets showing higher C-index values when the proposed FS method was employed.Additionally, in terms of the average value across all 11 datasets, the proposed method outperformed other FS methods in all three comparison metrics: IBS, IAE, and C-index.Table 5 presents the averaged performance evaluation values for five feature selection methods, aligned with each survival model, derived from the average values of each column in Tables 2, 3 and 4. For instance, the IBS value attributed to the 'Proposed' method under the 'Cox PH' model, which is "0.165298", represents the average IBS value obtained from Table 2 under the 'Proposed' method (averaged from the first column).All other values are recorded following the same methodology.This comprehensive approach provides a detailed overview of our entire analysis.The table distinctly illustrates that our proposed feature selection method outperforms other methods across all comparison metrics, including IBS, C-index, and IAE.Referring back to the previous practice of comparison using the voting criteria, it is evident that our proposed method

Conclusion
This study aimed to harness a range of existing feature selection techniques to develop a hybrid feature selection (FS) technique that could perform the same task with improved accuracy and reduced margin of error.To assert the accomplishment of the studies objective, we employed a total of five FS methods including the proposed one, along with three different survival models to compute three different performance evaluation metrics namely Integrated Brier Score (IBS), Concordance Index (CI) and Integrated Absolute Error (IAE).The results are presented in both tabular and graphical formats in the results section.
In conclusion, based on the results presented in the preceding section, we have two approaches to determine which method surpasses the others in reducing prediction error and enhancing prediction accuracy.One approach involves comparing how frequently a feature selection (FS) method outperforms others on individual datasets considered in our analysis.
As a rule of thumb, if a feature selection (FS) algorithm performs well on more than two-thirds of the dataset counts, it would be considered ideal.Specifically, in this case, since we are using a total of 11 datasets and employing 5 different FS algorithms, it would be quite rare and ideal for a FS technique to achieve that level of performance consistently across datasets.If a variable selection method performs well on 6-8 datasets, it is considered a satisfactory outcome.We would ideally expect our proposed method to perform at least as well, if not better (on 9-11 datasets).
Another approach to comparing the different feature selection techniques is to assess their average performance across all datasets.Given the detailed discussion of results in the previous section, we now provide a concise conclusion.
In Table 2, the proposed method achieved results above the satisfactory level, outperforming other methods in terms of both the count (8 out of 11 datasets) and averaged results across all three metrics.Moving to Table 3, our proposed method yielded ideal results, outperforming the other methods in terms of both the voting (9 out of 11 datasets) and average comparison.These surprising results align with our expectations, primarily due to the added advantage of using Random Survival Forest as our predictive model.In the case of DeepHit as a prediction model, the proposed technique obtained somewhat similar results to those of Cox PH.In summary, focusing on Table 5, the most comprehensive table, it is evident that the proposed method outperformed the other feature selection methods in terms of all three metrics, whether it is counting/voting or average results.Based on the results provided therein, we conclude that the proposed method improved predictive performance of timeto-event data, especially when the proposed algorithm is employed for dimension reduction and utilized Random Survival Forest for survival prediction.

Fig. 1 A
Fig. 1 A flowchart of the proposed feature selection method for survival analysis

Fig. 2
Fig. 2 Illustration of making pairs of observations (a) with no censored observations and (b) with censored observations

Fig. 3 Fig. 4
Fig. 3 Boxplots of IBS, C-Index and IAE for Cox PH model for five feature selection methods

Table 1
Description of the Benchmark high-dimensional survival datasets

Serial No Name of Dataset used in this study No. of Samples No. of Features Source
[21]CL: The dataset consists of gene expression and survival data from a cohort of 240 patients diagnosed with diffuse large-B-cell lymphoma 240 7399[21]

Table 2
Performance evaluation metrics for all datasets corresponding to feature selection method followed by Cox PH model

Table 3
Performance evaluation metrics for all datasets corresponding to variable selection method followed by random survival forest

Table 4
Performance evaluation metrics for all datasets corresponding to variable selection method followed by DeepHit neural network

Table 5
Averaged performance metrics across all datasets