Skip to main content

A novel dynamic Bayesian network approach for data mining and survival data analysis

Abstract

Background

Censorship is the primary challenge in survival modeling, especially in human health studies. The classical methods have been limited by applications like Kaplan–Meier or restricted assumptions like the Cox regression model. On the other hand, Machine learning algorithms commonly rely on the high dimensionality of data and ignore the censorship attribute. In addition, these algorithms are more sophisticated to understand and utilize. We propose a novel approach based on the Bayesian network to address these issues.

Methods

We proposed a two-slice temporal Bayesian network model for the survival data, introducing the survival and censorship status in each observed time as the dynamic states. A score-based algorithm learned the structure of the directed acyclic graph. The likelihood approach conducted parameter learning. We conducted a simulation study to assess the performance of our model in comparison with the Kaplan–Meier and Cox proportional hazard regression. We defined various scenarios according to the sample size, censoring rate, and shapes of survival and censoring distributions across time. Finally, we fit the model on a real-world dataset that includes 760 post gastrectomy surgery due to gastric cancer. The validation of the model was explored using the hold-out technique based on the posterior classification error. Our survival model performance results were compared using the Kaplan–Meier and Cox proportional hazard models.

Results

The simulation study shows the superiority of DBN in bias reduction for many scenarios compared with Cox regression and Kaplan–Meier, especially in the late survival times. In the real-world data, the structure of the dynamic Bayesian network model satisfied the finding from Kaplan–Meier and Cox regression classical approaches. The posterior classification error found from the validation technique did not exceed 0.04, representing that our network predicted the state variables with more than 96% accuracy.

Conclusions

Our proposed dynamic Bayesian network model could be used as a data mining technique in the context of survival data analysis. The advantages of this approach are feature selection ability, straightforward interpretation, handling of high-dimensional data, and few assumptions.

Peer Review reports

Background

Survival analysis is the formal name for an essential branch of statistics. It aims to study the time of the specific event and explore the particular circumstances or characteristics that influence it [1]. We know these techniques as reliability analysis in engineering and industries. The usual interest is the lifetime of devices, products, and machines in these contexts [2,3,4]. The intrinsic features of subjects in these domains enable the investigators to control all the processes and fully observe the data [3, 5].

Further, when the subjects are humans, the survival models confront the crucial challenge of incomplete data and the high dimension of confounders more than the other context. The incomplete data known as censoring is expected because sample withdrawal or loss of follow-up exceeds the event time of the study period [6, 7]. Despite all efforts to evolve the survival models, some challenges in classical methods show the necessity of developing new approaches in practice [8,9,10].

The Kaplan–Meier method, also known as product-limit, is one of the primary and still popular methods in analyzing survival data. This method estimated the survival probability function of data in a straightforward and easy-to-understand manner. In addition, the Kaplan–Meier is a nonparametric approach, and a few assumptions are required to apply it to data [1, 11]. However, the simplicity of Kaplan–Meier restricts its applications. For instance, it cannot account for multiple factors or control for confounding factors. Therefore, in the case of feature selection and group comparisons, we need to apply additional analyses such as the low power log-rank test. In this procedure, insufficient sample size and increasing the number of features lead to inaccurate inferences [12]. Another alternative is regression approaches like the Cox proportional hazard model. Using these models is limited by restrictive assumptions like the proportionality of hazards, independence of censoring and survival distribution, and exponential relation between hazard and covariates [13].

On the other hand, developing the application of machine learning algorithms in recent decades has reformed many classical approaches to data analysis. The survival analysis is one of the domains that changed significantly [2, 9, 10]. In this manner, the applications of random forests [14, 15], Bayesian methods [5, 16, 17], neural networks [18,19,20,21], support vector machines [22, 23], ensemble learning [24, 25], and active learning [26, 27] algorithms were introduced in survival analysis. These changes enable us to resolve issues with a new practice even though the general idea is similar to classical approaches.

As mentioned, incompletely observed data or censorship is the primary challenge in survival modeling, especially in health. Considerable studies that introduced the application of machine learning algorithms in survival analysis did not engage the censorship because of the study subject's type [5]. Some other studies relied on high sample size and ignored the censored observations or imputed them by modeling approaches [28, 29]. Finally, a few studies directly address censorship using the methods like weighting the censored observation [30]. However, there is a gap between classical approaches and novel machine learning techniques that use intelligent algorithms to extract data patterns. The classic methods were developed to handle small to medium-dimension data and find a general overview. In contrast, the machine learning algorithm and data mining techniques aim to handle high-dimensional data. These methods focus on the prediction with maximum accuracy.

This study proposes a novel Dynamic Bayesian Network (DBN) model for data mining in the context of survival data analysis. The Bayesian Network (BN) has a series of powerful tools that could facilitate survival analysis. Actually, the BN combines probability theory and graphical models [31]. Consequently, it enabled us to capture the uncertainty of stochastic survival events and represent a graphical structure of probability distribution. In addition, our model uses the Kaplan–Meier idea to consider the censoring phenomena and the various capability of BN models to add extra tools for more precise inferences simultaneously. The structure learning algorithm of the BN ables us to compare the groups and find the significant features. In addition, parameter learning algorithms lead to more precise inferences, estimations, and predictions. In this study, we present our DBN model for survival analysis, evaluate its performance using a simulation study, and finally use a real-world data set to show the way analysis could be performed using that.

Methods

Product limit estimators

The primary objective of survival analysis is to explore the time until a particular event. Hence, we describe the stochastic behavior of an outcome variable in time type. We usually use survival, density, hazard, and the mean or median residual life functions in this regard. As these functions are attainable from each other, there is no priority except for better interpretability in choosing one. The product limit estimators are the estimates of survival function, which is defined as the probability of an individual surviving after a given time point t:

$$S\left( t \right) = {\text{P}}(T > t)$$

T is a random variable that denotes when the event of interest occurs. Kaplan and Meier partitioned observed times into intervals according to unique event times and proposed the following estimator for all t values in the range of observed data when \(t_{1}\) represent the first event time [1]:

$$\hat{S}\left( t \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\;t < t_{1} } \hfill \\ {\mathop \prod \limits_{{t_{i} \le t}} \left[ {1 - \frac{{d_{i} }}{{Y_{i} }}} \right] } \hfill & {if\;t_{1} \le t} \hfill \\ \end{array} } \right.$$
(1)

where \(d_{i}\) and \(Y_{i}\) represent the number of failures and at-risk persons in each interval, respectively. Therefore, the product-limit estimator is a discrete approach that leads to a step function that only changes at event times. The Greenwood formula is the well-known approach to estimating the variance of the estimators [1]:

$$\hat{V}\left[ {\hat{S}\left( t \right)} \right] = \hat{S}\left( t \right)^{2} \mathop \sum \limits_{{t_{i} \le t}} \frac{{d_{i} }}{{Y_{i} \left( {Y_{i} - d_{i} } \right)}}$$
(2)

Bayesian network

Every Bayesian Networks (BNs) correspond to a Directed Acyclic Graph (DAG) and a joint distribution, which are the graphical and probabilistic aspects of the model. DAG consists of nodes corresponding to random variables and edges that present conditional probabilities. According to the domain of the random variables, the BN could be discrete, Gaussian, or hybrid. Our BN in this study is a discrete one. Therefore, for a set of discrete random variables \({\varvec{X}} = \left( {X_{1} ,X_{2} , \ldots ,X_{D} } \right){ }\) Taking their values in the discrete and finite D-dimensional domain. The BN is defined as pair \({\mathcal{M}} = \left( {{{\mathcal{g}}},\left( {P\left( {X_{d} |{\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)} \right)} \right)_{1 \le d \le D} } \right)\) where \({{\mathcal{g}}} = \left( {{\varvec{X}},{\varvec{\varepsilon}}} \right)\) is a DAG presentation of random variables \({\varvec{X}}\) with edges set \({\varvec{\varepsilon}}\), \({\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)\) is the set of parents \(X_{d}\) in \({\varvec{X}}\), and \(\left( {P\left( {X_{d} |{\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)} \right)} \right)_{1 \le d \le D}\) is the conditional probability of node \(X_{d}\) given their parents in the set \({\varvec{X}}\).

The appealing feature of BN is to summarize the complex joint probability distribution \({\varvec{X}}\) in the following parsimonious way using the conditional independence and Markov chain properties:

$$P\left( {X_{1} ,X_{2} , \ldots ,X_{D} } \right) = \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d} |{\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)} \right)$$
(3)

Dynamic Bayesian network

The classical BN is not adopted to address time-dependent processes like survival analysis [32]. Therefore, Dynamic Bayesian Network (DBN) [33] was introduced to extend this process. In this context, time-dependent random variables \(\left( {{\varvec{X}}_{t} } \right)_{t \ge 1} = \left( {X_{1,t} , \ldots ,X_{D,t} } \right)_{t \ge 1}\) are defined where t is a discrete index time formally called slice. DBN uses Markov property which indicates the future of a stochastic process is independent of its past, given current status or several lags before it. The number of lags determines the order of the Markov process. This study only needs to use the Markov process of order 1, which leads to a 2-slice Temporal Bayesian Network (2-TBN). In this regard, we assume \(X_{t - 1} \bot X_{t + 1} |X_{t}\) for all \(t \ge 2\). A 2-TBN could be defined as a pair of 2 BNs \(\left( {{\mathcal{M}}_{1} ,{\mathcal{M}}_{ \to } } \right)\) where \({\mathcal{M}}_{1}\) is the joint distribution of the initial process \({\varvec{X}}_{1} = \left( {X_{1,1} , \ldots ,X_{D,1} } \right)\) and \({\mathcal{M}}_{ \to }\) represent the transition model. The joint probability distribution \({\mathcal{M}}_{1}\) easily derived from BN approach in Eq. 3:

$$P\left( {{\varvec{X}}_{1} } \right) = \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d,1} |{\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d,1} } \right)} \right)$$
(4)

In the transition model, the joint distribution of \({\varvec{X}}_{t}\) only depends on random variables belonging to the set of parents \({\varvec{X}}_{t}\) at slice \(t - 1\) in the form:

$$P\left( {{\varvec{X}}_{t} {|}{\varvec{X}}_{t - 1} } \right) = P\left( {X_{1,t} , \ldots ,X_{D,t} {|}X_{1,t - 1} , \ldots ,X_{D,t - 1} } \right) = \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d,t} |{\mathbf{\mathcal{P}}}_{{{\varvec{X}}_{t} }} \left( {X_{d,t} } \right)} \right)$$
(5)

Hence the probability distribution of 2-TBN is calculated by the combination of Eqs. 4,5:

$$P\left( {{\varvec{X}}_{1 \le t \le T} } \right) = \mathop \prod \limits_{d = 1}^{D} P(X_{d,1} |{\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d,1} } \right))\mathop \prod \limits_{t = 2}^{T} \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d,t} |{\mathbf{\mathcal{P}}}_{{{\varvec{X}}_{t} }} \left( {X_{d,t} } \right)} \right)$$
(6)

In order to consider the time stationary covariates \({\varvec{Z}} = \left( {Z_{1} , \ldots ,Z_{q} } \right)\) in the model, we could extend the parent sets in both initial processes \({\mathcal{M}}_{1}\) and transition model \({\mathcal{M}}_{ \to }\). In this manner, the initial conditional probability could be presented as:

$$P\left( {{\varvec{X}}_{1} } \right) = \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d,1} |{\mathbf{\mathcal{P}}}_{{{\varvec{X}}_{t} }} \left( {X_{d,t} } \right),{\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {X_{d,1} } \right)} \right)$$

where the \({\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d,1} } \right)\) and \({\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {X_{d,1} } \right)\) represent the sets of parents \(X_{d,1}\) in \({\varvec{X}}\) and \({\varvec{Z}},\) respectively. On the other hand, the modified transition probability distribution is reformed to:

$$P\left( {{\varvec{X}}_{t} {|}{\varvec{X}}_{t - 1} ,Z_{1} , \ldots ,Z_{q } } \right) = P\left( {X_{1,t} , \ldots ,X_{D,t} {|}X_{1,t - 1} , \ldots ,X_{D,t - 1} ,Z_{1} , \ldots ,Z_{q} } \right) = \mathop \prod \limits_{d = 1}^{D} P\left( {X_{d,t} |{\mathbf{\mathcal{P}}}_{{{\varvec{X}}_{t} }} \left( {X_{d,t} } \right),{\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {X_{d,t} } \right)} \right)$$

Dynamic Bayesian network interpretation of product limit estimators

A DBN of type 2-TBN could efficiently conduct the calculation process of product-limit estimators. The product-limit approach considers the time as discrete intervals between consecutive observed failure times and counts the individuals at risk and failures. Equivalently, we define discrete intervals of Kaplan–Meier as slices and two time-dependent binary status variables. The survival state variables \(N_{i,t}\) is equal to 1 if individual \(i\) survives at least to slice t and state variables \(Q_{i,t}\) is equal to 1 if the individual \(i\) censored before or at slice t. Defining these variables enables us to form the DBN in Fig. 1 to analyze survival data. In addition, we can enter other fixed effect covariates in \({\varvec{Z}}\) to the model and examine their importance by the structure learning algorithms.

Fig. 1
figure 1

Prior and transition Bayesian networks correspond to the extended dynamic Bayesian network representation of the Kaplan–Meier approach

The DBN estimators of survival probability at each time t according to Eq. 6 are equal to:

$$\hat{S}\left( t \right) = P\left( {N_{1} = 1|{\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {N_{t} } \right)} \right)\mathop \prod \limits_{t = 2}^{T} P\left( {N_{t} = 1|N_{t - 1} = 1,Q_{t - 1} = 0,{\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {N_{t} } \right)} \right)$$
(7)

Moreover, the discrete covariates set \({\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {N_{t} } \right)\) were found by structure learning algorithms.

For simplicity, we consider there are no covariates in the model, so Eq. (7) could be modified by counter function N:

$$\begin{aligned} \hat{S}\left( t \right) = P\left( {N_{1} = 1} \right)\mathop \prod \limits_{t = 2}^{T} P\left( {N_{t} = 1{|}N_{t - 1} = 1,Q_{t - 1} = 0} \right) = & \frac{{N\left( {N_{i1} = 1} \right)}}{N}\mathop \prod \limits_{t = 2}^{T} \frac{{N\left( {N_{it} = 1} \right)}}{{N\left( {N_{i,t - 1} = 1, Q_{i,t - 1} = 0} \right)}} \\ = & \hat{S}\left( 1 \right)\mathop \prod \limits_{t = 2}^{T} \left[ {1 - \frac{{d_{t} }}{{Y_{t} }}} \right] \\ \end{aligned}$$
(8)

That is the same as product-limit estimators in Eq. (1). Therefore, we are able to use the Greenwood formula in Eq. (2) to calculate the variance of survival estimations. In addition, the common bootstrapping approaches in the BN context, like likelihood weighting and logic sampling, are the alternative approach in this way[34].

Simulation study

We conducted a simulation study to assess the performance of our model in comparison with the Kaplan–Meier and Cox proportional hazard regression. We defined various scenarios according to the sample size (N = 800, 5000, 10,000), censoring rate (R = 25%, 40%, 60%), and shape parameters of survival (\(\alpha_{S}\)) and censoring (\(\alpha_{C}\)) distributions. We considered five covariates distributed as mutually independent binomial distributions with different probability parameters \(\left[ {X_{i} \sim B\left( {N,P_{i} } \right), i = 1, \ldots ,5 P_{i} = 0.1, 0.2, 0.5, 0.7, 0.9} \right]\).

The survival and censoring times were generated using Weibull distributions. The scale parameter of survival time distribution was reparametrized according to the summation of the covariates \(\left[ {\theta_{S} = \mathop \sum \limits_{i = 1}^{5} X_{i} } \right]\). Using the numerical methods and assuming the fixed value for censorship, we found the scale parameter of censoring time distribution. We set the shape parameter of survival and censoring distributions as the values 0.5 (decreasing event/censor rate), 1 (constant event/censor rate), and 2 (increasing event/censor rate). In this manner, we achieved nine different scenarios for the shape of survival/censoring times.

In brief, if the survival time \(S \ge 0\) follows the below Weibull distribution:

$$f\left( {s;\alpha_{S} ,\theta_{S} } \right) = \frac{{\alpha_{S} }}{{\theta_{S} }}\left( {\frac{s}{{\theta_{S} }}} \right)^{{\alpha_{S} - 1}} {\text{exp}}\left( { - \left( {\frac{s}{{\theta_{S} }}} \right)^{{\alpha_{S} }} } \right)$$

where scale parameter \(\theta_{S}\) and shape parameter \(\alpha_{S}\) were defined before, the cumulative distribution function of S is:

$$F\left( {s;\alpha_{S} ,\theta_{S} } \right) = 1 - {\text{exp}}\left( { - \left( {\frac{s}{{\theta_{S} }}} \right)^{{\alpha_{S} }} } \right)$$

As the Weibull is a continuous random distribution \(F\left( {s;\alpha_{S} ,\theta_{S} } \right)\sim U\left( {0,1} \right)\). Therefore we generated N samples \(u_{i} \sim U\left( {0,1} \right)\) for \(i = 1, \ldots ,N\) and then compute the \(s_{i} = \theta_{S} \sqrt[{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\alpha_{S} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\alpha_{S} }$}}}]{{ - {\text{ln}}\left( {u_{i} } \right)}}\). A similar approach was used to generate the censored times \(c_{i} s\).

We fitted all three models to the simulated data and estimated survival probability at 20%, 50% (median), and 80% percentiles of actual survival times. The bias and its Root Mean Squared Errors (RMSE) were calculated in 1000 randomly generated samples in each scenario.

Real data analysis

We applied our proposed method to real-world survival data. The 760 patients diagnosed with gastric cancer at the Iran cancer Institute and who had undergone gastrectomy from 1995 to 2012 entered the study. This historical and artificial cohort of patients was followed until observing death events. The censorship was considered in case of loss to follow-up or being alive at the end of observation time. All the variables except follow-up duration were time stationery and were collected at the surgery time.

We conducted the ordinary Kaplan–Meier survival analysis, DBN, and Cox proportional hazard (PH) model. The Cox PH model was added to compare our findings with a regression model that could handle the covariate effect in survival analysis. In addition, the censor probability plots were generated using the Kaplan–Meier approach. For this purpose, we defined censorship as the primary event and death as the alternative status.

Structure learning

We used Hill-Climbing (HC) and Tabu search, two of the most popular score-based learning algorithms, to find the structure of DAG [35]. The last event occurred 15 years after surgery, and there was no event in year 13 post-surgery. Therefore 14 eligible slices \(t\) correspond to the unique event year were defined in the DBN structure. The prior and transition networks in Fig. 1 related to \(N_{i,t}\) and \(Q_{i,t}\) were considered as the white list, and the algorithms learned the structure of stationary variables Z. The validation of structure learning was conducted by the ten repeated hold-out technique using ten subsamples in size 30% of the original data [36]. The posterior classification error based on likelihood weighting was set as the loss function [37]. The validation process was repeated according to different score functions, including logarithm of likelihood, AIC, BIC, BDE, BDS, and \(K^{2}\) [38].

Results

Simulation study

Table 1 represents the bias and related RMSE of estimated survival probability by models in different scenarios. For instance, the first value of − 0.0033 (0.0046) is observed for bias (RMSE) for a sample size of 800 and censorship of 25%. It shows that the Kaplan–Meier approach estimated the survival probability by 0.0033 less than the actual value when the random samples come from the survival and censoring distributions with increasing rates over time. Cox and DBN estimations in this scenario are 0.004 and − 0.0008 differ from the actual probability, respectively. Therefore, the minimum absolute bias value is related to the DBN model.

Table 1 The models' bias and RMSE in survival probability estimation in percentile 20% of real survival time

The Cox regression model shows superiority according to the minimum bias for most scenarios in Table 1. On the other hand, the DBN estimates actual survival probability better than the Cox approach in 52% of simulation scenarios which assumes an increasing censor rate across time. Increasing the censoring rate causes higher observed bias (RMSE) for all the models. The constant censor rate across time (alpha c = 1), which corresponds to non-informative censoring, shows lower levels of bias (RMSE) in all the scenarios.

Similar results for exploring median survival time are presented in Table 2. The number of scenarios in which the DBN is superior to other models due to bias (RMSE) reduction is relatively higher than in Table 1. In addition, Cox is not significantly better than DBN in all situations, and in many cases, its absolute bias is less than 0.001 of DBN.

Table 2 The models' bias and RMSE in survival probability estimation in the median of real survival time

The results of exploring the percentile of 80% of survival time in Table 3 reveal that the DBN model is superior in bias (RMSE) reduction in all the scenarios except one. When the data comes from survival distribution with shape parameter 0.5 and the censoring rate increases across time (alpha c = 2), for the sample sizes of 5000 with 60% censoring, the bias (RMSE) of Cox regression is − 0.1373 (0.1345). DBN's bias (RMSE) for this scenario is − 0.1375 (0.138). We should consider the increasing RMSE according to heavy censoring in this scenario. It causes more variation for Cox results, leading to the unstable mean of bias values.

Table 3 The models' bias and RMSE in survival probability estimation in percentile 80% of real survival time

Real data analysis

The baseline characteristics of patients by their status in the last observation are summarized in Table 4. The pathology exam for the 672 (88%) patients resulted in adenocarcinoma; however, the prevalence did not differ significantly across groups. The total gastrectomy was the most prevalent procedure, with 403 (53%). However, the death event was distributed almost equally in all the surgery types (P-value = 0.3). As a predictable pattern, the metastasis cases were less frequent in survivors (17% of them survived compared to 83% of death cases). In addition, the survivors were almost categorized into the lower Stages.

Table 4 Descriptive statistics of patients by the last observation status

We used the Tabu search algorithm based on the BDE score function, considering the clinical justifications and validation results. The final DAG is presented in Fig. 2. For a better exploration, the survival and censor probability curves are depicted in Figs. 3 and 4. According to the DAG in Fig. 2, the baseline age is related to survival in the first- and second years post-operative. The corresponding Kaplan–Meier curve in Fig. 3 confirmed this finding, and the survival lines diverge in the initial years and continue to be parallel. On the other hand, there is an edge between metastatic status and \(N_{4}\) in DAG. The survival lines in Fig. 3 for metastasis and non-metastasis cases started to get away from each other at this time point.

Fig. 2
figure 2

Representation of DAG corresponds to survival analysis of real-world data (survival of patients after gastrectomy) surgery

Fig. 3
figure 3

Kaplan–Meier survival probability curves of patients according to their baseline characteristics and the results of the log-rank test to compare the curves

Fig. 4
figure 4

The censorship probability curve of patients according to their baseline characteristics and the results of the log-rank test to compare the curves

On the other hand, the directed edges from age to \(Q_{6}\) and \(Q_{7}\) in DAG correspond to flattening the censor probability after year 6 for more than 70 years old patients and a sharp decrease in censor probability less than 61 in year seven that leads to crossing another line (Fig. 4). The censor probability lines of adenocarcinoma and the other group separate from each other at years 3 and 5 in Fig. 4. It is coordinated to edges from the pathology node to \(Q_{3}\) and \(Q_{5}\). All the other edges from the covariates to \(Q_{t}\) nodes in DAG correspond to a specific pattern in Fig. 4 and are justifiable.

According to Table 5, the higher baseline age is the most critical factor for experiencing the death event early. The hazards for 61–70 and more than 70 years old patients are 1.77 (95% CI 1.40–2.24) and 3.99 (95% CI 3.09–5.14) times for those less than 61 years old. The edges from the age node to \(N_{1}\) and \(N_{2}\) in Fig. 2 assert that the effect of age is more notable in the initial times.

Table 5 The results of the univariable and multivariable Cox PH model

The stage and metastasis reflect two relatively same aspects of disease progression in surgery time. Therefore, there is some degree of correlation between these variables. That is why the stage was no longer significant in the multivariable model when we added metastasis. The metastasis cases had a 3.89 (95% CI 1.57–9.62) times higher hazard than the others. In contrast to the DBN, the Cox model does not ensure us about the relations between these variables. We present the conditional probability tables of the model covariates in the Additional file 1 for more clarification.

As the results of model validation, the mean posterior classification errors and their standard deviation for the whole learned network and \(N_{t}\), and \(Q_{t}\) nodes are represented in the Additional file 1. The expected loss for all the scenarios did not exceed the acceptable value of 0.04, which means all the networks predicted the state variables with more than 96% accuracy.

Discussion

We extended the classical idea of the Kaplan–Meier estimator and used the BN facilities to make a novel model for analyzing the survival data. The Bayesian network tools enable us to explore the different aspects of data in a previously impossible way. For instance, nonparametric survival methods like Kaplan–Meier were not adjusted to take covariates into account. On the other hand, the regression approaches only focus on the outcome variable and ignore the relations between covariates. The majority of the survival models were developed according to strict assumptions. In most applied cases, checking these assumptions is ignored or even hard to satisfy. Our model addressed the issues of the previous approaches and required the least possible assumptions.

Censorship which leads to incomplete observations, is an intrinsic property of survival data. Methods developed in this domain tried to manage this issue and incorporate the information of the censored observation as much as possible. Many researchers in the setting of machine learning ignore the censor observation and change the problem to explore the continuous-time outcome variable [5, 17, 28, 29]. In contrast, we consider a state for censorship that enables the model to examine how covariates affect this state. This property significantly increases the prediction power of the model. In real-world applications, administrators of data registries could manage situations to avoid preventable censorship.

The graphical aspect and conditional probability distributions of BN reflect much information in the simplest form. In comparison, other survival base algorithms in machine learning, like neural networks [18,19,20,21], support vector machines [22, 23], and ensemble models [24, 25], are as much sophisticated in outlining the patterns of data. On the other hand, alternative classical approaches like the Cox PH, frailty concept, and the other parametric regression models [1] involves the users in the intricate interpretations of their effect sizes.

The DAG of the BN model is the only mechanism for demonstrating the intra-relationship of covariates which is not available in the alternative approaches. For instance, interventions could be designed basis on the roots of the network or the parents of unchangeable nodes. In addition, this model feedback the correlation between variables, as we have seen in our example. The causal inference is one of the extensions of BNs, which we do not explore here [39]. Finally, the DBN forced the covariates to be discretized.

Using the semiparametric and parametric survival models should be cautionary. The Cox model's proportional hazard assumption violation leads to incorrect inferences or underestimating the hazard ratio [13, 40]. On the other hand, the parametric approach relies on the outcome distribution. These models assume a parametric distribution for the survival time, which is hard to satisfy, significantly when we have heavily censored data [41]. Finally, even the nonparametric approaches assume non-informative censorship [42]. Our BN has two primary assumptions. At first, the variables follow the Markov property, and then conditional multinomial or conditional binomial distribution is appropriate for discrete nodes. Both of these assumptions are logical in practice [43].

We conducted a wide range of scenarios in the simulation study. The DBN was superior to Kaplan–Meier in bias (RMSE) reduction in almost all of them. In addition, our results showed the comparability of Cox regression and DBN in this context. Our model was significantly superior to the Cox regression when the interest was exploring late survival times.

The Kaplan Meier biases were negative in all the scenarios. Hence, this method always estimates the survival probability as less than the actual value. Other simulation studies explain this issue [44, 45], and several suggest correction approaches. Stute and Wang proposed a Jackknife method to reduce the Kaplan Meier bias [46]. In another attempt, Jiang used the geometric mean of survival and censoring curves for bias correction [47].

In most scenarios that explore the lowest percentile with the lower sensor rate, the bias of Cox regression was estimated to be positive. In addition, the Cox biases were positive in the decreasing event rate and increasing censor rate for all the scenarios. Langner et al. showed that the maximum likelihood estimations of Cox are biased. They conducted a simulation study and concluded that there is a direct relationship between higher levels of event risk and seeing positive bias [48]. In concordance with their findings, we find that everywhere we expected to see the event, more than censoring the biases tend to the positive values.

The relation of covariates in our model is reasonably justifiable in the clinical aspect. Gender is the most known indicator of smoking across the population. According to the national representative survey, the age-standardized prevalence of current tobacco smoking among Iranian adults was 24.4% (95% CI 23.6–25.1) in males and 3.8% (95% CI 3.5–4.1) in females [49]. Therefore, it seems evident that the sex node is the parent of smoking.

We used the 7th version of the TNM Classification of Malignant Tumors (TNM) staging system for gastric cancer. The M parameter in TNM, representing distance metastases, is a critical prognostic for survival probability [50]. On the other hand, some studies described that the 7th TNM did not appropriately classify the biological behavior of cancer and the prognosis of patients [51]. In this manner, the 8th edition of the TNM staging with reforms to show relevant differences in stage III disease survival rates was released [52]. These arguments support our finding that TNM and stage nodes are affected by metastasis but are not the parents of any survival mechanism nodes.

Several studies on the Iranian population confirmed that a higher baseline age increases the hazard of death events in gastric patients who have undergone surgery. Interestingly, these studies did not mention a significant difference between males and females [53, 54].

Conclusion

Our proposed DBN could be used as a data mining technique in the context of survival data analysis. The feature selection ability of this model is comparable with the Cox PH model in both statistical and clinical aspects. In contrast to the Kaplan–Meier, our model can handle high-dimensional data and does not require the restrictive assumptions of regression approaches. The available machine learning algorithms are relatively sophisticated and rarely consider the censorship property of survival data. Whereas BN is a straightforward method, the DBN incorporates the information of censoring observations in inferences.

In this study, we introduced the simplest DBN model for survival analysis and compared its performance to the most used methods in the clinical field. This model could be adjusted for a specific situation like competing risk, time-variant covariates, and high dimensional data. In this manner, more specific simulation studies would be required.

Availability of data and materials

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Abbreviations

TNM:

TNM Classification of Malignant Tumors

DAG:

Directed Acyclic Graph

BN:

Bayesian Network

DBN:

Dynamic Bayesian Network

2-TBN:

2-Slice Temporal Bayesian Network

PH:

Proportional Hazard

HC:

Hill-Climbing

References

  1. Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data, vol. 1230. Berlin: Springer; 2003.

    Book  Google Scholar 

  2. Gorjian N, Ma L, Mittinty M, Yarlagadda P, Sun Y. A review on degradation models in reliability analysis. In: Engineering asset lifecycle management. Berlin: Springer; 2010. p. 369–84.

    Chapter  Google Scholar 

  3. Zacks S. Introduction to reliability analysis: probability models and statistical methods. Berlin: Springer; 2012.

    Google Scholar 

  4. Lee D, Choi D. Analysis of the reliability of a starter-generator using a dynamic Bayesian network. Reliab Eng Syst Saf. 2020;195:106628.

    Article  Google Scholar 

  5. Boudali H, Dugan JB. A discrete-time Bayesian network reliability modeling and analysis framework. Reliab Eng Syst Saf. 2005;87(3):337–49.

    Article  Google Scholar 

  6. Štajduhar I, Dalbelo-Bašić B, Bogunović N. Impact of censoring on learning Bayesian networks in survival modelling. Artif Intell Med. 2009;47(3):199–217.

    Article  PubMed  Google Scholar 

  7. Campigotto F, Weller E. Impact of informative censoring on the Kaplan-Meier estimate of progression-free survival in phase II clinical trials. J Clin Oncol. 2014;32(27):3068.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Nemati M, Ansary J, Nemati N. Covid-19 machine learning based survival analysis and discharge time likelihood prediction using clinical data. Available at SSRN 3584518 2020.

  9. Wang P, Li Y, Reddy CK. Machine learning for survival analysis: a survey. ACM Comput Surv (CSUR). 2019;51(6):1–36.

    Article  Google Scholar 

  10. Díez-Sanmartín C, Sarasa Cabezuelo A. Application of artificial intelligence techniques to predict survival in kidney transplantation: a review. J Clin Med. 2020;9(2):572.

    Article  PubMed Central  Google Scholar 

  11. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457–81.

    Article  Google Scholar 

  12. Efron B. Logistic regression, survival analysis, and the Kaplan-Meier curve. J Am Stat Assoc. 1988;83(402):414–25.

    Article  Google Scholar 

  13. Hiller L, Marshall A, Dunn J. Assessing violations of the proportional hazards assumption in Cox regression: does the chosen method matter? Trials. 2015;16(2):1–1.

    Google Scholar 

  14. Gordon L, Olshen RA. Tree-structured survival analysis. Cancer Treat Rep. 1985;69(10):1065–9.

    CAS  PubMed  Google Scholar 

  15. Utkin LV, Konstantinov AV, Chukanov VS, Kots MV, Ryabinin MA, Meldo AA. A weighted random survival forest. Knowl-Based Syst. 2019;177:136–44.

    Article  Google Scholar 

  16. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

    Article  PubMed  Google Scholar 

  17. Fard MJ, Wang P, Chawla S, Reddy CK. A bayesian perspective on early stage event prediction in longitudinal data. IEEE Trans Knowl Data Eng. 2016;28(12):3126–39.

    Article  Google Scholar 

  18. Faraggi D, Simon R. A neural network model for survival data. Stat Med. 1995;14(1):73–82.

    Article  CAS  PubMed  Google Scholar 

  19. Lee C, Zame WR, Yoon J, van der Schaar M. Deephit: A deep learning approach to survival analysis with competing risks. In: Thirty-second AAAI conference on artificial intelligence; 2018.

  20. Ren K, Qin J, Zheng L, Yang Z, Zhang W, Qiu L, Yu Y. Deep recurrent survival analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2019. p. 4798–4805.

  21. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(1):1–12.

    Article  Google Scholar 

  22. Shivaswamy PK, Chu W, Jansche M. A support vector approach to censored targets. In: Seventh IEEE international conference on data mining (ICDM 2007); 2007. IEEE. p. 655–660.

  23. Van Belle V, Pelckmans K, Van Huffel S, Suykens JA. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med. 2011;53(2):107–18.

    Article  PubMed  Google Scholar 

  24. Sarkar JP, Saha I, Sarkar A, Maulik U. Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Comput Biol Med. 2021;131:104244.

    Article  CAS  PubMed  Google Scholar 

  25. Tanner KT, Sharples LD, Daniel RM, Keogh RH. Dynamic survival prediction combining landmarking with a machine learning ensemble: methodology and empirical comparison. J R Stat Soc A Stat Soc. 2021;184(1):3–30.

    Article  Google Scholar 

  26. Ren Y, Zhang K, Shi Y. A Survival Certification Model Based on Active Learning over Medical Insurance Data. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data. Springer; 2019. p. 156–170.

  27. Vinzamuri B, Li Y, Reddy CK. Active learning based survival regression for censored data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management; 2014. p. 241–250.

  28. Zhang W, Tang J, Wang N. Using the machine learning approach to predict patient survival from high-dimensional survival data. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2016. IEEE. p. 1234–1238.

  29. García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N. Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med. 2015;59:125–33.

    Article  PubMed  Google Scholar 

  30. Štajduhar I, Dalbelo-Bašić B. Learning Bayesian networks from survival data using weighting censored instances. J Biomed Inform. 2010;43(4):613–22.

    Article  PubMed  Google Scholar 

  31. Bishop CM. Pattern recognition. Mach Learn. 2006;128(9):5568.

    Google Scholar 

  32. Donat R, Leray P, Bouillaut L, Aknin P. A dynamic Bayesian network to represent discrete duration models. Neurocomputing. 2010;73(4–6):570–7.

    Article  Google Scholar 

  33. Murphy KP. Dynamic bayesian networks: representation, inference and learning. Berkeley: University of California; 2002.

    Google Scholar 

  34. Yuan C, Druzdzel MJ. Importance sampling algorithms for Bayesian networks: Principles and performance. Math Comput Model. 2006;43(9–10):1189–207.

    Article  Google Scholar 

  35. Russell S, Norvig P. Artificial intelligence: a modern approach. Science. 2002;5:889.

    Google Scholar 

  36. Kim J-H. Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal. 2009;53(11):3735–45.

    Article  Google Scholar 

  37. Marcot BG. Metrics for evaluating performance and uncertainty of Bayesian network models. Ecol Model. 2012;230:50–62.

    Article  Google Scholar 

  38. Koller D, Friedman N. Probabilistic graphical models: principles and techniques. London: MIT Press; 2009.

    Google Scholar 

  39. Pearl J. From Bayesian networks to causal networks. In: Mathematical models for handling partial knowledge in artificial intelligence. Berlin: Springer; 1995. p. 157–82.

    Chapter  Google Scholar 

  40. Ramezani Tehrani F, Sheidaei A, Firouzi F, Tohidi M, Azizi F, Behboudi-Gandevani S. Does anti mullerian hormone decline rate improve the predication of age at menopause? Front Endocrinol. 2021;5:1104.

    Google Scholar 

  41. Kleinbaum DG, Klein M. Parametric survival models. In: Survival analysis. Berlin: Springer; 2012. p. 289–361.

    Chapter  Google Scholar 

  42. Siannis F, Copas J, Lu G. Sensitivity analysis for informative censoring in parametric survival models. Biostatistics. 2005;6(1):77–91.

    Article  PubMed  Google Scholar 

  43. Korb KB, Nicholson AE. Bayesian artificial intelligence. London: CRC Press; 2010.

    Book  Google Scholar 

  44. Breslow N, Crowley J. A large sample study of the life table and product limit estimates under random censorship. Ann Stat. 1974;2:437–53.

    Article  Google Scholar 

  45. Wellner JA. A heavy censoring limit theorem for the product limit estimator. Ann Stat. 1985;6:150–62.

    Google Scholar 

  46. Stute W, Wang J-L. The jackknife estimate of a Kaplan—Meier integral. Biometrika. 1994;81(3):602–6.

    Google Scholar 

  47. Jiang R. A bias-corrected Kaplan-Meier estimator. In: 2020 Asia-Pacific International Symposium on Advanced Reliability and Maintenance Modeling (APARM): 2020. IEEE. p. 1–6.

  48. Langner I, Bender R, Lenz-Tönjes R, Küchenhoff H, Blettner M. Bias of maximum-likelihood estimates in logistic and Cox regression models: a comparative simulation study. In.: Discussion Paper; 2003.

  49. Varmaghani M, Sharifi F, Mehdipour P, Sheidaei A, Djalalinia S, Gohari K, Modirian M, Pazhuheian F, Peykari N, Haghshenas R. Prevalence of smoking among Iranian adults: findings of the national STEPs survey 2016. Arch Iran Med. 2020;23(6):369–77.

    Article  PubMed  Google Scholar 

  50. Kim J-P, Lee J-H, Kim S-J, Yu H-J, Yang H-K. Clinicopathologic characteristics and prognostic factors in 10 783 patients with gastric cancer. Gastric Cancer. 1998;1(2):125–33.

    Article  PubMed  Google Scholar 

  51. Kikuchi S, Futawatari N, Sakuramoto S, Katada N, Yamashita K, Shibata T, Nemoto M, Watanabe M. Comparison of staging between the old and new TNM classifications in advanced gastric cancer. Anticancer Res. 2011;31(6):2361–5.

    PubMed  Google Scholar 

  52. Marano L, D’Ignazio A, Cammillini F, Angotti R, Messina M, Marrelli D, Roviello F. Comparison between 7th and 8th edition of AJCC TNM staging system for gastric cancer: old problems and new perspectives. Transl Gastroenterol Hepatol. 2019;4:5558.

    Article  Google Scholar 

  53. Jafarzadeh Kohneloo A, Yaseri M, Rahimi Foroushani A, Zeraati H. Post-surgery survival in patients with adenocarcinoma of stomach using multistate model. J Gastrointestinal Cancer. 2021;2:1–7.

    Google Scholar 

  54. Ali Z, Mahmoodi M, Mohammad K, Zeraati H, Hosseini M, Naieni KH. Factors affecting the survival of patients with gastric cancer undergone surgery at iran cancer institute: univariate and multivariate analyses. Iran J Public Health. 2014;43(6):800.

    Google Scholar 

Download references

Acknowledgements

This study is a part of the research process supported by the Tehran University of Medical Science to achieve a Ph.D. degree.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study's conception and design. HZ and AS performed analysis. AS wrote the first draft of the manuscript, and all authors commented on previous versions. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hojjat Zeraati.

Ethics declarations

Ethics approval and consent to participate

All methods were carried out following relevant guidelines and regulations. This study was found to be under the ethical principles and the national norms and standards for conducting medical research in Iran by the research ethics committee of the school of public health & allied medical sciences – Tehran university of medical sciences. The approval ID is IR.TUMS.SPH.REC.13993061. All participants provided written informed consent that their data collected as part of the study can be used in research. All the patients were followed until the event or when they preferred to stop participation.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary file 1.

The validation of structure learning. The posterior classification error of HC and Tabu algorithms for all nodes according to different score functions. Supplementary file 2. Conditional probability distribution of time stationary variables in the model. Supplementary figure 1. Conditional probability distribution of node stage given the different levels of its parents (Metastasis and TNM). Supplementary figure 2. Conditional probability distribution of node TNM given the different levels of its parent (Metastasis). Supplementary figure 3. Conditional probability distribution of node metastasis given the different levels of its parent (Smoking). Supplementary figure 4. Conditional probability distribution of node pathology given the different levels of its parent (Sex). Supplementary figure 5. Conditional probability distribution of node smoking given the different levels of its parent (Sex). Supplementary figure 6. Conditional probability distribution of node surgery given the different levels of its parent (Site).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheidaei, A., Foroushani, A.R., Gohari, K. et al. A novel dynamic Bayesian network approach for data mining and survival data analysis. BMC Med Inform Decis Mak 22, 251 (2022). https://doi.org/10.1186/s12911-022-02000-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-022-02000-7

Keywords