 Research
 Open access
 Published:
A novel dynamic Bayesian network approach for data mining and survival data analysis
BMC Medical Informatics and Decision Making volumeÂ 22, ArticleÂ number:Â 251 (2022)
Abstract
Background
Censorship is the primary challenge in survival modeling, especially in human health studies. The classical methods have been limited by applications like Kaplanâ€“Meier or restricted assumptions like the Cox regression model. On the other hand, Machine learning algorithms commonly rely on the high dimensionality of data and ignore the censorship attribute. In addition, these algorithms are more sophisticated to understand and utilize. We propose a novel approach based on the Bayesian network to address these issues.
Methods
We proposed a twoslice temporal Bayesian network model for the survival data, introducing the survival and censorship status in each observed time as the dynamic states. A scorebased algorithm learned the structure of the directed acyclic graph. The likelihood approach conducted parameter learning. We conducted a simulation study to assess the performance of our model in comparison with the Kaplanâ€“Meier and Cox proportional hazard regression. We defined various scenarios according to the sample size, censoring rate, and shapes of survival and censoring distributions across time. Finally, we fit the model on a realworld dataset that includes 760 post gastrectomy surgery due to gastric cancer. The validation of the model was explored using the holdout technique based on the posterior classification error. Our survival model performance results were compared using the Kaplanâ€“Meier and Cox proportional hazard models.
Results
The simulation study shows the superiority of DBN in bias reduction for many scenarios compared with Cox regression and Kaplanâ€“Meier, especially in the late survival times. In the realworld data, the structure of the dynamic Bayesian network model satisfied the finding from Kaplanâ€“Meier and Cox regression classical approaches. The posterior classification error found from the validation technique did not exceed 0.04, representing that our network predicted the state variables with more than 96% accuracy.
Conclusions
Our proposed dynamic Bayesian network model could be used as a data mining technique in the context of survival data analysis. The advantages of this approach are feature selection ability, straightforward interpretation, handling of highdimensional data, and few assumptions.
Background
Survival analysis is the formal name for an essential branch of statistics. It aims to study the time of the specific event and explore the particular circumstances or characteristics that influence it [1]. We know these techniques as reliability analysis in engineering and industries. The usual interest is the lifetime of devices, products, and machines in these contexts [2,3,4]. The intrinsic features of subjects in these domains enable the investigators to control all the processes and fully observe the data [3, 5].
Further, when the subjects are humans, the survival models confront the crucial challenge of incomplete data and the high dimension of confounders more than the other context. The incomplete data known as censoring is expected because sample withdrawal or loss of followup exceeds the event time of the study period [6, 7]. Despite all efforts to evolve the survival models, some challenges in classical methods show the necessity of developing new approaches in practice [8,9,10].
The Kaplanâ€“Meier method, also known as productlimit, is one of the primary and still popular methods in analyzing survival data. This method estimated the survival probability function of data in a straightforward and easytounderstand manner. In addition, the Kaplanâ€“Meier is a nonparametric approach, and a few assumptions are required to apply it to data [1, 11]. However, the simplicity of Kaplanâ€“Meier restricts its applications. For instance, it cannot account for multiple factors or control for confounding factors. Therefore, in the case of feature selection and group comparisons, we need to apply additional analyses such as the low power logrank test. In this procedure, insufficient sample size and increasing the number of features lead to inaccurate inferences [12]. Another alternative is regression approaches like the Cox proportional hazard model. Using these models is limited by restrictive assumptions like the proportionality of hazards, independence of censoring and survival distribution, and exponential relation between hazard and covariates [13].
On the other hand, developing the application of machine learning algorithms in recent decades has reformed many classical approaches to data analysis. The survival analysis is one of the domains that changed significantly [2, 9, 10]. In this manner, the applications of random forests [14, 15], Bayesian methods [5, 16, 17], neural networks [18,19,20,21], support vector machines [22, 23], ensemble learning [24, 25], and active learning [26, 27] algorithms were introduced in survival analysis. These changes enable us to resolve issues with a new practice even though the general idea is similar to classical approaches.
As mentioned, incompletely observed data or censorship is the primary challenge in survival modeling, especially in health. Considerable studies that introduced the application of machine learning algorithms in survival analysis did not engage the censorship because of the study subject's type [5]. Some other studies relied on high sample size and ignored the censored observations or imputed them by modeling approaches [28, 29]. Finally, a few studies directly address censorship using the methods like weighting the censored observation [30]. However, there is a gap between classical approaches and novel machine learning techniques that use intelligent algorithms to extract data patterns. The classic methods were developed to handle small to mediumdimension data and find a general overview. In contrast, the machine learning algorithm and data mining techniques aim to handle highdimensional data. These methods focus on the prediction with maximum accuracy.
This study proposes a novel Dynamic Bayesian Network (DBN) model for data mining in the context of survival data analysis. The Bayesian Network (BN) has a series of powerful tools that could facilitate survival analysis. Actually, the BN combines probability theory and graphical models [31]. Consequently, it enabled us to capture the uncertainty of stochastic survival events and represent a graphical structure of probability distribution. In addition, our model uses the Kaplanâ€“Meier idea to consider the censoring phenomena and the various capability of BN models to add extra tools for more precise inferences simultaneously. The structure learning algorithm of the BN ables us to compare the groups and find the significant features. In addition, parameter learning algorithms lead to more precise inferences, estimations, and predictions. In this study, we present our DBN model for survival analysis, evaluate its performance using a simulation study, and finally use a realworld data set to show the way analysis could be performed using that.
Methods
Product limit estimators
The primary objective of survival analysis is to explore the time until a particular event. Hence, we describe the stochastic behavior of an outcome variable in time type. We usually use survival, density, hazard, and the mean or median residual life functions in this regard. As these functions are attainable from each other, there is no priority except for better interpretability in choosing one. The product limit estimators are the estimates of survival function, which is defined as the probability of an individual surviving after a given time point t:
T is a random variable that denotes when the event of interest occurs. Kaplan and Meier partitioned observed times into intervals according to unique event times and proposed the following estimator for all t values in the range of observed data when \(t_{1}\) represent the first event time [1]:
where \(d_{i}\) and \(Y_{i}\) represent the number of failures and atrisk persons in each interval, respectively. Therefore, the productlimit estimator is a discrete approach that leads to a step function that only changes at event times. The Greenwood formula is the wellknown approach to estimating the variance of the estimators [1]:
Bayesian network
Every Bayesian Networks (BNs) correspond to a Directed Acyclic Graph (DAG) and a joint distribution, which are the graphical and probabilistic aspects of the model. DAG consists of nodes corresponding to random variables and edges that present conditional probabilities. According to the domain of the random variables, the BN could be discrete, Gaussian, or hybrid. Our BN in this study is a discrete one. Therefore, for a set of discrete random variables \({\varvec{X}} = \left( {X_{1} ,X_{2} , \ldots ,X_{D} } \right){ }\) Taking their values in the discrete and finite Ddimensional domain. The BN is defined as pair \({\mathcal{M}} = \left( {{{\mathcal{g}}},\left( {P\left( {X_{d} {\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)} \right)} \right)_{1 \le d \le D} } \right)\) where \({{\mathcal{g}}} = \left( {{\varvec{X}},{\varvec{\varepsilon}}} \right)\) is a DAG presentation of random variables \({\varvec{X}}\) with edges set \({\varvec{\varepsilon}}\), \({\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)\) is the set of parents \(X_{d}\) in \({\varvec{X}}\), and \(\left( {P\left( {X_{d} {\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d} } \right)} \right)} \right)_{1 \le d \le D}\) is the conditional probability of node \(X_{d}\) given their parents in the set \({\varvec{X}}\).
The appealing feature of BN is to summarize the complex joint probability distribution \({\varvec{X}}\) in the following parsimonious way using the conditional independence and Markov chain properties:
Dynamic Bayesian network
The classical BN is not adopted to address timedependent processes like survival analysis [32]. Therefore, Dynamic Bayesian Network (DBN) [33] was introduced to extend this process. In this context, timedependent random variables \(\left( {{\varvec{X}}_{t} } \right)_{t \ge 1} = \left( {X_{1,t} , \ldots ,X_{D,t} } \right)_{t \ge 1}\) are defined where t is a discrete index time formally called slice. DBN uses Markov property which indicates the future of a stochastic process is independent of its past, given current status or several lags before it. The number of lags determines the order of the Markov process. This study only needs to use the Markov process of order 1, which leads to a 2slice Temporal Bayesian Network (2TBN). In this regard, we assume \(X_{t  1} \bot X_{t + 1} X_{t}\) for all \(t \ge 2\). A 2TBN could be defined as a pair of 2 BNs \(\left( {{\mathcal{M}}_{1} ,{\mathcal{M}}_{ \to } } \right)\) where \({\mathcal{M}}_{1}\) is the joint distribution of the initial process \({\varvec{X}}_{1} = \left( {X_{1,1} , \ldots ,X_{D,1} } \right)\) and \({\mathcal{M}}_{ \to }\) represent the transition model. The joint probability distribution \({\mathcal{M}}_{1}\) easily derived from BN approach in Eq.Â 3:
In the transition model, the joint distribution of \({\varvec{X}}_{t}\) only depends on random variables belonging to the set of parents \({\varvec{X}}_{t}\) at slice \(t  1\) in the form:
Hence the probability distribution of 2TBN is calculated by the combination of Eqs.Â 4,5:
In order to consider the time stationary covariates \({\varvec{Z}} = \left( {Z_{1} , \ldots ,Z_{q} } \right)\) in the model, we could extend the parent sets in both initial processes \({\mathcal{M}}_{1}\) and transition model \({\mathcal{M}}_{ \to }\). In this manner, the initial conditional probability could be presented as:
where the \({\mathbf{\mathcal{P}}}_{{\varvec{X}}} \left( {X_{d,1} } \right)\) and \({\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {X_{d,1} } \right)\) represent the sets of parents \(X_{d,1}\) in \({\varvec{X}}\) and \({\varvec{Z}},\) respectively. On the other hand, the modified transition probability distribution is reformed to:
Dynamic Bayesian network interpretation of product limit estimators
A DBN of type 2TBN could efficiently conduct the calculation process of productlimit estimators. The productlimit approach considers the time as discrete intervals between consecutive observed failure times and counts the individuals at risk and failures. Equivalently, we define discrete intervals of Kaplanâ€“Meier as slices and two timedependent binary status variables. The survival state variables \(N_{i,t}\) is equal to 1 if individual \(i\) survives at least to slice t and state variables \(Q_{i,t}\) is equal to 1 if the individual \(i\) censored before or at slice t. Defining these variables enables us to form the DBN in Fig.Â 1 to analyze survival data. In addition, we can enter other fixed effect covariates in \({\varvec{Z}}\) to the model and examine their importance by the structure learning algorithms.
The DBN estimators of survival probability at each time t according to Eq.Â 6 are equal to:
Moreover, the discrete covariates set \({\mathbf{\mathcal{P}}}_{{\varvec{Z}}} \left( {N_{t} } \right)\) were found by structure learning algorithms.
For simplicity, we consider there are no covariates in the model, so Eq.Â (7) could be modified by counter function N:
That is the same as productlimit estimators in Eq.Â (1). Therefore, we are able to use the Greenwood formula in Eq.Â (2) to calculate the variance of survival estimations. In addition, the common bootstrapping approaches in the BN context, like likelihood weighting and logic sampling, are the alternative approach in this way[34].
Simulation study
We conducted a simulation study to assess the performance of our model in comparison with the Kaplanâ€“Meier and Cox proportional hazard regression. We defined various scenarios according to the sample size (Nâ€‰=â€‰800, 5000, 10,000), censoring rate (Râ€‰=â€‰25%, 40%, 60%), and shape parameters of survival (\(\alpha_{S}\)) and censoring (\(\alpha_{C}\)) distributions. We considered five covariates distributed as mutually independent binomial distributions with different probability parameters \(\left[ {X_{i} \sim B\left( {N,P_{i} } \right), i = 1, \ldots ,5 P_{i} = 0.1, 0.2, 0.5, 0.7, 0.9} \right]\).
The survival and censoring times were generated using Weibull distributions. The scale parameter of survival time distribution was reparametrized according to the summation of the covariates \(\left[ {\theta_{S} = \mathop \sum \limits_{i = 1}^{5} X_{i} } \right]\). Using the numerical methods and assuming the fixed value for censorship, we found the scale parameter of censoring time distribution. We set the shape parameter of survival and censoring distributions as the values 0.5 (decreasing event/censor rate), 1 (constant event/censor rate), and 2 (increasing event/censor rate). In this manner, we achieved nine different scenarios for the shape of survival/censoring times.
In brief, if the survival time \(S \ge 0\) follows the below Weibull distribution:
where scale parameter \(\theta_{S}\) and shape parameter \(\alpha_{S}\) were defined before, the cumulative distribution function of S is:
As the Weibull is a continuous random distribution \(F\left( {s;\alpha_{S} ,\theta_{S} } \right)\sim U\left( {0,1} \right)\). Therefore we generated N samples \(u_{i} \sim U\left( {0,1} \right)\) for \(i = 1, \ldots ,N\) and then compute the \(s_{i} = \theta_{S} \sqrt[{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\alpha_{S} }}}\right.\kern\nulldelimiterspace} \!\lower0.7ex\hbox{${\alpha_{S} }$}}}]{{  {\text{ln}}\left( {u_{i} } \right)}}\). A similar approach was used to generate the censored times \(c_{i} s\).
We fitted all three models to the simulated data and estimated survival probability at 20%, 50% (median), and 80% percentiles of actual survival times. The bias and its Root Mean Squared Errors (RMSE) were calculated in 1000 randomly generated samples in each scenario.
Real data analysis
We applied our proposed method to realworld survival data. The 760 patients diagnosed with gastric cancer at the Iran cancer Institute and who had undergone gastrectomy from 1995 to 2012 entered the study. This historical and artificial cohort of patients was followed until observing death events. The censorship was considered in case of loss to followup or being alive at the end of observation time. All the variables except followup duration were time stationery and were collected at the surgery time.
We conducted the ordinary Kaplanâ€“Meier survival analysis, DBN, and Cox proportional hazard (PH) model. The Cox PH model was added to compare our findings with a regression model that could handle the covariate effect in survival analysis. In addition, the censor probability plots were generated using the Kaplanâ€“Meier approach. For this purpose, we defined censorship as the primary event and death as the alternative status.
Structure learning
We used HillClimbing (HC) and Tabu search, two of the most popular scorebased learning algorithms, to find the structure of DAG [35]. The last event occurred 15Â years after surgery, and there was no event in year 13 postsurgery. Therefore 14 eligible slices \(t\) correspond to the unique event year were defined in the DBN structure. The prior and transition networks in Fig.Â 1 related to \(N_{i,t}\) and \(Q_{i,t}\) were considered as the white list, and the algorithms learned the structure of stationary variables Z. The validation of structure learning was conducted by the ten repeated holdout technique using ten subsamples in size 30% of the original data [36]. The posterior classification error based on likelihood weighting was set as the loss function [37]. The validation process was repeated according to different score functions, including logarithm of likelihood, AIC, BIC, BDE, BDS, and \(K^{2}\) [38].
Results
Simulation study
Table 1 represents the bias and related RMSE of estimated survival probability by models in different scenarios. For instance, the first value of âˆ’Â 0.0033 (0.0046) is observed for bias (RMSE) for a sample size of 800 and censorship of 25%. It shows that the Kaplanâ€“Meier approach estimated the survival probability by 0.0033 less than the actual value when the random samples come from the survival and censoring distributions with increasing rates over time. Cox and DBN estimations in this scenario are 0.004 and âˆ’Â 0.0008 differ from the actual probability, respectively. Therefore, the minimum absolute bias value is related to the DBN model.
The Cox regression model shows superiority according to the minimum bias for most scenarios in Table 1. On the other hand, the DBN estimates actual survival probability better than the Cox approach in 52% of simulation scenarios which assumes an increasing censor rate across time. Increasing the censoring rate causes higher observed bias (RMSE) for all the models. The constant censor rate across time (alpha câ€‰=â€‰1), which corresponds to noninformative censoring, shows lower levels of bias (RMSE) in all the scenarios.
Similar results for exploring median survival time are presented in Table 2. The number of scenarios in which the DBN is superior to other models due to bias (RMSE) reduction is relatively higher than in Table 1. In addition, Cox is not significantly better than DBN in all situations, and in many cases, its absolute bias is less than 0.001 of DBN.
The results of exploring the percentile of 80% of survival time in Table 3 reveal that the DBN model is superior in bias (RMSE) reduction in all the scenarios except one. When the data comes from survival distribution with shape parameter 0.5 and the censoring rate increases across time (alpha câ€‰=â€‰2), for the sample sizes of 5000 with 60% censoring, the bias (RMSE) of Cox regression is âˆ’Â 0.1373 (0.1345). DBN's bias (RMSE) for this scenario is âˆ’Â 0.1375 (0.138). We should consider the increasing RMSE according to heavy censoring in this scenario. It causes more variation for Cox results, leading to the unstable mean of bias values.
Real data analysis
The baseline characteristics of patients by their status in the last observation are summarized in Table 4. The pathology exam for the 672 (88%) patients resulted in adenocarcinoma; however, the prevalence did not differ significantly across groups. The total gastrectomy was the most prevalent procedure, with 403 (53%). However, the death event was distributed almost equally in all the surgery types (Pvalueâ€‰=â€‰0.3). As a predictable pattern, the metastasis cases were less frequent in survivors (17% of them survived compared to 83% of death cases). In addition, the survivors were almost categorized into the lower Stages.
We used the Tabu search algorithm based on the BDE score function, considering the clinical justifications and validation results. The final DAG is presented in Fig.Â 2. For a better exploration, the survival and censor probability curves are depicted in Figs.Â 3 and 4. According to the DAG in Fig.Â 2, the baseline age is related to survival in the first and second years postoperative. The corresponding Kaplanâ€“Meier curve in Fig.Â 3 confirmed this finding, and the survival lines diverge in the initial years and continue to be parallel. On the other hand, there is an edge between metastatic status and \(N_{4}\) in DAG. The survival lines in Fig.Â 3 for metastasis and nonmetastasis cases started to get away from each other at this time point.
On the other hand, the directed edges from age to \(Q_{6}\) and \(Q_{7}\) in DAG correspond to flattening the censor probability after year 6 for more than 70Â years old patients and a sharp decrease in censor probability less than 61 in year seven that leads to crossing another line (Fig.Â 4). The censor probability lines of adenocarcinoma and the other group separate from each other at years 3 and 5 in Fig.Â 4. It is coordinated to edges from the pathology node to \(Q_{3}\) and \(Q_{5}\). All the other edges from the covariates to \(Q_{t}\) nodes in DAG correspond to a specific pattern in Fig.Â 4 and are justifiable.
According to Table 5, the higher baseline age is the most critical factor for experiencing the death event early. The hazards for 61â€“70 and more than 70Â years old patients are 1.77 (95% CI 1.40â€“2.24) and 3.99 (95% CI 3.09â€“5.14) times for those less than 61Â years old. The edges from the age node to \(N_{1}\) and \(N_{2}\) in Fig.Â 2 assert that the effect of age is more notable in the initial times.
The stage and metastasis reflect two relatively same aspects of disease progression in surgery time. Therefore, there is some degree of correlation between these variables. That is why the stage was no longer significant in the multivariable model when we added metastasis. The metastasis cases had a 3.89 (95% CI 1.57â€“9.62) times higher hazard than the others. In contrast to the DBN, the Cox model does not ensure us about the relations between these variables. We present the conditional probability tables of the model covariates in the Additional file 1 for more clarification.
As the results of model validation, the mean posterior classification errors and their standard deviation for the whole learned network and \(N_{t}\), and \(Q_{t}\) nodes are represented in the Additional file 1. The expected loss for all the scenarios did not exceed the acceptable value of 0.04, which means all the networks predicted the state variables with more than 96% accuracy.
Discussion
We extended the classical idea of the Kaplanâ€“Meier estimator and used the BN facilities to make a novel model for analyzing the survival data. The Bayesian network tools enable us to explore the different aspects of data in a previously impossible way. For instance, nonparametric survival methods like Kaplanâ€“Meier were not adjusted to take covariates into account. On the other hand, the regression approaches only focus on the outcome variable and ignore the relations between covariates. The majority of the survival models were developed according to strict assumptions. In most applied cases, checking these assumptions is ignored or even hard to satisfy. Our model addressed the issues of the previous approaches and required the least possible assumptions.
Censorship which leads to incomplete observations, is an intrinsic property of survival data. Methods developed in this domain tried to manage this issue and incorporate the information of the censored observation as much as possible. Many researchers in the setting of machine learning ignore the censor observation and change the problem to explore the continuoustime outcome variable [5, 17, 28, 29]. In contrast, we consider a state for censorship that enables the model to examine how covariates affect this state. This property significantly increases the prediction power of the model. In realworld applications, administrators of data registries could manage situations to avoid preventable censorship.
The graphical aspect and conditional probability distributions of BN reflect much information in the simplest form. In comparison, other survival base algorithms in machine learning, like neural networks [18,19,20,21], support vector machines [22, 23], and ensemble models [24, 25], are as much sophisticated in outlining the patterns of data. On the other hand, alternative classical approaches like the Cox PH, frailty concept, and the other parametric regression models [1] involves the users in the intricate interpretations of their effect sizes.
The DAG of the BN model is the only mechanism for demonstrating the intrarelationship of covariates which is not available in the alternative approaches. For instance, interventions could be designed basis on the roots of the network or the parents of unchangeable nodes. In addition, this model feedback the correlation between variables, as we have seen in our example. The causal inference is one of the extensions of BNs, which we do not explore here [39]. Finally, the DBN forced the covariates to be discretized.
Using the semiparametric and parametric survival models should be cautionary. The Cox model's proportional hazard assumption violation leads to incorrect inferences or underestimating the hazard ratio [13, 40]. On the other hand, the parametric approach relies on the outcome distribution. These models assume a parametric distribution for the survival time, which is hard to satisfy, significantly when we have heavily censored data [41]. Finally, even the nonparametric approaches assume noninformative censorship [42]. Our BN has two primary assumptions. At first, the variables follow the Markov property, and then conditional multinomial or conditional binomial distribution is appropriate for discrete nodes. Both of these assumptions are logical in practice [43].
We conducted a wide range of scenarios in the simulation study. The DBN was superior to Kaplanâ€“Meier in bias (RMSE) reduction in almost all of them. In addition, our results showed the comparability of Cox regression and DBN in this context. Our model was significantly superior to the Cox regression when the interest was exploring late survival times.
The Kaplan Meier biases were negative in all the scenarios. Hence, this method always estimates the survival probability as less than the actual value. Other simulation studies explain this issue [44, 45], and several suggest correction approaches. Stute and Wang proposed a Jackknife method to reduce the Kaplan Meier bias [46]. In another attempt, Jiang used the geometric mean of survival and censoring curves for bias correction [47].
In most scenarios that explore the lowest percentile with the lower sensor rate, the bias of Cox regression was estimated to be positive. In addition, the Cox biases were positive in the decreasing event rate and increasing censor rate for all the scenarios. Langner et al. showed that the maximum likelihood estimations of Cox are biased. They conducted a simulation study and concluded that there is a direct relationship between higher levels of event risk and seeing positive bias [48]. In concordance with their findings, we find that everywhere we expected to see the event, more than censoring the biases tend to the positive values.
The relation of covariates in our model is reasonably justifiable in the clinical aspect. Gender is the most known indicator of smoking across the population. According to the national representative survey, the agestandardized prevalence of current tobacco smoking among Iranian adults was 24.4% (95% CI 23.6â€“25.1) in males and 3.8% (95% CI 3.5â€“4.1) in females [49]. Therefore, it seems evident that the sex node is the parent of smoking.
We used the 7th version of the TNM Classification of Malignant Tumors (TNM) staging system for gastric cancer. The M parameter in TNM, representing distance metastases, is a critical prognostic for survival probability [50]. On the other hand, some studies described that the 7th TNM did not appropriately classify the biological behavior of cancer and the prognosis of patients [51]. In this manner, the 8th edition of the TNM staging with reforms to show relevant differences in stage III disease survival rates was released [52]. These arguments support our finding that TNM and stage nodes are affected by metastasis but are not the parents of any survival mechanism nodes.
Several studies on the Iranian population confirmed that a higher baseline age increases the hazard of death events in gastric patients who have undergone surgery. Interestingly, these studies did not mention a significant difference between males and females [53, 54].
Conclusion
Our proposed DBN could be used as a data mining technique in the context of survival data analysis. The feature selection ability of this model is comparable with the Cox PH model in both statistical and clinical aspects. In contrast to the Kaplanâ€“Meier, our model can handle highdimensional data and does not require the restrictive assumptions of regression approaches. The available machine learning algorithms are relatively sophisticated and rarely consider the censorship property of survival data. Whereas BN is a straightforward method, the DBN incorporates the information of censoring observations in inferences.
In this study, we introduced the simplest DBN model for survival analysis and compared its performance to the most used methods in the clinical field. This model could be adjusted for a specific situation like competing risk, timevariant covariates, and high dimensional data. In this manner, more specific simulation studies would be required.
Availability of data and materials
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Abbreviations
 TNM:

TNM Classification of Malignant Tumors
 DAG:

Directed Acyclic Graph
 BN:

Bayesian Network
 DBN:

Dynamic Bayesian Network
 2TBN:

2Slice Temporal Bayesian Network
 PH:

Proportional Hazard
 HC:

HillClimbing
References
Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data, vol. 1230. Berlin: Springer; 2003.
Gorjian N, Ma L, Mittinty M, Yarlagadda P, Sun Y. A review on degradation models in reliability analysis. In: Engineering asset lifecycle management. Berlin: Springer; 2010. p. 369â€“84.
Zacks S. Introduction to reliability analysis: probability models and statistical methods. Berlin: Springer; 2012.
Lee D, Choi D. Analysis of the reliability of a startergenerator using a dynamic Bayesian network. Reliab Eng Syst Saf. 2020;195:106628.
Boudali H, Dugan JB. A discretetime Bayesian network reliability modeling and analysis framework. Reliab Eng Syst Saf. 2005;87(3):337â€“49.
Å tajduhar I, DalbeloBaÅ¡iÄ‡ B, BogunoviÄ‡ N. Impact of censoring on learning Bayesian networks in survival modelling. Artif Intell Med. 2009;47(3):199â€“217.
Campigotto F, Weller E. Impact of informative censoring on the KaplanMeier estimate of progressionfree survival in phase II clinical trials. J Clin Oncol. 2014;32(27):3068.
Nemati M, Ansary J, Nemati N. Covid19 machine learning based survival analysis and discharge time likelihood prediction using clinical data. Available at SSRN 3584518 2020.
Wang P, Li Y, Reddy CK. Machine learning for survival analysis: a survey. ACM Comput Surv (CSUR). 2019;51(6):1â€“36.
DÃezSanmartÃn C, Sarasa Cabezuelo A. Application of artificial intelligence techniques to predict survival in kidney transplantation: a review. J Clin Med. 2020;9(2):572.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53(282):457â€“81.
Efron B. Logistic regression, survival analysis, and the KaplanMeier curve. J Am Stat Assoc. 1988;83(402):414â€“25.
Hiller L, Marshall A, Dunn J. Assessing violations of the proportional hazards assumption in Cox regression: does the chosen method matter? Trials. 2015;16(2):1â€“1.
Gordon L, Olshen RA. Treestructured survival analysis. Cancer Treat Rep. 1985;69(10):1065â€“9.
Utkin LV, Konstantinov AV, Chukanov VS, Kots MV, Ryabinin MA, Meldo AA. A weighted random survival forest. KnowlBased Syst. 2019;177:136â€“44.
Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81â€“97.
Fard MJ, Wang P, Chawla S, Reddy CK. A bayesian perspective on early stage event prediction in longitudinal data. IEEE Trans Knowl Data Eng. 2016;28(12):3126â€“39.
Faraggi D, Simon R. A neural network model for survival data. Stat Med. 1995;14(1):73â€“82.
Lee C, Zame WR, Yoon J, van der Schaar M. Deephit: A deep learning approach to survival analysis with competing risks. In: Thirtysecond AAAI conference on artificial intelligence; 2018.
Ren K, Qin J, Zheng L, Yang Z, Zhang W, Qiu L, Yu Y. Deep recurrent survival analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2019. p. 4798â€“4805.
Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(1):1â€“12.
Shivaswamy PK, Chu W, Jansche M. A support vector approach to censored targets. In: Seventh IEEE international conference on data mining (ICDM 2007); 2007. IEEE. p. 655â€“660.
Van Belle V, Pelckmans K, Van Huffel S, Suykens JA. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med. 2011;53(2):107â€“18.
Sarkar JP, Saha I, Sarkar A, Maulik U. Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Comput Biol Med. 2021;131:104244.
Tanner KT, Sharples LD, Daniel RM, Keogh RH. Dynamic survival prediction combining landmarking with a machine learning ensemble: methodology and empirical comparison. J R Stat Soc A Stat Soc. 2021;184(1):3â€“30.
Ren Y, Zhang K, Shi Y. A Survival Certification Model Based on Active Learning over Medical Insurance Data. In: AsiaPacific Web (APWeb) and WebAge Information Management (WAIM) Joint International Conference on Web and Big Data. Springer; 2019. p. 156â€“170.
Vinzamuri B, Li Y, Reddy CK. Active learning based survival regression for censored data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management; 2014. p. 241â€“250.
Zhang W, Tang J, Wang N. Using the machine learning approach to predict patient survival from highdimensional survival data. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2016. IEEE. p. 1234â€“1238.
GarcÃaLaencina PJ, Abreu PH, Abreu MH, Afonoso N. Missing data imputation on the 5year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med. 2015;59:125â€“33.
Å tajduhar I, DalbeloBaÅ¡iÄ‡ B. Learning Bayesian networks from survival data using weighting censored instances. J Biomed Inform. 2010;43(4):613â€“22.
Bishop CM. Pattern recognition. Mach Learn. 2006;128(9):5568.
Donat R, Leray P, Bouillaut L, Aknin P. A dynamic Bayesian network to represent discrete duration models. Neurocomputing. 2010;73(4â€“6):570â€“7.
Murphy KP. Dynamic bayesian networks: representation, inference and learning. Berkeley: University of California; 2002.
Yuan C, Druzdzel MJ. Importance sampling algorithms for Bayesian networks: Principles and performance. Math Comput Model. 2006;43(9â€“10):1189â€“207.
Russell S, Norvig P. Artificial intelligence: a modern approach. Science. 2002;5:889.
Kim JH. Estimating classification error rate: repeated crossvalidation, repeated holdout and bootstrap. Comput Stat Data Anal. 2009;53(11):3735â€“45.
Marcot BG. Metrics for evaluating performance and uncertainty of Bayesian network models. Ecol Model. 2012;230:50â€“62.
Koller D, Friedman N. Probabilistic graphical models: principles and techniques. London: MIT Press; 2009.
Pearl J. From Bayesian networks to causal networks. In: Mathematical models for handling partial knowledge in artificial intelligence. Berlin: Springer; 1995. p. 157â€“82.
Ramezani Tehrani F, Sheidaei A, Firouzi F, Tohidi M, Azizi F, BehboudiGandevani S. Does anti mullerian hormone decline rate improve the predication of age at menopause? Front Endocrinol. 2021;5:1104.
Kleinbaum DG, Klein M. Parametric survival models. In: Survival analysis. Berlin: Springer; 2012. p. 289â€“361.
Siannis F, Copas J, Lu G. Sensitivity analysis for informative censoring in parametric survival models. Biostatistics. 2005;6(1):77â€“91.
Korb KB, Nicholson AE. Bayesian artificial intelligence. London: CRC Press; 2010.
Breslow N, Crowley J. A large sample study of the life table and product limit estimates under random censorship. Ann Stat. 1974;2:437â€“53.
Wellner JA. A heavy censoring limit theorem for the product limit estimator. Ann Stat. 1985;6:150â€“62.
Stute W, Wang JL. The jackknife estimate of a Kaplanâ€”Meier integral. Biometrika. 1994;81(3):602â€“6.
Jiang R. A biascorrected KaplanMeier estimator. In: 2020 AsiaPacific International Symposium on Advanced Reliability and Maintenance Modeling (APARM): 2020. IEEE. p. 1â€“6.
Langner I, Bender R, LenzTÃ¶njes R, KÃ¼chenhoff H, Blettner M. Bias of maximumlikelihood estimates in logistic and Cox regression models: a comparative simulation study. In.: Discussion Paper; 2003.
Varmaghani M, Sharifi F, Mehdipour P, Sheidaei A, Djalalinia S, Gohari K, Modirian M, Pazhuheian F, Peykari N, Haghshenas R. Prevalence of smoking among Iranian adults: findings of the national STEPs survey 2016. Arch Iran Med. 2020;23(6):369â€“77.
Kim JP, Lee JH, Kim SJ, Yu HJ, Yang HK. Clinicopathologic characteristics and prognostic factors in 10 783 patients with gastric cancer. Gastric Cancer. 1998;1(2):125â€“33.
Kikuchi S, Futawatari N, Sakuramoto S, Katada N, Yamashita K, Shibata T, Nemoto M, Watanabe M. Comparison of staging between the old and new TNM classifications in advanced gastric cancer. Anticancer Res. 2011;31(6):2361â€“5.
Marano L, Dâ€™Ignazio A, Cammillini F, Angotti R, Messina M, Marrelli D, Roviello F. Comparison between 7th and 8th edition of AJCC TNM staging system for gastric cancer: old problems and new perspectives. Transl Gastroenterol Hepatol. 2019;4:5558.
Jafarzadeh Kohneloo A, Yaseri M, Rahimi Foroushani A, Zeraati H. Postsurgery survival in patients with adenocarcinoma of stomach using multistate model. J Gastrointestinal Cancer. 2021;2:1â€“7.
Ali Z, Mahmoodi M, Mohammad K, Zeraati H, Hosseini M, Naieni KH. Factors affecting the survival of patients with gastric cancer undergone surgery at iran cancer institute: univariate and multivariate analyses. Iran J Public Health. 2014;43(6):800.
Acknowledgements
This study is a part of the research process supported by the Tehran University of Medical Science to achieve a Ph.D. degree.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study's conception and design. HZ and AS performed analysis. AS wrote the first draft of the manuscript, and all authors commented on previous versions. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All methods were carried out following relevant guidelines and regulations. This study was found to be under the ethical principles and the national norms and standards for conducting medical research in Iran by the research ethics committee of the school of public health & allied medical sciences â€“ Tehran university of medical sciences. The approval ID is IR.TUMS.SPH.REC.13993061. All participants provided written informed consent that their data collected as part of the study can be used in research. All the patients were followed until the event or when they preferred to stop participation.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Supplementary file 1.
The validation of structure learning. The posterior classification error of HC and Tabu algorithms for all nodes according to different score functions. Supplementary file 2. Conditional probability distribution of time stationary variables in the model. Supplementary figure 1. Conditional probability distribution of node stage given the different levels of its parents (Metastasis and TNM). Supplementary figure 2. Conditional probability distribution of node TNM given the different levels of its parent (Metastasis). Supplementary figure 3. Conditional probability distribution of node metastasis given the different levels of its parent (Smoking). Supplementary figure 4. Conditional probability distribution of node pathology given the different levels of its parent (Sex). Supplementary figure 5. Conditional probability distribution of node smoking given the different levels of its parent (Sex). Supplementary figure 6. Conditional probability distribution of node surgery given the different levels of its parent (Site).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sheidaei, A., Foroushani, A.R., Gohari, K. et al. A novel dynamic Bayesian network approach for data mining and survival data analysis. BMC Med Inform Decis Mak 22, 251 (2022). https://doi.org/10.1186/s12911022020007
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911022020007