A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example

Cevenini, Gabriele; Barbini, Emanuela; Scolletta, Sabino; Biagioli, Bonizella; Giomarelli, Pierpaolo; Barbini, Paolo

doi:10.1186/1472-6947-7-36

Research article
Open access
Published: 22 November 2007

A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example

Gabriele Cevenini¹,
Emanuela Barbini²,
Sabino Scolletta¹,
Bonizella Biagioli¹,
Pierpaolo Giomarelli¹ &
…
Paolo Barbini¹

BMC Medical Informatics and Decision Making volume 7, Article number: 36 (2007) Cite this article

4981 Accesses
10 Citations
Metrics details

Abstract

Background

Popular predictive models for estimating morbidity probability after heart surgery are compared critically in a unitary framework. The study is divided into two parts. In the first part modelling techniques and intrinsic strengths and weaknesses of different approaches were discussed from a theoretical point of view. In this second part the performances of the same models are evaluated in an illustrative example.

Methods

Eight models were developed: Bayes linear and quadratic models, k-nearest neighbour model, logistic regression model, Higgins and direct scoring systems and two feed-forward artificial neural networks with one and two layers. Cardiovascular, respiratory, neurological, renal, infectious and hemorrhagic complications were defined as morbidity. Training and testing sets each of 545 cases were used. The optimal set of predictors was chosen among a collection of 78 preoperative, intraoperative and postoperative variables by a stepwise procedure. Discrimination and calibration were evaluated by the area under the receiver operating characteristic curve and Hosmer-Lemeshow goodness-of-fit test, respectively.

Results

Scoring systems and the logistic regression model required the largest set of predictors, while Bayesian and k-nearest neighbour models were much more parsimonious. In testing data, all models showed acceptable discrimination capacities, however the Bayes quadratic model, using only three predictors, provided the best performance. All models showed satisfactory generalization ability: again the Bayes quadratic model exhibited the best generalization, while artificial neural networks and scoring systems gave the worst results. Finally, poor calibration was obtained when using scoring systems, k-nearest neighbour model and artificial neural networks, while Bayes (after recalibration) and logistic regression models gave adequate results.

Conclusion

Although all the predictive models showed acceptable discrimination performance in the example considered, the Bayes and logistic regression models seemed better than the others, because they also had good generalization and calibration. The Bayes quadratic model seemed to be a convincing alternative to the much more usual Bayes linear and logistic regression models. It showed its capacity to identify a minimum core of predictors generally recognized as essential to pragmatically evaluate the risk of developing morbidity after heart surgery.

Peer Review reports

Background

The increasing number of diagnostic and therapeutic choices and the demand for quality and cost control have contributed to a proliferation of techniques of pattern recognition and decision making in all biomedical fields. In recent years, many different models have been proposed for the prediction of adverse outcome in heart surgery patients [1–6]. This prompted us to critically analyse the features of a number of popular systems for predicting patient morbidity in the cardiac postoperative intensive care unit (ICU), in a unitary framework.

The study is divided into two parts. In the first part different methods for estimating morbidity probability were grouped into categories according to the underlying mathematical principles. Eight predictive models, based on the Bayes rule [6–9], k-nearest neighbour [7, 10], logistic regression [11], integer score systems [3, 6] and artificial neural networks [12, 13], were investigated from a theoretical point of view. Modelling techniques and intrinsic strengths and weaknesses of each predictive model were analysed and discussed in view of clinical applications.

Although knowledge of theoretical features, strengths and weaknesses of different approaches are fundamental for developing a predictive model of morbidity in the ICU, the final choice of model also has to consider the context and clinical scenario where the model will be used. Actual performances of locally-developed competitive models have to be evaluated and compared using real experimental data in order to reconcile local needs and model response. In this second part of the study, the experimental performance of the previously analysed models in predicting the risk of morbidity is evaluated in a real clinical scenario. All models were developed and tested using preoperative, intraoperative and postoperative data acquired in heart surgery patients. Since the aim of this study was to experimentally test the performance of a number of popular predictive models when locally customized to a specific scenario, not to develop a generally applicable model (for example, for benchmarking purposes), both training and testing data was acquired in the same postoperative cardiac ICU and the models were not tested on other independent data collected in different ICUs. Discrimination, generalization, calibration, simplicity of use and updating were the criteria used to assess differences between them, taking the specialised ICU as an illustrative example.

Methods

Sample set and variable collection

We considered data acquired in the whole set of 1090 patients who underwent coronary artery bypass grafting and were admitted to the intensive care unit of the Department of Surgery and Bioengineering of Siena University between 1^st January 2002 and 31^st December 2004. Standard preoperative and postoperative management and cardiopulmonary bypass (CPB) were performed [14]. The study was approved by the Ethics Committee of our institution.

A collection of 78 preoperative, intraoperative and postoperative variables were considered as likely risk predictors, that could be associated with morbidity development in the ICU. A dichotomous (binary) variable was chosen as ICU outcome (morbidity). Preoperative and intraoperative data was collected under the anaesthesiologist's supervision. Postoperative data was collected in the first three hours after admission to the ICU, except the binary outcome that was retrieved from medical records after discharge from the ICU. In total, 48 preoperative, intraoperative and postoperative continuous variables (Tables 1, 2 and 3, respectively) and 31 dichotomous variables (Tables 4, 5, 6) were used.

Table 1 Preoperative continuous variables

Full size table

Table 2 Intraoperative continuous variables

Full size table

Table 3 Postoperative continuous variables

Full size table

Table 4 Preoperative dichotomous variables

Full size table

Table 5 Intraoperative dichotomous variables

Full size table

Table 6 Postoperative dichotomous variables

Full size table

Cardiopulmonary bypass time (Table 2) was the total of all bypass runs if a second or subsequent period of cardiopulmonary bypass was conducted. Re-operations (Table 4) were considered as separate variables in the analysis [3].

According to the definitions of Higgins and colleagues [3, 15], emergency cases (Table 4) were defined as unstable angina, unstable hemodynamics, or ischemic valve dysfunction that could not be controlled medically. Left ventricular ejection fractions less than 35% were considered severely impaired (Table 4). Diabetes or chronic obstructive pulmonary disease (Table 4) were diagnosed only if the patient was maintained on appropriate medication.

Data was ranked chronologically on the basis of patient hospital discharge and organized in a database. The database was divided into two sets of equal size (545 cases each): a training set consisting of patients in odd positions in the original ranked database and a testing set consisting of the other patients, that is, those in even positions in the original database. To ensure that alternate allocation of cases did not introduce systematic sampling errors, training and testing data was compared using the Fisher exact test for dichotomous variables and the z-test or Mann-Whitney test for continuous normally or non-normally distributed variables, respectively [16]. Normality was assessed by the Kolmogorov-Smirnov test [16]. No significant difference was found between training and testing data, setting statistical significance at a p-value less than 0.05. Tables 1, 2, 3, 4, 5, 6 summarize the descriptive statistics of the training and test sets: continuous variables were described by means and standard deviations and dichotomous variables by frequencies and percentages.

Tables 1, 2, 3 also show the cut-off values at which continuous variables were dichotomised to develop integer score systems. They were chosen by setting sensitivity (SE) and specificity (SP) equal and testing the confidence interval for the odds ratio (for details see Section "Model description" in PartI of the present study). In the table, n.s. means that the odds ratio of the dichotomised variable was not significantly greater than 1.

Morbidity outcome was defined for patients developing at least one of the following clinical complications.

Cardiovascular complications: myocardial infarction (documented by electrocardiography and enzyme criteria); low cardiac output (requiring inotropic support for more than 24 hours, intraaortic balloon pump or ventricular assist device); or severe arrhythmias (requiring treatment or cardiopulmonary resuscitation).

Respiratory complications: prolonged ventilatory support (mechanical ventilatory support for more than 24 hours); re-intubation; tracheostomy; or clinical evidence of pulmonary embolism, edema or adult respiratory distress syndrome.

Neurological (central nervous system) complications: focal brain lesion confirmed by clinical findings and/or computed tomography; diffuse encephalopathy with more than 24 hours of severely altered mental status; or unexplained failure to awaken within 24 hours after operation.

Renal complications: acute renal failure needing dialysis

Infectious complications: culture-proven pneumonia; mediastinitis; wound infection; septicaemia with appropriate clinical findings; or septic shock.

Hemorrhagic complications: bleeding requiring re-operation

Note that the above outcome definition implies a compound endpoint of morbidity. This extensive definition is widely used when models for predicting major adverse outcomes are employed in ICU [3], although it limits the power of any single model to predict who gets a specific complication. On the other hand, it allows the number of events to be increased (the morbidity percentage in the whole patient set considered here was 20.7%) and the contribution of patient management to outcome is more evident when the endpoint occurs more frequently.

Predictive model development

The following models were developed locally to predict morbidity probability: Bayesian linear (BL) model, Bayesian quadratic (BQ) model, k-nearest neighbour (k NN) model, logistic regression (LR) model, Higgins score (HS) model derived from the previous LR model, direct score (DS) model, and two feed-forward artificial neural networks (ANNs) with one and two layers (ANN1 and ANN2, respectively). The theoretical details of the models were described in PartI of the study.

The above training and testing sets of 545 cases were used to train and test all models. Briefly, model development included: feature selection; evaluation of discrimination performance by AUC, that is area under the receiver operating curve (ROC); assessment of calibration by Hosmer-Lemeshov (HL) goodness-of-fit test using Ĉ-statistics; evaluation of accuracy by mean squared error (MSE); recalibration of model-predicted probabilities when necessary.

Artificial neural networks were trained using a batch training method which updates neural weights and biases after all training patterns have been processed, that is, after each epoch. An iterative training algorithm with gradient descendent momentum and adaptive learning rate was used to minimize MSE. The influence of initialization on the solution was reduced by always performing 99 training sessions starting from 99 different randomly-selected initial conditions; the 99 corresponding values obtained for AUC were sorted from lowest to highest and the results of the session giving the 50^th value of AUC were taken.

After the stepwise feature selection was performed for each predictive model on the training data by means of proper techniques such as leave-one-out, the 95% confidence interval of AUC and its median value (AŨC) were estimated for every set of selected features using 1000 different random samples generated by the bootstrap resampling method in the training and testing sets. The same samples were used to compare AUC values of different models in test data, by performing a Wilcoxon matched-pairs signed-ranks test [16].

When applying stepwise feature selection on training data to a model, techniques, such as leave-one-out, may not ensure satisfactory generalization. The final selection of the number of features used for predicting morbidity was therefore made trading discrimination capacity off against model complexity on the bootstrap samples of testing data (that is, on data not employed in the training process). The behaviour of AUC was first analysed in these test samples in relation to the set of features selected step-by-step by the previous stepwise procedure and the number of feature (d _M) allowing the maximum value of AŨC was taken as reference point. Then AUC values obtained with a number of selected features less than d _Mwere compared to those of d _Mby the Wilcoxon matched-pairs signed-ranks test. Finally, the optimal number of selected features was chosen as the minimum number ensuring no significant difference in AUC (0.05 probability) with respect to d _M. Of course, if all comparisons gave significant differences, d _Mwas chosen as the optimal number of selected features to be used for predicting morbidity.

Once optimized to ensure suitable generalization with the best discrimination performance, models with inadequate calibration were recalibrated by applying a cubic monotonic transformation (see Part I of the study) to the ranked predicted probabilities, so as to reach a more reliable estimation of morbidity probability.

All computer calculations were performed by means of locally-developed specific codes written in the Matlab programming language using the statistics and optimization toolboxes [17].

Results

For all models, Table 7 shows the predictor variables entered step-by-step or removed during the stepwise feature selection process. Variables that were removed appear in square brackets.

Table 7 Variables entered and removed (in square brackets) at each step of the stepwise selection procedure

Full size table

Figure 1 shows the median values of AUC, obtained for each model by the bootstrap resampling method, in training and testing data (continuous and dashed lines, respectively) in relation to the dimension of each best subset of features identified by the stepwise procedure on training data. Since AŨC was taken as a global index of discrimination capacity, the difference between training and testing AŨC values may be considered to evaluate model generalization as a function of the number of features in the model: the greater the difference, the greater the model overfitting. The asterisk on the curve indicates the point corresponding to the optimal set of features for predicting morbidity, that is, the minimum number of selected features ensuring AUC values not statistically different from those giving the highest AŨC in the testing bootstrap data.

Table 8 lists the above-defined optimal set of predictor variables model by model and Table 9 shows the corresponding model performance. Discrimination capacity is quantified by AŨC calculated on bootstrap data. For testing data, 95% confidence intervals (CI) of AUC and CI%, that is, the percentage ratio of CI width to AŨC, are also given. Generalization was evaluated as the percentage difference in AŨC between training and testing data. Calibration was assessed on testing data by p of the Hosmer-Lemeshov goodness-of-fit test using Ĉ-statistics (HL-p), so that an HL-p much greater than 0.05 indicated very good model calibration, while HL-p < 0.05 revealed poor model calibration.

Table 8 Optimal feature vectors selected by different models from bootstrap test data

Full size table

Table 9 Number of selected features and corresponding model performance

Full size table

Most models selected more than ten features to predict morbidity in the ICU after heart surgery (Table 8). The DS model used the largest set of features (sixteen predictor variables), while the number of features used in the HS model was set equal to that chosen by the corresponding LR model, as proposed by Higgins and colleagues [3]. The Bayesian and k NN models were much more parsimonious, using less than ten features. The Bayes quadratic model required the smallest set of predictor variables (only three).

Artificial neural networks gave the highest values of AŨC on training data, but their discrimination ability decreased sharply when estimated on testing data (Table 9). This result confirms that model overfitting may be a limitation of this approach.

The Bayes quadratic model, using only three predictor variables, provided the highest AŨC on test data (Table 9). Although the 95% confidence intervals of different models were largely superimposed, the Wilcoxon matched-pairs signed-ranks test on testing bootstrap data showed significant AUC differences between various models. This means that, when the results obtained with the bootstrap data were considered couple-by-couple, one model generally gave AUC values better than another. However, despite this statistical outcome, all models showed essentially not very dissimilar discrimination capacities, because the AŨC and CI were roughly equivalent from a practical point of view for the whole group of models. All models had acceptable discrimination capacities on test data, because their AŨC was always greater that 0.7 and less than 0.8 [11]. Furthermore, the width of the CI indicated appreciable sample variability in model discrimination performance.

All models showed satisfactory generalization when evaluated in our specialized ICU, because the percentage difference in AŨC between training and testing data was always less than 8% (Table 9). However the Bayes quadratic and k NN models had very good generalization performance, while artificial neural networks and integer score models gave the worst results.

The Hosmer-Lemeshov goodness-of-fit test indicated very poor calibration for both integer score models, even after recalibration (Table 9). However, this may also depend on limitations of the HL test in assessing goodness of fit for predictive models with discrete output probabilities. The k NN model and artificial neural networks (especially ANN2) also showed poor calibration, while the Bayesian and logistic regression models gave satisfactory results. Nevertheless, the Bayesian models had to be recalibrated, whereas the logistic regression model did not.

Discussion

A pool of 78 variables was taken a priori as potential predictors of morbidity in the ICU after heart surgery, so that feature selection had to be made a posteriori, considering not only training but also testing data. Although some identical features were selected from all models, the number of predictor variables identified as optimal was rather different in the various models under study. As shown in Table 8, the Bayes quadratic model was the most parsimonious. Most other models (such as integer score systems) required many more predictors.

The DS model used the largest set of predictor variables. Table 7 shows that some features were entered in this model several times during the stepwise selection procedure: oxygen extraction (O₂ER) was the most selected and obtainied the highest associated score. Despite a clear tendency to overfit training data (the differences between training and testing curves in Figure 1 increased remarkably with just a few features) AŨC significantly increased in test data with the number of selected features, reaching a value of 0.779 with sixteen predictors. Unfortunately this model showed very poor calibration performance, although this result may be partly due to the limitations of the HL test or the recalibration procedure for score models with discrete outputs. Furthermore, like other predictive score models, the DS system was difficult to update with new data. In fact, updating requires a complete periodic retraining. To do this, an automatic routine can be implemented on a computer, but this defeats the choice of this simple method which does not require a computer for everyday clinical application, the reason why such systems are very popular in medicine [3, 6].

About the same number of features were selected for the LR model and ANN1. Most of the predictor variables selected by both on our ICU experimental data were the same (see Table 8). From a theoretical point of view, these two models are characterized by the same input-output nonlinear mathematical relationships, although their parameters are estimated by different approaches (for details see PartI of this study). This may justify the likeness of their discrimination results. However ANN1 performed better on training data and worse on test data, so that its generalization power was lower than that of the LR model, confirming the tendency of artificial neural networks to overfit training data. Much better results were also obtained by the LR model as regards calibration. Finally, difficulties can arise when designing and using artificial neural networks and continuous updating is practically impossible. These considerations suggests that the LR model is preferable to ANN1 for the example considered here.

The results obtained using the two-layer artificial neural network ANN2 were similar to those of ANN1, although ANN2 used a smaller feature set. Despite increased model complexity, ANN2 showed only slightly better discrimination on test data and generalization power, but worse calibration. So, when comparing ANN2 and LR performance, the same conclusion as between ANN1 and LR was reached.

As described in detail in PartI of this study, the Higgins score system was derived from the logistic regression model with the same features, by transforming continuous predictors to binary variables and LR coefficients to integer scores. Of course, the HS system suffers from the weaknesses of all integer score systems, as discussed when considering the DS model. Furthermore, the comparison of the results obtained by the LR and corresponding HS models showed that LR had better performance than the corresponding scoring system. In fact, its discrimination ability was higher on testing data and its generalization power was superior. The HS model showed very poor calibration, whereas the LR model was well calibrated even without any recalibration procedure. All this confirms that, when transforming the LR model into a simpler-to-use score system, it is necessary to carefully consider the cost of increased computational facility.

The Bayes linear model selected only eight features versus fourteen of the logistic regression model. The LR model used all the predictors used in the BL model and six additional ones. However, the number of model parameters estimated by the LR model was much less than that of BL: fifteen and fifty-two, respectively (see also Part I of this study). Despite these, the 95% confidence interval of AUC was the same for both models ([0.722–0.831] for BL versus [0.721–0.830] for LR) and the generalization power was similar. They both showed good calibration performances. This seems to confirm previous experimental findings indicating that in many practical situations the two approaches give generally similar results [6, 18]. Their application in clinical practice is not difficult. To recognize morbidity a hand calculator is sufficient, because LR uses a simple exponential relationship and the Bayesian linear decision rule can be expressed as a linear function of the observation vector [7]. Major differences can be observed for updating. The BL model can be updated with new training data simply by updating the mean vector and pooled within-sample covariance matrix estimates using simple recursive relationships, whereas the LR model is not so simple to update. We therefore judged the BL model as better than LR in the present illustrative example.

The k NN model required only five features to predict morbidity in ICU patients after heart surgery. In general, this non parametric approach did not overfit training data, so that good generalization could also be obtained using different dimensions of the feature set (see Figure 1). In fact, generalization power only decreased appreciably with six predictor variables, that is, the maximum number of features selected by the stepwise procedure on training data. However its calibration was poor and AŨC computed on test data was the second last of all the models considered. Besides its computational cost and need for large data storage made this model unpromising, unless comparison of test cases with their k neighbours is considered important for comparative diagnosis.

The Bayes quadratic model had the highest discrimination capacity on test data, using the minimum number of features (oxygen extraction, oxygen delivery and need for cardiac inotropic drugs after the operation). AŨC calculated by means of the bootstrap resampling method was almost the same for training and testing data (percentage difference less than 1%). The quality of the results in the scenario considered may also be due to the small number of parameter estimates required by the model. In fact, with three predictor variables and two classes, the BQ model required the estimation of eighteen parameters (mean vectors and covariance matrices of the two classes). This model parameter number is about the same as that of the LR model (fifteen model parameters), but much less than that of BL (fifty-two model parameters). Like the BL model, the BQ model can be recursively updated whenever a new case has to be included in the training set. Finally, after recalibration, Hosmer-Lemeshov goodness-of-fit test using Ĉ-statistics indicated adequate model calibration. These considerations make the BQ model a convincing alternative to the BL and LR approaches for the present application.

It can be noted that two of the three predictors selected by the Bayes quadratic model (oxygen extraction and need for cardiac inotropic drugs after the operation) were chosen by all models. This means that these two variables were essential features for predicting morbidity in the scenario considered. Of course, the need for inotropic drugs after the operation is strongly correlated with poor cardiac function, while the key role played by oxygen extraction confirms the results of a previous study, in which increased oxygen extraction immediately after heart surgery has been indicated as an independent predictor of prolonged ICU stay [19]. The third predictor used by the BQ model was oxygen delivery and inadequate oxygen delivery has also been associated with prolonged ICU stay after heart surgery [20]. Increased levels of oxygen delivery and consumption have also been associated with improved outcome [21] and this fact has been tested in various clinical situations [22–24]. Knowledge of oxygen extraction and oxygen delivery is fundamental for assessing the relationship between oxygen consumption and oxygen delivery [25], though in many cases, mixed venous oxygen saturation or even central venous oxygen saturation alone may suffice [26]. Previous studies have shown that when oxygen saturation in the superior vena cava is used as a guide, early goal-directed therapy may provide significant benefits for outcome in ICU patients with severe sepsis and septic shock [27] and that this approach may reduce the length of hospital stay and the degree of organ dysfunction of heart surgery patients at discharge [28].

Statistical predictive models and artificial neural networks are black-box systems allowing cases to be allocated to different classes, but they do not lend themselves to interpretation of the underlying causes. However, when the number of the selected predictor variables is sufficiently small it may be interesting to seek an explanatory interpretation of the predictive model results a posteriori. In everyday life we are accustomed to considering phenomena in three dimensions. It is therefore difficult to expound the meaning of systems (such as predictive models) working in more than three dimensions. However, when the predictive model uses two or three features, a rational interpretation of its results may be attempted. The BQ model developed on our ICU data used only three features to predict morbidity outcomes, so that an interpretation of the result obtained was sought. First of all it is useful to recall that oxygen extraction is the ratio of oxygen consumption to oxygen delivery. A recent paper showed that the relationship of oxygen consumption to oxygen delivery is an important concept, even if its practical application is not simple and decisions regarding the need for strategies to increase and maintain oxygen delivery require the interpretation of many measurements [26]. The BQ predictive model seems to confirm these findings, because its decision boundary is given by a quadratic form of the three selected features in the three-dimensional space. In the clinical example used to locally develop the predictive model, this means that the cut-off value of oxygen delivery separating morbid and normal course classes does not remain constant or vary in a linear fashion as a function of oxygen extraction. Furthermore, this boundary changes in patients requiring cardiac inotropic drugs after the operation. Figure 2 clarifies this finding. Continuous and broken lines represent the decision boundaries in the oxygen extraction/oxygen delivery plane for patients to whom cardiac inotropic drugs are and are not administered, respectively. Patients at risk of morbidity are located below the decision boundary. The decision boundary moves up for patients who require drug administration after the operation, indicating that for these patients, the risk of morbidity will be high even at higher values of oxygen delivery.

As a conclusion, the Bayes quadratic model seemed to identify a minimum core of predictor variables generally recognized as essential for a pragmatic evaluation of the risk of morbidity after heart surgery. When this set of predictors was used on test data, it gave good discrimination, generalization and calibration, which were similar or better than those obtained with the Bayes linear or logistic regression models. Because of the small number of predictors to be monitored, clinicians may also more easily track and rationally interpret time courses of patient status, and consequently make prompt decisions about optimal therapeutic strategies. Of course, this does not mean that the Bayes quadratic approach is always the best model for predicting morbidity in ICU patients. However it provided a good compromise between system complexity and predictive performance in our example.

Conclusion

The purpose of the present study was to analyse and compare different predictive models for estimating patient morbidity in the ICU after heart surgery. In this second part of the study we developed and tested eight popular predictive models with preoperative, intraoperative and postoperative data acquired in adult patients who underwent coronary artery bypass grafting. This part of the study supplements Part I in which different approaches for developing predictive morbidity models were reviewed in a unitary framework from a theoretical point of view.

The experimental results indicated that all models provided acceptable discrimination in test data and satisfactory generalization in our illustrative example. On the contrary poor calibration was obtained with scoring systems, the k-nearest neighbour model and artificial neural networks, while Bayes and logistic regression models gave satisfactory results. Most of models selected more than ten features to predict morbidity. Scoring systems and logistic regression model required the largest set of predictors, while Bayesian and k NN models were much more parsimonious, requiring less than ten features.

The Bayes quadratic model required the smallest set of predictor variables (only three: oxygen extraction, oxygen delivery and use of cardiac inotropic drugs after the operation) and provided very interesting results, which were similar or better than those obtained with the Bayes linear or logistic regression models. Unlike logistic regression models, an additional intrinsic strength of Bayesian models is that they can be updated in a straightforward manner, including new correctly classified cases into the training set, since this just involves the updating of mean vector and covariance matrix estimates by means of simple recursive relationships.

Because of the small number of predictors needed, the Bayes quadratic linear model also enabled an explanatory interpretation of the results obtained in our example. In particular, the BQ model seemed to confirm previous experimental findings proving that the relationship between oxygen consumption and oxygen delivery is a key issue for guiding therapy.

In conclusion, both theoretical and experimental findings indicate that the Bayes quadratic model offers a good compromise between complexity and predictive performances and can therefore be a convincing alternative to other much more extensively used predictive models (such as scoring systems or even Bayes linear and logistic regression models) in many clinical applications.

Note: This paper is accompanied by Part I, which gives a comprehensive review of several methods used to plan predictive models [29].

Abbreviations

ANN:: artificial neural network
AUC:: area under the ROC curve
BL:: Bayes linear
BQ:: Bayes quadratic
CI:: confidence interval
DS:: direct score
HL:: Hosmer-Lemeshow
HS:: Higgins score
ICU:: intensive care unit
k NN:: k-nearest neighbour
LR:: logistic regression
MSE:: mean squared error
ROC:: receiver operating characteristic.

References

Heijmans JH, Maessen JG, Roekaerts PMHJ: Risk stratification for adverse outcome in cardiac surgery. Eur J Anaesthesiol. 2003, 20: 515-527. 10.1017/S0265021503000838.
Article CAS PubMed Google Scholar
Fortescue EB, Kahn K, Bates DW: Development and validation of a clinical prediction rule for major adverse outcomes in coronary bypass grafting. Am J Cardiol. 2001, 88: 1251-1258. 10.1016/S0002-9149(01)02086-0.
Article CAS PubMed Google Scholar
Higgins TL, Estafanous FG, Loop FD, Beck GJ, Lee JC, Starr NJ, Knaus WA, Cosgrove III DM: ICU admission score for predicting morbidity and mortality risk after coronary artery bypass grafting. Ann Thorac Surg. 1997, 64: 1050-1058. 10.1016/S0003-4975(97)00553-5.
Article CAS PubMed Google Scholar
Edwards FH, Peterson RF, Bridges C, Ceithaml EL: Use of a Bayesian statistical model for risk assessment in coronary artery surgery. Ann Thorac Surg. 1995, 59: 1611-1612. 10.1016/0003-4975(95)00189-R.
Article CAS PubMed Google Scholar
Marshall G, Shroyer ALW, Grover FL, Hammermeister KE: Bayesian-logit model for risk assessment in coronary artery bypass grafting. Ann Thorac Surg. 1994, 57: 1492-1500.
Article CAS PubMed Google Scholar
Biagioli B, Scolletta S, Cevenini G, Barbini E, Giomarelli P, Barbini P: A multivariate Bayesian model for assessing morbidity after coronary artery surgery. Crit Care. 2006, 10: R94-10.1186/cc4441.
Article PubMed PubMed Central Google Scholar
Fukunaga K: Introduction to Statistical Pattern Recognition. 1990, Boston: Academic Press
Google Scholar
Krzanowski WJ: Principles of Multivariate Analysis: A User's Perspective. 1988, Oxford: Clarendon Press
Google Scholar
Sivia DS, Skilling J: Data Analysis: A Bayesian Tutorial. 2006, Oxford: Oxford University Press
Google Scholar
Duda RO, Hart PE, Stork DG: Pattern Classification. 2001, New York: John Wiley and Sons
Google Scholar
Hosmer DW, Lemeshow S: Applied Logistic Regression. 2000, New York: Wiley
Book Google Scholar
Dreiseitl S, Ohno-Machado L: Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform. 2002, 35: 352-359. 10.1016/S1532-0464(03)00034-0.
Article PubMed Google Scholar
Bishop CM: Neural Networks for Pattern Recognition. 1995, New York: Oxford University Press Inc
Google Scholar
Giomarelli P, Scolletta S, Borrelli E, Biagioli B: Myocardial and lung injury after cardiopulmonary bypass: Role of interleukin (IL)-10. Ann Thorac Surg. 2003, 76: 117-123. 10.1016/S0003-4975(03)00194-2.
Article PubMed Google Scholar
Higgins TL, Estafanous FG, Loop FD, Beck GJ, Blum JM, Paranandi L: Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients. Jama. 1992, 267: 2344-2348. 10.1001/jama.267.17.2344.
Article CAS PubMed Google Scholar
Armitage P, Berry G: Statistical methods in medical research. 1987, Oxford: Blackwell Scientific Publications
Google Scholar
MATLAB: The Language of Technical Computing, Using MATLAB, Version 7. 2004, Natick, MA: The MathWorks Inc
Google Scholar
Testi D, Cappello A, Chiari L, Viceconti M, Gnudi S: Comparison of logistic and Bayesian classifiers for evaluating the risk of femoral neck fracture in osteoporotic patients. Med Biol Eng Comput. 2001, 39: 633-637. 10.1007/BF02345434.
Article CAS PubMed Google Scholar
Pölönen P, Hippeläinen M, Takala R, Ruokonen E, Takala J: Relationship between intra- and postoperative oxygen transport and prolonged intensive care after cardiac surgery: a prospective study. Acta Anaesthesiol Scand. 1997, 41: 810-817.
Article PubMed Google Scholar
Routsi C, Vincent JL, Bakker J, De Backer D, Lejeune P, d'Hollander A, Le Clerc JL, Kahn RJ: Relation between oxygen consumption and oxygen delivery in patients after cardiac surgery. Anesth Analg. 1993, 77: 1104-1110. 10.1213/00000539-199312000-00004.
Article CAS PubMed Google Scholar
Shoemaker WC, Montgomery ES, Kaplan E, Elwyn DH: Physiologic patterns in surviving and nonsurviving patients: use of sequential cardiorespiratory variables in defining criteria for therapeutic goals and early warning of death. Arch Surg. 1973, 106: 630-636.
Article CAS PubMed Google Scholar
Tuchschmidt J, Fried J, Astiz M, Rackow E: Elevation of cardiac output and oxygen delivery improves outcome in septic shock. Chest. 1992, 102: 216-220. 10.1378/chest.102.1.216.
Article CAS PubMed Google Scholar
Boyd O, Grounds RM, Bennett ED: A randomized clinical trial of the effect of deliberate perioperative increase of oxygen delivery on mortality in high-risk surgical patients. JAMA. 1993, 270: 2699-2707. 10.1001/jama.270.22.2699.
Article CAS PubMed Google Scholar
Gattinoni L, Brazzi L, Pelosi P, Latini R, Tognoni G, Pesenti A, Fumagalli R: A trial of goal-oriented hemodynamic therapy in critically ill patients. N Eng J Med. 1995, 333: 1025-1032. 10.1056/NEJM199510193331601.
Article CAS Google Scholar
Vincent JL: Determination of O₂ delivery and consumption vs cardiac index vs oxygen extraction ratio. Crit CareClin. 1996, 12: 995-1006. 10.1016/S0749-0704(05)70288-8.
CAS Google Scholar
Vincent JL, De Backer D: Oxygen transport-the oxygen delivery controversy. Intensive Care Med. 2004, 30: 1990-1996. 10.1007/s00134-004-2384-4.
Article PubMed Google Scholar
Rivers E, Nguyen B, Havstad S, Ressler J, Muzzin A, Knoblich B, Peterson E, Tomlanovich M: Early goal-directed therapy in the treatment of severe sepsis and septic shock. N Engl J Med. 2001, 345: 1368-1377. 10.1056/NEJMoa010307.
Article CAS PubMed Google Scholar
Pölönen P, Ruokonen E, Hippeläinen M, Pöyhönen M, Takala J: A prospective, randomized study of goal-oriented hemodynamic therapy in cardiac surgical patients. Anesth Analg. 2000, 90: 1052-1059. 10.1097/00000539-200005000-00010.
Article PubMed Google Scholar
Barbini E, Cevenini G, Scolletta S, Biagioli B, Giomarelli P, Barbini P: A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part I: model planning. BMC Med Inf Dec Mak. 7: 35-10.1186/1472-6947-7-35.

Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/7/36/prepub

Download references

Acknowledgements

This work was supported by the Italian Ministry of University and Research (MIUR – PRIN).

Author information

Authors and Affiliations

Department of Surgery and Bioengineering, University of Siena, Siena, Italy
Gabriele Cevenini, Sabino Scolletta, Bonizella Biagioli, Pierpaolo Giomarelli & Paolo Barbini
Department of Physiopathology, Experimental Medicine and Public Health, University of Siena, Siena, Italy
Emanuela Barbini

Authors

Gabriele Cevenini
View author publications
You can also search for this author in PubMed Google Scholar
Emanuela Barbini
View author publications
You can also search for this author in PubMed Google Scholar
Sabino Scolletta
View author publications
You can also search for this author in PubMed Google Scholar
Bonizella Biagioli
View author publications
You can also search for this author in PubMed Google Scholar
Pierpaolo Giomarelli
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Barbini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Barbini.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

All authors participated in the study plan and coordination. GC and PB were concerned with medical informatics and biostatistical aspects of the study. EB was concerned with epidemiology and biostatistical aspects of the study. SS, BB and PG were involved in clinical aspects. SS collected clinical data. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cevenini, G., Barbini, E., Scolletta, S. et al. A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example. BMC Med Inform Decis Mak 7, 36 (2007). https://doi.org/10.1186/1472-6947-7-36

Download citation

Received: 26 April 2007
Accepted: 22 November 2007
Published: 22 November 2007
DOI: https://doi.org/10.1186/1472-6947-7-36

A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example