Skip to main content


We’d like to understand how you use our websites in order to improve them. Register your interest.

A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example



Popular predictive models for estimating morbidity probability after heart surgery are compared critically in a unitary framework. The study is divided into two parts. In the first part modelling techniques and intrinsic strengths and weaknesses of different approaches were discussed from a theoretical point of view. In this second part the performances of the same models are evaluated in an illustrative example.


Eight models were developed: Bayes linear and quadratic models, k-nearest neighbour model, logistic regression model, Higgins and direct scoring systems and two feed-forward artificial neural networks with one and two layers. Cardiovascular, respiratory, neurological, renal, infectious and hemorrhagic complications were defined as morbidity. Training and testing sets each of 545 cases were used. The optimal set of predictors was chosen among a collection of 78 preoperative, intraoperative and postoperative variables by a stepwise procedure. Discrimination and calibration were evaluated by the area under the receiver operating characteristic curve and Hosmer-Lemeshow goodness-of-fit test, respectively.


Scoring systems and the logistic regression model required the largest set of predictors, while Bayesian and k-nearest neighbour models were much more parsimonious. In testing data, all models showed acceptable discrimination capacities, however the Bayes quadratic model, using only three predictors, provided the best performance. All models showed satisfactory generalization ability: again the Bayes quadratic model exhibited the best generalization, while artificial neural networks and scoring systems gave the worst results. Finally, poor calibration was obtained when using scoring systems, k-nearest neighbour model and artificial neural networks, while Bayes (after recalibration) and logistic regression models gave adequate results.


Although all the predictive models showed acceptable discrimination performance in the example considered, the Bayes and logistic regression models seemed better than the others, because they also had good generalization and calibration. The Bayes quadratic model seemed to be a convincing alternative to the much more usual Bayes linear and logistic regression models. It showed its capacity to identify a minimum core of predictors generally recognized as essential to pragmatically evaluate the risk of developing morbidity after heart surgery.

Peer Review reports


The increasing number of diagnostic and therapeutic choices and the demand for quality and cost control have contributed to a proliferation of techniques of pattern recognition and decision making in all biomedical fields. In recent years, many different models have been proposed for the prediction of adverse outcome in heart surgery patients [16]. This prompted us to critically analyse the features of a number of popular systems for predicting patient morbidity in the cardiac postoperative intensive care unit (ICU), in a unitary framework.

The study is divided into two parts. In the first part different methods for estimating morbidity probability were grouped into categories according to the underlying mathematical principles. Eight predictive models, based on the Bayes rule [69], k-nearest neighbour [7, 10], logistic regression [11], integer score systems [3, 6] and artificial neural networks [12, 13], were investigated from a theoretical point of view. Modelling techniques and intrinsic strengths and weaknesses of each predictive model were analysed and discussed in view of clinical applications.

Although knowledge of theoretical features, strengths and weaknesses of different approaches are fundamental for developing a predictive model of morbidity in the ICU, the final choice of model also has to consider the context and clinical scenario where the model will be used. Actual performances of locally-developed competitive models have to be evaluated and compared using real experimental data in order to reconcile local needs and model response. In this second part of the study, the experimental performance of the previously analysed models in predicting the risk of morbidity is evaluated in a real clinical scenario. All models were developed and tested using preoperative, intraoperative and postoperative data acquired in heart surgery patients. Since the aim of this study was to experimentally test the performance of a number of popular predictive models when locally customized to a specific scenario, not to develop a generally applicable model (for example, for benchmarking purposes), both training and testing data was acquired in the same postoperative cardiac ICU and the models were not tested on other independent data collected in different ICUs. Discrimination, generalization, calibration, simplicity of use and updating were the criteria used to assess differences between them, taking the specialised ICU as an illustrative example.


Sample set and variable collection

We considered data acquired in the whole set of 1090 patients who underwent coronary artery bypass grafting and were admitted to the intensive care unit of the Department of Surgery and Bioengineering of Siena University between 1st January 2002 and 31st December 2004. Standard preoperative and postoperative management and cardiopulmonary bypass (CPB) were performed [14]. The study was approved by the Ethics Committee of our institution.

A collection of 78 preoperative, intraoperative and postoperative variables were considered as likely risk predictors, that could be associated with morbidity development in the ICU. A dichotomous (binary) variable was chosen as ICU outcome (morbidity). Preoperative and intraoperative data was collected under the anaesthesiologist's supervision. Postoperative data was collected in the first three hours after admission to the ICU, except the binary outcome that was retrieved from medical records after discharge from the ICU. In total, 48 preoperative, intraoperative and postoperative continuous variables (Tables 1, 2 and 3, respectively) and 31 dichotomous variables (Tables 4, 5, 6) were used.

Table 1 Preoperative continuous variables
Table 2 Intraoperative continuous variables
Table 3 Postoperative continuous variables
Table 4 Preoperative dichotomous variables
Table 5 Intraoperative dichotomous variables
Table 6 Postoperative dichotomous variables

Cardiopulmonary bypass time (Table 2) was the total of all bypass runs if a second or subsequent period of cardiopulmonary bypass was conducted. Re-operations (Table 4) were considered as separate variables in the analysis [3].

According to the definitions of Higgins and colleagues [3, 15], emergency cases (Table 4) were defined as unstable angina, unstable hemodynamics, or ischemic valve dysfunction that could not be controlled medically. Left ventricular ejection fractions less than 35% were considered severely impaired (Table 4). Diabetes or chronic obstructive pulmonary disease (Table 4) were diagnosed only if the patient was maintained on appropriate medication.

Data was ranked chronologically on the basis of patient hospital discharge and organized in a database. The database was divided into two sets of equal size (545 cases each): a training set consisting of patients in odd positions in the original ranked database and a testing set consisting of the other patients, that is, those in even positions in the original database. To ensure that alternate allocation of cases did not introduce systematic sampling errors, training and testing data was compared using the Fisher exact test for dichotomous variables and the z-test or Mann-Whitney test for continuous normally or non-normally distributed variables, respectively [16]. Normality was assessed by the Kolmogorov-Smirnov test [16]. No significant difference was found between training and testing data, setting statistical significance at a p-value less than 0.05. Tables 1, 2, 3, 4, 5, 6 summarize the descriptive statistics of the training and test sets: continuous variables were described by means and standard deviations and dichotomous variables by frequencies and percentages.

Tables 1, 2, 3 also show the cut-off values at which continuous variables were dichotomised to develop integer score systems. They were chosen by setting sensitivity (SE) and specificity (SP) equal and testing the confidence interval for the odds ratio (for details see Section "Model description" in PartI of the present study). In the table, n.s. means that the odds ratio of the dichotomised variable was not significantly greater than 1.

Morbidity outcome was defined for patients developing at least one of the following clinical complications.

Cardiovascular complications: myocardial infarction (documented by electrocardiography and enzyme criteria); low cardiac output (requiring inotropic support for more than 24 hours, intraaortic balloon pump or ventricular assist device); or severe arrhythmias (requiring treatment or cardiopulmonary resuscitation).

Respiratory complications: prolonged ventilatory support (mechanical ventilatory support for more than 24 hours); re-intubation; tracheostomy; or clinical evidence of pulmonary embolism, edema or adult respiratory distress syndrome.

Neurological (central nervous system) complications: focal brain lesion confirmed by clinical findings and/or computed tomography; diffuse encephalopathy with more than 24 hours of severely altered mental status; or unexplained failure to awaken within 24 hours after operation.

Renal complications: acute renal failure needing dialysis

Infectious complications: culture-proven pneumonia; mediastinitis; wound infection; septicaemia with appropriate clinical findings; or septic shock.

Hemorrhagic complications: bleeding requiring re-operation

Note that the above outcome definition implies a compound endpoint of morbidity. This extensive definition is widely used when models for predicting major adverse outcomes are employed in ICU [3], although it limits the power of any single model to predict who gets a specific complication. On the other hand, it allows the number of events to be increased (the morbidity percentage in the whole patient set considered here was 20.7%) and the contribution of patient management to outcome is more evident when the endpoint occurs more frequently.

Predictive model development

The following models were developed locally to predict morbidity probability: Bayesian linear (BL) model, Bayesian quadratic (BQ) model, k-nearest neighbour (k NN) model, logistic regression (LR) model, Higgins score (HS) model derived from the previous LR model, direct score (DS) model, and two feed-forward artificial neural networks (ANNs) with one and two layers (ANN1 and ANN2, respectively). The theoretical details of the models were described in PartI of the study.

The above training and testing sets of 545 cases were used to train and test all models. Briefly, model development included: feature selection; evaluation of discrimination performance by AUC, that is area under the receiver operating curve (ROC); assessment of calibration by Hosmer-Lemeshov (HL) goodness-of-fit test using Ĉ-statistics; evaluation of accuracy by mean squared error (MSE); recalibration of model-predicted probabilities when necessary.

Artificial neural networks were trained using a batch training method which updates neural weights and biases after all training patterns have been processed, that is, after each epoch. An iterative training algorithm with gradient descendent momentum and adaptive learning rate was used to minimize MSE. The influence of initialization on the solution was reduced by always performing 99 training sessions starting from 99 different randomly-selected initial conditions; the 99 corresponding values obtained for AUC were sorted from lowest to highest and the results of the session giving the 50th value of AUC were taken.

After the stepwise feature selection was performed for each predictive model on the training data by means of proper techniques such as leave-one-out, the 95% confidence interval of AUC and its median value (AŨC) were estimated for every set of selected features using 1000 different random samples generated by the bootstrap resampling method in the training and testing sets. The same samples were used to compare AUC values of different models in test data, by performing a Wilcoxon matched-pairs signed-ranks test [16].

When applying stepwise feature selection on training data to a model, techniques, such as leave-one-out, may not ensure satisfactory generalization. The final selection of the number of features used for predicting morbidity was therefore made trading discrimination capacity off against model complexity on the bootstrap samples of testing data (that is, on data not employed in the training process). The behaviour of AUC was first analysed in these test samples in relation to the set of features selected step-by-step by the previous stepwise procedure and the number of feature (d M ) allowing the maximum value of AŨC was taken as reference point. Then AUC values obtained with a number of selected features less than d M were compared to those of d M by the Wilcoxon matched-pairs signed-ranks test. Finally, the optimal number of selected features was chosen as the minimum number ensuring no significant difference in AUC (0.05 probability) with respect to d M . Of course, if all comparisons gave significant differences, d M was chosen as the optimal number of selected features to be used for predicting morbidity.

Once optimized to ensure suitable generalization with the best discrimination performance, models with inadequate calibration were recalibrated by applying a cubic monotonic transformation (see Part I of the study) to the ranked predicted probabilities, so as to reach a more reliable estimation of morbidity probability.

All computer calculations were performed by means of locally-developed specific codes written in the Matlab programming language using the statistics and optimization toolboxes [17].


For all models, Table 7 shows the predictor variables entered step-by-step or removed during the stepwise feature selection process. Variables that were removed appear in square brackets.

Table 7 Variables entered and removed (in square brackets) at each step of the stepwise selection procedure

Figure 1 shows the median values of AUC, obtained for each model by the bootstrap resampling method, in training and testing data (continuous and dashed lines, respectively) in relation to the dimension of each best subset of features identified by the stepwise procedure on training data. Since AŨC was taken as a global index of discrimination capacity, the difference between training and testing AŨC values may be considered to evaluate model generalization as a function of the number of features in the model: the greater the difference, the greater the model overfitting. The asterisk on the curve indicates the point corresponding to the optimal set of features for predicting morbidity, that is, the minimum number of selected features ensuring AUC values not statistically different from those giving the highest AŨC in the testing bootstrap data.

Figure 1

Median values of AUC (AŨC) obtained for the eight models by the bootstrap resampling method, in relation to the dimension of each best subset of features identified by the stepwise selection procedure. AŨC patterns in the training and test data are shown as continuous and dashed lines, respectively. The asterisk on the curve indicates the point of the optimal set of features for predicting morbidity. BL, Bayes linear; BQ, Bayes quadratic; k NN, k-nearest neighbour; LR, logistic regression; HS, Higgins score; DS, direct score; ANN1, one-layer artificial neural network; ANN2, two-layer artificial neural network.

Table 8 lists the above-defined optimal set of predictor variables model by model and Table 9 shows the corresponding model performance. Discrimination capacity is quantified by AŨC calculated on bootstrap data. For testing data, 95% confidence intervals (CI) of AUC and CI%, that is, the percentage ratio of CI width to AŨC, are also given. Generalization was evaluated as the percentage difference in AŨC between training and testing data. Calibration was assessed on testing data by p of the Hosmer-Lemeshov goodness-of-fit test using Ĉ-statistics (HL-p), so that an HL-p much greater than 0.05 indicated very good model calibration, while HL-p < 0.05 revealed poor model calibration.

Table 8 Optimal feature vectors selected by different models from bootstrap test data
Table 9 Number of selected features and corresponding model performance

Most models selected more than ten features to predict morbidity in the ICU after heart surgery (Table 8). The DS model used the largest set of features (sixteen predictor variables), while the number of features used in the HS model was set equal to that chosen by the corresponding LR model, as proposed by Higgins and colleagues [3]. The Bayesian and k NN models were much more parsimonious, using less than ten features. The Bayes quadratic model required the smallest set of predictor variables (only three).

Artificial neural networks gave the highest values of AŨC on training data, but their discrimination ability decreased sharply when estimated on testing data (Table 9). This result confirms that model overfitting may be a limitation of this approach.

The Bayes quadratic model, using only three predictor variables, provided the highest AŨC on test data (Table 9). Although the 95% confidence intervals of different models were largely superimposed, the Wilcoxon matched-pairs signed-ranks test on testing bootstrap data showed significant AUC differences between various models. This means that, when the results obtained with the bootstrap data were considered couple-by-couple, one model generally gave AUC values better than another. However, despite this statistical outcome, all models showed essentially not very dissimilar discrimination capacities, because the AŨC and CI were roughly equivalent from a practical point of view for the whole group of models. All models had acceptable discrimination capacities on test data, because their AŨC was always greater that 0.7 and less than 0.8 [11]. Furthermore, the width of the CI indicated appreciable sample variability in model discrimination performance.

All models showed satisfactory generalization when evaluated in our specialized ICU, because the percentage difference in AŨC between training and testing data was always less than 8% (Table 9). However the Bayes quadratic and k NN models had very good generalization performance, while artificial neural networks and integer score models gave the worst results.

The Hosmer-Lemeshov goodness-of-fit test indicated very poor calibration for both integer score models, even after recalibration (Table 9). However, this may also depend on limitations of the HL test in assessing goodness of fit for predictive models with discrete output probabilities. The k NN model and artificial neural networks (especially ANN2) also showed poor calibration, while the Bayesian and logistic regression models gave satisfactory results. Nevertheless, the Bayesian models had to be recalibrated, whereas the logistic regression model did not.


A pool of 78 variables was taken a priori as potential predictors of morbidity in the ICU after heart surgery, so that feature selection had to be made a posteriori, considering not only training but also testing data. Although some identical features were selected from all models, the number of predictor variables identified as optimal was rather different in the various models under study. As shown in Table 8, the Bayes quadratic model was the most parsimonious. Most other models (such as integer score systems) required many more predictors.

The DS model used the largest set of predictor variables. Table 7 shows that some features were entered in this model several times during the stepwise selection procedure: oxygen extraction (O2ER) was the most selected and obtainied the highest associated score. Despite a clear tendency to overfit training data (the differences between training and testing curves in Figure 1 increased remarkably with just a few features) AŨC significantly increased in test data with the number of selected features, reaching a value of 0.779 with sixteen predictors. Unfortunately this model showed very poor calibration performance, although this result may be partly due to the limitations of the HL test or the recalibration procedure for score models with discrete outputs. Furthermore, like other predictive score models, the DS system was difficult to update with new data. In fact, updating requires a complete periodic retraining. To do this, an automatic routine can be implemented on a computer, but this defeats the choice of this simple method which does not require a computer for everyday clinical application, the reason why such systems are very popular in medicine [3, 6].

About the same number of features were selected for the LR model and ANN1. Most of the predictor variables selected by both on our ICU experimental data were the same (see Table 8). From a theoretical point of view, these two models are characterized by the same input-output nonlinear mathematical relationships, although their parameters are estimated by different approaches (for details see PartI of this study). This may justify the likeness of their discrimination results. However ANN1 performed better on training data and worse on test data, so that its generalization power was lower than that of the LR model, confirming the tendency of artificial neural networks to overfit training data. Much better results were also obtained by the LR model as regards calibration. Finally, difficulties can arise when designing and using artificial neural networks and continuous updating is practically impossible. These considerations suggests that the LR model is preferable to ANN1 for the example considered here.

The results obtained using the two-layer artificial neural network ANN2 were similar to those of ANN1, although ANN2 used a smaller feature set. Despite increased model complexity, ANN2 showed only slightly better discrimination on test data and generalization power, but worse calibration. So, when comparing ANN2 and LR performance, the same conclusion as between ANN1 and LR was reached.

As described in detail in PartI of this study, the Higgins score system was derived from the logistic regression model with the same features, by transforming continuous predictors to binary variables and LR coefficients to integer scores. Of course, the HS system suffers from the weaknesses of all integer score systems, as discussed when considering the DS model. Furthermore, the comparison of the results obtained by the LR and corresponding HS models showed that LR had better performance than the corresponding scoring system. In fact, its discrimination ability was higher on testing data and its generalization power was superior. The HS model showed very poor calibration, whereas the LR model was well calibrated even without any recalibration procedure. All this confirms that, when transforming the LR model into a simpler-to-use score system, it is necessary to carefully consider the cost of increased computational facility.

The Bayes linear model selected only eight features versus fourteen of the logistic regression model. The LR model used all the predictors used in the BL model and six additional ones. However, the number of model parameters estimated by the LR model was much less than that of BL: fifteen and fifty-two, respectively (see also Part I of this study). Despite these, the 95% confidence interval of AUC was the same for both models ([0.722–0.831] for BL versus [0.721–0.830] for LR) and the generalization power was similar. They both showed good calibration performances. This seems to confirm previous experimental findings indicating that in many practical situations the two approaches give generally similar results [6, 18]. Their application in clinical practice is not difficult. To recognize morbidity a hand calculator is sufficient, because LR uses a simple exponential relationship and the Bayesian linear decision rule can be expressed as a linear function of the observation vector [7]. Major differences can be observed for updating. The BL model can be updated with new training data simply by updating the mean vector and pooled within-sample covariance matrix estimates using simple recursive relationships, whereas the LR model is not so simple to update. We therefore judged the BL model as better than LR in the present illustrative example.

The k NN model required only five features to predict morbidity in ICU patients after heart surgery. In general, this non parametric approach did not overfit training data, so that good generalization could also be obtained using different dimensions of the feature set (see Figure 1). In fact, generalization power only decreased appreciably with six predictor variables, that is, the maximum number of features selected by the stepwise procedure on training data. However its calibration was poor and AŨC computed on test data was the second last of all the models considered. Besides its computational cost and need for large data storage made this model unpromising, unless comparison of test cases with their k neighbours is considered important for comparative diagnosis.

The Bayes quadratic model had the highest discrimination capacity on test data, using the minimum number of features (oxygen extraction, oxygen delivery and need for cardiac inotropic drugs after the operation). AŨC calculated by means of the bootstrap resampling method was almost the same for training and testing data (percentage difference less than 1%). The quality of the results in the scenario considered may also be due to the small number of parameter estimates required by the model. In fact, with three predictor variables and two classes, the BQ model required the estimation of eighteen parameters (mean vectors and covariance matrices of the two classes). This model parameter number is about the same as that of the LR model (fifteen model parameters), but much less than that of BL (fifty-two model parameters). Like the BL model, the BQ model can be recursively updated whenever a new case has to be included in the training set. Finally, after recalibration, Hosmer-Lemeshov goodness-of-fit test using Ĉ-statistics indicated adequate model calibration. These considerations make the BQ model a convincing alternative to the BL and LR approaches for the present application.

It can be noted that two of the three predictors selected by the Bayes quadratic model (oxygen extraction and need for cardiac inotropic drugs after the operation) were chosen by all models. This means that these two variables were essential features for predicting morbidity in the scenario considered. Of course, the need for inotropic drugs after the operation is strongly correlated with poor cardiac function, while the key role played by oxygen extraction confirms the results of a previous study, in which increased oxygen extraction immediately after heart surgery has been indicated as an independent predictor of prolonged ICU stay [19]. The third predictor used by the BQ model was oxygen delivery and inadequate oxygen delivery has also been associated with prolonged ICU stay after heart surgery [20]. Increased levels of oxygen delivery and consumption have also been associated with improved outcome [21] and this fact has been tested in various clinical situations [2224]. Knowledge of oxygen extraction and oxygen delivery is fundamental for assessing the relationship between oxygen consumption and oxygen delivery [25], though in many cases, mixed venous oxygen saturation or even central venous oxygen saturation alone may suffice [26]. Previous studies have shown that when oxygen saturation in the superior vena cava is used as a guide, early goal-directed therapy may provide significant benefits for outcome in ICU patients with severe sepsis and septic shock [27] and that this approach may reduce the length of hospital stay and the degree of organ dysfunction of heart surgery patients at discharge [28].

Statistical predictive models and artificial neural networks are black-box systems allowing cases to be allocated to different classes, but they do not lend themselves to interpretation of the underlying causes. However, when the number of the selected predictor variables is sufficiently small it may be interesting to seek an explanatory interpretation of the predictive model results a posteriori. In everyday life we are accustomed to considering phenomena in three dimensions. It is therefore difficult to expound the meaning of systems (such as predictive models) working in more than three dimensions. However, when the predictive model uses two or three features, a rational interpretation of its results may be attempted. The BQ model developed on our ICU data used only three features to predict morbidity outcomes, so that an interpretation of the result obtained was sought. First of all it is useful to recall that oxygen extraction is the ratio of oxygen consumption to oxygen delivery. A recent paper showed that the relationship of oxygen consumption to oxygen delivery is an important concept, even if its practical application is not simple and decisions regarding the need for strategies to increase and maintain oxygen delivery require the interpretation of many measurements [26]. The BQ predictive model seems to confirm these findings, because its decision boundary is given by a quadratic form of the three selected features in the three-dimensional space. In the clinical example used to locally develop the predictive model, this means that the cut-off value of oxygen delivery separating morbid and normal course classes does not remain constant or vary in a linear fashion as a function of oxygen extraction. Furthermore, this boundary changes in patients requiring cardiac inotropic drugs after the operation. Figure 2 clarifies this finding. Continuous and broken lines represent the decision boundaries in the oxygen extraction/oxygen delivery plane for patients to whom cardiac inotropic drugs are and are not administered, respectively. Patients at risk of morbidity are located below the decision boundary. The decision boundary moves up for patients who require drug administration after the operation, indicating that for these patients, the risk of morbidity will be high even at higher values of oxygen delivery.

Figure 2

Decision boundaries separating morbid and normal course classes in the oxygen extraction/oxygen delivery plane for patients to whom cardiac inotropic drugs were (continuous line) and were not (broken line) administered. Patients at risk for morbidity are located below the decision boundary.

As a conclusion, the Bayes quadratic model seemed to identify a minimum core of predictor variables generally recognized as essential for a pragmatic evaluation of the risk of morbidity after heart surgery. When this set of predictors was used on test data, it gave good discrimination, generalization and calibration, which were similar or better than those obtained with the Bayes linear or logistic regression models. Because of the small number of predictors to be monitored, clinicians may also more easily track and rationally interpret time courses of patient status, and consequently make prompt decisions about optimal therapeutic strategies. Of course, this does not mean that the Bayes quadratic approach is always the best model for predicting morbidity in ICU patients. However it provided a good compromise between system complexity and predictive performance in our example.


The purpose of the present study was to analyse and compare different predictive models for estimating patient morbidity in the ICU after heart surgery. In this second part of the study we developed and tested eight popular predictive models with preoperative, intraoperative and postoperative data acquired in adult patients who underwent coronary artery bypass grafting. This part of the study supplements Part I in which different approaches for developing predictive morbidity models were reviewed in a unitary framework from a theoretical point of view.

The experimental results indicated that all models provided acceptable discrimination in test data and satisfactory generalization in our illustrative example. On the contrary poor calibration was obtained with scoring systems, the k-nearest neighbour model and artificial neural networks, while Bayes and logistic regression models gave satisfactory results. Most of models selected more than ten features to predict morbidity. Scoring systems and logistic regression model required the largest set of predictors, while Bayesian and k NN models were much more parsimonious, requiring less than ten features.

The Bayes quadratic model required the smallest set of predictor variables (only three: oxygen extraction, oxygen delivery and use of cardiac inotropic drugs after the operation) and provided very interesting results, which were similar or better than those obtained with the Bayes linear or logistic regression models. Unlike logistic regression models, an additional intrinsic strength of Bayesian models is that they can be updated in a straightforward manner, including new correctly classified cases into the training set, since this just involves the updating of mean vector and covariance matrix estimates by means of simple recursive relationships.

Because of the small number of predictors needed, the Bayes quadratic linear model also enabled an explanatory interpretation of the results obtained in our example. In particular, the BQ model seemed to confirm previous experimental findings proving that the relationship between oxygen consumption and oxygen delivery is a key issue for guiding therapy.

In conclusion, both theoretical and experimental findings indicate that the Bayes quadratic model offers a good compromise between complexity and predictive performances and can therefore be a convincing alternative to other much more extensively used predictive models (such as scoring systems or even Bayes linear and logistic regression models) in many clinical applications.

Note: This paper is accompanied by Part I, which gives a comprehensive review of several methods used to plan predictive models [29].



artificial neural network


area under the ROC curve


Bayes linear


Bayes quadratic


confidence interval


direct score




Higgins score


intensive care unit

k NN:

k-nearest neighbour


logistic regression


mean squared error


receiver operating characteristic.


  1. 1.

    Heijmans JH, Maessen JG, Roekaerts PMHJ: Risk stratification for adverse outcome in cardiac surgery. Eur J Anaesthesiol. 2003, 20: 515-527. 10.1017/S0265021503000838.

  2. 2.

    Fortescue EB, Kahn K, Bates DW: Development and validation of a clinical prediction rule for major adverse outcomes in coronary bypass grafting. Am J Cardiol. 2001, 88: 1251-1258. 10.1016/S0002-9149(01)02086-0.

  3. 3.

    Higgins TL, Estafanous FG, Loop FD, Beck GJ, Lee JC, Starr NJ, Knaus WA, Cosgrove III DM: ICU admission score for predicting morbidity and mortality risk after coronary artery bypass grafting. Ann Thorac Surg. 1997, 64: 1050-1058. 10.1016/S0003-4975(97)00553-5.

  4. 4.

    Edwards FH, Peterson RF, Bridges C, Ceithaml EL: Use of a Bayesian statistical model for risk assessment in coronary artery surgery. Ann Thorac Surg. 1995, 59: 1611-1612. 10.1016/0003-4975(95)00189-R.

  5. 5.

    Marshall G, Shroyer ALW, Grover FL, Hammermeister KE: Bayesian-logit model for risk assessment in coronary artery bypass grafting. Ann Thorac Surg. 1994, 57: 1492-1500.

  6. 6.

    Biagioli B, Scolletta S, Cevenini G, Barbini E, Giomarelli P, Barbini P: A multivariate Bayesian model for assessing morbidity after coronary artery surgery. Crit Care. 2006, 10: R94-10.1186/cc4441.

  7. 7.

    Fukunaga K: Introduction to Statistical Pattern Recognition. 1990, Boston: Academic Press

  8. 8.

    Krzanowski WJ: Principles of Multivariate Analysis: A User's Perspective. 1988, Oxford: Clarendon Press

  9. 9.

    Sivia DS, Skilling J: Data Analysis: A Bayesian Tutorial. 2006, Oxford: Oxford University Press

  10. 10.

    Duda RO, Hart PE, Stork DG: Pattern Classification. 2001, New York: John Wiley and Sons

  11. 11.

    Hosmer DW, Lemeshow S: Applied Logistic Regression. 2000, New York: Wiley

  12. 12.

    Dreiseitl S, Ohno-Machado L: Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform. 2002, 35: 352-359. 10.1016/S1532-0464(03)00034-0.

  13. 13.

    Bishop CM: Neural Networks for Pattern Recognition. 1995, New York: Oxford University Press Inc

  14. 14.

    Giomarelli P, Scolletta S, Borrelli E, Biagioli B: Myocardial and lung injury after cardiopulmonary bypass: Role of interleukin (IL)-10. Ann Thorac Surg. 2003, 76: 117-123. 10.1016/S0003-4975(03)00194-2.

  15. 15.

    Higgins TL, Estafanous FG, Loop FD, Beck GJ, Blum JM, Paranandi L: Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients. Jama. 1992, 267: 2344-2348. 10.1001/jama.267.17.2344.

  16. 16.

    Armitage P, Berry G: Statistical methods in medical research. 1987, Oxford: Blackwell Scientific Publications

  17. 17.

    MATLAB: The Language of Technical Computing, Using MATLAB, Version 7. 2004, Natick, MA: The MathWorks Inc

  18. 18.

    Testi D, Cappello A, Chiari L, Viceconti M, Gnudi S: Comparison of logistic and Bayesian classifiers for evaluating the risk of femoral neck fracture in osteoporotic patients. Med Biol Eng Comput. 2001, 39: 633-637. 10.1007/BF02345434.

  19. 19.

    Pölönen P, Hippeläinen M, Takala R, Ruokonen E, Takala J: Relationship between intra- and postoperative oxygen transport and prolonged intensive care after cardiac surgery: a prospective study. Acta Anaesthesiol Scand. 1997, 41: 810-817.

  20. 20.

    Routsi C, Vincent JL, Bakker J, De Backer D, Lejeune P, d'Hollander A, Le Clerc JL, Kahn RJ: Relation between oxygen consumption and oxygen delivery in patients after cardiac surgery. Anesth Analg. 1993, 77: 1104-1110. 10.1213/00000539-199312000-00004.

  21. 21.

    Shoemaker WC, Montgomery ES, Kaplan E, Elwyn DH: Physiologic patterns in surviving and nonsurviving patients: use of sequential cardiorespiratory variables in defining criteria for therapeutic goals and early warning of death. Arch Surg. 1973, 106: 630-636.

  22. 22.

    Tuchschmidt J, Fried J, Astiz M, Rackow E: Elevation of cardiac output and oxygen delivery improves outcome in septic shock. Chest. 1992, 102: 216-220. 10.1378/chest.102.1.216.

  23. 23.

    Boyd O, Grounds RM, Bennett ED: A randomized clinical trial of the effect of deliberate perioperative increase of oxygen delivery on mortality in high-risk surgical patients. JAMA. 1993, 270: 2699-2707. 10.1001/jama.270.22.2699.

  24. 24.

    Gattinoni L, Brazzi L, Pelosi P, Latini R, Tognoni G, Pesenti A, Fumagalli R: A trial of goal-oriented hemodynamic therapy in critically ill patients. N Eng J Med. 1995, 333: 1025-1032. 10.1056/NEJM199510193331601.

  25. 25.

    Vincent JL: Determination of O2 delivery and consumption vs cardiac index vs oxygen extraction ratio. Crit CareClin. 1996, 12: 995-1006. 10.1016/S0749-0704(05)70288-8.

  26. 26.

    Vincent JL, De Backer D: Oxygen transport-the oxygen delivery controversy. Intensive Care Med. 2004, 30: 1990-1996. 10.1007/s00134-004-2384-4.

  27. 27.

    Rivers E, Nguyen B, Havstad S, Ressler J, Muzzin A, Knoblich B, Peterson E, Tomlanovich M: Early goal-directed therapy in the treatment of severe sepsis and septic shock. N Engl J Med. 2001, 345: 1368-1377. 10.1056/NEJMoa010307.

  28. 28.

    Pölönen P, Ruokonen E, Hippeläinen M, Pöyhönen M, Takala J: A prospective, randomized study of goal-oriented hemodynamic therapy in cardiac surgical patients. Anesth Analg. 2000, 90: 1052-1059. 10.1097/00000539-200005000-00010.

  29. 29.

    Barbini E, Cevenini G, Scolletta S, Biagioli B, Giomarelli P, Barbini P: A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part I: model planning. BMC Med Inf Dec Mak. 7: 35-10.1186/1472-6947-7-35.

Pre-publication history

  1. The pre-publication history for this paper can be accessed here:

Download references


This work was supported by the Italian Ministry of University and Research (MIUR – PRIN).

Author information



Corresponding author

Correspondence to Paolo Barbini.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

All authors participated in the study plan and coordination. GC and PB were concerned with medical informatics and biostatistical aspects of the study. EB was concerned with epidemiology and biostatistical aspects of the study. SS, BB and PG were involved in clinical aspects. SS collected clinical data. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Cevenini, G., Barbini, E., Scolletta, S. et al. A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part II: an illustrative example. BMC Med Inform Decis Mak 7, 36 (2007).

Download citation


  • Intensive Care Unit
  • Artificial Neural Network
  • Oxygen Delivery
  • Oxygen Extraction
  • Discrimination Capacity