Models for the prediction of patient risk are increasingly used in critical care because they allow diagnostic and prognostic information to be derived precisely and evaluated quantitatively. Scoring systems are a very attractive kind of model in clinical practice, due to their simplicity of application, especially where a rapid and effective decision has to be taken for correct evaluation of patient status.

Despite their simplicity, when carefully designed, the accuracy of scoring systems has proven not sufficiently worse than more complex models, such as logistic regression, Bayesian classification rule or artificial neural networks, to exclude their clinical application [9, 18]. A major limit is calibration, i.e. identification of a proper quantitative association between score values and prognostic risk probabilities. Reliable individual prediction of this probability is very important in clinical practice, being a useful tool for medical decision making, patient risk reduction, optimal planning of clinical resources and welfare cost saving.

An idea could be to directly estimate the risk probability by dividing the score of the test patient by the maximum possible score. However, this method may lead to unreliable results and the Hosmer-Lemeshov test (developed for logistic-regression models) may not be appropriate for models with discrete outputs such as scoring systems [13]. Of course, a more straightforward approach is to focus on the statistics of the score classes determined by the model. In the original paper of Higgins and colleagues [18] the risk levels of test patients were categorized on the basis of similar outcomes in the training set. However, an accurate estimate of the uncertainty associated with parameter estimates is important to avoid misleading inference. This uncertainty is usually summarized by a confidence interval, which is claimed to have a specified probability of including the true parameter value. In particular, confidence intervals combine point estimation and hypothesis testing in a single inferential statement of great intuitive appeal. Thus for predictive scoring systems, a crucial point is assessment of the confidence intervals of the estimated risk probabilities.

The bootstrap technique is a resampling method for statistical inference commonly used to estimate confidence intervals [14, 15, 21]. Although all bootstrap confidence intervals fail to perform well in some situations, this should not overly discourage the use of bootstrap confidence intervals. Some intervals work very well in many situations, and even when they do not work so well, they may still be better than most alternatives. The bootstrap method is more transparent, simpler and more general than conventional approaches. Understanding the rationale behind it does not requires any deep knowledge of mathematics or probability theory. The assumptions on which it depends are less restrictive and more easily checked than the assumptions on which conventional methods depend. The method can be applied to situations where conventional methods may be difficult or impossible to find.

On the basis of the above considerations, we proposed a more informative approach to develop and select competing scoring systems to predict adverse outcomes in medical applications. The model selection not only accounts for discrimination power and generalization of the predictive model, but also for the trustworthiness of the estimated prognostic probability associated with each score class. In particular for each scoring system the 95% confidence intervals of prognostic probabilities are estimated by the BCa bootstrap technique. As an example, the procedure was applied to data collected in heart-surgery patients who underwent coronary artery bypass graft. Since much has happened in the field of heart surgery in recent years, mortality is now low and morbidity has been considered a valid end point and a more attractive target for developing the risk model. The low prevalence of adverse events negatively influences the estimation of confidence intervals for prognostic probabilities, so that end points corresponding to quite rare events, such as death after heart surgery, must be avoided in designing a risk model.

The illustrative example uses a sample of 1090 patients, which was divided into one training set and one testing set of equal size. Cross validation would naturally be a more efficient approach, though more demanding computationally. However, our choice can be considered satisfactory, because sample size was large enough. It also allowed us to easily define training and testing sets with equal percentage morbidities and verify that the random allocation of patients to training and testing did not introduce systematic sampling errors.

The procedure developed enabled us to evaluate and compare several different models in the example considered here. In our opinion, the model with six score classes (Figure 4) has many advantages. First, the scoring system is based on only six variables, two of which give information about preoperative status, one is related to surgery, and the other three are postoperative variables. Despite the low number of predictors, the model shows good discriminating power, also achieving a satisfactory compromise between discrimination and generalization. Finally, it allows patients to be divided into a reasonable number of classes, most characterized by well separated confidence intervals of prognostic probabilities. The only limit of the model is the presence of one score class which has a morbidity-probability confidence interval partially overlapping the two adjacent score classes, so that patients with a score of 4 should be cautiously likened to those with score greater than 4.

Two intrinsically dichotomous preoperative variables (emergency status and peripheral vascular disease) are used in the scoring system. Emergency status is known as a significant preoperative predictor of poor outcome. Emergency patients are more likely to have other risk factors on admission to the ICU, such as low cardiac index, decreased serum albumin, higher alveolar-arterial oxygen gradient, elevated central venous pressure, and tachycardia [18]. Peripheral vascular disease is another important morbidity predictor after coronary artery bypass surgery, especially in predicting severe or mild neurological complications [24].

The postoperative variables are the oxygen extraction ratio (≥40%), carbon dioxide production (< 180 ml/min) and the need for cardiac inotropic drugs after the operation. In particular, the weight of O_{2}ER is twice that of the other predictors, indicating a key role for this variable. This important role confirms the results of a previous study, in which increased O_{2}ER immediately after heart surgery was indicated as an independent predictor of prolonged ICU stay [25]. O_{2}ER reflects a balance between oxygen consumption and oxygen delivery, providing information about compensatory increased extraction in hypovolemia and heart failure.

The only intraoperative variable in the model is cardio-pulmonary bypass time. This variable has been identified as a risk factor in similar studies [3, 18]. In particular, the role of CPBt in the determination of hyperlactatemia during cardio-pulmonary bypass has been highlighted by other authors [26, 27]. Hyperlactatemia is a well-recognized marker of circulatory failure, and its severity has been associated with mortality in different clinical conditions [28, 29]. In particular, high blood lactate levels during cardiopulmonary bypass are associated with tissue hypoperfusion and may contribute to severe postoperative complications. Patients with high blood lactate levels during cardiopulmonary bypass generally need greater and longer hospital care because postoperative morbidity is significantly more frequent [30]. Despite this evidence, the association between hyperlactatemia and postoperative mortality is a much debated question, because different authors have come to different conclusions [30, 31]. Finally, data used to develop the scoring system did not account for hyperlactatemia directly but only for CPBt, suggesting that the probability of morbidity may not be properly estimated in off-pump patients.