A straightforward approach to designing a scoring system for predicting length-of-stay of cardiac surgery patients

Background Length-of-stay prediction for cardiac surgery patients is a key point for medical management issues, such as optimization of resources in intensive care units and operating room scheduling. Scoring systems are a very attractive family of predictive models, but their retraining and updating are generally critical. The present approach to designing a scoring system for predicting length of stay in intensive care aims to overcome these difficulties, so that a model designed in a given scenario can easily be adjusted over time or for internal purposes. Methods A naïve Bayes approach was used to develop a simple scoring system. A set of 36 preoperative, intraoperative and postoperative variables collected in a sample of 3256 consecutive adult patients undergoing heart surgery were considered as likely risk predictors. The number of variables was reduced by selecting an optimal subset of features. Scoring system performance was assessed by cross-validation. Results After the selection process, seven variables were entered in the prediction model, which showed excellent discrimination, good generalization power and suitable sensitivity and specificity. No significant difference was found between AUC of the training and testing sets. The 95% confidence interval for AUC estimated by the BCa bootstrap method was [0.841, 0.883] and [0.837, 0.880] in the training and testing sets, respectively. Chronic dialysis, low postoperative cardiac output and acute myocardial infarction proved to be the major risk factors. Conclusions The proposed approach produced a simple and trustworthy scoring system, which is easy to update regularly and to customize for other centers. This is a crucial point when scoring systems are used as predictive models in clinical practice.


Background
Prediction models are increasingly important in clinical practice, as indicated by the number of recent publications describing their development. One of their purposes is to aid clinical decision-making by combining patient characteristics in order to estimate the probability of a certain disorder or problem (diagnosis and prognosis). In particular, prognostic models are widely accepted in intensive care units (ICUs) for predicting outcome of critical patients [1][2][3][4][5][6]. In many cases, these models are scoring systems in which the predictor variables are usually selected and scored subjectively by expert consensus or objectively using statistical methods [7,8].
While mortality can be considered the primary outcome, over the years technological advances have led to a significant decrease in mortality for certain patient populations, for example cardiac surgery patients. In these cases, morbidity or prolonged stay in intensive care have been suggested as valid end points and more attractive targets for developing operative risk models. In particular, models that estimate the length-of-stay (LOS) in ICU of cardiac surgery patients can be very useful for internal purposes. Reliable prediction of LOS is the starting point for good internal management of operating rooms. When a prediction system is developed primarily for internal purposes, such as operating-room scheduling, the model should not only be simple, reliable and characterized by high sensitivity/specificity, but also easy to modify, so that clinicians can customize it to their specific patient subpopulation and update it with new data sets. Unfortunately, retraining and updating are critical points of scoring systems, because the design and development of scoring systems generally imply theoretical modeling ability and statistical procedures seldom available in a clinical environment, and make it complicated to modify a given model. Thus, in clinical practice, scoring systems are usually used in their original standard form, as developed with training data from different countries and/or centers. If data from the specific scenario is not considered during model training, there can be significant loss in model performance [9].
Model customization is essential when it is difficult to standardize local practices and patient populations differ [9][10][11][12][13]. Easy updating is another crucial feature. In fact, acquisition of new, correctly classified patients enables the training set to be increased day by day, improving model performance in a corresponding way. Progress in medical techniques also makes it necessary to be able to update the model continuously. It is therefore fundamental to use approaches allowing the decision rule to be derived in a straightforward manner so that it is easily modified, locally customized, updated and validated.
In the present study, a scoring system was designed to predict prolonged stay in intensive care after heart surgery, using a straightforward approach recently proposed [14]. It is based on the naïve Bayes rule [15], which generally shows good classification accuracy, even when the assumption of independence does not hold [16][17][18]. Although the prediction model was trained using a sample of patients who underwent heart surgery in a specific institution, it can be modified directly to customize it for other centers.

Scoring-system development and validation
The scoring system was developed using a simple approach recently proposed [14], which uses the well-known Bayes rule assuming that features are all conditionally independent of each other given the class. This strong (naïve) assumption drastically simplifies the problem of estimation from training data.
Given an N-dimensional observation vector x = (x 1 , x 2 , ….., x N ) and two patient classes ω 1 and ω 2 (adverse and positive outcome, respectively), the decision rule was written as where P(ω i ) is the a priori probability of class ω i (i = 1,2) and w xj (j = 1,2,….,N) are log-likelihood ratios, which can be calculated directly from data acquired in any specific institution.
We chose this type of scoring system because it is easily customized to any specific scenario and it also can be easily updated by entering new and removing older data (for more details see Ref. [14]).
After selecting the subset of features to include in the predictive model, scoring system performance was assessed by five-fold cross-validation, randomly dividing the sample into five roughly equal non-overlapping subsamples. The whole validation process required five rounds, with each of the five subsamples used exactly once as testing data. In particular, in each round, a single subsample was retained as the validation data for testing the model, and the remaining four subsamples were used as training data to estimate the weight of each feature in the scoring system. This allowed us to assess the performance of the scoring system when its parameters (weights) were estimated on datasets different from testing data.
AUC and its 95% confidence interval were calculated in the training and testing sets. In particular, the biascorrected and accelerated (BCa) bootstrap method was used to estimate the 95% confidence intervals of AUC, using one thousand bootstrapped samples generated from original data [19].
The prior probabilities P(ω 1 ) and P(ω 2 ) were both assumed to be 0.5, so that the threshold value in equation 1 was set at zero. All computations were done using IBM SPSS Statistics (IBM Corp., Armonk, New Yok, USA) and MATLAB (The MathWorks, Inc., Natick, Massachusetts, USA) code.

Study population and feature selection
The data set for developing the locally customized scoring system was retrieved from the computerized database of the Department of Medical Biotechnologies of Siena University. Due to the retrospective nature of the study, the need for informed consent was waived. The authors did not have direct access to this institutional database. Aggregate patient data was provided anonymously. Use of anonymous aggregate data is consented because it does not implicate the privacy concerns that apply to patient-identifiable information. The study was undertaken after approval of the Ethics Committee (Comitato etico locale e comitato etico per la sperimentazione clinica dei medicinali) of Siena University Hospital and was conducted in compliance with the Helsinki declaration.
A sample of consecutive adult patients who underwent heart surgery between 2000 and 2007 was used. Exclusion criteria included operation without cardiopulmonary bypass, heart or heart-lung transplant, aortic dissection, age less than 18 years and death. Some records (about 1.5%) were excluded from the analysis because they contained insufficient data to design the classifier. The final size of the sample used in the study was 3256 patients. These patients underwent isolated coronary artery bypass grafting (CABG) and isolated valve or combined procedures (CABG plus valve) at the Cardiac Surgery Unit of Siena University Hospital, Italy.
Length of stay in the ICU was chosen as outcome. Adverse outcome was defined as LOS greater than or equal to 5 days (i.e. 120 hours), and normal LOS was defined as less than 5 days. The mean and standard deviation of LOS were 68 and 112 hours, respectively.
A collection of 36 preoperative, intraoperative and postoperative dichotomous variables and 16 non dichotomous (continuous or discrete) variables was considered a priori as a wide set of features for predicting patient outcome on the basis of clinical judgment and past experience [13]. Preoperative and intraoperative data was collected under the anaesthesiologist's supervision. Postoperative data was collected in the first three hours after admission to the ICU.
To lower the bias of the naïve independence assumption, the above number of variables was reduced by a procedure aimed at selecting an optimal subset of features to include in the predictive model. Firstly, the discrimination power of each variable was evaluated individually to eliminate less important features once and for all. For this purpose we calculated the 99% confidence intervals of the odds ratio [20] for each dichotomous variable and for other variables, dichotomized on the basis of their medians (cut-off point). Only variables with an odds ratio significantly different from 1 (p < 0.01) were chosen as potential competing features to be taken into consideration for the final stepwise selection, using the receiver operating characteristic (ROC) curve [21]. The direction of search proceeded in a forward manner [22] and, at each step of the algorithm, the variable giving the best increase in area under the ROC curve (AUC) was entered in the model [23]. To decrease the chance of entering redundant features that might introduce dependencies, the criterion for halting the search process was slightly less conservative than the one suggested in previous papers [22,23]: the procedure was stopped when the cumulative increment in AUC obtained in three consecutive steps was less than 1%. Table 1 shows the 99% confidence interval of the odds ratio for the whole set of 36 preoperative, intraoperative and postoperative dichotomous variables chosen a priori on the basis of clinical judgment and past experience. Table 2 shows the cut-off (median on original sample) and 99% confidence interval of the odds ratio for the 16 non dichotomous variables. Seventeen dichotomous and five non dichotomous variables (in italics in Tables 1 and  2) were eliminated from the subsequent stepwise selection since their corresponding 99% confidence interval of the odds ratio included 1. The remaining variables, whose odds ratios were significantly different from 1 (p < 0.01), were considered in the stepwise selection process (nineteen dichotomous and eleven continuous variables, which were discretized into 4 categories, according to their value falling into the 1st, 2nd, 3rd or 4th quartile interval).

Results
The stepwise process selected seven variables, three of which were dichotomous (low postoperative cardiac output, preoperative chronic dialysis and acute myocardial infarction). The detailed results obtained step-by-step are summarized Table 3.
The 95% confidence interval for AUC estimated by the BCa bootstrap method in the training test was [0.841, 0.883] and the median AUC was 0.863. No significant difference was found when estimating AUC in the testing set, where the median and 95% confidence interval were 0.859 and [0.837, 0.880], respectively. On the basis of the Hosmer-Lemeshow rule [24] an AUC greater than 0.8 indicated excellent discrimination, while the absence of significant differences between the results obtained with the training and testing sets denoted good generalization power. Table 4 shows the confusion matrix obtained with the testing set. The first row of the matrix refers to patients with LOS less than 5 days (normal outcome), and the second to patients with LOS greater than or equal to 5 days (adverse outcome). 2403 patients with normal outcome and 268 patients with adverse outcome were correctly classified, giving an overall correct-classification of 82%. Of course, the values in the table can be interpreted as true negatives, false positives, false negatives and true positives, so that the correct classification percentage of patients with normal outcome corresponds to the specificity (SP), while the correct classification percentage of patients with adverse outcome represents the sensitivity (SE) of the model, i.e. SP = 83%, SE = 74%. Tables 5 and 6 show the weights of the selected dichotomous and non dichotomous features in the scoring system, respectively. Since the highest positive weight in the study sample was assigned to chronic dialysis, patients in chronic dialysis showed a considerable risk of prolonged length of stay in the intensive care unit after heart surgery in the scenario considered. Important risk factors were also a low postoperative cardiac output and acute myocardial infarction. Table 6 also shows that low blood concentrations of postoperative creatinine and bilirubin were significant protective factors.
An analysis of Table 6 shows that the weights corresponding to each non dichotomous feature monotonically increase from the first to the fourth quartile interval. However, this increase is generally quite nonlinear. For example, the weights corresponding to postoperative creatinine values in the first and second quartile intervals differ little from each other (−0.88 vs. −0.70), but change drastically in the third (−0.07) and fourth quartile intervals (1.03). This result shows that postoperative creatinine values below the median can be considered a protective factor, whereas values in the fourth quartile represent a risk factor. Finally, the weight of creatinine values in the third quartile interval is close to zero.

Discussion
Outcome prediction is a key point in ICUs, not only for prognosis assessment, but also for cost-benefit analysis, health-care management, comparisons between centers, monitoring/assessment of new therapies and population sample comparison studies. A distinction must be made between predictive models for mortality and predictive models for LOS. For the former task, stable benchmarks are needed to conclude whether high-quality care is being delivered across institutions. On the contrary, for LOS, a customized model can be useful for internal healthcare management purposes.
In many cases the predictive models are scoring systems, in which the predictor variables are usually selected and scored subjectively by expert consensus or objectively using statistical methods. These systems are generally preferred by clinicians and health operators because they are so simple that individual scores can be assessed immediately, without using any data processing system. However, a common weakness of scoring systems is that their updating or customization to new populations is often not an easy task [25]. Scoring systems are therefore generally used in their original formulation also for internal management purposes, which implies a significant loss in performance, because model performance deteriorates over time or when applied to populations different from the ones on which they were developed [26].
The approach we used in the present study to get around this critical point was to derive the scoring system directly from a naïve Bayes classifier, using discrete predictors. This approach was not only straightforward but also successful, because naïve Bayes classifiers identify the parameters required for accurate classification using less training data than many other classifiers. This makes them particularly effective for datasets containing many features. Previous papers have also demonstrated The variables in italics were eliminated from the model because their corresponding odds ratios were not significantly different from 1 (p < 0.01).
that naïve Bayes classifiers may outperform more complex classification methods and show good average performance in terms of classification accuracy, especially over data sets having features that are not strongly correlated [22]. Of course this does not mean that the naïve Bayes technique is the best approach for supervised classification problems. More sophisticated models (which do not rely on the conditional independence assumption and incorporate interaction terms) may perform better. Unfortunately, sophisticated models are rarely used in clinical practice because they may be difficult to finetune. For example, the actual interaction terms are not easy to imagine and their choice is often heuristic. Thus, the naïve approach seems to be a satisfactory compromise between good performance and simplicity. Problems may arise if there are several redundant predictive features, in which case a naïve Bayes classifier may show low asymptotic accuracy. Under such conditions, Langley and Sage showed that a selective naïve Bayes classifier, using an optimal subset of selected features for making predictions, sharply improved classifier performance [22]. Unfortunately, if the number N of acquired features is high, an exhaustive search of the best subset of features may be impractical. In fact, to consider all possible subsets of h features (h = 1,2,…,N), it is necessary to analyze 2 N -1 subsets. In the present case (52 acquired variables), this is about 4.5 × 10 15 possible subsets of variables. To solve this problem, we used a heuristic approach consisting of two steps. First we reduced the number of variables, keeping only features giving an odds ratio significantly different from 1 (p < 0.01), after dichotomisation. This allowed us to eliminate a range of variables a priori, thus decreasing the number of possible subsets of likely predictors. The final selection was performed by a forward search, entering the variable giving the best increase in area under the ROC curve, step-by-step in the model, and stopping the search process when the increment in AUC became negligible. This procedure regards a methodology specifically designed to develop a selective naïve classifier [22,23]. Although the approach used in the present paper does not ensure an exhaustive search of the best subset of independent predictors, it considers all local changes to the current set of features and makes an optimal selection.
Although alternative methods of variable selection could be used, we chose a simple approach that exploits the naïve Bayes model. In particular, we judged it inappropriate to reduce the number of predictor variables by procedures based on different models (e.g. stepwise logistic regression analysis) and then use the selected set of variables in the naïve Bayes model. The type of model may influence the optimal subset of predictor variables.
Prior probabilities were assumed identical for the two classes, i.e. P(ω 1 ) = P(ω 2 ) = 0.5. Such a choice is often made when it is impossible or inappropriate to make use of a priori knowledge, even if information from available data and/or expert beliefs could be used to make these probabilities more distinctive. Actually, each change in prior probabilities is equivalent to modifying the cost of a wrong decision. Objective criteria could be used to optimize economic and social costs related to correct and false classifications. Unfortunately, despite of the importance of this goal, this type of criterion is rarely used in problems of clinical decision-making, because  The results showed that the scoring system derived from the naïve Bayes classifier had excellent discrimination and good generalization power. In particular the 95% confidence interval for AUC estimated by the BCa bootstrap method was [0.841, 0.883] and [0.837, 0.880] in the training and testing sets, respectively. To assess the actual performance of the scoring system, the results were compared with those obtained by a logistic regression (LR) model and a quadratic Bayesian (QB) classifier. Both models were designed on the same data set using IBM SPSS Statistics and MATLAB code.
For the LR model, the stepwise procedure of variable selection again chose seven predictors, four of which (low postoperative cardiac output, postoperative creatinine, age and postoperative bilirubin) were identical to those chosen for the scoring system. The 95% confidence intervals for AUC estimated by the BCa bootstrap method were [0.840, 0.883] and [0.833, 0.876] in the training and testing sets, respectively. No evident difference was observed between the results obtained with the scoring system and the LR model.
The QB classifier selected four predictor variables, three of which (low postoperative cardiac output, postoperative creatinine and age) were identical to those chosen for the scoring system and LR model. The last predictor variable entered in the QB classifier was aortic clamping time. It may be interesting to note that the latter variable was not present in the scoring system, which instead included extracorporeal circulation time. The number of predictor variables of the QB classifier was smaller than that of the other two classifiers (4 vs. 7). This confirms what we pointed out in previous papers, namely that the quadratic Bayesian classifier generally requires fewer predictor variables than other models [6,13]. For the QB classifier, the 95% confidence intervals for AUC were [0.834, 0.877] and [0.829, 0.873] in the training and testing sets, respectively. Like the LR model, the QB classifier provides performance that completely overlaps with the scoring system.
The finding that more complex classification systems (LR model and QB classifier) did not give better performance than the naïve Bayes classifier suggests that the assumption of conditional independence of the selected variables was mostly true, and small deviations from the assumption did not cause significant deterioration of model performance.
Length of stay in the ICU was chosen as endpoint for this study because it is a limiting factor for operating theatre utilization for heart surgery and consequently a major parameter of cost-effectiveness. While 120 hours is a high value of LOS for cardiac surgery patients, in the example under consideration we chose this cut-off because it identifies the group of patients (about 10%) that mostly influences internal management decisions in the scenario considered.
A recent study sought to identify and validate existing prediction models for prolonged intensive care after heart surgery [27] through systematic review of the literature. It also tested several models on a large registry database comprising 11,395 heart operations. The study proved that several models showed acceptable discrimination, but no model achieved excellent discrimination. The best performance was obtained by the Parsonnet model (AUC = 0.75), followed by the European system for cardiac operative risk evaluation (AUC = 0.71). A similar AUC value was obtained by us when we used the Cleveland scoring system [3] to predict morbidity risk after cardiac surgery in our specific scenario [12]. These AUC values are somewhat distant from those obtained by

Conclusion
Scoring systems are often used in ICUs to predict outcomes of critical patients. Despite their simple application, they are generally difficult to update with new sets of data and to tune to clinical institutions different from those in which they were designed. This weakness may have a negative effect on the reliability of these attractive predictive models, since the performance of models that were originally efficient may deteriorate significantly with changes in clinical scenario. The naïve Bayes approach used in the present paper seems to overcome this difficulty, because the scoring system is completely defined by descriptive tables that are easily calculated and/or updated using data acquired in any specific institution. Although the model described in the present paper is a working example, the results obtained indicate performance very similar to a logistic regression model and a quadratic Bayesian classifier, as well as greater ease of handling.
In conclusion, although the proposed scoring system can be regarded as an objective ICU discharge model or predictive tool in the particular scenario analysed, the results demonstrate that the present approach produces a very simple and trustworthy scoring system that is easily updated and customized for other centers. This is a key message, because simple, precise customization and updating not only ensure better model performance, but also better acceptance by surgeons and anaesthesiologists.