Skip to main content

Development and validation of predictive models for unplanned hospitalization in the Basque Country: analyzing the variability of non-deterministic algorithms



The progressive ageing in developed countries entails an increase in multimorbidity. Population-wide predictive models for adverse health outcomes are crucial to address these growing healthcare needs. The main objective of this study is to develop and validate a population-based prognostic model to predict the probability of unplanned hospitalization in the Basque Country, through comparing the performance of a logistic regression model and three families of machine learning models.


Using age, sex, diagnoses and drug prescriptions previously transformed by the Johns Hopkins Adjusted Clinical Groups (ACG) System, we predict the probability of unplanned hospitalization in the Basque Country (2.2 million inhabitants) using several techniques. When dealing with non-deterministic algorithms, comparing a single model per technique is not enough to choose the best approach. Thus, we conduct 40 experiments per family of models - Random Forest, Gradient Boosting Decision Trees and Multilayer Perceptrons - and compare them to Logistic Regression. Models’ performance are compared both population-wide and for the 20,000 patients with the highest predicted probabilities, as a hypothetical high-risk group to intervene on.


The best-performing technique is Multilayer Perceptron, followed by Gradient Boosting Decision Trees, Logistic Regression and Random Forest. Multilayer Perceptrons also have the lowest variability, around an order of magnitude less than Random Forests. Median area under the ROC curve, average precision and positive predictive value range from 0.789 to 0.802, 0.237 to 0.257 and 0.485 to 0.511, respectively. For Brier Score the median values are 0.048 for all techniques. There is some overlap between the algorithms. For instance, Gradient Boosting Decision Trees perform better than Logistic Regression more than 75% of the time, but not always.


All models have good global performance. The only family that is consistently superior to Logistic Regression is Multilayer Perceptron, showing a very reliable performance with the lowest variability.

Peer Review reports


The progressive ageing and longer life expectancy in developed countries [1] is leading to an increase in the number of people with chronic pathologies, fragility, functional deterioration and, in short, high care needs. This situation forces adapting the current type of healthcare offered – initially focused on acute intervention – towards a proactive system capable of anticipating people’s health problems and offering person-centered care [2]. Hence, there has been a great effort aimed at identifying populations with high future health needs in order to implement individualized preventive measures [3]. In this setting, predictive and segmentation tools facilitate population-wide stratification to achieve both improved quality of life and more efficient resource management.

The Basque Country is an autonomous region of Spain with around 2.2 million inhabitants, 23% of whose population is over 65 years old [4]. In anticipation of this ageing problem, the Autonomous Government incorporated population risk prediction tools as a fundamental element within its strategy to tackle the challenge of chronicity [5]. The current approach is a predictive model of healthcare costs based on variables transformed by the Adjusted Clinical Groups (ACG) system [6] and validated for the Basque population [7]. The model allows the patients to be ranked and segmented according to the predicted healthcare expenditures in the following 12 months.

Although the model’s discriminatory capacity to detect the basque population with the highest healthcare costs is high [7], the difficulty of its clinical interpretation – healthcare costs is acting as a proxy for healthcare needs – makes it necessary to consider other tools for predicting adverse events that are (a) relevant for the patient and their families and for clinicians and healthcare managers, (b) preventable in a significant percentage if care pathways are applied in time, or, (c) if they are not preventable, the experience of the transition through this process/event is the best possible. In the case of unplanned hospitalization, hospital discharges are critical moments regarding continuity of care and they can have negative impacts both emotionally and on quality of care and outcomes of patients, providers’ satisfaction and health system efficiency. Indeed, many health systems are already using population-based predictive models of events such as the risk of hospital admission or death [8,9,10,11,12].

For the development of predictive models, traditional models such as logistic regression are the norm, yet new mathematical approaches, especially machine learning methods, are being explored [12]. There is extensive literature comparing the performance of different statistical and machine learning predictive models for adverse health outcomes on specific data sets [13,14,15]. Often, the intent behind these examples is to choose the most appropriate modelling technique for the task at hand. However, when dealing with non-deterministic algorithms, the comparison of a single model per technique is not enough to conclude that one algorithm is superior to another.

The main objective of this study is (a) to develop and externally validate a population-based prognostic model to predict the probability of unplanned hospitalization in the Basque Country through (b) comparing the performance of a logistic regression model and three families of machine learning models. For the machine learning techniques, we evaluate the variability of models’ performance to be able to make an informed decision on the best approach for our specific problem.

Materials and Methods

Data and overall methodology

 Setting: Cross-sectional study conducted in the Basque Country, an autonomous region of Spain with around 2.2 million inhabitants. The Basque Health Service (Osakidetza) provides universal and free health care services – except for a co-payment to purchase prescribed drugs – to the Basque population. Osakidetza is divided into 13 Integrated Healthcare Organisations (IHO). Each IHO manages both primary and hospital care in their catchment population.

Fig. 1
figure 1

Flux diagram of our methodology, divided in three stages and repeated 40 times per non-deterministic algorithm. Necessary input data is inside green cylinders, predictions are in blue and red denotes data that will not be available at the time of application

Study population: Predictive models were developed based on all patients enrolled in Osakidetza in 2016 (N = 2240526). The external validation sample comprised all patients enrolled in Osakidetza in 2017 (N = 2241276).

Study variables: Variables included in the models were those suggested by The Johns Hopkins ACG System Version 11.1 Technical Reference Guide, including age and gender, Adjusted Clinical Groups (ACG, mutually exclusive isomorbidity clusters) that represent the most resource intensive morbidity groups; Expanded Diagnosis Clusters (EDC, e.g. Glaucoma, Endometriosis) for which the evidence linking health care to outcomes is strong; and pharmacy-based markers Rx-MGs (Rx-MG, e.g. Infections/Acute Minor, General Signs and Symptoms/Pain). ACGs, EDCs and Rx-MGs were computed by the Johns Hopkins ACG software, a case-mix system that processes all the codified diagnoses and dispensed drugs in the electronic health record (EHR) to characterize the population through ACGs, EDCs and Rx-MGs. All the variables are binary except age (categorized) and ACG (categorical). See Appendix for a brief description and Additional File for the full list of variables.

Although individuals in the same ACG category share clinical characteristics and similar expected needs for healthcare resources, ACGs capacity to predict the risk of hospitalization can be further improved by adding other variables such as EDCs or Rx-MGs (Appendix, Table 3).

The response variable was whether the patient had any unplanned admission in the following year (2017 during development, 2018 during validation). We excluded all unplanned admissions related to traumatic injuries, birth and delivery. We considered them to be either unpredictable or non-avoidable. To define unplanned admissions we followed the algorithm developed by the Centers for Medicare and Medicaid Services [16, 17]. In this way, we avoided the administrative labelling of admissions to focus on the clinical perspective. Regarding the construction process of the ACGs, EDCs and Rx-MGs, we only considered diagnoses with any activity during the corresponding year. This includes illnesses with diagnoses recorded in the same year, and previous diagnoses with follow-up activity such as treatment or medical visits.

Statistical analysis: We predicted the probability of an unplanned admission using Logistic Regression (LR), Random Forest (RF), Gradient Boosting Decision Trees (GBDT) and Multilayer Perceptrons (MLP). The overall methodology is described in Fig. 1. For all techniques, we trained the models with predictors from 2016 and hospitalizations from 2017 (Fig. 1, left block). We then fed the best model with predictors from 2016 and from 2017 to obtain predictions for 2017 and for 2018 respectively (Fig. 1, central block). We built a separate calibration model, and trained it with the predictions for 2017 and the corresponding observed events. This model then received the predicted probabilities for 2018 as input, and returned the final calibrated probabilities. We externally validated the models using the calibrated probabilities and the observed events in 2018 (Fig. 1, right block). For MLP, RF and GBDT we repeated the whole process 40 times due to its non-deterministic nature. We expand on the statistical analysis in the following sections, where we also describe the metrics used to evaluate models’ performance.

Variability assessment

Except LR, the other methods considered in this study are non-deterministic, in the sense that they can lead to different models for the same data-set. This is due to several sources of randomness:

  • The tree ensembles (RF and GBDT) require the classes to be balanced before training, whereas the LR and the MLPs are able to use the whole data-set. Given the dimensionality of our data, we achieved balanced classes using random undersampling.

  • RF, GBDT and MLP have multiple hyperparameters that need to be selected. For computational reasons, hyperparameter tuning usually explores only a subspace of the grid of possible values, choosing the best model within that sample. Commonly, these grid points are selected at random. Different samples of grid points may give rise to diverse selected models.

  • RF is intrinsically non-deterministic.

Considering all of this, different realizations of the entire process may lead to diverse rankings of the techniques. We studied the variability of the three non-deterministic algorithms using 40 different random seeds. This controls all sources of randomness, namely class balancing, hyperparameter tuning and intrinsic randomness.

For the tree ensembles, we tested 30 random hyperparameters combinations per seed, using 3-fold Cross-Validation (CV) and optimizing the validation accuracy score, which is the default evaluation metric in scikit-learn. We note that, since we are using balanced classes, the accuracy is a valid measure. For the MLP family, we tested 100 hyperparameters combinations per seed with a training horizon of at most 30 epochs, once again with 3-fold CV, but this time optimizing the validation loss (binary cross-entropy). Once the tuning was complete, we retrained the MLP models with early-stopping to prevent overfitting, interrupting the training process as soon as the validation loss stopped decreasing for 5 consecutive epochs, and keeping the best weights for the final models. We built these models using Keras (Tensorflow).

A limit of 80 hours was set for each experiment, aborting those that did not finish within that time frame. This condition may introduce a selection bias when assessing the performance of each technique. However, if the interest is a practical application, excessive computational demand may also be a reason to discard a model.


A probabilistic classifier is referred to as calibrated when its predictions are reliable estimates of the true probability for each class. By contrast, an uncalibrated classifier can only be trusted in its ability to order individuals according to risk. When probability estimates are used for clinical decision-making, assessing model calibration becomes essential. This allows population-wide stratification based on probability ranges, which could be displayed in the EHR.

It is well known that tree ensembles produce distorted class probability distributions [18] and need posterior calibration, commonly achieved by means of isotonic regression. This approach yields a non-decreasing step function that gives calibrated predictions. Its non-decreasing property avoids permuting the predicted probabilities, leaving the discriminatory metrics unaffected. However, the step function homogenizes the predictions into large groups, hindering stratification. To prevent that, we used Piecewise Cubic Hermite Interpolating Polynomials (PCHIP) after isotonic regression to obtain smooth calibrated probabilities [19]. This combination of techniques is non-parametric, preserves discrimination, and produces a smooth distribution of predicted probabilities, allowing flexible stratification.

Performance metrics

The intended use of the models developed in this work is both population-wide prediction and high-risk patient identification subject to resource management consideration. Therefore, we used both threshold-independent and threshold-dependent metrics to evaluate the models.

The first type of metrics do not depend on a pre-specified decision criterion (i.e., a pre-specified value of the predicted probability above which we consider that the patient will likely be hospitalized the following year). We chose the area under the ROC curve (AUC) and the Average Precision (AP) to evaluate this aspect of performance, as well as the corresponding ROC and Precision-Recall (PR) curves. In a coin-toss model, the value of AP is equal to the fraction of individuals of the positive class (in our case the fraction of patients that had an unplanned hospitalization). In such random model, the AUC would be 0.5. For a perfect model, both AP and AUC would be 1. In unbalanced data-sets, small differences between models in the ROC space translate to bigger differences in the precision-recall space [20]. Thus, AP is more informative than AUC in such data-sets [21].

To emulate a clinical decision-making scenario, the ability to identify high-risk patients is best measured with metrics that depend on the classification of patients (likely to be hospitalized versus not likely). We report R@K and PPV@K, which represent the recall (sensitivity) and positive predictive value among the K patients with the highest predicted probabilities (high-risk group). In terms of sets and in terms of true versus predicted classes, these metrics are described in Eq. (1):

$$\begin{aligned} R{@}K= & {} \frac{\vert L_K\cap H\vert }{\vert H\vert }\times 100 = \frac{TP}{TP+FN}\times 100 \nonumber \\ PPV@K= & {} \frac{|L_K\cap H|}{K}\times 100 = \frac{TP}{TP+FP}\times 100 \end{aligned}$$

where \(L_K\) is the list of K patients with the highest predicted probability, hence considered as positive, H is the list of patients who had unplanned hospitalizations, and the abbreviations TP, FN and FP stand for True Positives, False Negatives and False Positives.

Note that rather than choosing an “optimal” probability threshold for the classification, we fixed the number of positive class predictions (individuals likely to be hospitalized) at K to depict resource-management concerns. After the identification of high-risk individuals, such a list of patients will be delivered to the healthcare providers to find preventive interventions. In particular, we chose a list of \(K=20000\) individuals, corresponding to about \(10 - 15\) patients per primary care team in the Basque Country, because it is a manageable quantity to revise. We also report the results for the top 5% and 10% of patients (Appendix, Table 4).

Regarding calibration we report the Brier Score B. This metric quantifies the average of the squared distance between the predicted probabilities, \(\hat{p}_j\), and the observed binary outcomes, \(o_j\), as shown in Eq. (2), where N is the size of the population at the prediction/validation stage. It ranges from 0 to 1, with lower values indicating better models:

$$\begin{aligned} B = \frac{1}{N}\sum \limits _{j=1}^N(\hat{p}_j-o_j)^2 \end{aligned}$$

The Brier Score is the only metric affected by our post-calibration procedure. It combines the effects of calibration and discrimination, but the latter remains unchanged after isotonic regression and PCHIP smoothing. Hence, the difference in Brier Score before and after this process shows the improvement in calibration.


Description of the study population

Among the patients enrolled in the public health system in 2017 (i.e., the population used for the validation of the models), 5.67% of them had at least one unplanned hospitalization in 2018 (5.39% for women and 5.97% for men). 22.98% of women and 17.68% of men were older than 65 (Table 1).

Table 1 Population at the prediction/validation stage (COPD: Chronic Obstructive Pulmonary Disease). The comorbidity data indicates the number of patients with active diagnoses in 2017, i.e. those that had any related treatment, diagnostic code or medical appointment in 2017. The true prevalences of diseases may be higher

The most prevalent diseases were hypertension (9.09% for women and 9.13% for men) low back pain (8.57% and 6.65%), and diabetes (4.13% and 5.36%). We also inspect other relevant comorbidities (Table 1).

The population in 2016 (the one used in the development/training of the models) had a similar demographic distribution, and the percentage of patients that are in either set, but not in the intersection is 3.98% (Appendix, Table 1).

Model performance

After 80 hours of runtime, we obtained 30 RF models, 26 GBDT models and 40 MLPs, besides the logistic regression model. The results shown below are based on these models.

Overall performance of techniques: Median AUC ranges from 0.789 to 0.802. The median AP is between 0.237 and 0.257 (coin-toss model: AP = 0.0567). Median R@20k ranges from 0.076 to 0.080. For PPV@20k, the median values are between 0.485 and 0.511 and for Brier Score the median values are of 0.048 for all techniques (Table 2). Figure 2 shows the ROC and PR curves for one model per family. The ROC curves almost overlap and, as expected under class imbalance, the differences become visible in the PR curves.

Table 2 For the external validation set: Summary of the performance metrics in each family of models (quartiles and standard deviation). The techniques are ordered from the best to the worst performance. (MLP: Multi-Layer Perceptron; GBDT: Gradient-Boosting Decision Trees, LR: Logistic Regression, RF: Random Forest, AP: Average Precision, R@20k and PPV@20k: Recall and Positive Predictive Value for the 20000 highest-risk patients)
Fig. 2
figure 2

For the external validation set: ROC and PR curves for the models with median AUC and median AP, respectively. (MLP: Multi-Layer Perceptron; GBDT: Gradient-Boosting Decision Trees, LR: Logistic Regression, RF: Random Forest, AP: Average Precision)

The best performing technique: For all measures the best-performing technique is MLP (AUC = 0.802; AP = 0.257 and Brier Score = 0.048), followed by GBDT, LR and RF. MLPs also have the lowest variability (Table 2). For more than 50% of the MLP models, the proposed list of 20000 patients includes more than 8.0% of all patients who were admitted in 2018; 51.1% of the patients shortlisted by MLPs will indeed be admitted the next year. This percentage drops to 48.3% with RF. Given the population size, this implies 1098 additional correct class predictions with the median MLP model instead of the RF model.

Variability/overlapping of the algorithms: There is some overlap between the algorithms. For instance, GBDT has better R@20k and PPV@20k than LR more than 75% of the times, but there are several experiments where GBDT performs worse than LR, and rare cases where it does worse than RF in terms of the threshold specific metrics, although not in terms of global discrimination (Fig. 3).

Post-calibration: The values of the Brier Score improve after post-calibration, specially for the tree ensembles (Fig. 4). In Fig. 5 we report the reliability diagrams of the models with the median Brier Score before and after post-calibration. Again, we see a remarkable improvement in the calibration of tree ensembles.

Overfitting: In the training set, the median AUCs range from 0.789 to 0.805. The technique ranking in terms of R@20k and PPV@20k is different than in the external validation, placing RF on top of LR due to overfitting (Appendix, Table 2 and Supplementary Fig. 1).

Fig. 3
figure 3

For the external validation set: Performance of the RF, GBDT and MLP families according to different discrimination metrics. The violins show the variability of each family due to stochastic effects. Red line: LR (MLP: Multi-Layer Perceptron; GBDT: Gradient-Boosting Decision Trees, LR: Logistic Regression, RF: Random Forest, AP: Average Precision, R@20k and PPV@20k: Recall and Positive Predictive Value for the 20000 highest-risk patients)

Fig. 4
figure 4

For the external validation set: Variability of the Brier Score. Purple and red lines: LR before and after post-calibration. Top-right: Comparison of all the algorithms before and after post-calibration. Bottom: Violin plots of the tree ensembles before post-calibration (high Brier Scores). Top-left: RF, GBDT and MLP after calibration, and MLP before calibration. (MLP: Multi-Layer Perceptron; GBDT: Gradient-Boosting Decision Trees, LR: Logistic Regression, RF: Random Forest)

Fig. 5
figure 5

For the external validation set: Reliability diagrams (a) before, and (b) after post-calibration. The numbers indicate the amount of patients in the last two bins. The bottom part is a histogram of the predicted probabilities. (MLP: Multi-Layer Perceptron; GBDT: Gradient-Boosting Decision Trees, LR: Logistic Regression, RF: Random Forest)

Characteristics of the risk groups

Eighty-nine percent and eighty-five percent of the patients in the high-risk groups (20000 individuals) selected by LR and the MLP with median PPV@20k are older than 65, compared to 20% in the general population (GP). The prevalence of what are considered to be complex chronic conditions is also high: 39% - 44% of identified high-risk patients have diabetes (5% in the GP), 35%-39% heart failure, 31% -33% COPD, 18%-19% ischemic heart disease and 30%-34% of them have chronic renal failure, among others. Also, the proportion of patients with multimorbidity in this high-risk population is very high, 62%-66% (Table 3).

Table 3 For the external validation set: Characteristics of the risk groups (20000 individuals) identified by the MLP with median performance and by LR. The comorbidity data indicates the number of patients with active diagnoses in 2017, i.e., those that had any related treatment, diagnostic code or medical appointment in 2017. The true prevalences of diseases may be higher. Pluripathologic: We follow the modified definition of pluripathology by Ollero et al. [22]: the presence of chronic diseases included in two or more of the following clinical categories: (a) ischemic heart disease (excluding acute miocardyal infarction) OR congestive heart failure; (b) chronic renal failure; (c) COPD; (d) inflammatory bowel disease OR chronic liver disease; (e) cerebrovascular disease OR dementia OR delirium; and (f) peripheral vascular disease OR diabetic retinopathy OR diabetes with complications. (COPD: Chronic Obstructive Pulmonary Disease)


This study has achieved a double objective. First, we have obtained calibrated predictions of the risk of unplanned hospitalization for an entire population to allow stratification and planning of preventive interventions. Second, we have compared several prediction approaches, assessing the variability of non-deterministic techniques to identify those that consistently outperform others.

The importance of the application lies in its population-based nature. The use of this type of predictive models in the Basque Country would help, as a population screening tool (2.2 million inhabitants), to identify and prioritize individuals at high risk of hospitalization who could benefit from the integrated care pathways currently in use in the Basque Health Service. These integrated care pathways have been shown to reduce the risk of hospitalization in patients with complex chronic diseases and multimorbidity [23, 24], who, in fact, represent the majority of high-risk patients selected by our predictive models (see Table 3). In this way, the identification of individuals with a high risk of hospitalization would complement other preventive strategies already in place in the Basque Country for identifying high need-high risk patients. Specifically, in 2010, the Department of Health of the Basque Government launched the “Strategy to tackle the challenge of chronicity in the Basque Country” [5], aimed to re-orient the health system toward an integrated care model and a person-centred approach.

All our models have good global discrimination, with AUCs around 0.80 and APs around 0.25. Identifying the 20000 highest-risk patients is much more difficult, resulting in lists with low recall, because we impose a number of positives that is much lower than the admissions per year. Despite this low recall, the decision to identify only the 20000 most at-risk patients is based on the resource intensity that the deployment and adherence to these integrated pathways requires. According to the experience in the Basque Country, such resource-intensive integrated pathways [23, 24] – focused on (a) care coordination and communication between health providers, and (b) patient empowerment and home-based care – should not be offered to more than 10-15 patients per primary care team (there are about 1500 primary care teams in the Basque Health Service). However, the positive predictive value (around 50%) indicates that, in the screening phase by primary care physicians/teams, half of the screened patients will be hospitalized during the following year if no preventive intervention is considered. Thus, as mentioned above, the suggested lists could be very useful for clinical practice as they would help to identify patients who could benefit from adjusted patient-centered care pathways.

Regarding the evaluation of different techniques, we do not see big differences in the metrics. The only family that is always consistently superior to LR is MLP, showing a very reliable performance with the lowest variability. GBDT overall outperforms LR, but has some overlap. RF is consistently worse than LR, and also has the highest variability. In absolute terms, the MLP with median R@20k produces 704 additional correct predictions compared to LR and 1098 compared to the corresponding RF. The ACG system provides highly processed binary variables, leaving little room for machine learning techniques to excel with this specific kind of data-set.

Several factors influence whether the performance gain justifies the increased complexity of MLP versus LR. In terms of computational effort, these simple neural networks do not require considerable additional work or any special equipment. Both models could be pre-trained and implemented into a ready-to use application. Regarding interpretability, there is no doubt that LR is ahead of MLP. However, there is extensive recent work on explaining neural networks, as well as some open-source software [25,26,27]. Nevertheless, our predictor variables have limited interpretability by themselves due to the use of the opaque tool of ACG. Despite that, we show the 20 most important variables for selected models of each kind (Appendix, Fig. 2).

Algorithmic fairness is another concern regarding the use of machine learning in clinical practice. In this article we conclude that MLP performs better with our data in the general population, but an additional fairness analysis should be conducted before making the final decision. However, the fairness of LR must be assessed as thoroughly as that of MLP. To conduct this analysis, we must gather expert knowledge on which sensitive attributes to choose and which definition of fairness to strive for. Typical sensitive attributes are gender and socioeconomic status, but there may be many other hidden biases, and some of them may not even be measured in our variables. To begin with, both LR and MLP contain a low percentage of women in the selected risk groups (34% and 35%; Table 3), while the proportion of observed hospitalizations is only 0.6% lower for women. The possible causes of this will be researched in an upcoming work.

In terms of AUC, our results are similar to those found in the literature. Previous predictive models for hospitalization using the ACG system reported AUCs between 0.73 [28] and 0.80 [29], increasing to 0.82 with the inclusion of resource usage variables such as the number of outpatient and emergency visits, dialysis or nursing services.

Including additional variables may improve model performance, and even impact the comparison among techniques. Apart from resource usage variables, thorough information about the patients and their living conditions could be considered, accounting for social determinants of health.

The extent in overfitting is rather small for all techniques, which have similar performance in both the training and external validation sets (Appendix, Table 2 and Supplementary Fig. 1). Interestingly, GBDT sometimes has a higher performance in the training set than MLP, although the median is higher for MLP. This is due to early-stopping on the validation loss, which prevents MLPs from overfitting. In terms of the threshold-specific metrics, for the given case of \(K=20000\), RF sometimes achieves the highest performance in the training set, with a median higher than LR, but its big variability makes it unreliable compared to the other families. These results mean that, without performing a variability analysis and a proper validation, there is a high chance to erroneously choose RF over LR based on risk-patient identification on the training set.

Regarding limitations of this study, given our data we have no way to account for the different healthcare use patterns of subpopulations based on gender, socioeconomic status or other sensitive attributes, which may contribute to unfairness. Additionally, we have focused only on the performance variability without analyzing any trends in the model structure. To gain insight on model complexity, it could be interesting to assess tendencies in the selected hyperparameters, such as the number of layers and neurons most commonly chosen for MLPs. We are aware that not all unplanned hospitalizations are avoidable. In this regard, the development of predictive models of the risk of a potentially avoidable hospitalization might have helped to select a more impactful population [30, 31]. However, there is a lack of consensus on which hospitalizations should be considered avoidable, and there is a high risk of leaving out a proportion of avoidable hospitalizations (sensibility concerns) [32, 33]. Also, even if hospitalization were not avoidable, we should always provide a high quality hospitalization process, including a safe and satisfying discharge and transition back to the community for the patient, caregivers and the healthcare professionals caring for them, and one way to achieve this is by putting in place all the elements of communication, empowerment and coordination that are also developed in these integrated care pathways.

For further research, we propose constructing transparent and less processed predictors to test them against the models built on top of the ACG system. Previous studies choose ACG as the recommended case-mix system among other similar tools [8]. However, to our knowledge, there is no research comparing ACG predictors against a comprehensive set of variables built on the complete list of diagnoses and prescriptions per patient.


In this study, we built population-wide predictive models for unplanned hospitalization in the Basque Country, and compared several machine learning approaches considering the performance variability due to their non-deterministic nature. The best performance and lowest variability was achieved with multilayer perceptrons. The observed overlap between some of the techniques highlights the importance of a variability analysis.

Availability of data and materials

The data that support the findings of this study are available from Osakidetza but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Osakidetza. The source code is available in Zenodo, with DOI 6957516. Link:

Data is obtained from administrative health databases and the Osakidetza electronic health records (EHR), which integrates both primary care and hospital care including electronic pharmacological prescription. All this information is extracted by means of a business intelligence platform that links all the structured data relating to the same patient by means of pseudonymised identifiers.



Adjusted Clinical Groups


Average Precision


Area Under the (Receiver Operating Characteristic) Curve


Chronic Obstructive Pulmonary Disease


Expanded Diagnostic Clusters


Electronic Health Record


Gradient Boosting Decision Trees


Logistic Regression


Multilayer Perceptron


Piecewise Cubic Hermite Interpolating Polynomials


Random Forest


Pharmacy-based markers (Rx-Defined Morbidity Groups)


  1. Kinsella KG, Phillips DR. Global aging: The challenge of success, vol 60. DC: Population Reference Bureau Washington; 2005.

    Google Scholar 

  2. Nalin M, Bedbrook A, Baroni I, Romano M, Alhambra Borrás T, Avolio F, et al. White paper on deployment of stratification methods. Activation of Stratification Strategies and Results of the interventions on frail patients of Healthcare Services (ASSEHS) EU project. 2016. Available at Accessed Aug 2023.

  3. Chong JL, Lim KK, Matchar DB. Population segmentation based on healthcare needs: a systematic review. Syst Rev. 2019;8(1):1–11.

    Article  Google Scholar 

  4. EUSTAT- Euskal Estatistika Erakundea - Instituto Vasco de Estadística. Municipal Inhabitant Statistics. Accessed Aug 2023.

  5. Basque Government. Estrategia para afrontar el reto de la cronicidad en Euskadi (2016). Accessed Aug 2023.

  6. Johns Hopkins University, School of Public Health. The Johns Hopkins University ACG Case-Mix System. &view=article &id=46 &Itemid=61. Accessed Oct 2022.

  7. Orueta JF, García-Álvarez A, Alonso-Morán E, Nuño-Solinis R. Desarrollo de un modelo de predicción de riesgo de hospitalizaciones no programadas en el País Vasco. Rev Esp Salud Pública. 2014;88:251–60.

    Article  Google Scholar 

  8. Girwar SAM, Jabroer R, Fiocco M, Sutch SP, Numans ME, Bruijnzeels MA. A systematic review of risk stratification tools internationally used in primary care settings. Health Sci Rep. 2021;4(3):329.

    Article  Google Scholar 

  9. Henderson M, Han F, Perman C, Haft H, Stockwell I. Predicting avoidable hospital events in Maryland. Health Serv Res. 2022;57(1):192–9.

    Article  PubMed  Google Scholar 

  10. Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, et al. Risk Prediction Models for Hospital Readmission: A Systematic Review. JAMA. 2011 10;306(15):1688–1698.

  11. Wang L, Porter B, Maynard C, Evans G, Bryson C, Sun H, et al. Predicting risk of hospitalization or death among patients receiving primary care in the Veterans Health Administration. Med Care. 2013;51(4):368–373.

  12. Data Study Group team. Data Study Group Final Report: NHS Scotland. Zenodo. 2019.

    Article  Google Scholar 

  13. Blankers M, van der Post LF, Dekker JJ. Predicting hospitalization following psychiatric crisis care using machine learning. BMC Med Inf Decis Making. 2020;20(1):1–11.

    Google Scholar 

  14. Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li SX, et al. Analysis of machine learning techniques for heart failure readmissions. Circ Cardiovasc Qual Outcome. 2016;9(6):629–40.

    Article  Google Scholar 

  15. Shang Y, Jiang K, Wang L, Zhang Z, Zhou S, Liu Y, et al. The 30-days hospital readmission risk in diabetic patients: predictive modeling with machine learning classifiers. BMC Med Inform Decis Mak. 2021;21(2):1–11.

    Google Scholar 

  16. Horwitz L, Grady J, Cohen D, Lin Z, Volpe M, Ngo K, et al. Development and Validation of an Algorithm to Identify Planned Readmissions From Claims Data. J Hosp Med. 2015;07:10.

    Article  Google Scholar 

  17. Horwitz LI, Partovian C, Lin Z, Herrin J, Grady JN, Keenan M, et al. Centers for Medicare & Medicaid Services Planned Readmission Algorithm – Version 2.1. 2013. Accessed Aug 2023.

  18. Niculescu-Mizil A, Caruana R. Predicting Good Probabilities with Supervised Learning. In: Proceedings of the 22nd International Conference on Machine Learning. ICML ’05. New York, NY, USA: Association for Computing Machinery; 2005. p. 625–632.

  19. Jiang X, Osl M, Kim J, Ohno-Machado L. Smooth Isotonic Regression: a new method to calibrate predictive models. AMIA Summits Transl Sci Proc. 2011;2011:16–20.

    PubMed  PubMed Central  Google Scholar 

  20. Davis J, Goadrich M. The Relationship between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06. New York, NY, USA: Association for Computing Machinery; 2006. p. 233–240.

  21. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE. 2015;10(3):1–21.

    Article  CAS  Google Scholar 

  22. Ollero Baturone M, et al. Atención al paciente pluripatológico: Proceso asistencial integrado. 2002. Updated 2018. Consejería de Salud, Junta de Andalucia. Accessed Aug 2023.

  23. Soto-Gordoa M, de Manuel E, Fullaondo A, Merino M, Arrospide A, Igartua JI, et al. Impact of stratification on the effectiveness of a comprehensive patient-centered strategy for multimorbid patients. Health Services Research. 2019;54(2):466–73.

  24. Mateo M, González N, Fullaondo A, Merino M, Azkargorta L, Giné-March A, et al. Impact of the CareWell integrated care model for older patients with multimorbidity: a quasi-experimental controlled study in the Basque Country. BMC Health Serv Res. 2020;07:20.

    Article  Google Scholar 

  25. Ribeiro MT, Singh S, Guestrin C. Why should i trust you? Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

  26. Lundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 4768–4777.

  27. Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng. 2018;2(10):749.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Haas LR, Takahashi PY, Shah ND, Stroebel RJ, Bernard ME, Finnie D, et al. Risk-stratification methods for identifying patients for care coordination. Am J Manage Care. 2013;19(9):725–32.

    Google Scholar 

  29. Maltenfort MG, Chen Y, Forrest CB. Prediction of 30-day pediatric unplanned hospitalizations using the Johns Hopkins Adjusted Clinical Groups risk adjustment system. PLoS ONE. 2019;14:1–12.

    Article  Google Scholar 

  30. Donzé JD, Williams MV, Robinson EJ, Zimlichman E, Aujesky D, Vasilevskis EE, et al. International Validity of the HOSPITAL Score to Predict 30-Day Potentially Avoidable Hospital Readmissions. JAMA Intern Med. 2016 04;176(4):496–502.

  31. Riis A, Kristensen P, Lauritsen S, Thiesson B, Jørgensen M. Using Explainable Artificial Intelligence to Predict Potentially Preventable Hospitalizations: A Population-Based Cohort Study in Denmark. 2023.

    Article  Google Scholar 

  32. Pinto A, Santos JV, Souza J, Viana J, Costa Santos C, Lobo M, et al. Comparison and Impact of Four Different Methodologies for Identification of Ambulatory Care Sensitive Conditions. Int J Environ Res Public Health. 2020;17(21).

  33. Johnston J, Longman J, Ewald D, King J, Das S, Passey M. Study of potentially preventable hospitalisations (PPH) for chronic conditions: what proportion are preventable and what factors are associated with preventable PPH? BMJ Open. 2020;10(11).

Download references


We thank the Informatics and Health Information Services of the Directorate of the Basque Health Service (Osakidetza), especially Eider Pocheville, for their help identifying and extracting the data for the study; and the members of the Basque Group for the Improvement of population Risk-Stratification for their continued support. We are also grateful to the peer referees for their constructive comments of the paper.

All the authors approved the final version to be published and agreed to be accountable for all aspects of the work.

Preliminary results of this work were presented at the “VI Jornadas Científicas de Estudiantes de la SEB” (abstract book at


This research was funded by the Basque Government through the “Mathematical Modeling Applied to Health” Project and the BERC 2022-2025 program, by the Spanish Ministry of Science and Innovation and AEI (BCAM Severo Ochoa accreditation CEX2021-001142-S), and by the Ramon y Cajal Grant RYC2019-027534-I. The funding body played no role in the design of the study and collection, analysis, interpretation of data, and in writing the manuscript.

Author information

Authors and Affiliations



Alexander Olza: conception and design of the study; analysis and interpretation of data; drafting of the manuscript. Eduardo Millán: conception and design of the study; acquisition and interpretation of data; revision of the manuscript. María Xosé Rodríguez: conception and design of the study; interpretation of data; revision of the manuscript. All the authors approved the final version to be published and agreed to be accountable for all aspects of the work.

Corresponding author

Correspondence to Alexander Olza.

Ethics declarations

Ethics approval and consent to participate

The Basque Research Ethics Committee on Medicines (CEIm-E) approved this study and waived the need for obtaining informed consent. All methods in this study were carried out in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.


Additional file 2.

Full list of variables included in the models and TRIPOD checklist: Good practices for prediction model development and validation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Olza, A., Millán, E. & Rodríguez-Álvarez, M.X. Development and validation of predictive models for unplanned hospitalization in the Basque Country: analyzing the variability of non-deterministic algorithms. BMC Med Inform Decis Mak 23, 152 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Hospitalization
  • Non-deterministic algorithms
  • Predictive models