Prediction of long-term hospitalisation and all-cause mortality in patients with chronic heart failure on Dutch claims data: a machine learning approach

Background Accurately predicting which patients with chronic heart failure (CHF) are particularly vulnerable for adverse outcomes is of crucial importance to support clinical decision making. The goal of the current study was to examine the predictive value on long term heart failure (HF) hospitalisation and all-cause mortality in CHF patients, by exploring and exploiting machine learning (ML) and traditional statistical techniques on a Dutch health insurance claims database. Methods Our study population consisted of 25,776 patients with a CHF diagnosis code between 2012 and 2014 and one year and three years follow-up HF hospitalisation (1446 and 3220 patients respectively) and all-cause mortality (2434 and 7882 patients respectively) were measured from 2015 to 2018. The area under the receiver operating characteristic (ROC) curve (AUC) was calculated after modelling the data using Logistic Regression, Random Forest, Elastic Net regression and Neural Networks. Results AUC rates ranged from 0.710 to 0.732 for 1-year HF hospitalisation, 0.705–0.733 for 3-years HF hospitalisation, 0.765–0.787 for 1-year mortality and 0.764–0.791 for 3-years mortality. Elastic Net performed best for all endpoints. Differences between techniques were small and only statistically significant between Elastic Net and Logistic Regression compared with Random Forest for 3-years HF hospitalisation. Conclusion In this study based on a health insurance claims database we found clear predictive value for predicting long-term HF hospitalisation and mortality of CHF patients by using ML techniques compared to traditional statistics. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01657-w.

infarction [6]. Accurately predicting which CHF patients are particularly vulnerable for adverse outcomes, such as renewed HF hospitalisation or even death, is of crucial importance to support clinical decision-making. Advances in statistical approaches and computational power, including fully utilizing machine learning techniques on Big Data, potentially provide better knowledge extraction and evidence-based clinical decision support [7][8][9][10][11]. In addition to traditional statistical analysis, novel machine learning (ML) algorithms can identify patterns in large datasets and build both linear and non-linear models in order to make effective data-driven predictions [12]. All residents of the Netherlands are entitled to a comprehensive basic health insurance package and this includes the bulk of essential medical care, medications and medical aids. Health insurance claims (HIC) databases are attractive for research because of their large size, their longitudinal perspective, and their practice-based information. As they are based on financial reimbursement, the information is generally reliable, moreover databases are audited every year to ensure that they meet the required quality level for the Dutch risk equalization model [13,14]. ML techniques could potentially better utilize the richness of these databases [7,8,15,16]. The goal of the current study was to examine the predictive value of Dutch HIC data on long term HF hospitalisation and all-cause mortality in CHF patients, by exploring and exploiting ML and traditional statistical techniques.

Patients
A HIC database containing anonymous data that can be considered a representative sample of ~ 30% of the Dutch population from Zilveren Kruis, the largest insurance company in the Netherlands, was analysed retrospectively. Patients aged 18-85 years with a diagnosis code for CHF between 2012-2014 were included and follow-up HF hospitalisation and all-cause mortality in 2015-2018 was measured [17]. Patients had to have a CHF-related claim according to the national diagnosis-treatment classification system called 'Diagnose Behandeling Combinatie' (DBC), which is based on a combination of the International Classification of Diseases, 10th revision (ICD-10) and applied treatment [18]. Additionally, they had to have used at least one medication within the cardiovascular system ("C") based on the World Health Organization Anatomical Therapeutic Chemical Classification index and Defined Daily Dose (WHO ATC/DDD) in the same period [19]. According to the European Society of Cardiology (ESC) heart failure guidelines [20], CHF patients should visit their treating physician at least once per year. Therefore, patients were excluded who lacked any HF insurance claim after January 2015, because they are most likely wrongly diagnosed or labelled HF patients. Patients who switched insurance company between 2012-2017 were also excluded. A total of 25,776 patients were included in the final analysis (Fig. 1).

Endpoints
The study endpoints were HF hospitalisation and allcause mortality. The risk of HF hospitalisation and allcause mortality were predicted on a one (2015)-and three-years (2015-2017) perspective. This resulted in the following study endpoints (i.e. dependent features): (1) 1-year HF hospitalisation (1446 patients), (2) 3-years HF hospitalisation (3220 patients), (3) 1-year all-cause mortality (2434 patients), (4) 3-years all-cause mortality (7882 patients). HF hospitalisation was defined as at least one night of stay in inpatient care for acute or chronic HF based on the DBC system. All-cause mortality was defined as death due to any cause. No clinical adjudication committee reviewed the HF hospitalisation endpoint.

Data
The process of feature selection is graphically displayed in Fig. 2. Claim-based input features between 2012-2014 were divided into three categories: hospital claims, pharmaceutical claims and claims of other caregivers. Hospital claims are all DBC's, a combination of diagnosis and treatment, for instance a DBC for hospital admission for CHF with more than five nursing days. Within the hospital claims we also included the diagnosis related groups (DRG) based on the ICD-10 code of each DBC. The pharmaceutical claims were divided in seven categories: 1. Use of an individual prescription on a full anatomical therapeutic chemical (ATC) level 2. An ATC3 therapeutic subgroup level [21] 3. Medical adherence by defining the medication possession ratio (MPR) [22,23] 4. Use of automatic pill dispenser 5. Sum of prescribed daily doses (PDD) which takes into account dosage schemes as prescribed by the treating physician 6. Number of times medicines were collected 7. Number of different medication within the same ATC3 subgroup.
An example of claims of other caregivers are number of visits to a GP or physiotherapist or use of a medical device. Most features are categorical and are binary coded to represent whether the corresponding medical service was provided to the patient or not. Due to the large number of claim-related features of the dataset (> 6000 features), feature selection plays an important role in reducing noise and computational costs, while simultaneously improving accuracy [24]. Feature selection was done in two stages; first, prevalence prioritization and, second, a Lasso Regression [25]. Prevalence prioritization was performed in each of the three categories and their relevant subcategories for two-time episodes, by first selecting all features, with a threshold of > 250 patients in each category. In this way we included for two-time episodes 96 and 160 out of 2290 hospital claim features, 73 and 85 out of 141 DRG features, 192 and 232 out of 901 pharmaceutical claims features, 55 and 60 out of 299 pharmaceutical claims on a ATC3 level and 377 and 508 out of 3232 other claims features. Input features between 2012-2014 were divided in two-time episodes; (1) year 2014 and (2) combined years 2012-2013. Patient characteristics on postal code level, such as income (high, medium, low) and distance to nearest facilities such as GP and hospital, were also included as input features. Two time-related features were included in the model; days between last hospital visit in 2012-2014 and January 1, 2015, and duration since first hospital visit, by determining the period between the first occurrence of DBC for CHF in this baseline period up to January 1, 2015. The total number of input features in this first stage was > 2000. In the second stage of feature selection we ran a Lasso regression to obtain the (maximum) 150 most significant features on the partition of the dataset that was subsequently used for model training (49%) and validation (21%), on all the input features of stage 1, related to each of the four endpoints separately. The LASSO method puts a constraint on the sum of the absolute values of the model parameters, the sum has to be less than a fixed value (upper bound). In order to do so the method applies L1 regularization, where some of the coefficients become exactly zero. The variables corresponding to the non-zero coefficients remain in the dataset. The goal of this process is to reduce computation cost and was to minimize the prediction error [25]. We used SAS Enterprise Guide 7.1 for Lasso regression (proc GLMSELECT for binary outcome with Schwarz Bayes selection Criterion) [26]. Baseline characteristics age, sex and marital status were added to the final set. The features and the total number of features in the final set is described in Additional file 1: Table I. The definition of demographics, socio-economic status, selection of medication and all other input features are described in Additional file 1: Table II.

Statistical analysis
We compared four computational techniques to determine which method yields the best prediction for the study endpoints: backward logistic regression (LR), regularized logistic regression (Elastic Net, EN), random forests (RF) and neural networks (NN). We used the area under the receiver operating characteristic (ROC) curve (AUC) as primary and sensitivity and specificity as the secondary performance metric for comparing the models. Sensitivity and specificity have the advantage that they express equal importance to the correct prediction of hospitalisation/mortality and the prediction of no hospitalisation or mortality [27] Additional performance metrics, such as true negatives and precision are calculated and shown in a confusion matrix in Additional file 1: Table III. The dataset has been split randomly into two partitions, to learn (training and validation) and evaluate (test) the models. The first partition of 70% of the complete dataset (patients) to learn is used for model training (49%) and validation (21%). Various combinations of hyperparameter values (Table 1) were explored to optimize the AUC between training set and validation set to obtain the best trained model. For the hyperparameters not mentioned, the respective default values of the software packages R Statistical Software and SAS Enterprise Guide were used. The second partition of 30% of the complete dataset is then used to evaluate the final prediction performance. The same learning and evaluation partitions were used for all techniques.  Because the dichotomous endpoints (hospitalisation and mortality) of the models are imbalanced (> 90% has class value 'no mortality' , or 'no hospitalisation), by default the predictions could be biased towards the class that has the highest prevalence [28]. Therefore, the loss is separated per class and the class loss is weighted proportionally to the inverse of the proportion of the corresponding class (formula described in Additional file 1: Table II).
The backward logistic regression (LR) starts with all coefficients in the model and deletes them consecutively. In each step the coefficient that does not (significantly) improve the prediction on the dependent variable is removed until all features have a significance greater than 0.10.
The regularized Logistic Regression was estimated with the elastic net regularization (EN). This is a combination of the LASSO regularization (L1 penalty) and ridge regularization (L2 penalty). Therefore, there are two hyperparameters that need to be tuned: alpha (L1 penalty) and lambda (L2 penalty). The optimal combination of alpha and lambda is searched for with tenfold cross validation on the validation set. For alpha all values between 0 and 1 with an interval of 0.1 are used. For lambda a range between 0.001 and 10.000 is used in 80 exponentially increasing steps. This resulted in 880 combinations of different alpha and lambda values. The Random Forest model (RF) is an ensemble of multiple decision trees. Each step in the decision tree construction uses a selection of the input features ( total number of variables ) and per tree a subset of the training data. The splitting criteria is the Giniindex. We let the algorithm infer the optimal number of trees itself such that the misclassification rate on the out-of-bag samples is minimized.
For the neural network (NN) a fully connected feed forward network was used. We explored multiple architectures with a varying number of hidden layers and nodes per layer. Networks with 1-3 hidden layers are optimized with each layer having 10 up to 100 hidden nodes. Only the results of the best architecture optimized on the validation set are presented in the results section.
Cut-off values for all four computational techniques, for the additional performance metrics such as sensitivity and misclassification were derived from the Youden index, which is the sum of sensitivity and specificity minus one [29]. The total number of input features was used for all statistical techniques.
For NN variable importance (VI) is hard to establish because of its "black box" nature [30,31]. We therefore computed VI only for RF, EN and LR. VI in the RF was calculated based on Random Branch Assignments Variable Importance (RBA). The RBA is evidently much less influenced by correlations [32]. For EN we used the absolute values of the coefficients rank the features in the order of variable importance and for LR we used Multivariate Coefficients Score. For this score we simply calculated the magnitude of the marginal effect of each (non-standardized) predictor, by fitting multivariate LR models and filtering out insignificant coefficients according to the P-value "stay threshold". We report the top 10 features, and their number of patients and univariate odds ratio and discussed the outcome of VI with clinicians. Output data were analysed using R Statistical Software version 3.4.2 (Vienna, Austria), Caret and GLMNET were used to conduct the EN analysis and SAS Enterprise Guide 7.1 for LR, RF and NN, see Table 2 for software and model information.

Baseline characteristics
Our study population consists of 25,776 CHF patients Baseline characteristics of the overall study sample are described in Table 3.

Performance metrics and relevant features
AUC rates ranged from 0.710 to 0.732 for 1-year HF hospitalisation, 0.705-0.733 for 3-years HF hospitalisation, 0.765-0.787 for 1-year mortality and 0.764-0.791 for 3-years mortality. Elastic Net performed best for all endpoints (   (15) Never married 3040 (12) Divorced 1809 (7) SES score, median (IQR)   HF hospitalisation, RF for 1-year mortality and EN for 3-years mortality. Top-10 features of importance in our study are shown in Tables 5, 6, 7 and 8. For HF hospitalisation, previous HF hospitalisation for CHF or acute HF, comorbidities as COPD, diabetes or oncology and visit to the GP are the most common among the trained models. For mortality, age, sex and marital status are also often used in the models. Features from 2014 were more common compared to 2012-2013.

Discussion
In this analysis, based on a HIC database of > 25,000 patients with CHF, we have shown that the use of traditional and novel techniques indeed have clear predictive value for predicting long-term HF hospitalisation and allcause mortality for CHF patients, with AUC's between 0.7 and 0.8 [33,34].
For our main performance metric, the AUC, EN outperformed other statistical methods in predicting 1 and 3 years HF hospitalisation and 1 and 3 years mortality, although with only minor differences compared to traditional LR and only statistically significant between EN and LR compared with RF for 3-years HF hospitalisation.
Our results are comparable with earlier reported findings. Angraal et al. reported in a recent study for 3-years mortality and HF hospitalisation that RF was the best performing model with a mean AUC of 0.72 (95% confidence interval [CI] 0.69-0.75) for predicting 3-years mortality, and 0.76 (95% CI 0.71-0.81) for 3-years HF hospitalisation [35]. This study was based on a cohort with 1,767 patients in HF with preserved ejection fraction. Chicco et al. [36] analysed 9 months mortality in 299 patients with HF. RF outperformed all the other methods, by obtaining the top ROC AUC (0.800) The Artificial Neural Network perceptron, instead, obtained the top value on the Precision-Recall AUC (0.750). The AUC outcomes on our HIC database are in line with these 2 recent studies, both based on clinical data, although RF did not outperform in our study. Mortzavi et al. found more pronounced differences (10-25%) for 30-days HF hospitalisation and 180-days HF hospitalisation outcome between ML compared to traditional LR, but they only used the 5 most important features as identified previously (blood urea nitrogen, glomerular filtration rate, sex, waist-tohip ratio, and history of ischemic cardiomyopathy) in LR and the remaining techniques were created using the full raw data of 472 inputs [37]. It should be noted that these studies had small datasets, which may limit the generalizability of their conclusions. A meta-analysis and metaregression study of 117 prognostic models revealed only a moderate accuracy of models predicting mortality, whereas models designed to predict the combined endpoint of death or HF hospitalisation, or only HF hospitalisation, had an even poorer discriminative ability. The highest AUC-statistic values were achieved in a clinical setting, predicting short-term mortality with the use of models derived from prospective cohort/registry studies with many predictor variables. The mean AUC-statistic was 0.66 ± 0.0005 for the models reporting a standard error. Models using data from medical records had significantly better AUC-statistic values than models using claims data. Also, models using more predictor variables had better predictive values; AUC-statistic increased 0.0036 (SE = 0.0005) with each added predictor variable.
There was no significant difference in AUC-statistic values between patients diagnosed with either CHF or acute HF [38]. Most currently existing prognostic models in patients with CHF are based on data from randomized controlled trials or extracted from administrative datasets, such as medical insurance claims [39]. To our knowledge, this was the first study that applied machine learning techniques to a (Dutch) HIC database for CHF outcome prediction. A great advantage of a HIC database in the Netherlands is that it covers the entire healthcare utilization since over 99% of the population has basic health insurance as mandated by law [40]. Most studies are based on data of 1 or a limited number of hospitals, but in a HIC database we have data of all the hospitals and patient visits. Moreover, a HIC database also includes General Practitioners (GP), pharmaceutical and other healthcare-related data. The relevance of covering a patient's full healthcare usage was demonstrated by our feature importance analysis (Tables 5, 6, 7, 8) that shows that GP and pharmaceutical data are related to the endpoints in our study. Primary care plays a central role in many countries, such as the Netherlands and the UK in the diagnosis, long-term management, and end-of-life care for these patients. While there is specialist support available from nurses and cardiologists especially after admission and later on at a regular basis once or twice a year, GPs remains responsible for overseeing most patient care once a diagnosis is made including management to delay progression, recognition of HF decompensations, and patient follow-up in the vulnerable period following an HF hospitalisation [41,42]. Most machine learning techniques for adverse outcomes in CHF focus on a short time period, mainly on 30-days HF hospitalisation [6]. Although the risk for HF hospitalisation declines over time, patients with a CHF hospitalisation have a significantly elevated risk of HF hospitalisation for at least 1 year [4]. From a public health and patient perspective long-term adverse outcomes are as important and a HIC database has adequate information to examine this long-term perspective.
The most important features in our study are well known and reported in earlier studies, for instance age, sex, comorbidities as diabetes or COPD, and living in buildings built between the years 1965 and 1974 as a proxy for socio-economic status. HF Hospitalisation in the baseline period was unsurprisingly an import predictor of HF hospitalisation in the future, as well as Acute HF. Visit of the GP was an import predictor, but we do not know the reason of this visit. Most likely it was related to comorbidities. Importantly, most data-driven machine learning techniques, including the ones we used, are correlational in nature and not causal, so caution while interpreting these results is advised [43].
As the current study was only utilizing HIC data, potentially important clinical features are not included in the current models. For instance, Parenica et al. [44] found higher age, LV dysfunction, comorbidities and high levels of natriuretic peptides as the most powerful predictors of worse prognosis in long-term survival. Reduced ejection fraction is a powerful predictor of long-term mortality, especially after the 6th year [45]. Ouwerkerk et al. [38] found 3 variables with a high predictive value: sodium; blood urea nitrogen; and systolic blood pressure. Clinical features such as ejection fraction and natriuretic peptides are not available in a HIC database, but age and comorbidities based on pharmacy-based cost groups or DRGs are.
We have given an overview of model performance of several machine learning algorithms and traditional statistics in predicting risk for HF hospitalisation and all-cause mortality in a representative sample of the Dutch population from a HIC database. Our findings are therefore generalisable to all CHF patients in the Netherlands. However, several limitations should also be acknowledged. First, models based on administrative claims data lack certain clinically relevant features, such as New York Health Association Class [46], left ventricular ejection faction, intoxications such as smoking, blood pressure and physical activity [47]. Enriching HIC data through clinical data coupling would be preferred. Strict general data protection regulation (GDPR) rules limit these possibilities. Using a trusted third party or novel techniques like Multi Party Computation could provide a good solution to overcome the legal burdens to clinical and HIC data coupling [16]. Second, due to GDPR regulations, patient data history is limited to 7 years in HIC databases, hence, relevant historical data may be missing. Third, the ability to explain and interpret ML models is limited, especially NN. Hence, it is difficult to embrace these models and apply them in a clinically relevant way. More research is needed to explore the causal relationship of features that could be of importance in medical practice. In general, we found poor or moderate overlap between methods in their assessment of feature importance for the top10 features, even when their performance is comparable and relatively good. Most overlap was between EN and LR and least with RF, because RF was heavily nonlinear. The ability to explain and interpret RF is most elaborate, because RF has an integrated procedure of producing variable importance's [48]. However, for LR we used the multivariate method. This is the simplest feature importance measure tested, and unsurprisingly has strong assumptions, namely that a predictor's importance is independent of all other factors. It is also important to note that significant predictors in LR may not make useful predictions [49]. Finally, we reduced the input claims features using prevalence prioritization and Lasso Regression, due to performance reasons. By doing so we perchance excluded features which could have been relevant. In this study based on a Health Insurance Claims database we have shown clear predictive value for predicting long-term HF hospitalisation and mortality of CHF patients. Novel machine learning techniques like RF and NN can obviate more redundant HF hospitalisation or mortality, because they allow for non-linear relations or in the case of EN can reduce irrelevant features. In the long run, we hope that applying stateof-the-art machine learning on clinical data combined with HIC data can improve risk stratification and prognosis by offering high-risk patients timely intervention through for example cardiac rehabilitation and by optimizing medical therapy and stimulating medical adherence.