Deep learning for prediction of population health costs

Background Accurate prediction of healthcare costs is important for optimally managing health costs. However, methods leveraging the medical richness from data such as health insurance claims or electronic health records are missing. Methods Here, we developed a deep neural network to predict future cost from health insurance claims records. We applied the deep network and a ridge regression model to a sample of 1.4 million German insurants to predict total one-year health care costs. Both methods were compared to existing models with various performance measures and were also used to predict patients with a change in costs and to identify relevant codes for this prediction. Results We showed that the neural network outperformed the ridge regression as well as all considered models for cost prediction. Further, the neural network was superior to ridge regression in predicting patients with cost change and identified more specific codes. Conclusion In summary, we showed that our deep neural network can leverage the full complexity of the patient records and outperforms standard approaches. We suggest that the better performance is due to the ability to incorporate complex interactions in the model and that the model might also be used for predicting other health phenotypes. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01743-z.


Introduction
Health care expenditures are one of the biggest expenses in Germany and optimally managing these cost has great economical importance.Therefore, methods for accurate patient-level prediction of future health care cost are needed to provide the basis for decision making.As medical costs reflect the development of health over time, and health in turn is influenced by many factors such as social demographics, previous medical history, environmental influences, genetics but also by random events such as accidents, predicting the future health is inherently challenging.Consequently, accurately predicting health cost is a challenging problem.Existing work on prediction of health cost can be divided into two categories [18]: (1) Rule based prediction methods, in which decision rules of an algorithm to predict future costs are manually defined.The disadvantage of this approach is that it requires deep domain knowledge and that the capability of resulting models to reflect complex relations in the data is limited.
(2) Supervised learning based methods (e.g.linear regression models, random forests or support vector methods) that learn to predict future cost from the data [2,5,7,9,12,18].These methods have the advantage that they are not limited in their expressiveness as rule based methods are.However, they typically require large datasets for training.For training of these methods, health insurance claims records are an appealing data source.They cover most of the health care expenditures of the patients and have the advantage of having sample sizes that allow fitting rich models.Additionally, they contain detailed information on patients, such as the medical history and social demographic information.The challenges of this data is that it is high dimensional, that there are many hidden interactions between variables, and that the data is often not normally distributed [4].The aforementioned supervised learning methods are believed to typically not leverage the potential of population scale data to detect complex patterns [19].Recent developments in deep learning techniques, such as novel deep neural network architectures and numerical approaches to fit the networks, promise to address some of these challenges.Deep learning has been successfully applied in the medical domain to task such as dermatologist level detection of skin cancer [8], prediction of various clinical outcomes from electronics health records [15], or the detection of diabetic retinopathy from retinal fundus photographs [10], showing the potential of this technology.We present a novel deep neural network architecture to predict future health care cost from health insurance claims records (See Figure 1).This network architecture allows to fully capture the richness of the medical data in health insurance claims and can be fitted on a standard workstation with 64GB of RAM.We compare it on health insurance claim records of ∼ 1.4 million patients from German statutory health insurances against various standard methods and show that it outperforms existing approaches.It also is better identifying patients at risk than standard linear regression approaches.Finally, we show how the parameters of the network can be interpreted and that the network uses medically relevant features for its prediction.

Data
This study is based on data of the Institute for Applied Health Research Berlin (InGef) database, which contains anonymised longitudinal claims data of more than 60 German statutory health insurances.Claims data of the years 2010 to 2017 for a sample of about 1 403 346 insurants was used, which is representative for the German population with respect to age, sex and state of residence.Besides sociodemographic information, the database contains information on hospital stays, outpatient physician visits, drug prescriptions and remedies and aids including costs in each of the sectors.Further details of the database can be found elsewhere [1].An approval of an ethics committee or informed consent of the patients was not required for the conduct of this study since all patient-and provider-level information are anonymised to comply with German data protection regulations and German federal law.In the remainder of this manuscript we will refer to the period ranging from Q1 2010 to Q4 2015 as the observation period and to the period from Q3 2016 to Q2 2017 as the evaluation period.

Data Representations
The input for the machine learning algorithms was formatted in the following manner: Each numerical value was kept as a feature.Dates were coded in quarters since Q1 of 2010.Categorical values, such as International Statistical Classification Of Diseases And Related Health Problems, 10th revision, German Modification (ICD-10-GM) codes, Anatomical Therapeutic Chemical (ATC) codes, Diagnosis Related Group (DRG) codes, German procedure classification (OPS), physician subject group key (FG) and schedule of fees for physician outpatient services (GOP) codes or sex, were coded using a one-hot encoding (i.e. if n possible categories k 1 , . . ., k n were possible, the observation of category k j was coded by a n-dimensional vector that was 1 at the index j and 0 everywhere else).If multiple categories were observed in a quarter, the representing vectors were added.This coding was performed for each quarter and the resulting vectors were concatenated into a single vector of dimension 24*91'470 =2'491'470 representing the patient data in the observation period.In order to accelerate model fitting, we only considered variables that had more than 1 000 entries, leading to a vector of dimension 24*13'876 = 333'024.

Model definition
We used a model with four hidden layers (See Figure 2).The first four layers had each 50 neurons.In the fourth layer the original input was concatenated to the hidden vector and fed to the last layer, which had seven neurons to predict seven cost categories (Medications, practice, hospital, medical sundries, therapeutic appliances, compensation for incapacity to work and dentistry).All layers used the ReLU-activation function [14] and a dropout [16] rate of 0.25 during training.We compared the deep learning model to three standard models.(1) The average cost per year in the previous 6 years.(2) The costs in the last year of the observation time.(3) A ridge regression with parameter λ = 0.1.For model assessment, we predicted separately the seven (see above) different cost domains.We then summed all predicted costs except the cost to compensate for incapacity to work for model assessment, in order to make the cost comparable to costs reported for the Morbi-RSA [5,6].Furthermore, we performed model ensembling for the ridge regression and the neural network (i.e. the network as trained five times and the predictions of all five models was averaged for the final prediction.)

Model fitting
For model fitting we used the first 903 346 patients (training set).During model fitting, we minimized the l2-loss between the future and the predicted costs using ADAM [11], which is an extension of stochastic gradient descent.Both the ridge regression and the deep learning model was trained for 25 epochs.For training of the ridge regression a batch size of 128 was used and for training of the neural network a batch size of 32 was used.

Implementation
All models have been implemented in python and keras [3].

Evaluation Criteria
To assess the model quality, we determined the following quality criteria: Pearson's correlation coefficient, Spearman's correlation coefficient, the mean absolute error and Cumming's Prediction Measure.The performance was evaluated on the subset of 357 239 of the 500 000 held out patients (test set) that where alive in the observation period and either died or were still insured on at least one day in the evaluation periods.
We further assessed how well the methods could be used to identify patients with changing costs.As this is indicating a change of health status or treatment, these patients could benefit from preventive interventions.To this end, we divided our test set patients into three groups.Those for which the cost decreased more than 100-fold between the last year of the observation period and the first year of the prediction period; those for which the cost increased more than 100-fold; and the remaining patients.In order to not include patients with overall low cost in the two group (e.g to not include patients that change from 0.01 to 10.0 Euro) that have strong cost changes, we added 10 Euros to the overall cost before computing the fold change.We then computed the area under the precision-recall curve (auPRC) for identification of patients with increasing, resp.decreasing costs from all patients.To understand for which cost range the respective methods performed best, we computed the error of the prediction in dependence of the cost.

Sensitivity analyses
We investigated how the performance of the neural network depends on the amount of available training data.To this end, we trained the model on only 100 000, 200 000, 300 000, 400 000, 500 000, 600 000, 700 000, 800 000 and 900 000 patients.Furthermore, we investigated how the length of the observation time affects the predictive performance.Therefore, we trained the model also for each of the patient sets using the data from one to six years up to the end of the observation period.

Feature identification
An important application of predictive models is to identify relevant features in the data and to understand their effect on the prediction.This allows for example to identify and quantify risk factors.A common approach in linear models is to identify the weights that have a large absolute value as they correspond to the features that have as strong impact on the prediction.For deep neural networks it has been shown that this strategy is suboptimal [17] as it does not capture the interactions between features that the neural network uses.Here, we therefore used a strategy called integrated gradients [17] that is more robust.We determined the average integrated gradients of all patients in the evaluation set.Furthermore, we divided the mean integrated gradient by the number that the actual feature was nonzero, to account for the fact that not all features are equally abundant.We did not show codes in the results that allow identification of health insurance companies which contributed to the study database.

Results
To establish a baseline, we first compared the performance of all methods to predict costs.We found that the neural network was able to better predict future costs than ridge regression or the other two standard models in all considered measures as shown in Table 1.Furthermore, we found that ensembling several training runs provides an additional small improvement.
To better understand in which cost regimes the neural network and the ridge regression performed better, we studied the average absolute error in Euros depending on the true costs.The neural network performed better for patients with total costs lower than ∼ 10 000 Euro, whereas the ridge regression performed better for patients who were more expensive (See Figure 3a and Figure 3b).
As sensitivity analyses, we studied how the number of samples in the training set and the length of the observation period affect the performance of the prediction for the neural network.Our analyses showed that as the number of patients increased the predictive performance , as measured by R 2 , increased.The same was true when the observation time increased (See Figure 4a).A similar picture can also be seen for the Spearman and Pearson correlation (See Supplementary Figure S1).We compared this to the performance of the ridge regression (See Figure 4b).We found that at 100 000 patients the r 2 of the neural network was lower than for the ridge regression but that for larger sample sizes the neural network had in general a higher r 2 .
Next, we analysed the ability of identifying patients with changing costs (Figure 5a and Figure 5b).In this analysis, we did not consider the model that used the last years costs as prediction for the future costs as costs are predicted to stay constant for this model.The results of the analysis in predicting patients with increasing/decreasing costs are shown in Figure 5c and Figure 5d, respectively.We found that overall, prediction of decreasing costs was easier than increasing costs.Furthermore, we found that for both direction of the cost change the neural network outperformed the ridge regression.For increasing costs the neural network had an auPRC of 0.08 while the ridge regression only had an auPRC of 0.04.For decreasing costs the neural network at an auPRC of 0.24 while the ridge regression had an auPRC of 0.21.A similar picture also emerged for the area under the ROC curve where the neural network had an auROC of 0.93 and 0.90 for decreasing and increasing costs, respectively.Here the ridge regression had an auROC of 0.93 and 0.86 for decreasing and increas- ing costs, respectively.For both measures the Ridge regression and the neural network were substantially better than the baseline methods that did not model the costs.
Finally, we studied via integrated gradients, on which features of the data the neural network based its prediction and how this differed from the features used by the ridge regression.We first determined the importance of features from different quarters in the observation period by summing the integrated gradients of all feature in a quarter.We found that both methods have a similar temporal distribution of the importance and that for prediction the most recent features were the most important (See Figure 6).
To evaluate whether the features showed a qualitative difference between the neural network and the regression, we identified the features with the highest (associated with higher cost) integrated gradient in set of patients that have an 100-fold increase in costs.To this end, we summed the integrated gradients of each code over all quarters in the observation period.The top-20 codes are shown in Table 2 for the neural network and for the ridge regression in Table 3.We found that the neural network relied more on ICD10 diagnosis and ATC medication code than the ridge regression (8 of 20 vs. 3 of 20).

Discussion
Accurate prediction of future health care cost provides the basis to optimally manage healthcare costs.Furthermore, identification of patients whose cost will change allows optimization of interventions given a limited budget in order to improve population health.To achieve this it is important to have accurate predictions of the future health costs.In this work we presented a deep learning based approach to predict future costs.Our approach can leverage the full complexity of the patient records and does not require prior feature selection.
We showed that our approach can outperform standard approaches, including the Morbi-RSA for all measured performance metrics (See Tab.1).We suggest that the performance gain is due to two reasons.First, our approach learns important features from the data and does not require manual feature selection.It has been shown that learnt features allow better predictions in computer vision and speech processing given enough training data [13]).The value of learning predictive features from the data is suggested by the better (stateof-the-art) performance of our implementation of ridge regression compared to the existing implementation of Morbi-RSA that is only based on 80 diseases.Second, our deep learning approach allows modelling of complex interactions between all variables which is not possible for ridge regression.This enables better modelling of medical phenotypes such as interactions between age, sex and diagnosis.This is supported by the identified terms that are associated with increasing costs between ridge regression and the deep neural network, where the ridge regression uses mainly the GOP codes and the deep network puts a higher emphasis on medical diagnoses and prescribed drugs.It also worth noting that in contrast to the Morbi-RSA, which is mainly based on ICD10 codes, both the ridge regression and the neural network rely on GOP codes.
Since we placed no strong assumption on the phenotype that we modelled, we believe that the neural network may also easily adapted to predict other medical phenotypes.However, we also acknowledge that further research is necessary to better understand the merits and limits of deep learning in identifying medical phenotypes from insurance claims.This includes the optimal architecture of the networks but also strategies to interpret deep networks, to provide uncertainty estimates for the models and model distribution shifts caused by changes in billing and treatment guidelines.

Conflicts of interest/Competing interests
None declared.

Availability of data and material
The datasets used in the current study are not publicly available due to privacy and security concerns.

Code availability
Code will be available upon publication.

Figure 2 :
Figure 2: Network architecture: Shown is the architecture of the proposed deep neural network.Shown in (light grey) are the input features.Shown in (dark grey) are the target variables of the network.The (white) nodes are the internal nodes of the network.

Figure 3 :Figure 4 :
Figure 3: Error analysis: Shown is the histogram of total cost (a), the log10 absolute error based on the true cost of the ridge regression and the neural network (b) as well as the difference between the neural network error and the ridge regression error (c)

Figure 5 :
Figure 5: Cost change prediction: Shown are the raw cost (a) in the last year of the observation period (current costs) and the evaluation period (Future costs) as well as the log10 fold-change between them (b).Shown in (c)-(d) are the precision recall curves for predicting increasing and decreasing costs.

Figure 6 :
Figure 6: Importance of features per quarter: Shown is the summed normalized integrated gradient (Importance) per year for the ridge regression and the neural network.

Table 2 :
Codes with the highest feature importance as determined by the integrated gradients (IG) for the neural network, as well as a description of the codes

Table 3 :
Codes with the highest feature importance as determined by the integrated gradients (IG) for the ridge regression, as well as a description of the codes