Improving palliative care with deep learning

Background Access to palliative care is a key quality metric which most healthcare organizations strive to improve. The primary challenges to increasing palliative care access are a combination of physicians over-estimating patient prognoses, and a shortage of palliative staff in general. This, in combination with treatment inertia can result in a mismatch between patient wishes, and their actual care towards the end of life. Methods In this work, we address this problem, with Institutional Review Board approval, using machine learning and Electronic Health Record (EHR) data of patients. We train a Deep Neural Network model on the EHR data of patients from previous years, to predict mortality of patients within the next 3-12 month period. This prediction is used as a proxy decision for identifying patients who could benefit from palliative care. Results The EHR data of all admitted patients are evaluated every night by this algorithm, and the palliative care team is automatically notified of the list of patients with a positive prediction. In addition, we present a novel technique for decision interpretation, using which we provide explanations for the model’s predictions. Conclusion The automatic screening and notification saves the palliative care team the burden of time consuming chart reviews of all patients, and allows them to take a proactive approach in reaching out to such patients rather then relying on referrals from the treating physicians.


I. INTRODUCTION
Studies have shown that approximately 80% of Americans would like to spend their final days at home if possible, but only 20% do [1].In fact, up to 60% of deaths happen in an acute care hospital, with patients receiving aggressive care in their final days.Access to palliative care services in the United States has been on the rise over the past decade.In 2008, 53% of all hospitals with fifty or more beds reported having palliative care teams, rising to 67% in 2015 [2].However, despite increasing access, data from the National Palliative Care Registry estimates that less than half of the 7-8% of all hospital admissions that need palliative care actually receive it [3].Though a significant reason for this gap comes from the palliative care workforce shortage [4], and incentives for health systems to employ them, technology can still play a crucial role by efficiently identifying patients who may benefit most from palliative care, but might otherwise be overlooked under current care models.
We focus on two aspects of this problem.First, physicians may not refer patients likely to benefit from palliative care for multiple reasons such as overoptimism, time pressures, or treatment inertia [5].This may lead to patients failing to have their wishes carried out at end of life [6] and overuse of aggres-sive care.Second, a shortage of palliative care professionals makes proactive identification of candidate patients via manual chart review an expensive and time-consuming process.
The criteria for deciding which patients benefit from palliative care can be hard to state explicitly.Our approach uses deep learning to screen patients admitted to the hospital to identify those who are most likely to have palliative care needs.The algorithm addresses a proxy problem -to predict the mortality of a given patient within the next 12 months -and use that prediction for making recommendations for palliative care referral.This frees the palliative care team from manual chart review of every admission and helps counter the potential biases of treating physicians by providing an objective recommendation based on the patient's EHR.Currently existing tools to identify such patients have limitations, and they are discussed in the next section.

II. RELATED WORK
Accurate prognostic information is valuable to patients, caregivers, and clinicians [7] [8].Several studies have shown that clinicians are generally over optimistic in their estimates of the prognoses of terminally ill patients [9] [5] [10] [11].It has also been shown that no subset of clinicians are better at late stage prognostication than than others [12] [13] .However, clinician judgment remains the most common method of predicting survival in practice [12].Several solutions exist that attempt to make patient prognosis more objective and automated.Many of these solutions are models that produce a score based on the patient's clinical and biological parameters, and can be mapped to an expected survival rate.

Prognostic tools in Palliative Care
The Palliative Performance Scale [14] was developed as a modification of the Karnofsky Performance Status Scale (KPS) [15] to the Palliative care setting, and is calculated based on observable factors such as: degree of ambulation, ability to do activities, ability to do self-care, food and fluid intake, and state of consciousness.The Palliative Prognostic Score (PPS) was constructed for the Palliative Care setting as well, focusing on terminally ill cancer patients [16].The PPS is calculated with multiple regression analysis based on the following variables: Clinical Prediction of Survival (CPS), Karnofsky Performance Status (KPS), anorexia, dyspnea, total white blood count (WBC) and lymphocyte percentage.The Palliative Prognostic Index (PPI), developed around the same time as PPS, also calculates a multiple regression analysis based score using Performance Status, oral intake, edema, dyspnea at rest, and delirium.These scores are difficult to implement at scale since they involve face-to-face clinical assessment and involve prediction of survival by the clinician.Furthermore, these scores were designed to be used within the palliative care setting, where the patient is already in an advanced stage of the disease -as opposed to identifying them earlier.

Prognostic tools in the Intensive Care Unit
There also are prognosis scoring models that are commonly used in the Intensive Care Unit.The APACHE-II (Acute Physiology, Age, Chronic Health Evaluation) Score predicts hospital mortality risk for critically ill hospitalized adults in the ICU [17].This model has been more recently refined with the APACHE-III Score, which uses factors such as major medical and surgical disease categories, acute physiologic abnormalities, age, preexisting functional limitations, major comorbidities, and treatment location immediately prior to ICU admission [18].Another commonly used scoring system in the ICU is the Simplified Acute Physiological Score, or SAPS II [19], which is calculated based on the patient's physiological and underlying disease variables.While these score are useful for the treatment team when the patient is already in the ICU, they have limited use in terms of identifying patients who are at risk of longer term mortality, while they are still capable of having a meaningful discussion of their goals and values, so that they can be set on an alternative path of care.

Prognostic tools for Early Identification
There have been a number of studies and tools developed that aim to identify terminally ill patients early enough for an end-of-life plan and care to be meaningful.
CriSTAL (Criteria for Screening and Triaging to Appropriate aLternative care) was developed to identify elderly patients nearing end of life, and quantifies the risk of death in the hospital or soon after discharge [20].CriSTAL provides a check list using eighteen predictors with the goal of identifying the dying patient.
CARING is a tool that was developed to identify patients who could benefit from palliative care [21].The goal was to use six simple criteria in order to identify patients who were at risk of death within 1 year.PREDICT [22] is a screening tool also based on six prognostic indicators, which were refined from CARING.The model was derived from 976 patients.
The Intermountain Mortality Risk score is an all-causes mortality prediction based on common laboratory tests [23].The model provides score for 30-day, 1-year and 5-year mortality risk.It was trained on a population of 71,921 and tested on 47,458.
Cowen, M et al [24] proposed using a twenty-four factor based prediction rule at the time of hospital admission to identify patients with high risk of 30-day mortality, and to organize care activities using this prediction as a context.One of the their motivation was to have a rule from a single set of factors, and not be disease specific.The model was derived from 56,003 patients.
Meffert, C et al [25] proposed a scoring method based on logistic regression on six factors to identify hospitalized patients in need of palliative care.In this prospective study, they asked the treating physician at the time of discharge whether the patient had palliative care needs.The trained model was then used to identify such patients at the time of admission.The model was derived from 39,849 patients.
Ramachandran, K et al [26] developed a 30-day mortality prediction tool for hospitalized cancer patients.Their model used eight variables that were based on information from the first 24 hours of admission, and laboratory results and vitals.A logistic regression model was developed from these eight variables and used as a scoring function.The model was derived from 3,062 patients.
Amarasingham, R et al [27] built a tool to screen patients who were admitted with heart failure, and identify those who are at risk of 30-day readmission or death.Their regression model uses a combination of Tabak Morality Score [28], markers of social, behavioral, and utilization activity that could be obtained electronically, ICD-9 CM codes specific to depression and anxiety, billing and administrative data.Though this study was not specifically focused on palliative care, the methodology of using EHR system data is relevant to our work.The model was derived from 1,372 patients.
Makar, M et al [29] used only Medicare claims data on older population (≥ 65 years) to predict mortality in six months.By limiting their model to use only administrative data, they hypothesized an easier deployment scenario thereby making automated prognostic models more prevalent.The model was derived separately on four cohorts (one per disease type) with 20,000 patients per cohort.

Prognosis in the age of Big-Data
The proliferation of EHR systems in healthcare combined with advances in Machine Learning techniques on high dimensional data provides a unique opportunity to make contributions, especially in disease prognosis [30] [31].All the tools described above, and those we reviewed [32] [33] [34] [35] [36], have at least one of the following limitations.They were either derived from small data sets (limited to specific studies or cohorts), or used too few variables (intentionally to make the model portable, or avoid overfitting), or the model was too simple to capture the complexities and subtleties of human health, or was limited to certain sub-populations (based on disease type, age etc.) We address these limitations in our work.

III. METHODS
We approach the problem of predicting mortality from the point of view of the palliative care team by being largely agnostic to disease type, disease stage, severity of admission (ICU vs non-ICU), age etc.We take a data driven approach and build a deep learning model that considers every patient in the EHR (with a sufficiently long history), without limiting our analysis to any specific sub-population or cohort.In order to make the problem of identifying patients with palliative care needs tractable, we use the following proxy problem statement instead: Given a patient and a date, predict the mortality of that patient within 12 months from that date, using EHR data of that patient from the prior year.
We treat this as a binary classification problem and build a supervised deep learning model to solve it.Other than building a model that performs well on the above problem, we are also separately interested in the model performance on a subproblem -the ability to predict mortality of patients who are currently admitted.This is because it is much easier for the palliative care staff to intervene with admitted patients.

Constructing a Dataset for Supervised Learning
Patients who have a recorded date of death are considered positive cases; other patients are considered negative cases.Further, we define the prediction date of a patient to be the point in time that divides their health record timeline into virtual future and past events.We use data from each patient's virtual past to make predictions about their survival 3-12 months in the future.Note that we must take care when defining the prediction date to not violate common sense constraints (described below) that could invalidate the labels.We only include patients for whom it is possible to find a prediction date that satisfies these constraints.
Positive Cases: The constraints for positive cases were decided based on the rationale that palliative care is most beneficial if the referral occurs 3-12 months prior to death.Predicting mortality within 3 months is considered too late due to the preparatory time required to start palliative care in general.On the other hand, a lead time longer than 12 months is problematic because making accurate predictions over such a long time horizon is difficult, and more importantly, palliative care interventions are a limited resource that are best focused on more immediate needs.The prediction date for positive cases must meet all the following constraints: • The prediction date must be at least 12 months after the date of first encounter (otherwise the patient lacks sufficient history on which to base a prediction).• In-patient admissions are preferred over other admission types for the prediction date, as long as they meet the previous constraints (since it is easier to start the palliative care conversation with them).• The prediction date must be the earliest among the possible candidate dates subject to previous constraints.Negative Cases: For negative cases (patients without a date of death), we require that the patient was alive for at least 12 months from the prediction date.We choose the prediction date such that it satisfies all the following constraints: • The prediction date must be a recorded date of encounter.
• The prediction date must be at least 12 months prior to date of last encounter (to avoid ambiguity of death after date of EHR snapshot).• The prediction date must be at least 12 months after the date of first encounter (otherwise insufficient history).• In-patient admissions are preferred over other encounter types for the prediction date, as long as they meet the previous constraints (to serve as controls for the admitted positive cases).• The prediction date must be the latest among the possible candidate dates subject to previous constraints.Admitted patients: Those patients whose prediction date corresponds to an in-patient admission are considered admitted patients.Remaining patients are considered non-admitted (note that non-admitted patients could still have other recorded admissions in their history).Further, for admitted patients, their prediction date it is re-adjusted by incrementing it to be the second day of admission.The rationale for doing this is that patient records are generally updated with the latest data (preliminary tests, diagnostics etc.) within 24 hours of admission, and the second day is better suited for making a more informed prediction.Note that the admitted patients are

Data Description
The inclusion criteria selected a total of 221,284 patients.Table I shows the breakdown of these patients based on inclusion and admission.
We observe that, unsurprisingly, the distribution of age at prediction time is not equal between the classes, and that the positive class (of deceased patients) is skewed towards older age (Fig 2).
The included patients are randomly split in approximate ratio 8:1:1 into training, validation and test sets, as shown in Table II.
The prevalence of death among the included patients is approximately 7%.Approximately 5% were admitted patients (i.e., prediction date was the second day of an admission).Among the admitted patients, the prevalence of death is about 11%.

Feature Extraction
For each patient, we consider the 12 months leading up to their prediction date as their observation window.Within the observation window of each patient, we use ICD9 (International Classification of Diseases 9th rev) diagnostic and billing codes, CPT (Current Procedural Terminology) procedure codes, RxNorm prescription codes, and encounters found in that period to create features.
We create features as follows.In order to capture the longitudinal nature of the data, we split the observation window of each patient into four observation slices, specified relative to the prediction date (PD) as shown in Table III Thus, observation slice 1 is the most recent, and 4 is the oldest.The slice widths are intentionally uneven in order to give more emphasis to recent data.Within each observation slice, we count the the number of occurrences of each code in each code category (prescription, billing, etc.) per patient.The count of every such code within the slice is considered a separate feature.
We also include the patient demographics (age, gender, race and ethnicity), and the following per-patient summary statistics in the observation window for each code category: • Count of unique codes in the category.
• Count of total number of codes in the category.
• Maximum number of codes assigned in any day.
• Minimum number of codes (non-zero) assigned in any day.
• Range of number of codes assigned in a day.
• Mean of number of codes assigned in a day.
• Variance in number of codes assigned in a day.All these features (i.e, code counts in each of the four observation slices, per category summary statistics over the observation window, and demographics) were concatenated to form the candidate feature set.From this set, we pruned away those features which occur in 100 or fewer patients.This resulted in the final set of 13,654 features.Of the 13,654 features, each patient on average has 74 non-zero values (with a standard deviation of 62), and up to a maximum of 892 values.The overall feature matrix is approximately 99.5% sparse.

Algorithm and Training
Our model is a Deep Neural Network (DNN) [38] comprising an input layer (of 13,654 dimensions), 18 hidden layers (each 512 dimensions) and a scalar output layer.We employ the logistic loss function at the output layer and use the Scaled Exponential Linear Unit (SeLU) activation function [39] at each layer.The model is optimized using the Adam optimizer [40], with a mini-batch size of 128 examples.Intermediate model snapshots were taken every 250 mini-batch iterations, and the snapshot that performed best on the validation test was selected as the final model.Explicit regularization was not found necessary.The network configuration was reached by extensive hyperparameter search over various network depths (ranging from 2 to 32) and activation functions (tanh, ReLU and SeLU ).

Evaluation
Since the data is imbalanced (with 7% prevalence), accuracy can be a poor evaluation metric [43].The ROC curve can also be sometimes misleading on imbalanced problems [44] [45].Therefore, we use the Average Precision (AP) score, also known as Area Under Precision-Recall Curve (AUPRC) for model selection [46].

IV. RESULTS
In this section we report technical evaluation results obtained on the test set using the model selected based on the best AP score on the validation set.
We observe that the model is reasonably calibrated (Fig 3) with a Brier score of 0.042.In the high threshold regime, which is of interest to us, the model is a little conservative

Qualitative Analysis
It is worth recalling that predicting mortality was a proxy problem for identifying patients who could benefit from palliative care.In order to evaluate our performance on the original problem, we inspected false positives with high output probability.Although such patients did not die within 12 months from their prediction dates, we noted that they were often diagnosed with terminal illness and/or are high utilizers of healthcare services.This can be seen in the positive and false positive examples shown in Section V.
Upon conducting a chart review of 50 randomly chosen patients in the top 0.9 precision bracket of the test set, the palliative care team found all were appropriate for a referral on their prediction date, even if they survived more than a year.This suggests that mortality prediction was a reasonable (and tractable) choice of a proxy problem to solve.
V. EXPLAINING PREDICTIONS Supervised machine learning techniques, and in particular Deep Learning techniques, have recently demonstrated tremendous success in predictive ability.However, better performance often requires larger, more complex models and thus sacrifice interpretability.It is worth drawing a distinction between interpreting a model, versus interpreting its decision [47] [48].While interpreting complex models (e.g very deep neural networks) may sometimes be infeasible, it is often the case that users only want an explanation for the prediction made by the model for a given example.It is important to establish the trust of the practitioner in the model's decisions for them to feel comfortable taking actions based on it.Providing explanations along with decisions help establish that trust.
We make the following observations to motivate our explanation technique.
• We can view the EHR data as a strictly growing log of events, and that new data is only added (nothing is modified or removed in general).This results in all our features being positive valued (as counts, means and variance of counts, etc).• We are most interested in explaining why a model assigns high probability to a patient.We are less interested in getting an explanation for why a healthy person was given a low probability (the reasons are also much less clear: the patient did not have brain cancer, did not have pneumonia, and so on).• Directly perturbing feature vectors (e.g sensitivity analysis or for techniques described in [47]) does not work well in our case .For example, perturbing the feature representing the ICD count for brain cancer from zero to non zero can increase the probability of death significantly, implying that it is an important factor in general.However, that is not a very useful observation for a specific patient who does not have brain cancer.
These observations motivate the following technique.For each ICD-9, CPT, RXNORM and Encounter code, we ablate all occurrences of that code from the patient's EHR, create a new feature vector, and measure the drop in probability compared to the original probability.This corresponds to asking: all else being equal, how would the probability change if this patient was not diagnosed with drug ABC, etc?This drop in probability is considered the influence the code has on the model's decision for that patient.Demographic features are handled as follows.We zero out the age and swap the gender to the opposite sex, and measure the respective drops in probability.Finally we sort the codes in descending order by influence, and pick the top 5 in each code category.A random example of such a positive and false positive case are shown in Table IV and V.

VI. CONCLUSION
We demonstrate that routinely collected EHR data can be used to create a system that prioritizes patients for follow up for palliative care .In our preliminary analysis we find that it is possible to create a model for all-cause mortality prediction and use that outcome as a proxy for the need of a palliative care consultation.The resulting model is currently being piloted for daily, proactive outreach to newly admitted patients.We will collect objective outcome data (such as rates of palliative care consults, and rates of goals of care documentation) resulting from the use of our model .We also demonstrate a novel method of generating explanations from complex deep learning models that helps build confidence of practitioners to act on the recommendations of the system.
Data Source STRIDE (Stanford Translational Research Integrated Database Environment) [37] is a clinical data warehouse supporting clinical and translational research at Stanford University.The snapshot of STRIDE used in our work comprises the EHR data of approximately 2 million adult and pediatric patients cared for at either the Stanford Hospital or the Lucile Packard Children's hospital between 1995 and 2014.

Fig. 3 .
Fig. 3. Reliability curve (calibration plot) of the model output probabilities on the test set data.

Fig. 4 .
Fig. 4. Interpolated Precision-Recall curve.The horizontal dotted line represents precision level of 0.9.The vertical dotted lines indicate the recall at which the curves achieve 0.9 precision.

Fig. 5 .
Fig. 5. Receiver Operating Characteristic (ROC) of the model performance on the test set.

(
under-confident) in its probability estimates, which should not hurt.The interpolated Precision-Recall curve is shown in Fig 4. The model achieves an AP score of 0.69 (0.65 on admitted patients).Early recall is desirable, and therefore Recall at precision 0.9 is a metric of interest.The model achieves recall of 0.34 at 0.9 precision (0.32 on admitted patients).The Receiver Operating Characteristic curve is shown in Fig 5.The model achieves an AUROC of 0.93 (0.87 for admitted patients).Both the ROC and Precision-Recall plots suggest that the model demonstrates strong early recall behavior.

TABLE I BREAKDOWN
OF PATIENT COUNTS.

TABLE II DATA
SPLIT FOR MODELING.