Development, implementation, and prospective validation of a model to predict 60-day end-of-life in hospitalized adults upon admission at three sites

Background Automated systems that use machine learning to estimate a patient’s risk of death are being developed to influence care. There remains sparse transparent reporting of model generalizability in different subpopulations especially for implemented systems. Methods A prognostic study included adult admissions at a multi-site, academic medical center between 2015 and 2017. A predictive model for all-cause mortality (including initiation of hospice care) within 60 days of admission was developed. Model generalizability is assessed in temporal validation in the context of potential demographic bias. A subsequent prospective cohort study was conducted at the same sites between October 2018 and June 2019. Model performance during prospective validation was quantified with areas under the receiver operating characteristic and precision recall curves stratified by site. Prospective results include timeliness, positive predictive value, and the number of actionable predictions. Results Three years of development data included 128,941 inpatient admissions (94,733 unique patients) across sites where patients are mostly white (61%) and female (60%) and 4.2% led to death within 60 days. A random forest model incorporating 9614 predictors produced areas under the receiver operating characteristic and precision recall curves of 87.2 (95% CI, 86.1–88.2) and 28.0 (95% CI, 25.0–31.0) in temporal validation. Performance marginally diverges within sites as the patient mix shifts from development to validation (patients of one site increases from 10 to 38%). Applied prospectively for nine months, 41,728 predictions were generated in real-time (median [IQR], 1.3 [0.9, 32] minutes). An operating criterion of 75% positive predictive value identified 104 predictions at very high risk (0.25%) where 65% (50 from 77 well-timed predictions) led to death within 60 days. Conclusion Temporal validation demonstrates good model discrimination for 60-day mortality. Slight performance variations are observed across demographic subpopulations. The model was implemented prospectively and successfully produced meaningful estimates of risk within minutes of admission.


Model Development
Three alternative models were considered: 1. logistic regression with lasso regularization implemented with the glmnet package in R (1) , 2. XGBoost with a logistic objective implemented with the xgboost package in R (2) , and 3. random forest implemented in R using the fest program (3) . Empirical testing of model parameters was conducted within 5-fold cross-validation within the training cohort where patients (4) are partitioned into five groups and five models are learnt, each leaving out a different fifth for validation. Different parameters are compared by computing the areas under the receiver operating characteristic (AUROC) and precision-recall curves (AUPRC) within each cross-validation fold and the mean AUROC and AUPRC across folds.

Operating Threshold
Given the predicted probabilities and known truths, a criterion is imposed to draw a single threshold that will separate predicted positives from predicted negatives. The metric and value used is application specific and depends on the 'cost' of both types of errors (5) . Low cost interventions such as further diagnostic testing will greatly differ from a decision to perform costly treatment, for example. In this application, conservative identification of individuals at very high risk of near term death was the key objective as action will be taken only for those predicted at risk. Therefore, an operating criterion of 75% positive predictive value (PPV; otherwise known as precision) was selected-one false positive to three true positives.
To improve threshold robustness, 1000 bootstrap iterations are used to compute a median threshold. In each iteration, 80% of the test set is sampled (with replacement), a precision-recall curve created and a threshold selected at 75% PPV. The median threshold is then computed from the 1000 different values. This process adds robustness which is especially important at the very high PPV range as each false positive estimated at very high risk can greatly affect the path of the precision-recall curve at small cumulative samples.

Evaluation in the Context of Potential Demographic Bias
Given the demographic differences between development cohorts driven, in large, by structural differences across sites (observed in Table 1 and eTable 1), two experiments were conducted. First, as recommended by Mitchell et al. (6) , an investigation of model performance in intersectional sub-cohorts of increasing complexity is conducted. AUROC and AUPRC are measures of global model performance and do not accurately describe performance at a particular threshold. To investigate differences in local performance across subpopulations, the procedure is repeated for false positive rate, false negative rate, false discovery rate and false omission rate.
Second, an otherwise identical model is developed while strictly removing 'sensitive' demographics of race and ethnicity and several likely proxies of religion and preferred language. Sex and age can also be considered sensitive demographics in applications outside healthcare (e.g. recidivism or lending). In this context of mortality risk it would be impractical to require equal, fair treatment across these groups and as such were not considered. To aid direct comparison, identical model parameters are used to replicate an identical training procedure.
These sensitive fields potentially help the classifier separate clusters of patients at different depths of the classifier's trees that subsequently improve learning. Accordingly, omission of these fields is expected to marginally reduce performance, at least in demographic sub-populations. To investigate any changes, an investigation of model performance in intersectional sub-cohorts is similarly performed for this 'masked' model. To investigate how the masked model compensates without proxies of race or ethnicity, feature importance of demographic predictors before and after masking is computed using selection frequency (7) .

eResults
Composite End-of-Life Outcome By combining the three available sources of patient outcomes (internal deaths, purchased deaths, and hospice discharges), 10,229 patient outcomes are discovered where 67% are affirmed by two or more sources (eFigure 1). The largest group of single-source outcomes is the hospice group where 45% of all patients discharged to hospice were subsequently lost to follow-up with no confirmed death or date of death (2,504 from 5,598). In the 3,094 admissions with both hospice and death outcomes, the median [IQR] time between discharge to hospice and death was 9 [4,18] days. The addition of hospice adds some 'fuzziness' to the outcome but only for the 30% of end-of-life cases where the patient is discharged to hospice before death.

Model Development
Cross-validation across different parameter combinations found the random forest is relatively insensitive to parameters compared to the lasso regression and XGBoost alternatives. The random forest parameters with highest and most robust performance were 100 trees limited to a maximum depth of 1000. A final model was retrained on the entire training set with these parameters and applied to the temporally separated testing cohort. The most frequently selected predictors of the final model are reported in eTable 2.

Testing Set Performance and Calibration
Within the entire testing set, the learned classifier has good performance ( Table 2) and successfully separates patients by mortality risk (eFigure 2B) while being sufficiently well calibrated throughout the risk spectrum (eFigure 3A and B). The classifier also appears sufficiently well calibrated across locations (eFigure 3C and D), particularly Brooklyn and Tisch Hospitals, despite the demographic and outcome differences observed between sites (eTable 1). Of note, the classifier tends to underestimate mortality risk for patients within the top percentiles (observed mortality > estimated risk; intervals above the dashed diagonal line of eFigure 3) suggesting any selected threshold should conservatively maintain desired PPV.
Relatedly, the distributions of predicted probability within the testing set for the two general hospitals are remarkably similar only differing at the very high percentiles ( . These observations suggest a potential problem of infra-marginality that may challenge model fairness at any threshold (8) .

Intersectional Subcohort Performance
To assess model fairness across sensitive demographics, global model performance (measured by AUROC and AUPRC) are compared across strata of sex, ethnicity and race as depicted as black intervals in eFigure 5A and B. Furthermore, each strata is further separated by location (Brooklyn or Non-Brooklyn, combining Tisch and Orthopedic hospital) to assess the divide caused by population differences and underrepresentation during training.
The reduced AUROC and AUPRC reported in Table 2 for the Brooklyn population are visible in almost all subpopulations of eFigure 5A and B with marginal exceptions of higher AUROC in Asian and Black patients and higher AUPRC in Black patients and men at Brooklyn. Of note, the Hispanic population at Brooklyn is likely under-labeled in the demographic data leading to smaller than expected sample sizes which lead to the wide confidence intervals observed.
A similar analysis performed at a specific threshold corresponding to 50% PPV (as sample size was too small for subpopulation analysis at our preferred 75% PPV), described similar patterns of performance differences in eFigure 6A-D. False positive and false discovery rates are lower for all Brooklyn patients and in subpopulations of men and White patients. False negative rate is higher for all Brooklyn patients and in subpopulations of women, White and Other Race patients. False omission rate is higher for all Brooklyn patients, for both men and women as well as White patients. Unfortunately, the intersectional sample sizes limit more precise estimates especially for ethnicity and race subpopulations. Together these results suggest the site-specific differences and underrepresentation during training are causing the model to under-identify Brooklyn patients.

Explicit Removal of Demographics
The 'masked' model (trained on data with race, ethnicity and their proxies explicitly removed) results in model performance as described in eTable 4. Interestingly, the removal of race and ethnicity results in marginally improved testing set AUROC and AUPRC (comparing Table 2 and eTable 4) with little observable improvement across subpopulations (eFigure 5) with the exception of less variability in Hispanic patients at Brooklyn. When comparing error rates of the masked model to the unmasked model (eFigure 6), discrepancies between Brooklyn and Non-Brooklyn patients appear to be worsened across the board of false positive, false negative, and false omission rates. The explicit removal of sensitive predictors worsen the site-based disparity observed.
Exactly how the masking of race and ethnicity affects the classifier is difficult to determine. One may expect a shift in reliance from these predictors to other proxies of race or ethnicity. The selection frequency of demographic predictors used in the unmasked model (eFigure 7) describe the frequent use of each demographic including smoking status, sex, and age. When ethnicity, race, preferred language and religion are removed the selection frequency shifts randomly for the remaining demographics ('X' marks in eFigure 7) suggesting that none of these predictors are latched onto by the masked model. Comparing the top predictors of the unmasked and masked models in eTable 2 suggests little impact of masking race and ethnicity on these proxies of utilization where only 12 of 50 shift by more than ten places.
When similarly thresholded to a prespecified PPV of 75%, the two models identify a similar order of magnitude number of patients: 72 unmasked vs. 48 unmasked. However, there are only 31 patients in common. The ethnicity, race, sex, and location demographics of these patients are described in eTable 5. The masked model identifies fewer patients in total but the proportion of identified men and Asian patients increased (although the absolute number of men and Asian patients identified remains lower). Of note, the masked model does not improve the underrepresentation of Brooklyn patients. A) model development cohort as well subgroups of the testing cohort by: B) estimated risk group, C) location, D) sex, E) ethnicity and F) race. Note: Risk groups are mutually exclusive such that the Moderate group consists of patients who did not exceed the threshold corresponding to 75% PPV but did exceed the one for 50% PPV. The unknown, other or patient refused options for sex, ethnicity and race were omitted for D) sex and E) ethnicity but collapsed into Other for F) race.