Predicting clinical outcomes among hospitalized COVID-19 patients using both local and published models

Background Many models are published which predict outcomes in hospitalized COVID-19 patients. The generalizability of many is unknown. We evaluated the performance of selected models from the literature and our own models to predict outcomes in patients at our institution. Methods We searched the literature for models predicting outcomes in inpatients with COVID-19. We produced models of mortality or criticality (mortality or ICU admission) in a development cohort. We tested external models which provided sufficient information and our models using a test cohort of our most recent patients. The performance of models was compared using the area under the receiver operator curve (AUC). Results Our literature review yielded 41 papers. Of those, 8 were found to have sufficient documentation and concordance with features available in our cohort to implement in our test cohort. All models were from Chinese patients. One model predicted criticality and seven mortality. Tested against the test cohort, internal models had an AUC of 0.84 (0.74–0.94) for mortality and 0.83 (0.76–0.90) for criticality. The best external model had an AUC of 0.89 (0.82–0.96) using three variables, another an AUC of 0.84 (0.78–0.91) using ten variables. AUC’s ranged from 0.68 to 0.89. On average, models tested were unable to produce predictions in 27% of patients due to missing lab data. Conclusion Despite differences in pandemic timeline, race, and socio-cultural healthcare context some models derived in China performed well. For healthcare organizations considering implementation of an external model, concordance between the features used in the model and features available in their own patients may be important. Analysis of both local and external models should be done to help decide on what prediction method is used to provide clinical decision support to clinicians treating COVID-19 patients as well as what lab tests should be included in order sets.

US economy despite assistance from the US Federal government, via the CARES Act [2] and other funding programs.
The COVID-19 pandemic occurred quickly and was rapidly followed by a massive production of academic output, including prediction models for a variety of clinical outcomes; the initial models for hospital outcomes came from the city of Wuhan in the Hubei province of China, where the initial cases were discovered. From there, models around the globe surged and were likely integrated into many hospital guidelines. However, it is unclear if those models could be applied to local cohorts. Having a rapidly available and accurate prediction model for COVID-19 patients being admitted from the emergency department (ED) would be useful for making accurate triage and prognostic assessments to inform decisions regarding treatment and resource allocation. While knowledge of the likelihood of death in those sent home from the ED would also be of interest, this requires longitudinal data which is often not as readily available. The value of appropriate triage decisions is important, especially in time when resources are stretched.
The growth in the volume of readily available healthcare data has facilitated the development of artificial intelligence-based models; however, a significant factor limiting the utility of dissemination of such models is the issue of generalizability. For example, the earliest computer-aided decision models evaluating abdominal pain were not able to be replicated in different institutions [3]. A mortality prediction tool in acute alcoholic pancreatitis (Ranson's criteria) [4] developed in a small cohort has a wide acceptance compared to superior scoring tools [5].
One of the most popular predictions tools in clinical use today is the 2013 ACC/AHA Guideline on the Assessment of Cardiovascular Risk [6]. This risk tool uniformly overestimated risk in non-diabetic patients in a large, multi-ethnic, socioeconomically group of patients in California [7].
We performed an analysis of how well published and self-developed models would predict clinical outcomes after admission on a cohort of diverse urban patients in Chicago. Our self-developed models were trained using data from our local patient cohort. Published, external models were not re-trained with our cohort's data. We aim to close the gap in the understanding if COVID-19 prediction models on mortality and criticality could be potentially used in local cohorts despite ethnic, geographic and timeline differences. We postulate that due to our incomplete understanding of the pathophysiology, ethnic, racial and socioeconomic differences by location, and improving treatment over time, that models may not predict well in a cohort different than their validation and development cohorts.

University of Illinois Hospital (UIH) Cohort
UIH is a tertiary, academic teaching hospital in Chicago. The UIC Institutional Review Board approved this study. All admissions to UIH for COVID-19 positive patients were reviewed for the time of the first COVID-19 positive test and the date of admission. If the first positive COVID-19 test was performed greater than 14 days prior to admission or greater than 48 h after admission, the patient was excluded. Patients transferred from another institution were reviewed for prior COVID-19 testing. If the COVID-19 test was greater than 14 days before transfer, the patient was excluded. If the transfer was not related to any possible COVID-19 symptoms, the patient was excluded. If the patient was discharged and then readmitted less than 14 days after the first positive COVID-19 test, the encounter was included. Patients were discharged or expired prior to 8/18/20. Pregnant patients were included.
Since our goal was to assess the predictive power of our own prediction model as well as some of those in the literature, we partitioned our data into a training cohort consisting of the first 60% of patients admitted prior to 5/9/20 and a test cohort consisting of patients admitted and discharged from 5/9/20 through 8/18/20.
Variable selection was based on a review of the extant literature and expert opinion. The variables selected are shown in Table 1. Admission vital signs, laboratory values and clinical and radiological features were assessed. The results were the first available up to 24 h after admission. Two outcomes were evaluated, mortality (death during hospitalization), and "criticality", defined as mortality or admission to an ICU.

Literature search
We searched for articles published in PubMed, Embase, Arxiv and medRxiv using the search string:

[Prediction] AND [Human] AND [COVID-19] OR [SARS-COV2] AND [Clinical Trial] OR [Observational Trial] which
were published before 8/27/2020. Articles were reviewed to determine whether the models described predicted our outcomes of interest and whether there was sufficient concordance and detail provided to implement the model using our cohort's data.

Model development
The objective of our model development was to accurately predict patient outcomes using a reduced number of key input features. A variety of popular machine learning algorithms were evaluated to classify mortality and criticality. These algorithms include Linear Regression [8], Decision Tree [9], Random Forest [10], XGBoost [11], LightGBM [12], and CatBoost [13]. The training process uses a combination of step forward feature selection and parametric grid search.
Step forward feature selection is the process of starting with a single feature and iteratively adding one additional feature until there is no increase in model performance. For each step in the feature selection, a parametric grid search is performed to determine the optimal parameter set for each model. We use the area under the receiver operating characteristic curve (AUC) as the evaluation metric.

Statistical analysis of models
No missing data were imputed in our test cohort. External models were included in our analyses if predictions could be generated for greater than 60% of the patients based on this missingness. If odds or a point scale was available, a receiver operator curve was developed and the area under the curve (AUC) was calculated. Confidence interval and comparison of ROCs were performed using DeLong's method [14]. The training and test cohorts were compared using Chi-Square tests for categorical variables and two-sided t-tests for continuous variables using a significance level of P < 0.05. The fraction of missingness for each variable was compared between the cohorts using the Bonferroni correction to control the family-wise error rate.

UIH cohort characteristics model compilation
A description of the UIH cohorts is shown in Table 1. There was a total of 516 patients. The training cohort included the first 309 patients (60%), and the test cohort Though the whole racial distribution was not significantly different between the cohorts, the proportion of selfdeclared black patients was 49% in the training cohort and 42% in the test cohort. The lymphocyte, white blood cell and neutrophil counts were significantly higher in the test cohort. Though some lab tests were performed on almost all patients, many tests were performed in a more discretionary fashion. The missingness of some of the more discretionary tests was higher in the test cohort than in the training cohort: ferritin 7.1-17.4%, Lactate Dehydrogenase (LDH) 16.5-27.5%, Procalcitonin 16.8-32.9%, Interleukin 6 (IL-6) 75.1-87.9%. D-dimer was missing less frequently in the test cohort, 40.8-27.5%.

Model compilation summary
Ninety-one abstracts were reviewed. After applying our inclusion criteria, 41 articles remained. The models and references are shown in Table 2.
Over 60% of the models (n = 26) were derived in China, 11 in Europe, 3 in the US and 2 were multinational. The most common methods were logistic regression (n = 25) and Cox Regression (n = 12).A small number of models used neural networks and decision trees. Among models which published an AUC, the AUC's ranged from 0.74 to 0.98.

UI health internal model development
Multiple methods of machine learning were assessed to develop the best prediction model of the training (60%) cohort. The best models for both mortality and criticality were random forest models, based on the AUC values. Table 3 lists the key modeling parameters and covariates for the mortality and criticality models. The covariates are listed in the order of importance generated by the step forward regression. The key parameters for the random forest models were determined during the grid search of the development data set. The AUC for the mortality model in the training cohort was 0.98, and for criticality it was 0.97.
If model coefficients in the papers in Table 2 were sufficiently described and the model variables were available for more than 60% of admissions, the model was used to predict outcomes in the UIH test cohort. Results are shown in Table 4.
A total of 10 models were assessed using the test cohort, 8 from the literature and 2 internal. Seven of the external models used logistic regression and one used a decision tree. One external model predicted criticality; the remainder predicted mortality. The most common variables used in the models were the age (7 models), lymphocyte count or lymphocytes/WBC ratio (6 models), C-reactive protein (CRP) and LDH (4 models), D-dimer (3 models) and BUN (2 models). The number of features used in each model ranged from 2 to 11, with a median of 3.5. These models assessed clinical features and laboratory testing upon admission. In addition, 1 model explicitly included pregnant patients [19], 2 excluded pregnant patients [28,42], and 5 were undetermined [20,26,44,46,51].
Three of the models, B, G and H [19,46,51], had open access web-based calculators to predict outcomes for individual patients. One model used a decision tree of only three variables which is easy for a clinician to use (A) [42]. Two models used a nomogram to try to simplify use (D and F) [26,44].
All external models were trained using cohorts of Chinese patients. Though there were non-Chinese cohort models in Table 2, none of them provided sufficient description of their models to be implemented on our test cohort without retraining.
Common reasons why models were not used were the lack of availability of the coefficients needed to calculate a prediction score, lack of concordance between the features used in the model and features available in our test cohort, and outcome data not available in our test cohort (e.g., mortality). Figure 1 shows the confidence intervals of the AUC's obtained on the test cohort. Table 4 and Fig. 1 show that the best estimate for the AUC ranges from 0.68 for model G to 0.89 for model C. The internal models have an AUC of 0.84 for mortality and 0.83 for criticality. The mortality model with the highest AUC, C, was not statistically different than the UIH mortality model, 0.89 (0.82-0.96) vs AUC 0.84 (0.74-0.94), [P > 0.5].
The confidence intervals range from 0.13 to 0.30. The difference in performance between the published fit and that of its performance on our test set varied significantly. For model B this difference in AUC was only 0.04 and for model E it was 0.26. The UI Health models were in the middle with a 0.14 AUC difference.
For all 8 models, the mean values for lab results and those of the UI Health test cohort are shown in Table 5. The variables shown were used in at least one model and were available in five or more of the model cohorts. Age and CRP were reported in all papers. The creatinine was reported in seven papers. Though rigorous statistical testing cannot be performed due to the inability to obtain the raw data, some of the variables are clinically significantly different between the cohorts from China and UIH. The mean CRP at UIH is more than three times higher than in the external model average, the creatinine is two-fold higher and the LDH is roughly 1/3 higher.

Discussion
All the models in Table 1 could not be used to make predictions on our test cohort for multiple reasons. Without chart review, symptomology and its duration are difficult to obtain, excluding some models. Unusual imaging grading schemes or mandatory CT scans were not available in our cohort. Some studies used labs that were not ordered frequently in our hospital. Lack of longitudinal follow up limited the use of timed mortality, i.e., 30-day etc. These issues, along with the lack of well described coefficients of models produced the inability to use models except for the 8 models in Table 1.
The features used in the models were surprisingly diverse. The number of variables in each model ranged from 2 through 11, with 19 different variables across the studies. The most common variables used were age, lymphocyte count, CRP and LDH. It is surprising that only 7 of the 10 models used age as a predicting variable, and the 3 models that did not use it did not perform well. In large multi-site cohorts examined in Britain [56], the US [57] and internationally [58], age was a strong predictor of mortality.
Three of the external models performed very well, with AUC's of 0.84-0.89. This demonstrates that although the patients were geographically distant, ethnically different, in different health systems and cultures, and at different times during the pandemic, reasonable prediction was possible. Our initial hypothesis was that these models would not work well, but this was not the case in all the models.
It is likely that some of the models may have had better performance if retrained using our local cohort, but this was not done as the purpose was to see how they worked "out of the box". This appeared to be the intent of many of the authors of the published models as evidenced by the publishing of web calculators, nomograms and decision trees. One of the issues which may cause worse or better performance in a model is that the outcomes have been found to be a function of time during the pandemic, not just patient factors, with improving outcomes more recently [59,60].
Models A, F, G and H were also evaluated in a review and cohort prediction comparison by Gupta et al. [61] using their cohort of 440 patients from London with a mortality rate of around 28%. For Models F, G and H, the AUCs in our cohort were slightly different than in the London Cohort [61]  Review of the characteristics of the cohorts in Table 5 is instructive in understanding why some of the models did not perform well. Model A is a decision tree based on only 3 features, CRP, LDH and the percentage of lymphocytes. The first decision node suggests mortality if the LDH is greater than 365 U/L. In their cohort, the average LDH was 274 U/L. The average LDH in our test cohort was 386 however, thus a large portion were predicted to die at the first node, causing a poor positive predictive value (PPV). In the London cohort the average LDH was about the same as ours, 395 U/L, and this model performed poorly in that cohort also [61].
The average LDH was roughly 1/3 higher in our test cohort than in the average of the cohorts from China. It is not clear what the reason for this is. In a healthy multiethnic cohort from Hawaii [62], there were at most minor differences between black, Hispanic, White and Asian patients in their LDH, suggesting that the differences in LDH are not likely due to racial factors. It is possible that a difference in the time of infection to presentation might explain the difference. The other models which used LDH predicted well, but this might be in part related to use of a logistic regression instead of a decision tree.
The average CRP in our cohort is roughly 350% of the average in the external models, 99 mg/L vs. 27 mg/L. Four models used the CRP and only one model performed well, model C. The creatinine was significantly higher in our cohort than in any of the derivation cohorts and as well as the average of the studies, 0.84 mg/dL. Only one model used the creatinine, model H. Its derivation cohort average creatinine was 0.72 mg/dL. Thus, model Table 3 Internal Model Fit on first 60% of admissions for mortality and criticality ALT, alanine aminotransferase; AST, aspartate aminotransferase; BMI, body mass index; CRP, C-reactive protein; O2 Sat, oxygen saturation; RDW, red blood cell distribution width; WBC, white blood cell count   [63] with one systematic review showing the prevalence of chronic kidney disease in China was less than a fourth of the rate in the US [64]. The higher creatinine in the test cohort may not be related only to differences in illness at presentation, but rather differences in the prevalence of CKD.
It is not fully clear why the models produced at UIH using our training cohort did not perform better on our test cohort, though there are some likely factors.  Table 4 models   Table 5 Values of the most common variables in the 8 external models and the test cohort  The AUC for mortality decreased from 0.98 to 0.84 and for criticality, from 0.97 to 0.83. In analysis of the entire cohort, we were able to determine that the mortality and criticality were associated with the admission date. This is consistent with publications showing an improved mortality rate over time [59,60]. The WBC, lymphocytes and neutrophils were not all used in each model and all went up in the test cohort. Thus, it is possible that some variable not in the models changed over time, producing a worse fit compared to the first 60% of patients.
The number of cases for which a model is unable to generate a prediction due to missing data is an important practical consideration for model implementation.
The fraction of the test cohort for which predictions could not be generated due to missingness ranged between 17 and 31% for external models. The UI Health models could not generate predictions in 27% of the patients. Though retrospectively missing data can be imputed, this is not so easy in real time by clinicians during patient care, so was not done. This demonstrates non standardized test ordering, which is not surprising as our understanding of what is useful and necessary for testing in suspected COVID patients has evolved.
It is interesting to note that many of the tests which have been used commonly in these and other models were missed more frequently in the test cohort than the earlier development cohort. Ferritin 17.4% from 7.1%, LDH 27.5% from 16.5%. It is not clear why these tests were ordered less over time, particularly LDH with many publications demonstrated its prognostic power [15, 16, 21, 22, 25, 27, 42-45, 50, 52, 53]. It is possible that the ordering of these inflammatory prognostic markers [65] decreased as clinicians' confidence with clinical prognosis improved.
D-dimer on the other hand was missing less frequently in the test cohort, 27.5% from 40.8%. This difference may be due to an increased concern for venous thromboembolism in COVID 19 infections [66] which developed over time.
An important question is what model to use to provide prognostic information to clinicians. Using your own data to inform future care is consistent with a learning health system [67]. The ideal situation is that clinical decision support (CDS) could supply the best prediction for a patient based on the most recent trends at the time. Another reason to use your own data, especially with COVID-19, is that the disease, treatment and outcomes are likely to change over time [59,60], while the models in the literature are static. An additional benefit of using your own data and predictive models is the ability to see which diagnostic tests are most useful prognostically, but are not ordered enough, leading to more evidenced based order sets.
Our literature search has limitations due to the inability to ensure that all possible synonyms were used along with other reasons that the search strategy may have missed articles. As related to COVID-19, the rate of discovery and publication is so rapid that many models were likely published between the time of study completion and study publication.
Limitations related to our cohort and analysis are first that this is a single site study, and these models may have performed differently at other sites. The size of the test cohort contributed to the relatively large confidence intervals of the AUC's, making statistical significance difficult to prove. We were unable to follow patients consistently after discharge, thus could not measure timed outcomes like 30-day mortality. Lastly, we could not control for changes in treatment which have occurred over time.

Conclusions
Both internal and some external models were found to work well at predicting mortality in our test cohort. The 3 best external models used at least age, LDH and lymphocytes. Inconsistent ordering of lab tests led to the inability to generate predictions for 27-31% of our cohort using the 3 best external models and the 2 UIH models.
As not all the external models worked well, it would be difficult to know which model to use for future admissions at a particular time during the pandemic as treatment and patient mix can change. As an institution's own prior patients are most similar to their next group of patients, using models from local data should be considered.