Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

Background Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. Methods This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen’s Kappa, sensitivity and specificity were calculated for each lab-based severity level. Results The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7–2.7) for neutropenia to 18.4 (10.1–33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen’s Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. Conclusions Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-024-02449-8.


Background
Machine learning models based on electronic health records (EHRs) are increasingly being developed and implemented into routine care.They have improved outcomes related to reducing acute care visits among ambulatory cancer patients [1], decreasing in-hospital clinical deterioration [2], increasing serious illness conversations [3], improving platelet utilization [4] and refining antibiotic choice [5] as examples.
To develop models, inputs or features are extracted from EHRs; these reflect different aspects of care such as diagnostic codes, laboratory tests, microbiology results, medication administrations, blood product administration, and procedures.Diagnostic codes are also frequently used to define the outcome of interest or label.How well each institution generates accurate diagnostic codes may vary depending on the coding process specific to the institution [6] and clinical diagnostic practice specific to the hospital unit or physician [6][7][8].This variability might influence the performance and generalizability of machine learning models developed at institutions with different diagnostic coverage rates.In pediatric populations, the coverage rates of diagnostic codes and their variability across institutions are underexplored [9,10].
A challenge to studying the question of diagnostic code coverage is the creation of gold standard labels as the diagnostic codes themselves are often used to develop these labels.One type of clinical data in which the label is inherent within the result itself is laboratory-based outcomes.Abnormal lab tests can be defined using institution-specific reference ranges.In addition, levels of severity (mild, moderate, and severe) for each abnormal lab test can be defined based upon widely accepted thresholds.Thus, evaluating diagnostic code coverage against lab-based definitions provides a pragmatic setting in which to evaluate this question.Consequently, the primary objective was to describe lab-and diagnosisbased labels for selected outcomes at three institutions.Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels.

Design
This study used data derived from EHRs at three institutions, namely The Hospital for Sick Children (SickKids) in Toronto, Ontario; Lucile Packard Children's Hospital (primarily pediatric-directed care) in Palo Alto, California and Stanford Health Care (primarily adult-directed care) in Palo Alto, California.The overall goal was to compare lab-and diagnosis-based labels for pediatric patients at SickKids vs. Stanford.We included a Stanford adult cohort for descriptive purposes.

SEDAR
The data source at SickKids was the SickKids Enterprise-wide Data in Azure Repository (SEDAR) [11,12].SEDAR contains a curated version of Epic Clarity data that is being used for operational, quality improvement and research purposes.This study was approved as a quality improvement project at The Hospital for Sick Children and consequently, the requirement for Research Ethics Board approval and informed consent were waived by The Hospital for Sick Children.

STARR
The Stanford medicine research data repository (STARR) [13] is the clinical data warehouse that contains records routinely collected in the EHR of Stanford Medicine, which is comprised of Lucile Packard Children's Hospital and Stanford Health Care.The data have been mapped to the standard concept identifiers and structure of the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) [14], resulting in a dataset named STARR-OMOP.This study used a de-identified version of STARR-OMOP [13] in which protected health information has been redacted.Because of deidentification, the requirement for Institutional Review Board approval and informed consent were waived by Stanford Medicine.

Cohorts
We defined three cohorts.SickKids was obtained using SEDAR while StanfordPeds and StanfordAdults were obtained using STARR-OMOP and applying age-specific restrictions.Table 1 summarizes the inclusion criteria for each cohort.Across all three cohorts, inpatient admissions were included if they occurred between 2018-06-02 to 2022-08-01.The pediatric cohorts (SickKids and StanfordPeds) included patients who were 28 days or older and younger than 18 years on the day of admission.We excluded neonates 1 to 27 days of age because Lucile Packard Children's Hospital has an obstetrical unit and consequently includes healthy newborns while SickKids does not have an obstetrical unit and does not routinely see healthy newborns.StanfordAdults included adult patients aged 18 or above on the day of admission.Multiple admissions per patient were permitted as long as eligibility criteria were met.

Outcome definitions
We included seven clinical outcomes that have lab-based definitions, namely acute kidney injury (AKI), hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia, and thrombocytopenia.We appreciate there are a large number of potential lab-based outcomes; these seven were chosen based on our current research interests and because they are clinically meaningful.The outcomes were chosen a priori, before conducting any of the analyses.We purposely did not include abnormal high and low for the same lab test (for example, hyperglycemia and hypoglycemia) as they may be correlated.For each outcome, we created four lab-based labels based on the test result and one diagnosis-based label; these five labels were evaluated in each patient admission.Appendix 1 in Additional File 1 shows the thresholds for each severity level (mild, moderate, and severe levels) of the lab-based labels; these thresholds were based upon research studies or guidelines [15][16][17][18][19][20][21].We also labeled the result as abnormal if the result was above or below (not both) of the institution-specific reference range.For lab-based labels, units for lab results were normalized, and severity level was nested.For example, a patient admission with severe hypoglycemia would also be included in the analyses for mild and moderate hypoglycemia.For the diagnosis-based label, we considered an outcome to be present if at least one outcome-related diagnosis code was assigned to the admission.Appendices 2 and 3 in Additional File 1 list the diagnosis codes used to define diagnosis-based labels.

Concept Selection for Lab-based and diagnosis-based labels
We adopted different search strategies for concepts in STARR-OMOP and SEDAR due to differences in structure and vocabularies for clinical codes.Diagnosis codes were derived from the "condition_occurrence" table for STARR-OMOP and from the "diagnosis" table for SEDAR.Lab test results were obtained from the "measurement" table for STARR-OMOP and from the "lab" table for SEDAR.For face validation, diagnosis codes and lab result distributions obtained from STARR-OMOP were reviewed by three clinicians (KEM, CA and LS) to identify errors related to normalization or concept selection.At SickKids, this same review was only conducted by one clinician (LS) due to access restrictions.

Baseline characteristics by Cohort
To explore whether there were differences in the cohorts with respect to patients, we described the demographic characteristics and raw lab results of patients between centers.Demographic characteristics included age, sex, length of stay, and the prevalence of in-hospital mortality.For the evaluation of raw lab results, we determined the minimum or maximum result for each lab test per admission and stratified by cohort.
To gain insight into whether there were differences between pediatric institutions with respect to laboratory procedures or clinical practice, we described the institution-and age group-specific reference ranges for abnormal lab results by SickKids and StanfordPeds.The pediatric age groups were defined by the National Institute of Child Health and Human Development [22] as infancy (28 days-12 months), toddler (13 months-2 years), early childhood (2-5 years), middle childhood (6-11 years) and early adolescence (12-17 years).In addition, we evaluated lab testing frequency calculated as the number of tests per inpatient day for each admission.

Statistical analysis
The primary objective was to describe lab-and diagnosis-based labels at the three institutions.These were presented as the proportion of admissions with at least one positive label.To describe the odds of a lab-or diagnosisbased label by whether the pediatric admission occurred at StanfordPeds vs. SickKids, analysis was complicated by the large number of admissions and multiple testing (35 separate evaluations for this analysis alone).In addition,

Median length of stay [IQR]
2 [1][2][3][4][5] 3 [1][2][3][4][5][6] 3 [2][3][4][5][6] Abbreviations: SickKids-The Hospital for Sick Children; Peds-pediatrics; IQRinterquartile range there were multiple admissions per patient, resulting in correlation within individuals.To address these concerns, we took several steps.First, we focused on describing the odds ratio (OR) and 99.9% confidence interval (CI) for a lab-or diagnosis-based label by pediatric institution.Second, we described the 99.9% confidence interval rather than the 95% confidence interval to help address multiple testing.Third, we did not calculate P values but rather, focused on describing CIs with the exception of comparing lab testing frequency by institution.Finally, to address multiple admissions per patient, OR and 99.9% CI were calculated using mixed-effects logistic regression.Models included each binary label as the outcome, institution and pediatric age group as fixed effects and subject as random intercept.Analysis was performed using the glmer function from lme4 package in R. The secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels.Agreement in each cohort was described using Cohen's Kappa coefficient.Sensitivity and specificity of the diagnosis-based labels were determined using each of the lab-based labels as the gold standard.For each metric, we presented the median and ranges stratified by cohort and lab-based severity (abnormal, mild, moderate and severe).
As an exploratory analysis, we separately evaluated each visited unit during admissions at each pediatric institution.We examined the weighted proportion of positive lab-based labels and positive diagnosis-based labels for each hospital unit and calculated Spearman's rho (r) based on the average across lab-based severity.
To describe lab-based reference ranges for pediatric patients, we described the threshold for an abnormal lab test by pediatric age group stratified by institution.Where the threshold varied within an age group, the range was visually depicted using a bar rather than a line.To compare testing frequency between pediatric institutions, mixed-effects linear regression was performed with number of lab tests per admission as the outcome, institution and pediatric age group as fixed effects and subject as random intercept.Analysis was performed using the lmer function from the lme4 package in R.
All analyses were conducted using Python (version 3.7) and R (version 4.1.2).

Baseline characteristics
The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and Stanfor-dAdults (n = 159,985).Characteristics of the three cohorts are listed in Table 1.The distributions of age, sex, in-hospital mortality, and median length of stay were similar between SickKids and StanfordPeds while the distribution of sex and in-hospital mortality differed at Stanford Adult .Table 2 shows the distribution of minimum or maximum results for each lab test per admission by cohort.Distributions appeared similar between Stan-fordPeds and SickKids with the exception of minimum absolute neutrophil count, which was lower at SickKids vs. StanfordPeds.Appendix 2 in Additional File 1 shows that the reference ranges varied between SickKids and StanfordPeds.Reference ranges for glucose and sodium were the same for all age groups except infants.Reference ranges for potassium and platelets were notably different by institution across age groups.Appendix 3 in Additional File 1 shows the average number of lab tests performed per inpatient day across all admissions stratified by institution.SickKids performed significantly fewer tests compared to StanfordPeds for all tests.

Prevalence of lab-based and diagnosis-based labels
Table 3 provides the percentage of admissions with a positive lab-and diagnosis-based label.Table 3; Fig. 1 show OR and 99.9% CI.The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with OR (99.9% CI) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia.Lab-based labels were more similar by institution although several were significantly different as demonstrated by CIs that did not cross 1.

Agreement between outcome definitions
Figure 2 shows the evaluations of diagnosis-based labels against each of the lab-based labels using Cohen's Kappa  Figure 3 plots the weighted proportions of positive diagnosis-based labels against the weighted proportions of positive lab-based labels for each hospital unit at Sick-Kids and StanfordPeds.At StanfordPeds, units with a higher incidence of patients with lab-based labels also had higher incidence of patients with a positive diagnosis-based label for a clinical outcome, evidenced by Spearman r ranging from 0.513 (hyponatremia) to 0.871 (neutropenia).In contrast, the Spearman r's were generally lower at SickKids across all outcomes, ranging from 0.010 (hypoglycemia) to 0.356 (anemia).An odds ratio of > 1 corresponds to higher odds of assigning a positive label for StanfordPeds compared to SickKids.Odds ratios were obtained using mixed-effects logistic regression with each binary label as outcome, institution and pediatric age group as fixed effects and subject as random intercept

Discussion
Our results showed that despite similar demographic characteristics, there were large differences between the two pediatric institutions in the proportion of admissions with diagnosis codes for the evaluated clinical outcomes.In addition, diagnosis-based labels generally had low agreement with lab-based labels and displayed low sensitivity but high specificity when considering lab-based labels as the gold standard, with differences observed between the two institutions.In addition, we found differences between the two institutions in terms of test ordering frequency and even laboratory test references ranges.
These results suggest that if machine learning models are intended for deployment at multiple institutions, reliance on diagnostic codes, either as feature or labels, could be problematic if institutions have different coding practices.Differences in diagnostic coding practices between countries, such as between Canada and the U.S., might also contribute to the observed variations.Second, they suggest that using institutional reference ranges to categorize laboratory test results may contribute to geographic dataset shift.This study contributes to the body of evidence that demonstrates the limitations of using diagnosis codes for outcome identification.Studies have reported low sensitivity rate when using diagnosis codes to identify, for example, acute kidney injury [23], obesity [24], and symptoms of coronavirus disease 2019 [25].In addition, this study showed differences between and within institutions in diagnostic practice that may have contributed to the differences in the performance of diagnosis codes for outcome identification.
Diagnosis codes from the EHR are commonly queried during feature extraction [26][27][28][29][30], label creation [31], and cohort identification [32].Heterogeneity in diagnostic practice across hospital units within the same institution (e.g., SickKids) can impact a model's performance within sub-populations or spuriously associate certain units with the outcome of interest during model development.
In addition, the cross-institution difference in diagnostic coding practice has implications for network studies as it violates the assumption that coding practice is comparable across institutions and creates heterogeneity in outcome prevalence as an artifact of code availability.
While we found that the proportions of positive labbased labels were more similar between pediatric institutions, there were significant differences although smaller than that observed for diagnosis-based labels.Possible contributions were the observed differences in lab testing frequency between the two pediatric institutions.In addition, the reference ranges themselves were different for tests with the same absolute interpretation regardless of where the test was conducted.For example, two hypothetical children with the same platelet count could be considered to have a normal test at one institution and an abnormal test at the second institution.Some Sick-Kids reference ranges were based upon those established by the Canadian Laboratory Initiative on Pediatric Reference Intervals (CALIPER) initiative [33], which contributed to the disparity.Nonetheless, this has implications for machine learning models.First, it is common during feature processing to categorize lab test results as normal, high, and low based upon the reference range [26,34].Having different reference ranges would thus produce different features despite having the same numerical value.Second, different reference ranges may impact downstream clinical decision making and variability of resultant clinical actions, for example procedures and The strengths of this study include the ability to evaluate multiple institutions in different countries and the involvement of clinician co-investigators who contributed to the identification of concepts to include in the various label definitions.However, this study is limited for several reasons.First, we only evaluated seven outcomes.In addition, the outcomes were restricted to those that have lab-based definitions in order to use lab tests to develop gold standard labels.Outcomes that are more complex might require chart review to establish gold standards and more sophisticated electronic phenotyping approaches to reach reasonable performance [35,36].Next, our analyses were restricted to admissions within a relatively narrow time period (2018-2022).It might be useful to characterize practice differences over time as temporal distribution shift can negatively impact model performance over time [28,37].Finally, it is possible that differences in EHR implementation [38] and data transformation processes between the institutions may have contributed to the observed variations.

Conclusion
In conclusion, across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions.This difference was not explained by differences in test results.These results may have implications for machine learning model development and deployment.

Fig. 2 Fig. 1
Fig. 2 Cohen's Kappa, sensitivity, and specificity for diagnosis-based labels against lab-based labels The figure shows median, interquartile range (shaded box) and range (whiskers)

Fig. 3
Fig. 3 Agreement between diagnosis-based labels and lab-based labels across hospital units The numbers on the x-and y-axis represent the weighted proportion of positive lab-based labels and positive diagnosis-based labels for each hospital unit visited during the admission.Spearman rho (r) was calculated based on the average across lab-based severity

Table 1
Inclusion criteria and cohort characteristics

Table 2
Distribution of minimum or maximum results for each lab test per admission stratified by cohort

Table 3
Proportion of admissions with positive lab-and diagnosis-based labels by cohort