Use of natural language processing to improve predictive models for imaging utilization in children presenting to the emergency department

Objective To examine the association between the medical imaging utilization and information related to patients’ socioeconomic, demographic and clinical factors during the patients’ ED visits; and to develop predictive models using these associated factors including natural language elements to predict the medical imaging utilization at pediatric ED. Methods Pediatric patients’ data from the 2012–2016 United States National Hospital Ambulatory Medical Care Survey was included to build the models to predict the use of imaging in children presenting to the ED. Multivariable logistic regression models were built with structured variables such as temperature, heart rate, age, and unstructured variables such as reason for visit, free text nursing notes and combined data available at triage. NLP techniques were used to extract information from the unstructured data. Results Of the 27,665 pediatric ED visits included in the study, 8394 (30.3%) received medical imaging in the ED, including 6922 (25.0%) who had an X-ray and 1367 (4.9%) who had a computed tomography (CT) scan. In the predictive model including only structured variables, the c-statistic was 0.71 (95% CI: 0.70–0.71) for any imaging use, 0.69 (95% CI: 0.68–0.70) for X-ray, and 0.77 (95% CI: 0.76–0.78) for CT. Models including only unstructured information had c-statistics of 0.81 (95% CI: 0.81–0.82) for any imaging use, 0.82 (95% CI: 0.82–0.83) for X-ray, and 0.85 (95% CI: 0.83–0.86) for CT scans. When both structured variables and free text variables were included, the c-statistics reached 0.82 (95% CI: 0.82–0.83) for any imaging use, 0.83 (95% CI: 0.83–0.84) for X-ray, and 0.87 (95% CI: 0.86–0.88) for CT. Conclusions Both CT and X-rays are commonly used in the pediatric ED with one third of the visits receiving at least one. Patients’ socioeconomic, demographic and clinical factors presented at ED triage period were associated with the medical imaging utilization. Predictive models combining structured and unstructured variables available at triage performed better than models using structured or unstructured variables alone, suggesting the potential for use of NLP in determining resource utilization.


Introduction
More than 25 million pediatric patients seek medical care in the Emergency Department (ED) each year in the United States, and the pediatric ED utilization continues to increase [1]. Emergency providers usually need to make quick and complex clinical decisions with limited information [2]. Clinical decision making in pediatric patients is complicated and time-consuming because of their unique physiologic and developmental differences [3,4]. Consequently, ED health outcomes of children differ as their pattern of illness and presenting symptoms vary with age [5,6].
In many instances clinical decision making in the ED involves ordering of laboratory tests (blood, urine tests etc.) and/or performance of imaging procedures (x-rays, ultrasound, computed tomography (CT) scans) in order to arrive at working diagnosis to initiate therapies or other interventions which in some instances are lifesaving [7]. However, use of imaging has a significant impact on emergency care delivery both in terms of appropriateness as well as the impact of such studies on patient throughput, which in turn impacts access to emergency care and overcrowding [8,9].
Previous studies have focused on improving the efficiency and accuracy of pediatric medical decisions during ED visits [10,11]. Utilization of predictive analytical techniques to more rapidly determine patient health outcomes among adult ED patients have proved useful [12]. However, few studies have focused on predicting resource utilization (e.g., medical imaging use) of pediatric ED patients [13]. In addition, unstructured data such as patient chief complaints often available at the time of patient visiting, and contains valuable information that can potentially enhance the prediction performance [14,15]. However, these data are not immediately useful and require extraction, cleaning, and aggregation [16]. Our previous work has revealed that the incorporation of unstructured clinical notes can increase predictive accuracy for adult hospital admission using natural language processing [12] .
Early and accurate prediction of the need for medical imaging in pediatric patients visiting the ED may assist in the planning and optimization of resources in the ED healthcare service. In the current study, we examined the association between the medical imaging utilization and information related to patients' socioeconomic, demographic and clinical factors during the pediatric patients' ED visits; and developed predictive models using these associated factors including natural language elements to predict the medical imaging utilization at pediatric ED.

Data and methods
We used standardized guidelines for the conduction and reporting of this study including the Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [17]. We performed a secondary data analysis on the 2012-2016 National Hospital Ambulatory Medical Care Survey ED Subfile (NHAMCS-ED) [18][19][20]. NHAMCS is a multistage, stratified probability sample of ED visits from 300 hospital-based EDs each year, which was randomly selected from about 1900 geographically defined areas across the United States, administered by the National Center for Health Statistics. Details of the survey methodology are available from the National Center for Health Statistics [19,20]. We included a total of 27,665 pediatric patients (≤18 years old) visits for analysis in the survey datasets from 2012 to 2016. This represents 161,340,000 ED visits on the national level including the patient visit weight.
The primary outcome variables for this study were performance of any diagnostic imaging (X-ray and/or CT), any X-Ray use, and CT scan during an ED visit. Ultrasound use was not included in the study as the frequency of ultrasound use was low in the NHAMCS-ED. Structured covariates included information routinely collected at the time of ED triage: sex, age category, race/ethnicity, type of residence, source of payment, arrival mode, arrival day and time, initial vital signs (body temperature, heart rate, respiratory rate, blood pressure, pulse oximetry), 5 point triage level (1 Immediate; 2 Emergent; 3 Urgent; 4 Semi Urgent; 5 Nonurgent), pain scale, 72 h revisit, comorbidities (cancer, cerebrovascular disease, chronic obstructive pulmonary disease (COPD), congestive heart failure, and HIV), whether the visit was related to an injury, poisoning, or adverse effect of medical treatment. Description analysis of these structured variables were performed among each medical imaging group, and the odds ratios of using any imaging, X-Ray and CT scan were estimated using logistic regression.
Unstructured data included up to three reasons for visiting the ED, and three causes of injury recorded by the providers for each patient in the triage notes [21]. Natural Language Processing (NLP) techniques were used to extract the information from the unstructured data. Firstly, we conducted a text preprocessing step which included lemmatization (grouping word capitalization and derivations together), removal of numbers, punctuations, and stop words (e.g., 'and', 'are', 'the'), and tokenization (breaking the text into single words and word pairs). We extracted all the unigrams (single words) and bigrams (word pairs) from the free text data after preprocessing. Subsequently, the frequency of each tokenized word or word-pairs for each person or visit can be formed [22]. We finally removed sparse terms, i.e., those with a frequency lower than 99.9% of the overall population. The words or word-pairs with frequency less than 277 (1% of the total sample size 27,665) were removed. A total of 1209 words and word-pairs were identified after preprocessing the unstructured data.
For both structured data and the word (or word pairs) frequencies, principal component analysis (PCA) was used to decrease the dimension (or select the features) of the structured data and frequency table of tokenized words or word pairs. As is described in previous studies [12,23], the goal of PCA is to obtain a fewer number of new variables, or principal components from the word or word pairs to represent large number of words or word pairs, using a linear combination. These principal components account for the maximum original variance or information of those words or word pairs. The first components derive as much of the variance in the word or word pairs frequency as possible, with each succeeding principal components accounting for the largest possible remain variance. There principal components have no information overlap between each other, based on the linear orthogonal algorithm.
Logistic regression models were used to predict the pediatric medical imaging utilization. We established three models to determine the predictive performance in identifying patients with any medical imaging use, X-Ray, or CT scan: (1) models with structured variables only; (2) models with unstructured data; (3) models with both structured and unstructured variables. Missing values were imputed with median of each corresponding variable. Ten-fold cross-validation was used to validate the performance of each model. Patients were randomly divided into 10 sets, and 9 of the 10 sets were used to train the models while the one left was used as the testing set. For each round of training, t-tests compared principal components' scores between outcome groups. Principal components with p < 0.05 were used to establish the logistic regression models' input variables.
The area under the receiver operating curve (AUC), or c-statistic, was recorded for each testing set. The cstatistic informs in a single numerical value about the overall diagnostic accuracy of the index test. The cstatistic ranges from 0.50 to 1.00, with higher values indicating better predictive models. Values above 0.80 indicate very good models, between 0.70 and 0.80 good models, and between 0.50 and 0.70 weak models. The average ROC curve was derived by comparing the prediction values from all 10 cross-validation testing sets. The AUCs from different models were compared using t-test. The probabilities of medical imaging use for each patient were calculated with this model. The best cutoff of the probabilities was determined by using the point on the ROC curve with the shortest distance to the upper left corner (where sensitivity = 1 and specificity = 1). The best cutoff of the probabilities for prediction and the corresponding sensitivity, specificity, and overall accuracy were recorded [24].
We performed a sensitivity analysis to predict the two major subtypes of the CT scan (abdomen/pelvis and head CT) using the same modelling strategies described above. The best cut-off of the probabilities, sensitivity, specificity, overall accuracy, and AUC were recorded. Basic data organization was done in SAS 9.4. The text analyses were performed in R 3.3.2. The modeling of logistic regression was performed in MATLAB R2016b.
The crude and adjusted odds ratio of ED visits resulting in different types of medical imaging (vs. no medical imaging) for each variable using binary logistic regression are presented in Table 2. Adjusted analyses showed patients between 1 and 6 years and between 6 and 12 were 42 and 24% less likely to require any medical imaging than patients less than 1 year old, respectively (aOR: 0.58, 95% CI 0.47-0.72 and aOR: 0.76, 95% CI 0.60-0.96). Black patients were 12% less likely for any imaging use than white patients (aOR: 0.88, 95% CI 0.77-1.00). Compared to those with private insurance, patients with Medicaid were 18% less likely for any imaging use than patients with private insurance (aOR: 0.82, 95% CI 0.73-0.92). Compared to those with mild pain level, patients with moderate and very severe levels were 2.15 and 2.70 more likely to receive any imaging respectively (aOR: 2.15, 95% CI 1.90-2.44 and aOR: 2.70, 95% CI 2.32-3.13). Compared to those with injury/trauma, patients with overdose/poisoning were 82% less likely to receive any imaging (aOR: 0.18, 95% CI 0.09-0.34) Patients with adverse effects of medical treatment and patients with other diagnoses were 80 and 68% less likely for any imaging use than patient with injury, respectively (aOR: 0.20, 95% CI 0.11-0.37 and aOR: 0.32, 95% CI 0.28-0.36). The odds ratios of those characteristics for X-Ray use are similar to the risk for any imaging use, as X-Ray is the most frequent medical imaging type. The distribution and the odds ratio of the top 25 most frequent words or word pairs were also reported in Fig. 1 and Additional file 1: Table S1. The odds of having imaging were higher for patients whose complaints contained words, such as pain, soreness, injury, and spasm, compared to patients without the presence of those words. Patients reporting fever, vomit, or skin issues showed lower odds of having imaging done. Around 200 principal components remain after feature selection for the input of each logistic regression model. Applying the three logistic regression models (Table 3; model 1: structured variables only, model 2: unstructured variables only, and model 3: both unstructured and structured variables), we found that the predictive accuracy for any medical imaging use was higher for models with textbased reason for visit variables only, compared to models with structured variables only. The AUC (Fig. 2 The result for the sensitivity analysis was reported in Additional file 1: Table S2 and Additional file 1: Figure  S1. A number of 420 (1.52% of total) patients had abdomen/pelvis CT scan and 785 (2.84%) had a head CT scan. In the model of abdomen/pelvis CT scan, the AUC was 0.856 (95% CI: 0.833-0.879) for unstructured data, 0.826 (95% CI: 0.814-0.838) for structured data, and 0.892 (95% CI: 0.875-0.909) for both. In the model for head CT scan, the AUC was 0.891(95% CI: 0.877-0.905)   for unstructured data, 0.797 (95% CI: 0.786-0.808) for structured data, and 0.906 (95% CI: 0.893-0.920) for both. The AUC are significantly different between the models on the unstructured data, structured data, and combined data (p < 0.01).

Discussion
In the current study, we described the rates of X-Ray use and CT use in pediatric visits to the emergency department in the United States. The rate of medical imaging use ranged from 28.4% to 31.8 each year across from 2012 to 2016; the rate of X-Ray use ranged from 23.8 to 26.2%, and CT's rate was 4.2 to 5.9%. We found that patients' socioeconomic, demographic and clinical factors presented at ED triage were associated with the medical imaging use. Similar to previous studies, we detected racial/ethnic and socioeconomic differences in the use of medical imaging [25,26]. We found that Blacks and Hispanics were less likely to undergo CT scans compared to white patients, which could be related to the distribution difference of injury severity, or access to insurance coverage, across racial/ethnic groups [25,27]. Compared to patients with private insurance, patients with Medicaid cover had less likelihood of receiving a CT scan. Reasons for these disparities should be further explored in future research to determine the appropriateness of including or excluding these variables in prediction models [27] based on the clinical context. We also found that younger age, higher triage level, ambulance arrival, abnormal vital signs, injury diagnosis and certain comorbidities were predictive of medical imaging use. As expected, patients with urgent and immediate triage levels had the highest likelihood of medical imaging use. Patients with abnormal vital signs generally had higher likelihood of medical imaging use than the patients with normal vitals.  Clinical practice in adult ED and pediatric ED is largely different, in particular, triaging pediatric patients is more complicated and time-consuming than adults because of their unique physiologic and developmental differences. Compared to our previous study on adult patients, we found even worse racial /ethnic disparities among the black patients compared to white patients in pediatric ED than adult ED. The CT use are positively associated with patients with Medicare in the pediatric patients but opposite for the adult ED patients The CT use are positively associated with urgency of ED among pediatrics.
Since the prediction models are based on the imaging utilization assigned by the clinicians, the associated factors cannot only predict the imaging utilization outcomes but can also indicate the bias in the medical decision in imaging assignment by the clinicians. These biases should be considered in a real implementation of the prediction models in healthcare management. One of the approaches to evaluate these biases would be running a medical chart review from the electronic health records for each patient to analyze how much bias exists in the medical decisions in pediatric ED imaging assignment. Because EDs are the critical staging area for very ill patients, the higher ED utilization and ED overcrowding leads to reduced access to time-critical healthcare, thus negatively affecting patient care quality and patient safety [28][29][30]. As the crisis of emergency care grows, hospitals have taken initiatives to improve the patient care quality in many ways [31,32]. One of these is to establish better decision-making systems in emergency care systems that could mitigate these challenges and facilitate the transition to a value-based healthcare industry [33]. Based on large data collected from ED electronic health records and technological innovations that employ predictive analytics to more rapidly identify resources utilization, such as medical imaging.  Prediction models for the adult ED advanced medical imaging utilization (CT, MRI, and ultrasound) has been examined and proposed in a previous study [34]. The main difference between the prediction models in the adult paper and the current study is that single word frequency was used in the adult study for topic modelling, whereas we only kept the first few topics in the prediction models. Topic modelling is a commonly used technique for NLP. Although the method was reported to identify patterns hidden in the unstructured data into different themes, we did not find many clinically meaningful topics when we applied this to the reasons-forvisit data from adult patients. In the current paper using bag-of-words including both single and word pairs, we used a principal component analysis combined with a ttest for the feature extraction. We found that the AUC for pediatric patients (Any imaging use: 0.824; CT scan: 0.868) is improved compared to adults (Any imaging use: 0.780; CT scan: 0.790). The main contributors of the improvement are the bigrams and the inclusion of all features from all bags of words, instead of only keeping the first few. A novel part of this study was the development of a predictive model for medical imaging use among a cohort of pediatric ED patients using both structured and unstructured data available at ED triage. The predictive model showed "good" prediction performance for both medical imaging overall, X-Ray, and CT scan [35]. Although statistically significant, we found that the structured data did not add much prediction power based on the unstructured data in predicting medical imaging utilization for both adult and pediatric patients, indicating that the main factors for imaging utilization at ED were included in the reasons for visit and cause of injury data. A prediction tool built based on the information obtained from patient visits, including the unstructured information written by the triage nurse, may benefit triage personnel and ED physicians, suggesting that the Emergency Severity Index [36][37][38], a common triage standard in the US, may be underusing the wealth of information available in a typical triage note. Unstructured data from the hospital EHR system have remained largely unexplored as extraction and analysis of these data are complicated [37]. However, information hidden in those unstructured health records provide potentially important information to better predict resource utilization at ED. The prediction improved significantly for all three outcomes when naturallanguage processing elements were added. The present study adds to similar previous studies [39,40] by including natural language processing in the ED triage prediction model. Earlier prediction of resource use through tools like those developed here may improve throughput, and improved ED throughput may help reduce ED crowding [32,34,41,42].
The models generally use variables measured at one time point to estimate the probability of an outcome occurring within a given time in the future [43]. Research in prediction models for the ED health service at this stage aims to assist the clinical decision (i.e., to help identify patients' imaging needs early in the triage period) instead of completely replacing the role of clinicians. Prediction models with good accuracy can efficiently assist the clinical management workflow if there is a good implementation strategy. Medical and economic risk of deploying these models in a real clinical settings is, at this stage, high given the inaccuracies. However, it is still of value to study how to improve the prediction performance, how to better implement those types of prediction tools, and test the values of those models in real implementation, in order to advance the field. This study brought up a new approach to improve the prediction models, and set a base model for imaging prediction at pediatric ED using a national sample. We examined the associated factors of imaging utilization at ED and developed prediction models with good prediction results (AUCs greater than 0.80). Further studies should be performed on how to improve the models' accuracy, and how to implement the models with good accuracy as well as assess the medical-economic risk.

Limitations and strengths
This study is limited in several ways. Limits of the data source (NHAMCS) include that (1) the outcomes of medical imaging use are based on clinical decisions made with awareness of the predictors used in the model, with resulting incorporation bias [44]; (2) the survey did not collect the information of the subtype of X-rays, or (3) information on the appropriateness of imaging utilization and pediatric specific comorbidities; (4) the survey rely on clinician diagnoses and it is not possible to validate the diagnoses; (5) NHAMCS uses visits and not individual patient counts, so it is possible that some children had multiple visits, or received multiple imaging, particularly those more medically complex. The NLP approach simplified the feature extraction using the frequency of word and word pairs existing in the text data. The approach ignored other information, such as word combination with more than bigrams, or the order of the words, which could exclude specific predictors. However, the number of words was small within each text field, so we would expect to capture clinically relevant information by simply extracting the frequency of the words and word pairs.. Limitations of the use of the c-statistic: it is a single number and summarizes the discrimination of a model but does not communicate all the information ROC plots contain and lacks direct clinical application [43].
Strengths of the NHAMCS include national representativeness, increasing the generalizability of the data. This study was based on retrospective national survey samples, and should be viewed as preliminary in the hierarchy of diagnostic test validity. Future perspective studies should be performed to test the effectiveness of the predictive models.

Conclusions
Using a nationally representative data of pediatric patients presenting to the ED, we examined information relating to the patients' socioeconomic, demographic and clinical factors during the patients' ED visits, including unstructured free-text fields such as the reason for visiting, and developed predictive models for medical imaging use. Both CT and X-rays are commonly used in the pediatric ED with one third of the visits receiving at least one. We present several predictive models for the use of medical imaging in pediatric patients visiting the ED. The inclusion of unstructured data (ie: triage notes) provided significant improvement in accuracy.
Additional file 1: Table S1. Characteristics of the top 25 most frequent words in the patient complaint and cause of injury by imaging use. Table S2. Predictive performance of logistic regression models with 10fold classification in identifying patients with abdomen/pelvis and head CT scan during emergency department triage, NHAMCS 2012-2016. Figure S1. ROC curves for the logistic regression models for abdomen/ pelvis and head CT scan (The red point on each ROC curve minimizes the Euclidean distance between the ROC curve and the upper left corner of the coordinate, which is defined as the best cutoff in the study).