Stroke severity is an important predictor of patient outcomes and is commonly measured with the National Institutes of Health Stroke Scale (NIHSS) scores. Because these scores are often recorded as free text in physician reports, structured real-world evidence databases seldom include the severity. The aim of this study was to use machine learning models to impute NIHSS scores for all patients with newly diagnosed stroke from multi-institution electronic health record (EHR) data.
Leveraging machine learning we identified the main factors in electronic health record data for assessing stroke severity, including death within the same month as stroke occurrence, length of hospital stay following stroke occurrence, aphagia/dysphagia diagnosis, hemiplegia diagnosis, and whether a patient was discharged to home or self-care. Comparing the imputed NIHSS scores to the NLP-extracted NIHSS scores on the holdout data set yielded an R2 (coefficient of determination) of 0.57, an R (Pearson correlation coefficient) of 0.76, and a root-mean-squared error of 4.5.
Machine learning models built on EHR data can be used to determine proxies for stroke severity. This enables severity to be incorporated in studies of stroke patient outcomes using administrative and EHR databases.
Stroke is the fifth leading cause of death in the US and a primary focus for improving patient outcomes and healthcare quality [1, 2]. The National Institutes of Health Stroke Scale (NIHSS) is a widely accepted, clinically-validated measurement of stroke severity. The NIHSS score serves as an important guide for clinicians to effectively offer guidance about prognosis and disability associated with acute stroke [1, 3,4,5].
The NIHSS score is defined as the sum of 15 individually evaluated elements, and ranges from 0 to 42. Stroke severity may be categorized as follows: no stroke symptoms, 0; minor stroke, 1–4; moderate stroke, 5–15; moderate to severe stroke, 16–20; and severe stroke, 21–42 [6, 7]. NIHSS scores are not part of structured data in electronic health records (EHR); rather, stroke severity is recorded as free text in physician notes. The lack of a formal stroke severity assessment in large EHR databases is a limitation of real-world evidence patient outcome studies related to stroke [1, 8]. Therefore, this type of machine learning approach may be useful for quantifying stroke severity given the limited availability of clinically assessed NIHSS scores in real-world evidence databases , aiding such practices as payer modeling for case mix risk adjustment or assessing quality outcomes. Using billing codes from administrative claims data of the single-payer, compulsory enrollment healthcare program in Taiwan, Sung and colleagues developed several models to derive a stroke severity index and validate its performance against the NIHSS [9,10,11]. Developing a machine learning model for stroke severity based on claims or EHR data from the United States presents unique challenges due to the fact that the US healthcare system is a multi-payer and provider system financed and delivered through a combination of private and public resources.
The objective of this study was to retroactively impute NIHSS scores for all patients with newly diagnosed stroke in a multi-institution EHR database by leveraging machine learning techniques. Imputed NIHSS scores will enable large-scale real-world observational studies to incorporate a measure of stroke severity in research studies of disease burden in these patients.
The NIHSS scores are a part of the information derived from the physician notes , and these scores were used as an outcome variable when training and evaluating model performance. Because some invalid values were originally extracted from the physician notes (e.g. values which are not integers within NIHSS range), rigorous pre-processing was applied to exclude as many invalid NIHSS scores as possible. This exclusion criterion was defined in collaboration with Optum and evaluation was completed on the remaining extracted NIHSS values to ensure accuracy. When a patient had multiple NIHSS scores during their inpatient stay following stroke, the maximum score was used to capture the overall severity of the stroke. This study incorporated EHR data from January 2007 through September 2016.
Patients were included in the study if they had a primary diagnosis of stroke (hemorrhagic [the International Classification of Diseases (ICD)-9: 431; ICD-10 I61.XX], ischemic [ICD-9: 433.XX-434.XX, 436; ICD-10: I63.XXX], or transient ischemic attack (TIA) [ICD-9: 435.X; ICD-10: G45.9]) in an inpatient or emergency room setting, which was defined as the stroke event (Fig. 1). Additionally, patients were required to have a real NIHSS score (extracted from physician notes) during the stroke event. Patients were also required to have been in the database for at least 6 months prior to the stroke diagnosis.
Relevant patient demographics (e.g. age, gender) and billing codes related to procedures, diagnoses, prescriptions/medications, hospital visit information, and comorbidities were collected to form the initial set of 8023 potential features. All features were created during the inpatient hospitalization following stroke occurrence (Fig. 1), except the Charlson Comorbidity Index, which was estimated based on data prior to the stroke . Diagnoses codes were from the ninth and tenth revisions of the ICD-9 and ICD-10 . In order to create features for machine learning models that are agnostic to coding version an equivalency mapping provided by The Centers for Medicare & Medicaid Services (CMS) was leveraged. As the granularities of diagnoses codes are different in ICD9 and ICD10 revisions this mapping includes many-to-many relationships. By starting with all diagnosis codes within the stroke events for the patient cohort, and recursively incorporating any diagnosis codes that are equivalent according to the CMS mapping, disjoint diagnosis code groups were created. Binary features were then formed by checking each patient for the presence or absence of any diagnosis within a given ‘diagnosis code group’ during the patients’ stroke event.
Additionally, simple presence/absence features were created for procedures coded with the Current Procedural Terminology (CPT4), Healthcare Common Procedure Coding System – HCPCS procedure codes, Bergenson-Eggers Type of Service codes (BETOS), patient discharge status, diagnosis-related group assigned to the inpatient stay, drug class of prescriptions written, drug class of medications administered, and routes by which medications were administered. Counts of procedures (e.g. CPT4) and BETOS code groups within a patient’s stroke event were also included in the initial feature set. Other features included patient’s age at the time of stroke, gender, and length of hospital stay following stroke occurrence.
During the feature selection process, features with near zero variance or with high correlation (> 0.9) to another feature were removed. In the latter case, only the feature more highly correlated with the response variable was retained. A response-balanced subset of the training cohort was created for this step in the process, by randomly selecting an equal number of patients from each of 5 stroke severity categories  (n = 183 per category, n = 915 total); this step was necessary such that the feature engineering process was not affected by the skewed distribution of stroke severity categories (i.e., more patients in less severe categories than in more severe categories, Fig. 2). After initial feature selection, the remaining 619 features were used for the subsequent modeling step.
Machine learning model development
The imputed NIHSS scores were ultimately compared to the real NIHSS scores (extracted from physician notes) in the hold-out test dataset to assess model performance using the coefficient of determination (R2), the Pearson correlation coefficient (R), and root-mean-squared error (RMSE). During our initial model development, performance was compared across a set of models developed by several machine learning approaches including a random forest model, gradient boosting model, neural network, and linear regression. The random forest model, which is a meta estimator used to fit several classifying decision trees on various subsamples of the dataset, had the best performance. Model hyperparameters were optimized using a grid search and performance was evaluated using three-fold cross validation within the training data. (Additional file 1: Table S1). Recursive feature elimination performed on the training data reduced the 619 features further, with only the top 100 features included in the final model. The top 100 features were selected because only minor improvements in performance would have been gained for a substantial increase in model complexity with the inclusion of additional features (Fig. 3).
Table 1 provides demographic data for patients at the time of stroke for each of the two cohorts (training and hold-out test set). Patients in the hold-out test set (n = 1033) were only used once to assess the performance of the final optimized random forest model with 100 selected features. This ensured that performance metrics were not biased by over-fitting and that the model for predicting stroke severity scores is generalizable to data not previously included in model development.
The random forest model achieved an R2 of 0.57, an R of 0.76, and a RMSE of 4.5. Figure 4 presents imputed versus actual NIHSS scores. The median (interquartile range, IQR) NIHSS score in the hold-out test cohort was 2 (6) for both the real and imputed NIHSS scores. The distribution of the real NIHSS scores and imputed NIHSS scores for the hold-out test cohort are shown in Fig. 2.
A detailed list of the 100 features included in the final model is shown in Additional file 1: Table S2, ranked in order of relative importance. Top features included death within the same month as stroke occurrence, length of hospital stay following stroke occurrence, aphagia/dysphagia diagnosis, hemiplegia diagnosis, and whether a patient was discharged to home or self-care.
Although other methods have been used to determine proxy measures of stroke severity using clinical features available in EHR and claims databases, these methods (such as linear regression) often assume there are linear relationships between the features and stroke severity, which is not always the case . For example, the length of a hospital stay varies based on stroke severity: patients who suffer a moderately severe stroke tend to have a longer length of hospital stay compared to those with low severity strokes who recover more quickly. However, patients with severe stroke have high mortality rates within the first few days of admittance for their strokes and therefore tend to have shorter hospital stays compared to patients with moderately severe stroke. Machine learning algorithms can identify non-linear relationships between features and stroke severity and can incorporate the complex relationships and interactions between features such as the clinical diagnoses relevant to stroke outcomes, treatments including medications and procedures administered to patients at stroke onset, as well as the medical history of patients prior to stroke diagnosis. Machine learning methods are thus well suited to the task of assessing stroke severity based on clinically available information from EHRs.
The distribution of imputed NIHSS scores calculated using this machine learning model as consistent with observations from the population-based Greater Cincinnati/Northern Kentucky Stroke Study . The Cincinnati study determined NIHSS scores from a retrospective chart abstraction of 2233 ischemic stroke cases identified during a 12-month period, with median (IQR) NIHSS values of 3 (6) . This is similar to the results obtained using the machine learning model, where the median (IQR) was 2 (6), and both studies showed a skewed distribution toward less severe strokes. The slightly lower median in the machine learning study may be caused by the inclusion of all types of stroke, including TIAs which tend to be much less severe compared to ischemic stroke. These results contrast with randomized clinical trials in which the enrolled patient population is often selected to include patients with more severe stroke. For example, the Albumin in Acute Stroke study required patients to have an ischemic stroke and baseline NIHSS score of 6 or higher  and the Desmoteplase in Acute Ischemic Stroke3 study required patients to have an ischemic stroke and NIHSS score of 4–24 .
Documentation of NIHSS scores was evaluated as part of the Get With The Guidelines – Stroke . Over the 10-year study period from 2003 to 2012, the documentation rate was 56.1%, with a median NIHSS score of 4 (IQR, 2–9) and mean of 6.7 (SD, 7.4) . Characteristics associated with NIHSS documentation were those related to eligibility for thrombolysis (e.g. arrival by ambulance and within 3 h of symptom onset) . A modest selection bias was observed reflecting the tendency of hospitals with lower documentation rates to selectively report higher NIHSS scores. The ability to impute NIHSS score with machine learning algorithms may eliminate incomplete documentation issues.
The machine learning model described in this study was developed on a US-based data source and achieved similar performance to models developed from the more uniform, single payer, compulsory healthcare data from Taiwan [9,10,11]. The R was 0.76 for this US based study and ranged between 0.68 and 0.73 for different models developed in the Taiwan-based study. In the Taiwanese study, NIHSS scores were assessed on admission and recorded directly in the national stroke registry, patients were primarily managed by neurologists, and the features were based on medical billing codes rather than the diagnosis and procedure codes, as these are considered more accurate in Taiwanese health databases due to Taiwan’s universal coverage for hospitalizations and reimbursement system. In contrast, for this study using the US-based data source, real NIHSS scores were extracted using NLP from free text physician notes, attending physician specialty varied, and diagnostic and procedural coding can vary in the multi-provider US healthcare system. Using a more complex machine learning algorithm with significantly more features enabled the US-based model to achieve similar performance to the Taiwan-based model and demonstrates the ability of machine learning methods to handle the systematic differences from diverse EHR systems across various US providers.
This study is subject to several limitations worthy of consideration. Although the current EHR database has captured comprehensive information on diagnoses, administration of treatments and procedures during stroke occurrence, other information which could be critical for model performance including imaging of brain scans was not available. As with all studies based on real-world data, there is the potential for missing records. In addition, healthcare information in the database was not available until January 2007, which precluded the study from capturing information in patients who might have stroke-related diagnosis prior to the year 2007. As such, the first observed stroke occurrence in the data could be a mix of 1st and possibly later stroke diagnosis. Moreover, generalizability of the model to another database remains unclear, as the current model was trained and validated only within a single database. As such, future work is planned to validate the current model in a different EHR database.
Applying this machine learning method to assess patient’s stroke severity in real-world databases where NIHSS scores are not available enables large scale health-economics and long-term patient outcome studies to incorporate stroke severity. However, in any such endeavor, it would be important to ensure the removal of all features related to any outcomes being studied (and model performance reassessed after removal) to avoid artificially elevated associations. These enhanced studies can potentially accelerate the development of better clinical management and improve patient quality of care [3, 8]. This study represents a novel advanced analytics application to real-world data that could significantly impact drug development and patient outcomes.
Health Insurance Portability and Accountability Act
International Classification of Diseases
National Institutes of Health Stroke Scale
Natural Language Processing
Pearson Correlation Coefficient
Coefficient of Determination
Transient Ischemic Attack
Katzan IL, Spertus J, Bettger JP, Bravata DM, Reeves MJ, Smith EE, et al. Risk adjustment of ischemic stroke outcomes for comparing hospital performance: a statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45(3):918–44.
National Center for Health Statistics. Health, United States, 2016: with Chartbook on long-term trends in health. Hyattsville: National Center for Health Statistics; 2017.
Rost NS, Bottle A, Lee JM, Randall M, Middleton S, Shaw L, et al. Stroke severity is a crucial predictor of outcome: an international prospective validation study. J Am Heart Assoc. 2016;5(1). https://doi.org/10.1161/JAHA.115.002433.
Fonarow GC, Saver JL, Smith EE, Broderick JP, Kleindorfer DO, Sacco RL, et al. Relationship of national institutes of health stroke scale to 30-day mortality in medicare beneficiaries with acute ischemic stroke. J Am Heart Assoc. 2012;1(1):42–50.
Phan TG, Clissold BB, Ma H, Ly JV, Srikanth V. Predicting disability after ischemic stroke based on comorbidity index and stroke severity-from the virtual international stroke trials archive-acute collaboration. Front Neurol. 2017;8:192.
Samuel OW, Fang P, Chen S, Geng Y, Li G. Activity recognition based on pattern recognition of myoelectric signals for rehabilitation. In: Khan SU, Zomaya AY, Abbas A, editors. Handbook of large-scale distributed computing in smart healthcare. Basel: Springer International Publishing AG; 2017. https://doi.org/10.1007/978-3-319-58280-1_16.
Fonarow GC, Alberts MJ, Broderick JP, Jauch EC, Kleindorfer DO, Saver JL, et al. Stroke outcomes measures must be appropriately risk adjusted to ensure quality care of patients: a presidential advisory from the American Heart Association/American Stroke Association. Stroke. 2014;45(5):1589–601.
Sung SF, Hsieh CY, Kao Yang YH, Lin HJ, Chen CH, Chen YW, et al. Developing a stroke severity index based on administrative data was feasible using data mining techniques. J Clin Epidemiol. 2015;68(11):1292–300.
Sung SF, Hsieh CY, Lin HJ, Chen YW, Chen CH, Kao Yang YH, et al. Validity of a stroke severity index for administrative claims data research: a retrospective cohort study. BMC Health Serv Res. 2016;16(1):509.
Sung SF, Chen SC, Hsieh CY, Li CY, Lai EC, Hu YH. A comparison of stroke severity proxy measures for claims data research: a population-based cohort study. Pharmacoepidemiol Drug Saf. 2016;25(4):438–43.
Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Information Insurance Portability and Accountability Act (HIPAA) Privacy Rule (Dated as September 4, 2012, as first released on November 26, 2012).
Nunes AP, Yang J, Radican L, Engel SS, Kurtyka K, Tunceli K, et al. Assessing occurrence of hypoglycemia and its severity from electronic health records of patients with type 2 diabetes mellitus. Diabetes Res Clin Pract. 2016;121:192–203.
Reeves M, Khoury J, Alwell K, Moomaw C, Flaherty M, Woo D, et al. Distribution of National Institutes of Health stroke scale in the cincinnati/northern Kentucky stroke study. Stroke. 2013;44(11):3211–3.
Ginsberg MD, Palesch YY, Hill MD, Martin RH, Moy CS, Barsan WG, et al. High-dose albumin treatment for acute ischaemic stroke (ALIAS) part 2: a randomised, double-blind, phase 3, placebo-controlled trial. Lancet Neurol. 2013;12(11):1049–58.
Albers GW, von Kummer R, Truelsen T, Jensen J-KS, Ravn GM, Grønning BA, et al. Safety and efficacy of desmoteplase given 3–9 h after ischaemic stroke in patients with occlusion or high-grade stenosis in major cerebral arteries (DIAS-3): a double-blind, randomised, placebo-controlled phase 3 trial. Lancet Neurol. 2015;14(6):575–84.
Reeves MJ, Smith EE, Fonarow GC, Zhao X, Thompson M, Peterson ED, et al. Variation and trends in the documentation of National Institutes of Health stroke scale in GWTG-stroke hospitals. Circ Cardiovasc Qual Outcomes. 2015;8(6 Suppl 3):S90–8.
The authors wish to thank Erik Sjoeland for his valuable guidance and input in drafting the manuscript. Editorial assistance was provided by Matthew Michelson, PhD, of Evid Science (El Segundo, CA) and Ashley O’Dunne, PhD, of MedErgy (Yardley, PA).
This research and preparation of this manuscript were supported by Janssen Scientific Affairs, LLC. The funder did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Authors and Affiliations
Janssen Research & Development, LLC, Raritan, NJ, USA
Emily Kogan, Kathryn Twyman & Jesse Heap
Janssen Scientific Affairs, LLC, Titusville, NJ, USA
EK, KT, DM contributed to conceptualization of the study. EK and KT contributed to data curation and formal analysis and validation and visualization. JH contributed to project administration work. EK, KT, DM, JHL, MA contributed to investigation and methodology of the study. DM contributed to funding acquisition. EK, KT, DM, JH, JHL and MA contributed to writing, review, and editing of the work. All authors reviewed and approved the manuscript for submission.
No patient’s personal identifying information is included in this manuscript. All study data were accessed with protocols compliant with US patient confidentiality requirements, including the Health Insurance Portability and Accountability Act of 1996 regulations. Because this study was non-interventional and used only statistically deidentified patient records, it was exempt from institutional review board review.
Consent for publication
EK, JH and DM are employees of Janssen; KT and JHL were employees of Janssen during the study and manuscript development; and MA is an employee of Hartford HealthCare.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Random Forest Hyperparameters. The parameters of the final model which were obtained through hyperparameter optimization are presented here. Table S2. List of Final Model Features. Here, features are ranked by the expected fraction of the samples they contribute to as a measure of feature importance.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Kogan, E., Twyman, K., Heap, J. et al. Assessing stroke severity using electronic health record data: a machine learning approach.
BMC Med Inform Decis Mak20, 8 (2020). https://doi.org/10.1186/s12911-019-1010-x