Data source
This was a retrospective study that utilized the Lumbar Imaging with Reporting of Epidemiology (LIRE) study dataset which consisted of approximately 250,000 patients from four healthcare systems (Group Health, Kaiser Permanente Northern California, Henry Ford, and Mayo Clinic) who received a thoracic or lumbar spine plain X-ray, magnetic resonance imaging (MRI), or computed tomography (CT) between October 1, 2013 and September 30, 2016 [34]. The LIRE study was a multicenter intervention study that investigated whether inserting text about the prevalence of common imaging findings into lumbar spine imaging reports reduced subsequent spine-related interventions [34]. Once enrolled in the study, EHR data was collected from patients for two years following and one year prior to their first (i.e. index) imaging.
Patient selection
From the LIRE dataset, we selected patients who had at least two occurrences of International Classification of Diseases (ICD)-9 or ICD-10 codes related to LSS or LDH (Additional file 1: Table S1). This criterion was agreed upon by our clinical experts (PS, JF, and JGJ), to increase confidence in identifying patients with these syndromes [35, 36]. We based our ICD codes on two previous studies [37, 38]. Martin et al. selected ICD-9 codes that were commonly used to describe spine-related problems. These codes were identified by searching the annual updates published by the World Health Organization and referencing the Conversion Tables of new ICD-9 codes published by the National Center for Health Statistics to help identify newly added or modified codes [37]. They then validated their process to group patients based on these codes by comparing it to clinician judgment using sensitivity and specificity analysis. Deyo et al. further grouped their patients with back pain into back and leg pain or herniated disc and lumbar stenosis groups based on ICD-9 codes [38]. We updated the code lists of Martin et al. and Deyo et al. to also include ICD-10 [39].
Outcome
We further split patients with LDH/LSS into two prediction tasks: early and late surgery (Fig. 1). We chose these outcomes based on the clinical rationale that early surgery for LDH/LSS is more likely driven by severe or progressive neurologic deficits, as opposed to late surgery, which is more likely to be driven by chronic pain [9]. For early surgery, we limited the patients to those that had at least two LDH/LSS diagnosis codes within the year prior to LIRE enrollment and then searched two months ahead for the presence (positive) or absence (negative) of their first decompression surgery code. We had 198 (2.4%) LDH/LSS patients in the positive group and 8189 (97.6%) LDH/LSS patients in the negative group. For late surgery, we limited patients to those that had at least two LDH/LSS diagnosis codes within the year prior to LIRE enrollment then searched, after a two month gap, one year ahead for the presence or absence of their first decompression surgery code. We had 431 (5.0%) LDH/LSS patients in the positive group and 8189 (95.0%) LDH/LSS patients in the negative group. There was no overlap of patients with early and late decompression surgery. The decompression phenotype was developed by manually reviewing lists of Current Procedural Terminology (CPT) and ICD-9 Procedure Coding System that were potentially associated with surgery by at least one non-clinician reviewer (Additional file 1: Table S1) [34, 40, 41]. Any uncertain codes were also reviewed by two clinician reviewers (PS and JF) and discussed until consensus was achieved by both reviewers.
Features
We considered patient demographics, diagnoses, procedures, prescription information, and radiology reports as predictors for the model (Fig. 1). For demographics, we considered patients’ race, age, healthcare system, and ethnicity. For the primary care provider for each patient, we considered their gender, type of clinician, and speciality. For diagnosis, we considered patients’ ICD-9 and ICD-10 codes and the day they received the diagnosis. For procedures, we considered patients’ CPT and Healthcare Common Procedure Coding System Level II codes (i.e. procedure codes) and the day they received their procedure code. For prescriptions, we considered the drug name and prescription day. For radiology reports, we considered the finding and impression sections from the index imaging report in the LIRE study along with the type of image (i.e. X-ray, CT, or MRI).
Preprocessing/featurization
Demographics
This information is composed of patient and provider demographics along with the type of index image. To convert the data into a format for ML, we created dummy variables for the categorical features and normalized the discrete numerical feature (i.e. age) at the patient level (Fig. 2A). For early surgery, there are 23 features, while for late surgery there are 22 features.
Diagnosis, procedures, and prescriptions
We limited temporal data (diagnosis, prescriptions, and procedures) to the last three months of information prior to the index image for both prediction tasks, so that across the patients we (1) ensure that the time period is consistent and (2) minimize the variability in the amount of available data. The purpose was to minimize any influence from the heterogeneity of these factors on the prediction tasks. For diagnosis codes, we mapped ICD-10 to equivalent ICD-9 codes to minimize redundancy and then assigned all ICD-9 codes to depth level three on the ICD hierarchy using crosswalk files from cms.gov. We chose depth level three (i.e. the first three digits of ICD codes) to reduce the feature space, but also maintain an informative level of granularity [42]. ICD codes are organized into a hierarchy based on shared clinical characteristics. The further down in this hierarchy, the more specific the disease based on anatomic site, etiology, and manifestations.
Featurization for classical machine learning
We created dummy variables for the features (i.e. diagnosis codes, procedure codes, and drug names) at the patient-level. Further, we excluded extremely rare (≤ 0.1%) or common (≥ 99%) features to reduce the feature space. For early surgery, there are 25 features for diagnosis, 103 features for prescriptions, and 71 features for procedures. For late surgery, there are 25 features for diagnosis, 106 features for prescriptions, and 72 for procedures.
Featurization for deep learning
We binned the data into one month intervals to reduce the sparsity of the eventual temporal feature matrix. We then created dummy variables for the features (i.e. diagnosis codes, procedure codes, and drug names) at the bin-level for each patient. To maintain the same number of bins (i.e. three), we added empty bins to patients with less than three bins. Finally, we converted the dataframe into a 3D tensor where the depth corresponds to the number of the patients, the height to the number of bins, and the width to the number of unique features (Fig. 2B). For early surgery, there are 41 features for diagnosis, 245 features for prescriptions, and 160 for procedures. For late surgery, there are 43 features for diagnosis, 245 features for prescriptions, and 161 features for procedures.
Index imaging reports
We developed regular expressions to search for the headers of the finding and impression sections by reviewing a subset of these reports. For all our reports, we applied our regular expressions then isolated and concatenated the accompanying text in these sections. The purpose was to limit the text to only information that pertained to the diagnostic image itself. We then cleaned the text by converting it to lowercase, removing punctuation, removing extra whitespace, removing stopwords, and then isolated the stem of each word using a PorterStemmer from the python package nltk [43].
Featurization for classical machine learning
We extracted uni-, bi-, and trigrams from the cleaned text using the python package scikit-learn [44]. Further, we excluded extremely rare (≤ 0.1%) or common (≥ 99%) n-grams to reduce the feature space. For early surgery, there are 26,245 features, while for late surgery there are 26,983 features.
Featurization for deep learning
To convert the index reports into a format for the DL architecture, we used the python package genism [45]. We first collected reports (n = 123,461) post LIRE enrollment and preprocessed them the same way as the index reports. We pre-trained a skip-gram model with a vector length set to 300 on these reports. Parameter values and architecture were based on a recent study that evaluated different types of word2vec architectures and observed that this architecture and values lead to optimal performance when converting radiology reports into embedding representations [46, 47]. We extracted the vocabulary and the associated embeddings from this pre-trained skip-gram model (Fig. 2C). To maintain the same length for each document (a requirement for efficient batch-based deep learning implementations), we padded reports to the maximum length across index reports: 559 for early surgery and 573 for late surgery. We chose this approach to ensure the impression section was included as it summarizes the key findings from the image [48].
Machine learning
Benchmark model
We used the LASSO [49] logistic regression built using the python package scikit-learn and weighted the positive and negative group inversely proportional to their prevalence to address the imbalance in our dataset. Because the data naturally has multicollinearity among different features (i.e. diagnosis codes, procedure codes, and prescriptions), this can lead to over- and underestimating relationships between the features and outcome. As a result, we chose LASSO since it performs feature selection through penalization to minimize these redundant features. To identify the optimal regularization parameter (lambda), we performed fivefold cross validation within the training set. We chose the lambda value that led to the highest average F1-score across the folds to shrink the coefficients of the features. We chose the F1-score since it’s a popular performance metric for imbalanced datasets, which takes into consideration how well the model can capture the positive group (i.e. minority group), but also the reliability of these positive predictions. Because LASSO’s lambda value and its subsequent performance can be affected by how the data is split, we repeated the process of fivefold cross validation 50 times, each process with a different split of the data into the folds, then chose the prevalent lambda value across repeats [50]. Additionally, to assess the value of each modality, we repeated this process for each data type by itself (i.e. codes, demographics, and textual).
Multimodal deep learning model
The MDL architecture was built using the python package PyTorch and is composed of three entities: 1-layer Convolutional Neural Network (CNN), 1-layer Gated Recurrent Unit (GRU), and two 1-layer Fully-Connected (FC) (Fig. 1) [51]. This architecture is based on the work by Zhang et al., which compared two different MDL architectures that differed in the use of either a CNN or Long Short-Term Memory (LSTM) for both sequences of clinical notes and structured data [30]. Since in our approach we do not have sequences of clinical notes, this comparison is out of scope. Additionally, we decided to use a GRU instead of an LSTM since the former is a simpler architecture, but can lead to similar performance [52, 53]. We passed the featurized index reports and the pre-trained skip-gram embeddings and vocabulary into a CNN, the featurized temporal data into a GRU, concatenated the output from these individual networks with the featurized demographics and then passed the resulting concatenated vector to the FC layer to make predictions. We included a FC layer to convert the temporal input into embeddings before passing into the GRU as previous studies of this approach showed improvement in prediction performance [54,55,56]. We used a CNN, because we wanted to model the spatial relationship of the words in our reports in relation to our prediction task. The MDL model was trained using the Adam optimizer with a weight decay and ReLU as the activation function. We used Cross Entropy Loss as the loss function with weighting of the positive and negative group inversely proportional to their prevalence to address the imbalance in our dataset [57]. We minimized subsets of weights from co-adapting (i.e. overfitting to the noise in the training data) by adding a dropout to the hidden layer of the FC to allow all weights to participate in the prediction task [58]. To optimize the hyperparameters (i.e. number of filters, learning rate, dropout rate, GRU hidden size, and weight decay), we 1) split the training data into 80% for training and 20% for validation, 2) used previous works as a starting point for values [30, 59], then 3) grid searched to identify the combination of values that was associated to the lowest validation loss (Additional file 2: Table S2). We trained our model for 30 epochs using a learning rate scheduler to decrease the learning rate value when the validation loss increased to avoid overfitting. During the training process, our model was allowed to fine-tune the pre-trained skip-gram embedding values. Unlike the LASSO optimization, we did not perform fivefold cross validation as it would have been prohibitively computationally expensive. Additionally, we repeated this entire process for each individual network (i.e. 1-layer FC, 1-layer GRU with 1-layer FC, and 1-layer CNN with 1-layer FC) in the MDL architecture by itself and its associated data: demographics, temporal, and textual, respectively.
Evaluation
Classical
For each prediction task’s dataset, we split it into a training (80%) and test set (20%). After hyperparameter tuning, the LASSO models were retrained on the full training set using optimized lambda values, while the DL models were retrained on the same training and validation set using the optimized hyperparameter values. The reason for this is that the learning rate scheduler for the DL models needs to monitor the validation loss, so that it can properly update the training process. We then evaluated the models’ performance on the test set using the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, area under the curve (AUC), and area under the precision-recall curve (AUPRC). While we calculated these different performance metrics, we prioritized AUC in the analysis and interpretation since it’s (1) a global metric that assesses overall performance across different thresholds and (2) a more popular metric in the biomedical ML field. We estimated the significance of differences in performance between models by performing a t-test on 1000 bootstrapped test samples [27, 31]. We used a Bonferroni correction to correct for multiple hypothesis testing when comparing MDL to the three individual networks by multiplying each p value by three.
Generalizability
For generalizability, we divided the data based on the healthcare system. We trained the models on Kaiser Permanente Northern California and tested on the remaining systems. We chose Kaiser Permanente Northern California as the training set, since it made up roughly 80% of our entire dataset. For the test set, we excluded the Mayo Clinic since it contained a substantially smaller number of patients compared to Henry Ford and Group Health (Table 2). For each test system, we then evaluated the models’ performance using the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, AUC, and AUPRC. As before, while we calculated these different performance metrics, we prioritized AUC when interpreting results. We estimated the significance of performance differences between models by bootstrapping 1000 samples for each healthcare system in the test set and then calculating the performance metrics. For each metric and »healthcare system, we performed a t-test comparing the distributions between the models. We used a Bonferroni correction to correct for multiple hypothesis testing when comparing MDL to the three individual networks by multiplying each p value by three.