Towards stroke prediction using electronic health records

Teoh, Douglas

doi:10.1186/s12911-018-0702-y

Technical advance
Open access
Published: 04 December 2018

Towards stroke prediction using electronic health records

Douglas Teoh ORCID: orcid.org/0000-0001-5207-3859¹

BMC Medical Informatics and Decision Making volume 18, Article number: 127 (2018) Cite this article

3562 Accesses
25 Citations
4 Altmetric
Metrics details

Abstract

Background

As of 2014, stroke is the fourth leading cause of death in Japan. Predicting a future diagnosis of stroke would better enable proactive forms of healthcare measures to be taken. We aim to predict a diagnosis of stroke within one year of the patient’s last set of exam results or medical diagnoses.

Methods

Around 8000 electronic health records were provided by Tsuyama Jifukai Tsuyama Chuo Hospital in Japan. These records contained non-homogeneous temporal data which were first transformed into a form usable by an algorithm. The transformed data were used as input into several neural network architectures designed to evaluate efficacy of the supplied data and also the networks’ capability at exploiting relationships that could underlie the data. The prevalence of stroke cases resulted in imbalanced class outputs which resulted in trained neural network models being biased towards negative predictions. To address this issue, we designed and incorporated regularization terms into the standard cross-entropy loss function. These terms penalized false positive and false negative predictions. We evaluated the performance of our trained models using Receiver Operating Characteristic.

Results

The best neural network incorporated and combined the different sources of temporal data through a dual-input topology. This network attained area under the Receiver Operating Characteristic curve of 0.669. The custom regularization terms had a positive effect on the training process when compared against the standard cross-entropy loss function.

Conclusions

The techniques we describe in this paper are viable and the developed models form part of the foundation of a national clinical decision support system.

Peer Review reports

Background

As of 2014, data published by the Statistics Bureau of Japan showed stroke to be the fourth leading cause of death in Japan, with cerebral infarction being the leading cause of stroke [1]. A study on the lifetime risk of stroke in Japan reported observed probabilities of around 1 in 5 middle aged men and women suffering from stroke [2]. When broken down into cause of stroke subtypes, the observed probabilities of being at risk of cerebral infarction are around 1 in 7 for men and 1 in 6 for women. The other causes of stroke are cerebral hemorrhage and subarachnoid hemorrhage. The reported probabilities for cerebral hemorrhage were 1 in 40 for men and 1 in 60 for women, while for subarachnoid hemorrhage, the reported probabilities were 1 in 200 for men and 1 in 50 for women. The lifetime risks were reported to be similar across the studied age groups.

In this paper, we present our attempt at predicting future diagnosis of stroke for patients who have not and may or may not yet suffer from stroke. Applying computational techniques on Electronic Health Records (EHRs) is a common approach for predicting patient outcomes, which is a strategy we have also chosen. Our contributions are as follows. We obtain historical patient EHRs supplied by a hospital. These EHRs contain historical medical diagnoses and exam results. To make use of this data, we perform empirical testing of several pre-processing strategies to determine the best approach for addressing the issues of missing data and use of non-homogeneous and non-numerical data. We then design prediction models using neural network architectures in order to both evaluate the efficacy of the supplied EHR data and the network architecture in our stroke prediction task. Finally, we also present our modifications to the standard cross-entropy loss to include regularization terms, which we show to be useful when training neural networks with imbalanced output classes.

Related work

Various techniques have been employed to predict the risk of stroke. Cox proportional hazards regression is one technique used to develop a statistical model, where the use of the Framingham Study cohort forms one such application [3]. This application, developed on top of data collected from a sample population over a span of 36 years, described a formula that estimated the probabilities of stroke based on pre-determined risk factors such as age, systolic blood pressure, and presence of diabetes mellitus. Cox proportional hazards regression is commonly used to develop stroke risk models tailored towards a cohort of interest [4, 5]. Globorisk [6] was another model developed with Cox proportional hazards regression, with the aim of producing a formula that can be recalibrated and updated for use in different countries.

Bayesian Rule Lists (BRL) were used to develop an interpretable model to predict risk of stroke within a year for patients diagnosed with atrial fibrillation [7]. BRL produces a hierarchy of decision lists (chains of if …then …rules) ordered by posterior consequent distributions. The model used drug prescriptions, medical conditions, age, and gender to form these decision lists.

Support Vector Machines have been used in a study to predict the occurrence of stroke within five years after a set of baseline measurements [8]. The study also identified issues pertaining to missing data and large number of input features. Missing data was addressed through median imputation after a comparison with other methods. A novel feature selection algorithm was used to reduce the number of input features.

For the diagnosis of other kinds of disorders, or the development of decision support systems, different techniques have also been used. One of the first diagnosis systems was developed in 1961 [9]. This system focused on the diagnosis of congenital heart disease from clinical data, using a diagnostic model derived using Bayes’ theorem. In 1988, the earliest known disease diagnostic system using a multi-layer neural network was developed [10]. This system used over 200 questionnaire responses as inputs and supported 23 diseases as the diagnosis output.

More recently other types of neural network architectures have been investigated for creating predictions from EHRs. Doctor AI [11] employed a Recurrent Neural Network (RNN) architecture to process patient temporal medical events and prescriptions, which were both represented as categorical features. The model was used to predict future diagnoses, medication orders and visit time.

ICD-9 label assignments can be used to classify medical notes using a bag-of-words model combined with RNN [12]. The methods used to represent medical notes has a demonstrable effect on performance when evaluated on event prediction tasks such as patient mortality and emergency room visits [13].

On the topic of EHR representation, Deepr [14] drew on natural language processing techniques to represent diagnoses and treatments as a sequence of tokens. Temporal features were also discretized and represented as tokens. These tokens were then transformed into a continuous vector space through an embedding process. This representation was fed into a Convolutional Neural Network (CNN) architecture and evaluated on its ability to predict future hospital readmission risk.

Autoencoders can be used to derive compressed representations of physiological features as an input pre-processing step [15]. The word2vec algorithm [16] can be applied to generate an embedding over patient diagnoses and medications [17]. The embedded representation was fed into a CNN, whereby event temporality was handled with 1-dimensional convolutions over the temporal dimension of the input matrix.

Methods

In this section, we describe our data source, formally identify and formulate the problem we want to solve, and describe the machine learning models that we implemented in an attempt to solve the stroke prediction task.

Data source

Our data was supplied by Tsuyama Jifukai Tsuyama Chuo Hospital in Okayama Prefecture, Japan.

Patient cohort

The patient cohort was divided into two groups: case patients and control patients. Case patients are patients who will be diagnosed with stroke. The criteria for determining case patients were as follows. The initial set of case patients had a first diagnosis of stroke between January 1, 2001 and January 1, 2015. The specific date range was chosen arbitrarily, with the intention of accounting for latent variables which could include access to newer drugs, changes in data recording policies, and even socioeconomic changes. The patient cohort was further filtered to include only those who received the first diagnosis of stroke between the ages of 45 and 95. This range was chosen because there were at least 10 incidents of stroke per age. Case patients were also selected to have at least one medical diagnosis and at least one exam result at least one day before the stroke diagnosis.

To maximize both the number of case and control patients, only a maximum of two control patients were able to be selected per case patient. To be selected as a control patient, the candidate must have the same birth year and gender as the case patient, at least one medical diagnosis and one exam result before the date the case patient was diagnosed with stroke, and at least one medical diagnosis and one exam result after the date the case patient was diagnosed with stroke. Control patients also did not have any diagnosis of stroke in their EHR. Case patients without two control patients were excluded from the final set of case patients.

After applying these criteria, 2725 case patients and 5450 control patients were identified. These patients formed the dataset used for training and evaluating our models.

Patient diagnoses

Each diagnosis was recorded in the EHR as a tuple comprising of the diagnosis date and the ICD-10 code [18] representation of the diagnosis. For this study, codes longer than four letters were truncated to be four letters long. The use of codes longer than four letters comprised around 5% of diagnoses across all patients. In fact, five letter codes were the longest codes in the hospital’s EHR system. The truncation posed no problem because ICD-10 codes form a prefix hierarchy. We must also note that ICD-10 codes were derived through a mapping from Japanese insurance codes to ICD-10 codes of the year 2013 edition. Our study supported a total of 6905 ICD-10 codes. Stroke cases were identified by the range of codes spanning I60 to I69.

Patient exam results

We were supplied with blood and biochemical exam results for each patient. Each exam was comprised of one or more measured metrics, but an exam did not necessarily contain the same set of measured metrics. For example, a blood exam could contain results for red blood cell (RBC) count but not every blood exam was guaranteed to contain results for RBC count. There can also be two measurements per metric. The blood exam metrics were: (a) ALP (b) creatinine (c) Hb (d) hematocrit (e) MCH (f) MCHC (g) MCV (h) platelet count (i) RBC and (j) WBC. The biochemical exam metrics were: (a) GGT (b) GOT (c) GPT (d) HbA1c JDS (e) HbA1c NGSP (f) HDL-C (g) LDL-C (h) triglyceride (i) uric acid and (j) blood glucose.

We experimented with a few strategies to account for missing or optional data. We tried adding a categorical value per metric to indicate if the measurement was absent, as a means of differentiating zero values from unrecorded values. Our models did not have any predictive ability when this strategy was used. We tried a roll-forward strategy, where each missing value was replaced with the last known recorded value. Our models also failed to learn with this strategy. We then tried augmenting the roll-forward strategy by replacing missing values with the mean measured value for the patient’s gender using data from the training set, and our models were able to start learning with this strategy. Finally, to address the differences in the number of exam results between patients, exam metrics were summarised by age, i.e. keep only the newest recorded value for each metric at a given age. We found that our models attained the best results with this strategy.

Problem description

The purpose of this study was to investigate the possibility of predicting whether a patient will suffer from stroke within a year of the last set of exam results or the last set of medical diagnoses. A period of a year was chosen because annual health exams are mandatory for full time employees in Japan [19]. Being able to predict future illness from a regular exam would enable proactive forms of healthcare measures to be taken.

We formulated the prediction task as a binary classification task. The inputs available to this task were a patient’s historical exam results X_exam and medical diagnoses X_diagnosis. The output $\hat {\boldsymbol {y}}$ was a yes/no decision to indicate if the patient would suffer from stroke in the next 365 days. The following models were implemented and evaluated:

$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{d}\left(\boldsymbol{X}_{diagnosis}^{(i)}\right) \end{array} $$

(1)

$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{e}\left(\boldsymbol{X}_{exam}^{(j)}\right) \end{array} $$

(2)

$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{ed}\left(\boldsymbol{X}_{diagnosis}^{(i)}, \boldsymbol{X}_{exam}^{(j)}\right) \end{array} $$

(3)

We wanted our models to be able to parameterize by itself, the input features which are useful as predictors for the stroke prediction task. To do this we used all features that were available in the dataset, and encoded them as follows.

The input matrix $\boldsymbol {X}_{diagnosis}^{(i)}$ represented a patient’s h^th up to the i^th historical diagnoses:

$$ \boldsymbol{X}_{diagnosis}^{(i)} = \left(\boldsymbol{x}_{diagnosis}^{(h)}, \dots, \boldsymbol{x}_{diagnosis}^{(i)}\right) \mid h = \text{max}(1, i - 34), h \ne i $$

(4)

The maximum number of diagnoses in X_diagnosis was limited to 35 because this covered 99% of patients in the dataset. At least half of the patients had 5 or more diagnoses. There was a long tail of patients having more than 35 diagnoses.

A patient’s i^th diagnosis was represented by the vector $\boldsymbol {X}_{diagnosis}^{(i)}$. The patient’s gender was encoded as g∈{0,1}. To account for the patient’s age at time of diagnosis, this value was included and denoted as $a_{d}^{(i)} \in \mathbb {R}$. Each diagnosis ICD-10 code was encoded into a dense vector $\boldsymbol {d}^{(i)} \in \mathbb {R}^{32}$; details of this encoding process will be described in the next section. The number of days n since the (i−1)^th diagnosis was converted into a categorical value n_c and then encoded into a one-hot label vector $\boldsymbol {c}_{d}^{(i)} \in \{0,1\}^{|n_{c}|}$. The categories were:

$$ n_{c} =\left\{ \begin{array}{ll} 0 & \text{if } n \text{ is unknown or the first record}\\ 0 & 0 \leq n < 7\\ 1 & 7 \leq n < 30\\ 2 & 30 \leq n < 90\\ 3 & 90 \leq n < 180\\ 4 & 180 \leq n \leq 365\\ 5 & 365 < n\\ \end{array}\right. $$

(5)

The complete vector representation of the i^th diagnosis was then formulated as $\boldsymbol {x}_{diagnosis}^{(i)} = \left ((g), \left (a_{d}^{(i)}\right), \boldsymbol {c}_{d}^{(i)}, \boldsymbol {d}^{(i)}\right)$.

The input matrix $\boldsymbol {X}_{exam}^{(j)}$ represented the patient’s first to j^th summarized exam results:

$$ \boldsymbol{X}_{exam}^{(j)} = \left(\boldsymbol{x}_{exam}^{(1)}, \boldsymbol{x}_{exam}^{(2)}, \dots, \boldsymbol{x}_{exam}^{(j)}\right) \mid 0 < j \leq 20 $$

(6)

No patient in the dataset had more than 20 years of exam results, so $\boldsymbol {X}_{exam}^{(j)}$ will contain at most 20 results. At least half of the patients had 5 years or more of exam results.

The patient’s j^th exam results was represented by the vector $\boldsymbol {X}_{exam}^{(j)}$. The patient’s gender was encoded as g∈{0,1}. To account for the patient’s age when the exam was conducted, this value was included and denoted as $a_{e}^{(j)} \in \mathbb {R}$. The number of days n since the j−1^th exam was conducted was converted into a categorical value n_c as described in Eq. 5 and then encoded into a one-hot label vector $\boldsymbol {c}_{e}^{(j)} \in \left \{0,1\right \}^{|n_{c}|}$. Each exam metric’s measurement m was represented as $e_{m}^{(j)} \in \mathbb {R}$. All exam metrics were combined to form the vector $\boldsymbol {e}_{results}^{(j)} = \left (e_{1}^{(j)}, \dots, e_{m}^{(j)}\right)$. The complete vector representation of the j^th exam was formulated as $\boldsymbol {x}_{exam}^{(j)} = \left ((g), \left (a_{e}^{(j)}\right), \boldsymbol {c}_{e}^{(j)}, \boldsymbol {e}_{results}^{(j)}\right)$.

ICD-10 encoding

We encoded the ICD-10 codes into the dense vector d. Generally, N categorical values can be encoded as a one-hot label vector {0,1}^1×N, where a value of 1 in the i^th column and zero in every other column is a vector representing the i^th category. This encoding scheme works when the value of N does not cause memory usage issues during an algorithm’s runtime. For large N, it may be necessary to apply dimensionality reduction techniques to project the input space into a lower dimensional space. We used a single hidden layer autoencoder network to project from the sparse input vector x_ICD∈{0,1}⁶⁹⁰⁵ into a dense vector $\boldsymbol {d} \in \mathbb {R}^{32}$. Our autoencoder network was defined as

$$\begin{array}{*{20}l} \boldsymbol{d} &= \text{tanh}\left(\boldsymbol{x}_{ICD}^{T}\boldsymbol{W}_{input} + \boldsymbol{b}_{input}\right) \end{array} $$

(7)

$$\begin{array}{*{20}l} \boldsymbol{y}_{decoding} &= \text{softmax}\left(\boldsymbol{d}^{T}\boldsymbol{W}_{output} + \boldsymbol{b}_{output}\right) \end{array} $$

(8)

The autoencoder network was then trained to learn W_input, b_input, W_output, and b_output such that the categorical cross-entropy loss between y_decoding and x_ICD was minimized. Equation 7 can then be applied to yield our dense representation d. Our autoencoder network learned a perfect reconstruction.

Model implementations

All neural network models presented in this paper were selected through experimentation. These models were selected by evaluating the F₁ score achieved on the validation set. The experimentation (which included architecture searches and meta-parameter selections) was non-exhaustive; the models presented here make no claims of being globally optimal with respect to predictive power. Our experiments were conducted on a desktop-class machine with a single Nvidia GTX 1080 video card. We spent around 3 months conducting experiments, with the runtime for each experiment ranging from a day up to a week. We present our model descriptions as Keras 2.0.9 [20] code in the interest of being precise.

Model 1 was implemented as a RNN using a Gated Recurrent Unit (GRU) module for recurrent connections. Text translation tasks, such as those described in [21], have demonstrated the capability of RNNs to learn from variable-length sequences and model dependencies within the input sequence. GRU modules were chosen over Long Short Term Memory modules because empirical evidence have demonstrated it to be faster to train while achieving comparable performance [22]. Listing 1 shows the model’s code definition.

Model 2 was implemented using a fully connected architecture. We experimented with CNN and RNN architectures but did not achieve better performance. We handled temporal input using Keras’ TimeDistributed wrapper. The code for this model is in Listing 2.

Finally, Model 3 was implemented using a dual-input architecture to handle both diagnoses and exam results as input data. This model was a fusion of Models 1 and 2. Each type of input was processed by a separate branch in the network and the outputs were then concatenated together and processed by the rest of the network. To speed up the training of this model, we transferred weight values from the original neural network models of layers sharing the same configuration. The code for this model is in Listing 3. For clarity, we have omitted the code for overriding the initial weights with transferred weights.

Our dataset was split into training, validation, and test sets at a ratio of 70/15/15. The split was performed by first dividing case patients into the aforementioned subsets. To ensure that each split has the same ratio of case to control patients, control patients who were associated with a case patient became members of the associated case patient’s subset split. Keeping the ratios consistent across each split was crucial when evaluating model performance.

All exam result values were scaled by subtracting the mean and then dividing by the standard deviation. The mean and standard deviation values were calculated from the training set. The prevalence of stroke diagnosis was 1 out of 8 X_diagnosis inputs and 1 out of 9 X_exam inputs.

All models were trained to minimize a customised categorical cross-entropy loss. Adam optimizer [23] with default parameters supplied by Keras was used. The random seed was fixed for all models. We used mini-batches of size 32 and trained for 500 epochs. Model 3 was trained for 100 epochs.

Our loss function enhanced the standard categorical cross-entropy loss by incorporating penalties for false positive and false negative predictions. Let Y⁺ be positive labels and Y⁻ be negative labels. The total number of false positives fp and false negatives fn can be stated in the terms of the zero-one loss

$$ fp(f) = \sum_{i \in Y^{-}} \ell_{01}(f, x_{i}) \qquad \qquad fn(f) = \sum_{i \in Y^{+}} \ell_{01}(f, x_{i}) $$

(9)

Then, a suitable surrogate for the zero-one loss can be used to place an upper bound on these quantities and also to make these quantities optimizable through stochastic gradient descent methods [24]. We used log loss as the surrogate since we wanted to incorporate these quantities into the standard categorical cross-entropy loss. These quantities satisfied the following inequalities

$$ \begin{aligned} fp^{u}(f) &\triangleq \sum_{i \in Y^{-}} \ell_{ll}(f, x_{i}) \geq fp(f) \\ fn^{u}(f) &\triangleq \sum_{i \in Y^{+}} \ell_{ll}(f, x_{i}) \geq fn(f) \end{aligned} $$

(10)

Where ℓ_ll(f,x)=−p(x) logf(x). These quantities were incorporated into the standard cross-entropy loss as

$$ \ell(f) = \alpha\left(\sum_{i \in Y} \ell_{ll}(f, x_{i}) \right) + \beta fp^{u}(f) + \gamma fn^{u}(f) $$

(11)

The terms α, β and γ were used to change the influence of each loss term. Our training hyperparameters used α=0.2,β=5 and γ=5.

Results

Model performance

We illustrate the performance of our models using the Receiver Operating Characteristic (ROC) curve which displays the trade-off between sensitivity and specificity at each threshold. ROC curves are commonly used in medical decision making [25]. Sensitivity is defined as the sum of true positive predictions divided by the total population of positive conditions. Specificity is defined as the sum of true negative predictions divided by the total population of negative conditions.

To assist in the comparison of our models, we present the area under the ROC curve (AUC) value, which is the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative instance. A better-than-random model will have an AUC value greater than 0.5, while a perfect model will have an AUC value of 1.0. [25]

Finally, we also include the 95% confidence interval of the ROC curves calculated through bootstrapping and stratified sampling. All ROC curves presented in this paper were generated using pROC [26].

The ROC curves were computed for each model on the validation and test sets. Model 1’s ROC curves are displayed in Fig. 1. Model 2’s ROC curves are displayed in Fig. 2. Model 3’s ROC curves are displayed in Fig. 3. We summarize the performance of each model as being better-than-random, with Model 3 attaining the best overall predictive power when measured by AUC on both the validation and test sets.

Custom loss function

We conducted an ablation study to determine the efficacy of the custom loss function described in Eq. 11. For all models, we kept the same neural network architecture and training methodology but changed the loss function to the standard cross-entropy loss. Model 1’s ROC curves are displayed in Fig. 4. Model 2’s ROC curves are displayed in Fig. 5.

To evaluate Model 3, we tested two weight transfer configurations. The first configuration tested weights transferred from models trained with the custom loss function; the ROC curves are displayed in Fig. 6. The second configuration tested weights transferred from models trained with the standard loss function; the ROC curves are displayed in Fig. 7.

Exam importance

To determine which exams were considered to be important by our trained model, we conducted ablation studies using Model 3. For these studies, we combined the validation and test sets and calculated the AUC attained when an exam (comprised of one or more exam metrics) was excluded from the input to the model. We excluded the exam by setting the value of associated metrics to zero. The results are displayed in Table 1.

Table 1 Impact on predictive performance of exams on Model 3

Full size table