 Research
 Open Access
 Published:
EHR phenotyping via jointly embedding medical concepts and words into a unified vector space
BMC Medical Informatics and Decision Making volume 18, Article number: 123 (2018)
Abstract
Background
There has been an increasing interest in learning lowdimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients.
Methods
In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code.
Results
In our experiments, we learned joint representations using MIMICIII data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit.
Conclusions
The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.
Background
Electronic health record (EHR) systems are used by medical providers to streamline the workflow and enable sharing of patient data among providers. Beyond that primary purpose, EHR data have been used in healthcare research for exploratory and predictive analytics in problems such as risk prediction [1–3] and retrospective epidemiologic studies [4–6]. Important challenges in those studies include cohort identification [7, 8], which refers to finding a set of patients receiving a specific treatment or having a specific diagnosis, and patient phenotyping [9, 10], which refers to identifying conditions and treatments for given diseases from patients’ longitudinal records.
EHR data are heterogeneous collections of both structured and unstructured information. In order to store data in a structured way, several ontologies have been developed to describe diagnoses and treatments, among which the most popular coding classification systems is the International Classification of Diseases (e.g, ICD9, ICD10). ICD codes provide alphanumeric encoding of patient conditions and treatments. On the other hand, the unstructured clinical notes contain various more nuanced information (e.g, the history of patient’s illness and medication), which creates challenges for designing effective algorithms to transform data into meaningful representations that can be efficiently interpreted and used in health care applications. Various studies manage to discover knowledge from freetext clinical notes. Wang et al. proposed a token matching algorithm to map medical expressions in clinical notes into a structured medical terminology [11]. Pivovarov et al. developed a probabilistic graphical model to infer phenotypes described by medical codes, words and other clinical observations [12]. Joshi et al. proposed a nonnegative matrix factorization method to generate latent factors defined by clinical words [13].
The success of extracting knowledge from clinical notes often requires application of Natural Language Processing (NLP) techniques. Learning distributed representations of words using models based on neural networks has been shown to be very useful in many NLP tasks. These models represent words as vectors and place vectors of words that occur in similar contexts in a neighborhood of each other. Among the existing models, Mikolov’s word2vec model [14] is among the most popular due to its simplicity and effectiveness in learning word representations from a large amount of data. Several studies applied word2vec on clinical notes data to produce effective clinical word representations for various applications [15–21].
While word2vec was initially designed for handling text, recent studies demonstrate that word2vec could learn representations of other types of data, including medical codes from EHR data [21–25]. Choi et al. used word2vec to learn the vector representations of medical codes using longitudinal medical records and show that the related codes indeed have similar vector representations [22]. Choi et al. designed a multilayer perceptron to learn representations of medical codes for predicting future clinical events and clinical risk groups [23]. Gligorijevic et al. used word2vec to phenotype sepsis patients [25] and Choi et al. fed code representation learned by word2vec into a recurrent neural network to predict heart failure [24]. The limitation of those studies is that they focused only on representation of medical codes and did not utilize other sources of information from EHR data. Henriksson et al. applied word2vec to learn the vector representations of medical codes and words in clinical notes separately, and used both of them to predict adverse drug events [26, 27]. As they embed medical codes and words into two different spaces, their learned representations are not able to capture relationship between words and codes, which is exploited in our proposed method.
In this paper, we propose JointSkipgram model: a novel joint learning scheme for word2vec model which embeds both diagnosis medical codes and words from clinical notes in the same continuous vector space. The resulting representations capture not only similarity between codes or words themselves, but also similarity between codes and words. We believe many clinical tasks can be viewed as measuring similarity between codes and words. For example, textbased phenotyping [12, 13] is the process of discovering the most representative words for diagnostic medical concepts. On the other hand, given a collection of words, such as clinical notes, the automatic code assignment task [11] aims to automatically assign diagnosis and procedure medical codes and thus reduce human coding effort. In this paper we illustrate that it is possible to obtain representation of words and codes in the same vector space and that the resulting representations are very informative. To achieve this objective, directly applying word2vec and related algorithms may not be appropriate since codes and words are located in different parts of EHR and have different forms and properties. Our proposed model is designed to tackle the heterogeneous nature of EHR data and build a connection between medical codes and words in clinical notes.
In our experiments, we examined if our representations are able to discover meaningful textbased phenotypes for different medical concepts. We compared our proposed model with Labeled LDA [28], a supervised counterpart of Latent Dirichlet Allocation (LDA) [29], which has been applied previously to clinical data analysis [30–32]. The results show that our representations indeed capture the relationship between words and codes. In comparison to our previous study [21], we also show that our method is able to identify common medicines and treatments for different diseases. We also construct patient representations and test the predictive power of the representations on the task of predicting patient diagnosis of the next visit given information from the current visit. The results show that representations learned by our approach outperform several baseline methods.
Methods
After formulating the problem setup we overview Skipgram [14], the architecture contained in word2vec toolkit designed for learning representations of natural language words, which is also the basis of our method. Then we explain the proposed JointSkipgram model.
Basic problem setup
Let us assume we are given a collection of patient visits. Each visit S is a pair (D,N), where D is an unordered set of medical diagnosis codes {c_{1},c_{2},c_{3}...,c_{n}} summarizing health condition of a patient and N is an ordered sequence of words from clinical notes recorded during the visit (w_{1},w_{2},w_{3}...,w_{m}). We denote the size of the code vocabulary C as C and the size of the word vocabulary W as W.
Preliminary: Skipgram
Figure 1 summarizes the Skipgram framework. Given a sequence of words (w_{1},w_{2},w_{3}...,w_{m}), Skipgram sequentially scans it. For every scanned word w_{i}, called the target word, the loglikelihood of the words within its neighborhood (e.g., a window of a predefined size q) is calculated as
where p(w_{j}w_{i}) is the conditional probability of seeing word w_{j} as context of target word w_{i}. It is defined as a softmax function
where \(V_{w_{i}}\) is a Tdimensional vector providing the input representation of target word w_{i} and \(U_{w_{j}}\) is a Tdimensional vector providing the context representation of context word w_{j}. Skipgram results in two matrices: the input word matrix \(V\in \mathbb {R}^{W\times T}\) and the context word matrix \(U\in \mathbb {R}^{W\times T}\). The obtained input word representation \(V_{w_{i}}\) is typically used as word representation in downstream predictive or descriptive tasks.
To learn vector representation of words from the vocabulary, a stochastic gradient algorithm is used to maximize the objective function (1).
Maximizing (1) is computationally expensive since the denominator \(\sum _{w_{k} \in W} e^{{V_{w_{i}}}{U_{w_{k}}}}\) in (2) sums over all words w_{k}∈W. As a computationally efficient alternative of (1), Mikolov et al. proposed the skipgram with negative sampling (SGNS) [14], which replaces logp(w_{j}w_{i}) in (1) with the sum of two logarithmic probabilities as follows. For scanned word w_{i}, the objective function becomes
where probability p(w_{i},w_{j}) is defined as sigmoid function \(\sigma \left (V_{w_{i}}\cdot U_{w_{j}}\right)\):
and W_{neg}={w^{k}∼P_{w}k=1,...,K} is the set of socalled “negative words” that are sampled from the marginal distribution P_{w} of words. K is a hyperparameter determining the number of negative words generated with each context word. The assumption is that words sampled from the marginal distribution are less likely to cooccur as context of the target word. The first term of (3) is the probability that two words occur as target and context in the data set, while the second term of (3) is the probability that a target word and “negative words” in W_{neg} are not observed cooccurring in the dataset. By maximizing (3), the dot product between frequently cooccurring words would become large while the dot product between rarely cooccurring words would become small. In other words, in the resulting Tdimensional vector space, the related words will be placed in the vicinity of each other, such that their cosine similarity is high.
Proposed model: JointSkipgram
In the Skipgram model, each scanned word is used to predict probability of its neighboring words in the sequence. However, in the electronic health records each visit consists of clinical notes, which are ordered sequences of words, and medical codes, which are sets. We are interested in jointly learning vector representation of words and codes in the same vector space. Both medical codes and clinical notes describe condition and treatment of a patient and they are closely related. For example, if a patient is assigned ICD9 code “174” (female breast neoplasm), the corresponding clinical notes are likely to mention surgery (e.g, mastectomy or lumpectomy). To derive JointSkipgram, we first need to define context of each word and each code.
Since the codes are unordered, we define the context of target code c_{i} as all other codes in the same visit, as well as all words in the clinical note. Thus, as shown in Fig. 2a, in JointSkipgram, every scanned code c_{i} is used to predict other codes in D and all words in N. The loglikelihood of code c_{i} can be expressed as
Similarly to Skipgram, the probabilities p(c_{j}c_{i}) and p(w_{j}c_{i}) are defined as softmax functions
and
For words in clinical notes we define two types of contexts. One consists of neighboring words in the note. Another consists of all codes in the medical code set. Thus, as shown in Fig. 2b, for scanned word w_{i} in N JointSkipgram uses words within a window of a predefined size q as its context words. It also uses all codes in D as its context codes. The resulting loglikelihood of word w_{i} can be expressed as
in which
and
Maximizing the sum of objective functions (5) and (8) over the whole data set of visits is computationally expensive since in (6), (7), (9) and (10), the denominators sum over all words in W and all codes in C. Similar to SGSN [14], we use a computationally cheaper algorithm that relies on negative sampling. Instead of calculating the softmax function, the negative sampling approach uses computationally inexpensive sigmoid function to represent the probability that a word or a code is within a context of a target word or a code. For each scanned code c_{i}, the negative sampling objective function becomes
where
and
C_{neg}={c^{k}∼P_{c}k=1,...,K} is the set of “negative codes” that are sampled from marginal distribution P_{c} of codes and W_{neg}={w^{k}∼P_{w}k=1,...,K} is the set of negative words that are sampled from a marginal distribution P_{w} of words, where K is the number of negative samples.
Similarly, for each scanned word w_{i}, the negative sampling objective criterion becomes:
where
and
C_{neg} and W_{neg} are the same as in (11). By maximizing (14), the probabilities p(w_{i},w_{j}) and p(w_{i},c_{j}) of related words and codes will be large.
Similarly to Skipgram, stochastic gradient descent algorithm is applied in jointSkipgram to learn vector representations of codes and words that maximize (11) and (14). The input vector representation matrix V is used as the resulting representation of words and codes. Since we jointly learn vector representations of codes and words, matrices \(V\in \mathbb {R}^{\left (W+C\right)\times T}\) and \(U\in \mathbb {R}^{\left (W+C\right)\times T}\) include representations of both words and codes. In the resulting vector space, similarity of two vectors is measured using cosine similarity. The vectors of similar codes or words should be close to each other. Since JointSkipgram represents codes and words in the same vector space, the words related to a given medical code should be placed in vicinity.
Results
Dataset description
MIMICIII Dataset: The MIMICIII Critical Care Database [33] is a publiclyavailable database which contains deidentified health records of 46,518 patients who stayed in the Beth Israel Deaconess Medical Center’s Intensive Units from 2001 to 2012. Each visit in the dataset contains both structured health records data and free text clinical notes.
We used EHR data from all patients in the dataset. The total number of patient visits in MIMICIII is 58,597. On average, each patient had 1.26 visits, 38,991 patients had a single visit, 5151 had two visits, and 2376 patients had 3 or more visits. The average number of the recorded ICD9 diagnosis codes per visit is 11 and the average number of words in clinical notes is 7898. For each patient visit, we extracted all diagnosis codes and all clinical notes.
Preprocessing: For each EHR in the dataset we are only focusing on the clinical notes and ICD9 diagnosis codes. Each clinical note was preprocessed in the following way. All digits and stop words were removed. The typos were filtered using a standard English vocabulary in PyEnchant, a Python library for spell checking. For representation learning, rare words were filtered out since they do not appear often enough to obtain good quality representations. Therefore, all words whose frequency is less than 50 were removed. The resulting number of unique words was 14,302. Furthermore, the total number of unique ICD9 diagnosis codes in MIMICIII is 6984. Codes whose frequency is less than 5 were removed. This reduced the number of codes to 3874. Since some codes were still relatively rare for learning meaningful representations, we exploited the hierarchical tree structure of ICD9 codes and grouped them by their first three digits. For example, ICD9 codes “2901” (presenile dementia), “2902” (senile dementia with delusional or depressive) and “2903” (senile dementia with delirium) were grouped into a single code “290” (dementias). The size of the final code vocabulary was 752.
Training and Test Patients: We randomly split the patients into training and test sets. All 38,991 patients with a single visit were placed in the training set. Of the 7527 patients with 2 or more visits, we randomly assigned 80% of them (6015 patients) to the training set and 20% of them (1512 patients) to the test set. The whole training set was used for learning of vector representations. We excluded patients with only a single visit for the task of next visit prediction because this task requires patients to have at least two visits.
Training JointSkipgram model
EHRs of patients from the training set were used to learn our JointSkipgram model. For each visit we created a (D,N) pair. There were 54,965 such pairs in the training data. The size T of vectors representing codes and words was set to 200. Stochastic gradient algorithm with negative sampling maximizing (11) and (14) was set to loop through all the training data 40 times because we empirically observed that it was sufficient for the algorithm to converge. The number of negative samples was set to 5 and the size of the window for word context in the clinical notes was set to 5. As a result, each of the 7898 words and 752 ICD9 codes were represented as 200dimensional vectors in a joint vector space. Before applying JointSkipgram model, we used a small fraction (∼10%) of clinical notes to pretrain vector representations of words only, as we observed that this improves our final representations.
To evaluate the quality of vector representations, we performed two types of experiments: (1) phenotype and treatment discovery by evaluating associations between codes and words in the vector space, (2) testing the predictive power of the vector representations on the task of predicting medical codes of the next visit.
Phenotype discovery
Textbased phenotype discovery can be viewed as finding words representative of medical codes. For a given ICD9 diagnosis code, we retrieved its nearest 15 words in the vector space. If successful, the neighboring words should be clinically relevant to the ICD9 code.
As an alternative to JointSkipgram, we used labeled latent Dirichlet allocation (LLDA) [28], a supervised version of LDA [29]. In LLDA, there is a onetoone correspondence between topics and labels. LLDA assumes there are multiple labels associated with each document and assigns each word a probability that it corresponds to each label. LLDA can be naturally adapted to our case by treating medical codes as labels and clinical notes as documents. For a given ICD9 diagnosis code we retrieved 15 words with the highest probabilities and compared those words with the 15 words obtained by JointSkipgram.
We consulted domain experts about quality of the extracted phenotypes. First, we selected 6 diverse ICD9 codes from MIMICIII that cover both acute and chronic diseases and both common and less common conditions. The 6 ICD9 codes are listed in Table 1, together with their description and frequency in the training set. Table 1 shows the list of 15 closest words by both methods to the 6 ICD9 codes. For each ICD9 diagnosis code, we presented the two lists in a random order to a medical expert and asked two questions: (1) which list is a better representative of the diagnosis code, and (2) which words in each list are not highly related to the given diagnosis code. We recruited four physicians from the Fox Chase Cancer Center as medical experts for the evaluation.
The evaluation results are summarized in Table 2. As could be seen, all 4 experts agreed that JointSkipgram words better represent ICD9 codes 570, 348, and 311. For the remaining 3 codes (174, 295, 042), the experts were split, but in no case the majority preferred the LLDA words. By considering the average number of words deemed unrelated by the experts, the experts found that JointSkipgram was superior to LLDA for all 6 ICD9 diagnosis codes.
For ICD9 code “570” (acute liver failure), JointSkipgram finds “liver”, “hepatic”, “cirrhosis”, which are directly related to acute liver failure. Remaining words in the JointSkipgram list are mostly indirectly related to liver failure, such as “alcoholic”, which explains one of the primary reasons for liver damage. On the other hand, LLDA captured a few related words, as evidenced by an average of 9.25 words that experts found unrelated. Among those unrelated words we find “cooling”, “sun”, “arctic”, “rewarmed”, “cooled”, “rewarming”, “coded”, “continue”, and “prognosis”.
For ICD9 code “174” (female breast cancer), “295” (Schizopherenic disorders) and “042” (HIV), both JointSkipgram and LLDA find highly related words. One of our experts commented that several words found by JointSkipgram are diseases which are likely to cooccur with the given disease. For example, JointSkipgram finds “melanoma” for female breast cancer and “herpes”, “chlamydia”, “syphilis” for HIV. This suggests that JoinSkipgram captures the hidden relationships between diseases, which could make it suitable for understanding of comorbidities.
For code “311” (depressive disorder), both JointSkipgram and LLDA had difficulties in finding related words. According to feedback from one of our experts, “abuse”, “hallucinations”, “alcohol”, “overdose”, “depression” and “thiamine” (note: depression is a common symptom of thiamine deficiency) found by JointSkipgram are related to the disease, while only “depression”, “tablet”, “capsule” found by LLDA are recognizably related to depression. We hypothesize that for common diseases (e.g, “depression” and “hypertension”), which are rarely the primary diagnosis or a major factor in deciding an appropriate treatment of the main condition, physicians rarely discuss them in clinical notes. Thus, it is difficult for any algorithm to discover words from clinical notes related to such diagnoses.
Treatment discovery
In our preliminary study [21], we used PyEnchant standard English vocabulary to filter out the typos in clinical notes. However, there are many nonstandard English terms used in medical notes to describe medical treatments, medicines, and diagnoses. These nonstandard words are not part of PyEnchant standard English vocabulary we used for preprocessing, but they could have important meaning. Hence, we repeated our experiments by including all words occurring more than 50 times. The resulting vocabulary increased to 33,336 unique words.
After running our JointSkipgram model on the new dataset, we looked at the representative words for each diagnosis code. Tables 3 and 4 show the 15 nearest clinical note words in the vector space to ICD9 codes “570” and “174”, respectively. We can observe that many retrieved words are different from those in Table 1 for codes “570” and “174”. The words that also appear in Table 1 are marked with italic font in Tables 3 and 4.
A close look into Tables 3 and 4 reveals that most neighbors are specific medical terminology words describing drugs or treatments related to the diagnosis. For example, words “crrt”, “levophed”, “rifaximin”, and “transplant” in Table 3, are related to treatment of acute liver failure. Similarly, words “xcloda”, “tamoxifen”, “carboplatin”, “taxol”, “compazine” in Table 4 are related to cancer treatment. Therefore, including nonstandard words in our vocabulary enabled us to connect specialized medical terms with particular ICD9 diagnosis codes.
Predictive evaluation
In another group of experiments we constructed patient representations and evaluated quality of the vector representations of words and medical codes through predictive modeling. We adopted the evaluation approach used in [34], which predicts medical codes of the next visit given the information from the current visit. Specifically, given two consecutive visits of a patient, we used information of the first visit (i.e., medical codes and clinical notes) to predict medical codes assigned during the second visit. In the previous work on this topic, the authors of [23, 34, 35] used medical codes as features for prediction. In our evaluation, we used both medical codes and clinical notes to create predictive features. To generate a feature vector for the first visit, we found the average JointSkipgram vector representation of the diagnosis codes and the average JointSkipgram vector representation of the words used in clinical notes. Then, we concatenated those two averaged vectors. We call this method ConcatenationJointSG and compare it with the following five baselines:
ConcatenationOne: The onehot vector of medical codes and the onehot vector of clinical notes for a given visit were concatenated. In the onehot vector of each visit, words and codes which occur in the visit were encoded as 1, otherwise they were encoded as 0.
SVD: Singular vector decomposition (SVD) was applied to ConcatenationOne representations to generate dense representations of visits.
LDA: Using latent Dirichlet allocation (LDA) [29], each document was represented as a topic probability vector. This vector was used as the visit representation. To apply LDA, for each visit we created a document that consists of concatenation of a list of medical diagnosis codes and clinical notes. We note that LLDA is not suitable for this task since its topics only contain words.
CodesJointSG: To evaluate the predictive power of medical codes, we created features for a visit as the average JointSkipgram vector representation of the diagnosis codes.
WordsJoinSG: To evaluate the predictive power of clinical notes, we created features for a visit as the average JointSkipgram vector representation of the words in clinical notes.
To compare vector representations obtained by JointSkipgram and Skipgram, we also trained Skipgram on clinical notes and on medical codes separately. The resulting vector representations are not in the same vector space. We used Skipgram representations to construct 3 more groups of features:
CodesSG: The features for a visit were the average Skipgram vector representation of the diagnosis codes.
WordsSG: The features for a visit were the average Skipgram vector representation of the words in clinical notes.
ConcatenationSG: We concatenated the features from CodesSG and WordsSG.
Given a set of features describing the first visit, we used softmax to predict medical codes of the second visit. Let us assume the feature vector of the first visit is x_{t}, the size of code vocabulary is C and \(Z\in \mathbb {R}^{(C \times x_{t})}\) is the weight matrix of softmax function. The probability that the next visit y_{t+1} contains medical code c_{i} is calculated as
We use Topk recall [34] to measure the predictive performance, because it mimics the behavior of doctors who list the most probable diagnoses upon observation of a patient. For each visit, softmax recommends k codes with the highest probabilities and Topk recall is calculated as
In the experiment, we tested Topk recall when k=20, k=30, and k=40.
Training details: To create features for all proposed models (Skipgram, JointSkipgram, LDA, SVD), we used the training set. To train the Skipgram model, we used 40 iterations, 5 negative samples, and the window size 5 (the same as for JointSkipgram). For SVD and LDA, we set the maximum number of iterations to 1000 to guarantee convergence. For JointSkipgram, Skipgram, SVD and LDA, we set the dimensionality of feature vectors to 200.
To train the softmax model, we created the labeled set using only patients with 2 or more visits. We sort all visits of each such patient by the admission time. Given two consecutive visits, we use the former to create features and the latter to create the labels. As a result, the labeled set used to train the softmax model had 9955 labeled examples and the test set had 2489 labeled examples. The softmax model for prediction was trained for 100 epochs using a stochastic gradient algorithm to minimize the categorical cross entropy loss.
Table 5 shows the performance of softmax models that use different sets of features. A model using ConcatenationJointSG features outperformed other baselines on all three Topk measures.
Discussion
Predictive evaluation analysis
The results in Table 5 not only show the advantage of our model, but also demonstrate that both medical codes and clinical notes in ConcatenationJointSG contributed to the prediction of future visit, since using the concatenation of word representations and code representations outperformed both CodesJointSG and WordsJointSG. While CodesJointSG achieved considerably high recall, WordsJointSG performed relatively worse. The lower accuracy of WordsJointSG likely indicates that using the average of word vectors might not be the best strategy to use clinical note information. A future direction could be to use a neural network (NN) such as convolutional NN or recurrent NN to better capture information contained in clinical notes.
Figure 3 shows comparison between JointSkipgram and Skipgram features. From the figure, we can observe that features generated by JointSkipgram outperformed those generated by Skipgram. While the difference between WordsJointSG and WordsSG were not large, CodesJointSG and ConcatenationJointSG significantly outperformed CodesSG and ConcatenationSG, respectively. This strongly indicates that JointSkipgram not only captures the relationship between medical codes and words, but also learns improved word and code representations.
Limitations and future works
One limitation of our work is that in processing step we removed words whose frequency are less than 50 and codes whose frequency are less than 5. We also grouped all codes by their first three digits because rare codes are not statistically significant enough to learn meaningful representations. One way to use rare tokens is to exploit the domain knowledge such as subword information or hierarchical tree structure of medical codes.
The future work should consider applying joint representations to a broader range of tasks, such as cohort identification and automatic code assignment. It would also be interesting to explore more advanced prediction models such as deep neural networks.
Conclusions
In this paper, we proposed JointSkipgram algorithm to jointly learn representation of words from clinical notes and diagnosis codes in EHR. JointSkipgram exploits the relationship between diagnosis codes and clinical notes in the same visit and represents them in the same vector space. The experimental results demonstrate that the resulting code and word representation can be used to discover meaningful disease phenotypes. They also indicate that the representations learned by the joint model are useful for construction of patient features.
Abbreviations
 EHR:

Electronic health record
 ICD:

International classification of diseases
 LDA:

Latent Dirichlet allocation
 LLDA:

Labeled latent Dirichlet allocation
 NLP:

natural language processing
 NN:

neural network
 SVD:

Singular vector decomposition
References
 1
Yan Y, BirmanDeych E, Radford MJ, Nilasena DS, Gage BF. Comorbidity indices to predict mortality from medicare data: results from the national registry of atrial fibrillation. Med Care. 2005; 43:1073–7.
 2
Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand SLT. An administrative claims model suitable for profiling hospital performance based on 30day mortality rates among patients with heart failure. Circulation. 2006; 113(13):1693–701.
 3
Klabunde CN, Potosky AL, Legler JM, Warren JL. Development of a comorbidity index using physician claims data. J Clin Epidemiol. 2000; 53(12):1258–67.
 4
Levitan N, Dowlati A, Remick S, Tahsildar H, Sivinski L, Beyth R, Rimm A. Rates of initial and recurrent thromboembolic disease among patients with malignancy versus those without malignancy. Risk Anal Medicare Claims Data. Med (Baltimore). 1999; 78(5):285–91.
 5
Taylor Jr DH, Østbye T, Langa KM, Weir D, Plassman BL. The accuracy of medicare claims as an epidemiological tool: the case of dementia revisited. J Alzheimers Dis. 2009; 17(4):807–15.
 6
Schneeweiss S, Seeger JD, Maclure M, Wang PS, Avorn J, Glynn RJ. Performance of comorbidity scores to control for confounding in epidemiologic studies using claims data. Am J Epidemiol. 2001; 154(9):854–64.
 7
Nattinger AB, Laud PW, Bajorunaite R, Sparapani RA, Freeman JL. An algorithm for the use of medicare claims data to identify women with incident breast cancer. Health Serv Res. 2004; 39(6p1):1733–50.
 8
Winkelmayer WC, Schneeweiss S, Mogun H, Patrick AR, Avorn J, Solomon DH. Identification of individuals with ckd from medicare claims data: a validation study. Am J Kidney Dis. 2005; 46(2):225–32.
 9
Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the seermedicare data: content, research applications, and generalizability to the united states elderly population. Med Care. 2002;40:3–18.
 10
Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc. 2016; 23(4):731–40.
 11
Wang Y, Patrick J. Mapping clinical notes to medical terminology at point of care. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. Stroudsburg: Association for Computational Linguistics: 2008. p. 102–3.
 12
Pivovarov R, Perotte AJ, Grave E, Angiolillo J, Wiggins CH, Elhadad N. Learning probabilistic phenotypes from heterogeneous ehr data. J Biomed Inform. 2015; 58:156–65.
 13
Joshi S, Gunasekar S, Sontag D, Ghosh J. Identifiable phenotyping using constrained nonnegative matrix factorization; 2016, pp. 17–41. arXiv preprint arXiv:1608.00704.
 14
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems.2013. p. 3111–9.
 15
Moen H, Ginter F, Marsi E, Peltonen LM, Salakoski T, Salanterä S. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. In: BMC Medical Informatics and Decision Making, vol. 15. BioMed Central: 2015. p. 2. https://doi.org/10.1186/1472694715S2S2.
 16
Wu Y, Xu J, Jiang M, Zhang Y, Xu H. A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015. American Medical Informatics Association: 2015. p. 1326.
 17
De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York: ACM: 2014. p. 1819–22.
 18
Amunategui M, Markwell T, Rozenfeld Y. Prediction using note text: Synthetic feature creation with word2vec; 2015. arXiv preprint arXiv:1503.05123.
 19
Ghassemi MM, Mark RG, Nemati S. A visualization of evolving clinical sentiment using vector representations of clinical notes. In: Computing in Cardiology Conference (CinC), 2015. IEEE: 2015. p. 629–32. http://doi.org/10.1109/CIC.2015.7410989.
 20
Henriksson A. Representing clinical notes for adverse drug event detection. In: Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis.2015. p. 152–8.
 21
Bai T, Chanda AK, Egleston BL, Vucetic S. Joint learning of representations of medical concepts and words from ehr data. In: Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference On. IEEE: 2017. p. 764–9. http://doi.org/10.1109/BIBM.2017.8217752.
 22
Choi Y, Chiu CYI, Sontag D. Learning lowdimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016; 2016:41.
 23
Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, TejedorSojo J, Sun J. Multilayer representation learning for medical concepts. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2016. p. 1495–504.
 24
Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2016; 24(2):361–70.
 25
Stojanovic J, Gligorijevic D, Radosavljevic V, Djuric N, Grbovic M, Obradovic Z. Modeling healthcare quality via compact representations of electronic health records. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2017; 14(3):545–54.
 26
Henriksson A, Zhao J, Boström H, Dalianis H. Modeling electronic health records in ensembles of semantic spaces for adverse drug event detection. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference On. IEEE: 2015. p. 343–50. https://doi.org/10.1109/BIBM.2015.7359705.
 27
Henriksson A, Zhao J, Dalianis H, Boström H. Ensembles of randomized trees using diverse distributed representations of clinical events. BMC Med Inform Decis Mak. 2016; 16(2):69.
 28
Ramage D, Hall D, Nallapati R, Manning CD. Labeled lda: A supervised topic model for credit attribution in multilabeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1Volume 1. Stroudsburg: Association for Computational Linguistics: 2009. p. 248–56.
 29
Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3(Jan):993–1022.
 30
Chan KR, Lou X, Karaletsos T, Crosbie C, Gardos S, Artz D, Ratsch G. An empirical analysis of topic modeling for mining cancer clinical notes. In: Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference On. IEEE: 2013. p. 56–63. https://doi.org/10.1109/ICDMW.2013.91.
 31
Arnold CW, ElSaden SM, Bui AA, Taira R. Clinical casebased retrieval using latent topic analysis. In: AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association: 2010. p. 26.
 32
Ghassemi M, Naumann T, DoshiVelez F, Brimmer N, Joshi R, Rumshisky A, Szolovits P. Unfolding physiological state: Mortality modelling in intensive care units. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2014. p. 75–84.
 33
Johnson AE, Pollard TJ, Shen L, Liwei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. Mimiciii, a freely accessible critical care database. Sci Data. 2016; 3:160035.
 34
Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor ai: Predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference: 2016. p. 301–18.
 35
Esteban C, Staeck O, Baier S, Yang Y, Tresp V. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In: Healthcare Informatics (ICHI), 2016 IEEE International Conference On. IEEE: 2016. p. 93–101.
Acknowledgements
Authors would like to thank the National Institute of Health for funding our research.
Funding
This work was supported by the National Institutes of Health grants R21CA202130 and P30CA006927. Publication costs were also funded by the National Institutes of Health grants R21CA202130 and P30CA006927.
Availability of data and materials
The data (MIMICIII Dataset) used in our experiment can be obtained in https://mimic.physionet.org/. Researchers seeking to use the database must formally request access following the steps on their website.
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 18 Supplement 4, 2018: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2017: medical informatics and decision making. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume18supplement4.
Author information
Affiliations
Contributions
TB and SV conceived the study and developed the algorithm. TB wrote the first draft of the manuscript. AKC conducted experiments for discovering treatment procedures. All authors participated in the preparation of the manuscript and approved the final version.
Corresponding author
Correspondence to Slobodan Vucetic.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bai, T., Chanda, A., Egleston, B. et al. EHR phenotyping via jointly embedding medical concepts and words into a unified vector space. BMC Med Inform Decis Mak 18, 123 (2018) doi:10.1186/s1291101806720
Published
DOI
Keywords
 Electronic health records
 Distributed representation
 Natural language processing
 Healthcare