We extracted abstracts related to “infarction” from a database of Japanese medical documents and used them as a corpus to obtain word variance representations using Word2Vec, FastText, and GloVe. The variance representation thus obtained allowed us to measure inter-disease distances, which indicate the degrees of similarity among diseases. Our examination of multiple metrics and methods revealed that the combination of the euclidean metric and the centroid method was optimal for assessing internal validity, while the combination of cosine distance and the complete linkage method was optimal for assessing external validity with ICD-10 for NMI and AMI when using Word2Vec. The inter-disease distances between word embedding vectors are, therefore, expected to be a valid quantitative representation of similar disease groups.
Word2Vec, FastText, and GloVe use deep learning based on the co-occurrence of words within a context to obtain a word embedding vector. Thus, words that appear in similar contexts have high similarity. In academic abstracts, the description of infraction in an organ also includes clinical symptoms and characteristic information derived from that organ. Differences in the characteristic information co-occurring in different organs may be a factor in the distance between diseases.
In the medical field, there have been few challenges to classification tasks with embedded representations. In a study that used a distributed representation obtained from medical records as visit embedding, the k-means method was used to classify the characteristics of patients by specialty [31]. A German Word2Vec model trained on a corpus of 352 MB of medical reports attained an accuracy of 90% in assigning medical reports written by physicians to ICD-10 [32]. The study identified rare diseases, unusual designations, and ICD code degeneracy as sources of assignment or “missing” errors. ICD-10 has a hierarchical structure with more than 68,000 codes. However, not all codes have the same level of granularity. Furthermore, ICD codes that are ontologically distant are less likely to be grouped [29]. Although we have bound our embedding vectors to ICD-10 based on the corpus of academic literature, the maximum value of NMI is 0.85 but only 0.41 and 0.36 for AMI and ARI when using Word2Vec; therefore, our embedding vectors cannot be interpreted as a mapping to the continuous space of ICD-10.
Note that the metric and method that maximized the internal validity measure and those that maximized the external validity measure produced different results. The euclidean maximized the internal validity measure but did not maximize the external validity rating. The dendrogram with the parameter that maximizes internal validity (Fig. 3) gives the impression that, unlike the ICD-10 classification, the less clinically relevant diseases are adjacent. Conversely, the dendrogram with parameters that maximize the NMI and ARI with ICD-10 (Fig. 4) gives the impression of a classification based on anatomical and temporal differences such as “old” and “sequelae.” In other words, the latter could be classified as “ICD-10-like.”
The Euclidean distance is calculated between two points as follows:
$$ d_{ij} = \left[ {\sum\nolimits_{k = 1}^{K} {\left( {x_{ik} - x_{jk} } \right)^{2} } } \right]^{1/2} ,\; {\mathbf{x}}_{i} = \left[ {x_{i1} , x_{i2} , \ldots ,x_{iK} } \right]^{T} $$
(2)
$$ d_{ij} :distance between {\textbf{x}}_{i} and {\textbf{x}}_{j} ,{\textbf{x}} \in {\mathbb{R}}^{K} , $$
$$ x_{iK} :k - th component of {\mathbf{x}}_{i} , {\mathbf{x}}_{i} : i - th component of {\mathbf{x}} $$
Conversely, the cosine distance is calculated from the angles between the vectors as
$$ d_{ij} = 1 - \frac{{{\mathbf{x}}_{i} \cdot {\mathbf{x}}_{j} }}{{{\mathbf{x}}_{i} {\mathbf{x}}_{j} }}, {\mathbf{x}}_{i} = \left[ {x_{i1} , x_{i2} , \ldots ,x_{iK} } \right]^{T} $$
(3)
When performing calculations with cosine distance, the length information of the vectors is lost because it is divided by the L2 norm. In other words, when the angle between two vectors θ is 0°, the cosine distance is 0 (similarity is 1) even if the L2 norm is different; therefore, diseases at different distances from the origin (L2 norm) in a higher-dimensional space are expressed as being similar. The L2 norm of xi is calculated using the formula
$$ L_{i}^{2} = \left\| {{\mathbf{x}}_{i} } \right\| = \left[ {\sum\nolimits_{k = 1}^{K} {\left( {x_{ik} } \right)^{2} } } \right]^{1/2} $$
(4)
$$ L_{i}^{2} :L2 norm\;of\;the\; i - th\;component\;of\;{\mathbf{x}} $$
When clustering by cosine distance, all disease embedding vectors are normalized and plotted on the n-dimensional hypersphere, losing the L2 norm. Pivot and cluster were converted into a two-dimensional representation and shared with others, but this could also be represented on the hypersphere. A cluster is a group of diseases that make an angle of θ < θ0 with the pivot, and the inter-disease distance can be defined as the distance between two points on the hypersphere.
Conversely, the L2 norm is not negligible. Vectors represent words that are consistently used in similar contexts with larger L2 norms than words of the same frequency used in different contexts [33]. Applied to diseases and symptoms, medical words that denote various causes are highly abstract and are used in various contexts (e.g., “infarction” and “headache”) have a smaller L2 norm. By contrast, medical words that are less diverse in terms of the cause, are more specific, and are used in limited contexts (e.g., “omental infarction” and “placental infarction”) have a higher L2 norm. In fact, the L2 norms of the embedded vectors in this study are “infarction”: 3.107, “headache”: 3.955, “omental infarction”: 7.703, and “placental infarction”: 7.583. The L2 norm of the standard disease list is shown in Fig. 5. The frequency of occurrence and the L2 norm tend to be inversely proportional.
The similarity with the dendrogram shown in Fig. 3 is also clear. Because the L2 norm is a Euclidean distance, this dendrogram reflects the frequency of occurrence and context of words in the corpus. Thus, using an unnormalized vector for classification may be preferable for considering word frequency and polysemy and to apply PCS in actual clinical practice.
The word frequencies should be limited to and interpreted within the corpus used in this study. However, using a broad corpus of academic medical literature may, in principle, be consistent with disease frequencies in the real world. Prior probabilities are essential information in clinical reasoning. By using Bayes’ theorem to modify the posterior probability of a diagnosis when new information becomes available, the prior probability represents its starting probability. In many cases, the prior probability depends on the function and location of the medical facility. By leaving the L2 norm to obtain clusters, a differential diagnoses list that considers prior probabilities in the corpus or facility may be obtained. By using different clusters for different stages of clinical reasoning (varying the distance and updating method), such computation may provide a more efficient differential diagnoses list that is more in line with the physician’s thought process.
Although this study was limited to the infarction domain, depending on the dictionary used in the morphological analysis, we simultaneously obtained the embedded vectors of medical words other than infarction disease. In other words, we efficiently computed vectors that represent symptoms, such as hemiplegia and headache, and histories such as smoking and hypertension. In the future, we will examine validity scales for domains other than infarction and calculate inter-symptom distances or symptom-disease distances to visualize many keywords used in clinical reasoning.
Bidirectional encoder representations from transformers (BERT) is a pre-training transformer-based machine learning technique developed by Google [34]. BERT has exhibited good performance on several natural language understanding tasks. The corpus used in this study was too small to create a pre-training BERT model and was not in a form suitable for fine tuning. Consequently, we could not use BERT. The use of BERT will be considered in future studies using large corpora.
Limitations
There are several limitations to this study. First, we cannot assert that the corpus size used for the study was sufficiently large. Previous studies using academic literature corpora have acquired over 18 million abstracts to obtain a vocabulary of approximately 7.8 million words [35]. Ichushi Web is the most extensive Japanese-language academic corpus currently available, and this study used all the searchable medical journals available therein. Therefore, other resources should be considered to increase the corpus size.
Second, morphological analysis presents a problem. Many medical terms consist of multiple words, which is also true in Japanese; for example, “acute inferior myocardial infarction” contains four words in English and eight Chinese characters but is a single medical term in both languages. Word2Vec and GloVe are vulnerable when they encounter unknown words, and if a term is not entered as a multi-word term in the dictionary, it is divided similarly to the longest words. For example, if the medical term “acute right renal infarction” is present in the corpus but not in the dictionary, it will be divided into “acute,” “right,” and “renal infarction.” For morphological analysis, this study used the ComeJisyo medical dictionary, which, since November 2018, has 75,861 registered words. We can use the pre-trained model in Japanese with Word2Vec and FastText, but not with GloVe. However, some pre-training models for Japanese do not specify a dictionary for word segmentation. Nevertheless, depending on the domain and task, the dictionary registration of multi-word terms is inadequate, as found in this study. For the PCS task in the medical domain, definitively stating whether using a pre-trained model with a general domain corpus or a model learned with the medical domain is better is beyond the scope of this study.
Third, these technologies present a problem. Because they use random numbers during training, minute differences may occur each time the training is conducted, and the reproducibility of the study cannot be adequately guaranteed. In addition, because we did not mention the differences in research results that are due to differences in parameters, it cannot be asserted that the parameter settings used in this study are optimal.
Fourth, the ICD-10 classification prepared as an external validity measure is sometimes inappropriate for creating a clinical differential diagnosis list/cluster. Conversely, the true pre-prepared clusters mentioned in the previous study are not always found in textbooks or international classifications. If a list of physicians’ definitive differential diagnoses exists, it may be worth examining external validity scales on that list.