Evaluation of standard and semantically-augmented distance metrics for neurology patients
BMC Medical Informatics and Decision Making volume 20, Article number: 203 (2020)
Patient distances can be calculated based on signs and symptoms derived from an ontological hierarchy. There is controversy as to whether patient distance metrics that consider the semantic similarity between concepts can outperform standard patient distance metrics that are agnostic to concept similarity. The choice of distance metric can dominate the performance of classification or clustering algorithms. Our objective was to determine if semantically augmented distance metrics would outperform standard metrics on machine learning tasks.
We converted the neurological findings from 382 published neurology cases into sets of concepts with corresponding machine-readable codes. We calculated patient distances by four different metrics (cosine distance, a semantically augmented cosine distance, Jaccard distance, and a semantically augmented bipartite distance). Semantic augmentation for two of the metrics depended on concept similarities from a hierarchical neuro-ontology. For machine learning algorithms, we used the patient diagnosis as the ground truth label and patient findings as machine learning features. We assessed classification accuracy for four classifiers and cluster quality for two clustering algorithms for each of the distance metrics.
Inter-patient distances were smaller when the distance metric was semantically augmented. Classification accuracy and cluster quality were not significantly different by distance metric.
Although semantic augmentation reduced inter-patient distances, we did not find improved classification accuracy or improved cluster quality with semantically augmented patient distance metrics when applied to a dataset of neurology patients. Further work is needed to assess the utility of semantically augmented patient distances.
Background and related work
Patients present with signs (what the physician finds on examination) and symptoms (patient complaints). We group signs and symptoms under the more general term findings . Distance metrics play an important role in advancing precision medicine, machine learning, and patient phenotyping [2,3,4,5,6,7,8,9,10,11,12]. Patient distances can be calculated based on findings that have been converted to machine codes based on concepts from a hierarchical ontology.
In this study, we examine whether the semantic augmentation of distance metrics with concept similarities improves the classification and clustering of neurology patients.
A variety of similarity and distance metrics are available. These have been used to calculate distances between patients [13,14,15,16], documents [17,18,19], and phenotypes [4, 5, 9, 10, 12]. If similarity and distance metrics are normalized to a scale of 0.0 to 1.0, the distance between A and B is the complement of the similarity.
The distance between two patients is different than the distance between two medical concepts. Patients are complex and can be represented as a collection of many concepts. Inter-patient distances are many-to-many comparisons; inter-concept distances are one-to-one comparisons. Metrics that work for concept distances are generally different from metrics to calculate distances between patients. Melton et al.  comment that “semantic distance measures the relative closeness between two concepts …. Inter-patient distance compares the relative closeness between two cases (sets of patient data).”
The implementation of distance metrics for neurological patients based on findings is challenging. First, neurological findings are recorded as unstructured free text. Second, examiners use a variety of equivalent terms to represent the same meaning: hyperreflexia is equivalent to increased reflexes; Babinski sign is equivalent to extensor plantar response; and so on. Third, the number of findings may vary from patient to patient. Fourth, converting unstructured text into machine-readable codes is difficult [20, 21].
The SNOMED CT ontology and the UMLS Metathesaurus allow the consolidation of multiple synonymous terms under the same concept [22, 23]. Both terminologies assign unique machine-readable codes to a concept. We have identified 1204 core concepts from the UMLS Metathesaurus as a neuro-ontology for capturing findings of the neurological examination . This curated neuro-ontology has three characteristics that make it well-suited for patient distance calculations: 1) it is monohierarchic, 2) the neurologic similarity of concepts has organized its hierarchy, and 3) it contains neurologic concepts absent from SNOMED CT .
When findings are converted to concepts and represented as machine-readable codes, patients can be instantiated mathematically as a set (an unordered collection of findings) or as a vector (ordered array of elements of fixed length). If a patient is represented as a set, each finding is added to the set as a unique element. The cardinality of the set (number of set elements) is equal to the number of findings. If a patient is represented as a vector, each finding is represented as an element of the vector. The number of elements is equal to the number of potential findings. A variety of distance metrics can be used with vectors, including Manhattan, Euclidean, cosine, Pearson correlation, Hamming, Minkowski, and others . Commonly used distance metrics in patient similarity studies are Jaccard, Mahalanobis, Euclidean, and cosine [15, 26]. Haase et al.  have suggested a bipartite matching algorithm for set similarity (eq. 2) where |A| is the number of elements in set A and sim(a, b) is the similarity between a concept a from set A and b is a concept from set B.
Hierarchical ontologies such as SNOMED CT and the UMLS Metathesaurus allow the calculation of distances between concepts [28,29,30,31,32,33,34,35,36]. Concept distances derived from hierarchal ontologies show modest correlations with the distance judgments of human experts [35, 37, 38]. The distance metrics for both sets and vectors can be augmented by considering the similarity between concepts [13, 14, 19]. Melton et al.  compared computed patient distances with an expert opinion on patient distance based on chart review. They did not find that semantic augmentation of the distance metric enhanced correlation with expert opinion and that correlation between experts and computed patient distances was low regardless of semantic augmentation. Mabotuwana et al.  examined document similarity using a cosine distance metric after converting document concepts to a binarized vector. In a classification task that involved determining whether a radiological report was a head CT scan or an abdomen CT scan, they found the accuracy of a k-nearest neighbor classifier increased from 86.7 to 93.1% with semantic augmentation of the document vector based on the SNOMED CT concept hierarchy. Mabotuwana et al. found that semantic augmentation of inter-document distances increased the separation between the centroid of the head CT scan reports and the centroid of the abdomen CT reports. Jia et al.  examined the ability of patient distances generated by ICD-10 diagnoses to predict hospital length of stay. Although they explored a variety of distance metrics, including cosine, Jaccard, and bipartite matching, they came to no definite conclusion as to whether semantic augmentation (based on a concept hierarchy) improved classification accuracy. In the Human Phenotype Ontology (HPO), Kohler et al.  have implemented a semantically augmented distance metric to assist in matching unknown patients to archetypical patients in the Online Mendelian Inheritance in Man (OMIM) database. Girardi et al.  calculated distances between patients with diseases of the gall bladder, thyroid, or appendix and hernias based on ICD-10 diagnosis codes. They found that a semantically augmented patient distance metric outperformed a Jaccard distance on a clustering task and that a semantically augmented patient distance increased the distance between within-diagnosis centroids and between diagnosis centroids.
Machine learning is increasingly used in the analysis of patient data. Machine learning is divided into supervised and unsupervised learning . The prototypical tasks for supervised learning are classification and regression . Although there are many machine learning classifiers, some commonly used classifiers include naïve Bayes, logistic regression, k-nearest neighbor, and random forest . Naïve Bayes utilizes probabilities derived from predictor variables to select class membership. Logistic regression is a statistical method that fits parameters to a logistic equation to predict class membership. k-nearest neighbor classifiers utilize distances between cases to predict class membership. Random forest classifiers use an ensemble of decision trees to predict class membership. The most common use of unsupervised learning algorithms is for the clustering of cases into homogeneous groups. Although many clustering algorithms are available, two of the most commonly used clustering algorithms are k-means clustering and agglomerative clustering . Both of these algorithms utilize inter-case distances to form homogeneous clusters of cases. Indices of machine learning classification quality include precision, recall, F1, and accuracy . Indices of machine learning clustering quality include homogeneity, completeness, Rand index, V-score, silhouette score [43,44,45]. Distance metrics are frequently used to generate patient distance matrices that drive the clustering or classification of patients. Since the performance of machine learning clustering and classification algorithms can be assessed objectively, we have hypothesized that the semantic augmentation of distance metrics with inter-concept distances would improve the performance of these algorithms.
To test this hypothesis, we created four test groups of patients abstracted from textbooks. We investigated four classifiers (naïve Bayes, logistic regression, random forests, and k-nearest neighbor) and two clustering algorithms (agglomerative and k-means) across four distance metrics. We tested whether semantic augmentation of the distance metrics improved clustering or classification quality.
We created a dataset of 382 neurological patients selected from a convenience sample  of 1028 published teaching cases [47,48,49,50,51,52,53,54,55,56,57,58]. We abstracted 2616 findings from the case studies (mean 6.7 ± 3.4 findings per patient). Findings were transcribed verbatim from source materials. An abstractor manually selected one of the 1204 available terms in the neuro-ontology that best represented the finding and added the UMLS CUI code . Table 1 illustrates the case abstraction method for a patient with Parkinson disease.
We implemented four inter-patient distance metrics in Python . The Jaccard distance is the complement of the Jaccard similarity . If A and B are the sets of findings from patient A and patient B, the Jaccarddist (A, B) is shown by eq. (3), and Jsim is the Jaccard similarity.
The augmented bipartite distance is based on the metric of Melton et al.  after augmenting it with the inter-concept distance proposed by Wu and Palmer . If patients A and B are represented as a set of findings such that a ϵ A and b ϵ B, the augmented bipartite distance is shown by eq. (4) and is supported by eqs. (5), (6), and (7).
For eq. (7), we used the hierarchical structure of the neuro-ontology and the method of Wu and Palmer  to calculate the dist (a, b) as the semantic distance between concept a and concept b. LCS is the lowest common subsumer in the hierarchical ontology for concepts a and b; depth(a) is the number of levels from the root concept to concept a; depth (b) is the number of levels from the root concept to concept b, and depth (LCS) is the number of levels from the root concept to the LCS. Based on eq. (7), the dist (a, b) for each inter-concept distance was stored as a nxn lookup table where the number of possible concepts was n = 1204. Values from this lookup table were used in eqs. (5) and (6) to iteratively find the minimum inter-concept distance for each concept from patient A compared to the concepts in patient B. Cosine distances between patients (1 – cosine similarity) were calculated by standard methods (eq. 8). If patient A and patient B are represented as vectors of findings from a1 to an and from b1 to bn, the vector is binarized, so that ai or bi is 1 if the finding is present and 0 if the finding is absent. Patient vectors were represented as a one-dimensional array of length n = 1204, where n is the potential number of findings.
We calculated an augmented cosine distance between patients according to the method of Mabotuwana et al.  Patients were represented as one-dimensional arrays as in the cosine distance above. We used the hierarchical structure of the neuro-ontology  to find an ordered list of ancestors for each concept. For each of the 1204 concepts in the neuro-ontology, we created a semantically augmented vector. The formula for augmentation was 1/(1 + n) where n = 0 for the index concept, n = 1 for the parent concepts, n = 2 for the grandparent concepts, etc. Descendent concepts (children) in the neuro-ontology were not augmented. Ancestor hierarchy was determined by the neuro-ontology, which is mono-hierarchical . Augmentation vectors were stored in an nxn lookup table (n = 1204). Semantically augmented patient vectors were created for each patient by traversing a list of concepts for each patient and adding the augmented concept vector to the patient vector to obtain a summary patient vector. After semantic augmentation of the vectors, inter-patient distances were calculated by eq. 8.
For all metrics, distances were positive, symmetric, and normalized between 0.0 and 1.0. Distances for each distance metric were stored in a square nxn matrix (n = 382 patients) before input to classification or clustering algorithms.
We divided the dataset of 382 patients into four test groups by diagnosis (Table 2). Each test group consisted of patients with eight related diagnoses. Each diagnosis occurred at least four times (mean 11.9 ± 5.9) in the test group. Test groups were composed of competing diagnoses for a common presenting neurological complaint (a patient with weakness, a patient with abnormal movements, a patient with altered mental status, and a patient with cranial neuropathy). Diagnoses were selected to emulate the differential diagnosis a neurologist might consider when evaluating a patient complaint.
Classification and clustering
For the classification tasks, we assessed the ability to assign correctly diagnoses based on findings. The ground truth labels were the diagnoses from the abstracted patient histories, and the features were the abstracted findings. Naïve Bayes, logistic regression, random forest, and k-nearest neighbor classifiers were compared. We used the Orange 3.25 default hyperparameters for naïve Bayes. For logistic regression, we set regularization = L2, and for random forest, we set the number of trees = 10. For the k-nearest neighbor classifier, we used uniform distance weighting and k = 5 after the empirical evaluation of all k values between 2 and 15. We used classification accuracy and a balanced F1 score to assess classification performance based on 10-fold cross-validation . In a separate analysis, we found mean F1 scores and mean accuracy scores did not differ statistically (df = 1, p > .05) between the 10-fold cross-validation method and the random sampling validation method.
For both the agglomerative clustering algorithm (Ward linkage)  and the k-means clustering algorithm, we chose a hyperparameter of number of clusters = 8 based on the known number of diagnoses in the test groups (Table 2). We used the silhouette score, homogeneity score, completeness score, V-score, adjusted Rand index, and mutual information index to assess cluster quality [42,43,44,45, 59].
We used SPSS 26 (IBM Corporation) for analysis of variance, line plots, and box plots. We used Orange 3.25.0 for the k-nearest neighbor, logistic regression, naïve Bayes, and random forest classifications. We used scikit-learn 0.23.1 for agglomerative clustering and k-means clustering . All performance measures for clustering and classification were normalized to a 0 to 100 scale.
We examined inter-patient distances for 382 patients divided into 4 test groups of eight diagnoses (Table 2). Inter-patient means differed by distance metric (Fig. 1, one-way ANOVA, df = 3, F = 5820, p < .001). Post hoc means testing (Bonferroni p < .05) showed all means differed (p < .05) with the augmented bipartite distance metric having the lowest inter-patient mean distance and the Jaccard distance metric having the highest mean inter-patent distance.
The mean within-diagnosis patient distance was less than mean between-diagnosis patient distance for all the four-distance metrics (Fig. 2, two-way ANOVA, means differ by group, df = 1, F = 3050, p < .001 and means differ by distance metric, df = 3, F = 2936, p < .001). All pairwise mean comparisons by the group and by distance metric were significant (post hoc Bonferroni test, p < .05).
We found a significant difference in mean patient distances by diagnosis (Fig. 3, two-way ANOVA, means differ by diagnosis, df = 31, F = 107, p < .001, and means differ by distance metric, df = 3, F = 1351, p < .001). Post hoc Bonferroni testing showed that 60% of the pairwise patient distance means differed by diagnosis (P < .05). For the 32 diagnoses shown in Fig. 3, trigeminal neuralgia has the lowest mean within-diagnosis patient distance (less than all other 31 diagnoses, pairwise comparisons, p < .05) and multiple sclerosis had the highest within-diagnosis mean patient distance (greater than all other diagnoses, pairwise comparisons, p < .05).
We performed 64 classification analyses (4 distance metrics × 4 test groups × 4 classifiers). The four test groups were altered mental status, abnormal movement, cranial neuropathy, and weakness (Table 2). The four distance metrics were cosine, augmented cosine, augmented bipartite, and Jaccard (see Methods). The four classifiers were naïve Bayes, logistic regression, random forest, and k-nearest neighbor (k = 5). Classes were unbalanced in the test groups (Table 2). Each classification task involved selecting the correct diagnosis from one of eight competing diagnoses for each of the patients in the test group. The performance was measured by classification accuracy and F1. Classification performance varied by classifier for both classification accuracy (two-way ANOVA, main effect, df = 3, F = 7.8, p < .001) and F1 (two-way ANOVA, main effect, dF = 3, F = 10.1, p < .001). Bonferroni post hoc testing showed that the naïve Bayes classifier underperformed the logistic regression and k-nearest neighbor classifiers on both performance measures (p < .05).
Classification performance of the distance metrics was comparable regardless of classifier (Figs. 4-5, two-way ANOVA, df = 3, p > .05) or diagnosis group (two-way ANOVA, Figs. 6-7, df = 3, p > .05). Classifier performance was comparable when performance was measured by classification accuracy (Figs. 4) or by F1 (Fig. 5). Performance differed by diagnosis group (Figs. 6 and 7) for both classification accuracy (two-way ANOVA, df = 3, F = 10.2, p < .001) and the F1 score (two-way ANOVA, df = 3, F = 7.4, P < .001). Post hoc Bonferroni testing showed the classification accuracy score, and the F1 score was higher for the cranial nerve group than the other three diagnosis groups (p < .05).
We performed 32 clustering analyses (4 distance metrics × 4 test groups × 2 clustering algorithms). The two clustering algorithms were agglomerative clustering with Ward linkage and k-means clustering. Distances were inputted as pre-computed nxn matrices. For both clustering algorithms, the number of clusters was set at eight based on the known number of different diagnoses in each diagnosis group. Cluster quality was assessed by silhouette score, adjusted Rand Index (ARI), adjusted mutual information (AMI), completeness, homogeneity, and V-measure. Cluster quality did not differ by cluster algorithm (agglomerative versus k-means) on any of the cluster quality measures (Fig. 8, two-way ANOVA, df = 1, p > .05).
For both k-means clustering and agglomerative clustering, the distance metric did not significantly affect cluster quality (Figs. 9 and 10, two-way ANOVA, df = 3, p > .05). Cluster quality was better for the cranial nerve group (Fig. 11) than the other three groups, the movement group was better than the weakness group (Bonferroni post hoc test, p < .05; Groups differ two-way ANOVA, df = 3, F = 20.3, p < .001). The higher quality of the cranial nerve clustering with greater within-cluster homogeneity than the weakness group clustering is illustrated in the stacked bar charts Figs. 12 and 13.
We examined four distance metrics for calculation of the distances between neurology patients based on findings: Jaccard distance, cosine distance, augmented cosine distance and augmented bipartite distance. To calculate the Jaccard and augmented bipartite distances, we represented patients as unordered lists of elements of variable length (sets). To calculate the cosine and augmented cosine distances, we represented patients as ordered arrays of fixed length (vectors).
For the Jaccard and cosine distances, the matching of concepts between patients was binary (“all or none”). Semantic similarity between concepts was not considered. Consider a patient A that has the finding resting tremor; and a patient B that has the finding postural tremor. When calculating the Jaccard distance or the cosine distance, the semantic similarity between resting tremor and postural tremor would not contribute to the proximity between these two patients (each metric would value the similarity between resting tremor and postural tremor as ‘0’). The semantically augmented distance metrics behave differently. These augmented distance metrics move patients closer together when patients manifest semantically similar findings, even if they are not exact matches. The augmented cosine distance considers that postural tremor and resting tremor have a common immediate ancestor tremor. Hence, the tremor element of the vectors for patient A and patient B is augmented with a value of 0.5 (see Methods and ). This semantic augmentation of the vectors for patients A and B increases their similarity and moves the patients closer together when the cosine distance is calculated (eq. 8). The augmented bipartite distance considers that resting tremor and postural tremor are siblings in the neuro-ontology hierarchy and have a Wu Palmer distance of 0.25 (eq. 7); moving patients A and B closer (eqs. 5 and 6). The augmented cosine distance metric moves the patients closer because postural tremor and resting tremor have tremor as a common ancestor in the neuro-ontology. The augmented bipartite distance metric moves the patients closer because resting tremor and postural tremor are siblings in the neuro-ontology.
For each of the 382 patients in the dataset (n = 382), we calculated the mean patient distance to patients with the same diagnosis and the mean distance to patients with different diagnoses (Fig. 2). Within-diagnosis patient distances were lower than between-diagnosis patient distances for all of the metrics (Fig. 2). Patients of the same diagnosis should be closer to each other than those with a different diagnosis. Sematic augmentation of the distance metrics makes patients more similar, moves them closer together, and reduces mean patient distances. Augmented cosine and augmented bipartite patient distances were lower than cosine and Jaccard patient distances (Fig. 1, Bonferroni post hoc test, p < .05). For each patient, the difference between its mean distance to other patients with the same diagnosis and its mean distance to other patients with different diagnosis (Fig. 2) is important because it is this difference between within-diagnosis and between-diagnosis distances that contributes to the ability of clustering and classification algorithms to use distances to cluster or classify patients by patient distance successfully [63, 64]. The difference between mean within-diagnosis distance and mean-between diagnosis distance differed by metric (df = 3, F = 49, p < .001) with the largest differences found with the cosine and augmented cosine metrics and the smaller differences found with the augmented bipartite and Jaccard metrics (Bonferroni post hoc test, p < .05).
Classification and clustering
We evaluated four different classifiers on four different test groups of patients. We used F1 and classification accuracy (Figs. 4 and 5) as measures of classification performance. There were differences in classifier performance, with the logistic regression classifier and the k-nearest neighbor classifier outperforming the naïve Bayes classifier (Figs. 4 and 5). In retrospect, the selection of the naïve Bayes classifier was ill-suited for this study since this classifier assumes feature independence (not likely to hold among neurological patients) and is oriented towards using probabilities rather than distances for classification. Importantly, we found no effect on classification performance related to the distance metric. Classification performance did vary by test group (Figs. 6 and 7). Post hoc testing showed that the classification performance was better for the cranial nerve test group. A likely explanation for the better classification performance with the cranial nerve group is that members of this group (Table 2) had tighter within diagnosis inter-patient distances (i.e., less variability in presentation). As illustrated in Fig. 3, the diagnoses of the cranial nerve test group (TN, MNR, RH, ON, BEL, BPV, THD, and AN) are primarily on the left-hand side of the x-axis, and they have lower mean intra-diagnosis variability in their clinical presentations.
We evaluated two different clustering algorithms (agglomerative clustering and k-means clustering) on the four test groups of patients (Table 2). Except for the silhouette score, the clustering performance measures depend on the ground truth diagnosis label derived from the patient case studies. The silhouette score measures cluster quality independent of ground truth. Cluster quality did not differ by cluster algorithm (Fig. 8). Cluster quality did not vary by distance metric for either the k-means algorithm or the agglomerative algorithm (Figs. 9 and 10). Cluster quality did differ by patient test group with post hoc testing showing that the cranial nerve test group had higher cluster quality than the other test groups (Fig. 11). Visual inspection of Figs. 12 (cranial nerve test group) and Fig. 13 (weakness test group) show how with an 8-cluster solution, cluster homogeneity is higher in the cranial nerve group than the weakness test group. In Figs. 12 and 13, each color represents a different ground truth diagnosis label, and each column represents a computed cluster. The better performance on clustering of the cranial nerve group likely reflects the same factors intrinsic to this group of patients that led to better classification performance (see above). There is less variability in clinical presentation from patient to patient in this test group, within-diagnosis patient distances are lower (Fig. 3), and there is likely less sign and symptom overlap with other diagnoses.
The failure to find an improvement in clustering or classification performance with semantically augmented distance measures was somewhat surprising. Others have found improvements in the clustering of patients  or classification of documents  with semantically augmented distance metrics. However, Melton et al.  did not find improved concordance with domain experts when inter-patient distance calculations were augmented with concept semantic similarity information. Although semantically augmented distance metrics move patients closer (Fig. 1), these smaller inter-patient distances may not translate into improvements in clustering or classification performance unless these smaller distances create a greater gap between mean within-diagnosis distance and mean between-diagnosis. From Fig. 2, it seems likely that for patients with a given diagnosis, semantic augmented distance places them closer to other patients with the same diagnosis. The problem is that semantically augmented distances push these patients closer to other patients with a different diagnosis. If the net effect of semantic augmentation is to make each patient closer to patients with the same diagnosis and patients with a different diagnosis, there will be no net gain in the ability to cluster or classify patients by diagnosis. The non-intuitive failure of semantic augmentation to improve classification and clustering performance can be illustrated by returning to the hypothetical patient A with resting tremor and the hypothetical patient B with postural tremor. If the diagnosis of patient A is Parkinson disease and the diagnosis of patient B is essential tremor (as is likely), then semantically augmented distance metrics will move patient A closer to B. However, since the diagnosis of patient A and patient B are different, moving patient A closer to patient B will deprecate classification and clustering performance in this case.
Implications for neurological diagnosis
The accuracy of diagnosis for the 32 neurological diagnoses in this study ranged from 76 to 86% with the k-nearest neighbor classifier (Fig. 4). In one study, human experts made neurologic diagnoses at the bedside with an accuracy of 77% . Liu et al.  observe “machine learning methods can only be as good as the information in the training set … machine-learning methods should not be able to exceed the performance of extremely careful and experienced clinicians …. ” Machine learning can offer insights into which diseases are more variable in presentation than others (Fig. 3) and which diagnostic problems are more challenging to solve than others (Fig. 6). Furthermore, machine learning may offer improvements in patient matching strategies for large repositories of archetypal disease profiles such as the Online Mendelian Inheritance in Man [4, 5, 12].
One limitation of this study is that we did not consider the severity of deficits, such as weakness or ataxia. When deficits were present, they were binarized as either present or absent and not graded in severity. Another limitation is that some of the diagnosis classes were narrower than others. Although some of the diagnosis classes were specific (Huntington disease, Alzheimer disease, and Parkinson disease), others were more general, such as polyneuropathy, myopathy, and meningitis. This decision to use more general categories for some diagnosis classes reflects the reality that signs and symptoms alone are unlikely to distinguish specific causes of meningitis, polyneuropathy, or myopathy without additional ancillary testing. Another limitation is that we did not compare the computed patient distances to expert opinion for any of the distance metrics. The validity of the results would be improved by a larger dataset of patients, preferably in the thousands rather than in the hundreds. A further limitation of the study is that we utilized published cases from the textbooks of neurology rather than de-identified patient records from electronic medical records. We used manual abstraction of concepts from case histories instead of natural language processing (NLP) [67,68,69,70]. We chose manual abstraction rather than NLP because we wanted to carefully curate a database of test patients with minimal coding errors, and our initial experience with MetaMap indicated that extensive post-processing was needed to ensure accuracy. Future advances in NLP could make the conversion of signs and symptoms in electronic health records to machine-readable codes more accurate and efficient. Inter-rater reliability for abstracting clinical cases into UMLS codes or SNOMED CT codes is another concern [20, 21].
Neurological signs and symptoms from case histories can be represented as UMLS concepts from a neuro-ontology. We examined four different distance metrics for the calculation of inter-patient distances. All of the distance metrics provided useful patient distances that could be utilized by machine learning classification and clustering algorithms. Semantically augmented metrics that used the semantic similarity between neurological concepts to calculate patient distances yielded lower patient distances than more traditional distance metrics without semantic augmentation. When each of the four distance metrics was tested on four classifiers and two clustering algorithms, all distance metrics performed similarly without a discernible improvement due to semantic augmentation. Further work is needed to determine the utility of semantically augmenting patient distance metrics with inter-concept distances.
UMLS concept unique identifier
Unified Medical Language System
- SNOMED CT:
is a registered name of SNOMED International.
Natural language processing
Human Phenotype Ontology
Online Mendelian Inheritance in Man
Campbell WW. Diagnosis and localization of neurologic disease, Chapter 53. In Dejong's The neurologic examination. 7th edition. Lippincott Williams and Wilkins, Philadelphia, 2013, pp. 769–795.
Beaulieu-Jones B, Finlayson SG, Chivers C, Chen I, McDermott M, Kandola J, Dalca AV. Trends and Focus of Machine Learning Applications for Health Research. 2019;2:1–12. https://doi.org/10.1001/jamanetworkopen.2019.14051.
Parimbelli E, Marini S, Sacchi L, Bellazzi R. Patient similarity for precision medicine: a systematic review. J Biomed Inform. 2018;83:87–96. https://doi.org/10.1016/j.jbi.2018.06.001.
Xue H, Peng J, Shang X. Predicting disease-related phenotypes using an integrated phenotype similarity measurement based on HPO. BMC Syst Biol. 2019;13:1–12. https://doi.org/10.1186/s12918-019-0697-8.
Peng J, Xue H, Shao Y, Shang X, Wang Y, J. Chen J. Measuring phenotype semantic similarity using Human Phenotype Ontology, Proc. 2016 IEEE Int. Conf. Bioinforma. Biomed. BIBM 2016. (2017) 763–766. doi:https://doi.org/10.1109/BIBM.2016.7822617.
Pai S, Bader GD. Patient similarity networks for precision medicine. J Mol Biol. 2018;430:2924–38. https://doi.org/10.1016/j.jmb.2018.05.037.
Yang S, Stansbury LG, Rock P, Scalea T, Hu PF. Linking big data and prediction strategies: tools, pitfalls, and lessons learned. Crit Care Med. 2019;47:840–8. https://doi.org/10.1097/CCM.0000000000003739.
Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Heal Inf Sci Syst. 2014;2:1–10. https://doi.org/10.1186/2047-2501-2-3.
Deng Y, Gao L, Wang B, Guo X. HPOSim: an r package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology. PLoS One. 2015;10:1–12. https://doi.org/10.1371/journal.pone.0115692.
Su S, Zhang L, Liu J. An effective method to measure disease similarity using gene and phenotype associations. Front Genet. 2019;10:1–8. https://doi.org/10.3389/fgene.2019.00466.
Alanazi HO, Abdullah AH, Qureshi KN. A critical review for developing accurate and dynamic predictive models using machine learning methods in medicine and health care. J Med Syst. 2017;41. https://doi.org/10.1007/s10916-017-0715-6.
Köhler S, Schulz MH, Krawitz P, Bauer S, et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet. 2009;85:457–64. https://doi.org/10.1016/j.ajhg.2009.09.003.
Girardi D, Wartner S, Halmerbauer G, Ehrenmüller M, Kosorus H, Dreiseitl S. Using concept hierarchies to improve calculation of patient similarity. J Biomed Inform. 2016;63:66–73. https://doi.org/10.1016/j.jbi.2016.07.021.
Jia Z, Lu X, Duan H, Li H. Using the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. BMC Med. Inform. Decis. Mak. 2019;19:1–11. https://doi.org/10.1186/s12911-019-0807-y.
Sharafoddini A, Dubin JA, Lee J. Patient Similarity in Prediction Models Based on Health Data: A Scoping Review. JMIR Med Inform. (2017) 5(1):e7. Published 2017 Mar 3. doi:https://doi.org/10.2196/medinform.6730.
Melton GB, Parsons S, Morrison FP, Rothschild AS, Markatou M, Hripcsak G. Inter-patient distance metrics using SNOMED CT defining relationships. J Biomed Inform. 2006;39:697–705. https://doi.org/10.1016/j.jbi.2006.01.004.
Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches, PLoS One. 6 (2011). doi:https://doi.org/10.1371/journal.pone.0018029.
L.J. Garcia Castro LJ, R. Berlanga R, A. Garcia A, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access, J. Biomed. Inform. (2015) 57: 204–218. doi:https://doi.org/10.1016/j.jbi.2015.07.015.
Mabotuwana T, Lee MC. Cohen, Solal EV. An ontology-based similarity measure for biomedical data-application to radiology reports. J Biomed Inform. 2013;46(5):857–68. https://doi.org/10.1016/j.jbi.2013.06.013.
Andrews JE, Richesson RL, Krischer J. Variation of SNOMED CT coding of clinical research concepts among coding experts. J Am Med Inform Assoc. (2007) Jul-Aug;14(4):497–506.
Chiang MF, Hwang JC, Yu AC, Casper DS, Cimino JJ. Starren J. AMIA Annu Symp Proc: Reliability of SNOMED-CT Coding by Three Physicians using Two Terminology Browsers; 2006. p. 131–5.
Bhattacharyya SB. Introduction to SNOMED CT. Singapore: Springer; 2016.
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology, Nucleic Acids Research. (2004) 32, issue suppl_1, Pages D267–D270, https://doi.org/10.1093/nar/gkh061.
Hier DB, Brint SU. A Neuro-ontology for the neurological examination. BMC Med Inform Decis Mak. 2020;20:47. https://doi.org/10.1186/s12911-020-1066-7.
Choi SS, Cha SH, Tappert CC. A survey of binary similarity and distance measures, WMSCI 2009 - 13th world multi-conference Syst. Cybern. Informatics, jointly with 15th Int. Conf. Inf. Syst. Anal. Synth. ISAS 2009 - Proc 3 (2009) 80–85.
Tashkandi A, Wiese I, Wiese L. Efficient in-database patient similarity analysis for personalized medical decision support systems. Big Data Res. 2018;13:52–64. https://doi.org/10.1016/j.bdr.2018.05.001.
Haase P, Siebes R, van Harmelen F. Peer selection in peer-to-peer networks with semantic topologies. In: Bouzeghoub M., Goble C., Kashyap V., Spaccapietra S. (eds) semantics of a networked world. Semantics for grid databases. ICSNW 2004. Lecture notes in computer science. (2004) vol 3226. Springer, Berlin. Heidelberg. . https://doi.org/10.1007/978-3-540-30145-5_7.
Rada R, Hafedh M, Bicknell E, Blettner M. Development and Application of a Metric on Semantic Nets. IEEE transactions on systems, Man and Cybernetics (1989) 19(1): 17–30.
Wu Z, Palmer M. Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics, (1994) pp 133–138.
Leacock C, Chodorow M. Combining local context and WordNet similarity for word sense identification. WordNet. 1998. https://doi.org/10.7551/mitpress/7287.003.0018.
Resnik P. Using information content to evaluate semantic similarity in a taxonomy. (1995) http://arxiv.org/abs/cmp-lg/9511007.
Jiang JJ, Conrath DW. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of International Conference Research on Computational Linguistics (ROCLING X). (1997) Taiwan, pp 19–33, https://www.aclweb.org/anthology/O97-1002.
Lin D. An Information-Theoretic Definition of Similarity, ICML 1998 Proceedings of the Fifteenth International Conference on Machine Learning. (1998) Pages 296-304, July 24–27, 1998.
Lee W, Shah N, Sundlass K, Musen M. Comparison of Ontology-based Semantic-Similarity Measures. Medical College of Wisconsin, Milwaukee, WI, Symp. A Q. J. Mod. Foreign Lit. (2008) 384–388.
McInnes BT, Pedersen T. Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs. J Biomed Inform. 2015;54:329–36. https://doi.org/10.1016/j.jbi.2014.11.014.
Caviedes JE, Cimino JJ. Towards the development of a conceptual distance metric for the UMLS. J Biomed Inform. 2004;37:77–85. https://doi.org/10.1016/j.jbi.2004.02.001.
Al-Mubaid H, Nguyen HA, A cluster-based approach for semantic similarity in the biomedical domain, Annu. Int. Conf. IEEE Eng. Med. Biol. Proc. (2006) 2713–2717.
Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG. Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform. 2007;40:288–99. https://doi.org/10.1016/j.jbi.2006.06.004.
The MathWorks Inc. What is machine learning?, Retrieved at https://www.mathworks.com/discovery/machine-learning.html.
The Mathworks Inc. Supervised learning workflows and algorithms. Retrieved at https://www.mathworks.com/help/stats/supervised-learning-machine-learning-workflow-and-algorithms.html.
The Mathworks Inc. Unsupervised learning. Retrieved at https://www.mathworks.com/discovery/unsupervised-learning.html.
Al-Jabery KK, Obafemi-Ajayi T, Olbricht GR. Wunsch II DC (editors). Computational Learning Approaches to Data Analytics in Biomedical Applications: Academic Press; 2020. https://doi.org/10.1016/B978-0-12-814482-4.05001-4.
Rosenberg A, Hirschberg J. V-Measure: A conditional entropy-based external cluster evaluation measure, EMNLP-CoNLL 2007 - Proc. 2007 Jt. Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn. (2007) 410–420.
Rand WW. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50. https://doi.org/10.1080/01621459.1971.10482356.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
Kellar SP, Kelvin EA. Munro's statistical methods for healthcare research. 6th ed. Philadelphia: Wolters Kluwer; 2013.
Blumenfeld H. Neuroanatomy through clinical cases. 2nd ed. Sunderland, MA: Sinauer Associates; 2010.
Macleod M. Simpson M, pal S. Neurology. Wiley-Blackwell, West Sussex UK: Clinical Cases Uncovered; 2011.
Noseworthy JH. Fifty neurologic Cases from Mayo Clinic. Oxford UK: Oxford University Press; 2004.
Pendlebury ST, Anslow P, Rothwell PM. Neurological case histories. Oxford UK: Oxford University Press; 2007.
Toy EC, Simpson E, Mancias P, Furr-Stimming EE. Case files neurology. 3rd ed. New York: McGraw-Hill; 2018.
Waxman SG. Clinical Neuroanatomy. 28th ed. New York: McGraw Hill; 2017.
Hauser SL, Levitt LP, Weiner HL. Case studies in neurology for the house officer. Baltimore: Williams and Wilkins; 1986.
Liveson JA, Spielholz N. Peripheral neurology: case studies in electrodiagnosis. Philadelphia: FA Davis Company; 1979.
Gauthier SG, Rosa-Netto P. Case studies in dementia. Cambridge UK: Cambridge University Press; 2011.
Erro R, Stamelou M, Bhatia K. Case studies in movement disorders. Cambridge UK: Cambridge University Press; 2017.
Solomon T, Michael BD, Miller A, Kneen R. Case studies in neurological infections of adults and children. Cambridge UK: Cambridge University Press; 2019.
Howard J, Singh A. Neurology image-based clinical review. New York: Demos Publishing; 2017.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30 http://jmlr.org/papers/v12/pedregosa11a.html.
Jaccard P. The distribution of the flora in the alpine zone. New Phytol. 1912;11:37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x.
Jana N, Barik S, Arora N. Current use of medical eponyms--a need for global uniformity in scientific publications. BMC Med Res Methodol. (2009) 9:18. Published 2009 Mar 9. doi:https://doi.org/10.1186/1471-2288-9-18.
Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58:236–44. https://doi.org/10.1080/01621459.1963.10500845.
Xu R. Wunsch DC II. Clustering: Wiley-IEEE Press; 2008.
Xu R, Wunsch DC II. Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng. 2010;3:120–54.
Chimowitz MI, Logigian EL, Caplan LR. The accuracy of bedside neurological diagnoses. Ann Neurol. 1990;28:78–85. https://doi.org/10.1002/ana.410280114.
Liu Y, Chen PHC, Krause J, Peng L. How to read articles that use machine learning: Users' guides to the medical literature, JAMA - J. Am Med Assoc. 2019;322:1806–16. https://doi.org/10.1001/jama.2019.16489.
Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Informatics Assoc. 2010;17:229–36. https://doi.org/10.1136/jamia.2009.002733.
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Informatics Assoc. 2010;17:507–13. https://doi.org/10.1136/jamia.2009.001560.
Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, Forshee R, Walderhaug M, Botsis T. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14–29. https://doi.org/10.1016/j.jbi.2017.07.012.
Reátegui R, Ratté S. Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Med Inform Decis Mak. 2018;18:74. https://doi.org/10.1186/s12911-018-0654-2.
Partial support for this research was received from the Missouri University of Science and Technology Intelligent Systems Center, the Mary K. Finley Missouri Endowment, the National Science Foundation, the Lifelong Learning Machines program from DARPA/Microsystems Technology Office, and the Army Research Laboratory (ARL); and it was accomplished under Cooperative Agreement Number W911NF-18-2-0260. The research was also sponsored by the Leonard Wood Institute in cooperation with the ARL and was accomplished under Cooperative Agreement Number W911 NF-14-2-0034. The views and conclusions contained in this document are those of the authors. They should not be interpreted as representing the official policies, either expressed or implied, of the Leonard Wood Institute, the ARL, or the United States Government. The United States Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
Ethics approval and consent to participate
The Institutional Review Board of the University of Illinois at Chicago approved this work. No consent to participate was required for this work.
Consent for publication
None to report.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Hier, D.B., Kopel, J., Brint, S.U. et al. Evaluation of standard and semantically-augmented distance metrics for neurology patients. BMC Med Inform Decis Mak 20, 203 (2020). https://doi.org/10.1186/s12911-020-01217-8