Study on the semi-supervised learning-based patient similarity from heterogeneous electronic medical records

Background A new learning-based patient similarity measurement was proposed to measure patients’ similarity for heterogeneous electronic medical records (EMRs) data. Methods We first calculated feature-level similarities according to the features’ attributes. A domain expert provided patient similarity scores of 30 randomly selected patients. These similarity scores and feature-level similarities for 30 patients comprised the labeled sample set, which was used for the semi-supervised learning algorithm to learn the patient-level similarities for all patients. Then we used the k-nearest neighbor (kNN) classifier to predict four liver conditions. The predictive performances were compared in four different situations. We also compared the performances between personalized kNN models and other machine learning models. We assessed the predictive performances by the area under the receiver operating characteristic curve (AUC), F1-score, and cross-entropy (CE) loss. Results As the size of the random training samples increased, the kNN models using the learned patient similarity to select near neighbors consistently outperformed those using the Euclidean distance to select near neighbors (all P values < 0.001). The kNN models using the learned patient similarity to identify the top k nearest neighbors from the random training samples also had a higher best-performance (AUC: 0.95 vs. 0.89, F1-score: 0.84 vs. 0.67, and CE loss: 1.22 vs. 1.82) than those using the Euclidean distance. As the size of the similar training samples increased, which composed the most similar samples determined by the learned patient similarity, the performance of kNN models using the simple Euclidean distance to select the near neighbors degraded gradually. When exchanging the role of the Euclidean distance, and the learned patient similarity in selecting the near neighbors and similar training samples, the performance of the kNN models gradually increased. These two kinds of kNN models had the same best-performance of AUC 0.95, F1-score 0.84, and CE loss 1.22. Among the four reference models, the highest AUC and F1-score were 0.94 and 0.80, separately, which were both lower than those for the simple and similarity-based kNN models. Conclusions This learning-based method opened an opportunity for similarity measurement based on heterogeneous EMR data and supported the secondary use of EMR data.


Background
In recent years, the sharp increment in electronic medical records (EMRs) adoption has facilitated the move toward personalized medicine [1][2][3]. The wide use of EMR opened opportunities in making a personalized decision for a given patient [4]. Especially, EMR data for the similar patients in conjunction with machine learning algorithms was applied to build individualized models to predict the risk for a potential disease [5][6][7] or to identify the risk factors [8,9] and the disease subgroups [10,11], showing an improved performance in these tasks.
EMR data is usually heterogeneous, making it more difficult than common homogeneous data to measure the patient similarity for further use. Many non-numeric features, such as diseases and procedures, that are important and informative in assisting medical clinicians in making a diagnostic decision. Additionally, symptoms and signs are also useful that are often recorded in a free-text form. Generally, these features cannot be input into some similarity-or distance-based machine learning models with numeric features due to the difficulty in computing a uniform similarity or distance for features with different data types. For example, the widely used Euclidean distance was effective in measuring the distance among continuous variables but not among discrete variables. Moreover, traditional distance measurements did not consider some hierarchical code systems' ontological distance. To address these problems, some researchers calculated feature similarities respectively and then integrated them into a single value with weighted sum [5] or geometric mean [3,12]. However, the parameters used and their settings in integrating feature similarities into a single patient similarity may depend more on subjective judgment and lack conviction.
In this study, we proposed a new learning-based method to obtain patient similarity after the direct computation. We first calculated feature-level similarities by using different similarity measurements for different types of features. They were then input into a semi-supervised learning (SSL) algorithm to learn the patient-level similarities. The proposed patient similarity was validated by selecting similar samples to predict the status of liver diseases. Liver diseases were usually multifactorial complex with high global incidence [11] and had affected over one-fifth of the population in China [13]. The non-alcohol fatty liver disease (NAFLD) had been identified as an emerging health problem worldwide [14], and liver hemangiomas were the most common benign liver tumors occurring in people of all ages [15,16]. Also, China had the 9th highest rate and the largest number of liver cancer patients in the world [17]. Thus, there is an urgent need to help clinicians make a proper diagnostic decision for a patient of these liver diseases. Consequently, the proposed patient similarity was applied in predictive models for three common liver diseases (liver cancer, hemangioma, and NAFLD), attempting to assist the clinicians in diagnosis.
To the best of our knowledge, it is the first time that the patient similarity was learned through an SSL algorithm using the calculated feature similarities. This method can be easily implemented in other tasks and is likely to be an important step in facilitating personalized medicine based on heterogeneous EMR data.

Related work
Patient similarity has become a hot topic in recent years, with many researchers using patient similarity as a tool to enable precision medicine. For heterogeneous EMR data, two strategies were usually adopted when evaluating patient similarity. All features were standardized in advance for a static distance metric [18] in the first strategy. Lee et al. [19] used the cosine similarity metric to identify similar patients for the downstream 30-day mortality prediction based on the MIMIC-II database. All predictor variables were presented as a numeric vector to yield the cosine-based similarity metric. David et al. [20] utilized the Euclidean distance-based metric to select similar patients for anomaly detection and characterization on the basis of numeric laboratory data. Li et al. [10] assembled different types of variables into a numeric data matrix, with a one-hot representation for non-numeric ICD-9-CM codes. The cosine distance metric was then used to construct a patient-patient network for further disease subtype identification. Gu et al. [21] adopted a weighted Euclidean distance to evaluate similarity for both continuous and discrete variables. These studies either merely considered numeric variables [20,21] or standardized all variables in advance for further patient similarity measurement [10,19]. This strategy might result in feature information loss during standardization [18] and did not utilize all EMR data features.
Therefore, many researchers chose the second strategy to evaluate patient similarity for heterogeneous EMR data. A list of feature similarities was calculated for each data type separately. Wang et al. [5] built predictive models for diabetes' risk prediction based on EMR Keywords: Patient similarity, Electronic medical records, Semi-supervised learning, k-nearest neighbors, Liver diseases data. Feature similarities were firstly evaluated according to variables' attributes. The weighted sum of feature similarities was calculated to form a single patient similarity score, in which the weights were allocated by trial and error. Gottlieb et al. [3,12] assessed feature similarities and integrated them into a single similarity score with a weighted geometric sum. Huang et al. [22] also calculated feature similarities separately and compared all the possible combinations of three types of disease code, three laboratory test sets, and three weight allocation schemes. Zheng et al. [23] used the supervised XGBoost algorithm to learn patient similarities. Pairwise similarities for symptoms, lab tests, and preliminary diagnoses were measured to serve as inputs of the XGBoost model, and those for discharge diagnoses as the output. For a new patient, the calculated feature similarities with all patients were given to the trained XGBoost model to predict the patient similarities for downstream discharge diagnoses prediction. In these studies, weights were allocated subjectively, leading to a lack of conviction to some degree. Besides, the supervised label and learning effect of the XGBoost-based method might vary with the similarity assessment method adopted for discharge diagnoses.
We proposed a semi-supervised learning method to cover all available EMR data features, automatically weight each feature similarity, and save time and labor costs for labeling simultaneously.

Overview
In this study, the proposed learning-based method included two successive steps: calculating the feature similarity directly and learning the patient similarity in a semi-supervised way. The labeled sample set, a tiny size of computed feature similarities and corresponding patient similarity scores obtained from a clinical expert, was firstly used to learn the optimal distance measurement. We then dynamically expanded the labeled sample set with a snowballing process. Figure 1 showed an overview of the whole study. Similar patients determined by the resulting similarity were finally used for identifying the prevalence of three liver diseases.

Feature similarity calculation
In our study, EMR data consisted of four parts, demographical information, laboratory tests, comorbidity conditions, and free-text radiology reports. Similarities for the structured and unstructured, continuous and categorical features were estimated separately according to their attributes.

Feature similarity for comorbidity condition
A patient's comorbidity conditions referred to a group of simultaneous diseases, which were identified by the International Classification of Diseases, tenth revision (ICD-10) codes [24]. An ICD-10 code began with a letter followed by five digits, arranging in a tree-like hierarchical manner [5], which was simplified into a leading letter and three following digits [5,22] in the present study. Taking the hierarchical ontology of the ICD-10 code system into account, we measured the similarity between two ICD-10 codes by the information content (IC). IC was judged by the degree to which the two codes shared information. Two ICD-10 codes were more similar if they had a higher IC, meaning that they shared more information. The IC index for two ICD-10 codes ICD 1 and ICD 2 was defined as following [25]: where NCA represented the nearest common ancestor of two codes ICD 1 and ICD 2 , and p(NCA) was the probability of encountering the NCA in the study corpus. The corpus consisted of all the possible ICD-10 code fragments derived from ICD-10 codes appearing in any patient's record in this study. An ICD-10 code fragment might be the leading letter alone or the leading letter plus the sequential one, two, or three digits of an ICD-10 code (Fig. 2a). Obviously, in the study corpus, the probability of encountering an ICD-10 code fragment that contained a leading letter K alone was not less than that of encountering K2. The larger the probability, the smaller the IC value. This result coincided with an intuitive insight that two patients were more similar if they both had the same rare disease than they both had a common flu. The IC value of two ICD-10 codes without NCA (i.e., with different leading letters) was defined as 0 because of no sharing information.
Let X = {ICD 1 , ICD 2 , . . . , ICD m } denote the comorbidity conditions for patient i with m comorbidities and Y = {ICD 1 , ICD 2 , . . . , ICD n } for patient j with n comorbidities. The feature similarity for the comorbidity condition between the two patients was defined as the following: where ICD a ∈ X and ICD b ∈ Y . Figure 2 shows an example of calculating the feature similarity for the comorbidity condition.

Feature similarity for text feature
Free-text radiological reports on abdominal computed tomography (CT) stored in the EMR system were initially unstructured. Text features were recorded as words or phrases. Therefore, the radiological reports were first structured by a natural language processing (NLP) method using the term frequency-inverse document frequency (TF-IDF) score [26] to extract the important text features for the disease diagnosis. We selected r phrases with the highest TF-IDF scores as the most important and representative ones, and thus the diagnostically useful r text features for the subsequent study. Each text feature was organized as a binary variable. If a specific phrase existed within a patient's radiological report, the patient was considered to have the corresponding text feature, and the value for this feature variable was set to 1, and 0 otherwise (Fig. 3). After the radiological report structuration, each patient had an r-dimension 0-1 vector of text features.
Among all the extracted text features, retroperitoneal lymph node enlargement, lower density than the  , respectively. a The pre-built study corpus. K, K2, K26, and K269 and E, E1, E11, and E116 are the qualified ICD-10 code fragments from K269 and E116, respectively; b identification of the nearest common ancestor (NCA). K29 is the NCA for K295 and K293, K2 for K295 and K269, and K for the rest six pairs of ICD-10 code; c Calculation of the comorbidity condition similarity using information content Fig. 3 The structuration process of radiology reports spleen's, absence of abnormal density, uniform density of liver parenchyma, and arterial phase were considered to be the most important features according to radiologists' advice. They were then treated as independent binary features when calculating the feature similarity, while the rest were organized into a binary set of text features from the radiological reports. Feature similarity for the binary independent text features between patients i and j was defined as: The feature similarity for the text feature sets for patients i and j was defined as:

Feature similarity for laboratory tests
We used the classification and regression trees technology to interpolate the missing values of laboratory tests, which could handle both discrete and continuous input (using the rpart function in the rpart package of R 3.5.1 software (https ://cran.r-proje ct.org/)). All the laboratory test items were binarized into 0 standing for normal or 1 for abnormal regarding the respective normal ranges. Then the feature similarity for two laboratory tests was defined as the Jaccard distance-based similarity between the two laboratory test sets.

Feature similarity for demographic information
Demographic characteristics included the patient's age, sex, drug allergy history, and the source of hospital admission. We defined the feature similarity for ages as [5]: The similarity for binary features, including sex, drug allergy history, and source of hospital admission between patients i and j was defined as 1 if the two patients had the same information and 0 otherwise [Eq. (3)].
In total, four demographic features, namely, age, sex, drug allergy history, and the source of hospital admission, five text features specific to the predictive task, and three feature sets, i.e., comorbidity conditions, laboratory tests, and common text features from radiological reports, were involved in the calculation for the feature-level (3) Similarity for binary features i, j = 1, if patients i and j had the same information 0, otherwise similarity. A resulting 12-dimension similarity vector was generated for each pair of patients.

Semi-supervised patient similarity learning
In this study, we randomly selected a tiny fraction of the overall patients. An experienced domain medical expert provided a similarity score ranging from 0 to 1 for each of possible pairs among these patients. We also calculated the weighted sum of feature similarities as the patient similarity for each patient pair. The weights were determined through trial and error. The expert could adjust his score with reference to the calculated patient similarity. A patient pair with both the feature similarity vector (calculated in section Feature similarity calculation) and the patient similarity score (assigned by the expert) was considered as a labeled sample. Thus the labeled sample set consisted of all labeled samples, and the patient pairs without any known similarity score comprised the unlabeled sample set. Further, a continuous label was transformed into a new categorical label "constraint. " Constraint values of mustlink, general-link, and cannot-link were given to patient pairs with higher, median, and lower similarity scores, respectively. The cutoff points of similarity scores were set to the upper and lower tertiles. Patient pairs with the same constraint belonged to the same class. We used both continuous and categorical labels in the semi-supervised similarity learning.
Based on the labeled sample set, our goal was to learn a Mahalanobis distance between a patient pair. For a patient pair of x i and x j , the Mahalanobis distance d m x i , x j was defined as where C ∈ R d×d (d = 12 as described in section Feature similarity for demographic information) was a positive semi-definite covariance matrix. The learning aimed to get the optimal C, which could minimize the withinclass squared distances and maximize the between-class squared distances among all patient pairs on the labeled sample set, simultaneously. We firstly constructed a goal function with the ratio of the with-class squared distances and the between-class squared distances, then transformed the learning problem into a trace quotient minimization problem with matrix decomposition. Finally, we solved this trace quotient minimization problem with the decomposed Newtown's method [27]. The optimization of matrix C on the labeled sample set was a supervised learning process. See supplementary materials for more details and descriptions about the learning process.
Once the matrix C was fixed, it was used to calculate the Mahalanobis distances [Eq. (6)] between the labeled and unlabeled samples. A small batch of unlabeled samples obtained similarity labels from their nearest labeled neighbors measured by Mahalanobis distances, and became members of the labeled sample set. Then the dynamically expanded labeled sample set was used for providing near neighbors for the next batch of unlabeled samples. The process was repeated until all unlabeled samples got a similarity label. We set the batch size to 1153 (about 0.15% samples of all the unlabeled samples) in each round based on trial and error.

Building predictive models
Building and assessing a similarity-or distance-based classifier was considered feasible and suitable for validating the effectiveness of the proposed similarity. In this study, a kNN model was employed for a multi-class classification task. By trial and error, the optimal k was set to 10 with which the corresponding kNN model built on the whole training samples showed the best predictive performance in the application scenario. When applying this model, we had many choices concerning the composition and size of the training samples and the similarity measurement used in determining near neighbors, which might all cause differences in the predictive performance. We used a leave-one-out method to evaluate the performances of the predictive models. In each validation round, one patient in overall patients was used as a test sample and the rest patients comprised the training sample pool. The training samples for an kNN classifier were M% (M ranging from 2 to 100) samples selected from this training sample pool. The training samples might be randomly selected from the pool or selected with the learned similarity or Euclidean distance to help rule out some irrelevant samples. The proposed similarity and Euclidean distance were also used to determine near neighbors. We evaluated and compared the proposed patient similarity with respects of i) the selection of the training samples from the training sample pool; ii) the determination of the k nearest neighbors; and iii) the usage of patient similarity. For the first two, we had four combinations as follows.
R + L combination: the training samples were randomly selected from the training sample pool, and the top k nearest neighbors were identified with the learned patient similarity; R + E combination: the training samples were randomly selected from the training sample pool, and the top k nearest neighbors were identified with Euclidean distance; L + E combination: the training samples were the most similar samples determined by the learned similarity, and the top k nearest neighbors were identified with Euclidean distance; E + L combination: the training samples were the most similar samples determined by Euclidean distance, and the top k nearest neighbors were identified with the learned similarity; As a typical type of non-numeric variables, comorbidity conditions were not suitable for calculating the Euclidean distance-based similarity directly. Thus, we selected 26 popular comorbidities with an occurrence greater than 5% among all patients, each being treated as a binary variable. Finally, in the Euclidean distancerelated predictive models (combinations R + E, L + E, and E + L), an input sample had 131 features: 44 features were the most important and representative phrases extracted from the radiological reports as the text features; 57 features were regular laboratory test items such as blood tests and urine tests; 26 features were comorbidity variables such as diabetes mellitus with and without complications, and essential hypertension; and the rest four features were demographic variables. The predicted label for an index patient was assigned as the label of the majority of his k nearest neighbors.
Besides, we compared the performances between personalized kNN models and other state-of-the-art machine learning models without using patient similarity as references, including decision trees, Naïve Bayes classifier, random forest, and AdaBoost classifier integrating weak decision trees. The leave-one-out validation on the whole training sample pool was also used.
The performance metric included the micro-averaged area under the receiver operating characteristic curve (AUC), F1-score, and cross-entropy (CE) loss [28]. We used the cubic polynomial fitting to the changing trends of the three metrics, respectively.

Data set
EMR data used in this study were derived from inpatients discharged from a tertiary hospital in Beijing, China between 2014 and 2016. Individual hospitalizations were de-identified and maintained as unique records, including age at admission, sex, drug allergy history, source of hospital admissions, up to 11 comorbidities at discharge, laboratory tests, and radiology reports during hospitalization.
Because only radiology reports of CT examination were available at present, 6749 patients who underwent liver CT scans were enrolled in this study. Liver cancer, hemangioma, and NAFLD were determined according to the radiological impression section in a radiological report and were further confirmed by ICD-10 codes in the comorbidity fields of a patient's record. ICD-10 codes for liver cancer, hemangioma, and NAFLD were C22 (malignant neoplasm of liver and intrahepatic bile ducts) and C78.7 (secondary malignant neoplasm of liver and intrahepatic bile duct), D18.0 (hemangioma, any site), and K76.0 (nonalcoholic fatty liver disease), respectively. Following the above criteria, we finally identified 153, 178, and 403 patients diagnosed with liver cancer, hemangioma, and NAFLD correspondingly. Together with 449 patients with a normal liver diagnosis, records of these 1183 patients comprised the study dataset. Finally, we excluded the ICD-10 codes of C22, C78.7, D18.0, and K76.0 that were used to identify the target diseases from the comorbidity conditions for computing the feature similarity. Table 1 gives some details of the study population.

Patient similarity
We randomly selected 30 patients out of the 1183 study patients to construct the labeled set with 435 samples (patient pairs). The rest 698,718 (= 1183 * (1183-1)/2-435) samples comprised the unlabeled sample set. Among the 30 patients, four patients were with liver cancer, five with hemangioma, ten with NAFLD, and 11 with a normal liver diagnosis, retaining the same disease constituent ratio for the labeled set as that for the whole study population. There were statistically differences in similarity scores (mean ± standard deviation: 0.77 ± 0.082, 0.54 ± 0.043, and 0.37 ± 0.074, respectively; one-way analysis of variance, all P values < 0.001 after the Bonferroni adjustment) among the patient pairs with the constraints of must-link, generallink, and cannot-link. See supplementary materials for some examples of patient pairs with different similarity scores (Additional file 1: Table S1).
We obtained patient similarities for all the patient pairs among the overall 1183 patients after the semisupervised learning. Figure 4 is a visualized example of the learned patient similarities between four randomly selected patients (each with one of the four liver conditions) and the rest.

Disease prediction
As the size of training samples used for building kNN models increased from 2% (about 23 patients) to 100% There were significantly higher performances of models based on the learned patient similarity (i.e., L + E combination, E + L combination, and R + L combination) than Fig. 4 Illustration of the learned patient similarity. Four index patients (the central dots) are surrounded by all other patients, where the more similar a patient is with the index patient, the closer the respective dot is to the central dot. Dots in blue, orange, green, and purple represent patients with liver cancer, hemangioma, NAFLD, and normal condition, respectively. The figure was generated by using Gephi 0.9.2 (https ://gephi .org/). NAFLD, non-alcohol fatty liver disease those based on the directly calculated patient similarity (i.e., R + E combination) (Kruskal Wallis Tests, P values < 0.001 for all performance indexes). When using no more than 7% (about 82 patients) of the whole training samples, the L + E combination outperformed the E + L combination. Based on all training samples, the performance of the L + E combination and R + E combination showed the same performance of AUC 0.89, F1-score 0.67, and CE loss 1.82. Simultaneously, the E + L combination and R + L combination showed the same performance of AUC 0.95, F1-score 0.84, and CE loss 1.22, respectively.
Among the four reference models, the highest AUC and F1-score were 0.94 and 0.80, separately, which were both lower than those for the simple and similarity-based kNN models (Table 2).

Discussion
In personalized medicine, using machine learning algorithms and patient similarity based on real-world EMR data had significant facilitation in various scenarios such as disease prediction [5][6][7] and risk factors identification [8,9]. However, EMR data usually had mixed data types, structured and unstructured, and continuous and categorical, making it a challenge to measure the similarity among patients. Under this situation, the learning-based patient similarity measurement was proposed and evaluated in this study. When computing the feature-level similarity for two patients, we employed different similarity measurements according to the features' attributes (data types). Besides the relative ratio for continuous feature (age) and Jaccard similarity index for binary features (binarized laboratory test results, text features, and other demographic characteristics), IC measure was applied to the computation of the feature similarity for the ICD-10 code-based comorbidity conditions. As a coded and hierarchical feature, ICD-10 code could be considered either as a general multi-categorical variable thus using the Jaccard distance to measure their similarities [11,22,29], or as a tree-like variable thus using a path-based measurement to estimate their similarities with the full use of the hierarchical information [25]. Among the path-based measurements, the edge-counting approach had been used in many previous studies [3,5,30]. Two patients diagnosed with liver cancer of different ICD-10 codes (C22.0 and C22.9, for example) had the same similarity as those diagnosed with influenza of different ICD-10 codes (J11.1 and J11.2, for example), both having the same edge count 3. However, the former patient pair should be considered more similar than the latter because liver cancer was less common in the general population than the flu. As an alternative path-based measurement, IC had the advantage of dealing with this problem. It was a probabilistic model [25] and had been applied to the gene ontology terms [31,32] and semantic context [33][34][35]. To our best knowledge, it Performance of k-nearest neighbor models in terms of a area under the receiver operating characteristic curve, b F1-score, and c cross-entropy loss. The horizontal axes represent the sizes of the training samples in percentage. R + L combination and R + E combination, the kNN models that the nearest neighbor out of the randomly selected training samples was determined by the learned similarity and the Euclidean distance, respectively; L + E combination, the kNN model that the nearest neighbor out of the similar training samples based on the learned similarity was determined by the Euclidean distance; E + L combination, the kNN model that the nearest neighbor out of the similar training samples based on the Euclidean distance was determined by the learned similarity was the first time to use the corpus-based IC measurement to measure the similarity for the comorbidity conditions based on ICD-10 codes.
Except for the out-of-the-box patient features, text features that were extracted from free-text radiological reports were also involved in the calculation of the feature-level similarity in the current study. We firstly used an NLP method to extract features. All features had the same contribution to the Jaccard distance-based similarity, no matter how important they were to the target disease. However, the situation was not the same in radiological practice. Thus, the extracted features were then structured into one binary feature vector and five independent binary features at the radiologist's suggestion to highlight the role they played on the similarity calculation. This was a simple but successful attempt to add structured text features into the similarity calculation.
Another critical factor in getting a high-quality patient-level similarity was the way to consolidate the feature-level similarities. Instead of a linear or logarithmic combination of the individual feature similarities [3,5,12], we proposed a semi-supervised learning process in our study. For learning tasks, sufficient supervised information was crucial and desired, whereas annotation of EMRs required sophisticated medical professionals, which was very time-consuming and expensive [36]. Thus, the SSL algorithm based on a small amount of supervised information was used by numerous researchers in the ticket classification problem [37], the image segmentation problem [38], and the phenotype stratification problem [39] and so on. In this study, the SSL algorithm was adopted with a tiny labeled fraction of the overall patients to balance the learning performance and label-requested difficulties.
Previous studies have proved that similarity-based predictive models outperformed those without sample selection according to patient similarity [5,8,22]. In this study, even though we only labeled 2.5% of the study population for the semi-supervised learning, the learned patient similarity still played a significant role in improving the similarity-based kNN models' predictive performances. The personalized kNN models outperformed other state-of-the-art machine learning models without using patient similarity, which demonstrated the effectiveness and superiority of the proposed patient similarity measurement. Because the kNN classification method used the k nearest neighbors to vote for the output label, the way to determine the nearest neighbor surely did matter. The kNN models using the learned patient similarity to determine near neighbors consistently outperformed those using the Euclidean distance to determine near neighbors. This indicated that using the proposed similarity could find out more similar patients than using the widely used similarity measurement, the Euclidean distance. Furthermore, even if the simple Euclidean distance was used to determine near neighbors, the learned patient similarity could still improve the predictive performance by helping select the similar training samples from which the k nearest neighbors were selected. However, the performance of this kind of kNN model degraded gradually against the size of the training samples, coinciding with the previous studies' conclusion that adding more dissimilar samples would disturb the prediction [5,8]. The decreasing trend also indicated that as the size of the training samples increased, more dissimilar and irrelevant samples would be added, leading to the Euclidean distance's weakened ability to retrieve the most similar sample. On the contrary, predictive performances increased gradually when near neighbors determined with the learned patient similarity was selected from similar training samples identified with the Euclidean distance, showing the strength of the learned patient similarity in selecting near neighbors even from more dissimilar samples. In short, the proposed learning-based method successfully dealt with the problem when measuring the similarity of patients stored in heterogeneous EMRs.
This study had a few limitations. First, the determination of patient features for computing similarities depended partially on domain knowledge and aimed at a specific predictive task. There still need an effort to generalize the proposed patient similarity to other application scenario and design an adaptive feature selection. Second, only one expert was asked to score the patient similarities in this study. Though the human expert could give scores with the assistance of the calculated patient similarity, this scoring mechanism might exist labeling bias. Finally, considering the labor cost, we only labeled 30 randomly selected patients (generating 435 patient pairs) manually. To obtain similarity labels for hundreds and thousands of patient couples efficiently and effectively, we will further integrate active learning into the proposed semi-supervised learning method.

Conclusions
In this study, we proposed a semi-supervised learning method to obtain patient similarity using heterogeneous EMR data. It has proved that the similar samples identified with the proposed similarity measurement would be helpful to improve the performance of the similarity-based predictive models. The proposed method is expected to be used in other heterogeneous EMR data for machine learning tasks that originated from clinical practice.