Lab indicators standardization method for the regional healthcare platform: a case study on heart failure

Background Laboratory indicator test results in electronic health records have been applied to many clinical big data analysis. However, it is quite common that the same laboratory examination item (i.e., lab indicator) is presented using different names in Chinese due to the translation problem and the habit problem of various hospitals, which results in distortion of analysis results. Methods A framework with a recall model and a binary classification model is proposed, which could reduce the alignment scale and improve the accuracy of lab indicator normalization. To reduce alignment scale, tf-idf is used for candidate selection. To assure the accuracy of output, we utilize enhanced sequential inference model for binary classification. And active learning is applied with a selection strategy which is proposed for reducing annotation cost. Results Since our indicator standardization method mainly focuses on Chinese indicator inconsistency, we perform our experiment on Shanghai Hospital Development Center and select clinical data from 8 hospitals. The method achieves a F1-score 92.08\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}% in our final binary classification. As for active learning, the new strategy proposed performs better than random baseline and could outperform the result trained on full data with only 43\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}% training data. A case study on heart failure clinic analysis conducted on the sub-dataset collected from SHDC shows that our proposed method is practical in the application with good performance. Conclusion This work demonstrates that the structure we proposed can be effectively applied to lab indicator normalization. And active learning is also suitable for this task for cost reduction. Such a method is also valuable in data cleaning, data mining, text extracting and entity alignment.

and prognostic indicator of myocardial infarction [1]. However, many same indicators are presented using different names in Chinese. There may be two main reasons: The first one is a translation problem. Aspartate aminotransferase can be translated as " ", " ", " ", " ", etc, in Chinese. The second one is a habit problem. Different hospitals or different doctors prefer different names or have their own statements, such as " ", " " and " ". Since there exists no such a synonym knowledge base (KB), EHRs associated with these synonymous names will be lost, resulting in distortion of the analysis results. Therefore, it is necessary to design an automatic method for lab indicators standardization.
An effective way to solve the standardization of lab indicators problem is entity alignment. There are two main tasks related to the entity alignment: instance matching and entity linking. Instance matching [2] aligns same entities in different KBs. A lot of traditional methods are based on feature engineering [3,4], and utilize attribute values [5,6], structural information [7], external lexicons [8][9][10] and so on. Recent methods are based on representation learning [11][12][13][14][15][16], and embed entities into vectors for similarity calculation. Entity linking aligns a mention in a text and an entity in a KB. It has both unsupervised methods [17][18][19] and supervised methods [20][21][22][23]. However, there is no attribute or context in our data, these two tasks cannot apply to our standardization directly. Zhang et al. [24] proposed a n-gram and stacking enhanced method based on indicator name, abbreviation and reference value, which has a lot of space to improve because they only use handcraft feature. Besides, all the methods mentioned above ignore the cost of annotations. Unlike the tasks in universal corpus, indicator normalization requires experienced domain experts for annotating. Active learning aims to select few samples for training without degrading much performance [25,26]. And it has been successfully applied in many tasks [27][28][29]. So the annotation cost is also considered in our work.
In this paper, a framework is used for indicator normalization. Including a recall model and a binary classification model, and then is evaluated by a case study on heart failure, which extends our previous work [24]. The main contributions of our work are as follows.
• An effective framework is proposed to normalize the lab indicators, which combines a recall model and a binary classification model. The purpose of the recall model is to reduce alignment scale, a candidate set contains standard indicators is selected for each non-standard indicators. After this step, candidatestandard indicator pairs can be generated by a binary classification model through an enhanced sequential inference model(ESIM) based on the name and abbreviation of indicators. Experimental results of the proposed structure show that it achieves an F1-score of 92.08% in the final binary classification. • Active learning is utilized for reducing annotation cost. A new selection strategy is proposed and is compared with shannon entropy, least confidence and a random baseline. Experiments show that the our strategy performs better than random baseline and could outperform the same result which is trained on full data with only 43% training data. • A detailed case study on heart failure clinic analysis is conducted on the sub-dataset from the dataset of a regional healthcare platform called Shanghai Hospital Development Center (SHDC). The result shows that our proposed method is practical in data cleaning, data mining, text extracting and entity alignment. Figure 1 shows the framework of the whole process, which include four parts. The first part is data preprocessing. With the help of experienced experts, a list of standard indicators is selected and a set of nonstandard indicators is remained. A recall model is then presented in second part. For each non-standard indicators, a standard indicator candidate set is generated to reduce the alignment scale. The third part is a binary classification model to determine whether a non-standard indicator is synonymous with a standard indicator. Finally, active learning is applied to decline the cost of annotation.

Data pre-processing
The original data we collected from hospital is a set of indicators with name and abbreviation. The aim of data pre-processing is to generate a standard indicator set. Though terminology standard like LOINC has been published early, it has not been widely used in clinical practice. In Zhang's work [24], lab indicators are firstly clustered and then the most frequent term in each cluster is picked as the standard terminology. But we think that the set of standard indicator matters in indicator normalization. So we invite experts to select a list of standard indicators S = (s 1 , s 2 , . . . , s m ) , in which s i = (s i n , s i a ) . The remaining set L = (l 1 , l 2 , . . . , l n ) are regarded as a non-standard indicator collection, in which l i = (l 1 n , l 1 a ) , subscript n and a represent name and abbreviations, respectively.

Candidate selection
We collect a standard indicator set containing 1970 items through the first step. For a single non-standard indicator, it would be costly if it is compared with each item in standard indicator set due to the huge amount of standard indicator list. Similar challenges occur in recommending system and query engine, where a light recall model is often used to shrink the scale of candidate set, then a complicated model is applied for further processing. We use the similar idea in our task. To improve the efficiency of the whole system, a recall model is used to compute the similarity between nonstandard and standard indicators. Each indicator has two attributes: name and abbreviation value. So two candidate set C i n = (l i n , c i n ), (l i n , c i n ), . . . , (l i n , c k 1 n ) and C i a = (l i a , c i a ), (l i a , c i a ), . . . , (l i a , c k 2 a ) is then generated for each non-standard indicator, in which c i n , c i a ∈ S . Finally, we choose the union of C i n and C i a as the candidate set of l i , notated as C i .
The recall model selects a certain amount of candidates in a short time with a high recall rate. In this task, the textual similarity between the names and abbreviations of non-standard indicators and standard indicators is very high. So we have tried four classical text matching model: edit-distance, bow(bag of words), bm-25 and tfidf. Experiments show that tf-idf works best on the test dataset. Tf-idf is described as follow. For a document set D = {d 1 , d 2 , . . . d n } , each document is composed by a list of words {w 1 , w 2 , . . . , w l } . Formula 1-3 shows the tfidf value of w j in d j , in which f (w j , d i ) is the frequency of w j in d i and N w j is the number of documents where w j appears. The name or abbreviation of each indicator can be represented by a vector v whose length is equal to the length of the vocabulary. The corresponding value of each word in v is the tf-idf value of the word. Then the similarity between two indicators is represented by the doc product of their corresponding vector. The abbreviation of some indicator is missing during practice, so the length of candidate set for each non-standard indicator is unequal, range from k 1 to k 1 + k 2 .

Binary classification
Considering that the recall model has sharply decreased the amount of candidates for each non-standard indicator, we apply a binary classification model to judge the relationship between a non-standard indicator and its candidate standard indicators. Before training the model, we combine the name and abbreviation of each indicator. For example, an indicator with name " " and abbreviation "ph" will be transferred to " ", which avoiding training two model targeted to name and abbreviation separately. We note the non-standard indicator and the standard indicator as query and hypothesis, represented as a and b respectively.
We define the binary classification as a natural language inference task. Though pretrained language model such as Bert [30] achieves state of the art in many domains and tasks, models based on lstm or cnn require less resource without degrading much performance. Chen et al. [31] ( The overall process of the indicator standardization algorithm proposed ESIM for text matching, which extract features of query and hypothesis by lstm and utilizing attention mechanism to interact query with its counterparts. Based on the reason mentioned above, we choose ESIM for binary classification. ESIM consists of four parts. Input encoding utilize BiL-STM to represent each word with its context. Local inference modeling apply attention mechanism to interact query with hypothesis. Inference composition is proposed to compose the output of previous layer into a fixed length vector. Finally, a prediction layer put the vector into a multilayer perceptron classifier for classification.

Input encoding
For input a and b, BiLSTM is used to learn a representation of a word and its context, notated as â and b in Eqs. 4-5. â i represent the hidden state generated by the BiLSTM at time i on the input sequence, which is same with b i .

Local inference modeling
Local inference modelling apply attention mechanism to capture the local reference between query and hypothesis, which is formulated in Eqs. 6-10. The attention score e ij is the doc product of â i and b j . The alignment representation ã i is then composed by b , which taken e ij as weight. To get an enhanced representation, the elementwise product and the difference of â and ã is calculated, and concatenate with â and ã to get m a and m b .

Inference composition
Before the final prediction, a composition layer is proposed to compose the previous enhanced representation. Then a pooling layer is applied to convert the resulting vector obtained above to a fixed-length vector. Finally, the fixed-length vector is put into a multilayer perceptron classifier taken softmax as activation function. Crossentropy is used for training.

Active learning
The data trained in binary classification should be annotated by experienced experts with high cost. Active learning aims to use fewer samples to train without degrading much performance. which select a certain amount of samples by some strategies and sent them to professional experts for annotating. The whole process is iterative and the pseudo code is given in Algorithm 1.
Besides least confidence and shannon entropy, we also propose geni index, a new uncertainty metric for sample selection. In decision tree, geni index is used to replace entropy because it is more efficient in computation and more suitable for binary classification. Gini index describes the probability of samples not belonging to the same category in two consecutive samples, which has the same property with shannon entropy that normal distribution would lead to a high value. Formula 17 shows the calculation of gini index of a candidate set c i .

Dataset
The dataset is collected from Shanghai Hospital Development Center. Among 38 hospitals, we choose 8 hospitals whose data are diverse and rich enough in synonymous (15) The strategy used in active learning has a great importance to the final result. The main idea to select samples is based on the uncertainty about the prediction of the model. A variety of strategies have been proposed to calculate model prediction uncertainty, such as least confidence, margin sampling and shannon sampling, where margin sampling is same as least confidence in binary classification. Least confidence considers uncertainty as the difference of 1 and the maximum probability of the prediction. For example, in binary classification, the probability of the model prediction being 0 and 1 is p 1 and p 2 , respectively, where p 1 + p 2 = 1 . Then the uncertainty can be formulated as 1 − max(p 1 , p 2 ) . If the probability on one label(p 1 or p 2 ) is close to 1, then the sample with low uncertainty will not be selected. Shannon entropy use information entropy to evaluate uncertainty, which is high if the distribution of the model prediction is uniform and the sample should be selected. In our case, a set of candidate sets are selected in each iteration, and the uncertainty based on least confidence and entropy are described in formula 15-16. In which p ij refers the probability of ith non-standard indicator paired with jth candidate in c i predicted by the model. Because each candidate set has different length, we add a regularization term to represent the uncertainty of each candidate set.
indicators and types for our experiment. We extract two attributes of indicator from the dataset, namely name and abbreviation. Then a list of standard indicator names are selected and each standard indicator was annotated with some synonymous indicators by experts. The number of extracted indicators is 12,298, and the standard indicator list has 1970 items. As for the settings in our experiments, the recall number on indicator name k 1 is set to 15, and the number on indicator abbreviation k 2 is set to 5. In active learning, we select 100 candidate sets in each iteration. And we use SGD as optimizer for binary classification. For the evaluation of candidate selection task, Recall and MRR is used as metrics. We also use Precision, Recall and F1-score to measure the binary classification task. Finally, we split the dataset into train and test set randomly at the rate of 7:3.

Candidate selection
We compare tf-idf with three other baselines: • edit distance: a model measures the number of operations to transform one string to another. • bow(bag of words): a basic model to represent text into vector for similarity calculation. • bm-25: a baseline model in information retrieval.
The result is shown in Tables 1 and 2. In the result of Table 1, we only use the name information of indicator, Table 2 shows the result trained on both name and abbreviation. The Recall and MRR improves when abbreviation is added to each model, which shows abbreviation is an important feature in indicator term normalization. Beside, the tf-idf model outperforms all other models thus we choose tf-idf to select candidates of non-standard indicators.

Binary classifications
For binary classification, we compare ESIM with Zhang's methods [24] and another natural language inference method BIMPM [9]. Zhang utilizes DBSCAN for clustering and then uses stacking mechanism to learn a binary classification model on n_gram feature, it should be noticed that in addition to name and abbreviation of indicator, they also use the reference value as an important feature. BIMPM uses an advanced method of multiperspective matching on LSTM for classification.
The results are shown in Table 3. As seen from the table, ESIM outperforms the other methods, with a Precision of 92.39% , a Recall of 91.78% and a F1-score of 92.08% , respectively. In particular, ESIM is almost 10% higher on F1-score when compared with Zhang's methods, which is because that ESIM can extract semantic information instead of handcraft feature used in Zhang's methods. As for BIMPM, we suppose that its complicated structure is not suitable in the scenario of indicator normalization with shallow semantic information.

Active learning
We empirically compare three active learning strategy mentioned above with a random baseline. The result is shown in Fig. 2. At the start of training, four strategies behaves similar because random selection is used in first iteration. All selection strategies performs better than random baseline after several rounds, where least confidence and gini index achieves a better F1-score in most rounds. Impressively, least confidence achieve the performance trained with full data using only 38% data, and gini index even performs better with 43% training data. After using 60% data for training, shannon entropy performs best in the four strategy. Note that gini index computes more efficient than shannon entropy.

Discussion
We have experimentally evaluated our proposed standardization method in the section of Results. In this section, we further give a detailed case study on a real application, namely heart failure clinic analysis, to show the necessity of lab indicator standardization.

Qualitative analysis
In order to better understand the performance of the proposed method, we manually analyze an indicator related to heart failure called " (Serum sodium)" . Since " " is a standard indicator selected by experts. We take its non-standard synonym " " as example. The first step is using recall model to select a candidate set. Then the binary classification model is used to get the standardization result. Tables 4 and 5 The proposed method is able to produce presentable standardization results. It can be seen that the recall model successfully find a candidate set related to " ", in which contains many semantically similar indicators. The classification model is able to find the correct standard indicator and exclude all non-synonyms: " ", " ", " ", etc. Which is similar to the name of sodium ion in Chinese.
Nevertheless, our analysis also points out that the proposed standardization method has limitations. For candidate selection step, in Table 4, we observe that some irrelevant indicators such as " " and " " appear in the candidate set. This shouldn't be a surprise since the recall model is based on character level and the indicators with overlapping characters have a certain chance of being clustered. Besides, such irrelevant negative samples will make the distribution of label in binary classification uneven, which brings challenges for further training process. It is remarkable that the recall method has included the groundtruth, regrettably at the expense of misjudging some irrelevant non-synonyms.

Role of lab indicator standardization
The role of lab indicator standardization is then introduced. In academic research about association between use of statins and outcomes in heart failure, the lab indicator "serum sodium" is considered as a relevant factor. Before indicator standardization, only 50,094 records can be found out by searching name " " (serum sodium) in the SHDC dataset, which is obviously missing since there are more than 100 thousand inpatients in the research group and "serum sodium" is a common indicator in Shanghai hospitals. Observing the records in the dataset, we find that "serum sodium" is also described as " " (sodium), " " (ionized sodium). We then search the records using names supplemented by our algorithm and 150,000 records are associated.
So far, we can see that lab indicator standardization is helpful in the following procedures: data cleaning, data mining, text extracting and entity alignment. For data cleaning, standardization is its main step. For data mining and text extracting, more synonymous indicators standardized by our method, more records can be Fig. 3 The process of the case study on heart failure  searched. For entity alignment, lab indicator standardization can be considered as an entity alignment problem.

Application
The heart failure research is conducted on the subdataset from SHDC. The SHDC dataset contains EHRs from 2012 to 2016. Data specifications in different hospitals are all different so that expressions of the same item may be distinct. For example, one hospital uses lab indicators' English names for abbreviations while another hospital uses inner numeric codes of the hospital. With the dataset, we use our algorithm to clean the data and preprocess the data according to the research goal and the guidance of specialist doctor for varies diseases. Figure 3 shows the process of the case study on heart failure. Generally, we can clean and preprocess lab indicator data in three steps. The first step is to construct the SHDC lab indicator KB. The second step is to pick out lab indicators related to special disease and construct the special disease lab indicator KB. The last step is to named entity recognize [32,33] using the special disease indicator KB and turn text descriptions into structured data and map synonymous names into standard names.
The first step is mainly about our proposed method. A generic SHDC lab indicator KB is independent to special diseases or academic subjects and can be applied to all tasks based on the same dataset. As the generic KB can be used in lots of tasks, it can prevent duplication of data preprocessing and data cleaning while offering an easyto-use standard for follow-up special disease researches. Specifically, we input lab indicator reports of the SHDC dataset into our algorithm and output candidate-standard indicator pairs into an easy-to-use KB. Each synonymous indicator is linked to its standard indicator and standard indicators are organized in the tree structure which is easy to view and mange.
The second step is the procedure of data selection. According to the requirement of a research, several related indicators are picked out into specific indicator KB. In this case, commonly used indicators in heart failure are put into a heart failure indicator KB. Figure 4 is a screenshot of specific indicator KB visual tool, lab indicators related to heart failure are listed on the left tree view, and they are all standard names. When we click on an indicator, all its synonyms will be displayed on the right part of the webpage. For instance, synonymous indicators of "serum sodium" are shown on the right part, which are Fig. 4 The screenshot of the heart failure indicator KB. a Lab indicators related to heart failure: Related indicators are picked out, and listed as standard names. The "serum sodium" is clicked this time. b Standard indicators and its synonyms: After clicked, the "serum sodium" and its synonymous names, which are all from the SHDC dataset and standardized by our proposed method, are displayed standardized by our proposed method and preliminarily checked by medical students. Figure 5 is the detail results of serum sodium standardization.
The last step is the post-structure step. Each synonymous indicator is recognized using the heart failure indicator KB constructed above [32], and it is mapped to the standard indicator. Structured records are extracted effortlessly from the mapped entity. As shown in Fig. 6, all indicators related to heart failure are recognized, and the recognition is based on the KB constructed in the  Fig. 6 A demo of named entity recognition and mapping. All relevant indicators of heart failure are recognized. Since some indicators are synonymous names, they should be mapped to their standard names. That is, "sodium" is mapped to "serum sodium" and "serum glutamic-oxaloacetic transaminase" is mapped to its standard indicator named "glutamate oxaloacetate transaminase"

Table 3 Performance comparisons of binary classification
Zhang is the methods proposed in [24], a n-gram and stacking enhanced method  Table 4 Recall results of "sodium ion"

Non standard indicator
Standard indicators second step. The recognized indicators may already be standard names, or they may only be synonymous names.
For synonymous names, we need to map them to their standard names. For example, " " (sodium) is mapped to " " (serum sodium) and " " (serum glutamic-oxaloacetic transaminase) is mapped to its standard indicator " " (aspartate aminotransferase).

Conclusions and future work
In this paper, we propose an effective recall-and-classification structure based on active learning to standardize the lab indicators in SHDC. To decline the alignment scale, we test several classic text matching methods and finally utilize tf-idf to recall a candidate set for non-standard indicators. And to assure the accuracy of output, an enhanced sequential inference model(ESIM) is performed for binary classification. Experimental results show that ESIM can achieve an F1-score of 92.08% in the final binary classification. To reduce the cost of annotation, we add active learning on binary classification task. A new sample selection is proposed, which performs better than random baseline and could outperform the best score trained with full data with only 43% training data. Finally, a case study on heart failure clinic analysis shows that our proposed method is practical in an application with good performance.
As for future work, our proposed method includes synonyms at the expense of adding some irrelevant items in the case study, which suggests the possible research avenues that remain open. In the future, we also plan to optimize our proposed method to construct a lab indicator KB covering the whole data of Shanghai Hospital Development Center, and apply it to other clinic analyses.