Attention-based deep residual learning network for entity relation extraction in Chinese EMRs

Background Electronic medical records (EMRs) contain a variety of valuable medical concepts and relations. The ability to recognize relations between medical concepts described in EMRs enables the automatic processing of clinical texts, resulting in an improved quality of health-related data analysis. Driven by the 2010 i2b2/VA Challenge Evaluation, the relation recognition problem in EMRs has been studied by many researchers to address this important aspect of EMR information extraction. Methods This paper proposes an Attention-Based Deep Residual Network (ResNet) model to recognize medical concept relations in Chinese EMRs. Results Our model achieves F1-score of 77.80% on the manually annotated Chinese EMRs corpus and outperforms the state-of-the-art approaches. Conclusion The residual network-based model can reduce the negative impact of corpus noise to parameter learning, and the combination of character position attention mechanism will enhance the identification features of different type of entities.

use EMR to represent unstructured EMR text in the following.
Identifying semantic relations existing among medical concepts in EMRs is of great importance to healthrelated various applications. These relations are hold between medical problems, tests, and treatments. Table 1 presents two examples of semantic relation, one of which is between medical concept e 1 ="cold" and e 2 ="fever" in sentence S 1 , and the other is between e 1 ="Head MRI" and e 2 ="lacunar infarction" in sentence S 2 .
On account of the importance of this subject, the 2010 i2b2/VA NLP challenge for clinical Records presented a relation classification task focused on assigning relation types between medical concepts in EMRs. Since then medical concept relation classification has being paid attention by more and more researchers.

Table 1 Examples of the relations between medical entities
Sentence Relation S 1 : The patient has a cold, feels a fever and headache.
Test reveals the disease (TeRD) In the traditional natural language processing (NLP) research, semantic relations between named entities can be used for many applications including knowledge graph construction, sentiment analysis, question answering, etc. [1], relation extraction or classification therefore has always been an important issue [2]. In previous opendomain entity relation extraction studies, researchers applied many different traditional machine learning models include Logistic Regression, SVM and CRF to recognize relations [3][4][5][6][7]. Li et al. used CRF model to reduce the space of possible label sequences and introducing long range features for relation recognition [8]. Mintz et al. put forward a remote monitoring relation classification method which could generate adequate training data by aligning text and knowledge base to solve the problem of lack of enough training data [9]. Socher et al. firstly employed recurrent neural network (RNN) on the task of relation extraction, while utilizing the syntactic structure information of sentences [10]. Miwa et al. proposed a neural network relation extraction architecture based on bidirectional LSTM and tree LSTM to encode entities and sentences simultaneously [11].
Drawing on these studies on open-domain relation extraction, similar task on EMRs was formally defined in the 2010 i2b2/VA Challenge Evaluation [12]. Some researchers proposed various models for relation classification of EMRs. Bruijn et al. used SVM to train multiple classifiers to deal with different relation categories, and improved the effect of classification [13]. Rink et al. use external dictionaries to increase the effect of entity relationship recognition [14]. Fang et al. extracted the relations from relevant articles of Chinese herbal medicine based on manually designed rules and created a relation database [15]. Zhou et al. utilized a bootstrapping framework to extract relations from the medical articles and created a knowledge base [16]. Li et al. raised an electronic health records relation classification model based on CNN-LSTM [17]. Overall, the existing models mainly focus on English EMR texts, and on the other hand it still cannot deliver satisfactory recognition performance. Concerning the increasing availability of digitalized Chinese EMRs, this paper addresses the semantic relation identification problem among medical concepts in Chinese EMRs. We propose an attention mechanism based deep residual network model to classify the medical entity relations in Chinese EMRs. Experimental results performed on a manually labeled Chinese EMR corpus show that our model achieved better performance with F 1 -score of 77.80% compared with other methods.

Methods
Our model is based on a CNN architecture as shown Fig. 1. The model consists of five parts: vector representation layer, convolution layer, residual networks layer, position attention layer and output layer.  An example of the relative distance between an entity and a character. The relative distance of a character to medical entity " (cold)" and " (fever)" are 2 and -2 respectively

Character embedding
Given a Chinese sentence S = (c 1 , c 2 , . . . , c n ) which contains two entities e 1 and e 2 . Each character c i will be mapped to a low-dimensional dense vector w represents the character vector and V i p is the vector of character position in the sentence. The character embedding initialized with vector which is pre-trained by word2vec, and d w is the dimension of character vector.

Position embedding
Position embedding V i p is also a low-dimensional vector of character position in the sentence, which can combine the relative positions (see Fig. 2) of the current character to the first entity e 1 as well as the second entity e 2 . Each relative position corresponds to a position

Convolution
Convolution is to extract the effective local feature information from characters and their corresponding contexts.
The V j is a vector which corresponds the j-th character in the sentence S = (V 1 , V 2 , . . . , V n ), here n is the sentence length. We use filter W ∈ R h×d v to extract local features from the sentence S. A feature c j is generated from a window of character V j:j+h−1 by where b is a bias terms and f is a non-linear function. We apply dropout layer in convolution to prevent data from outfitting.

Residual networks
Residual learning connects low-level to high-level representations directly and solves the vanishing gradient problem, we superimposed the identity mapping function on a network. In our model, each residual convolution block (see Fig. 3) has two convolutional layers, each one followed by a ReLU activation, we use shortcut connection between each of the residue convolution block W 1 , W 2 ∈ R h×1 are two convolution filters, where h is convolution kernel size. The first convolutional layer is and the second iŝ here b 1 , b 2 are bias terms. The residual convolution block output is the vectorĉ j . This block will be multiply concatenated in our architecture by a shortcut connection.

Position attention
Recently attention mechanism has been widely used in machine learning, and great achievements have been made in various NLP problems. In this paper, we use the position attention to enhance relation extraction ability. Firstly, we carry the max-pooling operation on the residual learning result. Secondly, as shown in Fig. 1, we concatenate the max-pooling results with the position embedding of entity. Finally, we use the attention mechanism to balance the weight to the sentence.
where α i represents the attention weight. P i is a result which concatenates the max-pooling results with the position embedding of entity. Finally, we use the softmax function to normalize and output entity relation probability.

Dataset and evaluation metrics
On the basis of reference to medical semantic relation annotation specification of 2010 i2b2/VA Challenge, we established our own relation annotation specification of Chinese EMRs, in which semantic relations between medical concepts fall into five coarse-grained categories and fifteen fine-grained categories. All of relation category are detailed as follows.
Coarse-grained category 1: Treatment -Disease Relation. This category contains five fine-grained categories, including TrID (Treatment improves the disease), TrWD (Treatment worsens the disease), TrCD (Treatment causes the disease), TrAD (Treatment is administered for the disease), and TrNAD (Treatment is not administered because of the disease).
Coarse-grained category 2: Treatment -Symptoms Relation. This category also contains five fine-grained categories, including TrIS (Treatment improves the symptoms), TrWS (Treatment worsens the symptoms), TrCS (Treatment causes the symptoms), TrAS (Treatment is administered for the symptoms), and TrNAS (Treatment is not administered because of the symptoms).
Coarse-grained category 3: Test-Disease Relation. This category contains two fine-grained categories, including TeRD (Test reveals the disease) and TeCD (Test conducted to investigate the disease).
Coarse-grained category 4: Test-Symptoms Relation. This category also contains two fine-grained categories, including TeRS (Test reveals the symptoms) and TeBS (Test based on symptoms).

Coarse-grained category 5: Disease-Symptoms Relation.
This category contains only one fine-grained category named as DCS (Disease causes symptoms).
According to our specification, we manually annotated 3000 de-identified Chinese EMR texts from different clinical departments of a grade-A hospital of second class in Gansu Province, China. 2000 medical texts are selected as training data, 500 medical texts as develop data, and 500 medical texts for test while evaluating our method on this dataset. The relation numbers of every fine-grained category in this dataset are given in Table 2. Precision, Recall and F 1 -score are used as evaluation metrics.

Models and parameters
We carry out the experiments to compare the performance of our model with others described in the following.
CNN-Max: This model was used by Sahu, et al. [18], which encoded the sentence vectors with CNN, and outputted the results after max-pooling and softmax function.
BLSTM-Attention: This model was proposed by Li, et al. It mainly consists of bidirectional LSTM and attention mechanism [19].
ResNet-Max: This model was proposed by Huang, et al. Compared with our model, this model did not combined attention mechanism [20].
ResNet-BLSTM: The basic framework of the method is close to our model. The difference between this one with ours is that this model combine the residual network with Bi-LSTM.
ResNet-PAtt: This is the model presented in this paper. Table 3 gives the chosen hyper-parameters for all experiments. We tune the hyper-parameters on the development set by random search. We try to share as many hyper-parameters as possible in experiments.  Experimental results Table 4 shows the overall classification performance of different models on our evaluation corpus. It can be seen that our method ResNet-PAtt is better than other methods in F 1 -score while precision, recall and F 1 -score reaches 79. 16 and 77.80% respectively. Of all other methods, the model ResNet-BLSTM achieves the best performance on F 1score, and our model improves 2.97% F 1 -score compared with it, then our method is more effective. In addition, we can find that overall the residual network based methods are better than other relation extraction methods.

Discussion
The reasons our model achieves best performance maybe owe to that the residual network-based model could reduce the negative impact of corpus noise to parameter learning, and the combination of character position attention mechanism could enhance the identification information of different type of entities. Table 5 gives the classification performance of our model on every finegrained relation category. As can be seen from these data, our model performs best on relation category TeRS and worst on category TrNAS, which shows that it is more difficult to recognize category TrNAS correctly. We also evaluate the training time of different models. Figure 4 shows that the consumed times by these models while epoch is set as 5, 10 and 20 respectively. Overall, our model takes the shortest time to complete parameter training, and the traditional machine learning method SVM takes the longest time to train. Table 6 is comparison of F 1 -score for each model on every fine-grained relation category. The model has better classification performance and faster response speed.

Conclusions
In this paper, we propose a deep residual network model based on the attention mechanism to classify the relation of entity pairs in Chinese EMRs. The method reduced the influence of data noise on the model training, and enhance entity discrimination feature with position attention mechanism so that the entity information can be combined effectively in the relation extraction. Experimental results show that the model reached 77.80% F 1score value, and significantly improved the classification performance of the few instance categories. At present, most relation classifications are based on entity recognition tasks and need to specify the entity in the sentence. In the future, we will study the joint extraction of entity and entity relation to further improve the efficiency of entity and entity relation recognition simultaneously. entity relation data regarding the Chinese electronic medical records, and was a major contributor in establishe annotation specification. All authors read and reviewed the final manuscript.