CapsTM: capsule network for Chinese medical text matching

Background Text Matching (TM) is a fundamental task of natural language processing widely used in many application systems such as information retrieval, automatic question answering, machine translation, dialogue system, reading comprehension, etc. In recent years, a large number of deep learning neural networks have been applied to TM, and have refreshed benchmarks of TM repeatedly. Among the deep learning neural networks, convolutional neural network (CNN) is one of the most popular networks, which suffers from difficulties in dealing with small samples and keeping relative structures of features. In this paper, we propose a novel deep learning architecture based on capsule network for TM, called CapsTM, where capsule network is a new type of neural network architecture proposed to address some of the short comings of CNN and shows great potential in many tasks. Methods CapsTM is a five-layer neural network, including an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. In CapsTM, two pieces of text are first individually converted into sequences of embeddings and are further transformed by a highway network in the input layer. Then, Bidirectional Long Short-Term Memory (BiLSTM) is used to represent each piece of text and attention-based interaction matrix is used to represent interactive information of the two pieces of text in the representation layer. Subsequently, the two kinds of representations are fused together by BiLSTM in the aggregation layer, and are further represented with capsules (vectors) in the capsule layer. Finally, the prediction layer is a connected network used for classification. CapsTM is an extension of ESIM by adding a capsule layer before the prediction layer. Results We construct a corpus of Chinese medical question matching, which contains 36,360 question pairs. This corpus is randomly split into three parts: a training set of 32,360 question pairs, a development set of 2000 question pairs and a test set of 2000 question pairs. On this corpus, we conduct a series of experiments to evaluate the proposed CapsTM and compare it with other state-of-the-art methods. CapsTM achieves the highest F-score of 0.8666. Conclusion The experimental results demonstrate that CapsTM is effective for Chinese medical question matching and outperforms other state-of-the-art methods for comparison.

many application systems such as information retrieval, automatic question answering, machine translation, dialogue system and reading comprehension. It is usually recognized as a classification problem where the input is a pair of pieces of text and the output is a label to indicate the two pieces of text match (denoted by 1) or not (denoted by 0).
In recent years, a large number of deep learning neural networks, such as Enhanced Sequential Inference Model (ESIM) [1], Attention-based Convolutional Neural Network (ABCNN) [2], Bilateral Multi-Perspective Matching (BIMPM) [3], Directional Self-Attention Network (DISAN) [4], Densely-connected co-attentive Recurrent Neural Network (DRCN) [5], Decomposable Attention Model (DECOMP) [6] and Bidirectional Encoder Representations from Transformers (BERT) [7], have been proposed for TM, and have achieved state-of-the-art performance on lots of benchmark datasets. Therefore, deep learning neural networks have become the mainstream machine learning methods for TM. Among these deep learning neural networks, convolutional neural network (CNN) is one of the most popular basic networks for TM. However, it suffers from difficulties in dealing with small samples and keeping relative structures of features. In this paper, we propose a novel deep learning architecture based on capsule network for TM, called CapsTM, where capsule network [8] is a new type of neural network architecture proposed to address some of the short comings of CNN. CapsTM is a five-layer neural network composed of an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. In this neural network, two pieces of text are first individually converted into embeddings sequences and are further transformed by a highway network in the input layer. Then, Bidirectional Long Short-Term Memory (BiLSTM) is used to represent each piece of text and attentionbased interaction matrix is used to represent interactive information of the two pieces of text in the representation layer. Subsequently, the two kinds of representations are fused together by BiLSTM in the aggregation layer, and are further represented with capsules (vectors) in the capsule layer. Finally, the prediction layer is a connected network used for classification. CapsTM is an extension of ESIM by adding a capsule layer before the prediction layer. We apply CapsTM to Chinese medical question matching and achieve considerable performance. Experiments conducted on a manually annotated corpus regarding Chinese question matching show that CapsTM outperforms six state-of-the-art neural networks, that is, ESIM [1], ABCNN [2], BIMPM [3], DISAN [4], DRCN [5], DECOMP [6] and BERT [7].
The contributions of this work are: (1) investigating Chinese medical question matching comprehensively from corpus construction to methods; (2) proposing a novel method based on capsule network for Chinese medical question matching, which outperforms other state-of-the-art methods for text matching.

Related work
In recent years, deep learning methods have become mainstream for text matching, and many deep neural networks have been proposed. Most of deep neural networks are based on Siamese network [9] which aims to represent two pieces of text by the same structure. The representative neural networks are DSSM (deep structured semantic models) proposed by Huang et al. [10] and ARC-I/ARC-II proposed by Hu et al. [11]. DSSM first uses multi-layer fully connected neural network to represent two pieces of text, then computes their cosine similarity, and finally makes a prediction. ARC-I/ACR-II uses a CNN-based architecture to model both text semantic information and interactive information between two pieces of text.
By introducing new neural networks and techniques, many variants of DSSM and ARC-I/ARC-II have been proposed. Shen et al. presented CDSSM by replacing multi-layer fully connected neural network by CNN [12]. Palangi et al. developed LSTM-DSSM using LSTM instead of multi-layer fully connected neural network [13]. Yin et al. introduced attention mechanism into ARC-I/ARC-II and proposed attention-based CNN (ABCNN) [2]. Chen et al. proposed an enhanced LSTM for text inference, ESIM, which first used BiLSTM to represent text semantic information and attention matrix to represent interactive information between two pieces of text, and then fused the two kinds of information via BiL-STM and pooling. Wang et al. adopted the same architecture of ESIM with four kinds of attention matrices, called BIMPM [3]. DISAN is a light-weight neural net proposed to learn sentence embedding, based solely on a directional self-attention with temporal order encoded, followed by a multi-dimensional attention without any recurrent neural network/CNN structure [4]. DRCN is a densely-connected co-attentive recurrent neural network proposed by Seonhoon Kim et al. [5], each layer of which uses concatenated information of attentive features as well as hidden features of all the preceding recurrent layers to preserve the original and the co-attentive feature information from the bottommost word embedding layer to the uppermost recurrent layer. Ankur et al. proposed a simple neural architecture for natural language inference, called DECOMP [6], which uses attention to decompose the problem into subproblems that can be solved separately. BERT is a language representation model proposed by Jacob et al. [7], which is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. Capsule network [8] as a new type of neural network architecture proposed to address some of the short comings of CNN has showed great potential in image classification.

Methods
Formally, the task of TM is to find the most possible label y (0-not match or 1-match) of the given pair of pieces of text (s 1 , s 2 ), where s 1 = w 11 w 12 …w 1n and s 2 = w 21 w 22 … w 2n (w ij , the j-th word of s i for i = 1, 2 and j = 1, 2, …, n) are two pieces of text of the same length after preprocessing that extends all pieces of text to the same length by appending dummy tokens. Figure 1 shows the overview architecture of CapsTM, which consists of an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. All these layers are presented in the following sections in detail.

Input layer
For the given pair of pieces of text (s 1 , s 2 ), the input layer first converts each piece of text into embeddings leant from large-scale unlabeled data by word2vec [14] or BERT [7], denoted by e 1 for s 1 and e 2 for s 2 , and then further makes a transformation to the embeddings using highway network as follows: where i = 1,2, w f and w g are weight vectors, and b f and b g are bias vectors.

Representation layer
In the representation layer, two types of information are extracted: (1) information of each piece of text; (2) interactive information of the two pieces of text. We utilize BiLSTM to extract the first type of information (Eq. 4) and attention-based interaction matrix to extract the second type of information (Eqs. 5-8) as follows:

Aggregation layer
This layer is employed to aggregate the representations of the two pieces of text using BiLSTM as follows:

Capsule layer
Capsule network (as shown in Fig. 2) adopts the dynamic routing algorithm (as shown in Table 1) to process the text representations from the aggregation layer iteratively.
Firstly, J d-dimensional capsule networks are initialized. For each capsule, convolution operation is applied to S c 1 and S c 2 : where F ij is the feature vector obtain from the j-th convolution kernel T j for s i , and b j is the bias vector for T j . Suppose that there are I convolution kernels, we can obtain I-channel feature vectors for s i : The generated feature vectors are then input into the capsule layer, which uses vector instead of scalar to save the instanced parameters of each feature. It can not only represent the intensity of activation, but also record some details of instanced part in the input. For each channel feature vector u i in U 1 and U 2 (i.e., u j = F 1j for U 1 and F 2j for U 2 ), convolution kernel K j (j = 1, …, k) is used to generate u i|j ∈ R d for the j-th capsule using the following operation: where g is a nonlinear activation function and b is a bias vector. The k channels can be reconstituted to u i : Then, the dynamic routing algorithm (as shown in Table 1) is applied to generate capsules of the next layer. This process actually replaces the pooling operation that discards location information. At the beginning of the dynamic routing algorithm, the same weight is assigned to each location c r i|j like the average pooling operation. After the first iteration, the weight of each location is updated according to the similarity between c r i|j and u i|j . The weight of each position is stable after iterating T times.
Finally, each piece of text s i is represented by the outputs of all capsule networks:

Prediction layer
The prediction layer is a fully connected network using the sigmod activation function for prediction. Following the previous work for TM [1][2][3], we use the following vector as the input of the prediction layer and the crossentropy loss as the classification loss:

Dataset
We ask two medical experts to annotate a corpus of Chinese medical question matching, which contains 36,360 question pairs. This corpus is randomly split into three parts:  Table 2 in detail. Here, positive samples are the medical question pairs of the same meaning or intent, while negative samples are the medical question pairs of different meaning or intent.

Experiment settings
We compare CapsTM with the following state-of-the-art deep learning neural networks: ESIM [1], ABCNN [2], BIMPM [3], DISAN [4], DRCN [5], DECOMP [6] and BERT [7]. All hyperparameters used in our experiments are shown in Table 3. All Chinese character embeddings are pretrained by word2vec (https ://code.googl e.com/p/word2 vec/) and BERT (https ://githu b.com/googl e-resea rch/bert) on a large-scaled Chinese medical corpus. All model parameters are optimized on corresponding development sets. All the methods are implemented with Tensorflow 1.10.0,  and all models are trained on machines with NVIDIA GeForce GTX 1080ti GPU. The performance of models is measured by precision (P), recall (R) and F-score.

Results and discussion
As shown in Table 4 where all the highest values in each type are highlighted in bold, when using Chinese character embeddings initialized by word2vec, CapsTM(word2vec) achieves an F-score of 0.8432, and outperforms other state-of-the-art neural networks except BERT. The difference ranges from 0.65 to 3.85% in F-score. When using Chinese character embeddings initialized by BERT, the F-score of CapsTM(BERT) increases to 0.8666, which is higher than that of BERT by 0.2%. Compared to ESIM, CapsTM(word2vec) is significantly better with an improvement of 1.49% in F-score, indicating that the capsule layer added is effective. In addition to investigate the effect of the attention mechanism used in the representation layer and the dynamic routing algorithm used in the capsule layer, we conduct ablation study on CapsTM. The results are shown in Table 5, where all the highest values in each type are highlighted in bold and w/o denotes "without". When attention is removed or routing replaced by max pooling or mean pooling, the F-score of CapsTM drops. In the case of CapsTM(word2vec), the F-score decreases by at least 0.71% because of replacing routing by pooling and 2.16% caused by removing attention.
Furthermore, we check the attention matrices of some samples and find that the attention mechanism can depict semantic similarities between words in question pairs. Figure 3 gives examples of a matched question pair and an unmatched question pair, where the darker the color is, the more semantically similar the question pair is.
There are also some errors in CapsTM. These errors mainly fall into the following categories: (1) question pairs of the same type with different topics are usually wrongly classified into 1. For example, "乙肝疫苗有效 期为多久 (How long is the validation period of hepatitis B vaccine)" and "乙肝表面抗体能持续多久 (How long does hepatitis B antibody last)" are wrongly classified into 1. (2) the answer to one question covers the answer to another, but they are not the same. For example, the answer to "乙肝高血压如何用药 (How to take medicine for patients with hepatitis B)" should be included in the answer to "有乙肝病要如何控制高血压 (How to control hypertension of patients with hepatitis B)", but we cannot answer the former question using the answer to the latter question directly. It is because that medication is only one type of treatments for hypertension of patients with hepatitis B. If we have a complete clinical knowledge graph, this problem may be solved. Therefore, for further improvement, we will investigate how to integrate clinical knowledge graph into existing state-of-the-art deep neural networks in the future.

Conclusion
In this paper, we propose a novel five-layer neural network based on capsule network for Chinese medical TM, called CapsTM. Experiments on a manually annotated corpus shows that CapsTM outperforms other compared