Skip to main content

CapsTM: capsule network for Chinese medical text matching

Abstract

Background

Text Matching (TM) is a fundamental task of natural language processing widely used in many application systems such as information retrieval, automatic question answering, machine translation, dialogue system, reading comprehension, etc. In recent years, a large number of deep learning neural networks have been applied to TM, and have refreshed benchmarks of TM repeatedly. Among the deep learning neural networks, convolutional neural network (CNN) is one of the most popular networks, which suffers from difficulties in dealing with small samples and keeping relative structures of features. In this paper, we propose a novel deep learning architecture based on capsule network for TM, called CapsTM, where capsule network is a new type of neural network architecture proposed to address some of the short comings of CNN and shows great potential in many tasks.

Methods

CapsTM is a five-layer neural network, including an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. In CapsTM, two pieces of text are first individually converted into sequences of embeddings and are further transformed by a highway network in the input layer. Then, Bidirectional Long Short-Term Memory (BiLSTM) is used to represent each piece of text and attention-based interaction matrix is used to represent interactive information of the two pieces of text in the representation layer. Subsequently, the two kinds of representations are fused together by BiLSTM in the aggregation layer, and are further represented with capsules (vectors) in the capsule layer. Finally, the prediction layer is a connected network used for classification. CapsTM is an extension of ESIM by adding a capsule layer before the prediction layer.

Results

We construct a corpus of Chinese medical question matching, which contains 36,360 question pairs. This corpus is randomly split into three parts: a training set of 32,360 question pairs, a development set of 2000 question pairs and a test set of 2000 question pairs. On this corpus, we conduct a series of experiments to evaluate the proposed CapsTM and compare it with other state-of-the-art methods. CapsTM achieves the highest F-score of 0.8666.

Conclusion

The experimental results demonstrate that CapsTM is effective for Chinese medical question matching and outperforms other state-of-the-art methods for comparison.

Background

Text matching (TM), which aims to judge whether two pieces of text, including sentences, questions, etc., are equal or match in semantic space, is a key component of many application systems such as information retrieval, automatic question answering, machine translation, dialogue system and reading comprehension. It is usually recognized as a classification problem where the input is a pair of pieces of text and the output is a label to indicate the two pieces of text match (denoted by 1) or not (denoted by 0).

In recent years, a large number of deep learning neural networks, such as Enhanced Sequential Inference Model (ESIM) [1], Attention-based Convolutional Neural Network (ABCNN) [2], Bilateral Multi-Perspective Matching (BIMPM) [3], Directional Self-Attention Network (DISAN) [4], Densely-connected co-attentive Recurrent Neural Network (DRCN) [5], Decomposable Attention Model (DECOMP) [6] and Bidirectional Encoder Representations from Transformers (BERT) [7], have been proposed for TM, and have achieved state-of-the-art performance on lots of benchmark datasets. Therefore, deep learning neural networks have become the mainstream machine learning methods for TM. Among these deep learning neural networks, convolutional neural network (CNN) is one of the most popular basic networks for TM. However, it suffers from difficulties in dealing with small samples and keeping relative structures of features. In this paper, we propose a novel deep learning architecture based on capsule network for TM, called CapsTM, where capsule network [8] is a new type of neural network architecture proposed to address some of the short comings of CNN. CapsTM is a five-layer neural network composed of an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. In this neural network, two pieces of text are first individually converted into embeddings sequences and are further transformed by a highway network in the input layer. Then, Bidirectional Long Short-Term Memory (BiLSTM) is used to represent each piece of text and attention-based interaction matrix is used to represent interactive information of the two pieces of text in the representation layer. Subsequently, the two kinds of representations are fused together by BiLSTM in the aggregation layer, and are further represented with capsules (vectors) in the capsule layer. Finally, the prediction layer is a connected network used for classification. CapsTM is an extension of ESIM by adding a capsule layer before the prediction layer. We apply CapsTM to Chinese medical question matching and achieve considerable performance. Experiments conducted on a manually annotated corpus regarding Chinese question matching show that CapsTM outperforms six state-of-the-art neural networks, that is, ESIM [1], ABCNN [2], BIMPM [3], DISAN [4], DRCN [5], DECOMP [6] and BERT [7].

The contributions of this work are: (1) investigating Chinese medical question matching comprehensively from corpus construction to methods; (2) proposing a novel method based on capsule network for Chinese medical question matching, which outperforms other state-of-the-art methods for text matching.

Related work

In recent years, deep learning methods have become mainstream for text matching, and many deep neural networks have been proposed. Most of deep neural networks are based on Siamese network [9] which aims to represent two pieces of text by the same structure. The representative neural networks are DSSM (deep structured semantic models) proposed by Huang et al. [10] and ARC-I/ARC-II proposed by Hu et al. [11]. DSSM first uses multi-layer fully connected neural network to represent two pieces of text, then computes their cosine similarity, and finally makes a prediction. ARC-I/ACR-II uses a CNN-based architecture to model both text semantic information and interactive information between two pieces of text.

By introducing new neural networks and techniques, many variants of DSSM and ARC-I/ARC-II have been proposed. Shen et al. presented CDSSM by replacing multi-layer fully connected neural network by CNN [12]. Palangi et al. developed LSTM-DSSM using LSTM instead of multi-layer fully connected neural network [13]. Yin et al. introduced attention mechanism into ARC-I/ARC-II and proposed attention-based CNN (ABCNN) [2]. Chen et al. proposed an enhanced LSTM for text inference, ESIM, which first used BiLSTM to represent text semantic information and attention matrix to represent interactive information between two pieces of text, and then fused the two kinds of information via BiLSTM and pooling. Wang et al. adopted the same architecture of ESIM with four kinds of attention matrices, called BIMPM [3]. DISAN is a light-weight neural net proposed to learn sentence embedding, based solely on a directional self-attention with temporal order encoded, followed by a multi-dimensional attention without any recurrent neural network/CNN structure [4]. DRCN is a densely-connected co-attentive recurrent neural network proposed by Seonhoon Kim et al. [5], each layer of which uses concatenated information of attentive features as well as hidden features of all the preceding recurrent layers to preserve the original and the co-attentive feature information from the bottommost word embedding layer to the uppermost recurrent layer. Ankur et al. proposed a simple neural architecture for natural language inference, called DECOMP [6], which uses attention to decompose the problem into subproblems that can be solved separately. BERT is a language representation model proposed by Jacob et al. [7], which is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. Capsule network [8] as a new type of neural network architecture proposed to address some of the short comings of CNN has showed great potential in image classification.

Methods

Formally, the task of TM is to find the most possible label y (0-not match or 1- match) of the given pair of pieces of text (s1, s2), where s1 = w11w12w1n and s2 = w21w22w2n (wij, the j-th word of si for i = 1, 2 and j = 1, 2, …, n) are two pieces of text of the same length after preprocessing that extends all pieces of text to the same length by appending dummy tokens. Figure 1 shows the overview architecture of CapsTM, which consists of an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. All these layers are presented in the following sections in detail.

Fig. 1
figure 1

Architecture of CapsTM

Input layer

For the given pair of pieces of text (s1, s2), the input layer first converts each piece of text into embeddings leant from large-scale unlabeled data by word2vec [14] or BERT [7], denoted by e1 for s1 and e2 for s2, and then further makes a transformation to the embeddings using highway network as follows:

$$\widehat{{e_{i} }} = \tan \;h(w_{f} e_{i} + b_{f} ),$$
(1)
$$g = sigmoid\left( {w_{g} \widehat{{e_{i} }} + b_{g} } \right),$$
(2)
$$e_{i}^{{\prime }} = g\widehat{{e_{i} }} + (1 - g)e_{i} ,$$
(3)

where i = 1,2, wf and wg are weight vectors, and bf and bg are bias vectors.

Representation layer

In the representation layer, two types of information are extracted: (1) information of each piece of text; (2) interactive information of the two pieces of text. We utilize BiLSTM to extract the first type of information (Eq. 4) and attention-based interaction matrix to extract the second type of information (Eqs. 58) as follows:

$$h_{i} = BiLSTM\left( {e_{i}^{{\prime }} } \right),$$
(4)
$$sim_{ks} = e_{1k}^{{\prime }} .e_{2s}^{{\prime }} ,$$
(5)
$$a_{ks} = \frac{{sim_{ks} }}{{\mathop \sum \nolimits_{s} sim_{ks} }},$$
(6)
$$\widehat{{a_{1k} }} = \mathop \sum \limits_{s} a_{ks} e_{s} ,$$
(7)
$$\widehat{{a_{2k} }} = \mathop \sum \limits_{s} a_{sk} e_{s} ,$$
(8)

where \(h_{i}\) is the concatenation of the last hidden states from forward and backward directions of BiLSTM, \(e_{ij}^{^{\prime}}\) is the j-th vector of \(e_{i}^{^{\prime}}\) (i = 1,2) corresponding to wij. For the detailed information about BiLSTM, please refer to reference [1].

Finally, \(h_{i}\) and \(\widehat{{a_{i} }}\) are concatenated to form the representation of si for i = 1, 2, that is \(c_{i} = \left[ {h_{i} :s_{i} } \right]\).

Aggregation layer

This layer is employed to aggregate the representations of the two pieces of text using BiLSTM as follows:

$$S_{1}^{c} = BiLSTM\left( {c_{11} ,c_{12} , \ldots ,c_{1n} } \right),\quad S_{2}^{c} = BiLSTM\left( {c_{21} ,c_{22} , \ldots ,c_{2n} } \right),$$
(9)

where \(S_{1}^{c}\) and \(S_{2}^{c}\) are the concatenations of the last hidden states from forward and backward directions of BiLSTM for the two pieces of text.

Capsule layer

Capsule network (as shown in Fig. 2) adopts the dynamic routing algorithm (as shown in Table 1) to process the text representations from the aggregation layer iteratively. Firstly, J d-dimensional capsule networks are initialized. For each capsule, convolution operation is applied to \(S_{1}^{c}\) and \(S_{2}^{c}\):

$$F_{ij} = S_{i}^{c} .T_{j} + b_{j} (i = 1,2),$$
(10)

where \(F_{ij}\) is the feature vector obtain from the j-th convolution kernel \(T_{j}\) for si, and \(b_{j}\) is the bias vector for \(T_{j}\).

Fig. 2
figure 2

Architecture of capsule network

Table 1 Dynamic routing algorithm

Suppose that there are I convolution kernels, we can obtain I-channel feature vectors for si:

$$U_{i} = \left[ {F_{i1} ,F_{i2} , \ldots ,F_{iI} } \right]$$
(11)

The generated feature vectors are then input into the capsule layer, which uses vector instead of scalar to save the instanced parameters of each feature. It can not only represent the intensity of activation, but also record some details of instanced part in the input. For each channel feature vector ui in \(U_{1}\) and \(U_{2}\) (i.e., uj = F1j for U1 and F2j for U2), convolution kernel \(K_{j}\) (j = 1, …, k) is used to generate \(u_{i|j} \in R^{d}\) for the j-th capsule using the following operation:

$$u_{i|j} = g(K_{j} .u_{i} + b),$$
(12)

where \(g\) is a nonlinear activation function and \(b\) is a bias vector. The k channels can be reconstituted to \(\widehat{{u_{i} }}\):

$$\widehat{{u_{i} }} = \left[ {u_{i|1} ,u_{i|2} , \ldots ,u_{i|k} } \right]$$
(13)

Then, the dynamic routing algorithm (as shown in Table 1) is applied to generate capsules of the next layer. This process actually replaces the pooling operation that discards location information. At the beginning of the dynamic routing algorithm, the same weight is assigned to each location \(c_{i|j}^{r}\) like the average pooling operation. After the first iteration, the weight of each location is updated according to the similarity between \(c_{i|j}^{r}\) and \(\widehat{{u_{i|j} }}\). The weight of each position is stable after iterating T times.

Finally, each piece of text si is represented by the outputs of all capsule networks:

$$C_{i} = \left[ {d_{1}^{T} ,d_{2}^{T} , \ldots ,d_{J}^{T} } \right]$$
(14)

Prediction layer

The prediction layer is a fully connected network using the sigmod activation function for prediction. Following the previous work for TM [1,2,3], we use the following vector as the input of the prediction layer and the cross-entropy loss as the classification loss:

$$C = \left[ {C_{1} ,C_{2} ,C_{1} - C_{2} ,cos\left( {C_{1} ,C_{2} } \right)} \right]$$
(15)

Experiments

Dataset

We ask two medical experts to annotate a corpus of Chinese medical question matching, which contains 36,360 question pairs. This corpus is randomly split into three parts: a training set of 32,360 question pairs, a development set of 2000 question pairs and a test set of 2000 question pairs. The distributions of positive samples and negative samples in each dataset are listed in Table 2 in detail. Here, positive samples are the medical question pairs of the same meaning or intent, while negative samples are the medical question pairs of different meaning or intent.

Table 2 Distributions of positive samples and negative samples in the corpus used in this study

Experiment settings

We compare CapsTM with the following state-of-the-art deep learning neural networks: ESIM [1], ABCNN [2], BIMPM [3], DISAN [4], DRCN [5], DECOMP [6] and BERT [7]. All hyperparameters used in our experiments are shown in Table 3.

Table 3 Hyperparameters used in our experiments

All Chinese character embeddings are pretrained by word2vec (https://code.google.com/p/word2vec/) and BERT (https://github.com/google-research/bert) on a large-scaled Chinese medical corpus. All model parameters are optimized on corresponding development sets. All the methods are implemented with Tensorflow 1.10.0, and all models are trained on machines with NVIDIA GeForce GTX 1080ti GPU. The performance of models is measured by precision (P), recall (R) and F-score.

Results and discussion

As shown in Table 4 where all the highest values in each type are highlighted in bold, when using Chinese character embeddings initialized by word2vec, CapsTM(word2vec) achieves an F-score of 0.8432, and outperforms other state-of-the-art neural networks except BERT. The difference ranges from 0.65 to 3.85% in F-score. When using Chinese character embeddings initialized by BERT, the F-score of CapsTM(BERT) increases to 0.8666, which is higher than that of BERT by 0.2%. Compared to ESIM, CapsTM(word2vec) is significantly better with an improvement of 1.49% in F-score, indicating that the capsule layer added is effective.

Table 4 Comparison of CapsTM and other-state-of-the-art methods

In addition to investigate the effect of the attention mechanism used in the representation layer and the dynamic routing algorithm used in the capsule layer, we conduct ablation study on CapsTM. The results are shown in Table 5, where all the highest values in each type are highlighted in bold and w/o denotes “without”. When attention is removed or routing replaced by max pooling or mean pooling, the F-score of CapsTM drops. In the case of CapsTM(word2vec), the F-score decreases by at least 0.71% because of replacing routing by pooling and 2.16% caused by removing attention.

Table 5 Ablation study on CapsTM

Furthermore, we check the attention matrices of some samples and find that the attention mechanism can depict semantic similarities between words in question pairs. Figure 3 gives examples of a matched question pair and an unmatched question pair, where the darker the color is, the more semantically similar the question pair is.

Fig. 3
figure 3

Visualization samples of the attention mechanism in the representation layer

There are also some errors in CapsTM. These errors mainly fall into the following categories: (1) question pairs of the same type with different topics are usually wrongly classified into 1. For example, “乙肝疫苗有效期为多久 (How long is the validation period of hepatitis B vaccine)” and “乙肝表面抗体能持续多久 (How long does hepatitis B antibody last)” are wrongly classified into 1. (2) the answer to one question covers the answer to another, but they are not the same. For example, the answer to “乙肝高血压如何用药 (How to take medicine for patients with hepatitis B)” should be included in the answer to “有乙肝病要如何控制高血压 (How to control hypertension of patients with hepatitis B)”, but we cannot answer the former question using the answer to the latter question directly. It is because that medication is only one type of treatments for hypertension of patients with hepatitis B. If we have a complete clinical knowledge graph, this problem may be solved. Therefore, for further improvement, we will investigate how to integrate clinical knowledge graph into existing state-of-the-art deep neural networks in the future.

Conclusion

In this paper, we propose a novel five-layer neural network based on capsule network for Chinese medical TM, called CapsTM. Experiments on a manually annotated corpus shows that CapsTM outperforms other compared state-of-the-art neural networks. CapsTM can also have potential to be applied to TM in other domains.

Availability of data and materials

Not applicable.

Abbreviations

TM:

Text Matching

CNN:

Convolutional neural network

CapsTM:

Capsule Network for Chinese Medical Text Matching

BiLSTM:

Bidirectional Long Short-Term Memory

ESIM:

Enhanced Sequential Inference Model

ABCNN:

Attention-Based Convolutional Neural Network

BIMPM:

Bilateral Multi-Perspective Matching

DRCN:

Densely-connectied Co-attentive Recurrent Neural Network

DECOMP:

Decomposable Attention Model

BERT:

Bidirectional Encoder Representations from Transformers

DSSM:

Deep Structured Semantic Model

References

  1. Chen Q, Zhang X, Ling Z, et al. Enhanced LSTM for natural language inference. In: Proceedings of the 55th annual meeting of the association for computational linguistics; 2017. p. 1657–8.

  2. Yin W, Hinrich S, Bing X, et al. ABCNN: attention-based convolutional neural network for modeling sentence pairs. In: Proceedings of the 54th association for computational linguistics; 2016. p. 259–72.

  3. Wang Z, Wang H, Radu F. Bilateral multi-perspective matching for natural language sentences. In: Proceedings of the conference division and the AI journal division; 2017. p. 4144–50.

  4. Shen T, Zhou T, Long G, et al. DiSAN: directional self-attention network for RNN/CNN-free language understanding. In: Proceedings of the thirty-second AAAI conference on artificial intelligence; 2018. p. 5446–55.

  5. Kim S, Kang I, Kwak N, et al. Semantic sentence matching with densely-connected recurrent and co-attentive information. In: Proceedings of the thirty-three AAAI conference on artificial intelligence; 2019. p. 6586–93.

  6. Parikh AP, et al. A decomposable attention model for natural language inference. In: Proceedings of the 2016 empirical methods in natural language processing; 2016. p. 2249–55.

  7. Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics; 2019. p. 4171–86.

  8. Sabour S, Frosst N, Hinton GE, et al. Dynamic routing between capsules. In: Proceedings of the 2017 neural information processing systems; 2017. p. 3856–66.

  9. Neculoiu P, Versteegh M, Rotaru M, et al. Learning text similarity with siamese recurrent networks. In: Proceedings of the 2013 conference of the association for computational linguistics; 2016. p. 148–57.

  10. Huang P, He X, Gao J, et al. Learning deep structured semantic models for web search using click through data. In: Proceedings of the 2013 conference on information and knowledge management, 2013: 2333–38.

  11. Hu B, Lu Z, Li H, Chen Q. Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the 27th international conference on neural information processing systems—volume 2 (NIPS’14). MIT Press, Cambridge, p. 2042–50.

  12. Shen Y, He X, Gao J, et al. A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 2014 conference on information and knowledge management; 2014. p. 101–10.

  13. Palangi H, et al. Semantic modelling with long-short-term memory for information retrieval. arXiv preprint arXiv:1412.6629 (2014).

  14. Mikolov T, Chen K, Corrado GS, et al. Efficient estimation of word representations in vector space. In: Proceedings of the 2013 international conference on learning representations; 2013.

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 21, Supplement 2 2021: Health Big Data and Artificial Intelligence. The full contents of the supplement are available at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-21-supplement-2.

Funding

This paper is supported in part by Grants: NSFCs (National Natural Science Foundations of China) (U1813215, 61876052 and 61573118), National Key Research and Development Program of China (2017YFB0802204), Special Foundation for Technology Research Program of Guangdong Province (2015B010131010), Natural Science Foundation of Guangdong Province (2019A1515011158), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20170307150528934 and JCYJ20180306172232154), Innovation Fund of Harbin Institute of Technology (HIT.NSRIF.2017052). We also thank PingAn Health Technology Ltd. to support this study.

Author information

Authors and Affiliations

Authors

Contributions

XY, YS and BT design the experiments, XY and YS write the manuscript, and YN, XH, XW, QC and BT revised the manuscript. All authors check this revised version.

Corresponding author

Correspondence to Buzhou Tang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, X., Shen, Y., Ni, Y. et al. CapsTM: capsule network for Chinese medical text matching. BMC Med Inform Decis Mak 21 (Suppl 2), 94 (2021). https://doi.org/10.1186/s12911-021-01442-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-021-01442-9

Keywords