Named entity recognition in Chinese EMRs is a sequence labeling task in natural language processing. The deep learning-based method effectively extracts text feature information and solves the problem of named entity recognition in EMRs. Some researchers are currently using the BERT pretraining model for named entity recognition research, such as the CNN model combined with BERT [25]. The pretrained model can more accurately represent the text's word embedding, resulting in a better-named entity recognition effect.
These methods have solved the named entity recognition problem of complex EMR texts in the medical field to some extent, but in practice, there are still pretrained models in the general field that cannot sufficiently represent Chinese EMR texts, and a single deep neural network cannot fully extract the feature information of the word vector in the text.
This paper constructs a hybrid neural network model based on medical MC-BERT to address these issues. The model includes an MC-BERT layer for word embedding, a BiLSTM layer, a CNN layer, a multihead self-attention (MHA) mechanism, and a CRF layer in the downstream model. Among these, MC-BERT is used for medical text word embedding, and Chinese characters are converted into word vectors with text information using MC-BERT to achieve a better embedding. The obtained embedded word vectors are then simultaneously fed into BiLSTM and multilayer CNN models, and feature extraction is performed on the word vectors. The output results of these two parts are fused, and the multihead self-attention mechanism is combined to extract global feature correlation information from multiple angles and levels. Finally, the CRF layer can fully consider the intercharacter tag dependencies and constraints and decode them using CRF to ensure the reasonableness of the final predicted tags. The architecture of the hybrid neural network model based on MC-BERT is shown in Fig. 1.
BERT models
BERT is an excellent pretraining model for text word vector representation. It is made up of a multilayer bidirectional Transformer encoding that can take into account the words before and after a word to determine its meaning in context. The BERT model structure is shown in Fig. 2, and the model composition is similar to those of GPT and ELMO. The Chinese BERT model is typically obtained through unsupervised task training on a large number of general-purpose corpora, and it can learn a better feature representation of words and be used directly in downstream tasks. Texts in fields such as biomedicine have a very different structure and word distribution than ordinary texts in general domains, and they contain many long-tailed terms. Therefore, a general domain-based BERT model is unsuitable for medical texts. This paper uses the MC-BERT model from the Chinese medical field to perform word embedding operations on the training data to better learn the medical texts' content information. The structure of the MC-BERT model is the same as that of the BERT model, but different pretraining methods and pretraining corpora are used. Among MC-BERT pretraining approaches, one is mask prediction for medical entities, which only masks medical-related words. This approach replaces 15% of the medical-related words in the Chinese pretraining corpus with [Mask]; 80% of these selected medical words are replaced normally, 10% are replaced with another word, and the last 10% are kept constant for prediction of the masked words. The second pretraining method is "next-sentence prediction," which selects two sentences in the correct order from the same Chinese medical corpus document as positive samples and then randomly selects sentences from different documents to be added after the first sentence as negative samples. The former task focuses on the information between words, and the latter obtains the information between sentences. The integration of these two kinds of information during pretraining can make the word embedding have a better expression effect.
BiLSTM models
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) model. In comparison to the traditional cyclic RNN structure, LSTM adds three gate structures: an input gate, forget gate, and output gate; it can extract more useful information from neurons. In addition, the LSTM model can effectively solve the gradient disappearance and gradient explosion problems of long text sequences in the training process. The calculation process of the neurons in LSTM is shown in Formulas (1)–(6).
$${f}_{t}=\sigma \left({w}_{f}\cdot {h}_{t-1}+{u}_{f}\cdot {x}_{t}+{b}_{f}\right)$$
(1)
$${i}_{t}=\sigma \left({w}_{i}\cdot {h}_{t-1}+{u}_{i}\cdot {x}_{t}+{b}_{i}\right)$$
(2)
$${\widetilde{c}}_{t}=\mathit{tan}h\left({w}_{c}\cdot {h}_{t-1}+{u}_{c}\cdot {x}_{t}+{b}_{c}\right)$$
(3)
$${c}_{t}={f}_{t}\odot {c}_{t-1}+{i}_{t}\odot {\widetilde{c}}_{t}$$
(4)
$${o}_{t}=\sigma \left({w}_{o}\cdot {h}_{t-1}+{u}_{o}\cdot {x}_{t}+{b}_{o}\right)$$
(5)
$${h}_{t}={o}_{t}\odot \mathrm{tan}h\left({c}_{t}\right)$$
(6)
where \({f}_{t}\), \({i}_{t}\), \({h}_{t}\), \({o}_{t}\), and \({c}_{t}\) represent the forget gate, input gate, hidden layer, output gate, and current cell state, respectively. \({w}_{i},{w}_{f},{w}_{c}\) and \({w}_{o}\) each indicate the weight corresponding to the previous hidden layer \({h}_{t-1}\); \({u}_{i}\), \({u}_{f}\), \({u}_{c}\), and \({u}_{o}\) represent the weights corresponding to the current input vector \({x}_{t}\); and \({b}_{i}\), \({b}_{f}\), \({b}_{c}\), and \({b}_{o}\) indicate the relevant bias vectors. \({\widetilde{c}}_{t}\) is the new candidate state vector. \(\sigma\) is the dot product operation, and ⊙ is the sigmoid activation function.
Bidirectional LSTM involves applying a forward and reverse LSTM network to each training text sequence separately, with the two LSTM networks connected to the same output layer. As a result, information in the text can be obtained from both the forward and backward directions, and semantic dependencies of longer distances can be better captured at the sentence level [26].
CNN models
The convolutional neural network (CNN) model has a convolution layer and a pooling layer, which gives the CNN a good ability to select local features. It can also capture the local semantic relationship between words in a sentence and reduce the dimension of features. Although the CNN was designed to extract image features, it has increasingly been used in natural language processing tasks such as named entity recognition in recent years [27]. In the convolution layer, the text features are subjected to convolution operations through multiple convolution kernels of different sizes, and multiple convolution kernels can be efficiently calculated in parallel, which can further improve the calculation efficiency of feature vectors. The pooling layer extracts the representation of the most important features in the convolutional layer using the max pooling operation, resulting in the text feature vector based on the CNN layer.
Multihead self-attention
Attention mechanisms are widely used in deep learning-based natural language processing tasks. A study has proposed a self-attention mechanism (Self-Attention) that is combined with BiLSTM and applied to the task of named entity recognition. When extracting text feature information, recurrent neural network models such as RNN and LSTM cannot fully account for the importance of relevant characters in an entire sentence. Even BiLSTM will not obtain much important information in long-distance text. The introduction of a self-attention mechanism can effectively solve the problem of text data time series correlation. To further extract the interactive representation of a text sequence in the text, the multihead self-attention mechanism can obtain semantic feature information from multiple levels and perspectives and obtain the interactive representation of the text sequence.
In all attention mechanisms, there is a task-related query vector \(Q\). In addition to the query vector \(Q\), the self-attention mechanism adds key-value pairs \(K\) and \(V\) as matrices. These three matrices are obtained by linear transformation of the weight matrix corresponding to the input sequence, and the dimensions are all \({d}_{k}\), so the three vector matrices \(Q\), \(K\) and \(V\) contain the relevant information of the input features.
In the multihead self-attention mechanism, each self-attention head is also called a parallel computing head. These heads capture the unique feature information of each character in the text sequence in different representation subspaces through multiple independent attentional mechanism calculations; each focuses on a different part of the input. In the multihead self-attention mechanism, the three vector matrices \(Q\), \(K\) and \(V\) require multiple independent linear transformations; that is, they need to be multiplied by multiple different weight matrices W. Therefore, the three vector matrices \(Q\), \(K\) and \(V\) in the multihead self-attention mechanism require multiple mutually independent linear transformations; i.e., they need to be multiplied by multiple different weight matrices W. If this process must be iterated a certain number of times, then the self-attention mechanism that uses the method of scaling the dot product as the scoring function is represented by Formula (7).
$$Attention\left(Q{W}^{Q},K{W}^{K},V{W}^{V}\right)=softmax\left(\frac{{Q{W}^{Q}\left(K{W}^{K}\right)}^{T}}{\sqrt{{d}_{k}}}\right)V{W}^{V}$$
(7)
where \({Q\in R}^{n\times {d}_{k}}\), \({K\in R}^{m\times {d}_{k}}\) and \({V\in R}^{m\times {d}_{k}}\) are the vectorized sequences obtained after the linear transformation of the input sequence. \({W}^{Q}\)∈\({R}^{{d}_{k}\times {d}_{k/h}}\), \({W}^{K}\in {R}^{{d}_{k}\times {d}_{k/h}}\) and \({W}^{V}\in {R}^{{d}_{k}\times {d}_{k/h}}\) denote the corresponding parameter weight matrices, and \(softmax\) is a column normalization function. The multihead self-attention mechanism combines these h self-attention mechanisms, and its calculation process \(\mathrm{MultiHeadAttention}\) is shown in Formulas (8)–(9).
$$MultiHeadAttention=Concat\left({head}_{i},\ldots ,{head}_{h}\right){W}^{O}$$
(8)
$${head}_{i}=Attention\left({QW}_{i}^{Q},{KW}_{i}^{K},{VW}_{i}^{V}\right)$$
(9)
where \({head}_{i}\) repreSru
sents the ith head in the multihead self-attention mechanism, and \(\mathrm{Concat}\) represents the concatenation operation. \({W}^{O}\)∈\({R}^{{d}_{k}\times {d}_{k}}\) is the weight matrix, which changes linearly after combining multiple heads.
CRF models
Conditional random fields (CRFs) can be used to predict the output in the correct order of the labels by using their constraint relations to ensure the soundness of the entity label output results. Because the models in this paper are all based on the CRF layer output, the scoring function can be defined as in Formula (10).
$$score\left(X,y\right)={\sum }_{i=1}^{n}{A}_{{y}_{i},{y}_{i+1}}+{\sum }_{i=1}^{n}{P}_{i,{y}_{i}}$$
(10)
where \(X\) is the input text sequence \(\left({x}_{1},{x}_{2},{x}_{3},\ldots ,{x}_{n}\right)\), \({A}_{i,j}\) and \(P\) are the transition matrices and observation matrices, respectively, and the scoring function is the sum of the two matrices. y is the label sequence of the predicted output. As shown in Formula (11), the conditional probability \(P\left(y|X\right)\) of \(y\) under a given \(X\) can be calculated using the scoring function.
$$P\left(y|X\right)=\frac{exp\left(score\left(X,y\right)\right)}{{\sum }_{\widetilde{y}\in {Y}_{X}}exp\left(X,\widetilde{y}\right)}$$
(11)
where \({Y}_{X}\) represents all possible label sequences for a given sentence and the loss function is defined by Formula (12).
$$L=-\sum_{i=0}^{N}logP\left({Y}_{i}|{X}_{i}\right)$$
(12)
Following the completion of training, the label sequence \({y}^{*}\) obtained when the scoring function reaches its maximum value can be calculated using Formula (13).
$${y}^{*}={argmax}_{\widetilde{y}\in {Y}_{X}}sorce\left(X,\widetilde{y}\right)$$
(13)
Algorithm description
Inputs: One sequence of \(k\) text characters \(W=[{w}_{1},{w}_{2},\ldots ,{w}_{n}]\) is entered at a time. (W represents the corresponding word in the sentence, and \(\mathrm{n}\) denotes the maximum length of the input sentence).
Outputs: The hybrid neural network model produces the output label sequence \(Y=[{y}_{1},{y}_{2},\ldots ,{y}_{n}]\) from the input text character sequence \(W\) (\(y\) is the label that corresponds to the word).
Step 1: Word embedding
Following the word embedding process of MC-BERT, the word vector representation \({\mathrm{V}\in R}^{k\times n\times t}\) is obtained for the input character sequence \(\mathrm{W}\) (\(t\) is the dimension size of the self-attention head in BERT, usually 768).
Step 2: Downstream models for feature extraction
-
(1)
The feature vector matrices \({\mathrm{V}}_{\mathrm{B}}\in {\mathrm{R}}^{\mathrm{h}\times \mathrm{z}}\mathrm{ and }{\mathrm{V}}_{\mathrm{C}}\in {\mathrm{R}}^{\mathrm{r}\times \mathrm{j}\times \mathrm{z}}\) are obtained by feeding the word vector representation \(\mathrm{V}\) into the BiLSTM and CNN models, respectively(\(\mathrm{h},\mathrm{z},\mathrm{r},\mathrm{j}\) represent the corresponding vector dimension values of BiLSTM and CNN);
-
(2)
The obtained \({\mathrm{V}}_{\mathrm{B}}\in {\mathrm{R}}^{\mathrm{h}\times \mathrm{z}}\mathrm{ and }{\mathrm{V}}_{\mathrm{C}}\in {\mathrm{R}}^{\mathrm{r}\times \mathrm{j}\times \mathrm{z}}\) are summed according to a certain dimension to obtain the vector matrix \({\mathrm{V}}_{\mathrm{C}}\in {\mathrm{R}}^{\mathrm{m}\times \mathrm{z}}\), and \({\mathrm{V}}_{\mathrm{C}}\) will be put into the multihead self-attention mechanism (MHA) to further obtain the vector matrix \({\mathrm{V}}_{\mathrm{C}}\in {\mathrm{R}}^{\mathrm{v}\times \mathrm{z}}\) (\(\mathrm{m},\mathrm{v}\) represent the vector matrix dimension values of the corresponding MHA).
Step 3: Decoding features into output labels
The CRF layer receives the MHA mapped vector matrix \({\mathrm{V}}_{\mathrm{C}}\in {\mathrm{R}}^{\mathrm{v}\times \mathrm{z}}\) and decodes it to produce the NER output label sequence \(Y=[{y}_{1},{y}_{2},\ldots ,{y}_{n}]\) (y indicates the label corresponding to the word).
Step 4: Hyperparameters adjustment
The learning rate \(\mathrm{\alpha }\), dropout and other hyperparameters of the downstream model training are updated independently, and then the execution steps Step 2 and Step 3 are repeated to train the hybrid neural network model in this paper and return the results. According to the returned results, the relatively optimal \(\mathrm{\alpha }\), dropout and other parameter values are selected.