The Fig. 1 shows the overall architecture of the proposed model, HyM. The first step is to extract the features for each symptom text. We hope that all the important information, such as context, semantic and syntax information, can be extracted for future processing. Hence, different methods are applied to extracted different levels of features. In particular, there are two major channels for feature extraction.
-
(1)
In the first channel, 10 features are generated using TF–IDF techniques without word vectorization. The TF-IDF methods can effectively capture the shallow level semantic information in the symptom texts.
-
(2)
In the second channel, the symptom texts are converted to vectors first. The vectors then are fed into three modules: BERT, LSTM and TEXT-CNN respectively. Each of the three modules generates 10 features. The second channels are good at capturing the deep level semantic information and context information in the symptom texts.
The features are concatenated to represent the symptom texts. A fully connected neural network is used as the classifier to classify the symptom text to the medical specialties.
Text preprocessing and word vectorization
In the preprocessing step, the Jieba tool is used to segment the symptom description text into words. Thereafter, the punctuation marks, stop words, and special characters are removed.
To facilitate the future processing, the number of words for each symptom description text are all set to \(L\). If a symptom description text has more than \(L\) words, the words after \(L\)th are truncated. If a symptom description text has less than \(L\) words, we pad the words with blank words so that the number of words is \(L\). L is empirically set to be 50 in the experiments. After the preprocessing step, each symptom description text \(s\) is represented as \(L\) words \({w}_{1}, {w}_{2},\dots , {w}_{L}\).
The Word2Vec module creates a numerical value vector for each word \({w}_{i}\). The advantage of the vector representation is that words with similar characteristics are also in close proximity to one another as vector representation. This step is essential if we want to use the deep learning techniques to handle symptom description text. In the experiment, a dictionary is used and a word is converted to its corresponding vector by looking up its vector in the dictionary. That is, each word \({w}_{i}\) is converted into an N-dimensional numerical value vector \({V}_{i}\) and \(s\) is represented as a matrix with size \(*N\).
Features extraction based on TF-IDF
The TF–IDF weighting technique has been widely used in the information retrieval and text mining areas. It is a statistical method for determining the importance of a word in a document. Its importance increases proportionally to the frequency of its appearance in the document but decreases proportionally to the number of documents it occurs in the corpus.
$$ {\text{TF - IDF}} = {\text{TF}} * {\text{IDF}} $$
(1)
After the word segmentation is done, the TF-IDF is used to extract the feature vector in the first channel (see Fig. 1). It also produces a vector with length \(L\), which is 50 in the experiments. The details are illustrated in Additional file 1: Fig. S1. In this channel, the word's distinctive and significant position in the document is represented by the output vector characteristics.
Features extraction based on TEXT-CNN
TEXT–CNN is a one-dimensional convolutional neural network specific for text processing. It has the same structure as an ordinary CNN. It has a convolutional layer, a pooling layer, and a fully connected layer in sequence. Text segmentation is used as the input for TEXT-CNN. The breadth of the filter is comparable to the width of the word matrix, which is created through embedding and contains vectors. The word vector's width determines its size, and the filter can only move in the height direction. The filter's breadth corresponds to the word vector's width. Finally, an external softmax will be created in order to achieve multi-classification. The model structure is displayed in Additional file 1: Fig. S2.
In this study, sentence vectors were trained using TEXT-CNN as the content expression layer, and convolutional features were chosen using maximum pooling with a stride of 1. The more significant elements of the text are retained, and some contextual information is preserved via a max-pooling with stride 1.
Features extraction based on LSTM
RNN are good at processing sequence data. However, as the sequence length increases, it suffers from the problems of vanishing gradients, exploding gradients, and long-term dependencies and etc. As a variant RNN, as Fig. 2, LSTM addresses the aforementioned problems by adding the input gate it, forgetting gate ot, output gate ft, and memory state cell ct and uses the gate mechanism to control information retention, forgetting, and state update. The calculation formulas are listed as follows:
$$ {\text{f}}_{{\text{t}}} = \sigma \left( {{\text{W}}_{{\text{f}}} \cdot \left[ {{\text{h}}_{{{\text{t}} - 1}} ,{\text{x}}_{{\text{t}}} } \right] + {\text{b}}_{{\text{f}}} } \right) $$
(2)
$$ {\text{i}}_{{\text{t}}} = \sigma \left( {{\text{W}}_{{\text{i}}} \cdot \left[ {{\text{h}}_{{{\text{t}} - 1}} ,{\text{x}}_{{\text{t}}} } \right] + {\text{b}}_{{\text{i}}} } \right) $$
(3)
$$ {\text{o}}_{{\text{t}}} = \sigma \left( {{\text{W}}_{{\text{o}}} \cdot \left[ {{\text{h}}_{{{\text{t}} - 1}} ,{\text{x}}_{{\text{t}}} } \right] + {\text{b}}_{{\text{o}}} } \right) $$
(4)
$$ {\text{c}}_{{\text{t}}} = {\text{f}}_{{\text{t}}} \odot {\text{c}}_{{{\text{t}} - 1}} + {\text{i}}_{{\text{t}}} \odot \tanh \left( {{\text{W}}_{{\text{c}}} \cdot \left[ {{\text{h}}_{{{\text{t}} - 1}} ,{\text{x}}_{{\text{t}}} } \right] + {\text{b}}_{{\text{c}}} } \right) $$
(5)
$$ {\text{h}}_{{\text{t}}} = {\text{o}}_{{\text{t}}} \odot \tanh \left( {{\text{c}}_{{\text{t}}} } \right) $$
(6)
where σ is the nonlinear activation function, W is the weight matrix, b is the bias, xt is the input vector at time t, ht−1 is the output at the previous time, ct−1 is the hidden state at the previous time, ct and ht are the current state and output, respectively.
LSTM effectively solves the long-term dependency problem of RNN and alleviates the ‘gradient disappearance’ problem caused by the backpropagation of RNN during training with the gates mentioned above.
LSTM is used in this study to represent the deep level features of disease symptom text as vectors (see Additional file 1: Fig. S3). LSTM encodes the sentence embedding input at each moment to obtain the corresponding hidden layer vector (the output dimension is (128, 10)) and obtains a new vector.
Feature weighting based on BERT
The attention mechanism is designed to demonstrate the degree to which each feature word contributes to correctly categorizing the entire body of text into the intended category. The attention process gives priority to content that is vital, while ignoring content that is not as important as other content. The direct weighted average of the output vector and the addition of the attention mechanism to the network are contrasted; the former prevents the maintenance of redundancy, while the latter improves classification accuracy by retaining noise from the original text after the average has been taken into account. The Attention mechanism is utilized quite frequently within the BERT model. Within the context of this paper, the BERT model is used to compute the attention mechanism.
As illustrated in Additional file 1: Fig. S4, the BERT model was used to obtain attention-weighted vectors. This mechanism gives the output of the network model varied weights, which enhances the attention to keywords by incorporating word-level features further. The model's attention to local features and the expression of sentence semantic information can both be improved by the attention mechanism. In order to improve the accuracy of the final categorization, a word's weight will be increased if it contributes more meaning to the sentence; otherwise, its weight will be decreased.
A weighted semantic feature vector X, encompassing local and global features, is produced after the preceding model extracts the local and global feature representation of the text.
$$ {\text{X}} = {\text{Add}} < {\text{e}}_{2} ,\;{\text{e}}_{3} ,\;{\text{e}}_{4} > $$
(7)
where e2 stands for the feature vector produced by BERT e3 for the feature vector following the LSTM, e4 stands for the feature vector produced by TEXT–CNN.
Model fusion and output
The above two-channel models extract the local and global feature representations of the symptom description text, which are then concatenated to produce the local and global feature vector, X.
After the two channels are fused, a fully connected layer, FC, is added to translate the weighted vector of symptom text features into the label space (Fig. 1). To avoid the weight update after the fully linked layer, a dropout mechanism is included that solely relies on partial features and model overfitting. We directly output the model prediction results after using a classifier to compute the probability distribution of the medical specialty classification to which the patient's symptoms belong.