Effective attention-based network for syndrome differentiation of AIDS

Background Syndrome differentiation aims at dividing patients into several types according to their clinical symptoms and signs, which is essential for traditional Chinese medicine (TCM). Several previous works were devoted to employing the classical algorithms to classify the syndrome and achieved delightful results. However, the presence of ambiguous symptoms substantially disturbed the performance of syndrome differentiation, This disturbance is always due to the diversity and complexity of the patients’ symptoms. Methods To alleviate this issue, we proposed an algorithm based on the multilayer perceptron model with an attention mechanism (ATT-MLP). In particular, we first introduced an attention mechanism to assign different weights for different symptoms among the symptomatic features. In this manner, the symptoms of major significance were highlighted and ambiguous symptoms were restrained. Subsequently, those weighted features were further fed into an MLP to predict the syndrome type of AIDS. Results Experimental results for a real-world AIDS dataset show that our framework achieves significant and consistent improvements compared to other methods. Besides, our model can also capture the key symptoms corresponding to each type of syndrome. Conclusion In conclusion, our proposed method can learn these intrinsic correlations between symptoms and types of syndromes. Our model is able to learn the core cluster of symptoms for each type of syndrome from limited data, while assisting medical doctors to diagnose patients efficiently.


Background
Syndrome diffetentiation in Traditional Chinese Medicine (TCM) syndromes is a method of classifying the whole functional status summarized by clinical symptoms of different individuals during a period of illness. In TCM, this is one of the crucial aspects to study syndromes and plays a guiding role in clinically individualized diagnosis and dialectical treatment of TCM. In recent years, researchers have intensively studied the efficacy of TCM in the treatment of AIDS [1][2][3]. Many clinical practices and data proved that TCM had made surprising progress in reducing the HIV viral load in the blood and relieving patients' clinical symptoms and improving quality of life [4][5][6][7][8][9][10][11]. These advances are mainly attributed to the fact that TCM practitioners classified AIDS patients with syndromes and cured them with Chinese medicine treatments. It can be seen that syndrome classification is of great significance in the field of TCM.
Differentiation is at the core of TCM and sets the precondition that ensures efficacy. Usually, the approaches for classifying the syndromes in TCM, which include multivariate statistical methods, machine learning, neural networks, and other methods introduced into the study, have resulted in an extensive set of scenarios. Among the group of multivariate statistical methods, cluster analy-sis is one of the fundamental statistical methods. It is widely used in syndrome differentiation due to the trait that it avoids the negative impacts of individual subjectivity. Researchers such as Martis and Chakraborty tried to classify and explore the principles for the cause of arrhythmia disease [12]. As the representative of machine learning algorithms, support vector machine (SVM) is one of the most commonly used diagnostic classification models, and has been used by researchers such as Ekiz et al. to diagnose heart disease through SVM [13]; by Chen et al. to diagnose hepatitis disease [14] and by Zeng et al. to diagnose Alzheimer's disease [15]. Pang and Zhang tried to use a naive Bayesian network to reveal the connection between abnormal tongue appearances and diseases in a particular population [16]. In recent researches, deep learning models have been widely used to diagnose diseases. Models such as noisy deep dictionary learning [17], deep belief networks (DBN) [18], and longshort term memory network (LSTM) [19] have achieved better results.
Although these methods have achieved significant improvement in syndrome classification, they remain far from being satisfactory. First, when all symptoms are used equally for classification, uncorrelated symptoms can have excessive disturbing influences. In such cases, most of these algorithms cannot quite figure out what are the representative symptoms in one type of syndromes for diverse diseases. Moreover, because of the obvious differences between the diseases, there is no unified classification model for all illnesses. Due to the particularities of AIDS, most patients suffer from multiple diseases at the same time, and the clinical symptoms are various and complex [7,9]. These causes make it relatively difficult to judge patient's syndrome type and to define suitable treatment protocols.
To tackle the above-mentioned limitations, here we propose an attention-based MLP framewor (ATT-MLP). Our model relies on an attention mechanism that directly draws the global dependencies of the inputs [20]. We build feature-level attention with a hidden layer over multiple symptoms, which is expected to dynamically reduce the weights of noisy symptoms. Finally, the sequence where all symptom features are weighted by feature-level attention is fed into the MLP to discriminate types of syndrome. In contrast to other algorithms, the attention mechanism is trained to capture the dependencies that make significant contributions to the work, regardless of the distance between the elements in the sequence. Another major advantage of attention is that computing the attention only requires matrix multiplication, which is highly parallelizable and easily realized [21]. With the proposed simple yet effective ATT-MLP, we evaluated our model on a real-world AIDS dataset that integrates data from multiple clinical units to provide a comprehensive view of syndrome differentiation and medication patterns of AIDS [22]. In term of accuracy-score, our experimental results outperformed traditional algorithms by 13.4%. At the same time, our model also selected symptoms associated with the type of syndrome that is in line with the actual clinical situation.

Related work
The rapid adoption and availability of disease treatment records using TCM have enabled new investigations into data-driven clinical support. The broad goal of these studies is to learn from the datasets of patients' records and provide personalized treatment to them. Here, we provide a brief overview of work specifically in diagnosis and treatment of diseases as well as applications of attention-base models in different fields.

Diagnosis and treatment of diseases
Currently, some innovative classification techniques apply to quantitative syndrome analysis. Li and Yan et al. used the k-Nearest Neighbor (k-NN) model for classification of hypertension [23]. Bayesian networks have been used to make quantitative TCM diagnosis of diseases [24]. Some studies [25][26][27] have also used SVM to classify disease data with certain low feature dimensions. These algorithms have achieved good results in both disease diagnosis and re-classification tasks. However, the degree of syndrome correlation with symptoms in the data set of AIDS is different. The traditional k-NN algorithm [22] that assigns the same weight to each dimension or symptom does not fit the task well. The dataset has a high data dimension and the symptoms of patients are sparse. An SVM based on distance metric classification also fails to achieve ideal results. The Bayes classification algorithm based on prior probability can reveal its inadequacies when handling high-dimensional features. Artificial neural networks (ANN) boasts large-scale distributed parallel processing, nonlinear, self-organizing, self-learning, associative memory, and other excellent features. ANN has been used to achieve many gratifying results, and some researchers [28,29] have begun to use neural networks to conduct an exploratory study on the classification of diseases and syndromes according to TCM. Thanks to these efforts, the effectiveness of the diagnosis and classification of diseases has been significantly improved.

Attention applications
Attention-based models have attracted much interest from researchers recently. The selectivity of attentionbased models allows them to learn alignments between different modalities. Such models have been used successfully in several tasks. Minh et al. [30] used recurrent models and attention to facilitate the task of image classification. Ling et al. [31] proposed an attention-based model for word embedding, which calculates an attention weight for each word at each possible position in the context window. Parikh et al. [32] utilized attention for the task of natural language inference. Lin et al. [33] proposed sentence-level attention for sentence embedding and applied them to author profiling, sentiment analysis, and textual entailment. Paulus, Xiong, and Socher [34] combined reinforcement learning and self-attention to capture the long-distance-dependencies nature of abstractive summarization. Vaswani et al. [35] carried out the first study to construct an end-to-end model with attention alone and reported state-of-the-art performance in machine translation tasks. Our work follows this line and applies attention to learning long-distance dependencies. To the best of our knowledge, this is the first effort to adopt an attention-based model for TCM syndrome differentiation.

Frame structure
Three main functional modules are included in the proposed method. They are setting the initial symptom feature vectors and syndrome label vectors, calculating attention weights for different symptom characteristics under different labels, and iterative training of the classification model. The framework of the proposed method is illustrated in Fig. 1.

Attention-based framework
The definition of the symbols is given here to explain the attention-based model. The patient set is P = p nm , n = 1, ..., N, m = 1, ..., M ,where N is the total number of patients and M is the total number of the patients' symptoms. The syndrome type of AIDS is S = {c i , i = 1, ..., K} ,where K is the total number of syndrome types. The purpose of the attention-based model is to obtain different symptom weights based on the patient's own symptoms and syndrome characteristics. Then, the attention weight is realized by optimizing the function, which is shown in Eq. (1).
Where p n is the query vector of a patient's symptoms, A is a weighted matrix. E ns is the attentional vector that scores how well all symptoms and the syndrome type match. We selected the non-linear form tanh, which achieves the best performance when considering different alternatives.
Then, we chose the softmax function to normalize the attentional vector, which is shown in Eqs. (2)and (3). Where W ns is the normalized weight vector, e i is the ith weight probability corresponding to the i-th symptom, and w i is the normalized weight value.
The final output is the patient's feature vector based on attention weight as in Eq. (4):

Multilayer perceptron
We used a multilayer perceptron(MLP) as the last syndrome classifier. The MLP is a type of ANN composed of multiple hidden layers, where every neuron in layer i is fully connected to every other neuron in layer i + 1 . Typically, these networks are limited to a few hidden layers, and the data flows only in one direction, unlike recurrent or undirected models. The input is a one-dimensional vector that is consistent with the format of our data set. In our proposed method, the structure of MLP consists of three layers, which are input, hidden and output. Each hidden unit computes a weighted sum of the outputs from the input layer, followed by a nonlinear activation σ of the calculated sum as in Eq. (5).
Here, d is the number of units in the input layer, which is also the number of symptom features. p rj is the weighted value of the j-th symptom of the n-th patient. And w ij and b ij are the weight and bias terms associated with each p rj . Traditionally, sigmoid or tanh are chosen as the nonlinear activation functions, but we utilized rectified linear units (ReLU) to get good results in our model.
It is worth noting that the attention-based model and MLP were trained at the same time. We used the backpropagation algorithm to update the parameters in these models. To verify the performance of the model, the desired significance level is set to 0.05.

Dataset of AIDS in TCM
In this study, we used data from the TCM pilot project for treating AIDS, which contains over 12,000 patients' records from over 17 provinces covering the years from 2004 to 2013. The ethics committees of the Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, granted exempt status for this study and also waived the need for informed consent. The dataset recorded the personal and disease history, symptoms, syndromes, and treatment strategies provided by certified senior medical doctors. The symptom group contains a total of 93 symptoms. There are seven types of syndrome for these patients in the dataset according to the AIDS syndrome diagnostic criteria, which includes (S1) phlegm-heat obstructing the lung and accumulation of heat toxin; (S2) deficiency of both qi and yin in the lung and kidney; (S3) stasis of blood and toxin due to qi deficiency; (S4) hot in the liver with accumulation of damp toxin; (S5) stagnation of qi, phlegm and blood; (S6) deficiency of spleen and stomach with retention of damp; and (S7) qi deficiency with kidney yin deficiency. From the entire AIDS dataset, we selected 10,910 cases based on the inclusion criteria. The criteria for inclusion of patients in this study were (1)

Experimental and evaluation metric
In practice, we find that certain symptoms are different in their relevance and importance for these syndromes. Hence, we build a model for each type of syndrome. In the model, we used all positive samples and some randomly selected samples with other syndromes as the negative samples in the whole dataset to ensure that the ratio of positive to negative was about 1:2. All models were trained and tested through a 10-fold cross-validation method. Meanwhile, and the average of multiple test scores was recorded in these tables. To check the robustness of our model, we provided the prediction results with the five indicators: Accuracy, Sensitivity(Recall), Specificity, Matthews correlation coefficient(MCC), and Area Under the ROC curve(AUC).

Selected symptoms based on attention
We used the proposed model to select representative symptoms to characterize the features of each AIDS syndrome. The selected representative symptoms for seven AIDS syndromes are shown in Table 1.
According to Table 1, we could find that ATT-MLP performs well in the task of selecting representative symptoms to characterize each syndrome and all accuracy of syndromes are more than 80%. On closer inspection, the accuracy of S4 and S7 syndromes exceed 87.0%. And the S7 syndrome got the highest accuracy, reaching 87.6%. In addition, this worst performance (S3) is still 83.5%. Overall, our model automatically diagnosed the syndrome type of AIDS patients by selecting symptoms, and its performance was close to the level of most clinicians. Secondly, through observing Table 1, we could also find out that some symptoms appeared in the symptom group of multiple syndromes at the same time. For example, fever occurring in S1, S3, and S6; cough appearing in both S1 and S2, and slippery pulse appearing in both S4 and S6. White tongue coating was included in S3 and S6. More detailed reasons are given for above-mentioned conditions. On  Table 1 and compared with S3, symptom fever plays a more primary role in syndrome S1. Diagnosing syndrome accurately needs to integrate several symptoms' information. In order to explore the contribution of each symptom to syndrome type and disease diagnosis, more details and further discussions about symptoms' importance are discussed in the next section-Symptom Weights for Different Syndrome types.

Symptom weights for different syndromes
For each syndrome type, we collected the attention weight vectors for all patients within them and subjected them to normalization to obtain Fig. 2. The rows of the matrix in the figure represent seven syndrome types of the AIDS dataset and each column represents the attention weight values for one symptom. We can see that for different syndromes, the type and number of symptoms on which the model focused were significantly different from Fig. 2. For instance, the number of symptoms of concern for S3 and S5 is more than for S7. Focusing on the local details of the matrix, some symptoms occupy a higher weight in certain syndrome types, We can find that the importance of red tongue (2) is different for seven syndrome type: the higher importance for S2 and S6, compared with S1, S4, and S5. The representative symptom white tongue coating(29) also is closely related to S2 and S5, and selected as their key symptom in Table 1. Nevertheless, only the syndrome S1 has a greater dependence on the symptom-scanty coating(36), and the weight between S1 and scanty coating is over 0.9. there another noteworthy phenomenon is that though the weights of two symptoms-cough(72) and fatigue(73) are high in all of the syndromes, they just are regarded as generalized features, not typical symptoms. These cases are in line with the actual situation.
On other hand, the current model can focus on different dimensional symptoms, and the distances of these symptoms are relatively distant from the global perspective of the map. This implies that our model has prominent advantages in dealing with high-dimensional information in medical diagnosis. The heat map in Fig. 2 also shows that our model not only makes use of local symptom information for diagnosis and prediction but also combines global information to diagnose the patient's syndrome category. It is similar to the disease diagnosis method used by clinical experts, indicating that the internal mechanism of ATT-MLP is to realize the goal of predicting syndrome by studying and applying some real human experience.

Comparison
Tables 2 and 3 recorded the performance scores of 5 algorithms, measured with indicators including: Accuracy, Sensitivity(Recall), Specificity, MCC and AUC. The best result for each column was highlighted. Overall, our framework based on the attention mechanism significantly improved performance over the other four models in most indicators on each of the AIDS syndromes. The average of the Accuracy of syndrome differentiation based on ATT-MLP was 85.7% and the average of sensitivity and specificity reaches 74.6% and 91.3%, which were more than 10% higher than other models. However, the SVM had worse performance in the predicted task, the main reason was that the SVM classifier used all the symptoms of patients, but without considering and selecting some key symptoms and without using a fusion of different symptom information to classify syndrome type. Through analyzing all the experimental results of syndrome type, we could find that the classification effect of the five models for S4 and S7 were significantly better than for other syndromes, which illustrates that the symptomatic composition of both types of syndromes was relatively fixed and less affected by other symptoms. Obviously, the proposed model also achieved the best performance, the accuracy and sensitivity scoring 87.6%, and 83.7% in S7.
These results further indicated that the key symptoms of S7 were obvious and easy to find and capture. Because the number of positive and negative samples are unbalanced, we employed MCC and AUC indicators to measure the robustness of our model. Meanwhile, and we performed ATT-MLP over the dependent test to measure the reliability of cross-validation results. Judging from Table 3, Random Forest(RF) and ATT-MLP model have compared results. The MCC's score was above 0.5 and the average of AUC's result reached 0.77 and 0.79 for RF and ATT-MLP. This showed that facing the problem of unbalanced samples, our model could deal with it by learning the internal mapping between symptoms and syndrome. The P-values of the independence test of our model were all less than 0.001, which verified the truthfulness of the experimental results of our model and indicated that the model could extract effectively the representative symptom group of syndromes. It is noteworthy to point out that the MLP model is our baseline model. Viewing Table 3, we see that the MLP model based on the attention mechanism shows a significant improvement over the performance of the original model. There is enough evidence to show that the attention mechanism framework plays an important role in the task of capturing key symptoms. Whihout changing the structure of the data itself, the attention mechanism can help MLP to optimize and classify in the right direction by scoring the symptoms.
All models had poor classification performance for S3 and S5, especially in the indicator-Sensitivity score, compared to the other five syndromes. It means that by only relying on the dataset, these models can't find and learn the key symptom groups of S3 and S5. For our model(ATT-MLP), the difficulties for learning and mining the pivotal symptom groups of S3 and S5 syndromes are objective subsistent. The initial conjecture about the problem is that the symptoms of these patients labelled as the two syndromes (S3 and S5) are complex and diverse and that the interaction between symptoms is complicated. In order to test this hypothesis, we conducted the following experiment: the evaluation and measurement of the number of symptoms related to the syndrome.

Number of syndromes & metric
In order to verify whether the symptoms selected by the proposed model based on the attention mechanism are the key symptoms of the certain syndrome, we verified the results. Firstly, we used the attention framework to score all symptoms and to remove some symptoms according to different score thresholds. Then the remaining symptoms were re-scored to classify syndromes. Recording model classification performance indicators with different thresholds. The changes in these indicators are shown in Fig. 3.
By observing these figures, we can clearly see that as the number of selected symptoms decreases, there is no significant decline in the classification performance of our model. As to syndromes-S4 and S7, with the number of symptoms decreasing, our model keeps high-level score in the accuracy rate of diagnosis, which suggests that their main symptom groups remain stable and our model can efficiently extract their core symptoms from some limited samples. However, through observation of Fig. 3, the classification effects of S1 and S3 syndromes were greatly affected by the change in the number of symptoms, and the change of the index was more than 15%. One convictive explanation for these phenomena is that the combinations of primary symptom for different syndromes are diverse and the association between the symptoms is complex. We further randomly selected 100 samples labelled S1 and S3 separately, shown in Fig. 4. For syndrome-S1, main symptoms such as the red tongue(2), yellow coating(30), string-taut pulse(70), and fever(71) are relatively obvious to be seen, but other symptoms are difficult to summarize. As for syndrome-S3, white coating(29) and thready pulse(63) is easily detected. In this case, a single framework does not fit well into these clinical diagnostic  The sample cases hot map of syndromes S1-(a) and S3-(b). the symptoms and sample index severally are shown on the horizontal and the vertical axis methods. We need to combine other methods to do more comprehensive researches.
Secondly, we can also find another fact from these figures. As the selected symptoms decrease, all the indicator values decrease slightly. This illustrates that while our model eliminates the effects of irrelevant symptoms, it is also possible to delete certain key symptoms unintentionally or to sever the relationships between certain features. Since these results to have a small decrease in model performance, we will conduct further exploration and research of these matters in follow-up work.

Discussion
In this paper, we proposed an ATT-MLP-based syndrome classification model. Our model can diagnose the syndrome of AIDS patients effectively, and select representative symptoms of each syndrome. This provides new ideas and methods for the establishment of a TCM syndrome differentiation model for patients with AIDS. Our proposed model shows good performance in the classification of the seven syndromes in the dataset and gives an average accuracy of 85.7%, an average sensitivity (recall) rate of 74.6%, and an average specificity of 91.3%. The results of these classification experiments are superior to what has been obtained using more conventional classifiers in previous studies.
There are two advantages to the proposed model. First, compared with other syndrome differentiation models, our model can accurately select representative features to characterize explicitly the characteristics of the AIDS syndrome. This provides an objective basis for TCM syndrome differentiation, which often relied on empirical medicine. Second, our model can also assign reasonable weights to the selected symptoms. For the same symp-tom, the weight is different for different syndromes, which means that only the more heavily weighted symptoms play a key role in the diagnosis of a given syndrome. This can help doctors to develop treatments that are appropriate for their patients.
Due to the complexity of AIDS, some patients may have several AIDS syndromes. However, current attentionbased models are not good at handling the multi-label task classification. In the future, we will consider abandoning this model, which scores the symptoms individually. Instead, we will explore the effect of the degree of association between the symptoms on the diagnosis of the syndrome. Then we could use the efficient deep learning framework to build a more complete syndrome differentiation model.

Conclusions
In conclusion, our proposed method can learn these intrinsic correlations between symptoms and syndromes. As a matter of fact, the relevant information was summarized by experts with rich clinical experience. These experiments demonstrate that our model can use the attention mechanism to select representative symptoms for each syndrome and improve the diagnosis accuracy of patients' syndromes.