Clinical data mining on network of symptom and index and correlation of tongue-pulse data in fatigue population

Background Fatigue is a kind of non-specific symptom, which occurs widely in sub-health and various diseases. It is closely related to people's physical and mental health. Due to the lack of objective diagnostic criteria, it is often neglected in clinical diagnosis, especially in the early stage of disease. Many clinical practices and researches have shown that tongue and pulse conditions reflect the body's overall state. Establishing an objective evaluation method for diagnosing disease fatigue and non-disease fatigue by combining clinical symptom, index, and tongue and pulse data is of great significance for clinical treatment timely and effectively. Methods In this study, 2632 physical examination population were divided into healthy controls, sub-health fatigue group, and disease fatigue group. Complex network technology was used to screen out core symptoms and Western medicine indexes of sub-health fatigue and disease fatigue population. Pajek software was used to construct core symptom/index network and core symptom-index combined network. Simultaneously, canonical correlation analysis was used to analyze the objective tongue and pulse data between the two groups of fatigue population and analyze the distribution of tongue and pulse data. Results Some similarities were found in the core symptoms of sub-health fatigue and disease fatigue population, but with different node importance. The node-importance difference indicated that the diagnostic contribution rate of the same symptom to the two groups was different. The canonical correlation coefficient of tongue and pulse data in the disease fatigue group was 0.42 (P < 0.05), on the contrast, correlation analysis of tongue and pulse in the sub-health fatigue group showed no statistical significance. Conclusions The complex network technology was suitable for correlation analysis of symptoms and indexes in fatigue population, and tongue and pulse data had a certain diagnostic contribution to the classification of fatigue population.

of certain diseases [1]. Fatigue is a non-specific symptom with certain heritability [2]. It is ubiquitous in sub-health and various diseases, such as Parkinson's [3], major depressive disorder [4], schizophrenia [5], and cancer [6]. It has become one of the main factors harmful to human's physical and mental health with a serious decline in work efficiency and life quality.
Sub-health status is defined as the decline in vitality, physiological function, and adaptation capacity, but in general not meeting the diagnostic criteria for clinical or sub-clinical diseases [7,8]. It has attracted extensive attention with the increasing incidence of sub-health in recent years. Sub-health state of fatigue is one of the most common sub-health types with that fatigue as the chief complaint. The etiology and pathogenesis of fatigue are largely unknown. There is still a lack of objective, effective, and comprehensive evaluation method for diagnosing fatigue, leading to the inability to carry out targeted interventions when fatigue occurs in the early stage of disease. Hence it is imperative to establish a comprehensive, objective evaluation method for fatigue.
In the diagnosis methods of Traditional Chinese Medicine (TCM), tongue and pulse diagnoses always play an important role in clinical diagnosis and treatment. Tongue and pulse diagnoses are comprehensive diagnostic methods based on body's overall state, suitable for comprehensive evaluation of body's functional state, and have become an important objective basis for health status evaluation and syndrome diagnosis [9]. However, traditional tongue and pulse diagnoses lack accurate classification. On the one hand, artificial intelligence and machine learning methods can quickly and accurately complete basic data cleaning and large-scale data sorting, and on the other hand, do data mining for massive tongue and pulse data. Different recognition algorithms and machine learning methods have been widely used in image recognition, target detection, natural language processing, and other fields [10][11][12]. Nowadays, breakthroughs have been achieved in fatigue quantification and standardization. Artificial Neural Network [13], Support Vector Machine [14], K Nearest Neighbor [15], and other machine learning methods have helped to achieve the digitalization of TCM tongue and pulse diagnoses and establish corresponding disease diagnostic models [16,17]. The diagnostic relationship between tongue and pulse and healthy state can be better established through accurate detection, identification, and multi-dimensional quantitative analysis of tongue and pulse data to save medical resources and improve diagnosis efficiency and treatment efficacy [18][19][20]. Data-driven researches on fatigue diagnosis technology using tongue and pulse data have been increasing day by day. Researches based on tongue [21,22] and pulse [23][24][25] of fatigue population have shown that their tongue and pulse had their own unique characteristics.
Complex network is a basic framework with high topological abstraction. Further analysis of the network through classification, screening, and other analytical methods can mine the potential rules of many clinical data. Complex network is used in the analysis of basic rules of TCM [26], network pharmacology [27][28][29], combined analysis of TCM syndromes and network pharmacology [30,31], and symptom evolution [32]. Complex network is mostly used to construct qualitative network relations, which are more suitable for analyzing the symptoms involved in identifying certain syndromes and the relationship between symptoms. Fatigue is a complex concept, and it is one of the symptoms that can most reflect the interaction between psychology and physiology. As a non-specific symptom, different network relationships of fatigue symptom can be found in depression network [33], qi deficiency syndrome of coronary heart disease network [34,35], and qi deficiency syndrome of breast cancer network [36].
Despite the progress made in current research on fatigue, there are still many problems that need to be solved. For example, studies on fatigue are relatively simple. It is essential to study the correlation of fatigue symptoms and do the combined analysis of fatigue symptoms and indexes. Besides, analysis on objective tongue and pulse data of fatigue is mostly independent. To address the above problems, in this study, 2362 physical examination population were divided into healthy controls, sub-health fatigue group, and disease fatigue group. Complex network technology was used to screen out core symptoms and Western medicine indexes. Through constructing the core symptom and index network and the core symptom-index combined network, analyzing the network structure to establish the distribution of fatigue symptoms and indexes. Simultaneously, canonical correlation analysis method was used to get the associated relationship between tongue and pulse data of disease fatigue and sub-health fatigue population. Based on symptom, Western medicine index, and tongue and pulse data, this study tried to explore the characteristics of different fatigue population from different dimensions.

Study design
All the 7025 people were selected in the Medical Examination Center of Shuguang Hospital affiliated to Shanghai University of Traditional Chinese Medicine from Jul. 2015 to Dec.2018. The most common diseases in fatigue population mainly include hypertension, diabetes, hyperlipidemia, and fatty liver, other diseases, such as coronary heart disease and cancer due to the lack of patients, they were not included. A total of 361 sub-health fatigue and 1529 disease fatigue population were further selected.
Four well-trained clinicians completed the diagnosis according to the diagnostic criteria of the disease. The Health Status Assessment Questionnaire (shortly, H20) and Information Record Form of Four Diagnosis of Traditional Chinese Medicine (Copyright No.: 2016Z11L025702) designed by the sub-health research group of the "863 Plan" was used to investigate fatigue syndrome. Patients with fatigue symptom were defined as disease fatigue population, those who had no obvious positive indexes in Western medicine, H20 score was between 60 and 79, and with fatigue symptom were defined as sub-health fatigue population, and those who had no obvious positive indexes in Western medicine, H20 score was between 80 and 100, and without fatigue symptom were defined as healthy controls.
The overall flow diagram of this study was shown in Fig. 1.

Information on the four diagnostic methods of TCM and Physical and Chemical Indexes
The four-diagnostic scale of TCM included 25 categories and 256 subitems. The symptoms and signs were classified as none, mild, and severe with 0, 1, and 2 points. We used Tongue and Face Diagnosis Analysis-1 Fig.1 Overall flow diagram instrument (TFDA-1) (software copyright registration No.: 2018SR033451) and Pulse Diagnosis Analysis-1 instrument (PDA-1) (Patent No.: ZL201620157027.6), which independent developed by the National Key Research and Development Program to collect the clinical tongue and pulse data. Besides, clinical Western medicine indexes mainly include 191 items such as blood routine, urine routine, liver and kidney function, tumor markers, electrocardiogram, and imaging examination. Considering fatigue had no specific indexes, the existing Western medicine indexes were auxiliary. For example, anemia fatigue patients were often associated with decreased hemoglobin, cancer fatigue patients were often associated with abnormal tumor markers. It was necessary to combine specific symptoms in specific diseases for comprehensive analysis and judgment. The tongue and face diagnosis equipment and the analysis interface were shown in Figs. 2 and 3. Figure 2a Figure 4b was the sphygmogram of PDA-1 pulse diagnosis instrument.
The color parameters of tongue image in Fig. 3 came from four color spaces: [37][38][39] RGB, HSI, Lab, and YCrCb. They were R (red), G (green) and B (blue), H (hue), S (saturation), I (brightness), L (lightness), a (redgreen axis), b (yellow-blue axis), Y (brightness), Cr (the difference between the red part of the RGB input signal and the brightness value of the RGB signal), Cb (the difference between the blue part of RGB input signal and the brightness value of RGB signal), perAll (the ratio of the coating area to the total tongue area) and perPart (the ratio of the coating area to the non-coating tongue area).
The parameters of sphygmogram in Fig. 4b, h represented amplitude height, h 1 was the main amplitude, h 3 was the heavy wave front wave amplitude, h 4 was the dicrotic notch amplitude, h 5 was the gravity wave amplitude, t 1 was the time value from the start point to the crest point of the main wave, t 4 was the time value from the start point to the dicrotic notch, t 5 was the time value from the dicrotic notch to the endpoint, t was one pulsating period, and w was divided into w 1 and w 2 , w 1 was 1/3 height of the main wave, w 2 was 1/5 height of the main wave.

Normalized data entry and extraction
Using Python3.7 to make data arrangement, and established the symptom and Western medicine index data sets, respectively. A total of 494 symptoms and indexes were collected, including 254 symptoms of TCM and 240 Western medicine indexes. The data sets were binarized. The positive TCM syndromes were recorded as "1", and negative TCM syndromes were recorded as "0". Negative qualitative data of Western medicine index were recorded as "0", positive data including weak positive  (+) and strong positive (++ or +++) were recorded as "1", quantitative data in the normal range were recorded as "0", higher than or lower than normal range were recorded as "1".

Screening of Symptoms and Indexes
According to the data characteristics in this study, the improved node contraction method was used to analyze the network nodes quantitatively. Node contraction was the integration of k nodes connected to the node with this node, replace the k + 1 nodes with a new node, and the edges that were previously associated with k + 1 nodes were now associated with the new node. After nodes with high importance were shrunk and fused, the whole network's connection would be closer, and the aggregation degree would increase. This method's basic idea was to shrink the nodes in the network one by one and then compare the network aggregation degree changes to rank the importance of nodes. The improved node contraction method [40] comprehensively considered the weight of edges in the weighted network. In this study, nodes represented symptoms or indexes, when two abnormal symptoms or indexes appeared in the same individual simultaneously, a connection was established between the two nodes. The weight of the edge represented the number of simultaneous occurrences of the two connected nodes. The larger the weight was, the closer the relationship between the two indexes. So here, we choose the sum of the edge weight as the point weight. In the weighted network, the corresponding concept of node degree was node strength. With the specific definition of node strength, extend the definition of network cohesion to weighted networks. Network cohesion refers to the reciprocal of the product of the number of nodes and the average shortest distance. Quantitatively describe the degree of network cohesion, weighted network cohesion [41] was defined as in formula (1).
In the above formula, s was the sum of the network's average node strength, which was the strength of each node divided by the number of neighboring nodes of the node. l was the average shortest distance of the unweighted network corresponding to the weighted network after thresholding. d ij was the shortest distance between node V i and node V j in the network. W ij was the weight between node V i and node V j. The node importance was expressed by IMC(V i ), which was defined as in formula (2), ∂(WG * Vi) was the aggregation degree of the weighted network after contraction of node V i .
In this study, the improved node contraction method took the degree, betweenness, and edge weight of nodes, which was basically consistent with the purpose and requirement of each node's importance. In this paper, the sum of edge weight was used as the point weight to obtain an undirected weighted network. Triple was an improved form of adjacency list, and many network data were represented in the form of triples. A triple could be thought of as a three-column matrix, with each row in the format of [V i , V j , W ij ], indicating the existence of an edge with a weight of W from node V i to node V j . To better demonstrate the interaction between nodes, the symptom and index pairs were divided by the triad's maximum weight to obtain the normalized weight value.

Construction and analysis of complex network
MATLAB (R2016a) software was used to process the binary data, and the core symptom and index data were selected according to the importance of node. Selected the top 10 data to construct the network and do analysis by Pajek (Pajek 64 5.08) software. The networks were output after editing each node's color and label and adjusting the node position manually.

Statistical analysis
SPSS (Version 25.0) software was used for statistical analysis. Continuous data with normal distribution were presented as the mean and standard deviation, and those with abnormal distribution were presented as median and upper and lower quartiles. The categorical variables were expressed as counts and percentages. Analysis of Variance (ANOVA) was performed for data that normal distribution and homogeneity of variance among groups, Kruskal-Wallis H test was performed for non-normal distribution data. All the tests were two-tailed, and a P value < 0.05 was considered statistically significant.

Quality control
In this study, researchers from Shanghai University of Traditional Chinese Medicine completed all scales and tongue and pulse collection. All researchers were medical professionals in TCM or integrated TCM and Western medicine. Moreover, all researchers have been trained in standard operating procedures to ensure consistency and accuracy in interpreting data collection results. Each study participant was interviewed by at least two professional researchers and supervised by at least two senior physicians to ensure data collection consistency and authenticity and reduce the measurement bias.

Data set for fatigue group
There were 742 people in the healthy controls, 361 people in the sub-health fatigue group, and 1529 people in the disease fatigue group. The patients in the disease fatigue group were mainly hypertension, diabetes, hyperlipidemia, and fatty liver. The statistical analysis of baseline characteristics of the healthy controls, sub-health fatigue, and disease fatigue groups were shown in Table 1. "N" represented the number of categorical variables, "X ± SD" represented the mean and standard deviation of the continuous data Age and Body Mass Index (BMI).
From the result, we could see that there were many more males than females in the three groups, actually the total number of people who participated in medical examinations, males were higher than females, the reason might be that there were more males than females in routine physical examinations in the area where the hospital was located, and males might pay more attention to routine physical examinations. Age and BMI were statistically significant in the sub-health fatigue group and disease fatigue group subjects compared with healthy controls (P < 0.01), and age was statistically significant in the disease fatigue group compared with the sub-health fatigue group (P < 0.01).

Construct and analyze of symptom network of the sub-health fatigue group
Using MATLAB for data processing of sub-health fatigue group with the whole symptoms. Binary data of TCM symptoms were converted into ". NET" format and then using Pajek software to draw the networks, and the symptom network was shown in Fig. 5.
As the total network had many nodes and complex network relationships, its core nodes' relationships could not be well described. Therefore, it was necessary to select core nodes according to the importance of nodes IMC(V i ), and then used Pajek software to draw the core symptoms network. The network was shown in Fig. 6. In the network, the node's size represented the strength, the thickness of the edge represented the weight between nodes, and the core symptoms were shown in Table 2.
To analyze the network, took the symptom associated pairs whose normalized weight was greater than 0.5, and the associated result was shown in Table 3.

Construct and analyze of symptom and index networks of the disease fatigue group
Using the same method to draw a network of the whole symptom and index of disease fatigue population, as shown in Fig. 7. To Select core symptoms and indexes and draw networks of disease fatigue group. Core symptom and index networks were shown in Figs. 8 and 9. The core symptoms and indexes and node importance rank were shown in Tables 4 and 5.
Relationships between symptoms and indexes were very complicated in the actual clinical diagnosis of disease. The diagnosis could not rely on symptoms or indexes solely. It was necessary to combine them to analyze together. Its core symptom-index interaction edges were shown as cyan lines. The core symptom-index network was shown in Fig. 10.
To analyze the network, took the top 10 pairs of core symptom-symptom pairs and index-index pairs, respectively, and the associated analysis results were shown in Tables 6 and 7.
To select the core symptom-index associated results, the top 10 symptom-index pairs were shown in Table 8.
In conclusion, the research results showed that white coating, yellow coating, sour, dreaminess, irritability, thick coating, and insomnia were the common symptoms of the two groups of fatigue population. The main difference was that the node importance of the same symptom was different in different fatigue population networks, indicating that the diagnostic contribution rate of the same symptom to different fatigue population was different. Headache, chest distress, and xerophthalmia were more significant in sub-health fatigue group, while dizziness, wiry pulse, and greasy coating were more significant in disease fatigue group. The most common abnormal indexes in the disease fatigue group were basophil, platelet distribution width, systolic blood pressure, percentage of monocyte, diastolic blood pressure, PH of urine, hemoglobin, hematocrit, uric acid, and body mass indicator. The symptom-index associated analysis showed that systolic blood pressure, basophil, platelet distribution width, percentage of monocyte, and diastolic blood pressure were closely related to white coating, and it was also related to yellow coating to some extent.

Canonical correlation analysis of tongue and pulse parameters
Using SPSS (Version 25.0) software to detect outliers or extreme values. Individuals with outliers or extreme values in tongue or pulse data of the three groups were excluded, although this reduced the sample size to some extent, it ensured the accuracy of the results. Finally, 551 people were included in the healthy controls, 252 in the sub-health fatigue group, and 1160 in the disease fatigue group. Canonical Correlation Analysis verified the overall correlation between one set of variables and another. The results showed a certain correlation between the tongue and pulse data in healthy controls and the disease fatigue group. The correlation coefficient of tongue and pulse data in healthy controls was 0.475 (P < 0.05), tongue characteristic parameters were mainly affected by TB-Cb, TB-b, TB-H, and TC-Cb, the canonical correlation coefficients were − 0.435, 0.431, 0.429, and − 0.374, respectively, P < 0.05. Pulse characteristic parameters were mainly affected by h 1 and h 1 /t 1 , the canonical correlation coefficients were 0.388 and 0.378, respectively, P < 0.05, as shown in Fig. 11a. The correlation coefficient of tongue and pulse data in disease fatigue group was 0.420 (P < 0.05), tongue characteristic parameters were mainly affected by perAll, TC-Cr, TB-Cr, TB-Cb, TB-b, and the canonical correlation coefficients were − 0.723, 0.697, 0.649, − 0.603 and 0.590, respectively, P < 0.05. Pulse characteristic parameters were mainly affected by h 4 , h 4 /    Fig. 11b. There was no statistically significant correlation between the tongue and pulse data in sub-health fatigue group.

Discussion
Fatigue is an important early warning signal of abnormal health status. It should be treated in time to prevent it from further developing into more serious diseases. At present, there are two main reasons for the difficulties in fatigue research. Firstly, the mechanism of fatigue is complex, and there is still a lack of diagnostic criteria for fatigue. Secondly, there is a lack of effective fatigue evaluation models [42]. Modern tongue and pulse diagnosis can provide good data support, coupled with the assistance of artificial intelligence and technical machine learning technology, providing new methods and ideas for accurate diagnosis of fatigue. In this study, complex network technology was used to screen out the main symptoms and indexes of fatigue patients and the interaction between symptoms and indexes. Studying the interrelationship between fatiguerelated symptoms contributed to further determine the direction of diagnosis of fatigue-related diseases. Headache, chest distress, and xerophthalmia were more significant in sub-health fatigue population. Headache and chest distress were generally manifested as qi stagnation syndrome, and xerophthalmia was the common clinical manifestation of jinye deficiency syndrome. Wiry pulse, greasy coating, and dizziness were more significant in disease fatigue group. The clinical significance of wiry pulse is mainly about liver and gallbladder disease, pain, phlegm and retained fluid, consumptive disease, and stomach gas decline. The clinical significance of greasy coating was phlegm-damp, phlegm and retained fluid, and dyspepsia. These two symptoms were consistent with common pathological manifestations of the disease. Furthermore, dizziness was the concomitant symptom of hypertension, hypoglycemia, anemia, and cancer. Basophil, platelet distribution width, percentage of monocyte, hemoglobin, hematocrit were blood routine item, the value of PH of urine and uric acid are routine items of urine examination, abnormal of these two indexes mostly indicate the abnormal renal function, and BMI mostly reflects human metabolism, so it can be seen that abnormal blood routine, renal function, blood pressure, and basic metabolism are more common in patients with fatigue.  Associated analysis of symptoms and indexes can better explore the nature of diseases [43,44]. In this study, the symptom-index associated analysis showed that systolic blood pressure, basophil, platelet distribution width, percentage of monocyte, and diastolic blood pressure were closely related to white coating and yellow coating to some extent. The clinical significance of white coating in disease state was mainly surface syndrome, cold syndrome, and dampness syndrome. Thus, the fatigue population were mostly seen in surface syndrome, cold syndrome, and dampness syndrome. This analysis was helpful for better understanding the core symptoms, the interaction between symptoms, and the distribution of syndromes of fatigue population to provide a theoretical basis for the rapid and accurate diagnosis.
Fatigue as a comprehensive performance of the whole body, it was necessary to analyze the relationship of indexes. The canonical correlation analysis was used for the combined analysis of tongue and pulse data. Canonical correlation analysis of tongue and pulse data showed that the healthy controls' correlation was stronger than that in the disease fatigue group. In contrast, the correlation coefficient between canonical variables and all tongue and pulse variables in the disease fatigue group was greater than that in the healthy controls. The reason might be that there are many kinds of diseases in disease fatigue group while the healthy controls were relatively single. The tongue and pulse in the healthy controls tended to be more stable, and its characteristics were relatively stable. For example, the healthy controls' tongue was generally reddish tongue, thin white tongue coating, and the pulse was usually normal pulse. The tongue and pulse in the disease fatigue group might be diversified due to different diseases. Patients' tongue could present as purple-red tongue, bluish-purple tongue, yellow greasy tongue coating, white greasy tongue coating, etc. Their pulse could be different forms of wiry pulse, tight pulse, slippery pulse, uneven pulse, etc. The tongue and pulse abnormalities of disease fatigue population destroyed a certain stable correlation of the health state and tended to a certain partial correlation.
This study still has some limitations. Firstly, there are many kinds of diseases in this study, which did not conducive to interpreting the results. In the future, specific diseases will can be refined to analyze the relationship of symptom-index and the tongue-pulse data. Secondly, complexion spectral data can be added based on tongue and pulse data. Integrating more objective indexes that can objectively evaluate fatigue will be more productive to analyze its phenomenon and mechanism. Besides, this study still lacks treatment guidance and intervention for fatigue population, which will be improved in the future.

Conclusion
In summary, this study constructed the fatigue-related symptom networks and symptom-index networks, analyzed the data relationship of tongue and pulse in fatigue population, and revealed the distribution rule of symptom and index, the tongue and pulse data in different fatigue population. It provided an objective basis   for establishing the data evaluation of fatigue state, and we are looking forward to establishing a fatigue evaluation method based on objective data of tongue and pulse in the future.