Heterogeneous information network based clustering for precision traditional Chinese medicine

Background Traditional Chinese medicine (TCM) is a highly important complement to modern medicine and is widely practiced in China and in many other countries. The work of Chinese medicine is subject to the two factors of the inheritance and development of clinical experience of famous Chinese medicine practitioners and the difficulty in improving the service capacity of basic Chinese medicine practitioners. Heterogeneous information networks (HINs) are a kind of graphical model for integrating and modeling real-world information. Through HINs, we can integrate and model the large-scale heterogeneous TCM data into structured graph data and use this as a basis for analysis. Methods Mining categorizations from TCM data is an important task for precision medicine. In this paper, we propose a novel structured learning model to solve the problem of formula regularity, a pivotal task in prescription optimization. We integrate clustering with ranking in a heterogeneous information network. Results The results from experiments on the Pharmacopoeia of the People’s Republic of China (ChP) demonstrate the effectiveness and accuracy of the proposed model for discovering useful categorizations of formulas. Conclusions We use heterogeneous information networks to model TCM data and propose a TCM-HIN. Combining the heterogeneous graph with the probability graph, we proposed the TCM-Clus algorithm, which combines clustering with ranking and classifies traditional Chinese medicine prescriptions. The results of the categorizations can help Chinese medicine practitioners to make clinical decision.

The same disease may have different symptoms in different patients. Due to the differences between patients, accurate diagnosis and treatment are even more important [2]. In the field of Western medicine, doctors explore the cause of the disease and focus on treating specific parts of the body. However, Chinese medicine works differently. Traditional Chinese medicine and Western medicine have fundamental differences in diagnosis and treatment. The Chinese medicine practitioner explores the internal and external causes of the patient and combines them accordingly.
In TCM, herbal remedies are usually based on traditional Chinese medicine formula, and the use of only one type of herbal medicine rarely occurs. Each herb has its advantages and disadvantages, and they are formulated in a reasonable proportion. Figure 1 is a schematic diagram of the composition of a formula [20]. Formula is the foundation of traditional Chinese medicine and the core of TCM research. Traditional Chinese medicine data such as ancient Chinese literature and clinical prescriptions contain prescription data. How to study prescription data through scientific and technical analysis has become the main topic of TCM informatization. Traditional Chinese medicine data generally appears in the form of texts with strong natural language, usually characterized by unstructured, massive and heterogeneous. These characteristics have become a huge challenge in the process of informationization of Chinese medicine. In order to meet the challenge, how to integrate the data of heterogeneous Chinese medicine and model representation, using the data processing method to analyze Chinese medicine data has become an important work of TCM informationization. [3].
The traditional model can no longer meet the inheritance and development of TCM knowledge. The main bottleneck is the organization form of the knowledge and the limitation of human resources. Simply, TCM knowledge mainly exists in the forms of prescriptions, medical records, etc. Unlike ordinary texts, TCM texts have irregular natural language, which creates great difficulties for digitization. At the same time, because the degree of informationization of TCM is not high, the inheritance model of TCM is generally an apprenticeship mode in which an old practitioner cultivates apprentices. As a result, some knowledge cannot be passed down in time. How to pass on the vast knowledge of TCM in an efficient way has become a hot topic in the field of TCM research.
The development of machine learning and artificial intelligence is conducive to drive the inheritance of traditional Chinese medicine. Experience can be regarded as a kind of knowledge in artificial intelligence, which can be used as input data for machine learning. Secondly, "dialectical treatment" is the basic principle of TCM diagnosis of diseases. This is in line with the basic principles of machine learning: the model is trained based on the training set, and the model gives the target value based on the input values. Xu et al. used data mining methods to explore drug combinations for nonalcoholic fatty liver disease [4]. Chen et al. used a three-part map to explore the symptom-disease pattern in the case [5]. Liu et al. used CRF to learn the characteristic patterns in TCM cases to identify symptoms and cases [6]. Wang et al. proposed a probabilistic model for the analysis of symptoms, diseases, and drug relationships in TCM cases [7]. Although some of these tasks can also learn the lowdimensional representation of nodes, they focused on data with rich semantics. The majority of TCM medical data, particularly formula-based prescriptions, are lack of good semantic information.
In this paper, we propose a clustering algorithm based on probability model to solve the clustering problem of heterogeneous information network of traditional Chinese medicine. For a given target type, we aim to generate the clustering of the target object and the ranking information of the objects in the cluster. We propose a heterogeneous information network of traditional Chinese medicine, which is a star network schema. The algorithm can obtain stable clustering results after many iterations. Our main contributions are as follows: We propose a clustering algorithm based on probability model, which integrates clustering with ranking information for Chinese medicine formula categorization and discover potential knowledge. The algorithm can help doctors optimize diagnosis and prescription. According to the ranking information of each object in the cluster, doctors can easily assess its importance.
We conducted experiments on real data sets of traditional Chinese medicine. The experimental results show Fig. 1 The composition of a formula that the algorithm is effective and accurate. The algorithm can provide reasonable clustering results for optimizing prescriptions and is confirmed by Chinese medicine experts.
Social networks, the Internet, medical information networks and many other networks in real world contain a large number of interconnected nodes. These networks are called information networks [8]. The ubiquitous information network is an important part of modern information infrastructure. The nodes in the information network are connected by an intricate network structure, which contains rich information. At present, information network analysis is not only widely concerned by researchers in various fields, but also a hot topic in the field of data mining and information retrieval. However, most information network related research has a basic assumption: the types of nodes and the types of links in the network are unique. That is to say, the researcher does not distinguish the types of nodes and regards them as homogeneous information networks, for example, the author collaboration network. In fact, these networks are full of different kinds of nodes, and it is more reasonable to think of them as heterogeneous information networks (HINs) with different types of nodes and links. Heterogeneous information networks contain richer semantic information in nodes and links. For example, in a bibliographic information network, papers are connected to each other by different types of nodes, such as authors, conferences, and topics. If a paper is connected to two authors at the same time, the two authors have a cohesive relationship with this paper [9].
Ranking is an important task on the heterogeneous information network, and it faces some challenges. First, there are different types of objects and relationships in HIN. Second, different types of objects and relationships have different semantic information. In addition, the ranking information of different objects will affect each other.Taking the bibliographic heterogeneous network as an example, ranking on authors may have different results under different meta paths [10] since these meta paths will construct different link structures among authors. Moreover, the rankings of different-typed objects have mutual effects. For example, reputable authors generally publish papers in top journals [11].
Clustering is a process of classifying similar objects. The objects in the same cluster are similar, and the objects between different clusters are dissimilar. Traditional clustering is generally based on object-based features, such as the K-means algorithm. At present, network-based clustering (community discovery) and other issues are receiving widespread attention. The correlation model usually treats it as a homogeneous information network and divides the network into a series of subgraphs in a given way (e.g., normalized cuts and modularity).
Many algorithms have been proposed to solve this NPhard problem, such as the spectral method [12], greedy method and sampling technique [13]. Some studies consider both the link information and attribute information of the object to improve clustering accuracy [14]. Further, clustering on heterogeneous information networks has received attention.
Unlike homogeneous networks, different types of objects on heterogeneous information networks present a huge challenge to the task.
On the one hand, different types of objects in the network bring new forms of clustering. For example, a cluster may contain different types of objects with the same topic. A cluster of database domains contains authors, conferences, and papers in this field. In this case, clustering on heterogeneous information networks has richer semantics, but it also faces more challenges. On the other hand, the rich information contained in the network helps to improve the accuracy of the task. Li et al. put forward the SCHAN algorithm to solve the clustering problem in Attributed HIN [15]. Zhou et al. designed a dynamic learning algorithm SI-Cluster for social influence based graph clustering [16]. Luo et al. introduced the concept of relation-path to measure the similarity between two objects and propose a framework for semi-supervised learning in HINs [17]. Undoubtedly, these approaches improved the clustering performance, but they were confined to entities with rich attributes or labeled data.

Problem formulation
In this section, we introduce several important concepts and define the problem of clustering in the TCM HIN. Definition 1 (Heterogeneous Information Network). An information network is defined as an undirected graph G =< V , E > with an object type mapping function τ : V → T and link type mapping function ψ : is a set of link types on T. Specifically, we call such an information network a HIN when |T| ≥ 2 and a homogeneous information network when |T| = 1.
, or vise versa. G is then called a star network. T 0 is called the target type, and T k (k = 0) are called attribute types [11].The schema for TCM-HIN is shown in Fig. 2.
In this paper, we use T to represent the set of types of TCM entities. We have T = {F m , F c , H, S}, where F m , F s , H, and S denote the entity types "formula", "function", "herb", and "symptom", respectively. For convenience, we use F m to denote both the set of objects belonging to the "formula" type and the type name. Other types are similar to F m . We use R = {F m F c , F m H, F m S} to represent the set of types of TCM relations on T, where F m F c , F m H, and F m S denote the relation types "formula-function", "formula-herb", and "formulasymptom", respectively. [18]. An example of TCM-HIN is shown in Fig. 3.

Definition 4 (TCM-HIN). TCM-HIN is a HIN
Based on these definitions, we can formulate our key problem as follows: given a TCM-HIN G =< V , E >, the target type T 0 , and a specified cluster number K, we aim to generate K clusters {C K } for target objects from target type on G, as well as the within-cluster ranking information for all the objects based on these clusters in the network.
We propose a ranking-based clustering algorithm for mining formula categorization. In this section, we first introduce the overall clustering framework. Then, we explain four important parts of the algorithm in detail.

Framework of algorithm
To integrate ranking with clustering in a HIN, a model is required to flexibly support these two tasks. Therefore, we propose a probabilistic generative model to estimate the probability of target and attribute objects in the network. We can use the rankings of objects to infer the probability of objects and clustering information. The major difficulty in clustering in a HIN is the definition and calculation of pairwise similarity between objects. We map each target object into a low-dimensional space defined by the current clustering result to avoid defining and calculating similarity between each pair of objects.
TCM clustering is mainly composed of the following five steps: • Step 1: For each bipartite network, build a rankingbased probabilistic generative model for target type and attribute type, i.e., {P(x|C t k )} K k=1 . • Step 2: For each bipartite network, estimate the posterior probabilities to each cluster for each target object, i.e., Step 3: Calculate the distance from each target object to each cluster center based on the posterior probabilities and then assign each target object to the nearest cluster.
• Step 4: Repeat Steps 1, 2 and 3 until the cluster does not change significantly or the iteration number is larger than a predefined number.

Algorithm 1 TCM-Clus
Require: Cluster number K and relation matrix W Ensure: K clusters: from random partitions of target objects 2: decompose star schema network into three bipartite networks 3: while nonconvergence do 4: for each bipartite network do 5: build ranking-based probabilistic generative model: end for 8: calculate the distance and then assign target object to the nearest cluster 9: end while The core framework of TCM-Clus is shown in Algorithm 1. In TCM-HIN, a formula may connect to more than one herb, function, and symptom. For example, in Fig. 4, a formula called contains two herbs called (Cinnamomum cassia) and (arisacma consanguineum) and has two functions called (dispelling pathogenic wind and eliminating phlegm) and (boosting source of fire for eliminating abundance of yin). However, it does not mean that (Cinnamomum cassia) has both functions. Therefore, we should decompose the TCM-HIN into several bipartite networks as above, instead of simply making estimation in original TCM-HIN [11]. In this paper, we decompose the TCM-HIN into three bipartite networks(G S , G H , G F c ), which are induced graphs of the original graph G. Because the ranking function and posterior probability estimation for each bipartite network are

Ranking function
In information network analysis, the two most important ranking algorithms are PageRank [19] and HITS [20], both of which are successfully applied to Internet searches. PageRank is a link analysis algorithm that assigns a numerical weight to each object of the information network, with the purpose of "measuring" its relative importance within the object set. Conversely, HITS ranks objects based on two scores: authority and hub. Authority estimates the value of the content of the object, whereas hub measures the value of its links to other objects. Both PageRank and HITS evaluate the static quality of objects in the information network, which is similar to the intrinsic meaning of our ranking methods. However, both PageRank and HITS are designed on a network of webpages, which is a directed homogeneous network, and the weight of the edge is binary. Definition 5 (Ranking Distribution and Ranking Function). A ranking distribution P(T) on a type of object T is a discrete probability distribution, which satisfies P(T = t) ≥ 0(∀t ∈ T) and t∈T P(T = t) = 1. A function f : G → P(T) defined on an information network G is called a ranking function on type T if given an information network G, it can output a ranking distribution P(T) on T.
Ranking is beneficial for people to grasp the importance of objects in a collection. For example, PageRank and authority of HITS represent the static importance of webpages, while the rank of a document to a given query in text retrieval reflects the relevance of the document to that query.
We use W to represent the adjacency matrix, which we call the relation matrix, between the target type and the attribute type. We can define the matrix as where i and j are two objects from type F m and type S and p ij is the frequency of i that links to j.
We have two simple empirical rules: • Rule 1: Highly ranked formulas can cure highly ranked symptoms.
• Rule 2: One highly ranked symptom can enhance the rank of another symptom if they are cured by the same formula.
According to Rule 1, we generate the ranks of types F m and S as follows: where G is a network, f mi is an object from type F M , and s j is an object from type S. Notice that the normalization will not change the ranking position of an object, but it provides a relative importance score to each object. After normalization, we have We can prove that P(F m |F m , G) is the eigenvector of W F m S W SF m and P(S|S, G) is the eigenvector of W SF m W F m S .
Proof Combining (3) and (4), we can obtain Similarly, P(S|S, G) is the eigenvector of W SF m W F m S . We can use the power method to calculate the primary eigenvector.
When considering Rule 2, we can revise the equation as where W SS = W SF m W F m S and parameter α ∈[ 0, 1] determines the weight of "symptom-formula" and "symptomsymptom". Similarly, we can prove that P(S|S, G) should be the primary eigenvector of αW SF m W F m S + (1 − α)W SS , and P(F m |F m , G) should be the primary eigenvector of In fact, if we consider the problem from the perspective of the meta path, these two rules reflect the meta path based relationship between objects. Rule 1 corresponds to meta path S − F m , while Rule 2 corresponds to meta path S − F m − S.

Ranking-based probabilistic generative model
We assume that the probabilities that objects from different types will be visited in the given network are independent of each other. The probability of visiting an object in G can be decomposed into two parts: where the first part p(T x |G) is the general probability that the type of x will be visited in the network G and the second part p(x|T x , G) is the probability that an object x will be visited among all the objects from type T x in the network G. Here, we consider the ranking distribution as the probability of objects to be visited within their own type in a given information network G. We will show that the value of p(T x |G) is not important and can be set to 1 later. In a subnetwork G k = G(C k ), we can calculate the probability of visiting an object: However, we will encounter problems if we use the above equation directly. In a given cluster, a target object may link to objects whose ranking is zero in that cluster. In addition, a target object may not belong to the current cluster. If we simply assign the probability of visiting the target object as zero in that cluster, then we will lose some important information. To solve this problem, we can use smoothing, which is a well-known technique in information retrieval to cope with the zero probability problem for missing terms in a document [21]. We add the global ranking to smooth the conditional ranking before calculating the visibility for the target object: where the smoothing parameter λ denotes the portion of global ranking. To evaluate the model, we make another independence assumption that the probabilities that objects from the same types will be visited are also independent of each other: where

Posterior probability estimation using EM algorithm
To determine which cluster target objects belong to, we estimate the posterior probability for each target object. For convenience, we use X and Y to represent types F m and S, where |X| = m and |Y | = n. Given a clustering on the input network G, we can calculate the posterior probability for each target object using the Bayesian rule: , where p(x i |G k ) is the probability that target object x i will be visited in cluster k and p(k) denotes the relative size of cluster k. From this formula, we can see that type probability p(T|G) is just a constant for calculating the posterior probabilities for target objects and can be neglected.
Let be the parameter matrix, which is an m × K matrix: m×k = {P(G k |x i )}(i = 1, 2, · · · , m; k = 1, 2, · · · , K). To obtain the best that maximizes the likelihood to generate the whole bipartite network, we have the following likelihood function: , where P(x i , y j | ) is the probability of generating link < x i , y j > given the current parameter. Because it is difficult to maximize L directly, we apply the EM algorithm to solve the problem. In the E-Step, we introduce hidden variable z ∈ {1, 2, · · · , K} to represent the cluster label that a link < x, y > is from. The complete log likelihood can be written as Initially, we can set the parameters in (0) as even values. The expectation of the log likelihood under the current distribution of Z is [ W XY (i, j) log(P(z = k| (t) ))P(z = k|x i , y j , (t) )] [ W XY (i, j) log(P(x i , y j |z = k))P(z = k|x i , y j , (t) )] (8) where (t) is the parameter matrix after t iterations.
We can use the Bayesian rule to calculate conditional distribution P(z = k|x i , y j , (t) ) as follows: In the M-Step, to obtain P (t+1) (z = k) that maximizes Q( , (t) ), we introduce the Lagrange multiplier λ. For each P(z = k), where k = 1, 2, · · · , K, we have Now, integrating with (9), we can obtain the new estimation for P(z = k): Finally, each parameter in is calculated as

Cluster assignment
After we obtain the estimations for each target object in each bipartite network, we can represent the target object as a 3K-dimensional vector The centers for each cluster can thus be calculated accordingly, which is the arithmetic mean of s X i for all x i in each cluster: where x i is an object from type F m and |X k | is the size of the cluster k. The distance between an object and cluster is defined by 1 minus cosine similarity: Then, we can assign each object to the cluster with the smallest distance.

User-guided clustering
User guidance is critical for clustering objects in the network [22]. Using different types of link information in a network, different reasonable clustering results can be generated. We take user guidance in the form of object seeds for some clusters as the prior knowledge for the clustering result by modeling the prior as a Dirichlet distribution rather than treating them as hard labeled ones. For each target object x i , its clustering probability vector P(G|x i ) is a multinomial distribution, which is generated from some Dirichlet distribution. If x i is labeled as a seed in cluster k * , P(G|x i ) is then modeled as being sampled from a Dirichlet distribution with parameter vector λ d e k * + 1, where e k * is a K-dimensional basis vector, with the k * th element as 1 and 0 elsewhere. If x i is not a seed, x i is then assumed as being sampled from a uniform distribution, which can also be viewed as a Dirichlet distribution with a parameter vector of 1 .The density of P(G|x i ) given such priors is where 1{x i ∈ G k } is an indicator function, which is 1 if x i ∈ G k holds and 0 otherwise. The hyperparameter λ d is a nonnegative value and controls the strength of users' confidence over the object seeds in each cluster.

Time complexity analysis
The time complexity of TCM-Clus is composed of the following parts. First, the time complexity for ranking is O(t 1 |E|), where t 1 is the iteration number and |E| is the number of links. Notice that |E| |V | 2 in a sparse network, where |V | is the total number of objects in the network. Second, for the posterior probability estimation, we need to calculate O(K|E|+K +mK) parameters at each iteration, where the time complexity for (9) is O|K|E||, the time complexity for (10) is O(K), and the time complexity for (11) is O(mK). Third, the cluster adjustment for each object has complexity O(mK 2 ). Since we need to compute the distance between each object and each cluster, the dimension of an object is K. In total, the time complexity for TCM-Clus is O(t 1 |E| + t 2 (K|E| + K + mK) + mK 2 ), where t 2 is the iteration number of the estimation. If the network if sparse, which is typical in most applications, the time complexity is almost linear to the number of objects in the network.

Results
In this section, we conduct several experiments to show the effectiveness of TCM-Clus. We discuss the evaluation of TCM-Clus. First, we introduce the datasets used in this paper. Then, we discuss the evaluation of TCM-Clus.

Datasets
In this paper, we use the real datasets ChP, The Pharmacopoeia of the People's Republic of China 2015 Edition (http://wp.chp.org.cn/en/index.html), and 3K+ TCM clinical cases mainly in the stomach. We use herb information in Volume I, which contains 2598 types of medicinal materials without classifications, to set up our experiments. ChP is a unstructured corpus and contains various information. We only extract formula, function, herb, and symptom to build TCM-HIN.

Quantitative evaluation
We use FVIC (fraction of vertices identified correctly) to evaluate the clustering accuracy of the clustering results. It has been used in many research projects and is defined as follows: where C F and C K represent the found clusters and known clusters, respectively. c and c * are clusters in C F and C K , respectively. N is the number of objects in the network. FVIC evaluates the average matching degree by comparing each predicted cluster with the most matching real cluster. A higher score indicates a better clustering with respect to the ground truth. We compare TCM-Clus with spectral clustering, which is the k-way Ncut algorithm and has been used to cluster Western medical records [23]; PaReCat, which has been used to cluster Chinese medical records for the task of patient record categorization [24]; and K-Means, a common clustering technique. In this experiment, we fix the smoothing parameter λ as 0.2 and weight parameter α as 0.8. The accuracy results are shown in Table 1.
We can observe that TCM-Clus achieves the best clustering accuracy on the two datasets. K-Means shows poor performance because our medical data lack semantic information. Spectral has a good result. However, due to omitting the structure of the graph, it has worse performance compared to TCM-Clus. The performance of PaReCat is closest to our algorithm, but it is more suitable for patient record categorization with disease, symptom and herb. We have shown that TCM-Clus can indeed improve clustering accuracy by integrating ranking with clustering.

Parameter study
We use clustering accuracy to analyze the effect of different smoothing parameters λ on Chp dataset. We represent three different λs for symptom, herb, function as λ s , λ h and λ f , respectively. We change one type of λ and fix the other two to 0.2. We run TCM-Clus on ChP datasets, and the results are shown in Fig. 5. The results are based on ten different initial partitions. We can observe that TCM-Clus achieves better accuracy when λ is from 0.1 to 0.8. If the smoothing parameter λ is too small or too large, it means that we only consider conditional ranking or global ranking. Too small (λ → 0) or too large (λ → 1) will decrease the performance of TCM-Clus. We also examine the impact of iteration number on the clustering accuracy. As shown in Fig. 6, the clustering accuracy is poor when the iteration number is too small. As the iteration number becomes larger, the accuracy improves and then stabilizes.
Lastly, we examine the impact of the weight parameter α and the result is shown in Fig. 7, If the weight parameter α is too small or too large, it means that we only consider one kind of meta path based relationships. Shorter meta paths have more information than longer ones. If α = 1, the clustering accuracy equals 0.791, which is larger than 0.765(α = 0).

Qualitative evaluation
We apply our methods to investigate whether TCM-Clus can effectively cluster formulas into informative categories. The results are testified by TCM experts, and many of them are widely used in clinical diagnosis. We show the top-10 herbs and formulas in a cluster identified by our method in Table 2 and the top-5 functions and symptoms in a cluster in Table 3.

Case evaluation
As mentioned above, TCM-Clus can achieve high quality categorizations. Furthermore, we can obtain new knowledge from clusters, such as "different formulas with similar herbs", "different formulas with similar functions", "different symptoms with similar herbs" and so on. We show an example of "different symptoms with similar herbs" discovered by TCM-Clus in Table 4.
Besides, given a symptom as an input, our system can output proper herb/formula for the symptom. We have listed the herbs used for two symptoms in Table 5. The results are testified by TCM experts, and many of them are widely used in these symptoms.

Discussion
Based on our algorithm, we can learn potential knowledge in TCM, such as discovering similar prescriptions and recommending Chinese medicine based on symptoms. There are still some entities that we have not considered, such   as the amount of herb and the information of patients. In our future work, more research is needed to address general HINs with more kinds of entities. In addition, the ranking function is highly related to different domains, and how we can automatically extract rules based on small partial ranking results given by experts could be another interesting problem.

Conclusions
TCM is one of the most important complementary and alternative medicines. However, the complexity and elusiveness of diagnostic methods limit its development and generalization. Formulas are an essential part of TCM. Mining categorizations from TCM medical records is an important task for precision medicine. We present a novel algorithm, TCM-Clus, for mining formula categorization. We use a generative probabilistic model based on ranking to generate the reachable probability of target objects. Meanwhile, Bayesian rules and the EM algorithm are utilized to estimate the posterior probability. The experiments show that TCM-Clus achieves better clustering results than other representative algorithms and is beneficial for enhancing the predictive accuracy of medicine.
Abbreviations HIN: Heterogeneous information network TCM: Traditional Chinese medicine