Privacy protection of medical data in social network

Background Protection of privacy data published in the health care field is an important research field. The Health Insurance Portability and Accountability Act (HIPAA) in the USA is the current legislation for privacy protection. However, the Institute of Medicine Committee on Health Research and the Privacy of Health Information recently concluded that HIPAA cannot adequately safeguard the privacy, while at the same time researchers cannot use the medical data for effective researches. Therefore, more effective privacy protection methods are urgently needed to ensure the security of released medical data. Methods Privacy protection methods based on clustering are the methods and algorithms to ensure that the published data remains useful and protected. In this paper, we first analyzed the importance of the key attributes of medical data in the social network. According to the attribute function and the main objective of privacy protection, the attribute information was divided into three categories. We then proposed an algorithm based on greedy clustering to group the data points according to the attributes and the connective information of the nodes in the published social network. Finally, we analyzed the loss of information during the procedure of clustering, and evaluated the proposed approach with respect to classification accuracy and information loss rates on a medical dataset. Results The associated social network of a medical dataset was analyzed for privacy preservation. We evaluated the values of generalization loss and structure loss for different values of k and a, i.e. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}k = {3, 6, 9, 12, 15, 18, 21, 24, 27, 30}, a = {0, 0.2, 0.4, 0.6, 0.8, 1}. The experimental results in our proposed approach showed that the generalization loss approached optimal when a = 1 and k = 21, and structure loss approached optimal when a = 0.4 and k = 3. Conclusion We showed the importance of the attributes and the structure of the released health data in privacy preservation. Our method achieved better results of privacy preservation in social network by optimizing generalization loss and structure loss. The proposed method to evaluate loss obtained a balance between the data availability and the risk of privacy leakage.


Background
The wide deployment of electronic health record systems has brought convenience to our lives. The need for sharing health data among multiple parties has become evident in several applications, such as decision support, policy development, and data mining [1]. The widespread use Open Access *Correspondence: ise_suj@ujn.edu.cn of social networks and the integration and fusion of data based on linkage have posed privacy threats to the release of health data and the research of bioinformatics data [2][3][4][5][6]. With the rapid increase in data volume and development of storage cloud platforms, the security of medical data is facing increasing challenges. This is because of the rise of mobile medical industry and the necessary information shared between commercial health insurance information systems, basic medical insurance information systems, and the medical institution information systems. All these contribute to increase in privacy protection difficulties. It is highly likely that patients' privacy might be disclosed when they use social network tools in daily life or in seeking medical treatment. The disclosure of private information might result in serious consequences to the patients or the society. Therefore, privacy protection is a very important consideration in the field of medical data sharing and distribution.
According to a survey from a security software company, users in social networks are more likely to encounter the loss of financial information, stolen identity information, and the security threats through software and hardware. In addition, integration and fusion of data based on linkage also results in privacy disclosure, which is demonstrated in Fig. 1. The data source 1 is the data from the shopping online. The data source 2 is the anonymously published medical data. The attributes of ID, name and marriage have been anonymized. The data source 3 is the data from social network, which also has the attributes of gender, age, phone number and marriage status. The attackers can decipher the privacy information (such as the diagnosis), by integrating the data source 1, data source 2, and data source 3.
. At present, the measures to protect medical data and the privacy of patients mainly include: • Safely store medical data. File block storage and encryption technology is applied when patients' files, medical records, and pictures of test results are stored using cloud platforms. • Enhance the awareness of the protection of patient information. Storage of cards, documents, pictures, or test reports with patients' information is prohibited. Mention of patient information in public places or unsecured places is not allowed. • De-identification patient information when possible. Whenever possible, before sharing the necessary medical information, de-identification should be done, especially the patient's name, date of birth, telephone number, address, ID card number, medical record number, photos etc.

Fig. 1
Privacy disclosure caused by social network and by integrating the data source. The data source 1 is the data from the shopping online. The data source 2 is the anonymously published medical data. The attributes of ID, name and marriage have been anonymized. The data source 3 is the data from social network, which also has the attributes of gender, age, phone number and marriage status Among the above measures, de-identification is very important for privacy protection. Technical efforts are highly encouraged to make published health data both privacy-preserving and useful. The limited release technique selectively publishes data according to specific circumstances by using data generalization and anonymity techniques. For sensitive data, it publishes data with low accuracy or does not publish data. The aim is to find a balance between data availability and privacy protection. It tries to release data with reasonable value, while limits disclosure risk within a reasonable range. These kinds of algorithms have high versatility and wide adaptability. However, data published usually results in a certain degree of loss.
The existing algorithms of privacy assurance are either based on interactive approaches or based on non-interactive approaches. In a non-interactive framework, the owner of the database first anonymizes the raw data and then releases the anonymized version for public usage [1,7,8]. Anonymity is the technique to hide or fuzzy the data or the data sources. This kind of technique generally applies some methods to anonymize data by suppression, generalization, analysis, slicing, and separation. Data privacy protection technology in social network is divided into 2 categories: clustering-based method and graph structure modification method. When we use clustering-based method, we divide the nodes and edges of the graph into super nodes and edges, and we hide the sensitive information of nodes and edges in their super classes. Graph structure modification method is similar to K-anonymous, which prevents attackers from using network structure as the background knowledge [9].
The data in social network contains large numbers of sensitive information such as link node attribute, node tag, and graph structure features. Attackers can use either active attack models or passive attack models to dissect and uncover sensitive information. Social network is usually released in the form of a graph. In the graph, each node is described with the entity attribute set. There is a unique identifier for each node. Due to the advantages of the graph, some researchers try to use graph as the tools to study the problem of privacy protection. Some authors [10] categorized the anonymous methods and reviewed anonymous methods on rich graphs. Some other authors [11][12][13][14][15] presented a method of anonymous graph data based on groupings and classing. A clustering approach for data and structural anonymity in social networks was also given [16]. One report [17][18][19] described how to reserve the privacy of sensitive relationships in graph data. Other reports [20,21] examined the problem of vertex re-identification from anonymized graphs. Literature [22] proposed methods to release and analyze synthetic graphs in order to protect privacy of individual relationship in the social network. Literature [23] sought a solution to share meaningful graph datasets while preserving privacy. Literatures [24,25] studied the problem of anonymous graphs in evolving social network. Literatures [26,27] showed that the true anonymous level of graphs was lower than that obtained by measures such as k-anonymity.
Recent research has indicated that the present models are still vulnerable to various attacks and provide insufficient privacy protection. In this paper, we presented a privacy protection method to release medical data by adopting non-interactive framework [28].
To prevent attacks on network structure, we provided a k-anonymous greedy clustering algorithm based on entities attributes of released social network. In this algorithm, privacy protection algorithm is based on a generalization technique, and a method to evaluate loss was described. It significantly reduces the risk of privacy exposure and at the same time ensures data availability. Moreover, the algorithm is computationally efficient.

The key attributes of medical data in social network
When the medical data is released, each dataset contains a plurality of tuples, and each tuple corresponds to a specific individual member in the society. According to the attribute function and the main objective of privacy protection, the attribute information is divided into three categories. The first category is unique identifier attributes, these attributes can uniquely identify a specific individual member of the community. These include driver license number and social security number (SSN) etc. This kind of attributes are usually hidden before release to the social network. The second category is the approximate identity attributes, which must be presented in a list of published data sheets and external data sources. These include postal codes, home address, etc. The third category is sensitive attributes, which are secret attributes, such as family income or medical history etc. In a social network, the difficulty of privacy protection is increased because the three attributes described above are often interrelated and mutually influenced. In the published shared data table, people often directly remove unique identifies because the unique identifier attributes can clearly identify the individual members of the society with private information. However, the open shared data tables are released with zip codes, gender, birthday and other similar identity. An attacker can often link this data together by the obtained approximate identity attributes and other channels, and can easily identify all the data of the individual members of the community. According to statistics, about 87% of the citizens in the United States can be recognized by means of the approximate identity attributes, such as zip codes, gender, date of birth, etc.
Because of the need for statistics, research, or some other applications, hospitals have to frequently release the patient's data. Table 1 is the patient's medical information table, in which the sensitive attribute is {disease} and the approximate attributes are {Zip code, age}. Table 2 is the publicly available individual information data table.
The current practice of preventing the leakage of the patient's privacy information primarily relies on policies and guidelines, such as HIPAA in the USA [29]. However, the reality is that patients' health records are not perfectly protected while the researchers cannot effectively use them for discoveries. Hospital typically deleted the unique identity of the individual information, and deidentified the unique identity attributes. Although it has protected the individual privacy to certain extent, attackers can still obtain individual privacy information by connecting the approximate identity attributes in Table 1 with the released relevant information in Table 2. For example, if the attacker wants to know Sam's disease by using the information of his ZIP code and age, it may be inferred that Sam suffered from the disease "cancer". This is a simple link attack. To solve this problem, an attribute information-based clustering algorithm is used in our method.
During the process of social network release, changing the identification information of nodes or changing the structure information by adding or deleting edges is the basic method to protect privacy. Because a large number of historically released data could be collected easily and the information about the nodes can be collected for a certain time period, when the destination node is inserted into the network, attackers sometimes can recognize the target node in the published network. Anonymous methods for such attacks include K degree-anonymity method, K neighborhood anonymous method, and the anonymous method of k sub graph isomorphism [30][31][32]. However, these three kinds of methods usually result in loss when reconstructing a social network graph.

K-anonymity based on generalization
K-anonymity is realized by using generalization technology and hiding technology [33]. These two techniques are different from distortion, disturbance, and randomization because they can maintain the authenticity of the data. Attribute-based generalization method can reduce the damage to the original structure and reduce loss.
In order to construct K anonymous, we need to apply generalization techniques not only to the information of nodes, but also to the internal structure of the sub graph and the relationship between sub graphs. The edges used to show the relationship between the sub graphs are used to describe the characteristics of the structure of the network. We construct K-anonymous graph after estimation of the loss, the internal relations of sub graph, and the relationship between sub graphs.
For the graph G, there is G = (V, E) and |V|= N, where N is the number of the nodes, V is the collection of nodes and E is the collection of edges. There are the initial partitions for the nodes. Cluster progress needs to fulfill two criterions. The first is that each cluster contains at least k nodes, and the second requirement is to reduce the loss. Therefore, it is necessary to define a method to estimate the loss.
This algorithm clusters k nodes to a set with the similar attributes and minimal loss. We record the V with an ordered sequence {v 0 , v 1 , ……, v N }. The adjacency relationship between nodes is represented by an adjacency matrix A = {a i,j }, where i = 1, 2, ……, N and j = 1, 2, ……, N. When there is direct connection between v i and v j , a i,j = 1, otherwise a i,j = 0. The neighborhood can be retrieved. Symmetric binary distance measure was used for this matrix. The node distance and the structure distance are represented by D v i , v j and D(v i , s k ) , respectively.

Definition 1. Node distance
∀i, j ∈ 1, 2, . . . , N , the distance between v i and v j is described as:  where i = k = q = · · · = p = j , a i,k = a k,q = · · · = a p,j = 1 , mn is the number of nodes in the shortest path.

Definition 2. Structure distance
∀i, k ∈ 1, 2, . . . , N , The distance between v i (v i / ∈ s k ))and s k is described as: where |s k | is the number of nodes in cluster s k .
The distance between nodes and the distance between a node and a cluster are in the interval of [0, 1]. For graph G, the node with the maximum degree is selected to be the center of a new cluster. Unallocated nodes with the minimize distance to the structure was selected to form a new cluster.

Loss evaluation
According to the attributes of the nodes, the loss of cluster includes generalization loss and structure loss [29]. Generalization loss is used to calculate the loss of the descriptive information for the node [32], which is defined as: where PS = {s 1 , s 2 , . . . , s m } is the partition, s j is the cardinality of cluster s j , N = {N 1 , N 2 , . . . , N p } is the set of numerical attributes and C = {C 1 , C 2 , . . . , C q } is the set of categorical attributes. Attr s j , N and Cate s j , C are the generalization loss factors caused by generalizing attributes, which are defined as: where gen s j is the generalization information of cluster s j , and it has the values of attribute, numerical or categorical, the most specific common generalized value for all the values of attributes from s j sets. gen s j [N k ] is the interval between (1) D v i , v j = d d = min a i,k + a k,q + · · · + a p,j | mn GLoss(G, PS) = m j=1 s j · Attr s j , N + Cate s j , C n · (p + q) The hierarchy attribute associated with the classification is defined as H C k . gen s j [C k ] is defined as the recent ancestors. M gen s j [C k ] is H C k when gen s j [C k ] is the root of the sublayers. height ( H C k ) is defined as the height of sub layer.
Parameter α and β are set by the user and are used to control the relative information importance of the nodes and the structure.
The other loss is structure loss, which occurs when masking the graph G based on partition PS = {s 1 , s 2 , . . . , s m } . The structural information includes all inter-cluster information and intra-cluster structural information. SLoss(G, PS) is defined in [34], which is shown as formula: where m j=1 intraSL s j is the intra-cluster structure loss and m i=1 m j=i+1 interSL s i , s j is the inter-cluster structure loss, satisfying factors: When E s i ,s j = (|si|·|sj|) 2 , structure loss achieves the maximum value. The maximum loss and anonymous graph construction process in the class structure is defined as the maximum loss: where SLoss(G, PS) is a value in interval [0, 1].
For an initial social network G, we can obtain a partition PS = {s 1 , s 2 , . . . , s m } using the graph anonymous cluster algorithm. {SC 1 , SC 2 , . . . , SC m } is the focus node set corresponding to the cluster set {s 1 , s 2 , . . . , s m } .
where |s i |, E s i is the intra-cluster generalization pair, s i ∩ s j = ∅ , i, j = 1 . . . m, and i � = j. The masked social network is defined as: The anonymized graph was created by using generalization information and edge intra-cluster generation with a cluster and edge inter-cluster generalization between any two clusters. All nodes from the cluster s 1 collapsed into the generalized node SC 1 . These nodes are indistinguishable from each other. If the condition |s 1 | ≥ k is met, a k-anonymous social network can be constructed. When the social network is evolving, we first evaluate the change of structure in the published social network.
A k-anonymous greedy clustering algorithm based on entities attributes of released social network is shown as the following:

Simulation experiments
Our method was tested on a social network associated with a medical dataset. Table 3 shows the basic medical records of 60 patients. Unique identifier such as driver license or SSN has been removed. There are still some quasi-identifiers, such as the age, gender, zip codes, and marriage status. The relation network corresponding to the entities in Table 3 is shown as Fig. 2.
There are 60 nodes (entities) in this social network. When two entities have some relationship, we link them with one edge. We used our anonymous method to protect the privacy of patients. Attribute set of each node can be denoted as Attr = N ∪ C . The set of numerical attributes is defined as N = {Age} . The set of categorical attributes is defined as C = {Gender, Marriage, Smoke, Zipcode} . The hierarchical structures of the categorical attributes are shown in Fig. 3.
We tested the generalization losses and the structure losses during anonymity clustering for different values of the parameters k and a, i.e. k = {3, 6,9,12,15,18,21,24,27, 30}, a = {0, 0.2, 0.4, 0.6, 0.8, 1}. Figure 4 shows the generalization losses. Figure 5 shows the structure losses for the anonymous cluster. When parameter k is fixed, generalization loss tends to be less when parameter a becomes bigger. When parameter a is fixed, structure loss tends to be more when k becomes bigger. Tables 4 and 5 show the generalization losses and the structure losses separately when k and a take different value. The generalization loss approached optimal when k = 21 and a = 1, and structure loss approached optimal when k = 3 and a = 0.2, 0.4, 0.6, 0.8. It can be seen that the value of k mostly affects the structural losses and the value of a mostly affects the generalization losses.
The final losses of the clusters, which include structure losses and generalization losses, are shown in Fig. 6. We      Figure 7 shows the clustering results based on loss estimation. (a) is the result of clustering when k = 15 and a = 0.8. (b) is the result of clustering when k = 24 and a = 0.8. We can see that the clustering results is dependent on the value of k. The entities with similar attributes and shortest distance in the network tend to be in the same cluster through anonymous clustering. This method helps to control the scope of information dissemination. Figure 8 shows a clustering procedure when k = 15 and a = 0.8, which correspond to those in Fig. 7a. It is easier to locate the center of each cluster and to distinguish the entities from each cluster through this visual display.

Discussion
Medical researches require the collection of a large number of medical data for experiments and analysis. However, medical data is highly sensitive, and patients' privacy needs to be protected. Leakage of sensitive information is becoming a more and more problem due to increased information exchange in social networks. In order to protect the privacy of medical data to the greatest extent, this paper proposed a privacy protection method based on social network structure and key attributes of network entities. This method helps to control the exposure of sensitive information in social network by the clustering method.
Although unique identifiers might have been removed in medical data, some quasi-identifiers, such as the age, the gender, the zip codes, and the marriage status, which are often used in medicine researches, can still be queried to identify the patients to some extent. In this paper, we divided the key attributes into two categories, numerical attributes and categorical attributes. Categorical attributes are assigned to hierarchical structures, which are shown in Fig. 3. The distance between entities is also used in clustering algorithm. This distance is not only associated to the hierarchical distance of entities in the structure, but also associated to the numerical space distance of entity attributes. We utilized a structure loss and a generalization loss to evaluate the clustering algorithm, and the results are shown in Tables 4 and 5. In our experiments on a medical social data network with 60 entities, the minimum clustering loss is 0.302819, which is shown in Fig. 6. A cluster visualization demonstration (in Fig. 8) displays the center of each cluster and the entities in each cluster.

Conclusion
In this paper, we studied the privacy protection of medical data in social network. We used medical data sharing as an example to discuss the importance of the attributes in the privacy protection of health data. Nodes (entities) were clustered according to the features of attribute values and the distance of nodes in the network. The entities with similar attributes and shortest distance in the network tends to be in the same cluster through anonymous clustering. This method helps to control the scope of information dissemination. In some sub-network controlled by clusters, the sensitive data will be published with low accuracy or will not be published. The method can be used for real-time analysis.
Since the anonymous clustering in the network usually results in loss, this paper also paid special attention to the estimation algorithm for loss. A K-anonymous method based on attributes and distance clustering was proposed to estimate the loss during clustering. It tries to release data with reasonable value, while controlling disclosure risk within a reasonable range. The aim is to find a balance between data availability and privacy protection. The experiments on a social network associated with a medical dataset demonstrated our clustering procedure and the clustering results, and the usefulness of our method to protect privacy by controlling information release.
Abbreviations SSN: Social Security Number; HIPAA: Health Insurance Portability and Accountability Act.