Deidentifying a public use microdata file from the Canadian national discharge abstract database
 Khaled El Emam^{1, 2}Email author,
 David Paton^{3},
 Fida Dankar^{1} and
 Gunes Koru^{4}
DOI: 10.1186/147269471153
© Emam et al; licensee BioMed Central Ltd. 2011
Received: 26 December 2010
Accepted: 23 August 2011
Published: 23 August 2011
Abstract
Background
The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of reidentification for records in a PUMF, and to deidentify a national DAD PUMF consisting of 10% of records.
Methods
Plausible attacks on a PUMF were evaluated. Based on these attacks, the 20082009 national DAD was deidentified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct reidentification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.
Results
Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 89% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.
Conclusions
The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.
Background
There are increasing pressures to make raw individuallevel data more readily available for research and policy making purposes [1–5]. This should be pursued as there are many benefits to doing so [1–15]. National statistical agencies have responded to such demands by creating public use microdata files (PUMFs) [16]. For instance, PUMFs from census data and population surveys are often created [17–22], and recently the Centers for Medicare and Medicaid Services has started work on creating 5% PUMFs from claims databases (e.g., inpatient and outpatient claims) [23, 24].
Making data available as a PUMF entails disclosing individual level data with minimal restrictions or conditions on access. The ideal PUMF provides as much detail as possible short of disclosing raw files where the patients are readily identifiable [15].
The focus of this paper is on the creation of a PUMF for the Canadian national discharge abstract database (DAD).
In the US, 48 states collect data on inpatients [25], and 26 states make their DADs available through the Agency for Healthcare Research and Quality (AHRQ) [26]. This data can be purchased for the purposes of research or other approved use. All purchasers must also sign a data use agreement prohibiting the reidentification of individuals in the data set. The equivalent agency in Canada is the Canadian Institute for Health Information (CIHI), which collects DAD data from hospitals across the country [27] and makes it available through an application process, as well as performing its own analyses on the data. Data disclosed by CIHI are somewhat deidentified (it contains no direct identifiers), and are only provided when there is a data sharing agreement [28].
Discharge abstract data has been used for a diverse set of research and analysis purposes, including public safety and injury surveillance and prevention, public health, disease surveillance, public reporting for informed purchasing and comparative reports, quality assessment and performance improvement, and health services research [29].
CIHI was considering the creation of a public use microdata file (PUMF) for the national DAD. The availability of a DAD PUMF would make it easier for the research community to confirm some published results, provide broader feedback to CIHI to improve data quality, can be a tool for training students and fellows, provide an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large data set for computer scientists and statisticians to evaluate analysis and data mining techniques. A PUMF would be readily available with no application nor waiting time, at no cost to the requestor, and it would not impose an ongoing burden on the organization in terms of processing requests for data. It would complement the existing method of providing data through the regular data application process.
Two primary concerns with the creation of a PUMF were to ensure that the probability of reidentification of the patients is acceptably low and that the disclosed PUMF had sufficient utility to endusers.
The purpose of our study was to create a prototype PUMF. The outcome of this analysis was intended to inform the decision on whether to proceed with an actual PUMF. A critical criterion for making that decision was whether the resulting PUMF still had utility for endusers.
The contributions of this study are: (a) an analysis of plausible reidentification attacks on a Canadian DAD PUMF, (b) a set of new reidentification metrics were developed for evaluating these attacks, (c) a new set of strategies for maximizing data utility when deidentifying data were formulated, (d) a new efficient algorithm for the suppression of large data sets was developed, and (e) we present the results evaluating the probability of reidentification and the deidentification of a Canadian national DAD PUMF.
Definitions
Here we provide some basic definitions that we use throughout the paper, review related work for the deidentification of individuallevel data (also known as microdata), and present metrics for evaluating information loss due to deidentification.
Categories of Variables
Example data
IDENTIFYING VARIABLE  QUASIIDENTIFIERS  SENSITIVE VARIABLES  Other Variables  

ID  Name  Telephone Number  Sex  Year of Birth  Lab Test  Lab Result  PayDelay 
1  John Smith  (412) 6885468  Male  1959  Albumin, Serum  4.8  37 
2  Alan Smith  (413) 8225074  Male  1969  Creatine kinase  86  36 
3  Alice Brown  (416) 8865314  Female  1955  Alkaline Phosphatase  66  52 
4  Hercules Green  (613) 7635254  Male  1959  Bilirubin  Negative  36 
5  Alicia Freds  (613) 5866222  Female  1942  BUN/Creatinine Ratio  17  82 
6  Gill Stringer  (954) 6995423  Female  1975  Calcium, Serum  9.2  34 
7  Marie Kirkpatrick  (416) 7866212  Female  1966  Free Thyroxine Index  2.7  23 
8  Leslie Hall  (905) 6686581  Female  1987  Globulin, Total  3.5  9 
9  Douglas Henry  (416) 4235965  Male  1959  Btype natriuretic peptide  134.1  38 
10  Fred Thompson  (416) 4217719  Male  1967  Creatine kinase  80  21 
11  Joe Doe  (705) 7277808  Male  1968  Alanine aminotransferase  24  33 
12  Lillian Barley  (416) 6954669  Female  1955  Cancer antigen 125  86  28 
13  Deitmar Plank  (416) 6035526  Male  1967  Creatine kinase  327  37 
14  Anderson Hoyt  (905) 3882851  Male  1967  Creatine kinase  82  16 
15  Alexandra Knight  (416) 5394200  Female  1966  Creatinine  0.78  44 
16  Helene Arnold  (519) 6310587  Female  1955  Triglycerides  147  59 
17  Almond Zipf  (519) 5158500  Male  1967  Creatine kinase  73  20 
18  Britney Goldman  (613) 7377870  Female  1956  Monocytes  12  34 
19  Lisa Marie  (902) 4732383  Female  1956  HDL Cholesterol  68  141 
20  William Cooper  (905) 7636852  Male  1978  Neutrophils  83  21 
21  Kathy Last  (705) 4241266  Female  1966  Prothrombin Time  16.9  23 
22  Deitmar Plank  (519) 8312330  Male  1967  Creatine kinase  68  16 
23  Anderson Hoyt  (705) 6526215  Male  1971  White Blood Cell Count  13.0  151 
24  Alexandra Knight  (416) 8135873  Female  1954  Hemoglobin  14.8  34 
25  Helene Arnold  (705) 6631801  Female  1977  Lipase, Serum  37  27 
26  Anderson Heft  (416) 8136498  Male  1944  Cholesterol, Total  147  18 
27  Almond Zipf  (617) 6679540  Male  1965  Hematocrit  45.3  53 
Directly Identifying variables
One or more direct identifiers can be used to uniquely identify an individual, either by themselves or in combination with other readily available information. For example, there are more than 200 people named "John Smith" in Ontario, therefore the name by itself would not be directly identifying, but in combination with the address it would be directly identifying information. A telephone number is not directly identifying by itself, but in combination with the readily available White Pages it becomes so. Other examples of directly identifying variables include email address, health insurance card number, credit card number, and social insurance number. These numbers are identifying because there exist public and/or private databases that an adversary can plausibly get access to where these numbers can lead directly, and uniquely, to an identity. For example, Table 1 shows the names and telephone numbers of individuals. In that case the name and number would be considered as identifying variables.
Indirectly identifying variables (quasiidentifiers)
The quasiidentifiers are the background knowledge variables about individuals in the DAD that an adversary can use, individually or in combination, to probabilistically reidentify a record. If an adversary does not have background knowledge of a variable then it cannot be a quasiidentifier. The manner in which an adversary can obtain such background knowledge will determine which attacks on a data set are plausible. For example, the background knowledge may be available because the adversary knows a particular target individual in the disclosed data set, an individual in the data set has a visible characteristic that is also described in the data set, or the background knowledge exists in a public or semipubic registry. Examples of quasiidentifiers include sex, date of birth or age, locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.
For example, Table 1 shows the patient sex and year of birth (from which an age can be derived) as quasiidentifiers.
Sensitive variables
These are the variables that are not really useful for determining an individual's identity but contain sensitive health information about the individuals. Examples of sensitive variables are laboratory test results and drug dosage information. In Table 1 the lab test that was ordered and the test results are the sensitive variables.
Other variables
Any variable in the data set which does not fall into one of the above categories falls into this 'catch all' category. For example, in Table 1 we see the variable PayDelay, which indicates how long (in days) it took the insurer to pay the provider. In general, this information is not considered sensitive and would be quite difficult for an adversary to use for reidentification attack purposes.
Sometimes a variable classified as a quasiidentifier in one context is classified as a sensitive variable in another context. This will depend on the assumptions made about the adversary's background knowledge, and these assumptions will be situation dependent.
Individuals can be reidentified because of the directly identifying variables and the quasiidentifiers. There are no directly identifying variables in the proposed DAD PUMF. Therefore, the critical question is which variables are quasiidentifiers. The specific quasiidentifiers in the DAD are discussed below.
Equivalence Classes
All the records that share the same values on a set of quasiidentifiers are called an equivalence class. For example, consider the quasiidentifiers in Table 1, sex and age. All the records in Table 1 about males born in 1967 are an equivalence class (these have an ID of 10, 13, 14, 17, and 22). Equivalence class sizes for a data concept (such as age) potentially change during deidentification. For example, there may be 5 records for males born in 1967. When the precision of age is reduced to a five year interval, then there are 8 records for males born between 1965 and 1969 (these have an ID of 2, 10, 11, 13, 14, 17, 22, and 27). In general there is a trade off between the level of detail provided for a data concept and the size of the corresponding equivalence classes, with more detail associated with smaller equivalence classes.
Types of Disclosure
There are two kinds of disclosure that are of concern: identity disclosure and attribute disclosure [32, 33]. The first is when an adversary can assign an identity to a record in the PUMF. For example, if the adversary would be able to determine that record number 3 belongs to patient Alice Brown in Table 1 using only the quasiidentifiers, then this is identity disclosure. The second type of disclosure is when an adversary learns a sensitive attribute about a patient in the database with a sufficiently high probability without knowing which specific record belongs to that patient [32, 34]. For example, in Table 1 all males born in 1967 had a creatine kinease lab test. Assume that an adversary does not need to know which record belongs to Almond Zipf (record ID 17). Since Almond is male and was born in 1967 then the adversary will discover something new about him (that he had a test often given to individuals showing symptoms of a heart attack). This is attribute disclosure.
Known reidentifications of personal information that have actually occurred are identity disclosures, for example: (a) reporters reidentified an individual's records from web search queries publicly posted by AOL [35–37], (b) students reidentified individuals in the Chicago homicide database by linking it with the social security death index [38], (c) at least one individual was believed to be reidentified by linking their movie ratings in a publicly disclosed Netflix file to another public movie ratings database [39], (d) the insurance claims records of the governor of Massachusetts were reidentified by linking a claims database sold by the state employees' insurer with the voter registration list [31], (e) an expert witness reidentified most of the records in a neuroblastoma registry [40, 41], (f) a national broadcaster matched the adverse drug event database with public obituaries to reidentify a 26 year old girl who died while taking a drug and did a documentary on the drug afterwards [42], (g) an individual in a prescriptions record database was reidentified by a neighbour [43], and (h) the Department of Health and Human Services in the US linked a large medical database with a commercial database and reidentified a number of individuals [44].
Therefore in the current analysis we only focused on identity disclosure. This does not mean that attribute disclosure is not important to consider (i.e., absence of evidence does not mean evidence of absence). However, in terms of focus, we address only identity disclosure in the current study.
Disclosure Control for Microdata
Statistical and computational disclosure control methods can be applied by a data custodian to protect against identity disclosure [45, 46]. Disclosure control methods are concerned with both microdata [30, 47–53] and tabular data [49, 54–58]. The term "microdata" means individuallevel data. Tabular data may contain frequencies or aggregate statistics (e.g., means and standard deviations) in the table's cells. Given that the DAD PUMF is a microdata disclosure, we are only concerned with methods for the deidentification of microdata.
Deidentification methods for microdata can be broadly categorized as perturbative and nonperturbative [59]. The perturbative methods distort the truthfulness of the records in a data set. However, perturbation can be performed by preserving the aggregated properties of the data. Many perturbation methods have been explored, and some of these are summarized below.
Random noise can be added to continuous data without changing the correlation structure of the original data [60–63]. Gaussian noise is frequently used for this purpose. However, there are circumstances where noise can be filtered out of the data to recover the original information [64]. Another approach for both continuous and categorical data is to apply perturbations by considering the underlying probability distribution[65]: first, the underlying distribution is discovered, then distorted series are produced and the original series are replaced with the distorted series. Microaggregation[66–71] can be another alternative for perturbation, which creates small clusters of data and uses an aggregated value (e.g., average or median) instead of actual values. Some guidance exists on the appropriate size of these clusters [69]. Resampling[72] is a random perturbation method which resamples individual values in the data set [59]. Another approach is lossy compression which treats the continuous data as an image and applies a compression such as JPEG (Joint Photographic Experts Group 2001) [73, 74]. Multiple imputation utilizes techniques developed to deal with missing data [75]. For example, for each continuous variable, a randomized regression can be used to estimate a replacement value [76]. Vector camouflage is another perturbative method which provides unlimited and correct interval answers to database queries without revealing original values [77]. The PostRandomization Method (PRAM) is a probabilistic method used for categorical data [78]. A Markov matrix is used to change the scores assigned to categorical variables based on previously determined probabilities. PRAM can be thought of as a general approach comprising other techniques such as adding noise, suppressing data, and recoding data. Rank swapping is perturbation method which swaps the values of a variable within a specified range of ranks. It can be applied to ordinal or numerical variables. Rounding methods perturb by rounding values to a value in a determined rounding set [49].
The nonperturbative techniques preserve the truthfulness of data. Subsampling techniques only publish a subset of the records in a data set [79]. This can be more useful for categorical data compared to numerical data because if an adversary performs a matching attack using an external data set, continuous values are more likely to be unique. Global recoding is a nonperturbative technique that applies to all records in a data set. Continuous variables can be grouped into discretized values representing bins; for categorical variables, the variable values can be grouped into larger categories. In certain cases, top or bottom values of a variable can be grouped and published as one value known as top/bottom recoding, respectively. The statistical disclosure tool muargus can apply global recoding [80]. Suppression can be also used to either eliminate some values that cannot be published (e.g., stigmatized diseases) or outlier values that increase the reidentification probability.
The kAnonymity Criterion
A popular deidentification criterion is kanonymity[81–86]. With kanonymity an original data set containing personal health information can be transformed to protect against identity disclosure. A kanonymized data set has the property that each record is similar to at least another k1 other records on the quasiidentifiers. For example, if k = 5 and the quasiidentifiers are age and sex, then a kanonymized data set has at least 5 records for each value combination of age and gender.
Various methods can be used to satisfy the kanonymity criterion. For example, some authors have used microaggregation [52]. However, when dealing with health data, nonperturbative methods are favoured because they preserve the truthfulness of data [87]. Furthermore, some of the most common implementations of kanonymity use nonperturbative techniques such as global recoding and suppression [81–86]. The data analysts we consulted with were more comfortable with global recoding and suppression because their impact on data analysis was clearer to them. This is an important factor because we wanted to ensure the acceptability of the PUMF among data analysts. For this reason, our study uses global recoding, a form of generalization, and suppression.
After the quasiidentifiers are generalized the equivalence class sizes would be computed. For any equivalence class that is smaller than k, suppression is applied. Therefore, generalization by itself will not achieve kanonymity, and suppression is only applied to the small equivalence classes.
Measuring Information Loss
During deidentification, the other side of the coin is data utility. A tradeoff exists between privacy and utility. If an optimal tradeoff can be found, patient privacy can be preserved while data users are satisfied with the data utility. The data utility can be measured by information loss metrics. Typically, lower levels of information loss are associated with higher data utility, and vice versa.
The extent of suppression performed to the data is an important indicator of information loss. Although the extent of suppression has known disadvantages as an information loss metric [87], it provides an intuitive way for an expert data analyst to gauge data quality. The more suppressed records and/or individual values in a data set the greater the potential biases introduced in an analysis of the data.
A reasonable quantitative assessment of information loss could be based on comparing the analysis results obtained from the original and disclosed (deidentified) data [88]. However, this is difficult to achieve because the potential uses of data can vary and it is difficult to predict all of them in advance. In the case of the DAD PUMF it is not possible to know with precision a priori all the ways that data recipients can analyze that data, which can include statistical as well as machine learning methods. In fact, one purpose of creating a PUMF is to encourage the development of novel data modeling and data mining techniques.
Despite the lack of universally acceptable information loss criteria or metrics, it has been argued that there is little information loss if a data set is valid and interesting[89]. A deidentified data set is considered valid if it preserves (i) means and covariances in a small subset of records, (ii) marginal values in a few tabulations of the data, and (iii) at least one distributional characteristic. A data set is called interesting, if six variables on important subsets of records can be validly analyzed. While a useful starting point, this definition can only be meaningfully operationalized if there is some knowledge of the analysis that will be performed on the deidentified data. Another suggested approach is to examine the function that maps original records to the protected records [88]. As this function gets closer to the identity function, the information loss will decrease, and vice versa.
Information loss metrics for continuous data include comparing the original and deidentified data sets on the mean square error, mean absolute error, and mean variation [59, 90]. Such metrics cannot be easily computed for categorical variables; therefore, three methods are suggested [59, 88]: (i) direct comparisons based on a distance definition using category ranges, (ii) comparison of contingency tables, and (iii) entropybased measures.
An entropy [91, 92] metric was used in a number of studies to measure information loss [49, 93, 94] where suppression, global recoding, and PRAM were used. Recently, the entropy metrics described in [49, 93] were extended to deal with the nonuniform distributions, and the resulting measure has been called nonuniform entropy [95] and has been used specifically in kanonymity algorithms [96].
Samarati used height as an information loss metric [81]. Height indicates the generalization level in a quasiidentifier generalization hierarchy. Greater height means more information loss. Height is considered a weaker metric compared to nonuniform entropy because it does not take into account the information loss contributed by individual variables [87].
Another information loss metric based on the generalization hierarchy is Precision or Prec[83, 97]. For every variable, the ratio of the number of generalization steps applied to the total number of possible generalization steps (total height of the generalization hierarchy) gives the amount of information loss for that particular variable. Overall, Prec information loss is the average of the Prec values across all quasiidentifiers in the data set.
A frequently used information loss metric is the Discernability Metric [98–105]. The discernability metric (DM) assigns a penalty to every record that is proportional to the number of records that are indistinguishable from it. DM has been used often in the computational disclosure control literature.
The minimal distortion (MD) metric measures the dissimilarities between the original and deidentified records [30, 31, 106]. This charges a unit of penalty to each generalized or suppressed instance of a value. While both DM and MD can assess the level of distortion, DM has an advantage over MD in the sense that DM can differentiate how much indistinguishability increased by going from the original to deidentified data set [107].
Information loss caused by generalizations can also measured by using the ILoss metric [108]. This metric captures the fraction of domain values generalized to a certain value [45]. ILoss for a record is calculated by finding the sum of the ILoss values over all variable values. Different weights can be applied to different variables while obtaining this sum. Similarly, the overall ILoss for a data set can be obtained by adding up the ILoss values found for the records.
Iyengar [109] used a classification metric, CM, which assigns a penalty to a record if suppressions or generalizations assigns the record to a different majority class. This metric is applied to the training set and requires a classification method to be used. The associated problem is that the exact classification approach needed may not be known at the time of data publishing. Fung et al. [110, 111] used a metric called IGPL to measure a tradeoff between information gain (IG) and privacy loss (PL). IGPL is obtained by dividing IG by PL incremented by 1. The formulas for IG and PL can be seen in [45].
Methods
In this section we explain how the Canadian DAD PUMF was deidentified and describe the evaluations that were performed on it.
PUMF Specifications
Our analysis used the 20082009 DAD. Only acute care inpatient cases were included. The DAD excluded hospitalizations in Quebec. Cases with diagnoses or interventions that indicated abortion or HIV were excluded. These were removed because they represent higly stigmatized conditions where reidentification would cause significant harm to the affected individuals. Therefore, the only acceptable reidentification probability was zero. The resulting file had 2,375,331 records. In this section we provide the parameters for the PUMF.
Quasiidentifiers
The Quasiidentifiers
Quasiidentifiers  Coding  # Categories 

PROV_XXX  Province/region. Quebec data is not included in the DAD.  
PROV_ALL  The territories are grouped into one category + 9 provinces  10 
PROV_REGION  The country is divided into three regions (West, Central, and East), where Central consists of Ontario.  4 
TOTAL_LOS_XXX  Total Length of stay  
TOTAL_LOS_DAYS  Days up to 1 week, then in weeks up to 6 months, and top coded at 6 months +  31 
TOTAL_LOS_WEEKS  Weeks up to 6 months everything longer than that is topcoded into a single category  25 
AGE_GROUP  Five year intervals and top coded at 90 years  20 
GENDER_CODE  unchanged  5^{1} 
MRDx DIAG3 DIAG_BLOCK DIAG_CHAPTER  Different levels of coding detail of the most responsible diagnosis code.  8967 1435 195 23 
CMG_CODE  These identify Case Mix Groups (CMGs), which are groups of patients with similar clinical and cost characteristics. They are based on most responsible diagnosis (MRDx) and other diagnosis and intervention information.  545 
CCI_CODE SHORT_CCI  Different levels of coding detail of the principle intervention. Approximately 46% of the records had no interventions.  8780 569 
The generalization hierarchies defined in Table 2 were created in consultation with experts at the data custodian who regularly perform analysis on discharge abstract data. These experts considered the utility of the data from the perspective of multiple data users. For example, a PUMF that is useful for educational and training purposes at a college or university course may not be useful for a complex health services research study. Examples of the types of enduses considered were: education and training of students and fellows, preparation of analysis plans before making data access requests for the full DAD, and evaluation data sets for computer scientists and statisticians when developing new algorithms and modeling techniques.
The generalizations in Table 2 represent meaningful generalizations of the quasiidentifiers that would still retain some utility for analysis. Any further generalizations were deemed not to be acceptable from the perspective of the data users. Intervals for the continuous variables were chosen to be analytically meaningful.
The generalization hierarchies also represent policy decisions made by the organization around the granularity of information that they are willing to disclose. For example, the disclosure of postal codes or patient location information more specific than province was not acceptable to CIHI.
Focus of PUMF
The two PUMF data sets that would be created
ID  PROV_XXX  TOTAL_LOS_XXX  AGE_GROUP  GENDER_CODE  MRDx  CCI_CODE 

1: Geographic PUMF  X  X  X  X  X  
2: Clinical PUMF  X  X  X 
Characteristics of Adversaries
It is important to understand the characteristics of the potential adversaries who can attempt to reidentify individuals in the PUMF. This will help us determine the plausibility of different attacks and direct us in choosing appropriate metrics for measuring the probability of reidentification. When making data available as a PUMF one needs to make the worse case assumptions about the adversary because it is difficult to predict who would make a reidentification attempt. Consequently, we assumed that the adversary has unlimited financial resources and time, and is by definition not bound by any data sharing agreements that would constrain his/her behaviour.
Measuring the Probability of Reidentification
The primary metric we used to evaluate how likely it is for an adversary to reidentify one or more individuals in the PUMF is the probability of correct reidentification. This is measured by 1/k within the context of the kanonymity criterion. We do not consider the probability of incorrect reidentification in our evaluations. This is common practice, thus far, in the disclosure control community.
If the adversary matches, say, 1,000 records to a population registry and is able to correctly reidentify only 50 records, a data custodian may consider that to be acceptable. But there will be 950 records that were incorrectly reidentified. Would this be acceptable?
If the adversary will be sending email spam to these 1,000 reidentified patients then one can argue that there is no substantive injury to the 950 patients. If the adversary was going to try to fraudulently use the health insurance policies of the 1,000 patients or open financial accounts in their names, there will still be no injury to the 950 patients (because the reidentified identities will not match the actual identity information during the verification process, and therefore these attempts will very likely fail). Under such conditions we would not consider the incorrect reidentifications as a problem.
On the other hand, if a reporter was going to expose a person in the media based on incorrect information then this could cause reputational harm to the patient. Or if employment, insurance, or financial decisions were going to be made based on the incorrectly reidentified data then there could be social, psychological, or economic harm to the individuals.
If we make the probability of incorrect reidentification sufficiently high for the adversary, then we argue that this would act as a deterrent to launching a reidentification attack. This argument is based on two assumptions: (a) an incorrect reidentification is detrimental to the adversary's objectives, and (b) information about the probability of reidentification of individuals in the PUMF is made public so that it is known to the adversary to influence the adversary's behaviour.
Consequently in this paper we will only talk about the probability of reidentification to mean the probability of correct reidentification.
Probability of Reidentification Threshold
We need to define the probability of reidentification threshold. This represents the maximum probability that the custodian is willing to accept when disclosing the PUMF [112]. Some broad guidelines have been developed specifically for the creation of PUMFs within a Canadian context [113], however these did not recommend any specific thresholds.
There is considerable precedent for a probability threshold of 0.2, which is often expressed in terms of a "cell size of five" rule (or k = 5) [22, 114–122]. However, this threshold is more commonly used when data is disclosed to a trusted entity, such as a researcher, rather than for disclosing data to the public. It is also used when the data recipient will sign a data use/sharing agreement with the custodian. In some cases a threshold as high as 0.33 is used (a cell size of three) [123–126], although in practice this would be under quite restrictive conditions.
In the US, the Safe Harbor standard in the HIPAA Privacy Rule is sometimes used as a basis for the creation of PUMFs since such data is no longer considered protected health information. It has been estimated that Safe Harbor implies that 0.04% of the population is unique (has k = 1) [127, 128]. Another empirical reidentification attack study evaluated the proportion of Safe Harbor records that can be reidentified and found that only 0.01% can be correctly reidentified with certainty [44]. However, it can be argued that using a k value that is so low would not be acceptable for a PUMF, especially with the adversary profile that we described earlier.
In the current analysis we considered two probability thresholds, 0.04 (i.e., k = 25) and 0.05 (i.e., k = 20). The data custodian deemed these thresholds to be acceptable to the organization.
Possible Reidentification Attacks
In order to reduce the reidentification probability of the data set properly we had to understand the potential attacks that an adversary may attempt. Assuming that the adversary does not verify matches, there are three possible attacks which will be described below.
An important assumption is that the adversary would not know which individuals are in the PUMF. This is reasonable since the PUMF would be a random sample from the population DAD. Therefore, even if an adversary knew that Alice was hospitalized in 2008, it would not be possible to know if she was selected into the PUMF. If the adversary knew who was in the PUMF then the probability of reidentifying a single individual would be 1/f_{ j } , which will typically be quite high for, say, a 10% sampling fraction (since E(f_{ j }) = 1 × F_{ j } ). However, given that the adversary would not know who is in the PUMF, the actual probability of reidentification would be smaller than 1/f_{ j } .
Attack 1: Adversary Matches a Single Patient to a Registry
In this attack, the adversary has access to a public population registry containing identifying information and can use that to match against the PUMF. For an equivalence class that exists in both the PUMF and the population registry, the adversary then selects a record at random from either data set and matches it with a record in the other data set.
Previous research has suggested that it is relatively easy to construct population databases useful for reidentification about certain subpopulations in Canada because of the information available about them in public and semipublic registries [129, 130]. As illustrated in Additional file 2, it is possible to create population registries for these specific subpopulations: professionals such as doctors and lawyers because their professional associations publish complete membership lists, homeowners because the Land Registry publicly maintains their names, and civil servants because the government publishes lists of government employees. These subpopulations are considered the atrisk members of the Canadian population from a reidentification perspective.
Reidentification under this attack could only occur if the population registry overlaps with the PUMF (i.e., there are individuals in both data sets). This is different from the commonly cited US examples where voter registration lists are available in many states [131]. In the US case a disclosed data set is often represented as a sample from the voter registration list and consequently the probability of reidentification would be measured differently.
Assuming there is an overlap between the individuals in the PUMF and the individuals in the population registry, then the probability of reidentification for any individual in equivalence class j in the PUMF is derived in Additional file 3 to be 1/C_{ j } where C_{ j } is the size of the equivalence class in the provincial population (measured through the census, for example). We consider provincial populations because that is the smallest geographic granularity in the PUMF. In Additional file 3 we also demonstrate the accuracy of this derivation using a series of matching experiment simulations (for example, see [132, 133]).
where N_{ p } is the population of the province, I( ) is the indicator function returning one if the logical parameter is true and zero otherwise, and τ is some maximum acceptable probability threshold (as noted earlier, this would be 0.04 or 0.05). The proportion in equation (1) was computed for every province and territory separately.
To evaluate equation (1), we used the data from the 2001 census PUMF from Statistics Canada. Four population subsets in the census matched our three atrisk groups: homeowners, federal government employees, healthcare professionals, and business professionals. For each subset within each province we computed the number of (weighted) individuals within each age and sex equivalence class that is smaller than the threshold. This amounted to 11 provinces and territories (all territories were grouped into a single category), four data subsets, and two thresholds, which was equal to 88 different analyses. In none of these analyses was the proportion of a province's population with a reidentification probability higher than 0.04 greater than 0.001.
Our results indicate that the proportion of the provincial populations that have a probability of reidentification on the quasiidentifiers that can be used for matching with public registries is small and therefore the chances of a successful reidentification using this attack is negligible.
Attack 2: Adversary Matches All Patients Against a Registry
In this attack the adversary also obtains a population registry as in Figure 2, however he then matches all individuals in the PUMF against the individuals in the registry. This is different from the attack above which focuses on the probability of reidentification for a single randomly selected individual.
Also, note that to launch this attack an adversary needs to obtain or create a sufficiently large population registry to have an overlap of individuals with the DAD PUMF. This can be quite expensive and therefore there is a potential economic deterrent for this attack. To construct a large population registry that is detailed enough can be costly in practice. For instance, to create a complete population registry with records containing names, home addresses (including postal codes), gender and date of birth for the 23,506 registered practicing physicians in Ontario (a professional group) at the time of the study would cost at least C$188,048. Similarly for the 18,728 registered lawyers, the cost is estimated to be at least C$149,824 [129]. Although under our assumptions, an unconstrained adversary could still spend that amount on a reidentification attempt.
With reference to Figure 2, let the adversary have the PUMF (data set D) and a registry (data set I). Let D and I be two sets representing simple random samples of individuals from a population (say, represented by the census). Note that D and I must have common records but that I is not necessarily a subset of D or vice versa, in other words, there is a possible overlap between the two for any matching to be successful: D ∩ I ≠ ∅. Assume that the set of population equivalence classes is denoted by Q, and j ∈ Q.
As noted above, population registries in Canada only include age, gender, and location information. Therefore, the quasiidentifiers of relevance for this kind of matching are: GENDER_CODE, AGE_GROUP, and PROV_ALL.
In our data set we can compute the maximum value of $\leftQ\right$ as: 2 × 20 × 10 = 400. We only consider two values for GENDER_CODE because population registers do not account for the other three gender categories in the DAD. We can also compute α. According to Statistics Canada, the population of Canada in 2008 (excluding Quebec) was 25,573,800. Therefore a 10% sample from the PUMF would mean that α = 0.009, and approximately 4 individuals in the PUMF would be expected to be correctly reidentified according to equation (3). Furthermore, the adversary would not know which 4 individuals out of the .24 million in the file were correctly reidentified.
Given the low number of individuals that can be correctly reidentified and the adversary not knowing which individuals were correctly reidentified, we consider the chances that an adversary would try this attack, and succeed if attempted, to be negligible.
Attack 3: Adversary Targets a Specific Patient
Under this attack the adversary has background information about a specific individual, say a 35 year old male who is the adversary's neighbour. The adversary does not know if that target individual is in the PUMF. If the adversary does find an equivalence class j that matches the targeted individual, the adversary would then choose one of the f_{ j } records in that equivalence class at random as a match. The probability that the match is a correct one is 1/_{Fj}[134]. Therefore, to maintain the probability threshold at or below 0.04, we must ensure that F_{ j } ≥ 25 for all j through generalization and suppression, and to maintain a probability of 0.05 we must ensure that F_{ j } ≥ 20 for all j.
This attack is plausible and represents the one that we needed to focus on. Therefore, the remainder of the paper describes how we created versions of a DAD PUMF that protect against this attack through generalization and suppression. Since we have the population DAD, generalization and suppression were performed on the full population DAD and the PUMF samples drawn from that.
Strategies for Improving Data Utility
Level of Detail of Quasiidentifiers
Under attack number 3, our archetype adversary who would attempt to reidentify records in the PUMF is a neighbour. For any patient Alice, a neighbour would know the basic demographics of Alice (GENDER_CODE, AGE_GROUP, and PROV_ALL). It is also very plausible that Alice would tell her neighbour why she went to the hospital and what the primary interventions were. Therefore, it is plausible that this kind of information can be obtained by such an adversary.
Sometimes the adversary will know the full detail for a quasiidentifier that can be hierarchically generalized. Length of stay is such an example: a patient's neighbour can notice how many days the patient was in the hospital.
However, it is unlikely that an adversary would be able to determine the full ICD10 (International Statistical Classification of Diseases and Related Health Problems, 10^{th} Revision) [135] code for the most responsible diagnosis as these tend to be very specific and include some clinical details that Alice herself may not know with such specificity. Similarly, Alice will likely only provide a high level description of the intervention. The level of detail at which an adversary would know the quasiidentifiers, such as the diagnosis, can help us set limits on the amount of generalization that needs to be performed.
If we assume that the adversary is not likely to know the full ICD10 code for MRDx of the patient, then it need not be considered a quasiidentifier (at least for this adversary) except insofar as it could be used to derive a more general version of diagnosis that is known to the adversary and therefore useful as a quasiidentifier. For instance, if the adversary is likely to know the CMG_CODE for a patient then we need to apply deidentification at that level in the generalization hierarchy, and to all generalizations of CMG_CODE as well. In addition, it would be necessary to ensure that suppressed CMG_CODE values could not be derived from the MRDx codes; in general if a CMG_CODE value was suppressed, the MRDx values that generalize to the suppressed CMG_CODE value would have to also be suppressed. Similarly, it would be necessary to ensure that suppressed DIAG_BLOCK values could not be derived from the MRDx codes and therefore the MRDx values that generalize to a suppressed DIAG_BLOCK value would have to be suppressed as well. We will refer to this as a propagation of suppression from one quasiidentifier to another.
Rules to use for constraining the search for the deidentification solution
Relationship between maximum acceptable generalization (M) and adversary's background knowledge (Q)  Generalization level to use for suppression (S)  Generalization level to release data at 

M ≥ Q(analysts could accept a generalization equal to or less detailed than the adversary's knowledge)  S = M (suppress at level M)  Only include data generalized to level S = M 
M <Q (analysts need a version that is more detailed than the adversary's knowledge)  S = Q  The PUMF can have data at level M in the generalization hierarchy, except when it generalizes to a suppressed value at level S = Q. 
With this approach it may be possible to disclose more detailed information and minimize suppression. However, the assumptions about the level of detail of the adversary's knowledge, and so which levels in the hierarchy need to be considered as quasiidentifiers, must be defensible.
The general principle described here can be applied in the case of simple domain generalization hierarchies. For example, the generalization hierarchy in panel (b) of Figure 3 shows a diagnosis code hierarchy whereby each MRDx value generalizes to a single DIAG3 value, where DIAG3 consists of the first three characters of the diagnosis code. Previous work has shown that this more common generalization hierarchy for diagnosis codes (as depicted in panel b) can result in excessive suppression [136]. Therefore in the current study we only consider the generalization hierarchy in panel (a) of Figure 3.
Disclosing All Levels of the Hierarchy
To maximize utility to the user of the data, we can include all levels of the quasiidentifier hierarchy in the PUMF. In this manner the data user can select the level of precision most suitable for their analysis and there is no need to find a single optimal generalization for all types of users.
The four sets of evaluations that would be performed at each level of the most responsible diagnosis code
Combination ID  PROV_XXX  AGE_GROUP  GENDER_CODE  MRDx  CMG_CODE  DIAG_ BLOCK  DIAG_ CHAPTER 

1a  X  X  X  X  
1b  X  X  X  X  
1c  X  X  X  X  
1d  X  X  X  X 
Using the scheme above, any combination of demographics and a diagnosis code will have a probability of reidentification below the threshold. Therefore, we protect against all possible levels of background knowledge that the adversary may have. However, we make available the very detailed diagnosis code information as well.
A heuristic algorithm for suppressing combinations is described below. This algorithm iterates through all of the combinations and applies a heuristic suppression method on each. In our example it would iterate through < PROV_ALL, AGE_GROUP, GENDER_CODE, MRDx> and < PROV_ALL, AGE_GROUP, GENDER_CODE, CMG_CODE> in turn and apply suppressions.
Example showing the eight combinations when we have two hierarchical quasiidentifiers where we disclose all of their levels
Combination ID  PROV_XXX  AGE_GROUP  GENDER_CODE  MRDx  CMG_CODE  DIAG_BLOCK  DIAG_CHAPTER  CCI_CODE  SHORT_CCI 

2a  X  X  X  X  X  
2b  X  X  X  X  X  
2c  X  X  X  X  X  
2d  X  X  X  X  X  
2e  X  X  X  X  X  
2f  X  X  X  X  X  
2g  X  X  X  X  X  
2h  X  X  X  X  X 
For hierarchical quasiidentifiers where the hierarchy is totally ordered, further combinations can be skipped. For example in Table 5, the suppressions in DIAG_BLOCK performed on combination 1c can just be propagated to DIAG_CHAPTER without having to evaluate combination 1d.
Furthermore, this approach can be combined with the earlier one (having all levels of detail for a quasiidentifier) to minimize the amount of suppression. For example, if we assume that the adversary would only have background knowledge at the CMG_CODE level, then we would skip the suppressions in 1a, and apply the same suppressions on CMG_CODE from cycle 1b to the MRDx variable. This would reduce the amount of suppression in MRDx significantly.
Example showing the two combinations when we have two hierarchical quasiidentifiers where we disclose all of their levels but assume that the adversary would only have background knowledge at the CMG_CODE and SHORT_CCI levels, and where one of the hierarchies is totally ordered
Combination ID  PROV_XXX  AGE_GROUP  GENDER_CODE  MRDx  CMG_CODE  DIAG_BLOCK  DIAG_CHAPTER  CCI_CODE  SHORT_CCI 

2f  X  X  X  X  X  
2g  X  X  X  X  X 
Suppression Algorithm for Combinations
Current Approaches
We needed an algorithm that was suitable for suppressing combinations. There are three general approaches to suppression: casewise deletion, quasiidentifier removal, and local cell suppression [137].
Casewise deletion removes the whole record from the data set. This results in the most distortion to the data because the sensitive variables are also removed even though those do not contribute to an increase in the probability of identity disclosure. Casewise deletion would not apply to combinations of quasiidentifiers as it would remove all of the record.
Quasiidentifier removal only removes the values on the quasiidentifiers in the data set. This has the advantage that all of the sensitive information is retained. However, it would not apply to combinations because all of the quasiidentifiers would be removed.
Local cell suppression is an improvement over quasiidentifier removal in that fewer values are suppressed. Local cell suppression applies an optimization algorithm to find the least number of values ("cells") on the quasiidentifiers to suppress. All of the sensitive variables are retained and in practice considerably fewer of the quasiidentifier values are suppressed compared to casewise and quasiidentifier deletion. We therefore consider algorithms for local cell suppression that will ensure kanonymity.
One existing suppression algorithm has a 3k(1 + log2k) approximation to kanonymity with a O(n^{2k}) running time [138], where n is the number of records in the data set. Another had a 6k(1 + log m) approximation and O(mn^{2} + n^{3}) running time [138], where m is the number of variables. An improved algorithm with a 3(k3) approximation has been proposed [139] with polynomial running time. Because of its good approximation, the latter algorithm has been used in the past on health data sets [87, 140].
However, all of the above suppression algorithms require the computation of a pairwise distance among all pairs of records in a data set. This means n(n1)/2 distances need to be computed because they require the construction of a complete graph with n vertices and the edges are weighted by a distance metric (the number of cells that differ between the two nodes). Such algorithms can realistically only be used on relatively small data sets.
Furthermore, these algorithms do not account for combinations and can only be used on all of the quasiidentifiers in the data set at once. For example, if we apply suppression on all of the five quasiidentifiers in Table 6 we would have to suppress either all of the values in MRDx or all of the values in CMG_CODE. By considering the combinations seperately less suppression would be necessary. Therefore, we need to use a suppression algorithm which would determine the suppressions required for each combination of quasiidentifiers separately. Below we describe such an algorithm.
Combinations Suppression Algorithm
In this section we describe a heuristic algorithm for local cell suppression when we have overlapping combinations of quasiidentifiers. We assume that the data set has n records and i∈{1,...,n}. Let B be the set of quasiidentifier combinations, and b ∈ B. Each combination has a fixed number of quasiidentifiers denoted by the set Z _{b}, and let z_{b} ∈ Z_{ b } (which means that z_{ b } is a quasiidentifier in that particular combination). A single cell in the data set can be indexed by V[i, z_{ b } ] which indicates the record, the specific combination and quasiidentifiers in the combination.
The possible values for each quasiidentifier are represented by the set L [z_{ b } ]. For example, if the quasiidentifier is AGE_GROUP then it would have 20 possible 5 age intervals. Let l[z_{ b } ]∈ L [ z_{ b }] be one of the quasiidentifier values. The support of l[z_{ b } ] is the number of records that have that particular value: sup(l[z_{ b } ]).
Individual quasiidentifiers may be assigned a weight. Weights have values between zero and one. A quasiidentifier with a higher weight would be affected less during suppression because it would be considered a more important variable. A weight is denoted by w[z_{ b } ]. We then define the weighted support function as: wsup(l[z_{ b } ]) = sup(l[z_{ b } ]) × w[z_{ b } ].
Let E(l[z_{ b } ]) be a function which returns a set of all equivalence classes that have a value of l[z_{ b } ]. If x ∈ E(l[z]) then $\leftx\right$ is the size of the equivalence class.
The data custodian has set a risk threshold, τ, and k =⌈1/τ⌉ represents the minimum equivalence class size needed to ensure that the risk is equal to or greater than τ. Let M_{ b } be the number of equivalence classes smaller than k for a particular combination. These equivalence classes are called moles.
The algorithm has two general phases. In the first phase it removes all values in the quasiidentifiers that have infrequent support (i.e., support less than k). The reasoning is that these values by definition will always be problematic and therefore have to be removed. In the second phase the algorithm iterates through each quasiidentifier to remove the values with minimal support.
We then identify the combinations with moles (step 2.1). The quasiidentifier combinations with moles will be ordered with combinations having the largest number of moles first. The algorithm then iterates through these combinations with moles in order (step 2.3). Within each combination we find the value on any of the quasiidentifiers that has the smallest weighted support (step 2.5). This value is then suppressed from all equivalence classes that are smaller than k (steps 2.6 and 2.7). This is repeated until there are no more moles in the combination, and then we proceed to the next combination.
Suppression Algorithm Walkthrough
The next smallest support is 4. A birth year in 19701979 has a support of 4. All equivalence classes with this year of birth are smaller than 3, and therefore this value is suppressed from all the small equivalence classes within which it appears. These are labeled by a (1) in Figure 7. Next the diagnosis of "acute respiratory problem" has a support of 4, and it appears in an equivalence class of size one (ID 8). That diagnosis value is therefore suppressed within that equivalence class and is labeled with a (2) in Figure 7. The other equivalence class where this diagnosis appears has a size of 3 (ID 16, 19, and 24), and therefore will not undergo any suppression. A similar process is followed for the other value with a support of 4: "external injury".
We next consider a support of 6, which is a diagnosis of "metabolic disorder". There are two equivalence classes of size 1 with this diagnosis, and therefore the diagnosis is suppressed in these (ID 23 and 25).
The next support value is 9 for a year of birth in 19501959. All equivalence classes with this value are of size 3, and therefore will not undergo any suppression.
The next support is 11 for the of birth of 19601969. Here we have an equivalence class of size 2 (ID 10 and 14), in which case the year of birth will be suppressed.
At this point there are no longer any values on the quasiidentifiers that are smaller than 3, and the algorithm stops for this combination.
Computational Complexity of Suppression Algorithm
Our complexity analysis makes worse case assumptions, and we assume that always k > 1.
For phase 1 of the algorithm we assume that all values on all quasiidentifiers have low support. Computation of this support can be done once for all combinations. The computation of support requires a scan of all of the records. Therefore, the initial computation of support requires $\left\bigcup _{b\in B}{Z}_{b}\right\times n$ operations. During that step we would also keep index information about the records with each support value. Under a worse case scenario $\left{\stackrel{\u0303}{L}}_{b}\right=n\times \left{Z}_{b}\right$ for all b ∈ B. This means that every value in the data set is unique. Therefore it would be necessary to iterate that many times to perform suppressions. The supressions would use the stored index values to avoid performing additional scans, but would still need to perform n suppressions. The total number of computations for phase 1 would then be $3\times \left\bigcup _{b\in B}{Z}_{b}\right\times n$. This will be dominated by the n term given that $n\gg \left\bigcup _{b\in B}{Z}_{b}\right$ in practice.
For the second phase we would need to compute the support for each value, but for this we would use the support values computed and updated during phase 1 of the algorithm. However, we will still need to compute the equivalence class sizes for each combination, which requires $\left{Z}_{b}\right\times n$ operations. In the worse case scenario we would need to examine every value because it would have low support, and suppress it in every equivalence class because they would all be small. This would give a total number of iterations equal to $\sum _{{z}_{b}\in {Z}_{b}}\leftL\left[{z}_{b}\right]\right$. If each quasiidentifier has n possible values, then we would need $n\times \left{Z}_{b}\right$ iterations. At each iteration we need to update the affected equivalence class counts requiring at most a total of n computations. Also, at most we would perform $\left{Z}_{b}\right\times n$ suppressions in total. Therefore the total amount of computation in phase 2 would be $\sum _{b\in B}\left(3\times \left{Z}_{b}\right\times n+n\right)$ assuming all combinations had moles.
In general, we would expect the amount of computation to increase at most linearly with the data set size.
Evaluation
Our empirical evaluation of the generated versions of the PUMF and the methods used generate them consisted of three components.
Measuring Information Loss
We assessed information loss using two parameters. The first was the extent of suppression. As noted earlier, suppression is an intuitive metric that data analysts can easily comprehend while assessing data quality. Suppression can be measured as: (a) the percentage of records that have some suppression in them (on the quasiidentifiers) due to the deidentification, or (b) as the percentage of cells in the file that are suppressed.
The second information loss metric we used was nonuniform entropy. Entropy was chosen because it has many desirable properties compared to other proposed information loss metrics in the literature. For example, nonuniform entropy will always increase when a data set is generalized and before suppression is applied (monotonicity property) and will behave in expected ways with data having an unbalanced distribution (see the review in [87]). Because entropy has no easily interpretable unit, we used one of the data sets as a baseline and compared the other data sets to that as a percentage increase in entropy.
Distribution of Missingness
The distribution of missingness due to suppression was also evaluated. For the selected PUMF we evaluated whether diagnosis codes are more likely to be suppressed based on age, region, and length of stay. For example, if we take the province variable, equivalence classes in Ontario will tend to be larger than in Prince Edward Island (PEI). Therefore, the level of suppression for the diagnosis codes in Ontario would likely be less. A PUMF user may restrict his/her analysis to Ontario only if there is too much diagnosis code suppression in PEI. In such a case, knowledge of the distributions of suppressions would be useful.
Performance of Suppression Algorithm
We evaluated the performance of our combinations suppression algorithm, which will be termed the "combinations" algorithm.. To help interpret its performance, we compared it to an algorithm that does not take combinations into account, which will be termed the "complete" algorithm. The latter is the same algorithm as the former except that there is only one combination consisting of all of the quasiidentifiers.
The performance evaluation was conducted on a Windows machine with a dual core Intel processor running at 2.6 GHz and 2GB of RAM. Both suppression algorithms were implemented in the C# programming language.
Three sets of performance evaluations were performed. The first compared the overall algorithm performance in seconds on each of the four data sets. The second compared the overall time in seconds for each of the two phases of the algorithms. We varied the value of k from 3 to 40. This allows us to determine whether the performance of the algorithm is affected as the threshold for acceptable reidentification probability changes. Both algorithms are expected to perform more iterations as the value of k increases.
The third evaluation was to assess how the algorithm scales with the number of records in the data set. For each of our data sets we created random subsamples with sampling fractions varying from 0.1 to 0.9 in increments of 0.1. For each subsample we computed the time to perform the suppression. The value of k was fixed at 5 for these evaluations.
Results
Information Loss
In our analysis we assumed that the adversary would only know the diagnosis code at the CMG_CODE level and not at the MRDx level. Similarly, we assumed that the adversary would only know the short version of the intervention code rather than the detailed intervention code.
Main results for geographic file
PROV_ALL  PROV_REGION  

0.2  0.05  0.04  0.2  0.05  0.04  
Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  
GENDER_CODE  0.002  0.002  0.005  0.005  0.005  0.005  0.002  0.003  0.003  0.009  0.002  0.012 
AGE_GROUP  1.6  5.2  5.4  14.2  6.5  16.4  0.81  3.8  2.8  10.6  3.4  12.2 
PROV_XXX  0.07  3.11  0.2  7.4  0.25  8.3  0.06  0.3  0.16  0.5  0.19  0.6 
MRDx  12.6  14.4  26.8  27  29.4  29.4  8.6  10.7  19.4  21.4  21.6  23.5 
CMG_CODE  12.6  14.4  26.8  27  29.4  29.4  8.6  10.7  19.4  21.4  21.6  23.5 
DIAG_BLOCK  8.5  4.8  19.8  9.5  22.2  10.5  5.5  3.6  13.5  7.5  15.2  8.3 
DIAG_CHAPTER  1.5  0.11  3.8  0.3  4.3  0.4  0.87  0.07  2.5  0.3  2.8  0.31 
CCI_CODE  5.9  8.13  10.7  13.2  11.5  14.14  4.4  6.8  8.8  11.9  9.7  12.7 
SHORT_CCI  5.9  8.13  10.7  13.2  11.5  14.14  4.4  6.8  8.8  11.9  9.7  12.7 
Total % Cells Suppressed  5.4  6.5  11.6  12.4  12.8  13.6  3.7  4.75  8.4  9.5  9.4  10.4 
Entropy (%)  100  137  246  302  274  334  70  104  181  236  204  262 
Main results for clinical file
TOTAL_LOS_DAYS  TOTAL_LOS_WEEKS  

0.2  0.05  0.04  0.2  0.05  0.04  
Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  Comb.  Complete  
GENDER_CODE  0.002  0.003  0.005  0.005  0.006  0.006  0.002  0.004  0.004  0.01  0.004  0.012 
AGE_GROUP  2.5  6.2  7.6  16.7  8.86  19.2  1.32  3.7  4.2  9.8  4.8  11.25 
TOTAL_LOS_XXX  1.2  6.8  1.8  12.3  1.9  13.3  1.2  4.7  1.8  7.08  1.9  7.4 
MRDx  15.8  16.4  30.4  29  33  31.1  10.2  10.8  19.8  20  21.6  21.7 
CMG_CODE  15.8  16.4  30.4  29  33  31.1  10.2  10.8  19.8  20  21.6  21.7 
DIAG_BLOCK  11.4  5.4  24.2  10.2  27  11.1  6.96  3.5  15.1  6.8  16.5  7.5 
DIAG_CHAPTER  2.22  0.14  5  0.28  5.6  0.34  1.4  0.066  3.3  0.16  3.7  0.19 
CCI_CODE  7.4  9.16  12.3  14  13.2  14.8  4.9  6.7  8.9  11.22  9.6  12 
SHORT_CCI  7.4  9.16  12.3  14  13.2  14.8  4.9  6.7  8.9  11.22  9.6  12 
Total % Cells Suppressed  7.08  7.74  13.77  13.94  15.1  15.1  4.6  5.2  9.1  9.6  9.92  10.4 
Entropy (%)  100  123  201  237  219  259  50  68  114  143  126  158 
Two versions of the geographic PUMF file were created, one at each of the two levels of geography (PROV_ALL and PROV_REGION). PROV_REGION treats the country as consisting of three regions and the territories. Also, two versions of the clinical PUMF file were created, one that shows the individual days of LOS for the first week and one that treats 1 to 7 total days of stay as a single category (LOS_DAYS and LOS_WEEKS). This distinction is important because the vast majority of stays are less than one week. The multiple versions of the PUMF file allow us to consider the impact of these alternative generalizations on information loss.
The results for both suppression algorithms, combinations and complete, are also shown in Table 9 and Table 10. For each data set and threshold we show the percentage of values within each quasiidentifier that are suppressed, and the percentage of all cells across all quasiidentifiers that are suppressed. The entropy as a percentage change from a baseline (indicated by 100%) is given for each column.
The missingness and entropy results indicate that: (a) there is less missingness and less information loss with the 0.05 threshold than for the 0.04 threshold, (b) the combinations suppression algorithm produces less information loss overall than the complete suppression algorithm, and (c) the PROV_REGION and TOTAL_LOS_WEEKS data sets have less information loss overall. For the latter finding, this reflects that the generalizations for the PROV_REGION and TOTAL_LOS_WEEKS data sets were offset by the significant reduction in suppression, giving them a higher utility.
Based on these results, and to maintain data utility the authors and CIHI decided to use the 0.05 threshold for creating a PUMF, and use the PROV_ REGION and TOTAL_LOS_WEEKS data sets with the combinations suppression algorithm.
Distribution of Missingness
The percentage of missingness on CMG_CODE by region
Province  CMG_CODE (% Missing) 

East (10.8%)  30.3 
Central (45.5%)  19.1 
West (43.2%)  16.6 
Territories (0.3%)  53.0 
The percentage of missingness on CMG_CODE by age group for each PUMF
CMG_CODE (% Missing)  

AGE_GROUP  PUMF 1  PUMF 2 
04  5.2  5.3 
59  23  13.8 
1014  28.6  19 
1519  20.4  15.6 
2024  14.8  11 
2529  11.2  8.7 
3034  12  9.3 
3539  17.9  13.6 
4044  26  20.3 
4549  25.7  22.1 
5054  25.9  23 
5559  25.8  23.7 
6064  23.9  23.6 
6569  23.4  24.4 
7074  22.8  24.4 
7579  20.4  24.6 
8084  19.4  24.9 
8589  19.2  23.3 
90+  18  21.1 
The percentage of missingness on CMG_CODE by total LOS in weeks for the second PUMF
TOTAL_LOS_WEEKS (days)  CMG_CODE (% Missing) 

17  11 
814  44 
1521  60 
2228  68 
2935  75 
3642  81 
4349  83 
5056  79 
5763  78 
6470  100 
7177  100 
7884  100 
176+  80 
Performance of Suppression Algorithms

Because the PROV_ALL and TOTAL_LOS_DAYS data sets have variables with more response categories, they will take longer to suppress. This is expected because that means they also have a larger number of equivalence classes that the algorithm has to iterate through and hence apply more suppression.

The differences between the two algorithms tend to be more marked at lower values of k, with the combinations algorithm taking longer. This is due to the complete algorithm reaching a solution quite fast at low values.

One would expect that the combinations algorithm would take much more than the complete algorithm. Such a difference would be in phase 2 as both algorithms have the same phase 1. While there are differences, they are not dramatic. This is explained by the observation that earlier combinations result in suppressions that reduce the number of necessary iterations for subsequent combinations.
The graphs in Figure 9 and Figure 10 show that phase 2 of the algorithm takes considerably more time. This is driven by the need to recompute equivalence classes and their sizes at each iteration. Whereas phase 1 only considers individual quasiidentifiers.
For such large data sets the overall performance of the combinations algorithm was deemed acceptable, especially that the increment in time costs compared to the complete suppression algorithm was not significant.
Discussion
Summary
A substantial difference was found between the PROV_ALL and PROV_REGION data sets, and the TOTAL_LOS_DAYS and TOTAL_LOS_WEEKS data sets. The more generalized data sets tended to have considerably less information loss. Also, the clinical PUMF tended to have more suppression than the geographic PUMF because the LOS variable has more possible values on it (see Table 2). Therefore, we recommended using the PROV_REGION and TOTAL_LOS_WEEKS data sets as the basis for a PUMF.
There is an advantage to using the combinations suppression algorithm compared to the complete suppression algorithm. This is evident when we look at the percentage of total cells suppressed for the two algorithms and the difference in entropy.
Therefore, by using the 0.05 threshold, the more generalized versions of the two PUMF files, and the combinations algorithm we are able to produce a data set with significantly less information loss than would otherwise be the case. For example, for the geographic PUMF using PROV_ALL at the 0.04 threshold with the complete suppression algorithm results in 13.6% of the cells being suppressed, whereas our recommended PUMF has only 8.4% of its cells suppressed. For the second PUMF the differences are 15.1% versus 9.1%. In both cases there was a marked improvement in data utility as measured by the percentage of suppressed cells. Similarly, the complete suppression PROV_ALL data set at the 0.04 threshold had a 334% entropy value, compared to the combinations suppression PROV_REGION file with a 0.05 threshold which had less than two thirds of the entropy at 181%.
The variable with the most suppression was diagnosis code, with just under 20% of the records having suppression there. The least generalized data set with the 0.04 threshold and the complete suppression algorithm results in suppression closer to 30% on the diagnosis code variable.
The smallest provinces tend to have the most suppression: the Eastern region and the territories. Similarly, age groups that tend to have the most suppression are those with the fewest hospital visits (e.g., 514 years), and the age groups with the least suppression have the most hospital visits (e.g., newborns and toddlers in the 04 years age range). By far, the vast majority of stays are less than 7 days, and therefore this category has the least suppression. Longer stays are considerably less common, resulting in them having high levels of suppression.
Our combinations suppression algorithm results in less information loss than an algorithm that does not take combinations into account. It has acceptable performance on this large data set for k values up to 40, and scales linearly with the data set size.
Practical Implications
One important reason for sampling was to ensure that the two PUMF files would not have the same individuals in them, precluding the possibility of linking the two files by an adversary. For example, using the same reasoning as in Additional file 3, any two independent samples with replacement from the population will have an expected number of records that can be matched correctly equal to Kα where K is the number of equivalence classes in the population and α is the sampling fraction. If we have 100,000 equivalence classes in the population and a 10% sampling fraction, then 10,000 records could be correctly matched on average. A higher sampling fraction would mean more records could be matched. In practice, an adversary would not know which 10,000 records were matched correctly, however, which may limit the meaningfulness of the matching exercise. Nevertheless, to err on the conservative side we wanted to ensure that the samples cannot overlap.
To contextualize our results with other disclosures of health information in Canada, the Statistics Canada census disclosure control process uses record level suppression, therefore the census PUMF does not have suppressed cells. It should also be noted that the census PUMF does not include any variables that have as many response categories as the DAD (e.g., the census only includes generalized occupation or industry codes with at most 25 possible categories, unlike diagnosis and intervention codes which are much more detailed). Therefore, in one sense it is not surprising that the amount of suppression for the DAD PUMF under the current specifications would be higher than we see in the census file. Furthermore, the Canadian census PUMF is still not generally available to the public, but has some restrictions on its disclosure. Another study which performed a reidentification risk assessment for a nonpublic disclosure of a pharmacy file noted that suppression on some variables of as much as 15% still provided utility to the end users [140]. This number is comparable to ours when one considers that the pharmacy file disclosure required a data sharing agreement and regular audits of the data recipients.
Rather than drawing the PUMF data from the suppressed population file, it may be better to remove the records with suppression on many quasiidentifiers or where key variables are suppressed, and then draw the PUMF files from this new population. That way there will be little or no cell suppression in the PUMF files. However, such a PUMF will be missing more records from patients living in small areas, with longer stays, and with rare diagnoses and interventions. It may be possible to compensate for such missingness by providing weights for the nonsuppressed values.
Another approach to reduce missingness is to further generalize diagnosis and intervention codes. It is quite likely that further reductions in the number of categories on these variables would reduce missingness. However, it remains an empirical question whether additions of other levels to the current domain generalization hierarchies for these two variables will be meaningful.
The PUMF that we have described here may be of utility for many specific purposes, such as making it easier for the research community to confirm some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serving as a large data set for computer scientists and statisticians to evaluate analysis and data mining techniques. However, for complex research analysis, more detailed information would typically be needed than can be provided in a PUMF. The most appropriate action may be to produce an Analytical File that would be readily available to researchers and other users who satisfy some significant conditions regarding access and use of the data, including signing and being legally bound by data sharing agreements. Such a file would still need to have records with a low probability of reidentification, although the threshold used would likely be set higher, say 0.2, leading to reduced suppression and therefore greater utility of the released data. As noted earlier, higher thresholds are suitable when the data is being disclosed to trusted parties. Data sharing agreements establish the trust and so mitigate the risks from the higher threshold. The use of data sharing agreements is the approach used by the AHRQ in the US when disclosing state discharge abstract data for secondary purposes.
Limitations
The analysis presented here does not address protection against attribute disclosure. In a final PUMF that would be released to the public, protections against attribute disclosure would normally need to be implemented as an added precaution.
Conclusions
The public availability of detailed hospital discharge abstract data could benefit many communities, including health and data analysis researchers, educators, students, as well as the data custodians themselves. However, the public disclosure of health information can only be done if it is credibly deidentified. In this paper we have described how Canadian discharge abstract data can be deidentified, and clarified the costs in terms of data quality of doing so. Our deidentification utilized new metrics, methods and algorithms to maximize the utility of the data and to provide strong privacy guarantees. Furthermore, the process we followed highlighted the tradeoffs that need to be considered by a data custodian when making data available. But challenges remain for the disclosure of detailed health information, for example, the generalization of diagnosis codes to reduce the number of unique codes but retaining sufficient detail is an area for future work.
Abbreviations
 DAD:

Discharge abstact database
 PUMF:

Public use microdata file
 AHRQ:

Agency for Healthcare Research and Quality
 PRAM:

Post randomization method
 Prec:

Precision
 DM:

Discernability metric
 MD:

Minimal distortion
 CM:

Classification metric
 IG:

Information gain
 PL:

Privacy loss.
Declarations
Acknowledgements
The work reported here was funded by the Canadian Institute for Health Information, the Ontario Institute for Cancer Research, and the Canadian Institutes of Health Research. We wish to thank JeanMarie Berthelot (Canadian Institute for Health Information) and David Morgan (Newfoundland and Labrador Centre for Health Information) for their comments and suggestions on an earlier version of this article. We also would like to thank Sadrul Chowdhury (Google Inc.) for his contributions to the suppression algorithm (these contributions were made while Sadrul was at the CHEO Research Institute).
Authors’ Affiliations
References
 Fienberg S, Martin M, Straf M: Sharing Research Data. 1985, Committee on National Statistics, National Research Council
 Hutchon D: Publishing raw data and real time statistical analysis on ejournals. British Medical Journal. 2001, 322 (3): 530PubMed CentralPubMed
 Are journals doing enough to prevent fraudulent publication?. Canadian Medical Association Journal. 2006, 174 (4): 431
 Abraham K: Microdata access and labor market research: The US experience. Allegmeines Statistisches Archiv. 2005, 89: 121139. 10.1007/s1018200501976.
 Vickers A: Whose data set is it anyway?. Sharing raw data from randomized trials. Trials. 2006, 7 (15):
 Altman D, Cates C: Authors should make their data available. BMJ. 2001, 323: 1069PubMed CentralPubMed
 Delamothe T: Whose data are they anyway?. BMJ. 1996, 312: 12411242.PubMed CentralPubMed
 Smith GD: Increasing the accessibility of data. BMJ. 1994, 308: 15191520.PubMed CentralPubMed
 Commission of the European Communities. On scientific information in the digital age: Access, dissemination and preservation. 2007
 Lowrance W: Access to collections of data and materials for health research: A report to the Medical Research Council and the Wellcome Trust. Medical Research Council and the Wellcome Trust. 2006
 Yolles B, Connors J, Grufferman S: Obtaining access to data from governmentsponsored medical research. NEJM. 1986, 315 (26): 16691672. 10.1056/NEJM198612253152608.PubMed
 Hogue C: Ethical issues in sharing epidemiologic data. Journal of Clinical Epidemiology. 1991, 44 (Suppl I): 103S107S.PubMed
 Hedrick T: Justifications for the sharing of social science data. Law and Human Behavior. 1988, 12 (2): 163171. 10.1007/BF01073124.
 Mackie C, Bradburn N: Improving access to and confidentiality of research data: Report of a workshop. 2000, Washington: The National Academies Press
 Boyko E: Why disseminate microdata?. United Nations Economic and social Commission for Asia and the Pacific Workshop on Census and Survey Microdata. 2008
 Winkler W: Producing PublicUse Microdata That Are Analytically Valid And Confidential. US Census Bureau. 1997
 Statistics Canada: 2001 Census Public Use Microdata File: Individuals file user documentation. 2001
 Dale A, Elliot M: Proposals for the 2001 samples of anonymized records: An assessment of disclosure risk. Journal of the Royal Statistical Society. 2001, 164 (3): 427447. 10.1111/1467985X.00212.
 Marsh C, Skinner C, Arber S, Penhale B, Openshaw S, Hobcraft J, Lievesley D, Walford N: The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society, Series A (Statistics in Society). 1991, 154 (2): 305340. 10.2307/2983043.
 Statistics Canada: Canadian Community Health Survey: Public Use Microdata File. 2009, [http://www.statcan.gc.ca/bsolc/olccel/olccel?catno=82M0013X&lang=eng]
 Marsh C, Dale A, Skinner C: Safe data versus safe settings: Access to microdata from the British census. International Statistical Review. 1994, 62 (1): 3553. 10.2307/1403544.
 Alexander L, Jabine T: Access to social security microdata files for research and statistical purposes. Social Security Bulletin. 1978, 41 (8): 317.PubMed
 Department of Health and Human Services: Comparative Effectiveness Research Public Use Data Pilot Project. 2010
 Department of Health and Human Services: CERPublic Use Data Pilot Project. 2010
 ConsumerPurchaser Disclosure Project. The state experience in health quality data collection. 2004
 Agency for Healthcare Research and Quality: Healthcare cost and utilization project: SID/SASD/SEDD Application Kit. 2010
 Canadian Institute for Health Information: Data Quality Documentation, Discharge Abstract Database. 20092010: Executive Summary. 2010
 Canadian Institute for Health Information: Privacy and security framework. 2010
 Schoenman J, Sutton J, Kintala S, Love D, Maw R: The value of hospital discharge databases. Agency for Healthcare Research and Quality. 2005
 Samarati P: Protecting respondents identities in microdata release. Knowledge and Data Engineering, IEEE Transactions on. 2001, 13 (6): 10101027. 10.1109/69.971193. [10.1109/69.971193]
 Sweeney L: kanonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledgebased Systems. 2002, 10 (5): 557570. 10.1142/S0218488502001648.
 Skinner C: On identification dislcosure and prediction disclosure for microdata. Statistics Neerlandica. 1992, 46 (1): 2132. 10.1111/j.14679574.1992.tb01324.x.
 Subcommittee on Disclosure Limitation Methodology  Federal Committee on Statistical Methodology. Statistical Policy Working paper 22: Report on statistical disclosure control. Statistical Policy Office, Office of Information and Regulatory Affairs, Office of Management and Budget. 1994
 Machanavajjhala A, Gehrke J, Kifer D: lDiversity: Privacy Beyond kAnonymity. Transactions on Knowledge Discovery from Data. 2007, 1 (1): 147. 10.1145/1217299.1217300.
 Hansell S: AOL Removes Search Data on Group of Web Users. 2006, New York Times
 Barbaro M, Zeller T: A Face Is Exposed for AOL Searcher No. 4417749. New York Times. 2006
 Zeller T: AOL Moves to Increase Privacy on Search Queries New York Times. 2006
 Ochoa S, Rasmussen J, Robson C, Salib M: Reidentification of individuals in Chicago's homicide database: A technical and legal study. Massachusetts Institute of Technology. 2001
 Narayanan A, Shmatikov V: Robust deanonymization of large datasets (how to break anonymity of the Netflix prize dataset). 2008, University of Texas at Austin
 Appellate Court of Illinois  Fifth District. The Southern Illinoisan v. Department of Public Health. 2004
 The Supreme Court of the State of Illinois. Southern Illinoisan vs. The Illinois Department of Public Health. 2006, Accessed on: February 1, 2011, [http://www.state.il.us/court/opinions/supremecourt/2006/february/opinions/html/98712.htm]
 Federal Court (Canada): Mike Gordon vs. The Minister of Health: Affidavit of Bill Wilson. 2006
 El Emam K, Kosseim P: Privacy Interests in Prescription Records, Part 2: Patient Privacy. IEEE Security and Privacy. 2009, 7 (2): 7578.
 Lafky D: The Safe Harbor method of deidentification: An empirical test. Fourth National HIPAA Summit West. 2010
 Fung BCM, Wang K, Chen R, Yu PS: PrivacyPreserving Data Publishing: A Survey of Recent Developments. ACM Computing Surveys. 2010, 42 (4):
 Chen BC, Kifer D, LeFevre K, Machanavajjhala A: Privacy preserving data publishing. Foundations and Trends in Databases. 2009, 2 (12): 1167.
 Alexander LA, Jabine TB: Access to social security microdata files for research and statistical purposes. Social Security Bulletin. 1978, 41 (8): 317.PubMed
 de Waal T, Willenborg L: A view on statistical disclosure control for microdata. Survey Methodology. 1996, 22 (1): 95103.
 Willenborg L, de Waal T: Elements of Statistical Disclosure Control. Lecture Notes in Statistics. 2001, 155: Springer
 DomingoFerrer J, Torra Vc: Theory and Practical Applications for Statistical Agencies. NorthHolland: Amsterdam. 2002, 113134.
 Skinner C, Elliot M: A measure of disclosure risk for microdata. Journal of the Royal Statistical Society Series BStatistical Methodology. 2002, 64: 855867. 10.1111/14679868.00365.
 DomingoFerrer J, Torra Vc: Ordinal, Continuous and Heterogeneous kAnonymity Through Microaggregation. Data Min Knowl Discov. 2005, 11: 195212. 10.1007/s1061800500075. [10.1007/s1061800500075]
 Templ M: Statistical Disclosure Control for Microdata Using the RPackage sdcMicro. Trans Data Privacy. 2008, 1: 6785.
 Dandekar RA: Cost effective implementation of synthetic tabulation (a.k.a. controlled tabular adjustments) in legacy and new statistical data publication systems. Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxembourg. 2003
 Dandekar RA: Maximum UtilityMinimum Information Loss Table Server Design for Statistical Disclosure Control of Tabular Data. Privacy in statistical databases. Edited by: Josep DomingoFerrer, Vicen\cc Torra. 2004
 Castro J: Minimumdistance controlled perturbation methods for largescale tabular data protection. European Journal of Operational Research. 2004, 171
 Cox LH, Kelly JP, Patil R: Balancing quality and confidentiality for multivariate tabular data. Privacy in statistical databases. Edited by: Josep DomingoFerrer, Torra V. 2004
 Doyle P, Lane JI, Theeuwes JJM, Zayatz LM: Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Elsevier Science. 2001, 1
 DomingoFerrer J, Torra Vc: Disclosure Control Methods and Information Loss for Microdata. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Edited by: P. Doyle, et al. 2001, Elsevier Science
 Kim JJ: A method for limiting disclosure in microdata based on random noise and transformation. Proceedings of the ASA Section on Survey Research Methodology. 1986, American Statistical Association: Alexandria, Virginia, 370375.
 Little RJA: Statistical Analysis of Masked Data. Journal of Official Statistics. 1993, 9 (2): 407426.
 Sullivan G, Fuller WA: The use of measurement error to avoid disclosure. Proceedings of the American Statistical Association, Survey Methods Section. 1989, American Statistical Association: Alexandria, Virginia, 802808.
 Sullivan G, Fuller WA: The construction of masking error for categorical variables. Proceedings of the American Statistical Association, Survey Methods Section. 1990, American Statistical Association: Alexandria, Virginia, 435440.
 Kargupta H, Datta S, Wang Q, Sivakumar K: Random data perturbation techniques and privacy preserving data mining. Knowledge and Information Systems. 2005, 7: 387414. 10.1007/s1011500401736.
 Liew CK, Choi UJ, Liew CJ: A Data Distortion by Probability Distribution. ACM Transactions on Database Systems. 1985, 10 (3): 395411. 10.1145/3979.4017.
 Defays D, Nanopoulos P: Panels of enterprises and confidentiality: The small aggregates method. Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys. 1993, Statistics Canada: Ottowa, Canada
 Nanopoulos P: Confidentiality In The European Statistical System. Qüestiió. 1997, 21 (1 i 2): 219220.
 Defays D, Anwar M: Masking data using microaggregation. Journal of Official Statistics. 1998, 14: 449461.
 DomingoFerrer J, MateoSanz J: Practical dataoriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering. 2002, 14 (1): 189201. 10.1109/69.979982. [10.1109/69.979982]
 Sande G: Exact and approximate methods for data directed microaggregation in one or more dimensions. International Journal of Uncertainty, Fuziness, and KnowledgeBased Systems. 2002, 10 (5): 459476. 10.1142/S0218488502001582.
 Torra V: Microaggregation for categorical variables: A median based approach. Privacy in Statistical Databases. 2004, Springer LNCS
 DomingoFerrer J, MateoSanz JM: Resampling for statistical confidentiality in contingency tables. Computers and Mathematics with Applications. 1999, 38 (1112): 1332. 10.1016/S08981221(99)002813.
 DomingoFerrer J, Torra V: Disclosure Control Methods and Information Loss for Microdata. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Edited by: P. Doyle, et al. 2001
 Jimenez J, Torra V: Utility and Risk of JPEGbased Continuous Microdata Protection Methods. International Conference on Availability, Reliability and Security. 2009
 Rubin DB: Discussion Statistical Disclosure Limitation (also cited as: Satisfying confidentiality constraints through the use of synthetic multiplyimputed microdata). Journal of Official Statistics. 1996, 9 (2): 461468.
 Heer GR: A Bootstrap Procedure to Preserve Statistical Confidentiality in Contingency Tables. International Seminar on Statistical Confidentiality. Edited by: D. Lievesley. 1993, Luxembourg, 261271.
 Gopal R, Garfinkel R, Goes P: Confidentiality via Camouflage: The CVC Approach to Disclosure Limitation When Answering Queries to Databases. OPERATIONS RESEARCH. 2002, 50 (3): 501516. 10.1287/opre.50.3.501.7745.
 Gouweleeuw JP, Kooiman P, Willenborg LCRJ, P.P dW: dW. Post Randomisation for Statistical Disclosure Control: Theory and Implementation. Journal of Official Statistics. 1998, 14 (4): 463478.
 Willenborg L, de Waal T: Statistical Disclosure Control in Practice. 1996, Springer, 1
 Hundepool A, Willenborg L: $\mu$ and $\tau$argus: Software for Statistical Disclosure Control. 3rd International Seminar on Statistical Confidentiality. 1996, Bled
 Samarati P, Sweeney L: Protecting privacy when disclosing information: kanonymity and its enforcement through generalisation and suppression. 1998, SRI International
 Samarati P: Protecting respondents' identities in microdata release. IEEE Transactions on Knowledge and Data Engineering. 2001, 13 (6): 10101027. 10.1109/69.971193.
 Sweeney L: Achieving kanonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems. 2002, 10 (5): 571588. 10.1142/S021848850200165X.
 Ciriani V, De Capitani di Vimercati SSF, Samarati P: kAnonymity. Secure Data Management in Decentralized Systems. 2007, Springer
 Bayardo R, Agrawal R: Data Privacy through Optimal kAnonymization Proceedings of the 21st International Conference on Data Engineering. 2005
 Iyengar V: Transforming data to satisfy privacy constraints. Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery. 2002
 El Emam K, Dankar F, Issa R, Jonker E, Amyot D, Cogo E, Corriveau JP, Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J: A Globally Optimal kAnonymity Method for the Deidentification of Health Data Journal of the American Medical Informatics Association. 2009
 Hundepool A, DomingoFerrer J, Franconi L, Giessing S, Lenz R, Naylor J, Nordholt ES, Seri G, Wolf PPD: Handbook on Statistical Disclosure Control (Version 1.2). A Network of Excellence in the European Statistical System in the field of Statistical Disclosure Control. 2010
 Winkler WE: Reidentification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics. 1998, 1 (2): 5069.
 DomingoFerrer J, Mateosanz JM, Torra Vc: Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure. Proceedings of ETKNTTS 2001. 2001, Luxemburg: Eurostat, 807826. Eurostat
 Shannon CE: A Mathematical Theory of Communication. The Bell System Technical Journal. 1948, 27
 Shannon CE: Communication Theory of Secrecy Systems. The Bell System Technical Journal. 1949
 de Waal T, Willenborg L: Information loss through global recoding and local suppression. Netherlands Official Statistics. 1999, 1720. Spring Special Issue
 Kooiman P, Willenborg L, Gouweleeuw J: PRAM: a method for disclosure limitation of microdata. Statistics Netherlands, Division Research and Development, Department of Statistical Methods. 1997
 Gionis A, Tassa T: kAnonymization with Minimal Loss of Information. Knowledge and Data Engineering, IEEE Transactions on. 2009, 21 (2): 206219.
 El Emam K, Dankar FK, Issa R, Jonker E, Amyot D, Cogo E, Corriveau JP, Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J: A globally optimal kanonymity method for the deidentification of health data. Journal of the American Medical Informatics Association. 2009, 16 (5): 67082. 10.1197/jamia.M3144. [10.1197/jamia.M3144]PubMed CentralPubMed
 Sweeney L: Computational disclosure control: A primer on data privacy protection. Massachusetts Institute of Technology. 2001
 Bayardo RJ, Agrawal R: Data privacy through optimal kanonymization. Proceedings of the 21st International Conference on Data Engineering. 2005, 217228.
 El Emam K, Dankar FK: Protecting privacy using kanonymity. Journal of the American Medical Informatics Association. 2008, 15 (5): 62737. 10.1197/jamia.M2716. [10.1197/jamia.M2716]PubMed CentralPubMed
 LeFevre K, DeWitt DJ, Ramakrishnan R: Mondrian Multidimensional KAnonymity. Proceedings of the 22nd International Conference on Data Engineering. 2006, IEEE Computer Society: Washington, DC, USA, 25
 Hore B, Jammalamadaka RC, Mehrotra S: Flexible Anonymization For Privacy Preserving Data Publishing: A Systematic Search Based Approach. SIAM International Conference on Data Mining (SDM). 2007
 Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC: Utilitybased anonymization for privacy preservation with less information loss. SIGKDD Explorations Newsletter. 2006, 8: 2130. 10.1145/1233321.1233324. [http://doi.acm.org/10.1145/1233321.1233324]
 Nergiz ME, Clifton C: Thoughts on kanonymization. Data Knowl Eng. 2007, 63: 622645. 10.1016/j.datak.2007.03.009. [http://dx.doi.org/10.1016/j.datak.2007.03.009]
 note on the individual risk of disclosure A. Silvia Polettini. Istituto Nazionale di Statistica. 2003
 Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M: Ldiversity: Privacy beyond kanonymity. ACM Trans Knowl Discov Data. 2007, 1: [http://doi.acm.org/10.1145/1217299.1217302]
 Sweeney L: Datafly: A system for providing anoinymity to medical data. Proceedings of the IFIP TC11 WG11.3 Eleventh International Conference on Database Securty XI: Status and Prospects. 1997
 Fung BCM, Wang K, Fu AWC, Yu PS: Introduction to PrivacyPreserving Data Publishing: Concepts and Techniques. 2010, Chapman and Hall/CRC, 1
 Xiao X, Tao Y: Personalized privacy preservation. Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 2006, ACM: New York, NY, USA, 229240.
 Iyengar VS: Transforming data to satisfy privacy constraints. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002, ACM: New York, NY, USA, 279288.
 Fung BCM, Wang K, Yu PS: Topdown specialization for information and privacy preservation. ICDE 2005. Proceedings. 21st International Conference on Data Engineering. 2005, 205216.
 Fung BCM, Wang K, Yu PS: Anonymizing Classification Data for Privacy Preservation. IEEE Transactions on Knowledge and Data Engineering (TKDE). 2007, 19 (5): 711725.
 El Emam K: Riskbased deidentification of health data. IEEE Security and Privacy. 2010, 8 (3): 6467.
 Statistics Canada: Handbook for creating public use microdata files. 2006
 Cancer Care Ontario Data Use and Disclosure Policy. Cancer Care Ontario. 2005
 Security and confidentiality policies and procedures. Health Quality Council. 2004
 Privacy code. Health Quality Council. 2004
 Privacy code. Manitoba Center for Health Policy. 2002
 Subcommittee on Disclosure Limitation Methodology  Federal Committee on Statistical Methodology. Working paper 22: Report on statistical disclosure control. 1994, Office of Management and Budget
 Statistics Canada: Therapeutic abortion survey. 2007
 Office of the Information and Privacy Commissioner of British Columbia. 1998, Order No. 2611998
 Office of the Information and Privacy Commissioner of Ontario.
 Ministry of Health and Long Term care (Ontario). Corporate Policy. 1984, 3121
 Duncan G, Jabine T, de Wolf S: Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. 1993, National Academies Press
 de Waal A, Willenborg L: A view on statistical disclosure control for microdata. Survey Methodology. 1996, 22 (1): 95103.
 Office of the Privacy Commissioner of Quebec (CAI). Chenard v. Ministere de l'agriculture, des pecheries et de l'alimentation (141). 1997
 National Center for Education Statistics. NCES Statistical Standards. 2003, US Department of Education
 National Committee on Vital and Health Statistics: Report to the Secretary of the US Department of Health and Human Services on Enhanced Protections for Uses of Health Data: A Stewardship Framework for "Secondary Uses" of Electronically Collected and Transmitted Health Data. 2007
 Sweeney L: Data sharing under HIPAA: 12 years later. Workshop on the HIPAA Privacy Rule's DeIdentification Standard. 2010, Department of Health and Human Services
 El Emam K, Jabbouri S, Sams S, Drouet Y, Power M: Evaluating common deidentification heuristics for personal health information. Journal of Medical Internet Research. 2006, 8 (4): e2810.2196/jmir.8.4.e28. [PMID: 17213047 ]PubMed CentralPubMed
 El Emam K, Jonker E, Sams S, Neri E, Neisa A, Gao T, Chowdhury S: DeIdentification Guidelines for Personal Health Information. 2007, Report produced for the Office of the Privacy Commissioner of Canada: Ottawa
 Benitez K, Malin B: Evaluating reidentification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association. 2010, 17 (2): 169177. 10.1136/jamia.2009.000026.PubMed CentralPubMed
 Torra V, DomingoFerrer J: Record linkage methods for multidatabase data mining. Information Fusion in Data Mining. 2003, 101132.
 Torra V, Abowd J, DomingoFerrer J: Using Mahalanobis distancebased record linkage for dislcosure risk assessment. 2006, 233242. Springer LNCS
 El Emam K, Dankar F: Protecting privacy using kanonymity. Journal of the American Medical Informatics Association. 2008, 15: 627637. 10.1197/jamia.M2716.PubMed CentralPubMed
 World Health Organization: International Classification of Diseases. [http://www.who.int/classifications/icd/en/]
 Loukides G, Denny J, Malin B: The disclosure of diagnosis codes can breach research participants' privacy. Journal of the American Medical Informatics Association. 2010, 17: 322327.PubMed CentralPubMed
 El Emam K: Methods for the deidentification of electronic health records for genomic research. Genome Medicine. 2011, 3 (25):
 Meyerson A, Williams R: On the complexity of optimal kanonymity. Proceedings of the 23rd Conference on the Principles of Database Systems. 2004
 Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A: Anonymizing Tables. Proceedings of the 10th International Conference on Database Theory (ICDT05). 2005
 El Emam K, Dankar F, Vaillancourt R, Roffey T, Lysyk M: Evaluating patient reidentification risk from hospital prescription records. Canadian Journal of Hospital Pharmacy. 2009, 62 (4): 307319.
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14726947/11/53/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.