A community assessment of privacy preserving techniques for human genomes
- Xiaoqian Jiang†1,
- Yongan Zhao†2,
- Xiaofeng Wang2,
- Bradley Malin3,
- Shuang Wang1,
- Lucila Ohno-Machado1 and
- Haixu Tang2Email author
© Jiang et al.; licensee BioMed Central Ltd. 2014
Published: 8 December 2014
To answer the need for the rigorous protection of biomedical data, we organized the Critical Assessment of Data Privacy and Protection initiative as a community effort to evaluate privacy-preserving dissemination techniques for biomedical data. We focused on the challenge of sharing aggregate human genomic data (e.g., allele frequencies) in a way that preserves the privacy of the data donors, without undermining the utility of genome-wide association studies (GWAS) or impeding their dissemination. Specifically, we designed two problems for disseminating the raw data and the analysis outcome, respectively, based on publicly available data from HapMap and from the Personal Genome Project. A total of six teams participated in the challenges. The final results were presented at a workshop of the iDASH (integrating Data for Analysis, 'anonymization,' and SHaring) National Center for Biomedical Computing. We report the results of the challenge and our findings about the current genome privacy protection techniques.
The biomedical community is evolving to benefit from and to contribute to big data science. This began with advances in high throughput and computing technologies and is expanding to include expected, as well as unexpected, sources of health-related data, such as electronic health records and social media. A large amount of biomedical data (e.g., human DNA sequences) are being rapidly generated in research and clinical laboratories and healthcare settings . Human genome sequencing will soon become a routine procedure in biomedical research and healthcare, thanks to the rapidly increasing throughput and decreasing cost of DNA sequencing techniques.
To transform such data into knowledge that is applicable to biomedicine, new technologies for large-scale investigations and meta-analysis analysis (e.g., ) on genomic data continue to be developed, increasing the probability that human genomes will be used for clinical diagnosis and therapy, a trend dubbed "base pairs to bedside" .
However, further progress in this area has been impeded by the constraints in accessing human genome data, due in part to privacy concerns related to the disclosure of these data [4, 5]. As human genome is a type of biometric identifier, special provisions to protect privacy of individuals need to be taken into account.
It is well known that sharing raw DNA data (e.g., genotypes) poses risks even after the removal of explicit identifiers (e.g., name and Social Security number.), as the donor can be possibly re-identified through the genetic markers (e.g., single nucleotide polymorphisms, or SNPs) in the data [6–8]. Even aggregated DNA data, such as allele frequencies, have been found to leak out identity information . Specifically, Homer et al.  discovered that the presence of an individual in a case group can be reliably determined from allele frequencies using the person's DNA profile. As a result, NIH and Wellcome Trust removed most aggregated DNA data from the public domain to protect the privacy of study participants . Computational approaches have been developed to characterize which SNPs could be shared under the assumption that they are common variants [12, 13]. For example, it was shown that re-identification is unlikely if the set contains more than 1000 people . Nonetheless, data owners are becoming more cautious about sharing human genomic data, often implementing additional processes to review proposals for data access.
In the past several years, there is growing interest in developing effective methodologies to analyze and disseminate sensitive data. Cryptographic protocols (such as homomorphic computation [14–17]) were developed for analyzing sensitive data through computation over the ciphertext. However, these methods provide no protection to the output of computation, and thus cannot be directly used for disseminating sensitive data. Data perturbation techniques (e.g., [18–20]) achieve privacy protection on input and/or output data and can be used to disseminate sensitive data or to publish computing results that contain sensitive information. A recent review on privacy preserving technology to support data sharing for comparative effectiveness research presents an overview of some of these techniques .
A critical challenge of applying privacy technology on human genomic data is to balance data sharing (and data utility in real-world genomic analyses in particular) and privacy protection. To investigate the extent to which data perturbation technologies are practical, we organized the Critical Assessment of Data Privacy and Protection (CADPP) Workshop as a community effort to evaluate the effectiveness of these methodologies (in particular differential privacy approaches ) for genomic data. This workshop was organized around two specific problems. For the first problem, teams focused on the challenge of sharing aggregate human genomic data (e.g., allele frequencies) to preserve the privacy of the data donors, without undermining the utility of the data in genome-wide association studies (GWAS). This means that, in practice, the most significant genomic regions identified by a GWAS were preserved after data perturbation. In the second problem, teams were challenged to publish GWAS results (i.e., the most significant genomic regions) that meet the differentially privacy criteria under a specific privacy budget.
We devised a task for each of these two problems, based on publicly available data from the International HapMap Project  and the Personal Genome Project (PGP) . A total of six teams from two countries participated in the challenges and submitted their results to one or both tasks. In this article, we describe the design of the challenge tasks and our findings. Two companion papers in this issue of the journal [22, 23] written by participating teams provide details of the data perturbation techniques that were used in the challenge tasks.
For both tasks, we challenged teams to apply data perturbation techniques based on the differential privacy standard. These techniques attempt to add noise to the allele frequencies of the case group so that the resulting data are indistinguishable when any individual is present or absent in the case group. Below, we provide background on differential privacy, a framework developed by Dwork , for readers who are not familiar with this model.
Note that can be a dataset or a numerical value (a differentially private query result such as a statistic) depending on the application of interest. The Laplacian mechanism  is commonly applied in data perturbation methods to achieve differential privacy. It adds noise drawn from a Laplacian distribution, , to the output of a computation on the dataset. The degree of noise depends on the sensitivity of the computation. In this model, sensitivity characterizes the maximum change in the output of a function when elements change in a dataset.
Another popular method for achieving differential privacy is called the exponential mechanism, which was proposed by McSherry and Talwar . This mechanism chooses an output that is close to the optimum with respect to a utility function while preserving the differential privacy definition. The exponential mechanism takes as input a data set , an output range , a privacy parameter , and a utility function that assigns a real-valued score to every possible output , for which a higher score stands for better utility. The mechanism induces a probability distribution over the range of and sample an output in proportion to , where stands for the sensitivity of the utility function.
As mentioned, we organized the challenge around two specific problems: i) sharing allele frequencies and ii) sharing the most significant single nucleotide variants (SNVs) in GWAS.
We devised one task for each problem. For each task, we constructed two datasets based on case and control groups of individuals. The case group was composed of the subjects from the Personal Genome Project (http://www.personalgenomes.org/), which consisted of 411 participants whose genotypes were profiled on 29,757,319 SNVs across the whole genome. The control data consisted of 174 participants from the CEU population in HapMap (http://hapmap.ncbi.nlm.nih.gov/index.html.en).
Task 1: privacy preserving data sharing
The goal of this task is to achieve the highest utility under the perturbed case-control test statistics (i.e., allele frequencies). Each team was provided with the original allele frequencies for the case group, and was instructed to submit perturbed allele frequencies for this group. Note that, in the challenge, we assume the genotypes of the control group are publicly shared, and thus there is no need to protect them. In order for participating teams to evaluate the effectiveness of their data perturbation method, the control group data were made available.
Quantification of privacy risks
where is the allele type (i.e., 0 for the major allele, or 1 for the minor allele) at the SNV site j, m is the total number of SNVs, is the minor allele frequency of SNV j in the case group, and is the corresponding number for the reference group.
A statistic is then computed over all SNV sites for a subject i, representing the likelihood of that subject's presence in the case group. The probability that a subject in the case group will be re-identified is considered to be high if the LR test statistic from her SNVs is significantly greater than those of subjects who are not in the case group.
We used the LR test to evaluate the privacy risks in the submitted data after perturbation from participating teams. The submitted data were considered sufficiently protected if there were approximately the same number of individuals in the case group with LR statistics above the same threshold as those in a test group (i.e., individuals not in the case or control groups). The utility of the perturbed data were measured using the number of significant SNPs identified by the χ2-association test , which is commonly used in GWAS for case and control groups. A data perturbation method was considered to have high utility if a majority of the significant (below a p-value threshold, e.g., 10−5) SNVs reported by the χ2-association test on perturbed data were true. In other words, these SNVs were identified as truly significant on the original data using the same statistical test and a majority of truly significant SNVs could be identified on the perturbed data.
We implemented the likelihood ratio test as an online tool (http://www.humangenomeprivacy.org/service.php) for participating teams to evaluate the privacy risks in their perturbed data. A team could upload perturbed allele frequencies in the case group, and the LR statistics were calculated for each case individual using his/her actual genotype, the uploaded allele frequencies in the case group, and the actual allele frequencies in the control group (from the HapMap project). For comparison, the LR statistics were also computed for a group of test individuals (also from the HapMap project) who were not in either the case or control group. If the number of case individuals was not significantly larger than the number of test individuals who had a test statistic above a certain threshold, these case individuals were considered indistinguishable from any individuals outside the case group and the perturbed case group data were considered to be privacy-preserving.
We released two datasets for this task. The first dataset consisted of 311 SNV sites spanning of 5M bps genomic segment on human chromosome 2, which was a challenging dataset for which the re-identification risk could be alleviated (but not fully mitigated) using our baseline method (see below). The second dataset consisted of 600 SNV sites spanning a 1M bps genomic segment on human chromosome 10, which is a moderately challenging dataset whose re-identification risk can be fully mitigated by our baseline method.
To ensure the feasibility of the competition, we implemented two baseline algorithms for adding noise to data used in task 1, which we refer to as the naïve and advanced algorithms. In the naïve algorithm, we treated the allele counts across multiple SNP sites as a histogram, and added Laplacian noise to the allele counts. Note that the sensitivity of the count function over N SNVs is 2N, which can be used in to determine the amount of noise that needs to be added.
As the name implies, the naïve algorithm is simple and straightforward. However, the sensitivity in this case is high because the dimensionality of the dataset, which corresponds to the number of SNVs, is large (e.g., several hundred in the case of the two datasets used in challenge 1). The high sensitivity can resulting in adding a large amount of noise to the data, which may significantly damage the utility.
By contrast, the advanced data perturbation algorithm is designed based on the haplotypes , an intrinsic feature of the human genome. Each copy of the DNA sequence in a single human genome can be partitioned into haplotype blocks (or haploblocks); within each block, a specific combination of alleles across multiple adjacent SNV sites is called a haplotype. Due to linkage disequilibrium (LD) in human, SNVs within each block are likely inherited together. As a result, inter-haploblock SNVs are more highly correlated than intra-haploblock SNVs. Therefore, the number of potential SNV sequences in each haplotype block (across n SNV sites) is much smaller than the theoretical bound (i.e., ). Notably, the haplotype block structure of the human genome can be inferred from public data, which are independent from the sensitive case data to be protected .
Figure 1 shows the first three haploblocks of dataset 1 that were used in task 1. The SNV sites and the haplotypes are shown below the haploblock ID. The numbers next to each haplotype represent their observed frequencies in the CEU population of the HapMap project. The solid lines between haplotypes from adjacent haploblocks represent the most common combinations between the haplotypes. The thickness of the line represents the frequencies of the corresponding combinations. From the figure, it can be seen that about 50% of all SNV sequences contain the combination of haplotypes ACCGTGA in the first two haplotype blocks, and the third haplotype block consists of 125 SNVs, but has only 17 common haplotypes (instead of the theoretical upper bound of 2125). Rather than adding noise to allele counts on individual SNV sites, the advanced data perturbation approach adds Laplacian noise to the haplotype counts, which are normalized within each haplotype block.
Challenge datasets and utility evaluation
Task 2: secure release of analysis results
The goal of the second task was to evaluate the utility of competing data perturbation techniques in releasing privacy protected results of GWAS. We consider a typical GWAS, where the users are interested in knowing the identity of the top-K (e.g., K = 1, 5 or 10) most significant SNVs among all SNVs (for the statistical test) across the genome. Again, we used the χ2 test statistic for utility evaluation in the case-control analysis. We expected the released GWAS results to achieve a given privacy standard: they should be differentially private with a privacy budget of 1. The utility of the anonymized GWAS results submitted by the participating groups was evaluated based on the proportion of SNVs in the released result that were among actual top-K most significant SNVs that would have been released if preserving privacy was not a concern.
We first preprocessed the case data containing 29,755,199 SNV sites to make them compatible with the control data. Many SNV sites were genotyped for only a small portion of PGP individuals, and some other SNV sites genotyped in PGP were not genotyped in HapMap. We removed all unmatched SNV data between these two datasets and retained only the valid 2,500,781 SNV sites genotyped in at least 174 PGP participants so that we could estimate the allele frequencies on these sites accurately. Next, we eliminated PGP participants from our constructed dataset who had fewer than 90,000 genotyped valid SNV sites. After these preprocessing steps, there were still genotypes missing for some PGP participants, which were imputed based on allele frequencies from the CEU population in HapMap using fastPHASE, a commonly utilized phasing tool . Finally, we constructed the case group consisting of 201 PGP participants, and a control group consisting of 174 HapMap CEU participants, both genotyped on the same set of 106,129 SNV sites.
WIDGET: a web interface for dynamic genome-privacy evaluation
Results and discussions
A total of six teams participated in the competition. Three teams, from the University of Oklahoma, University of Texas (UT) Dallas, and McGill University, participated in the first task. The team from Indiana University collaborated with the UCSD organizers to design and implement the baseline methods, but they did not participate in the competition. Two teams, from UT Austin and Carnegie Mellon University, participated in the second task. The results of the challenge were presented at an iDASH (a National Center for Biomedical Computing based at UCSD) workshop on March 24, 2014 in La Jolla, California. All teams presented their methods and results at the workshop. The details of the methods used in the challenge are described in separate articles in this issue [29, 30]. Below, we summarize the results of the challenge and our own findings.
Results of Task 1.
# of sig SNVs
The results of the SNV-based algorithm (the naïve baseline method) showed the least utility. It missed a substantial number of truly significant SNVs, while it generated a large number false positive SNVs. The haplotype-based algorithm exhibited better performance. Almost all truly significant SNVs were detected in both datasets, and the number of false positives was reduced. The results of UT Dallas and McGill University teams showed similar results on utility of perturbed data, and nearly all significant SNVs were detected, although both had a high false positive rate. Overall, the McGill University team performed the best in the task and won the task 1 challenge.
Results of Task 2.
Small (5000 SNVs)
Large (100K SNVs)
Overall, the UT Austin team generated the best results. One reason is that their Hamming distance based utility function had relatively smaller sensitivity (therefore reduced perturbation noise) when compared to that of the χ2 test statistics used by the CMU team. From the results of task 1, we observed that it remains a challenge to share aggregate human genomic data in a privacy-preserving manner while maintaining truthful associations between the SNV data and the case-control groups. Even for a single genomic locus consisting of a few hundreds of SNVs the association was largely broken after data perturbation.
This problem will become more challenging as larger volumes of human genomic data are planned for dissemination. The performance of differential privacy-based data perturbation techniques heavily relies on the dimensionality of the data. Even though we can reduce the number of dimensions by utilizing unique semantics of genomic data, such as haplotype blocks, it is unlikely that current perturbation techniques will scale well for sharing whole human genomic data.
On the other hand, the results in task 2 showed that privacy-preserving techniques work well for sharing analytical results (by using statistics commonly used in GWAS). High utility can be preserved when only a small number (e.g., 5-10) of the most significant SNVs are of interest to the end users. Notably, this task is well aligned with the centralized computing model, in which a computing center hosts genomic data, as well as the service of customized analyses on these data, and will only release the results of these analyses .
It is essential to develop new privacy-preserving techniques that enable the sharing of human genomic data in a way that preserves the privacy of the data donors, without undermining the utility of the data or impeding its convenient dissemination. This workshop only focused on technical aspects of privacy, but there are many other components, such as social and legal controls, that need to be accounted for when erecting privacy preserving infrastructures. As an initial step to assess existing techniques, we organized the first challenge with two tasks based on real-world human genomic data in GWAS. We discovered that differential privacy-based data perturbation techniques have limitations in sharing a large volume of human genomic data, but can be used to disseminate analytical results (by using GWAS-like statistics), through services like those offered by NCBI. We plan to organize this challenge annually, and hereby welcome suggestions and comments to improve the contest in upcoming years.
Publication of this article has been funded by iDASH (U54HL108460), NHGRI (K99HG008175), NLM (R00LM011392, R21LM012060), CTSA (UL1TR000100), and NCBC-linked grant (R01HG007078)
This article has been published as part of BMC Medical Informatics and Decision Making Volume 14 Supplement 1, 2014: Critical Assessment of Data Privacy and Protection (CADPP). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/14/S1.
- Ohno-Machado L: Sharing data for the public good and protecting individual privacy: informatics solutions to combine different goals. J Am Med Inform Assoc. 2013, 20: 1-10.1136/amiajnl-2012-001513.PubMed CentralView ArticlePubMedGoogle Scholar
- Willer CJ, Li Y, Abecasis GR: METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010, 26: 2190-1. 10.1093/bioinformatics/btq340.PubMed CentralView ArticlePubMedGoogle Scholar
- Green ED, Guyer MS, Institute NHGR, et al: Charting a course for genomic medicine from base pairs to bedside. Nature. 2011, 470: 204-13. 10.1038/nature09764.View ArticlePubMedGoogle Scholar
- McGuire AL, Fisher R, Cusenza P, et al: Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet Med. 2008, 10: 495-9. 10.1097/GIM.0b013e31817a8aaa.View ArticlePubMedGoogle Scholar
- Shoenbill K, Fost N, Tachinardi U, et al: Genetic data and electronic health records: a discussion of ethical, logistical and technological considerations. J Am Med Inform Assoc. 2013, 21: 171-80.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin Z, Owen AB, Altman RB: Genomic research and human subject privacy. Science (80-). 2004, 305: 183-10.1126/science.1095019.View ArticleGoogle Scholar
- Erlich Y, Narayanan A: Routes for breaching and protecting genetic privacy. Nat Rev Genet. 2014, 15: 409-21. 10.1038/nrg3723.PubMed CentralView ArticlePubMedGoogle Scholar
- Naveed M, Ayday E, Clayton EW, et al: Privacy and Security in the Genomic Era. arXiv. 2014, 1405.1891v: 1-47.Google Scholar
- Gymrek M, McGuire AL, Golan D, et al: Identifying personal genomes by surname inference. Science (80-). 2013, 339: 321-4. 10.1126/science.1229566.View ArticleGoogle Scholar
- Homer N, Szelinger S, Redman M, et al: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008, 4: e1000167-10.1371/journal.pgen.1000167.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaye J, Heeney C, Hawkins N, et al: Data sharing in genomics--re-shaping scientific practice. Nat Rev Genet. 2009, 10: 331-5. 10.1038/nrg2573.PubMed CentralView ArticlePubMedGoogle Scholar
- Craig DW, Goor RM, Wang Z, et al: Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet. 2011, 12: 730-6. 10.1038/nrg3067. (accessed 11 Aug2014), [http://www.nature.com/nrg/journal/v12/n10/abs/nrg3067.html]PubMed CentralView ArticlePubMedGoogle Scholar
- Sankararaman S, Obozinski G, Jordan MI, et al: Genomic privacy and limits of individual detection in a pool. Nat Genet. 2009, 41: 965-7. 10.1038/ng.436. (accessed 18 Apr2014), [http://dx.doi.org/10.1038/ng.436]View ArticlePubMedGoogle Scholar
- Kantarcioglu M, Jiang W, Liu Y, et al: A cryptographic approach to securely share and query genomic sequences. IEEE Trans Inf Technol Biomed. 2008, 12: 606-17.View ArticlePubMedGoogle Scholar
- Kamm L, Bogdanov D, Laur S, et al: A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics. 2013, 29: 886-93. 10.1093/bioinformatics/btt066.PubMed CentralView ArticlePubMedGoogle Scholar
- Baldi P, Baronio R, Cristofaro E De: Countering gattaca: efficient and secure testing of fully-sequenced human genomes. CCS '11 Proceedings of the 18th ACM conference on Computer and communications security. 2011, 691-702.Google Scholar
- B EA, Raisaro JL, Hengartner U, et al: Data Privacy Management and Autonomous Spontaneous Security. 2014, Berlin, Heidelberg: : Springer Berlin HeidelbergGoogle Scholar
- Agrawal R, Kiernan J, Srikant R, et al: Order preserving encryption for numeric data. Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04. 2004, New York, USA: ACM Press, 563-View ArticleGoogle Scholar
- Agrawal R, Srikant R: Privacy-preserving data mining. Proceedings of the 2000 ACM SIGMOD international conference on Management of data - SIGMOD '00. 2000, New York, USA: ACM Press, 439-50.View ArticleGoogle Scholar
- Dwork C: Differential privacy. Int Colloq Autom Lang Program. 2006, 4052: 1-12.Google Scholar
- Jiang X, Sarwate AD, Ohno-Machado L: Privacy technology to support data sharing for comparative effectiveness research: a systematic review. Med Care. 2013, 51: S58-65.PubMed CentralView ArticlePubMedGoogle Scholar
- The International HapMap Project. [http://hapmap.ncbi.nlm.nih.gov/index.html.en]
- Church GM: The personal genome project. Mol Syst Biol. 2005, 1: 2005.0030-PubMed CentralPubMedGoogle Scholar
- Dwork C, McSherry F, Nissim K, et al: Calibrating noise to sensitivity in private data analysis. Theory Cryptogr. 2006, 3876: 265-84. 10.1007/11681878_14.View ArticleGoogle Scholar
- McSherry F, Talwar K: Mechanism Design via Differential Privacy. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07). 2007, Providence, RI: : IEEE, 94-103.View ArticleGoogle Scholar
- Chernoff H, Lehmann EL, et al: The use of maximum likelihood estimates in χ^2 tests for goodness of fit. Ann Math Stat. 1954, 25: 579-86. 10.1214/aoms/1177728726.View ArticleGoogle Scholar
- Barrett JC, Fry B, Maller J, et al: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-5. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 629-44. 10.1086/502802.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Mohammed N, Chen R: Differentially Private Genome Data Dissemination through Top-Down Specialization. BMC Med informatics Decis Mak. 2014, 14 (S1): S2-View ArticleGoogle Scholar
- Yu F, Ji Z: Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies: An Application to iDASH Healthcare Privacy Protection Challenge. BMC Med Informatics Decis Mak. 2014, 14 (S1): S3-View ArticleGoogle Scholar
- Ohno-Machado L, Bafna V, Boxwala Aa, et al: iDASH. Integrating data for analysis, anonymization, and sharing. J Am Med Informatics Assoc. 2012, 19: 196-201. 10.1136/amiajnl-2011-000538.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.