Privacy-preserving genome-wide association studies on cloud environment using fully homomorphic encryption
- Wen-Jie Lu^{1}Email author,
- Yoshiji Yamada^{3} and
- Jun Sakuma^{1, 2}
https://doi.org/10.1186/1472-6947-15-S5-S1
© Lu et al. 2015
Published: 21 December 2015
Abstract
Objective
Developed sequencing techniques are yielding large-scale genomic data at low cost. A genome-wide association study (GWAS) targeting genetic variations that are significantly associated with a particular disease offers great potential for medical improvement. However, subjects who volunteer their genomic data expose themselves to the risk of privacy invasion; these privacy concerns prevent efficient genomic data sharing. Our goal is to presents a cryptographic solution to this problem.
Methods
To maintain the privacy of subjects, we propose encryption of all genotype and phenotype data. To allow the cloud to perform meaningful computation in relation to the encrypted data, we use a fully homomorphic encryption scheme. Noting that we can evaluate typical statistics for GWAS from a frequency table, our solution evaluates frequency tables with encrypted genomic and clinical data as input. We propose to use a packing technique for efficient evaluation of these frequency tables.
Results
Our solution supports evaluation of the D′ measure of linkage disequilibrium, the Hardy-Weinberg Equilibrium, the χ^{2} test, etc. In this paper, we take χ^{2} test and linkage disequilibrium as examples and demonstrate how we can conduct these algorithms securely and efficiently in an outsourcing setting. We demonstrate with experimentation that secure outsourcing computation of one χ^{2} test with 10, 000 subjects requires about 35 ms and evaluation of one linkage disequilibrium with 10, 000 subjects requires about 80 ms.
Conclusions
With appropriate encoding and packing technique, cryptographic solutions based on fully homomorphic encryption for secure computations of GWAS can be practical.
Keywords
Introduction
Because of recent advances in DNA sequencing technologies, the cost of DNA sequencers is dropping rapidly. As a result, the scale of genomic data used by researchers is becoming larger and larger. To conduct computations on a large-scale genomic dataset, a cloud server that provides computational resources at low cost is regarded as a promising option.
It is difficult to argue that genomic and clinical data are highly sensitive. Outsourcing these data to an external server raises concerns about the privacy of sensitive data. Consequently, for outsourcing of computation with genomic data, privacy should be rigorously preserved.
The fully homomorphic encryption (FHE) scheme is attracting attention as a tool for secure outsourcing of data analysis. FHE enables encryption of data and then carrying out arbitrary computation using the encrypted data without decrypting the data. The first FHE scheme was proposed by Gentry [1]: subsequent improvements [2, 3] provided more practical FHE schemes.
Actually, FHE has been applied to secure outsourcing of computation that involves genomic and clinical data. Bos et al. [4] proposed a working implementation of cloud service for private computation of encrypted health data using FHE. Lauter et al. [5] demonstrated an approach to conducting private computation using encrypted genomic data with FHE. Unfortunately, these cryptographic solutions are not sufficiently time and space efficient to conduct a GWAS-scale computation, which can involve 300k SNPs for thousands or more subjects.
In this manuscript, we present a protocol for secure outsourced analysis of large-scale genomic data using FHE. Precisely, our proposed protocol evaluates a frequency table with encrypted genomic/clinical data as input. This enables us to outsource computation of typical statistics related to GWAS securely, such as the Hardy-Weinberg Equilibrium (HWE), χ^{2} test for independence and Linkage Disequilibrium (LD). Our method works by virtue of the fact that we can pack integer vectors into a single ciphertext of a certain type of FHE. This packing technique enables us to evaluate a scalar product of integer vectors through a single homomorphic multiplication using the packing technique; such a batch style computation helps to conduct computation of GWAS-scale data in an efficient manner.
Our basic strategy is to compute allelic frequency tables and genotype frequency tables privately from encrypted genetic data. With these tables, GWAS-related statistics including D′ measure of LD, the Pearson Goodness-of-Fit, HWE, and the χ^{2} test are conducted. In this work particularly, we apply our method to the χ^{2} test and LD to demonstrate the effectiveness of our protocol.
Raw genome data D^{ g }
ID | Genomic Data |
---|---|
1 | CC CG CT GG AA |
2 | AG CT CT AG CT |
3 | CT GG CC AG AA |
4 | AA GG GG AG CC |
Raw phenotype data D^{ p }
ID' | Disease Status |
---|---|
1 | Case |
2 | Control |
3 | Control |
4 | Case |
Observed allele frequency in a case-control study of M subjects.
Allele Type | total | ||
---|---|---|---|
A | a | ||
case | o _{1} | o _{2} | N _{1} |
control | o _{3} | o _{4} | N _{2} |
total | ${N}_{1}^{\prime}$ | ${N}_{2}^{\prime}$ | 2M |
where ${N}_{\mathsf{\text{AA}}}^{\mathsf{\text{case}}}$ and ${N}_{\mathsf{\text{Aa}}}^{\mathsf{\text{case}}}$ are the observed population counts for genotype AA and Aa in the case group: ${N}_{\mathsf{\text{AA}}}^{\mathsf{\text{control}}}$ and ${N}_{\mathsf{\text{Aa}}}^{\mathsf{\text{control}}}$ are the observed counts for the control group.
In addition to a χ^{2} test, we can evaluate the Hardy-Weinberg Equilibrium directly from an allelic frequency table similarly.
Genotype frequencies at markers M_{1} and M_{2} of M subjects.
Marker M_{1} | Total | ||||
---|---|---|---|---|---|
AA | Aa | aa | |||
BB | o _{11} | o _{12} | o _{13} | N _{1} | |
Marker M_{2} | Bb | o _{21} | o _{22} | o _{23} | N _{2} |
bb | o _{31} | o _{32} | o _{33} | N _{3} | |
Total | ${N}_{1}^{\prime}$ | ${N}_{2}^{\prime}$ | ${N}_{3}^{\prime}$ | 2M |
The value ${N}_{i{i}^{\prime}j{j}^{\prime}}$ denotes the observed population counts for genotype ii^{ ' } and jj^{ ' } where $i,{i}^{\prime}\in \left\{A,a\right\}$, and $j,{j}^{\prime}\in \left\{B,b\right\}$.
We evaluate LD from Table 4. The linkage disequilibrium is calculated as D = p_{AB} - p_{A}p_{B}, where probabilities p_{AB}, p_{A} and p_{B} are computed, respectively, as (2o_{11} + o_{12} + o_{21})/2M, $\left(2{N}_{1}^{\prime}+{N}_{2}^{\prime}-{o}_{22}\right)/2M$ and (2N_{1} + N_{2} − o_{22})/2M. We omit the frequency o_{22} to avoid the problem of haplotype ambiguity, especially when only genotypes are measured. See [6] for more details.
We remark that several measures for measuring linkage disequilibrium were proposed, including Pearson's correlation, Lewontin's D′, frequency difference and Yule's Q. Our proposal works for all these measures. However, we applied our method to Lewontin's D′ measure in the experimentation because of space limitations. Additional details related to these measurements are explained in an earlier report of the literature [6].
Problem settings and threat model
Problem settings
For our secure outsourcing of GWAS, we consider three stakeholders, data contributors, researchers, and the cloud. The data contributors (e.g. hospitals, research institutes or subjects) contribute private genomic or clinical data to the cloud. A researcher is an entity that wishes to conduct a GWAS. The cloud is an untrusted entity that includes researchers and data contributors with computational resources.
We assume that genotype/phenotype data of one subject can be contributed from different contributors. In other words, datasets D^{ g } and D^{ p } can be horizontally or vertically partitioned and can receive contributions from different contributors. Additionally, we assume that all subjects are identified with obfuscated IDs so that the cloud can correctly merge contributed data from two or more sources.
Given the contributed datasets D^{ g } and D^{ p }, the protocol proceeds as follows. 1) The cloud computes sufficient statistics with D^{ g } and D^{ p }, although it knows nothing about the contributed data and sends the resulting sufficient statistics to the researcher. 2) The researcher first reconstructs a frequency table from the sufficient statistics and then conducts GWAS.
Threat model
The goal of our system is to ensure that 1) the cloud server cannot learn anything about the private data contributed by data contributors beyond the public information, such as the total number of subjects; 2) the researcher cannot learn beyond what is revealed by the frequency table. Even in the case in which the cloud server colludes with some contributors, they still have no means to learn anything about the data contributed by other contributors except the final results.
1) The cloud server is not in collusion with the researcher to disclose private data contributed by data contributors. 2) Existence of a secure channel between data contributors and the cloud, e.g. SSH.
Methods
Before description of our protocol, we first introduce a homomorphic encryption and packing technique used as building blocks of our protocol.
Building block I: homomorphic encryption
Homomorphic encryption is a cryptosystem that allows performance of arithmetic operations of ciphertexts without decryption.
We detail a homomorphic encryption scheme based on ring-Learning with Errors (RLWE) assumption [7]. Let n be the lattice dimension of the scheme, where n is given as an integer of 2-power. Then, the message space of the scheme is given as a polynomial ring ${\mathbb{A}}_{t}:{\mathbb{Z}}_{t}\left[x\right]/\left({x}^{n}+1\right)$, where t is a prime number. Simply, we identify ${\mathbb{A}}_{t}$ with the set of integer polynomials of degree up to n − 1 reduced modulo t. Moreover, we identify modulo t in the interval (−t/2, t/2].
where c ∈ ${\mathbb{A}}_{t}$ and D_{sk}(·) is the decryption function using the corresponding decryption key sk. It is noteworthy that homomorphic multiplication costs much more time than a homomorphic addition does in terms of magnitude.
Building block II: packing technique
The BGV encryption scheme takes polynomials as plaintexts. An integer vector is transformed into a polynomial form. Then the encryption function takes as input the polynomial and outputs a ciphertext, which also forms a polynomial [10, 11]. These techniques are called packing techniques.
In the equations above, u_{ i } is the i-th element of $\overrightarrow{u}$;u_{ j } is the j-th element of $\overrightarrow{v}$. It is readily apparent that if v_{ i }, u_{ i } ∈ (−t/2, t/2] for 0 ≤ i < ℓ and ℓ ≤ n, then ρfw and ρbw respectively transform vectors $\overrightarrow{u}$ and $\overrightarrow{v}$ into elements of the ring ${\mathbb{A}}_{t}$.
The scalar product between vectors $\overrightarrow{u}$ and $\overrightarrow{v}$ is obtained from the constant term of Equation 2. The remaining 2ℓ − 2 terms are unconcerned.
Equation 2 allows evaluation of a scalar product between two length-ℓ encrypted vectors only by a single homomorphic multiplication. The correctness of this evaluation is presented in Theorem 1.
Theorem 1 Let n be lattice dimension and t be prime modulo. Let $\overrightarrow{u}$ and $\overrightarrow{v}$ denote length-ℓ vectors. Then, the constant term of the decryption ${D}_{sk}\left({\mathfrak{e}}_{u}\otimes {\widehat{\mathfrak{e}}}_{v}\right)$, where ${\mathfrak{e}}_{u}:={E}_{pk}\left({\rho}_{fw}\left(\overrightarrow{u}\right)\right)$ and ${\widehat{\mathfrak{e}}}_{v}:={E}_{pk}\left({\rho}_{bw}\left(\overrightarrow{v}\right)\right)$, gives the scalar product $\u27e8\overrightarrow{u},\overrightarrow{v}\u27e9$ if (1) u_{ i }, v_{ i } ∈ (−t/2, t/2] for 0 ≤ i, j < ℓ; (2) ℓ ≤ n; (3) $\u27e8\overrightarrow{u},\overrightarrow{v}\u27e9\in \left(-t\mathsf{\text{/}}2,\phantom{\rule{2.36043pt}{0ex}}t\mathsf{\text{/}}2\right]$.
The proof was obtained immediately from the derivation of Equation 2 and so is omitted here.
Proposed secure outsourcing of GWAS
Recall that our goal is to outsource the evaluation of frequency tables efficiently while maintaining the genotype/phenotype data private to the cloud servers. We present an encoding scheme for genotype/phenotype data. Particularly, with this encoding, we can securely evaluate a frequency table through scalar products by the technique introduced into the previous section. We present a protocol for secure outsourcing GWAS in the last part of this section. The detail of the protocol is described in Figure 1.
Data encoding
Let A and a be the alleles of the biallelic locus. Consequently, the genomic data at the locus is either AA, Aa, or aa. We represent each row of the genomic dataset D^{ g } as two integer vectors ${\overrightarrow{x}}^{AA}$, ${\overrightarrow{x}}^{Aa}$. Here, ${x}_{i}^{AA}$, the i-th element of ${\overrightarrow{x}}^{AA}$, represents the frequency of genotype AA at the marker locus: ${x}_{i}^{AA}=2$ for AA and ${x}_{i}^{AA}=0$ for other genotypes. ${x}_{i}^{Aa}$ is similar to ${x}_{i}^{AA}$ except that ${x}_{i}^{Aa}=1$ for Aa.
We presume that the disease status of each subject is represented by a binary variable, then "disease" is represented by 1 (case); "non-disease" is represented by 0 (control). The phenotype dataset D^{ p } for all subjects is therefore represented by a binary vector ${\overrightarrow{y}}^{case}$.
We assume that each element of vectors is contributed from only one data contributor, i.e. ${\sum}_{q}\phantom{\rule{2.36043pt}{0ex}}\pi {\left(\overrightarrow{x},q\right)}_{j}={x}_{j}$ holds for every j. For simplicity, we view $\pi \left(\overrightarrow{x},q\right)$ as a polynomial whose j-th coefficient has value $\pi {\left(\overrightarrow{x},q\right)}_{j}$.
We use this data encoding in Step 1.1 and Step 1.2 in Figure 1.
Evaluate the allelic frequency table
Where $\overrightarrow{1}$ is a vector of which the elements are 1. Because Table 3 is freedom-1 and the number of objects M is assumed to be known, whole Table 3 can be reconstructed with values o_{1}, ${N}_{1}^{\prime}$ and N_{1}. Therefore, three homomorphic multiplications are needed here. Step 3.1 of Figure 1 shows that the three scalar products can be evaluated with homomorphic multiplication.
Evaluate the genotype frequency table
Step 3.2.1 of Figure 1 shows that the six scalar products can be computed with homomorphic multiplication as well.
Secure outsourcing GWAS protocol
The procedure of secure outsourcing GWAS is shown in Figure 1. Recall that the evaluation of scalar product in Equation 2 requires a forward-packed vector and a backward-packed vector. Consequently, at Step 1.2, data contributors upload four copies for one genotype data in the form of the forward-packed and backward-packed vectors. The cloud aggregates the collected ciphertexts at Step 2, which only involves homomorphic additions. Then the cloud computes the allelic frequency table and the genotype frequency table respectively at Step 3.1 and 3.2.
Results
We benchmarked the computational costs of our method and compared it with a method proposed by Lauter et al. in [5], in which a genetic data point and a clinical data point are encoded respectively into three bits and two bits. All experiments were conducted on computers with a 2.60 GHz CPU (Xeon; Intel Corp.) and 32 GB RAM. We measured the computation time separately for Step 1.1 and 1.2 as the preparation time and for Steps 3.1 and 3.2 as the evaluation time. Details of the experiment settings are presented following. 1) An artificial dataset includes 1.0 × 10^{4} subjects. 2) Q = 5 data contributors are sharing same quantity of data points. 3) We used 8 threads for computation in parallel. 4) Parameters of the encryption scheme were set as n = 8192, t = 640007, and L = 6.
Performance of homomorphic encryption and implementation hints
Timing of fully homomorphic scheme with parameters n = 8192, t = 640007, L = 6.
Operation | Encrypt | Mult | Add | Add with Plaintext |
---|---|---|---|---|
Time (ms) | 3.08 | 7.57 | 0.032 | 0.789 |
Artificial genotype & phenotype dataset
Conclusions
From Figure 2 and 2 we can see that Lauter et al's cryptographic solution [5] might take about 2000 days to conduct the evaluation of χ^{2} test of one million SNPs and takes about 2600 days to conduct the evaluation of half million of linkage disequilibrium. At the meantime, it respectively took our approach about 12 hours and 11 hours to conduct the same computation. We conclude that with the appropriate encoding and packing technique, secure outsourcing of GWAS using FHE can be practical.
Related work
Studies of privacy-preserving data processing in GWAS involve different techniques. Kamm et al. proposed a secret sharing-based method in [12], by which private information is divided into several parts and is transferred to at least three collusion-free servers. All servers share the workload equally. The final result is aggregated from the output of each server. Computation based on secret-sharing requires multiple rounds of communication between servers; the computation is secret as long as no two servers collude. Because our outsourcing approach executes the whole computation with single cloud servers, computational environments employed for the computation are different.
A cryptographic solution was proposed recently from the work of Lauter et al. [5]. They constructed a method for computation on encrypted genomic data using a cryptosystem that is similar to BGV's scheme. Each genetic datum is encoded into three ciphertexts, which can cause inefficiency in both time and space. Our previous work [13] proposed a specified approach for secure outsourcing χ^{2} test. In this manuscript we propose a more general approach for secure outsourcing of χ^{2} test, HWE and LD etc.
An orthogonal method to ours is differential privacy [14]. With perturbation noise, differential privacy ensures that distribution of the output is insensitive to any data contributor's record, making it impossible to infer data from the obfuscated output. In our case, we can incorporate the perturbation noise in the query phase. Therefore, differential privacy can enforce the privacy properties of our protocol.
Declarations
Acknowledgements
This work is supported by JST CREST program " Advanced Core Technologies for Big Data Integration " and is partly supported by JSPS KAKENHI 24680015.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 15 Supplement 5, 2015: Proceedings of the 4th iDASH Privacy Workshop: Critical Assessment of Data Privacy and Protection (CADPP) challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1472-6947/15/S5.
Declarations
Publication funding for this supplement was supported by iDASH U54HL108460, iDASH linked R01HG007078 (Indiana University), NHGRI K99HG008175 and NLM R00LM011392.
Authors’ Affiliations
References
- Gentry C: A fully homomorphic encryption scheme. 2009, PhD thesis, Stanford UniversityGoogle Scholar
- Brakerski Z, Gentry C, Vaikuntanathan V: (Leveled) fully homomorphic encryption without bootstrapping. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ACM. 2012, 309-325.Google Scholar
- Brakerski Z: Fully homomorphic encryption without modulus switching from classical gapsvp. Advances in Cryptology-CRYPTO. 2012, 868-886.Google Scholar
- Bos JW, Lauter K, Naehrig M: Private predictive analysis on encrypted medical data. Journal of biomedical informatics. 2014, 50: 234-243.View ArticlePubMedGoogle Scholar
- Lauter K, López-Alt A, Naehrig M: Private computation on encrypted genomic data. Progress in Cryptology-LATINCRYPT. 2014, 3-27.Google Scholar
- Ziegler A, König IR: A Statistical Approach to Genetic Epidemiology: Concepts and Applications. 2010, John Wiley & Sons, Berlin, 247-254. 2ndView ArticleGoogle Scholar
- Lyubashevsky V, Peikert C, Regev O: On ideal lattices and learning with errors over rings. Proceedings of the 29th Annual International Conference on Theory and Applications of Cryptographic Techniques, Springer-Verlag. 2010, 1-23.Google Scholar
- HELib. Accessed: 2014-12-10, [http://shaih.github.io/HElib/index.html]
- Gentry C, Halevi S, Smart N: Homomorphic evaluation of the AES circuit. Advances in Cryptology-CRYPTO. 2012, 850-867.Google Scholar
- Yasuda M, Shimoyama T, Kogure J, Yokoyama K, Koshiba T: Secure pattern matching using somewhat homomorphic encryption. Proceedings of the 2013 ACM CCSW ACM. 2013, 65-76.Google Scholar
- Smart NP, Vercauteren F: Fully homomorphic SIMD operations. Designs, codes and cryptography. 2014, 71 (1): 57-81.View ArticleGoogle Scholar
- Kamm L, Bogdanov D, Laur S, Vilo J: A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics. 2013Google Scholar
- Lu W, Yamada Y, Sakuma J: Efficient secure outsourcing of genome-wide association studies. IEEE Symposium on Security and Privacy Workshops, SPW 2015, San Jose, CA, USA, May 21-22, 2015. 2015, 3-6.Google Scholar
- Johnson A, Shmatikov V: Privacy-preserving Data Exploration in Genome-wide Association Studies. KDD '13, ACM, New York, NY, USA. 2013, 1079-1087.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.