Secure distributed GWAS computation
In this section we describe our approach to securely computing the task of distributed GWAS computation, namely, computing minor allele frequencies and chi-squared statistics.
According to the task specification, the size of the input at each site, i.e., the number of SNPs and the number of individuals in the case and control groups, are treated as public and are not protected. This means that parameters P, , , and are known to all computation participants. All remaining data (i.e., the genotypes themselves) are private.
In what follows, we first describe a basic version of our solution and then provide optimization techniques that improve the runtime of program execution.
Basic solution
For each SNP in the input, the computation is identical (and independent of other SNPs) and thus it suffices to describe the computation for a single SNP.
We divide the overall computation into three phases: input preparation, computation execution, and output reconstruction, which proceed as follows. Observe that each input site i can locally compute , , , for each SNP. This is what is done as part of input preparation, after which each input site secret shares each of its computed values and distributes the shares among all three computational parties. We use notation [a] to denote that the value of a is secret-shared among the computational parties (i.e., each party holds a different share of a).
During computation execution, the computation proceeds on the shares to compute MAF and chi-squared statistics using equations 1 and 2 and secure building blocks from the previous section. We choose to perform only the private portion of the computation on secret shares, while postponing the computation with public constants to the output reconstruction phase. This is done for performance reasons to reduce the size of values used in the computation.
To calculate the MAF for each SNP in parallel, the computation follows equation 1 with provisions to make the computation data-oblivious. That is, each computational party performs the following steps: In this section we describe our approach to securely
1
2
3
4
The first two steps that aggregate the input values are local to each computational party, but steps 3 and 4 that produce the minimum of n
A
and n
B
involve joint computation by all of them. We subsequently discuss the choice of the parameter ℓ 1.
To compute the chi-squared statistics for each SNP in parallel, we similarly follow the computation in equation 2 using the following steps:
1
2
3
4
5 [a] = Mul([n
cA
], [n
tB
]);
6 [b] = Mul([n
cB
], [n
tA
]);
7 [c] = Mul([n
cA
] + [n
tA
], [n
cB
] + [n
tB
]);
8 [d] = Mul([a] − [b], [a] − [b]);
9 [res2] = Div(k · [d], [c], ℓ 2);
Lines 5, 6, and 8 compute the numerator in equation 2 and line 7 its denominator (multiplication by public N , N
c
, and N
t
is omitted). The numerator is then scaled up by a factor of k to ensure that using integer division will provide sufficient precision of the result. The bitlength of k will be on the order of the precision of the answer in bits. We defer discussion of the choice of ℓ 2 to the next section.
At the end of the computation, all computational parties send their shares of the result res1 and res2 for each SNP to one of the input sites who reconstruct the values. The output party then sets the result of MAF computation to res1/N'and the result of the chi-squared computation to .
Optimizations
We applied several optimizations to the computation to improve its runtime.
-
1
The nature of the computation in this task allows all interactive operations to run in parallel in a single batch for all SNPs. That is, all P comparisons corresponding to line 3 of MAF computation are executed simultaneously. The same applies to line 4 of MAF computation and lines 5-9 of chisquared computation.
We can further reduce the number of rounds in chi-squared computation by running interactive independent operations at the same time. In particular, this means that lines 5-7 of the computation can be executed in a single round.
-
2
We modify chi-squared computation to use floating point instead of integer division after converting both operands d and c to floating point representation. This is primarily driven by the fact that performance of division we rely on (described in [15, 9]) depends on the maximum of the bitlengths of its arguments and we can use substantially shorter values with floating point division compared to integer division (i.e., the bitlength can be comparable to that of k instead of the sum of the bitlengths of d and k). The savings noticeably outweigh the cost of integer-to-floating point conversion, or normalization (to use floating point division we need to normalize two values, while integer division needs to compute normalization of one of its arguments). We additionally slightly optimize integer to floating point conversion and floating point division compared to those given in [9] using information known about d and c (e.g., the fact that they are positive).
-
3
For performance reasons, we want to set parameters ℓ 1 and ℓ 2 (as well as the bitlength of secret shared values) to their minimum values that guarantee correctness. When the bitlength of the arguments to both comparison and division differ, the larger value is to be used. In particular, for ℓ 1, the largest value of n
A
or n
B
in the LT protocol appears when only one nucleotide is present in all genotypes in both case and control groups (i.e., max(n
A
, n
B
) = N'and min(n
A
, n
B
) = 0), and we set ℓ1 = ⌈log N'⌉ (where the extra 1 is due to the specifics of the LT operation). For ℓ 2, the largest value of d or c appears when n
cB
= n
tA
= 0, which leads to , , and , and we set , and , and we set . For integer division, this value of ℓ 2 needs to be additionally incremented by the bitlength of precision k, but fortunately after we switch to floating point representation, we can reduce the bitlength to the desired precision of the result because the values are represented in a normalized form.
Secure distributed genomic Hamming distance computation
We next concentrate on the second task of securely computing the Hamming distance between a pair of genomic datasets in a distributed setting.
According to the task specification, the number of records in each of the two datasets are known to all parties and we denote them as N1 and N2, respectively. The content of the records, however, is private (in particular, the values that fields CHROM, POS, REF, ALT, and SVTYPE take). Because only records with SVTYPE equal to SUB and SNP are relevant for the computation, for ease of notation we refer to them as SUB and SNP records, respectively.
The high-level idea behind our solution is as follows: we first let each input site extract SUB and SNP records from its dataset and pad the resulting set with dummy records to hide its size. After each input site secret shares its records across all computational parties, the parties then run a set operation to identify all records that appear in both dataset (conceptually similar to set intersection) using ⟨CHROM, POS⟩ as the key as well as all records that appear only in one dataset (conceptually similar to symmetric difference). We accomplish this by obliviously sorting all records from both datasets using Batcher's mergesort [2] and scanning the sorted set examining every two adjacent elements in it to determine if the Hamming distance needs to be incremented by one for that pair.
At the time of competition preparation, Batcher's mergesort was available to us as one of the best options for oblivious sorting (based on the overall amount work as well as its round complexity). It is particularly well suited to this task because it is a recursive algorithm that works by first sorting the first and the second half of its input set and then merging the sorted halves. In our setup this means that the input datasets can be pre-sorted by each input site locally and only the merge step needs to be run jointly. Unfortunately, Batcher's mergesort (including the merge step) has the drawback that the number of elements in the input set has to be a power of 2, which may unnecessarily increase the runtime.
In what follows, we start by describing in detail a basic solution in the first subsection and then discuss two optimizations in the two consecutive subsections.
Basic solution
As before, we divide the overall computation into three phases: input preparation, joint computation execution, and output reconstruction.
Input preparation. Each input site i extracts all SUB and SNP records from its dataset and pads them with dummy records to size N
i
+ 1 (we require at least one dummy record). (If the combined fraction of SUB and SNP records is guaranteed to be within a certain fraction α < 1 of the total size for typical genomic datasets, then the datasets can be padded to αN
i
+ 1 records. For this competition, α could not be lower than 1.) Furthermore, to meet Batcher's mergesort requirements, the input parties additionally pad the sets with dummy records so that the combined size of the two datasets is 2q, where q = ⌈log2(N1 + N2 + 2)⌉. We use this newly formed dataset as the input into the computation and refer to it as a "dataset".
Next, the values in each record need to be converted to integers, which we accomplish as follows:
-
1
The location ⟨CHROM, POS⟩ is represented as V1 = CHROM · L + POS, where L is the maximum length of any existing human chromosome. CHROM ranges from 1 to 24 (22 autosomes, plus × and Y), and for dummy records we set V1 = 25L + 1 to avoid overlap with real records.
-
2
REF and ALT fields are represented as strings of nucleotides in the input. To produce their numeric counterparts, we map each nucleotide value to a two-bit integer (e.g., A = 0, C = 1, G = 2, and T = 3) and concatenate two-bit integers from a string to form a single number. To hide information about the size of the fields, the values need to be represented using the same bitlength for all records based on the maximum string length M. Because shorter strings need to be padded to the maximum size, we need to ensure that strings of different sizes will always be different (i.e., the padding character cannot be one of 0-3).
Instead of introducing a separate padding character, which increases the bitlength of one character from 2 to 3 bits, we append the string length in bits at the end of the string and use 0 for padding. Thus, all strings are represented using 2M +log M bits. Let V2 and V3 denote numeric values of REF and ALT fields in a record. V2 and V3 are set to 0 for dummy records.
In our implementation with M = 100, we partition representation of V2 and V3 into three blocks of size (2M + log M )/3 each. This still requires comparing all 2M + log M bits when two such values need to be compared, but reduces the size of secret shared values and thus the cost of the corresponding arithmetic. When M is large, V2 and V3 can instead be set to the hash of REF and ALT strings. This would guarantee constant size representation regardless of the value of M.
After computing a 3-tuple (V1, V2, V3) for each record in its dataset, an input site i sorts the records by the V1 field to form set S
i
, generates shares of all records in S
i
, and distributes them to the computational parties (we slightly abuse notation and use [S
i
] to denote shares of all values in S
i
). It also distributes shares of the number of dummy records d
i
in S
i
.
Computation execution. After receiving two sorted sets of ([V1], [V2], [V3]) triples from both input sites, the computational parties run oblivious merge using [V1] as the key. The algorithm is built using an input-independent sequence of compare-and-exchange operations. Each operation takes two integers and either swaps them or leaves them unchanged so that the first output (min) is always smaller than the second (max). In our framework, it is implemented as follows:
1 [c] = LT([a], [b], ℓ );
2 [min] = [c]([a] − [b]) + [b];
3 [max] = [c]([b] − [a]) + [a];
Note that lines 2 and 3 involve only a single multiplication (i.e., first compute [c]([a] − [b]) and then set [min] and [max] with no additional interaction). When applying this operation to our setting, comparisons on line 1 are performed using [V1]'s, but the entire records ([V1], [V2], [V3]) are swapped (or left unchanged) using comparison results [c].
The computational parties next compute the Hamming distance as specified in Algorithm 2. Sets S1 and S2 represent sorted input triples of the input parties and parameters ℓ1 and ℓ2 correspond to the bitlengths of fields V1 and V2 (or V3), as discussed previously.
Because a specific location V1 appears only once in each of the input datasets (except for dummy records), there will be at most two records with the same V1 in the combined set. The algorithm works by looking at each pair of two consecutive elements in the combined sorted set and adds 1 to dist if this is the first time the location appears on the list (i.e., a
i
= 0 on line 4). The distance is incremented automatically for the first record (dist = 1 on line 2). Then, if a location appears for the second time (a
i
= 1 on line 4), the algorithm examines the values of V2 and V3 fields (lines 5-6) to determine whether the condition for incrementing the distance is satisfied (i.e., b
i
= 1 and c
i
= 0). If not (b
i
= 0 or c
i
= 1), dist is decremented by 1 to compensate for the fact that it was increased during previous loop iteration. All dummy records collectively contribute distance −d1−d2 +2 (i.e., 0 for the first two records and −1 for each additional record) and this is why we adjust the computed distance at the end (line 9). We note that all loop iterations and all comparisons within a loop iteration can be carried out in parallel.
Algorithm 2 SecureHD([S1], [S2], [d1], [d2])
1:
2: dist = 1
3: for i = 2 to 2q do
4:
5:
6:
7: dist = dist + (1 − [a
i
]) + [a
i
]([b
i
](1 − [c
i
]) − 1)
8: end for
9: dist = dist + [d1] + [d2] −2
10: return dist
Output reconstruction is straightforward and consists only of receiving and combining shares of the computed Hamming distance.
Separating SUB and SNP records
As the first significant optimization, we separate computation of the distance for SNP and SUB records and consequently reconstruct the overall distance from the two values. The main reason for this is to reduce the time comparisons of V2 and V3 take. Recall that SNP records contain a single character in REF and ALT fields, while SUB records can contain longer strings. In the genomic datasets we worked with, a great majority of all records were SNP records that can be processed using 2-bit comparisons for V2 and V3 (i.e., ℓ2 = 2). In the basic solution, however, the bitlength had to be unnecessarily increased by two orders of magnitude for most records to meet privacy requirements. Thus, the idea consists of extracting two sets from each input dataset: one consisting of SNP records and another consisting of SUB records. Then the distance for SUB records is computed separately from the distance for SNP records and the sum is returned as the result.
This strategy works well if all records with the same ⟨CHROM, POS⟩ pair are always marked with the same type across all datasets. It is, however, possible for two datasets to contain SUB and SNP records corresponding to the same location. Because of the existence of such records, the Hamming distance will not be computed correctly if we simply add the two distances together. That is, if one record appears in the SUB set and another with the same location appears in the SNP set, they collectively will contribute 2 to the overall distance instead of correct 0 or 1 (depending on other attributes). To address this, we need to find all such pairs and compensate for the difference they introduced, which is the most subtle part of our solution. We next provide more detail about the solution and highlight the differences from the basic scheme.
Input preparation. Given a dataset, an input entity produces two subsets: one composed of SUB records and one composed of both SUB and SNP records from the dataset. As before, both sets need to be padded with dummy records to hide their number and make the size to be a power of 2 to the combined size of 2qs and 2q, where q
s
= ⌈log(α
s
(N1 + N2) + 2)⌉ and q = ⌈log(α(N1 + N2) + 2⌉ and α
s
(α) denotes the maximum fraction of SUB (resp., SNP and SUB) records in genomic datasets (we were given α = 1 and α
s
= 0.3). All records in the SUB set are converted to (V1, V2, V3) triples as before. For the SNP&SUB set, one-character REF and ALT fields in SNP or SUB records are represented using integers 0-3, while these fields of longer length in SUB records are represented using integer 4 (i.e., V2 and V3 fields are 3 bits long). This will guarantee that comparison of a one-character long REF or ALT field in a SNP record with a longer REF or ALT field in a SUB record will result in their inequality. We also add another binary attribute V4 to each record of the SNP&SUB set that indicates whether the record is of SUB type (V4 = 0) or SNP type (V4 = 1). We set V4 = 0 in dummy records.
Each input entity now produces shares of (V1, V2, V3) in its SUB set and (V1, V2, V3, V4) in its SNP&SUB set (together with the number of dummy records in each set) and distributes them to the computational parties. We note that computation with SNP&SUB sets can be performed on shorter bitlengths, which results in faster arithmetic, and thus we setup two different instances of the secret sharing scheme and process SUB sets separately from SNP&SUB sets.
Computation execution. To compute the Hamming distance correctly, we now distinguish between different cases: (i) SUB records that don't have a SNP record with identical location in the other dataset, (ii) SNP records that don't have a SUB record with identical location in the other dataset, and (iii) records that have another record with identical location but different type present in the other dataset. Let N0 denote the number of records in the third category.
Algorithm 3 SecureHD2([S1], [S2])
1:
2:
3: for i = 2 to 2q do
4:
5:
6:
7:
8:
9:
10: end for
11: return dist
The computational parties execute Algorithm 2 on two SUB datasets. This computes the distance corresponding to the records in category 1, but also introduces offset N0. The parties then execute Algorithm 3 on two SNP&SUB sets that computes the distance corresponding to categories 2 and 3 and additionally compensates for the offset. The output will then be the sum of the distances computed by both algorithms.
In Algorithm 3, when examining each pair of consecutive records, we only consider those that contain at least one SNP record within the pair (d
i
= 1 on line 9). Furthermore, similar Algorithm 2, when observing a location for the first time, we add 1 to the Hamming distance, but only if it is a SNP record ( on line 2 and on line 9). If a location appears for the second time, we undo the previous increment if b
i
= 0 or c
i
= 1 as before, but only if the record preceding the current one is of type SNP (i.e., on line 9). By doing that, we are able compute the distance corresponding to records of second and third types without introducing errors. The offset N0 is compensated by the last term a
i
e
i
on line 9, that counts the number of pairs of consecutive records that have the same location (a
i
= 1), but different types (e
i
= 1). OR([x], [y]) and XOR([x], [y]) are implemented as [x]+[y]−Mult([x], [y]) and [x]+[y]−2Mult([x], [y]), respectively (computation of d
i
and e
i
reuses the same multiplication result).
Note that dummy records do not introduce any error in Algorithm 3. That is, d
i
= 0 and e
i
= 0 when both records i and i − 1 are dummy and the expression on line 9 evaluates to 0. Similarly, when record i−1 is real while record i is fake that expression also evaluates to 0 because a
i
= 0 and .
After computing the distances corresponding to SUB and SNP&SUB sets, the parties need to convert shares of one of them into shares of the same value in the secret sharing setup used by the other algorithm. Then the distances can be locally added to compute the over-all result. Output reconstruction is performed as before by exchanging the shares and recovering the result.
The performance gain achieved by this optimization highly depends on the values of public parameters α
s
, α, and M. While the gain stems from using shorter values for V2 and V3 with SNP&SUB sets, the total number of records processed using this solution is greater than in the basic scheme (2
q
). Therefore, this optimization is recommended with relatively small α
s
and large M. In our experiments with α
s
= 0.3, α = 1, and M = 100, we observed approximately 30% performance improvement compared to the basic scheme.
Reducing set size
Our second optimization is with respect to oblivious sort and removing the requirement that the input size has to be a power of 2 for the merge step of Batcher's mergesort. To explain how our optimization works, we need to provide additional details about Batcher's mergesort algorithm.
Recall that the merge step takes two sorted sets L1 = (a1, a2, ..., a
m
) and L2 = (b1, b2,..., b
n
), where m + n is a power of 2. It first combines them into a single sequence that first monotonically increases and then decreases as L = (a1, a2, ..., a
m
, b
n
, ..., b2, b1), after which a sequence of compare-and-exchange operation is executed as specified by the following pseudo-code:
After executing the first iteration of the outer loop, the first (second) half of L will contain (m + n)/2 smallest (resp., largest) elements of L although they are not necessarily sorted. After its second iteration, the ith quarter of L will contain the ith quarter of elements in the final sorted list for i = 1, ..., 4. This process continues until each sublist contains one element and L becomes sorted. Notice that the algorithm uses log(m + n) iterations of the outer loop, and in each iteration every element in the list is used in a compare-and-exchange operation, which is the reason for requiring the size of the list to be a power of 2.
Consider an example with input L1 = (3, 4, 6, 9, 12) and L2 = (2, 5, 10), which is combined into L = (3, 4, 6, 9, 12, 10, 5, 2). In the first iteration of the outer loop, compare-and-exchange operations are performed on pairs (3, 12), (4, 10), (6, 5), and (9, 2), and the resulting list is (3, 4, 5, 2, 12, 10, 6, 9). In the second iteration, comparisons are performed on (3, 5), (4, 2), (12, 6), (10, 9) and produce (3, 2, 5, 4, 6, 9, 12, 10). In the last iteration, we compare every pair of consecutive elements (3, 2), (5, 4), (6, 9), (12, 10), which results in the final sorted list (2, 3, 4, 5, 6, 9, 10, 12). If L2 = (2, 5) instead, after all iterations 2 will remain at the end of the list making it unsorted, as the element does not have any pair to use in a comparison.
We next proceed with describing our strategy for generalizing the merge operation to work with inputs of arbitrary sizes, which might be of independent interest. There will be no need to pad the input in the beginning to make the overall input size to be a power of 2, but dummy records are now added throughout algorithm execution as needed. This means that earlier loop iterations use a smaller number of elements and are therefore faster than in the original algorithm. In particular, at each loop iteration, if the size of a sublist is odd, we append a copy of its last element to the end. This will ensure that comparisons can be performed at the current level while still preserving the necessary properties of the (partially) sorted list. For example, suppose we want to merge (3, 6, 8) and (5, 7). Before the first iteration 5 will be added to the list (3, 6, 8, 7, 5) because the number of elements in it is odd, and we obtain (3, 5, 5, 7, 6, 8) at the end of that iteration. At the time of second iteration, the size of sublists (3, 5, 5) and (7, 6, 8) is also odd and they are modified to be (3, 5, 5, 5) and (7, 6, 8, 8). In the next iteration no additional padding is used and we obtain (3, 5, 5, 5, 6, 7, 8, 8) at the end of the algorithm.
In the context of Hamming distance computation, we similarly make a copy of the entire last record of a subset as needed during the merge step. More importantly, after having the list sorted, we need to ensure the Hamming distance is computed correctly because the introduction of repeated records creates inaccuracies in Algorithm 2. Now two consecutive records with the same location in the sorted set may correspond to (i) two records in the original datasets, (ii) two copied records, or (iii) one original and one copied record. Let a
i
and a
j
be two different records with the same location in the original datasets. If they get copied during the merge as and , the relative order of these records in the sorted list can be arbitrary (e.g., , , etc.) and they may contribute more than 1 to the computed distance.
We address the problem by modifying locations of records in the datasets so that (i) two records originally with the same location are assigned locations that differ by 1 and (ii) two records originally with different locations are assigned locations that differ by more than 1. By doing that, a pair of consecutive records with the same location in the sorted set is guaranteed to correspond to either two copied records or one original record and its copy. In either case, the Hamming distance should not get affected. We implement this change by setting the location to 4V1, where V1 is the original location, for records from the first input site and to 4V1 + 1 for records from the second input site.
Algorithm 4 SecureHD3([S1], [S2])
1:
2: dist = 1
3: for i = 2 to 2q do
4:
5:
6:
7:
8:
9: end for
10: return dist
With this solution, the input sites prepare their input datasets as in the basic scheme, but pad a set to size N
i
+ 1 instead of requiring the combined size to be a power of 2. The computational parties can locally modify V1's in the input records run the improved merge and compute the Hamming distance as specified in Algorithm 4. The algorithm has two major differences compared to Algorithm 2: (i) when examining each pair of consecutive records, only pairs with different locations can contribute to the distance (a
i
= 0 on line 8) and (ii) when locations of two consecutive records differ by 1 (d
i
= 1 on line 8), they are treated as having the same location in Algorithm 2, and when the locations differ by neither 1 nor 0 (d
i
= 0 and a
i
= 0 on line 8), they are treated as having different locations in Algorithm 2.
Dummy records inserted by each input site into their input datasets do not affect correctness of the Hamming distance that uses this optimization (including the combined solution in Algorithm 5). This is because the first dummy record from the first input set will result in the distance incremented by 1, while the first dummy record from the second input set will result in the distance decremented by 1. All consecutive dummy records from the first or the second input datasets do not modify the distance (because all records with the same V1 are ignored).
We recently became aware of a sorting algorithm [16] that generalizes Batcher's bitonic sort to input sizes which are not a power of 2 without adding extra records during the sorting procedure. The algorithm results in the same asymptotic complexity as our solution, but performs fewer compare-and-exchange operations in each iteration (because dummy records are not added), which is expected to lead to better performance than our algorithm. We plan to provide both theoretical and empirical comparison of this algorithm with our solution as future work.
Our final solution consists of using both optimizations from the previous and current subsections, and we summarize it in Algorithm 5. We omit its explanation due to space considerations.
Algorithm 5 SecureHD4
1: dist1 = SecureHD3
2:
3:
4: for i = 2 to 2q do
5:
6:
7:
8:
9:
10:
11:
12: end for
13: return dist1 + Convert(dist2)