 Research
 Open Access
 Published:
Privacypreserving dataset combination and Lasso regression for healthcare predictions
BMC Medical Informatics and Decision Making volume 21, Article number: 266 (2021)
Abstract
Background
Recent developments in machine learning have shown its potential impact for clinical use such as risk prediction, prognosis, and treatment selection. However, relevant data are often scattered across different stakeholders and their use is regulated, e.g. by GDPR or HIPAA.
As a concrete usecase, hospital Erasmus MC and health insurance company Achmea have data on individuals in the city of Rotterdam, which would in theory enable them to train a regression model in order to identify highimpact lifestyle factors for heart failure. However, privacy and confidentiality concerns make it unfeasible to exchange these data.
Methods
This article describes a solution where verticallypartitioned synthetic data of Achmea and of Erasmus MC are combined using Secure MultiParty Computation. First, a secure inner join protocol takes place to securely determine the identifiers of the patients that are represented in both datasets. Then, a secure Lasso Regression model is trained on the securely combined data. The involved parties thus obtain the prediction model but no further information on the input data of the other parties.
Results
We implement our secure solution and describe its performance and scalability: we can train a prediction model on two datasets with 5000 records each and a total of 30 features in less than one hour, with a minimal difference from the results of standard (nonsecure) methods.
Conclusions
This article shows that it is possible to combine datasets and train a Lasso regression model on this combination in a secure way. Such a solution thus further expands the potential of privacypreserving data analysis in the medical domain.
Background
Modern machinelearning techniques require largescale and wellcharacterized datasets to achieve their full potential. In the medical domain, this requirement translates to a need to store medical patient data, and to combine information from different institutions, the Covid19 outbreak being an example of a situation where this is deemed crucial [1, 2].
However, the collection, processing and exchange of personal data is a sensitive matter, and the risks coming from privacy violations are especially high for medical data. This has led to legal frameworks that regulate and restrict usage of personal (medical) data, the General Data Protection Regulation^{Footnote 1} (GDPR), and the Health Insurance Portability and Accountability Act^{Footnote 2} (HIPAA) being two prominent examples. These regulations mandate informed consent from patients in order to use the corresponding medical data; however, asking for consent for machinelearning purposes is often impractical, since it is a timeconsuming process, and since contact with patients may have been lost since the moment of data collection.
This conflict between, on the one hand, the need to gather, combine and process large amounts of data for better machinelearning techniques, and on the other hand the need to minimize personal data usage for privacy protection, has lead to the development of several solutions for privacypreserving data analysis. In particular, a collection of cryptographic techniques known as Secure MultiParty Computation, or MPC for short, is being applied more and more in the medical domain. Intuitively, the goal of MPC is to allow several parties to compute the output of a certain function or computation, depending on private inputs of each party, without actually disclosing information on their inputs to each other.
In 2018, the Netherlands Organization for Applied Scientific Research (TNO), together with academic medical center Erasmus MC and health insurance company Achmea, started a project within the Horizon 2020 Programme called BigMedilytics^{Footnote 3} to develop a secure algorithm to predict the number of hospitalization days for heart failure patients. Although the project does not use real patient data in its current phase, the MPC solution presented in this article is based on the following reallife usecase, which serves as a motivating example for the solution described in this article. In Rotterdam, a group of individuals took part in the “Rotterdam study” [3], a program by the Epidemiology department of Erasmus MC. Erasmus MC has collected data on the lifestyle of these patients, for example their exercising, smoking, and drinking behavior. Achmea, on the other hand, has claims data of its customers (including several participants of the Rotterdam study), which encompass different aspects, such as hospitalization days and healthcare usage outside of the hospital. Recent work has shown that using machinelearning models on medical data has the potential to predict survival of heartfailure patients [4]. The datasets of Achmea and Erasmus MC, once intersected and combined, could be used to train a prediction model that identifies highimpact lifestyle factors for heart failure, and thus, in turn, to recognize highrisk heartfailure patients.
However, privacy concerns mean that Erasmus MC and Achmea cannot simply share their data with each other to allow for a straightforward analysis. TNO has therefore developed and implemented the MPCbased Proof of Concept described in this article, which allows Erasmus MC and Achmea to securely train a prediction model without disclosing any personal medical information.
Before we present the details of our solution, we give an overview of the current landscape of privacypreserving data analysis techniques, focusing on the medical domain, and on solutions which bear resemblance to ours. We will then discuss how our solution compares to these existing techniques.
Previous and related work
Secure analysis of healthcare data
A straightforward approach for privacypreserving data analytics consists of data anonymization and pseudonymization. More precisely, in the case of horizontallypartitioned data (i.e. when organizations hold the same type of data on different individuals), organizations may simply remove identifiers such as name, date of birth, or social security numbers, and share the data features with each other; in the case of verticallypartitioned data (i.e. when parties hold the different data on the same individuals), a similar result can be achieved by resorting to an external third party that gets access to all identifiers, replaces them with pseudonyms, and then ensures that the data features from all involved organizations are linked to each other. These methods thus ensure that only feature data are revealed, instead of identifiers. However, feature data can often uniquely identify an individual, especially if other, related data is acquired through public sources, as shown in several studies [5, 6]. Thus in practice, data anonymization and pseudonymization offer little guaranteed on the protection of the identity of individuals involved in collaborative data analysis.
A more sophisticated and popular approach consists of federated learning, where algorithms are trained on decentralized devices or servers, each possessing its own data, by only exchanging intermediate model coefficients with each other. Federated learning promises great potential to facilitate big data for medical application, in particular for international consortia [7].
An example of federatedlearning architecture is provided by the Personal Health Train (PHT) [8], which core idea is to keep data at the original sources and to let the analytical tasks “visit” these data sources and perform data analysis tasks locally.
Both federated learning and the PHT work fairly straightforward for horizontallypartitioned data (where institutions hold the same type of data on different individuals), while verticallypartitioned data remains a challenge to be tackled.
Cryptographic solutions such as MPC typically overcome these limitations, but with an inherent overhead in terms of computation time and communication volume compared to noncryptographic solutions, and typically have a lower technology readiness level. Specific applications in the medical domain cover a wide range, including, for instance, disclosure of case counts, while preserving confidentiality of healthcare providers [9], sharing insights on the effectiveness of HIV treatments, while preserving both privacy of involved patients and confidentiality of practitioners’ treatment choices [10]; privacypreserving analysis of hospital workflows [11]; secure genome study [12]; and secure distributed logistic regression for medical data [13]. Compliance of MPC techniques with the GDPR has been discussed in [14].
We present in more details MPC solutions with a similar scope as ours in the following subsection.
Cryptographic techniques for dataset combination and secure regression
A first challenge in secure distributed data analysis lies in the combination of different datasets: namely, different institutions hold, in general, data on different individuals, and a first challenge lies in determining which individuals lie both datasets, and retrieving their relevant features. Various work has been done on “secure set intersection” (also referred to as “private set intersection”) [15,16,17,18], where the different involved parties learn which individuals lie in all datasets, but it is guaranteed that no information on individuals outside the intersection will be revealed. To the best of our knowledge, however, no previous work has been published that describes a secure inner join solution, where individuals in the intersection are determined, but not revealed, and where the corresponding feature values are associated to each individual. Notice that a secure inner join is a fundamental step for realistic deployment of a secure data analysis solution, since the identity of individuals in the intersection of datasets is, in general, personal (and thus protected) data.
Concerning securely training a linear regression model on distributed data, a lot of work has been done on a variant of linear regression known as Ridge regression. In linear regression, one aims to find a “coefficient” vector, such that the linear combination of feature values with the coefficients yields roughly the value of another, “target” feature. Such a linear combination cannot, in general, be exactly equal to the target feature: Ridge regression is a relatively straightforward method that aims to find a coefficient vector that minimizes this gap, while also preventing the obtained coefficient models from being biased towards the training features values (and thus poorly predict the target feature for new data). Ridge regression is typically solved in either of two ways: by solving the normal equations, or by minimizing the objective function in a more general fashion, e.g. by application of the Gradient Descent algorithm. The privacypreserving implementations [19,20,21,22,23,24,25,26] all train a Ridge regression model by solving the normal equations, which is in turn performed through matrix inversion. Privacy is often preserved by using homomorphic encryption techniques [20,21,22,23], yet there are also implementations that make of use of secret sharing [19], or of garbled circuits [24].
In contrast to the normalequations approach, we chose for the secure Gradient Descent approach. The works of [27, 28] all present privacypreserving Gradient Descent solutions to train Ridge regression models. In [28] and [27] the authors focus on verticallypartitioned data. Finally, the authors of [27] train a linear regression model via a variation of the standard Gradient Descent method, namely conjugate Gradient Descent.
The solution that we present in this article focuses on another linear regression method called Lasso; to the best of our knowledge, no previous work has been published on secure Lasso regression. Lasso is similar to Ridge in that it tries to minimize the gap between the target feature and the linear combination of the other features with the coefficient vector, but it also discards features of little impact on the target feature by pushing the corresponding coefficient to zero. This means that once the model has been (securely) trained, less data are needed to evaluate the model. This is a very desirable property for a healthcareprediction scenario, and in particular for the identification of highimpact factors for heart failure, as described at the beginning of this section: gathering and using only the data that is strictly necessary to apply the model is important to comply with privacy regulations and their dataminimization requirements. In [4] it is even shown that for the prediction of the survival of heartfailure patients, training a model on two features alone can yield more accurate predictions than those made using all available features.
Our contributions
We present a solution for (1) computing a secure inner join of two datasets and, (2) securely training a Lasso regression model on the obtained (encrypted) data. To the best of our knowledge, both these contributions are novel.
Both components of our solution are essential: the secure inner join ensures that individuals in the overlap of the two datasets can be determined (but not revealed) together with their feature values, and the Lasso regression allows for minimizing the number of features that have an impact on the model, thus meeting the proportionality and dataminimization requirements for subsequent nonencrypted application of the model to identify highrisk patients.
In both components, we assume that a third, “helper” party joins the computation. The helper party does not supply any input and does not learn the input data of the other parties nor the outcome of the model, but its inclusion allows for a very efficient design. Namely, we are able to use efficient techniques such as hashing for the secure inner join, as opposed to the expensive polynomialevaluation techniques typically required in a twoparty setting; for what concerns the secure Lasso regression, we can make use of the MPyC framework, which requires at least three parties to guarantee security.
Our solution is tailored to the heartfailure usecase described above, and involves Achmea and Eramus MC as data parties and healthcare information intermediation company ZorgTTP as helper party. We installed our solution on a test infrastructure of the three involved parties, generated artificial data, and tested the performance in terms of quality of the obtained model and efficiency. Both aspects are fully satisfactory, the secure solution showing a difference in objectives of 0.004 with a standard, nonsecure solution (in scikitlearn), and requiring less than one hour to compute the inner join and Lasso coefficients of two datasets, consisting of 5000 records each, and 30 features in total.
Outline
The rest of the article is organized as follows. The "Methods" section is divided into two parts: the first one (section "Description of the desired functionality") illustrates the functionality that we aim to achieve (inner join and Lasso regression), but without taking security and privacy considerations into account. These are discussed in the following section "Description of the secure solution", which shows how our solution securely implements the functionality of section "Description of the desired functionality". The "Results" section discusses what our solution achieves in terms of security (section "Security results"), efficiency (section "Running time"), and quality of the obtained regression model (section "Performance and accuracy results"). We discuss the impact and possible improvements of our work in section "Discussion", and we end with the conclusions in section "Conclusions".
Methods
Description of the desired functionality
We first discuss the details of the functionality that we aim to realize. Privacy and security aspects are not considered here, and will instead be discussed in section "Description of the desired functionality", following the same structure as the current section.
Description of the setting and data formatting
We begin with the general setup and a description of the format of the input data. In our setting, two dataproviding parties are involved: a healthcare insurance company, Achmea (often shortened to AC), and a university hospital, Erasmus MC (which will often be shortened to EMC). We assume that each party owns a dataset where several features of various customers/patients are contained. Each row in the dataset corresponds to a customer or patient, and we refer to it as a record. Specifically, we denote the dataset of Achmea, and its element, as in Table 1, and we denote by \(A\) its set of identifiers \(\{a_1, a_2, \ldots \}\).
The dataset of Erasmus MC, on the other hand, is depicted as in Table 2, and we denote by \(B\) the set of identifiers \(\{b_1,b_2, \ldots \}\).
Before discussing the properties of identifiers and features, we stress the fact that the research described in this article did not use any actual identifiers or features corresponding to existing individuals. For the running time, accuracy and performance experiments, synthetic data was created or existing public data sets were used. More details can be foud in "Results" section.
It is assumed that identifiers in \(A\) and in \(B\) are of the same type; for simplicity, one may think of them as the social security number of a customer/patient. In particular, if \(a_i\) and \(b_j\) refer to the same person, then \(a_i=b_j\). Notice that we are actually interested in the intersection of \(A\) and \(B\), as we want to train a regression algorithm on all features.
For what concerns the features, both \(\alpha ^{\left( i \right) }\) and \(\beta ^{\left( j \right) }\) are assumed to be numerical or Boolean. One of the features serves as a target: intuitively, we aim to predict its value as a function of the other feature values. We formalize this intuitive goal in the following subsections.
Inner join of the data
In order to find a correlation among different features, a first necessary step is to identify which features belong to the same customer/patient. Namely, not every person of Achmea is necessarily present in the database of Erasmus MC (as not all customers of AC took part in the social and behavioral study of EMC), and vice versa. In mathematical terms, and in the notation defined above, \(A\ne B\) (in general).
Therefore, the two parties need to a) compute \(A\cap B\) (i.e. identify which persons are represented in both databases), b) ensure that \(\alpha _i^{\left( \cdot \right) }\) and \(\beta _j^{\left( \cdot \right) }\) are identified for all i and j such that \(a_i=b_j\in A\cap B\) (i.e. assign to each identifier in the intersection the corresponding features). In Tables 3 and 4, an example of the aimed result of this intersection is shown, inspired by the heartfailure usecase presented in the background section.
More abstractly, Table 5 would therefore be obtained, using the notation of Tables 1 and 2.
This type of operation is commonly referred to as Inner Join in the field of database management [29].
The next step is to train a regression algorithm on the data contained in Table 5. We remark that, at this point, the identifier column is no longer necessary, and indeed will play no role in the regression step.
Lasso regression algorithm
Given Table 5, we are now interested in finding a way of expressing a given feature (the number of hospitalization days) as a linear combination of the other features, or as an approximation of such a linear combination. This is accomplished by training a linear regression model on Table 5. In this subsection, we discuss the details of this process.
A linear regression problem can be informally expressed by the following question: for a known matrix \(\mathbf {X}\in \mathbb {R}^{n \times m}\), where n is the number of records and m is the number of features, and target vector \(\mathbf {y}\in \mathbb {R}^{n \times 1}\), can we find a weight vector \(\mathbf {w}\) such that the equality \(\mathbf {X}\mathbf {w}= \mathbf {y}\) is satisfied? In general the system is overdetermined and there exists no solution. Instead, one aims to find \(\mathbf {w}\), such that some function of the approximation error vector \(\mathbf {X}\mathbf {w}\mathbf {y}\) (and possibly some other arguments) is minimized.
The straightforward form of this problem focuses on minimizing the \(\ell ^2\)norm \(\Vert \mathbf {X}\mathbf {w} \mathbf {y}\Vert _2^2\), where \(\Vert \mathbf {x}\Vert _2^2:=\sum _i x_i^2\); this is known as (ordinary) least squares linear regression (OLS) [30]. Typically, a socalled regularization term is added to this target value; for instance, Ridge regression [31, 32] uses the \(\ell ^2\)norm of the weight vector, and therefore tries to minimize the value \(\Vert \mathbf {X}\mathbf {w} \mathbf {y}\Vert _2^2 + \lambda \Vert \mathbf {w}\Vert _2^2\), for a fixed constant \(\lambda >0\). The goal of such a regularization term is to ensure that the weight vector has small values, thereby making the overall model more manageable and reducing the risk of overfitting (i.e. reducing the risk that the model is too tailored to the data \(\mathbf {X},\mathbf {y}\) and poorly predicts values based on new data).
We choose instead for Lasso (Least Absolute Shrinkage and Selection Operator) [33, 34], which automatically discards features of little impact on the target vector. This is a desirable feature for the usecase described in the Background section, as it is important to focus on factors that have the greatest impact on hospitalization days, in order to minimize data collection for subsequent usage of the obtained model. Furthermore, reducing the number of used features results in a more easily explainable model, thereby increasing the acceptance by endusers. Lasso tries to minimize the following objective function:
where \(\Vert \mathbf {w}\Vert _1 = \sum _{i=1}^n w_i\), and where \(\lambda >0\) is a userchosen parameter known as regularization parameter. Note that this method reduces to OLS, if \(\lambda\) is set to zero. In our setup we use a proximal gradient descent algorithm to minimize the objective.
Gradient descent approach
Gradient Descent (GD) is a general optimization algorithm that finds a local minimum of an objective function. The algorithm takes repeated steps in the opposite direction of the (approximate) gradient of the objective function at the current point. In that way it moves to the direction of steepest descent. GD is a building block for many different models, including Ridge Regression and Support Vector Machine. The GD algorithm is described in Algorithm 1. We will now describe the parameters and functions in this algorithm.
Stopping criteria. The algorithm can stop for two reasons: either because it has reached the limit of iterations set by \(\text {maxIter}\), or when the model has been trained “sufficiently”. The latter factor can be quantified by measuring the relative or absolute change in the objective function and comparing this change with a preset treshold. Since the secure evaluation of an objective function is computationally intensive, we compare with a treshold the following value, known as update difference:
The model is then said to be sufficiently trained when this value is smaller than a preset value known as tolerance.
\(\textsc {CalcGradient}\) and \(\textsc {Proxy}\) In the GD algorithm, we repeatedly calculate the gradient \(\textsc {CalcGradient}\). Because of the \(\ell ^1\)norm in (1), the objective function for Lasso is nondifferentiable. We therefore use the technique of proximal gradient descent, that can optimize a function that is not entirely differentiable. We therefore first compute a gradient function over the first part of the objective function in (1),
then approximate the gradient of the second part of (1) by applying a proximal function \(\textsc {Proxy}\) on \(\mathbf {w}\). The ith component of \(\textsc {Proxy} (\mathbf {w}, \lambda )\) is given by the following expression:
Step size \(\eta\). An important parameter when using Gradient Descent is the size of the steps. If the step size is too small, the algorithm will need too many iterations, while if it is too large, the algorithm will never converge. In Algorithm 1, the step size decreases in every iteration, such that the weight vector converges. The initial stepsize is typically a userchosen parameter; however, if it depends on the input data, then it needs to be calculated securely. For example, one could choose \(\eta _0 = \frac{0.1}{\max (\mathbf {X}^t \mathbf {X})}\), of which we explain the secure calculation in section "Secure lasso regression".
Goodness of fit. To test the performance of Lasso Regression we can use different goodness of fit measures, such as the mean squared error, the mean absolute error or the coefficient of determination \(R^2\). As an example for the secure implementation, we focused on the last measure. \(R^2\) provides a measure of how well observed outcomes are replicated by our prediction model, based on the proportion of total variation of outcomes explained by the model. The range of \(R^2\) is \([\infty ,1]\), where 1 indicates the best fit. We calculate \(R^2\) as follows. First, we define the value \(\bar{y}\) as the mean of the observed target data \(y_i\), i.e. \(\bar{y} = \frac{1}{n} \sum _{i=1}^n y_i\). We then denote by \(\mathbf {y}^{\text {pred}}\) the vector of the values predicted by the model, and define
This coefficient is important for determining the right regularization parameter \(\lambda\). In practice, one will have to run the Lasso regression multiple times with different \(\lambda\), to find the optimal model with the highest \(R^2\).
Description of the secure solution
Aim and assumptions
The goal of this section is to show how the functionality described in "Description of the desired functionality" section can be realized in a secure way: this means that while both parties will learn the output of the Lasso regression (i.e. the model coefficients) trained on the inner join of their datasets,^{Footnote 4} no other information on the datasets of each party will be disclosed to any other party.
Our secure solution involves a third party, which does not supply any input, and does not receive any output (except from the size of the intersection of the two datasets). For our Proof of Concept, this thirdparty role is taken by ZorgTTP, a company that offers consultancy and practical solutions on the topic of privacypreserving data analysis in the healthcare sector. The addition of such a party has two benefits, relating to the two steps of our solution: secure inner join and secure Lasso regression. For the first step, the presence of a third party allows us to design a custom, highly efficient protocol; for the second step, we are able to use the MPyC library [35], that provides useful building blocks but requires at least three parties to guarantee security.
Before discussing the details of our solution, we give a brief introduction to Secure MultiParty Computation. Notice that we chose to present cryptographic concepts with a focus on intuition, so as not to burden the reader with an unnecessary level of formalism. The reader can refer to [36, 37] for a more formal discussion of general cryptographic concepts (including cryptographic hash functions, homomorphic encryption, and secret sharing), and to [38, 39] for an indepth discussion of MPC and Secret Sharing.
Introduction to secure multiparty computation
Assume n parties \(P_1, \ldots ,P_n\) hold private inputs \(x_1, \ldots ,x_n\). An MPC protocol is an interactive protocol that allows the parties to compute the value \(f(x_1, \ldots ,x_n)\) of a function f on their inputs, without revealing any other information to each other on their inputs. Notice that the private inputs are not necessarily just a single element, but could actually consist of an entire dataset. Moreover, not all parties need to supply an input, and not all parties should receive an output; the addition of these “dataless” parties (such as ZorgTTP in our case) allows protocols to achieve better efficiency, or to use techniques which would be insecure with a smaller number of parties.
Secure inner join
As outlined in section "Inner join of the data", in order to realize a protocol that securely implements our desired functionality, the first step to be performed is to compute the socalled inner join of the datasets of Achmea and Erasmus MC. Namely, we need to obtain a database with the identifiers that are present in both the datasets of Achmea (AC) and Erasmus MC (EMC), and with the corresponding features coming from both datasets. Notice that we do not wish to reveal the dataset obtained in this way to any party, as it would still contain highly sensitive personal data (in case of application involving real data). The innerjoin database will thus remain secret — yet computing the coefficients of a Lasso regression model on this secret dataset will be possible.
We first give a brief overview of the cryptographic building blocks that are used for this phase, and then present our solution.
Cryptographic Building Blocks. Our solution makes use of three core components: (keyed) cryptographic hash functions, (additively) homomorphic encryption, and 2outof2 secret sharing.

Hash functions. A cryptographic hash function is a deterministic function \(H:\mathcal {D}\rightarrow \mathcal {C}\), that maps any alphanumeric string \(\mathbf {s}\in \mathcal {D}\), to another alphanumeric string \(H(\mathbf {s})=\mathbf {z}\in \mathcal {C}\), called digest, of fixed length^{Footnote 5} Such a function enjoys the property that, given a digest \(\mathbf {z}\in \mathcal {C}\), it is unfeasible to compute a string \(\mathbf {s}\) such that \(H(\mathbf {s})=\mathbf {z}\). In our protocol, we compute the hash of values \(\mathbf {s}\) that are concatenated with a random bitstring \(\mathbf {b}\), thus obtaining \(H(\mathbf {b}\Vert \mathbf {s})\). This ensures that a party with no knowledge of \(\mathbf {b}\) is unable to recover \(\mathbf {s}\) from its hash with a bruteforce attack; in cryptographic terms, it is a simple form of keyed hashing.

Homomorphic encryption. An (additively) homomorphic encryption scheme is a publickey encryption scheme, thus consisting of a keygeneration algorithm \(\texttt {KeyGen}\), an encryption algorithm \(\texttt {Enc}\) and a decryption algorithm \(\texttt {Dec}\). For a keypair of public and secret key \((\text {pk},\text {sk})\) generated by \(\texttt {KeyGen}\), we have that \(\texttt {Enc}_{\text {pk}}\) takes as input a message m and some randomness r, and produces as output a ciphertext \(c=\texttt {Enc}_{\text {pk}}(m,r)\), with the property that \(\texttt {Dec}_{\text {sk}}(c)=m\), and that no information whatsoever can be extracted on m or \(\text {sk}\) from c and \(\text {pk}\); in formal terms, the encryption scheme is INDCCA1 secure. In order to simplify notation, we will often omit the key and randomness when discussing encryption, and write \(\left[ m \right] :=\texttt {Enc}_{\text {pk}}(m,r)\); moreover, we implicitly assume messages to be numeric values, so that addition and subtraction of messages are welldefined. The scheme is supposed to be additivelyhomomorphic, which means that there exists special operations on ciphertexts \(\boxplus\) and \(\boxminus\), such that \(\left[ m_1\right] \boxplus \left[ m_2\right] = \left[ m_1 + m_2\right]\), and \(\left[ m_1\right] \boxminus \left[ m_2\right] = \left[ m_1  m_2\right]\) for all messages \(m_1, m_2\).

2outof2 secret sharing. This building block can be seen as a form of keyless encryption, distributed among two parties, and works as follows: given a secret (numerical) value s, two elements \(s_1\) and \(s_2\) called shares are randomly sampled, but subject to the condition that \(s_1 + s_2 =s\). Then \(s_1\) is assigned to a party, and \(s_2\) to another party; in this way, each party has individually no knowledge of s (since the share \(s_i\) that they have is a random number), but the original secret value s can be reconstructed, when the two parties cooperate and communicate their shares to each other.
The Secure Inner Join Solution. The presence of a third party (ZorgTTP) allows us to design a novel, highly efficient protocol for secure inner join, which we believe to be of independent interest. ZorgTTP is taking care of the communication between the two data holders, but is not allowed to learn any data other than the cardinality of the intersection. The goal is for AC and EMC to obtain a secretshared version of the features from Table 5. Our secure inner join protocol between AC, EMC, and ZorgTTP uses cryptographic hash functions, and both AC and EMC have an (additively) homomorphic encryption key pair; we used SHA256 [40] as hash function and the Paillier homomorphicencryption scheme [41] in our implementation. Both public keys are communicated to the other involved parties, so that everyone can encrypt and perform homomorphic operations on ciphertexts; denote by \(\left[ x \right] _{\text {AC}}\) a value encrypted under the public key of AC, and by \(\left[ x \right] _{\text {EMC}}\) a value encrypted under the public key of EMC. The main idea is as follows:

1
AC and EMC randomly permute the rows of their datasets.

2
AC and EMC jointly generate a random bit string, not known to ZorgTTP, for the cryptographic hash function.

3
Using the hash function and the random string, both AC and EMC hide the identifiers from their own databases, and send the obtained hashes to ZorgTTP. AC and EMC then use their own public key to homomorphically encrypt the feature values of their records, and send the resulting ciphertexts to ZorgTTP. For simplicity, we assume here that AC and EMC have only one feature each. More features can be processed by a simple repetition of the steps below; the reader can refer to Table 6 below for a visual representation of the data sent to ZorgTTP, and to “Appendix” section for the details.

4
ZorgTTP computes how many hashed identifiers from AC also appear among the hashed identifiers of EMC; denote by \(k\) this value. Due to the properties of cryptographic hash functions, \(k\) is equal to the number of records in the intersection of the datasets of AC and EMC, and ZorgTTP learns no other information on the identifiers. For simplicity, we assume \(k=1\). In case \(k\) is larger, the steps below can be repeated, ZorgTTP properly linking the attributes with overlapping hashed identifiers; once again a more detailed description can be found in “Appendix” section. Note that, at this point, ZorgTTP holds the encrypted features of this record, which we denote by \(\left[ \alpha \right] _{\text {AC}}\) and \(\left[ \beta \right] _{\text {EMC}}\).

5
Both AC and EMC generate a random value, denoted by s and z respectively.

6
Both AC and EMC use the public key of the other party to homomorphically encrypt their shares, thus obtaining \(\left[ s \right] _{\text {EMC}}\) and \(\left[ z \right] _{\text {AC}}\), and they then send these ciphertexts to ZorgTTP.

7
ZorgTTP computes \(\left[ \alpha \right] _{\text {AC}} \boxminus \left[ z \right] _{\text {AC}} = \left[ \alpha z \right] _{\text {AC}}\) and sends this to EMC. Similarly, ZorgTTP computes and sends \(\left[ \beta s \right] _{\text {EMC}}\) to AC. Table 7 below visualized the data obtained and computed by ZorgTTP.

8
AC and EMC decrypt the received values and obtain \(\beta s\) and \(\alpha s\), respectively. Note that we have thus obtained a 2outof2 sharing of \(\alpha\) and \(\beta\) among EMC and AC, since AC still holds s and EMC still holds z. This outcome can be seen in Table 8.
To ensure that the decrypted differences (in step 9) reveal no information on the feature values, the randomly generated shares (in step 6) need to be sufficiently large.
Secure lasso regression
Once the steps of Paragraph "Secure inner join" section have been performed, we obtain a “2outof2 secretshared” version of Table 5: namely, Achmea and Erasmus MC each have a table filled with apparently random numbers, but if they were to add up the corresponding numbers, they would obtain exactly Table 5.
Recall that our purpose is to train a linear regression model — specifically, Lasso — on this table. Now letting Achmea and Erasmus MC communicate their datasets to each other in order to reconstruct Table 5, and then train the regression model, is clearly not an option: the information that they would obtain consists of personal data, the exchange of which has to be prevented.
Instead, we present a solution that is able to compute the regression coefficients from the two datasets, without leaking information on their content.
The fundamental building block that allows us to design and implement this solution is Shamir Secret Sharing. We make use of the software platform MPyC [35], which implements this form of secretsharing, and other useful communication and computation tools. We present the relevant properties and features of MPyC and Shamir Secret Sharing in the next section, and then discuss how these are used in our solution.
Shamir Secret Sharing As mentioned above, the core component of MPyC is Secret Sharing due to Shamir [42]. Shamir Secret Sharing can still be seen as a form of keyless distributed encryption, but the number of involved parties and the privacy and reconstruction guarantees are different. Instead of discussing Shamir Secret Sharing in its full generality, we focus here on the regime of parameters which is relevant for our purposes, called (1, 3)Secret Sharing.
Given three parties \(P_1, P_2, P_3\), a (1, 3)secretsharing scheme (denoted by SSS for short) consists of two algorithms, namely a sharing \(\texttt {Share}\) and a reconstruction algorithm \(\texttt {Rec}\). \(\texttt {Share}\) is, in general, randomized, and on input a given (secret) value s, it outputs three elements \(\mathbf {s}_1, \mathbf {s}_2, \mathbf {s}_3\) called shares. Typically, any party can use the sharing algorithm to obtain shares of a secret value of their knowledge, and they will then distribute these shares to the parties, with party \(P_i\) receiving share \(s_i\). The \(\texttt {Rec}\) algorithm tries to invert the process: on input three elements \(\mathbf {s}_1, \mathbf {s}_2, \mathbf {s}_3\), it outputs a value s or an error message \(\bot\), indicating that the reconstruction failed.
A (1, 3)SSS enjoys 1privacy: no information on the secret value s can be extracted from an individual share \(s_i\). On the other hand, two or more shares allow to unequivocally reconstruct s (2reconstruction), i.e. \(\texttt {Rec}(s_{i_1},s_{i_2})=s\).^{Footnote 6} The \(``1''\) in (1, 3)SSS thus refers to the privacy threshold, while the \(``3''\) refers to the total number of parties.
Such a secretsharing scheme can be used to construct MPC protocols: assume that the three involved parties (Achmea, Erasmus MC, and ZorgTTP) have access to a (1, 3)SSS. Let us assume that parties wish to perform some computation on a value \(\alpha\) (held by Achmea) and \(\beta\) (held by Erasmus MC). The three parties can then proceed as follows: first, Achmea secretshares \(\alpha\), i.e. computes \((\alpha _1,\alpha _2,\alpha _3)=\texttt {Share}(\alpha )\), such that Achmea, Erasmus MC and ZorgTTP will receive \(\alpha _1, \alpha _2, \alpha _3\), respectively. Notice that by 1privacy, no information on \(\alpha\) is leaked at this point. Erasmus MC then similarly secretshares \(\beta\), i.e. computes and distributes \((\beta _1,\beta _2,\beta _3)=\texttt {Share}(\beta )\).
The key property now is that for any operation that the parties wish to perform on the values \(\alpha\) and \(\beta\), there exists a corresponding operation that can be performed on the shares \(\alpha _i,\beta _i\), resulting in some other sharing \(s_1,s_2,s_3\), in such a way that no information at all is leaked on \(\alpha\), nor \(\beta\). It is important to remark that these operations typically involve all shares and may also require some form of communication among the three parties. While operations such as sum can be straightforwardly be evaluated, multiplications are typically more involved; MPyC makes use of a relatively standard protocol where players locally multiply shares, then reshare the obtained values and apply a Lagrange interpolation function on the received shares [43, 44].
It then becomes possible to evaluate a complex algorithm such as Lasso regression on several features of Achmea and Erasmus MC: parties can secretshare their features, then decompose the Lasso regression into basic operations, and perform the corresponding operations on the shares. Eventually, they will obtain shares of the regression coefficients; due to the 2reconstruction property, Achmea and Erasmus MC at this point simply need to exchange their shares with each other and to evaluate \(\texttt {Rec}\) in order to obtain the coefficients.
A final remark of notable importance is that while sums and multiplications are, per se, sufficient to evaluate any algorithm, MPyC also supports a number of custom subprotocols to evaluate special operations in a much more efficient way. Notably, efficient systems are implemented to compute the maximum of two values and to evaluate the inner product of two vectors, and there is full support for fixedpoint arithmetic operations; we refer to the protocol specifications [35] for the details.
Casting from 2outof2 to Shamir Secret Sharing. Recall that once the steps in section "Secure inner join" have been executed, parties obtain a 2outof2 secret sharing of the table which serves as input for the secure Lasso solution, and not the (1,3) secret sharing that is required for MPyC. The first step to be performed is thus to “cast” this 2outof2 secret sharing to a (1,3)Shamir sharing.
This is actually a fairly simple step, where only sum operations are required. Indeed, denote by x a 2outof2 share of Achmea and by y the corresponding 2outof2 share of Erasmus MC, which means that \(x+y=z\), where z is some feature value of an individual record occuring in both datasets. Now Achmea can (1,3)share x and Erasmus MC can (1,3)share y, so that Achmea obtains \(x_1\) and \(y_1\), Erasmus MC obtains \(x_2\) and \(y_2\), and ZorgTTP obtains \(x_3\) and \(y_3\). All parties have now have to locally add their shares, resulting in \(x_1+y_1\), \(x_2+y_2\), and \(x_3+y_3\): these are valid (1,3)shares of \(z=x+y\) that can be used in MPyC.
The Secure Lasso Regression solution. In order to explain our secure Lasso solution, we follow the blueprint of section "Lasso regression algorithm" and show how each step can be securely performed on secret shared data, using the techniques of section "Secure lasso regression".

Secure Gradient Descent. Apart from the stopping criterion, \(\textsc {CalcGradient}\), \(\textsc {Proxy}\), step size \(\eta\) and goodness of fit \(R^2\), all computations in Algorithm 1 are linear operations, and can thus be calculated on secretshared data as explained in section "Secure lasso regression". We will now elaborate on these secure calculations.

Secure stopping criteria. As explained in the corresponding paragraph in section "Gradient descent approach", there are two possible stopping criteria: The first one is reaching the maximal number of iterations, and since \(\text {maxIter}\) is a public value, this criterion does not need to be implemented securely: The second criterion demands computing the update difference \(\textsc {UpdateDifference}\), and compare this with the tolerance (which is a public constant). For efficiency purposes, we chose not to implement all these steps securely. Instead, for every iteration we reveal the value of the update difference, and compare it with the tolerance in plaintext. To be more precise, recall that the update difference is given by the ratio between \(\mathbf {w}_{\text {new}}\mathbf {w}_{\text {old}}_2^2\) and \(\mathbf {w}_{\text {old}}_2^2\): in order to calculate the update difference, we securely compute both enumerator and denominator and then reveal their values. We believe the information leak of this step to be acceptable, especially given the performance gain that is derived from it by avoiding the expensive secure division step.

Secure \(\textsc {CalcGradient}\) and \(\textsc {Proxy}\). In order to securely calculate the gradient of \(\mathbf {w}\), linear operations are used. We also make use of the custom subprotocol for vector multiplication, as described in section "Secure lasso regression". In order to compute \(\textsc {Proxy}\), we calculate two secretshared bits, namely \(a = \left( w_i > \lambda \right)\), and \(b = \left( w_i <  \lambda \right)\), where \((x<y)\) denotes the bit that is equal to 1, if \(x<y\), and to 0, otherwise. We can then securely compute the following linear operation over the shares of \(w_i\):
$$\begin{aligned} \textsc {Proxy} (w_i) = a\cdot (w_i  \lambda ) + b \cdot ( w_i + \lambda ). \end{aligned}$$(5)It is easy to see that this gives the same result as Eq. (4).

Secure initial step size. Although the operations that we use for computing our choice of the initial step size \(\eta _0\) (inner product and maximum) are computationally expensive, we only need to perform them once. Once again, we make use of the subprotocols for vector multiplications and maximum from MPyC.

Secure goodness of fit. Once we have computed the weight vector of the prediction model, we aim to securely compute goodnessoffit measures. As an example we implemented \(R^2\). Recall that the definition of \(R^2\) is given by
$$\begin{aligned} R^2 = 1  \frac{\sum _{i=1}^n (y_i  y_i^{\text {pred}})}{\sum _{i=1}^n(y_i  \bar{y})}. \end{aligned}$$With the shares of \(\mathbf {X}\), and the publiclyknown coefficients of \(\mathbf {w}\), we can calculate the shares of \(\mathbf {y}^{\text {pred}}\). At this point, by using the secretshared vector \(\mathbf {y}\), we can compute the numerator and denominator of \(1R^2\), reveal these values, and thus obtain \(R^2\).
Results
In this section we first present the security results of our solution. We then discuss the scalability results of our Proof of Concept, which was not performed on real data but did run on the actual infrastructure between Achmea, Erasmus MC and ZorgTTP. Finally, we describe the performance of our implementation of the Lasso regression and the accuracy of the secure model.
Security results
The security of our solution is guaranteed under the following assumptions. First of all, we assume that any two parties are connected by secure channels; in practice this is done by means of SSL/TLS connections. We assume that parties do follow the instructions of the protocol; in cryptographic lexicon, they are thus assumed to be semihonest. Privacy is guaranteed, even if parties try to infer extra information from the data they sent and received as part of the protocol, though we assume that no party will collude with any other party and exchange information with them. Finally, we adopt the standard assumption that the involved parties are bounded by polynomialtime computations, and that factoring large integers is feasible under this constraint.
Under the above conditions, the solution we present is provably secure, in the sense that we can mathematically argue that the only information that will be revealed are regression coefficients, and the size of the intersection between the datasets of Achmea and Erasmus MC.
Running time
We implemented our solution in Python. In order to test the efficiency of our implementation, we ran several experiments on three machines, under the control of Achmea, Erasmus MC and ZorgTTP, respectively, and geographically separated.
The experiments include the secure innerjoin computation and the protocol to securely train a Lasso regression algorithm as described in section "Description of the secure solution". Notice that we have not evaluated the efficiency of applying the Lasso model to new data, as it would be outofscope for this article;^{Footnote 7} For the same reason, no test data is extracted from these data artificial datasets.
All three (virtual) machines run a Linuxbased operating system, and are equipped with a commercialgrade virtual processor (up to four cores at 2.4GHz) and with 8 to 16 GB of RAM.
The solution was installed as a Docker image on all three machines. Connections within the machines were realized via HTTPs over TCP (for the secure inner join) and via the custom TCP protocol of MPyC. The connections were secured with TLS; certificates were created and installed on the machines to this end.
In order to test the efficiency of our solution, we sampled artificial datasets, using scikitlearn (with the datasets.sample_generator.make_regression functionality, that creates a dataset of real numbers with a rougly linear dependency of the target features). We sampled datasets with an increasing number of records and features, and ran several instances of our solution. The number of records (per dataset) was equal to 5, 100, 500, 1000, 5000 and 10000, while the (total) number of features was equal to 1, 2, 5, 10, 30 and 40. We vertically split the dataset into two datasets, with an (up to one difference) equal number of features and with a complete overlap in record IDs, i.e. the identifiers in the Achmea dataset were identical to those of the Erasmus MC dataset for each iteration.
For datasets with five records, we chose not to run instances with more than two features, as this regime of parameters would be highly unsuitable for a linear regression algorithm. Furthermore, the instance with 10.000 records and 40 features could not be run due to the RAM limitations of the involved machines; we nevertheless believe that the instances we considered are sufficient to analyze the scalability of our solution.
Each instance was run 10 times; all figures presented in this article refer to the median time over these 10 executions.
The total running time (thus encompassing both secure inner join and secure Lasso regression) is showed in in Figs. 1 and 2. Our solution thus takes roughly 3500 s, slightly less than one hour, to process two datasets with 5000 records each and a total of 30 features. Moreover, the running time of our solution scales linearly in the number of records and features.
The running time of our solution is dominated by the Secure Lasso regression, the scalability of which is shown in Figs. 3 and 4. Just as for the total time, the running time of this phase also has a linear dependency on the number of records and of features.
Performance and accuracy results
Data
To test the performance and accuracy of our secure model, we use the “Medical Costs” dataset by Brett Lantz [45]. This public dataset contains 1338 records of patients with 12 features each (including, among others, age, bmi, children, gender, medical costs), of which four are numerical, and eight are Boolean. We centered and scaled the data in advance, such that the feature values are comprised between 0 and 1. We also split into a train and a test set (10% of the data, randomly selected).
Performance of lasso regression
To test the performance of our solution we compare the results of our secure model with the nonsecure scikitlearn Lasso model [46]. Note that the secure inner join has no influence on the performance of the Lasso regression. Therefore, as input of our secure model, the data is secretshared between the three parties. The influence of the calculation on secretshared values will be discussed in the next paragraph.
We train with our secure model on 11 features of the train set for predicting the (numerical) target feature of medical costs, by varying \(\lambda\) and tolerance. We found the optimal choice, leading to a good fit (\(R^2\), mean squared error) and enough coefficients set to zero, to be \(\lambda = 0.001\) and tolerance \(=0.0001\). For this choice of parameters, when training our secure model, we need 26 iterations. Applying the trained model on our test set, we achieve an \(R^2\) of 0.70, a mean squared error of 0.0086, a mean absolute error of 0.062 and an objective of 0.013. As a validation of the solving method that we used, we compare these results with the (highly optimized) Lasso model of scikitlearn [46], using the same parameters. After the model was trained on the train set, on the test set we find an \(R^2\) value of 0.66, a mean squared error of 0.012, a mean absolute error of 0.082 and an objective of 0.0090. Although the goodnessoffit measures of our secure model are better than the scikitlearn model, it has a larger objective value. In Tables 9 and 10 one can see that in the scikitlearn model, two more coefficients are set to zero, which is one of the aims of Lasso. Therefore, we can conclude that our secure model has a good performance, although the (highly optimized) scikitlearn model performs slightly better.
Accuracy of the secure implementation
To compare the performance and accuracy of our secure model, we implemented the Lasso algorithm described in section "Gradient descent approach" in a nonsecure way. Although the steps in training both models are the same, a slight difference in outcome is to be expected, due to possible rounding errors of noninteger, secretshared values. This difference in objective values is less than \(10^{7}\); we consider this to be negligible for our research purposes.
Discussion
In light of the results shown in section "Results", we conclude that our solution does provide a viable way of securely training a Lasso regression model on distributed patient data in a privacypreserving way. In particular, the good quality of the obtained model, together with its satisfying efficiency in a fairly realistic setup, make our solution a promising tool for privacypreserving analysis of distributed patient data.
As future work, we have identified two main directions, namely, improvements to the solution and working towards a pilot on real heartfailure risk data.
Improvement to the secure solution
We identify several ways to further improve our solution. First of all, our solution was relatively efficient, the secure solution took less than one hour for the setting with 5000 records and 30 features. This should be fast enough for research purposes. However, while we deem our solution to be fast enough for research purposes, its efficiency might need to be improved when working with very large datasets. Several approaches are possible in order to reduce running time, for instance implementing the solution in another programming language such as C, or making optimal use of parallelisation. Moreover, RAM usage could be reduced by supporting access to advanced databasemanagement systems.
Also, we identified some opportunities to improve the quality of the model. Within this article, we assumed the data to be preprocessed, i.e. scaled and centered; a solution with a higher technology readiness level would need to securely implement this step. Moreover, next to \(R^2\), more goodnessoffit measures such as mean squared error and mean absolute error could be securely implemented. This would enable parties to perform more quality checks on the model, and to choose a good regularization parameter \(\lambda\).
Finally, while we focused on a situation where exactly two parties supply input data, and it would be interesting to extend our solution to more than two dataparties. The secure Lasso regression training poses no issue for such an extension, since MPyC supports a virtually unlimited number of parties, but the secure inner join would need to be redesigned, since it is tailored to the twopartywithhelper setting. A stepbystep approach for this part could probably be realized, i.e. by first performing an inner join of the datasets of two parties and then using the outcome as input for another inner join with the third data party, and so on, but a thorough analysis is required to validate this approach and measure its performance.
Towards prediction of heartfailure risk factors
Given the promising results obtained by our Proof of Concept, a future pilot with real patient data should be started, in order to establish the effectiveness of our solution for prediction of heartfailure risk on combined datasets from Erasmus MC and Achmea. The data needed for such an experiment is already stored at both parties. At Achmea, features express and quantify, notably, the number of days a given customer was admitted to a hospital, and various other aspects such as comorbidities, marital status, and socioeconomic status. This information is stored as part of the standard procedures of health insurance companies. At Erasmus MC, on the other hand, features express and quantify social and behavioral aspects such as age, smoking, exercising, and alcohol consumption. This type of data has been collected by the Epidemiology department of Erasmus MC as part of a previous study performed on volunteers in the city of Rotterdam [3], of which a significant part are also ensured at Achmea. In a future pilot, we would aim to predict the number of hospitalization days as a function of the other feature values. Such a pilot would need to address both the technical challenges highlighted above (for instance, Achmea has data on more than five million individuals). But it should also focus on nontechnical challenges, such as compliance and legal aspects, and ensure that employees and management are properly involved in the process and get acquainted with the used techniques, which constitutes a timeconsuming process.
Conclusions
In this paper, we presented a secure and scalable solution for Lasso regression as a part of the European BigMedilytics project. The solution allows two parties, in this case Erasmus MC and Achmea, to securely compute the inner join of their respective datasets and to train a Lasso regression algorithm on the obtained dataset in a privacypreserving way, assisted by healthcare information intermediation company ZorgTTP. No party learns any patient data, other than the number of overlapping patients from both datasets, the result of the regression, and some intermediate values of the regression algorithm, which we believe to be fully acceptable.
We implemented our solution on three computing nodes, running at separate machines, and located at different sites, under control of Achmea, Erasmus MC, and ZorgTTP, respectively. The experimental results show that our implementation is reliable, accurate, and fast enough for research purposes. We conclude that our solution is a promising tool for privacypreserving machine learning tasks on distributed patient data, potentially leading to an improvement of the quality of healthcare, while respecting the privacy of involved patients.
Availability of data and materials
The (artificial) datasets used for the scalability experiments are included in the supplementary material, in the \(\texttt {artificial\_datasets\_scalability.zip}\) file. The log files measuring the running time of the solution for these inputs datasets are also included, to be found in \(\texttt {log\_files\_scalability.zip}\). Finally, the dataset used to test the accuracy of our solution is a publiclyavailable dataset [45].
Notes
 1.
 2.
 3.
 4.
To be completely precise, we also reveal the size of the intersection of the two datasets to the involved parties.
 5.
Domain and codomain do not need, strictly speaking, to consist of alphanumeric strings, but we restrict to this situation for simplicity.
 6.
Formally, each share \(s_i\) should be supplied to \(\texttt {Rec}\) together with its index i, but we omit this to simplify notation.
 7.
In the healthcarescenario that motivates this article, applying the model could arguably be performed without advanced cryptographic techniques, since the smaller amount of data needed would probably make direct data usage feasible and proportional (under appropriate informed consent). For this reason, we have decided not to focus on the application of the trained Lasso model.
References
 1.
Dwivedi YK, Hughes DL, Coombs C, Constantiou I, Duan Y, Edwards JS, Gupta B, Lal B, Misra S, Prashant P, et al. Impact of covid19 pandemic on information management research and practice: transforming education, work and life. Int J Inf Manag. 2020;102211.
 2.
...Raisaro JL, Marino F, TroncosoPastoriza J, BeauLejdstrom R, Bellazzi R, Murphy R, Bernstam EV, Wang H, Bucalo M, Chen Y, Gottlieb A, Harmanci A, Kim M, Kim Y, Klann J, Klersy C, Malin BA, Méan M, Prasser F, Scudeller L, Torkamani A, Vaucher J, Puppala M, Wong STC, FrenkelMorgenstern M, Xu H, Musa BM, Habib AG, Cohen T, Wilcox A, Salihu HM, Sofia H, Jiang X, Hubaux JP. SCOR: a secure international informatics infrastructure to investigate COVID19. J Am Med Inf Assoc. 2020;27(11):1721–6. https://doi.org/10.1093/jamia/ocaa172.
 3.
Ikram MA, Brusselle GG, Murad SD, van Duijn CM, Franco OH, Goedegebure A, Klaver CC, Nijsten TE, Peeters RP, Stricker BH, et al. The Rotterdam study: 2018 update on objectives, design and main results. Eur J Epidemiol. 2017;32(9):807–50.
 4.
Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. 2020;20(1):16.
 5.
Sweeney L. Weaving technology and policy together to maintain confidentiality. J Law Med Ethics. 1997;25(2–3):98–110.
 6.
Narayanan A, Shmatikov V. Robust deanonymization of large sparse datasets. In: 2008 IEEE symposium on security and privacy (sp 2008), 2008;111–125 . IEEE.
 7.
Zerka F, Barakat S, Walsh S, Bogowicz M, Leijenaar RT, Jochems A, Miraglio B, Townend D, Lambin P. Systematic review of privacypreserving distributed machine learning from federated databases in health care. JCO Clin Cancer Inf. 2020;4:184–200.
 8.
Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, Karim MR, Dumontier M, Decker S, da Silva Santos LOB, Dekker A. Distributed analytics on sensitive medical data: the personal health train. Data Intell. 2020 2(1–2):96–107.
 9.
Emam KE, Hu J, Mercer J, Peyton L, Kantarcioglu M, Malin BA, Buckeridge DL, Samet S, Earle C. A secure protocol for protecting the identity of providers when disclosing data for disease surveillance. J Am Med Inf Assoc. 2011;18(3):212–7. https://doi.org/10.1136/amiajnl2011000100.
 10.
De Optimale Hiv Behandeling Vinden Met MPC. https://www.tno.nl/nl/tnoinsights/artikelen/deoptimalehivbehandelingvindenmetmpc/. Accessed: 20201026.
 11.
Spini G, van Heesch M, Veugen T, Chatterjea S. Private hospital workflow optimization via secure kmeans clustering. J Med Syst. 2020;44(1):8–1812. https://doi.org/10.1007/s1091601914734.
 12.
Zhang Y, Dai W, Jiang X, Xiong H, Wang S. Foresee: fully outsourced secure genome study based on homomorphic encryption. In: BMC medical informatics and decision making. 2015;15, 5 . Springer.
 13.
Shi H, Jiang C, Dai W, Jiang X, Tang Y, OhnoMachado L, Wang S. Secure multiparty computation grid logistic regression (SMACGLORE). BMC Med Inf Decis Mak. 2016;16(S–3):89. https://doi.org/10.1186/s1291101603161.
 14.
van Haaften W, Sangers A, van Engers T, Djafari S. Coping with the general data protection regulation; anonymization through multiparty computation technology. 2020.
 15.
Freedman MJ, Nissim K, Pinkas B. Efficient private matching and set intersection. Eurocrypt Lect Notes Comput Sci. 2004;3027:1–19.
 16.
Cristofaro ED, Tsudik G. Practical private set intersection protocols with linear complexity. In: Sion R (ed) Financial cryptography and data security, 14th international conference, FC 2010, Tenerife, Canary Islands, Spain, January 25–28, 2010, Revised selected papers. lecture notes in computer science, 2010; 6052, 143–159. Springer. https://doi.org/10.1007/9783642145773_13.
 17.
Pinkas B, Rosulek M, Trieu N, Yanai A. SpOTlight: lightweight private set intersection from sparse OT extension. Cryptology ePrint Archive 2019.
 18.
Pinkas B, Schneider T, Tkachenko O, Yanai A. Efficient circuitbased PSI with linear communication. In: Ishai Y, Rijmen V (eds) Advances in cryptology—EUROCRYPT 2019—38th annual international conference on the theory and applications of cryptographic techniques, Darmstadt, Germany, May 19–23, 2019, proceedings, part III. Lecture notes in computer science. 2019; 11478, 122–153. Springer. https://doi.org/10.1007/9783030176594_5.
 19.
Bogdanov D, Kamm L, Laur S, Sokk V. Rmind: a tool for cryptographically secure statistical analysis. IEEE Trans Dependable Secure Comput. 2018;15(3):481–95.
 20.
Dankar FK, Brien R, Adams C, Matwin S. Secure multiparty linear regression. In: EDBT/ICDT workshops. 2014; 406–414 . Citeseer.
 21.
de Cock M, Dowsley R, Nascimento AC, Newman SC. Fast, privacy preserving linear regression over distributed datasets based on predistributed data. In: Proceedings of the 8th ACM workshop on artificial intelligence and security. 2015;3–14 . ACM.
 22.
Hall R, Fienberg SE, Nardi Y. Secure multiple linear regression based on homomorphic encryption. J Off Stat. 2011;27(4):669.
 23.
Hu S, Wang Q, Wang J, Chow SSM, Zou Q. Securing fast learning! ridge regression over encrypted big data. In: 2016 IEEE Trustcom/BigDataSE/ISPA. 2016; 19–26 . https://doi.org/10.1109/TrustCom.2016.0041.
 24.
Nikolaenko V, Weinsberg U, Ioannidis S, Joye M, Boneh D, Taft N. Privacypreserving ridge regression on hundreds of millions of records. In: 2013 IEEE symposium on security and privacy. 2013;334–348. IEEE.
 25.
Chen YR, Rezapour A, Tzeng WG. Privacypreserving ridge regression on distributed data. Inf Sci. 2018;451–452:34–49. https://doi.org/10.1016/j.ins.2018.03.061.
 26.
Blom F, Bouman NJ, Schoenmakers B, de Vreede N. Efficient secure ridge regression from randomized gaussian elimination. IACR Cryptol ePrint Arch. 2019;2019:773.
 27.
Gascón A, Schoppmann P, Balle B, Raykova M, Doerner J, Zahur S, Evans D. Privacypreserving distributed linear regression on highdimensional data. Proc Priv Enhanc Technol. 2017;2017(4):345–64.
 28.
Giacomelli I, Jha S, Page CD, Yoon K. Privacypreserving ridge regression on distributed data. IACR Cryptol ePrint Arch. 2017;2017:707.
 29.
Join Clause (SQL). https://en.wikipedia.org/wiki/Join_(SQL)#Inner_join. Accessed: 20201006.
 30.
Schmidt M. Least squares optimization with l1norm regularization. CS542B Project Report. 2005;504, 195–221.
 31.
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67.
 32.
McDonald GC. Ridge regression. Wiley Interdiscip Rev Comput Stat. 2009;1(1):93–100.
 33.
Santosa F, Symes WW. Linear inversion of bandlimited reflection seismograms. SIAM J Sci Stat Comput. 1986;7(4):1307–30.
 34.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol). 1996;58(1):267–88.
 35.
Schoenmakers B. MPyC—secure multiparty computation in Python. https://github.com/lschoe/mpyc.
 36.
Katz J, Lindell Y. Introduction to modern cryptography, 2nd edn. CRC Press, 2014. https://www.crcpress.com/IntroductiontoModernCryptographySecondEdition/KatzLindell/p/book/9781466570269.
 37.
Menezes A, van Oorschot PC, Vanstone SA. Handbook of applied cryptography. CRC Press; 1996. https://doi.org/10.1201/9781439821916. http://cacr.uwaterloo.ca/hac/.
 38.
Cramer R, Damgård I, Nielsen JB. Secure multiparty computation and secret sharing. Cambridge University Press; 2015. http://www.cambridge.org/de/academic/subjects/computerscience/cryptographycryptologyandcoding/securemultipartycomputationandsecretsharing?format=HB&isbn=9781107043053.
 39.
Lindell Y. Secure multiparty computation. Commun ACM. 2021;64(1):86–96. https://doi.org/10.1145/3387108.
 40.
FIPS P. 1804. secure hash standard. National Institute of Standards and Technology, 36, 2005.
 41.
Paillier P. Publickey cryptosystems based on composite degree residuosity classes. In: Stern J, editor. Advances in cryptology—UROCRYPT ’99, international conference on the theory and application of cryptographic techniques, Prague, Czech Republic, May 2–6, 1999, proceeding. Lecture notes in computer science, 1999;1592, 223–238. Springer. https://doi.org/10.1007/354048910X_16.
 42.
Shamir A. How to share a secret. Commun ACM. 1979;22(11):612–3. https://doi.org/10.1145/359168.359176.
 43.
BenOr M, Goldwasser S, Wigderson A. Completeness theorems for noncryptographic faulttolerant distributed computation (extended abstract). In: Simon J, editors. Proceedings of the 20th annual ACM symposium on theory of computing, May 2–4, 1988, Chicago, Illinois, USA, 1988;1–10. ACM . https://doi.org/10.1145/62212.62213.
 44.
Gennaro R, Rabin MO, Rabin T. Simplified VSS and fasttrack multiparty computations with applications to threshold cryptography. In: Coan BA, Afek Y, editors. Proceedings of the seventeenth Annual ACM symposium on principles of distributed computing, PODC ’98, Puerto Vallarta, Mexico, June 28–July 2, 1998, pp. 101–111. ACM, 1998. https://doi.org/10.1145/277697.277716.
 45.
Medical Costs Dataset, Brett Lantz. https://www.kaggle.com/mirichoi0218/insurance. Accessed: 20201026.
 46.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Acknowledgements
The authors wish to thank Tim van der Broek for his useful comments and suggestions to improve this article.
Funding
The BigMedilytics project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 780495.
Author information
Affiliations
Contributions
MBvE contributed to the privacypreserving Lasso regression solution and to the writing of this manuscript. GS contributed to the cryptographic design and implementation of the solution and to the writing of this manuscript. OvdG formulated requirements and draw conclusion from the perspective of Achmea, and contributed to the writing of this manuscript. AIJ formulated requirements and draw conclusion from the perspective of Erasmus MC, and contributed to the writing of this manuscript. TV contributed to the design of the secure inner join protocol, of the secure gradient descent technique, and to the writing of this manuscript. WK contributed to the usecase definition and to the writing of this manuscript. AS contributed to the design of the secure inner join protocol, to the implementation of the solution, and provided project coordination. TR contributed to the implementation and testing of the solution. PL contributed to the implementation and testing of the solution. BK contributed to the design and implementation of the secure Lasso regression technique. NvdL contributed to the design and implementation of the secure Lasso regression technique. MKJ contributed to the usecase definition and provided project coordination. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Due to the fact that no personal or medical data was used for the research described in this article (artificial data was instead sampled), approval from an ethics committee was deemed not necessary. For the same reason, consent to participate is not applicable.
Consent for publication
Due to the fact that no personal or medical data was used for the research described in this article (artificial data was instead sampled), consent for publication is not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1:
Zip archive containing the log files of the scalability experiments. The archive can be opened with any archive manager, and the included log files (with .log extension) can be read with any text editor.
Appendix: Secure inner join protocol
Appendix: Secure inner join protocol
This is a detailed version of the secure inner join protocol, described in “Aim and assumptions” section. Both AC and EMC get a key pair of an additively homomorphic crypto system, which keys are generated in the first step. In the second step a random key r is generated jointly between AC and EMC, without ZorgTTP learning it. This key is used for scrambling the identifiers and encrypting the private data in step 3. They are sent to ZorgTTP in step 4, together with the encrypted attribute values. In step 5 ZorgTTP looks for the matching indices of the scrambled identifiers, and obtains the intersection cardinality. In steps 6 and 7, AC and EMC generate their shares of the inner join table entries, and send them encrypted to ZorgTTP. In step 8, ZorgTTP computes the encrypted remaining shares, so AC and EMC can decrypt them.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
van Egmond, M.B., Spini, G., van der Galien, O. et al. Privacypreserving dataset combination and Lasso regression for healthcare predictions. BMC Med Inform Decis Mak 21, 266 (2021). https://doi.org/10.1186/s1291102101582y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102101582y
Keywords
 Secure multiparty computation
 Privacy
 Machine learning
 Lasso regression