- Technical advance
- Open Access
A generic solution for web-based management of pseudonymized data
BMC Medical Informatics and Decision Making volume 15, Article number: 100 (2015)
Collaborative collection and sharing of data have become a core element of biomedical research. Typical applications are multi-site registries which collect sensitive person-related data prospectively, often together with biospecimens. To secure these sensitive data, national and international data protection laws and regulations demand the separation of identifying data from biomedical data and to introduce pseudonyms. Neither the formulation in laws and regulations nor existing pseudonymization concepts, however, are precise enough to directly provide an implementation guideline. We therefore describe core requirements as well as implementation options for registries and study databases with sensitive biomedical data.
We first analyze existing concepts and compile a set of fundamental requirements for pseudonymized data management. Then we derive a system architecture that fulfills these requirements. Next, we provide a comprehensive overview and a comparison of different technical options for an implementation. Finally, we develop a generic software solution for managing pseudonymized data and show its feasibility by describing how we have used it to realize two research networks.
We have found that pseudonymization models are highly heterogeneous, already on a conceptual level. We have compiled a set of requirements from different pseudonymization schemes. We propose an architecture and present an overview of technical options. Based on a selection of technical elements, we suggest a generic solution. It supports the multi-site collection and management of biomedical data. Security measures are multi-tier pseudonymity and physical separation of data over independent backend servers. Integrated views are provided by a web-based user interface. Our approach has been successfully used to implement a national and an international rare disease network.
We were able to identify a set of core requirements out of several pseudonymization models. Considering various implementation options, we realized a generic solution which was implemented and deployed in research networks. Still, further conceptual work on pseudonymity is needed. Specifically, it remains unclear how exactly data is to be separated into distributed subsets. Moreover, a thorough risk and threat analysis is needed.
While collaborative research is developing rapidly, (e.g. [1–4]) a series of publications has shown relevant privacy threats , especially when genomic data are involved [6, 7]. On the other side, security of research data and biosamples is being addressed by regulations. The most important are the European Directive on Data Protection  (which is currently undergoing a reform process ), the European Recommendation on Research on Biological Materials of Human Origin  and the HIPAA Privacy Rule .
On the technical and organizational level, state-of-the-art security measures are needed to protect sensitive research data from unauthorized access. Important techniques include the use of secure network communication, strong authentication mechanisms, role-based access and different access tiers.
Definitions and scope
Pseudonymization adds an important layer of protection for person-related data. It has been implemented in many projects (e.g. by the UK Biobank , the Icelandic biobank run by deCode Genetics  and the German National Cohort ) and it has become an important security measure required by laws and regulations. The term “separation” plays a central role in various definitions and regulations. The formulation in the Proposal for a General Data Protection Regulation of the Council of the European Union  is: “personal data may be processed for […] scientific research purposes only if […] data enabling the attribution of information to an identified or identifiable data subject is kept separately from the other information”, and in the German Federal Data Protection Act : “characteristics enabling information concerning personal or material circumstances to be attributed to an identified or identifiable individual shall be stored separately”, and in the Italian Personal data protection code: “identification data shall be stored separately from all other data“. There are, however, different definitions of pseudonymity and even synonyms for the term itself (including “coding” and “aliasing” [15, 17]). For the purpose of this work, we will use the term pseudonymity according to the description by Kalra et al. (who in turn cite a definition by Lowrance ) : “Pseudonymization (reversible anonymization, or key coding) involves separating personally identifying data from substantive data but maintaining a link between them through an arbitrary code (the key).”
The ISO Technical Specification 25237 on “Health informatics - Pseudonymization” also addresses separation: “identifying and payload data shall be separated” . While separation is a common element in the cited sources, there is no explicit specification of what exactly has to be separated. It is clear that separation will require at least two data pools. Kalra uses the term “identifying” data to characterize the first one, while the above regulations describe this first part as “data enabling the attribution […] to an identified or identifiable data subject”. In slight difference to Kalra, ISO 25237 uses the terms “identifying”, “quasi-identifying” or “indirectly identifying” . We will refer to identifying data as master data. For the content of the second pool, Kalra uses “substantive data”, the regulations call it “other data”, and ISO uses the term “payload”. It is particularly unclear which attributes should (or can) remain in this second pool. Both ISO 25237 and Pommerening et al.  have addressed but not completely clarified this. Pommerening et al.  have introduced additional types of data: 1) identifying data, 2) medical- or clinical phenotype data, 3) data associated with the management of biospecimens, and 4) data resulting from the analysis of biospecimens. We will not further address the specifications of different types of data [19, 20], and consider data pools to be pre-defined. We recommend, however, clarification by further work. Our focus will be on separation and on the management of pseudonyms. We will address pseudonymous identifiers for biosamples, but we will not go into any detail of biosample management itself.
Some further clarifications are necessary. While anonymous data are not considered personal data in a regulatory sense, pseudonymous data remain personal data . There is a distinction between irreversible and reversible pseudonymity: within this article, we will focus on the latter case. As separation of data is a core characteristic of pseudonymization, we illustrate it by Fig. 1, showing two different options.
Option 1, which we call “one-tier pseudonymized” has been used in trials for decades, the “two-tier pseudonymized” approach is more recent and it is recommended by ISO 25237  and by Pommerening et al. . Using the terminology from ISO 25237, we will consider A, B, C “identifying” and D, E, F as “payload”. Two-tier pseudonymity means that the datasets are interlinked via a set of cascading identifiers. When two-tier pseudonymization is used, each component maintains its own namespace for identifiers. The identifiers from different namespaces are linked by a dedicated mapping service. The figure also shows the “integrated dataset” that can be constructed by de-pseudonymizing the dataset. We will refer to this simplified data-centric view of Fig. 1 throughout this paper.
It is important to distinguish between concepts and their implementation. A pseudonymization concept describes a more or less abstract separation of data into different pools, potentially combined with specifications of valid data flows and implementation constraints. Already on the concept level, there are different approaches to pseudonymization. In this article, we will focus on two models: the ISO Technical Specification 25237  on “Health informatics - Pseudonymization”, which describes concepts fundamental to pseudonymity in biomedical research environments, and the German model by Pommerening et al. , which is closely related to ISO 25237. Overviews of the concept by Pommerening et al.  can be found in [21–23]. The solutions described by Brinkmann et al. , Spitzer et al. , and Lablans et al.  are also based on the German model, and therefore these references contain short descriptions of the model.
Motivated by the need to design and implement IT solutions for several research projects [27–29], we had (1) to collect and systematize core requirements for secure solutions, (2) to compare implementation options, and (3) to implement a generic solution. The focus of our work is on multi-site research registries with sensitive biomedical data; the management of biosamples had to be addressed by the solution, but will only be shortly touched in this article.
In order to get a conceptual basis, we have started our work with an analysis of two comprehensive concepts ([19, 20]). While ISO  is international by its definition,  is being considered a set of quasi-standard requirements in Germany. From these two sources, we compiled a set of fundamental requirements for pseudonymized data management. Functional requirements were taken from related work on electronic data capturing systems and complemented with results from our own analyses which we cannot describe in detail here. The next step was to design a system architecture fulfilling these requirements. For this purpose, we analyzed several architectural options. Then, we created an overview of technical options for an implementation and performed a comparison. The final step has been the implementation of a generic solution. Its feasibility has been demonstrated in research networks of which we will shortly describe two ([27, 29]). Institutional Review Boards (IRBs) and data protection officers of the participating sites have approved the concept. For a comprehensive list we refer to Additional file 1.
The methodical approach described above has led to results which we will present in the same order: (1) core requirements, (2) an analysis of architectural options, (3) a high-level system architecture, (4) an analysis of technical options, (5) a technical design and (6) several implementations.
When implementing an electronic data collection system, the permission model for access to data has to be designed carefully, e.g. following the need-to-know principle as well as the principle of least privilege. Audit trails are essential in any case. Context-dependent rights and roles of users are an important factor: who (in which role) has the right and the need to know which data in which context. In general, a health care professional treating a patient may need more permissions than a researcher. We will not follow up on these aspects, because they are not directly related to the problem of pseudonymization.
Instead, we will focus on functional requirements between the system and its users that are affected by introducing pseudonymity and non-functional requirements that are implied by pseudonymization concepts. We note that this classification is in-line with the according concepts in software engineering: non-functional requirements may be defined as “not directly concerned with the specific services delivered by the system to its users” .
Our solution focusses on a clearly defined use case: collaborative prospective electronic collection of longitudinal person-related data. The key stakeholders are users and patients. Users are health care professionals, study nurses, registry monitors and researches. Functional requirements have been compiled from related work by Demiroglu et al. , Bialke et al. , Meyer et al.  and Spitzer et al.  and from a comprehensive overview by Ohmann et al. . We complemented them by results from our own software engineering process, which we cannot describe in full detail here (see Kalman et al.  and Kohlmayer et al.  for a short overview). While links to biosamples play an important role in many systems (e.g. [27, 31, 35–37]), we will not cover this aspect in detail.
For our overview of functional requirements, we introduce a systematic order and focus on the subset of specific relevance to our topic. Where appropriate, we motivate functional system requirements with usage scenarios. The following set has resulted from our approach:
R-C2 - Data Structuring: Typically, different types of data are collected in different eCRFs that belong to the same context (e.g. patient or visit). The system shall provide means to maintain links between associated entities and documents. 
We note that requirements R-C1, R-C2, R-C3 and RC-4 need a legal basis and must be covered by informed consent. We further note that R-C3 may include an integrated view of master data and other types of data. This view should adhere to the need to know principle and must be compliant with legal and regulatory requirements. A usage scenario for R-C3 is the process of re-contacting a patient or proband in cases specified by patient information and informed consent. R-C3 is also of relevance for follow-up data collection where (additional) information about a patient or proband needs to be entered during multiple visits. For this purpose, documents also need to be integrated with master data [27, 33, 35]. Finally, some processes of data management may also require an integrated view on several documents, for example, for cross validation .
Patients have an inherent interest in security and confidentiality of the data they have consented to share. At the same time, data management solutions for collaborative biomedical research has to be compliant with national and international laws. As outlined above, two pseudonymization concepts [19, 20] have been our basis for formulating non-functional requirements. We will present a set of non-functional requirements, which define a system that is able to fulfill the functional requirements while ensuring compliance with pseudonymization concepts. Were appropriate, we will motivate non-functional requirements with references to functional requirements and requirements implied by these concepts. To describe the central aspect of “separation”, we start by focusing on the data layer. Typically, information systems are described by further layers, comprising an application and a presentation layer , which support (and to some degree model) real-world processes . We will structure non-functional requirements along these layers.
Requirements on the data layer
On the data layer, the concepts [19, 20] define pseudonymization of a dataset as a separation into subsets containing different types of data. The records within these subsets are stored in different locations and they are interlinked with identifiers. Data collection and management can be modeled as a set of CRUD operations on documents: (1) Create: creates a new document, (2) Read: provides a view of the data contained in one document or a list of other documents related to one document. (3) Update: provides a view of the data contained in a document while allowing updating its content. (4) Delete: deletes a document.
R-D1 - Distributed CRUD: The system shall implement data collection on top of a set of distributed databases.
R-D2 - Physical separation: The system shall support the hosting of different backends on different physical machines with different host names.
R-D3 - Two-tier pseudonymization: The system shall provide support for two-tier pseudonymization, implemented with an additional mapping service.
As a result of R-D1, operations on documents must be performed across different data pools. Requirement R-D2 is motivated by the fact that [19, 20] require the installation of separate governance, duties and responsibilities for the individual data pools. Requirement R-D3 is motivated by our aim to provide a generic solution for both pseudonymization concepts [19, 20] which require two-tier pseudonymity in several cases (e.g. when biosamples are involved and/or in multi-site research networks).
Requirements on the application layer
As the application layer supports workflows that are provided to users, requirements on this layer are strongly influenced by the above functional requirements:
R-A1 – De-Pseudonymization: The system shall support the de-pseudonymization of data.
On the documentation level, and thus on the system level, re-identification requires reversing the separation between identifying data and payload data [20, 31, 35]. De-pseudonymization, which equals a re-identification of data subjects, is a core element related to (reversible) pseudonymity. On the real-world level, re-identification means revealing the hidden identity of a subject. The non-functional requirement R-A1 is implied by functional requirement R-C3. The latter is motivated by several usage scenarios, for which a legal basis exists. We have summarized them above and they are described in detail in ISO 25237 .
Pommerening et al. have added the following requirements on application-layer:
R-A2 - Client-side re-combination: The reconstruction of the logical global dataset shall only be performed at the client-side to reduce the number of attack vectors .
R-A3 - Confidentiality of internal identifiers: Clients shall be unable to learn the pseudonymous identifiers used in the distributed databases .
Requirements on the presentation layer
This requirement (R-P1) is motivated by the fact that there are reports on systems in which the linkage of different data subsets must be performed manually, i.e., by copying and pasting an identifier displayed by the interface of one application into the interface of another application . This process is time-consuming and error-prone . Furthermore, implementing a consistent user interface for several distributed systems while ensuring a continuous workflow means that there should be no need for users to separately authenticate on the multiple systems involved.
Requirements on all layers
Data separation will inevitably lead to complex architectures which may negatively affect maintainability. We have therefore added the following requirement:
This requirement is quite typical for multi-site scenarios: the integration of a system module into the security architectures of (distant) clinical or research sites implies challenges such as managing institutional firewalls and software installation policies. Frequently, the system needs to support a large set of users that are distributed geographically.
Analysis of architectural options
Next, we have analyzed options to build an integrated interface to distributed databases. The architectural design space is shown in Fig. 2. In the remainder of this section we focus on applications built with web technologies. We note, however, that the presented system architectures are applicable to other development techniques as well.
The concept of loose coupling illustrates a thin layer implementing presentation-layer integration. In this case, users need to sequentially access different clients for separate systems, which might be displayed next to each other or be embedded into each other. Moreover, methods for context management, such as HL7 CCOW , may be used. Loose coupling does only support limited exchange of data between interfaces (interface-to-interface communication). Operations like creating, updating and deleting documents have to be performed manually, potentially repeatedly on the interfaces of the multiple systems over which the data of an entity is distributed. To maintain consistency, identifiers must often be transferred manually from one system to another. Obviously, this represents an error-prone and inefficient workflow, which may lead to data quality issues. Furthermore, using the system is complicated, as different modules may utilize different user interface designs and different interaction patterns, which may negatively affect user acceptance. Most papers describing implementations of pseudonymization concepts are based on the principle of loose coupling [26, 31, 32, 36, 37, 44, 45].
A design with tight coupling, which is also shown in Fig. 2, allows for integrated access to several endpoints. Here, each endpoint provides its own graphical user interface but a dedicated component (called the primary service) delivers the main application and provides presentation-layer integration of user interfaces. Moreover, endpoints may provide additional programming interfaces for access to data. These access points may be used by the central component to enable interface-to-interface communication, resulting in a seamless user experience. In contrast to loosely coupled designs, the central component needs to process and display data from different domains. Moreover, business logic is more complex because access and interaction between the separated services must be orchestrated.
Based on the requirements described above and the architectural options identified, we designed a high-level system architecture. To reduce installation efforts and ensure compatibility with enterprise security architectures, we decided to implement a web-based system that adheres to established web standards and is thus accessible from a broad spectrum of web browsers. Moreover, distributing updated versions of our software becomes easy (R-M1).
On an architectural level, we decided against a loosely coupled design due to the problems described above. For a Single-Page Application the technologies supported by legacy web browsers are insufficient. Frameworks for SPAs are partially immature and not in widespread use. Hence, we decided for a tightly coupled architecture that guarantees seamless integration (R-A1) and good usability (R-P1) based on reliable and widely supported technologies.
The requirement that it must only be possible to re-construct the dataset at the client side (R-A2) is fulfilled by employing client-side mashup-techniques. In short, a client-side mashup displays data from different servers in an integrated manner within a user’s local browser. To support multi-tier pseudonymity (R-D3), we maintain a mapping service, which translates pseudonymous identifiers from the namespace of one system into the namespace of another. It is ensured that the distributed datasets can only be joined at the client systems by exclusively delivering data to clients, meaning that no data is (directly) exchanged between backend services. In this process it is further ensured that clients cannot learn pseudonymous identifiers (R-A3) by substituting identifiers within the distributed datasets with temporary identifiers before delivering any data. To allow for a re-combination of separated data subsets these temporary identifiers must be synchronized between the backend services. To this end, a secure server-to-server communication channel is needed that is not accessible by clients. To ensure consistency while supporting common types of database operations (R-D1), data is managed in a set of distributed relational database management systems (RDBMSs).
Two problem-specific challenges arise from the design decisions described above. Firstly, to ensure continuous workflow, a Single-Sign-On (SSO) mechanism has to be implemented (R-P1). Secondly, in all modern browsers, the implementation of client-side mashups of data retrieved from different domains is complicated by the Same-Origin-Policy (SOP). The basic principle of the SOP is that “only the site that stores information in the browser may later read or modify that information” . This security feature prohibits cross-domain communication, which, on the other hand, is required to re-integrate distributed data subsets that must be hosted on different physical machines in our setup (R-D2).
Analysis of technical options
In order to proceed from a high-level architecture towards an implementation, a variety of implementation options for different aspects of the architecture exists and has to be discussed. In this section, we will present and compare several options for implementing the most important modules of the system: (1) client-side web mashups, (2) single-sign-on mechanisms, and, (3) methods for providing a secure server-to-server communication channel.
In  a mashup is defined as “a website […] that seamlessly combines content from more than one source into an integrated experience”. In the context of our work, implementing a Web Mashup is challenging, because data is stored in different physical locations and thus accessed via different interfaces provided by servers from different fully qualified domain names. Integrating such distributed interfaces and data conflicts with the Same-Origin Policy (SOP), a security feature which prohibits cross-domain communication. It was designed to protect a user’s privacy by preventing sites from tracking a user’s behavior, e.g., by reading stored cookies or data from the cache. The SOP also prevents a user’s actions from being corrupted by other websites and it prevents websites from performing transactions on behalf of the user . The SOP is implemented by only allowing scripts to modify a web page of the same origin only (i.e., loaded by the browser from the same domain). The work by De Ryck et al. presents an overview of state-of-the-art mashup techniques . Well-known techniques that can be used to realize mashups include HTML Frames, PostMessage, XMLHttpRequest (XHR) and JSON with Padding (JSONP) . However, not all of these techniques provide means to circumvent the restrictions implied by the SOP.
An HTML-frameset is a group of HTML-frames. The content of a frame is dynamically loaded and independent of the other frames in a frameset. IFrames (inline frames) were introduced in HTML 4.0. In contrast to standard HMTL-frames, IFrames allow for embedding HTML-documents in the body of other HTML documents. HTML-Framesets and IFrames can be used to display contents from different domains in a browser but without supporting any kind of interaction. Enforcing the SOP, the contents of each origin will be loaded separately and isolated from the contents of other frames.
The HTML postMessage mechanism enables cross-domain communication by enabling scripts to send messages to HTML Frames or windows of arbitrary origin. This feature, available since HTML 5, relies on the recipient to verify that the message is from a valid or authorized sender . It is not supported by legacy browsers.
JSON with Padding (JSONP)
A server-side mashup can be implemented by employing a proxy that integrates data from different sites into a common context and delivers it to the clients. A proxy can also be used to mask different origins of data and thus circumvent the SOP . In our context, server-side mashups cannot be used. A proxy must be able to see all the information that it has to integrate. On the other hand, sensitive personal data managed by research systems must only be transported via encrypted channels (typically using Transport Layer Security (TLS/SSL)). Additionally, this encrypted channel must be established between the client and the data stores, because of the requirement to restrict the context of data linkage to the local machines of users.
A web mashup must be combined with a Single-Sign-On mechanism that ensures a continuous workflow by making it unnecessary for users to separately authenticate on the multiple systems involved. In addition, a complex system for collaborative research also requires means for authorization. The same design decisions that must be made for authentication must also be made for authorization: (1) should the according mechanism be implemented by a dedicated component within the distributed system (e.g. using Shibboleth for authentication), or (2) should each component handle the according aspect by itself. In this section we provide an overview of these design dimensions and present several techniques that can be used to implement the various aspects involved.
In this setup, each component handles authentication by itself. The most straight-forward implementation simply includes the user’s credentials (user name and password) in all requests to an endpoint. When, as in our case, servers are stateful, a server-side session must additionally be associated with the client. Sessions are usually identified by a randomly generated token and these IDs can (and will) thus be different for different sessions at different servers. As a result, the client must either actively manage a set of session IDs, one for each server, within its business logic or use a passive approach, such as cookies.
While not directly related to authentication and authorization, cookies are a widespread technique to make user sessions persistent across several requests to an endpoint. Here, the unique session ID is stored in a local file (called cookie), which is transparently transferred to the host on every request. Because the Same-Origin-Policy also applies to cookies (a single cookie cannot be sent to multiple endpoints hosted on different domains), this mechanism cannot be used to implement cross-domain Single-Sign-On. However, cookies can complement SSO solutions, because they can be used to persist individual sessions at different endpoints.
Single-Sign-On can also be implemented with server-to-server communication. Here, opening a session at one endpoint transparently creates sessions on the other endpoints as well by implementing a communication mechanism between servers. As a result, it (a) can be ensured that a single user session is identified by the same token on different endpoints, and, (b) there is no need to send the user’s credentials to the endpoints with every request. This technique can be implemented, e.g., with a multicast protocol such as JGroups , sockets or a shared file system. Such an approach is difficult to integrate into Enterprise Security Architectures.
Typically, SSO solutions are implemented with cryptographic access tokens. Basically, a token is an object that encapsulates the identity and potentially roles of a user as well as a session ID. Tokens generated by one system can be used to perform operations on another system. A token can (and must) be validated by the target system. From a conceptual perspective, using access tokens is not different from the server-to-server communication approach. The only difference is that with the former approach server-to-server communication is indirect, i.e., performed via the client. This makes this approach feasible for implementing SSO between several isolated services on the World Wide Web. Consequently, the approach is, e.g., implemented by Kerberos  and Shibboleth . Access tokens provide a secure communication channel between servers, meaning that the client cannot read or modify the content of a token. This is especially useful in our scenario, because it can be used to fulfil an additional non-functional requirement. When implementing access tokens, the main challenges are (1) transferring tokens from the clients to the server, and, (2) key management.
Rights and roles
The handling of authorization of a user’s actions is typically coupled with authentication. As a consequence, the design space is closely related to the design space for Single-Sign-On solutions. Role-based access control (RBAC) is an authorization mechanism in which rights are granted to users depending on their associated roles. A role encapsulates a set of permissions. Analogously to SSO, RBAC can be realized with a) a centralized component that authorizes users as well as b) a decentralized solution where every system implements a RBAC component and manages authorization by itself. Important standards for authorization in distributed environments include SAML and XACML .
Secure server-to-server communication
For synchronizing temporary pseudonyms between backend services, secure communication channels are needed. In this context, secure means that the contents of messages are hidden from the clients. This can be implemented with two different mechanisms. Firstly, backend servers can manage exclusive communication channels between them and use these to synchronize information about temporary pseudonyms. Secondly, a secure channel between servers can be built that is routed through the client by using cryptographic tokens.
Figure 3a shows how the reconstruction of a pseudonymized dataset using temporary identifiers can be performed with direct server-to-server communication. In step 1, the client requests a data item (A) from backend B1. The backend creates a temporary pseudonym for the data entry and persists its association to the actual identifier from its namespace (step 2). The data entry with substituted identifier is then delivered to the client (step 3). Next, the client requests the data item associated with the temporary identifier (step 4) from backend B2. In step 5, the backend requests a mapping of the temporary identifier from backend B1. B1 resolves this request by looking into its set of persisted temporary mappings (step 6). The answer to B2 must be routed through the mapping service (steps 7 and 8). Finally, in step 9, B2 delivers the data entry to the client. Problems with this approach include that (a) it is unclear when exactly the persisted substitution of an identifier may be deleted without implementing complex protocols for transactional guarantees, and, (b) at least seven messages must be exchanged to recombine data distributed amongst two databases via mapping service.
Figure 3b shows the reconstruction of a pseudonymized dataset with indirect server-to-server communication. Analogously to the previous example, the client requests a data item from backend B1 (step 1). In step 2, the backend creates an association with a temporary identifier, replaces the actual identifier for the data item and sends it back to the client. In contrast to the previous scenario, where the mapping from the actual identifier to the temporary pseudonym is persisted, B1 also sends an encrypted token containing the association. The client forwards the token to the mapping service (step 3) where it is decrypted and the ID from backend B1 is translated in the associated ID at backend B2. Next, the mapping service generates a second token for B2, containing the mapping from the temporary pseudonym to the original identifier. This token is sent to the client (step 4) where it is forwarded to backend B2 (step 5). Finally, in step 6, B2 decrypts the token, performs a lookup for the data item and sends the result back to the client, along with the temporary pseudonym. At the client side, the data from both backends can be joined using the temporary identifier. We note that in this simple example, it would be sufficient to keep track of the relationships between requests and responses to perform a mapping of the contained data. In more complex real-world scenarios, however, tokens may contain multiple data entities. As a consequence, tokens must contain identifiers that allow combining individual data items from different backends. The use of temporary identifiers for this purpose is motivated by R-A3, which requires internal identifiers to be kept confidential. Compared to direct server-to-server communication, the number of exchanged messages is reduced. Fewer communication channels must be managed, because the same communication channels are used for client-to-server and server-to-server communication. Moreover, as already noted above, indirect server-to-server communication can also be implemented relatively easily, if access tokens are already used for implementing Single-Sign-On. In the remainder of this section, we will elaborate on ways to implement cryptographic (access) tokens.
To maintain confidentiality for the contents of a token, symmetric or asymmetric (or hybrid) cryptography can be employed. Depending on the topology of the infrastructure (hierarchical or peer-to-peer), encryption also affects key management. In a hierarchical infrastructure, a single component can be employed to manage all keys needed for the encryption of tokens, whereas in a peer-to-peer infrastructure each component needs to manage key pairs for every other component. In web-based applications, tokens can be implemented with JSON Web Tokens (JWT) utilizing related technologies such as JSON Web Encryption (JWE), JSON Web Signatures (JWS) and JSON Web Keys (JWK) .
Based on requirements, we selected a set of technical options for an implementation. In this section, we will describe the resulting generic solution.
The maintainability requirement (R-M1) was weighted high in our implementation. Our aim was to develop a solution that is robust while only relying on technologies supported by common web browsers (including wide-spread legacy browsers). We therefore decided to build a client-side Web Mashup with HTML Frames for parts of the application that do not require any interface-to-interface communication and with utilization of JSONP for all other cases. In our implementation, data is modeled as a tree-like structure where the root represents a subject’s master data and further nodes represent documents containing payload data. To support data collection, as defined by R-C1, we realized interfaces for CRUD operations on this tree with two functional views: “create, list & delete” and “view & update”. The former provides a list of documents and allows creating new or deleting existing documents. The latter shows the content of a document and allows updating it. Several instances of these two types of views may be displayed next to each other, thus providing an integrated interface as required by R-A1 to fulfill our functional requirements R-C2, R-C3 and R-C4. JSONP is a good solution for interfaces in which data of many entities has to be displayed, i.e. the “create, list & delete” view. In the other cases, i.e. the “view & update” interface, we leverage HTML Frames because of their ease of implementation and therefore increased productivity when developing the software.
There are multiple frameworks for implementing token infrastructures, but we decided to develop our own solution that is tailored to our requirements for the following reasons. JSON Web Tokens are still in a draft-phase and currently immature. XACML and SAML come with a significant overhead regarding the size of the exchanged messages because they use an XML-Syntax. This is problematic when transmitting data via URLs. Furthermore, XACML and SAML are complex, resulting in a rather high implementation effort. In our system, tokens are encrypted with a hybrid method combining AES and RSA. The payload is encrypted symmetrically and integrity protected and the key for decryption K 1 is encrypted asymmetrically with the public key K 2 of the receiver. Tokens contain the key K 1 , username, password, counter and payload data P (e.g. encoded in JSON syntax). In the following E x (y) denotes the encryption and integrity protection of y using the key x. The token is built of two components. The first component contains the key K 1 , which is encrypted and integrity protected with the public key of the Server (K 2 ), i.e. E K2 (K 1 ). The second component contains the username, password, counter and the payload encrypted and integrity protected with the key K 1 from the first component, i.e. E K1 (username, password, counter, P). Replay protection is implemented with a counter that is continuously incremented and prevents repeated acceptance of tokens by any receiver.
The design of our solution supports two or more physically distributed data stores (R-D2) and one or more mapping services (R-D3). All endpoints have to provide API access and all services but the mapping service must be able to provide HTML-formatted data to clients as well. However, the mapping service must provide HTML-Frames that embed HTML-formatted data from other services, as will be explained below. A basic design fulfilling all requirements of the model by Pommerening et al.  must implement separation of master data and clinical data . A minimal solution is shown in Fig. 4. The central component is implemented by the backend managing master data (primary service), because it stores the root nodes of the tree and is thus the starting point for user interactions.
Our final solution combines the above techniques into a Web-Mashup that integrates pseudonymized data (R-A2). The first variant, which uses HTML-Framesets, is sketched in Fig. 5. Here, a static frame at the top displays selected data of a single entity from the primary service. The content of the second frame, which is located at the bottom, is provided by the mapping service and contains an additional nested frame, which shows the corresponding clinical data. Please note that in Fig. 5 pseudonyms are represented as clear text instead of being encoded into tokens for the sake of readability. In our implementation pseudonyms are encoded into encrypted tokens and therefore never visible to the client (R-A3).
A typical workflow in which the above method is utilized is the creation of a new eCRF. Firstly, the user logs into the primary service and selects a specific subject. The primary service returns a HTML-Frameset as response, where the top-frame contains an HTML-document with the master data of the selected subject. A new instance of a predefined eCRF is generated and the resulting document is displayed using the previously described method. In this process, a chain of HTTP-Requests is generated, in which the user’s credentials are encoded into tokens and distributed to all endpoints to implement SSO.
A basic version of this process is shown in Fig. 6. To simplify our illustration, we assume that the first request, which also logs the user into the system, already contains the ID of the subject for which a new document is to be created. In a real-world scenario, the login process would already have been performed earlier. It can be seen that the user’s credentials and the ID of the data element that is to be displayed are sent to the primary service with the first request. From there on, the operation to be performed, on which data it is to be performed and for which user, is encoded into tokens. These tokens are generated at the backends. This also provides a transparent SSO mechanism. As an alternative to embedding nested frames, this process can also be implemented by using an HTTP-Redirect to route the request from the mapping service to the secondary service.
The second variant of our Web-Mashup uses JSONP requests to display distributed pseudonymized data. It is especially suitable for scenarios in which a larger set of distributed but related entities, e.g., a list of all subjects and an overview of associated clinical data, is to be displayed. The method is sketched in Fig. 7.
We have used the described generic solution as a basis for implementing the data management software for several research projects [27–29]. We will focus on two of them [27, 29] which are research networks for rare diseases. Here, the primary actors are health care professionals in an observational study. No specific intervention takes place, and data used for research are collected during health care activities. The associated biobanks use prepared “kits” (tubes with identifiers sent to sites and returned to a central biobank) with pseudonymous labels, which are registered in the system. Internal second level pseudonyms are provided as required by [19, 20]. We will not address the management of biosamples here.
Our first system instance is “mitoRegister”, a multi-site registry which is a part of the mitoNET project . This research network for mitochondrial disorders was started in 2009 under funding by the German Federal Ministry of Education and Research (BMBF). It serves as a platform for over 18 centers in Germany and by August 2015 about 1165 patients have been recruited. Data is managed by 35 eCRFs, which comprise over 900 attributes.
Our second system instance also supports a research network for neurodegenerative diseases, TIRCON . This project was started in 2012, funded by European Commission FP7-Health Work Programme . Our software supports TIRCON’s registry for 13 partners from 8 countries (including the US, UK and Germany). By August 2015 about 265 patients have been recruited. Data is collected in 34 eCRFs consisting of almost 1000 attributes. TIRCON comprises further system parts .
In both projects three separated and two-tier pseudonymized data pools are managed by our solution: a) master data, b) clinical phenotype data and c) biospecimen registration data.
Our solution was implemented with Java-Server-Faces as the driving technology for the backends, jQuery for client-side functionalities, MySQL as a database system as well as Tomcat application servers and Apache web servers as runtime environments. Both systems use two-factor authentication with One-Time-Passwords (OTP) following the OATH standard  for user accounts with high privileges. Users are provided with time-based dongles that generate short-living passwords, each of which can only be used to access the system exactly once. Communication between endpoints and the clients is secured with Transport Layer Security (TLS/SSL). Automated penetration-tests have been performed and did not detect any weaknesses. Master data is stored encrypted in the according backend. Accountability and integrity are ensured by an audit trail that keeps protocol of every data modification on each backend. We use virtual servers to provide fail-over mechanisms. All endpoints are secured by firewalls. Encrypted backups are created daily and transferred to one dedicated location per backend. Both systems were designed and implemented at our institution in close collaboration with the involved physicians and researchers using an agile development process with short feedback cycles.
Both systems provide web-based data entry, support of cross-validation and plausibility checks, a (logical) central database, an elaborated security concept with multi-tier pseudonymity for patient-, specimen- and image-identifiers. A web-browser is the only software needed to access the system. The informed consent serves as basic agreement for the patient’s research participation. The systems use controlled vocabularies  as well as standardized questionnaires [61–63]. Access roles comprise application administrators, monitors, physicians and lab personnel. Each role has different permissions in terms of create-, read-, update- and delete operations (CRUD) for certain types of documents and system objects. Application administrators are able to perform all CRUD-operations on user accounts but do not have access to any type of research data. Monitors may perform read-only operations on clinical data to perform quality assurance. Physicians may perform all CRUD-operations on master data and clinical data. Each physician and patient is associated to his or her home institution. Physicians are only able to access data from patients related to the same institution.
An example screenshot of the EDC system implemented for the TIRCON project is shown in Fig. 8. Here, a seamless integration of data from different pools is implemented, providing the “create, list & delete” functionality defined previously. The view shows an overview and summary data about all subjects, which can be managed by the current user. For each subject, the list is substructured into master data used for re-identification and an overview of the documents used to track biosamples and to collect clinical data. The view is realized with JSONP.
A second screenshot from the TIRCON application is presented in Fig. 9. It shows an integrated view of master data and clinical data from an eCRF realized with a HTML-Frameset, which is provided by the primary service. The view implements the previously defined functionality of view & update. A top-frame displays the master data of a select subject, whereas the bottom-frame shows the associated documents with clinical data, which are stored at the secondary service. The bottom-frame is organized into two interlinked regions. Firstly a document tree provides an overview of the different documents available for the subject. Secondly, the currently selected document from the tree is displayed.
In this article, we have presented an overview of challenges and solutions for implementing software for the management of pseudonymized data with web technologies. We have described a generic solution that can be tailored to different pseudonymization schemes by using a well-defined subset of the presented techniques. Our approach is independent of the actual distribution of data and it is able to manage associations between patients or visits and further external entities. The aim of our implementation is to build integrated applications, in which the actual distribution of data is transparent to users, providing a virtual central database. Our solution features single-sign-on, supports multi-tier pseudonymity and does not require direct server-to-server communication. By providing various features, our generic solution can be used for the collection of a broad spectrum of different types of data in compliance with national and international laws. Moreover, as a basis, we chose a set of techniques that are supported by modern state-of-the-art browsers as well as legacy browsers. We have shown the practical applicability of our approach, by using it as a basis for implementing two geographically large research networks. Both systems have been in productive use for several years. Several national and international Institutional Review Boards (IRBs) and Data Protection Commissioners of the participating sites have approved the concept.
Current access statistics (i.e. from August 2015) for our applications show that about 25 % percent of our users still access the systems with legacy browsers, such as Internet Explorer 8. As a consequence, we decided to implement our approach with technologies that are supported in older versions of widespread web browsers and did not utilize modern HTML 5 features, such as CORS, or client-side frameworks for building Single-Page Applications, such as AngularJS. Compared to the technologies currently utilized in our implementations, these methods have a great potential to reduce system complexity. The main reason is that instead of distributing business logic over several backend servers, more functionality can be bundled into the client application, reducing the need for logic that orchestrates distributed operations. Moreover, application development and system maintenance are simplified, because the complexity of the backend services can be reduced to a minimum. As support for modern HTML features increases, we plan to upgrade our solution from a tightly coupled application with server-side rendering to a single-page application.
From a security and privacy perspective, current pseudonymization concepts are limited by not being based on risk and threat analyses. This may be the reason why multiple schemes have been proposed but international consensus is missing. Overviews have been provided by [64, 65]; the schemes described differ in their requirements on application level as well as on data level. Some of these differences can be explained with the fact that the schemes have been developed for different use cases (e.g. for data warehouses  as compared to research networks ). But still, many of the inherent design decisions seem to be ad-hoc and lack thorough justification, which could have been provided by a risk and threat analysis. Some requirements can be well justified with general principles in IT security, e.g., the need-to-know principle and the principle of least privilege. Other methods specified by pseudonymization concepts, however, have a strong impact on system design but lack such justification. Among the important open questions are motivations for the application-level requirements R-A2 (client-side re-combination only) and R-A3 (confidentiality of internal identifiers) as well as the data-layer requirement R-D3 (two-tier pseudonymization).
The general problem is that it remains unclear, how exactly data is to be separated into subsets. The ad-hoc classification into “identifying data” and “other types of data” is insufficient. For example, it is well understood that data which may fall into the second category can be used to re-identify individuals (see  for a discussion regarding diagnosis codes). To the best of our knowledge, ISO 25237  is the only work in the context of pseudonymization that lists a set of common identifiers with a high risk of re-identification. But still, no countermeasures against this inherent problem of pseudonymity have been proposed. This situation makes it difficult to find an adequate balance between privacy concerns and support for workflows that require re-identification of data and subjects. For example, the pseudonymization and de-pseudonymization process may be designed differently. The work by Aamot et al.  suggests an efficient routine process that requires to contact multiple ombudsmen, each of which controls a horizontal subset (i.e. data about a certain set of patients) of the data, to de-pseudonymize datasets. In contrast, the concept of Pommerening et al.  involves two additional parties in the process of de-pseudonymizing research data, each of which controls a vertical subset of the data (i.e. a certain set of attributes for all patients).
Threats and countermeasures
As already noted, a thorough risk and threat analysis is needed to determine to which extent pseudonymity and related methods, such as multi-tier pseudonymity or client-side re-combination of data, offer protection against common security threats at which costs. This in turn requires an analysis of potential attack vectors, risks associated with common types of data, methods for quantifying re-identification risks, and a consideration of results from related research areas, such as privacy-preserving data publishing or privacy-preserving data outsourcing. An analysis of this kind would exceed the scope of this article. In this article, we do not focus on the methodical basis of pseudonymity, but on its implementation. Analogously to related work [31, 64], we will therefore simply assume that implementing pseudonymity as currently conceptualized offers protection against information disclosure. In the remainder of this section, we will focus on the specific aspects of our implementation and the deployed systems.
The STRIDE  methodology provides an appropriate means to analyze threats and countermeasures systematically. STRIDE is an acronym for the security threat types addressed by the methodology which are (1) spoofing, (2) tampering, (3) repudiation, (4) information disclosure, (5) denial-of-service, and (6) elevation-of-privilege. We will relate these principles to the basic security principles of ISO 27000  and RFC-4949 :
“Authenticity – property that an entity is what it claims to be” 
“Integrity – property of protecting the accuracy and completeness of assets” 
“Accountability – responsibility of an entity for its actions and decisions” 
“Confidentiality – property that information is not made available or disclosed to unauthorized individuals, entities, or processes” 
“Availability – property of being accessible and usable upon demand by an authorized entity” 
“Authorization – approval that is granted to a system entity to access a system resource” 
The relation of security principles, threats and implemented countermeasures can be seen in Table 1. Many of the countermeasures deployed and implemented in our systems are well-known and in widespread use. First, we apply hardware-level protection, including restricted access to hardware, secure server rooms with a UPS, and redundant server hardware. Second, we implement network-level measures, such as communication based on TLS with certificates and IP-based filtering of requests. On the host-level, we perform backups and maintain disaster recovery plans, deploy intrusion detection systems, firewalls, virus scanners, perform penetration testing and server hardening and use virtualization as well as automated server updates. On the application-level, our software uses common methods, such as limits for login attempts, automated logout after a certain time period, two-factor authentication, role-based access control, input sanitization (e.g. against SQL injection) and input validation. Additionally, our software implements various pseudonymization methods, as described previously. On the client-level, we employ account management policies and perform user trainings.
Additionally, there are some more-specific security measures implemented by our system. We have covered many of them in the previous sections: prevention of replay attacks on the token infrastructure (one-time access tokens), distributed non-delegated authentication where each component handles authentication and authorization autonomously (distributed authorization), an audit trail that keeps protocol of every data modification on each backend (audit trail) and the encryption of master data in the according backend (database encryption). Additionally, users from a specific participating site are only allowed to access data of patients recruited at their site (site-based view). This is implemented with the role-based access control mechanism.
Comparison with related work
The work presented in this article is not the first solution that has been proposed for pseudonymized data management. It is one of the very few contributions, however, asking fundamental questions. We have presented a systematic solution for a typical use case, but we strongly suggest further work. Moreover, we have put emphasis on detailed descriptions of alternatives that are available for implementing the methods described in this paper.
There are many articles that focus on application-level aspects of pseudonymization and do not describe technical details about the information systems that manage these data and implement the described processes [71–76]. Some articles on pseudonymization focus on other use cases than our work, leading to different functional requirements. An important group consists of approaches in which re-identification is only supported as an exceptional procedure [66, 77, 78]. In any case, we consider risk and threat analyses a must for the future. Furthermore, we did not consider work in which access to pseudonymized data is controlled by patients, e.g. via smart cards [79, 80].
Several articles have described systems that implement loose coupling. For an in-depth comparison of loosely coupled and tightly coupled architectures we refer to Section “Architectural Options”, but we feel that the most important drawback is that users may need to manually transfer pseudonyms between component systems. The work by Eggert et al. uses a paper-based core process in which pseudonyms are printed on documents . Physicians use the pseudonym from the paper-based documents for remote entry of clinical data. Moreover, a trusted third party is involved in the re-identification process. Demiroglu et al. have published two articles describing loosely coupled systems that implement one-tier pseudonymity [31, 44]. Both systems manage links to two external systems: Starlims , which is used for managing biospecimen and secuTrial , which acts as a clinical phenotype database.
The most elaborated approach for loose coupling has been presented by Lablans et al. in . Their work describes a reference implementation of a REST-based interface for the realization of clinical research networks. Its main functionality is to support identity management, i.e., to store master data together with an associated pseudonymized link (e.g. identifier) to an external data pool. In the article, the EDC system secuTrial  is used as an example. Analogously to our approach, the authors utilize tokens for the communication between the clients and the RESTful backend but these tokens are not cryptographically protected. In contrast to our solution their system only supports one-tier pseudonymization and makes internal pseudonyms (used for storage) visible to users.
The work by Brinkmann et al.  is an implementation of the model by Pommerening et al. The system provides an integrated view on data from two separated pools in a web browser. It employs IFrames for a tight coupling of one-tier pseudonymized master data and DICOM images. Temporary identifiers are utilized to integrate these data without making pseudonyms visible to clients. While these design decisions and implementation methods are similar to our solution, it is much narrower in its scope. The system focusses on associating a collection of images with patient master data, which is a rather simple setup with a data model that is not too complex. The system only uses IFrames for providing an integrated view on distributed data. As we have described in Section 2.4.1, IFrames are well suited for simple data structures, but have technological limitations when dealing with more complex data. We have also verified this with experiments. It can therefore be assumed that the system is not able to efficiently provide comprehensive views on complex structures consisting of multiple different entities that are interlinked with high multiplicities. Moreover, the system provides a smaller set of features than ours. Important examples include not supporting multi-tier pseudonymity and not providing alternatives to direct server-to-server communication for the synchronization of temporary identifiers.
The articles by Bialke et al. [32, 45] describes a loosely coupled approach where generic software modules perform tasks like pseudonymization and record-linkage for research data. The aim is to reduce implementation efforts by providing components that implement standard functionalities necessary in disease registries. The individual modules are hosted together by a Trusted Third Party which provides the according services to external parties. The proposed architecture supports two-tier pseudonymization between master data and clinical phenotype data. The paper  focusses mainly on workflow aspects. Both articles do not address presentation-layer integration and integrated user interfaces. Furthermore, they do not discuss technical options for implementation.
Pseudonymization models are very heterogeneous, already on a conceptual level. Most importantly it remains unclear how exactly data is to be separated into distributed subsets. What is lacking is a thorough risk and threat analysis for pseudonymization schemes, covering at least the data- and the application level. Different architectural solutions exist for managing a set of pseudonymized data subsets, each of which has different properties in terms of usability, support for functional requirements and software complexity. Additionally, these architectures can be implemented with different technologies. In this article, we have analyzed this broad spectrum of architectural options and implementation techniques and we have presented a solution that is generic because it is independent of the actual distribution of data and supports a large set of features. In the future, we will investigate how using more modern HTML features can help to reduce system complexity and thus simplify application development as well as system maintenance.
Clinical context object workgroup
Cross-Origin resource sharing
Create read update and delete
Electronic case report form
Electronic data capture
Seventh framework programme
Health insurance portability and accountability act
Health level seven
Hypertext transfer protocol
Institutional review board
International organization for standardization
JSON with padding
JSON web encryption
JSON web keys
JSON web signatures
JSON web tokens
Initiative for open authentication
Open source registry system for rare diseases in the EU
Role-based access control
Relational database management system
Representational state transfer
Security assertion markup language
Secure sockets layer
Transport layer security
Trusted third party
Uninterruptible power supply
Uniform resource locator
World Wide Web Consortium
- Web API:
Web application programming interface
Extensible markup language
eXtensible access control markup language
HBGRDs. Guidelines for human biobanks and genetic research databases. 2008. http://www.oecd.org/science/biotech/guidelinesforhumanbiobanksandgeneticresearchdatabaseshbgrds.htm. Accessed 28 September 2015.
P3G. Public population project in genomics and society. 2015. http://p3g.org. Accessed 28 September 2015.
Wichmann HE, Kuhn KA, Waldenberger M, Schmelcher D, Schuffenhauer S, Meitinger T, et al. Comprehensive catalog of european biobanks. Nat Biotechnol. 2011;29:795–7. doi:10.1038/nbt.1958.
BioMedBridges. Building data bridges from biology to medicine in Europe. 2015. http://www.biomedbridges.eu. Accessed 28 September 2015.
Appari A, Johnson ME. Information security and privacy in healthcare: current state of research. Int J Internet Enterp Manag. 2010;6(4):279–314. doi:10.1504/IJIEM.2010.035624.
Malin B. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J Am Med Inform Assoc. 2005;12:28–34.
Ayday E, De Cristofaro E, Hubaux J-P, Tsudik, G. The chills and thrills of whole genome sequencing. IEEE Computer. 2013;99:1. doi:10.1109/MC.2013.333.
European Parliament and Council of the European Union: European Parliament and council directive 95/46/EC of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal L 1995;281:31–50.
European Commission. Proposal for a regulation of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (General data protection regulation). Outcome of the European Parliament’s first reading (Strasbourg, 10 to 13 March 2014). Brussels. 2014.
Council of Europe: Recommendation Rec(2006) 4 of the Committee of Ministers to member states on research on biological materials of human origin. 958th meeting. 15 March 2006.
U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administrative simplification regulation, 45 CFR Parts 160, 162, and 164. 2013.
UK Biobank. UK biobank ethics and governance framework (version 3.0). 2007. https://www.ukbiobank.ac.uk/wp-content/uploads/2011/05/EGF20082.pdf. Accessed 28 September 2015.
Hakonarson H, Gulcher JR, Stefansson K. deCODE genetics, Inc. Pharmacogenomics. 2003;4:209–15.
The German National Cohort: http://nationale-kohorte.de/informationen-auf-englisch/ (2015). Accessed 30 November 2015.
Federal Data Protection Act in the version promulgated on 14 January 2003 (Federal Law Gazette I p. 66), as most recently amended by Article 1 of the Act of 14 August 2009 (Federal Law Gazette I p. 2814). 2009.
Republic of Italy. Personal data protection code. legislative decree No. 196, (196), 1–186. 2003.
Lowrance W. Learning from experience: privacy and the secondary use of data in health research. J Health Serv Res Policy. 2003;8 Suppl 1:2–7. doi:10.1258/135581903766468800.
Kalra D, Gertz R, Singleton P, Inskip HM. Confidentiality of personal health information used for research. BMJ. 2006;333(7650):196–8. doi:10.1136/bmj.333.7560.196.
International Organization for Standardization (ISO). Health informatics - pseudonymization. ISO/TS 25237:2008(E). 2008.
Pommerening K, Drepper J, Helbing K, Ganslandt T. Leitfaden zum Datenschutz in medizinischen Forschungsprojekten. 1st ed. Berlin: MWV; 2014. ISBN-10: 3954661233.
Winter A, Funkat G, Haeber A, Mauz-Koerholz C, Pommerening K, Smers S, et al. Integrated information systems for translational medicine. Methods Inf Med. 2007;46:601–7.
Pommerening K, Sax U, Müller T, Speer R, Ganslandt T, Drepper J, et al. Integrating eHealth and medical research: The TMF data protection scheme. In: Blobel B, Pharow P, Zvarova J, Lopez D, editors. eHealth: Combining health telematics, telemedicine, biomedical engineering and bioinformatics to the edge. Berlin: Akademische Verlagsgesellschaft Aka GmbH; 2008. p. 5–10.
Helbing K, Demiroglu SY, Rakebrandt F, Pommerening K, Rienhoff O, Sax U. A data protection scheme for medical research networks. Methods Inf Med. 2010;49(6):601–7. doi:10.3414/ME09-02-0058.
Brinkmann L, Klein A, Ganslandt T, Ückert F. Implementing a data safety and protection concept for a web-based exchange of variable medical image data. Int Congr Ser. 2005;1281:191–5. doi:10.1016/j.ics.2005.03.185.
Spitzer M, Ullrich T, Ueckert F. Securing a web-based teleradiology platform according to German law and “Best Practices”. Stud Health Technol Inform. 2009;150:730–4.
Lablans M, Borg A, Ückert F. A RESTful interface to pseudonymization services in modern web applications. BMC Med Inform Decis Mak. 2015;15(1):2. doi:10.1186/s12911-014-0123-5.
Kalman B, Lautenschlaeger R, Kohlmayer F, Büchner B, Kmiec T, Klopstock T, et al. An international registry for neurodegeneration with brain iron accumulation. Orphanet J Rare Dis. 2012;7:66. doi:10.1186/1750-1172-7-66.
m4 Leading-Egde Cluster: m4 data integration systems. http://www.m4.de/en/leading-edge-cluster/m4-data-integrationsystem.html (2015). Accessed 30 November 2015.
Büchner B, Gallenmüller C, Lautenschläger R, Kuhn KA, Wittig I, Schöls L, et al. Das deutsche Netzwerk für mitochondriale Erkrankungen (mitoNET). Med Genet. 2012;24(3):193–9. doi:10.1007/s11825-012-0338-8.
Sommerville I: Software engineering. 9th ed. Addison-Wesley; 2010:792. ISBN-10: 0137035152.
Demiroglu SY, Skrowny D, Quade M, Schwanke J, Budde M, Gullatz V, et al. Managing sensitive phenotypic data and biomaterial in large-scale collaborative psychiatric genetic research projects: practical considerations. Mol Psychiatry. 2012;17(12):1180–5. doi:10.1038/mp.2012.11.
Bialke M, Bahls T, Havemann C, Piegsa J, Weitmann K, Wegner T, et al. MOSAIC – a modular approach to data management in epidemiological studies. Methods Inf Med. 2015;54:364–71. doi:10.3414/ME14-01-0133.
Meyer J, Ostrzinski S, Fredrich D, Havemann C, Krafczyk J, Hoffmann W. Efficient data management in a large-scale epidemiology research project. Comput Methods Programs Biomed. 2012;107(3):425–35. doi:10.1016/j.cmpb.2010.12.016.
Ohmann C, Kuchinke W, Canham S, Lauritsen J, Salas N, Schade-Brittinger C, et al. Standard requirements for GCP-compliant data management in multinational clinical trials. Trials. 2011;12:85. doi:10.1186/1745-6215-12-85.
Kohlmayer F, Lautenschläger R, Wurst SHR, Klopstock T, Prokisch H, Meitinger T, et al. Konzept für ein deutschlandweites Krankheitsnetz am Beispiel von mitoREGISTER. GI Jahrestagung. 2010:746–751.
Eggert K, Wüllner U, Antony G, Gasser T, Janetzky B, Klein C, et al. Data protection in biomaterial banks for parkinson’s disease research: the model of GEPARD (gene bank parkinson’s disease germany). Mov Disord. 2007;22(5):611–318. doi:10.1002/mds.21331.
Dangl A, Demiroglu SY, Gaedcke J, Helbing K, Jo P, Rakebrandt F, et al. The IT-infrastructure of a biobank for an academic medical center. Stud Health Technol Inform. 2010;160(Pt 2):1334–8. doi:10.3233/978-1-60750-588-4-1334.
Jin J, Ahn G-J, Hu H, Covington MJ, Zhang X. Patient-centric authorization framework for sharing electronic health records. In Proc 14th ACM Symp Access Control Model Technol. 2009; 125–134; doi 10.1145/1542207.1542228.
Alonso G, Casati F, Kuno H, Machiraju V. Web services: Concepts, architectures and applications (Data-centric systems and applications). Berlin Heidelberg: Springer; 2004. p. 123–49. ISBN 3642078885.
Dadam P, Reichert M, Kuhn KA. Clinical workflows-the killer application for process-oriented information systems? In: Abramowicz W, Orlowska ME, editors. BIS 2000, 4th Int Conf on Bus Inf Syst. London: Springer; 2000. p. 36–59.
Bevan N. Usability issues in web site design. 1999. http://experiencelab.typepad.com/files/usability-issues-in-website-design-1.pdf. Accessed 28 September 2015.
Goldberg SI, Niemierko A, Turchin A. Analysis of data errors in clinical research databases. AMIA Annu Symp Proc. 2008:242–246.
The HL7 CCOW Standard: http://www.hl7.com.au/CCOW.htm (2006). Accessed 28 September 2015.
Demiroglu SY, Skrowny D, Schulze TG. Adaption of the identity management regarding new requirements of a long-term psychosis biobank. In: Moen A, Andersen SK, Aarts J, Hurlen P, editors. In Proc 23rd Int Conf European Federation Med Inform. Oslo. MIE 2011. 2011:1–3.
Bialke M, Penndorf P, Wegner T, Bahls T, Havemann C, Piegsa J, et al. A workflow-driven approach to integrate generic software modules in a trusted third party. J Transl Med. 2015;13:176. doi:10.1186/s12967-015-0545-6.
AngularJS: https://angularjs.org (2015). Accessed 28 September 2015.
BACKBONE.JS: http://backbonejs.org (2015). Accessed 28 September 2015.
Jackson C, Bortz A, Boneh D, Mitchell JC. Protecting browser state from web privacy attacks. Proc Int Conf World Wide Web. 2006:737–744; doi:10.1145/1135777.1135884.
Jackson C, Wang HJ. Subspace: Secure cross-domain communication for web mashups. Proc Int Conf World Wide Web. 2007:611–620; doi:10.1145/1242572.1242655.
De Ryck P, Decat M, Desmet L, Piessens F, Joosen W. Security of web mashups: a survey. Proc Nord Conf Sec IT Syst. 2012:223–238; doi:10.1007/978-3-642-27937-9_16.
Son S, Shmatikov V. The postman always rings twice: attacking and defending postMessage in HTML5 websites. In: ISOC Network and Distributed System Security Symposium, NDSS 2013. 2013.
JGroups: http://www.jgroups.org (2015). Accessed 28 September 2015.
Neuman C, Kohl J. RFC 4120: the Kerberos network authentication service (V5). 2005. http://www.ietf.org/rfc/rfc4120.txt. Accessed 28 September 2015.
Shibboleth 3: a new identity platform. 2013. https://shibboleth.net/documents/business-case.pdf. Accessed 28 September 2015.
Anderson A, Lockhart H. SAML 2.0 profile of XACML. OASIS Open 2004. http://docs.oasis-open.org/xacml/access_control-xacml-2.0-saml_profile-spec-cd-01.pdf. Accessed 28 September 2015.
University of Southern California. RFC 791: Darpa internet program protocol specification. 1981. https://tools.ietf.org/html/rfc791. Accessed 28 September 2015.
Jones MB. The emerging JSON-based identity protocol suite. W3C workshop on identity in the browser. 2011:1–3.
European Commission. FP7-HEALTH - FP7 specific programme ‘cooperation’ - research theme: ‘health’. 2007. http://cordis.europa.eu/programme/rcn/852_en.html. Accessed 28 September 2015.
OATH Standard: http://www.openauthentication.org (2015). Accessed 28 September 2015.
Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83(5):610–5. doi:10.1016/j.ajhg.2008.09.017.
Schaefer AM, Phoenix C, Elson JL. Mitochondrial disease in adults: a scale to monitor progression and treatment mitochondrial disease in adults. Neurology. 2012;66(12):1932–4.
Barry MJ, VanSwearingen JM, Albright AL. Reliability and responsiveness of the barry-albright dystonia scale. Dev Med Child Neurol. 1999;41(6):404–11.
Schmitz-Hübsch T, Du Montcel ST, Baliko L, Berciano J, Boesch S, Depondt C, et al. Scale for the assessment and rating of ataxia: development of a new clinical scale. Neurology. 2006;66(11):1717–20.
Aamot H, Kohl CD, Richter D, Knaup-Gregori P. Pseudonymization of patient identifiers for translational research. BMC Med Inform Decis Mak. 2013;13(1):75. doi:10.1186/1472-6947-13-75.
Neubauer T, Kolb M. An evaluation of technologies for the pseudonymization of medical data. In: Computer and Information Science. Berlin: Springer; 2009. p. 47–60. doi:10.1007/978-3-642-01209-9_5.
Kalra D, Singleton P, Milan J, MacKay J, Detmer D, Rector A, et al. Security and confidentiality approach for the clinical e-science framework (CLEF). Methods Inf Med. 2005. doi:10.1267/METH05020193.
Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breach research participants’ privacy. J Am Med Inform Assoc. 2010;17(3):322–7. doi:10.1136/jamia.2009.002725.
M. Howard und S. Lipner. The security development lifecycle: SDL, a process for developing demonstrably more secure software. Microsoft Press; 2006. ISBN-10: 0735622140
International Organization for Standardization (ISO): Information technology - security techniques - information security management systems - overview and vocabulary. ISO/IEC 27000:2009(E). 2009.
Shirey R. RFC4949: Internet security glossary (V2). 2007. https://tools.ietf.org/html/rfc4949 Accessed 28 September 2015.
Majchrzak T, Schmitt O. Improving epidemiology research with patient registries based on advanced web technology. In: Proc Int Conf Info Sys Crisis Response Management. 2012:1–5.
De Moor GJE, Claerhout B, De Meyer F. Privacy enhancing techniques: the key to secure communication and management of clinical and genomic data. Methods Inf Med. 2003;42(2):148–53. doi:10.1267/METH03020148.
Claerhout B, De Moor GJE, De Meyer F. Secure communication and management of clinical and genomic data: the use of pseudonymisation as privacy enhancing technique. Stud Health Technol Inform. 2002;95:170–5. doi:10.3233/978-1-60750-939-4-170.
Iversen K, Grøtan T. Socio-technical aspects of the use of health related personal information for management and research. Int J Biomed Comput. 1996;43(1):83–91.
Wylie JE, Mineau GP. Biomedical databases: protecting privacy and promoting research. Trends Biotechnol. 2003;21(3):113–6. doi:10.1016/S0167-7799(02)00039-2.
Noumeir R, Lemay A, Lina JM. Pseudonymization of radiology data for research purposes. J Digit Imaging. 2007;20(3):284–95. doi:10.1007/s10278-006-1051-4.
Lo IL. Multi-centric universal pseudonymisation for secondary use of the EHR. Stud Health Technol Inform. 2007;126:239–47.
Heurix J, Karlinger M, Neubauer T. Pseudonymization with metadata encryption for privacy-preserving searchable documents. In: Proc Annu Hawaii Int Conf Syst Sci. HICSS 2012. 2012:3011–3020; doi:10.1109/HICSS.2012.491.
Neubauer T, Heurix J. A methodology for the pseudonymization of medical data. Int J Med Inform. 2011;80(3):190–204. doi:10.1016/j.ijmedinf.2010.10.016.
Riedl B, Grascher V, Neubauer T. A secure e-health architecture based on the appliance of pseudonymization. J Software. 2008;3(2):23–32. doi:10.4304/jsw.3.2.23-32.
Starlims: http://www.starlims.com/de-de/home (2015). Accessed 28 September 2015.
secuTrial: http://www.secutrial.com (2015). Accessed 28 September 2015.
DSLib: http://www.unimedizin-mainz.de/imbei/informatik/opensource/dslib.html (2015). Accessed 28 September 2015.
Muscholl M, Lablans M, Ückert F. OSSE: open source registry system for rare diseases in the EU (executive summary). 2014. http://download.osse-register.de/OSSE_Executive_Summary.pdf. Accessed 30 November 2015.
Muscholl M, Lablans M, Wagner TO, Ückert F. OSSE: open source registry software solution. Orphanet J Rare Dis. 2014; 9 Suppl 1; doi:10.1186/1750-1172-9-S1-O9.
Parts of this work have been carried out in the context of mitoNET and TIRCON. mitoNET is a project funded by the German Ministry of Education and Research (BMBF) under grant agreement no. 01GM0862. TIRCON is funded by the European Community’s Seventh Framework Programme (FP7/2007-2013, HEALTH-F2-2011) under grant agreement no. 277984. This work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) within the funding programme Open Access Publishing.
The authors declare that they have no competing interests.
FK, RL and FP created the overview of architectural and implementation options and designed and developed the generic solution. RL implemented the software for the disease networks. FK, RL, KK and FP wrote the manuscript and discussed the methods and results at all stages. All authors have read and approved the final manuscript.
About this article
Cite this article
Lautenschläger, R., Kohlmayer, F., Prasser, F. et al. A generic solution for web-based management of pseudonymized data. BMC Med Inform Decis Mak 15, 100 (2015). https://doi.org/10.1186/s12911-015-0222-y
- Electronic data capture
- Web-based application
- Seamless integration
- Cross-domain communication