Security and privacy requirements for a multi-institutional cancer research data grid: an interview-based study

Background Data protection is important for all information systems that deal with human-subjects data. Grid-based systems – such as the cancer Biomedical Informatics Grid (caBIG) – seek to develop new mechanisms to facilitate real-time federation of cancer-relevant data sources, including sources protected under a variety of regulatory laws, such as HIPAA and 21CFR11. These systems embody new models for data sharing, and hence pose new challenges to the regulatory community, and to those who would develop or adopt them. These challenges must be understood by both systems developers and system adopters. In this paper, we describe our work collecting policy statements, expectations, and requirements from regulatory decision makers at academic cancer centers in the United States. We use these statements to examine fundamental assumptions regarding data sharing using data federations and grid computing. Methods An interview-based study of key stakeholders from a sample of US cancer centers. Interviews were structured, and used an instrument that was developed for the purpose of this study. The instrument included a set of problem scenarios – difficult policy situations that were derived during a full-day discussion of potentially problematic issues by a set of project participants with diverse expertise. Each problem scenario included a set of open-ended questions that were designed to elucidate stakeholder opinions and concerns. Interviews were transcribed verbatim and used for both qualitative and quantitative analysis. For quantitative analysis, data was aggregated at the individual or institutional unit of analysis, depending on the specific interview question. Results Thirty-one (31) individuals at six cancer centers were contacted to participate. Twenty-four out of thirty-one (24/31) individuals responded to our request- yielding a total response rate of 77%. Respondents included IRB directors and policy-makers, privacy and security officers, directors of offices of research, information security officers and university legal counsel. Nineteen total interviews were conducted over a period of 16 weeks. Respondents provided answers for all four scenarios (a total of 87 questions). Results were grouped by broad themes, including among others: governance, legal and financial issues, partnership agreements, de-identification, institutional technical infrastructure for security and privacy protection, training, risk management, auditing, IRB issues, and patient/subject consent. Conclusion The findings suggest that with additional work, large scale federated sharing of data within a regulated environment is possible. A key challenge is developing suitable models for authentication and authorization practices within a federated environment. Authentication – the recognition and validation of a person's identity – is in fact a global property of such systems, while authorization – the permission to access data or resources – mimics data sharing agreements in being best served at a local level. Nine specific recommendations result from the work and are discussed in detail. These include: (1) the necessity to construct separate legal or corporate entities for governance of federated sharing initiatives on this scale; (2) consensus on the treatment of foreign and commercial partnerships; (3) the development of risk models and risk management processes; (4) development of technical infrastructure to support the credentialing process associated with research including human subjects; (5) exploring the feasibility of developing large-scale, federated honest broker approaches; (6) the development of suitable, federated identity provisioning processes to support federated authentication and authorization; (7) community development of requisite HIPAA and research ethics training modules by federation members; (8) the recognition of the need for central auditing requirements and authority, and; (9) use of two-protocol data exchange models where possible in the federation.

agreements for thousands of participants across hundreds of organizations will be the square of the number of participants, which will be prohibitive in scope and scale. Consequently, adoption models must allow regulatory needs to be met, while supporting flexibility and growth of the underlying organization. Many existing organizations have evolved in part to address this scaling issue. The cancer Cooperative Groups, BIRN and many other groups have developed reciprocal business agreements that enable linear scaling of agreements, although for clinical datasets there are typically additional agreements that are put in place that are in fact facilitated by the umbrella business agreement.
Adoption also requires trust between data providers and consumers who use the infrastructure and regulators who oversee the process. Trust relies on an understanding of the needs all stakeholder groups, and the development of suitable technology to meet these needs. As used in a technical context, the term "trust" describes the degree of assurance a relying party may place in a digital assertion (usually termed a "certificate") given by some entity (usually termed a Certifying Authority). These assertions may be concerned with either Authentication, i.e., who or what a given entity is, or Authorization, which deals with the rights or privileges an entity may possess. A full description of the formal concepts and foundations of trust is beyond the scope of this paper; however the interested reader is referred to the paper by Chapin [19]. An effective security system in a federated environment is well served by having a mechanism for expressing and maintaining differing degrees of this digital "trustworthiness" between multiple parties. For a description of the novel technical mechanisms developed for caBIG see the description of the GAARDS security system in Oster [9]. From a legal or governance perspective, existing federations often employ "trust agreements" of some degree to reify expectations between parties. An example of such an agreement may be seen in the InCommon Participation Agreement [20].
Regulatory personnel require that data sharing agreements and technical mechanisms used between investigators adhere to HIPAA [21], the Common Rule [22], 21CFR11 [23], and other regulations. Investigators require that the systems protect their intellectual capital. Tech-transfer officers want the system to protect intellectual property. These requirements lead to technical implications for the design, implementation, and operation of caBIG systems including how potential users at multiple sites are identified, made known to, and ultimately authorized to access those systems.
From its inception, the caBIG project has been committed to a federated, as opposed to a centralized model. In this federated model, data are stored and managed locally in systems that can communicate with other geographically distributed systems using the capabilities of the caGrid middleware. In principle, each individual research group or institution can retain ultimate control over who has access to its data at all times. However, accurate accesscontrol (i.e. authorization) decisions cannot occur without knowledge of who is requesting access, for what purpose, and with what authority. Consequently, caBIG includes identity management processes in its federation model to provide the needed authentication on which authorization decisions ultimately rely.
If caBIG or any federated biomedical data grid is to meet the needs of all relevant parties, those needs must be known -especially those of the often non-technical staff charged with overseeing data integrity and privacy.

Existing Regulatory Constraints
There are several regulations that must be recognized and addressed for federated biomedical grids such as caBIG to function effectively. The following regulations are not intended to constitute an exclusive list of all potential regulations affecting biomedical grids, as there are numerous federal and state regulations that will affect operations. Below, we list and briefly introduce the key regulations governing federated biomedical data sharing consortia.

HIPAA
The Health Insurance Portability and Accountability Act [21] Privacy Rule found in 45 CFR 164, regulates the use and disclosure of Protected Health Information (PHI), including PHI's electronic transmission. HIPAA imposes important requirements for research performed using caGrid, including strict requirements for informed consent and data de-identification.

Institutional Review Boards
Institutional Review Boards (IRB) have the authority to approve, require modifications, or disapprove and disallow research on human subjects under Food and Drug (FDA) and Health and Human Services (HHS) regulations [15]. IRBs may require institutions to implement specific IRB and HIPAA training programs and other policies and procedures for institutions and researchers to perform human subjects' research. For institutions to obtain IRB approval to participate in caGrid, it appears IRBs may seek reassurance of the ability of caBIG to ensure safe practices for human subject research by all caGrid participants, including compliance with honest broker and informed consent requirements.

IACUC
The Institutional Animal Care and Use Committees provide regulatory oversight of research involving laboratory animals. Every institution that uses animals for federally funded laboratory research must have an IACUC, which reviews research protocols and evaluates an institution's animal care and use.

CFR Part 11: Electronic Records and Signatures
21 CFR 11 consists of FDA regulations for electronic records and electronic signatures to be considered trustworthy and equivalent to paper records and handwritten signatures. Part 11 requires various controls, including audits and validation systems, to be implemented as part of a regulated entity's operations.

Federal Employee Regulations and Standards
There are various federal regulations and standards governing federal employees' and contractors' use of electronic equipment, such as the Federal Information Processing Standards 201-1 (Personal Identity Verification requirements), that will have some impact on caBIG.

State Privacy Laws
Each state may establish its own privacy laws, governing the use and disclosure of personal information. These laws vary by state, and may be more stringent than federal laws, such as HIPAA, requiring additional regulatory compliance by institutions in those states.

The Structural Basis of Federations
Federations by definition consist of multiple entities which must be bound together by a shared framework of governance. The Liberty Alliance [24], a consortium working to define interoperable federated computing environments, defines three major governance models for federations [25]. Each model has specific strengths and weaknesses. These constraints must be understood in selecting a governance model and developing policy. To operate, federations typically must have agreements in place to describe the structure of the federation, how it will be governed, the requirements and rules expected between the various parties. Consequently, establishing a federation requires higher level governing structures, guidelines, and policies. These are in addition to the security, privacy, and data sharing policies of the individual organizations. Since trust relies on the adherence to agreed upon policies in these areas by all participants, some degree of policy reconciliation between the members of the federation is usually necessary. Three pertinent examples of moderately mature federated environments are presented below.

Liberty Alliance
The Liberty Alliance [24] is a group of over 30 commercial and other organizations formed to establish open standards, guidelines, and best practices for federated identity management. The group has been a leader in the specification, certification, and development of various proto-cols, guideline documents, and policies related to developing successful wide-scale identity federations.

Safe-BioPharma Association
The Safe-BioPharma Association [26] is a group that has developed and promoted specific digital identity and digital signature standards to promote interoperability of systems across corporate boundaries. As such, they function as a federation. The federation focuses on the specific business requirements and interchange of information between the BioPharma industry, various regulatory bodies, such as the FDA, and the healthcare industry.
InCommon InCommon [27] is an identity federation run by a large consortium of institutions of higher education in the United States. The goal of the federation is to promote interoperability of systems across institutional boundaries for faculty, researchers, staff, and students in the US research and education sphere. As of October 2008 the consortium lists over 2.2 million users in over 108 academic and research organizations, and it includes major academic publishers, libraries, 72 higher education participants, including a number of large state university systems, and several major government and governmentsponsored programs. Of particular relevance for this paper are the NIH and TeraGrid.

Methods
Our approach was to develop structured elicitation interviews of key regulatory personnel at a subset of cancer centers involved in exchange of data using the caBIG system. Interview instruments were developed using a teambased approach. Regulatory participants were recruited, and telephone or in-person interviews were conducted. Results were tabulated according to job description, type of institution, and other relevant classifications. These were used by the investigators to determine the stated fundamental security and privacy drivers involved in a multicenter use of the grid for de-identified data exchange.

Development of the interview instrument
The interviews utilized problem scenarios developed collaboratively during a one-and-a-half day intensive face-toface meeting that occurred in Pittsburgh, June 12-13, 2006. Thirty-eight individuals representing a wide spectrum of experts and stakeholders from US Cancer Centers and the NIH spent approximately four hours discussing and brainstorming about potential barriers to the multiinstitutional sharing of data, through caBIG. Individuals who participated in the development of the instruments included representatives of the security project (7), members of the Data Sharing and Intellectual Capital Working Group (3) and Architecture Working Group (1), Institutional Review Board directors (3), external advisors (3), grid technologists (3), NCICB representatives (5), patient advocates (3), caTIES adopters (4), caTIES development team (2), and other stakeholders (4).
Meeting participants were asked to think broadly about issues that might pose problems, particularly those where we expected significant variation among cancer centers. Issues were collected into a master list and sorted into four general categories. The categories which emerged from this process were: (1) Locus of control/decision making, (2) De-identification and IRB Policy, (3) Authentication and Authorization, and (4) Consenting.
Participants then divided into four breakout groups, one for each of these major themes, and constructed scenarios and draft interview questions designed to elicit information during the interviews. All scenarios used caTIES as the example system. Participants met at the end of the day to critique the resulting scenarios.
Following the face-to-face meeting, the authors edited the interview scenarios to ensure adequate coverage of the issues, improve the understandability and simplicity of the interview questions, and match interview questions to organizational roles of interviewees. The resulting draft instruments (see additional file 1) were reviewed by all meeting participants, and modified in three subsequent rounds of editing and draft revisions. Together, the four interview instruments contained a total of 87 questions. The topic of each scenario along with the organizational roles of intended respondents and the number of questions are shown in Table 1.

Participants
We contacted individuals across six United States (US) cancer centers involved in the caBIG project. Participating cancer centers included all four current adopters of the caTIES System, the test-bed system described in the Interview Instrument. All four are university-affiliated. Two other institutions represented stand-alone cancer centers involved in the caBIG project, and were included to broaden the sample, because of a concern that data obtained from the four university-affiliated cancer centers might not generalize to stand-alone cancer centers. These two centers represented a convenience sample of centers affiliated with the authors. The total percentage of standalone centers in this sample (2/6) is similar to the percentage of stand-alone cancer centers across the nation (13/63).
For each institution, we asked a collaborator at that institution to identify key individuals with decision-making authority who, we anticipated, would need to be involved in the development of a federated grid for data sharing across institutions. The roles of these individuals thus varied somewhat based on the organizational structure and culture of the participating institution.

Data Collection
Interviews were conducted either on-site (N = 5) or by telephone (N = 14), based upon conditions of approval of the participating institution. For all interviews, we provided participants with the interview scenarios in advance. Interviews were recorded as digital files, and transcribed verbatim. The interviewer maintained a key indicating the organization and role of the participant. Identifying information regarding participant and institution was scrubbed from the resulting documents to generate the final de-identified transcripts.

Data Analysis
The interviewer manually coded the interviews, using principles of both quantitative and qualitative data analysis.

Quantitative Analysis
The interview scenarios were structured such that individual participants were asked a subset of the 87 questions across four scenarios, based on organizational role and expertise. Responses to the 87 interview questions were aggregated in Excel. For some objective questions regarding organizational policy or processes, only a single answer was sought from an individual with sufficient authority to respond. Consequently, during the analysis phase, we chose to alternate the unit of analysis depending on the interview question.
For questions related primarily to the institution, we aggregate all information from multiple individuals across a single institution and present statistics with institution as the unit of analysis. For questions where each participant provided a single response, we show counts with interview as the unit of analysis. When two individuals were interviewed together, and we found no instances of disagreements, we recorded only one response per interview. For questions where participants enumerated multiple items in response to a question, we use interview statements as the unit of analysis.

Qualitative Analysis
Many issues were discussed during these semi-structured interviews that provide guidance for developing security processes and policies. Key issues and opinions from all interviews were highlighted in the files, and used to distill a set of themes and issues for qualitative data analysis.
Areas where there appears to be consensus and areas that show strongly divergent views are discussed using quotations from the primary data. Commonly accepted editing and proofing standards were used to clarify quotes when necessary without changing the contextual meaning. For example, any added words or phrases appear in block parenthesis []. Every precaution was taken to maintain the integrity of the original quotes. In order to assure that quotations were representative of the entire sample and not a small set of participants, we examined the distribution of quotations after the manuscript was completed.

Characteristics of the Interview Participant Sample
We contacted thirty-one (31) individuals at six cancer centers with requests to participate. Twenty-four out of thirty-one (24/31) individuals responded to our request-yielding a total response rate of 77%. The distribution of organizational affiliation of participants is shown in Table  2. Nineteen total interviews were conducted over a period of 16 weeks.
At one institution (Institution D), we were only able to recruit a single participant. Therefore, for questions in which the unit of analysis is the institution, we include only five of the six institutions. Data obtained from the single individual from cancer center D is included only in quantitative analyses where the unit of analysis is the individual and in qualitative analyses.
Fourteen interviews were conducted with one participant only and five interviews were conducted with two participants together. In all interviews where two participants were interviewed together, the pairs consisted of supervisor-supervisee dyads that worked at the same institution. In all cases, one of the two individuals originally contacted specifically requested that their supervisor or supervisee participate jointly in the interview.
The roles of participants within their organizations are shown in Table 3. In some cases, individuals served in multiple capacities within their organizations (for example information security officer and privacy officer); therefore, the total number of roles recorded in Table 3 exceeds the number of respondents.

Analysis of interview responses grouped by theme
The following sections contain responses to the interview questions grouped by theme, and include both quantitative and qualitative analyses of the pattern of responses. Tables indicating quantitative results include captions which describe the total number of respondents and their organizational roles. Questions posed in each interview were specific to organizational role, and hence the denominator varies with each question. In addition to aggregating and quantifying the responses, we also looked for issues or requirements that could have technical, as well as policy or procedural, implications for the operation of caGrid. Figure 1 depicts the distribution of quotations across interviews, and shows that all participants are represented in the analysis.

Project structure and governance
Necessity of a governance structure Over 85% of individuals expressed the opinion that multi-institutional data sharing through the caGrid requires a governing body (Table 4).
The need for a governing body was expressed across the entire spectrum of organizational roles, from IRB directors to information technology (IT) security managers to privacy officers and Office of Research representatives: -Privacy Officer Others felt that a governing body would be useful but that it was critical to achieve the right balance between guidance and standards at the multi-institutional level, and the flexibility to interpret and adapt them at the local level:

Potential functions of a governing body
Participants suggested a large number of potential functions for the governing body in overseeing the sharing of data. All responses collected are enumerated in Table 5.
The resulting functions cover a broad range of categories including common guidelines for data use, communitywide IRB functions, risk assessment, general security policies and procedures, audit and oversight, reporting and enforcement, and selection of external standards for operation. In addition to the operational functions, participants also suggested several more abstract responsibilities.
Some participants indicated that the governing body was necessary in order to build trust among the participant organizations. Participants also suggested that the governing body must provide a strategic role, for example by monitoring the Office of Human Research Protections (OHRP) regulations or new laws that might affect the use of the federated grid.

Requirement that the collaboration be a legal entity
One university legal counsel articulated the need for the collaboration to be a legal entity. The benefit of a legal entity is that the entity carries insurance and provides a single point of authority for enforcement should the terms of the contract be breached. The legal entity reduces risk to individual participating organizations. Distribution of quotes across the participants Figure 1 Distribution of quotes across the participants.

Data Use
Establish principles of operation of the community 3 Make project-wide decisions regarding appropriate use of data and tissue (rules of engagement) 5 Establish uniform position on data ownership and intellectual property 1 Set standards for assuring data integrity 1 Establish common guidelines on professional credentials needed to access specific types of data 2 Oversee the "joining" of organizations 4 Review privacy laws and research ethics guidelines for potential foreign partners before entry 2

Community-Wide IRB Functions
Provide community-wide assurance that all repositories have appropriate IRB review 1 Establish common Data Safety Monitoring Plans agreeable to constituent IRBs 1 Act as a community-wide Data Safety Monitoring Board 1 Establish standards for Human Subjects Research (HSR) and HIPAA training; require institutions to assess own training modules; publish results to community 1 Provide guidance on common consent form language across caBIG 2 Random checks of user publications to determine whether data use appropriate to protocol 1

Risk Assessment
Establish common levels of data risk and identify security mechanisms appropriate for risk level 1 Provide centralized statistical assurance of minimal risk of re-identification for systems 2

Establish Security Policies and Processes
Prevent and police abuse 4 Establish common guidelines for provisioning and de-provisioning users 2 Establish requirements for monitoring credentialing process and assess incoming progress reports 2 Establish standards for authorization 2 Set minimum standards for physical security 2 Set standards for what users will have to agree to do and not do 1

Audit and Oversight
Aggregate audit information and provide reports back to member institutions 2

-University and IRB Legal Counsel
The need for a legal entity was posited regardless of whether data was identified or de-identified. This participant suggested that caBIG consider forming its own nonprofit incorporated entity. The formation of such an entity would greatly simplify the legal requirements for joining caBIG for this institution. In fact, the institution has previous experience with data sharing under these conditions: Monitor compliance with established and agreed upon processes 2 Periodic checks of whether the data which is supposed to be de-identified is REALLY de-identified 1 Investigation of security incidents 1

Reporting and Enforcement
Establish enforcement policy for sanctioning of organizations or individuals who misuse resource 1 Report misuse to OHRP, ORI and funding agency when necessary 1 Issue federation-wide reports of security incidents 1 Maintain federation "No Fly" list of researchers not permitted access anymore from any institution 2

Mediation
Mediate disputes between organizations 2 Accept requests to appeal decisions at local institutions (for example termination of access) 1

Build Trust within the Community
Build trust among institutions that data will be used appropriately 3 Build trust in veracity of user identities 1

External Standards and Best Practices
Set external standards participating institutions must meet (e.g. CLIA approval of tissue-banks) 1 Seek out and publicize community-wide best practices 1

Strategic Role
Establish goals for the entire project and ensure that operation is in keeping with those goals 1 Monitor new regulations coming from the federal government and address relevance to sites 1 Assess and address weaknesses of the collaborative research environment 1

-University and IRB Legal Counsel
In the absence of an incorporated entity, this participant suggested that it would be necessary for the institution to sign separate Data Use and Confidentiality Agreements with each participating organization. To streamline the process, the participant suggested using common forms for Data Use and Confidentiality Agreements between institutions and Authorized User Agreements among users. The Data Use agreement may need to specify that the receiving organization is responsible for policing compliance. Institutions may need to understand exactly what resources are necessary for meeting these compliance requirements.

Trust agreements
Participants recognized the importance of agreements between institutions and were largely in agreement with what such documents should contain.

Important areas to be covered under trust agreements
The majority of participants agreed that documents should contain language related to all of the elements described in Table 6.
We also asked participants to suggest other potential areas that should be covered in the trust agreements. Areas suggested included language on intellectual property, agreements to participate in a compliance program including audits, agreements to be bound by the local IRB, and statements that data is not provided with a warranty of compliance (Table 7).

Indemnification and liability allocation
One participant indicated that their institution typically included a statement that the institution providing data made no warranty of its compliance. This requirement that institutions be able to submit data with no warranty as to their compliance status is completely contradictory to another requirement that "local caBIG repository owners and stewards need to be able to define and attest to the risk level specific to their context and state law. Sharing of data must operate under these constraints." Additional work is needed to determine how best to reconcile these opposing positions.
Assuming for the moment, that caBIG does try to support warranty-free data sharing, it may be difficult to get all institutions to agree to a blanket, use-at-your-own-risk policy. However, one interviewee noted that a more general statement about responsibility for acts of negligence might meet with less resistance: "Typically what we would do is we would state that the data is not provided with any warranty with respect to its suitability or with respect to its compliance. The receiving entity is going to want to take responsibility if there was a mistake in the de-identification process, and data gets out. We are going to want them to assume liability for anything that happens to the data once they get it. You're right. State institutions will not agree to this indemnity provision. I Assurance that staff will receive training including on privacy and security 8 88.

Additional Suggested Elements of Trust Agreements Count
Agreement to participate in compliance program including audits 4 Intellectual Property 2 Statement that repositories will be IRB approved, and that users will abide by IRB practices 2 Statement that data is not provided with warranty of compliance 1 Scenario 2 -Question 17. Data was aggregated with interview statement as the unit of analysis.
employed. If, however, the ownership is not sort of given up into this collection, that's going to be much more problematic." -Director, Officer of Research Administration Another concern raised about intellectual property was the potential for data to become available through the grid inadvertently that is owned by some third party, potentially a for-profit entity: "I'll tell you the thing that worries me more than that is valuable information or samples that have been obtained from commercial parties under NDAs put into the tissue repositories without any markings on them whatsoever..."

-Vice President for Planning and Business Development
Further discussion of Intellectual Property considerations are addressed in an associated white paper produced by the authors for caBIG [12].

Authorized user agreement
One university has an existing project with some parallels to the caBIG. The project aggregates public health data, and makes it available to institutions including public health departments throughout the country. The project has developed an authorized user agreement that users at the external institutions must sign as part of the process of establishing access.
"What we do is we sign the agreements that get the data in from both the commercial organization and hospitals. We aggregate it here, then we sign agreements with each health system that wants to access the data in which they agree to use the data only for certain purposes. They acknowledge in writing that we get the data under confidentiality restrictions, and they agree that anybody who is going to access it from the public health system has to sign what we call an authorized user agreement... a one-page agreement that states... that they are only accessing it for their job.. . they are not going to do anything else with it."

-University and IRB Legal Counsel
The need for users to agree to attest to their agreement to abide by particular safeguards was echoed by a number of participants: "And a legal agreement that each individual agrees to abide by when they ask for access . . .not just the institution, it's the individual." -Health System Privacy Officer

Data Ownership vs Stewardship
Most participants did not appreciate a difference between stewards and owners. Others had definitions that were not in agreement with the interviewer's definition. In retrospect, the answers to this question would have been more informative had the interviewer provided some definition regarding these terms.

Scope of the collaboration Effect of joining of new organizations on IRB processes
An important finding of these interviews is that many participants will be willing to accept that individual organizations may join the community, without explicit approval of every other institution ( Table 8). As long as new organizations agree to abide by the same principles, the addition of a new institution appears to pose few specific barriers.
IRBs would find it useful to have an online registry, which displays all organizations that have signed agreements: "If there is a new institution coming in, we would like some kind of registry process... maybe it could be just something that is done online, and you can look up and say 'okay.

M.D. Anderson just signed on.' I guess that's okay."
-IRB Director However, some participants were concerned that joining organizations must be able to demonstrate that they have sufficient resources and sophistication to implement both the security technology as well as the security processes.
The problem could become that we are all only as strong as our weakest link:  These concerns highlight the need for a process of credentialing institutions that will participate in the data-sharing community.

Foreign partnerships
There were significant concerns about the inclusion of foreign partners (Table 9), for a variety of reasons.
Although few participants would preclude foreign partnerships (Table 10), many wanted additional assurances and controls.
Some participants were pessimistic about the inclusion of foreign partners given the wide gap in policies: "We would not deal with Europe. They have too hard of a standard... caBIG has to involve international partners, but they also have to make sure that it is realistic to do so, given that each culture or country or union (like EU) has their own unique regulations about electronic data transfer issues in research."

-IRB Director
Commercial entities as partners Several participants considered use of caBIG data by commercial entities as problematic. The concern was that commercial entities could exploit data for purposes other than the advancement of science. In particular, there was a concern that data might be passed on to commercial entities without the knowledge of the providing institution: "I want to be sure they are not marketing... that once they get this data, that there are restrictions on them passing it on."

-University Compliance Officer and IRB Legal Counsel
The potential of commercial entities to gain access to data was considered problematic by one participant because of issues related to private inurement. Private inurementthe benefit of a private interest at the expense of the nonprofit -is prohibited under law. -Health System Privacy Officer

Specific Concerns Count
Privacy requirements are different than US 4 Contracts are difficult to enforce overseas 2 Concerns about potential national security threat 2 Quality of foreign partner IRB review varies greatly 2 Cannot ensure that foreign partner will not be violating their own laws 1 Increased security may be necessary 1 Research ethics guidelines vary greatly 1 Scenario 2 -Question 22. A total of 9 interviews provided responses. Respondents included university and IRB legal counsel, IRB directors, office of Research representatives, and Privacy and Compliance officers. Data was aggregated with interview statement as the unit of analysis.

Existing organizational infrastructure for data sharing
The development of the envisioned research grid will need to rely on local institutions to implement the processes in addition to the software. We found significant variation in the infrastructure existing at these organizations that could support federated data sharing.

Existing honest broker systems
Honest broker systems (also referred to as trusted third parties) have been developed at some institutions to provide de-identified data, compliant with the requirements of HIPAA "safe harbor" [28]. The "honest broker" acts as a trusted, neutral third-party, often regulated by the IRB, and may maintain the key which links the de-identified record and the original identifier. Although this method has been used locally, there have been no previous attempts to deploy such a system across a federated grid.

Description of existing honest broker systems
Only one institution indicated that they had a formal human honest brokering system in place, which was established and monitored by the Institutional Review Board (Table 11).

"The way we have the honest broker system set up is that the healthcare organization certifies honest brokers, and those honest brokers are typically at the department level. It's not only on a particular projection or research project by project basis, and so once those honest brokers have been certification [sic]) by a certification process involving ultimate sign off by the IRB as well as the privacy officer, once certified, then those honest brokers would work either at the department level or project level to take data from health organizations and to de-identify it for the use by an individual research project."
-Privacy Officer Other participants described less formalized systems that had developed over time where specific individuals had the capability to de-identify data and this mechanism began to be used by outside investigators: "I don't think we have a true honest broker system. What we have is an individual or a group of individuals who will consult and will actually provide the mechanisms for deidentifying data when asked."

-IRB Director
Other institutions had no existing mechanism to provide such a disinterested party and opportunity for maintaining a linkage file, which would permit re-identification to the disinterested party but not to the investigators.
"I don't even think there is probably a disinterested party that ever has done this either... if they de-identify, it would be... one of the people in the research team that would do it."

-University Compliance Officer and IRB Legal Counsel
Participants who did not have any kind of honest broker system nevertheless recognized the potential of such a system to enhance the functioning of a data-sharing grid: "I like the idea of this disinterested person being able to reidentify, but again, under very controlled circumstances." -University Privacy Officer

Existing Honest Broker Human Systems Count
Institutions with formal process 1 Institutions with informal process 2 Institutions without any identifiable process 2 Scenario 1 -Question 1. A total of 16 interviews provided responses, from 5 institutions. Respondents included individuals from all organizational roles. Data was aggregated with institution as the unit of analysis. For responses, where there was disagreement, we accepted any description of an existing informal process by any individual at that institution as evidence of an existing informal process.
The main benefit of such an arrangement appears to be the potential to keep data identified only at the source institution. The additional IRB requirements that may be necessary for federated sharing of re-identifiable information, suggests that the community should study whether honest broker systems could reduce the number of cases where identifiable information is necessary.
Existing approved process for automated de-identification Two of the five institutions had experience with using an automated method for text de-identification. One of these two institutions has a formal policy regarding text deidentification, which stated that data that had been scrubbed by a specific system could be considered to be "de-identified".

Re-identification
All participants indicated that when using a disinterested party (honest broker), it was an acceptable practice for the disinterested party to maintain a linkage file in order to allow for re-identification of the patient or participant by the disinterested party for the purpose of including additional data, as long as data remained de-identified to the investigator (Table 12). The use of a disinterested party and maintenance of a linkage file are described in the HIPAA regulations.

Existing organizational decision-making structure related to privacy
We also found marked variation in the organizational infrastructure underlying decision-making in the area of privacy (Table 13).
Participants identified a wide range of organizational structures regarding decision-making about privacy policy and the interpretation of the HIPAA. The determining factor appears to be the relationship of the medical school or university to the health system or hospital, producing a wide variety of configurations:   It appears however, in the institutions represented in this sample, that the health-system privacy officer typically handles disclosures of the PHI, even when the disclosure is related to research data.
Of note, in some cases we found that individuals at the same institution did not always agree about which individual or organization has the responsibility to interpret the HIPAA legislation.
In most institutions, it was either the privacy or the compliance officer with or without collaborative input who investigated a PHI disclosure. Frequently, disclosures of the PHI made in the course of university research were still investigated by the officer on the health system side ( Table  14).
The responses suggest that policies regarding notification in the event of security incidents may need to follow very different routes, dependent on the organization. Consensus of multiple offices or organizations within the institution may be necessary. For example, it may be advantageous to ask the IRB, Office of Research, and Uni-versity Compliance and Privacy Office to weigh in on who should be responsible for the local response.

Existing identity provisioning infrastructure
Several institutions were on the verge of adopting some kind of automated, organization-wide identity management infrastructure and processes suitable for the research enterprise. Such infrastructure, sometimes called an Identity Management System, is used to construct automatic systems for creating and managing user account and access controls in many disparate computer systems within a single management domain. The process (manual or automatic) of creating and managing user identity into the systems is termed provisioning, a term we frequently use throughout this document. These institutions were interested in using this local infrastructure for eventual automated provisioning of users into caBIG users:  -Legal Counsel to IRB Many participants had difficulty conceiving of the envisioned platform and offered their insights with the caveat that additional study would be needed. Additionally, many participants had difficulty in distinguishing between authentication and authorization requirements; therefore, we have grouped these together in our analysis. Further work is needed to separate the constituent requirements more carefully.

Parties responsible for provisioning
Regarding the provisioning of users, there was a preference for local authority over these decisions with some caveats. In general, IRB directors were willing to consider either central or local provisioning given that data was deidentified, but were less willing to accept central provisioning if there was any risk of re-identification. However, security officers, privacy officers, and compliance officers generally preferred local provisioning (Table 15).
Most participants preferred to have local institutions manage the provisioning process using existing infrastructure, because they felt local institutions were best positioned to make these decisions, especially because of the centrality of the IRB to this process:

-Health System Privacy Officer
Another argument for at least some centralized provisioning was articulated by one participant who recognized the importance for having a separate credentialing body for investigators who were not affiliated with a caBIG institution. The development of Unaffiliated Investigator Agreements parallels processes that exist at the cancer centers, in which unaffiliated investigators may gain access to data after attesting to the use of a particular IRB and to be bound by the regulations of that IRB. Unaffiliated investigators would need to be credentialed by a third party.
Another participant noted that motivation to properly credential users may in fact be related to whether one's own data is "in the game". In effect, investigators being provisioned at institutions that are not providing data to caBIG may need to be treated in some ways as unaffiliated investigators because there may be little motivation to carefully adhere to the requisite policies and processes:

-Information Security Officer
What organizational unit could credential users Some institutions had difficulty identifying an appropriate group that could manage the provisioning process within their institution. The IT infrastructure supporting research is often meager compared with the IT infrastructure supporting clinical systems. In general, IRBs may not be well positioned to perform this task, and developing adequate control structures may be a significant task for local institutions.

Monitoring of credentialing process
There were a variety of responses as to the appropriate process for monitoring credentialing (Table 16).

Potential for federated credentials
Very few participants were willing to answer Scenario 1, Question 8, regarding what kind of federated credentials might be acceptable. Reasons provided for the lack of response included: (1) participants had little or no experience with federated credentials, (2) it was too early to make such a decision, or (3) that such a decision would require extensive consultation with the technical security team.
Information needed about users to make provisioning and authorization decisions Across all interviews, we were able to derive a set of requirements for information needed about users (Table  17). We make no attempt to define those that are required at the time of identity provisioning and those that could be deferred to authorization.
Several participants felt that the HIPAA training (although not technically required for de-identified information) would be of significant benefit if there were any chance that the information could somehow be re-identified.
An important finding from this question is the importance of establishing a relationship among the user, institution, and IRB protocol. As one participant put it: Drawing from these established practices, many participants felt that this should be captured within the envisioned system and that, in many cases, use of the system should be within the context of an approved IRB protocol: "There should be an additional piece of information from an IRB-type committee that would say... that would at least get permission for that researcher to access the data. When the protocol is created, it can list the appropriate members and each institution will have a role configured for that per-

Difficulties with anonymous users
Anonymous users were considered problematic by all participants and most would simply not allow it under any circumstances ( -University Chief Compliance Officer Some felt that standardization was the best option, and that such a standard could eventually replace local Another difficult issue with identified information is that passage of identified information will likely require an IRB authorization agreement. "Such agreements are not always accepted between institutions as discussed further under the topic of patient consent, below."

Control over authorization decisions
Aspects of authorization that must be controlled locally Although there were generally few responses to this question, the responses we did collect suggested that local institutions need to control characteristics such as roles of their own users, and that they need to control the characteristics that govern entry into their own data repositories:

-IT Security Manager
"I would claim that I need to have control over the ability to reset their passwords or something, because they are going to come to me to ask for that. And I certainly would need to be able to control their roles that are defined in my institution,

but can I envision a scenario in which Dr. Baggins is working here at (this institution) and has one role here, but he also has a joint appointment at another institution, and that's a minor thing. So his identity is verified here at the (sic) this institution BUT the other institution vouches for his role as a researcher at the other institution's protocol, based upon our identity. Yeah... that's reasonable in a federated environment." -Information Security Manager
Most participants who answered this question were willing to allow entry into repositories that contained only deidentified data on the basis of users meeting certain predefined characteristics, but preferred to have a much tighter control over access to identified data.

Third party vs. local verification of user attributes
In general, there were few responses to the benefits or shortcomings of third party verification of user attributes. This is possibly because participants had difficulty envisioning such a remote verification of attributes. We were As a legal matter, at least one participant indicated that there were no specific barriers to third party verification:

"I think as a legal matter, the institution could agree to that, but the decision about what is the adequate level wouldn't be mine. It would be the IT people." -University and IRB Legal counsel
Turning off access to data across the grid Participants generally agreed that the termination of access was an important capability that they wished to retain as a right of participation in the project (Table 19).
Participants expect the ability to cease transfer of data immediately with a specific user or an entire organization, even though they felt they would use it rarely:  (Table 20).
Participants offered a range of answers to the question of who should make the decision to turn off access (Table  21). A number of participants indicated that the data owner should be responsible for turning off access, but that some decisions regarding termination might come from the IRB or local caBIG officer.
Step-up requests Requests to step-up the level of authorization might require additional processes, agreements, or alterations to the IRB protocol (Table 22).

Auditing
Questions related to technical auditing requirements were posed to three enterprise information security experts and one hospital department information systems manager. A subset of questions was posed to selected compliance and privacy officers, and representatives of the Office of Research. Due to the small number of responses, we provide summary statements as opposed to tables.

Level of audit trail required
Most participants indicated that audit information was needed at data set or record level, for de-identified data. Additionally, audits needed to address logging of authorization decisions as well as access to data:

"I want to have auditing and controls, and I want to be able to say when (a person is) given authorization to use this system... who vouched for him? Who vouched for his identity to say that this is who he is. Who gave him access to this specific data? And how do I know that when (this person) ceases to have access to this data by some obscure criteria, that is going to be revoked in a timely fashion and then logged and tracked, and that motivation is purely for safeguarding the data that I have."
-Information Security Officer In some cases, participants wanted to know who could access data at any point in time, in order to perform com-pliance checking on their own authorization decisionmaking processes, or to determine individuals who are not using the system in order to reconsider whether they continue to require access:

Generation and Management of Audit Data
There was a consensus among the participants representing information security that local institutions should generate the audit data and that some central authority should aggregate, analyze, and distribute aggregated and analyzed data back to the local institutions. Participants wanted to retain the ability of the local institutions to inspect all relevant auditing data, in order to evaluate the sufficiency of any central investigation process, and also because in some cases they must conduct their own investigation because they are the responsible party:  If we need to be able to contact all investigators that have used such potentially tainted data, we may need to preserve audit logs for a much longer interval. This could be true even for de-identified data. It should be noted that it will likely be necessary to "quarantine" the affected data -that is, turn off access to a broad group of people, potentially everyone using the system -and contain the affected data indefinitely during an investigation and remediation period.

Effect of level of identification
Interestingly, although many participants felt that other controls would differ markedly among the different types

Impact of workflow tools on Auditing Requirements
Participants voiced concerns about workflow tools and other processes that result in derived data. Here we use the term workflow tools as a generic term to refer to mechanisms that allow a series of operations and the data flows between them to be modeled, and carried out in an automated fashion. A well-known example of a tool for scientific workflows is the Taverna [29] workflow system. This question is of relevance as caBIG™ is developing this capability as part of the caGrid tool suite. In particular, the director of information services we spoke to who directly controls a clinical database was very sensitive to the idea that these class of tools might alter the initial data and falsely represent it to the user: "Another thing that will concern me too, is how do we know the integrity of the data has not been altered.... I would want to... verify that the data is still the same." -Director of Information Services The passage of identified data through third party workflow and analytic tools posed a particular concern, and greatly increased the requirements for auditing: There are two potential approaches to the problem of "fishing". Both approaches require that the investigator present their IRB credentials at the time that they are provisioned. Users can be required to attest to the fact that they are using data for a purpose sanctioned by their IRB.
"I think that -I know that the people here would be much more comfortable if they were assured that users were investigators who had a legitimate purpose for wanting to use the data... I mean actually, what we would like is to say -all right, this person has a given project that has been approved and that the PI for that project -the user for that project, and those he designates to work on the project -give assurances they are going to use the data only for that project."

-IRB Director
Attestations might be made in the Authorized User Agreement, but could also be reinforced by reminding users each time they log into the system.
A second possible approach that arose from the interviews is that we could audit users to be certain that they are in fact viewing data that relates to their identified purpose.
"In the terms of whatever in signing on to use the grid, that the expectation is that they will be randomly checked, and if any uses are outside of IRB approval or lack of acknowledgement of the resource, they are out."

-IRB Director
The need to audit compliance with an IRB protocol further supports the need to capture sufficient information about the approved protocol to judge the intent of the research. It also suggests that data record may be the minimal level of auditing required even for de-identified information.

Response to security incidents Information Needed by Local Institutions in the Event of a Security Breach
We aggregated all information that would be requested in the event of a security breach in Table 23.

Reporting requirements
Reporting requirements will vary depending on the incident, type of IRB approval, and state where data was collected. Examples of entities that will require notification in some cases include: (1) IRB of the providing institution, (2) IRB of the receiving institution, (3)

IRB issues
An important finding of this research was that, in general, participants accepted the idea of a two-protocol mode for data exchange. In this model, both the repository owner and the investigator (who may be at different institutions) may have IRB protocols from their respective institutions. This is true as long as all understand and agree to this approach upfront: "As long as the institution or group or investigator realizes he is putting information samples, whatever, into a repository that is going to be available to other investigators, and they are going to be reviewed by their board... so that's fine."

-IRB Director
Protocol required for setting up a de-identified repository There was marked variation in the class of protocol that institutions were likely to require to establish a data repository for caTIES, ranging from no HSR waiver to expedited review (Table 24).

Protocol and agreements for searching a de-identified repository
We did not specifically ask about the kind of protocol expected for the investigator at the receiving institution, but several IRB directors offered that the protocol would likely fall under a "Not Human Subjects Research" or "Exempt" designation if only data was exchanged. For tissue exchange, there was a wider variety of responses.
For data that is potentially re-identifiable, existing IRB processes generally also utilize a Data User and Confidentiality Agreement between the data provider and the user: -University and IRB Legal Counsel As articulated by one privacy officer, a potential problem with Data Use and Confidentiality Agreements in relation to use within a federated data-sharing environment is that they are typically specific to a project and executed between the researchers and the other institution:

Information Needed in the Event of a Security Breach
Investigator Name(s) of individual(s) responsible Who funded the project?
Description of the project for which the data was accessed?
Data Accessed

Description of data accessed
Risk level of data How many patients/participants/subjects were affected?
Were any identifiers present in the data?
Was any data modified -Is the integrity of the data still intact?

Dates of access
What period of time did the data cover?

Incident
Was data re-identified?
Where (physically) did the breach take place?
How many times was the data accessed?
Was the data accessed by more than one individual?
Was data made publicly available (for example on a public website)?
What state did the security breach occur in?
Were SSNs or other financial information released?

Management
What discipline was provided at home institution?
Who was responsible for maintaining security of data?
How was the incident discovered?
Who discovered the incident?
What was the chain of reporting once the incident was discovered?
Was there a failure on the part of the local institution?
What oversight did caBIG governance have over matter?
Was there an unaffiliated investigator agreement in place?
Scenario 3 -Question 17. Respondents included individuals from all organizational roles. Data was aggregated with interview statement as the unit of analysis.
The Problem of data outside the scope of consent An important problem we detected with the current assumptions regarding IRB protocols for grid use is that data to be obtained under many protocols is bound by the dates with respect to the IRB protocol. For exempt studies, data obtained must have been collected prior to the granting of approval. This constraint is not required in cases when the research has been designated "Not HSR". -IRB Director It appears that this requirement is true only for the investigator-initiated IRB protocol, and not for the development of the repository. Assuming data will flow continuously into caBIG repositories from other sources, it may be necessary for the grid system to regulate the release of data in accordance with this constraint. In order to provide data to a particular user, the system must know the IRB type, and the date of approval of the IRB protocol.

Use of aggregate data
Most IRBs felt that use of aggregate data (for example as histograms) would not be considered human subjects research, and would, therefore, be suitable for preparatory research (Table 25). However, at least one IRB director felt that data needed to be physically separate and not simply an aggregate view of more complete data sources. -Director, Office of Regulatory Affairs

Importance of defining a level of risk for IRB approval
The importance of risk level for making authorization decisions has previously been discussed. Assurance regarding the risk level at the providing institution is also important for securing IRB protocols at the receiving institution. Thus, an aspect of the approval process for caBIG repositories that needs to be addressed through agreements and/or auditing is the assurance that information in the repository meets the definitions of the appropriate risk level -for example de-identified data under the HIPAA safe harbor. Individual IRBs must have this assurance in order to approve the protocol on the investigator side: "The other IRB would have to be assured . . .and know that the data that the person was getting is in fact de-identified, which it would be in the repository." -Director, Division of Human Subjects Protection For this reason, local caBIG repository owners and stewards need to be able to define and attest to the risk level specific to their context and state law. Sharing of data must operate under these constraints.

De-identification
Assessing the risk of imperfect de-identification is an expected local IRB function evaluating an IRB protocol to establish a caBIG repository such as caTIES.
Not all institutions limit their definition to that which is provided under HIPAA (Table 26). Additionally, it appears that in some cases state law may supersede the HIPAA definition of de-identified, further complicating the matter of establishing uniform policies across a federated grid: "The thing... I am worried about is because your are setting this up in such a way that you are in fact creating a highway for data... the rules of which each supplier (of) data has to comply with are going to differ, and... that includes whether or not something is de-identified. So in Washing- -Legal Counsel to IRB In the case of Washington State, some have suggested that state law may be interpreted to forbid transmission of sequences from patient material, or even prohibit the sharing of tissue from which DNA might be extracted.
The responsibility for assessing the adequacy of de-identification for patient related data appears to rest very clearly with the health system or hospital. However, the use of an honest broker to act as an intermediary between the identified clinical side and the de-identified research side benefits both sides. The honest broker can thus take on some roles of a data steward in assuring that data in a particular system does not exceed the level of risk that later IRB determinations are based upon:

Reducing risk of partial de-identification
Respondents were asked how they would reduce the potential for incomplete de-identification if automated processes are employed, as envisioned in the caBIG project. Automated de-identification of free text has a number of challenges, including recognition and preservation of contextual information. For example, although proper names in a text document must be removed, the subject of an action in the text (i.e., Physician, Nurse, Patient), must be preserved. Consequently de-identification algorithms occasionally leave information in a document that allows a human reader to infer identifying information. The risk of this information varies from full disclosure, as in the case of a proper name, social security number, or other identifiers, to limited; as in the case of missing the removal of a birth date or other personal attribute (Table 27). Risks that go beyond accidental or intentional re-identification Although de-identified data does minimize some risks, many respondents were quick to note that even truly deidentified data did not mean risk-free data: "The reality is that even if it's de-identified data, I still have some measure of responsibility over the data that my institution provides, and so there has to be some understanding that the researcher...that the data is still some institution's data, and it is a privilege for them to have access to it."

Re-consenting of human research subjects
One participant noted that a significant security breach might have the effect of requiring re-consent of patients because the risks of participation would be altered (Table  29).
"We have lots of reasons for re-consenting or reauthorization, depending on whether or not we believe the risks of their participation change, so if there is a major problem with a security breach or something, we may require the investigators [to] go back and at least make an attempt to re-consent or reauthorize the use of a particular data set." -Director, Office of Regulatory Affairs Waivers of consent as an alternative to re-consenting An alternative to re-consenting in some cases may be to obtain a waiver of consent. As one participant pointed out, many important existing databases were obtained without explicit consent for sharing of data, principally because technology for such sharing was not yet envisioned. Further, the keys that would allow re-consenting have been destroyed according to the original protocols: "At the time that many of these huge databases that currently exist or were created, there was never any expectation necessarily that the technology would reach a point where data sharing of the type you are trying to design would take place. So, people were promised that any information about you will be kept confidential. It would be only be shared with those on the study staff, and any use of it will not have any identifiers about you unless it has been approved by an institutional review board in accordance with law, and when we are done with this study, we will destroy the data." -Legal Counsel to IRB Waivers of authorization are a problem, because individual IRBs may not accept each other's waivers:

"Frankly, I don't think individual institutions are will [ing] to accept other IRB's waivers and authorizations."
-Health System Privacy Officer Another participant suggested that it would be very advantageous to have uniform language regarding security safeguards that could be used by local institutions when applying for a waiver of authorization from their IRB.

The problem of undefined future research
Undefined future research is a significant problem with prospective studies that IRBs approach differently. Some IRB directors we spoke to indicated that they encouraged investigators to use the broadest possible language still acceptable to the IRB. Others preferred to let protocols remain rather specific to discourage undefined future research. Two participants noted that the frequent changes in consent form language could be a significant impediment to using the grid for as yet undefined research and that it was therefore critical to deal with the consent issue as a community.
In the case of identifiable data, the problem of undefined future research is made even more complex by the privacy regulations. As one participant notes: "The fact is that HIPAA seems to require a sort of a projectspecific authorization." -Legal Counsel to IRB One respondent considered the provision for undefined future research to be especially problematic given the multi-institutional nature of this project and the existing IRB processes for handling waivers based on adequacy of security measures:

Discussion
Building effective security systems for a project of the size and scope of caBIG remains a complex and challenging, although manageable, task. The legal and regulatory landscape is difficult and evolving, with the current rules and regulations being interpreted inconsistently by various institutional review boards and regulatory bodies. The grid concept, and indeed the concept of caBIG, is predicated on the ability to share data freely in a federated fashion. This implies supporting technology, supporting business and legal agreements between parties. The current practice of using various point-to-point agreements to facilitate data sharing will not scale to the size envisioned. Reducing the complexity of a system from one that grows as the square of the number of interconnections to one that is linear in the number of connections is a well-known and well-accepted principle of systems theory. Here, the system we speak of is not technical machinery, but rather the set of documents, agreements, policies and processes required to create and sustain an effective federation. Realizing this system is as much an exercise in social engineering as in software engineering.
Below we discuss the important issues and high-level recommendations resulting from this work (Table 30).
Where the authors believe an important project assumption has been verified by the interviews, the conclusion stands by itself with no further action suggested. In most cases, additional and ongoing effort -consisting of further study, consensus building, and organization leadership -will be required. In some cases, there are clear steps that appear to be possible to build supporting structures, both of governance and of infrastructure that would greatly facilitate approval of federated systems by local IRBs and other institutional officials charged with compliance oversight.

Construct a separate legal entity for governance
The major recurrent theme that emerged throughout the interviews, either directly or as a logical consequence of verbal statements made by the interviewee was the need for a clear, cohesive, and empowered governing entity. This is similar to the conclusions reached by the European Advanced Clinical Genomic Trials on Cancer (ACGT) project, which concluded that a separate data management board was needed [30].
Interviewees stated that a governing body was a necessity for effective operation in areas where exchange of regulated data (de-identified or identifiable data) was taking place. It was suggested that this body must be a separate operating entity, possibly a non-profit entity. The governing body must have oversight and accountability to the user community in a variety of areas. These responsibilities include accreditation of participating entities, policy and enforcement authority in the areas of data use, risk assessment, security policy and procedure, auditing, compliance, dispute resolution, indemnification, and liability allocation, among others (for a full description, see Table 5). While small scale pilot operations can be built and sustained initially with a small number of institutions, large scale federated efforts must include a legally separate governance structure for areas involving regulated information exchange. This will have direct bearing on various business arrangements between participants, security policy, and potentially the technical implementations of the underlying security system. Failure to recognize these areas and take supporting action will likely limit usability, slow the broad adoption of key components, and ultimately threaten the sustainability of data sharing federations. An important incentive to develop such a governing body is the indication that with sufficient governance structure, point-to-point agreements would not be obligatory. New organizations could join the federation if they agreed to adhere to master document sets and agreed to audit to demonstrate compliance with the same.
It also appears a key task of such a structure will be to provide a mechanism to support trust brokering among institutions that have different quantities of data exposed, or among those of substantially different size and sophistication. Without this mechanism, some large data providers may have reservations about releasing even de-identified data unless mechanisms exist to reliably verify compliance to a minimum set of standards by the consumer of data.

Develop consensus on foreign and commercial partnerships
Regulatory groups have serious concerns about data sharing projects aimed at including foreign and commercial partners. At least some of these concerns stem from the perception that foreign partners may have higher, not lower, privacy standards.
Similarly, data sharing with commercial entities is viewed as a problematic issue, but for reasons involving improper use of data. An interesting topic that emerged from these interviews is the issue of private inurement -specifically, can non-profit participants provide data free of charge via a federated system without receiving value in return from the commercial partner? Once again, establishing a governing membership body funded by membership fees would probably limit the impact of this issue entirely because derived commercial value could subsidize operations costs and therefore reduce membership fees to the non-profit members.

Risk models and risk management processes for data within the Federation should be defined
Appropriate decision-making on security and privacy issues derives directly from the characteristics of data and the processes involved in the handling of data. In addition to being verified by these interviews, this constitutes wellcodified security principles spelled out in standards documents such as ISO 17799:2005 [31]. An appropriate risk model should take into effect state and local law, and contextual issues as well as more global aspects such as IP value, clinical vs. de-identified vs. exempt/non-human data, and re-identification risk. At a minimum, such models should include the risks to data, repositories, and institutions. Those dealing with de-identified data must include some assessment of the likelihood of re-identification. Existing, best practice frameworks for IT governance describe risk management methodologies in detail. At a minimum, the standards indicated in Federal Information Processing Standards (FIPS) 199 [32] should be used to categorize the elements of the risk model. Indeed, depending on the precise governance model selected, if

Guidelines
A separate legal entity for governance is desired.

Consensus on foreign and commercial partnerships should be developed
Risk models and risk management processes for data within the Federation should be defined.
Specific technical infrastructure to support the credentialing process in the regulated environment should be developed.
The feasibility of creating a federated honest broker system should be studied.
Local control of identity provisioning and authorization of users is desired.
The identity credentialing process should be strong.
A special credentialing structure for institutionally unaffiliated investigators will be needed.
Existing institutional infrastructure should be leveraged.
Develop or acquire acceptable HIPAA and research ethics training modules for the entire federated community.
A central auditing authority is a necessity.
All data sets dealing with human data, whether de-identified, limited, or fully identified, should be subject to the same auditing requirements.
Specific tooling to support the auditing functions is needed.
A Two-protocol Mode for Data Exchange is accepted by interview participants.

Further Study
Potential for federated human honest broker systems to reduce the number of cases where identifiable information is necessary.
Manner in which undefined prospective research involving data and tissue repositories will be consented and handled.
Establishment of data use and confidentiality agreements between participant organizations and individual investigators in a scalable fashion.
Development of common consent forms acceptable to all IRBs participating in a federation.
elements of a federation are ultimately classified as part of the Federal Information Security Management Act of 2002 (FISMA) [33], this may be a legal requirement. Those seeking to develop large-scale data sharing federations should not try to develop their own method ad hoc, but rely on established and mature IT and risk assessment literature and practices such as the CobiT 4.0 framework [34].
Develop specific technical infrastructure to support the credentialing process in the regulated environment A specific area identified during the interviews, which would facilitate data sharing, is an online registry of "accredited" participating signing organizations. The concept of having an online support infrastructure for protocols, trust and security levels, IRB federal certification, and other metadata to support the regulatory process decisionmaking process emerged in several interviews. It is a requirement that regulatory and compliance personnel be able to determine -possibly ahead of time -who can access what data under what circumstances.

Study the feasibility of creating a federated honest broker system
The interview process suggested that honest broker systems are of interest to the community to enhance data sharing. Importantly for a federation, structured use of such systems could reduce number of cases where it is necessary for identifiable information to leave local control.
From a systems architecture perspective, honest broker systems can be thought of as a design pattern containing a requestor and a publisher. Consequently, data sharing projects that develop software would be well served to consider this a high level architectural model for constructing software.

Identity provisioning and authorization of users
The interviews surfaced substantial information on requirements of identity provisioning and authorization of access to data. Key points and recommendations are presented below.
Our data suggest that local control of identity authentication and authorization issues is preferred by the majority, although a significant proportion of the respondents believed that central identity provisioning was a viable possibility. Most of the participants in the interviews had trouble understanding the concepts and implications involved with a federated environment, even though federation of identity and data sharing practices represent a model that retains and enhances local control. Further work on user education and actual experience will be needed for groups to achieve comfort with the concept of federation. Respondents felt that responsibility and accountability require local control in such a system; how-ever, it was noted that because of the differences in practice between institutions, a centralized legal entity is required to coordinate and enforce policy and practices.

Create a strong identity credentialing process
This study highlights significant concern about the strength and robustness of the user credentialing process (identity vetting) available within local institutions. A number of reasons were given for this, including institution size, the amount of data a given institution serves to the grid, differential financial support for clinical computing and research related functions, and organizational structure and mission of the groups performing the credentialing process. There was a feeling that even though IRBs need strong auditing and credentialing safeguards, they may not be well positioned or staffed to actually perform credentialing functions. There was also strong desire to have a single identified individual at each institution be accountable for and in control of the entire credentialing process. This includes having processes in place to verify identity of users (what we would term "providing authentication functions"), and to perform authorization functions such as the association of a person with particular research roles and allowing access to information restricted by specific IRB approved protocols.
Local control of both authentication and authorization can be facilitated by the use of a common certification authority (such as Verisign™ [35] or SAFE-BioPharma™), using a common certificate policy framework consisting of a certificate practice statement and certificate policy, and a registration agent certification process available to each institution participating. Designated staff at each institution could be certified as registration agents (RAs) by the managing body of the certificate authority. The registration agents would then issue credentials to end-users. This use of common practices and certification creates a common and uniform chain of trust between all parties involved in the federation. Development of such practice frameworks should alleviate concerns expressed by IRB members about institutions with insufficient internal policies. Such frameworks are used in a number of successful federation efforts, such as SAFE™ [26], the University of Texas Health Science System, and the federal E-Authentication [36] effort.

Create a special credentialing structure for institutionally unaffiliated investigators
In a federation where the basic membership level is expressed at the institutional level, it may be beneficial to develop mechanisms that allow individual investigators who are not affiliated with member institutions to access or even contribute data to a federation in a secure fashion. As with foreign and commercial entities, unaffiliated investigators pose significant challenges to the establishment of a trusted federation. Most notably, they are not employed by a participating entity and therefore may have less incentive to avoid breaching their agreement to participate. Consequently, they may require more monitoring and control by the centralized governing body. Federations must develop mechanisms to deal with this issue.
Take steps to leverage existing institutional infrastructure Several institutions interviewed have or are on the verge of adopting centralized identity management systems. Not surprisingly, participants expressed a desire to leverage this infrastructure for any federations that they may join. This parallels the situation among InCommon members where institutional identity management and a central university authentication authority is used for all systems within a security domain. Indeed, many universities require all information systems to use these institutional identity services for authentication control. Developers of data-sharing federations should consider the preferential use of centralized identity management systems.

Develop or acquire acceptable HIPAA and research ethics training modules for the entire federated community
The interviews revealed clear difficulties with the acceptance of external HIPAA and IRB Research Ethics training certification. This implies that it might be fruitful to seek to resolve this issue in a federation-wide fashion. This could be done by community wide effort to develop training and certification components as part of the caBIG™ program.

A central auditing authority is a necessity
All data sets dealing with human data, whether de-identified, limited, of fully identified, should be subject to the same auditing requirements Auditing should be performed in the same manner and level for all data sets dealing with human data. This includes de-identified, limited data sets, and identified data. The same auditing data should be captured regardless of risk level of the data itself.
From the perspective of the policy and compliance personnel interviewed the data support the requirement for a specific body to oversee auditing. This group should be developed and empowered to define compliance standards with policies, and to enforce these standards via an accreditation process. Specific auditing functions this central group would be charged with overseeing include both technical and non-technical components, and consist of policy review, adherence to agreements, adherence to technical procedure and technical security architecture, adherence to data release only through protocol, incident aggregation, incident analysis, and communication of audit data back to the member institution. The audit group should provide a statement of compliance or non-compliance with key policies and procedures for each member institution.

Specific tooling to support the auditing functions is needed
Given the need for some form of centralized auditing support, the technical considerations are not trivial. Every institution involved in a federation must have technology support for the relevant security and privacy logs. However, coherent global audit requires efforts to standardize security data elements and communication protocols. Otherwise, the power of the auditing capability is reduced to simply a set of local audits that may not appropriately address systematic and end to end security and privacy issues. Consequently, specific tooling is needed to support both the centralized auditing functions proposed for the governing body, and the specific data required for trust development at the individual institutions. Interviews suggest a preference for auditing at the individual record level even when dealing with de-identified data. Local groups need assurance that the remote audit data is being properly maintained, that it has an acceptable retention period, and that it is available to them for inspection on demand.

A Two-protocol Mode for Data Exchange is accepted by interview participants
An important finding of this research was that, in general, participants accepted the idea of a two-protocol mode for data exchange for de-identified data. In this model, both the repository owner and the investigator (who may be at different institutions) may have IRB protocols from their respective institutions. The relationship between parties can be emergent and does not need to be specified in advance, as long as all understand and agree to this approach. This approach mirrors (at the data exchange level) the governance structure of successful identity federations such as InCommon, SAFE, and the Liberty Foundation.
Critical IRB issues remain that must be resolved This study raised several important IRB issues that should be clarified by development of community consensus including: • The manner in which undefined prospective research involving data and tissue repositories will be consented and handled.
• How data use and confidentiality agreements can be established between participant organizations and individual investigators in a scalable fashion.
• The development of common consent forms acceptable to all IRBs participating in a federation.
In particular, the development of a common consent form would greatly facilitate multi-institutional prospective research projects, but would require strong leadership and involvement of the individual IRBs, potentially including face-to-face meetings of IRB representatives. Participants suggested that this is important work where NIH leadership is needed.

Limitations
Limitations of this study include potential selection bias, and difficulty of participants describing risks of a novel and unfamiliar technology.
Participants in this study were limited to stakeholders at cancer centers who had already agreed to participate in caBIG -a federated biomedical data grid. Thus, it is likely that the institutions from which our stakeholders were selected, have already bought in to the concept of data federation. Participants from these institutions may be more accepting of the basic premises of federation than participants drawn from centers who are not participating in the caBIG.
Throughout the interviews, we found that participants had difficulty with some questions related to assessing risks for such a technology as novel as a federated grid system. Privacy and security requirements are typically considered only on the scale of an individual institution or known business partners. Envisioning a world where security must be managed across multiple, unknown partners is a daunting task for many participants. Thus, further efforts to gather security and privacy requirements should be undertaken as federated systems emerge.
Although the number of participants in this study was relatively small, we note that the sample includes most stakeholders at five of the fifty existing NCI designated cancer centers, representing a 10% sample of the institutions. Given the detailed nature of the interview instrument, the sample was deemed sufficient for the purposes of requirements gathering. However, it is entirely possible that other security and privacy concerns and requirements may exist which were not uncovered in these interviews.
Finally, this work represents a survey study prior to actual design and implementation of data exchange systems. The course suggested holds substantial sociologic challenges in the reorganization of regulatory practices across multiple institutions. This work will need to be further validated by comparison to functioning systems at a future date.

Further Observations
The recently created NIH Genome-Wide Association Studies (GWAS) program [37] and the data-sharing policies emerging from it represent an interesting development. The GWAS data sharing mechanisms for the dbGAP database are an example of a new program that has been constructed along the general lines of the framework discussed in this paper. Data submission to dbGAP requires pre certification by institutional officials in advance of data submission and the pre certification must be part of data sharing plan submitted with grant applications. Data quality, security and privacy are maintained to certain standards and there are guidelines in place for both the data repository at the National Library of Medicine, as well as the research groups submitting the data. Access to the database and use of data extracted from it requires review by a data-access committee and a formally submitted data-use certification agreement that must be signed by institutional officials. Data distribution is bound by additional constraints, including publication embargo of developed results. Finally, there are mechanisms in place to audit and review appropriate data access and use.
The GWAS agreement uses a strong central governing structure and places responsibility for adherence to the terms of the data sharing and acceptable use agreements on the institutions. This model is similar to existing animal and human subject protection mechanisms in that: (a) an institutional-level commitment to protection and compliance is required, and (b) the agreements are specific to the resource. The GWAS policy effectively lays out resource-specific risk-assessment and risk-mitigation plans to be carried out by all parties. In essence, all parties have agreed to meet uniform minimum requirements and standards, overseen by a common governance framework to protect each other and third parties from risk. This framework has many of the elements discussed in this paper, with the minor variation that the NIH has chosen to maintain dbGAP as a centralized resource, rather than deploying a distributed or federated technology model. Table 1 of Piwowar [38] summarizes possible classifications and the tradeoffs involved in selecting a centralized or competing data sharing model.
The present study suggested, and the structure of the GWAS repository confirms, that formal, advance risk-mitigation and institutional sign-off mechanisms should be in place before sensitive data are collected and shared through collaborative resources. There is high scientific value in collecting and sharing very large and highly distributed data sets, but there is also high risk associated with maintaining sensitive information in a distributed, collaboratively operated information resource. These joint pressures are driving the need to develop more detailed, and more specific social mechanisms to permit collaborative research while maintaining individual accountability.
These trends are likely to continue and it would not be surprising if institutions soon begin to deploy information security oversight committees, designed and operated in a manner similar to institutional review boards and animal care and use committees where the security details of individual proposed research projects are reviewed and approved in the context of general guidelines and applicable laws and regulations, all supported by appropriate local infrastructure.

Conclusion
This study identified and explored security and privacy requirements for large-scale federated biomedical data sharing initiatives. Study data suggest many areas of consensus among a sample population of stake-holders, and also indicate areas where there is greater variability. It also elucidates stakeholder opinions in the areas of governance, identity provisioning, auditing, honest brokering and research training certification. Based on these opinions, the authors propose an initial set of security and privacy requirements for an emerging federation. These requirements are currently being used as a model for the development of the caBIG data sharing and security framework (DSSF). These requirements also represent a general framework that can be used to inform the development of other large multi-institutional data sharing consortium. The findings, as well as the experiences of other large scale data sharing initiatives suggest that data sharing mechanisms will increasingly require strong central governance, and institutional commitment to the security procedures and policies of these organizations. It may well be that we are seeing the development of risk mitigation strategies and institutional sign off requirements on a resource-specific basis. The implications for applied technology and biomedical informatics practitioners will be the need to develop new applications and knowledge infrastructure to support the processes of security, privacy, and trust management in the regulated environment.