Accessing the public MIMIC-II intensive care relational database for clinical research

Background The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database is a free, public resource for intensive care research. The database was officially released in 2006, and has attracted a growing number of researchers in academia and industry. We present the two major software tools that facilitate accessing the relational database: the web-based QueryBuilder and a downloadable virtual machine (VM) image. Results QueryBuilder and the MIMIC-II VM have been developed successfully and are freely available to MIMIC-II users. Simple example SQL queries and the resulting data are presented. Clinical studies pertaining to acute kidney injury and prediction of fluid requirements in the intensive care unit are shown as typical examples of research performed with MIMIC-II. In addition, MIMIC-II has also provided data for annual PhysioNet/Computing in Cardiology Challenges, including the 2012 Challenge “Predicting mortality of ICU Patients”. Conclusions QueryBuilder is a web-based tool that provides easy access to MIMIC-II. For more computationally intensive queries, one can locally install a complete copy of MIMIC-II in a VM. Both publicly available tools provide the MIMIC-II research community with convenient querying interfaces and complement the value of the MIMIC-II relational database.


Background
The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database [1] (http://physionet.org/ mimic2) is a public research archive of data collected from patients in intensive care units (ICUs). Although other clinical research databases exist [2,3], such databases are often privately owned, have highly restricted access or require fees for access. MIMIC-II has been fully deidentified in a Health Insurance Portability and Accountability Act (HIPAA) compliant manner and is available free of charge for public use, subject to completion of an appropriate online human-subjects training course and signing of a data use agreement. The database is available via PhysioNet [4,5], a web-based resource for the study of physiologic data.
The data comprising MIMIC-II was collected at the Beth Israel Deaconess Medical Center in Boston, MA, USA from patients who were admitted from 2001 to 2008. The available clinical information includes: patient demographics, laboratory test results, vital sign recordings, fluid and medication records, charted parameters and free-text reports such as nursing notes, imaging reports and discharge summaries. There is a second component of MIMIC-II consisting of high resolution waveform recordings of electrocardiograms, blood pressures, pulse plethysmograms and other monitored signals that were archived from bedside monitors for a subset of the patients. The waveforms, and derived trends and alarms are the subject of much research interest [6][7][8]. Here, however, we focus primarily on the "clinical data", stored in a relational database. For a detailed description of the MIMIC-II database, please see [1]. The MIMIC-II project was approved by the Institutional Review Boards of the Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology (Cambridge, MA, USA). The requirement for individual patient consent was waived because clinical care was not affected and all protected health information (PHI) was deidentified. http://www.biomedcentral.com/1472-6947/13/9 As of the end of 2012, over 500 users have been approved for access to the MIMIC-II relational database, which reflects researchers' interest in the clinical data of MIMIC-II. Numerous innovative and significant studies on a broad range of topics are based on MIMIC II and establish its importance. The software tools that make it feasible for a large worldwide community of investigators to draw on MIMIC II are essential contributors to its value and utility for intensive care research. Providing public access to a relational database for users who are geographically separated and from a wide range of backgrounds is a challenging task. While there are tools available for web-based administration such as phpPgAdmin [9] and even searching of clinical data [10], they are not always appropriate in any given situation. For MIMIC-II, we have developed an easy-to-use, read-only interface capable of performing exploratory searches and a more powerful tool for complex data processing. These two access tools currently serve as the main gateways to the MIMIC-II relational database. In the present article, we describe their implementations and vital roles in conducting clinical research using MIMIC-II.

Implementation
The MIMIC-II relational database (version 2.6) contains records from over 32,000 subjects, including over 7,000 neonatal patients. The raw data is stored in various base tables, generally organized by subject, hospital and ICUstay IDs. Several database views, which summarize and collate information, have been generated to allow users to become familiar with the available data and to find records of interest. Users can access the database via a web-based online tool (QueryBuilder) and a downloadable virtual machine (VM) image, which are discussed in the ensuing sections. Flat file exports of the database tables and a Post-greSQL compatible dump file are also available, but are not discussed here. To the best of our knowledge, there are no other publicly available software tools that allow users to query a clinical database in SQL (Structured Query Language), either in a web-browser or in a virtual machine environment.

QueryBuilder web-based tool
QueryBuilder is a web-based database query tool developed using the Google Web Toolkit (GWT) [11] and ExtGWT widget library [12]. Figure 1 shows the system infrastructure. The QueryBuilder application is hosted on a Tomcat 7 application server and connects to an Oracle 11g database containing clinical data from MIMIC-II. Queries are submitted to the application server using a GWT Remote Procedure Call (RPC) and are executed in the database using Java DataBase Connectivity (JDBC) [13]. The results are passed back through the application server to the user's web browser.
The GWT framework allows developers to write code in Java, which is then compiled into a highly optimized browser-independent JavaScript web application. GWT provides a "development mode" in which the Java code is dynamically translated to JavaScript and displayed using a browser plugin. For production systems, GWT builds a Web application ARchive (WAR) file containing optimized JavaScript that works across a wide range of browsers and platforms. The WAR contains Java servlets for server side processing and can be deployed on any standard application server.
QueryBuilder, which is accessible through desktop or mobile web browsers, allows users to explore the structure of the various tables and views in the database and to examine the relationships among them. SQL queries allow users to examine and process the data as desired; the resulting datasets can be exported in CSV (Comma-Separated Values) format for further processing. In order to prevent a given user from excessively consuming shared resources on QueryBuilder (e.g., exporting all tables in MIMIC-II), we limited the maximum number of exportable rows to 1,000. Figure 1 QueryBuilder system infrastructure. The user connects to QueryBuilder using a web browser. When the user submits an SQL query, it is transmitted to the application server and executed on the database server via JDBC. The query results are returned to the user via the application server and displayed. http://www.biomedcentral.com/1472-6947/13/9

Virtual machine
The increasing number of users, and their desire to run more complex queries, has begun to overload the computer systems hosting QueryBuilder. To mitigate this problem, we have developed a system allowing users to run a copy of the relational database on their own computers, providing much faster, uncongested access using a VM. A VM is a completely isolated operating system installation that can be run within a host environment. The MIMIC-II VM employs Oracle's VirtualBox virtualization environment, providing an Ubuntu 10.04 Linux operating system distribution and a pre-configured Post-greSQL 8.4 database server.
To use the MIMIC-II VM, users must first install the VirtualBox host software, and download the MIMIC-II VM image for import into VirtualBox. Once the VM has been started, a simple script will download and import the MIMIC-II database into the local PostgreSQL server. The resulting system contains a complete clone of the MIMIC-II relational database that can be queried using a command line client, a GUI (Graphical User Interface) desktop application (pgAdmin III), and JDBC interfaces. The VM also includes an SQL cookbook which is a compilation of example SQL queries that users can use as a starting point for their research studies.
We also created a demo VM (and, to suit users' preferences, a bootable ISO image) containing data from 4,000 patients who have been deceased for two years or more. Since the demo VM and the ISO image contain neither PHI, nor free text, nor any data from recently living individuals, they are exempt from HIPAA restrictions, and interested researchers may download them freely.

Results
Both QueryBuilder and the MIMIC-II VM are currently available to MIMIC-II users free of charge. A link to QueryBuilder, the downloadable VM image, as well as related documentation and instructions for gaining access are available on PhysioNet (http://physionet.org/mimic2). Although QueryBuilder and the VM provide immediate access to the MIMIC-II relational database, the user needs to have working knowledge in both SQL and the MIMIC-II database schema. SQL is a rare skill among clinicians, and becoming familiar with the structure of the MIMIC-II clinical data requires substantial time and effort. In order to guide new MIMIC-II users, we present a few example queries and research studies in the subsequent sections. We recommend using QueryBuilder for the simple example queries in Section "Example usage" and using the VM for the computationally expensive studies in Section "Applications".

Example usage
The ICUSTAY DETAIL view summarizes ICU stays for all patients and can be used to obtain general statistics for the entire population of MIMIC-II. The following example query obtains ICU mortality statistics broken down by gender.
SELECT gender, icustay expire flg, COUNT(*) FROM mimic2v26.icustay detail WHERE subject icustay total num = 1 GROUP BY gender, icustay expire flg The results from this query are shown in Table 1, which indicates that among patients with only one ICU stay in the database, there are more males than females, and that males have a lower ICU mortality rate (6.2%) than females (7.2%). One can obtain hospital mortality by querying the hospital expire flg.
The ICUSTAY DETAIL table is also used to obtain patient cohorts by using the WHERE clause to restrict the query to obtain ICU stays of interest. The second example query obtains all ICU stays for patients who have a SAPS (Simplified Acute Physiology Score) I [14] score between 15 and 20, are between 20 and 30 years old, had 2 ICU admissions in total and died in the hospital.
SELECT icustay id, subject id, gender, dob, dod FROM mimic2v26.icustay detail WHERE icustay admit age BETWEEN 20 AND 30 AND sapsi first BETWEEN 15 AND 20 AND subject icustay total num = 2 AND hospital expire flg ='Y'  Results of the example query in the text. ICUSTAY ID and SUBJECT ID identify the ICU stay and the patient; DOB and DOD are surrogate dates of birth and death; see text for discussion.
The query returns three rows as shown in Table 2 (for patient privacy, an offset, randomly chosen for each patient individually, has been added to all dates in the original data to obtain surrogate dates). Despite its apparently simple constraints, the query is actually quite specific, and there are only three subjects, all male, from over 26,000 adults who meet the criteria. Furthermore, although the query sought patients with two ICU stays, the results show only one stay for each patient. This is for two possible reasons: 1. The patient's SAPS I score during his other ICU admission was not between 15 and 20. 2. The patient's age was not between 20 and 30 for one of his two ICU admissions.
We can obtain all of the ICU stays for the subjects who were returned in Table 2  The results in Table 3 show that each patient did have two ICU stays, but his SAPS I score was available for only one of them. The records not listed in Table 2 failed to meet the criteria of the previous query, most likely due to missing data for one or more parameters needed to calculate the SAPS I score.
The MIMIC-II database contains complex, detailed data and apparently simple queries can return unexpected results. The rich, detailed information it contains has stimulated a variety of research interests.

Applications
MIMIC-II has attracted research in data mining, pattern recognition and signal processing. There have been a wide variety of publications based on the data contained within MIMIC-II and its public availability encourages reproducible research and permits comparison of results. We now discuss two of the recent research problems that have been investigated using data from MIMIC-II. Subsequently, the PhysioNet/Computing in Cardiology (CinC) Challenges that utilized MIMIC-II are also described. The examples below illustrate what kinds of clinical research are possible with the MIMIC-II relational database.

Acute kidney injury
Acute kidney injury (AKI) is a serious and frequent condition in critically ill patients [15]. There are established criteria [16] defining three severities of AKI based on patient urine output over 6, 12 or 24 hour periods and increases in serum creatinine levels over a two-day window. MIMIC-II contains hourly urine output measurements and daily serum creatinine laboratory test results that permit a thorough investigation into the AKI classifications. Using the data in MIMIC-II, we were able to determine AKI stages for all patients and build multivariate logistic regression models to determine whether AKI stages can be used as biomarkers of increased hospital mortality [17]. Owing to the high temporal resolution of the data, we were able to build models for a large range of urine output thresholds and durations to determine that the existing AKIN definitions employ clinically meaningful criteria [18].

Prediction of fluid requirement in the ICU
The first 72 hours after admission are critical for ICU patients. Suboptimal fluid management during this period can result in episodes of hypotension, leading to reduced organ perfusion. In practice, clinicians perform the difficult task of estimating maintenance fluid requirement by estimating fluid loss. Providing an accurate prediction of The data shows that the SAPSI FIRST column is 'null' in three rows, explaining why these rows were not returned in Table 2. The missing SAPS I scores are most likely due to missing data for the parameters used to calculate the score. http://www.biomedcentral.com/1472-6947/13/9 a patient's fluid requirements would assist clinicians in making their decision. MIMIC-II contains detailed fluid input/output measurements as well as vasopressor administration, demographics and physiologic variables. Using data from the first day of a patient's ICU admission in a linear regression model combined with a Bayesian network, Celi et al. were able to accurately estimate patient fluid requirements for day two a [19].

PhysioNet/Computing in cardiology challenges
The annual PhysioNet/CinC Challenges (http://www. physionet.org/challenge/) invite participants to tackle clinically interesting problems. The challenges in 2009 [20] to predict hypotensive episodes in the ICU and 2010 [21] to attempt to reconstruct missing or corrupted signals, both used data from the MIMIC-II database. The 2012 PhysioNet challenge entitled "Predicting Mortality of ICU Patients" also used MIMIC-II data and asked participants to develop a patient-specific prediction of in-hospital mortality. The dataset consisted of MIMIC-II records from 12,000 ICU stays each at least 48 hours in duration providing up to 41 different variables. Five of the 41 were "general descriptors" (recordID, age, gender height and weight), recorded once, on admission. The remainder were "time series" variables such as vital signs and laboratory test results and were recorded multiple times throughout the 48 hour period. The aim of the challenge was to predict for each patient, whether they died in the hospital. Participants discussed their approaches to the challenge problem during the CinC 2012 conference (http://physionet.org/challenge/2012/).

Discussion
The MIMIC-II database is a valuable research tool that is gaining popularity as it is expanded and improved over time. Its clinical data can be accessed using a variety of methods, including the web-based QueryBuilder and standalone virtual machine technology. These publicly available software tools play a vital role in connecting a broad community of researchers to MIMIC-II, providing them with immediate access to a one-of-a-kind ICU database and making it feasible for them to perform a wide variety of innovative studies with it.
Typically, a new MIMIC-II user would utilize Query-Builder and the VM to conduct a clinical study in the following steps: 1. Explore the clinical data in MIMIC-II using the demo VM and conduct a feasibility test for an envisioned research study. 2. Use QueryBuilder to conduct a further feasibility test by looking in the tables that are not part of the demo VM and by checking cohort size.
3. Write and debug an appropriate SQL query in QueryBuilder to extract desired patient data. 4. If the final SQL query requires substantial computing time or the results contain more than 1,000 rows, run the query in the VM with complete data.
In our experience, clinical research such as that presented in this article is best approached using an interdisciplinary team combining clinicians who provide the research direction and interpretation of results with engineers who provide data extraction and statistical modeling [22]. Being a web-based tool, QueryBuilder ensures minimal setup time and effort. MIMIC-II users only need a web browser and Internet connection to be able to launch QueryBuilder. Installing a complete MIMIC-II VM on a local computer involves more steps and requires a longer time, but is an effective method when the shared resources for QueryBuilder become the bottleneck in conducting a research study.
We are working to introduce and improve tools for searching and visualizing the data available in MIMIC-II. Our existing QueryBuilder and VM require users to know or to learn SQL; our next generation of tools will provide an intuitive graphical interface that will be immediately accessible to a wider user community that includes many more clinicians. Additionally, we are expanding the database by adding additional patient records, and enlarging the records of existing patients. Improved tools and expansion of the database will further support retrospective clinical research.
In the present article, we have discussed simple example SQL queries as well as representative clinical studies that have been performed using MIMIC-II. We have also described the PhysioNet/CinC 2012 Challenge "Predicting Mortality of ICU Patients". These examples hint at the range of problems that can be studied using MIMIC-II. They illustrate how investigators can formulate and answer research questions using open-source tools to explore the rich contents of the first (and so far the only) large and publicly available database for intensive care research.

Conclusions
MIMIC-II is an invaluable public database for intensive care research, and we have successfully developed two freely available tools that facilitate accessing MIMIC-II. QueryBuilder is a web-based tool that allows a user to query MIMIC-II in SQL. For more computationally intensive queries, one can locally install a complete copy of MIMIC-II in a VM. A demo VM is also available for interested users who wish to explore MIMIC-II with minimal setup time. We believe that QueryBuilder and the MIMIC-II VM are integral parts of the MIMIC-II research http://www.biomedcentral.com/1472-6947/13/9 community, which is corroborated by extensive utilization of both tools by MIMIC-II users.

Availability and requirements
• Project name: QueryBuilder and MIMIC-II virtual machine • Project home page: http://physionet.org/mimic2 • Operating system(s): Platform independent • Programming language: Java, SQL • Other requirements: Any web browser, Oracle VirtualBox • License: Open source • Any restrictions to use by non-academics: None Endnote a The provided accuracy was 77.8%, which is the percentage of correctly estimated fluid requirements when the actual fluid requirements in the test dataset were divided into quartiles.