Actually, XNAT and HID approaches provide a very sound base for working on these issues and it is far from our intention to compare our collaborative environment to such powerful and widely adopted platforms. Nevertheless there is still room for improvements, as regards extensibility and customization, and, for some specific scenarios, more customizable tools are needed to meet specific users requirements. Our work has been mostly aimed at addressing the needs of small laboratories without any or little technical expertise. To this goal, non technical users should be enabled to create, extend and modify the whole data scheme, if possible through the same user friendly web interface used to fill in data. Also, especially for multidisciplinary and multicentre experiments, particular security and privacy policies have to be addressed in regard to the access to proprietary data and sensible clinical data. Moreover, the management of genetic data often requires an integrated access to public databases such as NCBI and Molgen. Finally, especially in experiments including genetics screenings, tools must be provided for managing specimens and samples in local storage fridges. Our platform tries to answer to these needs in order to put non technical researchers from small laboratories in control of data and samples during collaborative experiments.
The need of extensibility has been considered from two different points of view. The first one is bound to the possibility to easily customize and extend the experimental procedures in order to log each step of acquisition or analysis. This is achieved through a process-event model, a multipurpose taxonomic schema composed by two generic main objects: events and processes. The second one is related to the improvement of data flexibility. This aspect has been taken into account through the development of a methodology for the dynamic creation and use of data types and related metadata, based on the definition of a “meta” data model. This issue is critical in order not to constraint the repository to a set of predefined data but to make it easily extensible and applicable to different contexts, also making data immediately usable and integrated.
Finally, data integration aspects have been addressed by efficiently storing distributed samples, data and metadata and providing the repository application with an efficient dynamic interface designed to enable the user to both easily query the data depending on defined data types and view all the data of every patient in an integrated and simple way.
The process-event structure
An event is defined as either any “atomic” operation that can be performed on patients, or any processing of data, or any other action related to the administration and management of the repository. If needed, it can contain correlations between data, metadata and, in addition, algorithms. Each event is associated with a process. The process is defined as a group of sequential events and/or sub-processes related to an activity, allowing the creation of a sort of hierarchical structure. Custom process-event types and their relationships can be defined, thus describing the taxonomy better fitting the needs of the application. An example of the process-events structure concerning a possible clinical scenario is shown in Figure1. It is worth noting the existing relationships between processes (blue boxes) and events (yellow boxes) and the association of data and metadata to each specific event. A pre-surgical analysis sequence is considered. This is divided into different phases consisting, each, of different steps, ranging from the acquisition of data to their analysis. According to the described structure, a Pre-surgical Process (P) can be considered as a top level process composed by different sequential subprocesses: (SP1) Data Acquisition, (SP2) Image Post-Processing, (SP3) Trajectories study and (SP4) Surgery Area Estimation. Each of these sub-processes represents a specific part of the main process and is composed by a number of events, each of them connected to the related data and metadata. It is also worth noting that the whole process can be easily modified to fit either changes in the analysis sequence or different requirements for another case study, just by changing or adding new events, creating new processes composed of different events or combining existing processes and events. The defined process-event taxonomy can be used to store the information about each step in the process and the related data and metadata, thus allowing the definition of a detailed time line of performed operations. This can also improve the repeatability of experiments by providing both a detailed recording of the analysis process and a complete description of relationships between data and actions.
The XCEDE model has been tailored to fit this process/events model in a very simple and efficient way. As shown in Figure1, the visit and study elements have been associated to process entities with a father-son relationship while episodes and acquisitions have been collapsed into events.
The data model
The analysis of the state of the art pointed out that just a few data models/repositories have the ability to add new data types. However, in fact, in most cases, as their databases are based on the already defined data, they require the rebuilding of the database each time a new type of data is added.
For example, the procedure to add a new data type in XNAT requires the intervention of a specialist in order to:
-
Create an XML Schema
-
Add the schema to the project
-
Run the update script (bin/update.sh or bin/update.bat)
-
Update the database
-
Re-Deploy the webapp
-
Setup XNAT security to allow access to the new data types
The proposed approach solves these issues providing a new methodology for the data and related metadata management to make the repository easier to use, more dynamic and therefore improving the overall extensibility and customization. This is based on the definition of a “meta” data model enabling the user to build his own data types independently from the application context. With this approach, in order to add a new data type, a user (even a clinical one) has just to:
-
Open a web page on the browser and build the data type structure using drop down lists and other simple controls
-
Save the data type, and eventually assign the new data type only to a given set of users/groups
-
Use the data type directly (without the need to restart the application or perform any other setup)
-
The presented approach, however, does not preclude the use of existing and standard data models. Indeed, through the development of suitable wrappers, it is possible to convert data into a standard compatible format as well as to create types of data from an existing data model. Furthermore, the presented model is general enough to be also extended in order to describe other entities like experiment descriptions.
A “data type” is identified as a minimum set of information describing a data instance (i.e., the set of records associated to a clinical study, the parameters for a particular biomedical image) that may or may not be associated to physical files. As an example, clinical data can be defined as datatype but are not associated to a file, unlike MRI data.
Each data type is described by an XML metadata schema associated to XSD and XSL files file to define, respectively, its structure and display. The XSD and the XSL adopted for transformations are the same for all XML files. From XML files, DHTML web forms are built, using XSL transformations. The XML can be stored in MySQL or in other SQL databases. In this case, an XML file URL can result not in a physical file but in a query to the DB (this is transparent to the component requiring the XML). This can be considered as a caching mechanism.
The XML representation of a data type metadata is divided into two main sections: a header, containing general information about the schema, and the metadata description, representing the detailed description of the information. The metadata description is composed by one or more groups of information each composed by attributes, loops and their combinations. An attribute defines a single metadata and is composed by different parameters and subelements. The formers describe the information related to the typology of the attribute (type, whether it is required, etc.). Subelements, instead, describe what the attribute represents: they include name, value or possible values and references to existing ontology definitions. Actually, through the interface used to build data types, it is possible to choose an ontology among those defined within the platform. Once an ontology has been chosen, it is temporarily loaded within the system and its terms are made available to support users in the attributes’ name definition. This is done thanks to a “suggest” mechanism. In fact, whenever a user digits the name field, an ontology based suggestion will appear according to the written text. This approach is very important in order to build data types using standard annotations and to make them being easily integrated with external data sources.
The loop is probably the most relevant improvement with respect to existing data models. Usually, when modelling clinical information, the information model (in terms of number of attributes) is clear but the amount of information (the number of occurences) that will characterize the specific data to be recorded is not. This is the case, for example, of reagents in microarray studies: ”reagent” is the information to be recorded, but for different microarray studies, different types of reagents can be used. Therefore, the information to be recorded under the “reagent” metadata is in fact characterized by description, concentration and amount for every used reagent. A similar issue in clinical studies arises describing risk factors: the presence of a risk factor is the information, and it can be modeled with details on every sub-factor presence and relevance (smoke, alchool, sedentary job).
Overall system architecture
The main components of the Repository are (Figure2):
-
the Repository portal: it provides a web interface and allows users to access and manage database requests. It is hosted by a Linux/Unix server environment;
-
the Database: it hosts all the information about projects, subjects, metadata, etc.;
-
the Grid Storage: it contains all data files;
Two important aspects are related to authentication and software as a service issues. As regards authentication, users and system administrators authenticate to the system using an existing LDAP or a database account available on the server infrastructure. The access is via web browser without any client installation and in a secure way through the HTTPS (secure HTTP) protocol. When a user needs to access the repository resources he has to authenticate himself through the web portal interface using his username and password. Each user is associated to Access Control Lists in order to guarantee security and auditing. System administrators are able to define different groups of users associated with different access permission to different pages and functions of the repository. This way, users can see only a subset of pages and perform a limited number of different actions (depending on their role in the project). The Web based SaaS (Software as a Service) approach has been preferred because the backend hardware can be scaled up enough to satisfy the user needs without requiring users to implement their own infrastructure. Moreover, this allows, at any moment, to migrate from a standard hardware infrastructure to a service based one, e.g. using the Grid or other kind of on demand IT services.
The repository portal
The repository portal is designed to make the storage and the navigation of data and information easy, through a simple and transparent web interface. It is a Java 2 Enterprise Edition (J2EE) web application based on several existing open source tools for the development of web applications. The basis of the portal consists in a framework that relies on an Apache Tomcat web application container[22]. It incorporates a database interface layer built through iBATIS, a persistence framework which automates the mapping between SQL databases and objects in Java[23]. To provide users with highly interactive interfaces, some components are designed using the AJAX (Asynchronous Javascript and XML) programming technique. Messages are exchanged in XML or JSON (JavaScript Object Notation)[24] format wherever possible. Also, wherever possible, XSL transformations (to transform XML data into human readable HTML pages) are performed[25].
This component represents the main access point to all the functionalities available through the overall integration platform, and exposes both user and administrator interfaces. Administrators are able to control users’ access by creating groups and their association with pages and functions, define processes (visits and studies), events and all their relationships, define new data types and related metadata, associate them with the related events and manage available ontologies. Normal users, according to their assigned permissions, are able to insert new data, retrieve patients’ information and view all the related data, download stored data, explore visits, studies and their interconnection and all the related events, data and metadata to have a global picture.
As an additional feature, in order to make the insertion of metadata easier, an automated approach is available for some predefined datatypes like MRI, fMRI, PET and SPECT. Using libraries like Java dcm4che[26], the portal can automatically extract metadata contained within uploaded data files and incorporate them correctly in the database, creating associated events depending on image modality. Such as automatic procedure permits to avoid human errors and provides a further file type checking before the uploading. If needed, a visualization tool can be made available within the portal interface in order to allow users to interact with neuroimages through a remote visualization service. This is made possible by using a client-side application that uses the VNC protocol to connect to a sharable work session that is running server-side, with a significant speed up of diagnostic processes.
The database
The Repository is based on a MySQL database. The database design has been a crucial part of the repository development. In fact this component is fundamental in order to make the repository highly flexible and easily extensible. The core of the database is formed by the two previously described entities: processes and events and their relationship to data and metadata. The information inside the data table represents the data inserted in the repository. These data can be associated with one o more files, thus keeping the association with one or more file entities accordingly to their datatype. The File table contains the URIs of all the stored files. The repository can be configured to store the metadata totally or partially within the database. In this latter case, the metadata are stored as XML descriptions inside the data table, to display the data in a rapid and dynamic way using XSLT Transformations and as records of specific metadata tables, to perform complex queries in an easier way.
The Grid middleware
A crucial aspect of the the repository design is related to the choice of the Grid middleware used to build the underlying infrastructure. The storage subsystem has been built around the iRods tool, the successor of SRB (Storage Resource Broker, by the San Diego Supercomputing Center)[27]. iRods has been chosen, among others (e.g. gLlite Storage Element subcomponent) because it allows to build a federated and distributed data storage system without the need of central components. In fact, the gLite middleware provides a storage component with many features (command line and Java APIs, integration with x.509 certificates, integration with key-stores for crypted data storage) and, in a separate component, a metadata catalog (AMGA). However, it requires almost dedicated servers, with a specific Linux flavour on top and a bit of infrastructure (resource brokers) to deal with, thus requiring (almost partially) specific technical staff. Therefore, as our platform is mostly aimed at small laboratories mostly focused on biomedical skills and with a low level hardware infrastructure, the choice of iRODS as Grid storage middleware has been preferred despite the advanced features of gLite. Moreover, in the planned experiment, the requested amount of storage was really basic and the storage infrastructure had not to be shared on a public Grid virtual organization, thus suggesting a simpler solution. Finally, we also decided to use iRODS as the basis of the storage architecture of the repository because it permits the use of heterogeneous storage resources and allows the creation of microservices and rules to easily perform operations on the stored data and metadata. Anyway, it is worth mentioning that iRODS can be interoperable with the gLite Storage Resource Manager (SRM)[28] interface thanks to the work carried out at the Academia Sinica Grid Computing group, Taiwan[29].
The samples manager
The system described in this paper has been realized with the main goal of providing a tool for everyday activity of data collection and search, with a really pragmatic and practice oriented perspective. One of the peculiarity of our multi-centre study is that blood samples taken in Genoa, Italy have to be sent to the Health Science Center, Lubbock, Texas, for the genetic analysis to be carried out. Samples have to be stored in freezers, as different analyses are carried out over time. To streamline the process of storing and retrieving samples, an ad-hoc Web application has been developed, fully integrated in the platform. The application allows biologists to register the actual location of the sample in the freezer benches (-80C and -45C), also allowing technicians to configure graphically the available racks and slots in their freezers. When they put a blood sample (or a processed DNA/RNA sample) in the freezer, they can input the coordinates of the sample, from the freezer identification number down to the x-y coordinate in the sample box, simply by clicking on the application interface. They can also save the courier mailing number and a reference to the patient. This way, a sample can be retrieved at any time and is unequivocally associated with other patient data.