Architecture and technologies
The EHR-based literature retrieval system proposed here and illustrated in Figure 1 requires two sources of information: (i) an EHR repository, and (ii) a biomedical literature search engine. The operation is initiated by selecting the EHR of a given patient from a local repository or HIS and a disease. The selected EHR is displayed to the user, highlighting relevant keywords for literature retrieval purposes. The user launches search queries by selecting the suggested terms. Following this approach, the system retrieves a number of citations, related to the given EHR.
Besides the above functionality requirements, the system architecture is also determined by security issues. Although security and anonymization protocols are available for sensitive data transfer, our approach is based on a client-side model to prevent patient data transmission to an external server. Note that the feasibility of this architecture depends on limiting the systems computational requirements to the possibilities of any device with a web browser.
CDAPubMed was developed as a Mozilla Firefox 3.6.x extension (or plug-in) resting on an architecture where all the information is processed on the client side. Firefox 3.6.x is the current version of the popular web browser maintained by Mozilla, which supports Java-based plugins. This being an open source project, Mozilla Firefox was selected because it is a widely accepted open source web browser. It also provides a hardware- and operating system- independent platform.
Regarding access to patient information, we implemented a wrapper for EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA). HL7-CDA is an eXtensible Markup Language (XML)-based format aiming to provide a structure for clinical information, but still including free text. A modular implementation has been carried out to facilitate the development of other XML-based interfaces within CDAPubMed's term identification module. In the current version, the query generation module generates PubMed queries, facilitating access to biomedical research publications. In addition, PubMed offers an application programming interface (API), i.e., Entrez Utilities [27], which is able to improve the performance of the system by retrieving specific information from PubMed, such as the number of citations instead of the full list of citations. Finally, the biomedical controlled vocabulary selected for CDAPubMed was the Medical Subject Headings (MeSH) thesaurus [28], developed by the National Library of Medicine (NLM) and used for article indexing in MEDLINE.
Keyword Identification in HL7-CDA-Based EHRs
The MeSH thesaurus is organized according to a tree-based structure, from general to specific concepts, containing in 2011, 177,000 Entry Terms and 26,142 Main Headings [28]. CDAPubMed's keywords identification algorithm requires the following information from the MeSH thesaurus: (i) the list of MeSH terms, (ii) a set of synonyms for each term, (iii) the MeSH tree branch or branches that include the MeSH term, and, to narrow down the MeSH term-related subject area, (iv) the corresponding set of allowable qualifiers [29], e.g., Diabetes/epidemiology will retrieve only epidemiological publications about Diabetes. Instead of using the original format provided by the NLM to store the required MeSH information, CDAPubMed employs a self-generated hash table[30] (labeled mesh.tbl in Figure 2) to improve the efficiency of keyword searching within the free text of HL7-CDA-based EHRs.
HL7-CDAs gather clinical data from different sources and are divided into sections, usually identified by LOINC or SNOMED codes [31]. CDAPubMed automatically identifies relevant MeSH terms, or synonyms, within each of these sections. However, not every MeSH term present in an EHR is necessarily relevant for citation retrieval. To meet the performance requirements of the PubMed API a default limit of 1,000 queries per minute CDAPubMed includes a configuration file (labeled conf.xml in Figure 2) with two mechanisms to avoid unnecessary keyword identifications. The first mechanism is the option of declaring an EHR section as relevant for literature retrieval purposes. Each section can be identified by a title, a set of synonym titles and/or LOINC/SNOMED codes. The identification process is confined to those sections declared as relevant only. EHRs sections such as Appointment or Chief complaint could be discarded for literature retrieval purposes. The second mechanism is to confine the identification process of an EHR section to a subset of MeSH terms, i.e., MeSH tree branches. For instance, the term Penicillin does not have the same implications when it is located in the Allergies section or when it is located in the section Family Diseases. Complete MeSH branches such as Publication Characteristics (V), Humanities (K) or Technology, Industry, Agriculture (J) for instance, could also be discarded within the identification process if users consider that they are not relevant. CDAPubMed contains a default set of relevant sections and branch associations, those that obtained the best performance in the result section. Anyway, advanced users have the option of modifying this configuration through a graphical interface. The identification process, shown in Figure 2, is driven by such configuration.
For each relevant section of the EHR, CDAPubMed analyzes the content using an NLP package, i.e., OpenNLP [32], extracting sentences and words (or tokens) from the free text content. OpenNLP is an open source and machine learning-based toolkit, including a variety of java-based NLP tools, used for the processing of natural language text. At CDAPubMed, if a token, or consecutive tokens, is a MeSH term or synonym of the corresponding set of MeSH branches, this term is considered relevant for publication retrieval purposes, unless it is preceded by a negative expression, e.g. no Pain. Keywords identified by this algorithm are highlighted within the CDAPubMed interface (described in the CDAPubMed functionality section) and may trigger the PubMed query generation.
Query generation
To generate a query through the CDAPubMed interface, the user may select one of the MeSH terms or synonyms highlighted in the EHR. For each selection, the matching keyword is added to the query, restricting the results to those related with the EHR. The relationship degree between a citation and an EHR implemented in CDAPubMed is calculated as follows:
A citation is related with an EHR by the intersection of MeSH terms indexing the publication, and the union of disease and relevant MeSH terms identified within the EHR.
The first iteration of CDAPubMed, i.e., the first time a MeSH term is selected after loading an EHR, generates the following query: (term
1
[mh]) AND (disease[mh]). The [mh] suffix specifies that the retrieved documents will be indexed by term
1 in this case. Additional n keywords from the EHR can be integrated into the query in successive iterations: (term
1
[mh]) AND (term
2
[mh]) AND ...AND (term
n
[mh]) AND (disease[mh]). Note that keywords are linked with a conjunction connective (AND), instead of a disjunction connective (OR), to restrict the retrieved citations to the maximum relationship degree in each iteration. The n
th query in CDAPubMed retrieves citations related by n + 1 keywords to the EHR, or above, instead of citations related only by the disease.
To provide further restrictions within the same query level, qualifiers associated with MeSH terms and the respective EHR section can be selected. A qualifier is highlighted for selection within the interface, and the MeSH term is displayed in a different color if the qualifier is admissible for both the EHR section and the MeSH term. Allowable qualifiers for MeSH terms are taken from official MeSH information, whereas allowable qualifiers for sections are provided by the CDAPubMed configuration file. If a set of m qualifiers are selected for a set of n terms, the following query is generated: (term
1
/qualifier
1
) AND (term
1
/qualifier
2
) AND ...AND (term
n
/qualifier
m
) AND (disease). Citations for these queries are retrieved from PubMed or from a cache included within CDAPubMed. This cache stores a history of queries and the corresponding citations to improve system performance.
CDAPubMed interface and functionality
After a one-click installation of the CDAPubMed extension in Mozilla Firefox, a new window section at the bottom of the web browser can be activated, as shown in Figure 3. The rest of the browser window can be used for regular web browsing or PubMed result visualization. In the new area, the sequence of actions open to the user is: (i) load and validate a HL7-CDA file that will be displayed in the new window section, (ii) modify the default configuration of the MeSH identification process, (iii)enter a default disease from the C MeSH branch (Diseases), (iv) select one of the keywords within specific EHR sections to generate a PubMed query, and (v) restrict the resulting citations by selecting qualifiers for MeSH terms (if allowed). To iteratively retrieve more specific citations, the user may repeat action (iv) and select additional keywords from the EHR.
The screenshot illustrated in Figure 3 was the result of a query generated by CDAPubMed including the following keywords: Diabetes Mellitus, Type 1 (disease), Asthma and Hypertension (keywords selected from the EHR). In this case, CDAPubMed retrieved 6 results related by at least 3 relevant keywords with the EHR, instead of 53,137 results retrieved by the Diabetes Mellitus, Type 1 general query.
The CDAPubMed interface highlights keywords that are considered relevant by the identification algorithm, including the potential number of retrieved publications displayed as a superscript next to the keyword. If there are qualifiers that can be applied to a MeSH term, this keyword is also highlighted (in a different color) and a menu, listing the allowable qualifiers, is displayed next to the respective term. When one or more qualifiers are selected for a MeSH term, the number of potential results is updated for this keyword. This number is refreshed for every keyword highlighted in the EHR, every time a query is made returning the number of citations that would be retrieved if the keyword is AND-concatenated to the query.
As mentioned in the previous section, the process of identifying and highlighting terms within the EHR can be optimized by setting some parameters of CDAPubMed. A graphical user interface (GUI) has been developed to modify the system configuration, as shown in Figure 4.
Users can use the CDAPubMed configuration tool to perform four main actions: (i) enter titles, synonyms or LOINC/SNOMED codes to identify new relevant sections for publication retrieval purposes, (ii) declare a set of MeSH term branches to search within a certain section., (iii) add a set of allowable qualifiers in a section, and, finally, (iv) save or load a configuration file to import/export a personal system configuration. A selection box available next to each EHR section in the general-purpose CDAPubMed interface is used to enter the title of a relevant section in the configuration. If no branches are associated to the MeSH term declaration, nine of the sixteen MeSH branches, A - G (Anatomy-Biological Sciences), N (Health Care) and Z (Geographic Locations) are associated by default, since they contain the most popular keywords for PubMed citation indexing. Finally, all qualifiers are used if no qualifiers are explicitly associated to a section. These actions are able to modify all the parameters that have an influence on CDAPubMed's keyword identification algorithm.