BMC Medical Informatics and Decision Making BioMed Central

Background: Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement.

biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article.

Background
Scientific progress requires the free distribution of research findings. The National Institutes of Health (NIH) has recently stated, in a public notice, the importance of sharing research [1]. "The NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers." [1]. The proposed data sharing policy has arrived at a time when scientists are challenged by confidentiality issues, intellectual property considerations and a variety of federal regulations that constrain the free distribution of research data. In reply to the NIH draft statement, the American Association of Medical Colleges responded, "AAMC believes that effective policies to promote data sharing will require creative, discipline-specific solutions to these complicated problems" [2].
Scientists need innovative methods to overcome the many barriers to data sharing. One of the most important efforts in bioinformatics is data annotation. Data annotation involves appending descriptive information to experimental data with the intention of creating an encapsulated data object that can be intelligently linked or integrated with other data. In genomic databases, annotation of a gene sequence might involve including the physical location in the genome from which the sequence was derived. This might involve using a standard way of describing a location inside a chromosome (e.g. 1p21.3). Annotation might also involve adding information about the translated proteins or the disease that might be associated with a mutation found in the sequence. These annotations are a critical part of the sequence data, because the annotations assist in the discovery of the biological relevance of the sequence. When experimental data are distributed, it is important to distribute the annotation along with the raw data. Unfortunately, many medical vocabularies used for annotation are encumbered under proprietary license agreements. This often means that scientists cannot freely share their annotated biomedical data sets.
The purpose of this article is to provide researchers with a free and legal strategy for sharing research data that has been annotated with a standardized biomedical terminology [3]. Standardized vocabularies are used to annotate text and data with medical terms identified by unique identifier codes that can be parsed by computers and linked to equivalent terms contained in other data sets. For instance, renal cell carcinomas are known by a variety of different terms, including: hypernephroma, clear cell carcinoma of kidney, renal cell adenocarcinoma, rcc (abbreviated form), Grawitz tumor (eponymous term), etc. Medical nomenclatures typically assign a unique number to "renal cell carcinoma" that is shared by all the synonymous terms. In the case of UMLS, this identifier is C0007134. Some medical nomenclatures include relationships linking a concept to more general (parent) terms and more specific (descendant terms). A data set that annotates each inclusion of "renal cell carcinoma" or "rcc" or "renal cell adenocarcinoma" with a unique concept identifier ensures that different representations of the medical concept can be retrieved and integrated.

The UMLS Metathesaurus
The UMLS is the largest curated medical nomenclature in existence. It is composed of more than 90 different biomedical vocabularies and contains 2,146,899 million medical terms mapping to 875,255 medical concepts. The UMLS files are available at no cost from the National Library of Medicine's Web site. Anyone wishing to obtain the UMLS metathesaurus must obtain and sign the UMLS License Agreement and register as a UMLS user. The UMLS metathesaurus, the UMLS License Agreement, and detailed instructional documents are available from the following URLs: 1. The NLM hereby grants a nonexclusive, non-transferable right to LICENSEE to use the UMLS products and incorporate them in any computer applications or systems designed to improve access to biomedical information of any type subject to the restrictions in other provisions of this Agreement. The list of licensees authorized to use the UMLS products is public information.
3. LICENSEE is prohibited from distributing the UMLS products or subsets of these products, including individual vocabulary sources within the Metathesaurus, except as an integral part of computer applications developed by LICENSEE for a purpose other than redistribution of data contained in the UMLS products.
10. LICENSEE shall acknowledge NLM as its source of the UMLS data, citing the year of the UMLS data, in a suitable and customary manner but may not in any way indicate or imply that NLM or any of the organizations whose vocabulary data are included in the UMLS has endorsed LICENSEE or its products.
11. Some of the Material in the UMLS Metathesaurus is from copyrighted sources. If LICENSEE uses any data from the UMLS Metathesaurus: a) the LICENSEE is required to display in full, prior to providing user access to any Metathesaurus data, the following wording in order that its users be made aware of these copyright constraints: "Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright claimants. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright restrictions and are referred to the copyright notices appearing in the original sources, all of which are hereby incorporated by reference." and to display a list of all of the vocabularies obtained from the UMLS Metathesaurus that are used in the LICEN-SEE's application. b) the LICENSEE is prohibited from altering data obtained from the UMLS Metathesaurus, but may include data from other sources in applications that also contain UMLS data. The LICENSEE may not imply in any way that data from other sources is part of the UMLS Metathesaurus or of any of its vocabulary sources. c) the LICENSEE is required to include in its applications identifiers from the UMLS Metathesaurus such that the original source vocabularies for any data obtained from the UMLS Metathesaurus can be determined by reference to a complete version of the UMLS Metathesaurus.
The majority of vocabularies included in the UMLS contain restrictions on their use. An example of these restrictions are transcribed here from the UMLS License Agreement.

Category 3 Restrictions:
LICENSEE's right to use material from the source vocabulary is restricted to internal use at the LICENSEE's site(s) for research, product development, and statistical analysis only. Internal use includes use by employees, faculty, and students of a single institution at multiple sites. Notwithstanding the foregoing, use by students is limited to doing research under the direct supervision of faculty. Internal research, product development, and statistical analysis use expressly excludes: use of material from these copyrighted sources in routine patient data creation; incorporation of material from these copyrighted sources in any publicly accessible computer-based information system or public electronic bulletin board including the Internet; publishing or translating or creating derivative works from material from these copyrighted sources; selling, leasing, licensing, or otherwise making available material from these copyrighted works to any unauthorized party; and copying for any purpose except for back up or archival purposes.
LICENSEE may be required to display special copyright notices before displaying data from the vocabulary source. Applicable notices are included in the list of UMLS Metathesaurus Vocabulary sources, that is part of this Agreement.
Category 3 restrictions prohibit using UMLS-annotated data sets for any of the following purposes: 1) distribution to colleagues, 2) posting on a publicly available site (such as an internet web site), 3) submission to shared data set repositories, or submitted as supplemental data in support of research articles, 4) submission to scientific journals.
The inclusion of Category 3 vocabularies in the UMLS may seem like an egregious oversight. However, the original purpose of the UMLS was to provide a way of mapping between the different vocabularies in the UMLS metathesaurus. Using UMLS, an institution that annotated one database with a proprietary nomenclature and another database with a different proprietary nomenclature, could map equivalent terms in the two databases. In the 1980s, when the UMLS was first released, it was common for large institutions to have site licenses for several proprietary nomenclatures.
In the 1980s, it was not obvious that individual scientists in 2003 would have the technologic facility to create immense data sets annotated with concepts parsed from nomenclatures containing millions of terms. It was not obvious that scientists would seek to share and merge these data sets into immense collections of publicly available data.
The purpose of this article is to provide a simple Perl script specifically designed to extract the Category 0 UMLS vocabularies, and to describe the value of the Category 0 vocabularies as a data sharing tool.

Methods
The January 1, 2003 release was used to extract the Category 0 vocabularies. Category 0 vocabularies contained in the UMLS are listed in Table 1.

Software
All of the software scripts used are written in Perl, an open source, freely available cross-platform programming language with interpreters available for virtually every type of computer operating system. Perl can be downloaded from http://www.activestate.com or http://www.cpan.org.
Detailed information on Perl is available at http:// www.cpan.org or http://www.perldoc.com The Perl scripts distributed with this article are entered into the public domain by the author and included as a supplemental file with this article.
JHARCOLL, a corpus of medical text was used to obtain an indication of the utility of the Category 0 vocabularies to capture text concepts. JHARCOLL is a public domain collection of 568,035 different phrases extracted from  GOODLST.PL expands and normalizes the GOODNEW2.TXT file, producing a file that consists of term in all lowercase letters, with the "s" truncated from words that end in "s", and with noun and adjectival forms of terms added where they are absent. For example: Colonic Adenocarcinoma -> colonic adenocarcinoma colonic adenocarcinomas -> colonic adenocarcinoma adenocarcinoma of colon -> colonic adenocarcinoma These transformations, applied to the entire vocabulary, may facilitate implementations of algorithms designed to match free text terms against terms found in the vocabulary.
Perl scripts using the Parse module (GOODCNT.PL and MRCONCNT.PL) examine each of the half-million phrases in JHARCOLL, looking for identical matches from the Category 0 vocabularies or the entire UMLS metathe-saurus, respectively. These Perl script require the author's freely available Parse module, described in detail elsewhere [4] and distributed as supplemental files with this article. The number of phrase matches for both the Category 0 vocabularies and for the complete UMLS exceeded the number of different phrases in JHARCOLL. Therefore, on average, JHARCOLL phrases had more than one match against either nomenclature. It was further determined that 545,321 JHARCOLL phrases had at least one match against UMLS terms while 468,785 JHARCOLL phrases matched against Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a phrase, the Category 0 vocabularies are about 86% as useful as the complete UMLS metathesaurus.

Discussion
Term three of UMLS License Agreement [see Background] prohibits users "from distributing the UMLS products or subsets of these products." This is a reasonable condition, enforcing the National Library of Medicine's role as the sole curator and distributor of UMLS. However, Term 3 stipulates an exception when the subsets are "an integral part of computer applications developed by LICENSEE for a purpose other than redistribution of data contained in the UMLS products." When a researcher includes a UMLS term as a data set annotation, and distributes the data set to a colleague, her purpose is to disseminate her research. In this case, annotation terms are integrated into the data and do not appear as complete source vocabularies or as subsets of the UMLS that would be suitable as a medical vocabulary. Therefore, distributing data sets annotated with Category 0 terms would not violate the UMLS License Agreement.
For the most part, the Category 0 terms have been contributed directly by U.S. Federal Agencies or by organizations that receive U.S. Federal funds for the purpose of creating publicly available vocabularies. The two largest contributors to Category 0 terms in UMLS are the National Library of Medicine's MESH (Medical Subject Headings) and the NCBI's Taxonomy.
Both MESH and the NCBI Taxonomy vocabularies can be obtained individually from the National Library of Medicine website http://www.nlm.nih.gov. Mesh is used by the National Library of Medicine to index all biomedical abstracts included in MedLine, and has been used to index medical terms found throughout the internet [5]. Moore et al have shown that MESH is a useful vocabulary for capturing clinical concepts in surgical pathology reports [6]. MESH alone is a sufficient medical vocabulary for many indexing purposes. The National Center for Bioinformatics Taxonomy contains over 130 thousand terms used to assign standard names for organisms, biological properties, and molecules. Combined, MESH and NCBI Taxonomy would serve to capture most of the concepts included in any biomedical text or data set.
Given that the Category 0 vocabularies are available individually, and at no cost, what is the advantage of using an aggregate nomenclature by combining the Category 0 vocabularies? When data sets are annotated with Category 0 terms, they can be freely shared, and the annotated data can be re-integrated with the FULL set of UMLS knowledge sources, including the Category 1,2 and 3 vocabularies. This is possible because UMLS concept relationships (particularly ancestral and descendant concepts) are always available to UMLS license holders. Because UMLScoded terms can be related to terms from any UMLS vocabulary, researchers benefit from using the subset of Category 0 UMLS-encoded vocabularies rather than directly employing natively encoded vocabularies (such as MESH or NCBI).

Conclusions
The Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article.