BMC Medical Informatics and Decision Making

Background: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature.


Background
Samuel Johnson defines a lexicographer as a "harmless drudge" [1]. The drudgery of the lexicographer's tasks is beyond dispute. In the domain of medical nomenclatures, however, the harmlessness of the lexicographer is far from certain. The misuse of medical terminology can lead to medical errors, as indicated by the U.S. Joint Commission on Accreditation of Healthcare Organization's recent ban on certain common medical abbreviations [2]. This action was taken to reduce the occurrence of medication errors that result when non-standard abbreviations are misinterpreted. The U.S. Institute of Medicine has advocated standardized methods for collecting codified diagnostic data as a strategy for reducing medical errors [3].
Medical terminologies are replete with examples of minor term modification that can result in treatment error. One example is the "carcinoid tumor of appendix." Carcinoid tumors of the appendix are typically indolent. According to the U.S. National Cancer Institute, "Surgical resection is the standard curative modality. If the primary tumor is localized and resectable, 5-year survival rates are excellent" [4]. There is a variant of appendiceal carcinoid known as the goblet cell carcinoid of appendix. Someone uninitiated in tumor biology may infer that the goblet cell carcinoid is a morphologic variant of an indolent neoplasm undeserving of any special designation. Actually, the goblet cell carcinoid of appendix is a highly malignant tumor that has a completely different clinical course and a different recommended treatment than the carcinoid of appendix [5]. If a neoplasm nomenclature were to omit the term "goblet cell carcinoid of appendix", someone encountering the term in a pathology report may mistakenly code the subsumed term, "carcinoid of appendix." Alternately, if the curator of a nomenclature is unaware of the distinction between the two tumors, he or she may mistakenly assign the same code to both terms. In either case, mistaking a carcinoid of the appendix with a goblet cell carcinoid of appendix could result in harm to the patient.
Because modern nomenclatures are used to annotate medical data so that clinical information can be merged with heterogeneous data sources (e.g. tissue bank records, research datasets, epidemiologic databases), the duties of lexicographers have broadened to include a range of informatics activities. For this reason, the modern curator is involved in codifying terms (providing a unique identifier to a term and all its synonyms), and mapping terms between different nomenclatures. In the past, nomenclatures were recorded on paper documents. Brevity was appreciated, and rare lesions may have been neglected. Modern nomenclatures are stored electronically. With no barriers to the size of nomenclatures, there is no reason to exclude any used terms [6].
The purpose of this manuscript is to describe a simple method for extracting candidate new terms from any large corpus of text. The method uses the previously published doublet algorithm to compare connected word doublets in a medical text against a list of word doublets found in a nomenclature [7]. Text phrases composed of sequences of word doublets found in an existing nomenclature are candidate new nomenclature terms. This general method can be used with any text and any existing nomenclature. This method permits curators to continually enhance their nomenclatures with new terms, an essential activity needed to ensure the proper coding and annotation of biomedical data.

Nomenclature
As a sample implementation, the developmental lineage classification and taxonomy of neoplasms, hereinafter called "the neoplasm taxonomy," was used. The neoplasm taxonomy has several properties that make it particularly suitable: 1. It is a free, open-access medical nomenclature.
2. It has been described in two prior open access publications [6,8] 3. New versions of the nomenclature are made available for public download by the Association for Pathology Informatics [9]. 4. It is an easily parsed XML document, with every term appearing as a lower-case alphanumeric phrase. 5. It is intended to be a comprehensive listing of all items in a knowledge domain [i.e., names of neoplasms] The XML version of the neoplasm taxonomy file is neocl.xml [see Additional file 1]. In this manuscript, the purpose of the taxonomy is to provide a listing of all names of neoplasms, with synonyms grouped under a common code number. The current version of the neoplasm taxonomy contains 149,192 unique names of neoplasms. In constructing the taxonomy, enormous effort was made to list every variant name for every known neoplasm of man. Variant names included different terms for the same concept and different ways of expressing an individual term (e.g. variations in word order).

Input file
The input file was created by a PubMed query on "am j surg pathol [jo]", at the U.S. government website [10]. The query gathered all abstracts from the PubMed database that were published in the journal "American Journal of Surgical Pathology." The American Journal of Surgical Pathology is a popular repository for articles written to describing newly encountered neoplasms or distinctive variants of known neoplasms. The PubMed search produced 4,289 articles. These were downloaded to an external file using PubMed XML (i.e. by selecting "XML" from PubMed's "Display" list). The downloaded file has a length of over 31 Megabytes. Each of the 4,289 entries in the downloaded file contains metadata-tagged fields that include the PubMed identifier, the title of the article, the names of the authors, the text of the abstract, and other citation-related information. Titles of medical articles are terse summary statements that often contain the names of disease entities discussed within the article. Although this file is not included with this manuscript, anyone in the world with internet access can obtain a near-identical file by repeating the same PubMed query.

Algorithms
The doublet method was described in a recent publication [7]. Its utility is derived in part from the observation that most medical terms are multi-word terms. In the neoplasm taxonomy, all but about 250 terms are multi-word terms. Unlike single words, which often have several different meanings, multi-word medical terms, with very rare exceptions, have a single, specific meaning.
Any multiword term can be constructed by a concatenation of overlapping doublets.

For example:
Serous borderline ovarian tumor -> ("serous borderline," "borderline ovarian," "ovarian tumor") The doublets composing the multiword terms from a nomenclature can be combined into a list . The list of nomenclature doublets can be used to determine whether a fragment of text is composed from doublets included in the list.

Algorithm 1
The following algorithm determines those terms present in a nomenclature that are composed of doublets found in other terms within the nomenclature.
1. Collect all the doublets that occur in the entire nomenclature (i.e., accumulate a list of the doublets from every term in the nomenclature).
2. Number the occurrences of these doublets (i.e., count the number of times each doublet occurs anywhere in the nomenclature) 3. Advance through the nomenclature, term by term, decomposing the term into its doublets. 4. For each of the doublets that compose a term, retrieve the total number of occurrences of the doublet in the nomenclature. 5. Each doublet occurring in a term that has an occurrence number of 1 is unique to that term and indicates that the term is not composed entirely of doublets found in other terms in the nomenclature.

Algorithm 2
The following algorithm parses through text, extracting candidate term phrases: 1. Collect all the doublets that occur in the entire nomenclature (i.e., accumulate a list of the doublets from every term in the nomenclature).
2. Parse text (in this case individual abstract titles) into an ordered array of overlapping doublets (as per the example shown for the text string, "serous borderline ovarian tumor").
3. Compare each consecutive text doublet against the array of doublets from the nomenclature to determine whether the doublet exists somewhere in the nomenclature.
4. If the doublet from the text does not exist in the nomenclature, it can be deleted. If it exists in the nomenclature, it is concatenated with the following doublet if the following doublet exists in the nomenclature. Otherwise, it is deleted. This process continues, concatenating doublets that exist somewhere in the nomenclature. Extraneous leading words (the, in, of, with, and) and trailer words, (the, and, with, from, a) are automatically deleted from the final concatenated sequence. Final concatenated sequences of two or greater consecutive doublets that match to doublets from the nomenclature are saved as candidate terms.

Programming Language
All code was written in Perl. Perl is an open source language that is bundled with the Unix and Linux operating systems. In the past decade, Perl has become very popular in the bioinformatics community. Perl interpreters are available at no cost and in versions suitable for virtually all operating systems. Perl can be obtained from the Comprehensive Perl Archives Network [11] or from Active State [12].
2. The Perl script doubuniq.pl implements algorithm 1 to locate terms in the nomenclature that contain any uniquely occurring doublets (i.e., terms that are not exclusively composed of doublets found in other terms from the same nomenclature) [see Additional file 2]. Only the classified terms of the nomenclature (a total of 126,756 terms) were used in this exercise to avoid formatting inconsistencies present in terms that are pending classification. The output of the Perl script is included as a supplemental file [see Additional file 3].
3. The Perl script getdoub.pl implements algorithm 2 to extract phrases from the text corpora composed of sequences of doublets included in the original nomenclature and exceeding 4 words in length [see Additional file 4]. The output file is tumoram.out [see Additional file 5].

Analysis of doublet occurrences within terms included in the nomenclature
The current version of the neoplasm nomenclature contains 149,192 unique terms. Of these terms 126,756 terms are classified terms and are composed of at least two words (i.e., are doublets or greater in length). Of these 126,756 terms, all but 6,308 (4.97%) are composed entirely of doublets extracted from other terms in the reference nomenclature. This means that 95% of the classified terms from the nomenclature are formed entirely of doublet terms found in other terms from the same nomenclature. The Perl script and the output text are included as supplemental files [see Additional file 2] [see Additional file 3].

Implementation of the candidate term extractor
The Perl script, getdoub.pl was run on a CPU with a processor speed of 2.79 GHz. The parsed corpus was a PubMed file in XML format that exceeded 31 Megabytes in length and contained 4,289 PubMed records. Each record contained an abstract title enclosed by a tag, <ArticleTitle>. The parser extracted the abstract titles demarcated by the XML tag, dropped all characters to lower-case, deleted all characters other than alphanumerics, spaces and hyphens, deleted the possessive suffix characterized by "'s" and converted plural words to singular words if the word ended in the suffix "oma" or consisted of the word "tumors" or "tumours". The total number of words contained in the 4,289 abstract titles was 50,547, and the total number of characters contained in the 4,289 abstract titles was 395,396. The total execution time to extract a list of candidate terms from the corpus was 2 seconds.

Output file of candidate phrases
The output file produced by getdoub.pl is tumoram.out. It contains 313 candidate phrases each phrase followed by a copy of the original text from which the candidate term was extracted. The output is included as a supplemental file [see Additional file 5]. Each extracted candidate term consisted entirely of concatenated doublets found within the reference nomenclature.
In some cases, terms consisting of concatenated doublets were obviously not names of neoplasms, and a human expert in the knowledge domain would be needed to distinguish a real name of a neoplasm from a series of words that have no particular significance.
The curator read through all 313 phrases, looking for phrases that suggest a new concept. In this case, the curator looked for names of neoplasms. This step can only be performed by a domain expert. Because the exact candidate phrases are found in PubMed abstract titles, curators can obtain additional information on any candidate phrase by inserting the phrase into the PubMed query box [10]. In some cases, in-depth review of the literature containing the candidate phrase may suggest a slightly modified preferred term as a new entry to the nomenclature, but for the purposes of this manuscript, candidate phrases were not expanded or embellished with alternate words.
As example, consider line 100 from the output file [see Additional file 5]. The candidate term is: "synovial sarcoma of the prostate with t." This term is composed of the doublets, "synovial sarcoma", "sarcoma of", "of the", "the prostate", "prostate with", "with t". The full abstract title from which the term was extracted was: "SYNOVIAL SAR-COMA OF THE PROSTATE WITH T(X;18)(P11.2;Q11.2)." The chromosomal marker designated in the abstract title is not picked up by the extracting Perl script because (X;18) is not part of a word doublet. For the purposes of this manuscript, the curator is only permitted to remove words from the beginning and end of the candidate phrase (not embellish the phrase). The curator shortens the term to "synovial sarcoma of the prostate". If that term is not included in the existing nomenclature, it can be added to the final list of new terms. In practice, a curator might wish to add two new terms to the nomenclature, "synovial sarcoma of the prostate" and "synovial sarcoma of the prostate with T(X;18) (P11.2;Q11.2)."

Final list of new terms
The output file, tumoram.out contained 313 candidate terms [see Additional file 5]. Of these 313 terms, a human curator found 285 terms that represented names of neoplasms. These 285 terms were automatically evaluated to eliminate terms that were duplicated in the reference nomenclature or on duplicate lines from the output file. Examination of the terms indicates that the majority of terms are either very rare and obscure tumors, unusual variants of common tumors, or just unusual terms for common and uncommon tumors. Pathologists reading this list may agree that the tumors from this list include clinically distinct and significant subclasses of tumors and would be welcome additions to a truly comprehensive nomenclature of neoplastic entities.

Discussion
Adding terms to an existing vocabulary is best done by reading the current literature in the knowledge domain of the nomenclature, and transcribing new terms when they are encountered. It is difficult to imagine any automatic process that can replace this scholarly pursuit. Terms encountered while reading a scientific text appear in a structured context that often defines the term, clarifies the relationship of the new term to related terms, and sometimes provides sufficient information to classify the new term within a structured taxonomic hierarchy.
Sadly, it is impossible for curators to read all of the biomedical literature pertaining to a nomenclature's domain. The purpose of the doublet phrase extractor is to parse through any corpus of text, extracting phrases that may contain new nomenclature terms. The phrases are chosen to meet two criteria: 1) they are composed of word doublets that are contained in an existing nomenclature, and 2) the matched phrases do not already occur in the nomenclature. The doublet extractor works fast to produce a neat list of candidate phrases that can be conveniently reviewed by a curator. In the case described herein, a 31+ MByte corpus was extracted in 2 seconds, to produce 313 candidate phrases. From the candidate phrases, the curator found 222 phrases that could be added to the reference nomenclature (30 minutes of human effort).
Because the final term list was extracted from PubMed abstract titles, the contained terms have two important properties: 1. The terms have whatever legitimacy publication confers. Nobody can say that the extracted terms are confabulated or never actually used.
2. The terms can be searched and found through a PubMed search. The articles that contain the terms are likely to describe or define the terms.

Limitations of the output
New terms that are not composed of doublets found in pre-existing nomenclature terms will be missed by the doublet extraction method. This would certainly apply to most new eponymous terms, new terms extracted from a foreign language, new terms using variant orthography, or newly invented words (e.g. theragnostics, glycomics, nanoscope). In the realm of neoplasms, it would be difficult to create a new term that is not partially composed of older terms.
If the extraction software used a vocabulary that already contained every term in the knowledge domain, then there would be very few candidate terms extracted (because parsed terms already included in the nomenclature are immediately rejected). In addition, any candidate terms extracted from the corpus would always be rejected by the human curator (0% precision). This is because the nomenclature would be complete, and any candidate terms must be false terms.
If the extraction software used a text corpus that consisted exclusively of valid, new nomenclature terms that should be added to an incomplete vocabulary, then all of the extracted candidate terms would be valid new terms (100% precision). If the extraction used a text that contained no valid terms from the knowledge domain, then every extracted candidate term would be an invalid term (0% precision). This is a problem: the effectiveness of the software is dependent on the completeness of the source vocabulary and on the presence of valid new terms existing in the text corpus. In addition, the suitability of candidate terms for inclusion in the vocabulary will ultimately depend on subjective decisions reached by the human curator.
For this manuscript, the author chose to include the extractor's output as a supplemental file [see Additional file 5], so that interested readers can form their own judgments regarding the value of the software. In addition, the availability of a public set of files (software implementation, nomenclature and text corpus) will facilitate future efforts to compare improved versions of the extraction algorithm with this first version.

Characteristics of the approach
Recently, Krauthammer and Nenadic reviewed methods for identifying nomenclature terms within text [14]. In all cases, the reviewed methods were evaluated based on published measurements of precision and recall. There was no common procedure for determining how the text would be manually coded. Manually coded text is used to measure the precision and recall, and if the rules for manual coding differ from study to study, so would the published performance measurements [15]. The speed of the different methods was not compared, and there was no common text corpus or nomenclature for the different studies. Most of the discussed methods were published in the Proceedings of various data mining workshops, and the source code, source vocabularies, and output files were not made available to the public. Because the source code for most of these methods is not accessible, it is difficult to determine the degree in which the methodology is vocabulary-dependent and whether the methods have generalized utility. For these reasons, it is impossible to state whether the doublet method is superior to previously published methods [16][17][18][19].
It is the perception of the author that the field of medical informatics suffers from a lack of simple and generalized methods that can be easily tested in various use cases or compared with alternate or new methods using shared corpora. In their review of methods for automatically identifying medical terms contained in text, Krauthammer and Nenadic commented that few terminologically tagged biomedical corpora are available [14], and this makes it impossible to compare different informatics methods. This manuscript describes a method for extracting candidate terms from a corpus of text. It uses the doublet method previously described [7]. The doublet method has similarities to data retrieval techniques using bi-grams and word-pairs [20]. The method is fast, convenient, and suits the purposes of the author, who curates the neoplasm taxonomy. The method requires substantial human involvement to winnow down the list of candidate terms into a final collection of well-formed new nomenclature terms. The methodology has the following features: 1. The method will work with any nomenclature and with any plain-text file 2. The implementation is provided with the details of operation, allowing others to repeat the study.
3. The source code is platform-independent and is made freely available to the public. This permits other laboratories to modify the program and publish an improved version of this documented prototype.
4. The output files are provided, permitting others to review the results from this study or to compare the results with the results obtained with other methods.

Conclusion
Almost all terms in medical nomenclatures can be formed by combining word doublets derived from other terms in the same nomenclature. By collecting all the word doublets found in a nomenclature, it is possible to quickly parse through enormous tracts of text, extracting phrases composed of sequences of these doublets. From these extracted candidate terms, a curator can select new terms. The doublet method is a novel approach to term extraction that curators can use to augment medical nomenclatures.

Availability and requirements
The provided supplementary scripts are short programs written in Perl. Perl is a freely available open source programming language. Perl interpreters for virtually any operating system are available from several sites on the web [11,12]. These sites have links to rich sources of online information on the Perl language. The scripts require an external file of PubMed XML text. The scripts can be easily modified to accept any plain-text external file. The scripts require the neoplasm taxonomy [see Additional file 1], but can be modified to accept any parsable nomenclature that contains listed plain-text terms.
The Perl scripts provided with this manuscript will work without modification using the current version of the neoplasm taxonomy [see Additional file 1]. New versions of the neoplasm taxonomy are posted at the Association for Pathology Informatics website [9].