Developing a hybrid dictionary-based bio-entity recognition technique
© Song et al.; licensee BioMed Central Ltd. 2015
Published: 20 May 2015
Bio-entity extraction is a pivotal component for information extraction from biomedical literature. The dictionary-based bio-entity extraction is the first generation of Named Entity Recognition (NER) techniques.
This paper presents a hybrid dictionary-based bio-entity extraction technique. The approach expands the bio-entity dictionary by combining different data sources and improves the recall rate through the shortest path edit distance algorithm. In addition, the proposed technique adopts text mining techniques in the merging stage of similar entities such as Part of Speech (POS) expansion, stemming, and the exploitation of the contextual cues to further improve the performance.
The experimental results show that the proposed technique achieves the best or at least equivalent performance among compared techniques, GENIA, MESH, UMLS, and combinations of these three resources in F-measure.
The results imply that the performance of dictionary-based extraction techniques is largely influenced by information resources used to build the dictionary. In addition, the edit distance algorithm shows steady performance with three different dictionaries in precision whereas the context-only technique achieves a high-end performance with three difference dictionaries in recall.
The extraction of biomedical entities from scientific literature is a challenging task encountered in many applications such as system biology, molecular biology, and bioinformatics. One of the early, continuously adopted approaches is the dictionary-based entity extraction. Dictionary-based entity extraction extracts all the matched strings from a given text by entities defined in a dictionary. Based on the lemma for a given term, it recognizes a term by searching the most similar (or identical) one in the dictionary. This makes dictionary-based approaches particularly useful for practical information extraction from biomedical documents as the first step for extraction . In addition, dictionary-based approaches are very useful when there are no or minimal contexts available to detect named entities such as a query.
However, dictionary-based approaches have two major performance bottlenecks. First, the false positives, inherent with using short names, significantly degrade the overall accuracy. Exclusion of short names from the dictionary may resolve this issue, but it is not the ultimate solution in that such a solution disallows for recognizing short protein or gene names. Second, spelling variation makes dictionary-based approaches less usable. For example, the gene name "DC2-dopamine receptor" has many spelling variants such as "dopamine DC2 receptor," and "dopamine DC2 receptor." Exact matching techniques mainly employed by dictionary-based approaches treat these terms as distinct ones.
We alleviate this problem by using an approximate string matching method in which surface-level similarities between terms are considered. In order to mitigate the low recall problem associated with dictionary-based approaches, we combine entity extraction with soft-matching scheme that is capable of handling variant entity names. To this end, we propose a new entity extraction technique comprised of several different techniques. The proposed technique consists of 1) the approximate string distance algorithm to retrieve candidate entries, 2) shortest-path edit distance algorithm (SPED), and 3) text mining techniques such as Part-Of-Speech (POS) tagging and utilization of syntactical properties of terms. The experimental results show that in most cases, the performance of the proposed technique is superior to other approaches.
The rest of the paper is organized as follows: Section 2 describes the studies related to the present paper. Section 3 explains the proposed technique in depth. Section 4 reports on the data collection and the experimental results. Section 5 concludes the paper with a discussion of future research.
The dictionary-based entity extraction is still widely used method for biomedical literature annotation and indexing . The major advantages of dictionary-based technique over the pattern-based approach are twofold: it allows for recognizing names and identifying unique concept identities. The exact match approach is the simplest one; however, it suffers from low recall due to the ingrained variants (morphological, syntactic, and semantic) characteristic of a biological term (Chiang and Yu, 2005). In addition, it is nearly impossible for a dictionary to collect them all. One entity type extraction, combining dictionary-based with supervised learning techniques, dictionary Hidden Markov Models (HMMs) represent a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases .
Stemming from the development of the GENIA corpus , many studies have explored extraction tasks including "protein," "DNA," "RNA," "cell line," and "cell type" (e.g., [11, 10]). In addition, some studies have targeted "protein" recognition only . Other tasks include "drug"  and "chemical" (Narayanaswamy et al. ) names. Another related research area related to entity mapping is semantic category assignment. Most of the work about semantic category assignment is done in the context of named entity tagging where terms in the text will be assigned categories from a list of predefined categories. Features for semantic category assignment include both words within a phrase and contextual features derived from neighbouring words. In the general English domain, Frantzi et al.  used term similarity measures based on phrase-inner and contextual information (C/NC-values), where the similarity measure for phrase-inner clues was used to distinguish headwords from modifiers. For language independent named entity recognition(first name, last name, and location), Cucerzan and Yarowsky  proposed a trie-based approach to combine name affix information and contextual information, where affix information is informative in detecting names, while names can be ambiguous among the name classes. In the biomedical domain, words within a phrase tend to be more effective since most biomedical terms are descriptive noun phrases. Many systems depend on a set of manually collected headwords or suffixes for semantic category assignment [16–18].
Besides hand-crafted methods, machine learning methods have also been explored in the domain of extract bio-entities. Nenadic et al.  used a method similar to the C/NC-values  to identify similar terms. Hatzivassiloglou and colleagues  used a method similar to our two-step corpus-based learning for the disambiguation of protein, gene, or RNA class labels for biological terms in free text. Nobata et al.  created a system that assigns four semantic categories (i.e., protein, DNA, RNA, and source) based on supervised machine learning systems trained on 100 MEDLINE abstracts. Similarly, Lee et al.  developed a two-phase name recognition system, where separate detection and classification modules were developed using support vector machine. They used different and specialized sets of features for each task, and obtained better results than those of the one-phase model over the GENIA corpus annotated with 23 categories. The NER technique has also been applied to extract chemical components from text. ChemSpot is the NER algorithm that extracts mentions of chemicals such as trivial names, drugs, abbreviations, and molecular formulas  whereas  focuses on the identification of International Union of Pure and Applied Chemistry (IUPAC) chemical compounds. Drug-drug interaction is another application area of NER. DDIExtraction 2013 is the extraction task that concerns the recognition of drugs and extraction of drug-drug interactions that appear in biomedical literature .
Algorithm of The proposed technique.
Given a dictionary
Input: short passage
1: Apply the approximate string matching technique
2: Generate candidate matched entries
For each generated entry list
3: Apply the shortest path edit distance (SPED) technique
4: If there is a perfect match between the input and the matched entry, exit the loop
6: Merge the candidate list by the context-enabled text mining techniques
7: Select the best merged entry
8: Return the matched entity type
Descriptors provided in MeSH are grouped into 16 categories. For example, category A, B, C, and C are related to anatomic terms, organisms, diseases, drugs and chemicals, respectively. Each category consists of an array of subcategories, and in each subcategory, descriptors are structured in the form of an upside down tree. Although MeSH categories do not officially represent an authoritative subject classification scheme, they help guide indexers to assign subject headings to documents or researchers to search for literature. In this paper, MeSH tree enables us to build a dictionary where each MeSH term in the sub-trees is mapped to the top label term. To expand the idea of the dictionary construction, we make use of GENIA as well as UMLS to examine the influence of the source on mapping accuracy.
GENIA is a gene corpus consisting of 1,999 MEDLINE records to help develop and evaluate information extraction and text mining systems for the medical domain. The corpus is annotated with various levels of linguistic and semantic information related to genes. UMLS stands for Unified Medical Language System developed by the National Library of Medicine. UMLS consists of three knowledge sources: the Metathesaurus, the Semantic Network, and the SPECIALIST Lexicon. The Metathesaurus is a vocabulary database that contains information about biomedical related concepts and the relationships among them. The semantic network categorizes all concepts represented in the UMLS Metathesaurus and provides a set of useful relationships between these concepts. The SPECIALIST Lexicon provides the lexical information needed for the SPECIALIST NLP tool.
Approximate string matching technique
The exact match technique is the simplest one to utilize a dictionary to spot candidate terms. Several fast exact match algorithms such as Boyer-Moore algorithm  have been proposed. However, spelling variations make the exact match impractical less attractive. For example, a protein name "EGR-1" has the following variations: Egr-1, Egr 1, egr-1, egr 1, EGR-1, and EGR 1.
Unlike this exact matching algorithm, the approximated matching technique is based on weighted edit distance of strings from dictionary entries. In other words, it is a fuzzy dictionary matching strategy. The data structure for the underlying dictionary is a trie in order to support efficient search for matches. The approximate string matching technique is implemented based on the algorithm proposed by . The algorithm consists of two phases: 1) finding approximate substring matches inside a given string and 2) finding dictionary strings that match the pattern approximately. The closeness of a match is measured by a set of operations of edit distance to convert the string into an exact match. For example, in the case of the term "E coli," this term is compared against the dictionary constructed by an approximate string match technique, and a matched entry for "E coli," "Escherichia coli Proteins," is found in the dictionary once the threshold is set.
Shortest Path Edit Distance (SPED)
SPED, an extension of algorithm from Rudniy et al. , calculates edit distance between two strings, at character level.
Where SPED is the final value of the string distance, |Prefix| is the length of the common prefix, SPED' is the normalized weight of the shortest path.
Lattice of neighbourhood of string
The SPED algorithm's tasks are twofold; firstly, identification of the shortest path from a directed weighted acyclic graph, secondly, computation of the edit distance among strings. We construct nodes lattice, which represents substring interactions and where ; the strings lengths, s and t, are represented by n and m, respectively. Let Neighborhood of String (NS) be a set, C, of consequent substrings each having length k, where k = 1,..., n, for elements, and the length of -th element is . Furthermore, a numeric value in the range, 1,...,, is assigned to each neighborhood.
The NS interaction among two strings defines a lattice element. The NS edit operation result into value, which is assigned to a lattice node. The edit operation results into a weight value, corresponding to a pair of NSs; which is the transformation cost of converting a substring from the first string into a substring from the second string. Five different methods are taken into consideration for weight assignment, which are used and tested in the experiments.
Lattice-based graph composition
In this paper, the aforementioned idea is applied by converting each lattice cell to a graph vertex. A source vertex is added at the left top side and connected to the vertex (1; 1) of the graph by a diagonal edge. A gap cost is assigned to edges, which is selected during a learning phase. As discussed earlier, a weight value is assigned to each diagonal edge; this weight value is a result of an NS edit operation, which is stored in the lattice cell. Since the source vertex is a placeholder, it is used as a starting point by the algorithm. In the SPED algorithm, the calculation of a string distance value between two strings becomes the shortest path calculation task from the source to the destination vertices. The destination vertex corresponds to the pair of last NSs of strings S and T. The graph is a directed, weighted, and acyclic.
To handle the issue of similar concepts, we combine them into a representative one by the following rule: The shorter term is merged into the longer term when 1) the starting position of both terms is identical, 2) they have the same top category - contextual cues, 3) they are either noun term or phrases - POS tagging, and 4) they share similar lexical properties.
To measure the performance of the proposed technique in a comprehensive manner, we used three different data sources: 1) GENIA, 2) Mesh Tree, and 3) UMLS. GENIA ontology is a taxonomy of 47 biologically relevant nominal categories . GENIA corpus consists of 96 582 annotations. Among them, 89 682 are for surface level terms, 1583 are for higher level terms. As described earlier, we used the MeSH Tree Structure that organizes 25,588 MeSH concepts under the 16 top categories. UMLS is the most well-received ontology in the biomedical domain. It consists of 2,918,970 concepts.
Basic Statistics of the Test Data.
The proposed technique
As the evaluation measure, we used precision, recall and F1. We also adopted 10-fold cross validation and reported the average value of 10 trials. Precision is defined as the percentage of true positives over the total number of positives predicted by the system (Precision=TP/(TP+FP) where TP denotes the number of true positives and FP denotes the number of false positives). Recall is defined as the percentage of the number of true positives over the total number of positives in evaluation entries (Recall=TP/(TP+FN) where FN is the number of false negatives). The F1 score is the ensemble of precision and recall and defined as the inverse of the arithmetic mean of the reciprocal values of precision and recall.
Experimental results of three different combinations of the proposed technique.
# of abstracts
# of tokens
The second evaluation is whether and how sub components of the proposed technique (context only, SPED only, and combination of these two) have an impact on the performance. In particular, in terms of precision, the performance of entity extraction based on context-enabled text mining (described in the Merging Strategy section) gets significantly worse whereas the performance of entity extraction based on SPED does not change much. In the case of the proposed technique (which combines context and SPED), the performance drop is moderate. In terms of recall, the more data sources are combined, the better performance is observed. This is an expected outcome in that the dictionary size correlates with the performance of dictionary-based entity extraction. One interesting observation is that the performance of the proposed technique on the test set A does not increase when the dictionary is based on the combination of three sources compared to one source (GENIA). We are currently undertaking a close investigation of possible causes for this outcome.
As shown in Figure 6, SPED outperforms the proposed technique on the test dataset A and has equivalent performance on the test datasets B and C. Even if the proposed technique makes a weak performance in the combination of the three sources, the overall experimental results show that the proposed technique is superior to the other two approaches in almost all cases.
In addition, we compare the proposed technique with other techniques reported in the literature. We choose the proposed technique with the dictionary constructed by the combination of the three data sources. In the conference of JNLPBA, the best performance is achieved by Zhou and Su's technique .
Performance comparison between the proposed technique and Zhou and Su's Technique (P, R, and F denote precision, recall, and F-measure respectively).
The proposed technique
A series of experiments show that the performance of dictionary-based extraction techniques is largely influenced by the information resources used to build the dictionary. In addition, the edit distance algorithm shows a steady performance with the three different dictionaries in precision whereas the context only technique achieves high-end performance with those dictionaries in recall.
This paper proposed a hybrid dictionary-based entity extraction technique. The proposed technique consists of 1) an approximate string matching technique, 2) a shortest path edit distance technique, and 3) context-enabled text mining techniques.
The novel feature of our method lies in the two-level string matching technique where SPED is applied to candidate sets of matched entries from a dictionary. We conducted comprehensive evaluation of the proposed technique on the JNLPBA 2004 test data. We examined the impact of the dictionary on the performance by combining three different data sources: GENIA, MeSH, and UMLS. The experimental results show that the proposed technique outperforms the approaches with text mining techniques only as well as with SPED only by F measure in most cases. In addition, the experimental results show that the proposed technique performs better than the state-of-the-art technique which achieved the best performance at the JNLPBA 2004.
As a follow-up study, we plan to improve the text mining technique where the context only option performs the worst. In several instances, we observe that it exacerbates the performance. Another research direction is to exploit various data sources, such as Gene Ontology (GO) and PharmGKB to study how an entity-specific dictionary could impact on the performance of entity extraction.
This research was supported by 1) the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2014R1A2A2A01004454), 2) by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the "IT Consilience Creative Program" (NIPA-2013-H0203-13-1001) supervised by the NIPA(National IT Industry Promotion Agency), and 3) by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning(Grant No. 2013M3A9C4078138). The funding for publication of this article came from the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2014R1A2A2A01004454).
This article has been published as part of BMC Medical Informatics and Decision Making Volume 15 Supplement 1, 2015: Proceedings of the ACM Eighth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/15/S1.
- Kou Z, Cohen W, Murphy R: High-recall protein entity recognition using a dictionary. Bioinformatics. 2005, 21 (Suppl 1): i266-i273. 10.1093/bioinformatics/bti1006. doi:10.1093/bioinformatics/bti1006View ArticlePubMedPubMed CentralGoogle Scholar
- Chiang JH, Yu HC: Literature extraction of protein functions using sentence pattern mining. IEEE Transactions on Knowledge and Data Engineering. 2005, 17 (8): 1088-1098.View ArticleGoogle Scholar
- Narayanaswamy M, et al: A biological named entity recognizer. Pac Symp Biocomput. 2003, 427-438.Google Scholar
- Nenadic G, Spasic I, Ananiadou S: Automatic Discovery of Term Similarities Using Pattern Mining. Intl J of Terminology. 2004, 10 (1): 55-80. 10.1075/term.10.1.04nen.View ArticleGoogle Scholar
- RS Boyer: A Fast String Searching Algorithm. Communications of the Association for Computing Machinery. 1977, 20 (10): 762-772. 10.1145/359842.359859.View ArticleGoogle Scholar
- Ono T, et al: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2001, 17: 155-161. 10.1093/bioinformatics/17.2.155.View ArticlePubMedGoogle Scholar
- Rudniy A, Geller J, Song M: Histogram Difference String Distance for Enhancing Ontology Integration in Bioinformatics. International Conference on Bioinformatics and Computational Biology. 2012, Las Vegas, NVGoogle Scholar
- Kim JD, Ohta T, Tsuruoka Y, Tateisi Y, Collier N: Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Application. 2004Google Scholar
- Ohta Tomoko, Tateisi Y, Kim J, Mima H, Tsujii J: The GENIA Corpus: An Annotated Research Abstract Corpus in Molecular Biology Domain. Proc Human Language Technology Conference. 2002Google Scholar
- Settles Burr: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. Proc Conference on Computational Linguistics, Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004Google Scholar
- Shen Dan, Zhang J, Zhou G, Su J, Tan CL: Effective Adaptation of a Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain. Proc Conference of Association for Computational Linguistics Natural Language Processing in Biomedicine. 2003Google Scholar
- Tsuruoka Y, Tsujii J: Boosting Precision and Recall of Dictionary-Based Protein Name Recognition. Proc Conference of Association for Computational Linguistics Natural Language Processing in Biomedicine. 2003Google Scholar
- Rindfleisch Thomas C, Tanabe L, Weinstein JN: EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature. Proc Pacific Symposium on Biocomputing. 2003Google Scholar
- Frantzi K, Ananiadou S, et al: Automatic classification of technical terms using the NC-value method for term recognition. International Conference on Computational Lexicography (COMPLEX '99). 1999, Budapest, HungaryGoogle Scholar
- Cucerzan S, D Yarowsky: Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC). 1999, University of Maryland, MDGoogle Scholar
- Fukuda K, A Tamura, et al: Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput. 1998, 707-18.Google Scholar
- Yu H, E Agichtein: Extracting synonymous gene and protein terms from biological literature. Bioinformatics. 2003, 19 (Suppl 1): i340-9. 10.1093/bioinformatics/btg1047.View ArticlePubMedGoogle Scholar
- Chang JT, H Schutze, et al: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics. 2004, 20 (2): 216-25. 10.1093/bioinformatics/btg393.View ArticlePubMedGoogle Scholar
- Hatzivassiloglou V, PA Duboue, et al: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001, 17 (Suppl 1): S97-106. 10.1093/bioinformatics/17.suppl_1.S97.View ArticlePubMedGoogle Scholar
- Nobata C, N Collier, et al: Automatic term identification and classification in biology texts. Proc of the 5th NLPRS. 1999, 369-374.Google Scholar
- Lee K, Y Hwang, et al: Two-phase biomedical NE recognition based on SVMs. Proceeding of ACL workshop on NLP in biomedicine. 2003Google Scholar
- Zhou G, Su J: Exploring Deep Knowledge Resources in Biomedical Name Recognition. Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004Google Scholar
- Rocktäschel T, Weidlich M, Leser U: ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012, 28 (12): 1633-1640. 10.1093/bioinformatics/bts183.View ArticlePubMedGoogle Scholar
- Usié A, Rui Alves1 R, Solsona F, Vázquez3 M, Valencia A: CheNER: chemical named entity recognizer. Bioinformatics. 2013, btt639v2-btt639.Google Scholar
- Segura Bedmar I, Martínez P, Herrero Zazo M: SemEval 2013 Task 9 : Extraction of Drug-Drug Interactions fromBiomedical Texts (DDIExtraction 2013). Proceedings of Semeval. 2013Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.