Automatic medical encoding with SNOMED categories
© Ruch et al; licensee BioMed Central Ltd. 2008
Published: 27 October 2008
In this paper, we describe the design and preliminary evaluation of a new type of tools to speed up the encoding of episodes of care using the SNOMED CT terminology.
The proposed system can be used either as a search tool to browse the terminology or as a categorization tool to support automatic annotation of textual contents with SNOMED concepts. The general strategy is similar for both tools and is based on the fusion of two complementary retrieval strategies with thesaural resources. The first classification module uses a traditional vector-space retrieval engine which has been fine-tuned for the task, while the second classifier is based on regular variations of the term list. For evaluating the system, we use a sample of MEDLINE. SNOMED CT categories have been restricted to Medical Subject Headings (MeSH) using the SNOMED-MeSH mapping provided by the UMLS (version 2006).
Consistent with previous investigations applied on biomedical terminologies, our results show that performances of the hybrid system are significantly improved as compared to each single module. For top returned concepts, a precision at high ranks (P0) of more than 80% is observed. In addition, a manual and qualitative evaluation on a dozen of MEDLINE abstracts suggests that SNOMED CT could represent an improvement compared to existing medical terminologies such as MeSH.
Although the precision of the SNOMED categorizer seems sufficient to help professional encoders, it is concluded that clinical benchmarks as well as usability studies are needed to assess the impact of our SNOMED encoding method in real settings.
The system is available for research purposes on: http://eagl.unige.ch/SNOCat.
From a functional perspective, we can try to compare our tool with the well-known CLUE Browser. First, we observe that hierarchical visualization is not available in our tool, while it is an important functionality in the CLUE system, which can be seen as complementary to our system. On the opposite, the CLUE interface, as a strict browsing system, cannot accept as input a full document. In contrast, the string matching power of our system, which uses a retrieval-inspired conflation and normalization strategy – called stemming – clearly outperforms the CLUE browser regarding string approximation. Thus, the system handles most plural forms, morphological flexions and derivations as in expressions, expresses, expressed, expressive.... But the main advance of our categorizer is obviously located in its ranking power whose ranking strategies provide an optimal model of relevance as designed by large-scale user evaluation campaigns (e.g. ).
The remainder of this paper is organized as follows: the next section presents the data and metrics used in our experiments. Then, we present the methods used to perform the categorization task. Further, we propose a preliminary evaluation of the categorizer based on MEDLINE records, together with a qualitative evaluation, based on a few examples, which tries to exhibit SNOMED-specific features. Finally, we conclude on our experiments and suggest some future work to deliver an integrated and user-friendly system.
Data and metrics
Because MEDLINE is indexed with Medical Subject Headings (MeSH) rather than with SNOMED codes, we need to transform MeSH terms, which are used to index MEDLINE records, into SNOMED terms. This translation is done thanks to the MeSH-SNOMED mapping table provided by the Unified Medical Language System. However, the translation process is not objective: it means that several MeSH terms cannot be mapped appropriately to SNOMED codes and vice versa. Thus, non medically-specific MeSH terms have usually no equivalent in SNOMED. Thus, technical categories methods, such as Storage, data, biological and pharmacological entities, such as chaperonin – which is mapped to protein – have often no appropriate equivalent in SNOMED. In all our experiments, we will assume that such an information loss is minor from a biomedical perspective and that the UMLS mapping between MeSH and SNOMED is sufficient to provide a basic evaluation of our SNOMED categorization tool. In order to assess our SNOMED categorizer, we apply the system on the Cystic Fibrosis (CF) collection [4, 5]. The CF collection is a collection of 1239 MEDLINE citations. From this collection, 239 records were used for tuning our system and 1000 were used to evaluate our system. In the collection, MeSH items listed in MeSH fields are replaced by SNOMED codes. We demand that SNOMED codes are unique so that when two or more MeSH are mapped to the same SNOMED code, only one category is accepted. For each citation, we used the content of the abstract field as input for the categorizer. The average number of concepts per abstract in the collection is 12.3 and the average number of major terms is 2.8. From our MEDLINE records, only terms marked as major (with a star) are considered in our experiments. Following  and as it is usual with retrieval systems, the core measure for the evaluation is based on mean average precision. The top precision (interpolated Precisionat Recall = 0), which is of major importance for a fully-automatic system, is also given.
Two main modules constitute the skeleton of our system: the regular expression component, and the vector space component. Each of these basic classifiers uses known approaches to document retrieval. The former component uses tokens as indexing units and can take advantage of the thesaurus while the latter uses stems (i.e. strings such as expression, expressed are replaced by express). The first tool is based on a regular expression pattern matcher. Although such an approach is less used in modern information retrieval systems , it is expected to perform well when applied on relatively short documents such as SNOMED terms. It is to be observed that some SNOMED terms are particularly lengthy, at least when compared to MeSH terms: while most MeSH terms are shorter than six words, there are several terms SNOMED with more than a dozen of words. The second classifier is based on a vector-space engine . This second tool is expected to provide high recall in contrast with the regular expression-based tool, which should privilege precision.
For a short introduction on automatic text categorization in MEDLINE, the reader is referred to the NLM's indexing initiative ; for a detailed presentation of our vector space engine and a comparison with state-of-the-art systems, including NLM's tools, see (in this joint evaluation between four retrieval systems, our engine showed competitive performances) . For a complete overview and evaluation of our categorization system applied on Medical Subject Headings and on the Gene Ontology, see .
To be able to better associate SNOMED terms and textual entities, it is necessary to perform some pre-processing normalization steps. This includes removing meta-abbreviations, which are common in most terminological systems, such as NOS (Not otherwise specified), NEC (Not elsewhere classified), NOC (Not otherwise classifiable), or NFQ (Not further qualified). But we also need to handle and expand more than fifty SNOMED specific abbreviations, often inherited from Read codes, such as ACOF; ADVA; AR; CFIO; CFSO; FB; FH; FHM; HFQ; LOC; MVNTA; MVTA.
Vector space system
The vector space (VS) module is based on a general information retrieval engine ; the engine, easyIR, which is available on the first author's homepage, with tf.idf (term frequency – inverse document frequency) weighting schema. In this study, it uses stems (Porter-like, with minor modifications) as indexing terms and an English stop word list. While stemming can be an important parameter in a text classification task, the impact of which is sometimes a matter of discussion , we did not notice any significant differences between the use of tokens and the use of stems. However, we noticed that a significant set of semantically related stems were not associated to a unique normalized string by the stemming procedure while they could have. Thus, the morpheme immun is found in 48 different stems. This suggests that more powerful conflation methods, based on morphemes , could have been used to enhance the recall of the current method, especially in multilingual contexts . Altogether, we counted 72 402 unique stems in the SNOMED vocabulary.
Results for RegEx and (tf.idf) classifiers. weighting schemas. For the VS engine, tf.idf parameters are provided: the first triplet indicates the weighting applied to the "document collection", i.e. the concepts, while the second is for the "query collection", i.e. the abstracts.
System or parameters
Mean average precision
Regular expressions and synonyms
The regular expression (RegEx) pattern matcher is applied on the SNOMED concepts (376 212) augmented with its synonyms (the total includes 787 091 terms). In this module, text normalization is mainly performed by removing punctuation (e.g. hyphen, parenthesis...). The manually crafted transition network of the pattern-matcher is very simple, as it allows one insertion or one deletion within a SNOMED term, and ranks the proposed candidate terms based on these basic edit operations following a completion principle: the more tokens are recognized, the more the term is relevant. The system hashes the abstract into 6 token-long phrases and moves the window through the abstract. The same type of operations is allowed at the token level, so that the system is able to handle minor string variations, as for instance between diarrhea and diarrhoea. Interestingly, several morphological variations are directly provided by SNOMED descriptions.
Table 1 shows that the single RegEx (mean average precision = 0.4 and top-precision = 0.64) system performs better than the different settings tested for the vector space module, with tf.idf (term frequency-inverse document frequency) and length normalization (cosine...) factors, so that the thesaurus-powered pattern-matcher provides better results than the basic VS engine for SNOMED mapping. We use the (de facto) SMART standard representation in order to express these different parameters, cf.  for a detailed presentation. For each triplet provided in table 1, the first letter refers to the term frequency, the second refers to the inverse document frequency and the third letter refers to a normalization factor.
Results of the system when combining the vector space and the regular expression modules.
Weighting function concepts.abstracts
Mean average Precision
Hybrids: tf.idf (VS) + RegEx
The top precision (82.3%) is in the range of what has been reported elsewhere [17, 18], while the search space of our tool (800 000 terms and 1000 documents) is much larger than in these experiments which work with some hundreds categories and use sentences rather than abstracts for the categorization. Such a precision means that the top-returned category is one of the expected categories in 8 cases out of 10. The measured mean average precision of almost 50% (0.45%) means that half of the expected categories are proposed by the system.
Qualitative evaluation and discussion
To conduct the qualitative evaluation, we looked at a sample of twelve abstracts. A unique judge manually controlled the top three categories provided by the SNOMED categorizer to find if there were non-MeSH categories which could have been relevant for indexing the abstract. For eight abstracts, a relevant category was found in the SNOMED ranking. A typical example is given in Figures 1 and 2. Thus, the SNOMED terminology contains the concept glucuronic acid, which could have been chosen to index the content of the article based on its abstract, but the concept does not exist in the MeSH. We also can observe that the current setting of the system, which favors exact match concepts (via the regular expression module) and content-bearing features (via the document frequency factor), seems somehow able to discard some very lengthy SNOMED concepts such as nontraffic accident involving collision of motor-driven snow vehicle, not on public highway, driver of motor vehicle injured. This qualitative observation suggests that the conceptual coverage of SNOMED can be larger than the MeSH and that automatic indexing could be improved regarding recall by using at least some SNOMED codes. This observation must be balanced by our preliminary remarks: several MeSH categories cannot be appropriately mapped into SNOMED CT.
More generally, we believe that the size and complexity of advanced ontological systems such as SNOMED do demand the development of specific computer tools to make possible their manipulation by real users, including professional encoders. Indeed, given the low rate of coding agreement reported when using large terminologies, e.g. , which at best represent only a fraction of SNOMED CT regarding their size and complexity, it is expected that large-scale ontology-driven coding tasks will require computer tools to assist users for maintaining and operating. Unfortunately, we observe that the research in the field seem to concentrate most of efforts on clinical and/or formal issues, whereas access, search and navigation capabilities tend to receive a fairly limited interest.
Conclusion and future work
We have reported on the development and preliminary evaluation of a new type of categorization and browsing tools for SNOMED encoding. The system combines a pattern matcher, based on regular expressions of terms, and a vector space retrieval engine that uses stems as indexing terms, a traditional tf.idf weighting schema, and cosine as normalization factor. For top returned concepts, a precision of more than 80% is observed, which seems sufficient to help coding textual contents with SNOMED categories. A manual and qualitative evaluation on a dozen of MEDLINE abstracts suggests that SNOMED CT could represent an improvement compared to existing broad medical terminologies such as the MeSH. Clearly, further studies will be needed using clinical cases directly annotated with SNOMED categories, as described in  and . Usability studies  are also needed to assess the relevance of the text-to-SNOMED associations provided by the tool from a coder perspective. Furthermore, from encoding and billing perspectives, SNOMED codes should be mapped to the International Classification of Disease and/or to Diagnosis Related Groups  using an unambiguous model [23, 24] to evaluate the appropriateness of the SNOMED CT encoding for monitoring health systems.
The study has been partially sponsored by the Swiss National Science Foundation (SNF Grant no 3252B0-105755 – EAGL: an Engine for question Answering in Genomics Literature). The online version of the tool has been made available thanks to the support of the EU DebugIT project (http://www.debugit.eu/); it uses dynamically generated links to the SNOMED Browser provided by the Virginia-Maryland Regional College of Veterinary Medicine.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 8 Supplement 1, 2008: Selected contributions to the First European Conference on SNOMED CT. The full contents of the supplement are available online at http://www.biomedcentral.com/1472-6947/8?issue=S1.
- Ehrler F, Jimeno A, Geissbühler A, Ruch P: Data-poor Categorization and Passage Retrieval for Gene Ontology Annotation in Swiss-Prot. BMC Bioinformatics. 2005, 6 (suppl 1): s23-10.1186/1471-2105-6-S1-S23.PubMed CentralView ArticlePubMedGoogle Scholar
- Gobeill J, Tbahriti I, Ehrler F, Mottaz A, Veuthey A, Ruch P: Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction. BMC Bioinformatics. 2008, 9:Google Scholar
- Aronson A, Demner-Fushman D, Humphrey S, Lin J, Liu H, Ruch P, Ruiz M, Smith L, Tanabe L, Wilbur J: Fusion of Knowledge-intensive and Statistical Approaches for Retrieving and Annotating Textual Genomics Documents. TREC 2005. 2006Google Scholar
- Shaw W, Wood J, Wood R, Tibbo H: The Cystic Fibrosis Database: Content and Research Opportunities. LSIR. 1991, 13: 347-366.Google Scholar
- Marti Hearst's pages. [http://www.sims.berkeley.edu/~hearst/irbook/]
- Larkey L, Croft W: Combining classifiers in text categorization. SIGIR. 1996, ACM Press, New York, US, 289-297.Google Scholar
- Manber U, Wu S: GLIMPSE: A Tool to Search Through Entire File Systems. Proceedings of the USENIX Winter 1994 Technical Conference, San Francisco CA USA. 1994, 23-32.Google Scholar
- Ruch P: Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. COLING 2002. 2002Google Scholar
- Aronson A, Bodenreider O, Chang H, Humphrey S, Mork J, Nelson S, Rindflesch T, Wilbur W: The Indexing Initiative. A Report to the Board of Scientific Counselors of the Lister Hill National Center for Biomedical Communications. Tech. rep., NLM. 1999Google Scholar
- Ruch P, Baud R, Geissbühler A: Learning-Free Text Categorization. LNAI 2780. 2003, 199-208.Google Scholar
- Ruch P: Automatic Assignment of Biomedical Categories: Toward a Generic Approach. Bioinformatics. 2006, 6:Google Scholar
- Hull D: Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society of Information Science. 1996, 47: 70-84. 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO;2-#.View ArticleGoogle Scholar
- Baud R, Nystrom M, Borin L, Evans R, Schulz S, Zweigenbaum P: Interchanging Lexical Information for a Multilingual Dictionary. AMIA Symposium Proceedings. 2005Google Scholar
- Ruch P: Query Translation by Text Categorization. COLING 2004. 2004Google Scholar
- Singhal A, Buckley C, Mitra M: Pivoted document length normalization. ACM-SIGIR. 1996, 21-29.Google Scholar
- Ruch P, Ehrler F, Abdou S, Savoy J: Report on the TREC2005 Experiment: Genomics Track. TREC 2005. 2006Google Scholar
- Friedman C, Shagina L, Lussier Y, Hripcsak G: Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004, 11 (5): 392-402. 10.1197/jamia.M1552.PubMed CentralView ArticlePubMedGoogle Scholar
- Lussier Y, Shagina L, Friedman C: Automating SNOMED coding using medical language understanding: a feasibility study. J Am Med Inform Assoc (Symposium Suppl). 2001, 418-22.Google Scholar
- Funk M, Reid C: Indexing consistency in MEDLINE. Bull Med Libr Assoc. 1983, 71: 176-83.PubMed CentralPubMedGoogle Scholar
- de Bruijn L, Hasman A, Arends J: Automatic SNOMED classification – a corpus based method. Yearbook of Medical Informatics. 1999Google Scholar
- Despont-Gros C, Mueller H, Lovis C: Evaluating user interactions with clinical information systems: a model based on human-computer interaction models. J Biomed Inform. 2005, 38 (3): 244-255. 10.1016/j.jbi.2004.12.004.View ArticlePubMedGoogle Scholar
- Bowman S: Coordinating SNOMED-CT and ICD-10. J AHIMA. 2005, 60-1.Google Scholar
- Rassinoux A, Baud R, Ruch P, Trombert-Paviot B, Rodrigues J: Model-based Semantic Dictionaries for Medical Language Understanding. J Am Med Inform Assoc (Symp Suppl). 1999, 122-6.Google Scholar
- Rodrigues J, Rector A, Zanstra P, R RB, Innes K, Rogers J, Rassinoux A, Schulz S, Paviot BT, ten Napel H, Clavel L, Haring van der E, Mateus C: An Ontology driven collaborative development for biomedical terminologies: from the French CCAM to the Australian ICHI coding system. Stud Health Technol Inform. 2006, 124: 863-8.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.