Autocoding is a specialized form of machine translation. The general idea behind machine translation is that computers have the patience, stamina and speed to quickly parse through gigabytes of text, matching text terms with equivalent terms from an external vocabulary. Human translators often scoff at the output of machine translators, noting the high rate of comical errors. An often cited, perhaps apocryphal, example of poor machine translation is the English to Russian transformation of "out of sight, out of mind" to the Russian equivalent of "invisible idiot."
Despite limitations, machine translation is the only way to transform gigabytes and terabytes of text. As long as clinicians, pathologists, radiologists, nurses, and scientists continue to type messages, reports, manuscripts and notes into electronic documents, we will need computers to parse and organize the resulting text.
One of the many problems in the field of machine translation is that expressions (multi-word terms) convey ideas that transcend the meanings of the individual words in the expression. Consider the following sentence:
"The ciliary body produces aqueous humor."
The example sentence has unambiguous meaning to anatomists, but each word in the sentence can have many different meanings. "Ciliary" is a common medical word, and usually refers to the action of cilia. Cilia are found throughout the respiratory and GI tract and have an important role locomoting particulate matter. The word "body" almost always refers to the human body. The term "ciliary body" should (but does not) refer to the action of cilia that move human bodies from place to place. The word "aqueous" always refers to water. Humor relates to something being funny. The term "aqueous humor" should (but does not) relate to something that is funny by virtue of its use of water (as in squirting someone in the face with a trick flower). Actually, "ciliary body" and "aqueous humor" are each examples of medical doublets whose meanings are specific and contextually constant (i.e. always mean one thing). Furthermore, the meanings of the doublets cannot be reliably determined from the individual words that constitute the doublet, because the individual words have several different meanings. Basically, you either know the correct meaning of the doublet, or you don't.
Any sentence can be examined by parsing it into an array of intercalated doublets:
"The ciliary, ciliary body, body produces, produces aqueous, aqueous humor."
The important concepts in the sentence are contained in two doublets (ciliary body and aqueous humor). A nomenclature containing these doublets would allow us to extract and index these two medical concepts. A nomenclature consisting of single words might miss the contextual meaning of the doublets.
What if the term were larger than a doublet? Consider the tumor "orbital alveolar rhabdomyosarcoma." The individual words can be misleading. This orbital tumor is not from outer space, and the alveolar tumor is not from the lung. The 3-word term describes a sarcoma arising from the orbit of the eye that has a morphology characterized by tiny spaces of a size and shape as may occur in glands (alveoli). The term "orbital alveolar rhabdomyosarcoma" can be parsed as "orbital alveolar, alveolar rhabdomyosarcoma" Why is this any better than parsing the term into individual words, as in "orbital, alveolar, rhabdomyosarcoma"? The doublets, unlike the single words, are highly specific terms that are unlikely to occur in association with more than a few specific concepts.
Very few medical terms are single words. In "The developmental lineage classification and taxonomy of neoplasms" there are 102,271 unique terms for neoplasms. All but 252 of these terms are multi-word terms [see Additional file 1] [see Additional file 2][1]. Of the 252 singletons, all but 34 are names of specific tumors ending in the suffix, "oma." "Oma" is short for "tumor." Single-word names of tumors ending in "oma" can be thought of as doublets with the first and second words fused together (i.e. osteoblastoma is "osteoblast" + "oma"). Some examples of "oma" terms are: acanthoma, adamantinoma, adenofibroma, adenomyoepithelioma, adenomyoma, adenosarcoma, ameloblastoma, etc.) In the entire taxonomy, there are only 34 singletons that do not end in "oma." These are, "acrochordon, carcinoid, cyst, dermoid, dip-nech, dipnech, erythroleukemia, fibroid, histiocytosis, leucaemia, leukaemia, leukemia, macroglobulinemia, mastocytosis, milia, milium, myelodysplasia, naevus, neuronevus, nevus, parapsoriasis, pre-leukaemia, pre-leukemia, precancer, preleukemia, premalignancy, preneoplasia, tylosis, verruca, verrucae, and wart". These singletons represent a mere 34/102,271 or 0.0003 of the neoplasm terminology.
Medical autocoding can be considered a specialized form of machine translation. Medical autocoders transform text into an index of coded nomenclature terms (sometimes called a "concept index" or "concept signature"). Several innovative approaches to autocoding have used the higher information content of multiword terms (also called word n-grams) to match terms in text with terms in vocabularies or to enhance the content of vocabularies by identifying n-grams occurring in text that qualify as new nomenclature terms [2–4]. Unlike prior studies with n-grams, the method developed for this study does not use statistical inferencing or the information content of bi-grams to infer semantic meaning from natural language. The author has used the higher term specificity of doublets [bi-grams] to construct a simple and fast lexical parser. Lexical parsers are types of string-matching algorithms. In general, the overall speed of lexical parsers is determined by the speed with which the parser can prepare an array of all possible words and phrases contained in a block of text, coupled with the speed with which each of these phrases can be compared against all the terms in the nomenclature.
The purpose of this paper is to describe a novel algorithm for autocoding based on finding runs of word-doublets that match a list of doublets extracted from a medical nomenclature. Using the same hardware and the same nomenclature, the speed and accuracy of the doublet method can be compared with another fast lexical parser.