The rapid growth in the number of academic publications makes it increasingly difficult for researchers to keep up to date with advances in their areas of interest. In some fields, such as those related to medicine, the volume of published research is now so great that no individual can read all of the publications that are potentially relevant to their research. Consequently researchers focus on key publications within their own domain, but this can lead to connections between sub-fields being missed. Literature-based discovery (LBD) aims to (semi-)automate the process of identifying inferable connections. Swanson [1] proposed the A-B-C model for finding links between two unconnected terms, A and C. The approach operates by identifying a publication containing both A and B and another containing B and C (for some linking term B). The approach’s efficacy was demonstrated by finding a link between Raynaud disease and fish oil via blood viscosity.
Two types of LBD have been discussed in the literature: open and closed [2]. Open discovery starts from term A, follows connections to any B terms which are further followed to any C terms. Removing directly related A−C pairs from the list leaves hypothesized new knowledge. In closed discovery, both the A and C terms are specified at the start and only the B terms are sought. The B terms can provide justification for any hypothesized link between A and C.
For both open and closed discovery, identifying relations between pairs of terms is obviously critical to the success of an LBD system. A simple approach is to assume that terms which appear together (e.g. occur in the same document title, sentence or document) are related. However, this assumption causes a large amount of over-generation since connections are hypothesised through linking terms which are unrelated or too general.
We note the following types of linking terms which tend to over-generate connections:
-
1.
Non-content words (words such as and and or).
-
2.
Uninformative or very general words (such as patient or week).
-
3.
Ambiguous terms (words with multiple meanings such as cold which can mean common cold, cold sensation or Chronic Obstructive Lung Disease (COLD)).
Non-content words (point 1) can be addressed using a stoplist. Uninformative words (point 2) are more difficult to identify than content words: they often appear in inventories, but do not provide much information for the task. For example, patient appears in the UMLS Metathesaurus [3], but it is rarely an informative term for LBD. A list of uninformative words can be built either automatically (with varying degrees of human intervention) or fully manually; e.g. Swanson, Smalheiser, and Torvik [4] build (semi-automatically) a 9,500 word stoplist for their LBD system. Such a list often suffers from errors of omission, and in this case, the list has been criticized for being too fine tuned to a fundamental LBD discovery [5]. Another approach to removing uninformative words carries this out at the system level, either by using an LBD system to indicate commonly occurring (and thus likely uninformative) linking terms (building a stoplist), or by removing links where these are likely to be unhelpful [6].
Point 3 is the central focus of this work. Ambiguous terms can lead to spurious hidden knowledge being identified: if a publication contains a connection between A and B
1 and another supports a connection between B
2 and C (where B
1 and B
2 are different senses of the term B) the A-B-C model will suggest a hidden connection between A and C, despite there being no link.
The problem is exacerbated by the prevalence of ambiguous terms in the biomedical literature. A range of different types can be found [7] including:
-
1.
ambiguous words, e.g. depression can refer to psychological condition or hollow on surface [8].
-
2.
abbreviations with multiple possible expansions [9], e.g. CAT can mean chloramphenicol acetyl transferase, computer-aided testing, computer-automated tomography, choline acetyltransferase or computed axial tomography [10].
-
3.
gene names are often not used uniquely and the same description can be used to refer to different genes [11], e.g. NAP1 relates to at least five genes.
The standard approach to LBD of identifying connections between words fails to account for word ambiguity, and consequently some researchers have explored the use of alternative representations for words. Weeber et al [5] discuss the disadvantages of generating hidden knowledge from words, or n-grams of words, as opposed to generating from Concept Unique Identifiers (CUIs) from the UMLS Metathesaurus, although they only indirectly point out the sense disambiguating advantage by higlighting that any stop lists used for filtering terms no longer needs to be domain specific. They employ the publicly available tool MetaMap [12], which assigns a CUI to each term, and thus avoid ambiguous linking terms. However, they do not discuss the extent to which LBD is sensitive to the accuracy of the WSD system employed or whether performance gains are due to the filtering of irrelevant terms.
It seems plausible that the information provided by WSD will improve performance of language processing tasks such as LBD. However, WSD has proved to be a challenging problem and the errors made by WSD systems often mean that integrating them with language processing systems has not lead to performance increases in practise [13]. For example, it is unclear whether WSD benefits Machine Translation (MT), a key NLP application, with researchers making opposing claims about the effect on performance [14–16]. A similar situation is observed for Information Retrieval where it was thought that applying WSD did not improve performance [17], although more recent work has suggested that it can [18]. Consequently, it is not possible to predict whether WSD will be useful for any application, including LBD, a priori and its effect needs to be evaluated directly.
We explore the connection between WSD accuracy and the effectiveness of LBD. We examine the effect a number of WSD approaches with significantly different performance have on the hidden knowledge generated by an LBD system. LBD performance is evaluated using a time-slicing approach [19].