Mining biomarker information in biomedical literature
© Younesi et al.; licensee BioMed Central Ltd. 2012
Received: 3 February 2012
Accepted: 10 December 2012
Published: 18 December 2012
Skip to main content
© Younesi et al.; licensee BioMed Central Ltd. 2012
Received: 3 February 2012
Accepted: 10 December 2012
Published: 18 December 2012
For selection and evaluation of potential biomarkers, inclusion of already published information is of utmost importance. In spite of significant advancements in text- and data-mining techniques, the vast knowledge space of biomarkers in biomedical text has remained unexplored. Existing named entity recognition approaches are not sufficiently selective for the retrieval of biomarker information from the literature. The purpose of this study was to identify textual features that enhance the effectiveness of biomarker information retrieval for different indication areas and diverse end user perspectives.
A biomarker terminology was created and further organized into six concept classes. Performance of this terminology was optimized towards balanced selectivity and specificity. The information retrieval performance using the biomarker terminology was evaluated based on various combinations of the terminology's six classes. Further validation of these results was performed on two independent corpora representing two different neurodegenerative diseases.
The current state of the biomarker terminology contains 119 entity classes supported by 1890 different synonyms. The result of information retrieval shows improved retrieval rate of informative abstracts, which is achieved by including clinical management terms and evidence of gene/protein alterations (e.g. gene/protein expression status or certain polymorphisms) in combination with disease and gene name recognition. When additional filtering through other classes (e.g. diagnostic or prognostic methods) is applied, the typical high number of unspecific search results is significantly reduced. The evaluation results suggest that this approach enables the automated identification of biomarker information in the literature. A demo version of the search engine SCAIView, including the biomarker retrieval, is made available to the public through http://www.scaiview.com/scaiview–academia.html.
The approach presented in this paper demonstrates that using a dedicated biomarker terminology for automated analysis of the scientific literature maybe helpful as an aid to finding biomarker information in text. Successful extraction of candidate biomarkers information from published resources can be considered as the first step towards developing novel hypotheses. These hypotheses will be valuable for the early decision-making in the drug discovery and development process.
During the past years, high-throughput technologies have been extensively employed for the study of molecular mechanisms underlying different diseases; this has led to the discovery of a large number of molecular biomarkers . The US National Institutes of Health defines a biomarker as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes or pharmacological responses to a therapeutic intervention” .
Depending on representing molecular, physiological, or structural features, biomarkers show significant diversity – spanning from genes, proteins, DNA, RNA, and SNPs to blood cholesterol levels and patterns of brain abnormality. Due to such diverse coverage, ambiguity in defining biomarkers types and classes continues to exist [3–5].
Application of biomarkers, however, goes beyond disease prediction and monitoring; in fact they have also been utilized throughout various stages of drug discovery and development . For example, biomarkers play an important role in drug target discovery and validation (e.g. as quantitative readouts for candidate drugs) , in the monitoring of toxicity mechanisms (e.g. assessment of indication of unwanted side-effects) , and non-invasive imaging of diseased organs . In the process of drug development, biomarkers are considered to be pivotal for informed go/no-go decision-making in the early stages of drug development . For example, mechanistic biomarkers can be used to pre-clinically measure a drug's pharmacological activity in terms of its distribution and interaction with a defined protein target. Such measurements help to decide whether to move forward to the next phase of clinical development.
A first step to finding supportive evidence for clinically important potential biomarkers is to search the accumulated data and knowledge generated from basic research . For efficient exploration of the suspected large amount of biomarker information contained in the biomedical literature, semantic search and information retrieval systems are of utmost importance. The largest publicly available biomedical literature resource, PubMed, makes use of MeSH (Medical Subject Headings) concepts to annotate the abstracts so that it maps these concepts to user queries and by this means allows for semantic search . However, to search for reported biomarkers, (e.g. genes, proteins, or genetic variations) in a text corpus, additional annotation of these entities and their normalization to relevant databases is required. Such a search is currently not provided by PubMed. Manual annotation is also time consuming and lags behind due to the ever abundance of new publications. For example, compiling a compendium of potential biomarkers for pancreatic cancer was carried out by systematic manual curation of the literature and took over 7,000 person hours .
The automated identification of relevant terminology, a process known as Named Entity Recognition (NER), can support semantic annotation and information retrieval. In the past years, several NER methods for the extraction of different biological entities have been developed, primarily focusing on the recognition of gene and protein names. In this regard, the BioCreAtIvE assessments present an overview about the state of various technologies and approaches in use [14–16]. NER approaches have already been used to support the identification of biomarker genes. In , the authors applied gene and protein name recognition techniques to Medline and OMIM (Online Mendelian Inheritance in Man) to identify potential serum biomarkers for Down syndrome. Similarly, a named entity tagging approach was employed to search for prostate cancer biomarker candidate genes in OMIM records . Studies on biomarker relation extraction from text have considered the extraction of semantic relations between diseases and genes or proteins [19, 20]. In a recent study, an information extraction framework has been developed for the classification of biomarker sentences that showed favourable results, but the study was unfortunately focused on a small training set . Moreover, biomarker information is often dispersed over the entire abstract, making it difficult to reach high information recall with sentence classification systems.
The aim of this work is to analyze how information about potential biomarkers is expressed in the complete body of a scientific text, and identify additional textual features that are relevant for the retrieval of potential biomarker information from the scientific literature.
where p1 is the number of abstracts containing the entity in the query selected corpus and p2 denotes the total number of documents in which the entity occurs within an unspecific reference corpus (i.e. the entire Medline). The Kullback–Leibler divergence ranks those entities high, which have especially high frequency in the selected corpus in comparison to the unspecific reference corpus. This means that frequently occurring entities do not receive high ranks. For example, using the query “ ‘Alzheimer’s Disease’ AND ‘Evidence marker’ AND ‘Human Genes/Proteins’ ”, we retrieved 331 abstracts containing IL1B with a frequency ranking of 10. Conversely, according to the relative entropy formula, IL1B has an entropy rank of 34 despite its high occurrence in Medline (i.e. 40685 abstracts).
Five end users were asked to provide 50 relevant abstracts containing cancer or drug-related biomarker information. The abstracts were chosen from the following areas due to data availability and their current interest in treating cancer.
NSCLC (Non-Small Cell Lung Cancer) related, Breast cancer and predictive/prognostic, Met signaling and cancer, EGFR related, Gefitinib related, and Erbitux related.
Clinical Management: annotate all terms indicating clinical investigations in patients, which includes the initial mentioning, the clinical study, and finally the treatment
Diagnostics: annotate all diagnostics that are used, which includes the initial disease stage, the molecular identification, and blood diagnostics
Prognosis: annotate all terms indicating any prognosis, outcome, or marker (e.g. clinical or biomarkers, adverse effects, resistance, response, disease progression or outcome)
Evidence marker: annotate all changes in gene and protein abundance, spanning from expression to mutation, SNP variations to phosphorylation status
Antecedent: annotate all risk factors mentioned for the relevant disease
where Pr(a) is the relative observed agreement among the raters and Pr(e) is the probability of random agreement. The remaining abstracts were then annotated by one of the annotators.
where Precision is defined as TP/(TP+FP) and Recall is defined as TP/(TP+FN) , and TP = True Positive, FP = False positive, TN = True Negative (TN) and FN = False Negative (FN).
From the list of retrieved entities, eleven top ranking genes/proteins were selected for an analysis, assessing how much biomarker information content is contained in their corresponding abstracts. Since more than one abstract is retrieved per gene and protein, and because a typical user might base its decision on the first retrieved abstracts (the most recent publications), only the first ten abstracts were taken into account. Subsequently, and after manually checking of abstracts, the number of true positive and false positive biomarker abstracts was determined.
Biomarker retrieval terminology classes and coverage of the terminology in the annotated corpora
Annotation found in abstracts
Terms indicating clinical investigations on patients
Patient; Cohort study
Terms representing clinical as well as molecular diagnostics
Immunohistochemistry; Emission Tomography; Microarray
Terms indicating the prediction for a patient /kind of biomarker/ outcome of therapies
Surrogate end point; clinical response; biomarker; predictor
Statistical methods indicating the strength of the biomarker relationship
Chi(2) test; mean +/− SD; univariate analysis; Kaplan-Meier Analysis
Terms describing genetic/ molecular evidence for activity of a gene
Mutation; gene amplification; polymorphism; expression
Terms expressing exposure to hazardous agents and risk factors
Smoker; susceptibility; exposure
All annotated terms in the training corpus were automatically extracted and integrated into a seed biomarker retrieval dictionary by manually assigning them to the six main classes. In the next step, the seed set was structured, forming a seed terminology containing 76 entity classes with 1132 different synonyms. After extension of the terminology with similar classes and synonyms from MeSH and UMLS as well as expert knowledge for diagnostic tests (after the analysis of larger disease- and biomarker-related text corpora), the biomarker retrieval terminology contained 119 entity classes with 1890 different synonyms.
Distribution analysis of different biomarker classes in relevant text corpora (annotations in abstracts) showed that every training abstract contains at least one biomarker retrieval term. Moreover, Clinical Management and Prognosis terms were found most frequently in the corpora, followed by Diagnostic, Evidence and Statistics terms (see Table 1, last column). The Antecedent class plays a minor role in the selected training data (recall of 20%), which might be quite different for diseases in which antecedents play a role in biomarker search. The six terminology classes are integrated into the SCAIView Demo server and can be accessed through http://www.scaiview.com/scaiview–academia.html.
Spot check of NSCLC genes/proteins for their relevance to biomarker applicability and evidence
Gene or protein
No. of retrieved document
Relative entropy rank
Frequency based rank
One of the biomarker candidates is Mucin 1 for which 8 abstracts under the above- mentioned criteria were retrieved. Mucin 1 is involved in invasiveness, metastasis, and angiogenesis of NSCLC  and its expression and localisation in lung adenocarcinoma patients is altered compared to normal epithelial cells . A further exemplified action using the biomarker retrieval terminology is the literature search for suitable Mucin-1 antibodies. This can be directly conducted by applying the biomarker retrieval terminology. Selection of the Diagnostic subclass Immunohistochemistry and the Evidence subclass Expression together with the NSLCL MeSH variants and the Mucin-1 gene concept retrieves 16 abstracts. In 11 abstracts immunostaining of Muc1 is directly stated but only 4 abstracts mention the specific antibodies (CA15-3, clone DF3 in PMID 17409826, HMFG2 in PMIDs 8980247, 2456176 and 1284790). A further analysis of the retrieved full text articles is necessary to retrieve in a next step all information about the antibodies utilized. Higher recall rates for different antibodies for this example could be reached with searches for a broader disease area like Lung Neoplasms (31 abstracts), the search with the Diagnostic subclass Immunohistochemistry in combination with the text search ‘anti-Muc1’ (105 abstracts), or in combination with the gene concept Muc1 and the Evidence marker Expression (1055 abstracts).
These examples demonstrate the ability of the biomarker search engine - depending on the application area - to substantially increase recall or precision of the search results, maximizing the efficiency of public domain literature searches.
To demonstrate the applicability of the biomarker retrieval terminology in other disease indications independently of the oncology-focused training corpus, the performance of retrieval for biomarker-related abstracts from Medline was tested in the field of Alzheimer’s disease and multiple sclerosis.
In a first test it became evident that the class Diagnostics shows a high diversity between cancer and neurodegenerative diseases and that this class has to be extended for the new disease areas. For instance, in the area of neurodegenerative diseases, we included the diagnostic imaging subtree of MeSH (after omitting microscopy and molecular imaging) and a number of very specific medical classification systems provided by experts (e.g. MS Functional Composite score). The final BioMarker terminology contains 164 classes and 2506 synonyms.
After this optimization for neurodegenerative diseases, in three evaluation steps we examined: i) to what extent biomarker class selection influences the biomarker gene enrichment in comparison to the content of independent gold standards (entity retrieval); ii) how many biomarker-enriched abstracts could be retrieved with the selection of our terminology (abstract retrieval); and iii) how far new biomarker genes/proteins information could be retrieved which are not provided by the gold standard (potential biomarker identification).
The system was able to successfully extract candidate biomarker genes/proteins relevant to the queried diseases and both ranking methods performed well for high-ranking genes. Nevertheless, the baseline search returned over 3600 genes from more than 33000 abstracts for Alzheimer’s disease and over 2300 genes from more than 12900 abstracts for multiple sclerosis. This amount of abstracts for manual inspection is overwhelming. By additive selection of the Clinical Management or Evidence Marker classes, it can be shown that the relevant biomarker candidate genes are retrieved at higher ranks in both Alzheimer’s and multiple sclerosis contexts (Figure 4 and Figure 5). For all other classes, the slope of gene enrichment decreases at lower ranks in comparison to selection of Human Genes/Proteins alone. Overall, frequency-based ranking (Figure 4A and Figure 5A) seems to perform better than relative entropy-based ranking (Figure 4B and Figure 5B). Although frequency-based ranking leads to higher gene enrichment slope between the ranks 300 and 1500, we observe a stronger enrichment at low ranking positions using relative entropy-based ranking. Analysis of those low ranking but frequent genes in Alzheimer’s disease and multiple sclerosis shows that the high number of cytokines and inflammatory proteins are mere indicators of the disease. The frequency of those genes in the whole Medline is very high and for this reason they are underrepresented in the specific corpus.
Performance evaluation for Alzheimer’s disease
Baseline: Genes / Proteins
Baseline + Clinical Management
Baseline + Evidence Marker
Baseline + Diagnostics
Baseline + Statistics
Baseline + Clinical Management + Evidence Marker
Baseline + Clinical Management + Evidence Marker + Prognosis
Examples of articles accepted to contain biomarker information
In a European screening sample of 115 sporadic AD patients and 191 healthy control subjects, we analyzed single nucleotide polymorphisms in 28 cholesterol-related genes for association with AD. The genes HMGCS2, FDPS, RAFTLIN, ACAD8, NPC2, and ABCG1 were associated with AD at a significance level of P < or = 0.05 in this sample.
Protein expression decrease
Loss of VGLUT1 and VGLUT2 in the prefrontal cortex is correlated with cognitive decline in Alzheimer disease…We quantified VGLUT1 and VGLUT2 in the prefrontal dorsolateral cortex (Brodmann area 9) of controls and AD patients using specific antiserums. A dramatic decrease in VGLUT1 and VGLUT2 was observed in AD using Western blot
Cerebrospinal fluid concentration
Five differentially-expressed proteins with potential roles in amyloid-beta metabolism and vascular and brain physiology [apolipoprotein A-1 (Apo A-1), cathepsin D (CatD), hemopexin (HPX), transthyretin (TTR), and two pigment epithelium-derived factor (PEDF) isoforms] were identified. Apo A-1, CatD and TTR were significantly reduced in the AD pool sample, while HPX and the PEDF isoforms were increased in AD CSF
Evaluation of genes not found in AD gold standard but retrieved using the biomarker terminology
No. of genes/ proteins retrieved by SCAIView but not contained in gold standard
No. of genes with at least one biomarker evidence in Medline
No. of abstracts with lack of relevance either to the disease or to being a biomarker
Protein biomarkers are required for informed decision-making in drug discovery and development. In order to evaluate and prioritize potential biomarkers, a systematic literature search is the first step. Abstracts containing information about potential gene biomarkers most often comprise evidence of gene or protein alteration as well as indication words for the clinical investigations. Moreover, diagnostic methods and prognostic terms are found frequently in these abstracts. Automated recognition of gene names, co-occurrence search with disease names and their statistical ranking serve as a baseline for biomarker search.
Using our baseline approach, it turns out that almost all genes/proteins contained in the gold standard were covered by the results of our retrieval system. The system was able to successfully extract biomarker genes/proteins relevant to the queried diseases and both ranking methods performed well for high-ranking genes. Nevertheless, the baseline search returned over 3600 genes from over 33000 abstracts for Alzheimer’s disease and over 2300 genes from more than 12900 abstracts for multiple sclerosis. This amount of abstracts for manual inspection is overwhelming. Making use of the biomarker retrieval terminology developed here results in the significant reduction of the number of extracted genes and documents with almost no loss in the gene enrichment, especially for high-ranking genes. This indicates that inclusion of more biomarker classes in the search query helps to narrow down the retrieval results by adding more restrictive context to the search.
Concerning the statistical ranking of the result set, the frequency-based ranking method performs better than the relative entropy-based ranking for biomarker gene enrichment of the diseases investigated in this study. One explanation is that the relative entropy penalizes genes/proteins common to many investigations/publications and these entities rank at the end of the relevancy list, although they show a high frequency for the disease in question. Another explanation might lay in the selective annotation of entities by annotators of the external AD and MS gold standards; if annotators have selected the most frequent genes for annotation, the gold standard has a bias towards frequency ranking.
Evaluation of retrieved genes not existing in the gold standard for AD showed that almost half of these genes have probably the potential of being considered as potential biomarkers. This indicates that automated text mining using the biomarker terminology combinations increases recall for biomarker information retrieval and makes it possible to explore the biomarker space efficiently.
Comparing the number of retrieved entities between AD and MS shows that the biomarker information for each specific disease is distributed differently over the literature corpora. Thus, to obtain the optimum results, the search strategy should be adapted to each disease of interest; however, combining more general queries with the Statistics and/or Prognosis classes results in more stringent outputs in terms of biomarker specificity, sensitivity, and predictive parameters. For the purpose of new biomarker identification, it is more suitable to use less stringent combinations, which results in higher recall. In this case, the relative entropy-based ranking method can increase the likelihood of finding proper evidence in the support of a novel biomarker. For new disease areas, it is necessary to update the terminology especially for the class Diagnostics which shows the highest diversity of terminology in different disease areas. One possibility to improve the generalisation of the terminology would be to include all diagnostic classes from publicly available terminology resources. However, this might lead to a decline in retrieval precision. Additionally, some medical diagnosis scores are very specific and often do not occur in the large publicly available resources such as MeSH and UMLS.
The flexibility of the biomarker terminology in relation to different query formulations and different biomarker classes should be tested in future investigations.
The work presented in this paper is a first step towards developing a search engine for literature-based retrieval and in-silico validation of biomarker candidates. It was shown that the development and application of a dedicated biomarker terminology could enhance the retrieval performance significantly through combined search for genes and selected classes of the biomarker retrieval terminology. The experimental results, obtained in the course of this study, demonstrate the effectiveness of the proposed approach that provides a foundation in semantic indexing and retrieval.
Medical Subject Headings
Names Entity Recognition
Online Mendelian Inheritance in Man
Unified Medical Language System
Non-Small-Cell Lung Carcinoma
We wish to thank Caroline Kant from Merck Serono for her support and discussion in the generation of the terminology and Merck Serono experts for delivering the annotations. We would like to acknowledge Heinz Theodor Mevissen from Fraunhofer SCAI for ProMiner support, BIOBASE KL for disclosure of details in the article, and Dr. Karl Kirschner for proofreading the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.