Semantic text mining support for lignocellulose research
© Meurs et al.; licensee BioMed Central Ltd. 2012
Published: 30 April 2012
Skip to main content
© Meurs et al.; licensee BioMed Central Ltd. 2012
Published: 30 April 2012
Biofuels produced from biomass are considered to be promising sustainable alternatives to fossil fuels. The conversion of lignocellulose into fermentable sugars for biofuels production requires the use of enzyme cocktails that can efficiently and economically hydrolyze lignocellulosic biomass. As many fungi naturally break down lignocellulose, the identification and characterization of the enzymes involved is a key challenge in the research and development of biomass-derived products and fuels. One approach to meeting this challenge is to mine the rapidly-expanding repertoire of microbial genomes for enzymes with the appropriate catalytic properties.
Semantic technologies, including natural language processing, ontologies, semantic Web services and Web-based collaboration tools, promise to support users in handling complex data, thereby facilitating knowledge-intensive tasks. An ongoing challenge is to select the appropriate technologies and combine them in a coherent system that brings measurable improvements to the users. We present our ongoing development of a semantic infrastructure in support of genomics-based lignocellulose research. Part of this effort is the automated curation of knowledge from information on fungal enzymes that is available in the literature and genome resources.
Working closely with fungal biology researchers who manually curate the existing literature, we developed ontological natural language processing pipelines integrated in a Web-based interface to assist them in two main tasks: mining the literature for relevant knowledge, and at the same time providing rich and semantically linked information.
Since the early decades of the 20th century, when the internal combustion engine rapidly replaced the steam engine, transport has been almost totally dependent on fossil fuels. As the petroleum reserves decrease, producing sustainable liquid fuels with low environmental impact is one of the major technological challenges the world is facing today. Industrialized and developing countries consider biofuels, fuels produced from biomass, as a promising alternative to fossil fuels.
There are many advantages of using biofuels in terms of economic, environmental and energy security impacts : from biomass sources, biofuels can be sustainable and contribute to reducing carbon dioxide emissions. In the United States, biofuel is produced mainly from the fermentation of hydrolyzed corn starch, a process requiring substantial input of water, fertilizer and energy, and which consumes a food resource. According to the United Nations Environment Programme , the global use of biofuels will nearly double during the next ten years. Hence, improving efficiency and sustainability of biofuels production from non-food sources is of great interest. Underutilized agricultural and forestry residues, such as agricultural straws, residues from pulp and paper production and other "green" garbage, are composed of lignocellulose, which is the most abundant organic material on earth.
The conversion of lignocellulose into fermentable sugars for biofuels production requires the use of cocktails of biological catalysts, called enzymes. A key challenge lies in the development of enzyme cocktails that can efficiently and economically hydrolyze lignocellulosic biomass. One approach to meeting this challenge is to mine the rapidly-expanding repertoire of microbial genomes for enzymes with the appropriate catalytic properties .
Researchers who aim to identify, analyze and develop these enzymes need to extract and interpret valuable and relevant knowledge from the huge number of documents that are available in multiple, ever-growing repositories.
The largest knowledge source available to biological researchers is the PubMed bibliographic database , provided by the US National Center for Biotechnology Information (NCBI), which contains more than 19 million citations from more than 21000 life science journals. PubMed is linked to other databases, like Entrez Genome, which provides access to genomic sequences, and BRENDA, The Comprehensive Enzyme Information System , which is the main collection of enzyme functional data available to the scientific community. A biology researcher querying PubMed using keywords typically collects a long list of potentially relevant papers. Reading all the abstracts and full-text of these papers to extract relevant information is a time-consuming task.
The work we present in this paper focuses on the automatic extraction of knowledge from the massive amount of information on fungal biomass-degrading enzymes available from the literature. In our approach, Natural Language Processing (NLP) pipelines brokered through Web services support the extraction of relevant mentions. Detected entities are further enriched with additional information and where possible, linked to external data sources.
To address the challenges of extracting relevant data from large collections of published papers, NLP and Semantic Web approaches are increasingly adopted in biomedical research [6–8]. During the last decade, several systems combining text mining and semantic processing have been developed to help life sciences researchers in extracting knowledge from the literature. Textpresso  enables the user to search for categories of biological concepts and classes relating two objects and/or keywords within an entire literature set. GoPubMed  supports the arrangement of the abstracts returned from a PubMed query. iHOP  converts the information in PubMed into one navigable resource by using genes and proteins as hyperlinks between sentences and abstracts. BioRAT  extracts biological information from full-length papers. Bio-Jigsaw  is a visual analytics system highlighting connections between biological entities or concepts grounded in the biomedical literature. MutationMiner  automates the extraction of mutations and textual annotations describing the impacts of mutations on protein properties from full-text scientific literature. Finally, Reflect  is a Firefox plugin which tags gene, protein and small molecule names in any Web page.
Before we describe our overall architecture and the text mining pipelines, we briefly introduce the user groups involved, the semantic entities we analyze and the resources we use.
The identification and the development of effective fungal enzyme cocktails are key elements of the biorefinery industry. In this context, the manual curation of fungal genes encoding lignocellulose-active enzymes provides the thorough knowledge necessary to facilitate research and experiments. Researchers involved in this curation are building sharable resources, usually by populating dedicated databases containing the extracted knowledge from the curated literature.
The users of our system are populating and using the mycoCLAP database http://cubique.fungalgenomics.ca/mycoCLAP/, which is a searchable database of fungal genes encoding lignocellulose-active proteins that have been biochemically characterized. The curators are therefore the first user group of our system. The biology researchers who make decisions about the experiments to conduct and the experimenters executing them represent two additional user groups. They are mainly interested in the ability of combining multiple semantic queries to the curated data, thereby semantically integrating the various knowledge resources.
The system we are developing has to support the manual curation process; therefore, the semantic entities have been defined by the curators according to the information they need to store in the mycoCLAP database.
Semantic entities, applicable level (sentence, S or word(s), W), definitions and examples
Conditions at which the activity assay is carried out Ex.: disodium hydrogen phosphate, citric acid, pH 4.0, 37°C
Name of the activity assay Ex.: Dinitrosalicylic Acid Method (Somogyi-Nelson)
Enzyme name Ex.: alpha-galactosidase
Gene name Ex.: mel36F
Presence of glycosylation on protein Ex.: N-glycosylated
Organism used to produce the recombinant protein Ex.: Escherichia coli
Buffer, pH, temp. for the kinetic parameters determination Ex.: 0.1 M (disodium hydrogen phosphate, citric acid), pH 4.0, 37°C
Organism name Ex.: Gibberella sp.
pH mentions Ex.: The enzyme retained greater than 90% of its original activity between pH 2.0 and 7.0 at room temperature for 3 h.
Products formed from enzyme reaction and identification method Ex.: HPLC, glucose, galactose
Specific activity of the enzyme Ex.: 11.9 U/mg
Strain name Ex.: F75
Substrate name Ex.: stachyose
Substrate specificity mentions Ex.: The Endoglucanase from Pyrococcus furiosus had highest activity on cellopentaose
Temperature mentions Ex.: The enzyme stability at different pH values was measured by the residual activity after the enzyme was incubated at 25°C for 3 h.
About half of these entities are detected at the word level (e.g., enzyme or organism names) and the other half consists of contextual properties captured at the sentence level (e.g., pH and temperature contexts). The entity set was built in the perspective of providing instances of the ontological representation of the domain knowledge. The enzyme names are sought, as well as the names of their source organisms and strain designations. The enzymes have specific biochemical properties, such as optimal temperature and pH, temperature and pH stability, specific activity, substrate specificities and kinetic parameters. These experimentally determined properties describe each enzyme's catalytic ability and capacity, and are a basis for comparison between enzymes. Their mentions are captured from the literature along with the laboratory methods (assay) used and the experimental conditions (activity and kinetic assay conditions). In addition to these properties, the extraction of mentions describing an enzymatic property (glycosylation state) and the products formed (product analysis) is performed to complete the knowledge of the reaction.
In terms of knowledge sources, the system relies on external and internal resources and ontologies. The Taxonomy database http://www.ncbi.nlm.nih.gov/Taxonomy/ from NCBI is used for initializing the NLP resources supporting organism recognition. BRENDA http://www.brenda-enzymes.org provides the enzyme knowledge along with SwissProt/UniProtKB http://www.uniprot.org/. References to the original sources are integrated into the curated data, which allows us to automatically create links using standard Web techniques: e.g., links from an organism mention in a research paper to its corresponding entry in the NCBI Taxonomy database or from an enzyme name to its EC number in BRENDA.
In this section, we provide an overview of our system architecture, the semantic resources we deployed, and the text mining pipelines we developed.
With the different user groups and their diverging requirements, as well as the existing and continuously updated project infrastructure, we needed to find solutions for incrementally adding semantic support without disrupting day-to-day work. Our solution deploys a loosely-coupled, service-oriented architecture that provides semantic services through existing and new clients.
The processing resources (PRs) composing the first part of the system pipeline are generic and independent from the domain. Some of these resources are based on standard components shipped with the GATE distribution. In particular, the JAPE language allows the generation of finite-state language transducers that are processing annotation graphs over documents. After initializing the document, the LigatureFinder PR finds and replaces all ligatures, like fi, ff or fl, with their individual characters, thereby facilitating gazetteer-based analysis. The next PR is the ANNIE English Tokenizer, which splits the text into very simple tokens, such as numbers, punctuation characters and words of different types. Finally, the ANNIE Sentence Splitter segments the text into sentences by means of a cascade of finite-state transducers and the ANNIE part-of-speech (POS) tagger that is included with GATE adds POS tags to each token.
Organism tagging and extraction rely on the open-source OrganismTagger system http://www.semanticsoftware.info/organism-tagger. The OrganismTagger is a hybrid rule-based/machine-learning system that extracts organism mentions from the biomedical literature, normalizes them to their scientific name, and provides grounding to the NCBI Taxonomy database .
The OrganismTagger also comes in the form of GATE pipeline, which can be easily integrated into our system. It reuses the NCBI Taxonomy database, which is automatically transformed into NLP resources, thereby ensuring the system stays up-to-date with the NCBI database. The OrganismTagger pipeline provides the flexibility of annotating the species of particular interest to bio-researchers on different corpora, by optionally including detection of common names, acronyms, and strains.
Despite the standards published by the Enzyme Commission , enzymes are often described by the authors under various formats, ranging from their 'Recommended Name' to different synonyms or abbreviations. Our enzyme recognition process is rule-based: Gazetteer and mapping lists are automatically extracted from the BRENDA database, in addition to a mapping list of SwissProt identifiers extracted from the SwissProt database.
An enzyme-specific text tokenization, along with grammar rules written in the JAPE language, analyses tokens with the -ase and -ases enzyme suffixes. The gazetteers allow the finding of the enzyme mentions in the documents by applying a pattern-matching approach.
The extracellular endoglucanase (EG) was purified to homogeneity from the culture supernatant by ethanol precipitation (75%, v/v), CM Bio-Gel A column chromatography, and Bio-Gel A-0.5 m gel filtration. The purified EG (specific activity 43.33 U/mg protein) was a monomeric protein with a molecular weight of 27 000.
Here, EG stands for 'endoglucanase', but this abbreviation is not reported in BRENDA. Such abbreviations are meaningful only within the context of a single document. Therefore, our pipeline contains grammar rules identifying these author-specific abbreviations and performing coreference resolution on each document.
The mapping lists link up the enzyme mentions found in the document and the external resources. Through this grounding step, the system provides the user with the enzymes' Recommended Names, Systematic Names, EC Numbers, SwissProt Identifiers and the URL of the related Web pages on the BRENDA website.
Temperature: The purified enzyme exhibited maximum activity at 55°C, with 84% relative activity at 60°C and 29% activity at 70°C under the assay conditions used.
pH: The enzyme displayed an optimum activity at pH 5.0 and retained 80% activity at pH 3.0 and also at pH 8.0.
Our GATE pipeline contains PRs based on JAPE rules and gazetteer lists of specific vocabulary that enable the detection of these key mentions at the sentence level.
The detection of the other entities mentioned in Table 1 is currently implemented through gazetteer lists and grammar rules implemented in JAPE; with the exception of the strain mentions, which are detected by the strain feature provided by the OrganismTagger pipeline.
External resources can be accessed from the user interfaces; the system output provides direct links to the relevant Web pages, e.g., URLs of the Web pages related to the detected enzymes on the BRENDA website site or the detected organisms on the NCBI Taxonomy website.
In this section, we first discuss the development of the gold standard corpus and present preliminary results of our system.
For the intrinsic evaluation of our NLP pipelines, we are building a gold standard corpus of freely accessible full-text articles. These are manually annotated through GATE Teamware , a Web-based management platform for collaborative annotation and curation.
The annotation team consists of four biology researchers. The researcher in charge of the curation task and an annotator having a strong background in fungal enzyme literature curation are considered as expert annotators. The inter-annotator agreement between them is over 80% (F-measure), hence their annotation sets are always defined as the most reliable sets during the adjudication process.
The corpus is composed of freely accessible full-text articles containing critical knowledge and technical details the biology researchers aim to store in the mycoCLAP database which is specifically designed for their needs. The papers are related to classes of enzymes, among them the glycoside hydrolases, the lipases and the peroxidases. Glycoside hydrolase papers represent 69%, lipase papers account for 12% of the articles, and the remaining 19% are related to peroxidases. The current gold standard corpus is composed of ten full-text papers that have been manually annotated by four biologists each.
Entities and their counts in the current gold standard corpus
Text Mining pipelines results on the gold standard corpus in terms of recall (R), precision (P) and F-measure (Fm)
The OrganismTagger performance has previously been evaluated on two corpora, where it showed a precision of 95%-99%, a recall of 94%-97%, and a grounding accuracy of 97.4%-97.5% . Since its results here are lower, we examined the error cases in more detail.
The manual annotation of organisms highlights all the textual mentions referring to an organism as indirect references, non-standard names (e.g., non-binomial names) or generic mentions. In some cases, correct results from the OrganismTagger were not manually annotated, leading to false positives. The following common sentence:
Soluble protein was determined according to the method of Lowry et al. (1951) using bovine serum albumin as standard.
shows an example of such a case where the OrganismTagger correctly annotates bovine as an organism, whereas the expert annotators considered bovine serum albumin as a stand-alone expression.
In some other cases, human annotations are not detected by the OrganismTagger. For example, Trichoderma viridie and M. incrasata or cellulolytic fungi were manually annotated as organisms by the experts. These mentions are not detected by the OrganismTagger. In the first two cases, the cause is a spelling difference between the names of the organisms reported in the NCBI Taxonomy database and their mention in the article. In the last case, the annotation of a generic organism mention that is relevant within the context of our project is not an objective of the OrganismTagger system, which is designed to provide normalization with scientific names and grounding to the NCBI Taxonomy database. Consequently, the results obtained by our pipeline on the organism recognition are lower than the published results of the OrganismTagger system. The text mining pipeline supporting our system needs to be enhanced in its ability to capture generic organism mentions and to discard stand-alone expressions containing organism names.
The results obtained on Temperature and pH sentence detection are much better in the lenient evaluation than the strict because of sentence splitter mistakes.
The enzyme recognition pipeline provides state-of-the-art performance. However, wrong detection of abbreviations and acronyms represent 92% of the false negatives found by our pipeline. Further work is needed to reduce this amount by improving the co-reference resolution with approaches as described in  and external resources, such as Allie .
We presented our ongoing development of a semantic infrastructure for enzyme data management. As the first system specifically designed for lignocellulolytic enzymes research, it targets the automatic extraction of knowledge on fungal enzymes from the research literature. The proposed approach is based on text mining pipelines combined with ontological resources. Preliminary experiments show state-of-the-art results. Improving the consistency of the extracted knowledge by increasing the use of ontologies is one of the next goals for our system. Therefore, a key objective is the population of the overall ontology of the domain knowledge and its publication in Linked Data format.
The gold standard corpus of manually annotated papers, as well as the presented system, will be available under http://www.semanticsoftware.info/genozymes.
The accessibility of the services through the Semantic Assistants framework allows the users to mine the semantically annotated literature from their desktop. Future work is needed to enable the interaction between selected users (e.g., curators) and the presented system in terms of data validation and knowledge acquisition.
In future work, we will further deploy our text mining pipelines to assess the quality of existing manually curated data in the databases. Measuring the overall impact of the semantic system on the scientific discovery workflow will be the target of an extrinsic study.
a Nearly-New Information Extraction System
BRaunschweig ENzyme DAtabase
General Architecture for Text Engineering
Graphical User Interface
Java Annotation Patterns Engine
(database of) Characterized Lignocellulose-Active Proteins of fungal origin
National Center for Biotechnology Information
Natural Language Processing
Web Ontology Language
Part Of Speech
Resource Description Framework
Simple Object Access Protocol
Uniform Resource Locator.
Funding for this work was provided by Genome Canada and Génome Québec. Nona Naderi is acknowledged for her work on the OrganismTagger and the LigatureFinder. Bahar Sateli is acknowledged for help on the Semantic Assistants resources. We also thank Carolina Cantu, Semarjit Shary and Sherry Wu who helped on the annotation task.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 12 Supplement 1, 2012: Proceedings of the ACM Fifth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/12/S1.