Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer’s disease

Heo, Go Eun; Xie, Qing; Song, Min; Lee, Jeong-Hoon

doi:10.1186/s12911-019-0934-5

Volume 19 Supplement 5

Selected articles from the second International Workshop on Health Natural Language Processing (HealthNLP 2019)

Research
Open access
Published: 05 December 2019

Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer’s disease

Go Eun Heo¹,
Qing Xie¹,
Min Song¹ &
…
Jeong-Hoon Lee²

BMC Medical Informatics and Decision Making volume 19, Article number: 240 (2019) Cite this article

3250 Accesses
7 Citations
10 Altmetric
Metrics details

Abstract

Background

Extracting useful information from biomedical literature plays an important role in the development of modern medicine. In natural language processing, there have been rigorous attempts to find meaningful relationships between entities automatically by co-occurrence-based methods. It has been increasingly important to understand whether relationships exist, and if so how strong, between any two entities extracted from a large number of texts. One of the defining methods is to measure semantic similarity and relatedness between two entities.

Methods

We propose a hybrid ranking method that combines a co-occurrence approach considering both direct and indirect entity pair relationship with specialized word embeddings for measuring the relatedness of two entities.

Results

We evaluate the proposed ranking method comparatively with other well-known methods such as co-occurrence, Word2Vec, COALS (Correlated Occurrence Analog to Lexical Semantics), and random indexing by calculating top-ranked entities related to Alzheimer’s disease. In addition, we analyze gene, pathway, and gene–phenotype relationships. Overall, the proposed method tends to find more hidden relationships than the other methods.

Conclusion

Our proposed method is able to select more useful related entities that not only highly co-occur but also have more indirect relations for the target entity. In pathway analysis, our proposed method shows superior performance at identifying (functional) cross clustering and higher-level pathways. Our proposed method, resulting from phenotype analysis, has an advantage in identifying the common genotype relating to phenotypes from biological literature.

Background

With the recent exponential growth of biomedical literatures, extracting useful information from these literatures has come to play an important role in the development of modern medicine. In the biomedical domain, information extraction (IE) is focused mainly on automatically identifying entities and their relationships from biomedical literatures as an aspect of natural language processing (NLP). Traditionally, detecting biomedical relationships between entities commonly involves adopting co-occurrence methods, which are based on the assumption that if two entities appear in the same sentence, paragraph, or abstract, these entities would be relevant to each other and helpful for biomedical knowledge discovery such as gene–gene interaction and gene–drug association. However, co-occurrence methods have posed the problem of generating many false positive relations, since they do not consider contextual information in a specific text [1].

In addition to simple co-occurrence-based approaches to measuring the relationship between entities, rule-based methods using syntactic patterns [2,3,4,5] and machine learning methods [6, 7] have been proposed in order to tackle this false positive issue. Measures of semantic similarity and relatedness have been developed to identify ontological relationships between two entities, such as WordNet [8] and UMLS (Unified Medical Language System) [9]. Recently, models of semantic word representations, or word embeddings, have been developed constructing semantic spaces based on large-scale corpora. This line of research adopts deep learning approaches [10,11,12,13,14,15,16] such as Word2Vec [17] for automatically learning optimal feature representation. However, these studies focus only on learning word embeddings by maximizing raw-text probability, which does not perfectly capture both similarity and relatedness [18].

As indicated by previous studies [18,19,20,21], incorporating two or more knowledge sources (e.g. thesaurus, ontology, and corpus) into word embedding approaches can produce better results for ranking the results for relationships between two entities. The present paper was motivated by the concept of utilizing knowledge sources for enriching word embeddings. To our best knowledge, no attempt has previously been made to combine word embedding based on multiple knowledge resources with co-occurrence of entity pairs, while classifying the type of relation by reflecting contextual information in biomedical literature. Moreover, there is no previous study that considers both direct and indirect relationships of entity pairs when calculating co-occurrence of entity pairs.

Therefore, in this study, we propose a hybrid semantic relatedness algorithm for biological knowledge discovery. Our proposed method combines co-occurrence between entities with specialized word embeddings [18] to calculate the semantic similarity of two entities by capturing both similarity and relatedness for semantic words, learning from both a corpus and a thesaurus. In the proposed method, we also consider both direct and indirect scores for each entity pair so as to find a more complex relationship considering not only explicit but also hidden relationships. We select Alzheimer’s disease (AD) as a case study for analysis and evaluation. Alzheimer’s disease is a degenerative brain disorder, whose cause is hard to diagnose accurately. As the number of AD patients has increased, researchers have striven by means of medical experiments and literature analysis to understand the disease’s pathophysiology so as to improve its diagnosis and treatment. For entity extraction, we used two approaches, PKDE4J [22] and SemRep [23]. PKDE4J is an integrated system designed to extract entity and relation from unstructured biomedical text corpora, whereas SemRep, a UMLS-based entity and relation extraction application, can identify semantic relationships in biomedical literatures. To evaluate the performance of the proposed method, we compared it with several well-accepted techniques, namely co-occurrence, Word2Vec [17], COALS (Correlated Occurrence Analog to Lexical Semantics) [24], and random indexing (RI) [25]. In addition, to evaluate the usefulness of the proposed method for other types of knowledge discovery, we conducted the following analyses 1) pathways analysis on the Reactome Pathway database [26] and 2) gene–phenotype relationships analysis on OMIM (Online Mendelian Inheritance in Man) [27]. Overall, the proposed method is able to identify more related genes for pathways than the other methods by differentiating rankings for each gene. The proposed method also finds genes like APOE, which is strongly associated with familial early-onset AD and coronary heart disease [28], through analyses of AD-related genes and the gene–phenotype relationship.

Methods

The present study comprises four steps: data collection, entity relation extraction, semantic relatedness scoring calculation, and evaluation. For semantic relatedness scoring, we consider both direct and indirect connection; in terms of evaluation, we employ four kinds of analyses, namely algorithm comparison, AD related–gene analysis, pathway analysis, and gene–phenotype relation analysis. Figure 1 illustrates the overall design of this study. A detailed description of the proposed approach is provided in subsequent sections.

Data collection

Using ‘Alzheimer disease’ or ‘Alzheimer’s disease’ as search terms, we retrieved 118,167 abstracts from PubMed, a search engine indexing more than 29 million citations for biomedical literature from MEDLINE. The exact query formulation is “Alzheimer disease [Title/Abstract] OR Alzheimer’s disease [Title/Abstract]”.

We did not limit publication by year, so as to get as much data as possible for our analysis. Figure 2 shows the distribution of the number of papers by publication year from 1990 to January 2019.

Entity relation extraction

For PKDE4J [22], the algorithm used for entity relation extraction can identify the verb located between the two entities in a sentence and capture relational characteristics. In order to decrease unnecessary indirect connections, we selected entity by type. Since we focus on Alzheimer’s disease, we limited the entity type to gene, drug, and disease. Thus, for entity extraction, we used the following dictionaries: drug dictionaries, the gene dictionary collects from UniProt [29], MeSH (Medical Subject Headings) for disease [30], KEGG (Kyoto Encyclopedia of Genes and Genomes) for genetics [31], and DrugBank for medications [32]. We used the same data collection as the input for SemRep. As output, we extracted 969,341 entity relations using PKDE4J and 630,054 entity relations using SemRep [23].

Semantic relatedness scoring calculation

We considered both direct and indirect scoring for each entity pair. For the direct score, after we extracted the relations of an entity pair, we looked at the same entity pairs with different relation types appearing in one abstract. An example is shown below: the first column is the PMID (PubMed unique identifier), the second column is sentence location in that abstract, and the last column is entity relations:

19,395,124 | 8 | MCI | DISEASE | depression | DISEASE | CO-OCCUR |.

19,395,124 | 17 | MCI | DISEASE | depression | DISEASE | RESULT_OF |.

Next, we considered only the co-occurrence frequency of entity pairs. There are two different kinds of direct relations: 1) co-occurrence of an entity pair in one abstract with frequency greater than one as noted as ‘sum_same’ in Tables 1 and 2) one-time co-occurrence of an entity pair in one abstract as noted as ‘sum_different’ in Table 1. If an entity pair only co-occurs once in an abstract, the co-occurrence number is the same as the number of abstracts. Biomedical literatures, like any other literatures, have skewed distribution. In other words, much research tends to follow popular diseases, drugs, and genes. Due to this tendency, it is hard to identify a new relation by the co-occurrence method. Thus, we aim to find less visible information from biological texts. If two-entity pairs co-occur in several abstracts, it indicates these relations are more popular and we can infer they are well-known entity pairs. We give them a low weight, while assigning entity pairs found in the same abstract a higher weight. Table 1 represents pseudocode for our algorithm.

Table 1 Pseudocode for our algorithm.

Selected articles from the second International Workshop on Health Natural Language Processing (HealthNLP 2019)

Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer’s disease

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Data collection

Entity relation extraction

Semantic relatedness scoring calculation

Results and discussion

Top 20 entity pairs analysis

Alzheimer’s disease-related gene analysis

Pathway analysis

Gene–phenotype relationship analysis

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us