In this section we will present our EL approach to the biomedical domain. A more detailed description of this system with applications to other domains is found in [26].
Overview
In the discussion that follows, we first define some basic concepts, notations, and preliminary background and then give an overview of the EL system. The entity mentions m ∈ M are the prominent phrases in the full text of a scientific paper. We consider all classes, properties, and individuals as described in the ontologies e ∈ E to be the reference entities, which are used to ground the entity mentions. Each entity is described by a surface form dictionary that contains all phrases matching its string. For example, the entity "IKK" is an entry in E, whereas an occurrence of "IKK" in a scientific paper is an entity mention. Furthermore, an occurrence of "IκB kinase" is one surface form of "IKK" because it's a synonym of "IKK".
The overall approach is depicted in Figure 1. We first construct a knowledge base (described in the following section ). Next, given a textual document d, we extract the entity mentions M : {m1, m2, ...m
n
} as described in section 3.2. We then construct a graph representation G
d
= 〈V, R〉 for d, where V = {v1, v2, ...v
n
} is the set of vertices, each vertex v represents an entity mention in d, and R = {r1, r2, ...r
n
} is the set of edges. (Note: G
d
refers to the graph of document d whereas G
k
refers to the graph of the knowledge base.) The vertices v1 and v2 are connected by an edge denoted as ε(v1, v2, r) if and only if the entity mentions for v1 and v2 are related to each other. Here, such a relation is obtained by analyzing the document d. For this work, we extract relations based on sentence-level or paragraph-level co-occurrence. Then, for each entity mention m, we use the surface form dictionary to locate a list of candidate entities c ∈ C for entity mentions in graph G
d
and compute an importance score by the non-collective approach detailed in section 3.5. Finally we compute similarity scores for each entity mention/candidate entity pair 〈m, c〉 and select the candidate with the highest score as the appropriate entity for linking.
Knowledge Base graph construction
We utilize a very broad definition of a Knowledge Base (KB). A Knowledge Base is a data set that contains some, potentially limited, structured content along with unstructured content.
Using this broad definition, Wikipedia is a popular knowledge base that is often used for entity linking because it contains structured information such as titles, hyperlinks, infoboxes as well as unstructured texts. However, in order to take advantage of richer structures and domain knowledge which are not offered by Wikipedia, we constructed a knowledge base from 300 biology-related ontologies from BioPortal [6]. Based on the rich structure contained in these ontologies, we created a web of data (WOD). In the WOD, each entity e is described as a set of triples t ∈ T . For example, a triple (_:Nucleus, _:PartOf, _:CellComponent ) indicates that the entity "nucleus" is "part of" the entity "cell".
Our expanded knowledge base E was constructed using a graph-based approach. E consists of classes, individuals, and properties in the aggregated ontologies. Each entity e is regarded as a vertex in the knowledge graph G
k
. Using our WOD, each entity is connected to other entities via a set of triples T . These connections are regarded as the edges of G
k
. For example, the entities "phosphorylating", "IKK", and "IκB kinase activity" contained in the GeneOntology [27] are treated as the vertices of our graph. The triples (_:IκB kinase activity, _:subClassOf, _:phosphorylating ) and (_:IκB kinase activity, _:relatedTo, _:IKK ) are treated as edges between the vertex "IκB kinase activity" and other vertices in our graph.
Mention extraction
The focus of the paper is to link identified mentions to the concepts in the knowledge base. Therefore, for identifying prominent mentions from unstructured texts, we apply various publicly available natural language processing tools. First a name tagger [28] is used to extract entity mentions. Regular expressions are used to join named entities that might have been considered separate by looking for intervening prepositions, articles, and punctuation marks. Then, a shallow parser [29] is used to add noun phrase chunks to the list of mentions. A parameter controls the minimum and maximum number of chunks per mention (one and five by default), and whether overlapping mentions are allowed. Although domain-specific named entity recognition could improve the overall performance of the system, this was not investigated since our focus was on the entity linking problem in this work.
Entity candidate retrieval
By analyzing the triples describing the entities, we also construct a surface form dictionary (f, {e1 , e2...e
k
}) where {e1, e2...e
k
} is the set of entities with surface form f. We analyzed the following main properties: labels and names (e.g. rdfs:label), synonyms (e.g. exact synonym from gene ontology), aliases, and symbols (e.g. from Orphanet ontology), providing us with more than 150 properties to construct the surface form dictionary. During the candidate retrieval process, we retrieve all entities with surface forms that are similar to the mentions' surface form, and considered them as candidates for the mentions.
Non-collective entropy rank
The candidate entities retrieved from the knowledge base are pre-ranked using an entropy-based non-collective approach. The main idea of the algorithm is to assign the entities with higher popularity a higher score. While entities in Wikipedia are universally connected with the same type of link, entities in the ontologies are potentially connected with many kinds of links that may have semantically rich definitions. We can leverage this greater degree of specificity and assign different weights to edges described by different properties. For example, consider the triples (_: IKK, _:isCapableOf, _:phosphorylation) and (_:IKK, _:locatedIn, _:cytoplasm). Since "phosphorylation" and "cytoplasm" are connected to "IKK" by different relations, we consider their influence on the importance of "IKK" to be different.
To capture such differences in influence, we compute the entropy of relations H(p) [30] as
(1)
where p ∈ P is a property or relation that has a value o
p
∈ O
p
or links to an object o
p
∈ O
p
and Ψ(o
p
) is the probability of obtaining o given the property p. The entropy measure has been used in many ranking algorithms to capture the salience of information [31, 32], therefore, in our task, we used it to capture the saliency of a property. In the previous example, p indicates "is capable of" and "located in" while o indicates "IKK" and "cytoplasm" respectively. Then H("iscapableof") and H("locatedin") are the influence factors between "IKK" and "phosphorylation", and "IKK" and "cytoplasm" respectively.
We then compute the salience score of candidate entities using the following non-collective EntropyRank:
(2)
where Pc is the set of properties describing a candidate entity c and is number of entities linked to . The EntropyRank for each entity starts at 1 and is recursively updated until convergence. This equation is similar to PageRank [33], which gives higher ranks to the popular entities, but we also take the difference of influence of neighbor nodes into consideration.
As described previously, the candidate entities are retrieved from the surface form dictionary based on the above salience measure. Most often, the exact surface form match between an entity mention and a candidate entity cannot be found. However, our rank model allows partial surface form matches with a penalty. Currently we use Jaccard Similarity to compute partial match scores. For example, Jaccard Similarity will be computed for mention "nucleus" and entity "neural nucleus". In the equation below, JS(m, e) is the Jaccard Similarity score between the surface form of entity mention m and the surface form of candidate entity c.
(3)
Collective inference
In the non-collective inference approach, each entity mention is analyzed, retrieved, and ranked individually. Although this approach performs well in many cases, sometimes incorrect entity mention/entity links are formed due to the lack of context information. Therefore, we adopt a collective inference approach, which analyzes relations among multiple entity mentions and ranks the candidates simultaneously. For example, given the sentence that contains the entity mentions "phosphorylating" and "IKK", the collective approach will analyze the two mentions simultaneously to determine the best reference entities.
In Section 3.1, we presented how we construct the document graph G
d
. Using the connected G
d
and candidate entities retrieved from the non-collective approach, we can compute the similarity between each entity mention m from G
d
and a candidate entity c from G
k
. Both m and c are connected to sets of neighbor nodes, which provide important contextual descriptions for both m and candidate entity c, respectively. We then use the following equation to compute the similarity score:
(4)
Here, is the set of neighbors with equivalent surface form between the G
k
subgraph for candidate c and G
d
subgraph for mention m. The parameters α and β are used to adjust the effects of the candidate pre-ranking score and the context information score on the overall similarity score. Based on the optimization results reported by Zheng et al. [26], we empirically set α = 15 and β = 8 for all experiments. The equation captures two important ranking intuitions: 1. the more popular a c is, the higher rank it will be, as captured by ER, 2. the more similar between the G
k
subgraph for c and G
d
subgraph for mention m, then higher rank will be given to c, which is captured by latter part of the equation.
To better describe the use of this system for the life science domain, we provide an illustrative example in Figure 2. For the example sentence provided, the document graph G
d
has vertices V that correspond to entity mentions M . For this sentence-level collective inference approach, there exist edges between all vertices since these mentions co-occur in the sentence. We then retrieve our knowledge graph G
k
from our knowledge base. Focusing our attention on reference entity "STAT3", a term-level search returns candidate "STAT3". However, because "Activated STAT3" is connected to more vertices of G
k
, it is intuitive that this candidate's rank increases with collective inference. Furthermore, although candidate "Neural Nucleus" is indirectly linked to "Nerve Impulse" which is in turn linked to candidate "Nervous Tissue", the isolation of "Neural Nucleus" from candidates of other entities enables candidate entity "Cell Nucleus" to obtain the highest rank.