The Metathesaurus of the Unified Medical Language System (UMLS) [1] is a large biomedical thesaurus of concepts from 211 source terminologies (2019 AB release) in 25 different languages. It is organized by linking all names for the same concept under a Concept Unique Identifier (CUI). The Metathesaurus identifies the different relationships between the concepts and also preserves the concept names, concept IDs and the relationships between the concepts in each source terminology. The terminologies in the UMLS differ widely in their domains and application areas. For example, the Logical Observation Identifiers Names and Codes terminology (LOINC®) [2] is a terminology for the standardized exchange of laboratory data, while the Gene Ontology (GO) [3] describes gene products in terms of their associated biological processes, cellular components, and molecular functions. However, there are many terminologies that cover multiple domains. For example, the SNOMED CT [4] provides the core general terminology for Electronic Health Records (EHRs) by organizing concepts into hierarchies (Body structure, Clinical finding, Specimen, etc.) and has over 350,000 unique, active concepts. As a result, there is substantial overlap in the conceptual content between the SNOMED CT and several other terminologies.
Previously, we have observed that when pairs of terminologies in the UMLS have overlap in their conceptual contents, they nevertheless may have notable differences with respect to their vertical and horizontal densities [5,6,7,8]. A vertical density difference occurs when “IS-A”/concept paths of different lengths exist in two terminologies that are constrained by begin/end concepts that are identical in both the terminologies (Fig. 1a). We use the term “density” following Rector et al. [9]. The resulting topological pattern was referred to as a diamond [10]. A horizontal density difference arises out of the fact that the same concept in two different terminologies may have different sets of children in each terminology (Fig. 1b) [8]. These differences led to several questions like (a) are some concepts missing from one terminology and if so could these missing concepts be imported into that terminology, (b) are these differences the result of some error in one or both of the terminologies, or (c) are these differences due to concepts in one terminology being synonyms to concepts in the other terminology? Detailed investigations of all such cases were performed in prior research and the results were analyzed by domain experts [5] who confirmed many possible cases of concept import, which in turn results in terminology enrichment.
This paper explores whether topological patterns analog to diamonds (Fig. 1a) exist when considering more than two terminologies at a time and whether the resulting patterns suggest possible import of concepts from one terminology into another. While such suggestions should be derived algorithmically, the final decision on an import is always made by a human expert.
One of the possible extensions of the study on vertical density differences involves the concepts in three terminologies as shown in Fig. 2. Consider three terminologies A, B, and C. The concept A1 in terminology A has a child concept A3, the concept B1 in terminology B has a child B2, and the concept C2 in terminology C has a child C3. The concepts A1 and B1 are identical by means of having the same UMLS CUI. Similarly, the concepts B2 and C2 are identical, and so are A3 and C3. It should also be noted that the concept C3 (= A3) does not exist anywhere in terminology B, the concept B2 (= C2) does not exist anywhere in terminology A, and the concept A1 (= B1) does not exist anywhere in terminology C. Looking only at A1, B1, B2, C2, and ignoring that the connections between them are of two different kinds (IS-A versus identity) this identifies a kind of transitivity (Fig. 2) [11].
Because we are chaining together two vertical patterns to jointly achieve a “higher reach” we are reminded of an extensible ladder as they are carried by fire trucks. Thus, we will refer to the pattern in Fig. 2 as the fire ladder pattern in contrast to the diamond patterns that we have investigated previously for vertical density (Fig. 1a). We refer to A as the target terminology, to B as the “upper source terminology,” and to C as the “lower source terminology.” The primary questions that arise from Fig. 2 are whether B2 (= C2) should be proposed for import into terminology A, and whether C3 should be recommended for import into terminology B.
Thus, in this paper, we quantitatively explore the fire ladder patterns formed by the concepts from 10 different terminologies in the UMLS Metathesaurus. We developed an algorithm that suggests concepts that could potentially be imported into another terminology. We also had two domain experts review the suggestions made by the algorithm for deciding whether the concepts should be imported or not. We note that one other import is suggested by Fig. 2, which we will elaborate on in the Discussion Section.
UMLS
The UMLS Metathesaurus is a large, multi-purpose, and multi-lingual repository of biomedical and health-related terminologies. The Metathesaurus maintains information about concepts, their synonyms and the relationships among them. Similar terms from different source terminologies are organized into a concept that is identified by a Concept Unique Identifier (CUI), e.g. C0018799 stands for Heart diseases. The concepts are linked to each other by means of different relationships identified by a Relationship Unique Identifier (RUI) [12]. All relationships in the Metathesaurus are given a general label (REL), describing the nature of the relationship like Child of, Broader, Qualifier of, etc. Furthermore, about one quarter of the relationships carry an additional label (RELA—Relationship Attribute). Labels are obtained from each source terminology and include, e.g., IS-A, component_of, part_of, etc. For the experiments described in this paper, we used the 2018 AB release of the UMLS with a focus on PAR (Parent of) relationships with an additional inverse_isa Relationship Attribute, together corresponding to what is commonly known as an IS-A link.
Related work
Density differences
In prior work, we utilized the structure of the UMLS to identify the vertical and horizontal density differences for concepts from pairs of terminologies to find potential concepts for import that could help in achieving semantic harmonization among terminologies. He et al. [5] defined “structurally congruent concepts” and interpreted them in different ways including alternative classifications, synonyms, and errors in a terminology. A definition of alternative classifications is beyond the scope of this paper. This idea was later extended to identify topological patterns called trapezoids or diamonds arising from the vertical density differences, to import missing concepts into the SNOMED CT and National Cancer Institute Thesaurus (NCIt) [6, 7, 13]. A quantitative analysis of the difficulty in importing the pattern-based concepts was also performed [10, 14]. We subsequently proposed a metric for identifying likely cases of alternative classifications using horizontal density differences [15].
Sun and Zhang’s method for identifying granularity differences and similarities between biomedical ontologies uses a rule-based approach, where a rule inference engine constructs rules to explore structural incompatibilities [16, 17]. Luo et al. [18] proposed “parallel concept sets (PCS)” to identify the granularity balance of IS-A and part_of relationships within one biomedical ontology, while we always worked with pairs of ontologies.
Ontology matching/alignment
Ontology alignment is the process of finding semantic correspondences between different ontologies [19,20,21]. The mappings are usually based on concept names, definitions, and relationships between concepts in the ontologies. Most research in this field focuses on identifying 1:1 correspondences between concepts in different ontologies [22, 23]. For example, Bodenreider et al. [24] reported alignment of mouse and human anatomies by investigating the NCIt (for the human anatomy) and the Adult Mouse Anatomical Dictionary. Certain complex correspondences (1:n and m:n) [25] and ternary compound alignments [26] were also reported in targeted studies.
For applications involving pairs of (or, less often, multiple) ontologies, the alignment/matching techniques help ensuring interoperability by establishing semantic mappings between the ontologies. On the other hand, our techniques, involving density differences, help with identifying concepts that are potentially missing in one ontology. Those concepts could be imported from one ontology into another whenever a human expert agrees.
Ontology quality assurance and semantic enrichment
Quality assurance is an important part of the ontology life cycle and has been widely studied [27,28,29,30,31]. Different studies have focused on different aspects such as structural relationships (e.g. IS-A, part-of), semantic type assignments, and different methodologies (e.g. lattice-based [32], abstraction-network-based [33] etc.). Several studies have focused on lattice-based structural auditing, as the hierarchical structure of an ontology is expected to be a lattice, as a criterion for its well-formedness [32, 34]. Zhu et al. [35] compared the subsumption relationship between FMA and SNOMED CT’s Body Structure hierarchy, to understand structural disparities and analyze the non-lattice fragments in SNOMED CT. Zhang and Bodenreider [32] proposed a lattice-based approach for exhaustive auditing of SNOMED CT, while Zhu et al. [36] used concept lattices for evaluating the semantic completeness of SNOMED CT.
While most studies focused on auditing a single ontology, Cui [37] proposed a cross-ontology method for identifying inconsistencies and errors across multiple ontologies in the UMLS. Even though the direct goal of our methods [7, 8, 15], based on density differences, was not quality assurance, as a by product these methods have identified inconsistencies and errors in different ontologies. On the other hand, Zhang and Bodenreider [32] reported that lattice-based studies for auditing ontologies are in turn effective in identifying potentially missing precoordinated concepts in SNOMED CT for semantic enrichment. While our methods identify already existing concepts in other ontologies that are missing in the target ontology, the lattice-based approaches identify precoordinated concepts which, when introduced, will make non-lattice fragments into lattice-conforming structures that are ontologically well-formed.