OntoKeeper
OntoKeeper is a Java-based web application that analyzes ontology files (.owl or.rdf) using semiotic metrics from Burton-Jones and colleagues. OntoKeeper is the latest upgraded evolution of the author’s previous tool, SEMS [2]. The current version has refined the metric calculation, improves on interface and functionality, and incorporates natural language generation feature, which is harnessed from the Hootation API library [31].
Application architecture
OntoKeeper was developed with the Vaadin Java web framework (v7.7), along with various third-party API components to provide specific functionality. OntoKeeper also utilized a PostgreSQL database (v9.5.8) to store basic application data and natural language statements. The test version of OntoKeeper was deployed on a Jetty web server (v9), hosted on an Ubuntu v16.04.3 LTS machine (4GB RAM and dual CPU cores). OntoKeeper was primarily developed by one of us (MA) and was evolved from the previous iteration of the tool mentioned in [2]. Figure 1 briefly summarizes the main components, and their interaction with each other.
In the figure (Fig. 1) SEMS Service was a port of the code from the original SEMS web application. It was partitioned by other components (Syntactic, Semantic and Pragmatic components)that were responsible for calculating each of the metrics and sub-metrics, except for the Social component module which is inactive.
Each of the metrics modules heavily relied on the Ontology Service component to parse meta-data and label information from the ontology. The Ontology Service interfaced with either an ontology artifact that has been uploaded (Ontology file) or an ontology that has already previously uploaded and stored in the database through the Database Service. The Ontology Service also required the OWL-API to access functionality for the parsing of an ontology artifact.
In addition, the WordNet Service relied on the Ontology Service to access the label-related information of the ontology. The WordNet Service utilized the MIT JWI, a Java WordNet interface, that queries a WordNet database [32]. WordNet Service primarily provided the word sense information for each token from the labels.
The NLG Service was primarily responsible for the natural language generation of the ontology. It accessed the ontology either through the database (Database Service) or the uploaded ontology (Ontology Service). Also through the Database Service it saved the natural language sentences for each of the triples.
Aside from providing services to other components, the Database Service was also leveraged by the application (OntoKeeper UI component) where the tool stored and retrieved application data to function.
Application navigation
Regarding interface design, we aimed to refine the tool to be simple and easy to navigate throughout the various scores, as well as minimizing the amount of information to avoid cognitive overload. The tool was also designed to be responsive to various devices, so in the later part of this section, we present screenshots of the mobile version. The following section describes the interface starting from the login screen to a final screen that showed the final quality score. Also, in this section, we introduce the interface from which external domain experts will access to judge the accuracy of the knowledge embedded in the ontology.
After navigating to the URL address of the application, the ontologist user will encounter the login screen and will be prompted for their username and password (Fig. 2).
Figure 3 is the next screen the user views after successfully logging in. The entire application has a visible sidebar menu that allowes the user to navigate between the different sections of the application. The Introduction screen which greets the user after login, has three tabs. The first tab is a short video demonstrating a quick use-case on how to use the tool. The second tab permits the user to change their username or password, and the third tab shows the saved snapshots of scores from previous sessions.
The Configuration screen (Fig. 4) is where the user starts the process in attaining scores for their ontology or importing an existing ontology the user has uploaded previously. Any ontology file uploaded will be saved into the database automatically for later retrieval. The first panel has two tabs, Upload Ontology and Select an Ontology. The former is where the user will choose the ontology from their machine and upload it to the server. The other tab will present the user with a list of ontologies that the user has previously uploaded. The user can select the ontology and click Import to load the ontology. Currently, we advise the users to merge their ontology (via the Protégé editor) if it imports external ontologies, because the system will calculate the scores based only on what is local to the file and will not follow OWL imports. With a merged ontology, the entities and properties from the imports will be considered into the scoring. In the future, we plan on adding support to automatically import the external ontologies.
The other panels on the Configuration screen includes the Ontology Status panel that indicate whether the ontology has been loaded, with the option to remove the ontology from the session. The Excluded Aspects panel allows the users to exclude scores from the four aspects of syntactic, semantic, pragmatic, and social. The Parsing Options panel gives control to users on how to parse non-alphanumeric characters. By default all the options – fixing camel cased labels, removing determiners, brackets, underscores, and dashes – are selected.
After the ontology has been loaded and the session configured, the next screen is the Processing screen (Fig. 5). The labels of the ontology is outputted and displayed in the grid after the Process button has been clicked. The grid displays the original label and the post-processed label based on the configuration. Also, the number of word senses the label has is based on the WordNet database. For labels with multiple tokens, the word senses are added to form an accumulated word sense total.
Syntactic calculation The syntactic score (Eq. 1) is composed of the lawfulness (SL) and richness scores (SR). The lawfulness is calculated by attaining the total number of axioms (logical and non-logical axioms), which are derived from OWL-API [17]. By instantiating the OWL2DLProfile class with the OWL-API, we also collected the number of violations. Using that count we divide it by the total number of axioms, resulting in the lawfulness score.
$$ \begin{aligned} S &= w_{s_{1}}*SL + w_{s_{2}}*SR \\ SL &= sl_{v}/AX \\ SR &= sr_{features}/sr_{total features} \\ let\ &AX\ represent\ all\ logical\ and\ non-logical\ axioms \end{aligned} $$
(1)
For richness, we used the OWL-API to determine the number of features of the language used in the ontology being evaluated. This was then divided by the number of possible features in the ontology, which for OWL is 39. This quotient provided us with the richness score.
Figure 6 shows the Syntactic screen that displays the scores related to the syntactic measures. The two tabs relate to the syntactic measures of lawfulness and richness. Each of these panels displays the scores for these two measures along with a simple explanation of the scores. The other panel contains slider widgets that allow the user to diminish or strengthen one of the scores.
Semantic calculation
Semantic score (Eq. 2) relies on OWL-API, and WordNet [33] to derive the number of word senses that each word has. For the interpretability score (EI), we took the number of unique words from all of the labels that are parsed from the ontology. For each unique word, we used WordNet to discover if the word has at least one word sense, and recorded the total. Using that total, we divided it by the total number of unique words in the ontology. The resulting value is then subtracted from 1 to provide us the interpretability score.
$$ \begin{aligned} E &= w_{e_{1}}*EI + w_{e_{2}}*EC + w_{e_{3}}*EA \\ EI &= 1-(t_{sense}/t)\\ EA &= 1-\frac{(t_{avg\_senses})}{t}\\ EC &= 1-(d/t)\\ let\ t &= unique\ tokens\ \subset\ ontology\ labels,\\ t_{sense} &= total\ tokens\ with\ one\ word\ sense,\\ t_{senses} &= total\ sense\ per\ token,\\ t_{avg\_sense} &= average\ sense\ per\ token,\\ d &= non-unique\ tokens\ \subset ontology\ labels \end{aligned} $$
(2)
With clarity (EA), we utilized the average number of word senses per unique word, and divide that value with the total number of unique words. With that value, we subtracted that from 1 to obtain the clarity score.
Consistency score (EC) is calculated by counting the number of duplicate words and dividing that figure with the total number of unique words. That value is subtracted from 1 to attain the consistency score.
Similarly, the Semantic screen (Fig. 7) also has the same widget to modulate the three semantic scores of interpertability, consistency, and clarity. There are also three tabs for each of those scores, with an explanation of the scores.
Pragmatic calculation
We used the OWL-API to collect the number of classes, instances, data properties, and object properties. The total number of all four of these elements amounted to the total number of elements used to calculate the comprehensiveness score (PO) for the pragmatic score (P). Also needed was an average number of number of elements (classes, instances, data properties, and object properties) from a group or library of representative ontologies. The total number of elements from the ontology being assessed, was divided by the aforementioned average number of elements. For example, if we have a food-related ontology, we would require the average of classes, instances, and properties from similar food-related ontologies that are available, or we could attain the average from a general ontology repository/library, like NCBO BioPortal, if there is a scarcity of similar ontologies. All in all, the numeric values results in the comprehensiveness score of the ontology.
The accuracy score (PU) relies on the Hootation API (See “Hootation” section) and external experts. All of the logical axioms from the ontology were translated into natural language. The external domain experts assessed if each statement was true or false. The number of true statements was collected and averaged, and the final value produced the accuracy score.
$$ {\begin{aligned} P &= w_{p_{1}}*PO + w_{p_{2}}*PU + w_{p_{3}}*PR\\ PO &= CIDO_{n}/CIDO_{average}\\ PU &= AX_{true_{n}}/AX_{logical_{n}}\\ let\ CIDO_{n} &= \{ classes_{n} \cup instances_{n} \cup data\ properties_{n}\\ &\quad\cup object \ properties_{n},\}\\ CIDO_{average} &= average\ from\ set\ or\ library\ of\ ontologies,\\ AX_{logical\_axioms} &\subset\ AX,\ logical\ axioms\ from\ all\ axioms,\\ AX_{human} &\approx AX_{logical\_axiom},\\ natural\ &language\ translation\ of\ axioms,\\ AX_{true_{n}}, &number\ of\ true\ AX_{human},\\ AX_{logical_{n}},\ &number\ of\ AX_{logical\_axioms} \end{aligned}} $$
(3)
Relevancy score (PR) is not supported in OntoKeeper, as it a score that is specific to a use-case defined by the evaluator. For example, an evaluator may create a set of competency questions and calculate the percentage of adherence for the questions to determine the relevancy score. Relevancy is understood as being a score to measure performance of a task, specifically a user-defined task.
Most of the calculations, are automated, but the pragmatic scoring is a bit more involved. Figure 8 shows the Pragmatic screen, and like the previous, it also has slider widgets to control the influence of the pragmatic scores. It has three tabs for each of the pragmatic scores. The first tab for comprehensiveness, displays its score and has a text field for the user to input the average number of ontology elements (classes, properties, and instances). This average value may vary depending on the number of ontologies that are being compared. In our previous study [34], we noted that this number may vary (i.e. 1,277,993 for NCBO Bioportal, 169,862 for a set of drug ontologies). In [13], Burton-Jones, et al. used 500, but over the last decade the size of ontologies have greatly increased, and the comprehensiveness score may elicit a value greater than 1. What we have performed, and what is recommended by [13] is to collect a set of ontologies that are of a similar domain and record the total number of elements to input.
The second tab for the Pragmatic screen is more involved. Like all the other tabs, it displays information about the sub-score, but it also has functionality to enlist volunteer domain experts to assess the truthfulness of the ontology. The Preview Statements button allows the user to view the list of natural language statements that are from the ontology’s axioms (See Fig. 9). This Review screen will be the same UI as what the enlisted domain experts will experience (See Fig. 10). From Fig. 11, there is also a panel labeled Subject Matter Volunteers. In this widget, the user adds the domain experts to be sent an invitation to examine the user’s ontology. From this panel, the user can remind the volunteers to participate and also view their private link to access their unique grid to review the ontology (Fig. 10). In the review, the volunteer can indicate whether the statement is true or false, and add any notes.
Hootation
Hootation API is a Java library based on natural language generation (NLG) components from the Agile Knowledge Engineering and Semantic Web Group’s semantic web application for generating quiz questions [35]. At the time of our past study [31], only 14 logical axiom types were supported, but currently Hootation supports 25 logical axioms types.
A few of the metrics provided through OntoKeeper requires external participants and resources. One such metric (accuracy) needs domain experts to assess the veracity of the triples in the ontology. Most domain experts are not familiar with ontology languages or tools. Exporting the logical axioms to human readable language would enable accessibility for domain experts with little ontology experience, even though the knowledge triples are expressed in descriptive logic.
Social calculation Also, due to technical limitations, the current iteration of OntoKeeper does not calculate the social score (Eq. 3). However, the score is comprised of the history score (OH) and authority score (OT). The authority score is based on an average times that ontology has been accessed within a library of ontologies, and the history score is calculated by the number of ontologies of a certain library that links to the ontology.
$$ O=w_{o_{1}} * OT + w_{o_{2}} * OH $$
(4)
Overall Quality Calculation The overall quality (Eq. 4) is a composite score of semantic (E), syntactic (S), pragmatic (P), and social score (O). Each score is modulated with weights (\( w_{q_{n}} \)) to balance their degree of strength. In a previous publication, we noted how the weights can be leveraged to provide a more accurate composite score among similar ontologies [34].
$$ Q=w_{q_{1}}*S + w_{q_{2}}*E + w_{q_{3}}*P + w_{q_{4}}*O $$
(5)
The final screen of importance is the Summary section (Fig. 12). In this screen, the overall quality score is displayed along with some visualizations to indicate scores for each of the quality aspects. As noted earlier, social score is not supported and thus grayed out on the UI. Similar to the sub-score screens, the user has the option to adjust the strengths of each score. In Fig. 12, for demonstration purposes, the syntactic score is weighted at 0.15, semantic is weighted at 0.51, and pragmatic at 0.33. The final scoring of the session can be saved for archiving using the Save Snapshot panel.
Of some worth, the application is also usable through mobile device by way of a responsive design. Figures 13 and 14 show the login screen and the Pragmatic screen rendered from an Android smartphone. With a streamlined interface and adaption to various screen sizes, we foresee that in the future this application could be usable for mobile users. Further refinement of the interface is still needed and the possibility of an ontology artifact residing on someone’s smartphone is remote. For the usability testing, as we will introduce in the next section, our evaluators utilized their desktop instead of their portable devices.
Usability evaluation
Five of the co-authors (CT, YH, CL, DW, and FM), who have published research and development experience with ontologies, participated in assessing OntoKeeper independently. None of the five co-authors were involved in the development of OntoKeeper. Each participant was furnished with a username and password to login, and each participant did not receive any guidance. Each user were left to their own to operate the tool by uploading an ontology of their choice and explore the tool without any intervention. After reviewing and testing the tool, each participant completed a survey using the System Usability Scale (SUS) [36,37] to appraise the tool. The SUS instrument is a simple 10-item survey using a Likert scale for each item (1=strongly disagreed, 5=strongly agreed). Additionally, SUS is known for its reliability with a small sample [38]. The scores were compiled and discussed in the next section. Lastly, the survey provided free text space to allow further comments that are not covered by the survey.