An overview of the methodology is shown in Fig. 2. It has 3 main components: (1) data collection and pre-processing, (2) semantic feature generation, (3) readmission prediction. The data collection gathers all the ICU information needed for the ML models and semantic enrichment from the MIMIC-III database. The semantic feature generation step builds the KG and generates vector representations (i.e., embeddings) that can be processed by the ML models for prediction. The process takes as input the MIMIC-III dataset and a repository of biomedical ontologies (via BioPortal), and outputs a prediction for if a patient will be readmitted into the ICU in the 30 days following their release.
To better elucidate the impact of using semantic annotations and KG embeddings for ICU readmission prediction, we build on the work of Lin et al. [16], the only related work with open source code (see Availability of data and materials). Our methodology differs from theirs by both considering additional relevant information from the MIMIC-III dataset and by enriching this information with semantic representations of features based on ontology embeddings. While Lin et al. built a readmission prediction model based on three specific categories of features in the MIMIC-III data set, namely, chart events, ICD9 final diagnosis and demographic information for each patient, this work also includes prescriptions, initial diagnosis, procedures information and laboratory events. Lin et al. employed pre-trained ICD9 embeddings based on medical texts to represent the final diagnosis. We do not use them, and instead use KG embeddings to represent the semantic annotations for all features. Moreover, they are limited to predictions at the end of an ICU stay, since final diagnosis is only available then, whereas this work supports predictions at different moments of the ICU stay as more data becomes available. Finally, while Lin et al. employ sophisticated combinations of LSTM and CNN, we focus on more traditional ML algorithms to better discriminate the impact of semantic annotations from that of the choice of ML algorithm.
Data collection and pre-processing
MIMIC-III [31] is an extensive, freely available database possessing records related to 53, 423 distinct hospital admissions of adult patients (aged 16 years or above) who stayed in the Beth Israel Deaconess Medical Center intensive care units between 2001 and 2012 [32]. The MIMIC-III database contains de-identified and comprehensive health-related intensive care data. It comprises relevant information such as demographics data, vital sign measurements, diagnosis, caregiver notes, procedures endured on the stay, laboratory tests and findings, prescriptions, and mortality.
Data acquisition and filtering
Four features were extracted from the MIMIC-III data set:
-
Patients’ demographics extracted are age, gender, ethnicity and insurance type, following [16].
-
Chart events are 17 features including notes, laboratory tests, fluid balance, etc., for each patient and normal median values extracted following [16].
-
Prescriptions include the drug name and National Drug Code (NDC) code (not considered in [16]).
-
Diagnosis includes two features: initial diagnosis, recorded at admission in free text (not considered in [16]), and final diagnosis coded in ICD9 at discharge with a matching label.
-
Procedures are coded as \(\text {ICD9}\) procedures (not considered in [16]).
-
Laboratory Tests are coded using Logical Observation Identifier Names and Codes (LOINC), we extract the label and code as features (not considered in [16]). We do not use other data such as value, unit, etc.
Following Lin et al. [16] we remove patients under the age of 18 years old, resulting in a total of 35,334 patients with 48,393 ICU stays. Figure 3 represents the patient classification into negative and positive instances. According to the criteria for patient’s selection, the following cases are considered to be ICU readmissions [16]: (1) The patients that were transferred to low-level wards from ICU, but returned to ICU again; (2) The patients that were transferred to low-level wards from ICU, and died later; (3) The patients that were discharged, but returned to the ICU within the next 30 days; (4) The patients that were discharged and died within the next 30 days. This results in a balance of 3:1 between records without readmission (negative) and records with readmission (positive). The total is 37,102 negative records and 11,290 positive records.
Data cleaning and pre-processing
For initial and final diagnosis, we exclude entries that have missing labels or labels containing unreadable characters. For the prescriptions, lab events and procedures, the features correspond to controlled vocabularies codes. However, there are still issues of missing data or codes in the wrong format. Entries with missing labels are excluded. Codes in the wrong format occur only in the International Classification of Diseases, Version 9-Clinical Modification (ICD9CM) controlled vocabulary. The MIMIC-III does not include the period character that is a part of ICD9CM codes, which creates ambiguity for the annotators between codes from different branches. Since MIMIC-III distinguishes between codes used for procedures and codes for diagnosis, we were able to reconstruct correctly formatted codes since procedures include the period after the second character, and diagnosis after the third.
ICU timeline snapshot split
An ICU stay has multiple stages from the moment a patient enters the unit, undergoes diagnosis exams and procedures, receives care, all leading ideally to a successful discharge and recovery. This means that throughout a patient’s ICU stay new information is generated, as drugs are prescribed, tests are prescribed and done, or procedures are performed.
To capture the evolution of an ICU stay we consider three moments (snapshots) for which to make predictions: Pre-ICU, In-ICU and Post-ICU [33]. Figure 4 represents this timeline and the information that is available for each moment. Pre-ICU corresponds to the data available when the patient enters the ICU: demographic information and initial diagnosis. In-ICU includes Pre-ICU data as well as laboratory tests, prescribed drugs and chart events. Post-ICU includes all previous information as well as the information that is recorded at discharge: final diagnosis and procedures. Although procedures correspond to the In-ICU moment, since they are only recorded in the MIMIC-III EHR for billing purposes at the end of the stay we only include them in the final moment.
Semantic feature generation
An ontology provides a specification of the meaning of the concepts in a domain and an associated vocabulary [34]. This specification means the context and the semantic rules that apply to concepts, allowing for their interpretation through their logical axioms. Using ontologies to represent domains reduces ambiguity and facilitates machine understanding. Clinical text is rich in synonyms, contributing to a high degree of ambiguity in analysis. Ontologies define multiple synonyms to represent the same concept and afford precise semantics for each concept, allowing the identification of synonyms in the text they annotate.
Ontologies serve as the semantic layer of smart information systems or more recently, as the schema layer of KGs [35]. When a substantial amount of data instances are structured in a graph, where nodes represent instances and edges the relations between them, and that graph follows a schema provided by an ontology to define the classes and relations of the instances, we can call it a KG.
The knowledge that ontologies (and controlled vocabularies) provide may be used in predictive models without prior data analysis or mining, to enrich or expand features, increasing the information available to the ML methods that would otherwise be unavailable [36]. This is especially true in the life science domain where there are more than nine-hundred ontologies available, spanning cross several fields of research on biological and biomedical domains [36]. Biomedical ontologies are able to provide controlled vocabularies for characterizing most biological phenomena with formalized domain descriptions and provide interaction by link them to other related domains [37].
We define a KG as a graph-representation of knowledge that describes entities and their relations defined according to classes and relations in an ontology. The approach to build a KG by linking the data extracted from the EHR to ontologies and using it to generate features includes three steps (see Fig. 2 step 2): (1) Ontology Selection, where ontologies that provide adequate coverage of the feature’s domains are selected; (2) Semantic Annotation, where textual features are mapped to ontology classes that describe them; and (3) Annotation Embedding, where each feature’s annotation is processed using a KG embedding approach that represents it in a numerical vector that reflects the meaning of the particular class within the ontology.
Ontology selection
The BioPortal Recommender platform [38] was used to support ontology selection. This service receives a biomedical text corpus or a list of keywords, for instance a set of EHR terms and for the set suggests ontologies appropriate for reference [38]. Despite the low number of studies describing the Bioportal recommender accuracy, we know it relies on the NCBO annotator for ontology annotations scoring and to provide a recommendation [39]. The annotator uses Mgrep for concept recognition [40], an extremely accurate system that for disease name recognition ensures a 95% or higher accuracy [40]. These high accuracies ensure that the bioportal recommender has a high accuracy for diagnosis annotation. We used a pre-selected group of ontologies of interest: National Cancer Institute Thesaurus (NCIT), Systematized Nomenclature of Medicine-Clinical Terms (SNOMEDCT), Medical Subject Headings Thesaurus (MeSH) and RxNORM were selected based on relevance attributed in a previous work [41]; LOINC, The Drug Ontology (DRON) and ICD9CM were selected due to their presence on the MIMIC-III data set; Medical Dictionary for Regulatory Activities Terminology (MedDRA) and Experimental Factor Ontology (EFO) were selected as extra relevant biomedical ontologies.
Semantic annotation
Semantic annotation is the process of describing an object by associating it with concepts that have well-defined semantics in an ontology [42]. Given the results obtained for the Ontology selection, we considered two annotation strategies: one using the NCIT and one using four different ontologies (NCIT, LOINC, ICD9CM and DRON).
For the single ontology scenario all textual labels (diagnoses, lab events,procedures and prescriptions) were mapped to a single ontology, NCIT.
For the multi-ontology scenario, the semantic annotation procedure is simpler because MIMIC-III already includes the codes (i.e., class identifiers) for the laboratory events (LOINC) and final diagnosis and procedures (ICD9CM). To cover the drug prescriptions, since NDC is not openly available, we mapped its classes to DRON using the BioPortal Annotator. Initial diagnoses were mapped via their textual labels to NCIT.
The text based annotations to NCIT were performed with ElasticSearch [43]. For each term in our dataset, a list of the six best scoring matched ontology classes is retrieved, and the one with the smallest Levenshtein Distance between the label of each class and the input term is selected.
A patient is thus initially represented by a vector of all their annotations, i.e. the ontology classes that describe their features.
$$\begin{aligned} P_{a}=\{c_1,\ldots , c_n\} \end{aligned}$$
(1)
The annotations are then used to build a KG that defines patients as instances that are related to the ontology classes that annotate them:
$$\begin{aligned} KG=\{V_c,V_i,E_c,E_a\} \end{aligned}$$
(2)
where \(V_c\) are the vertexes that represent ontology classes, \(V_i\) represent instances, \(E_c\) the edges between classes, \(E_a\) the edges between an instance and the class that annotates it.
KG embeddings
To represent patients through their semantic annotations for each feature in a way that machine learning algorithms can process, we employed KG embeddings. An embedding is a technique that transforms a higher dimensional space into a lower dimensional one [44]. KG embeddings represent the KG components in continuous vector spaces, so that their manipulation is simplified but at the same time preserve the inherent structure of the KG [45]. A typical KG embedding techniques has three steps [45]; the first specifies the entities and relations representation on the vector space, with entities usually represented as vectors and relations taken as operations represented as vectors or matrices; the second step defines a scoring function to measure the plausibility; and on the final step, to learn useful entity and relation representations, based on the score function, an optimization is done to maximize plausibility [45].
KG embeddings are learned for each annotated class of each ontology used, resulting in five sets of vector embeddings (one for the single ontology scenario one and four for the multi-ontology scenario two). The full graph is given as input to the KG embedding methods, with all types of relationships being considered. However, the majority of these are hierarchical relations, which is a consequence of the nature of the ontologies employed in our strategy.
We hypothesize that random-walk-based embedding techniques such as RDF2Vec are better suited to embedding instances based on their ontology annotations, because they are better at capturing long distance hierarchical relations than translational strategies like TransE [46]. Additionally, we also wanted to investigate whether methods that also take advantage of the ontology axioms and lexical component of the ontologies, such as OPA2Vec [22] would represent an improvement.
RDF2Vec [44] is a random-walk based strategy fit to handle specific semantics of RDF graphs (a language used to encode KGs and ontologies) [44]. For a given graph \(G = (V, E)\), for every single vertex \(v \in V\), RDF2Vec generates all graph walks \(P_v\) of depth d rooted in the vertex v. These sequences are the input to word2vec [47], a two-layer neural net model to learn word embeddings from raw text (or in this case, sequences of graph entities).
OPA2Vec [22] produces a triple representation of the ontology based on formal axioms both materialized and inferred by reasoning and annotation axioms that capture the lexical component. It then applies a PubMed pre-trained Word2Vec model [48] to produce the embeddings vectors.
TransE [46] uses translations to represent relations in the embedding space, where for each entity and relation, if a triple of subject, predicate, and object (s, p, o) holds, the embedding of the object must be close to that of the subject plus a vector of the predicate (relation) [46]. This can than be generalized for every triple on the KG.
All embedding vectors have 300 dimensions, following the baseline embedding parameters used by Lin et al. [16] and after empirical evaluation of 200 and 400 dimensions showed no performance gain. Other parameters are set to default in both TransE and OPA2Vec. RDF2Vec employed the Skip-Gram algorithm, 500 walks and a maximum depth of 4. An ontology class is now represented as a vector with 300 dimensions.
If an ICU stay of a patient is annotated by more than one class (vector) within an ontology, then the vectors for each annotated class are summed. This aggregation approach follows the one used by [16] for the ICD9 embeddings. More formally, the embedding vector that represents a patient p under a given ontology o is given by the sum of each embedding vector \(v_c\) that represents each annotation of the patient in o to a class c.
$$\begin{aligned} v_{p_o} = \sum _{c=1}^{n} v_c \end{aligned}$$
(3)
Since in the multi-ontology scenario each ICU stay feature is annotated by a different ontology, this results in four different embeddings vectors each corresponding to the sum of the individual vectors for each annotation. These four vectors are then concatenated (i.e., appended) instead of summed to preserve the distinct dimensions.
This results in a single vector describing an ICU stay of a patient with 300 dimensions for the single ontology scenario and 1200 dimensions for the multi-ontology scenario.
Readmission prediction
The prediction task is formulated to correctly predict if a patient will be readmitted to an ICU unit within 30-days after release (or die). Each instance corresponds to a patient and their ICU stay, represented by a concatenated vector that includes demographic data (\(v_d\)) and chart events (\(v_c\)) (similarly to [16]) as well as the KG embeddings vector (\(v_{p_o}\)):
$$\begin{aligned} P =\{v_{d} + v_{c} + v_{p_o}\} \end{aligned}$$
(4)
Predictions at different points of the ICU timeline stay include only the embeddings for the data available at that time.
Four classical machine learning methods are used: Logistic Regression (LR), Random Forest (RF), Naive Bayes (NB), and Support Vector Machine (SVM). These are the same methods used by [16] as baseline models. The LSTM and CNN models were not reproducible, possibly due to an incompatibility of libraries. No hyperparameter optimization is applied, to ensure a more direct comparison to [16]. The choice of using classical methods allowed us to focus our analysis on the impact of the KG embeddings.