Ontology-driven and weakly supervised rare disease identification from clinical notes

Background Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. Methods We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. Results The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). Conclusion The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-023-02181-9.


Introduction
Text phenotyping is the task of extracting diseases or traits of patients from clinical notes, which can benefit a wide range of tasks like cohort selection, epidemiological research, and decision making for better clinical care.A particular set of human phenotypes are rare diseases: a rare disease is very uncommon, affecting 5 or fewer people in 10,000, but there are between 6,000 and 8,000 rare diseases and they collectively affect approximately 3.5-5.9% of the population (or 263-446 million persons) globally [1] (and over 1 in 17 people in the UK [2] and 8% of population in Scotland [3]) at some point in their lifetime.Compared to common diseases, rare diseases are usually not coded in a precise manner, this is partly because they are underrepresented in the current, ICD-10 (International Classification of Diseases, version 10) terminologies [4,5].Detailed information about a patient is usually hidden in unstructured, clinical narratives.It is thus necessary to use clinical notes with Natural Language Processing (NLP) techniques to complement coded data to identify rare diseases in patients.
The main challenge for rare disease identification with NLP is the lack of annotated data for machine learning, especially deep learning.Deep learning models for clinical note classification tend to perform worse for infrequent diseases due to the lack of cases for training [6].On the other hand, annotating a variety of rare diseases in clinical notes from scratch needs specific domain expertise.This also requires the manual annotation of a very large number of clinical notes to ensure enough cases for each rare disease, thus taking time and incurring considerable costs from a group of clinical experts.
We propose an ontology-driven and weakly supervised framework for rare disease identification from clinical notes, extending our previous work in [7] with further, detailed empirical analyses and external validation.Ontologies are essential for text phenotyping as they provide a curated list of terms of diseases and traits.Previous studies have used ontologies to estimate the frequency of rare diseases [8].Our main ontology-driven framework is illustrated in Figure 1.
We use Orphanet Rare Disease Ontology [9] as the list of vocabularies of rare diseases [1] .We then leverage the concepts and synonyms in Unified Medical Language System (UMLS) as an intermediary dictionary to extend matching terms and address the issue of name variation [12] in linking texts to rare diseases, e.g."tracheobronchomalacia" for Williams-Campbell syndrome.The framework thus contains two integrated parts, entity linking (Text-to-UMLS) and ontology matching (UMLS-to-ORDO).Entity linking from mentions (or text fragments) to UMLS concepts is challenging due to the ambiguous mentions [12,8], especially for abbreviations, e.g."HD" which could mean Huntington Disease, Hemodialysis, or Hospital Day.String matching usually does not consider the complex contexts of a mention and can therefore result in many false positives. [1]We focus on the identification of diseases instead of the associated phenotypic abnormalities, therefore we chose ORDO instead of Human Phenotype Ontology (HPO) [10].The overall ontology based and weak supervision framework can potentially be applied to HPO, given it being aligned to ORDO [11].We leave the phenotypic abnormalities (in HPO) for future studies.
Machine learning can be applied for the disambiguation of terms, but it needs abundant annotated training data, which are currently not available in the context of rare diseases.
We therefore propose a weakly supervised approach to filter out the false positives in entity linking.Weak supervision [13,14] is a strategy to automatically create labelled training data using heuristics, knowledge bases, crowdsourcing, and other sources, to alleviate the burden and cost of annotation.We first use a string matching based named entity linking tool, SemEHR [15] (widely applied for text phenotyping in the UK [15,16,17], based on Bio-YODIE [18]) to generate candidate entity linking results, i.e. mentions and their UMLS concepts, from clinical notes; then, we propose to efficiently create weak training data of candidate mention-UMLS pairs of sufficient quality with two rules, mention character length, regarding ambiguous abbreviations, and "prevalence", regarding rare diseases.A phenotype confirmation model can thus be learned through contextual mention representations with domain-specific BERT models (e.g.BlueBERT [19]) to capture the context under-lied in the texts to disambiguate the mention to improve entity linking.For UMLS-to-ORDO matching, we used the mappings in ORDO and corrected the wrong links by filtering ORDO concepts with a phenome type as an upper class in the ontology [9].
For our main experiments, we trained a weakly supervised phenotype confirmation model using the discharge summaries in the MIMIC-III dataset [20].A large, weak entity linking dataset (of 127,150 candidate mention-UMLS pairs) was created for training.For evaluation, we annotated 1,073 mention-UMLS pairs as a gold-standard dataset.By filtering out the false positives, the proposed approach dramatically improved the precision and F 1 of the entity linking tool, SemEHR, with almost no loss of recall.
We further evaluated the phenotype confirmation models from discharge summaries to radiology reports in US MIMIC-III and UK NHS Tayside through either a direct transfer of the model or a weakly supervised re-training from new clinical notes.Almost perfect (100%) recall was achieved with a dramatic absolute increase of precision by over 30% to 50% with re-training and parameter tuning.This demonstrates that the approach can be efficiently adapted to identify rare disease phenotypes in another type of clinical notes and from another institution.Our annotated datasets on discharge summaries and radiology reports in MIMIC-III and our implementation of the overall approach are publicly available [2] .
As far as we know, this is the first study on text phenotyping of rare diseases using weak supervision, with the application on clinical notes of different types and institutions. [2]https://github.com/acadTags/Rare-disease-identification Our findings will shed light on using weakly supervised approaches and contextual representations for text phenotyping from clinical notes.The overall approach to identifying rare disease cohorts has the potential to support epidemiology and clinical decision making for better care.

Background and Related Work
Text phenotyping with ontologies.Compared to the efficient and gradually economical genotyping (i.e.sequencing genomics information), phenotyping usually needs highthroughput computational approaches for the extraction of diseases and traits from electronic health records (EHRs) [21,22].Clinical codes (e.g. with International Classification of Diseases, ICD) are a common source typically used regarding their ease of retrieval for phenotyping.However, ICD codes are usually less specific to define nuanced diseases or traits (e.g.rare diseases [4]) and are likely to be incomplete or under-coded [23], which may cause erroneous and missing cases in phenotyping.An alternative source for phenotyping is free-text clinical notes in the EHRs.It is shown in a previous systematic review of cohort identification from EHRs [24] that text phenotyping (or case detection) achieves on average higher precision (or positive prediction value) and recall (or sensitivity) than code-based phenotyping, and combining both sources (texts and codes) achieved greatly improved phenotyping results.Text phenotyping also requires understanding the wider contextual features of the matched concepts, including negation (i.e.whether negated or hypothetical), experiencer (i.e.whether experienced by the patient or someone else), and temporality (i.e.whether historical) [25,16].These contextual features have been reasonably well detected with rule-based approaches, e.g.[25], and applied in Bio-YODIE and Se-mEHR, and more recently with neural network methods, e.g. in MedCAT [26].
Ontologies are essential for text phenotyping as they define the concepts and terms of diseases and traits.These concepts and terms are widely used to annotate clinical notes, i.e. match to text fragments or mentions [27] and to estimate rare diseases from texts [8].The task to match ontology concepts (and their terms) to mentions is formally referred to as entity linking.One main issue of entity linking is entity ambiguity, where a mention could possibly denote different concepts or terms in an ontology [12].Our work aims to improve entity linking with better disambiguation using weak supervision and contextual mention representation.
Weak supervision.Weak supervision [13,14] is a strategy to efficiently create a large set of noisy labelled training data in a programmatical way using various sources containing heuristics and knowledge bases.The success of applying weak supervision in clinical NLP studies depends on two aspects, data programming and data representation, as suggested in [13].Efficient data programming ensures that reliable weak data can be programmatically created for supervised learning.In clinical NLP, studies use lexical or concept filtering rules to create labelled data to extract nuanced categories (e.g.suicidal ideation [28] or lifestyle factors for Alzheimer's Disease [29]) from clinical texts.We extend over this line of research by using ontologies and a medical concept labelling tool with two specific rules to create reliable weak data to extract rare diseases.The second aspect is data representation, representing the contexts and semantics in the data into vectors in a high-dimensional space for subsequent steps in machine learning.For deep learning methods, previous studies [13,29] proposed to use neural word embeddings and more recently using BERT [30] to represent the contexts of the textual data.We follow this direction to apply weak supervision with contextual representations for rare disease phenotyping.
Contextual Representation.The most significant, recent progress in NLP is the contextual representations pretrained using Transformers [31] from a very large corpus [30].The most representative contextual representation is BERT [30].The pre-training task for BERT learns a masked language model with next sentence prediction, trained with a vast amount of curated texts on the Web (e.g.BookCorpus and English Wikipedia) using a 12 or 24 layered deep neural network mainly composed of multihead self-attentions blocks.The learned parameters in the large neural network can then be applied to a wide range of downstream tasks, e.g.text classification, Named Entity Recognition, and question answering, with superior performance than the previous, task-specific models [30].Contextual representations have been adapted to the clinical domain by pre-training using biomedical publications, clinical notes, and clinical ontologies.The notable models include but are not limited to BlueBERT [19] (BERT further pre-trained with PubMed abstracts and MIMIC-III clinical notes), PubMedBERT [32] (pre-trained from scratch with PubMed abstracts and full texts), SapBERT [33] (PubMed-BERT further pre-trained with UMLS concepts), etc.We adapt the contextual representation methods for the mentions or text fragments to improve entity linking.

Method
In this section, we will describe the ontology-driven method, the weak supervision for entity linking, contextual mention representation, and model training and inferencing.
Entity Linking and Ontology Matching Entity Linking.Given a set of entities E in an ontology and a collection of documents (e.g.clinical notes), entity linking aims to match a mention (or text fragment) m to its corresponding entity e ∈ E in the ontology [12].The mention m is a sequence of tokens in a document which potentially refers to one or more named entities and is usually identified in advance during the named entity recognition stage [12].For Named Entity Recognition and Linking (NER+L) tools with a very large number of entities, e.g.Bio-YODIE [18], SemEHR [15], and MedCAT [26], a mention m is recognised at the same time when it is linked to a concept in an ontology; this is usually realised through string matching [18,26].
We applied SemEHR, a medical NER+L tool widely deployed in Trusted Research Environments (or Data Safe Havens) and servers in the UK.Previously, high recall and F 1 (around 90%) were reported on sub-phenotyping with stroke from texts with SemEHR [17].The output is a set of mention-UMLS pairs, where each mention is in a context window and with a name of the document structure (or the template section of the clinical note) if available.SemEHR adapts Bio-YODIE as its main NLP module, enhanced with a search interface and continuous learning functionalities based on users' feedback labels and rulebased and machine learning methods.Bio-YODIE can efficiently extract UMLSs from texts using a string matching based approach.When there is an ambiguous mention, time-efficient NER+L systems like Bio-YODIE mainly assume a corpus-based prior to assign the same, most frequent UMLS to the mention regardless of its context or surrounding texts [18].This can result in many false positive phenotypes, mostly regarding the abbreviations in the clinical notes.For example in Table 1, none of the identified "HD" mentions indicate a type of disease, according to the context.While SemEHR has a continuous learning functionality to classify and correct the errors, the approach relies on users' feedback labels and requires time from clinical experts.Ontology Matching.Another issue in entity linking is the variations of terms that may be missed in the process [12].This can be addressed by using the rich term variations in the metathesaurus UMLS as an intermediary dictionary with ontology matching to match concepts in UMLS to ORDO.Ontology matching (or mapping) is the task of finding the correspondence between two ontologies [34].Each correspondence is represented as a triple < e, e , r >, where e and e denote an entity in the ontology O and O , respectively, and r denotes a relation that holds between the two entities [35, p. 43].The main form of an entity in an ontology is a concept or a class, denoted as c ∈ C [35, p. 34].In ORDO, the matching of an ORDO concept to UMLS and ICD-10 concepts are available as cross references [9], for example for Orphanet 3325 (Heparininduced thrombocytopenia), there exist correspondences < Orphanet 3325, UMLS:C0272285, E >, where the relation E denote "Exact matching".We use E (Exact matching) or BTNT (ORDO's Broader Term maps to a Narrower Term) to ensure the matched term is a rare disease (and removed NTBT relations).We further added a rule ("isNot-GroupOfDisorders") to filter out the Group of Disorders, e.g.Orphanet 181422 (Rare hyperlipidemia), which were mostly matched to a common disease in the UMLS, e.g. to C0020473 (hyperlipidemia).More details and examples of ontology matching are presented in Table S2-2 in Supplementary material 2.

Weak Supervision for Phenotype Confirmation Model
To address the issue of ambiguous mentions, we propose weak supervision based on rules for labelled data creation with context mention embeddings for representation.When both data and representations are created, a classifier can be learned to decide whether a mention linked to UMLS in the context indicates a correct phenotype of the patient.Weakly Supervised Data Creation.The idea in the weak data creation is to create rules that can complement the existing tool (e.g.SemEHR) to create reliable mention-UMLS pairs for training.The whole data creation process for weak supervision is described in the Algorithm 1.The candidate mention-UMLS pairs from an NER+L tool are denoted as a list of 5-element tuples L (i.e.links), where each tuple includes a mention start position m start , a mention end position m end , a rare disease UMLS concept c rare UMLS , the context window of the mention t, and the name s of the document structure where the mention is located.We propose two rules as functions on mention-UMLS pairs, mention character length rule, λ 1 , and "prevalence" rule, λ 2 , as shown in the blue blocks in Figure 2. Given that abbreviations (like "HD" in Table 1) are usually ambiguous and falsely linked by the NER+L tools, the mention character length rule λ 1 satisfies when the mention has more than l (default as 3) characters, i.e. m end −m start > l, otherwise as False.Given that rare diseases usually have a very low prevalence [3,36] and rare disease mentions usually have a low frequency in a consecutive sample of clinical notes, the "prevalence" rule λ 2 satisfies when the UMLS concept represents a very small percentage p (default as 0.5%) in the whole number of candidate links |L|, i.e.Freq(c) |L| < p, otherwise as False.This is an attempt to integrate an estimated epidemiological rule into weak supervision for text phenotyping.
The final rule-based weak labelling function λ is defined as True (i.e, mention-UMLS indicates a correct phenotype of the patient) when both rules λ 1 and λ 2 are satisfied, and as False when both rules are not satisfied.The data selection is equivalent to an XNOR logic operator (selected if and only if both rules are True or both are False) and the data labelling is equivalent to an AND operator of the rules.This ensures that only data that are consistently checked by both rules are weakly labelled.The binary weak label, y weak ∈ {0, 1}, is then appended to each mention-UMLS pair to create the weakly labelled data D weak .
The mention length threshold l and the "prevalence" threshold p are selected to ensure a sufficient amount of reliable, weak data generated.We empirically determine the best values of l (as 3 or 4) and p (as 0.005 or 0.01) based on the validation set or a small number of annotated data solely for evaluation (results on MIMIC-III discharge summaries in Table S1-1 in the Supplementary material 1).
Contextual Mention Representation.We use a clinically pre-trained BERT model (e.g.BlueBERT, as described in the related work) to represent the mention in its context window t in the weakly labelled data D weak .A BERT model can be succinctly described as the Equations 1.We excluded layer normalisation, dropout, and other functions and parameters in the equations for simplicity.The output H n ∈ R |tokens|,d is a matrix that can be used as the layer for the subsequent task, where |tokens| is the length of sequence after tokenisation and d denotes the dimensionality (usually 768 for BERTnorm and 1024 for BERTlarge).FFNN() is a feed-forward neural network of two linear transformations with a ReLU activation function in between, and MultiHead() is a multi-head self-attention layer that models multiple forms of alignment from the tokens to themselves; and the three inputs represent matrices of queries (Q), keys (K), and values (V ), respectively, linearly transformed from H i .We refer readers for the details of the Transformers and BERT architectures to [31,30].
The contextual understanding mainly comes from selfattention (as softmax , where d k is a scaling factor) that captures the importance of every other token to each token.These parameters have been pre-trained based on massive corpora from general and medical domains.The hidden layers in BERT, H can be used as static embeddings to represent a sequence.We extract the second-last layer H n−1 in BERT as static embedding (or features) for the subsequent task, according to the results that H n−1 has the best feature-based results among any single layers in H for an NER task [30].A plausible explanation for this is that the last layer is more biased towards the training loss (e.g.masked language model and next sentence prediction), while the second-to-last layer better represents the contextual information of the sentence.
The selection of the specific BERT model generally favours models pre-trained with in-domain (i.e.clinical) corpora [37] and is empirically based on results (e.g.F 1 scores) on the validation set.We will compare and analyse different BERT models in the experiments (see The overall weak supervision data representation and model training process is described in Algorithm 2. We use H n−1 ← BERT(t) to denote the whole process above.Mean pooling, as empirically suggested in [38], is applied to create a final vector v.We define a contextual mention representation where only the tokens within the mention are included, i.e. v ← mean(H n−1 [m token start , m token end ]).The start and end tokens' position of the mention m token start and m token end are derived based on the WordPiece tokenizer of the BERT model and the original position of the mention.
We also experimented with two encoding strategies, mention masking and using document structure name s (see line 3 in Algorithm 2), that allow a more flexible representation of the contexts.Non-masked encoding with document structures provided better results on the validation set (see Table S1-2 in Supplementary material 1).
Model Training and Inference.Finally, a phenotype confirmation model can be trained from the weakly labelled data.The contextual mention representation v, as static embedding, is fed into a binary classification model.We use logistic regression as the training model (in Train and validate() in Algorithm 2), which is similar to adding a feed-forward layer on top of the static pre-trained layer in BERT with sigmoid activation.We also compared this static embedding approach to fine-tuning the whole BERT model in the experiments.
The inference stage is succinctly defined in Equation 2. We use SemEHR to extract candidate mention-UMLS pairs from a clinical note d.We then transform each instance into a contextual mention representation (see line 3-6 in Algorithm 2), denoted as the function V BERT ().After selecting the patients' phenotype in O rare UMLS with M weak , we can then use the correspondence between UMLS and ORDO, denoted as OM U →O , to obtain the final set of rare disease phenotypes C d ORDO as concepts in ORDO.

Experiments
We evaluated the above ontology-driven and weakly supervised algorithms on MIMIC-III discharge summaries and further validated the approach with MIMIC-III radiology reports and NHS Tayside brain imaging reports.For validation and testing, we manually annotated a small number of mention-to-UMLS pairs from each of the datasets.We present results on each part of the system, Text-to-UMLS and UMLS-to-ORDO.For Text-to-UMLS, we carried out extensive experiments to study the best combination of parameters in weak labelling rules, the encoding strategies, with a comparison between weak and strong supervision.
We then show the whole pipeline can support rare disease phenotyping by enriching the traditional method using ICD codes.Finally, we show that the proposed approach can easily generalise or be adapted to a new type of clinical note, radiology reports, in the same or another institution.

Data Processing and Annotation
We evaluated the proposed NLP pipeline with three datasets in two healthcare institutions in the US and the UK.
The main dataset we used was the discharge summaries (n=59,652) in MIMIC-III ("Medical Information Mart for Intensive Care") dataset [20], which contains clinical data from adult patients admitted to the ICU in the Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2001 and 2012.We were granted access to MIMIC-III through PhysioNet after completing the ethical training by the Collaborative Institutional Training Initiative program.MIMIC-III data are supposed to contain rich rare disease mentions, as a large number of rare diseases (especially genetic disorders) can lead to an ICU (intensive care unit) admission [36].
The manual ICD-9 codes (i.e.ICD-9-CM) of the MIMIC-III admissions allow us to compare code-based phenotyping with text phenotyping for rare diseases.We linked ICD-9 codes to ICD-10 codes using the matching from the Ministry of Health, New Zealand [39] and linked ICD-9 to UMLS codes based on the ICD-9 ontology in BioPortal [40], as shown in Figure 1.We used ORDO version 3.0 (released 07/03/2020), which contained 14,501 concepts or classes related to rare diseases.We selected the ORDO concepts which have linkage to UMLS and ICD-10 in this study as this supports the interoperability (e.g.linking and traversing) among the clinical terminologies; this resulted in a set of 4,064 rare disease concepts [3] .We focus on this [3] The most 5 frequent UMLS (version 2020AB) semantic types of the 4,064 linked ORDO concepts: T047 (Disease or Syndrome, 3,245 concepts, 79.8%), T019 (Congenital Abnormality, 465 concepts, 11.4%), T191 (Neoplastic Process, 374 concepts, 9.2%), T049 (Cell or Molecular Dysfunction, 35 concepts, 0.9%), and T046 (Pathologic Function, 19 concepts, 0.5%); note that 160 concepts (3.9%) are associated with two semantic types.essential set of overlapped rare diseases and the coverage is improving as the mappings are being updated; we leave the ORDO concepts without both ICD-10 and UMLS linkage for future research.
After processing the discharge summaries with a Se-mEHR database instance [4] [15] with rule-based contextual filtering on negation and experiencer based on [25], we obtained 127,150 candidate mention-UMLS pairs for the UMLS concepts linked to ORDO.After applying the weak labelling function with the two rules, we finally obtained 15,598 positive and 74,217 negative data, and 37,335 non-labelled data or mention-UMLS pairs.
We further applied the same preprocessing steps with the MIMIC-III radiology reports (n=522,279) and NHS Tayside brain imaging reports (n=156,618).MIMIC-III radiology reports are from the same institution and within the same time span as in MIMIC-III discharge summaries [20].The Tayside data contain the routine brain MRI and CT scans from the National Health Service (NHS) Tayside Health Board, which have been applied in previous NLP research [17,41].We have received NHS Tayside Caldicott Guardian approval to use the anonymised brain imaging reports for this work.
The statistics of the three datasets, MIMIC-III discharge summaries ("Disch"), MIMIC-III radiology reports ("Rad"), and NHS Tayside brain imaging reports ("Tayside Brain Img"), with the Natural Language Processing pipeline and manual annotations, are presented in Table 2. MIMIC-III discharge summaries have proportionally more documents associated with at least one candidate rare diseases (identified by SemEHR), quantified by |T RD | |D| : 3.4 times more than MIMIC-III radiology reports and 13.3 times more than brain imaging reports in Tayside.Data Annotation.For evaluation, we created a gold standard dataset of 1,073 candidate mention-UMLS-ORDO [4] https://github.com/CogStack/CogStack-SemEHRtriplets (with each mention in a context window) generated by SemEHR and ontology matching in ORDO, from a set of 500 randomly sampled discharge summaries from MIMIC-III, of which 312 (or 62.5%) discharge summaries have at least one candidate or potential "rare disease" mention.There were in total 95 types of rare disease associated with the mentions.Annotators were asked to label whether a mention-UMLS pair truly indicates a phenotype of the patient with an annotation guideline of detailed examples on hypothetical mentions.The mention-UMLS pairs were annotated by 3 domain experts, including two research fellows and one PhD student in Medical Informatics (MI).Based on the random 200 mention-UMLS pairs annotated by all 3 domain experts, the multi-rater Kappa value was 0.76.ORDO-to-UMLS concept matching was annotated by 2 domain experts (a research fellow and a PhD student in MI) and obtained a Kappa of 0.72.All contradictory and unsure annotations were resolved by a research fellow in biomedical science and MI.We used the first 400 data instances for model validation and the rest 673 for final testing.
To study how the model performs when it is directly transferred to or re-trained on other clinical notes, we further annotated 198 candidate mention-UMLS pairs in a sample of 1,000 radiology reports in MIMIC-III [20] and 279 candidate mention-UMLS pairs (with 4 new manually identified mentions) in a sample of 5,000 brain imaging reports in NHS Tayside [17].Each dataset was annotated by two researchers in clinical science or MI with contradictions addressed by another researcher.The Kappa for MIMIC-III radiology reports and NHS Tayside reports were 0.88 and 0.86, respectively.
To note that the evaluation set is independent of the rules used for weak supervision, thus abbreviations and "popular" disease mentions were in the validation and testing data.This helps to test whether the phenotype confirmation model trained on the rule-based weakly labelled data can generalise to the full scenario that also contains the unseen mentions, which were filtered out during weak supervision.
As baselines, we compared the proposed approach ("Se-mEHR+WS") with SemEHR with the two rules only using an OR operation for the interest of higher recall ("Se-mEHR+rules").We evaluated the baselines using precision, recall, and F 1 scores.Note that SemEHR had a reference recall of 100% as all candidate "rare disease" mentions were identified by SemEHR, which was the starting source for the annotations. [9]  We tuned the two parameters l and p (to 3 and 0.5%, respectively, if not specified) in the weak labelling rules (in Algorithm 1) by grid search based on the performance of validation data in MIMIC-III discharge summaries.The detailed parameter tuning results of l and p are in Table S1-1 in Supplementary material 1, Weak Rule Parameter Tuning.We also tuned the size of context windows (default as 5), which however, did not affect the performance, probably because our final representation was based on the position of the mention in the BERT layer (see line 6 in Algorithm 2).Also, we tuned the optimal number of random training mention-UMLS pairs needed (n=9k) based on the validation set, which had little impact on the results (<1% F 1 score).
In contrast to weak supervision (WS), we also provide results on strong supervision (SS), the traditional approach that trains a model from full manually labelled data.For MIMIC-III discharge summaries, we used the first 400 validation set in the full 1,073 mentions to train a model, M strong , and test on the rest 673 mentions with the same inferencing step in Equation 2 but using M strong instead of M weak .As manually labelled data are usually more reliable than weakly labelled data, the performance of strong supervision is considered as an upper bound in studies in weak supervision [45,46].
We provide the results regarding each step in the pipeline (in Figure 1), Text-to-UMLS linking and UMLS-to-ORDO matching, followed by the overall results on rare disease identification, Text-to-ORDO linking and admission-level ORDO concept prediction.The column statistics (n=N++/N ) show the number of positive data N+ and all samples N in the dataset.SemEHR has a perfect reference recall, because all candidate mention-UMLS pairs were created using the tool.WS, weak supervision; SS, strong supervision.BlueBERT-base (PubMed+MIMIC-III) was used as the BERT model.The best scores, either or not considering strong supervision (SS), are bolded.
Main Results: Text-to-UMLS linking Table 3 shows the validation and testing results of Text-to-UMLS linking.With weak supervision (WS), the precision and F 1 of SemEHR has been greatly improved by around 55% and 40% absolute value, respectively, for both validation and testing data.Adding the two customised rules already improved the testing performance greatly by over 30% F 1 to SemEHR (as shown in SemEHR+rules), which validates the efficiency of the two proposed rules with the NER+L tool to create reliable weak annotations.Adding WS further outperformed the SemEHR+rules setting absolutely by around 10% precision (and 5% F 1 ), showing the usefulness of the contextual mention representation on filtering out false positives.The recall dropped slightly after introducing the two rules.This indicates the bias or noise in the rules with the current threshold (p as 0.5% and l as 3).Results with weak supervision are within a small gap of 5% F 1 of strong supervision with hand-labelled data.This, overall, demonstrates the potential of WS to improve text phenotype entity linking.
As a solid evaluation needs to assess the system with different biased test sets, we further split the testing data into those weakly labelled or unlabelled during the weak supervision.This helps analyse the impact of the rule-based weak supervision on the testing performance."Seen" data mean that the mention-UMLS pairs were weakly labelled with λ, i.e. with both rules satisfied or both not satisfied (see line 7-11 in Algorithm 1); "unseen" data mean that only one of the rules was satisfied so that the data were not labelled in the process.WS improved the performance of SemEHR in both settings: while the weakly "seen" data were dramatically boosted by rules (by nearly 50% F 1 ), the "unseen" data were greatly improved (by 10% F 1 ) through the model generalised with contextual representations. [8]https://github.com/huggingface/transformers [9]We also benchmarked the performance on MIMIC-III discharge summaries with recent NER+L tools, MedCAT [26] and Google Healthcare Natural Language API [44].However, given that the results (especially recall) may favour SemEHR-based methods, we do not formally report the results of the two NER+L tools but make them available at https://github.com/acadTags/Rare-disease-identification/blob/main/supp-results.
Embedding and Encoding Strategies.We compared the different embedding methods, including word embeddings and several BERT models pre-trained from different sources.Table 4 shows that contextual mention embeddings (e.g. with BERT, described in lines 4-6 in Algorithm 2) based methods greatly outperformed word embeddings, although increasing the dimensionality of word2vec embeddings improved their recall and F 1 .For the contextual mention embeddings, we compared the vanilla BERT and representative pre-trained BERT models in the biomedical domain.We observed that BlueBERT, pre-trained using the in-domain (or same-data), MIMIC-III clinical notes, outperformed the various BERT models only from general domains (e.g.BERT), biomedical publications (e.g.PubMed-BERT), or clinical ontologies (e.g.SapBERT).This supports the use of in-domain pre-trained models, e.g.Blue-BERT for the task, corroborating the conclusion from [37].We also see that neither using fine-tuning (cf.featurebased) nor the large version of BlueBERT could improve the performance, which is probably because they introduce more learnable parameters (and a larger model size for BlueBERT-large), thus likely overfitting the weakly labelled data and underperforming on the real, testing data.We further compare the encoding strategies and found that non-masked encoding (with document structures) achieved the best F 1 scores on the validation data (see Table S1-2 in Supplementary material 1).The column statistics (n=N++/N ) show number of positive data N+ and all samples N in the dataset.All word2vec-k embeddings were pre-trained from MIMIC-III discharge summaries, representing the mention as the averaged k-dimensional embedding of tokens in the context window.BERT models were used as static features (in the second-last layer) if not specified with "fine-tuning".The best scores, either or not considering strong supervision (SS), are bolded.We did not tune the optimal number of random weakly supervised training data for BlueBERT-base model (and all other models), thus its results were slightly below those in Table 3.

UMLS-to-ORDO Matching Results
For UMLS-to-ORDO ontology matching, the original accuracy by the ORDO ontology was 87.4% (=83/95), if considering the repeated mentions in the whole 1073 evaluation data, the linking accuracy was 81.6% (=876/1073).The most frequent three false UMLS-to-ORDO mappings in ORDO were Hyperlipidemia (C0020473) to Rare hyperlipidemia (Orphanet 181422), Epilepsy (C0014544) to Rare epilepsy (Orphanet 101998), and Dyslipidemias (C0242339) to Rare dyslipidemia (Orphanet 101953), all linking a broader, common disease concept to its specific types in rare diseases under the phenome type or the upper class [9] of group of disorders (Orphanet 557492).By filtering with ORDO's phenome type using "isNot-GroupOfDisorders" (i.e.not under group of disorders), the UMLS-to-ORDO concept linking accuracy of the unique and repeated mentions was improved to 88.4% (from 87.4%) and 94.4% (from 81.6%), respectively, from the whole validation and testing data in the MIMIC-III discharge summaries.

Overall Mention-level and Admission-level Results
We finally obtained the mention-level results (Text-to-ORDO) based on the two parts of the system.The results, shown in Table 5, are consistent with Text-to-UMLS results.The overall metrics are lower than Text-to-UMLS results (71.7% vs 86.1% for testing F 1 score for WS) due to the imperfect matching between UMLS and ORDO.For a perfect UMLS-to-ORDO matching, the results of the Textto-UMLS and Text-to-ORDO should be the same.
In the interest of detection of rare disease cases in admissions, we aggregated the mention-level results to admission-level results, where one admission may be associated with several unique rare diseases (each as a concept in ORDO).Thus, we report the standard micro-level labelbased metrics for multi-label classification [47].Microlevel metrics count each admission to a single ORDO con- cept as an instance and create a confusion matrix to calculate the precision, recall, and F 1 scores.We were also able to obtain ICD-based results purely based on ontology matching (from ICD-9 codes to ICD-10 or UMLS concepts then finally to ORDO concepts, as shown in Figure 1).Admission-level results were generally consistent with mention-level (Text-to-UMLS and Text-to-ORDO) results.In terms of precision and F 1 score, weak supervision greatly improved the performance of SemEHR and outperformed other third-party tools, slightly below strong supervision, while the recall was the same for both WS and SS.We also obtained the admission-level results of ICD codes.
Admission-level results are presented in Table S1-3 in Supplementary Material 1.It is discovered that our NLPbased approach (SemEHR+WS) achieved better precision and F 1 scores than the code-based approach (ICD).In terms of recall, ICD codes could only identify a few more rare diseases cases than SemEHR with weak supervision (e.g.21 vs 20 out of 30 in the validation set and 36 vs 33 out of 42 in the test set, between ICD ∪ SemEHR+WS and SemEHR+WS).Note that this result may not be accurate as our annotation is based on the string matching based NER+L results from SemEHR, so the false positives from ICD-based cohorts may actually be true cases.Also, the number of positive data is much lower in admissionlevel results than in the mention-level (e.g. for testing data, 42 admissions vs. 187 mention-UMLS pairs).But nevertheless, our results show the essential role of free-texts and NLP methods for rare disease phenotyping; the results are consistent with the conclusion in [24] regarding general diseases.

Error Analysis
We breakdown the errors of the proposed approach ("Se-mEHR+WS") regarding Text-to-ORDO in MIMIC-III discharge summaries (see results in Table 5) in Figure 3.There were altogether 91 errors (including 59 false positives and 32 false negatives), representing 8.5% from the 1,073 candidate mentions-UMLS-ORDO triplets, where 61 (or 5.7%) were from Text-to-UMLS stage and 30 (or 2.8%) only from the UMLS-to-ORDO stage (and 4 in both stages).
While rules are effective for WS, they may also introduce some bias.Over half 57.4% (or 35 of 61 errors) from the Text-to-UMLS side were likely due to the bias introduced from the weak rules, where the prediction was wrong when using the weak rules only.The other two main errors were either (i) semantic type errors (representing 26.2% or 16 out of 61), where the mention was a (negative) laboratory test (e.g."legionella") or other unrelated types (e.g."ENDO" as department name) instead of a disease, or (ii) diseases of hypothetical or negative contexts (represented 6.6% or 4 out of 61), which were not filtered out by the NER+L tool, SemEHR, and were also challenging for the annotators.The other errors (9.8%, 6 out of 61) were due to not enough information for human to decide or no exact reason found for the error.The issues above may be addressed by combining WS with human-in-the-loop machine learning [48] with adaptive rules to improve the performance.The wrong UMLS-to-ORDO ontology mappings were due to the simple heuristic ("isNotGroupOfDisorders") which also filtered out correct mappings -this may be addressed when the official ontology matching is updated or by using a machine learning based system to correct the matching.

NLP vs. ICD for Rare Disease Phenotyping
We applied the trained model and the whole pipeline to process all MIMIC-III discharge summaries (n=59,652) and compared the rare disease admissions identified from NLP and ICD.The NLP approach is the proposed ontologydriven and weakly supervised pipeline.For the ICD-based results, we combined the ICD-9 codes matched to either the UMLS or ICD-10 codes linked to ORDO (see Figure 1).Using our NLP-based pipeline, it is possible to greatly enrich the rare disease cases identified solely from ICD codes.For most (97.2%=453/466)types of the rare diseases, our approach mining free texts could enrich at least one (and usually many) potential rare disease case compared to the ICD-based approach.The results can be useful to identify potential cases for an alerting system for clinical care or a base for further refinement.Figure 4 shows the selected 10 rare diseases which were best predicted in the annotated 312 discharge summaries, however, since the support value was few (between 1 to 5) for each of the diseases in the admission-level evaluation, the results did not represent the predictions of the full 59k admission cases in MIMIC-III.
We thus further performed an extra manual evaluation to verify whether the rare disease cases identified by NLP were true phenotypes (or represented a current or past rare disease of the patient), as there was no gold reference standard.Five researchers (one in clinical science, one in biomedical science, and the remaining three in MI) screened the 1,428 cases or patient stays identified by NLP (WS or SS) regarding the 10 selected diseases, according to the definitions of the rare diseases in ORDO.The accuracy scores (the fraction of correct rare disease cases in all identified cases) of the weakly-supervised NLP-identified rare diseases are displayed after each horizontal bar in Figure 4. We can see that NLP identified most rare diseases (6/10) with an accuracy score from around 70% to over 90%.For rheumatic fever, over 90% of the cases were true positives, except for a few hypothetical mentions or the subject being the patient's relative.Some examples are provided in Table S2-1 in Supplementary material 2. As rheumatic fever is usually a historical disease when the patient was a child, the disease was commonly not coded with ICD.
For certain rare diseases, the accuracy score from the manual evaluation was very low, e.g.0.0% for IRIDA syndrome due to "microcytic anaemia" wrongly assigned as a synonym or an atom of C0085576 ("Iron-Refractory Iron Deficiency Anemia" or IRIDA) in the previous UMLS version (2019AA) in the Text-to-UMLS process, 8.2% and 43.8% for Retinitis Pigmentosa and Progressive Multifocal Leukoencephalopathy, respectively, due to the ambiguous meanings of their abbreviations ("RP" and "PML") and unseen in WS (with a low corpus-based prevalence below 0.5%).For Multifocal Atrial Tachycardia, the definition in ORDO is a neonatal disease, while its matched UMLS concept of the same name may also mean an adult disease.We also found difficulty in reaching a consensus in the annotation due to the vague definition of Acute Liver Failure in ORDO [10] , for which we derived two distinct interpretations which were then reconciled by a senior clinician [11] .This analysis suggests that we should take the definitions into consideration in entity linking and ontology matching.We should also ensure that the definitions used are appropriate for the clinical research question for people using the tools. [10]https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=h ttp%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet90062 [11] Our two interpretations of acute liver failure differ most in the factors of drug use, alcohol abuse, virus infection, etc., that could contribute to the rarity of the disease but not specified in the definition from ORDO.We finally considered hepatitis virus or drugs as causes of acute liver failure as a rare disease, but removed cases of alcohol abuse.Admissions are split into those only identified through links from ICD-9 codes (in black), those only identified from free texts with weak supervision (NLP, in white), and the intersection of cases from both ICD-9 and NLP (in grey).The percentage after each horizontal bar shows the accuracy of NLP based on the manual assessment of the identified cases.
Although the accuracy scores were not perfect, for all diseases except IRIDA syndrome, NLP could still enrich the cases identified from ICD-9 after the manual check by the experts.We also find that with ICD codes, it is possible to find cases not identified by NLP as well, as shown in asbestos intoxication, necrotizing enterocolitis, etc., which may be related to the imperfect recall of the NLP model or the rare diseases being not (explicitly) mentioned in the clinical note.In general, the results above on rare diseases extend the conclusion of the previous survey in case detection [24] that NLP with free-texts can greatly enrich the information from ICD codes and the two sources complement each other.We further present the results of NLP with strong supervision in Figure S1-1 in Supplementary material 1, which overall predicted fewer cases and resulted in better accuracy scores, but reflected the same picture as with weak supervision.

Transfer and Re-training with Radiology Reports
For external validation, we applied the proposed weak supervision pipeline and models to extract rare disease phenotypes from two datasets of radiology reports, US MIMIC-III radiology reports (n=520k) [20] and UK NHS Tayside brain imaging reports (n=156k) [17].For each of the datasets, we selected a subset of clinical notes (1,000 for MIMIC-III and 5000 for Tayside), and obtained the candidate mention-UMLS pairs with SemEHR to be labelled for evaluation.The detailed data statistics are in Table 2. Based on the real-world practice of NLP, we consider two ways to apply the pipeline in Figure 2: (i) model transfer and (ii) in-domain re-training.For model transfer, we directly applied our phenotype confirmation models, M weak (and M strong ), trained from MIMIC-III discharge summaries to the two new datasets; for in-domain re-training, we created weakly labelled training data from each new dataset and trained a data-specific phenotype confirmation model with Algorithms 1-2; we further tuned the parameters p and l in the weak labelling rules during re-training.
Table 6 shows the external validation results of the NLP pipeline with model transfer or in-domain re-training.We mainly present the Text-to-UMLS results, consistent with Text-to-ORDO results in Table S1-4 and admission-level results in Table S1-5 in Supplementary material 1.It is observed that directly applying a weak supervision model trained from another type of report (e.g.discharge summaries) could largely improve the precision and F 1 score of SemEHR, with a slight drop of recall from nearly 100% to over 90%.This transferability of models suggests that there are common linguistic patterns used in all types of clinical notes, even from different sources.The strong supervision model obtained a higher precision, but with a much lower recall (a drop of 20% to over 30% compared to SemEHR only) and thus may bear the risk of missing true positive mentions.Results from the in-domain re-training of models were much better than model transfer, as the former could bridge the linguistic gap between discharge summaries and radiology reports even for the same cohort or institution in MIMIC-III.We further tuned the weak labelling parameters to optimise the recall or F 1 score.A perfect or no loss of recall (100% or near 95%) was achieved on par with Se-mEHR and the precision was further improved compared to using the original parameters.Although the parameter tuning process was based on the full annotated data, this can be substituted by the inspection of a small number of data The column statistics (n=N++/N ) show number of positive data N+ and all samples N in the dataset.WS, weak supervision; SS, strong supervision.The original parameters for WS were p = 0.005 and l = 3.The new parameters for best recall (R) were p = 0.01 and l = 4 and for best F1 were p = 0.0005 and l = 4, for both datasets.For SemEHR+rules, we present the results of rules, where p = 0.0005 and l = 4, with an OR operation.The best scores for the metrics are bolded.
at the rule designing stage.Finally, we noticed that simply using rules (SemEHR+rules) with the best tuned parameters was highly effective, achieving better results than most evaluation settings, but still surpassed by the best tuned WS model, especially for the Tayside reports.The results between rules only and weak supervision were consistent with those of the discharge summaries in Table 3.

Conclusion, Discussion, and Future Studies
In this study, we proposed an ontology-driven and weakly supervised approach for rare disease phenotyping from clinical notes.Unlike the use of ontologies, weak supervision has not been well established in the clinical NLP domain.Our proposed weak supervised deep learning approach requires no human annotation and extends the paradigm from [13] on weak supervision for clinical texts, by introducing ontologies, named entity linking tools, and contextual representations.We designed two simple but effective rules (mention character length and corpus-based "prevalence") to create weakly labelled data regarding ambiguous abbreviations and rare entities.The trained phenotype confirmation model effectively filtered out the false positives in the data with no (or a minimum) side effect on the true positives.
Traditional clinical NLP relies heavily on strong supervision with manually labelled data.However, with recent data-demanding methods like deep learning, it is time to consider to automatically create labelled data to train models, with the support of rules and resources like ontologies and NER+L tools.Our work on rare diseases provides empirical evidence for the task by applying a weakly supervised NLP pipeline on three clinical note datasets (one for discharge summaries and two for radiology reports) in two institutions in the US and the UK.The improvements on the precision were highly significant (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, Se-mEHR.Our study also demonstrates that NLP can comple-ment traditional ICD-based approaches to better estimate rare diseases in clinical notes (see Figure 4).
While our rule-based weak supervision does not require annotated data, it can bring bias or noise as no simple rule can perfectly predict the labels for a complex task.This bias, although not affecting most predictions for the testing data, was manifested in the slight drop of recall in Text-to-UMLS linking (Table 3).This loss of recall may be minimised through tuning the parameters in the weak labelling rule (e.g.relaxing the "prevalence" or mention length threshold, shown in Table 6), but needs a small set of annotated data or some manual inspection of the predictions.The mention character length rule may also be enhanced with accurate abbreviation expansion and disambiguation to retain abbreviations that are rare diseases.Besides, recent studies in the general NLP domain have begun tackling the bias of rules (with a rule-level attention mechanism [49]) or noise of weakly labelled data (with the estimation of data-level confidence [50]).Also, we used a heuristic-based logic operation (as XNOR) to aggregate the two rules; future studies can explore more advanced aggregation methods (e.g., learning a label model [45,46]).
As suggested in our results and other studies [45,46], the current performance of the best weakly supervised methods is still below strong supervision.But the gap between the weak and strong supervision is small (within 5% F 1 score) and there is no difference in terms of recall.This shows that the expensive and time-consuming annotations for text phenotyping may be greatly reduced, substituted by an alerting system or manual screening based on the predictions of a weakly supervised NLP system.With a small number of annotated data for parameter tuning, both the precision and recall of our weak NLP model were further improved (see Table 6).This may suggest a future study to better use a small sample of annotated data with the weakly annotated data for semi-supervised learning to improve the performance.
There are still, however, some false positive mentions detected by the proposed NLP pipeline, as shown in our analyses of the prediction errors and the identified cohorts (in Figures 3-4).Disambiguating entity types (especially for abbreviations) still remains a challenge for text phenotying.This suggests to potentially integrate word sense disambiguation to enhance the weak supervision approach, e.g., through more reliable weak data creation.Also, errors in identifying hypothetical and negation ("Hypo/neg") mentions suggest to separately model "Hypo/neg" in the classification, which can be learned with mentions beyond the scope of rare diseases.Furthermore, the complexities of linguistic patterns of a (rare) disease may still require better representations beyond the current context window and may need to be enhanced with ontology concepts.Our evaluation of the NLP-identified cases suggests modelling the semantics of the lexical definitions in ontologies (e.g.ORDO) to improve entity linking and ontology matching.Also, we note that our work is highly dependent on existing ontologies and their available matchings to each other.We leveraged and validated the matching among ORDO, UMLS, ICD-10, and ICD-9.The current matchings are generally correct, but not perfect (e.g.88.4% accuracy of matching between UMLS and ORDO).A more accurate matching among ontologies, potentially corrected with machine learning [51], will improve the performance of our pipeline.It is also possible to directly match texts to ORDO, which can include rare diseases not contained in UMLS, but this does not leverage the synonyms in UMLS that represent the name variation of rare disease entities.Also, our approach cannot identify emerging rare disease entities, not contained in the ontologies and not thus easily captured by SemEHR, which is the next, challenging direction for our study.
While we only enhanced SemEHR with the weakly supervised phenotype confirmation model, the approach can be adapted to improve other NER+L tools and models to support more accurate rare disease cohort selection and coding.Recently, more packages and environments (e.g.Snorkel [46], skweak [52]) have been created to apply weak supervision in general domain NLP practice.Thus, a promising future study is to adapt the current weak supervision infrastructures or the ideas behind them to the clinical NLP domain and establish best practices in the field; a recent work adapting Snorkel [46] is Trove [45], which has not yet been applied to the domain of rare diseases, that involves additional ontologies and their mappings.
Our work mainly focused on identifying rare disease concepts in the clinical notes, while other physical, behavioural, and physiological characteristics need to be identified so as to establish a clinical diagnosis of a rare disease.We also mainly focused on rare diseases as a whole and the approach can be applied to identify specific rare diseases.Future work needs to extract a wider set of information to enhance rare disease phenotyping, and to facilitate the development of risk prediction tools for rare diseases to support decision making during the COVID-19 pandemic and beyond [53,54].
For fine-tuning BERT models, we used the average pooling of the mention's sub-tokens representations in the second-last layer (same as the Contextual Mention Representation, with fine-tuning instead of static embedding), followed by a linear layer with softmax activation and cross-entropy loss.The learning rate, warmup steps, and weight decay were 5e-05, 500, and 0.01, resp., set up using Huggingface Trainer [2] , and trained with 3 epochs.We fine-tuned the BlueBERT-based model (see Table 4 in the paper).

Results on Different Strategies
The first encoding strategy is mention masking, whether or not to mask the mention in the full context window.The intuition behind this is to explore the potential of a language model to confirm a phenotype solely based on the surrounding context but not the mention itself.
The second encoding strategy is using document structure names (or template section names) to enhance local context.If the document structure name s is available in the dataset, we add s before the context window t with a separation token [SEP] in between.
Results on the different encoding strategies for Text-to-UMLS linking in MIMIC-III discharge summaries are displayed in Table S1-2.Non-masked encoding achieved better results than masked encoding.Using document structure names further boosted recall scores on the validation and the test set.We used non-masked encoding (with document structure names for MIMIC-III discharge summaries only) for data representation.69.4 M denotes mention masking and non-M denotes no mention masking applied.DS denotes using document structure names.The non-M+DS model was trained on the full set of weakly labelled data, without tuning the optimal number of data, thus slightly below results in Table 2. BlueBERT-base (PubMed+MIMIC-III) was used to encode the text sequences.

NLP with Strong Supervision vs. ICD for Admission-level Rare Disease Identification
Figure S1-1 shows the results of the NLP pipeline with strong supervision compared to ICD codes for admission-level rare disease phenotyping.The results were generally consistent with the weak supervision approach (in Figure 4 in the paper) that NLP-based results greatly complement the code-based rare disease cohort.Generally, a higher accuracy with a less number of admissions was predicted by strong supervision compared to weak supervision (e.g. the accuracy was 25.5% or 14/55 predicted by "Retinitis Pigmentosa" for strong supervision, compared to 8.2% or 15/183 predicted by weak supervision).

Overall Admission-level and Mention-level Results
Table S1-3 shows the admission-level rare disease phenotyping results for MIMIC-III discharge summaries. [2]https://huggingface.co/docs/transformers/main classes/trainer  4 in the paper.Admissions are split into those only identified through links from ICD-9 codes (in black), those only identified from free texts with strong supervision (NLP, in white), and the intersection of cases from both ICD-9 and NLP (in grey).The percentage after each horizontal bar shows the accuracy of NLP based on the manual assessment of the identified cases.
Table S1-4 and S1-5 show the overall mention-level (Text-to-ORDO) and admissionlevel results of two radiology report datasets in the US (MIMIC-III) and the UK (NHS Tayside).For Tayside data, the recall was lower as we manually identified new rare disease mentions that were not included in the candidate mentions from SemEHR.Weak supervision (WS) achieved better recall than transferring the SS model in results from both Tables.The code-based approach (ICD) also did not show an advantage in identifying more rare disease admissions (see recall, R), and overall performance (see F 1 ), comparing ICD or "ICD ∪ SemEHR+WS" with the (best) SemEHR+WS setting in Table S1-5 and Table S1-3, but the results may be biased towards methods adapting SemEHR as it was used as a starting source to create candidate mentions for the manual annotation.

Examples of Rare Disease Text Phenotyping
Table S2-6 (on page 5) shows some selected prediction errors and a few correct predictions.The first four examples are the false positives selected in the evaluation data for the weak supervision model due to semantic type errors, hypothetical contexts, or other issues.The last five examples are those selected from the identified rare disease cohort for Retinitis Pigmentosa and Rheumatic Fever.Synonyms in UMLS could help identify some name variations, e.g."tracheobronchomalacia" for Williams-Campbell syndrome, and "acute rheumatic fever" for Rheumatic fever, but also introduces false positives especially regarding abbreviations, e.g."EMA" and "RP".The complex context in the clinical notes, including the relative's diseases or hypothetical mentions, although only representing a small part of cases, were still challenging for the NLP pipeline (SemEHR+WS), as these were not explicitly considered in the weakly supervised training process.We also note that there were errors in parsing the document structure name through regular expressions in SemEHR, which might affect the predictions.The micro-level metric counts each admission and an associated ORDO concept (or an admission-ORDO pair) as a single instance.The column statistics (n=N++/N d * N l ) show the number of positive data N+, admissions (or discharge summaries) N d , and unique candidate rare diseases (or ORDO concepts) N l in the dataset.WS, weak supervision; SS, strong supervision; anns, annotations.BlueBERT-base (PubMed+MIMIC-III) was used as the BERT model.ICD denotes the approach to matching ICD-9 codes to ORDO concepts.The union sign (∪) denotes merging and de-duplicating the cases identified from the two methods.Precision (P) and F1 for ICD-based methods may be lower than actual values, as all candidate mentions were from SemEHR.The column statistics (n=N++/N ) shows the number of positive data N+ and the overall number of samples N in the dataset.WS, weak supervision; SS, strong supervision.The original parameters for WS were p = 0.005 and l = 3.The new parameters for best recall (R) were p = 0.01 and l = 4 and for best F1 were p = 0.0005 and l = 4, for both datasets.For SemEHR+rules, rules were aggregated with an OR operation and p = 0.0005 and l = 4.

Ontology Matching from ORDO to ICD-9
Table S2-7 (on page 6) shows 10 examples of rare disease concepts and their ontology matching from ORDO to UMLS, ICD-10, and ICD-9.The rare diseases are the same as those presented in Figure 4 in the paper and Figure S1-1 in Supplementary material 1.The micro-level metric counts each admission and an associated ORDO concept (or an admission-ORDO pair) as a single instance.The column statistics (n=N++/N d * N l ) show the number of positive data N+, the number of admissions (or discharge summaries) N d , and the number of candidate rare diseases (or ORDO concepts) N l in the dataset.WS, weak supervision; SS, strong supervision.The original parameters for WS were p = 0.005 and l = 3.The new parameters for best recall (R) were p = 0.01 and l = 4 and for best F1 were p = 0.0005 and l = 4, for both datasets.For SemEHR+rules, rules were aggregated with an OR operation and p = 0.0005 and l = 4.The union sign (∪) denotes merging and de-duplicating the cases identified from the two methods.For ICD ∪ SemEHR+WS, the WS model was "in-domain + tuning R", the one re-trained with in-domain data and optimised recall.Precision (P) and F1 for ICD-based methods may be lower than actual values, as all candidate mentions were from SemEHR.Prediction errors are coloured with red in "Pred" (third-last) column.For columns "Pred" and "Label", "T" means that the prediction or gold is True and "F" indicates False.The wrongly parsed document structure names in the second column are marked with corrected ones in the form of "(should be XXX)".

Figure 1 A
Figure 1 A pipeline for rare disease identification from clinical notes with ontologies and weak supervision.The upper horizontal lines (in red) show the proposed pipeline based on clinical notes (e.g.discharge summaries and radiology reports in US MIMIC-III and UK NHS Tayside) and ontologies, including two steps (Text-to-UMLS and UMLS-to-ORDO).No annotation data are needed, through a UMLS extraction tool, SemEHR, and weak supervision (WS) based on customised rules and BERT-based contextual representations (see details on WS in Figure 2).The admission ID and ICD-9 codes (linked with dotted lines) are only available for the MIMIC-III data.The lower, dotted lines show a baseline approach purely based on manual ICD codes, also enhanced with ontology matching.(Figure adapted from [7].)

Figure 4
Figure 4Number of rare disease patient stays from MIMIC-III (n=59,652): ICD (code-based) vs. NLP (text-based, with weak supervision), for 10 selected diseases.Admissions are split into those only identified through links from ICD-9 codes (in black), those only identified from free texts with weak supervision (NLP, in white), and the intersection of cases from both ICD-9 and NLP (in grey).The percentage after each horizontal bar shows the accuracy of NLP based on the manual assessment of the identified cases.

Contextual mention representation (e.g. BlueBERT) Phenotype confirmation model
Weak supervision process for Text-to-UMLS linking.The left four white text boxes displayed the metadata (with examples) of [7]andidate mention-UMLS pair, identified by a Named Entity Recognition and Linking (NER+L) tool, SemEHR; the coloured text boxes in the middle show the contextual representation block (in green ) and the rule-based weak data labelling (in blue ).A binary label is then generated, which weakly estimates whether the candidate pair indicates a correct phenotype of the patient.A phenotype confirmation model (in grey ) is then learned to select correct phenotypes from the pairs.(Figureadaptedfrom[7]).)

Table 1
Examples of false positives mention-UMLS pairs in entity linking identified from SemEHR and Bio-YODIE.Each mention is bolded in its context window.

Algorithm 1 :
Weakly supervised data creation Require: T , document set; c U M LS ∈ O U M LS , UMLS concepts and ontology; c ORDO ∈ O ORDO , ORDO concepts and ontology.Ensure: D weak , weakly labelled data 1 Initialise D weak ← ∅;

Table 2
Statistics of Clinical Note Datasets with the Natural Language Processing Pipeline and Manual Annotations |D weak + |, |D weak − |, number of weakly labelled positive and negative mention-UMLS pairs, respectively; |T RD |, |T weak RD |, number of documents associated with one or more rare diseases detected by Se-mEHR and SemEHR+WS (i.e.further with weak supervision), respectively; |T ann |, |D ann |, |T ann RD |, number of documents sampled, number of mention-UMLS pairs sampled, and number of the sampled documents with one or more rare diseases identified by SemEHR, respectively.For Tayside data, 4 new positive mention-UMLS pairs in |Dann| were identified from the reports during the manual annotation.

Table 3
Evaluation results of Text-to-UMLS linking on validation and testing data from MIMIC-III discharge summaries

Table 4
Comparison among embeddings for weakly supervised Text-to-UMLS linking from MIMIC-III discharge summaries

Table 5
Results on rare disease identification (Text-to-ORDO) from MIMIC-III discharge summaries

Table 6
External Validation Results on Radiology Reports from MIMIC-III and NHS Tayside

Table S1 -2 Comparison among encoding strategies for weakly supervised Text-to-UMLS linking on MIMIC-III discharge summaries
Number of rare disease patient stays from MIMIC-III (n=59,652): ICD (code-based) vs. NLP (text-based, with strong supervision).The 10 rare diseases are the same as those presented for weak supervision in Figure

Table S1 - 3
Micro-level results of admission-level rare disease identification for MIMIC-III discharge summaries

Table S1 - 4
Results on rare disease identification (Text-to-ORDO) for MIMIC-III and Tayside radiology reports

Table S1 - 5
Micro-level results of admission-level rare disease identification for MIMIC-III and Tayside Radiology Reports

Table S2 - 6
Examples of wrong and correct rare diseases identified by SemEHR with the weak supervised phenotype confirmation model from MIMIC-III discharge summaries