Ontology-driven and weakly supervised rare disease identification from clinical notes

Table 2 Statistics of Clinical Note Datasets with the Natural Language Processing Pipeline and Manual Annotations

|T|, number of documents; |D|, number of mention-UMLS pairs; \(|D_{weak^+}|\), \(|D_{weak^-}|\), number of weakly labelled positive and negative mention-UMLS pairs, respectively; \(|T_{RD}|\), \(|T^{weak}_{RD}|\), number of documents associated with one or more rare diseases detected by SemEHR and SemEHR+WS (i.e. further with weak supervision), respectively; \(|T^{ann}|\), \(|D^{ann}|\), \(|T^{ann}_{RD}|\), number of documents sampled, number of mention-UMLS pairs sampled, and number of the sampled documents with one or more rare diseases identified by SemEHR, respectively. For Tayside data, 4 new positive mention-UMLS pairs in \(|D_{ann}|\) were identified from the reports during the manual annotation

ISSN: 1472-6947