Skip to main content

Table 2 Statistics of Clinical Note Datasets with the Natural Language Processing Pipeline and Manual Annotations

From: Ontology-driven and weakly supervised rare disease identification from clinical notes

 

MIMIC-III Disch

MIMIC-III Rad

Tayside Brain Img

|T|

59,652

522,279

156,618

|D|

127,150

109,096

7,761

\(|D_{weak^+}|\)

15,598

13,907

1,137

\(|D_{weak^{-}}|\)

74,217

65,171

2,898

\(|T_{RD}|\)

37,110

73,589

7,321

\(|T^{weak}_{RD}|\)

10,568

21,102

2,855

\(|T^{ann}|\)

500

1,000

5,000

\(|D^{ann}|\)

1,073

198

279+4

\(|T^{ann}_{RD}|\)

312

145

273

  1. |T|, number of documents; |D|, number of mention-UMLS pairs; \(|D_{weak^+}|\), \(|D_{weak^-}|\), number of weakly labelled positive and negative mention-UMLS pairs, respectively; \(|T_{RD}|\), \(|T^{weak}_{RD}|\), number of documents associated with one or more rare diseases detected by SemEHR and SemEHR+WS (i.e. further with weak supervision), respectively; \(|T^{ann}|\), \(|D^{ann}|\), \(|T^{ann}_{RD}|\), number of documents sampled, number of mention-UMLS pairs sampled, and number of the sampled documents with one or more rare diseases identified by SemEHR, respectively. For Tayside data, 4 new positive mention-UMLS pairs in \(|D_{ann}|\) were identified from the reports during the manual annotation