Ontology-driven and weakly supervised rare disease identification from clinical notes

Table 4 Comparison among embeddings for weakly supervised Text-to-UMLS linking from MIMIC-III discharge summaries

The column statistics (n=\(N_+\)+/N) show number of positive data \(N_+\) and all samples N in the dataset. All word2vec-k embeddings were pre-trained from MIMIC-III discharge summaries, representing the mention as the averaged k-dimensional embedding of tokens in the context window. BERT models were used as static features (in the second-last layer) if not specified with “fine-tuning”. The best scores, either or not considering strong supervision (SS), are bolded. We did not tune the optimal number of random weakly supervised training data for BlueBERT-base model (and all other models), thus its results were slightly below those in Table 3

ISSN: 1472-6947