Skip to main content

Table 3 Top five best-performing interpretable shallow classifiers in iDASH and MGH datasets

From: Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach

Data

Feature

Vector

Algorithm

F1

AUC

p-value

iDASH

Bag-of-words + UMLS (5SG)

Tf-idf

SVM-Lin

0.932

0.957

<0.01

 

Bag-of-words + UMLS (All)

Tf-idf

SVM-Lin

0.931

0.957

<0.01

 

Bag-of-words + UMLS (15ST)

Tf-idf

SVM-Lin

0.930

0.957

<0.01

 

Bag-of-words + UMLS (All)

Tf-idf

SVM-Lin-SGD

0.928

0.955

<0.01

 

Bag-of-words

Tf-idf

SVM-Lin

0.927

0.955

<0.01

 

Bag-of-words

Tf

NB

0.893

0.935

Baseline

MGH

Bag-of-words + UMLS (5SG)

Tf-idf

SVM-Lin

0.934

0.964

<0.01

 

Bag-of-words + UMLS (15ST)

Tf-idf

SVM-Lin

0.931

0.962

<0.01

 

Bag-of-words + UMLS (All)

Tf-idf

SVM-Lin

0.930

0.962

<0.01

 

Bag-of-words

Tf-idf

SVM-Lin

0.924

0.958

<0.01

 

Bag-of-words + UMLS (5SG)

Tf

LR-L1

0.915

0.953

<0.01

 

Bag-of-words

Tf

NB

0.755

0.867

Baseline

  1. Abbreviation: SG Semantic groups, ST Semantic types, Tf Term frequency, Tf-idf Term frequency-inverse document frequency weighting, SVM-Lin Linear support vector machine, SVM-Lin-SGD Linear support vector machine with stochastic gradient descent training, LR-L1 L1-regularized multinomial logistic regression, NB Multinomial naïve Bayes. Baseline combinations are shown in bold face