Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach

Table 3 Top five best-performing interpretable shallow classifiers in iDASH and MGH datasets

Data	Feature	Vector	Algorithm	F1	AUC	p-value
iDASH	Bag-of-words + UMLS (5SG)	Tf-idf	SVM-Lin	0.932	0.957	<0.01
	Bag-of-words + UMLS (All)	Tf-idf	SVM-Lin	0.931	0.957	<0.01
	Bag-of-words + UMLS (15ST)	Tf-idf	SVM-Lin	0.930	0.957	<0.01
	Bag-of-words + UMLS (All)	Tf-idf	SVM-Lin-SGD	0.928	0.955	<0.01
	Bag-of-words	Tf-idf	SVM-Lin	0.927	0.955	<0.01
	Bag-of-words	Tf	NB	0.893	0.935	Baseline
MGH	Bag-of-words + UMLS (5SG)	Tf-idf	SVM-Lin	0.934	0.964	<0.01
	Bag-of-words + UMLS (15ST)	Tf-idf	SVM-Lin	0.931	0.962	<0.01
	Bag-of-words + UMLS (All)	Tf-idf	SVM-Lin	0.930	0.962	<0.01
	Bag-of-words	Tf-idf	SVM-Lin	0.924	0.958	<0.01
	Bag-of-words + UMLS (5SG)	Tf	LR-L1	0.915	0.953	<0.01
	Bag-of-words	Tf	NB	0.755	0.867	Baseline

Abbreviation: SG Semantic groups, ST Semantic types, Tf Term frequency, Tf-idf Term frequency-inverse document frequency weighting, SVM-Lin Linear support vector machine, SVM-Lin-SGD Linear support vector machine with stochastic gradient descent training, LR-L1 L1-regularized multinomial logistic regression, NB Multinomial naïve Bayes. Baseline combinations are shown in bold face

ISSN: 1472-6947