Comparison of different feature extraction methods for applicable automated ICD coding

Table 3 Coding results for Fuwai dataset with \(f_s = 200\)

Feature extraction & classifiers	Macro-F1 (%)	Micro-F1 (%)	Macro-AUC (%)	Micro-AUC (%)
BoW
LR_uni	84.44	91.54	88.58	93.75
SVM_uni	84.69	91.78	89.27	94.10
LR_uni_bi	84.83	92.27	89.08	94.41
SVM_uni_bi	83.02	91.57	88.23	93.93
LR_uni_bi_tri	83.01	91.50	88.00	93.88
SVM_uni_bi_tri	78.21	89.45	85.20	92.19
W2V
LR_word	53.14	75.07	71.60	82.05
SVM_word	35.73	64.92	64.09	75.10
LR_char	48.03	70.54	68.77	79.04
SVM_char	26.30	58.86	60.37	71.64
LR_comb	61.73	80.27	75.75	85.47
SVM_comb	46.26	73.68	69.17	80.51
RoBERTa_embeddings
LR_char	64.56	78.59	77.90	85.51
SVM_char	51.30	75.24	71.86	82.45
LR_comb	72.41	84.20	82.23	89.07
SVM_comb	64.25	81.41	77.57	86.44
RoBERTa_finetune
top_layer	4.31	40.59	69.56	80.32
whole	83.39	93.87	98.65	99.55

For BoW, _uni, _uni_bi and _uni_bi_tri mean unigram, unigram+bigram and unigram+bigram+trigram respectively. For W2V, _comb means concatenating character and word embeddings, while _char (_word) means merely character (word) embeddings. For RoBERTa_embeddings, _char means merely the RoBERTa-Mini embeddings, and _comb means concatenating the RoBERTa-Mini embeddings and W2V word embbeddings. For RoBERTa_finetune, whole and top_layer mean fine-tuning the whole network and only the top fully connected layer respectively

ISSN: 1472-6947