Comparison of different feature extraction methods for applicable automated ICD coding

Background Automated ICD coding on medical texts via machine learning has been a hot topic. Related studies from medical field heavily relies on conventional bag-of-words (BoW) as the feature extraction method, and do not commonly use more complicated methods, such as word2vec (W2V) and large pretrained models like BERT. This study aimed at uncovering the most effective feature extraction methods for coding models by comparing BoW, W2V and BERT variants. Methods We experimented with a Chinese dataset from Fuwai Hospital, which contains 6947 records and 1532 unique ICD codes, and a public Spanish dataset, which contains 1000 records and 2557 unique ICD codes. We designed coding tasks with different code frequency thresholds (denoted as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s$$\end{document}fs), with a lower threshold indicating a more complex task. Using traditional classifiers, we compared BoW, W2V and BERT variants on accomplishing these coding tasks. Results When \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s$$\end{document}fs was equal to or greater than 140 for Fuwai dataset, and 60 for the Spanish dataset, the BERT variants with the whole network fine-tuned was the best method, leading to a Micro-F1 of 93.9% for Fuwai data when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s=200$$\end{document}fs=200, and a Micro-F1 of 85.41% for the Spanish dataset when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s=180$$\end{document}fs=180. When \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s$$\end{document}fs fell below 140 for Fuwai dataset, and 60 for the Spanish dataset, BoW turned out to be the best, leading to a Micro-F1 of 83% for Fuwai dataset when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s=20$$\end{document}fs=20, and a Micro-F1 of 39.1% for the Spanish dataset when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_s=20$$\end{document}fs=20. Our experiments also showed that both the BERT variants and BoW possessed good interpretability, which is important for medical applications of coding models. Conclusions This study shed light on building promising machine learning models for automated ICD coding by revealing the most effective feature extraction methods. Concretely, our results indicated that fine-tuning the whole network of the BERT variants was the optimal method for tasks covering only frequent codes, especially codes that represented unspecified diseases, while BoW was the best for tasks involving both frequent and infrequent codes. The frequency threshold where the best-performing method varied differed between different datasets due to factors like language and codeset. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01753-5.


Background
During patients' visits at hospitals, rich text data is generated, such as diagnoses from health professionals. An important task is to assign the codes from the International Classification of Diseases (ICD) system to the text, with each code representing a disease or procedure. The coding task serves as basis for a wide range of applications, including reimbursement, epidemiological studies and health service research. At present, the task is mainly Page 2 of 15 Shuai et al. BMC Medical Informatics and Decision Making (2022) 22:11 accomplished by clinical coders who are trained to grasp coding rules, yet manual coding is time-consuming and prone to errors, promoting automated ICD coding via machine learning to be a hot topic.
Using machine learning to fulfill automated ICD coding basically comprises two phases, feature extraction and classifier building. Feature extraction is crucially important, as it plays the role of a bridge between raw text and classifiers, and should extract useful features from raw text as many as possible. At present, there are three typical feature extraction methods, namely bag-of-words (BoW), word2vec (W2V) and large pretrained natural language processing (NLP) models. BoW is widely used in traditional machine learning. It treats text as a collection of words without strict orders, and ignores complicated semantic and syntactic information. W2V was introduced by Mikolov et al. [1], and has been adopted in a great many studies. Splitting a training corpus into windows of text, W2V uses context words to predict central words (or vice versa), through which word embeddings for corresponding vocabulary can be learned. Compared with BoW, W2V is capable of guiding word embeddings to embody semantic and syntactic information in dense real-valued low-dimensional vectors. Large pre-trained NLP models gain much attention over recent years, the key point underlying which is first mining knowledge from large corpora with complicated neural networks, and then transferring the knowledge to downstream tasks to improve their performances. The most representative large pre-trained model is BERT, which was trained on large corpora from various fields and has been proven quite useful over many NLP tasks [2]. In comparison to W2V, models like BERT can learn far more language semantics and domain knowledge.
Existing studies relating to automated ICD coding can be categorized into two streams. The first is from medical field [3][4][5][6][7][8][9][10][11][12][13][14]. These studies focus on developing applicable models for a subset of ICD codes based on private datasets. BoW is adopted as the feature extraction method mostly, and conventional classifiers or similarity-based methods are commonly used to accomplish automated coding. In specific, using BoW and support vector machine (SVM), Karimi et al. (2017) auto-assigned 16 codes to radiology reports and resulted in a Micro-F1 of over 80% [8], Koopman et al. (2015) predicted 85 cancer related codes based on death certificates and reached a F1-score of 70% [6], and Kaur and Ginige (2018) automatically allocated two codes relating to respiratory and gastrointestinal systems and achieved a F1-score of 91.4% [7]. Applying BoW on unstructured clinical notes, Elyne et al. (2016) concluded that unstructured and structured data were complementary in predicting codes covering several medical specialties [14]. To the best of our knowledge, W2V and large pretrained NLP models, which hold advantages over BoW in analyzing syntax and semantics, have not been commonly used yet.
The second is from computer science [15][16][17][18][19][20][21][22][23][24], where most studies use W2V and deep learning neural networks as feature extraction methods, and mainly target on developing models on large public datasets, such as MIMIC-III [25][26][27]. For instance, Mullenbach et al. (2018) proposed a network named CAML, which consists of a convolutional layer and a label attention layer [18]. Lately, some studies are keen on pretraining BERTlike architectures on large medical corpora, with the purpose of making the model more fitted for medical missions. One example is BioBert, which was pretrained on PubMed corpora and confirmed effective in dealing with tasks such as ICD coding [28]. Although having adopted more advanced feature extraction methods, the metrics reported by these studies are generally not high. For instance, the state-of-the-art F1-score for the full codes in MIMIC-III is currently below 60% [20].
For the purpose of application, this study targeted on comparing BoW, W2V and BERT variants on auto-assigning ICD codes to medical records. Like some related studies [7,8], the scale of the datasets in this study is limited, therefore we used logistic regression (LR) and SVM instead of deep learning models as classifiers. We designed coding tasks with different code frequency thresholds. In general, a higher threshold means more frequent codes to predict and thus a less complex task. Our goal is to uncover which feature extraction method is most effective, and whether the most effective one varies across tasks at different complex levels. Achieving the goal can be of help in building promising coding models and assisting coding practice.

Methods
This section first briefly describes the feature extraction methods, classifiers and evaluation metrics used in this study, and then gives details of our methodology.

Bag-of-words
BoW is widely used in traditional machine learning. The model treats text as a collection of words (or n − grams ) without strict orders, and ignores complicated semantic and syntactic information. To calculate weights of words in a document, the term frequency-inverse document frequency ( tf − idf ) method is mostly adopted. According to tf − idf , given a corpus D containing N documents, the weight of word w i in document d j is: where tf i,j is the term frequency of w i in d j and df i is the number of documents that mention w i . As the equation indicates, words occurring more in d j and less in D are considered more representative of d j and given higher weights. Features from BoW are generally high-dimensional and sparse.

Word2vec
Since proposed by Mikolov et al. [1], W2V has been widely used in both traditional machine learning and deep learning studies. The method involves two alternative models, continuous bag-of-words (CBOW) and skipgram, both of which are simple multiple-layer perceptron structures, as shown in Fig. 1. As preprocessing, W2V transforms a training corpus into text windows of pre-defined size, and randomly initializes word embeddings for corresponding vocabulary. Given a text window during training, CBOW uses context words to predict the central word, while skip-gram uses the central word to predict context words. Training loss is assessed by cross entropy and word embeddings are gradually adjusted during backpropagation. After enough training, the embeddings tend to converge and are ready for downstream tasks.
Compared with BoW that only utilizes frequency information, W2V is capable of extracting abstract semantic and syntactic features which are dense, real-valued and low-dimensional.

Large pretrained NLP models
The basic idea underlying large pre-trained NLP models is employing complex neural networks to mine knowledge from large corpora first, and then transferring the knowledge to downstream tasks to improve their performances. BERT is a typical such model [2], which has received much attention lately [29][30][31][32][33][34]. The neural network of BERT 1 is a 12-layer encoder of Transformer [35], where each layer consists of a residual multi-head selfattention layer and a residual feed forward layer, both of   which are followed by layer normalization, as shown in Fig. 2. Next sentence prediction (NSP) and masked language modeling (MLM) are used as learning tasks. After trained on a large corpus from various fields, BERT has been proven very useful in fulfilling a wide range of NLP tasks [36,37]. Some studies proposed some variants of BERT, one of which is RoBERTa [30], which uses the same network as BERT, but was trained with improved procedures such as larger batches and more steps. Note that W2V provides word embeddings that are static and context-agnostic once trained. In contrast, BERT and its variants play the role of sentence encoders, as they encode sequences of tokens and provided embeddings for tokens by taking contextual information in the sequences into account. This can be aiding in handling ambiguity in text and extracting more informative features for downstream tasks.

Logistic regression
LR is widely used as a baseline classifier because of its simplicity and high efficiency [38]. Given a binary dependent variable y and m predictors x = {x 1 , x 2 , . . . , x m } , the model can be expressed as: where α and β are intercept and coefficients respectively, and can be estimated via maximum likelihood estimation. Provided with n observations(x i , y i ) (1 ≤ i ≤ n) , the log likelihood of LR regularized with L2 norm is: where is a positive penalty factor. Maximizing l(α, β) using gradient descent methods leads to optimal parameters α ⋆ and β ⋆ that fit training data best. As we focused on comparing different feature extraction methods, we did not tune comprehensively. Choosing from 1, 1 2 , 1 3 , 1 4 , 1 5 , we implemented training and tests on some randomly sampled data, and found = 1 5 generally led to better results. Hence = 1 5 was used throughout the study.

Support vector machine
SVM is one of the most successful conventional classifiers, due to its capability of handling a large number of features and simultaneously being memory efficient [39], and has been adopted in many studies for automated ICD coding [5][6][7][8]. Given a data set sents feature values, y i is a class label and n indicates data volume, SVM aims at searching for the hyperplane with the largest margin P: . w and b are usually calculated by solving the following optimization problem: Subject to: ξ = {ξ i : 1 ≤ i ≤ n} are called slack variables, standing for the tolerance of SVM for misclassifications. C is a positive penalty factor on the slack variables. A larger C generally leads to higher training accuracy, yet puts the model under the risk of overfitting. φ(x i ) is a transformation function which transforms data x into a new space, with the purpose of increasing the chance of separating data belonging to different classes which can not be separated in the original space. Kernel functions K(x, y), which satisfies: K (x, y) = φ(x) · φ(y) , is introduced to simplify the transformation calculations. Commonly used kernel functions include linear function and radial basis function. In this study, linear kernel function was used. Using the same method of selecting for LR, we chose C = 1 5 for SVM in all of the experiments.

Fuwai dataset
Fuwai Hospital is a Chinese hospital featured in treating cardiovascular diseases. With the approval from the Ethics Committee at Fuwai Hospital, we obtained a dataset lasting from January 2019 to February 2019, which includes no identifiable personal information. The dataset contains 6947 records, in which each record consists of a textual diagnosis summary for a patient and a list of codes, which are the Chinese version of standard ICD-10 diagnosis codes. Totally, the dataset involves 1532 unique codes. As preprocessing, we cleaned the diagnosis summaries by removing all numbers and symbols except for '[' and ']' 2 , and used Jieba 3 package to cut the cleaned text into words. To obtain medical terms more precisely, we loaded a Chinese medical vocabulary from Sogou 4 into Jieba, which contains 90,047 terms relating to diagnoses, medicine and so on.

CodiEsp dataset
CodiEsp dataset [40] is a public dataset released by the CLEF eHealth 2020 conference 5 . The dataset contains 1000 Spanish clinical records, and provides both Spanish and English textual diagnosis summaries. In this study, the Engish version was used. A list of gold standard CIE-10 diagnosis codes, which are the Spanish version of ICD-10 diagnosis codes, were assign to each record. 2557 unique codes appear in the dataset. We deleted all symbols, numbers and stop words in the records at the preprocessing stage.

Descriptive analysis
Descriptive statistics of the datasets are listed in Table 1. For Fuwai data, we additionally summarized the character-level statistics.
For each dataset, we ranked the codes by their frequencies in descending order, and plotted the frequencies against the rankings (Fig. 3). Apparently, the distributions of the code frequencies in both datasets follow long-tail distribution. Figure 4 shows the 10 most frequent codes and their frequencies regarding each of the datasets.
Codes with few records would result in overfitting when training classifiers. Therefore, we selected a subset of codes to predict by setting a code frequency threshold f s . Intuitively, a smaller f s means more infrequent codes to predict and thus a more complex coding task. To find out whether the best feature extraction method varies across tasks at different complex levels, we experimented with different thresholds on each of the datasets. Under each threshold, we only used data relating to at least one of qualified codes. 80% of selected data was for training and the rest was for test.

Evaluation metrics
In accordance with related studies [6,7,16,18], we mainly used F1-score and AUC to assess coding performance. Micro-F1, Macro-F1, Micro-AUC and Macro-AUC were used as specific metrics. A micro metric corresponds to the hypothetical single code that integrates all individual codes, while a macro metric is the mean of metrics for each individual code. Micro metrics place more weights on codes with more records. As a comparison, macro metrics treat each code equally. In automated ICD coding where codes are most likely to distribute disproportionately, micro metrics, especially Micro-F1, are generally given more attention.
F1-score and AUC regarding a single code are described as follows.

F1
Assume there are a set of records among which t 1 are in class 1 indicating a code is assigned. After feeding the records into a trained model, p 1 are tagged with 1, within which tp 1 are correctly tagged. Then the F1-score of the coding performance is: where Precision = tp 1 /p 1 and Recall = tp 1 /t 1 .    Figure 5 depicts the framework of our methodology. We designed coding tasks with different code frequency thresholds. In terms of each coding task, we accomplished code prediction by both feature-based methods and fine-tuning BERT variants, and evaluated coding performance on test data. Details of the feature extraction methods are given below.

BoW
We extracted features via unigram, unigram+bigram, unigram+bigram+trigram separately. As there are generally tens of thousands of features resulted from each combination, we applied the filter method with Chisquare test to select most significant 1,000 features during training. For each feature, we calculated its significance with respect to each individual code, and took the maximum significance as the final metric for ranking the feature. Related experiments were finished using Sklearn package [41].

W2V
Using gensim package [42], we trained both word and character embeddings based on Fuwai dataset, and word embeddings based on CodiEsp dataset. Parameters used during training and their descriptions are listed in Table 2.
We used average pooling to obtain the embedding of a textual record. Specifically, we looked up embeddings of all words (characters) in a record and took the mean at all dimensions as word-level (character-level) record embedding. For Fuwai dataset, we respectively adopted record embeddings at character-level and word-level, and also used the concatenation of both kinds of embeddings in the experiments.

BERT variants
For Fuwai dataset, we used a Chinese pre-trained model named RoBERTa-Mini [2,30,43,44], which has 4 layers of transformer encoder units and represents characters with embeddings of 256 dimensions in each layer. Note that in the Chinese context, pre-trained NLP models are generally character-based rather than word-based, due to tremendous size of Chinese word vocabulary. Albeit recently there are some attempts at word-based models like WoBert [45], they only cover a vocabulary of limited size, and would encounter severe out-of-vocabulary problems when used in medical studies.
For CodiEsp dataset, we used a English pre-pretrained model named BERT-mini [46,47], which has the same network structure as RoBERTa-Mini.
As both models can handle input sequences up to 512 tokens, medical records longer than 512 tokens were truncated in our experiments, while those shorter than 512 tokens were padded with meaningless tokens. For Fuwai dataset, among all the coding tasks, at most 4.6% of more than 6,000 records were truncated. The deleted content was mostly detailed symptom descriptions. Therefore, the truncation imposed little influence on the coding performance of RoBERTa-Mini. As for CodiEsp dataset, no clinical records need to be truncated.
Regarding both BERT variants, the outputs from the last layer were used, which mainly consists of two parts. One is a feature vector for the [CLS] token, which is automatically added on the beginning of each input sequence for the NSP task. The other is a embedding matrix where columns correspond to input tokens. Accordingly, we adopted two methods to achieve automated coding. The first is adding a fully connected layer above the feature vector, which contains the same number of neurons as that of target codes. Sigmoid was used as the activation function. During training we fine-tuned the whole network and only the top fully connected layer respectively. Adam was employed as the optimization algorithm, with batch size set to 32 and binary cross entropy as the loss function. Considering the random issue induced by operations such as parameter initialization and data split, we ran 5 rounds of training and tests when using the fine-tuning method. Within each round, different random seeds were used to split data. The mean of metrics from all the rounds were reported finally. The second method is using the embedding matrix as token features to train LR and SVM, with columns corresponding to padded tokens excluded. Average-pooling was used to generate record embeddings. Besides using the embedding matrix directly, we also experimented with concatenating the embeddings from the BERT variants and W2V.

Frequent codes only
We started with fixing the code frequency threshold f s as relatively high values, in which situation each chosen code occurs frequently in the datasets. In specific, for Fuwai data, we set f s as 200, and 37 codes meet the standard, whose occurrences account for 61.6% of the total code occurrences. 6,398 records were selected. For CodiEsp dataset, we set f s as 180, and 3 codes meet the standard, whose occurrences account for 5.4% of the total code occurrences. 473 clinical records were selected.
Coding results for the datasets are listed in Tables 3 and  4, respectively. The largest Micro-F1 for each feature extraction method is shown in bold, and the largest globally is marked in underline.
Regarding the classifiers, for Fuwai dataset, LR mostly performed better than SVM over all the metrics using same features, whereas for CodiEsp dataset, SVM mostly outperformed LR over all the metrics given same features. The result held across experiments with different f s for both datasets.
Focusing on the feature extraction methods for  For BoW, _uni, _uni_bi and _uni_bi_tri mean unigram, unigram+bigram and unigram+bigram+trigram respectively. For W2V, _comb means concatenating character and word embeddings, while _char (_word) means merely character (word) embeddings. For RoBERTa_embeddings, _char means merely the RoBERTa-Mini embeddings, and _comb means concatenating the RoBERTa-Mini embeddings and W2V word embbeddings. For RoBERTa_finetune, whole and top_layer mean finetuning the whole network and only the top fully connected layer respectively word embeddings from BERT-mini performed better than the other options, leading to a Micro-F1 of 63.9%.

Considering both frequent and infrequent codes
We chose relatively low f s in this section, in order to find out the most effective feature extraction methods when analyzing both frequent and infrequent codes. Specifically, f s = 20 was used for both datasets. For Fuwai dataset, 248 codes and 6,906 records were chosen. The occurrences of the code subset account for 90.1% of the total code occurrences. For CodiEsp dataset, 106 codes and 931 records were selected, with 41.4% of the total code occurrences covered. Coding results for the datasets are separately listed in Tables 5 and 6.
The results for both datasets diffed a lot from those in the last subsection. The methods of fine-tuning the Regarding the embedding methods, using the embeddings from the BERT variants and W2V together was the best choice for both datasets.

Results with multiple code frequency thresholds
We intended to uncover more details about how the best feature extraction method varied with respect to code frequency thresholds. Accordingly, we let f s change between (20,200) for Fuwai dataset and increased it by 20 each time. For CodiEsp dataset, we let f s take 40, 60, 80, 100, and 140 respectively 6 . For each of the thresholds, the number of the qualified codes and selected records, and the proportion of the total code occurrences covered by the qualified codes are given in Additional file 1. Interestingly, fine-tuning the whole network of the BERT variants consistently led to the highest Micro-AUC for both datasets. As the definition indicates, a higher AUC generally means higher TPR and 1-FPR, in other words, higher Recall for both positive and negative cases given multiple calibration thresholds, and it does not directly relate to Precision for positive cases. However, In the ICD coding task, both Precision and Recall for positive cases are quite important in coding practice, and can be captured by F1-score. Hence, we placed more Table 4 Coding results for CodiEsp dataset with f s = 180 Aside from BERT_embeddings, the suffixes have the same meanings as those in Table 3. For BERT_embeddings, _word means merely the BERT-mini embeddings, and _comb means concatenating the BERT-mini embeddings and W2V word embbeddings  Table 5 Coding results for Fuwai dataset with f s = 20  Table 6 Coding results for CodiEsp dataset with f s = 20 weights on Micro-F1 when evaluating the feature extraction methods 7 . Focusing on Fuwai dataset, the point where f s = 140 can be observed as a turning point. When f s was equal to or greater than 140, RoBERTa-Mini with the whole network fine-tuned resulted in the highest Micro-F1. When f s was lower than 140, BoW, mostly used together with LR, performed best.

BoW
As for CodiEsp dataset, there also exists a turning point, the one where f s = 60 . When f s equaled or exceeded 60, fine-tuning the whole network of BERTmini was most effective. Once f s fell below 60, BoW in conjunction with SVM consistently led to the highest Micro-F1.
Regarding the embedding methods on both datasets, using the embeddings from the BERT variants and W2V together generally achieved higher Micro-F1, in comparison to using merely the embeddings from the BERT variants or W2V.  Note that the advantage of the BERT variants over BoW in terms of Micro-F1 was more obvious and lasted for a wider range on CodiEsp dataset than that on Fuwai dataset. Besides factors like the difference in data volume and pretraining details of the BERT variants, the main reason could be as follows. In Fuwai dataset, most codes stand for specified diseases, such as E78.501 (hyperlipemia) and I25.105 (coronary atherosclerotic heart disease). There are some unique n − grams features quite indicative of these codes, such as 'atherosclerotic' for I25.105, and these features could be captured effectively by BoW when the feature selection was implemented, leading to competitive coding performance of the BoW methods on the frequent or infrequent codes. As a contrast, in CodiEsp dataset, many codes stand for unspecified diseases, such as r52 (pain, not elsewhere classified) and r69 (illness unspecified). Among the 20 most frequent codes, 13 are such kind of codes. The uncertainty behind these codes might make BoW struggle in capturing informative n − grams features for predicting the codes. However, due to the capability of handling ambiguity in text, the BERT variants could perform promisingly when extracting useful features for code assignment, as long as enough records for target codes are provided.

Interpretability
The experiments above indicated that the BERT variants with the whole network fine-tuned was the optimal feature extraction method when assigning merely frequent codes, while BoW became most effective when predicting both frequent and infrequent codes. This section shows that both the BERT variants and BoW possess good interpretability in automated coding, which is important for medical applications of coding models.
As for the BERT variants, the attention weights from its top layer give hints on how the models allocate importance for input tokens. In specific, the feature vector for code prediction is the weighted average of all embeddings of input tokens from the top layer. Higher attention weights mean more important roles of corresponding tokens in computing the feature vector. By observing the distribution of the weights, we can gain a straightforward view of what are key tokens, and whether those tokens are useful after referring to target codes.
Both BERT variants in this study adopt 4-head selfattention mechanism, indicating that there are four groups of attention weights for input tokens. We used the largest weight for each token as the metric of importance and ranked all tokens in descending order according to the metric.
In terms of BoW, key features selected by the filter method are shared by all inputs, hence the interpretability can be achieved by analysing whether the key features are informative or not in relation to target codes.
For Fuwai dataset, we defined Pro_K and Pro_N to quantify the explainability of RoBERTa-Mini and BoW, respectively.
Given a diagnosis summary, we computed the number of characters that appear in both top K key characters from RoBERTa-Mini and corresponding code labels, and divided the number by K. The result can be seen as a metric for explainability at single record level. Pro_K equaled the mean over such metrics from all test records.
We calculated the number of words that appear in both top N key words 8 from Chi-square test and corresponding code labels, and divided the number by N to gain Pro_N.
Fixing f s as 200, we report Pro_K for several K based on a randomly selected round of experiment in Table 7, and Pro_N for several N in Table 8.
The tables show that both RoBERTa-Mini and BoW precisely located useful information for assigning the target codes.
In terms of CodiEsp dataset, Pro_K and Pro_N do not fit for BERT-mini and BoW. As mentioned above, many codes in the dataset represent unspecified diseases. As a result, many key words selected by feature extraction methods might not appear in such code labels, even if they are predictive of the codes. Hence, we give two examples to show the interpretability of BERT-mini and BoW below.
Fixing f s as 180, we list the top 10 key words selected through BoW and the 3 target codes in Table 10. For BERT-mini, we randomly display a clinical record and its ICD codes in Table 9, with the top 5 key words shown in bold.
The examples intuitively demonstrate that both BERTmini and BoW identified valuable information to predict the target codes.

Discussion
As expected, the lower the code frequency threshold, the more complex corresponding tasks, and the lower the performance metrics for all the feature extraction methods. Experiments on both datasets suggested that the performance of the BERT variants changed dramatically across tasks at different complex levels. When handling frequent codes, the BERT variants reached the most promising results, and their advantage over other feature extraction methods was more obvious when handling codes representing unspecified diseases. When handling infrequent codes, the BERT variants performed poorly, probably because input tokens relating to the infrequent codes were rare and not sufficiently seen by the BERT variants, and as a consequence useful semantic representation for such tokens could not be learned.
BoW and the embedding methods were more stable compared with fine-tuning the BERT variants, among which BoW performed better, suggesting that BoW was more suited for coding tasks that covered both frequent and infrequent codes. The probable reason why BoW was relatively effective for infrequent codes was as follows. Combined with tf − idf and the feature selection via Chi-square test, BoW could capture some rare words or phrases that were closely associated with infrequent codes.
The frequency threshold that indicates the change of the best-performing feature extraction method varied between different datasets. This could be attributable to many factors, such as language, data volume and the number of predicted codes, and we can not pinpoint a single one as the major cause.
Focusing on the embedding methods, using embeddings from both the BERT variants and W2V was the optimal choice in most cases.
This study faces some limitations. First, we only used text data following a number of related studies [4,5,8,9]. However, as some research recorded [14,23], both unstructured and structured data, such as various lab results, can help predict ICD codes. Whether using unstructured and structured data simultaneously would affect our conclusions need to be further verified.
Second, limited by the scale of the available datasets, we only employed traditional classifiers as many studies did [5][6][7][8]. These classifiers might not be capable of fully taking advantage of the information in the embeddings from W2V and the BERT variants, and this is probably why the embeddding methods performed not so well in our experiments. In the future, when datasets of larger scale are available, we will build sophisticated deep learning classifiers to check whether the embeddings would lead to more promising coding performance.
Third, currently, we merely experimented with a private Chinese dataset and a public Spanish dataset. According to related studies [10,13], the portability of machine learning models for automated ICD coding might not be guaranteed. In the future, we will test the robustness of our conclusions by experimenting on more public datasets.

Conclusion
This study aimed at comparing different feature extraction methods, namely BoW, W2V and BERT variants, when building applicable models for automated ICD coding. Our experiments demonstrated that the BERT variants with the whole network fine-tuned was optimal for coding tasks covering only frequent codes, especially codes representing unspecified diseases, and BoW turned into the best when coding tasks involved both frequent and infrequent codes. The frequency threshold at which the best feature extraction method Table 9 A clinical record for interpreting the code assignment by BERT-mini

Case description ICD code
year male patient evaluated pain grade iii obliterating arteriopathy involvement limbs received analgesic treatment durogesic matrix months acceptable pain control vas rescue medication paracetamol maximum daily pain unit emergency visit days increased pain threshold agitation nervousness picture occurs result bedside doctor medication prescribed transdermal fentanyl generic requiring rescue paracetamol increasing vas pain relief anamnesis patient prescribed durogesic matrix patient reviewed weeks presents pain relief vas disappearing nervousness presented months visit patient continues durogesic matrix occasionally paracetamol r52 (pain not elsewhere classified) Table 10 10 key words and the target ICD codes Key words Target codes fever, disease, pain, antibiotic, drainage, crp, painful, leukocytosis, vas, pleural r52(pain, not elsewhere classified), r69(illness unspecified), r50.9(Fever, unspecified)