Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine

Background In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future. Methods We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen’s kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9. Results We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality. Conclusions These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.

data-driven medical studies, clinical decision making, and health management. Natural language processing (NLP) techniques, which assist the automatic processing and analysis of EMRs, have become increasingly used in the field of TCM analysis in recent years [3]. Named entity recognition (NER) [4,5] is a high-level task in NLP, and a human-annotated entity corpus is an indispensable resource for training automated NER systems and testing their performance. In English, some medical knowledge bases, such as terminology systems like the Unified Medical Language System [6], clinical ontology systems like the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) [7], and medical databases like DrugBank [8], contribute to NER in clinical records. In China, some resources have been developed for NER tasks in the Chinese clinical domain; for example, the Traditional Chinese Medicine Language System (TCMLS) standardizes the terminology of TCM. Currently, some entity types in Chinese clinical records, such as medications, anatomy, treatments, tests, symptoms, body parts, temporal words, drugs, and operations [5,[9][10][11][12][13] have already been annotated. However, to the best of our knowledge, open Chinese annotated corpora rarely include TCM clinical records. The lack of TCM clinical datasets is partly due to concerns regarding patients' privacy as well as concerns about revealing unfavorable institutional practices [14], so these records are very private and scarce; another reason is the high complexity of Chinese clinical text analysis. This type of text has sublanguage features [15], so the characteristics of raw TCM free-text clinical records are very different from the characteristics of common texts in the Chinese language. For instance, the text has a narrative form, uses a concise style similar to classical Chinese, and employs nonstandard descriptions [16]. Hence, constructing a corpus of TCM clinical records remains difficult, and the electronic capture or retrieval of TCM clinical text data has been a challenge; thus, research into NLP tasks on TCM clinical free text is still at a preliminary stage.
Currently, most of the relevant studies, such as [5,10,11], do not present a standardized process-based approach to the construction of a corpus, especially in the steps of data selection, guideline drafting, and annotation. To date, there is no existing fine-grained annotation schema applicable to the TCM clinical domain. Hence, this study focuses on a fine-grained corpus construction method that is suitable for the clinical free text of TCM. On the basis of existing approaches, we propose a four-step method to make the entire process clear, replicable, and consistent. Fine-grained annotation guidelines for TCM clinical text were also developed. The statistical analysis indicates that the method and guidelines are appropriate and effective. The results of this study will provide a foundation for future research into corpus construction and effective NER tasks in the TCM clinical domain.

Related work
In recent years, research on clinical EMRs has become a popular topic [17]. Studies on English EMR entity corpora began early, and text mining and NLP applications, algorithms, and corpora in the English language are relatively mature. There are some well-known publicly available annotated corpora, such as GENIA (Genome Information Acquisition) [18] for data mining and information extraction in the molecular biology domain, NCBI (National Center for Biotechnology Information) Disease [19] for disease names and adverse effects, and drug-drug interactions (DDI) [20] for pharmacological substances and drug interactions. Moreover, the integrating biology and the bedside (i2b2) challenges have contributed to clinical NLP studies. For instance, i2b2 organized challenges on the extraction of medical information from English discharge summaries in 2009 and 2010: the concepts of extraction involve drugs, doses, duration, medical problems, treatment, and testing. Since 2006, i2b2 has released nine corpora for evaluating EMR information extraction. Based on these corpora, great strides have been made in NER research on English discharge summaries. The annotation schemes and evaluation methods of corpus construction in English possess high reference value for Chinese clinical notes. Despite this, the development of corpus construction in Chinese medicine has fallen behind that of English Western medicine, and the availability of large corpora in Chinese is currently limited. Influenced by the rapidly developing English medical corpora, the drive to construct Chinese medical corpora has gradually begun to move forward. The major annotated corpora of Chinese medical notes are summarized in Table 1 and described in detail in the following section.

Chinese clinical entity recognition corpus construction
Based on the concept annotation guidelines from the 2010 i2b2 challenge, Xu et al. [9] labeled a standard corpus of 336 Chinese discharge summaries (medications, anatomy, medical problems, treatments, and tests) in 2014. The annotation work consisted of two rounds; the first round was completed by three annotators with the relevant domain background, and the second round was conducted by three annotators with backgrounds in computer linguistics. The results were refined and the final gold standard was obtained by combining the results of the first two rounds. Lei et al. [5] constructed an annotated entity corpus of 400 discharge summaries and 400 admission notes. The guidelines were similar to those used in the 2010 i2b2 NLP challenge, but the "treatments" were divided into "procedures" and "medications." Moreover, Wang et al. [22] annotated the text with 12 elements required by doctors from free-text operation notes. In this study, the guidelines are not mentioned and the annotation process is briefly described. Miao et al. [30] annotated Breast Imaging Reporting and Data System categories manually in a preliminary study on information extraction from Chinese breast ultrasound reports. These two studies [21,29] are good examples of information extraction for specific information. Liu et al. [13] annotated temporal expressions in clinical notes and built guidelines that refer to the temporal expression annotation guidelines of TimeML for English newswire text and the 2012 i2b2 NLP challenge for English clinical text. Furthermore, in 2019, Gao et al. [11] described a more detailed method of constructing a corpus of nine entity types based on resident admit notes. The guideline was also developed using the i2b2 annotation guidelines, but they added the "body part" and "temporal word" entities in their annotation work, and the "inspection" and "laboratory test" entities are distinguished. An iterative annotation method was employed to form the manual annotation scheme. Furthermore, He et al. [10] used an annotation method for English clinical text to build a syntactic corpus about entity diseases, symptoms, and treatments. They created the draft guideline first, then trained the annotators and updated the guideline. The inter-annotator agreement (IAA) was then  [33] manually annotated a corpus with Chinese word segmentation and part-of-speech tags for Chinese clinical text at a fine granularity. This work is an excellent reference, but it does not elaborate on the methods and steps used in the annotation. In summary, there has been some excellent initial progress toward the construction of Chinese clinical record corpora. However, there is still no standardized methodology for Chinese clinical text.

TCM corpus construction
In contrast to the progress in the corpora of Western medicine in Chinese medical records, the progress on corpora of TCM clinical notes is still in its infancy. Fang et al. [34] annotated a large biomedical literature corpus obtained from PubMed and developed an open database, TCMGeneDIT, to provide information about TCM, genes, diseases, TCM effects, and TCM ingredients. However, this research approach cannot be used for clinical records. With respect to clinical records, Wang et al. [21] constructed an annotated corpus for the symptoms of the chief complaint in TCMfree text. It is an empirical study, but the number of text types and entity categories in the corpus was relatively small, and the authors did not list detailed methods about the development of their guidelines and annotation. Ruan et al. [28] focused on symptoms and symptom-related entity extraction. Symptoms were divided into TCM symptoms and Western symptoms, and medicine was divided into TCM medicine and Western medicine. The dataset was divided into two parts; two experts annotated the symptom entities in the EMRs to train and test a conditional random field (CRF) model. Li et al. [24] built a dataset of herbs and symptom records and annotated the relationships between them. This study is useful for the extraction of relations from TCM health records. Fig. 1 presents an overview of the concepts found in TCM clinical notes, e.g., meridians and collaterals, viscera, acupoints, the etiology of TCM, syndrome elements, and diagnosis methods. These concepts were not all addressed in previous studies. These studies demonstrate that the research on TCM clinical text has some defects: 1) there are no large corpora available in the TCM domain; 2) only a small part of the overall TCM concepts have been annotated systematically, while other types of entities have been ignored; 3) the existing methods of TCM corpus construction are too coarse grained; and 4) most of the previous studies do not describe how data selection, guideline drafting, and annotation were implemented. Hence, a practical and effective method is needed to develop a standard annotation scheme and build a comprehensive entity corpus of TCM records.

Dataset
The dataset contains 10,197 records, which are fragments extracted from a modern Chinese TCM case records database. These records are transcripts of raw TCM clinical records collected by TCM doctors during their routine diagnosis and treatment work. Our dataset does not contain basic information about patients, such as name, age, or gender. The reason for this is two-fold: 1) The transcripts of TCM case records are an important resource for studying TCM. A complete case record of TCM contains abundant TCM knowledge, such as the main complaint, syndrome differentiation, diagnosis, treatment or prescription, medicines, and doses. Therefore, it is an important resource of medical information for the study of unstructured documents, and is the best type of document for obtaining the analysis and clinical experience of well-known TCM experts [35][36][37]. In contrast to resident admit notes, the TCM case records are more refined, logical, and enlightening [38]. One example is the text "咳嗽,黄稠痰,痰不易 吐出,咽喉疼痛,咳引两太阳穴痛,怕冷,无汗,口气秽 臭,苔黄腻,舌略红,脉不浮" (cough, the yellow thick phlegm is not easy to spit out, throat pain, traction pain in the position of EX-HN5 when coughing, fear of the cold, no sweat, fetid breath, yellow and greasy coating, slightly red tongue, pulse is not floating), in which the key symptoms for the TCM diagnosis have been listed. These were obtained by the four basic diagnosis procedures (inspection, listening and smelling, inquiry, and palpation) [39].
2) The extraction of knowledge hidden in a large number of TCM clinical texts and distillation of this knowledge into a concise form is clinically significant. A good example is the discovery of artemisinin, which was spotted in TCM records. The use of artemisinin is a medical advance that has saved millions of lives globally [40]. More recently, many studies have increasingly found that the diagnostic methods of TCM can help the diagnosis of disease in modern medicine. For example, it was found that tongue features can be used to predict early-stage breast cancer [41,42]. Moreover, a geographical tongue 1 is associated with the severity of diseases such as psoriasis [43]. With regard to pulse diagnosis, Wang et al. [44] found that there is a significant difference between the pulse signals of healthy volunteers and patients with fatty liver disease and cirrhosis. In TCM a "stringlike" pulse 2 in the left hand is closely related with liver disease [45].
For these reasons, we employed parts of TCM transcript data to design a feasible and reusable method for establishing a fine-grained entity corpus of TCM clinical records. We note that because personal patient information is not included in the dataset, the study requires no ethics committee approval.

Entity selection
The method used to select entities is rarely mentioned in previous studies. In our work, we combine sample annotation with repeated discussions. First, we analyzed the characteristics of our dataset. Then, 100 randomly selected records were given to each annotator to establish the entity labels and annotate the records. After this step, 26, 10, and 46 concepts were marked by each of the three annotators. The three annotators discussed the inconsistent labels to reach a consensus about which entity types should be included in our study. The annotators' understanding of four entity categories ("body parts," "tongue diagnosis," 3 "pulse diagnosis," 4 and "direction and position") was more consistent than their understanding of the others. To improve the 2 A "string-like pulse" is a straight, long, and taut pulse, like a musical string to the touch. 3 Tongue diagnosis is an inspection of the size, shape, color, and moisture of the tongue proper and its coating. It is very helpful for TCM doctors in disease diagnosis. For example, a yellow coating is always a manifestation of inner heat. 4 Pulse diagnosis is an examination of the pulse for making a diagnosis. TCM doctors examine the pulsation of blood vessels by feeling with the fingertips. This examination involves the pulse position, pulse shape, pulse rate, and other features. A certain type of pulse indicates a particular disease work efficiency and quality, we chose these four entity categories rather than all the categories of TCM entities that occur in the dataset. There are some important concepts not involved in our study, for example, "symptoms," "temporal words," and "herbal medicine," that we plan to address in future research.
The four categories in our experiment, which consist of 13 entities, are highly important to pathogenesis analysis, syndrome differentiation, diagnosis, and treatment. For example, in the phrase "疏肝利胆" (dispersing stagnated liver qi for promoting bile flow), "肝" (the liver, a "Zang organ" entity) and "胆"(the gallbladder, a "Fu organ" entity) reflect the key Zang-Fu organs 5 in the treatment procedure. In the Chinese word "肩髃痛" (pain at LI15), "肩 髃"(LI15, an "acupoint" entity) indicates that the pathogenesis is an abnormality of the meridian qi of the large intestine meridian (LI). Moreover, as an ashi acupoint, 6 "肩髃" (LI15) has a good curative effect for shoulder pain. Moreover, pulse diagnosis and tongue diagnosis are indispensable in TCM. For instance, when a particular pulse appears at the wrong place or in the wrong season, a serious disequilibrium of the system is indicated [46]. Furthermore, the tongue body mainly reflects a deficiency or excess of qi and blood in the Zang-Fu organs, whereas a change in the tongue coating is mainly used to judge the depth and severity of pathogenic qi [47]. For example, in the transcript "失眠,口苦,思饮,咯痰略黄,大便偏干,鼻息 热,眼干,苔黄干,脉细弱" (insomnia, bitter taste, fond of drink, slightly yellow sputum, dry stool, hot breath, dryness of eyes, yellow and dry coating, thready and weak pulse), the "yellow and dry coating" reflects the internal disturbance of pathogenic heat, and the "thready and weak pulse" indicates the deficiency of healthy qi. Furthermore, position and direction have significant clinical diagnostic value. For example, according to TCM theory, different positions of the tongue correspond to five different Zang organs: the top of the tongue corresponds to the heart, so "舌尖红" (red tip of the tongue) is probably a manifestation of heart fire.

Entity definition
In this study, we summarized four data categories, and 13 entity types are derived from these four categories.
Referring to the concept definitions of TCM in WHO's international standard terminologies on traditional medicine in the Western Pacific region [48] and a text book on the diagnostics of TCM [47], the definitions of 13 entities are listed in Table 2. More details and examples are shown in the guidelines in Additional file 1.

Annotation tools
To make the fine-grained marking process easier and more efficient, we developed an entity annotation tool. As shown in Fig. 2, the Chinese characters were labeled with predefined tags with a specific color. By specifying the color of the label, we can distinguish the content of continuous annotations and make inconsistencies more visible. This will facilitate the modification of the annotations and the recording of the problems. With this annotation tool, annotators are able to add and remove labels in the labels column, remove incorrect annotations and re-annotate them in the function column, and annotate entities in the annotation column. The location information of the selected content is displayed in the position column.

Fine-grained annotation
Fine-grained annotation further divides the coarsegrained entities into finer subcategories until no further divisions can be made, and the Chinese words are then further divided into the smallest semantic units. Consequently, most of the words are shorter than two Chinese characters. In this way, more context information can be captured. Therefore, a fine-grained annotated corpus will better support the automatic processing and analysis of EMRs in NER. For instance, Roberts et al. [49] determined that high-quality fine-grained natural language annotations substantially affect a system's ability to recognize heart disease risk factors. However, most of the studies on Chinese clinical entity tagging in recent years employ a coarser-grained annotation, for example, "右下肢" (the right lower limb) was annotated as a "body part." In contrast, in our fine-grained annotation guidelines, as shown in Fig. 3, "肢" (a limb) should be annotated as an "ordinary body part," and "右" (right) and "下" (lower) should be separately annotated as "direction and position."

Annotation method
Using previous research methods as references, we designed a replicable method to develop a fine-grained annotation guideline and construct a fine-grained entity corpus. The approach consisted of the following four steps (Fig. 4).
1) Determination of the entities to be marked (as described in detail in Section 3.2).
2) Guideline drafting: After referring to some existing well-developed guidelines [26,50,51], a team of three annotators randomly selected 300 records (100 each) from the dataset and independently annotated the samples. At the same time, they summarized the characteristics of the included entities and continually discussed them. Finally, a finegrained annotation guideline was drafted in which examples of different cases were included for easier understanding. 3) Guideline updating and consistency assessment: In each round, 100 unannotated records were randomly selected from the dataset. The guideline was constantly updated until the IAA met the standard of satisfaction (κ > 0.9) which meant the labels of the three annotators were highly consistent. Otherwise, the iterative fine-grained annotations on the sample records were continued. During this step, we added more examples and supplemented the draft guidelines with detailed explanations. A more comprehensive set of fine-grained annotation guidelines was hence developed (see Additional file 1). 4) Corpus construction: Using the guidelines developed in steps 2 and 3, three annotators performed the annotation work independently. The dataset was divided into three parts, and the three annotators marked different parts separately to reduce the time required and improve annotation efficiency. During this period, we kept the annotation work as independent as possible, and the following principles were strictly followed: i) Although there are practical standards for medical record writing, sometimes errors exist in these

眼痒 (itchiness in the eyes)
Tongue body This is the musculature and vascular tissue of the tongue, also the tongue substance. It is annotated only when followed by a specific description of the tongue's physical manifestation.
舌红, 苔黄, 脉滑 (red tongue, yellow coating, slippery pulse) Tongue coating A layer of moss-like material covering the tongue, also called tongue fur. It is annotated only when followed by the description of tongue coating manifestation.
舌红, 苔黄, 脉滑 (red tongue, yellow coating, slippery pulse) Pulse A radial artery of the wrist, which includes three sections: cun, guan, and chi. The pulse entity is annotated only when it is followed by a description of the pulse condition.
舌红, 苔黄, 脉滑 (red tongue, yellow coating, slippery pulse) Acupoint A point where a needle is inserted and manipulated in acupuncture therapy.

Meridian and collateral
A system of conduits through which qi and blood circulate, connecting the bowels, viscera, extremities, superficial organs, and tissues, and making the body an organic whole. These are the same as channels and networks and are also called meridians or channels.
左大腿阳明经固定痛 (fixed pain in the stomach channel of the foot-yangming of the left leg)

Zang organ
An internal organ in which the essence and qi are formed and stored. These organs include heart, liver, spleen, lungs, and kidneys, and are also called the five viscera.
一直服调脾化湿药 (always take the medicine for regulating the spleen and removing dampness) Fu organ An internal organ in which food is received, transported, and digested, including the gallbladder, stomach, large intestine, small intestine, urinary bladder, and triple energizers. g They are also called the six bowels.

Both the tongue body and tongue coating
Words referring to the tongue body and tongue coating. 舌可 (normal tongue)

Tongue body manifestation
Specific description of the tongue body manifestation, including tongue color, shape, and sublingual vein.

Tongue coating manifestation
Specific tongue coating manifestation, including color, thickness, and texture.

Pulse condition
Specific description of arterial pulsation in TCM when the pulse is felt during examination.

Direction and position
Description of the direction and position, which enables us to know the specific location of the body part.
左膝关节疼痛 (pain in the left knee joint) g In TCM, the Fu organ, or "triple energizers," is a collective term for the three portions of the body cavity through which the visceral qi is transformed. This organ is also widely known as the "triple burners." It contains the upper energizer, middle energizer, and lower energizer. It is also called the "solitary hollow organ," because there is no paired relationship between the viscera and the "triple energizers" texts. Incorrectly written characters were not annotated in any situation. For example, in the word "脚指" (foot finger), "指" (finger) is miswritten and should be "趾" (toes). Hence, "指" (finger) was not annotated. ii) Punctuation should not be included in the annotation as much as possible. This is to minimize the interference of punctuation on the annotated entities. iii) Entity annotation can be nested but not overlapped. For example, "指掌连 接处" (the body part where the fingers and palms are connected) should be annotated as an "ordinary body part" but "指" (finger) and "掌" (palm) should also be annotated as "ordinary body parts" individually. iv) For some complex or ambiguous situations, the annotators discussed how to unify the decisions. For example, there was some controversy as to whether "心" (heart) in the word "心悸" (palpitation) should be annotated as an "ordinary body part" or a "Zang organ." Here, "心 悸" (palpitation) is a subjective sensation of the rapid and forceful beating of the heart. It seems logical to annotate "心" (heart) as both an "ordinary body part" or a "Zang organ." After discussions, and considering that "心悸" has been a symptom name in TCM for more than a thousand years, the annotators formed a consistent view; that is, "心" (heart) in such situations should to be consistently annotated as a "Zang organ." In addition, during the comprehensive fine-grained annotation process, some measures were taken to ensure the quality: 1) Annotators were required to record uncertain  annotations, and they discussed them regularly until all the ambiguities were resolved. 2) Three annotators with similar TCM backgrounds (with doctor qualifications in TCM and in the same research area) improved the marking accuracy and reduced the occurrence of uncertain cases. 3) Duplicate documents were assigned to three groups in step 4 for an IAA evaluation in order to ensure the quality of the annotated data.

Key and difficult points in the entity annotation task
Our study is the first to use fine-grained annotation methods in the TCM clinical records. Entities such as "acupoints," "Zang-Fu organs," "tongue manifestations," and "pulse conditions" have not been annotated in previous studies. Consequently, the annotation work is challenging.
The key points and difficulties are as follows. Clinical narratives are often written in a medical sublanguage with semantic categorization of words, domain specific terminology, incomplete phrases, and omission of information [52]. TCM case records are written by practicing doctors, and their brief forms appear very similar to ancient Chinese texts. Moreover, they can only be understood by a professional doctor with a background in TCM. For example, in the transcripts "痛点, 左右耳门" (pain point, at left and right TE21), "耳门" (TE21) is an acupoint other than ordinary body part. In another example, "颈项" (neck), "颈" (front of the neck), and "项" (back of the neck) are entities of ordinary body parts. Fine-grained annotation is the most important and difficult part of our work. One major difference between Chinese and English text is that words in Chinese are formed by continuous Chinese characters without any spaces, and the boundary between fine-grained entities is not clear (as shown in Fig. 3); as a result, fine-grained annotation on TCM clinical notes is time-consuming and annotators must have an in-depth understanding of the document.

IAA
The calculation of IAA (often known outside of corpus linguistics as the inter-rater agreement) is motivated by the need to address the problem of subjectivity in judgments about things that are not observable with the senses [53]. In our study, we choose Cohen's kappa to measure the consistency of the three annotators' work. Cohen's kappa is a coefficient of internal consistency and is a widely used index for assessing IAA. It is appropriate for nominal and ordinal data when there are two or more raters per subject and is calculated as follows [54,55].
Here, P 0 is the observed agreement between two annotators, and P e is the probability of agreement between the annotators if each annotator were to randomly pick a category for each annotation. It is computed from a contingency matrix representing agreements and disagreements. The annotation is considered to be sufficiently consistent when all three κ values are greater than 0.9.

Annotation consistency
We added duplicates (600 records) to each annotator's tagging task to calculate the IAA. The result shows that the IAA value during corpus construction remained at a relatively high level (0.93, 0.94, and 0.94; Fig. 5). The IAA evaluation shows that this fine-grained entity corpus is of good quality.
As shown in Fig. 5, our marking task was repetitive and time-consuming work, in which the whole marking process took five rounds to complete. In the fourth round, the IAA values exceeded 0.9, indicating that the three annotators had a high degree of consistency in the understanding of labels and TCM records, and they had ability to accomplish these annotation tasks with satisfactory consistency. As shown in Fig. 5, the IAA values in each annotation round are higher than those of the previous round, showing that our method of iterative annotations and discussions is effective.

Data analysis of annotations
The fine-grained annotated corpus has 1104 entities and 67,799 tokens in total. An analysis of the corpus reveals some interesting points, especially in terms of the language expressions used in clinical TCM. The data analysis is helpful for identifying the rules of TCM clinical expressions and leads to questions that will contribute to future research about the corpus construction of TCM clinical records.

Distribution and analysis of entities and tokens
The distribution of entities and tokens are shown in the Table 3. The proportion of the entities of "ordinary body part," "tongue body," "tongue coating," "tongue body manifestation," "tongue coating manifestation," "pulse," "pulse condition," and "direction and position" are much higher than those of other entities. In the "body part" category, the entity "ordinary body part" (21,093) occurs the most, followed by the entities "pulse" (6148), "tongue coating" (4978), and "tongue body" (3789). Among the "ordinary body part" entities, we noticed that many annotated entities are concepts from Western medicine. For instance, "毛细血管" (capillary vessel) and "椎间盘" (intervertebral disk) are body part concepts in Western medical anatomy. Clearly, the modern case records of TCM contain both TCM and Western medicine knowledge. In addition, entities related to "tongue body manifestation" (4088), "tongue coating manifestation" (10, 911), and "pulse condition" (9573) are relatively common. After reading the original text, we observed that almost every TCM case record documents the pulse or tongue diagnosis information. It can be seen that tongue diagnosis and pulse diagnosis are one of the most common diagnostic methods in TCM, and the "tongue coating manifestation" (10,911) has high diagnostic value in practice.
As for the distribution of tokens, examples of the top-10 entities in each entity type are shown in Table 4. Combined with the original text, we analyzed the distribution of tag content and revealed that the expressions of many concepts of TCM are not uniform, and there are many entities that are similar in semantics but different in name, e.g., "腹" and "腹部" both mean abdomen, "足阳明经" and "胃经" both refer to the stomach meridian, "内" and "内侧" both mean "inside," and "中心" and "中间" both refer to the center position. Such synonyms with different expressions on the one hand reduce the reliability of statistical analysis results of the corpus, but on the other hand, are expressions found in real and raw Chinese language data, and identifying them will increase the adaptability of machine learning models. The normalization of these entities will be a part of future research work.

Top-10 syndromes and their relationships with the entities of pulse and tongue body (coating) manifestations
The top-10 syndromes in our preprocessed database are listed in Figs. 6 and 7. As an important part of TCM diagnosis, syndrome differentiation, which is a comprehensive analysis of symptoms and signs, has implications for determining the cause, nature, and location of the illness and the patient's physical condition [48]. Solid lines are used to connect entities that are likely to be to syndromes according to the textbook Diagnostics of Traditional Chinese Medicine [47]. Figure 6 shows that there are many-to-one and one-to-many relationships between syndromes and pulse conditions. For example, a string-like pulse is probably caused by qi stagnation and liver depression, and the blood deficiency manifests as a thready or faint pulse. In Fig. 7, the blood stasis syndrome appears as multiple clinical tongue body manifestations (dark, dark and red, or red and dark), blood deficiency manifests as a pale tongue, and the yellow coating may be the result of inner heat or dampness heat. As can be seen, common pulse or tongue    There are also some exceptions. For example, blood stasis is the most frequent syndrome in our dataset. In TCM basic theory, blood stasis syndrome is likely to manifest as a rough pulse, slow pulse, or tight pulse [47]. However, these three are not mentioned in the top-10 pulse conditions. To determine why, we looked up the original text and noticed that in the TCM clinical free text, patients with blood stasis syndrome may not appear as having the above-mentioned pulse conditions. For instance, in the text "脑血管动脉瘤术后,神识不清,唇干,紫 暗,痰多稠黏,大便干燥,小便清,苔黄腻,舌暗红,脉缓" (postoperative cerebral vascular aneurysm, clouded in mind, dry lips, dark purple, sticky sputum, dry stool, clear urination, yellow and greasy coating, dark red tongue, slow pulse), the syndrome of this case should be summarized as blood stasis 8 accompanied by phlegmheat, 9 and the postoperative cerebral vascular aneurysm and dark red tongue body reflect the stagnated blood inside the body. However, a moderate pulse is not a typical symptom of blood stasis syndrome.
It can be seen that the main content of the corpus mostly corresponds to the annotation results. Moreover, constructing a corpus helps us to obtain and analyze the content of a dataset. However, there are some cases that do not conform to this trend. This occurs because TCM is an experience-based clinical medicine, and its clinical cases are detailed and variable. Although tongue and pulse diagnoses have a certain diagnostic function, only

Examples of special entities and analysis
In our fine-grained entity corpus, there are some special annotations that need to be explained. In most cases, the general rule is that there is a modification in the direction and location words when they occur in front of a body part, such as "右下肢" (right lower limb) or "左膝关节" (left knee joint). However, there are still some particular expressions in TCM, for example, "少腹" (lower abdomen) and "小腹" (lower abdomen), which are two of the "ordinary body part" entities in TCM. In our annotation guidelines, "少" and "小" should not be annotated as "direction and position" separately. To preserve the particular expressions of TCM, entities similar to the above two cases are not split.
There are some entities with combinable attributes in TCM. For example, the record "背心怕冷" (the center of the back is sensitive to cold), "心" means the center position on the back rather than the heart viscera. In addition, the record "心虚胆怯" (timidity due to insufficiency of qi and deficiency of blood of the heart), "心" should be annotated as a Zang organ, moreover, the word"心肌" (cardiac muscle) is an anatomical concept of Western medicine, so the entity "心" (cardiac) should be annotated as an "original body part." In above three cases, the word "心" should be annotated as a different entity type in different contexts.
Some special entities are annotated as the "direction and position" entity type, such as the records "下两寸处" (two cun downward) and "外侧4寸处" (4 cun sideward). Here, "cun" is a common ancient unit of length (about 3.33 cm) especially used for locating acupoints or meridians. This is quite similar to ancient Chinese medical texts.
Hence, the annotation of TCM clinical records is complicated. It is quite different from the annotation work of Western medical records performed in previous studies, and abundant TCM knowledge is necessary for the annotators to analyze the meaning of the context.
First, it is easy to form inertial thinking when annotating the entity "body part," which results the entity "Fu organ" being rarely used as an annotation result. For example, "胃" (stomach) is annotated as "Fu organ" for twice but as "ordinary body part" 1267 times. The three annotators agreed that the entity "胃" (stomach) is more likely to express an anatomical part rather than a Fu organ. It is thus clear that TCM practitioners are highly influenced by Western medicine knowledge.
In addition, the dataset in our study consists of Chinese medicine physician case records instead of acupuncture case records. Thus, the entity "acupoint" (1.3%), and "meridians and collaterals" (0.1%) account for a very small proportion in our corpus. Table 5 lists the examples of top-10 annotated acupoints and corresponding meridians. Interestingly, from it we can see that acupoints are mostly used to describe symptoms, especially symptom of pain. We can reasonably infer that the different focus of knowledge and clinical habits of TCM physicians may also lead to this result. Furthermore, from Table 3, we can see that there are not many entities related to "tongue body" (7), "tongue coating" (10), "pulse" (22), and "both tongue body and tongue coating" (2); however, they have large number of annotations (3789, 4978, 6148, 793). Hence, one can see that the expressions of "tongue body," "tongue coating," "pulse," and "both tongue body and tongue coating" are relatively consistent and frequently used in TCM clinical records.

Conclusions and future work
Corpus construction is a fundamental and indispensable task for the development of NLP techniques with the aim of discovering valuable knowledge in TCM. In this paper, we presented a method of building a fine-grained annotated entity corpus based on case records of TCM. This paper presented the detailed steps as well as the implementation, which involves data selection, draft guideline development, iterative annotations for guideline updating, consistency assessment, and corpus construction. High IAA values were achieved in our final annotation work, indicating that our approach is effective and the corpus is of high quality. The annotated data analysis revealed some interesting point and problems, indicating that the modern TCM has integrated a lot of knowledge of Western medicine; at the same time, the construction of the corpus of TCM records is still very dependent on a professional knowledge of TCM. This work lays a solid foundation for future TCM corpus construction and NER research.
There are still some inevitable shortcomings in our work; for instance, the entity types were not comprehensive enough. Because of the limitations of time, we could not complete the annotation of all existing entities in our dataset. In future, we will annotate more entity types, such as symptoms and prescriptions, to enrich the guidelines and corpus using the methods introduced in this paper. More types of TCM clinical records from different sources will also be annotated to improve the applicability of the corpus. Furthermore, based on the corpus, we will develop algorithms to support NLP techniques. Finally, deep research of the polysemy, abbreviations, relationships among entities, and the normalization of entities are the next tasks in our future work.