Skip to main content

Developing a cardiovascular disease risk factor annotated corpus of Chinese electronic medical records



Cardiovascular disease (CVD) has become the leading cause of death in China, and most of the cases can be prevented by controlling risk factors. The goal of this study was to build a corpus of CVD risk factor annotations based on Chinese electronic medical records (CEMRs). This corpus is intended to be used to develop a risk factor information extraction system that, in turn, can be applied as a foundation for the further study of the progress of risk factors and CVD.


We designed a light annotation task to capture CVD risk factors with indicators, temporal attributes and assertions that were explicitly or implicitly displayed in the records. The task included: 1) preparing data; 2) creating guidelines for capturing annotations (these were created with the help of clinicians); 3) proposing an annotation method including building the guidelines draft, training the annotators and updating the guidelines, and corpus construction. Meanwhile, we proposed some creative annotation guidelines: (1) the under-threshold medical examination values were annotated for our purpose of studying the progress of risk factors and CVD; (2) possible and negative risk factors were concerned for the same reason, and we created assertions for annotations; (3) we added four temporal attributes to CVD risk factors in CEMRs for constructing long term variations. Then, a risk factor annotated corpus based on de-identified discharge summaries and progress notes from 600 patients was developed. Built with the help of clinicians, this corpus has an inter-annotator agreement (IAA) F1-measure of 0.968, indicating a high reliability.


To the best of our knowledge, this is the first annotated corpus concerning CVD risk factors in CEMRs and the guidelines for capturing CVD risk factor annotations from CEMRs were proposed. The obtained document-level annotations can be applied in future studies to monitor risk factors and CVD over the long term.

Peer Review reports



Cardiovascular disease (CVD) has become the primary cause of death throughout the world; there were approximately 17.5 million deaths from CVD in 2012, most of which occurred in low- and middle-income countries [1]. In China, CVD occupies the leading position among causes of death and is responsible for 2 out of every 5 deaths [2]. This situation deeply affects the health of the Chinese people and is a heavy burden on society. Fortunately, most CVD can be prevented by controlling the malleable risk factors such as specific medical conditions and the adoption of unhealthy life-styles at early stages [3]. A risk factor is a pattern of behavior or physical characteristic of a group of individuals that increases the probability of the future occurrence of one or more diseases in that group relative to comparable groups without or with different levels of the behavior or characteristic [4]. Risk factors, including specific medical conditions such as hypertension and hyperglycemia/diabetes, unhealthy life-style choices such as smoking and alcohol abuse, and other factors such as age and family history, can have prominent effects on the progress of CVD [3, 5]. Therefore, monitoring these risk factors constitutes an important approach in avoiding CVD.

A Chinese electronic medical record (CEMR) is a storage medium suitable for extracting CVD risk factors and monitoring. Actually, an electronic medical record (EMR) stores all health care data and information in electronic formats, along with the associated information processing and knowledge support tools necessary for the managing the health enterprise system [6]. The availability of large amounts of individual health narratives in CEMRs make this resource suitable for study by natural language processing (NLP) techniques especially information extraction (IE) techniques [7]. In 2010, the ministry of health in China published the basic norms of medical records writing [8] and the basic norms of electronic medical records [9], making these data more normative. Together, these characteristics make CEMR an effective medium for studies that involve extracting CVD risk factors. Some related works [7, 10,11,12,13,14,15,16,17,18,19,20,21] have been performed, but no studies have been conducted on CVD risk factors based on CEMRs. To do this, we designed a task to extract CVD risk factors from CEMRs.

To perform the extraction, we developed a CVD risk factor annotated corpus based on CEMRs because in the biomedical field the utilized corpora are far less than other open fields and a specific corpus is critically important in building an IE system. The purpose of this corpus is to act as the basis for developing an automatic risk factor extraction system. Subsequently, a monitoring platform could be established based on this extraction system that can help supervise CVD risk factors over time. Furthermore, based on the risk factors (along with other health information) that are comprehensively stored over long durations, a method that could predict the trend of each risk factor, help manage chronic diseases (such as hypertension and diabetes) and estimate the progress of CVD could also be included in the platform. To build the corpus, we proposed a light annotation task [22]. As the first step, we annotated 600 patients’ de-identified discharge summaries and progress notes from CEMRs.

Our work is similar to the 2014 i2b2/UTHealth risk factor annotation shared task [23]. We adopted some technologies from that task, but also made some adjustments: (1) we proposed annotation guidelines of 12 CVD risk factors for the free text in CEMRs on the advices of medical experts; (2) positive, possible, negative information and under-threshold examination values form a part of one’s health condition and can be used to develop a long term supervision system, so these information were appended for our annotations; (3) for a long term monitoring, time information is critical, so we created temporal attributes for marking the occurrence time of risk factors in CEMRs.

Related work

Related works based on English EMRs

The 2006 Informatics for Integrating Biology and the Bedside (i2b2) shared task focused on identifying patients’ smoking statuses from medical discharge records. In this task, 928 records covering five categories were annotated by two pulmonologists [24]. The 2008 i2b2 Obesity Challenge was an organized competition intend to find ways to recognize obesity and comorbidities from discharge summaries and classify them into four classes: Present, Absent, Questionable, and Unmentioned. An annotated data set was provided [25]. In 2009, the challenge focused on extracting medication information from medical records, including the names of medications, their dosages, modes and frequencies of administration, treatment durations, and reasons for administration [26, 27]. The challenge included a set of annotated discharge summaries. In 2010, the challenge involved a concepts, assertions, and relations identification task in which participants were given an annotated gold-standard corpus for system training [28]. The Sixth i2b2 Natural Language Processing Challenge concerned the issues involved in recognizing temporal relations in clinical records; it provided a corpus of annotated discharge summaries with temporal information [29, 30]. Subsequently, the 2014 i2b2/UTHealth NLP project focused on identifying risk factors for Cardiac Artery Disease in the narrative texts of EMRs, providing a set of 1304 annotated medical records [31, 32]. In 2016, the challenge was to classify psychotic patients into four severities based on their neuropsychiatric clinical records; 433 annotated records were provided for training [33].

Projects such as the ShARe/CLEF eHealth Evaluation Lab 2013, were devoted to solving the difficulties involved in understanding the professional expressions (such as non-standard abbreviations, and ward-specific idioms) that clinicians use when describing their patients [34, 35]. This project provided annotated corpora for system building. Another important project was undertaken during SemEval 2015. Its clinical TempEval sub-task was similar to the i2b2 2012 NLP shared task in that participants were asked to find ways to recognize temporal information, clinical events, and their relations in clinical narratives. This project used a manually annotated corpus based on 600 clinical notes and pathology reports [36]. Another SemEval 2015 subtask involved analyzing clinical text, which involves named entity recognition and template field completion. This subtask used the ShARe corpus of annotated clinical text [35, 37]. Meystre et al. [38] proposed a new IE system for a congestive heart failure performance measure based on clinical notes from 1083 Veterans Health Administration patients. Domain experts’ annotated notes were created to act as a gold standard. Ford et al. [39] reported that information extracted from text in EMRs does improve case detection when combined with proper coding.

Related works based on CEMRs

Wang et al. [16] focused on recognizing and normalizing the names of symptoms in traditional Chinese medicine EMRs. To perform judgements, this system used a set of manually annotated clinical symptom names. Jiang et al. [14] proposed a complete annotation scheme for building a corpus of word segmentation and part-of-speech (POS) from CEMRs. Yang et al. [11] focused on designing an annotation scheme and constructing a corpus of named entities and entity relationships from CEMRs; they formulated an annotation specification and built a corpus based on 992 medical discharge summaries and progress notes. Lei [17] and Lei et al. [18] focused on recognizing named entities in Chinese medical discharge summaries. They classified the entities into four categories: clinical problems, procedures, labs, and medications. Finally, they annotated an entities corpus based on CEMRs. Xu et al. [19] studied a joint model that performed segmentation and named entity recognition in Chinese discharge summaries and built a set of 336 annotated Chinese discharge summaries. Wang et al. [20] researched the extraction of tumor-related information from Chinese-language operation notes of patients with hepatic carcinomas, and annotated a corpus contains 961 entities. He et al. [21] proposed a comprehensive corpus of syntactic and semantic annotations from Chinese clinical texts.

Despite the similar intent of these works, research into extracting CVD risk factors from CEMRs has not yet been studied. Meanwhile, for the IE tasks in the biomedical field, the number of accessible corpora is far fewer than those for more general extractions. However, corpora are important for building IE system. Thus, constructing a CVD risk factor annotated corpus is both a necessary and fundamental task. Moreover, unlike annotation tasks for texts that require less specialized knowledge, linguists require the assistance of medical experts to perform annotations in the biomedical field.


A light annotation task

Compared with traditional NLP tasks such as segmentation, POS tagging, parsing, and semantic analysis, annotating CVD risk factors from CEMRs is a task that is both distinctive and light. As Stubbs says [22], we need only create a light annotation task for risk factor annotation rather than implementing all the NLP tasks. Therefore, based on the annotation trials conducted by Stubbs and Uzuner [32], we built a light annotation task that focuses solely on annotations of CVD risk factors with indicators, temporal attributes, and assertions and that does not require other NLP tasks. Meanwhile, an exhaustive annotation strategy— (we annotated all the occurrences of a CVD risk factor in CEMR narratives no matter how many times they appeared) —was applied. Notably, during the annotation trials, the increased time consumption caused by the exhaustive annotation strategy was subsequently offset by a higher level of inter-annotator agreement (IAA) and reduced difficulty for the annotators.


We obtained a snapshot of medical records from the Second Affiliated Hospital of Harbin Medical University (a large general hospital that offers clinical services, medical education, scientific research, disease prevention, healthcare and rehabilitation) for all of 2012. The data included images of the medical records for approximately 140,000 patients from 35 departments and 87 sub-departments, ranging from pediatrics to the (Intensive Care Unit) ICU. To function as annotation tasks for CVD risk factors, we selected a subset of CEMRs from 600 patients composed of 344 randomly selected cardiovascular patients, 190 cardiovascular surgery patients, and 66 other departments. Each patient’s medical records contained a series of documents consisting of their discharge summary, progress notes, medical examination reports, electrocardiograms, hospitalization daily records, hospitalization operation records, nursing records, hospital expense invoice, order sheets, consultation sheets, and consent letters. The discharge summaries and progress notes were regarded as the most important free text in these records [7]. A discharge summary is used to summarize the entire therapeutic process and treatment outcome, while progress notes record the clinical manifestations, medical examinations and treatment periodicity. Therefore, we regarded discharge summaries and progress notes for the 600 patients described above as suitable for annotation.

Next, the records were preprocessed as follows: (1) we used an optical character recognition (OCR) tool, “Tesseract,” [40] to convert the original record images into text; (2) we manually fixed errors after the OCR process was complete and removed identifying information such as patient names, addresses, hospital IDs and doctor names; (3) we encoded the text into Extensible Markup Language (XML) format and added a title section using an XML node. Figure 1a and b shows an example of a progress note in XML format after preprocessing and ready for annotation.

Fig. 1

a A sample progress note after preprocessing (original). b A sample progress note after preprocessing (English version)

Annotation guidelines

A light annotation task involves annotating the CVD risk factors with indicators, temporal attributes and assertions from the narratives in the CEMRs. Based on the CEMR characteristics and clinician suggestions, the guidelines for annotating this information are presented as follows.

CVD risk factors and indicators

An indicator is used to indicate the existence of a risk factor that may not be explicitly recorded in the narratives of CEMRs but exist in a cryptic form (e.g., the highest blood pressure (Bp) is 150/100 mmHg (“最高血压达150/100 mmHg”) indicates a hypertensive patient). Explicitly mentioned risk factors and indirect expressions such as tests or treatments that can indicate the existence of risk factors are given equal status. Even indirect information (e.g., quantitative values from medical examinations) can be meaningful because it captures additional details about a patient’s condition. With the assistance of medical experts, we proposed to annotate a set of CVD risk factors that include Overweight/Obesity (O2), Hypertension, Diabetes, Dyslipidemia, Chronic Kidney Disease (CKD), Atherosis, Obstructive Sleep Apnea Syndrome (OSAS), Smoking, Alcohol Abuse (A2), Family History of CVD (FHCVD), Age and Gender, and exploited the indicators of these risk factors. Table 1 lists all 12 types of risk factors with their indicators. The risk factors are in the left column and the indicators are on the right.

Table 1 CVD risk factors and their indicators

Notably, with the goal of being able to construct a timeline of CVD risk factors, we annotated all the quantitative values from medical examinations regardless of whether they exceeded the threshold (e.g., a patient whose Bp is 120/80 mmHg is also annotated for hypertension, even though the measurement is below the 140/90 mmHg criterion [5]). This was done so that all the test values would be extracted and we could later build a clear picture of changes in risk factors over time.

Temporal attributes

To construct a health condition timeline, collecting temporal annotations is essential. Considering the time at which the indicators occurred and the characteristics of CEMRs, we proposed to divide the risk factors into four time-dependent categories: before the duration of hospital stay (DHS) (the risk factor occurred before the DHS), during the DHS, after the DHS, and continuing (the risk factor is continuous). For instance, a patient whose ordinary Bp is 130/90 mmHg (“平时血压130/90 mmHg”) is regarded as the “High Bp” indicator of hypertension with time before the DHS; physical examination: Bp 130/80 mmHg (“查体:Bp 130/80 mmHg”) indicates that the “High Bp” of Hypertension was made during the DHS; doctor advice to a patient after discharge: to regulate blood glucose (“出院医嘱:调节血糖”) indicates that the diabetes indicator of “Regulate Blood Glucose” occurred after the DHS; and obesity (“身材肥胖”) which is usually unchangeable over the short term would be annotated as a “Mention” of O2 with time continuing. In this way, changes in risk factors can be clearly presented. For example, “no indicators of diabetes were presented during the previous DHS, but evidence shows that the patient exhibited diabetic indicators before the next DHS; therefore, the diabetes occurred between the two DHSs.” Notably, age and gender were not included in the temporal annotations.


In contrast to the works of Stubbs and Uzuner [32], we proposed assertions of risk factors. For example, patient does not have a history of diabetes (“无糖尿病病史”) needs to be considered, because such text show that the patient did not previously have diabetes. Based on whether the assertion of the risk factor actually occurred on the patient, we created two modifiers: associated or not associated with the patient. Further, risk factors associated with the patient are divided into three categories: present, absent and possible. Overall, assertions can be summarized as follows:

  • Present: the risk factor definitely occurred on the patient, e.g., the patient’s ordinary Bp is 130/90 mmHg (“平时血压130/90 mmHg”)

  • Absent: the risk factor was considered for the patient, but was negative, e.g., the patient does not have a history of diabetes (“无糖尿病病史”)

  • Possible: the risk factor may possibly have occurred on the patient, e.g., Primary diagnosis: diabetes (“临床初步诊断:糖尿病”)

  • Not associated with the patient: the risk factor occurred on someone else, e.g., the patient’s brother has diabetes (“弟患糖尿病”)

Annotation method

The annotation method involves: drafting the guidelines, training the annotators and updating the guidelines, and corpus construction. These tasks can be seen in Fig. 2.

Fig. 2

The flowchart for CVD risk factor annotation method

Drafting the guidelines

Based on the annotation guidelines discussed above, the linguists created a preliminary draft of the annotation guidelines that included all 12 CVD risk factors, indicators, temporal attributes, and assertions along with their definitions, as well as some positive annotations (expressions which should be marked) and negative annotations (expressions which should not be marked). Some sample annotation attempts were conducted under this draft using an annotation tool developed specifically for this task. Figure 3ab and shows a sample annotation.

Fig. 3

a A sample annotation for CVD risk factors (original). b A sample annotation for CVD risk factors (English version)

Using the sample annotations, errors and inappropriate rules in the preliminary draft were corrected, and additional positive and negative examples were added to the draft. This process continued until no further modifications were needed; at that point, the specifications were considered to be suitable for the next workflow step.

Training the annotators and updating the guidelines

For domain annotation, annotators with specific knowledge backgrounds are desirable. Consequently, two Masters students in medicine were employed and trained as annotators. The training process follows an iterative method, each repetition can be summed up as follows:

Phase 1: A set of discharge summaries and progress notes for 15 randomly selected patients were provided to both annotators for labeling.

Phase 2: After completing the annotation, the IAA of the two annotated corpus was calculated to evaluate the degree to which the annotators were in agreement. For the IAA calculation, one annotated database is used as the gold standard, and the other is compared to the standard to compute the precision, recall, and F1-measure. Here, standard precision, recall, and F1-measure equations were adopted; their calculations are as follows:

$$ precision=\frac{Agreement\left(A1,A2\right)}{Annotation(A2)}, $$
$$ recall=\frac{Agreement\left(A1,A2\right)}{Annotation(A1)}, $$
$$ {F}_1=\frac{2\times precision\times recall}{precision+ recall}. $$

Here, we regarded the annotations of annotator A1 as gold standard and evaluated the quality of annotator A2‘s annotations. The Agreement(A1, A2) refers to the same annotations made by the two annotators. More calculation details can be found in Hripcsak and Rothschild [41].

Phase 3: The two annotations were compared and any uncertainties were discussed by both the linguists and the annotators. A voting method was used to obtain a final agreement in which the two annotators and three linguists were asked to discuss each uncertainty again and the annotation with the most votes were decided.

Phase 4: The annotation guidelines were updated. In particular, errors found during phase 3 were added to the positive or negative examples and, when necessary, the guidelines were modified.

This procedure was iteratively conducted until the IAA calculated in Phase 2 achieved a continuously high value. In total, five repetitions were carried out; the resulting IAA values are listed in Table 2. Notably, in each iteration the 15 patients were selected different from previous.

Table 2 IAA values achieved during the iterative training process

As Table 2 shows, the iterations obtained very high IAAs. All the F1-measures values were above 0.964 except for Iteration 1, in which the low score was probably caused by the initial unfamiliarity of annotators with the annotation guidelines and tools. Subsequent iterations obtained surprisingly high scores, indicating that the annotators and guidelines were truly ready to perform the corpus annotation.

Corpus construction

The annotators were asked to capture annotations from CEMRs for the 600 patients using the updated annotation guidelines. Moreover, to create a high quality annotated corpus, three measures were taken. One was that, the annotators could press a button on the annotation tool to indicate that they were unsure of the accuracy of a current annotation. Those uncertainties could be collected and discussed later. Another measure involved the use of overlapped documents (discharge summaries and progress notes of 25 patients), which were distributed to both annotators. These twice-annotated records were used to calculate the IAA and to monitor the quality of the entire annotation evaluation. The last measure was a random sampling check on the annotations (at least one third were selected) by the linguists. When problems were found, discussions were held and the guidelines were updated.


In total, in the CVD risk factor annotated corpus comprising the discharge summaries and progress notes for all the 600 patients, there are 9678 annotations associated with the 12 CVD risk factors. Of these, the “mention” indicator type garners 63.5%, while the “drug” indicator type is rare (due to our restriction in the annotation guidelines that medication must be confirmed to be treated as a risk factor). Among the risk factors, hypertension is prominent in CEMR, with 3729 annotations. Age and gender annotations occur at the same rate as Bp because they are basic patient attributes that are routinely recorded before diagnosis. The distributions of the four assertions (present, absent, possible, and not associated with patient) were 69.7, 19.2, 11.0, and 0.1%, respectively. The “present” assertion type occur most often because positive descriptions may have more significance when creating the medical records.

Annotation quality and analysis

Reasonably, the IAA values for the final corpus should be as high as the IAA values obtained during the training process due to the work performed before and during the formal annotation to guarantee sufficient quality. The final IAA calculations resulted in 0.971 for precision, 0.965 for recall and 0.968 for the F1-measure; these values demonstrate the high quality of the corpus.

Table 3 shows the distribution of risk factors, indicators, temporal attributes and assertions. Each row in the table shows the distribution of a single indicator over the entire corpus in different time and assertion partitions.

Table 3 Distribution of CVD risk factors, indicators, their occurrence times, and assertions

For hypertension, “high Bp” is the most common indicator and “mention” is second. Approximately 2 “mention” and 3 “high Bp” indicators appear in each patient’s records. The “mention” annotations are mostly continuing and usually fixed over short durations, while “high Bp” annotations tend to occur during DHSs because Bp measurements are taken during physical examinations. Meanwhile, 78.7% of the “regulate Bp” annotations occur during DHSs because controlling Bp is a standard treatment for hypertensive individuals.

Compared with “high Bp”, annotations identifying “high blood glucose” are far less common because of the complicated testing technique for blood glucose. In Table 3, the “mention” annotations comprise 86.7% of all the diabetes annotations for this risk factor, and 64.1% of these “mention” annotations are “continuing, absent.” From earlier discoveries in the records, we knew that denying a history of hypertension and diabetes (“否认高血压、糖尿病病史”) occurs frequently.

The spotlight indicator of dyslipidemia annotations is “regulate blood lipids”, because to regulate blood lipids (“调节血脂”) is a representative narrative in the assessment and plan section of dyslipidemia records. Moreover, for the same reason, the timing of this indicator is clustered around “during DHS, present”.

When considering CKD, OSAS and FHCVD, the only indicator “mention” occurs infrequently. It occurs the most with CKD but still fewer than 26 times. Meanwhile, there is no “absent” assertion for any of these annotations.

Atherosclerosis for stabilizing coronary atheromatous plaque (“稳定冠脉粥样斑块”) has a relatively high number of mentions in our corpus and occurs repeatedly in the assessment and planning sections of CEMRs.

The number of smoking annotations is relatively high (508 annotations among the records of 600 patients). “Mention” and “smoking amount” account for almost all the occurrences. The 380 “mention” and 119 “smoking amount” annotations include the “continuing” assertion because tobacco use is generally a habit, and quitting rarely occurs over a short period.

There are numerous references to alcohol in the narratives of CEMRs, such as denying the history of smoking and drinking (“否认吸烟、饮酒史”) and a history of intermittent small amounts of alcohol (“间断少量饮酒史”). However, these were not tagged as alcohol abuse because the patient’s intake is none or a slight. In contrast, serious usage has only 76 “mention” and 19 “drinking amount” annotations in our corpus and nearly all those are continuing.

Age and gender are rich basic information in CEMRs. As with actual discoveries in the narratives of CEMRs, most “mention” annotations for age and gender occur in the hospitalization information section of discharge summaries and in the complaint, case characteristics and diagnosis basis sections of progress notes. Occasionally, “age group” occurs in the case characteristics and diagnosis basis sections.


In consideration of building an IE system for automatically monitoring risk factors to avoid CVD, we developed a corpus of CVD risk factor annotations including indicators, temporal attributes and assertions based on CEMRs. Linguists and clinicians cooperated throughout the entire corpus construction process—from drafting annotation guidelines to discussing disagreements. The final IAA values achieved for this corpus reflects its high level.

In our corpus, a test value was annotated whether a test outcome was above or below a standard threshold. This was intentional and was designed to build a complete record of risk factor variation over the long term. For hypertension, we annotated all the Bp values regardless of whether they were above the standard 140/90 mmHg. Based on these annotations, when supervising an individual’s long-term health condition, the trained IE system can extract all the Bp conditions with no omissions and can then provide a visualization of the Bp variations over time. An appropriate warning or intervention treatment could then be applied at critical variation points.

Annotations of therapeutic methods such as blood glucose regulation and hypoglycemic drug administration supports optional treatment recommendations. The trained IE system, after extracting the information from large numbers of CEMRs, can provide clinicians with referable treatments when it finds a similar condition in a current patient. Along a time dimension, these extractions can provide a clear assessment of treatment effects, for example, when a need to regulate glucose was recognized but glucose treatments were later stopped could show that the regulating treatment had a positive effect. This is significant because it allows clinicians to treat similar patients and decide which treatment would be better.

The annotations of O2, hypertension, diabetes, dyslipidemia, CKD, and Atherosis can play a role in managing these chronic diseases. An IE system can extract an individual’s characteristics, such as examination values and medications. Then, a long term monitoring platform can monitor variations in these characteristics, provide feedback on treatments effects, and predict disease tendencies.


This paper describes the construction of an annotated corpus of CVD risk factors in CEMRs. To the best of our knowledge, this is the first Chinese corpus that concerns risk factors for CVD. We engaged both clinicians and annotators to draft guidelines and annotate the medical records. We proposed an annotation method that results in high quality annotations; the presented IAA values indicate the high quality of the resulting corpus. These document-level risk factor annotations, along with the included temporal attributes and assertions, can be utilized in future studies of risk factor progression and their relationships with CVD over time. This corpus can play a significant role in developing a future IE system that can extract CVD risk factors from CEMRs to build a clear picture of individuals’ CVD risk factors and conditions, and it makes developing a monitoring platform to supervise the progression of risk factors and CVD possible. Globally, nearly 1 billion people have high blood pressure; in 2008, diabetes was responsible for 1.3 billion deaths; high cholesterol was estimated to have caused 2.6 million deaths; and at least 2.8 million people die each year as a result of being overweight or obese [5]. Consequently, a monitoring platform that aids in managing these chronic diseases can significantly reduce the suffering they cause to many patients and, thus, reduce CVD. The related annotation resources are publicly available at



Alcohol Abuse


Blood pressure


Chinese electronic medical records


Chronic Kidney Disease


Cardiovascular disease


Duration of hospital stay


Electronic medical records


Family History of CVD


Integrating Biology and the Bedside


Inter-annotated agreement


Intensive Care Unit


Information extraction


Natural language processing




Optical character recognition


Obstructive Sleep Apnea Syndrome




Random Blood Glucose




Extensible Markup Language.


  1. 1.

    World Health Organization. Cardiovascular diseases (CVDs). 2016. Accessed 25 Aug 2016.

  2. 2.

    Hu S, Gao R, Liu L, et al. Report on cardiovascular disease in China 2014. Beijing: Encyclopedia of China Publishing House; 2015. p. 1–184.

    Google Scholar 

  3. 3.

    Armen YG. Cardiovascular risk factors. In: Melek ZU, editor. Cardiovascular risk factors in the elderly. Rijeka: InTech; 2012. p. 81–102.

    Google Scholar 

  4. 4.

    Rothstein WG. Public health and the risk factor: a history of an uneven medical revolution. Rochester: University of Rochester Press; 2003.

    Google Scholar 

  5. 5.

    World Heart Federation. Cardiovascular disease risk factors. 2016. Accessed 7 Aug 2017.

  6. 6.

    Hannan TJ. Electronic medical records. In: Hovenga EJS, Kidd MR, Garde S, Cossio CHL, editors. Health informatics: an overview. Amsterdam: IOS Press; 1996. p. 133–48.

    Google Scholar 

  7. 7.

    Yang J, Qiubin Y, Guan Y, Jiang Z. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automat Sin. 2014;40:1537–62.

    Google Scholar 

  8. 8.

    The basic norms of medical records writing. Accessed 17 Oct 2016.

  9. 9.

    The basic norms of electronic medical records. Accessed 17 Oct 2016.

  10. 10.

    Feng Y, Ying-Ying C, Gen-Gui Z, Wen LH, Ying L. Intelligent recognition of named entity in electronic medical records. Chinese Joural of Biomedical Engineering. 2011;30:256–62.

    Google Scholar 

  11. 11.

    Yang J, Guan Y, He B, Qu C, Yu Q, Liu Y, et al. Annotation scheme and corpus construction for named entities and entity relations on Chinese electronic medical records. J Softw. 2016;27:1–22.

    CAS  Google Scholar 

  12. 12.

    Qu C, Guan Y, Yang J, Liu Y. The construction of annotated corpora of named entities for Chinese electronic medical records. Chinese High Technol Lett. 2015;25:143–50.

    Google Scholar 

  13. 13.

    Jiang Z, Zhao F, Guan Y. Developing a linguistically annotated corpus of Chinese electronic medical record. In: 2014 IEEE international conference on bioinformatics and biomedicine (BIBM). Belfast; 2014. p. 307–10.

  14. 14.

    Jiang Z, Zhao F, Guan Y, Yang J. Research on Chinese electronic medical record oriented lexical corpus annotation. High Technol Lett. 2014;24:609–15.

    Google Scholar 

  15. 15.

    Wang Y, Yu Z, Chen L, Chen Y, Liu Y, Hu X, et al. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study. J Biomed Inform. 2014;47:91–104.

    Article  PubMed  Google Scholar 

  16. 16.

    Wang Y, Yu Z, Jiang Y, Xu K, Chen X. Automatic symptom name normalization in clinical records of traditional Chinese medicine. BMC Bioinformatics. 2010;11:40.

    Article  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Lei J. Named entity recognition in Chinese clinical text (doctoral dissertation). Houston: University of Texas School of Biomedical Informatics at Houston; 2014. Accessed 7 Aug 2017.

  18. 18.

    Lei J, Tang B, Lu X, Gao K, Jiang M, Xu H. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Inform Assoc. 2014;21:808–14.

    Article  PubMed  Google Scholar 

  19. 19.

    Xu Y, Wang Y, Liu T, Liu J, Fan Y, Qian Y, et al. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. J Am Med Inform Assoc. 2014;21:e84–92.

    Article  PubMed  Google Scholar 

  20. 20.

    Wang H, Zhang W, Zeng Q, Li Z, Feng K, Liu L. Extracting important information from Chinese operation notes with natural language processing methods. J Biomed Inform. 2014;48:130–6.

    Article  PubMed  Google Scholar 

  21. 21.

    He B, Dong B, Guan Y, Yang J, Jiang Z, Yu Q, et al. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts. arXiv preprint arXiv:1611.02091. 2016.

  22. 22.

    Stubbs A. A methodology for using professional knowledge in corpus annotation. Disseration: Brandeis University; 2013.

    Google Scholar 

  23. 23.

    Stubbs A, Uzuner O, Kumar V, Shaw S. Annotation guidelines: risk factors for heart disease in diabetic patients. i2b2/UTHealth NLP. Challenge. 2014:1–9.

  24. 24.

    Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15:14–24.

    Article  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Uzuner O. Recognizing obesity and comorbidities in sparse data. J Am Med Inform Assoc. 2009;16:561–70.

    Article  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc. 2010;17:514–8.

    Article  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Uzuner O, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth generation for the i2b2 medication challenge. J Am Med Inform Assoc. 2010;17:519–23.

    Article  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Uzuner O, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011;18:552–6.

    Article  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. J Am Med Inform Assoc. 2013;20:806–13.

    Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Sun W, Rumshisky A, Uzuner O. Annotating temporal information in clinical narratives. J Biomed Inform. 2013;46:S5–S12.

    Article  PubMed  Google Scholar 

  31. 31.

    Stubbs A, Kotfila C, Xu H, Uzuner O. Identifying risk factors for heart disease over time: overview of 2014 i2b2/UTHealth shared task track 2. J Biomed Inform. 2015;58:S67–77.

    Article  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Stubbs A, Uzuner O. Annotating risk factors for heart disease in clinical narratives for diabetic patients. J Biomed Inform. 2015;58:S78–91.

    Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    i2b2 2016 CEGS N-GRID shared tasks and workshop on challenges in natural language processing for clinical data. Accessed 28 Oct 2015.

  34. 34.

    Suominen H, Salanterä S, Velupillai S, Chapman WW, Savova G, Elhadad N, et al. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In: Forner P, Müller H, Paredes R, Rosso P, Stein B, editors. 4th international conference of the CLEF initiative, CLEF 2013. Valencia; 2013. p. 212–31.

  35. 35.

    Pradhan S, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, et al. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. J Am Med Inform Assoc. 2015;22:143–54.

    Article  PubMed  Google Scholar 

  36. 36.

    Styler IVWF, Bethard S, Finan S, Palmer M, Pradhan S, de Groen PC, et al. Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics. 2014;2:143–54.

    Google Scholar 

  37. 37.

    Elhadad N, Pradhan S, Chapman W, Manandhar S, Savova G. SemEval-2015 task 14: analysis of clinical text. In: Proc of Workshop on Semantic Evaluation Association for Computational Linguistics. Denver; 2015. p. 303-310.

  38. 38.

    Meystre SM, Kim Y, Gobbel GT, Matheny ME, Redd A, Bray BE, et al. Congestive heart failure information extraction framework for automated treatment performance measures assessment. J Am Med Inform Assoc. 2016; doi:10.1093/jamia/ocw097.

  39. 39.

    Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016; doi:10.1093/jamia/ocv180.

  40. 40.

    Tesseract. 2016. Accessed 28 October 2015.

  41. 41.

    Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005;12:296–8.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


We would like to thank the medical records department of the 2nd Affiliated Hospital of Harbin Medical University for providing the electronic medical records. We would also like to thank the clinicians, Qiubin YU and Yongjie Zhao, and the annotators, Hao Wu and Na Feng, for their excellent work.


This study was supported by National Natural Science Foundation of China (Grant No. 71531007).

Availability of data and materials

We are performing a secondary analysis of existing data which is from Medical Records Room of Second Affiliated Hospital of Harbin Medical University, and we got a using permission from Qiubin YU who is a deputy chief physician at the department. The datasets used and analysed during the current study are available from the corresponding author on reasonable request. Preprocessed CEMR samples are available at

Author information




JS, BH and JY designed the tasks and conducted the entire project. JS, BH and JJ developed the annotation guidelines and collaborated in constructing the corpus. JS and BH distributed the data, collected the annotations, analyzed the corpus quality and communicated with the doctors. All the authors contributed to the final manuscript.

Corresponding author

Correspondence to Yi Guan.

Ethics declarations

Ethics approval and consent to participate

We have got an ethics approval and consent from Second Affiliated Hospital of Harbin Medical University Medical Ethics Committee. The approval is available at

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Su, J., He, B., Guan, Y. et al. Developing a cardiovascular disease risk factor annotated corpus of Chinese electronic medical records. BMC Med Inform Decis Mak 17, 117 (2017).

Download citation


  • Cardiovascular disease risk factors
  • Chinese electronic medical records
  • Annotation
  • Corpus construction
  • Information extraction