Qcorp: an annotated classification corpus of Chinese health questions
© The Author(s). 2018
Published: 22 March 2018
Health question-answering (QA) systems have become a typical application scenario of Artificial Intelligent (AI). An annotated question corpus is prerequisite for training machines to understand health information needs of users. Thus, we aimed to develop an annotated classification corpus of Chinese health questions (Qcorp) and make it openly accessible.
We developed a two-layered classification schema and corresponding annotation rules on basis of our previous work. Using the schema, we annotated 5000 questions that were randomly selected from 5 Chinese health websites within 6 broad sections. 8 annotators participated in the annotation task, and the inter-annotator agreement was evaluated to ensure the corpus quality. Furthermore, the distribution and relationship of the annotated tags were measured by descriptive statistics and social network map.
The questions were annotated using 7101 tags that covers 29 topic categories in the two-layered schema. In our released corpus, the distribution of questions on the top-layered categories was treatment of 64.22%, diagnosis of 37.14%, epidemiology of 14.96%, healthy lifestyle of 10.38%, and health provider choice of 4.54% respectively. Both the annotated health questions and annotation schema were openly accessible on the Qcorp website. Users can download the annotated Chinese questions in CSV, XML, and HTML format.
We developed a Chinese health question corpus including 5000 manually annotated questions. It is openly accessible and would contribute to the intelligent health QA system development.
Seeking health-related information is one of the top activities of today’s online users via both personal computers and mobile devices. 59% of the U.S. adults have looked online for health information in 2012. China has 194.76 million Internet health users in 2016 , increased 28.0% compared with that in 2015 , and will be further stimulated by the development of the Internet and communication technologies, as well as China’s “Internet Plus” and “Health Big Data” policies[4, 5]. Despite the widespread need, the search engines often failed in returning relevant and trustworthy health information [6, 7]. Automatic question answering (QA) systems that can comprehend the questions asked by users in natural language and respond with concise and correct answers using natural language processing techniques shall be a good way to solve this problem . Therefore, several efforts have worked on exploring automatic QA systems in health and medical area in recently years [9–14]. However, it is challenging [15–17], one of the main challenges is the lack of large scale corpus of annotated questions for the machines to learn to extract and understand the main information needs from the questions, known as question processing, which will obviously affect the performance of a QA system .
Due to the significant roles of annotated questions for QA system research and development, several studies have focused on this task and collected some useful corpus. For example, the National Library of Medicine of the United States has collected a total of 4,654 annotated clinical questions  via endeavored studies [20–25]. This corpus has been applied for training machines to automatically classify question types , distinguishing answerable and unanswerable questions , recognizing question entailment , extracting keywords of the questions , as well as separating consumer questions from clinical questions . Other groups have annotated several small scale corpora of health care-related questions in English, so as to automatically identify questions that can be answered by specific EMR notes , analyze the user’s demographic, cognitive, affective, situational, social environmental information that are implied in the questions , classify the types of consumer health questions [32–35], and extract structured information from EHR-related ICU questions  and so on. These previous studies on English question corpora have provided useful references for Chinese health question corpus development.
Several studies on Chinese health questions corpus development have been conducted. Yin JW  annotated 1,600 Chinese questions related to maternal and infant health care with 8 topics so as to conduct automatic question classification. Zhang N  annotated 4,465 Chinese questions related to skin diseases with a self-developed two-layered classification schema so as to automatically classify question topics and help computing their semantic similarity. Tang GY  manually classified 1,688 questions related to hyperlipidemia into 241 categories in order to computing their semantic similarity. Compared with the above studies, our corpus featured as: (1) its annotation schema covers a large range of health topics; (2) the annotated questions covered broad diversity of diseases; moreover, (3) the corpus is openly accessible and easily reusable. Our work would help the intelligent system development related to Chinese health QA.
In this paper, we presented the Qcorp database which collects annotated health care-related questions in Chinese on the basis of our previous works [40, 41]. In current release, Qcorp contains 5000 consumer health questions in Chinese that are annotated with 7101 tags by 8 annotators with a two-layered classification schema consisting of 29 topic categories. An empirical study conducted by us  showed that the corpus was useful in training machines to automatically assign the topics of consumer health questions. We have made the current Qcorp publicly available and would enrich it in future work/collaboration, thus, the corpus could be more useful and applicable in various scenarios.
Sources of the 3000 questions in data set 2
Obstetrics & Gynecology
Traditional Chinese Medicine
Here, the “question” is defined as a request on a certain subject posted by a consumer via the Internet to elicit answers from the physicians or the patient support group, which was identified based on meaning, not form. We manually discarded the uncomplicated data, repeated data and irrelevant data, such as advertisements, health education contents, patients’ experiences, and other non-health contents. When one question was excluded we randomly selected another question from the same website within the same section so as to keep the sample balance.
Since the various consumer health questions could be represented by limited topics and keywords, the question classification plays an important role in an automatic QA system in identifying the information needs of consumers and further improving the accuracy of returned answers. Here, we performed manual annotation of the general topics of 5000 Chinese questions related to health care posted by consumers via the internet, for the purpose of building a high quality annotated corpus for question classification, and further promoting the research and development of intelligent Chinese health QA systems.
We recruited eight annotators, one half of them have medical education background and the other half are specialized in medical informatics. For the 2000 hypertension related questions, their annotations were completed by five annotators in our previous work . We translated their tags into two-layered tags according to the annotation guidelines in this study. For the rest 3000 questions (i.e., internal medicine, surgery, obstetrics & gynecology, pediatrics, infectious diseases, and traditional Chinese medicine), the annotation processes were performed in 3 rounds: In round 1, a training set of 300 randomly selected Chinese questions related to health care were annotated by four annotators independently so as to conclude and modify the annotation guidelines, ambiguous questions were settled by specifying the annotation rules and the question patterns. Then the four annotators were divided into two groups. In round 2, a testing set of 600 questions randomly selected from the sample were assigned to the two groups, 300 questions for each, and each annotator annotated independently so as to measure the inter-annotator agreement. In round 3, a development set of the remaining 2100 questions were each annotated independently by two of the four annotators. The disparities were discussed to achieve an agreement.
Inter-annotator agreement analysis
Where M is the number of tag matched questions, and A is the number of all the annotated questions.
Database framework and web interface
Data stored in the Qcorp database were managed by using MySQL. Social network map and descriptive statistics were used to calculate and visualize the distribution and relationship of the annotated tags. The web server of Qcorp was developed based on Java. The Qcorp database is freely available at http://www.phoc.org.cn/healthqa/qcorp/.
Annotated tag distribution on the first layer
Annotated tag distribution on the second layer
The inter-annotator agreement for the four annotators on the training set (300 questions) of data set 2 was 0.67 in round 1. By discussing on the disparities and further specifying the annotation rules and the question patterns for each category, the inter-annotator agreement for the two groups on the testing set (300 questions for each group) in round 2 increased to 0.88 and 0.92. After further discussion to achieve an agreement on the disparities, the average inter-annotator agreement for the four annotators on the developing set (2100 questions in total, each was annotated by two annotators independently) in round 3 increased to 0.96. And the average inter-annotator agreement for the five annotators on the data set 1 (each question was at least annotated by two annotators) on the second layer of the classification schema was 0.95.
Corpus access and usage
Case application of Qcorp corpus
Using the annotated 2000 questions in data set 1 as corpus, we applied a machine-learning method to automatically classify these questions into one of the five topics on the first layer of the classification schema. The Chinese questions were represented as a set of lexical, grammatical, and semantic features, and the features were weighted and selected according to . Among them, Lexical features include bag-of-words and part-of-speech, grammatical features include interrogative words and corresponding chunks, semantic features include the Chinese Medical Subject Headings concepts and semantic types and so on. The result shows that the question classification achieved the F1-score of 99.13%, 98.55%, 96.35%, 76.02%, and 71.77% for the topics of Healthy Lifestyle, Diagnosis, Health Provider Choice, Treatment, and Epidemiology, respectively (More details can be found in ). This demonstrated that these annotated Chinese questions were applicable for training machines to automatically classify the topics of questions posted by health consumers, facilitating answer generation.
Internet is increasingly becoming one of the main resources for consumers to acquire health information. Automatic QA systems that can correctly answer users’ questions in natural language shall be a promising way to fulfill this need. A shared corpus of annotated consumer health questions in Chinese is prerequisite for training machines to understand the information needs of Chinese consumers by a health QA system. Thus, we developed the Qcorp database which collects annotated health care-related questions in Chinese. Qcorp currently contains 5000 consumer health questions in Chinese that annotated with 7101 tags by 8 annotators with a two-layered classification schema consisting of 29 topic categories. The corpus was proved to be applicable for training machines to automatically assign the topics of Chinese consumer health questions in an empirical study.
Comparison with other related works
A comparison of works on the corpus building of health and medical questions
Corpus or Author name
NLM collected clinical questions 
Patrick J 
Zhang Y 
Roberts K 
Genetic and rare diseases
Maroy S 
Yin JW 
1 health APP
Maternal and infant health
Zhang N 
1 website, books, self-composed
Tang GY 
6 broad sections
Limitations and future studies
The Chinese health question corpus introduced here was only annotated with general topics, and yet was far from precisely representing the health information needs of askers that contained in the questions. There are much work to do to reveal more detailed information of the Chinese consumer health questions in a structured manner. Our next step is to annotate the named entities and their relationships expressed in the Chinese consumer health questions. We hope that this database mainly developed for Chinese consumer health questions could serve as an important resource for the research and development of intelligent Chinese health QA systems.
We developed a corpus with 5000 Chinese consumer health questions manually annotated using a two-layered classification schema. The corpus, named as Qcorp, was openly accessible with the annotated questions in formats of CSV, XML and HTML, which can be easily used to train machines to understand consumers’ health questions in Chinese. To our knowledge, the Qcorp database is currently the annotated classification corpus of Chinese health questions that covered relatively more diversity of diseases and come from multiple sources. Our study would help Chinese health QA system development.
This study was supported by the Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences, the Basal Research Fund from Chinese Academy of Medical Sciences (Grant No. 2016ZX330011), The National Key Research and Development Program of China (Grant No. 2016YFC0901901), the National Population and Health Scientific Data Sharing Program of China, and the Key Laboratory of Knowledge Technology for Medical Integrative Publishing.
The publishing costs for this manuscript were provided by the Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences.
Availability of data and materials
The datasets build in this study are freely available at our Qcorp website, URL: http://www.phoc.org.cn/healthqa/qcorp/.
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 18 Supplement 1, 2018: Proceedings from the 3rd China Health Information Processing Conference (CHIP 2017). The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-18-supplement-1.
Jiao Li conducted the Chinese health question classification studies. Haihong Guo and Jiao Li designed the experiment and analyzed the results. Haihong Guo and Xu Na built the classification schema, collected the Chinese consumer health questions, organized the annotation, and did the corpus quality control. Haihong Guo designed the Qcorp website, and Xu Na developed it with the help of an engineer. All the authors wrote and revised the manuscript, all the authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Fox S, Duggan M.: Health Online 2013. Pew Research Center Internet & Technology, January 2013, http://www.pewinternet.org/2013/01/15/health-online-2013/, last accessed 21 July 2017.
- China Internet Network Information Center.: China statistical report on internet development,http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201701/P020170123364672657408.pdf, last accessed 2017/08/20.
- China Internet Network Information Center.: China statistical report on internet development, http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201601/P020160122469130059846.pdf, last accessed 2017/08/20.
- State Council of the People’s Republic of China.: Guidance of promoting the “Internet Plus” action, http://www.gov.cn/zhengce/content/2015-07/04/content_10002.htm, last accessed 2017/08/20.
- Office of State Council of the People’s Republic of China.: Guidance of promoting and regulating the application and development of health and medical big data, http://www.gov.cn/zhengce/content/2016-06/24/content_5085091.htm, last accessed 2017/08/20.
- Pletneva N, Vargas A, Kalogianni K, Boyer C. Online health information search: what struggles and empowers the users? Results of an online survey. Stud Health Technol Inform. 2012;180:843–7.PubMedGoogle Scholar
- Scantlebury A, Booth A, Hanley B. Experiences, practices and barriers to accessing health information: a qualitative study. International Journal of Medical Informatics. 2017;103:103–8.View ArticlePubMedGoogle Scholar
- Wren J D.: Question Answering Systems in biology and medicine – the time is now. Bioinformatics, 27(14): 2025-2026 (2011).Google Scholar
- Cao Y, Liu F, Simpson P, et al. AskHERMES: An online question answering system for complex clinical questions. Journal of Biomedical Informatics. 2011;44(2):277–88.View ArticlePubMedPubMed CentralGoogle Scholar
- Cairns BL, Nielsen RD, Masanz JJ, et al. The MiPACQ clinical question answering system. AMIA Annu Symp Proc. 2011:171–80.Google Scholar
- Ni Y, Zhu H, Cai P, et al. CliniQA: highly reliable clinical question answering system. Stud Health Technol Inform. 2012;180:215–9.PubMedGoogle Scholar
- Hristovski D, Dinevski D, Kastrin A, Rindflesch TC. Biomedical question answering using semantic relations. BMC Bioinformatics. 2015;16:6.View ArticlePubMedPubMed CentralGoogle Scholar
- Asiaee AH, Minning T, Doshi P, Tarleton RL. A framework for ontology-based question answering with application to parasite immunology. J Biomed Semantics. 2015;6:31.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu F, Tur G, Hakkani-Tür D, Yu H. Towards spoken clinical-question answering: evaluating and adapting automatic speech-recognition systems for spoken clinical questions. J Am Med Inform Assoc. 2011;18:625–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Bauer M, Berleant D. Usability survey of biomedical question answering systems. Human Genomics. 2012;6(1):17–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Olivera-Lobo MD, Gutiérrez AJ. Evaluation of open- vs. restricted-domain question answering systems in the biomedical field. Journal of Inf Science. 2011;37(2):152–62.View ArticleGoogle Scholar
- Cruchet S, Boyer C, van der Plas L. Trustworthiness and relevance in web-based clinical question answering. Stud Health Technol Inform. 2012;180:863–7.PubMedGoogle Scholar
- Zhang N, Zhu L.: A review of Chinese Q & A system questions. Technology Intelligence Engineering, 01-42 (2016).Google Scholar
- Clinical question collection. https://clinques.nlm.nih.gov/, last accessed 2017/08/23.
- Gorman P, Ash J, Wykoff L. Can primary care physician's questions be answered using the medical journal literature? Bull Med Libr Assoc. 1994;82:140–6.PubMedPubMed CentralGoogle Scholar
- Ely JW, Osheroff JA, Ebell MH, Bergus GR, Levy BT, Chambliss ML, et al. Analysis of questions asked by family doctors regarding patient care. Br Med J. 1999;319(7206):358–61.View ArticleGoogle Scholar
- Ely JW, Osheroff JA, Gorman PN, Ebell MH, Chambliss ML, Pifer EA, et al. A taxonomy of generic clinical questions: classification study. British Medical Journal. 2000;321:429–32.View ArticlePubMedPubMed CentralGoogle Scholar
- Alper B, Stevermer J, White D, Ewigman B. Answering family physicians' clinical questions using electronic medical databases. J Fam Pract. 2001;50:960–5.PubMedGoogle Scholar
- Niu Y, Hirst G, Mcarthur G, et al.: Answering clinical questions with role identification. Meeting of the Association for Computational Linguistics, 73-80 (2003). http://www.aclweb.org/anthology/W/W03/W03-1310.pdf, last accessed 23 Aug 2017.
- Alessandro DM, Kreiter CD, Peterson MW. An evaluation of information seeking behaviors of general pediatricians. Pediatrics. 2004;113:64–9.View ArticlePubMedGoogle Scholar
- Cao Y, Cimino JJ, Ely J, Yu H. Automatically extracting information needs from complex clinical questions. J Biomed Inform. 2010;43(6):962–71.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu H, Sable C, Zhu HR. Classifying medical questions based on an evidence taxonomy. In Workshop of AAAI, (2005). https://www.aaai.org/Papers/Workshops/2005/WS-05-10/WS05-10-005.pdf, last accessed 25 Aug 2017.
- Abacha AB, Dina DF, Recognizing question entailment for medical question answering. AMIA Annu Symp Proc, 310-318 (2016).Google Scholar
- Liu F, Antieau LD, Yu H. Toward automated consumer question answering: Automatically separating consumer questions from professional questions in the healthcare domain. Journal of biomedical informatics. 2011;44(6):1032–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Patrick J, Li M. An ontology for clinical questions about the contents of patient notes. Journal of Biomedical Informatics. 2012;45:292–306.View ArticlePubMedGoogle Scholar
- Zhang Y. Toward a Layered Model of Context for Health Information Searching: An Analysis of Consumer-Generated Questions. Journal of the American Society for Information Science and Technology. 2013;64(6):1158–72.View ArticleGoogle Scholar
- Roberts K, Masterton K, Fiszman M, Kilicoglu H, Demner-Fushman D. Annotating Question Types for Consumer Health Questions. In: In Proceedings of the Fourth LREC Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing; 2014.Google Scholar
- Roberts K, Kilicoglu H, Fiszman M, Demner-Fushman D. Automatically classifying question types for consumer health questions. AMIA Annu Symp Proc. 2014;2014:1018–27.PubMedPubMed CentralGoogle Scholar
- McRoy S, Jones S, Kurmally A. Toward automated classification of consumers’ cancer-related questions with a new taxonomy of expected answer types. Health Informatics Journal. 2015;22(3):523–35.View ArticlePubMedGoogle Scholar
- Cronin RM, Fabbri D, Denny JC, et al. A comparison of rule-based and machine learning approaches for classifying patient portal messages. International Journal of Medical Informatics. 2017:110–20.Google Scholar
- Roberts K, Demner-Fushman D.: Toward a natural language interface for EHR questions. AMIA Jt Summits Transl Sci Proc. 2015 Mar 25:157-161 (2015).Google Scholar
- Yin JW.: The method of classification and similarity calculation for mobile health questions based on the field dictionary, pp. P42-47. Shenzhen University, Shenzhen (2016).Google Scholar
- Zhang N. Research on Natural Language Question Analysis Based on Knowledge Organization System. Beijing: Institute of Science and Technology of China; 2016.Google Scholar
- Tang G, Ni Y, Xie G, et al.: A deep learning based method for similar patient question retrieval in Chinese. In Proceedings of the 16th World Congress on Health and Biomedical Informatics, Hangzhou, China, 23-29 August 2017 (2017).Google Scholar
- Guo H, Li J, Dai T.: Consumer health information needs and question classification: analysis of hypertension related questions asked by consumers on a Chinese health website, In Proceedings of the 15th World Congress on Health and Biomedical Informatics, São Paulo, Brazil, 19-23 August 2015, Studies in Health Technology and Informatics, IOS Press, 216: 810-814 (2015).Google Scholar
- Guo H, Na X, Hou L, Li J. Classifying Chinese Questions Related to Health Care Posted by Consumers Via the Internet. J Med Internet Res. 19(6):e220–2017.Google Scholar
- Qcorp: An Annotated Chinese Health Question Corpus, http://www.phoc.org.cn/healthqa/qcorp/, last accessed 2017/09/29.
- McHugh ML. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Roberts K, Demner-Fushman D. Interactive use of online health resources: a comparison of consumer and professional questions. J Am Med Inform Assoc. 2016;23(4):802–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Tsoumakas G, Katakis I.: Multi-label classification: an overview, http://lpis.csd.auth.gr/publications/tsoumakas-ijdwm.pdf, last accessed 26 Aug 2017.
- Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. In: Studies in Fuzziness and Soft Computing. Berlin, Heidelberg: Springer; 2006. p. 315–24.Google Scholar