Selected articles from the BioCreative/OHNLP challenge 2018

Introduction The wide adoption of electronic health records (EHRs) has led to an improvement in healthcare quality by electronically documenting a patient’s medical conditions, thoughts and actions among the care providers [1]. Those EHR data, with the vast majority being free-texts (e.g., clinical notes, discharge summaries, radiology reports, and pathology reports), have been utilized for primary and secondary purposes, such as documentation need in care process, clinical decision support, outcome improvement, biomedical research and epidemiologic monitoring of the nation’s health. The application of natural language processing (NLP) methods and resources to clinical and biomedical text has received growing attention over the past years, but progress has been limited by difficulties to access shared tools and resources, partially caused by patient privacy and data confidentiality constraints. Efforts to increase sharing and interoperability of the few existing resources are needed to facilitate the progress observed in the general NLP domain. Towards this goal, we organized the BioCreative/OHNLP Challenge 2018 workshop (https://sites. google.com/view/ohnlp2018/home) to promote community efforts on methodological advancements and data curation mechanisms in clinical NLP. The challenge consists of two independent clinical NLP tasks: 1) Family History Extraction; and 2) Clinical Semantic Textual Similarity. The top performing teams were invited to present their solutions during the BioCreative/OHNLP Challenge 2018 workshop in conjunction with the 9th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB) (http:// acm-bcb.org/2018/) on August 30th, 2018. This supplement collects the system descriptions of top-performing solutions of the tasks. Task 1: family history extraction As a risk factor of many diseases, family history information (FHI) captures shared genetic variations among family members [2]. Information such as age, gender, and degree of relatives are also considered when taking into the account of risk assignment of a large number of common diseases. The fact that many care process models use FHI highlights the importance of FHI in the decision-making process of diagnosis and treatment. However, extracting accurate and complete FHI from clinical texts remains challenging as a clinical NLP problem due to the lack of standardized evaluation mechanisms and publicly available language resources. To curate a corpus that can be made publicly available without losing semantic power for potential information extraction systems, we first collected the clinical narrative from family history sections of clinical notes at Mayo Clinic Rochester, the content of which are highly relevant to FHI. A team of annotators annotated the original corpus with clinical observations, family member mentions and protected health information. Afterwards, the protected health information is replaced with synthetic yet meaningful strings, and the clinical observations, family member mentions are shuffled among the corpus to further protect patient privacy. Leveraging the synthetic corpus with FHI, we organized this shared task to encourage the community to propose and develop family history extraction (FHE) systems [3]. The task composes two subtasks. The Subtask 1 focuses on identifying family member entities and clinical observations (diseases), and the Subtask 2 expects the association the living status, side of the family and clinical observations to family members to be extracted. The Subtask 2 is an end-to-end task which is based on the result of the Subtask 1. A total of 5 teams submitted overall 14 submissions for the official evaluation, and the descriptions of 2 teams are included in this supplement.


Introduction
The wide adoption of electronic health records (EHRs) has led to an improvement in healthcare quality by electronically documenting a patient's medical conditions, thoughts and actions among the care providers [1]. Those EHR data, with the vast majority being free-texts (e.g., clinical notes, discharge summaries, radiology reports, and pathology reports), have been utilized for primary and secondary purposes, such as documentation need in care process, clinical decision support, outcome improvement, biomedical research and epidemiologic monitoring of the nation's health. The application of natural language processing (NLP) methods and resources to clinical and biomedical text has received growing attention over the past years, but progress has been limited by difficulties to access shared tools and resources, partially caused by patient privacy and data confidentiality constraints. Efforts to increase sharing and interoperability of the few existing resources are needed to facilitate the progress observed in the general NLP domain. Towards this goal, we organized the BioCreative/OHNLP Challenge 2018 workshop (https://sites. google.com/view/ohnlp2018/home) to promote community efforts on methodological advancements and data curation mechanisms in clinical NLP. The challenge consists of two independent clinical NLP tasks: 1) Family History Extraction; and 2) Clinical Semantic Textual Similarity. The top performing teams were invited to present their solutions during the BioCreative/OHNLP Challenge 2018 workshop in conjunction with the 9th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB) (http:// acm-bcb.org/2018/) on August 30th, 2018. This supplement collects the system descriptions of top-performing solutions of the tasks.

Task 1: family history extraction
As a risk factor of many diseases, family history information (FHI) captures shared genetic variations among family members [2]. Information such as age, gender, and degree of relatives are also considered when taking into the account of risk assignment of a large number of common diseases. The fact that many care process models use FHI highlights the importance of FHI in the decision-making process of diagnosis and treatment. However, extracting accurate and complete FHI from clinical texts remains challenging as a clinical NLP problem due to the lack of standardized evaluation mechanisms and publicly available language resources.
To curate a corpus that can be made publicly available without losing semantic power for potential information extraction systems, we first collected the clinical narrative from family history sections of clinical notes at Mayo Clinic Rochester, the content of which are highly relevant to FHI. A team of annotators annotated the original corpus with clinical observations, family member mentions and protected health information. Afterwards, the protected health information is replaced with synthetic yet meaningful strings, and the clinical observations, family member mentions are shuffled among the corpus to further protect patient privacy.
Leveraging the synthetic corpus with FHI, we organized this shared task to encourage the community to propose and develop family history extraction (FHE) systems [3]. The task composes two subtasks. The Subtask 1 focuses on identifying family member entities and clinical observations (diseases), and the Subtask 2 expects the association the living status, side of the family and clinical observations to family members to be extracted. The Subtask 2 is an end-to-end task which is based on the result of the Subtask 1. A total of 5 teams submitted overall 14 submissions for the official evaluation, and the descriptions of 2 teams are included in this supplement.
The solution proposed by Dai focused on the extraction step and formulates it as a sequence labeling task. A neural sequence labeling model along with different tag schemes to distinguish family members and FHI-related observations was developed. Corresponding to different tag schemes, the identified entities were aggregated and processed by different algorithms to determine the required properties. The effectiveness of encoding required properties in the tag schemes was evaluated by the task corpus. The developed neural network-based models performed significantly better than the conditional random fields models.
Shi et al. explored two joint learned models for the two subtasks. For the entity extraction subtask, the Bidirectional Long Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF) models are used to recognize FHI related entities using word embeddings and part-of-speech (POS) embedding as inputs. For the relation extraction subtask, they trained a Bi-LSTM to classify the relations. The two models are joint trained towards a customized loss function to combine the loss from the two subtasks. On top of the results from machine learning models, they used heuristic rules and post-processing to handle entity properties such as side of family and living status.

Task 2: clinical semantic textual similarity
The frequent use of copy-and-paste, templates, and smart phrases have resulted in redundant texts in clinical notes, which may reduce the EHR data quality and add cognitive burden of tracking complex records in clinical practice. Therefore, there is a growing need for tools that can aggregate data from diverse sources and minimize data redundancy, and organize and present the EHR data in a user-friendly way to reduce physicians' cognitive burden. One technique for automatically reducing redundancy in free text EHRs is to compute semantic similarity between clinical text snippets and remove highly similar snippets. Semantic textual similarity (STS) is a common task in the general English domain to assess the degree to which the underlying semantics of two segments of text are equivalent to each other. The assessment is usually performed using ordinal scaled output ranging from complete semantic equivalence to complete semantic dissimilarity. The STS task has been held annually since 2012 to encourage and support research in this area. However, these series of STS tasks used texts in the general English domain and no STS shared task focuses on the text data in the clinical domain. To motivate the biomedical informatics and NLP communities to study STS in the clinical domain, we initiated the ClinicalSTS task to provide a venue for evaluation of the state-of-the-art algorithms and models.
ClinicalSTS provides paired clinical text snippets for each participant. The corpus, named MedSTS, consists of deidentified clinical sentences from narrative clinical notes [4]. The participating systems were asked to return a numerical score indicating the degree of semantic similarity between the pair of two sentences. Performance is measured by the Pearson correlation coefficient between the predicted similarity scores and human judgments. The scores fall on an ordinal scale, ranging from 0 to 5 where 0 means that the two clinical text snippets are completely dissimilar (i.e., no overlap in their meanings) and 5 means that the two snippets have complete semantic equivalence.
Xiong et al. proposed a novel framework based on a gated network to fuse distributed representation and one-hot representation of sentence pairs. Some current state-of-theart distributed representation models, including Convolutional Neural Network (CNN), Bi-LSTM and Bidirectional Encoder Representations from Transformers (BERT), were used in their system. Compared with the systems only using distributed representation or one-hot representation, their proposed method achieved higher performance. Among all distributed representations, BERT performed best. Further analysis indicates that the distributed representation and one-hot representation are complementary to each other and can be fused by gated network.
Chen et al. demonstrated both their participating systems and improvements after the challenge. They applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves higher performance than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further improves the performance. Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Error analytics indicates that end-to-end deep learning models and traditional machine learning models with manually-crafted features can complement each other, which suggests that a combination of these models can better find similar sentences in practice.