Chemical-induced disease extraction via recurrent piecewise convolutional neural networks

Background Extracting relationships between chemicals and diseases from unstructured literature have attracted plenty of attention since the relationships are very useful for a large number of biomedical applications such as drug repositioning and pharmacovigilance. A number of machine learning methods have been proposed for chemical-induced disease (CID) extraction due to some publicly available annotated corpora. Most of them suffer from time-consuming feature engineering except deep learning methods. In this paper, we propose a novel document-level deep learning method, called recurrent piecewise convolutional neural networks (RPCNN), for CID extraction. Results Experimental results on a benchmark dataset, the CDR (Chemical-induced Disease Relation) dataset of the BioCreative V challenge for CID extraction show that the highest precision, recall and F-score of our RPCNN-based CID extraction system are 65.24, 77.21 and 70.77%, which is competitive with other state-of-the-art systems. Conclusions A novel deep learning method is proposed for document-level CID extraction, where domain knowledge, piecewise strategy, attention mechanism, and multi-instance learning are combined together. The effectiveness of the method is proved by experiments conducted on a benchmark dataset.


Background
Nowdays, there is more and more literature published with rich domain knowledge. The first step to reuse literature is to extract biomedical information from literature. Chemical-induced disease (CID), which refers to adverse drug reactions, is a type of important information, which can be used for drug safety monitoring and medicine development [1], has attracted more and more attentions.
During the last decade, there have been a large number of methods proposed for CID extraction [2], which can be classified into three categories: 1) statistics-based methods, 2) rule-based methods, and 3) machine learning-based methods. The statistics-based methods determine CIDs according to the distributions of chemicals and diseases. For example, Chen et al. [3] discovered drug side effects by analyzing co-occurrences of drugs and adverse reactions in biomedical literature. Mao et al. [4] used a similar method to mine drug side effects from social media. The limitation of statistic-based methods lies in their low precision, although they usually achieves high recall. Khoo et al. [5] used manually-constructed graphical patterns derived from syntactic parse trees to extract causal relations between drugs and adverse events in MEDLINE abstracts. The rule-based methods usually need domain experts, constructing rules is time-consuming, and the manually-crafted rules are not easily applicable to other corpora. To increase generalizability of rules, Xu and Wang [6] provided a method to learn syntactic patterns from sentences containing known drug side effect pairs for drug side effect extraction from biomedical literature. The manchine learning-based methods are deployed for CID extaction due to some manually-annotated corpora, such as the corpus of the BioCreative V chemical-indcued disease relation (CDR) challenge [7] for CID extraction, are publically available. Support vector machine (SVM) is the most commonly used machine learning method. Xu et al. [8] won the BioCreative V CDR challenge using an SVM-based system. The feature engineering of the SVM-based system is terrible. To avoid fussy feature engineering, deep learning methods were applied to CID extraction [9], including convolutional neural networks (CNN) [10] and long short term memory neural networks (LSTM) [11]. In these systems, domain knowledge about adverse drug reactions, and some new techniques, such as piecewise strategy [12] and attention mechanism [13], widely used in other domains are not considered. Subsequently, Li et al. [14] adopted piecewise CNN to extract chemical-disease relations contained in intra-sentence and inter-sentence using a uniform model. Gu [15] improved the CNN model by adding syntactic information of cross-sentence, and the performance has been further improved. However, all these methods extract chemical-disease relations from single sentences or adjacent sentences. None of them consider document-level information. In a document, two entities usually do not appear only once, and it is difficult to determine which sentence or paragraph describes a relation or not. To facilitate efficient document-level relation extraction from biological text, Patrick [16] proposed Bi-affine Relation Attention Networks (BRAN), a combination of network architecture, multi-instance and multi-task learning. In this paper, we propose a novel document-level deep learning method for CID extraction, called recurrent piecewise convolutional neural networks (RPCNN). It should be noted that this paper is an extension of our previous paper [14].

Overview
There are usually two steps in chemical-induced disease extraction: 1) candidate generationgenerating all possible related pairs of chemicals and diseases, denoted by <chemical, disease>; 2) candidate classificationdetermining whether each <chemical, disease> pair generated in the previous step is related.

Candidate generation
Given a biomedical record with m chemical mentions and n disease mentions, all m × n < chemical, disease> pairs can be recognized as candidates. In this study, we combine <chemical, disease> pairs that have the same chemical and disease identifiers together to form a candidate, denoted by <chemical identifier, disease identi-fier>. An example of candidate generation is shown in Table 1, where given a record with 2 chemical mentions (i.e., "terbutaline"×2) and 4 disease mentions (i.e., "Cardiovascular complications", "cardiovascular complications", "andpreterm labor"×2), as the two chemical mentions has the same MeSH (Medical Subject Headings) [17] identifier (i.e., D013726) and 4 disease mentions correspond to 2 MeSH identifiers (i.e, cardiovascular complications -D002318 and preterm labor -D007752), two candidates, that is, <D013726, D002318 > and < D013726, D007752>, are generated. Each candidate is a document-level candidate corresponding with multiple < chemical, disease> pairs, and each <chemical, disease> pair is an instance. Therefore, there are eight instances corresponding to two candidates in Table 1.

Candidate classification
A four-layer recurrent piecewise convolutional neural networks (RPCNN) is proposed for CID extraction as shown in Fig. 1, where piecewise CNN (the same as Li et al. [14]) is used to represent each instance of a candidate, and RNN is used to combine representations of each candidate's instances in a record together to obtain the document-level representation of the candidate.

Input layer
Given a candidate, the corresponding multiple instances I 0 , I 1 , …, I m are arranged in descending order according to the length of context between the two entity mentions, which is measured by the number of words within the context. For each instance, we select the two entity mentions with context between them and context before or after them in the same sentence as the instance's input. To distinguish chemical entity mentions and disease mentions, "<ENTC > ...</ENTC>" and "<ENTD> ... </ENTD>", are further used to enclose them respectively. Then, an instance's input is divided into three parts: 1) S − 1 : context before the first entity mention (e.g., "Severe ... with" before "<ENTC> terbutaline </ENTC>" in Table 2); 2) S 0 : context between the two entity mentions (e.g., "for" in Table 2); and 3) S 1 : context after the second entity mention (e.g., "." after "<ENTD> preterm labor </ENTD>" in Table 2). Each word of an instance's input is represented by word embedding and embeddings of positions relative to chemcial and disease mentions (see Table 2). For convenience, the lengths of all instances' inputs (i.e., numbers of words within inputs) are set to the maximum (denoted by l). For instances with short input, paddings are appended to their input to make up the difference. Given an instance <c, a > with input S = w 1 w 2 …w l , suppose that the positions of c and a in S are p c and p a respectively, word w i can be represented by  -dimensional position embedding, d ic = i − p c and d ia = i − p a are relative distances from w to c and a respectively

Piecewise convolutional layer
The convolutional layer takes the matrix of each instance' input x, and generates high-level feature vectors by convolving filters at multiple scales across x, where the filtes need to be learnt. Given a filter of size k, t∈ R ðd w þd p c þd p a ÞÂk , for example, feature vector f = [f 1 , f 2 , …, f l − k + 1 ] T ∈ R l − k + 1 is generated by sliding filter t across S's input x with a convolution operator (take the rectified linear unit function (Relu) for example) as follows: Each filter corresponds to a high-level feature vector. Therefore, how many filters determines how many feature vectors we can obtain.
To reduce the spatial size of the representation of each instance, the number of parameters and computation, max pooling is adopted to select some important features from all the features generated in the convolutional layer: where (f t, 1 , f t, 2 , …, f t, l + k − 1) is the feature vector corresponding to filter t, and f t is the maximum feature. If there are q filters, we a new q-dimensional vector is generated to represent S, denoted by z ¼ ½f 1 ; f 2 ; …; f q T .
In addition, piecewise strategy that applies pooling to individual parts (i.e., S −1 , S 0 and S 1 ), and concatenates the outputs of all pooling layers is also adopted in our study. Before pooling, attention mechanism is used to measure feature importances for each class as follows: where G is a correlation matrix between features f for each filter t and relation class embedding W classes , M and W classes are weight matrix need to be learnt, A is an attention matrix, A i, j and G i, j are the (i, j)-th entry of A and G, respectively. We use a uniform distribution to initialize M, and an identity matrix to initialize W classes .
When the attention mechanism is adopted, the output of the pooling layer becomes: where f t;i A À and (f t A) i, j are the i-th item of f t A À and the (i, j)-th item of f t A, respectively.

RNN layer
In this layer, RNN is used to model multiple instances of a candidate. For each instance I i , the corresponding RNN cell takes the output of the piecewise convolutional layer (i.e., z i ) and the previously hidden vector h i − 1 as input, and output hidden vector h i using a non-linear transformation function ρ, that is, h i = ρ(z i , h i − 1 ). The last hidden vector Table 2 Example of chemical position and disease position h m is used as the representation of multiple instances of a candidate, which is a document-level representation.

Softmax layer
In this layer, a fully connected neural network is used for classification. The neural network takes the following two parts as input: 1) h m from the RNN layer presented above; 2) features extracted from four domain knowledge bases, the same as Xu et al.'s system [8], as follows: (1) The CTD repository [18] that contains relationships between drugs and diseases, such as inferredassociation, therapeutic, marker/mechanism, etc., manually summarized by experts. (2) The Drugs and Indications Database (MEDI) [19] that records common drugs with common indications. (3) SIDER (Drug Side Effects Database) [20] that records common drugs with common side effects. (4) Medical Subject Headings (MeSH) that records superordinate and inferior structural relationships between drugs and the diseases.
The one-hot features extracted from domain knowledges are first converted into dense features (denoted by v) by a 1-layer neural network. For candidate classification, we use the sigmoid function as follows: where v 0 ¼ ½h T m ; v T T , and u is a weight vector.

Dataset
Our method is evaluated on the CDR corpus of the Bio-Creative V challenge. This corpus contains 1500 manually annotated PubMed record, 1000 out of 1500 records are used as training and development sets, and the remainder 500 records as test set. In the training and development sets, there are 10,550 chemical mentions, 8426 disease mentions, corresponding to 3829 and 2973 MeSH identifiers respectively. and 2050 relations. In the test set, there are 5385 chemical mentions, 4424 disease mentions, corresponding to 1988 and 1435 MeSH identifiers respectively, and 1066 relations.

Experimental settings
We start with a simple CNN-based system which only selects the last instance of every candidate in the input layer and does not use any one of domain knowledge, piecewise strategy or attention mechanism as baseline, and then compares it with CNN-based systems gradually using them and RPCNN. In addition, our best CNN-based and RPCNN-based systems are also compared with other state-of-the-art systems using a single machine learning method. Precision (P), recall (R) and F-score (F) are used to measure performance of all systems, which are calculated by the official evaluation tool of the BioCreative V organizer. 10-fold cross-validation is used to optimize all hyperparameters of our system on the training and development sets. Finally, d w , d p c and d p a are set to 30, 5 and 5 respectively. CBOW is deployed to initialize word embeddings on a large-scale unannotated corpus from Medline, and position embeddings are initialized by a uniform distribution. Filters at scales of 3 and 4 are selected and the numbers of filters are both set to 150. In the RNN layer, we used LSTM cell with 150 hidden states as the RNN cell. In the softmax layer, we follow Srivastava 's work [21] to randomly drop out units from networks to prevent overfitting during training, and set the dropout probability to 0.25. The number of units of the neural network for knowledge feature conversion is set to 120.

Results
The precision, recall and F-score of the baseline system (CNN in Table 3, where the best performance in each column is in bold) are 50.47, 55.61 and 52.92%. Similar with [8], the CNN-based systems is significantly improved by the domain knowledge. Take the baselien system as an example, when the domain knowledge is added, the system's F-score is improved by 15.72% (52.92% vs 68.64%). Both the piecewise strategy and attention mechanism are beneficial to the CNN-based systems and they are complementary to each other. For example, when the piecewise strategy is added into the baseline system (CNN + piecewise in Table 3), the system's F-score increases from 52.92 to 54.20%, while when the attention mechanism is added to the baseline system before pooling (CNN + attention), the F-score slightly increases from 52.92 to 52.99%. When both the piecewise strategy and attention mechanism are together added to the baseline system (CNN + attention + piecewise), the system's F-score is further improved to 55.94%. When the domain knowledge is added, the effects of piecewise strategy and attention mechanism decrease. For example, the F-score difference between CNN using domain knowledge and CNN + piecewise using domain knowledge is 0.39%, while the F-score difference between corresponding systems without using domain knowledge is 1.28%. Among all CNN-based systems, the system that using domain knowledge, piecewise strategy and attention mechanism achieves highest F-score, which is 69.09%. The RPCNN-based system (RPCNN) outperforms CNN + attention + piecewise. RPCNN without using domain knowledge achieves an F-score of 59.10%, higher than CNN + attention + piecewise by 3.16%, while RPCNN using domain knowledge achieves an F-score of 70.77%, which is higher than that of CNN + attention + piecewise by 1.68%.
Moreover, our best CNN-based and RPCNN-based systems are also compared with other state-of-art systems using a single machine learning method, including Xu et al.  Table 4 list the results of comparison, where "/" denotes no result report, and the best performance in each column is in bold. Compared with Xu et al.'s system, our RPCNN-based system achieves much higher F-score no matter whether the domain knowledge is used. The difference between the systems without using domain knowledge is 5.21% (55.94% vs 50.73%), while that between the systems using domain knowledge is 3.61% (70.77% vs 67.16%). Compared with Zhou et al.'s systems, our RPCNN-based system also achieves much higher F-score. The F-score difference between our RPCNN-based system and Zhou's systems arranges from 8.78 to 2.84%. Compared with Gu et al.'s system, though our CNN-based system does not perform better, our RPCNN-based system performs better by 1.90% in F-score. The Patrick et al.'s BRAN-based system achieves a higher F-score than our system by 3.00%, when it takes entity recogniton into account, which significantly improves the peformance of relation extraction.
Without entity recognition multi-task objective, the BRAN-based's F-score is only 55.50%.

Discussion
In this paper, we propose RPCNN for CID extraction, where domain knowledge, piecewise strategy, attention mechanism and multi-instance learning are naturally combined. The RPCNN-based system on a benchmark corpus shows state-of-the-art performance.
Similar to previous studies on CNN-based relation extraction in other domains, the piecewise strategy and attention mechanism are effective in our CNN-based system. In our system, the attention mechanism makes it have the ability to handle some cases when the chemical mention is far away from the disease mention, especially they are not in one sentence. For example, a candidate < "AK", "cisplatin" > with the context of "The primary outcome was acute kidney injury (<ENTD> AKI <ENTD>). RESULTS: We evaluated 143 patients who received single-agent <ENTC> cisplatin <ENTC>", where S 1 is much longer and more complex than S −1 and S 0 , is wrongly labeled as 0 when without using the piecewise strategy, but correctly labeled as 1 when using the piecewise strategy. However, tackling the two types of cases above mentioned are still challenging. We evaluate the performance of our system (CNN + attention+piecewise in Table 3) on tackling cases when the chemical mention and disease mention are not in one sentence. The precision, recall, and F-score are only 53.15, 26.07 and 34.99% respectively.
Compared with CNN-based systems, our RPCNN-based system performs better. The main reason is that RPCNN provides a document-level representation for every candidate as all corresponding instances are considered, while CNN only selects one instance to represent a candidate by removing other instances where there may be different descriptions about relations.
There may be two limitations of our study: 1) chemical mentions and disease mentions themselves are ignored in the input layer. The chemcial and disease mentions may be helpful for CID extraction. In the future work,  we will have a try to integrate chemical and disease mentions in the input layer for further improvement.
2) The effectiveness of our method is validated on an independent test set from the same resource (BioCreative V challenge), but not on latest papers. We will manually label a corpus from PubMed including latest papers as another separate test set for further validation.

Conclusion
In this paper, we propose a novel document-level deep learning method for CID extraction. The proposed method naturally combines domain knowledge, piecewise strategy, attention mechanism and multi-instance learning together. The effectiveness of the method is validated on a benchmark corpus, and the system based on the proposed method shows competitive performance with other state-of-the-art systems. . This publication fee of this paper is supported by JCYJ20160531192358466. The funding agency was not involved in the design of this study, analysis and interpretation of data and the writing of the manuscript.

Availability of data and materials
The codes used in the experiments are now available at https://github.com/ wglassly/CID_ATTCNN.

About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 18 Supplement 2, 2018: Selected extended articles from the 2nd International Workshop on Semantics-Powered Data Analytics. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-18-supplement-2.
Authors' contributions HL, MY, QC and BT designed the study together. HL and QC performed the experiments. HL, MY and BT analyzed the results, HL and BT write the manuscript. XW and JY reviewed and edited the manuscript. All authors read and approved the manuscript.
Ethics approval and consent to participate Not applicable.