Towards early detection of adverse drug reactions: combining pre-clinical drug structures and post-market safety reports

Background Adverse drug reaction (ADR) is a major burden for patients and healthcare industry. Early and accurate detection of potential ADRs can help to improve drug safety and reduce financial costs. Post-market spontaneous reports of ADRs remain a cornerstone of pharmacovigilance and a series of drug safety signal detection methods play an important role in providing drug safety insights. However, existing methods require sufficient case reports to generate signals, limiting their usages for newly approved drugs with few (or even no) reports. Methods In this study, we propose a label propagation framework to enhance drug safety signals by combining drug chemical structures with FDA Adverse Event Reporting System (FAERS). First, we compute original drug safety signals via common signal detection algorithms. Then, we construct a drug similarity network based on chemical structures. Finally, we generate enhanced drug safety signals by propagating original signals on the drug similarity network. Our proposed framework enriches post-market safety reports with pre-clinical drug similarity network, effectively alleviating issues of insufficient cases for newly approved drugs. Results We apply the label propagation framework to four popular signal detection algorithms (PRR, ROR, MGPS, BCPNN) and find that our proposed framework generates more accurate drug safety signals than the corresponding baselines. In addition, our framework identifies potential ADRs for newly approved drugs, thus paving the way for early detection of ADRs. Conclusions The proposed label propagation framework combines pre-clinical drug structures with post-market safety reports, generates enhanced drug safety signals, and can potentially help to accurately detect ADRs ahead of time. Availability The source code for this paper is available at: https://github.com/ruoqi-liu/LP-SDA.

larger population and extended follow up. Real-world evidence, such as Spontaneous Reporting System (SRS) [4], Electronic Health Records (EHRs) [5], medical claims [6], social media and web search [7,8], become important for detecting ADRs. Among those data sources, SRS remains a cornerstone of pharmacovigilance and are collected from a variety of sources, including healthcare providers, national authorities, pharmaceutical companies, medical literature and more recently directly from patients. SRS collects case reports such that each sample contains ADR status (Yes/No) and drug status (Yes/No). Such a structure allows SRS to be mined without an epidemiology design.
Due to the rich and valuable information offered by SRS data, a series of signal detection algorithms have been developed to detect drug safety signals from SRS. Proportional Reporting Rate (PRR) [9] and Reporting Odds Ratio (ROR) [10,11] are the most commonly used methods, which are based on frequentist statistical analysis. And Multi-item Gamma Poisson Shrinker (MGPS) [12] and Bayesian Confidence Propagation Neural Network (BCPNN) [13]) are two Bayesian approaches that widely used for signal detection. Recently, another approach has emerged that combines pre-clinical drug structures with SRS to improve the original safety signals. Vilar et al. [14,15] improve the original signals generated from healthcare databases by incorporating biological and chemical information of drugs. Their methods firstly achieved improvement of performance in the analysis of two representative ADRs: rhabdomyolysis and pancreatitis. Vilar et al. [16] further demonstrate that other types of cheminformic similarity (e.g., 2D drug chemical structural similarity, adverse event profile similarity and target profile similarity) can also yield great results in the detection of drug safety signals. Moreover, Vilar et al. [17] present a 3D drug-ADR predictor, which incorporates 3D molecular structure similarity and drug-ADR standard reference, to improve ADRs identification and generate enriched drug-ADR signals. They apply the 3D drug-ADR predictor on SRS resources and find that the proposed predictor identifies more accurate signals than baseline methods. The underlying principle behind these approaches is that drugs with similar chemical structures are more likely to exhibit similar ADR [18]. In general, existing methods are developed to generate signals and/or re-rank original signals for drugs with enough reports in SRS, but few methods can be used to generate signals for newly approved drugs with few or even no safety reports in SRS.
There are some approaches that use machine learning techniques and pre-clinical information from large public drug databases to predict ADR [19][20][21][22][23][24]. Most of these methods typically use chemical, biological and phenotypic properties of drugs to build predictive models. In [19] for example, a computational approach is presented to predict the side effects of a given drug by incorporating information on other drugs and their side effects. They use drug-ADR pairs obtained from public drug databases both in the training process and performance evaluation. However, we just use these drug-ADR pairs as external evaluation resources which do not take part in the prior training process (A comparison of [19] and ours framework can be found in Fig. S1 of Additional file 1). To best of our knowledge, ours is the first signal detection framework that combines pre-clinical drug structures and post-market safety reports.
In this paper, we propose a label propagation framework to enhance drug safety signals by combining drug chemical structures with FDA Adverse Event Reporting System (FAERS) [25]. First of all, we compute original drug safety signals via common signal detection algorithms from FAERS. Then, we construct a drug-drug similarity network based on chemical structures. Finally, we generate enhanced drug safety signals by propagating original signals on the drug-drug similarity network. We apply the label propagation framework on four popular signal detection algorithms (PRR, ROR, MGPS, BCPNN) and find that our proposed framework can generate more accurate drug safety signals than the corresponding baseline methods. In addition, the proposed framework can identifies potential ADRs for newly approved drugs, thus providing promise for early detection of ADRs.
In general, the contributions of the paper lie in threefold: • We propose a label propagation framework to generate enhanced drug safety signals, which incorporates the pre-clinical drug structures with the post-market safety reports. • We compare the proposed framework with four different state-of-the-art signal detection algorithms and evaluate the performance in detecting ADRs. • We also apply our framework on newly approved drugs (with few cases in SRS) and access whether pre-clinical drug structures can help to early detect safety signals prior to FDA safety label change.

FAERS database
The SRS data used in this work is FAERS. we adopt a curated and standardized version of FAERS data from 2004 to 2014 [26]. After removing duplicate case records, mapping drug names to RxNorm concepts and ADR outcomes to Medical Dictionary for Regulatory Activities (MedDRA) codes [27], we obtain 4245 unique drugs, 17,671 ADRs and totalling 4,928,413 reports. We plot the frequencies of ADRs and drugs of FAERS data in Fig. 1 to demonstrate the data distribution of this dataset. The number of drugs associated with ADRs varies a lot with

Pubchem database
PubChem Compound database [28] provides unique chemical structure information of drugs. We map the concept IDs of drugs in FAERS into PubChem IDs using the exact drug names and then extract the drug chemical substructures from PubChem. Among 4245 unique drugs in FAERS, 2708 drugs are mapped and their chemical features are extracted from PubChem.

SIDER ground truth data
The Side Effect Resource (SIDER) database [29] contains approved drugs and their recorded ADRs, which are collected from package inserts (i.e., drug labels). In the SIDER version 4.1, it contains totalling 1430 drugs, 5868 ADRs and 139,756 drug-ADR pairs. We use drug-ADR pairs extracted from SIDER version 4.1 as positive controls for evaluation. Of 2708 drugs with chemical features, 843 drugs are mapped to SIDER by converting PubChem IDs to STITCH IDs in SIDER. ADRs in SIDER are recorded in both Lowest Level Terms (LLT) and Preferred Terms (PT) form of MedDRA. We select PT for ADRs as our evaluation dataset. Thus, we end up with 843 drugs, 842 ADRs and 65,636 drug-ADR pairs as the ground truth data in the experiment.As further validation of the approach, we also use OFFSIDES [30], a post-marketing dataset to test the performance (See Table S4 in Additional file 1).

Overall framework
The overall framework of this paper is outlined in Fig. 2. It consists of three main steps: computing original drug safety signals from FAERS reports, constructing a drug-drug similarity network from pre-clinical drug structures, and generating enhanced drug safety signals through a label propagation process.

Computing drug safety signals
Our study covers four commonly used signal detection algorithms. Table 1 lists the main properties of each algorithm. The proportional reporting ration (PRR) [9] and the reporting odds ratio (ROR) [10,11] are two popular measurements of frequentist statistical methods. For each drug-adverse pair, we construct a 2×2 contingency table  (Table 2) and compute the signal scores as follow: In this paper, we use PRR05 (referred as PPR) and ROR05 (referred as ROR) as baseline methods in the experiments. The multi-item gamma poisson shrinker (MGPS) [12,31] and bayesian confidence propagation neural network (BCPNN) [13] are widely used Bayesian approaches for signal detection. We adopt EB05 of MGPS and BCPNN25 of BCPNN as our baseline methods.

Constructing drug similarity network
We construct a drug similarity network based on chemical structures. To be specific, we treat different drugs as nodes on the network, and compute edge weights on the network with drug chemical structure similarities. The similarity is based on a chemical structure fingerprint corresponding to the 881 chemical substructure [32] defined in PubChem. Each drug can be represented by an 881dimensional binary profile whose elements indicate the presence or absence of corresponding PubChem substructures with value 1 or 0. The Jaccard similarity between two drugs can be calculated by: where A and B denote the profiles of two drugs.

Generating enhanced drug safety signals
Label propagation algorithms are widely adopted in analyzing weighted N nodes graph to discover latent information [33] and have been applied to biomedical problems [34]. At the beginning of the algorithms, a small portion of nodes have labels and these labels are propagated to previously unlabeled nodes through the algorithms. In our method, we generate enhanced drug safety signals via propagating original signals on the drug similarity network. The weighted N nodes graph is constructed based on the N × N drug similarity matrix A, where A i,j ≥ 0 represents the similarity for drug i and drug j. Drugs are treated as nodes in the graph and the edge weights are assigned by the drug similarities. The signal score matrix S of drug-ADR pairs, where S i,j denotes the signal score of drugi-ADRj combination, are considered as initial labels of nodes. For the drug D i , the initial labels are ith row of the signal scores matrix S, which are denoted as S i . The label information of initial drug nodes is propagated to the nodes through the weighted edges in the graph by an iterative approach. To guarantee the convergence of the updates, the original drug similarity matrix A needs to be normalized so that the row sum is one. We denote the normalized matrix as W.
Using W, we propagate labels from the labeled drug nodes to the unlabeled nodes. In every iteration, the label information of each node is updated by absorbing labels from its neighbors by a probability γ , and retaining labels of its previous labels by a probability (1 − γ ). The updating formula for a drug node i in the t th iteration from step t − 1 to step t can be denoted as below, In this formula, Y t i represents the updated label information of drug node i in tth iteration, and 0 < γ < 1 is the absorbing probability that determine the label information absorbed from neighbors. By considering all drug nodes at the same time, we can formulate the updating formula (4) into a matrix form, After t iterations, (5) can be written as, Since N j=0 A i,j = 1, the spectral radius ρ(W ) ≤ 1. And 0 < γ < 1, thus lim t→∞ (γ W ) t = 0 and lim t→∞ where I is the identity matrix of order N. Therefore, the iteration of updating formula will converge as (The proof of convergence can be found in [33]), where Y is the final label information for N drug nodes and S is the matrix for initial label information.
To generate signals for a new drug, we regard the signals of the drug with all ADRs as 0. Then we calculate In general, the original signal scores computed by common signal detection algorithms are further improved through the label propagation on the drug similarity network. The final labels (scores) can be regarded as the improved signals for drug-ADR pairs.

Experiment setup
The known drug-ADR pairs extracted from SIDER are treated as positive controls, and the unknown drug-ADR pairs are referred as negative controls. Since the number of positive samples is much fewer than negative ones, we randomly sample part of negative controls from all unknown pairs. The size of negative samples is twice the size of positive controls. To fully demonstrate the performance of our methods, we also compile an evaluation dataset with all drug-ADR pairs from SIDER as reference positives and the complement set of SIDER drug-ADR pairs as reference negatives (i.e., without any sub-sampling of negatives). We conduct the experiments on this alternative dataset and report the results in Table S2 (9) Accuracy measures the probability of all ground labels of drug-pairs being estimated correctly. F1 is defined as the harmonic mean of precision and recall: There is one parameter: absorbing probability (γ ) of label propagation in the proposed method. We consider γ in {0.1, 0.2, 0.3, ..., 0.9} and build the model with γ that yields the maximum AUC score. We evaluate the performance of models on different parameters and show the results in the Fig. S2 of Additional file 1. The optimal values of γ for each signal detection algorithms are shown in Table S3 of Supplementary Materials.

Performance evaluation on all ADRs
We compare the proposed methods with four baselines (PRR, ROR, MGPS, BCPNN) using all years data and report the six metrics in Table 3. "LP-Method name" denotes the proposed method and which signal detection algorithm we use to generate original signals. From Table 3, we can observe that among these four signal detection algorithms, MGPS outperforms other baseline methods resulting in the best AUC scores and AUPR scores. And our methods are better than all the corresponding baseline methods in terms of AUC scores, AUPR scores and precision. The results demonstrate that drug-drug similarities can help to enhance the safety signals since the similar drugs may induce same ADRs. By this way, the original drug safety signals are improved by incorporating information from similar drugs.
We also plot the yearly change curve for LP-MGPS and MGPS based on AUC scores and AUPR scores in Fig. 3. Here, 2004, 2005, ..., 2014 of horizontal axis represent the reports we use to generate signals accumulated from 2004 to current year (i.e., 2008 denotes reports from 2004 to 2008 are utilized to generate signals). According to Fig. 3, we can find that our method LP-MGPS outperforms its corresponding baseline MGPS on every cumulative years. In addition, the proposed method can achieve better performance especially only with reports of early years.

Performance evaluation on representative ADRs
To further characterize the performance of the proposed method, we select ADRs from Designated Medical Event (DME) [35] for additional comparisons. DME contains standardized medical concept terms released by The European Medicines Agency (EMA), which is a list of inherently serious ADRs. We map the ADRs of DME with our datasets and remove the ADRs associated with less than 10 drugs. 31 ADRs are considered for performance evaluation and Table 4 shows the comparison of proposed

LP-MGPS and the original MGPS algorithm on top 15
ADRs ranked by AUPR scores. "Number of positive drugs" denotes the number of drugs that associated with each ADR. Here, we use MGPS as our based signal detection algorithm since it yields highest AUC and AUPR scores for this task. According to the results, the proposed method is better than the corresponding baseline method on all 15 ADRs in terms of AUPR scores. And our methods outperform the baseline on most cases for AUC scores. (More experiments on these representative ADRs can be found in Table S5 and Table S6 of Additional file 1).

Discussion
A label propagation framework is built in this study, which enriches post-market safety reports with pre-clinical drug similarity network to generate enhanced safety signals.
The overall performance of the proposed method is superior, the performance on those important ADRs are good, and the MGPS-based method achieves the best performance.
We further demonstrate the performance of the proposed method on newly approved drugs which have few (or even no) reports in SRS. The safety related labels for a drug are released by FDA since the drug approval and ADRs are recorded in labeling information for drugs. The labeling information might be revised quarterly by port-marketing surveillance. Here, we report the performance of ADRs detection for two recently approved drugs "liraglutide" and "pazopanib" in Fig. 4. We use MGPSbased method to generate original signals since we obtain the best performance on MGPS. We compute the yearly rankings of the drug to the ADR and the number of drug-ADR cases in SRS. The horizontal axis here represents the cumulative years from 2004 to current year. The rank in vertical axis denotes the percentile of the drug ranking, which can be calculated by rank of the drug # all drugs * 100 after sorting the entire drug list in a descending order.
Liraglutide is a medication used to treat diabetes or obesity [36], and it is approved for medical use in the United States in 2010 [37] and in Europe in 2009 [38]. In 2011, renal failure was updated to the labeling information of liraglutide [39]. According to Fig. 4a, we can find that Liraglutide-Renal failure first showed up in SRS in 2010 and accumulated to 11 cases in 2014. Thus, the baseline which entirely rely on the sufficient cases can only generate signals for this pair after 2010. The ranking of liraglutide gradually increases as more years data accumulated. The proposed method performs better than the baseline after 2010. More importantly, the proposed method is able to generate signals before 2010 and can predict liraglutide to cause renal failure as early as of 2005 by taking the case reports of liraglutide's similar drugs into the consideration. Therefore, the proposed method can early detect Pazopanib is a medicine used for treatment of advanced renal cell carcinoma (RCC) and advanced soft tissue sarcoma (STS) [40]. It is approved for medical use in the United States in 2009 [41] and in Europe in 2010 [42]. The impaired wound healing was included in one of syndromes in labeling information of pazopanib in 2014 [43]. For Pazopanib-Impaired wound healing shown in Fig. 4b, it is initially reported by SRS in 2009 and continually accumulated up to 77 cases by 2014. The baseline can not generate signals for Pazopanib-Impaired wound healing without any cases. However, the proposed method is able to identify potential safety signals before 2009 and yearly rankings of the pazopanib confirm that our method can detect the safety signals prior to FDA safety label change.
The above instances confirm that the algorithm is able to detect drug safety signal before the approval, and consistently outperforms the state-of-the-art in early detection and before the drug label change which every pharmacy is trying to avoid.

Conclusions
In this paper, we present a label propagation framework, which integrates drug chemical information with postmarket safety reports, to generate enhanced drug safety signals. The drug safety signals are enhanced through the process of label propagation with the drug similarity computed from the chemical information. We compare the  performance of our methods with four different state-ofthe-art signal detection algorithms (PRR, ROR, MGPS, BCPNN) using safety reports from SRS. The results demonstrate that the proposed methods outperform their corresponding baselines in generating accurate drug safety signals. Extensive experiments show that our methods are able to accurately detect potential ADRs for newly approved drugs with few safety reports, which pave the way for early detection of ADRs.
This study can be extended in multiple directions in the future in terms of both drug features and post-market real-world evidence. Other types of available data sources of drugs such as chemical-protein binding and therapeutic indication data can be leveraged for the construction of drug similarity networks. Furthermore, the label propagation framework can be applied to enhance drug safety signals generated by other real-world evidence such as EHRs and medical claims.