Hybrid phenotype mining method for investigating off-target protein and underlying side effects of anti-tumor immunotherapy

Background It is of utmost importance to investigate novel therapies for cancer, as it is a major cause of death. In recent years, immunotherapies, especially those against immune checkpoints, have been developed and brought significant improvement in cancer management. However, on the other hand, immune checkpoints blockade (ICB) by monoclonal antiboties may cause common and severe adverse reactions (ADRs), the cause of which remains largely undetermined. We hypothesize that ICB-agents may induce adverse reactions through off-target protein interactions, similar to the ADR-causing off-target effects of small molecules. In this study, we propose a hybrid phenotype mining approach which integrates molecular level information and provides new mechanistic insights for ICB-associated ADRs. Methods We trained a conditional random fields model on the TAC 2017 benchmark training data, then used it to extract all drug-centric phenotypes for the five anti-PD-1/PD-L1 drugs from the drug labels of the DailyMed database. Proteins with structure similar to the drugs were obtained by using BlastP, and the gene targets of drugs were obtained from the STRING database. The target-centric phenotypes were extracted from the human phenotype ontology database. Finally, a screening module was designed to investigate off-target proteins, by making use of gene ontology analysis and pathway analysis. Results Eventually, through the cross-analysis of the drug and target gene phenotypes, the off-target effect caused by the mutation of gene BTK was found, and the candidate side-effect off-target site was analyzed. Conclusions This research provided a hybrid method of biomedical natural language processing and bioinformatics to investigate the off-target-based mechanism of ICB treatment. The method can also be applied for the investigation of ADRs related to other large molecule drugs.


Background
Immunotherapy has been very popular in cancer treatment recently, owing to the breakthroughs made in the past few decades, including investigation of mechanisms and related immune response [1]. Mechanisms of action vary among the different types of cancer immunotherapies including oncolytic viruses therapy, cancer vaccines and monoclonal antibodies therapy. Unlike the traditional immunotherapies, the monoclonal antibodies therapy accurately targets an inhibitory-signaling pathway, which is related to Programmed Death-1/Programmed cell Death-Ligand 1 (PD-1/PD-L1) [2]. Monoclonal antibodies are proteins produced by immune cells that specifically recognize cellular targets in the context of cancer treatment. Monoclonal antibodies can inhibit the activity of specific proteins in cancer cells to kill cells or prevent them from growing [3].
Immunotherapy has cured many people's immunerelated diseases, brought new medical tools to humans, and promoted new developments in medicine. However, while curing the disease, it also brings many side-effect diseases to people. Therefore, studying the side-effect mechanism of immunomonoclonal drugs can better provide more powerful evidence for drug modification and production.

PD-1/PD-L1 mechanism of action
The binding of PD-1 to its ligand PD-L1 is critical for the homeostatic regulation of the immune system. The primary role of the PD-1/PD-L1 signaling process is to suppress autonomic T cells and accelerate programmed cell death of T cells for the prevention of human immune diseases. After antigen receptor and cytokine signaling, PD-1 is expressed on mature CD4+ and CD8+ T cells, killer T cells, and some cell subsets of B cells [4]. PD-L1 is ubiquitously expressed in a wide variety of immune and tumor cells, such as T cells, B cells, macrophages, regulatory T cells, dendritic cells, and some non-immune cell types, such as vascular endothelial cells and pancreatic cells [5]. However, in the tumor microenvironment, PD-L1 levels in bone marrow cells (eg, macrophages, MDSCs, and DCs) are much higher than in tumors and stromal cells, compared to much lower PD-L1 expression in lymphocytes [6].
In a tumor-free environment, PD-L1 expression is restricted in normal tissues [7]. Binding of B7.1 or B7.2 on APC to CD28 on T cells upregulates T cell survival protein (IFN-γ , Bcl-xL and IL-2) expression by PI3K activation of T cell intracellular AKT pathway. The T cell proliferation initiates an anti-tumor response, as shown in Fig. 1. In contrast, in the tumor environment, PD-L1 expressed on tumor cells binds to PD-1 located on T cells and is enriched by phosphatase SHP-2. The AKT pathway causes T cell survival protein expression to decrease, resulting in T cell immune failure, as shown in Fig. 1 [8]. However, in recent studies, it has been found that most of the "contribution" in the process of tumor escape is derived from the host's own cells. In this case, PD-L1 is not expressed in most of the tumor cells, but highly expressed in the adjacent cells around the tumor cells and inflammatory cells. It is reported that PD-L1, which is expressed at high levels on bone marrow cells, plays a important role in suppressing T cell responses [6]. When a greater inhibitory effect is activated in tumor cell, the expression of PD-L1 on tumor cells shows transient and diminishes rapidly. If the tumor is in an early stage and has just begun to evade immune surveillance, its effect is stronger [9]. For example, the relative importance of tumor PD-L1 may be affected by the inherent immunogenicity of the tumor. If the immune prototype is strong, the tumor's PD-L1 is sufficient to trigger immune escape [10]. In addition, the degree of expression of PD-L1 is related to other genetic variations. After the epidermal growth factor receptor is mutated into NSCLC cells, PD-L1 is biased to be expressed on the cell surface [5].
The PD-1/PD-L1 inhibitor targets the epitope on the PD-1/PD-L1 molecule with high affinity and specificity. By blocking their interaction pathways, T cells can "recognize" cancer cells and initiate immune responses [11].

Side effect and phenotype mining
The PD-1/PD-L1 inhibitor drugs that have been approved by the FDA are mainly: 1. Anti-PD-1 antibodies: Nivolumab developed by BMS, Pembrolizumab of Merck, 2. Anti-PD-L1 antibody: Atezolizumab, Pfizer of Roche /Avelumab from Merck, Germany, Durvalumab from AstraZeneca, etc [12,13] (See Table 1). Side effects of drugs are related to adverse reactions in human medicine, and it is regarded as a typical phenotype. The common side effects of PD-1 drugs include constipation, fever, cough, dyspnea, headache, hair loss, chills, rapid heartbeat, sore throat, abnormal fatigue or muscle weakness, sputum and stiffness, joint pain, muscle cramps and stiffness, hoarseness, abnormal weight [14]. Common side effects of PD-L1 drugs include: bladder pain, fever, cough, urinary blood stasis or turbidity, difficulty urinating, burning or pain, numbness of hands and feet, watery or bloody diarrhea, hoarseness, abnormal weight, chills, stomach cramps, extremely tired or weak, waist pain, rapid or slow heartbeat, swelling of the face or limbs [15].
Campillos et al. [16] suggested that if there is an identical side-effect phenotype between different drugs, it reveals that these drugs share a target. Their results show that analyzing the side-effects of drugs can reveal new targets for drug action. Actually, novel targets offer new clinical treatment possibilities. Keiser et al. [17] used a statistical-based chemical informatics approach to predict new off-targets for drug compounds. The similarity ensemble approach (SEA)was used to calculate the structural similarity between drugs and targets, and to infer drug-target associations.
Furthermore, using the adverse effects of drug targets to establish a large-scale drug-target-adverse reaction network is a novel way to explain the ADR coincidence between target genes and drug target genes [18,19]. Owing to the rapid growth of biomedical literature in recent years, it is feasible to automatically extract knowledge from the published papers, such as the drugs and diseases they mention and their relations in a large scale [20,21]. The knowledge found can be used in many biomedical related fields such as drug discovery, safety monitoring, and drug side-effect detection [22,23]. Manually identifying such entities can be very accurate, but is also time consuming and laborious. Developing automatic annotation systems based on Natural Language Processing (NLP) technology is an alternate method to identify drug and disease entities in text [24]. It may be applied in large scale and provide a way to discover relations among phenotypes and off target effects. In this paper, a pharmacological knowledge mining strategy iss proposed in the form of hybrid phenotypic mining by combining Biomedical Natural Language Processing (BioNLP) with medical informatics. First, we targeted at five PD-1/PD-L1 inhibitor drugs, and extract all the adverse drug reaction (ADR) from DailyMed drug labes by using conditional random fields (CRF) [19]. Second, we targeted at all the targeted proteins of the drugs, and extracted all the phenotype terms of the proteins from the Human Phenotype Ontology (HPO). To integrate the phenotype from ADR and HPO terms, we cross matched the phenotype terms with embedding similarity and find intersected terms, which was eventually used to investigate the target phenotype. As a result, the mutation of gene BTK was investigated and found to be relevant to the off-target effect of anti-tumor immunology therapy.

Drugs and drug labels
Five drugs against PD-1/PD-L1 are listed at Table 1, and their amino acid sequences were collected from Drugbank [12,13].
The N-terminal of an lgG antibody, approximately 120 amino acids, is a complementarity determining region (CDR). It was recently found that the heavy chain CDR2 of Avelumab binds to a specific part of PD-L1 and prevents PD-1 from being linked to PD-L1, thereby blocking the channel effect of PD-1/PD-L1 [25]. Both Atezolizumab and Durvalumab use all three CDRs from the heavy chain and two complementarity determining regions from the light chain to form contact with PD-L1 [26]. All three CDR loops in the Nivolumab VH region and the CDR1/CDR2 loops in VL provide partial interaction, without contact of LCDR3. Nivolumab and Pembrolizumab bind to different parts of PD-1, and these parts overlap, preventing Pembrolizumab antibodies from continuing to bind to PD-1 [27]. Therefore, the CDR region sequence of the light and heavy chain of the drug protein sequence was used in this research for BlastP alignment, which consisted of ∼240 amino acids.
Taking Pembrolizumab as an example, the amino acid sequences of the heavy and light chains are shown in the Fig. 2. The antibody consists of 4 strands, and the Fab segment is the part that can specifically recognize the antigen.
Heavy Chain Sequence As training data for phenotype extraction, we downloaded the gold dataset of the TAC 2017 shared task on adverse reaction extraction [28], which consists of 100 drug labels and annotated adverse reactions. Meanwhile, the drug labels of 5 targeted drugs were downloaded from DailyMed [29] (https://dailymed.nlm.nih.gov/dailymed/) in XML format.

Omics data sets
Several multi-Omics data sets were used in this research.
i) HPO. Human phenotype ontology (HPO) contains standardized vocabulary for phenotypic abnormalities in human diseases [30]. ii) UniproKB. We used the local BlastP [31] and protein database UniprotKB [32] to obtain homologous sequences of five drugs. iii) STRING. Protein-protein interaction (PPI) analysis helps to study the molecular mechanisms of disease and discover new drug targets. In this paper, antibody-producing proteins may be obtained from the STRING database.

Proposed method: hybrid phenotype mining method
Though the exact physiological process of immunerelated side effects is unclear, it is thought to be closely related to maintaining immune homeostasis. Since most immune-related adverse reactions resolve within weeks to months after the onset of immunosuppressive therapy, one of the most important clinical practice is to restore the safety of immune checkpoint blockade after the adverse events have subsided. Because the study protocol often requires that if there is a serious adverse reaction related to the immune system, the immune checkpoint blockade treatment must be permanently stopped, so the prospective data from clinical trials is limited [33]. Therefore, a deeper understanding of these mechanisms will help us find new ways to effectively deal with PD-1/PD-L1 tumor escape and solve the above safety problems. Figure 3 shows the general flow chart of the proposed hybrid phenotype extraction method. The pipeline consists of one pre-processing step and three computation modules, i.e., extraction module of drug-centric phenotypes, extraction module of target-centric phenotypes, screening module for off-target proteins.
Pre-processing module. This module downloads the anti-PD-1/PD-L1 drug from the Drugbank database, and the XML description file from the DailyMed database. In addition, it downloads the protein sequence of the antibody drug from the Drugbank database.
Extraction module of drug-centric phenotypes. As shown in the left module section in Fig. 3, drug-centric phenotype terms were extracted from DailyMed texts of drug labels with Conditional Random Fields. Wapiti [34] is used as the CRF implementation.
Step-1: Formatting drug label texts. The BIEO scheme is used as the entity tagging format. ADRs in the results are marked as B-AdverseReaction, I-AdverseReaction, E-AdverseReaction, where B-type means begin, I-type means Inside, E-type means End, and O indicates Outside. For instance, if the word "pruritis" is an ADR entity, it is marked as "B-AdverseReaction"; if a phrase "injection site hemorrhage" is an ADR entity, the phrase is marked as "B-AdverseReaction I-AdverseReaction E-AdverseReaction".
Step-2: Setting up feature function. DISO (Disorders) is a semantic group in the UMLS [35] through which we built a dictionary of disorder terms from the UMLS Metathesaurus [36]. This semantic group consists of the following 12 semantic types: acquired abnormality, anatomical abnormality, cell or molecular dysfunction, congenital abnormality, disease or syndrome, experimental model of disease, finding, injury or poisoning, mental or behavioral dysfunction, neoplastic process, pathologic function, and sign or symptom.
Step-3: NER of phenotype terms with CRF. Gold standard annotated text from the TAC shared task is first used to learn a CRF model. That model is then applied to the unannotated DailyMed drug label texts to detect drug side effects. The training data is randomly divided according to 7: 3. During the training model, 7 copies use the actual training model and 3 copies use the development set to adjust the parameters. After the model is optimized, all data is applied to the DailyMed drug label texts to detect side effect mentions.
Extraction module of target-centric phenotypes. As shown in the right module section in Fig. 3, target-centric phenotype terms were extracted from a series of database queries, including BlastP, STRING, and HPO.
Step-1: Sequence similarity computation. We used the native BlastP and protein database UniprotKB to obtain homologous sequences of the five drugs. The parameters were set to: Indentities> 30% and E-value > 10 − e4.
Step-2: PPI extension for searching target with function relevance. We entered the protein name in the BlastP result into the STRING database, setting the confidence parameter to 0.7 and the maximum number of display nodes to 500.
Step-3: Target-centric phenotypes extraction with HPO query. We downloaded the HPO database and compared the previous protein interaction results with the HPO database. We thus obtained the side effect phenotype of each drug, and predicted the most likely drug off-target site.
Screening module for off-target proteins. A Gene Ontology (GO) analysis is performed using the obtained gene list information. We analyzed the signal pathways involved in each gene to screen for genes related to T cell proliferation activation, immune regulation, such as AKT and PI3K pathways, and matched the genes by crosschecking. Eventually, we performed literature search to find evidence supporting their relevance.

Results
In this section, the results of the proposed hybrid phenotype mining method are given. Specifically, drug-and target-centric phenotype mining results were obtained and comprehensively integrated to unveil a cluster of possible target proteins of five drugs.

Results for drug-centric phenotypes extraction
The precision (proportion of correctly predicted entities), recall (proportion of gold entities actually predicted) and F-score (harmonic mean of precision and recall) obtained on the training data for each entity type are shown in Table 2.
After training, the model was used to extract all adverse reactions from the drug labels of the five targeted drugs. The extraction results are shown in Fig. 4. As can be seen from the figure, many side effects are extracted for each drug. An examination of the results shows many synonyms, which need to be further screened. A total of 905 drug side effects were extracted from the five drug labels.

Results for target-centric phenotypes extraction
The results for the target-centric phenotypes are shown in Fig. 5. i) BlastP results. We run BlastP to obtain proteins with similar structures of five target drugs. In this process, a total of 2,540 protein sequences were found before screening, and 848 sequences remained after screening. In the pre-screening results, some proteins will have different positions to match the drug protein sequence, so the number of proteins before screening will be repeated. For instance, P0DOX2, P0DOX3, P0DOX4, P0DOX5, P0DOX6, P0DOX7, P0DOX8 are immune one strand of globulin that do not have its corresponding gene name, so the number of proteins after screening is reduced. Although the five drugs were reduced before and after screening, the reduction trend was about the same, with Pembrolizumab decreasing the most, i.e., 70%. The statistic summary of the protein after BlastP is shown in Fig. 5a.  ii) STRING results. We use STRING to obtain proteins that have similar structure with BlastP screened proteins. Thus an expanded candidate off-target protein set is obtained. The statistic summary is shown in Fig. 5b. As can be seen from the figure, a total of 134 possible target proteins were found, of which Atezolizumab found 31%, while Pembrolizumab found fewer proteins, i.e., only 14%. iii) HPO results. HPO terms were extracted from the drug labels of the five targeted drugs. As can be seen from Fig. 5c, a total of 2,071 HPO terms were found, of which Atezolizumab accounted for the most, at 36%, and Pembrolizumab had the least proportion, at 11%.

Results for off-target proteins screening
i) Side effect phenotype cross matching results. We use embedding similarity strategy [19] to match intersected terms in adverse reaction terms from drug labels and HPO terms from candidate off target proteins. As can be seen from Fig. 6, the total match has 110 side effects, of which Atezolizumab has the largest proportion of 38%, and Pembrolizumab has the least proportion of 10%. ii) GO analysis results. Gene ontology analysis aims to investigate the function of the filtered proteins. By filtering the genes with function in the signal pathways related to T cell proliferation activation, immune regulation, only three genes remains, which are AKT1, ACTG2, and BTK. The results of drug/gene/phenotype link are shown in Table 3. In the result of GO analysis, ACTG2 is closely related to T cell activation and immune process, and may be associated with drug off-target. AKT1 is associated with inflammatory factor-mediated diseases, in which IFN-γ is also involved in immunosuppressive signaling, helping tumors to escape. Bruton's tyrosine kinase (BTK) is involved in the activation process of B cells and is involved in immune regulation. However, the other two genes did not find relevant side effects, and there may be few articles reporting the corresponding side effects of the gene, so no relevant literature was found.

Discussion and conclusion
In this section, the process of phenotype mining is discussed, and a candidate side-effect off-target site is proposed and analyzed accordingly.

Phenotypic mining strategy on finding BTK
The screening process of BTK discovery was represented in Fig. 3 with respect to the numbers of hit counts for drugs, phenotypes and candidate core proteins. First, the module of drug-centric phenotype extraction started with five PD-1/PD-L1 inhibitors, searched the DailyMed database for related side-effect phenotypes, and used a CRF for sequence labeling to mine 905 drug-centric side effects. The precision of that process was 88% on the training data, so a moderate amount of noise is expected. This noise should however be reduced by the intersection with target-centric phenotype extraction. Conversely, recall was 68%, meaning that about one third of the drug side effects are likely to be missed.
Second, in the module of target-centric phenotype extraction, a total of 848 amino acid sequences were found to be similar with sequences of the five PD-1/PD-L1 inhibitors by using blastP. Subsequently, a total of 134 target proteins of drug were found in the STRING database. Finally, the 134 proteins were compared with the HPO database, and a total of 2,071 target-centric side-effect phenotypes were found.
Third, in the screening module for off-target proteins, the phenotypes of the first two steps were cross-matched, and a total of 110 target-centered target proteins were found. For more accurate screening, we performed GO analysis and found genes related to AKT and PI3K pathways. A total of three were found: ACTG2, AKT1 and BTK. Since only BTK related literature reports were found during the literature search, only one gene related to PD-1/PD-L1 off-target was found.

Candidate side-effect off-target site
As introduced in the result section, the proposed hybrid phenotype extraction method found a related side-effect off-target site, BTK, and we hypothesized that the offtarget site caused a cellulitis side effect due to mutation of BTK gene.
While tracing back the side effect of Avelumab and Pembrolizumab in Table 3, both of which being relevant with BTK and suffered cellulitis, it is possibly BTK that played a role of side-eefect off-target site in terms with PD-1/PD-L1 blockage.
Literature research unveiled the fact that BTK is associated with B cell activation. If B cells are not activated, the efficiency of immunization is greatly reduced, and it also affects the PI3K pathway, which affects autoimmunity and causes adverse immune reactions. As shown in Fig. 7, the left part is the domain structure of the BTK enzyme, in which the PH-domain is mainly involved in B cell signal transduction. In the previous report [37], the G−→A mutation in the PH-domain causes BTK structural variation and make it fail to recognize the substrate to complete the signal transduction, hinders the development and maturation of B cells. It ends the progenitor B cell phase of the differentiation stage, resulting in a large decrease in plasma cells and affecting the immune response.
In detail, the gene encoding BTK is mutated to form XLA. The defect of BTK leads to the development of B cells in the development of differentiation. Therefore, patients are basically lack of antibody-producing plasma cells, and are susceptible to each The effects of infectious diseases such as Helicobacter. H. cinaedi, formerly known as Campylobacter, in patients with weak immune function, diseases caused by H. cinaedi include repeated fever, bacteremia, arthritis, osteomyelitis, cellulitis, abdominal abscess and diarrhea [38].
In conclusioin, the literature evidence of cellulitis is consistent with the above hypothesis.

Future of PD-1/PD-L1 targeted therapy
Although PD-1/PD-L1 targeted therapy has proven the enormous power of cancer immunotherapy, in recent studies, anti-PD-1 drugs have no therapeutic effect on non-Hodgkin's lymphoma, but have a tendency to deteriorate. PD-1 signaling in a mouse model prevents the proliferation of cancerous T cells, and in these mice, anti-PD-1 treatment can aggravate the disease by reactivating cancer cells for continued proliferation [39]. Although this result is individual specific, it also shows that PD-1 inhibitors are not very safe. Activation of the immune system by PD-1/L1 inhibitors also attacks islet cells, resulting in reduced insulin secretion and type 1 diabetes [40]. Ludin and Zon also found that anti-PD-1 therapy may only ameliorate the proliferation of specific T cell subsets that induce cancer, and that anti-PD-1 treatment reduces the effects of phagocytic cells, thereby accelerating tumors. proliferation [41]. In addition, during the treatment of immune checkpoints, not all patients responded to PD-1 monoclonal antibodies, probably because their tumor immunity was limited. Therefore, a combination method for different targets can improve the effect of tumor treatment. The clinical combination of Ipilimumab (an anti-CTLA-4 monoclonal antibody) and Nivoluma has a better therapeutic effect than the use of one antibody alone, but since many side effects have been reported during the single or combined administration, Includes whole body, skin disease, gastrointestinal tract, lung and endocrine [42]. Immunological checkpoint molecules involved in major immunosuppressive effects will vary in different cancers and individual cases, so study the different immunomodulatory mechanisms of the PD-1/PD-L1 pathway in each cell to determine which ones It is important to use a combination of cloned antibody drugs to better enhance the therapeutic effect. It is important to determine the precise mechanism of the combination drug targeting.
PD-1/PD-L1 is not classified according to tumor type but based on biomarkers [14], PD-L1 expression may be a potential predictor of anti-PD-1/PD-L1 antibody therapy, in order to obtain better predictive therapeutic effects as early as possible, especially for predicting PD-1/PD-L1 targeted therapy. A better predictive system of response effects is capable of finding more accurate biomarkers [5]. Immunization-related adverse events against PD-1 therapeutic antibodies are usually reversible and can be well controlled by immunosuppressive therapy, to develop diagnostic tools to determine the most appropriate treatment goals, and to identify with immunological Fig. 7 The domain strucuture of BTK, and the fact of B lymphocyte inhibition due to BTK mutation checkpoints predictive biomarkers that are accurately correlated with clinical response or adverse events, so initial diagnosis and close clinical monitoring are critical to successful response to immune-related adverse events.
However, due to the complexity of cancer immunization, the mechanism is still unknown by which PD-1/PD-L1 targeted therapy improves the survival time of cancer patients, so the molecular and cellular mechanisms that clarify the immunosuppressive function of each immunological checkpoint molecule are helpful.

Abbreviations
Abbreviations used in this paper are listed below. ADR: Adverse reaction; BioNLP: Biomedical natural language processing; CDR: Complementarity determining region; CRF: Conditional random fields; DISO: Disorder; GO: Gene ontology; HPO: Human phenotype ontology; ICD: Immune checkpoints blockade; PD: Programmed cell death