PASCAL: a pseudo cascade learning framework for breast cancer treatment entity normalization in Chinese clinical text

Backgrounds Knowledge discovery from breast cancer treatment records has promoted downstream clinical studies such as careflow mining and therapy analysis. However, the clinical treatment text from electronic health data might be recorded by different doctors under their hospital guidelines, making the final data rich in author- and domain-specific idiosyncrasies. Therefore, breast cancer treatment entity normalization becomes an essential task for the above downstream clinical studies. The latest studies have demonstrated the superiority of deep learning methods in named entity normalization tasks. Fundamentally, most existing approaches adopt pipeline implementations that treat it as an independent process after named entity recognition, which can propagate errors to later tasks. In addition, despite its importance in clinical and translational research, few studies directly deal with the normalization task in Chinese clinical text due to the complexity of composition forms. Methods To address these issues, we propose PASCAL, an end-to-end and accurate framework for breast cancer treatment entity normalization (TEN). PASCAL leverages a gated convolutional neural network to obtain a representation vector that can capture contextual features and long-term dependencies. Additionally, it treats treatment entity recognition (TER) as an auxiliary task that can provide meaningful information to the primary TEN task and as a particular regularization to further optimize the shared parameters. Finally, by concatenating the context-aware vector and probabilistic distribution vector from TEN, we utilize the conditional random field layer (CRF) to model the normalization sequence and predict the TEN sequential results. Results To evaluate the effectiveness of the proposed framework, we employ the three latest sequential models as baselines and build the model in single- and multitask on a real-world database. Experimental results show that our method achieves better accuracy and efficiency than state-of-the-art approaches. Conclusions The effectiveness and efficiency of the presented pseudo cascade learning framework were validated for breast cancer treatment normalization in clinical text. We believe the predominant performance lies in its ability to extract valuable information from unstructured text data, which will significantly contribute to downstream tasks, such as treatment recommendations, breast cancer staging and careflow mining.


Background
Breast cancer is one of the leading cancers with a high mortality rate. WHO reported that it is the second most common cause of cancer death in women [1]. In particular, developing countries are suffering from an increasing breast cancer epidemic with a growing number of younger women who are susceptible to cancer. Fortunately, the mortality rate caused by breast cancer has significantly decreased in recent years due to the increased emphasis on early detection and the development of more effective treatment [2]. Additionally, the widespread application of modern medical devices has accumulated large-scale electronic health record (EHR) data, especially historical breast cancer treatment records, which create a foundation for drug therapy analysis, regimen adjustment, and careflow mining [3]. Consequently, breast cancer patients can receive better healthcare and more accurate treatment.
Additionally, traditional machine learning methods and more advanced deep learning methods have deeply accelerated the process of discovering underlying patterns or structures in EHR data. For instance, in the treatment prediction field, Yadav et al. [4] proposed a framework that uses a decision tree and support vector machine algorithm to identify patients who need urgent chemotherapy. For breast cancer diagnosis, Wang et al. [5] developed a comprehensive diagnosis tool by mining heterogeneous EHR data, such as physical examination results, patient clinical backgrounds, histories and features of mammography images. For prognosis, [6] employed three different machine learning methods to predict breast cancer survivability, which can assist in providing reasonable treatment for patients. In summary, the application of machine learning methods has largely improved the quality of patient care and reduced the misdiagnosis rate for breast cancer.
Currently, most existing work on breast cancer treatment mining mainly relies on structural features or manually designed features based on EHRs in the English language. However, the widespread use of electronic medical devices in China has generated a considerable number of EHRs ranging from structured information to unstructured clinical text. As shown in Fig.1a, the EHR data might come from various hospitals and be recorded by different doctors under their own guidelines, thus making the final data rich in author-and domain-specific idiosyncrasies, acronyms and abbreviations. For instance, clinical physicians use "EC×4-TH×4" and "EC TH" to denote the same treatment "EC-TH" (as shown in Fig. 1c). The complex character composition represents the specific treatment process in real clinical texts, which is helpful for future reference. In general, physicians use the fewest characters with the most powerful expressive ability in the treatment texts. Taking treatment "EC×4-TH×4" as an example, "×4" represents that the patient should adopt the EC as the first four chemotherapy regimens and employ the TH regimen as the subsequent four chemotherapy regimens.
However, such data hampers the development of advanced applications for breast cancer, such as treatment recommendation, treatment effect prediction, prognosis prediction and smart visualization in the era of big data. At present, uniform features have been utilized to avoid repetitive features and reduce noisy data, which can contribute to higher algorithm accuracy. For instance, standardized data have been used to solve the data isolated islands problem with the help of federated learning [7,8]. Therefore, as shown in Fig. 1c, we need to normalize the medical terms in the left real-world data (Fig. 1b) to the right normalized term. Namely, despite various denotations for each treatment from the clinical text, according to the practical necessity, they must be mapped to a corresponding unified expression that generally comes from the authoritative reference such as GUIDELINES [9]. In our work, we call this a nontrivial problem (i.e., mapping the treatment entities to codes in a relevant controlled vocabulary) the treatment entity normalization task (TEN). Note that if the treatment entity is in the clinical text, we should first recognize the entity's boundaries, which is called the treatment entity recognition task (TER). As shown in Fig. 1b, for the treatment "EC TH " from the clinical text, we first recognize its position (TER task) and then map it to the unified term "EC-TH".
At present, this is a challenging task for three reasons. First, the normalization process is tedious and time consuming via manual handling, thus requiring specifically designed data-driven approaches. Second, the medical entities are closely related to the contexts of clinical text, which provide a further description and should be taken into account when designing the algorithms, as shown in Fig. 1b. Finally, the inputs are mixed Chinese and English sentences (Fig. 1c), which make it more difficult to identify the entity boundaries. As a result, the development of computational methods concerning TEN has been hindered. In addition, researchers primarily focus on the named entity recognition task that determines the boundaries of medical entities, such as [10][11][12][13], while few studies directly deal with medical named entity normalization (MEN), especially for Chinese, due to the complexity of Chinese characters.
Nevertheless, researchers have proposed several methods, such as machine learning-based methods and joint learning-based methods, to address the named entity normalization problems. For example, Leaman et al. [14] were the first to introduce machine learning approaches to address the problem by pairwise learning. Leaman et al. [15] and Lou et al. [16] addressed these problems by jointly modeling recognition and normalization. Zhao et al. [17] proposed a deep neural multitask learning method with explicit feedback strategies to obtain optimal performance. However, all of the above methods are specifically designed for English-based entity normalization and recognition, such as from "CEF×3-P×3" to "FEC-P" in Fig. 1c. Chinese MEN is much more difficult than English owing to the complexity of Chinese composition forms and lack of word boundaries [18]. Moreover, the real-world public datasets in Chinese related to health informatics are almost nonexistent, which has been a bottleneck to the development of text mining algorithms in the Chinese domain. Additionally, in the Chinese medical named entity normalization domain, some researchers have developed algorithms by cooperating with hospitals. For instance, Luo et al. [19] introduced a multiview convolutional neural network to address the normalization of diagnostic and procedure names simultaneously. Likewise, Zhang et al. [20] presented an unsupervised framework to normalize the Chinese medical concept by combining disease text with comorbidity. However, the inputs of the networks are just Chinese medical terms, such as various name expressions for the same disease, not informative clinical sentences.
Furthermore, with the increasing quantity of training data, some researchers have begun to seek efficient learning algorithms, especially in the industrial field, such as [21]. In language modeling, many researchers [22,23] attempt to leverage convolutional neural networks to replace traditional recurrent neural networks, which enable parallelization over the elements of sequences. Such approaches significantly promote computational efficiency compared with BiLSTM [24], which requires sequential modeling. In addition, to further improve the language model performance, Shen et al. [25] integrated a novel recurrent architecture with an explicit bias towards modeling a hierarchy of constituents, which can better extract the hidden hierarchical information in the sentence. In addition, with the advancement of health informatics research, the practical significance is becoming much more important, and it has brought about the necessity for computational efficiency. Therefore, we should maintain a balance between the computational precision and efficiency when developing such a framework.
To address the aforementioned challenges, we propose a pseudocascade learning framework (PASCAL) with a gated convolutional neural network (GCNN) [23] and conditional random field (CRF) [26] for breast cancer treatment entity normalization in Chinese clinical text, which fully takes advantage of the contextual information mainly in Chinese and sequential interactive information. Specifically, the main contributions of our work can be summarized as follows: • We propose PASCAL, an end-to-end, accurate and efficient framework with GCNN and CRF to normalize breast cancer treatment, which fully makes use of the sequential interactive information and implicit context information in Chinese clinical text.
To the best of our knowledge, this is the first work to introduce GCNN and CRF specialized for TEN. Moreover, the experiments on a large real-world breast cancer EHR dataset illustrate the effectiveness and efficiency of the framework. • In the pseudo cascade structure, we incorporate TER into the framework as an auxiliary task to propagate useful implicit information and assist in optimizing the shared parameters. The final experimental results prove the necessity of the auxiliary recognition task. • We present a biased loss function with an adjustable parameter γ to strategically optimize the parameters and seek an optimized balance between the contributions of assistant optimization and providing information.

Materials and problem definition
Chinese medical named entity normalization (TEN) aims to map different medical terms from Chinese clinical text, as shown in Table 1, onto a controlled vocabulary, which can be regarded as a multiclass learning task. Nevertheless, the ambiguity in the boundary of Chinese words can cause segmentation errors, which could introduce noise into the downstream task. Considering this, we label the sequence at the character level to mitigate the error transmission. In addition, we incorporate an auxiliary task TER to further assist in regulating the parameters from shared layers. Next, we introduce the input and output of the TEN task and describe the primary definitions of the problem.

Input and output data
Owing to the complexity of the real-world database, we extract the clinical notes from EHRs. Let D = p 1 , p 2 , ..., p n denote the patients from the EHR.
where v k denotes a visit encounter and k is the number of visits for the patient. For a visit v k , it might generate multiple treatment records {X 1 , X 2 , ..., X l } for the therapy of breast cancer, where l represents the number of treatments in a visit. We treat the records as different input sequences. As shown in Table 1, the input clinical text X l it contains multiple characters {x 1 , x 2 , ..., x N }, where N denotes the number of characters in a sequence. The labels, namely, standard entities, are from the standard treatment regimens database C = r 1 , r 2 , ..., r j , where r j is an entity and j is the number of entities.

Problem definition
The Chinese EHRs contain various mentions about the same entity because the data can come from various hospitals and be recorded by different doctors under their own guidelines. Therefore, the aim of TEN is to map the mention with a nonstandard name to a specified controlled vocabulary from the treatment regimens database R: where y 1 , y 2 , ..., y N ∈ R is the normalized entity from the treatment regimens database, x 1 , x 2 , ..., x N is the input characters from a clinical sentence, and N is the number of characters in one clinical sentence X l . In this one-vsone method (character-vs-label), we can not only ensure the correctness of normalization but also understand the location of the treatment entity.

Methods
In this section, we present a pseudo cascade learning framework with gated convolutional networks and a conditional random field to address the TEN task. As shown in Fig. 2, the model is composed of four key modules: embedding layer, GCNN encoder module, pseudo cascade structure, and the CRF layer. First, the embedding layer projects the Chinese characters into dense vector representations. Then, the representations are fed into the encoder GCNN to capture the contextual relationships and long-term dependencies by the convolutional network and gating mechanism. After obtaining the contextual features, a pseudo cascade structure, which includes a softmax layer, an auxiliary TER layer and an information fusion layer, is utilized to obtain the fused information vector representation. Finally, to obtain more accurate normalization outcomes, we deploy a CRF layer due to its superiority in capturing the internal and contextual relationships within labels. Subsequent sections detail the components of the pseudo cascade learning framework (PASCAL).

Fig. 2
Main architecture of PASCAL model. PASCAL consists of four modules: character embedding module, encoder module (containing a gated convolutional neural network to learn the shared representation with temporal relationship), pseudo cascade structure module (including the enhanced primary task TEN and an auxiliary task TER)

Embedding layer
As discussed in "Materials and problem definition" section, Chinese sentences have their nature without separators between words, and word segmentation is usually treated as the first step for clinical test mining. Word segmentation can cause ambiguity in the boundaries of Chinese words. To address the above problems, our proposed PASCAL is based on the character level input to avoid introducing noise caused by segmentation errors. Formally, as shown in Table 1, given a clinical treatment sentence X l = {x 1 , x 2 , ..., x N }, The model first maps the characters to dense embedding representations. Specifically, the character embedding e i ∈ R d e is extracted from embedding matrix W e ∈ R |N|×d e that can be learned for every character x i , where i ∈ {1, 2, ..., N} and d e is a hyperparameter denoting the embedding size. Then, the character embedding vectors can be treated as a sequence that is fed into the encoder to mine more complex relations.

Gated convolutional neural network module
As shown in Fig. 2, the gated convolutional neural network (GCNN) is selected as the encoder of PASCAL, and the detailed substructures are shown in Fig. 3. In the figure, GCNN consists of three blocks, including a convolutional block, a gating block and a residual connection, which enable the GCNN to capture the contextual relationships and long-term dependencies in an efficient manner. As shown in Fig. 3a, the input to the convolutional block is a sequence of character embeddings C = {e 1 , e 2 , ..., e N }, where C ∈ R |N|×d e , |N| is the number of characters, and d e is the embedding size. Then, the matrix C is sent to the one-dimensional convolutional neural network, and finally, we obtain the outputs B = C * W + b and G = C * M + g, where W, M ∈ R k×d e ×d h , b ∈ R d h and g ∈ R d h are the parameters to be learned. Furthermore, d h denotes the output dimension, and k denotes the patch size in the convolutional process.
Following the convolutional operation is the gating block, as shown in Fig. 3b, in which a gated linear unit (GLU) [23] is utilized to control the information flows by selecting features through a sigmoid activation function: where h l is the output of one hidden layer. is the elementwise product between matrices, and σ is the sigmoid activation function. Finally, considering the computational efficiency, a residual connection [27] is further added to the block, which means that the final output consists of two parts, the output of GLU and the input of the block. Thus, C + h l (C) is the final output of the l-th layer.

Pseudo cascade structure
One limitation of pipeline approaches is that the errors from TER propagate to subsequent TEN tasks. Therefore, we present the pseudo cascade learning structure that can mitigate the adverse impact and enhance the positive effect. As shown in [28], the auxiliary tasks can be regarded as a kind of regularization to boost the performance of the main tasks. In addition, [29] adds unsupervised auxiliary tasks to improve the outcomes of emotional attributes. Likewise, we leverage the auxiliary task as an additional regularization to assist the primary tasks, both of which constitute the pseudo cascade learning structure. The detailed architecture is described as follows.
First, the encoder GCNN generates informative feature vectors with contextual relationships and long-term dependencies. Then, as shown in Fig. 2, they are further fed into the pseudo cascade structure to fulfill two tasks: Chinese medical named entity recognition (TER, an auxiliary task) and Chinese medical named entity normalization (TEN, the primary task). Although the TER task is assistant, it is indispensable for the regularization of shared parameters and the transmission of useful information. In addition, the pseudo cascade structure also includes the softmax activation layers and the critical CRF layer.

Auxiliary task: TER
In the auxiliary task TER, to recognize the medical entities y r 1 , y r 2 , ..., y r i , we take the informative feature vectors H = {h 1 , h 2 , ..., h N } from the encoder GCNN as the input. With the help of a linear layer and a softmax layer, we can obtain the recognized entity: whereŷ r i is the recognized entity, W r ∈ R d r ×d h , b r ∈ R d r are the learned parameters, and h i ∈ R d h is the i-th input vector.ŷ r i is regarded as additional information to be transmitted to the primary task.

Primary task: enhanced TEN
As mentioned above, in the primary task, we not only leverage the information from the encoder GCNN, H = {h 1 , h 2 , ..., h N } but also utilize the information from the auxiliary TER task. To be more specific, we directly take advantage of the concatenation method to integrate them: where h c i denotes the input of the next CRF layer,ŷ r i is the recognized entity, h i is the output of the encoder GCNN andŷ r i is the predicted outcome from the auxiliary TER task. Therefore, the input of the CRF layer can be defined as H c = h c 1 , h c 2 , ..., h c N .

CRF layer
To better utilize the contextual information and obtain the optimum global path, we leverage CRF [26] to model the normalization sequence and predict the TEN sequential results. The label sequence of characters is denoted as Y = y 1 , y 2 , . . . , y N , where y i ∈ R |C| is the i-th character's label with one-hot representation and |C| is the number of treatment regimens in the database. The input of the CRF layer is the integrated representation, namely, H c = h c 1 , h c 2 , ..., h c N . Moreover, the CRF is a probabilistic model, and the conditional probability of Y given input H c is calculated as follows: where Y(s) denotes the set of all possible label sequences under a given sentence, θ denotes the learned parameters, and ψ h c i , y n i , y n i−1 denotes the potential function: where W ∈ R |d r +d h |×|C| and T ∈ R |C|×|C| are the learned parameters, both of which constitute θ in Eq. (5).

Biased loss function
To enhance the performance of TEN, we present a biased loss function for the pseudo cascade learning framework, which can partially influence the optimization process by adjusting the proportion of TEN loss and TER loss.

TER loss
For auxiliary task TER, we employ the binary crossentropy between the ground truth label y r i and the predictedŷ r i as the objective loss function:

TEN loss
For the enhanced TEN task, we adopt the negative loglikelihood over all training samples as the loss function of CRF, which can be computed as follows: where D is the set of medical sentences of training data, s denotes one sequential sentence in D, Y s is the label sequence and H c s is the integrated input representation.

Biased loss function
To strategically optimize the model parameters, we incorporate a static parameter γ , which can be called a bias parameter, into the biased loss function for indirectly tuning the optimization process. The biased loss function is: where 0 < γ < 1 and L BL is the combined loss function. Furthermore, to obtain the best model, we should find a balance between L TEN and L TER by fine tuning the bias parameter γ . The detailed information is discussed in "Bias parameter analysis" section.

Data
To show the effectiveness of PASCAL, we evaluated it on a real-world EHR dataset containing 12,700 clinical records from Chinese third grade and class-A hospitals. As introduced in Fig. 1, treatment regimens, from the clinical text with a detailed description, might be recorded by different doctors following their own guidelines, which can generate nonstandardized terms on the clinical records. Hence, our objective is to map the treatment regimens onto the controlled vocabulary from the latest GUIDELINES [9] (the authoritative reference for breast cancer physicians in China). For each patient, we extracted the clinical treatment regimens from their electronic health records and integrated them. As the length of nearly 99% clinical texts in the datasets is less than 256, in this paper, we employ clinical texts whose length is less than 256 in the following experiments. To maintain relative independence, we partition the records into training data and test data by a ratio of 8 : 2 based on the patients. Therefore, it contains 209,677 sentences for training and 52,420 sentences for testing. In the experiment, the training data are randomly sampled at 10% for validation, and the remaining data are used for training.

Settings and hyperparameters
To evaluate the effectiveness of framework PASCAL and the influence of each key component, we design various experiments on a real-world database. First, we choose the three latest sequential models as baselines, including Bi-LSTM [24], bidirectional OnLSTM [30] and TCN [22], to obtain an accuracy comparison with GCNN. We also conduct experiments for the single task to compare CRF with softmax in a sequential multiclass classification task.
In addition, to further evaluate the performance of our model, one state-of-the-art multitask learning model, we call it feedback [17], is used as another baseline model in the experiment. Finally, we dynamically adjust the values of γ to realize the best model performance and to validate the impact of the bias parameter on model performance via experiments. Moreover, it is worth noting that most experiments are conducted based on univariate analysis.
To achieve the optimal normalization results, the hyperparameters are set as follows: the dimension of character embedding is set as 200, the number of filters in the first convolutional layer is set as 128 and in the following three connected layers is set as 256, the size of convolutional kernels in the CNN layer is set as 3, the number of convolutional layers is 4, the number of residual blocks is 3, the dropout probability is set as 0.5, the learning rate is set as 0.001 and the batch size is set as 256. We select the hyperparameters in terms of cross-validation on training data and choose the average result of 10 experiments as the result. In addition, the parameters are initialized with Xavier initialization, and we take the LazyAdam [31] optimizer for all neural networks. Finally, we employ the Keras library [32] with the TensorFlow [33] backend, and all models are run on a single NVIDIA Tesla P40.

Evaluation metrics
To fully evaluate the proposed approaches, we use three prevalent evaluation metrics to provide a comparison among different approaches. The metrics in [34] are precision, recall, and the F1-measure: where FP and TP are the number of false positives and true positives, respectively. Table 2 illustrates the performance comparison between baselines and our proposed approach concerning three evaluation metrics on a real-world breast cancer dataset for treatment entity normalization (TEN) in Chinese clinical text. Softmax and CRF denote the softmax layer and CRF layer for the single task of normalization, respectively. Moreover, PASCAL (Softmax + CRF) denotes our proposed cascade learning framework with a softmax layer for the auxiliary task and a CRF layer for the primary task.

Performance comparison
As seen in Table 2, our proposed framework outperforms all the baselines on precision, recall and F1. Specifically, for our proposed framework with encoder TCN, we observe that the F1 score exceeds approximately 13.9% and 2.62%, the recall score exceeds approximately 11.2% and 2.66%, and the precision score exceeds approximately 16.1% and 2.6% compared to that of softmax and CRF, respectively. This means that our proposed pseudo cascade learning framework can fully take advantage of the auxiliary TER task to optimize the shared parameters and propagate the implicit information to the primary TEN task. Moreover, for PASCAL with encoder GCNN, the F1 score and recall score outperform others except for precision. This phenomenon shows that PASCAL is more inclined to the correctness of normalized regimens but neglects part of the ground truth regimens. However, the recall and F1 metrics are more meaningful than the precision metric in health informatics.
Concerning the critical encoder, as shown in Table 2, GCNN performs better than other encoders on all evaluation metrics under the same framework. This partly indicates that GCNN has a stronger ability to capture long-range dependencies and mine the contextual relationships via the convolutional blocks and gating block. In addition, comparing CRF with softmax, we observe that the former with the CRF layer obtains higher performance than the latter with the softmax layer. The reason is that the neighboring TEN labels have strong dependencies that can be captured by CRF.
Another meaningful finding is that the models with GCNN perform much better than the model with Bi-OnLSTM. Both models can utilize hierarchical information to obtain better performance. However, the difference is that the latter integrates the intrinsic tree structures into RNN to obtain ordered neurons, while the former builds the hierarchical structure via stacked CNN layers to capture local and long-range dependencies and introduces a gating block to avoid gradient vanishing problems.
Furthermore, as shown in Fig. 4, the performance of PASCAL obviously outperforms Feedback [17] with respect to three evaluation metrics. We think there are three main reasons for this. First, the explicit feedback approach is designed for medical entity recognition and normalization in English clinical text, while the PASCAL model is developed for the TEN task in Chinese clinical text. Second, the constituent characters in Chinese clinical text are complicated and not only contain Chinese characters but also mix English characters. The relations between them are intricate and varied. The powerful blocks of encoder GCNN enable PASCAL to better capture the contextual relationships and long-term dependencies in clinical sentences. Third, the pseudo cascade structure in PASCAL can further improve the model performance by retaining useful information and mitigating error propagation. In addition, the incorporation of CRF can better utilize contextual information to normalize the treatment entity. Therefore, based on the above analysis, our model with GCNN and CRF is the most suitable approach for the TEN task for breast cancer.

Computational efficiency
The aforementioned analyses mainly concentrate on the aspect of normalization accuracy. However, it is well known that computational efficiency is a critical factor in industrial applications. The main reason is that the computational efficiency within finite computational ability is much more important than a slight improvement in accuracy under some circumstances. For instance, in mobile health monitoring, the responsive time of the device has a great influence on the popularization rate. From the perspective of clinical doctors, what they need is saving their time for decision-making and not wasting their time on it. Thus, we must maintain a balance between efficiency and accuracy when choosing the approaches. As shown in Fig. 5, our presented PASCAL framework with different encoders spends different training times finishing one epoch. We find that Bi-OnLSTM spends 193s on one training epoch, Bi-LSTM needs 117s, while TCN and GCNN need 33s and 39s, respectively. The reason lies in the different operating mechanisms between recurrent networks and convolutional networks. The recurrent network-based models, such as Bi-LSTM, cannot be parallelized over the characters of a sentence because the next outputs rely on the previous state. However, convolutional networks are very amenable to parallel computing because the computation of all input characters in a sentence can be performed simultaneously. Moreover, the training efficiency of the TCN is higher than that of the GCNN because it directly imposes temporal information on the convolutional process and does not rely on the gating block, which slightly improves the efficiency. However, the performance of GCNN on Precision,  Recall and F1 is 6.7%, 4.3% and 5.6% higher than TCN. Therefore, after comprehensively considering the accuracy and efficiency, we choose GCNN as the encoder of the pseudo cascade learning framework.

Bias parameter analysis
The main task of PASCAL is to normalize the treatment entity into standard vocabulary with the help of an auxiliary TEN task. γ represents the proportion of TEN loss in the training process, and (1 − γ ) denotes the proportion of TER loss. Considering TEN as the primary task, we manually adjust the proportion of γ in the biased loss function L BL from 0.5 to 0.9 to explore the influence of γ on the normalization performance. Table 3 shows that as the value of γ increases, the normalization accuracy also increases, which indicates that the optimization process is gradually inclined to the orientation that is beneficial to the TEN task. We observe that the improvement process becomes unstable with the increase in the γ value. For instance, the recall score when γ = 0.7 is lower than γ = 0.6. We hypothesize that the main reason is that the increase in γ means a decrease in 1 − γ , which indirectly influences the optimization process related to the TER auxiliary task. Moreover, the affected auxiliary TER will further influence the optimization process of shared parameters. Therefore, we should rationally select the appropriate value of γ in practical applications. Table 4 exhibits four general errors in different categories obtained from the testing results. The displayed breast cancer treatments are extracted from complicated clinical text (Fig. 1b) and concatenated with the entity positions. Specifically, the table lists the normalization results and corresponding labels for each error case. For instance, [' AC' , 17,19], ' AC' denotes the treatment regimen of breast cancer, 17 denotes the starting index of the entity in a sentence, and 19 denotes the ending index. Only when the entity and the starting and ending indexes are all accurate can the normalized results be recognized as correct. In error case 1, there is an extra normalized entity [ AC − T , 11,14], which is regarded as a correct normalization result. This occurs because the entity label is missing in the sentence, which can be an inevitable real case in the dataset with artificial labels. However, the error case also confirms the normalization effectiveness of our method. Error case 2 belongs to the general normalization mistakes via our methods. However, for error case 3, it is difficult to normalize, especially when the treatment regimens rarely exist in the training set. In that case, the algorithm mapped the regimen onto the most similar normalization entity. Likewise, in error case 4, the normalized indexes deviate from the standard position, which brings about another unnecessary entity 'EC-T' that is an error due to the high similarity to 'FEC-T' . All of the above-discussed error cases will be further solved in our future work and practical applications.

Conclusion and outlook
In this paper, we present a novel pseudo cascade learning framework with a gated convolutional neural network and  Notes: The treatments are specifically extracted from the clinical context that describes the treatment process of the patient. Treatments in red color indicates the error cases on both the name and position of treatment conditional random field, named PASCAL, for breast cancer entity normalization. Unlike traditional LSTM-based models, our approaches improve the ability to capture the local and long-range dependencies in a sentence by a gated convolutional network (GCNN) and enhance the training efficiency. We design a pseudo cascade structure with an auxiliary TER task to provide auxiliary assistance for optimizing the shared parameters and propagating the useful information and with a biased loss function to further optimize the TEN process. Moreover, we employ a conditional random field (CRF) to obtain the optimized normalization results by considering the previous labels and contextual information. Finally, we conduct extensive experiments on a real-world dataset of treatment regimens for breast cancer, and the experimental results validate the effectiveness and efficiency of our proposed approaches. In general, the presented methods can be utilized to solve the Chinese named entity normalization in any other field. We further improve the performance from the following three aspects. First, we attempt to utilize the public corpus to pretrain the character embedding for better performance. Second, we integrate the domain knowledge about breast cancer into the model to enable the model to be more targeted. Third, we consider dynamically adjusting the optimization process by replacing static γ with a dynamic parameter that can be learned from the neural networks. Finally, we leveraged the normalized treatment and clinical laboratory measurements to recommend breast cancer treatment for patients.