Exploratory study on classification of chronic obstructive pulmonary disease combining multi-stage feature fusion and machine learning

Background Due to the complexity and high heterogeneity of the acute exacerbation of chronic obstructive pulmonary disease (AECOPD), the guidelines (global initiative for chronic obstructive, GOLD) is unable to fully guide the treatment of AECOPD. Objectives To provide a rapid treatment in line with the development of the AECOPD after admission. In this paper, we propose a multi-stage feature fusion (MSFF) framework combining machine learning to track the diseases deterioration risk of the AECOPD. Methods First, we identify 408 AECOPD patients as the study population. Then, feature segment and fusion methods are applied to generate the phased data set. Finally, human studies are designed to evaluate the performance of the MSFF framework. Results The experimental results show that the proposed framework is potential to obtain the full-process tracking of deterioration risk for the AECOPD patients. The proposed MSFF framework achieves a higher overall accuracy average and F1 scores than the four physician groups i.e., IM, Surgery, Emergency, and ICU. Conclusions The proposed MSFF model may serve as a useful disease tracking tool to estimate the deterioration risk at each stage, and finally achieve the disease monitoring and management for AECOPD patients.


Background
Chronic obstructive pulmonary disease (COPD), as a common disease characterized by an irreversible persistent airflow limitation, causes the decline in the quality of life of patients [1,2]. COPD as a major cause of chronic morbidity and mortality is a global health threat, and will become the third leading cause of death in the world by 2030 [3]. The Acute Exacerbation of COPD (AECPOD) is a key event in the disease course, which causes a sharp decline of lung function and a significant increase in mortality [4,5]. The increasing frequency of exacerbations and hospitalizations is the daily manifestation of the deterioration of AECOPD [6]. In addition, the prevalence proportion of depressive symptoms is found to be significantly higher among individuals with AECOPD as compared to control [7]. However, timely and reliable risk identification of AECOPD has important clinical significance for prevention and early interventional therapy [8]. In order to identify the deteriorate risk for Open Access *Correspondence: pengjunf@mail2.sysu.edu.cn 1 School of Computer Science, Guangdong University of Education, Guangzhou 510006, China Full list of author information is available at the end of the article Page 2 of 9 Peng et al. BMC Medical Informatics and Decision Making (2021) 21:348 any given AECOPD patient, physicians often use hypothetical reasoning. From this initial feature set of patients (basic information, past history, current medical history), a physician forms a basic diagnostic set, and then updates the basic diagnostic set based on further clinical data (complications, tests, and examinations) [9]. Thus, the physician who acts like a classifier of sorts reaches final diagnostic set through the cycle of the above process. However, due to the complexity of AECOPD and the lack of consensus, it is a tricky task for clinicians to identify the start of deterioration for AECOPD [10][11][12].
In addition, the limitations of the human brain and lack of benign statistical and analytical techniques make the identification of acute deterioration events a huge challenge for physicians. In recent years, Artificial Intelligence (AI) techniques have become potentially powerful tools for the diagnosis and management of diseases, mimicking and even aiding clinical decision-making of human physicians [13][14][15]. For AI-enabled medical applications, the number of cancer-related publications was the highest, followed by heart diseases and stroke, vision impairment, alzheimers disease, and depression [16]. The intelligent diagnosis of AECOPD requires further research.
To improve the therapeutic effect of the AECOPD, many scholars study the disease based on artificial intelligence. Swaminatha et al. proposed a supervised Gradient Enhanced Random Forest (GERF) model after the collection of the clinical data from AECOPD patients. The prediction accuracy of GERF reaches 88% of that of clinicians [17]. Wang et al. developed a transfer learning algorithm based on balanced probability distribution to predict the risk of exacerbation in patients with AECOPD and the model achieved a good prediction result on the small data sets [18]. Altan et al. used the deep learning method to extract the features of lung sound of AECOPD patients, then the extracted features were utilized to build the model to predict AECOPD. The method achieved 93.67% prediction accuracy [19]. Wang et al. employed a variety of machine learning methods to predict the risk of exacerbation in patients with AECOPD, and the experimental results showed that support vector machine (SVM) performed better than the other models [20]. Ganguly et al. took advantage of the forward feature selection method to select features, and then employed the logistic regression model to achieve the identification of AECOPD. The area under curve (AUC) of the developed model was 0.710 [21]. We can find that the common methods based on artificial intelligence tend to ignore the timeliness of clinical data.
However, the timeliness of clinical data is the key to the real world research (RWD) [22]. The neglect of timeliness of clinical data brings the intelligent diagnosis of AECOPD three challenges at least [23]. First, AECOPD is a highly heterogeneous and complex disease required to be monitored in time. Second, a large amount of treatment data is produced over time. As the treatments progress, a lot of important clinical data sets are gradually produced by time. The mining of the generation time in these clinical data is supportive of clinical decision-making. Third, clinical data is strictly time-sensitive which means that the diagnosis significance of expired data is little. Thus, a model considering the timeliness of clinical data has practical clinical significance to track the diseases deterioration risk. Thus, we propose a multi-stage feature fusion (MSFF) framework to achieve this goal (Fig. 1).
The experimental results show that the proposed framework is potential to obtain the full-process tracking of deterioration risk for the AECOPD patients. The framework can also be extended to the deterioration risk monitoring for the other chronic diseases. The main contributions of this paper are summarized as follows: (1) A method based on machine learning is proposed to track the disease deterioration risk of the AECOPD with the generation time of clinical data. (2) We evaluate the proposed method by real data set from the Third Affiliated Hospital, Sun Yat-sen University.
The remainder of this paper is organized as follows. Section 2 designs the multi-stage feature fusion for deterioration risk trace. In Sect. 3, the evaluation of the proposed framework is carried out. Discussion are provided to the proposed MSFF framework in Sect. 4. We conclude our work in Sect. 5.

Study participants
We conduct the research under the guidance of the Third Affiliated Hospital, Sun Yat-sen University (TAHSYU) Institutional Review Board (IRB), protocol [2019]-02-334-01. 408 AECOPD patients are identified retrospectively by the International Classification of Diseases, Tenth Revision, Clinical Modification (J44.100, J44.101) from the respiratory unit database of TAHSYU, a major Chinese large-scale Medical Center. Data masking techniques are applied to the AECOPD patients before analysis.
The distribution of AECOPD patients is shown in Table 1. The AECOPD inpatients those need the intensive care unit are marked as serious group, while those don't need as mild group. We can find that the proportion of AECOPD patients with serious group is 46.1%, while the proportion of mild group is 53.9%.

Feature fusion
We define the data set of AECOPD patients as P. A patient in the collection can be defined as P i . We assume that P i contains n clinical features, then P i can be express as: where p i ∈ P , n ∈ N . Meanwhile, we assume p n i as the clinical feature generated through k periods (phases or stages), so that the clinical features of p i can be divided into k parts by periods, and then the clinic data of p i with multiple time periods can be denoted as: where p n i_k represents the clinical data of the patient i produced at period phase? k. We use y i_k to indicate the severity level of the AECOPD patient p i . Then, we can define the phased deterioration risk as: Let y i_1 = y i_2 = ... = y i_n . so that we can predict the final deterioration risk based on the phased clinical data. We define seg w i_k as (5) to the represent all the clinical features (w ) generated at phase k of the patient.
where the number of the collected clinical features at phase k can be represented as w = c(k) . Thus, we can define the single patient time-based phased accumulating clinical data in the previous k stages as acc i_k shown by Eq. (6).
While all the patients' time-based phased accumulating clinical data can be expressed by: where D k and y i are the k-th input and output of the model, respectively. While all the patients' time-based phased accumulating clinical data can be expressed by: where D k−1 indicates the input of stage k − 1 , seg k denotes the k-th input of stage k, respectively.

Modeling
Data modeling begins after the feature fusion. Assuming that there is a hypothesis f on the AEOCPD patient data set D through machine learning method. Then, the prediction of the staged deterioration risk for the AECOPD patient p i is calculated by: where Ŷ k denotes the deterioration risk of prediction for the AECOPD patient data set. f denotes the machine learning model to be solved. D k represents the AECOPD patient data of stage k. Y k depicts the true deterioration risk of the AECOPD patient data set. The difference ǫ between Ŷ k and Y k can be expressed by: f can be simple or complex model. The smaller the difference ǫ is, the better the model performs. To search the best model f from the data set D k , the difference ǫ should be minimized. In real-world study, considering the complexity of clinical data, it may be more appropriate to build simple models and complex ones together.

Framework of multi-stage feature fusion
We define the multi-stage feature fusion framework below. In this section, we propose the novel framework by combining the machine learning model and phased accumulating clinical data (Fig. 1). The proposed MSFF framework is presented and discussed in framework 1. The input of the MSFF framework is the phased accumulating clinical data sets shown by Eq. (6). The output is the queue of the phased deterioration risk of the patients. The goal of MSFF framework (Fig. 1) is to obtain the classification results on the phased data set defined by Eq. (6). The framework is as follows.
Step 1: set the prediction results empty.
Step 2 to step 4: sort the input D, segment the input D into K pieces, and define the cumulative data successively.
Step 5 to step 11: for each phased data, construct the machine learning model successively and calculate the prediction results.
Step 12: return the phased prediction results.

Human studies
We carry out a study to compare the performance of MSFF framework with that of human physicians on the validation set of data set of D 4 . We select 16 junior physicians from the department of Internal Medicine (IM), Surgery, Intensive Care Unit (ICU), and Emergency. Each group consists of 4 clinicians (four at the intermediate level and four at the primary level). Each physician in each group reads a random subset of 326 AECOPD patients' clinical records from the independent training data set and assign a severity rating (serious or mild) to the 82 random validation data set. Then, we evaluate the diagnostic classification performance of each physician group using an F 1 score and overall accuracy.

Framework initialization
In the real world medical scene, the clinical phase k becomes bigger with the development of the treatment. For the convenience of experimental demonstration, we identify 4 segmentations and 40 features as the input of the MSFF framework (shown in Table 2). Then, considering the high accuracy, robustness and fastness of the ensemble of decision trees, random forest model is employed as the model f to predict the disease deterioration risk for AECDOP [24,25]. Finally, we compare our MSFF framework with the clinicians on the phased data set D 4 = (Seg 1 , Seg 2 , Seg 3 , Seg 4 ) . We implemented the MSFF framework using the development platform of R3.5.1. 90% of the whole dataset is randomly separated into the training data set, while the rest of (10%) of the whole dataset is set as independent test dataset. In order to verify the generalization of the model, 10-fold CV is employed. We apply the cross-validated gridsearch approach to optimize the machine learning models and tune the hyperparameters.

Metrics
To evaluate the classifier (Partial least-squares (PLS), Support vector machin (SVM), K-Nearest Neighbor (KNN) and random forest (RF)) of the MSFF model, the overall

Evaluations
To guarantee the robustness of the MSFF framework, both cross-validation and independent test results are provided. Figure 2 and Tables 3, 4 show the evaluation of the classifiers in MSFF framework on the phased data set (D1-D4).
on phased data D 4 . Table 5 shows that the MSFF with RF classifier achieves a higher overall average accuracy and F 1 scores than the four junior physician groups i.e., IM, Surgery, Emergency, and ICU. This result shows that the proposed MSFF with RF classifier may potentially support clinic physicians in diagnoses of the deterioration and death risk in patients with AECOPD.

Discussion
MSFF is conceived to comprehensively integrate the multi-stage clinic data to forecast the exacerbation risks of AECOPD before they occur. Based on a comparison with junior physicians from Internal Medicine (IM), Surgery, Intensive Care Unit (ICU), and Emergency, the proposed MSFF framework obtains superior performance when forecasting the AECOPD exacerbation risk from real world data. There are several issues that require further discussion.
For the convenience of experimental demonstration, the phased data set is defined by the type of clinical indicators, and then the observation stage K is determined ( Table 2). In general, the number of stages K is directly proportional to the severity of the patients' condition. The number of large stages K can achieve the seamless monitoring of the patient's condition. However, it will consume a lot of computing resources and generate frequent prompts to the clinicians.Thus, it is necessary to determine the balance between computing resource consumption and seamless tracking. In fact, the medical institutions should determine the number of stages K according to their own reality.
Further, we need to discuss the model selection problem. First, models used for clinical analysis require overfitting prevention, noise resistance, and outstanding prognostic performance. The traditional statistical and linear models are highly explanatory. However, the inherent complexity and interactivity of the pathogenic factors of AECOPD make it difficult to use traditional statistical and linear models, e.g., linear causal interpretation. Second, the large number of clinical data generated in stages is a huge challenge for clinicians. Thus, a model with good accuracy and tracking ability is helpful in revealing the degree of influence and correlation with multiple clinical indicators.
The shortcoming of this paper includes lack of tracking data (anxiety and depression) of AECOPD patients after discharge from hospital because of the poor follow-up compliance of the study subjects. Medical imaging was not used during data acquisition as we used the lung function to determine the severity level. In the future, we will establish multicenter contracts to obtain more AECOPD patient data. We hope that the model's predictive power will be improved by more abundant and reliable real-world training data.

Conclusions
To achieve the real-time monitoring of acute exacerbation disease such as AECOPD, we digs down the generation time of the clinical features, dynamically divides the clinical data into multiple stages, and utilizes the machine learning methods to perform the deterioration risk warning in stages, eventually achieves the real-time monitoring for the acute exacerbation diseases. The proposed MSFF framework is able to track the phased deterioration risk of AECOPD patients with real world data. Our model achieves a higher classification performance than the four junior physician groups. The data segmentation proposed in this paper conforms to the process of clinical diagnostic reasoning. With the increase of clinical information, the predictive performance of the proposed MSFF framework may be gradually improved. Further work we will investigate the natural language processing technology to dig out potentially valuable information from the electronic medical records, such as the patient's past history, present medical history, course of illness, discharge summary and other text records [26].