Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection

Liu, Lijue; Wu, Xiaoyu; Li, Shihao; Li, Yi; Tan, Shiyang; Bai, Yongping

doi:10.1186/s12911-022-01821-w

Research
Open access
Published: 28 March 2022

Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection

Lijue Liu^1,2,
Xiaoyu Wu¹,
Shihao Li¹,
Yi Li^1,2,
Shiyang Tan¹ &
…
Yongping Bai³

BMC Medical Informatics and Decision Making volume 22, Article number: 82 (2022) Cite this article

6489 Accesses
18 Citations
Metrics details

Abstract

Background

Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Despite various studies, class imbalance has always been a difficult issue. The main objective of this study was to find an effective integrated approach to address the problems posed by class imbalance and to validate the method in an early screening model for a rare cardiovascular disease aortic dissection (AD).

Methods

Different data-level methods, cost-sensitive learning, and the bagging method were combined to solve the problem of low sensitivity caused by the imbalance of two classes of data. First, feature selection was applied to select the most relevant features using statistical analysis, including significance test and logistic regression. Then, we assigned two different misclassification cost values for two classes, constructed weak classifiers based on the support vector machine (SVM) model, and integrated the weak classifiers with undersampling and bagging methods to build the final strong classifier. Due to the rarity of AD, the data imbalance was particularly prominent. Therefore, we applied our method to the construction of an early screening model for AD disease. Clinical data of 523,213 patients from the Institute of Hypertension, Xiangya Hospital, Central South University were used to verify the validity of this method. In these data, the sample ratio of AD patients to non-AD patients was 1:65, and each sample contained 71 features.

Results

The proposed ensemble model achieved the highest sensitivity of 82.8%, with training time and specificity reaching 56.4 s and 71.9% respectively. Additionally, it obtained a small variance of sensitivity of 19.58 × 10^–3 in the seven-fold cross validation experiment. The results outperformed the common ensemble algorithms of AdaBoost, EasyEnsemble, and Random Forest (RF) as well as the single machine learning (ML) methods of logistic regression, decision tree, k nearest neighbors (KNN), back propagation neural network (BP) and SVM. Among the five single ML algorithms, the SVM model after cost-sensitive learning method performed best with a sensitivity of 79.5% and a specificity of 73.4%.

Conclusions

In this study, we demonstrate that the integration of feature selection, undersampling, cost-sensitive learning and bagging methods can overcome the challenge of class imbalance in a medical dataset and develop a practical screening model for AD, which could lead to a decision support for screening for AD at an early stage.

Peer Review reports

Background

With the development of technology and digital medical data, computer techniques have been widely applied in the medical field. However, medical datasets are often imbalanced [1], for example, the non-patients/negative class set, has far more samples than the patients/positive class set. And the class imbalance problem is a typical problem in classification tasks [2]. When the dataset is imbalanced, in order to improve accuracy, many classifiers tend to misclassify minority samples into majority samples, even though a classifier that classifies all the samples into the majority class can get an accuracy of up to 98%. Obviously, the classifier is invalid because it cannot identify patients effectively. Therefore, accuracy is not an appropriate evaluation metric, and sensitivity and specificity are often used for evaluation in medical treatment instead. In particular, sensitivity always attracts more attention, which shows the ability of classifiers to find all positive samples. Misclassifying the patients class set leads to more serious consequences than misclassifying the non-patients class set.

There are three categories of strategies to solve the problem of class imbalance: the data-level approach, the algorithm-level approach and ensemble learning techniques [3, 4]. The data-level approach includes oversampling, undersampling and feature selection. Oversampling generates minority samples. Its disadvantage is that it causes overfitting and increases time complexity accordingly. Undersampling selects a part of the data from the majority set and recombines the minority set into a new dataset, which causes loss of information. Zhou et al. [5] and Feng et al. [6] revealed that combining sampling techniques and ensemble methods could solve the problem of information loss effectively. Feature selection based on the importance of factors can identify the most relevant factors for the classification. It can compress the dimensionality of the feature space. Because class imbalance problems are usually accompanied by high dimensionality of the data, it is important to adopt feature selection techniques. Researchers have shown it can alleviate the class imbalance problem to a certain extent [7].

The algorithm-level method mainly applies cost-sensitive learning methods [8], which are an extension of the weight adjustment method, by assigning higher weights to the minority class samples to modify their preference for the majority class. Many studies have demonstrated that ensemble learning techniques can achieve better performance than a single classifier when the dataset is imbalanced [9, 10]. Ensemble learning techniques combine multiple weak classifier models to obtain a better and more comprehensive strong model. There are two ways to integrate base classifiers into a strong classifier: bagging and boosting. The bagging method is a parallel ensemble techniques in which the base classifiers are generated in parallel, while the boosting method is a sequential method where the base classifiers are generated sequentially, with the later classifiers influenced by the earlier ones. The boosting method runs slowly and is sensitive to abnormal data and noise. In many real-world applications, one strategy cannot solve the class imbalance problem effectively. Usually several strategies are combined to solve the imbalance problem. Feng et al. [11] improved the performance of the general vector machine (GVM) by feature selection and cost-sensitive learning methods. Tao et al. [12] adopted cost-sensitive SVM and the boosting ensemble method for imbalanced dataset classification. Mustafa et al. [13] solved the class imbalance problem by combining undersampling techniques with the MultiBoost ensemble method. Seiffert et al. [14] showed that both sampling and the ensemble technique can improve the accuracy of skewed data streams effectively. Sainin et al. [15] applied feature selection and sampling methods to improve the ensemble model for the class imbalance problem.

Aortic dissection (AD) is a cardiovascular disease caused by the rupture of the aortic intima, in which the blood breaks through the aorta to form pathological changes in the true and false lumen. This is a very rare clinical emergency with low morbidity, a high rate of misdiagnosis and a high mortality rate [16]. And the number of non-patients is much larger than patients. It has been reported that the first 90 min in the early stage of AD is the prime time for treatment. In one study [17], the death rate was 21% for an AD patient untreated in the first 24 h, 37% for 48 h and 74% for one week. Most patients who are not treated will die within a year [18]. Current studies have limited understanding of the causes of AD. Although there are many known pathogenic factors for AD including family history of AD, pre-existing AD or aortic valve disease, hypertension, and cigarette smoking, [19], there is no highly sensitive and specific indicator [20]. At present, the golden criteria of AD diagnosis is CTA (computer tomography angiography) [21]. This check uses imaging detection to show the location, scope, entrance, exit and involvement of the aortic branches and aortic valve. Because AD has an insidious onset, primary medical institutions often face many difficulties in the diagnosis and prognosis of the disease. When facing a patient, the doctor will first inquire about the patient's medical history and physical examination results. Once the doctor feels the patient is at high-risk due to medical history and the presence of typical symptoms, CTA will be arranged to help confirm the diagnosis. The typical symptoms of AD are sudden severe pain in the chest, back and between the shoulder blades. However, some patients do not have typical symptoms. They may experience chest tightness, syncope, nausea and other symptoms, and these atypical symptoms are diverse. Many doctors lack the ability to distinguish and diagnose atypical AD patients, which leads them not to arrange a CTA. Thus, some patients with AD fail to get an accurate diagnosis and effective treatment in time.

Therefore, earlier screening and prediction of AD is essential. To help doctors screen for patients with suspected AD, doctors can take the screening results as advice and further examine those high-risk patients to then make an accurate diagnosis. Some researchers have used machine learning (ML) techniques to diagnose AD patients. Huo et al. [22] applied data mining methods including SVM, Naïve Bayes, Bayesian Network and J48 to classify AD patients, and the Bayesian network performed best with an accuracy of 84.55%. However, the purpose of their study was to identify false positive patients in 492 emergency cases who were sent to emergency room as AD patients. Their research is not suitable for early screening. Liu et al. [23] used multiple ensemble learning methods to screen for AD patients; however, they only explored the performance of existing ensemble methods.

In recent years, many ML approaches have been proposed for classification and medical treatment. Saadatfae et al. [24] proposed a new KNN algorithm that improved the pruning process of the LC-KNN. The results showed their method performed better than recent related works. Simon et al. [25] evaluated the performance of logistic regression and other ML algorithms to predict the risk of cardiovascular diseases and other diseases. Among them, logistic regression achieved as good of a performance as other ML models. A review [26] investigated the state-of-the-art research on deep learning techniques in the healthcare system between 2015 and 2019, which concluded that ensemble techniques based on deep learning techniques performed better than a single method. Ashish [27] applied SVM and the extreme gradient boosting method to detect ischemic heart disease using the Z-Alizadeh Sani dataset. Among various ML algorithms, SVM has proven to be one of the most outstanding methods [28]. The main idea of SVM [29] is to establish an optimal decision hyperplane to maximize the distance between the two types of samples closest to the plane, thereby providing good generalization for classification problems. However, SVM does not take into consideration the class distribution and class imbalance problem. In order to handle this problem, Veropoulos et al. [30] adjusted the loss function of SVM by modifying two different misclassification cost values. Kang et al. [31] proposed a weighted undersampling method for SVM; the improved algorithm performed well on imbalanced data sets. Hazarika [32] proposed a SVM that weights the training points based on their class distributions. Recently, the use of ensemble learning on SVM has been useful and has attracted much attention [33]. Pouriyeh et al. [34] investigated different ML methods for heart disease prediction. Then ensemble learning techniques, including stacking, bagging and boosting, were applied to optimize performance. The SVM method using the boosting approach performed best. Huang et al. [35] applied different ML methods to classify supraventricular ectopic and ventricular ectopic beats. The SVM ensemble method outperformed other methods. Shorewala et al. [36] compared the performance of base ML classifiers and their ensemble techniques in detecting coronary heart disease, and the stacking model involving SVM, RF and KNN performed best. Alsafi et al. [37] proposed a ML system to diagnose coronary heart disease. They integrated RF, SVM and XGBoost techniques to build a diagnosis model after feature selection and optimized oversampling on an unbalanced dataset.

In our work, we have explored the binary class imbalance problem in medical research, and tested our method in an early screening model for AD. The significant contributions are as follows:

1.
An effective ensemble model, which integrates the bagging, data-level and algorithm-level methods, is proposed to overcome the class imbalance problem; it outperforms standard competitive base and ensemble classifiers.
2.
Different data-level methods are used to deal with the class imbalance problem. First, feature selection techniques, including a significance test and logistic regression, are used for selecting relevant features. Then we integrate the weak classifiers with undersampling and bagging to build the final strong classifier.
3.
The cost-sensitive learning method is applied to SVM models to construct weak classifiers by assigning higher misclassification cost to the minority class examples; this is different from the decision tree used by general ensemble models.
4.
The proposed ensemble model is able to effectively identify patients with AD and also yields better results than the clinical screening results of some hospitals, indicating it can be used to develop a decision support for screening for AD at an early stage.

Methods

Our method consists of three parts: feature selection, cost-sensitive learning and the proposed ensemble algorithm. The three parts will be introduced in the following sections. The data flow diagram of the proposed method is shown in Fig. 1. The data-level method based on feature selection is applied to select the most relevant features by significance test and logistic regression methods. Then the algorithm-level method based on cost-sensitive learning is implemented on SVM by assigning different misclassification cost values for two classes to obtain the optimal weight settings $w$ of SVM. The seven-fold cross-validation technique is used to evaluate the predictive performance of the model. First, the dataset is partitioned into seven subsets evenly, and each subset is taken as a testing dataset. The remaining six subsets are used as the training dataset. In this way, seven models are obtained, and the average performance indicators of these models on the testing sets are used as the model’s final results.

During each training phase, the proposed ensemble algorithm was applied to obtain a better and more comprehensive ensemble model. The data-level method based on undersampling and ensemble learning techniques based on bagging were used. First, the weight settings $w$ are initialized on SVM to construct weak classifiers according to the results of cost-sensitive learning. Then multiple weak classifiers are trained using the balanced dataset obtained by undersampling. Finally, an ensemble model is constructed with weak classifiers by bagging.

During each testing phase, the result of the ensemble model on the testing dataset is predicted.

We compare the ensemble model to single classifiers, including logistic regression, KNN, decision tree, BP and SVM, as well as standard ensemble models including EasyEnsemble, AdaBoost and RF.

Data collection

Since screening for AD patients is a typical imbalance problem, this study used an AD dataset. Clinical data of more than 60,000 cardiovascular in-patients were collected from the Institute of Hypertension, Xiangya Hospital, Central South University between 2008 and 2016. We referred to the indicators recommended in the 2014 ESC Guidelines and selected 71 features initially, including blood routine, biochemical examination, clotting routine examination and other easily accessible information, such as clinical presentation and medical history. The imbalance ratio of AD patients to non-AD patients is 1:65. Since any imbalance ratio more than 1:50 is considered a severe imbalance problem, predicting AD is such a problem. Details of these features are shown in Table 2. The use of all data was authorized by the Institute of Hypertension, Xiangya Hospital, Central South University.

In order to have a comprehensive view of the data, box plots and scatter diagrams were drawn for every feature. The goal was to find some specific indicators that were helpful for classification but failed, which means it is difficult to distinguish an AD patient from non-patients using only one or a few indicators. Figure 2 is a box plot of some randomly selected features of our dataset. In a box plot, the horizontal line inside the box is the median value of the distribution. The upper and lower ends of the box are the approximate upper and lower quartiles of the distribution, and the whiskers extend 1.5 times the interquartile range (IQR) from the box edges. The box plot allows for identification of outliers in the distribution. The positive samples are drawn in red while the negative samples are blue in the box plot, which clearly shows that the distribution of positive samples is similar to that of negative samples; thus, it is difficult to separate positive and negative samples through a single feature. Figure 3 shows a set of scatter diagrams; each diagram is drawn using two different features of our dataset. From each individual diagram a serious overlap between positive and negative classes can be found, so it is also hard to separate positive samples from the negative with two features.

Feature selection

Investigating the features that affect models can help to analyze the importance of them. Furthermore, feature selection techniques based on the importance of features play a crucial role in medical diagnosis and have been widely applied. They can reduce the dimensionality of features in data, and improve the performance of classifiers. Redundant features or poor features can make classifiers inaccurate. Aghaei et al. [38] analyzed factors associated with HIV-related stigma, and concluded strategies of diminishing the HIV-related stigma. Joloudari et al. [39] applied feature selection technology to improve the accuracy of coronary artery disease diagnosis. Four ML models were used to establish predictive models and select features, among which RF performed best. Liu et al. [40] proposed an embedded feature selection technology using a weighted Gini index on a decision tree for classification of imbalanced data. Singh et al. [41] determined relevant features for breast cancer prediction by significance analysis and feature selection methods. Ma et al. [42] studied eight feature selection techniques, and recursive feature elimination (RFE) based on SVM performed well. Huo et al. [22] applied the correlation-based feature selection (CFS) method to select attributes that were used to build ML models for AD classification. Wang et al. [43] investigated six filter-based feature selection techniques, such as information gain and chi-square [44]. Different ML classifiers and performance metrics were applied to build and evaluate models. Abdar [45] applied four ML classifiers, including decision tree, KNN, SVM and neural network to predict heart disease. Logistic regression was used to select significant variables.

In order to select relevant features, statistical analysis, including a significance test and logistic regression, were applied to analyze the influence of features.

A significance test is used to determine whether the difference between the experimental treatment group and the control group is statistically significant. In the significance test, categorical variables were presented as frequencies with percentages, and were analyzed by Chi-square test (${\upchi }^{2}$). Continuous variables were expressed as the mean with standard deviation (SD) and analyzed by independent t-tests. The P value less than 0.05 was considered to be statistically significant. Logistic regression is a type of regression analysis commonly used in the analysis of diseases. This method can analyze the relative importance of some factors in disease prediction. Therefore, we pinpointed the most relevant factors by using logistic regression.

Finally, the feature set $\mathrm{Fset}$ was constructed according to the following formula, including all features whose P values in ${\mathrm{F}}_{\mathrm{s}}$ and ${\mathrm{F}}_{\mathrm{l}}$ were no greater than 0.05.

$$\mathrm{Fset}={\mathrm{F}}_{\mathrm{s}}\cup {\mathrm{F}}_{\mathrm{l}},$$

where ${\mathrm{F}}_{\mathrm{s}}$ is the feature set selected by significance test; ${\mathrm{F}}_{\mathrm{l}}$ is the feature set selected by logistic regression.

In addition, feature selection based on RF and recursive feature elimination (RFE) were used to verify the effectiveness of the features selected in our study. RF is an ensemble learning method that uses multiple decision trees and has high accuracy and good robustness. It can quantify the importance of features through the attenuation of the Gini coefficient obtained by the decision tree. The main idea of RFE is to iteratively build a model to remove features. Then the process is repeated on the remaining features until all the features are traversed. The order of eliminating features in this process is the rank of feature importance. RFE is a greedy algorithm for finding the optimal feature subset. SVM model was used as the model of RFE in our study.

Cost-sensitive learning

SVM is good at high dimension data, making it popular for many ML practitioners. Furthermore, in the SVM model, by changing the weights of positive and negative samples in the loss function, different penalty coefficients can be set for positive and negative samples, which means two different misclassification cost values will be assigned. For instance, the greater the weight of the positive sample, the greater the penalty for this type of sample, and the greater the penalty, the smaller the error it can tolerate. The loss function of SVM is the sum of the hinge loss function and the regularization term, which is computed as follows:

$$\sum_{i}^{N}{[1-{y}_{i}(w\bullet {x}_{i}+b)]}_{+}+\lambda {||w||}^{2},$$

where ${x}_{i}$ is the ${t}^{th}$ samples; ${y}_{i}$ is the class label of ${x}_{i}$; $w$ and b are the parameters of the hyperplane. ||*|| is the L2 norm. $If {x}_{i}\epsilon P, w={w}_{1}; else {x}_{i}\epsilon N, w={w}_{2}.$

Based on the advantages of SVM, SVM was selected as the base classifier for the ensemble model in this study. It is different from standard ensemble learning methods, such as AdaBoost and EasyEnsemble, which use decision tree as the base classifier. SVM models can pay more attention to positive samples and alleviate the impact of class imbalance.

Proposed ensemble algorithm

In our study, we focus on the binary class imbalance problem. The labels for the positive and negative samples were set to 1 and 0. The pseudo code of the proposed algorithm is shown in Algorithm 1, and the corresponding flowchart is shown in Fig. 4. The input of Algorithm 1 includes a dataset composed of a set of majority class samples $N$ and a set of minority class samples $P$, as well as K most relevant features obtained from feature selection, and the weight settings $w$ of SVM obtained from cost-sensitive learning. First calculate T, the number of weak classifiers based on the imbalanced ratio of major class set to minority class set. Then there is a loop to build and train T weak classifiers. In each loop, first construct the weak classifier ${H}_{i}(i=\mathrm{1,2},\dots ,T)$ by initializing the weight settings $w$ on SVM. Then randomly undersample a subset ${\mathrm{N}}_{i}(i=\mathrm{1,2},\dots ,\tau )$ from N and construct a new balanced dataset ${D}_{\mathrm{i}}$ by combining ${\mathrm{N}}_{i}$ and all instances of the minority class in P:

$$N={\bigcup }_{i=0}^{T}{N}_{i},$$

$${D}_{\mathrm{i}}={N}_{i}\cup P,$$

where ${N}_{i}\subset N; N=\bigcup_{i=1}^{T}{N}_{i}; {N}_{i}\cap {N}_{j}=\Phi \left(i\ne j\right); |{N}_{i}| = |P|$.

Then train a weak classifier ${H}_{i}$ using ${D}_{\mathrm{i}}$. Repeat this process $T$ times until $T$ weak classifiers are all trained. Finally, an ensemble model $H\left(x\right)$ is built by integrating multiple weak classifiers with bagging methods.

Performance evaluation

Usually, the performance of any classification algorithm is measured in terms of accuracy. However, relying only on classification accuracy, especially for an imbalanced medical dataset, could be misleading. Apparently, if a classifier identifies all the samples into the majority class, it can get a high accuracy. But this kind of classifier is meaningless. In this study, sensitivity and specificity were measured as two evaluation metrics as they are commonly used in the medical field. At the same time, training time was used as another metric to evaluate the complexity of the model. Sensitivity shows the ability to detect positive samples correctly to all positive samples. The higher the sensitivity, the lower the missed diagnosis rate. Specificity shows the ability to detect negative samples correctly to all negative samples. The higher the specificity, the lower the misdiagnosis rate. In the screening of diseases, it is more important to improve sensitivity, so as to reduce the missed diagnosis rate. Specificity does not need to be particularly high, and it is acceptable within a reasonable range. They are computed as follows:

$$\mathrm{specificity}= \frac{TN}{TN+FP},$$

$$\mathrm{sensitivity}= \frac{TP}{TP+FN},$$

where TP means the number of true positive samples; FN means the number of false negative samples; TN means the number of true negative samples, and FP means the number of false positive samples.

Each metric was tested under seven-fold cross validation that randomly selected six-sevenths of the dataset as the training set and one-seventh of the dataset as the test set. The undersampling method was employed to balance the training set.