A hybrid cost-sensitive ensemble for heart disease prediction

Zhenya, Qi; Zhang, Zuoru

doi:10.1186/s12911-021-01436-7

Research article
Open access
Published: 25 February 2021

A hybrid cost-sensitive ensemble for heart disease prediction

BMC Medical Informatics and Decision Making volume 21, Article number: 73 (2021) Cite this article

4331 Accesses
47 Citations
1 Altmetric
Metrics details

Abstract

Background

Heart disease is the primary cause of morbidity and mortality in the world. It includes numerous problems and symptoms. The diagnosis of heart disease is difficult because there are too many factors to analyze. What’s more, the misclassification cost could be very high.

Methods

A cost-sensitive ensemble method was proposed to improve the efficiency of diagnosis and reduce the misclassification cost. The proposed method contains five heterogeneous classifiers: random forest, logistic regression, support vector machine, extreme learning machine and k-nearest neighbor. T-test was used to investigate if the performance of the ensemble was better than individual classifiers and the contribution of Relief algorithm.

Results

The best performance was achieved by the proposed method according to ten-fold cross validation. The statistical tests demonstrated that the performance of the proposed ensemble was significantly superior to individual classifiers, and the efficiency of classification was distinctively improved by Relief algorithm.

Conclusions

The proposed ensemble gained significantly better results compared with individual classifiers and previous studies, which implies that it can be used as a promising alternative tool in medical decision making for heart disease diagnosis.

Peer Review reports

Background

Heart disease is any disorder that influences the heart’s ability to function normally [1]. As the leading cause of death, heart disease is responsible for nearly $30\%$ of the global deaths annually [2]. In China, it is estimated that 290 millon people are suffering from heart disease, and the rate of death caused by heart disease is more than $40\%$ [3]. According to The European Society of Cardiology (ESC), nearly half of the heart disease patients die within initial 2 years [4]. Therefore, accurate diagnosis of heart disease in early stages is of great importance in improving security of heart [5].

However, as it’s associated with numerous symptoms and various pathologic features such as diabetes, smoking and high blood pressure, the diagnosis of heart disease remains a huge problem for less experienced physicians [6]. In order to detect heart disease, several diagnostic methods have been developed, Coronary angiography (CA) and Electrocardiography (ECG) are the most widely used among them, but they both have serious defects. ECG may fail to detect the symptoms of heart disease in its record [7] while CA is invasive, costly and needs highly-trained operators [8].

Computer-aided diagnostic methods based on machine learning predictive models can be noninvasive if they are based on the data that can be gathered using noninvasive methods, they can also help physicians make proper and objective diagnoses, hence reduce the suffering of patients [9]. Various machine learning predictive models [10,11,12,13,14] have been developed and widely used for decision support in diagnosing heart disease. Dogan et al. [15] built a random forest (RF) classification model for coronary heart disease. The clinical characteristics of the 1545 and 142 subjects were used for training and testing respectively, and the classification accuracy of symptomatic coronary heart disease was $78\%$. Detrano et al. [16] proposed a logistic regression (LR) classifier for heart disease classification and obtained an accuracy of $77\%$ in 3 patient test groups. Gokulnath and Shantharajah [17] proposed a classification model based on genetic algorithm (GA) and support vector machine (SVM), obtaining an accuracy of $88.34\%$ on Cleveland heart disease dataset. Subbulakshmi et al. [18] performed a detailed analysis of different activation functions of extreme learning machine (ELM) using Statlog heart disease dataset. The results indicated that ELM achieved an accuracy of $87.5\%$, higher than other methods. Duch et al. [19] used K-nearest neighbor (KNN) classifier to predict heart disease on Cleveland heart disease dataset and achieved an accuracy of $85.6\%$, superior to other machine learning techniques.

As No Free Lunch Theorem implies, no single model or algorithm can solve all classification problems [20]. One way to overcome the limitations of a single classifier is to use an ensemble model. An ensemble model is the combination of multiple sets of classifiers, it can outperform the individual classifiers because the variance of error estimation is reduced [21,22,23,24]. In recent years, many ensemble approaches have been proposed to improve the performance of heart disease diagnosis systems. For instance, Das et al. [25] proposed a neural networks ensemble and obtained $89.01\%$ classification accuracy from the experiments made on the data taken from Cleveland heart disease dataset. Bashir et al. [26] employed the ensemble of five heterogeneous classifiers on five heart disease datasets. The proposed ensemble classifier achieved the high diagnosis accuracy of $87.37\%$. Khened et al. [27] presented an ensemble system based on deep fully convolutional neural network (FCNN) and achieved a maximum classification accuracy of $100\%$ on Automated Cardiac Diagnosis Challenge (ACDC-2017) dataset. Therefore, we use an ensemble classifier to predict the presence or absence of heart disease in present study.

From the previous studies, it is observed that traditional medical decision support systems usually focused only on the maximization of classification accuracy without taking the unequal misclassification costs between different categories into consideration. However, in the field of medical decision making, it is often the minority class that is of higher importance [28]. Further, the cost associated with missing a patient (false negative) is much higher than that of mislabeling a healthy instance (false positive) [29]. Therefore, traditional classifiers inevitably result in a defective decision support system. In order to overcome this limitation, in this paper we combine the classification results of individual classifiers in a cost-sensitive way so that classifiers that help reduce the costs gain more weights in the final decision.

The rest of the paper is organized as follows. Section "Data-mining algorithms" offers brief background information concerning Relief algorithm and each individual classifier. Section "Methods" presents the framework of the proposed cost-sensitive ensemble. Section "Experimental setup" illustrates the research design of this paper in detail. Section "Results" describes the experimental results and compares the ensemble method with individual classifiers and previous methods. In section "Discussion", experimental results are discussed in detail. Finally, the conclusions and directions for future works are summarized in section "Conclusions".

Data-mining algorithms

Relief feature selection algorithm

Relief is a kind of famous filter feature selection algorithm which adopts a relevant statistics to measure the importance of the feature. This statistics can be seen as the weight of each feature. Top k features of bigger weights are selected. Therefore, the key is to determine the relevant statistics [30].

Assume $D = \{(x_1, y_1), (x_2, y_2), \ldots (x_m, y_m)\}$ is a dataset. $x_i$ is an input feature vector and $y_i$ is a class label corresponding to $x_i$. First, select a sample $x_i$ randomly. Then, Relief attempts to find out its nearest sample $x_{i,nh}$ from samples of its same class and nearest sample $x_{i,nm}$ from samples of its different class using the same techniques as in KNN, $x_{i,nh}$ is called “near-hit”, $x_{i,nm}$ is called “near-miss”. Next, update the weight of a feature A in W as described in Algorithm 1 [31, 32]. Repeat the random sampling steps for m times and get the average value of W[A], W[A] is the weight of feature A.

In Algorithm 1, $diff(x_{a}^j, x_{b}^j)$ depends on the type of feature j. For discrete feature j:

$$\begin{aligned} diff(x_{a}^j, x_{b}^j) = \left\{ \begin{aligned} 0,&x_{a}^j = x_{b}^j\\ 1,&otherwise, \end{aligned} \right. \end{aligned}$$

for continuous feature j:

$$\begin{aligned} diff(x_{a}^j, x_{b}^j) = | x_{a}^j - x_{b}^j |. \end{aligned}$$

Repeatedly operate for n times, then average the weights of each feature. Finally, choose the top k features for classification.

Machine learning classifiers

Machine learning classification algorithms are used to distinguish heart disease patients from healthy people. Five popular classifiers and their theoretical backgrounds are discussed briefly in this paper.

Random forest

RF is a machine learning algorithm based on the ensemble of decision trees [33]. In traditional decision tree methods such as C4.5 and C5.0, all the features are used for generating the decision tree. In contrast, RF builds multiple decision trees and chooses the random subspaces of the features for each of them. Then, the votes of trees are aggregated and the class with the most votes is the prediction result [34]. As an excellent classification model, RF can successfully reduce the overfitting and calculate the nonlinear and interactive effects of variables. Besides, the training of each tree are done separately, so it could be done in parallel, which reduced the training time needed. Finally, combining the prediction result of each tree could reduce the variance and improve the accuracy of the predictions. There are many studies showing the performance superiority of RF over other machine learning methods [35,36,37].

Logistic regression

LR is a generalized linear regression model [38]. Therefore, it is similar with multiple linear regression in many aspects. Usually, LR is used for binary classification problems where the predictive variable $y \in [0,1]$, 0 is negative class and 1 is positive class. But it can also be used for multi-classification.

In order to distinguish heart disease patients from healthy people, a hypothesis $h(\theta ) = \theta ^TX$ is proposed. The threshold of classifier output is $h_{\theta }(x) = 0.5$, which is to say, if the value of hypothesis $h_{\theta }(x) \ge 0.5$, it will predict $y = 1$ which means that the person is a heart disease patient, otherwise the person is healthy. Hence, the prediction is done.

The sigmoid function of LR can be written as:

$$\begin{aligned} h_{\theta }(x) = \frac{1}{1+e^{-z}}, \end{aligned}$$

where $z = \theta ^TX$.

The cost function of LR can be written as:

$$\begin{aligned} J(\theta ) = \frac{1}{m}\sum _{i=1}^mcost ( y_i, y_i' ), \end{aligned}$$

where m is the number of instances to be predicted, $y_i$ is the real class label of the ith instance, and $y_i'$ is the predicted class label of the ith instance.

$$\begin{aligned} cost ( y_i, y_i' ) = \left\{ \begin{aligned} 0,&\quad y_i = y_i'\\ 1,&\quad otherwise. \end{aligned} \right. \end{aligned}$$

Support vector machine

Invented by Cortes and Vapnik [39], SVM is a supervised machine learning algorithm which has been widely used for classification problems [29, 40, 41]. The output of SVM is in the form of two classes in a binary classification problem, making it a non-probabilistic binary classifier [42]. SVM tries to find a linear maximum margin hyperplane that separates the instances.

Assume the hyperplane is $w^Tx+b=0$, where w is a dimensional coefficient vector, which is normal to the hyperplane of the surface, b is offset value from the origin, and x is dataset values. Obviously, the hyperplane is determined by w and b. The data points nearest to the hyperplane are called support vectors. In the linear case, w can be solved by introducing Lagrangian multiplier $\alpha _i$. The solution of w can be written as:

$$\begin{aligned} w = \sum _{i=1}^m\alpha _iy_ix_i, \end{aligned}$$

where m is the number of support vectors and $y_i$ are target labels to x. The linear discriminant function can be written as:

$$\begin{aligned} g(x)=sgn\left(\sum _{i=1}^m\alpha _iy_ix_i^Tx+b\right), \end{aligned}$$

sgn is the sign function that calculates the sign of a number, $sgn(x)=-1$ if $x< 0$, $sgn(x)=0$ if $x=0$, $sgn(x)=1$ if $x> 0$. The nonlinear separation of data set is performed by using a kernel function. The discriminant function can be written as:

$$g(x)=sgn\left(\sum_{i=1}^m\alpha_iy_iK(x_i,x)+b\right),$$

where $K(x_i,x)$ is the kernel function.

Extreme learning machine

ELM was first proposed by Huang et al. [43]. Similar to a single layer feed-forward neural network(SLFNN), ELM is also a simple neural network with a single hidden layer. However, unlike a traditional SLFNN, the hidden layer weights and bias of ELM are randomized and need not to tune, and the output layer weights of ELM are analytically determined through simple generalized inverse operations [43, 44].

K-nearest neighbor

KNN a supervised classification algorithm. Its procedure is as follows: when a new case is given, first search the database to find the k historical cases which are closest to the new case, namely k-nearest neighbors, and then these neighbors vote on the class label of the new case. If a class has the most nearest neighbors, the new case is determined to belong to the class [45]. The following formula is used to calculate the distance between two cases [46]:

$$\begin{aligned} d(x_i,x_j)=\sum _{q\in Q}w_q(x_{iq}-x_{jq})^2+\sum _{c\in C}w_cL_c(x_{ic},x_{jc}), \end{aligned}$$

where Q is the set of quantitative features and C is the set of categorical features, $L_c$ is an $M \times M$ symmetric matrix, $w_q$ is the weight of feature q and $w_c$ is the weight of feature c.

Methods

The proposed classification system consists of four main components: (1) preprocessing of data, (2) feature selection using Relief algorithm, (3) training of individual classifiers, and (4) prediction result generation of the ensemble classifier. A flow chart of the proposed system is shown in Fig. 1. The main components of the system are described in the following subsections.

Data preprocessing

The aim of data preprocessing is to obtain data from different heart disease data repositories and then process them in the appropriate format for the subsequent analysis [47]. The preprocessing phase involves missing-value imputation and data normalization.

Missing-value imputation

Missing data in medical data sets must be handled carefully because they have a serious effect on the experimental results. Usually, researchers choose to replace the missing values with the mean/mode of the attribute depending on its type [26]. Mokeddem [47] used weighted KNN to calculate the missing values. In present study, features with missing values more than $50\%$ of all instances are removed, then group mean instead of simple mean are used to substitute remaining missing values, as Bashir et al did in their study [41]. For example, if the case with a missing value is a patient, the mean value for patients is calculated and inserted in place of the missing value. In this way the class label is taken into consideration, thus the information offered by the dataset could be fully utilized.

Data normalization

Before feature selection, the continuous features are normalized to ensure that they have the mean 0 and variance 1, thus the effects of different quantitative units are eliminated.

Feature selection and training of individual classifiers

In this phase, the dataset is randomly split into training set, validation set and test set. That is, $80\%$ of the dataset is used for training, $10\%$ is used for validation and $10\%$ is used for testing purpose. The features are selected by the Relief algorithm on training set and the obtained result is a feature rank. A higher ranking means that the feature has stronger distinguishing quality and a higher weight [48]. Afterwards, features are added to the ensemble model one by one, from the most important one to the least. Then we can get several models with different number of features using training set, the number of models equals to the number of features. These models are tested on validation set, and the ensemble classifier with the best performance should have the best feature subset. Such classifier is used on test set, and its performance is recorded in Sect. 5. This procedure is repeated 10 times.

Prediction result generation

The classification accuracy and misclassification cost (MC) of each classifier are taken into account during the process of generating the final prediction result. In present study, in order to compare the misclassification costs for the different classifiers conveniently, the value of the correct classification cost is set as 0, and the MC is split into two scenarios. In the first scenario, healthy people are diagnosed with heart disease, resulting in unnecessary and costly treatment. In the second scenario, heart disease patients are told that they are healthy, as a result they may miss the best time for treatment, which may cause the disease to deteriorate or even death. The cost matrix is presented in Table 1. Considering the different costs people have to pay for misclassification, we set $cost_1=10$ and $cost_2=1$ [49, 50]. Afterwards, an index E is constructed to evaluate the performance of each classifier:

$$\begin{aligned} E_i= \frac{Accuracy_i+1-\frac{MC_i}{cost_1+cost_2}}{2}, \end{aligned}$$

where $Accuracy_i$ represents the accuracy and $MC_i$ represents the MC of ith classifier during the training phase (the formula to calculate the MC is presented in Sect. 4.2). $E_i$ stands for the efficiency of ith classifier to improve the accuracy and reduce the MC simultaneously. The weights of individual classifiers are based on $E_i$ and they are calculated as:

$$\begin{aligned} w_i=\frac{E_i}{\sum \limits _{i=1}^nE_i}, \end{aligned}$$

where n is the number of classifiers. Finally, the instances of the test set are imported into each classifier, and the outputs of ensemble classifier are the labels with the highest weighted vote [51].

Table 1 The cost matrix used by the classifiers

A hybrid cost-sensitive ensemble for heart disease prediction

Abstract

Background

Methods

Results

Conclusions

Background

Data-mining algorithms

Relief feature selection algorithm

Machine learning classifiers

Random forest

Logistic regression

Support vector machine

Extreme learning machine

K-nearest neighbor

Methods

Data preprocessing

Missing-value imputation

Data normalization

Feature selection and training of individual classifiers

Prediction result generation

Experimental setup

Datasets description

Performance evaluation metrics

Results

Feature ranking on different datasets

Performance on Statlog dataset

Performance on Cleveland dataset

Performance on Hungarian dataset

Comparison of the results with other studies

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent to publish

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us