BMC Medical Informatics and Decision Making BioMed

Background: Digital mammography is one of the most promising options to diagnose breast cancer which is the most common cancer in women. However, its effectiveness is enfeebled due to the difficulty in distinguishing actual cancer lesions from benign abnormalities, which results in unnecessary biopsy referrals. To overcome this issue, computer aided diagnosis (CADx) using machine learning techniques have been studied worldwide. Since this is a classification problem and the number of features obtainable from a mammogram image is infinite, a feature selection method that is tailored for use in the CADx systems is needed.


Background
Applications of artificial intelligence and machine learning techniques in medicine are now common and computer aided diagnosis (CADx) systems are one of those successful applications. Breast cancer, the most common cancer in women and second largest cause of death [1], is the disease which CADx systems are expected to be employed most successfully. To apply CADx systems, various imaging methods are available to reflect the inside tissue structure of breasts. Digital mammography using low-dose x-ray is one of those methods and is the most popular one worldwide. It has advantages over other methods such as sonar or magnetic resonance imaging (MRI) due to low cost and wide availability [2]. With digital mammography devices, doctors are able to find abnormal lesions which cannot be recognized using clinical palpation on breasts. CADx systems are applied on those images to detect and diagnose abnormalities. Since the early detection of breast cancer is important to ensure successful treatment of the disease, recent advances in research community have concentrated on improving the performance of CADx systems. Improvements in CADx systems can be obtained by solving two classification tasks: (1) detect more abnormalities or (2) distinguish actual malignant cancers from benign ones. Detecting abnormalities from a digitized mammogram is a relatively easy task and many improvements have been achieved while the latter is still a major area of research [3]. To achieve better performance, both classic and modern machine learning approaches such as Bayesian networks [4], artificial neural networks [5,6] and support vector machines (SVMs) [5,7] have been applied. However, the performance of CADx systems is still not as high as required for practical usage. This problem can be partially solved by using a better feature selection method that optimally fits to the mammogram classification problem [3].
We propose a new feature selection method for SVMs in this paper. Our method is based on SVM-Recursive Feature Elimination (SVM-RFE) [8] and its ensemble variant Multiple SVM-RFE [9]. We have conducted a comparison of the classification performance with baseline methods and two other SVM-RFE based feature selection methods, JOIN and ENSEMBLE, proposed by other groups [10]. To compare performances of methods, we prepared a dataset consisting of mass and calcification lesions extracted from Digital Database of Screening Mammography (DDSM) [11], the largest publicly available mammogram database.

Notations
Let us suppose that a data set consists of N examples x 1 ,..., x N each of which has P features {1,..., P}. Let x n = (x 1, n ,..., x P, n ) be the n-th example where n ∈ {1,..., N}, and the i-th feature value, i ∈ {1,..., P}, of the n-th example is denoted by x i, n . Class labels of the N examples will be denoted by y = (y 1 ,..., y N ).
In this paper, we only consider a binary classification problem because we are interested in distinguishing benign and malignant examples. Overall, the labeled data set is expressed as {(x 1 , y 1 ),..., (x N , y N )}.

SVM
SVM is one of the most popular modern classification methods. Based on the structural risk minimization principal, SVM defines an optimal hyperplane between samples of different class labels. The position of the hyperplane is adjusted so that the distance from the hyperplane to a nearest sample, or margin, is maximized.

SVM-RFE
SVM is a powerful classification method but it has no feature selection method. Therefore, a wrapper-type feature selection method, SVM-RFE, was introduced [8]. SVM-RFE generates ranking of features by computing information gain during iterative backward feature elimination. The idea of information gain computation is based on Optimal Brain Damage (OBD) [12]. In every iterative step, SVM-RFE sorts the features in working set in the order of difference of the obejective functions and removes a feature with the minimum difference. Defining IG(k) as information gain when k-th feature is removed, overall iterative algorithm of SVM-RFE is shown in Algorithm 1.

ENSEMBLE and JOIN
SVM-RFE [8] has two parameters that need to be determined. The first parameter decides how many features should be used to obtain best performance. The second parameter specifies what portion of features should be eliminated in each iteration. To resolve this issue, a simple approach can be easily implemented. First, we separate given training set into a partial training set and a hold-out set. Then, we apply Algorithm 2 with some parameter 'threshold'.

Score of each feature subset R o is computed as
where err(R o ) is the error of SVM trained using R o and tested with hold-out set. Using this method, we can obtain a feature subset R which yields reasonably small amount of error on trained dataset. Utilizing this algorithm as base, Jong et al. [10] proposed two methods, ENSEMBLE and JOIN to combine multiple rankings generated by SVM-RFE as in Algorithm 3 and 4.
In this paper, we used 25% of training set as hold-out set and used same sets of thresholds and cutoffs as in [ 3: end for 4: return a majority vote classifier using SVMs trained by .

Multiple SVM-RFE with bootstrap
Multiple SVM-RFE (MSVM-RFE) [9] is a recently introduced SVM-RFE-based feature selection algorithm. It exploits an ensemble of SVM classifiers and cross validation schemes to rank features. First, we make T subsamples from the original training set. Then, supposing that we have T SVMs trained using different subsamples, we calculate the corresponding discriminant information gain associated with each feature of each SVM. To compute this information gain, we use the same method as in SVM-RFE [8]. Exploiting the objective function of SVM, and its Lagrangian solution λ, we can derive a cost function where H is a matrix with elements y q y r K(x q , x r ) and 1 is a N dimensional vector of ones while K(·) is a kernel func- tion and 1 ≤ q, r ≤ N. Since we are looking for the subset of features that has the best discriminating power between classes, we compute the difference in cost function for each elimination of i-th input feature, leaving Lagrangian multipliers unchanged. Therefore, the ranking for the i-th feature of j-th SVM can be defined as where H(-i) denotes that i-th feature was removed from all elements in H. Then, considering DJ j as a weight vector of features for j-th SVM, we normalize all T weight vectors such as DJ j = DJ j /||DJ j ||. This gives us T weight vectors each with P elements. Here, each element in the vector stands for a information gain achieved by eliminating the corresponding feature. After normalizing weight vectors for each SVM, we can compute each feature's ranking score with μ i and σ i defined as: The algorithm then applies this method to the training set with k-fold cross validation scheme. If we perform 5-fold cross validation and generate 20 subsamples in each fold, we will eventually have T = 100 SVMs to combine. The overall MSVM-RFE algorithm is described in Algorithm 5. 12: e = arg min l (c(l)) where l ∈ S'

Algorithm 5 MSVM-RFE
14: 15: end while 16: return R One should note that original MSVM-RFE proposed in [9] uses cross-validation scheme when generating subsamples. However, we omitted this step because combining boosting into the original MSVM-RFE algorithm with cross-validation scheme is very complex and may confuse the purpose of this study.

Multiple SVM-RFE with boosting
When making subsamples, original MSVM-RFE uses the bootstrap approach [13]. This ensemble approach builds replicates of the original data set S by random re-sampling from S, but with replacement N times, where N is the number of examples. Therefore, each example (x n , y n ) may appear more than once or not at all in a particular replicate subsample. Statistically, it is desirable to make every replicate differ as much as possible to gain higher improvement of the ensemble. The concept is both intuitively reasonable and theoretically correct. However, as the architecture of MSVM-RFE uses simple bootstrapping, it naturally follows that utilizing another popular ensemble method, boosting [14], instead of bootstrapping for two reasons. First, boosting outperforms bootstrapping on average [15,16], and secondly, boosting of SVMs generally yields better classification accuracy than bootstrap counterpart [17]. Therefore, to make use of ensemble of SVMs effectively, it may be worthwhile to use boosting instead of bootstrapping. For this reason, we applied Ada-Boost [14], a classic boosting algorithm, to MSVM-RFE algorithm instead of bootstrapping in this work.
Unlike the simple bootstrap approach, AdaBoost maintains weights of each example in S. Initially, we assign same value of weight to n-th example D 1 (n) = 1/N where 1 ≤ n ≤ N. Each iterative process consists of four steps. At first, the algorithm generates a bootstrap subsample according to weight distribution at t-th iteration D t . Next, it trains an SVM using the subsample. Third, it calculate the error using the original example set S. Finally it updates the weight value so that the probability of cor- rectly classified examples is decreased while that of incorrect ones is increased. This update procedure makes next bootstrap pick more incorrectly classified examples, i.e. difficult-to-classify examples than easy-to-classify ones. The iterative re-sampling procedure MAKE_SUBSAMPLES() using AdaBoost algorithm is described in Algorithm 6.
: D j+1 (n) = (D j (n)/Z j ) × exp(-α j y n h j (x n )) where Z j is a normalization factor chosen so that D j+1 also be a probability distribution 10: end for 11: return B j , α j where 1 ≤ j ≤ T In addition to modifying re-sampling method, we made a change in ranking criterion of original MSVM-RFE. In this MSVM-RFE with Boosting method, the weight vector DJ j of j-th SVM undergoes one more process between normalization and feature ranking score calculation. Since the contribution of each SVM in ensemble to the overall classification accuracy is unique, we multiply another weight factor to the normalized feature weight vector DJ j . The new weight factor is obtained from the weight of hypothesis classifier calculated during the re-sampling process of AdaBoost. By multiplying this weight α j to DJ j , we can grade the overall feature weight more coherently. The overall iterative algorithm of MSVM-RFE with AdaBoost is described in Algorithm 7.  Lastly, unlike the conventional boosting algorithm application, we only exploit bootstrap subsamples generated by the algorithm and dismiss trained SVMs for the following reasons:

Algorithm 7 MSVM-RFE with AdaBoost
• We are primarily interested in feature ranking and not the aggregation of weak hypotheses.
• Since we are using SVM-RFE for eventual classification method, this require a certain criterion to pick appropriate number of features from different boosted models.
In preliminary experiments using same number of features and simple majority-voting aggregation, SVM-RFE using boosted models did not show significance in accuracy improvement. However, we could find some evidences that ensemble of SVMs can be useful in mammogram classification.

Results
In this section, we first describe dataset, features and experimental framework we used. Then we draw results of the experiments including analysis on them.

Dataset
The DDSM database provides about 2500 mammogram cases that were gathered from 1988 to 1999. Four U.S. medical institutions offered the data to construct DDSM. This includes Massachusetts General Hospital (MGH), Wake Forest University School of Medicine (WFUSM), Sacred Heart Hospital (SHH) and Washington University in St. Louis (WU). All mammogram cases we used in this paper contain one or more abnormalities which can be classified into benign or malignant group following their biopsy results. Table 1 summarizes the statistics of abnormalities from each digitizer type and institution.

Performance comparison
In sum, we prepared a total of 16 datasets each with 8 and 22 features, from each mass and calcification lesion of each institution. All SVM-RFE based methods are tested using 5-fold cross validation on each dataset. We computed area under Receiver Operating Characteristic (ROC) curves (A z ) using the output of SVMs and feature ranking produced by each method.
Before comparing the methods explained in the previous section, we did some preliminary experiments comparing different kernels and parameters to find optimal kernel and parameters. The result of this experiment is summarized in Table 3 and Table 4. We used the best-performing parameter and kernel (radial basis function, or RBF) from this experiment of this study.
The overall performance comparison result is summarized from Table 5 through   improved as we already mentioned in the previous chapter. Any method that can effectively exploit the trained SVMs during feature selection progress may be the future key improvement for MSVM-RFE with boosting.

Conclusion
In this paper, a new SVM-RFE based feature selection method was proposed. We conducted experiments on real world clinical data, and compared our method with baseline and other feature selection methods using SVM-RFE. Results show that our method outperforms in some cases and is at least competitive to others in other cases. Therefore, it can be a possible alternative to SVM-RFE or the original MSVM-RFE. Future works include investigation of specific methods to effectively combine models trained  during the feature selection process and ways to combine feature subsets generated from individual SVM-RFE instances.