Notations
Let us suppose that a data set consists of N examples x_{1},..., x_{
N
}each of which has P features {1,..., P}.
Let x_{
n
}= (x_{1, n},..., x_{P, n}) be the nth example where n ∈ {1,..., N}, and the ith feature value, i ∈ {1,..., P}, of the nth example is denoted by x_{i, n}. Class labels of the N examples will be denoted by y = (y_{1},..., y_{
N
}).
In this paper, we only consider a binary classification problem because we are interested in distinguishing benign and malignant examples. Overall, the labeled data set is expressed as {(x_{1}, y_{1}),..., (x_{
N
}, y_{
N
})}.
SVM
SVM is one of the most popular modern classification methods. Based on the structural risk minimization principal, SVM defines an optimal hyperplane between samples of different class labels. The position of the hyperplane is adjusted so that the distance from the hyperplane to a nearest sample, or margin, is maximized.
Moreover, if the SVM cannot define any hyperplane that separates examples in linear space, it can use kernel functions to send examples to any kernel space where the hyperplane can separate examples. Although we can use any kernel function meeting Mercer's Theorem for SVM, we consider widelyused the linear and Gaussian radial basis function (RBF) kernels only in this research.
SVMRFE
SVM is a powerful classification method but it has no feature selection method. Therefore, a wrappertype feature selection method, SVMRFE, was introduced [8]. SVMRFE generates ranking of features by computing information gain during iterative backward feature elimination. The idea of information gain computation is based on Optimal Brain Damage (OBD) [12]. In every iterative step, SVMRFE sorts the features in working set in the order of difference of the obejective functions and removes a feature with the minimum difference. Defining IG(k) as information gain when kth feature is removed, overall iterative algorithm of SVMRFE is shown in Algorithm 1.
ENSEMBLE and JOIN
SVMRFE [8] has two parameters that need to be determined. The first parameter decides how many features should be used to obtain best performance. The second parameter specifies what portion of features should be eliminated in each iteration. To resolve this issue, a simple approach can be easily
Algorithm 1 SVMRFE
Require: Feature lists R = [] and S = [1,..., P]
1: while S ≠ [] do
2: Train a SVM with features in S
3: for all kth feature in S do
4: Compute IG(k)
5: end for
6: e = arg min_{
k
}(IG(k))
7: R = [e, R]
8: S = S  [e]
9: end while
10: return R
implemented. First, we separate given training set into a partial training set and a holdout set. Then, we apply Algorithm 2 with some parameter 'threshold'.
Score of each feature subset R_{
o
}is computed as
where err(R_{
o
}) is the error of SVM trained using R_{
o
}and tested with holdout set. Using this method, we can obtain a feature subset R which yields reasonably small amount of error on trained dataset. Utilizing this algorithm as base, Jong et al. [10] proposed two methods, ENSEMBLE and JOIN to combine multiple rankings generated by SVMRFE as in Algorithm 3 and 4.
In this paper, we used 25% of training set as holdout set and used same sets of thresholds and cutoffs as in [10], i.e., {0.2, 0.3, 0.4, 0.5, 0.6, 0.7} and {1, 2, 3, 4, 5}.
Algorithm 2 SVMRFE(threshold)
Require: Ranked feature lists R = [], R_{
i
}= [] where i = 1,..., P and S' = [1,..., P]
1: i = 1
2: while S' ≠ [] do
3: Train an SVMs using a partial trainset with features in S'
4: for all features in S' do
5: Compute ranking of features as in SVMRFE
6: end for
7: R_{
i
}= S'
8: Eliminate threshold percent of lesserimportant features from S'
9: i = i + 1
10: end while
11: R = R_{
o
}where R_{
o
}yields minimum score on holdout set.
12: return R
Algorithm 3 ENSEMBLE(v_{1}, v_{2},.., v_{
k
})
1: for threshold v ∈ {v_{1}, v_{2},..., v_{
k
}} do
2: R_{
v
}= SVMRFE(v)
3: end for
4: return a majority vote classifier using SVMs trained by .
Algorithm 4 JOIN(cutoff, v_{1}, v_{2},..., v_{
k
})
1: for threshold v ∈ {v_{1}, v_{2},..., v_{
k
}} do
2: R_{
v
}= SVMRFE(v)
3: end for
4: R = features selected at least cutoff times in {}
5: return a SVM trained with R
Multiple SVMRFE with bootstrap
Multiple SVMRFE (MSVMRFE) [9] is a recently introduced SVMRFEbased feature selection algorithm. It exploits an ensemble of SVM classifiers and cross validation schemes to rank features. First, we make T subsamples from the original training set. Then, supposing that we have T SVMs trained using different subsamples, we calculate the corresponding discriminant information gain associated with each feature of each SVM. To compute this information gain, we use the same method as in SVMRFE [8]. Exploiting the objective function of SVM, and its Lagrangian solution λ, we can derive a cost function
where H is a matrix with elements y_{
q
}y_{
r
}K(x_{
q
}, x_{
r
}) and 1 is a N dimensional vector of ones while K(·) is a kernel function and 1 ≤ q, r ≤ N. Since we are looking for the subset of features that has the best discriminating power between classes, we compute the difference in cost function for each elimination of ith input feature, leaving Lagrangian multipliers unchanged. Therefore, the ranking for the ith feature of jth SVM can be defined as
where H(i) denotes that ith feature was removed from all elements in H. Then, considering DJ_{
j
}as a weight vector of features for jth SVM, we normalize all T weight vectors such as DJ_{
j
}= DJ_{
j
}/DJ_{
j
}. This gives us T weight vectors each with P elements. Here, each element in the vector stands for a information gain achieved by eliminating the corresponding feature. After normalizing weight vectors for each SVM, we can compute each feature's ranking score
with μ_{
i
}and σ_{
i
}defined as:
The algorithm then applies this method to the training set with kfold cross validation scheme. If we perform 5fold cross validation and generate 20 subsamples in each fold, we will eventually have T = 100 SVMs to combine. The overall MSVMRFE algorithm is described in Algorithm 5.
Algorithm 5 MSVMRFE
Require: Ranked feature lists R = [] and S' = [1,..., P]
1: while S' ≠ [] do
2: Train T SVMs using T subsamples with features in S'
3: for all jth SVM 1 ≤ j ≤ T do
4: for all ith feature 1 ≤ i ≤ P do
5: Compute DJ_{
ji
}
6: end for
7: Compute DJ_{
j
}= DJ_{
j
}/DJ_{
j
}
8: end for
9: for all feature l ∈ S' do
10: Compute c_{
l
}using Equation (1)
11: end for
12: e = arg min_{
l
}(c(l)) where l ∈ S'
13: R = [e, R]
14: S' = S'  [e]
15: end while
16: return R
One should note that original MSVMRFE proposed in [9] uses crossvalidation scheme when generating subsamples. However, we omitted this step because combining boosting into the original MSVMRFE algorithm with crossvalidation scheme is very complex and may confuse the purpose of this study.
Multiple SVMRFE with boosting
When making subsamples, original MSVMRFE uses the bootstrap approach [13]. This ensemble approach builds replicates of the original data set S by random resampling from S, but with replacement N times, where N is the number of examples. Therefore, each example (x_{
n
}, y_{
n
}) may appear more than once or not at all in a particular replicate subsample. Statistically, it is desirable to make every replicate differ as much as possible to gain higher improvement of the ensemble. The concept is both intuitively reasonable and theoretically correct. However, as the architecture of MSVMRFE uses simple bootstrapping, it naturally follows that utilizing another popular ensemble method, boosting [14], instead of bootstrapping for two reasons. First, boosting outperforms bootstrapping on average [15, 16], and secondly, boosting of SVMs generally yields better classification accuracy than bootstrap counterpart [17]. Therefore, to make use of ensemble of SVMs effectively, it may be worthwhile to use boosting instead of bootstrapping. For this reason, we applied AdaBoost [14], a classic boosting algorithm, to MSVMRFE algorithm instead of bootstrapping in this work.
Unlike the simple bootstrap approach, AdaBoost maintains weights of each example in S. Initially, we assign same value of weight to nth example D_{1}(n) = 1/N where 1 ≤ n ≤ N. Each iterative process consists of four steps. At first, the algorithm generates a bootstrap subsample according to weight distribution at tth iteration D_{
t
}. Next, it trains an SVM using the subsample. Third, it calculate the error using the original example set S. Finally it updates the weight value so that the probability of correctly classified examples is decreased while that of incorrect ones is increased. This update procedure makes next bootstrap pick more incorrectly classified examples, i.e. difficulttoclassify examples than easytoclassify ones. The iterative resampling procedure MAKE_SUBSAMPLES() using AdaBoost algorithm is described in Algorithm 6.
Algorithm 6 MAKE_SUBSAMPLE
Require: S = {(x_{
n
}, y_{
n
})}, D_{1}(n) = 1/N, n = 1,..., N;
1: for j = 1 to T do
2: Build a bootstrap B_{
j
}= {(x_{
n
}, y_{
n
})n = 1,..., N} based on weight distribution D_{
j
}
3: Train a SVM hypothesis h_{
j
}using B_{
j
}
4:
5: if ϵ_{
j
}≥ 0.5 then
6: Goto line 2
7: end if
8: α_{
j
}= (1/2)ln((1  ϵ_{
j
})/ϵ_{
j
}), α_{
j
}∈ R
9: D_{j+1}(n) = (D_{
j
}(n)/Z_{
j
}) × exp(α_{
j
}y_{
n
}h_{
j
}(x_{
n
})) where Z_{
j
}is a normalization factor chosen so that D_{j+1 }also be a probability distribution
10: end for
11: return B_{
j
}, α_{
j
}where 1 ≤ j ≤ T
In addition to modifying resampling method, we made a change in ranking criterion of original MSVMRFE. In this MSVMRFE with Boosting method, the weight vector DJ_{
j
}of jth SVM undergoes one more process between normalization and feature ranking score calculation. Since the contribution of each SVM in ensemble to the overall classification accuracy is unique, we multiply another weight factor to the normalized feature weight vector DJ_{
j
}. The new weight factor is obtained from the weight of hypothesis classifier calculated during the resampling process of AdaBoost. By multiplying this weight α_{
j
}to DJ_{
j
}, we can grade the overall feature weight more coherently. The overall iterative algorithm of MSVMRFE with AdaBoost is described in Algorithm 7.
Algorithm 7 MSVMRFE with AdaBoost
Require: Ranked feature lists R = [] and S'= [1,..., P]
1: MAKE_SUBSAMPLES(B_{
t
}, α_{
t
}); t = 1,..., T
2: while S' ≠ [] do
3: Train T SVMs using B_{
t
}, with features in set S'
4: Compute and normalize T weight vectors DJ_{
j
}as in MSVMRFE where 1 ≤ j ≤ T
5: for j = 1 to T do
6: DJ_{
j
}= DJ_{
j
}× ln(α_{
j
})
7: end for
8: for all feature l ∈ S' do
9: Compute the ranking score c_{
l
}using Eq. (1)
10: end for
11: e = argmin_{
l
}(c_{
l
}) where l ∈ S'
12: R = [e, R]
13: S' = S'  [e]
14: end while
15: return R
Note that we took logarithm of hypothesis weights instead of raw values in order to avoid radical changes in ranking criterion. Since boosting algorithm overfits by nature and SVM, the base classifier, is relatively strong classifier, the error rate of hypothesis increases drastically as iteration in MAKE_SUBSAMPLES() progresses. We have witnessed this overfitting problem by preliminary experiment and solved the problem by taking logarithm to the hypothesis weight. Computation time of MSVMRFE with boosting can also be explained here. From our experiments, we found that there is no significant difference between the original MSVMRFE and MSVMRFE with boosting as the number of subsamples generated by MAKE_SUBSAMPLES() decreases.
Lastly, unlike the conventional boosting algorithm application, we only exploit bootstrap subsamples generated by the algorithm and dismiss trained SVMs for the following reasons:

We are primarily interested in feature ranking and not the aggregation of weak hypotheses.

Since we are using SVMRFE for eventual classification method, this require a certain criterion to pick appropriate number of features from different boosted models.
In preliminary experiments using same number of features and simple majorityvoting aggregation, SVMRFE using boosted models did not show significance in accuracy improvement. However, we could find some evidences that ensemble of SVMs can be useful in mammogram classification.