Skip to main content

Accurate breast cancer diagnosis using a stable feature ranking algorithm



Breast cancer (BC) is one of the most common cancers among women. Since diverse features can be collected, how to stably select the powerful ones for accurate BC diagnosis remains challenging.


A hybrid framework is designed for successively investigating both feature ranking (FR) stability and cancer diagnosis effectiveness. Specifically, on 4 BC datasets (BCDR-F03, WDBC, GSE10810 and GSE15852), the stability of 23 FR algorithms is evaluated via an advanced estimator (S), and the predictive power of the stable feature ranks is further tested by using different machine learning classifiers.


Experimental results identify 3 algorithms achieving good stability (\(S \ge 0.55\)) on the four datasets and generalized Fisher score (GFS) leading to state-of-the-art performance. Moreover, GFS ranks suggest that shape features are crucial in BC image analysis (BCDR-F03 and WDBC) and that using a few genes can well differentiate benign and malignant tumor cases (GSE10810 and GSE15852).


The proposed framework recognizes a stable FR algorithm for accurate BC diagnosis. Stable and effective features could deepen the understanding of BC diagnosis and related decision-making applications.

Peer Review reports


Breast cancer (BC) is one of the most frequently diagnosed cancers among women worldwide. In 2020, it caused 2.26 million new cases and 0.68 million deaths [1]. As a transitioning country, China is facing a growing burden, since the number of new cases is near 0.42 million [2]. Much worse is transitioning countries have lower incidence rates but much higher death rates than transitioned countries [1]. The substantial BC burden in developing and low-resource countries calls for cost-effective screening and diagnostic services to improve survival rates and quality of life [3].

Many techniques have been developed for BC screening and diagnosis [4, 5]. Mammography (MAM) is the gold standard for BC screening. Due to high-resolution imaging of internal anatomy, it benefits the observation of suspicious lesions. To make a diagnosis of cancer, fine needle aspiration (FNA) biopsy test is needed. It obtains a sample of breast lump cells, and a pathologist checks whether the sample contains any cancer cells [4]. Gene expression profiling tests analyze genes within cancer cells and can help decide whether a patient is expected to benefit from additional treatment after surgery [5]. Some other modalities, such as ultrasound tomography [6], are under pre-clinical trial for investigating BC diagnosis and prognosis.

Computer-aided diagnosis (CAD) models have also been built to facilitate BC diagnosis [7, 8]. A CAD model consists of feature extraction, feature selection, and malignancy prediction. Feature extraction is to design or collect variables or predictors for breast tumor representation. The features can be computed from intensity analysis, shape description and texture quantification [7]. Since the feature dimensionality grows dramatically, feature selection becomes increasingly important, and its purpose is to find a subset of features by removing redundant and irrelevant ones [9]. According to the output type, feature selection methods can be categorized into feature ranking (FR) and subset feature selection (SFS) groups [9]. To differentiate benign and malignant cases, popular classifiers not limited to artificial neural network (ANN), K-nearest neighbors (KNN), linear discriminant analysis (LDA), naive Bayes (NB), random forest (RF) and support vector machine (SVM) are used [10]. Recently, deep learning has updated CAD performance [11]. It fuses feature extraction, feature selection and cancer prediction into a seamless optimization procedure [12]. Novel architectures have been designed, and technical strategies have also been suggested [8].

However, two shortcomings are observed in the understanding of FR/SFS algorithms for decision-making applications. Firstly, the stability has rarely been studied. Specifically, few of FR/SFS algorithms are evaluated [13,14,15,16], and stability estimators are not yet comprehensive [17]. Secondly, the superiority of an FR/SFS algorithm is overwhelmingly determined by its predictive power and thus, performance-oriented. The underestimation of stability decreases user confidence and hampers the deployment of FR/SFS algorithms in real-world applications.

To well address the above-mentioned shortcomings, a hybrid framework is proposed for investigating both FR/SFS stability and diagnosis effectiveness. To the best of our knowledge, this is the first work devoted to evaluating the stability and effectiveness of more than twenty FR algorithms on BC data analysis (BCDR-F03, WDBC, GSE10810 and GSE15852). The contributions of this work can be summarized as follows:

  1. 1.

    A hybrid framework is designed in which both the FR/SFS stability and the diagnosis effectiveness can be evaluated successively.

  2. 2.

    The stability of 23 FR algorithms is assessed on 4 BC datasets via an advanced estimator, and 3 FR algorithms are identified stable.

  3. 3.

    The predictive power of stable ranks is tested, and generalized Fisher score (GFS) leads to superior performance regardless of classifiers.

  4. 4.

    GFS ranks suggest shape features are vital in image analysis (BCDR-F03 and WDBC) and using a few of genes can well differentiate malignant cases from benign ones (GSE10810 and GSE15852).

Related work

In most studies, the superiority of FR/SFS algorithms is defined by the predictive power as shown with solid-line arrows in Fig. 1. For instance, performance comparison of SFS methods and classifiers on glioma grading is quantified by using the balanced accuracy and the area under the curve [10], and FR outcomes followed by classifiers are evaluated using precision, sensitivity and F-measure for finding the most significant features [18].

Fig. 1
figure 1

The performance-oriented (solid-line arrows) and the proposed stability-first FR/SFS (dashed-line arrows) frameworks for building a CAD model. (The figure can be enlarged for viewing)

Few studies have explored FR/SFS stability and predictive power at the same time. For BC risk forecasting, 6 methods are assessed via correlation coefficient and Jaccard index [13]. For colorectal cancer prediction, 6 methods are evaluated using 3 similarity-based measures [14]. On gene datasets, 6 methods are analyzed with 3 estimators [15]. And on small-sample data analysis, relative weighted consistency, partially adjusted average Tanimoto index and correlation-based similarity measures are used [16].

However, these studies [13,14,15,16] are not comprehensive, because the estimators used lack one or more properties a good estimator should possess [17], and subsequently, the conclusions might be untenable. Therefore, building a hybrid framework to investigate both stability and effectiveness of FR/SFS algorithms is meaningful. Table 1 shows from the number of FR/SFS algorithms and estimators involved and from whether the estimators satisfy the full propertiers of a good estimator [17].

Table 1 Comparison of related works

This study differs from the previous studies [13,14,15,16]. Overall, 23 algorithms are evaluated, surpassing that of each previous study. Meanwhile, an advanced estimator [17] is used, and the dynamic change of FR stability is quantified regarding the number of selected features. Moreover, on 4 BC datasets, stable algorithms are identified, and their effectiveness is assessed on malignancy prediction of breast tumor cases. In addition, selected features are analyzed as potential BC signatures by literature screening, and the findings may pave the way for understanding the disease occurrence and diagnosis.

Materials and methods

In this section, data collection, FR algorithms, stability estimator, machine learning classifiers and experimental design are described. To preserve the readability, major notations are summarized in Table 2.

Table 2 Main notations

Data collection

Four datasets are analyzed. BCRD-F03 [19] includes 406 breast lesions (230 benign and 176 malignant) and 736 MAM images. For lesion representation, 17 features are derived from intensity analysis (mean, median, standard error, maximum, minimum, kurtosis, and skewness), shape description (area, perimeter, x-center, y-center, circularity, elongation, and form) and texture quantification (contrast, correlation, and entropy). To avoid one lesion with multiple images, the first feature record of each lesion is used.

Wisconsin Diagnostic Breast Cancer (WDBC) [20] contains 357 benign and 212 malignant instances. For a FNA image, 10 features (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension) are computed. Besides mean values, the standard error and the “worst” (or largest) values of features are collected.

Other datasets are from the Gene Expression Omnibus (GEO) [21]. GSE10810 comprises 31 tumor samples and 27 control samples of breast specimens, and 18,382 gene profiles are provided for each sample [22]. GSE15852 provides 43 tumor samples and 43 control samples of Malaysian women, and 22,283 gene expression data points are collected [23].

Table 3 shows the dataset information. The goal is to recognize malignant samples from benign ones by using medical images (BCDR-F03 and WDBC), or gene profiles (GSE10810 and GSE15852).

Table 3 Summary of the datasets used in this study

FR algorithms

Twenty-three methods in the matFR toolbox [24] are evaluated, and the other methods beyond time expectation (\(\ge 0.5\) hour per iteration) on GSE15852 are discarded. In general, the core ideas of used algorithms are based on absolute values of t-test [25], relative entropy [26], Bhattacharyya distance [27], the area between the empirical receiver operating characteristic curve and random classifier slope [28], absolute values of Mann-Whitney test [29], ReliefF [30], the least absolute shrinkage and selection operator [31], correlation analysis [32], generalized Fisher score (GFS) [33], Gini score [34], Kruskal-Wallis test [35], pairwise feature proximity (PWFP) [36], min-max local structure information [37], local learning-based clustering [38], eigenvector centrality [39], probabilistic latent graph-based measure space [40], concave minimization and SVM [41], the convergence properties of the power series of matrices [42], Laplacian score [43], L\(_{2,0}\)-norm equality constraints (LNEC) [44], adaptive structure learning [45], robust spectral learning of the spectrum information of the graph Laplacian [46], and L\(_{2,1}\)-norm minimization on processes of both label learning and feature learning [47]. Full details of the algorithms can be found in the original publications.

Stability estimator

The stability estimator proposed in [17] is used, and it possesses the full properties a good estimator should hold. It recasts the procedure of FR/SFS stability measure as the estimation of a random variable, and corresponding population parameters are explicitly embedded. After the sampling distribution is identified, tools are provided to estimate the confidence intervals and to perform the hypothesis tests. Importantly, the estimator allows for reliable comparison across different FR/SFS procedures.

In addition, stability values (S) above 0.75 represent excellent agreement of feature sets beyond chance, the values below 0.40 reveal poor agreement between sampled feature sets, and the values in the range of 0.40 and 0.75 indicate intermediate to good agreement.

Machine learning classifiers

To avoid potential over-fitting, simple classifiers are used. ANN is with one hidden layer (ANN01) and with two hidden layers (ANN02), and 10 neurons are embedded in each hidden layer. KNN is a nonparametric classifier, and a new instance is grouped based on the class labels of the majority of K nearest neighbors. LDA is to find a linear combination of features for separating new instances. NB is a probabilistic classifier based on the basic Bayes theorem with an independence assumption between features and class labels. Linear SVM is a supervised learning classifier and groups new instances into different classes by using optimized hyper-planes.

Evaluation metrics

To quantify the prediction performance, area under the receiver operating characteristic curve (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPE), negative predictive value (NPV), F-measure and Matthews correlation coefficient (MCC) are used.

The metrics have been widely used in binary classification problems (\(y \in \{0, 1\}\)), and higher values indicate better prediction results. In this study, the label of a benign case is \(y=0\), and the label of a malignant case is \(y=1\).

Experimental design

Estimation of FR stability

Figure 2 shows how to estimate FR stability. In each iteration, a dataset \(\{(X, y)\}\) is divided into a training set \(\{(X^{train}, y^{train})\}\) and a testing set \(\{(X^{test}, y^{test})\}\), and each FR method yields a feature rank on the training set. Moreover, \(<f_{i,1}, ..., f_{i,k}, ..., f_{i,p}>\) is the output of the \(i^{th}\) running of the p features, and \(f_{i,k}\) is the ranking index of the \(k^{th}\) feature. Finally, feature ranks are averaged as an output.

Fig. 2
figure 2

The procedure of FR stability estimation. On a given dataset, N iterations of feature ranking are conducted that yield average feature ranks and stability values S. (The figure can be enlarged for viewing)

The other output is the stability value (S) when top-m features are selected. Notably, an FR algorithm generates a feature rank in a descending order in terms of feature importance. When the number (m) of features is defined, it yields a subset of features. In this study, \(N=30\), and m ranges from 3 to 10 with equal interval of 1. Specifically, when \(m = 3\) and \(S \ge 0.55\), an FR algorithm is assumed to be stable on the dataset.

As shown in Table 3, when a dataset is divided into two subsets, the number of benign and malignant cases is set equal in the training set (\(\approx\) 80% of the group with fewer cases).

Effectiveness of feature ranks on BC diagnosis

For a stable FR algorithm, its N-times of feature ranks are averaged, and then, the predictive power of top-m features is explored on BC diagnosis. Figure 3 shows the procedure, the value of m increases incrementally, and linear SVM is an example of classifiers.

Fig. 3
figure 3

The procedure of estimating the effectiveness of feature ranks on BC diagnosis. On a given dataset, top-m features are incrementally added and M iterations of data splitting are conducted for machine learning based malignancy prediction. (The figure can be enlarged for viewing)

After a model is trained, the optimal number (n) of features is determined by balancing model complexity and prediction performance, i.e. feature number vs. AUC value, on the testing set. Notably, \(M=50\), the feature number progresses from 1 to 10, and the prediction results and selected features are analyzed.

Statistical analysis

The Wilcoxon rank sum test is used to analyze the values of evaluation metrics. It is non-parametric for testing two populations with independent samples. The p-value less than 0.05 is defined as the significance level to reject the null hypothesis of equal medians between two distributions.

Implementation and platform

The proposed framework runs on a Win10 system (one Intel (R) Core (TM) i9-11980HK CPU (3.20 GHz), and 64.0 GB RAM). Algorithms are implemented with MATLAB R2018a (MathWorks, Natick, MA, United States). FR algorithms are from matFRFootnote 1, the stability estimator is from githubFootnote 2, and classifiers and statistical analysis are implemented with embedded functions. In addition, except for \(K = 5\) (KNN), the other parameters of FR methods, estimator, classifiers and Wilcoxon rank sum test are set to default values.


Estimated FR stability

Stability values are shown in Tables 4 and 5, and values with \(S \ge 0.55\) when \(m=3\) are in red. Table 4 shows most algorithms achieving good stability (BCDR-F03 and WDBC). The S values of [25, 35, 36, 39, 43, 46, 47] are near or larger than 0.80 on both datasets. On contrast, [32, 37] on BCDR-F03 and [32, 41] on WDBC are highly sensitive to data perturbation.

Table 4 Estimated FR stability values on medical image datasets (BCDR-F03 and WDBC)

Table 5 shows several algorithms with good stability on the gene datasets. On GSE10810, [27, 33, 36, 41, 44] are stable with \(0.58 \le S \le 0.78\). On GSE15852, [33, 35, 36, 39, 42,43,44] have S values within [0.56, 0.85]. Notably, S values of some algorithms are close to zero, such as [32, 38, 40, 45] on GSE10810 and [32, 40, 45] on GSE15852.

Table 5 Estimated FR stability values on gene expression datasets (GSE10810 and GSE15852)

According to the rule of \(S \ge 0.55\) when \(m=3\), there are 19, 20, 5, and 7 algorithms showing good stability on BCDR-F03, WDBC, GSE10810 and GSE15852, respectively. It is also found that PWFP is strongly stable (\(\ge 0.70\)) on all the datasets, LNEC obtains strong stability on BCDR-F03 and WDBC and good stability (\(\ge 0.60\)) on GSE10810, while the S values of GFS are relatively lower. Moreover, compared to S values on the gene datasets (GSE10810 and GSE15852), FR algorithms obtain much higher values on medical image datasets. Further observation reveals 3 FR algorithms (GFS [33], PWFP [36] and LNEC [44]) obtaining good stability on all the datasets, and the algorithms are focused on in follow-up analysis.

Effectiveness of feature ranks on BC diagnosis

The predictive power of feature ranks on BC diagnosis is shown in Figs. 4 (BCDR-F03), 5 (WDBC), 6 (GSE10810) and 7 (GSE15852). In the figures, AUC values marked as solid lines with blue crosses, dashed lines with brown triangles and dot-dashed lines with yellow stars stand for the results from the ranks of GFS, PWFP and LNEC, respectively. In each plot, the horizontal axis denotes the number (m) of the features, and the vertical axis shows the AUC values using a specific machine learning classifier.

On the BCDR-F03

Figure 4 shows the prediction results on BCDR-F03 using different feature subsets. AUC values from GFS ranks are correspondingly larger than those from the other feature ranks on average. Based on GFS ranks, NB and SVM achieve better performance using 3 and 2 features, respectively. Meanwhile, using LDA as the classifier, LNEC leads close performance to GFS when 3 features are used. Out of the 17 image features, GFS prefers the “contrast”, “circularity” and “perimeter”.

Fig. 4
figure 4

Predictive power of feature ranks on BCDR-F03. A plot shows the results of one classifier combined with different FR feature ranks. (The figure can be enlarged for viewing)

On the WDBC

Figure 5 shows AUC values when different feature subsets are used on WDBC. Comparatively, PWFP ranks cause worse results, and GFS and LNEC ranks lead to superior results. From the perspective of model simplicity, KNN with LNEC (2 features), LDA with GFS (2 features), NB with GFS (2 features), and SVM with GFS (2 features) or LNEC (2 features) achieve good results (AUC \(\ge 0.90\)). Out of the 30 image features, GFS ranks top of “the largest concave points”, “the largest perimeter” and “concave points”.

Fig. 5
figure 5

Predictive power of feature ranks on WDBC. A plot shows the results of one classifier combined with different FR feature ranks. (The figure can be enlarged for viewing)

On the GSE10810

The change of AUC values along with selected features on GSE10810 is shown in Fig. 6. It suggests that GFS might identify a subset of discriminative features since the AUC values reach AUC \(\approx\) 0.95 when few features are used. In addition, when classifiers change, the prediction performance remains good. In contrast, feature ranks from PWFP and LNEC cause poor AUC values (\(\le\) 0.80). Out of 18,382 genes, GFS prefers “206930_at”, “243311_at” and “222083_at” as the most important ones.

Fig. 6
figure 6

Predictive power of feature ranks on GSE10810. A plot shows the results of one classifier combined with different FR feature ranks. (The figure can be enlarged for viewing)

On the GSE15852

Figure 7 shows the AUC values on GSE15852. Again, AUC values from GFS ranks are much better than those from the other two feature ranks, and using 2 to 4 features leads to AUC \(\ge\) 0.80. Out of the 22,283 gene profiles, GFS ranks top of the genes of “204997_at”, “210298_x_at” and “222317_at”.

Fig. 7
figure 7

Predictive power of feature ranks on GSE15852. A plot shows the results of one classifier combined with different FR feature ranks. (The figure can be enlarged for viewing)

Summary of the BC diagnostic performance

Table 6 presents the prediction results on the datasets when using different feature ranks, classifiers and selected features, and * denotes p-value < 0.05 of each metric within a feature rank. Generally, GFS ranks lead to higher metric values over the ranks from PWFP and LNEC when using same classifiers. Notably, the superiority of GFS ranks is observed from the gene data analysis with significantly better results (p-values \(< 10^{-8}\)) regardless of classifiers. In summary, for malignancy prediction, GFS ranks induce superior results when using 4 features as the NB input on BCDR-F03, using 2 features as the NB input on WDBC, using 2 genes as the SVM input on GSE10810 and using 4 genes as the NB input on GSE15852.

Table 6 Summary of BC diagnosis performance (* denotes p-value < 0.05 of one metric within a feature rank)

Representative achievement on the BC datasets

Table 7 shows current achievement using FR/SFS and classifiers. On BCDR-F03, using 17 features [19] achieves 0.06 higher AUC over the present study. On WDBC, using 2 features in the present study achieves slightly lower AUC, ACC and SPE but higher SEN than that using 6 features with genetic algorithm [48]. On GSE10810, using 2 features from GFS leads to much better ACC over that using 80 features from the t-test in [49]. On GSE15852, using 4 features in the present study results in lower ACC than that using 235 features [50] and that using 10 features [51], while it achieves ACC close to that using 33 features [23] and to that using 50 features [52]. In general, GFS ranks lead to competitive or better performance as other FR/SFS methods when using much fewer features.

Table 7 Representative achievement on the BC datasets

Computational complexity analysis

The computational complexity of the proposed framework is from the FR/SFS algorithms (\(\mathcal {O}_{fr}\)), the stability estimator (\(\mathcal {O}_{es}\)) and the classifier (\(\mathcal {O}_{class}\)), which can be generally formulated as \(\mathcal {O}_{fr} + \mathcal {O}_{es} + \mathcal {O}_{class}\).

In the proposed model, GFS is the FR algorithm, and \(\mathcal {O}_{fr} = \mathcal {O}_{GFS} = \mathcal {O}(T(cns+s\log {m}))\) in which T is the number of iteration, s is the number of nonzero features among the training samples, c is the number of classes, n is the number of data samples, and m is the number of selected features [33]. The complexity of the estimator is \(\mathcal {O}_{es} = \mathcal {O}(Mp)\) in which M is the number of feature sets and p is the feature dimensionality [17]. As to classifiers, the testing complexity of NB is \(\mathcal {O}(cp)\) and that of linear SVM is \(\mathcal {O}(p)\).

Thus, the time cost of the proposed model is mainly laid on GFS algorithm. Figure 8 shows the time consumption for ranking features. It reveals that GFS is the fastest, and on GSE15852, its average time cost is \(\approx\) 0.12 second per iteration.

Fig. 8
figure 8

Average time cost per iteration for ranking features (GFS, red bar; PWFP, green bar; and LNEC, blue bar). (The figure can be enlarged for viewing)


A hybrid framework is proposed to identify stable FR algorithms for accurate BC diagnosis. Twenty-three algorithms have been evaluated on four datasets. It reveals that three algorithms show consistent stability, and GFS leads to superior prediction results.

Three algorithms show consistently good stability on the four datasets. Initially, 19 out of 42 algorithms handle GSE15852 [23] not well own to tens of thousands of gene features. Secondly, nearly all the remaining algorithms obtain stable feature ranks on BCDR-F03 and WDBC (Table 4), while substantially fewer algorithms show good stability on GSE10810 and GSE15850 (Table 5). The reason may come from the data sufficiency. It is easy to observe that there are more than eighteen samples to describe a feature on medical image dataset (BCDR-F03, 406 samples vs. 17 features; WDBC, 569 samples vs. 30 features), while on gene datasets, samples are far from sufficient (GSE10810, 58 samples vs. 18,382 genes; GSE15852, 86 samples vs. 22,283 genes) (Table 3). This finding might suggest that sufficient samples are necessitated for the construction of measure spaces before the estimation of feature importance [16]. Among the stable algorithms, GFS is the most efficient and it takes around 0.12 second to complete an iteration of the GSE15852 dataset (Fig. 8).

GFS ranks lead to superior diagnosis performance over the other two feature ranks (Table 6). On the medical image datasets, the evaluation metrics of GFS ranks show higher values over those of PWFP or LNEC rank with no significant difference (p-values > 0.05). For instance, GFS rank followed by KNN classifier (0.59±0.05) causes inferior SPE values in comparison to LNEC ranks with KNN classifier (0.61±0.06). On the gene expression datasets, GFS ranks result in significantly better performance over the other two ranks (p-values \(< 10^{-8}\)) regardless of classifiers. On another perspective, by using same classifiers, such as NB, on the datasets with sufficient samples (BCDR-F03 and WDBC), LNEC and PWFP ranks cause slightly inferior evaluation metric values in comparison to GFS ranks; while on gene expression datasets (GSE10810 and GSE15852), GFS ranks lead to much better results over the other two feature ranks. The comparison might reveal that GFS is able to discover signatures from high-dimensional small-sample gene datasets for improved BC diagnosis.

The proposed framework yields state-of-the-art performance (Table 7). On the gene expression datasets, using fewer gene features in this study exceeds some other methods on BC diagnosis. In [49], 80 genes cause inferior results on GSE10810. In [52], using 50 genes leads to worse performance on GSE15852. On the medical image datasets, using fewer features achieves comparable performance. On WDBC, 6 features lead to better result [48] than the present study using 2 features. On BCDR-F03, compared to the baseline work using 17 features [19], the present study using 4 features causes 0.06 AUC decrease. In general, using fewer features and simpler classifier in this study outperforms the other complex CAD models, such as hierarchical clustering RF [54] and particle swarm optimization [55], and the main contribution might come from the effective GFS feature ranks.

The selected features for accurate BC diagnosis have already been witnessed in previous studies or guidelines. On BCDR-F03, “circularity”, “perimeter” and “contrast” are found beneficial to breast image analysis. This finding is in consistent with the guideline of breast imaging-reporting and data system descriptor [58], and clinical studies identify that malignant lesions in MAM images are prone to show irregular shapes (“circularity” and “perimeter”) and inhomogeneous intensity (“contrast”). On WDBC, “concave points”, “the largest concave points” and “the largest perimeter” are vital in malignancy prediction. In an FNA image, “concave points” are the concave portions of the contour [20], and the presence of more concave points indicates a more irregular shape of a nucleus. The findings on BCDR-F03 and WDBC suggest that shape features should be paid more attention to MAM and FNA image analysis. On GSE10810, BC occurrence is highly related with genes “206930_at” and “222083_at” (both glycine-N-acyltransferase, GLYAT) and “243311_at” (defensin beta 132, DEFB132). Notably, the expression level of “206930_at”, “222083_at” and “243311_at” decrease from the normal (7.19±1.36, 7.59±1.07, and 7.12±1.81) to the tumor group (3.88±0.42, 4.89±0.43, and 4.22±0.40), with a significant difference (p-values \(< 10^{-11}\)). Existing studies have revealed that GLYAT-encoded proteins catalyze the transfer of acyl groups from acyl-CoA to glycine to produce acyl glycine and coenzyme A. The product acyl-CoA is an important resource for oxidative phosphorylation and lipogenesis that is necessary for normal cell metabolism. In particular, downregulation of GLYAT expression is associated with a variety of malignant tumors, including BC tumors [59]. DEFB132 is a member of the alarm element family. It mainly involves in the transmission of danger signals and may play a role in tumorigenesis [60]. On GSE15852, BC development is found in relation to genes “204997_at” (glycerol-3-phosphate dehydrogenase 1, GPD1), “210298_x_at” (four and a half LIM domains 1, FHL1), and “222317_at” (phosphodiesterase 3B, PDE3B). The expression level of the genes is significantly reduced from control cases to malignant cases (p-values \(< 10^{-6}\)). GPD1 encodes cytoplasmic NAD-dependent glycerol 3-phosphate dehydrogenase 1, a key element connecting carbohydrate and lipid metabolism. Existing studies have shown that GPD1 may inhibit the proliferation, migration, and invasion of breast cancer cells [61]. FHL1 has been identified as a suppressor gene for a variety of malignant tumors and exerts antitumor effects by inhibiting tumor differentiation, proliferation, invasion, and metastasis, and low FHL1 expression is closely related to the invasion and metastasis of breast cancer [62]. In addition, PDE3B-mediated cAMP hydrolysis limits the antiangiogenic potential of PKA in endothelial cells, suggesting PDE3B regulates angiogenesis and inhibits the occurrence and metastasis of breast cancer by controlling the invasion ability of endothelial cells [63].

In addition, selecting an appropriate classifier seems helpful when feature ranks are not so effective by comparing the classifiers. Taking LNEC ranks as an example, using LDA classifier generally obtains fair good results on the datasets (Table 6). In other words, using same feature subset from LNEC ranks, LDA generally outperforms the other classifiers in mapping features to the labels. It is also found that ANN with two hidden layers (ANN02) has no much improvement over that with one hidden layer (ANN01). The reason might come from the limited representation of quantitative features [19, 20] or the limited numbers of data samples [22, 23]. When feature ranks are fixed, which classifier is suitable for a specific task is a performance-oriented problem, which may require basic analysis, systematic experiments and empirical experience. Meanwhile, it is feasible to merge feature ranks into an optimization procedure for ensemble feature selection and malignancy prediction [64].

FR/SFS stability is crucial in cancer diagnosis, signature discovery and many other related applications. In the era of deep learning, FR/SFS stability provides a novel way to improve user confidence when deep networks are applied for high-risk decision-making tasks. It is known that deep networks can perform as feature extractors to generate massive hierarchical features [8, 12]. However, these features are so abstract that the decision-making procedures become uninterpretable. Alternatively, the stability or preference of deep features can be quantified as the frequency of features activated in the training stage [65], as the reproducibility of features when error rate is controlled via paired-input nonlinear knockoffs [66], or as the difference of propagating activation when decomposing the output prediction of a deep network based on a specific input of learned features [67]. Most importantly, FR/SFS stability should be considered before translating lab research findings to clinical practice, since only the features that have been stably identified as potential signatures deserve labor and time for further clinical investigation.

Several limitations exist in the current study. Firstly, on datasets with hundreds of samples, the impact of training size change on the stability estimation is an interesting topic. However, own to insufficient samples in gene datasets, the size of training samples is fixed. In our future study, the impact of training sizes will be explored. Secondly, using one estimator to assess the stability seems not convincing, while the estimator possesses all the properties of a good stability measure [17]. One desirable approach is to develop more estimators and to conduct comprehensive evaluation. Meanwhile, a decrease of the stability threshold can identify more FR algorithms, while it poses difficulty to follow-up data analysis, and thus, \(S \ge 0.55\) is a trade-off. Thirdly, more advanced classifiers could be employed, such as deep learning networks [8], while to maintain good interpretability, six simple yet effective classifiers are applied. In our future work, more classifiers will be considered. On the other hand, instead of direct use of classifiers, one promising way is to embed feature ranks into an optimization procedure for signature discovery and cancer diagnosis [64]. Fourthly, retrieval and meta-analysis of discovered genes are helpful for understanding cancer occurrence, development and prognosis, while these topics fall outside of the scope of this study. In addition, using different data splitting strategies, such as k-fold cross validation and data percentage split criteria, might change the prediction results, while retaining the numbers of benign and malignant cases in the training set can avoid data imbalance and prediction bias. Last but not least, more efforts can be made to finely stratify patient cases from clinical data and cancer staging for personalized medicine.


This study proposes a hybrid framework to investigate both the stability and effectiveness of FR algorithms on BC data analysis. Three algorithms exhibit good stability consistently on the datasets, and GFS feature ranks lead to superior classification performance. The GFS ranks suggest that shape features are vital in medical image analysis (BCDR-F03 and WDBC) and using a few of genes can help differentiation of benign and malignant cases (GSE10810 and GSE15852).

FR/SFS stability is important in real-world decision-making applications. This study indicates that few FR algorithms demonstrate stable feature preference on high-dimensional small-sample data analysis. To address this challenge, developing stable FR/SFS algorithms is preferred. Meanwhile, an effective reduction of feature dimensionality is also helpful for accurate estimation of feature importance. In addition, collecting sufficient samples is a primary consideration to determine the data distribution and to facilitate the stability estimation.

The proposed model could recognize stable FR/SFS algorithms and effective feature subsets. However, it is restricted to the input of quantitative features. The future scope of the model could be broadened into the deep learning field by concatenating low-, middle-, and high-level features of interest as the input. In the future, experiments will be conducted by involving more FR/SFS algorithms, machine learning classifiers, stability estimators and medical datasets for finding out stable and discriminative features for cancer diagnosis and signature discovery.

Availability of data and materials

The datasets BCDR-F03 (, WDBC (, GSE10810 (, and GSE15852 ( are all available online.






Breast cancer


Feature ranking


Generalized Fisher score




Fine needle aspiration


Computer-aided diagnosis


Feature selection


Artificial neural network


Knearest neighbors


Linear discriminant analysis


Naive Bayes


Random forest


Support vector machine


Wisconsin diagnostic breast cancer


Gene Expression Omnibus


Pairwise feature proximity


L\(_{2,0}\)-norm equality constraints


Area under the receiver operating characteristic curve








Matthews correlation coefficient


Negative predictive value


  1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):266–7.

    Article  Google Scholar 

  2. Cao W, Chen HD, Yu YW, Li N, Chen WQ. Changing profiles of cancer burden worldwide and in China: a secondary analysis of the global cancer statistics 2020. Chin Med J. 2021;134(7):783–7917.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Sharma R. Global, regional, national burden of breast cancer in 185 countries: Evidence from GLOBOCAN 2018. Breast Cancer Res Treat. 2021;187(2):557–67.

    Article  PubMed  Google Scholar 

  4. Barco I, Chabrera C, García-Fernández A, Fraile M, González S, Canales L, et al. Role of axillary ultrasound, magnetic resonance imaging, and ultrasound-guided fine-needle aspiration biopsy in the preoperative triage of breast cancer patients. Clin Transl Oncol. 2017;19(6):704–10.

    Article  CAS  PubMed  Google Scholar 

  5. Reis-Filho JS, Pusztai L. Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lancet. 2011;378(9805):1812–23.

    Article  CAS  PubMed  Google Scholar 

  6. Yu S, Wu S, Zhuang L, Wei X, Sak M, Neb D, et al. Efficient segmentation of a breast in B-mode ultrasound tomography using three-dimensional GrabCut (GC3D). Sensors. 2017;17(8):1827.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Houssein EH, Emam MM, Ali AA, Suganthan PN. Deep and machine learning techniques for medical imaging-based breast cancer: A comprehensive review. Expert Syst Appl. 2021;167:114161.

    Article  Google Scholar 

  8. Zou L, Yu S, Meng T, Zhang Z, Liang X, Xie Y. A technical review of convolutional neural network-based mammographic breast cancer diagnosis. Comput Math Methods Med. 2019;2019. Article ID 6509357.

  9. Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing. 2018;300:70–9.

    Article  Google Scholar 

  10. Sun P, Wang D, Mok VC, Shi L. Comparison of feature selection methods and machine learning classifiers for radiomics analysis in glioma grading. IEEE Access. 2019;7:102010–20.

    Article  Google Scholar 

  11. Yu S, Liu L, Wang Z, Dai G, Xie Y. Transferring deep neural networks for the differentiation of mammographic breast lesions. Sci China Technol Sci. 2019;62(3):441–7.

    Article  CAS  Google Scholar 

  12. Debelee TG, Schwenker F, Ibenthal A, Yohannes D. Survey of deep learning in breast cancer image analysis. Evolving Syst. 2020;11(1):143–63.

    Article  Google Scholar 

  13. López NC, García-Ordás MT, Vitelli-Storelli F, Fernández-Navarro P, Palazuelos C, Alaiz-Rodríguez R. Evaluation of feature selection techniques for breast cancer risk prediction. Int J Environ Res Public Health. 2021;18(20):10670.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Cueto-López N, García-Ordás MT, Dávila-Batista V, Moreno V, Aragonés N, Alaiz-Rodríguez R. A comparative study on feature selection for a risk prediction model for colorectal cancer. Comput Methods Programs Biomed. 2019;177:219–29.

    Article  PubMed  Google Scholar 

  15. Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst. 2007;12(1):95–116.

    Article  Google Scholar 

  16. Dernoncourt D, Hanczar B, Zucker JD. Analysis of feature selection stability on high dimension and small sample data. Comput Stat Data Anal. 2014;71:681–93.

    Article  Google Scholar 

  17. Nogueira S, Sechidis K, Brown G. On the stability of feature selection algorithms. J Mach Learn Res. 2017;18(1):6345–98.

    Google Scholar 

  18. Vakharia V, Gupta VK, Kankar PK. A comparison of feature ranking techniques for fault diagnosis of ball bearing. Soft Comput. 2016;20(4):1601–19.

    Article  Google Scholar 

  19. Arevalo J, González FA, Ramos-Pollán R, Oliveira JL, Lopez MAG. Representation learning for mammography mass lesion classification with convolutional neural networks. Comput Methods Prog Biomed. 2016;127:248–57.

    Article  Google Scholar 

  20. Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. In: Biomedical image processing and biomedical visualization. San Jose: SPIE; 1993; vol. 1905. p. 861–70.

  21. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Pedraza V, Gomez-Capilla JA, Escaramis G, Gomez C, Torné P, Rivera JM, et al. Gene expression signatures in breast cancer distinguish phenotype characteristics, histologic subtypes, and tumor invasiveness. Cancer Interdisc Int J Am Cancer Soc. 2010;116(2):486–96.

    CAS  Google Scholar 

  23. Ni IBP, Zakaria Z, Muhammad R, Abdullah N, Ibrahim N, Emran NA, et al. Gene expression patterns distinguish breast carcinomas from normal breast tissues: the Malaysian context. Pathol-Res Pract. 2010;206(4):223–8.

    Article  CAS  Google Scholar 

  24. Zhang Z, Liang X, Qin W, Yu S, Xie Y. matFR: a MATLAB toolbox for feature ranking. Bioinformatics. 2020;36(19):4968–9.

    Article  CAS  PubMed  Google Scholar 

  25. Cressie N, Whitford H. How to use the two sample t-test. Biom J. 1986;28(2):131–48.

    Article  Google Scholar 

  26. Cover TM, Thomas JA, et al. Entropy, relative entropy and mutual information. Elem Inf Theory. 1991;2(1):12–3.

    Google Scholar 

  27. Kailath T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol. 1967;15(1):52–60.

    Article  Google Scholar 

  28. Hsieh F, Turnbull BW. Nonparametric and semiparametric estimation of the receiver operating characteristic curve. Ann Stat. 1996;24(1):25–40.

    Article  Google Scholar 

  29. Nachar N, et al. The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutor Quant Methods Psychol. 2008;4(1):13–20.

    Article  Google Scholar 

  30. Robnik-Šikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn. 2003;53(1):23–69.

    Article  Google Scholar 

  31. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol). 1996;58(1):267–88.

    Google Scholar 

  32. Roffo G. Feature selection library (MATLAB toolbox). 2016. arXiv preprint arXiv:1607.01327.

  33. Gu Q, Li Z, Han J. Generalized fisher score for feature selection. 2012. arXiv preprint arXiv:1202.3725.

  34. Uitdehaag J, Zaman GJ. A theoretical entropy score as a single value to express inhibitor selectivity. BMC Bioinformatics. 2011;12(1):1–11.

    Article  Google Scholar 

  35. McKight PE, Najab J. Kruskal-wallis test. Corsini Encycl Psychol. 2010;1.

  36. Happy S, Mohanty R, Routray A, An effective feature selection method based on pair-wise feature proximity for high dimensional low sample size data. In: 2017 25th European signal processing conference (EUSIPCO). Kos Island: IEEE; 2017. p. 1574–8.

  37. Hu W, Choi KS, Gu Y, Wang S. Minimum-maximum local structure information for feature selection. Pattern Recogn Lett. 2013;34(5):527–35.

    Article  Google Scholar 

  38. Zeng H, Cheung Y-M. Feature selection and kernel learning for local learning-based clustering. IEEE Trans Pattern Anal Mach Intell. 2010;33(8):1532–47.

    Article  PubMed  Google Scholar 

  39. Roffo G, Melzi S. Features selection via eigenvector centrality. In: Proceedings of new frontiers in mining complex patterns (NFMCP 2016) (Oct 2016). Riva del Garda: Springer International Publishing; 2016.

  40. Roffo G, Melzi S, Castellani U, Vinciarelli A. Infinite latent feature selection: A probabilistic latent graph-based ranking approach. In: Proceedings of the IEEE international conference on computer vision. Santiago: IEEE; 2017. p. 1398–1406.

  41. Bradley PS, Mangasarian OL. Feature selection via concave minimization and support vector machines. In: ICML. 1998;98:82–90.

  42. Roffo G, Melzi S, Cristani M. Infinite feature selection. In: Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4202–10.

  43. He X, Cai D, Niyogi P. Laplacian score for feature selection. Adv Neural Inf Process Syst. 2005;18.

  44. Guo J, Zhu W. Dependence guided unsupervised feature selection. In: Proceedings of the AAAI Conference on Artificial Intelligence. Louisiana: AAAI; 2018. vol. 32.

  45. Du L, Shen YD. Unsupervised feature selection with adaptive structure learning. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. Sydney: ACM; 2015. p. 209–18.

  46. Shi L, Du L, Shen YD. Robust spectral learning for unsupervised feature selection. In: 2014 IEEE International Conference on Data Mining. Shenzhen: IEEE; 2014. p. 977–982.

  47. Qian M, Zhai C. Robust unsupervised feature selection. In: Twenty-third international joint conference on artificial intelligence. Beijing: Morgan Kaufmann; 2013.

  48. Wuniri Q, Huangfu W, Liu Y, Lin X, Liu L, Yu Z. A generic-driven wrapper embedded with feature-type-aware hybrid Bayesian classifier for breast cancer classification. IEEE Access. 2019;7:119931–42.

    Article  Google Scholar 

  49. Zheng F, Wei L, Zhao L, Ni F. Pathway network analysis of complex diseases based on multiple biological networks. BioMed Res Int. 2018;2018. Article ID 5670210.

  50. Jia D, Chen C, Chen C, Chen F, Zhang N, Yan Z, et al. Breast cancer case identification based on deep learning and bioinformatics analysis. Front Genet. 2021;12:628136.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Sun M, Ding T, Tang XQ, Yu K. An efficient mixed-model for screening differentially expressed genes of breast cancer based on LR-RF. IEEE/ACM Trans Comput Biol Bioinform. 2018;16(1):124–30.

    Article  PubMed  Google Scholar 

  52. Zhang Y, Li W, Zhang Y, Hu E, Rong Z, Ge L, et al. Network-based integration method for potential breast cancer gene identification. J Cell Physiol. 2020;235(11):7960–9.

    Article  CAS  PubMed  Google Scholar 

  53. Yu S, Chen H, Yu H, Zhang Z, Liang X, Qin W, et al. Elastic Net based Feature Ranking and Selection. 2020. arXiv preprint arXiv:2012.14982.

  54. Huang Z, Chen D. A breast cancer diagnosis method based on VIM feature selection and hierarchical clustering random forest algorithm. IEEE Access. 2021;10:3284–93.

    Article  Google Scholar 

  55. Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S. Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets. Iran J Basic Med Sci. 2016;19(5):476.

    PubMed  PubMed Central  Google Scholar 

  56. Aličković E, Subasi A. Breast cancer diagnosis using GA feature selection and Rotation Forest. Neural Comput Appl. 2017;28(4):753–63.

    Article  Google Scholar 

  57. Zeng Z, Heng X. Feature selection and visualization based on interaction dominance. In: 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC). Hangzhou: IEEE; 2019. p. 668–73.

  58. Spak DA, Plaxco J, Santiago L, Dryden M, Dogan B. BI-RADS® fifth edition: A summary of changes. Diagn Interv Imaging. 2017;98(3):179–90.

    Article  CAS  PubMed  Google Scholar 

  59. Tian X, Wu L, Jiang M, Zhang Z, Wu R, Miao J, et al. Downregulation of GLYAT Facilitates Tumor Growth and Metastasis and Poor Clinical Outcomes Through the PI3K/AKT/Snail Pathway in Human Breast Cancer. Front Oncol. 2021;11:641399.

    Article  PubMed  PubMed Central  Google Scholar 

  60. Coffelt SB, Scandurro AB. Tumors sound the alarmin (s). Cancer Res. 2008;68(16):6482–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Zhou C, Yu J, Wang M, Yang J, Xiong H, Huang H, et al. Identification of glycerol-3-phosphate dehydrogenase 1 as a tumour suppressor in human breast cancer. Oncotarget. 2017;8(60):101309.

    Article  PubMed  PubMed Central  Google Scholar 

  62. Li Y, Qiu J, Pang T, Ye F, Huang L, Zhang X, et al. MiR-183-5p promotes proliferation, metastasis and angiogenesis in breast cancer cells through negatively regulating four and a half LIM protein 1. J Breast Cancer. 2020;23(4):355.

    Article  PubMed  PubMed Central  Google Scholar 

  63. MacKeil JL, Brzezinska P, Burke-Kleinman J, Theilmann AL, Nicol CJ, Ormiston ML, et al. Phosphodiesterase 3B (PDE3B) antagonizes the anti-angiogenic actions of PKA in human and murine endothelial cells. Cell Signal. 2019;62:109342.

    Article  CAS  PubMed  Google Scholar 

  64. Bolón-Canedo V, Alonso-Betanzos A. Ensembles for feature selection: A review and future trends. Inf Fusion. 2019;52:1–12.

    Article  Google Scholar 

  65. Antropova N, Huynh BQ, Giger ML. A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Med Phys. 2017;44(10):5162–71.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Lu Y, Fan Y, Lv J, Stafford Noble W. DeepPINK: reproducible feature selection in deep neural networks. Adv Neural Inf Processing Syst. 2018;31.

  67. Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: International conference on machine learning. PMLR. 2017;70:3145–53.

Download references


The authors would like to thank the organizers who provided breast cancer datasets for public analysis, the researchers who designed and distributed algorithms for comparison, and the editor and anonymous reviewers for their invaluable comments that helped improve the paper quality.


This work is supported in part by grants from the National Key Research and Development Program of China (2020YFB1406800), the Young S &T Talent Training Program of Guangdong Provincial Association for S &T (SKXRC202224), and the 2019 National Pre-Research Project of NSMC (7500820086). The funding body played no role in the design of the study and collection, analyses, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



SY collected the datasets, designed the experiments and drafted the manuscript. MJ implemented the most of the algorithms and conducted the experiments. TW, ZL and XZ pre-processed the datasets and participated in data analysis and discussion. XL participated in algorithm implementation and discussion. YX proofread the manuscript. WP investigated the potential use of identified genes in breast cancer diagnosis and proofread the manuscript. CP proposed the original notion, supervised the research and proofread the manuscript. All authors contributed to the article and approved the submitted version.

Corresponding author

Correspondence to Chenghao Piao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, S., Jin, M., Wen, T. et al. Accurate breast cancer diagnosis using a stable feature ranking algorithm. BMC Med Inform Decis Mak 23, 64 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: