Skip to main content

Table 1 Variable selection methods to be used

From: Quad-phased data mining modeling for dementia diagnosis

Variable Selection Method

Definition

Selection Criterion

Chi-square Test (univariate)

\( {\chi}^2={\displaystyle \sum_j}\frac{{\left({O}_j-{E}_j\right)}^2}{E_j} \)

Oj is the observed frequency and Ej is the expected frequency of class j

p value < 0.05

Decision Tree

CHAID

(based on Chi − square Test)

Importance > 0.001

CART

\( \begin{array}{l} Entropy(t)=-{\displaystyle \sum_j} p\left( j\Big| t\right) \log p\left( j\Big| t\right)\hfill \\ {} GAI{N}_{split}= Entropy(p)-\left({\displaystyle \sum_{i=1}^k}\frac{n_i}{n} Entropy(i)\right)\hfill \end{array} \)

Importance > 0.001

C4.5

\( \begin{array}{l}\begin{array}{l} GIN I(t)=1-{\displaystyle \sum_j}{\left[ p\left( j\Big| t\right)\right]}^2\hfill \\ {} p\left( j\Big| t\right)\ is\ t he\ relative\ frequency\ of\ class\ j\ at\ node\ t\hfill \end{array}\\ {} GIN{I}_{split}={\displaystyle \sum_{i=1}^k}\frac{n_i}{n} GIN I(i)\end{array} \)

Importance > 0.001

Logistic Regression

LR (1)

\( F(x)=\frac{1}{1+ exp\left({\beta}_0+{\beta}_1{x}_1 \dots {\beta}_n{x}_n\right)} \)

p value < 0.05

LR (1)

p value < 0.01

  1. * Note that the importance in selection criterion in Decision Tree is different from the aforementioned ‘importance’. The former is simply the weights imposed on a largely contributing variable for classification of sample with growth of the tree