 Research article
 Open Access
 Open Peer Review
Effective diagnosis of Alzheimer’s disease by means of large marginbased methodology
 Rosa Chaves^{1}Email author,
 Javier Ramírez^{1},
 Juan M Górriz^{1},
 Ignacio A Illán^{1},
 Manuel GómezRío^{2},
 Cristobal Carnero^{3} and
 the Alzheimer’s Disease Neuroimaging Initiative
https://doi.org/10.1186/147269471279
© Chaves et al.; licensee BioMed Central Ltd. 2012
 Received: 13 May 2011
 Accepted: 27 June 2012
 Published: 31 July 2012
Abstract
Background
Functional brain images such as SinglePhoton Emission Computed Tomography (SPECT) and Positron Emission Tomography (PET) have been widely used to guide the clinicians in the Alzheimer’s Disease (AD) diagnosis. However, the subjectivity involved in their evaluation has favoured the development of Computer Aided Diagnosis (CAD) Systems.
Methods
It is proposed a novel combination of feature extraction techniques to improve the diagnosis of AD. Firstly, Regions of Interest (ROIs) are selected by means of a ttest carried out on 3D Normalised Mean Square Error (NMSE) features restricted to be located within a predefined brain activation mask. In order to address the small samplesize problem, the dimension of the feature space was further reduced by: Large Margin Nearest Neighbours using a rectangular matrix (LMNNRECT), Principal Component Analysis (PCA) or Partial Least Squares (PLS) (the two latter also analysed with a LMNN transformation). Regarding the classifiers, kernel Support Vector Machines (SVMs) and LMNN using Euclidean, Mahalanobis and Energybased metrics were compared.
Results
Several experiments were conducted in order to evaluate the proposed LMNNbased feature extraction algorithms and its benefits as: i) linear transformation of the PLS or PCA reduced data, ii) feature reduction technique, and iii) classifier (with Euclidean, Mahalanobis or Energybased methodology). The system was evaluated by means of kfold crossvalidation yielding accuracy, sensitivity and specificity values of 92.78%, 91.07% and 95.12% (for SPECT) and 90.67%, 88% and 93.33% (for PET), respectively, when a NMSEPLSLMNN feature extraction method was used in combination with a SVM classifier, thus outperforming recently reported baseline methods.
Conclusions
All the proposed methods turned out to be a valid solution for the presented problem. One of the advances is the robustness of the LMNN algorithm that not only provides higher separation rate between the classes but it also makes (in combination with NMSE and PLS) this rate variation more stable. In addition, their generalization ability is another advance since several experiments were performed on two image modalities (SPECT and PET).
Keywords
 Partial Little Square
 Normalize Mean Square Error
 Kernel Principal Component Analysis
 Small Sample Size Problem
 Target Neighbour
Background
Alzheimer’s Disease (AD)
Alzheimer’s Disease (AD) is the most common cause of dementia in the elderly and affects approximately 30 million individuals worldwide[1]. Its prevalence is expected to triple over the next 50 years due to the growth of the older population. To date there is no single test that can predict whether a particular person will develop the disease. With the advent of several effective treatments of AD symptoms, current consensus statements have emphasized the need for early recognition[2].
Functional brain imaging
Single Positron Emission Computed Tomography (SPECT) is a widely used technique to study the functional properties of the brain[3]. After the reconstruction and a proper normalization of the SPECT raw data, taken with Tc99m ethyl cysteinate dimer (ECD) as a tracer, one obtains an activation map displaying the local intensity of the regional cerebral blood flow (rCBF). Therefore, this technique is particularly applicable for the diagnosis of neurodegenerative diseases like AD[4, 5]. On the other hand, Positron Emission Tomography (PET) measures the rate of glucose metabolism with the tracer [^{18} F] Fluorodeoxyglucose. In AD, characteristic brain regions show decreased glucose metabolism, specifically bilaterally regions in the temporal and parietal lobes, posterior cingulate gyri and precunei, as well as frontal cortex and whole brain in more severely affected patients[6]. SPECT modality has lower resolution and higher variability than PET, but the use of SPECT tracers[7] is relatively cheap, and the longer halflives when compared to PET tracers makes SPECT well suited, if not required, when biologically active radiopharmaceuticals have slow kinetics.
Computer Aided Diagnosis (CAD)
In order to improve the prediction accuracy especially in the early stage of the disease, when the patient could benefit most from drugs and treatments, computer aided diagnosis (CAD) tools are desirable[8].
Several approaches for designing CAD systems of the AD can be found in the literature[9]. Univariate methodology is based on the analysis of regions of interest (ROIs) by means of some discriminant functions, whereas the second approach (multivariate) is related to statistical analysis techniques. Regarding the first, the most common and used approach is named Statistical Parametric Mapping (SPM)[10] software tool and its numerous variants. It was not developed specifically to study a single image, but for comparing groups of images. Regarding multivariate techniques, it is remarkable MANCOVA, which considers as one observation all the voxels in a single scan and requires a higher number of available samples than the one of features. This fact reports the wellknown small sample size problem that is very common in nuclear medicine studies since the number of images is limited. In this work context, with the clear goal to solve the dimensionality issue, some techniques of feature space reduction were used and combined.
Firstly, a 3D binary mask is obtained from the average of control subjects which contains a set of activated voxels in certain brain regions characterized by an intensity level above half of the maximum intensity of the mean image. The use of activation masks and the automatic selection of spatial image components reports improved discrimination ability and reduces the complexity of the direct voxel as feature (VAF) approach[6]. The system was developed by exploring the masked brain volume in order to identify discriminant ROIs using different shaped subsets of voxels or components.
ROIs are defined as blocks of voxels represented by the so called Normalized Mean Square Error (NMSE) (further explanation in section Feature extraction) and are selected by means of a ttest[11]. These ROIs act as inputs for obtaining kernel Principal Component Analysis (KPCA), Partial Least Squares (PLS) or Large Margin Nearest Neighbours using a rectangular matrix (LMNNRECT) in order to reduce the dimension of the feature vector to address the small sample size problem. In addition, it can be transformed the PLS or PCA space using a linear transformation matrix (denoted by L) that is built through the Euclidean distance based on the LMNN method that learns a linear transformation which attempts to make input neighbours share the same labels. This is achieved by minimizing a loss function (see section Loss function).
Finally, the classification task of the supervised learner is to predict by using several paradigms the class of an unknown pattern after a training procedure based on a subset of samples.
On the one hand, Support Vector Machines (SVMs) have achieved general success in the last decade[12–14] in the learning from examples paradigm and it can be considered as a special kind of large margin classifier. Recent developments in the definition and training of statistical classifiers make it possible to build reliable classifiers in very small sample size problems since SVM circumvents the curse of dimensionality, and even may find nonlinear decision boundaries for small training sets. On the other hand, LMNN classifier[15, 16] aims to improve the Euclidean distance metric (which learns a linear transformation L, see section Large Margin Nearest Neighbors (LMNN)) by a new Mahalanobis one (which is described by the matrix M = L ·L ^{ T }, see also section Large Margin Nearest Neighbors (LMNN)) through linear transformations. In addition, Energybased method is also analysed for LMNN, leading to further improvements in test error rates over the ones obtained with Euclidean or Mahalanobis distances as shown in Results and discussion Section. These transformations can improve significantly[17] in k Nearest Neighbors (KNN)[15] which are aimed to be organised to the same class, while examples from different classes are separated by a large margin[18, 19].
Methods
Subjects and preprocessing
SPECT database
Demographic details of the SPECT dataset and PET dataset
(a)Demographic details of the SPECT dataset  

Num. of Samples  Sex (M/F) (%)  Age μ [range/σ]  
CTRL  41  32.95/12.19  71.51 [4685/7.99] 
AD 1  30  10.97/18.29  65.20 [2381/13.36] 
AD 2  22  13.41/9.76  65.73 [4686/8.25] 
AD 3  4  0/2.43  76 [6983/9.90] 
(b)Demographic details of the PET dataset  
Num. of Samples  Sex (M/F) (%)  Age μ [range/σ]  
CTRL  75  29.33/20.67  75.97 [6286/4.91] 
AD  75  31.33/18.67  75.72 [5588/7.40] 
PET database
PET data was obtained from the ADNI^{a} Laboratory on NeuroImaging (LONI, University of California, Los Angeles) website (http://www.loni.ucla.edu/ADNI/). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organizations, as a 60 million, 5year public private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California – San Francisco. ADNI is the result of efforts of many co investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years and 200 people with early AD to be followed for 2 years. For uptodate information, see www.adniinfo.org. FDG PET scans were acquired according to a standardized protocol. A 30min dynamic emission scan, consisting of 6 5min frames, was acquired starting 30 min after the intravenous injection of 5.0 ±0.5 mCi of ^{18}FFDG, as the subjects, who were instructed to fast for at least 4 h prior to the scan, lay quietly in a dimly lit room with their eyes open and minimal sensory stimulation. Data were corrected for radiationattenuation and scatter using transmission scans from Ge68 rotating rod sources and reconstructed using measuredattenuation correction and image reconstruction algorithms specified for each scanner. Following the scan, each image was reviewed for possible artifacts at the University of Michigan and all raw and processed study data was archived. Subsequently, the images were normalized through a general affine model, with 12 parameters[27] using the SPM5 software. After the affine normalization, the resulting image was registered using a more complex nonrigid spatial transformation model. The nonlinear deformations to the Montreal Neurological Imaging (MNI) Template were parameterized by a linear combination of the lowestfrequency components of the threedimensional cosine transform bases[28]. A smalldeformation approach was used, and regularization was by the bending energy of the displacement field, ensuring that the voxels in different FDGPET images refer to the same anatomical positions in the brains. After spatial normalization, an intensity normalization was required in order to perform direct images comparisons between different subjects. The intensity of the images was normalized to a value I _{ max }, obtained averaging the 0.1% of the highest voxel intensities exceeding a threshold. The threshold was fixed to the 10th bin intensity value of a 50bins intensity histogram, for discarding most low intensity records from outsidebrain regions, and preventing image saturation. Participant’s enrolment was conditioned to some eligibility criteria. General inclusionexclusion criteria were as follows:

Normal control subjects: Mini Mental State Examination (MMSE) scores between 24−30 (inclusive), a Clinical Dementia Ratio (CDR) of 0, non depressed, non MCI, and non demented. The age range of normal subjects will be roughly matched to that of MCI and AD subjects. Therefore, there should be minimal enrolment of normals under the age of 70.

Mild AD: MMSE scores between 20−26 (inclusive), CDR of 0.5 or 1.0, and meets NINCDS/ADRDA criteria for probable AD.
The PET database collected from ADNI consists of 150 labeled PET images: 75 control subjects and 75 AD patients (see Table1(b) for demographic details). ADNI patient diagnostics are not pathologically confirmed, introducing some uncertainly on the subject’s labels. Using these labels, allows to test the robustness of the classifier. This should be also considered when comparing to other methods tested on autopsy confirmed AD patients, on which every classifier is expected to improve its performance[6].
Written informed consent was obtained from all ADNI participants before protocolspecific procedures were performed. The informed consent not only covers consent for the trial itself, but for the genetic research, biomarker studies, biological sample storage and imaging scans as well. The consent for storage includes consent to access stored data, biological samples, and imaging data for secondary analyses. By signing the consent, ADNI participants authorize the use of the data for large scale, multicenter studies that combine data from similar populations.
Feature extraction
In this article, we propose to apply a combination of different extraction methods in order to obtain the most important features in the early diagnosis of AD. In this way, we can save the memory space and reduce the system complexity removing those useless and harmful noisy components. We are also able to deal with data set of few samples and high dimensions and thus weakening the disadvantages caused by the socalled curseofdimensionality problem[16].
Secondly the Block Division is done as shown in Figure1. Baseline VAF is a way of including in vaf(x,y,z) all the voxels inside the obtained mask(x,y,z) and considering them as features. Therefore, voxels outside the brain and poorly activated regions are excluded from this analysis. The main problem to be faced up by these techniques is the wellknown small sample size problem, that is, the number of available samples is much lower than the number of features used in the training step. However in this work, the combination of feature reduction techniques does not only solve this problem, but also helps to reach better results of classification.
It is obtained for each subject and block (see Figure1) where f(x y z) is the mean voxel intensity of all the control subjects and g _{ p }(x y z) is the voxel intensity of the pth subject at (x y z) coordinates. The most discriminant ROIs are obtained by means of an absolute value twosample ttest with pooled covariance estimate on NMSE features as in[14].
Widely used methods for the analysis of data sets are PCA[29, 30] and projections to latent structures (PLS)[31, 32], that work computationally well for many variables and observations. By contrast, LMNN algorithm is aimed at the organization of the knearest neighbors to the same class, while examples from different classes are separated by a large margin[15, 17, 33, 34].
In this work we propose and compare several feature extraction methods (shown in Figure1) that includes on the one hand the combination of NMSE with PCA (see section Large Margin Nearest Neighbors (LMNN)) or PLS (see section Partial Least Squares (PLS)) plus the LMNN transformation. On the other hand, NMSE is directly combined with a LMNNRECT reduction (see section LMNNRECT as feature reduction technique).
Principal Component Analysis: PCA
PCA is a multivariate approach often used in neuroimaging to significantly reduce the original highdimensional space of the brain images to a lower dimensional subspace[35]. PCA generates an orthonormal basis vector that maximizes the scatter of all the projected samples, which is equivalent to find the eigenvalues from the covariance matrix. PCA can be used in combination with the socalled kernel methods[36]. The basic idea of the kernel PCA[37] method (further details in appendix 1: Kernel PCA) is to first pre process the data by some nonlinear mapping and then to apply the same linear PCA.
Partial Least Squares (PLS)
where T, U are the score matrices; E _{ x }, E _{ y } are the error matrices and P, Q are the loading matrices with number of columns being the number of PLS components. The score matrices result from projection of the data matrices X and Y on loading matrices. The fundamental goal of PLS is to maximize the covariance between the scores of X and Y. PLS can be used as a regression tool or as a dimension reduction technique similar to PCA. The main difference between PLS and PCA is that the former creates orthogonal weight vectors by maximizing the covariance between the variables X and Y, thus, PLS does not only consider the variance of the samples but also the class label[40]. Partial least squares modeling[40] is an effective method for feature extraction that has shown improved results over other conventional feature extraction methods such as PCA in classification problems. In this work, PLS is implemented by means of SIMPLS algorithm (further details in Appendix 2: Partial Least Squares SIMPLS algorithm).
Large Margin Nearest Neighbors (LMNN)
Distance metric[41] is a key issue in many machine learning algorithms. LMNN is used in this work in different ways: i) as a transformation of the feature space obtained by means of PLS or PCA in order to better separate the control subject and AD patient classes, ii) as feature reduction technique by performing the transformation as a rectangular matrix (LMNNRECT), and iii) as a classifier as reported in section Large margin nearest classifier.
A Mahalanobis distance can be parameterized in terms of the matrix L or the matrix M[15]. The first is unconstrained, whereas the second must be positive semidefinite.
The main idea of LMNN consists of minimizing the loss function (see the following section Loss function) that is able to learn a distance metric under which inputs and their target neighbours are closer together.
Loss function
where []_{+} = max(z,0) denotes the standard hinge loss[15].
LMNNRECT as feature reduction technique
The loss function needs to be optimized in order to obtain the distance metric transforms in terms of the explicitly lowrank linear and rectangular matrix transformation L. The optimization over L is not convex unlike the original optimization over M, but a (possibly local) minimum can be computed by standard gradientbased methods. We call this approach LMNNRECT[42], in which L is a matrix with a size equal to the number of features selected by the ttest. In particular, in this work the matrix L is multiplied by the matrix consisting of the NMSE features selected by the ttest and defined above in order to obtain a new space of features that better separates control subjects from AD patients. This fact is experimentally demonstrated in the Results and discussion Section.
Kernel LMNN
for some kernel k[19]. When we use the Kernel PCA trick framework (appendix 1), the original LMNN can be immediately used as Kernel LMNN (KLMNN) as it is explained in[43]. The new KPCA trick framework offers several practical advantages over the classical kernel trick framework, e.g. no mathematical formulas and no reprogramming are required for a kernel implementation, a way to speed up an algorithm is provided with no extra work, the framework avoids troublesome problems such as singularity.
Feature/model selection
Classification
LMNN and SVM classifiers were used in this work to build the AD CAD system. They present many similarities, for example its potential to work in nonlinear feature spaces by using the kernel trick. On the other hand, features can be extracted by means of the kernel trick and PCA (kernel PCA, KPCA) or LMNN (kernel LMNN, KLMNN)[43]. LMNN can be viewed as the logical counterpart to SVMs in which kNN classification replaces linear classification. However, LMNN contrasts with classification by SVMs, in that it requires no modification for multiclass problems that involve combining the results of many binary classifiers, that is there is no explicit dependence in the number of classes.
Large margin nearest classifier
Some techniques were developed to learn feature weights to manage the change of distance structure of samples in nearest neighbour classification. Euclidean distance, the most commonly used, assumes that each feature is equally important and independent from others. By contrast, a distance metric with good quality such as Mahalanobis, should identify relevant features assigning different weights or importance factors to the extracted ROIs[44]. Only when the features are uncorrelated, the distance under a Mahalanobis distance metric is identical to that under the Euclidean distance metric. On the other hand, our work has been inspired by energybased metric (EBC) learning, obtaining with it the best results in terms of accuracy, specificity and sensitivity[33, 45]. EBC consists of computing the loss function for every possible label y _{ i }. We compute the minimization of three terms. The first one term is defined to be the squared distances to the k target neighbours of x _{ i }. The second term accumulates the hinge loss over all impostors (that is differently labeled) which invade the perimeter around x _{ i } determined by its target neighbours. The third term is the accumulation of the hinge loss for differently labelled examples whose perimeters are invaded by x _{ i }.
Support vector machines classifier
SVMs[46, 47] let to build reliable classifiers in very small sample size problems[48] and even may find nonlinear decision boundaries for small training sets. SVM[13] separates a set of binarylabeled training data by means of a maximal margin hyperplane, building a decision function${\mathbb{R}}^{N}\to \left\{\pm 1\right\}$. The objective is to build a decision function f:${\mathbb{R}}^{N}\to \left\{\pm 1\right\}$ using training data that is, l Ndimensional patterns x _{ i }and class labels y _{ i }: (x _{1} y _{1}), (x _{2} y _{2}), …, (x _{ l } y _{ l }), so that f will correctly classify new unseen examples (x y). Linear discriminant functions define decision hyperplanes in a multidimensional feature space: g(x) = w ^{ T }· x + w _{0} where w is the weight vector to be optimized that is orthogonal to the decision hyperplane and w _{0} is the threshold. The optimization task consists of finding the unknown parameters w _{ i }, i = 1, …, N and w _{0} that define the decision hyperplane. When no linear separation of the training data is possible, SVM can work effectively in combination with kernel techniques such as quadratic, polynomial or radial basis function (RBF), so that the hyperplane defining the SVM corresponds to a nonlinear decision boundary in the input space[14].
Results and discussion
Several experiments were conducted in order to evaluate the proposed LMNNbased feature extraction algorithms and its benefits as: i) linear transformation of the PLS or PCA reduced data, ii) feature reduction technique, and iii) classifier (with Euclidean, Mahalanobis or Energybased methodology). SVM classification including transformation of the input space by means of linear, polynomial, quadratic or rbf kernels, which define nonlinear decision surfaces, was adopted for the first two approaches. The classification performance of our approach was tested by means of kfold cross validation (instead of LeaveOneOut), which is widely used to compare the performances of different predictive modelling procedures as in[49].
Although there are studies that consider k independent training and test splits (for instance in[50, 51]), we focus on the standard kfold crossvalidation that is widely used ([6, 51, 52]). In kfold procedure, there is no overlap between test sets: each example of the original data set is used once and only once as a test example. In kfold crossvalidation, sometimes called rotation estimation, the dataset D is randomly split into k mutually exclusive subsets (the folds) D _{1}, D _{2},…,D _{ k }of approximately equal size. The inducer is trained and tested k times; each time t ε{t _{1} t _{2},…,t _{ N }}, it is trained on D{D _{ t }} and tested on D _{ t }[53]. 10 folds were used in each experiment which yielded accurate estimates of the error rates. For each iteration (t = 1,…,10), the algorithm returns randomly generated indices for a kfold crossvalidation of D observations. Testing rate is mostly equal to the integer of the fraction 100/number of folds, that is 10% in our experiments, but it can vary randomly one or two samples in each iteration if the number of observations is a prime number. These indices are used for testing and the rest (approximately 90%) for training. Statistical results obtained in each iteration are averaged.
respectively, where TP is the number of true positives: number of AD patients correctly classified; TN is the number of true negatives: number of control subjects correctly classified; FP is the number of false positives: number of control subjects classified as AD patients; FN is the number of false negatives: number of AD patients classified as control subjects.
For posterior analysis, the data was arranged in two different Groups: AD subjects were labeled as positive and controls as negative. The motivation of doing that is to test our method with all the available stages of the disease, keeping the database as balanced as possible (41 CTRL versus 56 AD for SPECT and 75 CTRL versus 75 AD for PET) and to include several types of patterns in the classification task (training and test).
In the feature reduction process, there are certain parameters to tune such as the number of NMSEBlocks, the number of PCA, PLS or LMNN reduced features and the selection of the kernel shape (linear, polynomial, quadratic or RBF) which define better decision surfaces in SVM classification. The NMSE features were computed using 5 × 5 × 5 voxel blocks since reduced size cubic NMSE features yield better results as shown in[14]. Furthermore, 200 discriminant features were selected by means of ttest reduction (a higher number of NMSE blocks means a decrease of the classification method effectiveness). The posterior reduction of the size of the feature vector is achieved by means of PCA, PLS or LMNNRECT.
Experiments with SPECT database
It is remarkable the fact that when using the combination of 3D NMSE blocks as input features and afterwards transformed them with LMNN algorithm in its multiple possibilities (both as reduction technique, linear transformer or classifier) adds a valuable robustness to the system. This can be proven in view of the experiments shown in Figures3(a),3(b),3(c),4(a),4(b),5(a). In Figure5(b), PCA was used directly over the voxels reduced to the half (because of the high computational cost) and treated with the same type of mask as explained in this work. The results in Figure5(b) showed that the variation of accuracy increases when voxels are used as features. By contrast, in this work the advantage of the combination of the methods proposed, is that they maintain stable around the 90%. We can conclude that the fact of obtaining the ROIs by using the combination of NMSE Blocks with LMNN algorithm favors the stability in all the range of reduced features, thus promoting the robustness of the algorithm.
Statistical measures of performance of LMNNbased techniques in comparison with other reported methods for SPECT database
SVMlinear classifier  Accuracy (%)  Sensitivity (%)  Specificity(%) 

VAF  83.51  83.93  82.93 
PCA  86.56  91.07  80.49 
GMM  89.69  90.24  89.29 
Gaussian kernel PCA+LMNN Transformation  91.75  91.07  92.68 
Gaussian kernel PLS+LMNN Transformation  90.72  91.07  90.24 
PLS+LMNN Transformation  92.78  91.07  95.12 
LMNNRECT  80.28  70  87.80 
LMNNClassifier Accuracy (%)  Euclidean  Mahalanobis  Energy 
PCA  80.54  81.63  87.65 
PLS  84.33  89.56  88.67 
To sum up, LMNN was presented as a valid solution to make broader the margin between the classes. It was developed an effective CAD system in which it is not necessary to incorporate an a priori knowledge about the pathology, since up to its first feature extraction step, all the voxels with a considerable activation (that is, those voxels that are located inside the calculated mask) are considered. The analysis shown in this papers reports clear advantages in the following ROIselection steps as well, because they were computed in an automatic way for the early diagnosis of Alzheimer’s disease. The best combination of feature reduction techniques yielded an accuracy value of 92.72%, thus outperforming other recently and consolidated reported methods such as VAF, PCA and GMM (Table2). Finally, in order to study in depth the AD classification with LMNNbased techniques , we have also included additional information about the classification of AD1 subjects versus CTRL. This set up is more difficult to be classified since AD1 pattern is still a challenge to be diagnosed. If we only consider the case CTRL versus AD1 the precision rates of the method are for PCA plus LMNN: Acc = 84.51%, Sen = 73.33%, Spe = 92.68%, for PLS plus LMNN transformation: Acc = 83.10%, Sen = 70%, Spe = 92.68% and for LMNNRECT: Acc = 84.51%, Sen = 76.67%, Spe = 90.24%. These results still represent a great advance in the field in comparison with the baseline VAF: Acc = 77.46%, Sen = 70%, Spe = 82.93%.
Experiments with PET database
Figure6(b) shows LMNN classification using energybased models, Mahalanobis and Euclidean distances for PCA and PLS features. Maxima accuracy rates were obtained for Energybased classifier (90.11% for PCA and 89.99% for PLS).
ROC analysis
Conclusions
Kernel Distance Metric Learning Methods were investigated for SVMbased classification of SPECT brain images in order to improve the early AD’s diagnosis. Several experiments were conducted in order to evaluate the proposed LMNNbased feature extraction algorithms and its benefits as: i) linear transformation of the PLS or PCA reduced data, ii) feature reduction technique, and iii) classifier (with Euclidean, Mahalanobis or Energybased methodology). LMNN classification using energybased models and Mahalanobis distances performs better than when the Euclidean distance is considered, which suffers a decrease in the accuracy as the number of features increases. Aiming at further improving the accuracy of the classification, SVM was also compared to LMNNbased classification yielding improved results. Thus, the proposed methods yielded Acc rates of 92.7% for SPECT and 90.11% for PET when an advanced feature extraction technique consisting of NMSE feature selection, PLS feature reduction and LMNN transformation in combination with linear SVM classification was considered, thus outperforming other recently and consolidated reported methods such as VAF, PCA or GMM. One of the principal advantages of our techniques is the robustness and stability of the proposed methods shown in this work as stated in the Results. Another property is its generalization ability in the light of the results obtained with an PET database.
Endnotes
^{a} Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at:http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
^{b} Clinical information is unfortunately not available for privacy reasons, but only demographic information.
Appendix
Appendix 1: Kernel PCA
We found two advantages of nonlinear kernel PCA: first, nonlinear principal components afforded better recognition rates and second, the performance for nonlinear components can be further improved by using more components than possible in the linear case[59].
Appendix 2: Partial Least Squares SIMPLS algorithm
 1.
initialize S _{0} = X ^{ T } Y and iterate steps 2 to 8 for j = 1,…,n
 2.
if j = 1, S _{ j }= S _{0} else, ${\mathbf{S}}_{j}={\mathbf{S}}_{j1}{\mathbf{P}}_{j1}{\left({\mathbf{P}}_{j1}^{T}{\mathbf{P}}_{j1}\right)}^{1}{\mathbf{P}}_{j1}^{T}{\mathbf{S}}_{j1}$
 3.
compute w _{ j }as the first singular vector of S _{ j }
 4.
${\mathbf{w}}_{j}=\frac{{\mathbf{w}}_{j}}{\parallel {\mathbf{w}}_{j}\parallel}$
 5.
t _{ j }= X w _{ j }
 6.
${\mathbf{t}}_{j}=\frac{{\mathbf{t}}_{j}}{\parallel {\mathbf{t}}_{j}\parallel}$
 7.
${\mathbf{p}}_{j}={\mathbf{X}}_{j}^{T}{\mathbf{t}}_{j}$
 8.
P _{ j }= [p _{1},p _{2},…,p _{ j−1}]
The resulting weights w _{ j }and scores t _{ j } are stored as columns in the matrix W and T respectively.
The nonlinear kernel PLS method is based on mapping the original input data into a high dimensional feature space[62]. SIMPLS needs to be reformulated into its kernel variant (in this work Gaussian kernel PLS pls LMNN transformation Acc result is shown in Table2), assuming a zero mean nonlinear kernel PLS.
Declarations
Acknowledgements
This work was partly supported by the MICINN of Spain under the TEC200802113 and TEC201234306 project and the Consejeria de Innovacion, Ciencia y Empresa (Junta de Andalucia, Spain) under the Excellence Projects P07TIC02566, P09TIC 4530 and P11TIC7103.
The PET data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott, Alzheimer’s Association, Alzheimer’s Drug Discovery Foundation, Amorfix Life Sciences Ltd., AstraZeneca, Bayer HealthCare; BioClinica, Inc., Biogen Idec Inc., BristolMyers Squibb Company, Eisai Inc., Elan Pharmaceuticals Inc., Eli Lilly and Company, F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc., GE Healthcare, Innogenetics, N.V., IXICO Ltd., Janssen Alzheimer Immunotherapy Research and Development, LLC., Johnson and Johnson Pharmaceutical Research and Development LLC., Medpace, Inc., Merck and Co., Inc., Meso Scale Diagnostics, LLC., Novartis Pharmaceuticals Corporation, Pfizer Inc., Servier, Synarc Inc., and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (http://www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.
Authors’ Affiliations
References
 Petrella JR, Coleman RE, Doraiswamy PM: Neuroimaging and Early Diagnosis of Alzheimer’s Disease: A Look to the Future. Radiology. 2003, 226: 315336. 10.1148/radiol.2262011600.View ArticlePubMedGoogle Scholar
 Ramírez J, Górriz JM, SalasGonzalez D, Romero A, López M, Illán IA, GómezRío M: Computeraided diagnosis of Alzheimer’s type dementia combining support vector machines and discriminant set of features. Inf Sci. 2009, doi:10.1016/j.ins.2009.05.012Google Scholar
 English RJ, Childs J (Eds): SPECT: SinglePhoton Emission Computed Tomography: A Primer. 1996, Society of Nuclear MedicineGoogle Scholar
 Hellman RS, Tikofsky RS, Collier BD, Hoffmann RG, Palmer DW, Glatt S, Antuono PG, Isitman AT, Papke RA: Alzheimer disease: quantitative analysis of I123iodoamphetamine SPECT brain imaging. Radiology. 1989, 172: 183188.View ArticlePubMedGoogle Scholar
 Holman BL, Johnson KA, Gerada B, Carvalho PA, Satlin A: The Scintigraphic Appearance of Alzheimer’s Disease: A Prospective Study Using Technetium99mHMPAO SPECT. J Nucl Med. 1992, 33 (2): 181185.PubMedGoogle Scholar
 Illán IA, Górriz JM, López MM, Ramírez J, Gonzalez DS, Segovia F, Chaves R, Puntonet CG: Computer aided diagnosis of Alzheimer’s disease using component based SVM. Appl Soft Comput. 2011, 11: 23762382. 10.1016/j.asoc.2010.08.019.View ArticleGoogle Scholar
 Ramírez J, Górriz JM, Chaves R, López M, SalasGonzález D, Alvarez I, Segovia F: SPECT image classification using random forests. Electron Lett. 2009, 45 (12): 604605. 10.1049/el.2009.1111.View ArticleGoogle Scholar
 Górriz JM, Ramírez J, Lassl A, SalasGonzález D, Lang EW, Puntonet CG, Alvarez I, Río MG: Automatic computer aided diagnosis tool using componentbased SVM. IEEE Nucl Sci Symp Conference Record. 2008, 4774255: 43924395.Google Scholar
 Fung G, Stoeckel J: SVM feature selection for classification of SPECT images of Alzheimer’s disease using spatial information. Knowledge Inf Syst. 2007, 11 (2): 243258. 10.1007/s1011500600435.View ArticleGoogle Scholar
 Friston KJ, Ashburner J, Kiebel SJ, Nichols TE, Penny WD (Eds): Statistical Parametric Mapping: The Analysis of Functional Brain Images. 2007, San Diego: Academic PressView ArticleGoogle Scholar
 Schechtmana E, Shermanb M: The twosample ttest with a known ratio of variances. Stat Methodology. 2007, 4: 508514. 10.1016/j.stamet.2007.03.001.View ArticleGoogle Scholar
 Cristianini N, ShaweTaylor J: An Introduction to Support Vector Machines and Other KernelBased Learning Methods. 2000, Cambridge University PressView ArticleGoogle Scholar
 Burges C: A tutorial on support vector machines for pattern recognition. Data Min Knowledge Discovery. 1998, 2 (2): 121167. 10.1023/A:1009715923555.View ArticleGoogle Scholar
 Chaves R, Ramírez J, Górriz J, López M, SalasGonzalez D, Alvarez I, Segovia F: SVMbased computeraided diagnosis of the Alzheimer’s disease using ttest NMSE feature selection with feature correlation weighting. Neurosci Lett. 2009, 461: 293297. 10.1016/j.neulet.2009.06.052.View ArticlePubMedGoogle Scholar
 Weinberger KQ, Blitzer J, Saul LK: Distance Metric Learning for Large Margin Nearest Neighbor Classification. J Machine Learning Res. 2009, 10: 207244.Google Scholar
 Chai J, Liu H, Chen B, Bao Z: Large margin nearest local mean classifier. Signal Process. 2010, 90: 236248. 10.1016/j.sigpro.2009.06.015.View ArticleGoogle Scholar
 Goldberger J, Roweis S, Hinton G, Salakhutdinov R: Neighbourhood components analysis. Adv Neural Inf Process Syst, Cambridge MA. 2005, 17: 513520.Google Scholar
 Xing EP, Ng AY, Jordan MI, Russell S: Distance metric learning, with application to clustering with sideinformation. T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Adv Neural Inf Process Syst; Cambridge, MA. 2002, 15: 505512.Google Scholar
 Globerson A, Roweis ST: Metric learning by collapsing classes. Adv Neural Inf Process Syst. 2005, 18: 451458.Google Scholar
 Hill D, Batchelor PG, Holden M, Hawkes DJ: Medical image registration. Phys Med Biol. 2001, 46: R145. 10.1088/00319155/46/3/201.View ArticlePubMedGoogle Scholar
 SalasGonzalez D, Górriz JM, Ramírez J, López M, Alvarez I, Segovia F, Chaves R, Puntonet CG: Computeraided diagnosis of Alzheimer’s disease using support vector machines and classification trees. Phys Med Biol. 2010, 55: 28072817. 10.1088/00319155/55/10/002.View ArticlePubMedGoogle Scholar
 Woods RP, Grafton ST, Holmes CJ, Cherry SR, Mazziotta JC: Automated image registration: I. General methods and intrasubject, intramodality validation. J Comput Assist Tomogr. 1998, 22: 139152. 10.1097/0000472819980100000027.View ArticlePubMedGoogle Scholar
 Ashburner J, Friston KJ: Nonlinear spatial normalization using basis functions. Hum Brain Mapp. 1999, 7: 254266. 10.1002/(SICI)10970193(1999)7:4<254::AIDHBM4>3.0.CO;2G.View ArticlePubMedGoogle Scholar
 Saxena P, Pavel FG, Quintana JC, Horwitz B: An automatic thresholdbased scaling method for enhancing the usefulness of TcHMPAO SPECT in the diagnosis of Alzheimer’s disease. Med Image Comput ComputAssisted Intervention  MICCAI. 1998, 1496: 623630.View ArticleGoogle Scholar
 Jobst KA, Barnetson LP, Shepstone BJ: Accurate prediction of histologically confirmed, Alzheimer’s disease and the differential diagnosis of dementia: the use of NINCDSADRDA and DSMIIIR criteria, SPECT, xray, CT, and apo e4 in medial temporal lobe dementias. Oxford Project to Investigate Memory and Aging, Int Psychogeriatrics. 1998, 10 (3): 271302.Google Scholar
 Dubois B, Feldman HH, Jacova C, DeKosky ST, BarbergerGateau P, Cummings J, Delacourte A, Galasko D, Gauthier S, Jicha G, Meguro K, O’Brien J, Pasquier F, Robert P, Rossor M, Salloway S, Stern Y, Visser PJ, Scheltens P: Research criteria for the diagnosis of Alzheimer’s disease: revising the NINCDS ADRDA criteria. Lancet Neurology. 2007, 6 (8): 734746. 10.1016/S14744422(07)701783.View ArticlePubMedGoogle Scholar
 SalasGonzalez D, Górriz JM, Ramírez J, Lassl A, Puntonet CG: Improved gaussnewton optimization methods in affine registration of SPECT brain images. IET Electron Lett. 2008, 44 (22): 12911292. 10.1049/el:20081838.View ArticleGoogle Scholar
 Ashburner J, Friston KJ: Nonlinear spatial normalization using basis functions. Human Brain Mapping. 1999, 7 (4): 254266. 10.1002/(SICI)10970193(1999)7:4<254::AIDHBM4>3.0.CO;2G.View ArticlePubMedGoogle Scholar
 Jackson JE (Ed): A Users Guide to Principal Components. 1991, New York: WileyView ArticleGoogle Scholar
 Wold S, Esbensen K, Geladi P: Principal components analysis. Chemometr Intell Lab Syst. 1987, 2: 3752. 10.1016/01697439(87)800849.View ArticleGoogle Scholar
 Wold H: Soft modeling. The basic design and some extensions. Joreskog, K.G., Wold, H. (Eds.), Syst. Under Indirect Observation. 1982, 2: 589591.Google Scholar
 Tenenhaus M (Ed): La Regression PLS: Theorie et Pratique. 1998, Paris: TechnipGoogle Scholar
 Chopra S, Hadsell R, LeCun Y: Learning a similiarty metric discriminatively, with application to face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR05) San Diego, CA. 2005, 349356.Google Scholar
 Tsochantaridis I, Joachims T, Hofmann T, Altun Y: Large margin methods for structured and interdependent output variables. J Machine Learning Res. 2005, 6: 14531484.Google Scholar
 Andersen A, Gash DM, Avison MJ: Principal component analysis of the dynamic response measured by fMRI: a generalized linear systems framework. J Magn Reson Imaging. 1999, 17: 795815. 10.1016/S0730725X(99)000284.View ArticleGoogle Scholar
 López M, Ramírez J, Górriz JM, SalasGonzalez D, Alvarez I, Segovia F, Puntonet CG: Automatic tool for the Alzheimer’s disease diagnosis using PCA and bayesian classification rules. IET Electron Lett. 2009, 45 (8): 389391. 10.1049/el.2009.0176.View ArticleGoogle Scholar
 Jadea AM, Srikantha B, Jayaramana VK, Kulkarnia BD: Feature extraction and denoising using kernel PCA. Chem Eng Sci. 2003, 58: 44414448. 10.1016/S00092509(03)003403.View ArticleGoogle Scholar
 Ramírez J, Górriz J, Segovia F, Chaves R, SalasGonzalez D, López M, Illán I, Padilla P: Computer aided diagnosis system for the Alzheimer’s disease based on partial least squares and random forest SPECT image classification. Neurosci Lett. 2010, 472: 99103. 10.1016/j.neulet.2010.01.056.View ArticlePubMedGoogle Scholar
 Bastien P, Vinzi VE, Tenenhaus M: PLS generalised linear regression. Comput Stat Data Anal. 2005, 48: 1746. 10.1016/j.csda.2004.02.005.View ArticleGoogle Scholar
 Wold S, Ruhe H, Wold H, Dunn W: The colinearity problem in linear regression. The Partial Least squares (PLS) approach to generalized inverse. J Sci Stat Computations. 1984, 5: 735743. 10.1137/0905052.View ArticleGoogle Scholar
 Yang L, Jin R: Distance metric learning: a comprehensive survey. 2006, Michigan State UniversityGoogle Scholar
 Weinberger K, Saul LK: Fast Solvers and Efficient Implementations for Distance Metric Learning. 2008, Helsinki, Finland, 11601167.Google Scholar
 Chatpatanasiri R, Korsrilabutr T, Tangchanachaianan P, Kijsirikul B: A new kernelization framework for Mahalanobis distance learning algorithms. Neurocomputing. 2010, 73: 15701579. 10.1016/j.neucom.2009.11.037.View ArticleGoogle Scholar
 Xiang S, Nie F, Zhang C: Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit. 2008, 41: 36003612. 10.1016/j.patcog.2008.05.018.View ArticleGoogle Scholar
 Pérez P, Chardin A, Laferte J: Noniterative manipulation of discrete energybased models for image analysis. Pattern Recognit. 2000, 33: 573586. 10.1016/S00313203(99)000734.View ArticleGoogle Scholar
 Vapnik V (Ed): Estimation of Dependences Based on Empirical Data. 1982, New York: SpringerVerlagGoogle Scholar
 SalasGonzález D, Górriz JM, Ramírez J, López M, Illán IA, Segovia F, Puntonet CG, GómezRío M: Analysis of SPECT brain images for the diagnosis of Alzheimer’s disease using moments and support vector machines. Neurosci Lett. 2009, 461: 6064. 10.1016/j.neulet.2009.05.056.View ArticlePubMedGoogle Scholar
 Duin RPW: Classifiers in almost empty spaces. Int Conference Pattern Recognit (ICPR). 2000, 2 (2): 43924395.Google Scholar
 Wiens TS, Dale BC, Boyce MS, Kershaw GP: Three way kfold crossvalidation of resource selection functions. Original Res Art Ecol Modell. 2008, 212 (34): 244255.View ArticleGoogle Scholar
 Nadeau C, Bengio Y: Inference for the generalization error. Machine Learning. 2003, 52 (3): 239281. 10.1023/A:1024068626366.View ArticleGoogle Scholar
 Bengio Y, Grandvalet Y: No Unbiased Estimator of the Variance of KFold CrossValidation. J Machine Learning Res. 2004, 5: 10891105.Google Scholar
 Westman E, Simmons A, Zhang Y, Muehlboeck JS, Tunnard C, Liu Y, Collins L, Evans A, Mecocci P, Vellas B, Tsolaki M, Kloszewska I, Soininen H, Lovestone S, Spenger C, Wahlund L: Multivariate analysis of MRI data for Alzheimer’s disease, mild cognitive impairment and healthy controls. Neuroimage. 2011, 54 (2): 11781187. 10.1016/j.neuroimage.2010.08.044.View ArticlePubMedGoogle Scholar
 Kohavi R: A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. 1995, Montreal, Quebec, Canada, 11371143.Google Scholar
 Górriz JM, Segovia F, Ramírez J, Lassl A, SalasGonzalez D: Automatic Selection of ROIs in functional imaging using gaussian mixture models. Appl Soft Comput. 2011, 11 (2): 23762382. 10.1016/j.asoc.2010.08.019.View ArticleGoogle Scholar
 Stoeckel J, Ayache N, Malandain G, Koulibaly PM, Ebmeier KP, Darcourt J: Automatic Classification of SPECT Images of Alzheimer’s Disease Patients and Control Subjects. Med Image Comput ComputAssisted Intervention  MICCAI. 2004, 3217: 654663. Lecture Notes in Computer Science SpringerGoogle Scholar
 López M, Ramírez J, Górriz JM, Alvarez I, SalasGonzalez D, Segovia F, Chaves R: SVMbased CAD System for Early Detection of the Alzheimer’s Disease using Kernel PCA and LDA. Neurosci Lett. 2009, 3 (464): 233238.View ArticleGoogle Scholar
 Metz C: Basic Principles of ROC Analysis. Seminars Nucl Med. 1978, 4 (8): 283298.View ArticleGoogle Scholar
 Scholkopf B, Müller KRM, Smola A: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10 (5): 12991319. 10.1162/089976698300017467.View ArticleGoogle Scholar
 Scholkopf B, Müller KRM, Smola A: Kernel Principal Component Analysis. Artif Neural Networks — ICANN’97. Lecture Notes in Comput Sci. 1997, 1327: 583588. 10.1007/BFb0020217.View ArticleGoogle Scholar
 de Jong S: Simpls: An alternative approach to partial least squares regression. Chemom Intell Lab Syst. 1993, 18 (3): 251263. 10.1016/01697439(93)85002X.View ArticleGoogle Scholar
 Varmuza K, Filzmoser P (Eds): Introduction to Multivariate Statistical Analysis in Chemometrics. 2009, FL: Taylor and Francis  CRC Press, Boca RatonView ArticleGoogle Scholar
 Rosipal R: Kernel Partial Least Squares for Nonlinear Regression and Discrimination. Neural Network World. 2003, 13 (3): 291300.Google Scholar
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14726947/12/79/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.