Cognitive biomarker prioritization in Alzheimer’s Disease using brain morphometric data

Background Cognitive assessments represent the most common clinical routine for the diagnosis of Alzheimer’s Disease (AD). Given a large number of cognitive assessment tools and time-limited office visits, it is important to determine a proper set of cognitive tests for different subjects. Most current studies create guidelines of cognitive test selection for a targeted population, but they are not customized for each individual subject. In this manuscript, we develop a machine learning paradigm enabling personalized cognitive assessments prioritization. Method We adapt a newly developed learning-to-rank approach \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathtt {PLTR}}$$\end{document}PLTR to implement our paradigm. This method learns the latent scoring function that pushes the most effective cognitive assessments onto the top of the prioritization list. We also extend \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathtt {PLTR}}$$\end{document}PLTR to better separate the most effective cognitive assessments and the less effective ones. Results Our empirical study on the ADNI data shows that the proposed paradigm outperforms the state-of-the-art baselines on identifying and prioritizing individual-specific cognitive biomarkers. We conduct experiments in cross validation and level-out validation settings. In the two settings, our paradigm significantly outperforms the best baselines with improvement as much as 22.1% and 19.7%, respectively, on prioritizing cognitive features. Conclusions The proposed paradigm achieves superior performance on prioritizing cognitive biomarkers. The cognitive biomarkers prioritized on top have great potentials to facilitate personalized diagnosis, disease subtyping, and ultimately precision medicine in AD.


Background
Identifying structural brain changes related to cognitive impairments is an important research topic in Alzheimer's Disease (AD) study. Regression models have been extensively studied to predict cognitive outcomes using morphometric measures that are extracted from structural magnetic resonance imaging (MRI) scans [1,2]. These studies are able to advance our understanding on the neuroanatomical basis of cognitive impairments. However, they are not designed to have direct impacts on clinical practice. To bridge this gap, in this manuscript we develop a novel learning paradigm to rank cognitive assessments based on their relevance to AD using brain MRI data.
Cognitive assessments represent the most common clinical routine for AD diagnosis. Given a large number of cognitive assessment tools and time-limited office visits, it is important to determine a proper set of cognitive tests for the subjects. Most current studies create guidelines of cognitive test selection for a targeted population [3,4], but they are not customized for each individual subject. In this work, we develop a novel learning paradigm that incorporate the ideas of precision medicine and customizes the cognitive test selection process to Open Access *Correspondence: ning.104@osu.edu 1 The Ohio State University, Columbus, USA Full list of author information is available at the end of the article the characteristics of each individual patient. Specifically, we conduct a novel application of a newly developed learning-to-rank approach, denoted as PLTR [5], to the structural MRI and cognitive assessment data of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort [6]. Using structural MRI measures as the individual characteristics, we are able to not only identify individual-specific cognitive biomarkers but also prioritize them and their corresponding assessment tasks according to AD-specific abnormality. We also extend PLTR to PLTR h using hinge loss [7] to more effectively prioritize individual-specific cognitive biomarkers. The study presented in this manuscript is a substantial extension from our preliminary study [8].
Our study is unique and innovative from the following two perspectives. First, conventional regression-based studies for cognitive performance prediction using MRI data focus on identifying relevant imaging biomarkers at the population level. However, our proposed model aims to identify AD-relevant cognitive biomarkers customized to each individual patient. Second, the identified cognitive biomarkers and assessments are prioritized based on the individual's brain characteristics. Therefore, they can be used to guide the selection of cognitive assessments in a personalized manner in clinical practice; it has the potential to enable personalized diagnosis and disease subtyping.

Literature review Learning to rank
Learning-to-Rank ( LETOR ) [9] is a popular technique used in information retrieval [10], web search [11] and recommender systems [12]. Existing LETOR methods can be classified into three categories [9]. The first category is point-wise methods [13], in which a function is learned to score individual instance, and then instances are sorted/ranked based on their scores. The second category is pair-wise methods [14], which maximize the number of correctly ordered pairs in order to learn the optimal ranking structure among instances. The last category is list-wise methods [15], in which a ranking function is learned to explicitly model the entire ranking. Generally, pairwise and listwise methods have superior performance over point-wise methods due to their ability to leverage order structure among instances in learning [9]. Recently, LETOR has also been applied in drug discovery and drug selection [16][17][18][19]. For example, Agarwal et al. [20] developed a bipartite ranking method to prioritize drug-like compounds. He et al. [5] developed a joint push and learning-to-rank method to select cancer drugs for each individual patient. These studies demonstrate the great potential of LETOR in computational biology and computational medicine, particularly for biomarker prioritization.

Machine learning for AD biomarker discovery
The importance of using big data to enhance AD biomarker study has been widely recognized [6]. As a result, numerous data-driven machine learning models have been developed for early AD detection and AD-relevant biomarker identification including cognitive measures. These models are often designed to accomplish tasks such as classification (e.g., [21]), regression (e.g., [1,2,22]) or both (e.g., [23,24]), where imaging and other biomarker data are used to predict diagnostic, cognitive and/or other outcome(s) of interest. A drawback of these methods is that, although outcome-relevant biomarkers can be identified, they are identified at the population level and not specific to any individual subject. To bridge this gap, we adapt the PLTR method for biomarker prioritization at the individual level, which has greater potential to directly impact personalized diagnosis.

Materials
The imaging and cognitive data used in our study were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database [6]. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI, a prodromal stage of AD) and early AD. For up-to-date information, Please refer to [25] for more detailed, up-to-date information.
Participants include 819 ADNI-1 subjects with 229 healthy control (HC), 397 MCI and 193 AD participants. We consider both MCI and AD subjects as patients, and thus we have 590 cases and 229 controls. We downloaded the 1.5T baseline MRI scans and cognitive assessment data from the ADNI website [25]. We processed the MRI scans using Freesurfer version 5.1 [26], where volumetric and cortical thickness measures of 101 regions relevant to AD were extracted to characterize brain morphometry.
We focus our analysis on 151 scores assessed in 15 neuropsychological tests. For convenience, we denote these measures as cognitive features and these tests as cognitive tasks. The

Joint push and learning-to-rank using scores-PLTR
We use the joint push and learning-to-rank method that we developed in He et al. [5], denoted as PLTR , for personalized cognitive feature prioritization. PLTR has also been successfully applied in our preliminary study [8]. We aim to prioritize cognitive features for each individual patient that are most relevant to his/her disease diagnosis. We will use patients' brain morphometric measures that are extracted from their MRI scans for the cognitive feature prioritization. The cognitive features are in the form of scores or answers in the cognitive tasks that the patients take. The prioritization outcomes can potentially be used in clinical practice to suggest the most relevant cognitive features or tasks that can most effectively facilitate diagnosis of an individual subject.
In order to prioritize MCI/AD cognitive features, PLTR learns and uses patient latent vector representations and their imaging features to score each cognitive feature for each individual patient. Then, PLTR ranks the cognitive features based on their scores. Patients with similar imaging feature profiles will have similar latent vectors and thus similiar ranking of cognitive features [27,28]. During the learning, PLTR explicitly pushes the most relevant cognitive features on top of the less relevant features for each patient, and therefore optimizes the latent patient vectors and cognitive feature vectors in a way that they will reproduce the feature ranking structures [9]. In PLTR , these latent vectors are learned via solving the following optimization problem: where α , β and γ ∈ [0, 1] are coefficients of O + s , R uv and R csim terms, respectively; U = [u 1 , u 2 , · · · , u m ] and V = [v 1 , v 2 , · · · , v n ] are the latent matrices for patients and features, respectively ( u and v are column latent patient vector and feature vector, respectively); L s is the overall loss function. In Problem 1, P function ( I(x) = 1 if x is true, otherwise 0). In Eq. (2), s p (f i ) is a scoring function defined as follows, that is, it calculates the score of feature f i on patient P p using their respective latent vectors u p and v i [29]. By minimizing P ↑ s , PLTR learns to assign higher scores to relevant features than irrelevant features so as to rank the relevant features at the top of the final ranking list. Note that, PLTR learns different latent vectors and ranking lists for different subjects, and therefore enables personalized feature prioritization. In Problem (1), O + s measures the ratio of mis-ordered feature pairs over the relevant features among all the subjects, defined as follows, where f i ≻ P p f j represents that f i is ranked higher than f j for patient P p . By minimizing O ↑ s , PLTR learns to push the most relevant features on top of the less relevant features. Thus, most relevant features are pushed to the very top of the ranking list. In Problem (1), R uv is a regularizer on U and V to prevent overfitting, defined as, where X F is the Frobenius norm of matrix X. R csim is a regularizer on patients to constrain patient latent vectors, defined as where w pq is the similarity between subject P p and P q that is calculated using the imaging features of thesubjects. The assumption here is that patients who are similar in terms of imaging features could also be similar in terms of cognitive features.

Joint push and learning-to-rank with marginalization-PLTR h
The objective of PLTR is to score relevant features higher than less relevant features as shown in Eqs. 2 and 4. However, in some cases, the score of relevant features is expected to be higher than that of less relevant features by a large margin. For example, patients can be very sensitive to a few cognitive tasks but less sensitive to many others. In order to incorporate such information, we propose a new hinge loss [7] based PLTR , denoted as PLTR h .
In PLTR h , the overall loss function is very similar to Eq. 1, defined as follows, where L h is the overall loss function; U, V, R uv and R csim are identical as those in Eq. 1. In PLTR h , P ↑ h measures the average loss between the relevant features and irrelevant features using hinge loss as follows, > t p will not induce any loss during optimization. Otherwise, the hinge loss will be positive and increase as gets smaller than t p . Thus, the hinge loss forces the scores of relevant features higher than those of irrelevant features by at least t p . By doing this, the relevant features are ranked higher than irrelevant features in the ranking list. Similarly, O + h measures the average loss among the relevant features also using hinge loss as follows, where t o is also the pre-defined margin.

Data processing Data normalization
Following the protocol in our preliminary study [8], we selected all the MCI and AD patients from ADNI and conducted the following data normalization for these patients. We first performed a t test on each cognitive feature between patients and controls, and selected those features if there is a significant difference between patients and controls on these features. Then, we converted the selected features into [0, 1] by shifting and scaling the feature values. We also converted all the normalized feature values according to the Cohen's d of the features between patients and controls, and thus, smaller values always indicate higher AD possibility. After that, we filtered out features with values 0, 1 or 0.5 for more than 95% patients. This is to discard features that are either not discriminative, or extremely dominated by patients or controls. After the filtering step, we have 112 cognitive features remained and used in experiments. Additional file 1: Table S1 presents these 112 cognitive features. We conducted the same process as above on the imaging features. Additional file 1: Table S2 presents these imaging features used in experiments.

Patient similarities from imaging features
Through the normalization and filtering steps as in "Data normalization" section, we have 86 normalized imaging features remained. We represent each patient using a vector of these features, denoted as r p = [r p1 , r p2 , · · · , r p86 ] , in which r pi ( i = 1, · · · , 86 ) is an imaging feature for patient p. We calculate the patient similarity from imaging features using the radial basis function (RBF) kernel, that is, , where w pq is the patient similarity used in R csim .

Baseline methods
We compare PLTR and PLTR h with two baseline methods: the Bayesian Multi-Task Multi-Kernel Learning ( BMTMKL ) method [30] and the Kernelized Rank Learning ( KRL ) method [31].

Bayesian multi-task multi-kernel learning ( BMTMKL)
BMTMKL is a state-of-the-art baseline for biomarker prioritization. It was originally proposed to rank cell lines for drugs and won the DREAM 7 challenge [32]. In our study, BMTMKL uses the multi-task and multi-kernel learning within kernelized regression to predict cognitive feature values and learns parameters by conducting Bayesian inference. We use the patient similarity matrix calculated from FreeSurfer features as the kernels in BMTMKL.

Kernelized rank learning ( KRL)
KRL represents another state-of-the-art baseline for biomarker prioritization. In our study, KRL uses kernelized regression with a ranking loss to learn the ranking structure of patients and to predict the cognitive feature values. The objective of KRL is to maximize the hits among the top k of the ranking list. We use the patient similarity matrix calculated from FreeSurfer features as the kernels in KRL.

Training-testing data splits
Following the protocol in our preliminary study [8], we test our methods in two different settings: cross validation ( CV ) and leave-out validation ( LOV ). In CV , we randomly split each patient's cognitive tasks into 5 folds: all the features of a cognitive task will be either split into training or testing set. We use 4 folds for training and the rest fold for testing, and do such experiments 5 times, each with one of the 5 folds as the testing set. The overall performance of the methods is averaged over the 5 testing sets. This setting corresponds to the goal to prioritize additional cognitive tasks that a patient should complete. In LOV , we split patients (not patient tasks) into training and testing sets, and a certain patient and all his/her cognitive features will be either in the training set or in the testing set. This corresponds to the use scenario to identify the most relevant cognitive tasks that a new patient needs to take, based on the existing imaging information of the patient, when the patient has not completed any cognitive tasks. Figures 1 and 2 demonstrate the CV and LOV data split processes, respectively.
Please note that as presented in "Data normalization" section, for normalized cognitive features, smaller values always indicate more AD possibility. Thus, in both settings, we use the ranking list of normalized cognitive features of each patient as ground truth for training and testing.

Parameters
We conduct grid search to identify the best parameters on each evaluation metric for each model. We use 0.3 and 0.1 as the value of t p and t o , respectively. In the experimental results, we report the combinations of parameters that achieve the best performance on evaluation metrics. We implement PLTR and PLTR h using Python 3.7.3 and Numpy 1.16.2, and run the experiments on Xeon E5-2680 v4 with 128G memory.

Metrics on cognitive feature level
We use a metric named average feature hit at k (QH@k) as in our preliminary study [8] to evaluate the ranking performance, where τ q is the ground-truth ranking list of all the features in all the tasks, τ q (1 : k) is the top k features in the list, τ q is the predicted ranking list of all the features, and τ q i is the ith ranked features in τ q . That is, QH@k calculates the number of features among top k in the predicted feature lists that are also in the ground truth (i.e., hits). Higher QH@k values indicate better prioritization performance.
We use a second evaluation metric weighted average feature hit at k (WQH@k) as follows: that is, WQH@k is a weighted version of QH@k that calculates the average of QH@j ( j = 1, · · · , k ) over top k. Higher WQH@k indicates more feature hits and those hits are ranked on top in the ranking list.

Metrics on cognitive task level
In in Peng et al. [8], we use the mean of the top-g normalized ground-truth scores/predicted scores on the features of each cognitive task for a patient as the score of that task for that patient. For each patient, we rank the tasks using their ground-truth scores and use the ranking as the ground-truth ranking of these tasks. Thus, these scores measure how much relevant to AD the task indicates for the patients. We use the predicted scores to rank cognitive tasks into the predicted ranking of the tasks. We define a third evaluation metric

Fig. 2 Data split for leave-out validation ( LOV)
task hit at k ( NH g @k) as follows to evaluate the ranking performance in terms of tasks, where τ n g /τ n g is the ground-truth/predicted ranking list of all the tasks using top-g question scores.

Experimental results
Overall Performance on CV Table 1 presents the performance of PLTR , PLTR h and two baseline methods in the CV setting. Note that overall, PLTR and PLTR h have similar standard deviations; KRL and BMTMKL have higher standard deviations compared to PLTR and PLTR h . This indicates that PLTR and PLTR h are more robust than KRL and BMTMKL for the prioritization tasks.

Comparison on cognitive feature level
For cognitive features from all tasks, PLTR is able to identify on average 2.665 ± 0.07 out of the top-5 most relevant ground-truth cognitive features among its top-5 predictions (i.e., QH@5 = 2.665 ± 0.07). PLTR h achieves similar performance as PLTR , and identifies on average 2.599 ± 0.09 most relevant groundtruth cognitive features on its top-5 predictions (i.e., QH@5 = 2.599 ± 0.09 ). PLTR and PLTR h significantly outperform the baseline methods in terms of all the (12) NH g @k(τ n g ,τ n g ) = k i=1 I(τ n gi ∈ τ n g (1 : k)), evaluation metrics on cognitive feature level (i.e., QH@5 and WQH@5). Specifically, PLTR outperforms the best baseline method BMTMKL at 9.1 ± 3.7 % and 22.1 ± 9.5 % on QH@5 and WQH@5, respectively. PLTR h also outperforms BMTMKL at 6.4 ± 4.3 % and 19.2 ± 10.1 % on QH@5 and WQH@5, respectively. These experimental results demonstrate that among the top 5 features in the ranking list, PLTR and PLTR h are able to rank more relevant features on top than the two state-of-the-art baseline methods and the positions of those hits are also higher than those in the baseline methods.

Comparison on cognitive task level
For the scenario to prioritize cognitive tasks that each patient should take, PLTR and PLTR h are able to identify the top-1 most relevant task for 72.5 ± 6.0 % and 74.3 ± 4.0 % of all the patients when using 3 features to score cognitive tasks, respectively (i.e., NH 3 = 0.725 ± 0.06 for PLTR and NH 3 = 0.743 ± 0.04 for PLTR h ). This indicates the strong power of PLTR and PLTR h in prioritizing cognitive features and in recommending relevant cognition tasks for real clinical applications. We also find that PLTR and PLTR h are able to outperform baseline methods on most of the metrics on cognitive task level (i.e., NH g @1 ). PLTR outperforms the best baseline method at 11.6 ± 5.6 %, 16.7 ± 6.1 % and 14.2 ± 6.6 % on NH 1 @1 , NH 2 @1 and NH 3 @1 , respectively. PLTR h performs even better than PLTR on NH 1 @1 and NH 3 @1 , in addition to that it outperforms the Table 1 Overall performance in CV The column "d" corresponds to the latent dimension. The numbers in the form of x ± y represent the mean (x) and standard deviation (y best performance of baseline methods at 13.7 ± 5.3 %, 14.7 ± 4.8 % and 17.0 ± 8.8 % on NH 1 @1 , NH 2 @1 and NH 3 @1 , respectively. PLTR and PLTR h perform slightly worse than baseline methods on NH 5 @1 and NH all @1 ( 0.760 ± 0.05 vs 0.784 ± 0.05 on NH 5 @1 and 0.707 ± 0.03 vs 0.760 ± 0.06 on NH all @1 ). These experimental results indicate that PLTR and PLTR h are able to push the most relevant task to the top of the ranking list than baseline methods when using a small number of features to score cognitive tasks. Note that in CV , each patient has only a few cognitive tasks in the testing set. Therefore, we only consider the evaluation at the top task in the predicted task rankings (i.e., only NH g @1 in Table 1). Table 1 also shows that PLTR h outperforms PLTR on most of the metrics on cognitive task level (i.e., NH g @1 ). PLTR h outperforms PLTR at 1.9 ± 0.5%, 2.5 ± 1.2%, 1.5 ± 0.3% and 2.6 ± 0.9% on NH 1 @1, NH 3 @1, NH 5 @1 and NH all @1, respectively. This indicates that generally PLTR h is better than PLTR on ranking cognitive tasks in CV setting. The reason could be that the hinge-based loss functions with pre-defined margins can enable significant difference between the scores of relevant features and irrelevant features, and thus effectively push relevant features upon irrelevant features. Tables 2 and 3 present the performance of PLTR , PLTR h and two baseline methods in the LOV setting. Due to space limit, we did not present the standard deviations in the tables, but they have similar trends as those in Table 1. We first hold out 26 (Table 2) and 52 (Table 3) AD patients as testing patients, respectively. We determine these hold-out AD patients as the ones that have more than 10 similar AD patients in the training set with corresponding patient similarities higher than 0.67 and 0.62, respectively. Tables 2 and 3 show that PLTR and PLTR h significantly outperform the baseline methods in terms of all the evaluation metrics on cognitive feature level (i.e., QH@5 and WQH@5), which is consistent with the experimental results in CV setting. When 26 patients are hold out for testing, with parameters α = 0.5 , β = 1.5 , γ = 1.0 and d = 30 , PLTR outperforms the best baseline method KRL at 13.4% and 1.3% on QH@5 and WQH@5, respectively. The performance of PLTR h is very comparable with that of PLTR " PLTR h outperforms KRL at 13.4% and 0.5% on QH@5 and WQH@5, respectively. When 52 patients are hold out for testing, with parameters α = 0.5 , β = 0.5 , γ = 1.0 and d = 50 , PLTR outperforms the best baseline method KRL at 18.1% and 7.8% on QH@5 and WQH@5, respectively. PLTR h even performs better than PLTR in this setting. In addition, PLTR h outperforms KRL at 19.7% and 9.5% on QH@5 and WQH@5, respectively. These experimental results demonstrate that for new patients, PLTR and PLTR h are able to rank more relevant features to the top of the ranking list than the two baseline methods. They also indicate that for new patients, ranking based methods (e.g., PLTR and PLTR h ) are more effective than regression based methods (e.g., KRL and BMTMKL ) for biomarker prioritization.   Table 3 shows that when 52 patients are hold out for testing, PLTR and PLTR h are both able to identify for 80.8% of the testing patients (i.e., 42 patients) under NH 1 @1 . Note that the hold-out testing patients in LOV do not have any cognitive features. Therefore, the performance of PLTR and PLTR h as above demonstrates their strong capability in identifying most AD related cognitive features based on imaging features only. We also find that PLTR and PLTR h are able to achieve similar or even better results compared to baseline methods in terms of the evaluation metrics on cognitive task level (i.e., NH g @1 and NH g @5). When 26 patients are hold out for testing, PLTR and PLTR h outperform the baseline methods in terms of NH g @1 (i.e., g = 1, 2 . . . 5 ). They are only slightly worse than KRL on ranking relevant tasks on their top-5 of predictions when g = 1 or g = 5 (3.308 vs 3.423 on NH 1 @5 and 3.808 vs 3.962 on NH 5 @5). When 52 patients are hold out for testing, PLTR and PLTR h also achieve the best performance on most of the evaluation metrics. They are only slightly worse than KRL on NH 2 @1, NH 5 @5 (0.423 vs 0.481 on NH 2 @1 and 3.712 vs 3.808 on NH 5 @5). These experimental results demonstrate that among top 5 tasks in the ranking list, PLTR and PLTR h rank more relevant task on top than KRL.

Comparison on cognitive task level
It's notable that in Tables 2 and 3, as the number of features used to score cognitive tasks (i.e., g in NH g @k ) increases, the performance of all the methods in NH g @1 first declines and then increases. This may indicate that as g increases, irrelevant features which happen to have relatively high scores will be included in scoring tasks, and thus degrade the model performance on NH g @1 . However, generally, the scores of irrelevant features are considerably lower than those of relevant ones. Thus, as more features are included, the scores for tasks are more dominated by the scores of relevant features and thus the performance increases.
We also find that BMTMKL performs poorly on NH 3 @1 in both Tables 2 and 3. This indicates that BMTMKL , a regression-based method, could not well rank relevant features and irrelevant features. It's also notable that generally the best performance for the 26 testing patients is better than that for 52 testing patients. This may be due to that the similarities between the 26 testing patients and their top 10 similar training patients are higher than those for the 52 testing patients. The high similarities enable accurate latent vectors for testing patients.
Tables 2 and 3 also show that PLTR h is better than PLTR on ranking cognitive tasks in LOV setting. When Table 3 Overall Performance in LOV on 52 testing patients The column "n" corresponds to the number of hold-out testing patients. The best performance of each model is in italic. The best performance under each evaluation metric is upon underline.

Method
Feature level Task level

Discussion
Our experimental results show that when NH 1 @1 achieves its best performance of 0.846 for the 26 testing patients in the LOV setting (i.e., the first row block in Table 2), the task that is most commonly prioritized for the testing patients is Rey Auditory Verbal Learning Test (RAVLT), including the following cognitive features: (1) trial 1 total number of words recalled; (2) trial 2 total number of words recalled; (3) trial 3 total number of words recalled; (4) trial 4 total number of words recalled; (5) trial 5 total number of words recalled; (6) total Score; (7) trial 6 total number of words recalled; (8) list B total number of words recalled; (9) 30 min delay total; and (10) 30 min delay recognition score. RAVLT is also the most relevant task in the ground truth if tasks are scored correspondingly. RAVLT assesses learning and memory, and has shown promising performance in early detection of AD [33]. A number of studies have reported high correlations between various RAVLT scores with different brain regions [34]. For instance, RAVLT recall is associated with medial prefrontal cortex and hippocampus; RAVLT recognition is highly correlated with thalamic and caudate nuclei. In addition, genetic analysis of APOE ε 4 allele, the most common variant of AD, reported its association with RAVLT score in an early-MCI (EMCI) study [26]. The fact that RAVLT is prioritized demonstrates that PLTR is powerful in prioritizing cognitive features to assist AD diagnosis. Similarly, we find the top-5 most frequent cognitive tasks corresponding to the performance at NH 3 @5 = 3.731 for the 26 hold-out testing patients. They are: Functional Assessment Questionnaire (FAQ), Clock Drawing Test (CDT), Weschler's Logical Memory Scale (LOGMEM), Rey Auditory Verbal Learning Test (RAVLT), and Neuropsychiatric Inventory Questionnaire (NPIQ). In addition to RAVLT discussed above, other top prioritized cognitive tasks have also been reported to be associated with AD or its progression. In an MCI to AD conversion study, FAQ, NPIQ and RAVLT showed significant difference between MCI-converter and MCIstable groups [35]. We also notice that for some testing subjects, PLTR is able to very well reconstruct their ranking structures. For example, when NH 3 @5 achieves its optimal performance 3.731, for a certain testing subject, her top-5 predicted cognitive tasks RAVLT, LOGMEM, FAQ, NPIQ and CDT are exactly the top-5 cognitive tasks in the ground truth. These evidences further demonstrate the diagnostic power of our method.

Conclusions
We have proposed a novel machine learning paradigm to prioritize cognitive assessments based on their relevance to AD at the individual patient level. The paradigm tailors the cognitive biomarker discovery and cognitive assessment selection process to the brain morphometric characteristics of each individual patient. It has been implemented using newly developed learning-to-rank method PLTR and PLTR h . Our empirical study on the ADNI data has produced promising results to identify and prioritize individual-specific cognitive biomarkers as well as cognitive assessment tasks based on the individual's structural MRI data. In addition, PLTR h shows better performance than PLTR on ranking cognitive assessment tasks. The resulting top ranked cognitive biomarkers and assessment tasks have the potential to aid personalized diagnosis and disease subtyping, and to make progress towards enabling precision medicine in AD.