Classi�cation of Painful or Painless Diabetic Peripheral Neuropathy and Identi�cation of the Most Powerful Predictors Using Machine Learning Models in Large Cross-Sectional Cohorts

Background: To improve the treatment of painful Diabetic Peripheral Neuropathy (DPN) and associated co-morbidities, a better understanding of the pathophysiology and risk factors for painful DPN is required. Using harmonised cohorts (N = 1230) we have built models that classify painful versus painless DPN. Methods: The Random Forest, Adaptive Regression Splines and Naive Bayes machine learning models were trained for classifying painful/painless DPN. Their performance was estimated using cross-validation in large cross-sectional cohorts (N = 935). Models were externally validated in a large population-based cohort (N = 295) in the presence of missing values. Variables were ranked for importance using model speci�c metrics and marginal effects of predictors were aggregated and assessed at the global level. Model selection was carried out using the Mathews Correlation Coe�cient (MCC) and model performance was quanti�ed in the validation set using MCC, the area under the precision/recall curve (AUPRC) and accuracy. Results: Random Forest (MCC=0.28, AUPRC = 0.76) and Adaptive Regression Splines (MCC = 0.29, AUPRC = 0.77) were the best performing models and showed the smallest reduction in performance between the training and validation dataset


Peripheral neuropathy as a complication of diabetes
The prevalence of Diabetic Peripheral Neuropathy (DPN) is 29-49% in people with diabetes mellitus (1,2).Up to 50% of patients with DPN will develop chronic neuropathic pain (1,3).Painful DPN is characterised by chronic pain that is most severe in the feet (4), but can extend to involve the legs, hands and arms in a typical "glove and stocking distribution".The pain is often described as a burning sensation, associated with paraesthesiae or dysaesthesiae and occasionally allodynia (4,5).The last decade has seen signi cant advances in our ability to characterise the sensory phenotype of DPN using patient-reported sensory symptoms (6) and standardised quantitative sensory testing (7) which facilitate patient strati cation (8).
Painful DPN is strongly associated with poor quality of life and psychological co-morbidities such as depression and anxiety disorders (9,10).The mental health burden associated with painful DPN remains an under-recognised and under-treated complication associated with diabetes mellitus.Management of DPN is complicated by several challenges.The condition is under-diagnosed (11), current treatment options are inadequate (3,4), and we do not understand why some patients with DPN develop pain and others do not.
The pathophysiology of painful DPN is most likely a complex interaction of genetic, environmental and psychological factors.For example, a recent review of studies of the risk factors for neuropathic pain reported that clinical and lifestyle factors such as obesity, poor glucose control, hypertension and neuropathy severity were associated with its presence as well as genetic and psychological factors such as depression and anxiety (12).Interactions between risk factors may also be important.For example, we have found that the negative impact of glucose control on neuropathy severity is larger in males than in females, whereas stress and anxiety had larger effects in females (8,13).
In order to improve the treatment of painful DPN and the associated co-morbidities, it is essential that we develop a better understanding of the pathophysiology and risk factors for painful DPN.The DOLORisk project (14) has provided the scienti c community with large harmonised datasets that can be exploited in order to better understand the risk factors and build models predicting the development of painful or painless DPN.Moreover, data harmonisation between multiple datasets collected from different centres facilitates estimation of the models performance using cross-validation and rigorous external validation in independent datasets.

Machine learning
ML is a technique for statistical learning that involves optimisation in order to minimise the loss of function and optimise the predictive ability.During training an ML algorithm learns patterns and determines the optimal values for its internal parameters from data (15,16).ML is focused on prediction by optimising the discrimination of different classes; no assumptions about the underlying processes that generate data are required and a well performing model does not replace the rigorous statistical techniques needed to infer causality.The advantage of a well-trained ML model is that it can be easily generalised to predict unobserved outcomes on an unknown dataset.In supervised classi cation, we train an ML algorithm in a training dataset with known classes (e.g.painful vs painless neuropathy), and use the nal model to predict classes in new samples.Reducible prediction errors are bias (i.e. the difference between a model's prediction and the actual value) and variance (i.e. the difference between predictions of different realisations of the model) (17).In this context model training involves a trade-off between bias and variance (18,19).A model with high bias under-ts the data and a model with high variance over-ts the data.A manifestation of over-tting is a model that has learned nuances of the training data and cannot be generalised in un-observed data.Thus model validation in an independent test dataset is important in order to avoid highly optimistic estimations of real-world model performance.
Another important aspect is how to handle missing data.Every decision taken in training is considered part of the model building and should be benchmarked for its effect on model performance.The same is true for the imputation of missing values in model training, testing and validation.As large clinical cohorts often suffer from missing values in several data points we have developed a framework of utilising multiple imputation of missing values that does not leak information between model training and validation, models the uncertainty introduced due to the imputation, and performs outcome agnostic imputation of the validation datasets simulating model deployment and prediction in the presence of missing values (20,21).Estimates from multiple imputed datasets are aggregated using Rubin's rules (20).Multiple models are then trained in the multiple completed datasets and predictions are aggregated over all the completed instances of the outcomeagnostic imputed validation set.This is a predict-and-aggregate strategy (22,23).We have also taken care to encapsulate all model building decisions in a cross-validation approach that ensures no information leakage between in-fold and out-of-fold instances during cross-validation.
Algorithmic modelling and ML techniques are notorious for their "black box" approach that can obscure meaningful relations between predictors and present spurious associations due to chance or systematic errors as related to the outcome.A lot of work has focused on making ML interpretable (24)(25)(26).In the context of a model one can see how changes in predictors in uence the model's predictions (26,27) and predictors can be ranked based on model speci c or model independent metrics.These techniques allow us to explain and better understand the behaviour of ML algorithms.In the rest of the manuscript we will use the words predictors or features interchangeably.

Predicting diabetes complications
ML has been successfully utilised for predicting (28-30) and developing (31)(32)(33) risk equations for diabetes and its complications.Hyperglycaemia, hypertension, obesity, smoking, duration of diabetes and female gender have been identi ed as risk factors increasing the odds ratio for DPN (30,31,34,35).However, psychological factors have generally not been considered and importantly, only the presence of DPN was usually amongst the endpoints with no distinction between painful and painless neuropathy.ML has been used to predict co-morbid depression in people with diabetes mellitus, nding that female gender, having a higher number of diabetic complications and the presence of chronic pain were amongst the factors most highly correlated with major depression (32).In this study we have used an array of demographics, clinical, quality of life and psychological features to classify painful or painless DPN in the largest (to the best of our knowledge) deeply phenotyped clinical cohort of people with painful or painless DPN (N= 1230) to date.ML was applied in a realistic context that included the presence of missing values and different ways of de ning the outcome, i.e. clinical diagnosis or questionnaire based.

Methods
In this study we trained a diverse set of Machine Learning (ML) models to classify painful versus painless DPN in three datasets: a deeply phenotyped clinical cohort developed in the University of Oxford (5); Technion -Israel Institute of Technology; and Imperial College London.We then externally validated these models in a questionnaire-based phenotyped population cohort developed in the University of Dundee (15).We followed the TRIPOD reporting guidelines (36), supplementary Figure 1.

Datasets
All data used in this study has been generated using the DOLORisk study protocol which has been described elsewhere (14).The total sample size used for training and validating these models was 1230 people with diabetes mellitus, predominantly Type II.
Three large, deeply phenotyped, cross-sectional cohorts (DOLORisk Imperial College London, PINS -University of Oxford (5) and DOLORisk Technion -Israel Institute of Technology, N = 935) were used to train and estimate model's performance using 5-times repeated 10-fold cross-validation.Training datasets had deep clinical phenotyping.Participants were rst screened for clinical neuropathy based on symptomatology and DPN was con rmed by abnormalities of nerve conduction studies or Intra Epidermal Nerve Fibre Density (IENFD) (37).Neuropathic Pain (NeuP "pain caused by a lesion or disease of the somatosensory system") was determined at the time of the clinical assessment according to the NeuP Special Interest Group (NeuPSIG) of the International Association for the Study of Pain (IASP) grading system (38).
The NeuPSIG grading for neuropathic pain was used to grade neuropathic pain.This is pain with: 1. a distinct neuroanatomically plausible distribution, i.e. pain in symmetrically distributed in the extremities; 2. a history suggestive of a relevant lesion or disease affecting the peripheral or central somatosensory system-diagnosis of diabetes mellitus and a history of neuropathic symptoms including decreased sensation, positive sensory symptoms, e.g., burning, aching pain mainly in the toes, feet, or legs;

Outcome de nition
The outcome of this study was painful or painless Diabetic Peripheral Neuropathy (DPN).For the training datasets one or more physicians have de ned phenotypes after detailed clinical examination and grading of neuropathic pain as discussed above and in line with IASP and NeuPSIG de nitions (37,44).An overview of the training datasets including all independent variables is in Table 2.
For the validation datasets we have used an array of structured, validated questionnaires and screening questions to de ne phenotypes.The Michigan Neuropathy Screening Instrument (MNSI) (45) questionnaire section alone, with a cut-off value of 3, was used to de ne diabetic neuropathy.Various cut-offs had been suggested for the MNSI clinical examination and questionnaire instrument (45)(46)(47).In these it has been consistently reported that the cut-off of 7 for the questionnaire part when used in combination with clinical examination is too insensitive for stand-alone questionnaire use, whereas a cut-off score of 3 and above for the questionnaire only has been shown to have very good performance (AUC = 0.75, optimal cut-off > 2.0318) (45).
A screening question "Are you currently troubled by pain or discomfort, either all the time or on and off?" was used to de ne the presence of pain.Chronicity was screened using the question "How long have you been suffering with this pain or discomfort?", with a cut-off of > 3 months to de ne the temporal aspect of chronic pain.These questions have been validated and are identical to those used in previous population-based epidemiology studies of pain, including UK Biobank (48,49).Location of pain was assessed using the question "In the past three months; a) which of these pains have you had, b) which one of these pains bothered you the most?" followed by a comprehensive choice of body locations including "Pain in your feet".The participant was then asked to complete the self-complete version of the "Douleur Neuropathique en 4 Questions" (DN4) questionnaire (50) for the most bothersome pain where a score of 3 and above indicated the presence of NeuP.A de nition of possible DPN required the presence of neuropathy as screened by the MNSI and chronic pain in the feet, regardless of whether it was the most bothersome pain present.
The DN4 questionnaire-only version was only considered to remove instances of con icting evidence, i.e. painful neuropathy with a DN4 score under the cut-off value or painless neuropathy with a DN4 value over the cut-off , using a cut-off value of 3 to indicate the presence of NeuP.This cut-off has been validated to provide the optimal area under the ROC for the questionnaire-only section, i.e. excluding the clinical examination (50).
The presence of diabetes, neuropathy and chronic pain in the feet was used to de ne painful DPN.The presence of diabetes, neuropathy and no neuropathic pain in the feet de ned the group of painless DPN, gure 1.
An overview of the validation dataset is in Table 3. 798 people had painful diabetic neuropathy (617 (66%) in training and 181 (61.4%) in validation sets) and 432 had painless diabetic neuropathy (318 (34%) in training and 114 (38.6) in validation sets).A sensitivity analysis assessing the change of the pooled coe cient estimates of a logistic regression model tted on the imputed validation dataset showed no or very small sensitivity to various outcome de nitions (Supplementary Figure 2).We further assessed the internal consistency of the validation cohort by calculating the Cohen's Kappa inter-rater agreement between the MNSI question "Do you ever have any burning pain in your legs and/or feet?" and the response to the pain localisation question asking for "Pain in your feet", We observed a signi cant (p.value < 0.01), fair agreement between these responses, Kappa = 0.295.

Missing values and feature construction
Datasets were largely harmonised and follow the DOLORisk core protocol (39).However, in the Oxford cohort, the Depression, Anxiety and Positive Outlook scale (DAPOS) (51) Anxiety and Depression scores initially used have been replaced with the PROMIS (42) anxiety and depression short forms.Under the assumption that these constructs measured the same quantity in different scales we linked DAPOS to PROMIS scores by scaling them together and then using the derived means and standard deviations to bring them in the same scale as PROMIS t-scores.Questions related to smoking were transformed to an "ever smoking" feature by taking into account the response to questions related to smoking at the time when the questionnaire was completed, clinical examination took place or in the past.The EQ-5D-5L (41) questionnaire was used alongside the UK normative data to obtain the EQ-5D index that was used as an independent variable.Alcohol consumption was transformed to a Likert type scale (0-5) using the following ordered levels: "Never", "Less than 1 day per month", "1 to 3 days per month", "1 or 2 days per week", "3 or 4 days per week", "Daily or almost daily".HbA1c was transformed to percentages (%) from mmol/mol when it was reported that way.
After removing all variables with >50% of missingness, anxiety and depression t-scores demonstrated the highest rate of missingness (about 45%).We then performed multiple imputations by chained equations using the predictive mean matching algorithm (52).The number of imputations was set equal to the maximum missingness ratio experienced by any variable.Multiple imputation was done separately in the training and validation datasets to prevent information leakage between datasets.In order to accurately and not overoptimistically model the uncertainty introduced due to missing values we performed outcome agnostic imputation of the validation dataset.For this purpose a dataset was created by stacking both the training and imputation datasets and removing the outcome.The density plot of the observed and imputed values are in Supplementary Figures 3 and 4 for the training and validation datasets respectively.

Statistical Analysis
In an exploratory analysis before model tting, dependencies between all independent variables and the outcome were assessed using the chi-square test for categorical data and the Kruskal Wallis test for numerical data comparison between two groups.Model's performance was assessed by calculating overall accuracy as the proportion of the total number of correctly classi ed instances and the binomial test (p.value< 0.05) was used to assess whether accuracy was higher than the prevalence of the majority class, i.e no-information rate (NIR), the balanced accuracy as the mean of sensitivity and speci city and the Area Under the Precision/Recall Curve (AUPRC).However, all these metrics are sensitive to class imbalance and can lead to the selection of models that severely mis-classify the minority class.Therefore, we used the Matthews Correlation Coe cient (MCC) (53), which is similar to Pearson's correlation coe cient, ranging from -1 to 1 and measures how correlated the prediction is to the true outcome.Moreover, MCC does not change with the substitution of the reference class as it is symmetrical and provides a more robust performance metric than accuracy, balanced accuracy and F1 score (53)(54)(55).Models' calibration was assessed on the validation set by visualising the observed event percentage against the 10 prediction probability deciles.A linear model tted to the calibration curves, i.e. event rate vs midpoint of the decile bin, was used to estimate the calibration slope and intercept.
Models were not updated nor calibrated after training in order to realistically assess the performance in predicting new data.

Model training and validation
We developed a work ow that uses the predict-then-aggregate strategy, after multiple imputation during both model training and validation.This way we modelled the uncertainty due to the imputation in both datasets, did not allow for information leakage between training and validation datasets and encapsulated all model building decisions in resampling and external validation (23).Predictions were aggregated using majority voting, and point estimators were aggregated using the mean and Rubin's rules to calculate the pooled -total standard deviation from the within/between imputations variance (56).The work ow is visualised in Figure 2.
Models were trained on the training dataset using a maximum grid search of 60 tuning parameters optimised for the higher MCC.Numeric variables were centred and scaled for each cross-validation fold to ensure no information leakage between in and out-of-fold samples.The number of multiple imputations was equal to the highest rate of missingness observed across all variables.During training, we imputed missing values using all independent variables and the outcome.Thus we trained an ensemble of m=45 models, optimised for the highest MCC during the 5-times repeated, 10-fold cross-validation.The nal model out-of-fold prediction probabilities, MCC, accuracy, balanced accuracy and AUPRC were calculated using Rubin's rules.Partial dependence was also aggregated by calculating the mean marginal probability across imputations.
In validation, we rst removed the outcome and pooled together the training and validation data.Then we performed m=45 imputations as we did during training.The real outcome was kept unknown during the imputation of the validation set, simulating a real-world scenario of prediction of new data in the presence of missing values.Then we used the ensemble of trained models to predict the outcome on each of the 45 imputed validation datasets, producing 45 2 predictions.Finally, these predictions were aggregated by majority voting.
We trained a series of models using a set of diverged and well-known algorithms presenting the most signi cant subdomains of ML.The Random Forest ( 57) is an algorithm that produces an ensemble of decision trees using bagging, a technique that selects random subsets of potential predictors in order split each node of each growing tree.The Random Forest is also robust to the presence of multicollinearity of independent variables as a subset of predictors is randomly selected for each node split.
The Adaptive Regression Splines (58) is a multi-variable extension of regression that is able to model complex non-linear problems using an ensemble of simple linear functions that in aggregate optimise predictive ability.
The algorithm has a built-in backwards elimination variable selection.
Finally, the Naive Bayes classi er as implemented in the e1071 package ( 59) is a probabilistic classi er that uses the Bayes' rule to calculate the posterior probability of each class given a con guration of independent predictor variables.
In the case of Random Forests and Adaptive Regression Splines we trained an unweighted version and a weighted version with class weights inversely proportional to the class prevalences.Weighted models should match the probability distribution of the outcome closer than unweighted by having better calibration but could also run a higher risk if over-tting training data and be less generalisable.

Interpretability
In order to understand how independent variables values in uenced models outcomes and to provide some interpretability of ML algorithms we calculated the variable importance in a model speci c way and then scaled the metrics value to 100 to make them directly comparable.For the Random Forest we used the Gini importance index that calculates the mean decrease in impurity of the nodes produced by a split that uses a certain variable.For the Multivariate Adaptive Regression Splines we used the built-in backwards variable selection of the model to calculate the reduction in performance estimated by cross-validation when each variable is removed, and for the Naive Bayes model a ROC curve analysis was conducted for each variable.
Moreover, we calculated, aggregated and visualised the marginal effects of all independent variables with a scaled importance of > 10 on each model's outcome prediction.These Partial Dependence Plots (PDP) (66,67) plots represent how the model's outcome was in uenced by changes in an independent variable values.Partial Dependence (PD) was calculated by marginalizing the classi er's predicted probability over the distribution of the feature of interest.
PDP not only show the marginal probability of the outcome given certain feature values but also provide an assessment on how robust and accurate is the information that a ML model can learn across the distribution of a feature's values.A one-dimensional plot below each PDP shows the density of the feature values across each whole range.A dense distribution indicates that the PD can be accurately calculated for this range of values, while a sparse distribution shows that we cannot reliably calculate PD and also that there was probably not enough data in our training datasets for a model to learn a meaningful relationship for this feature range.
We should also note that, while most ML models use different mechanisms to robustly handle some collinearity of features, PDP is sensitive to multicollinearity.
Partial dependence was aggregated for the whole ensemble of trained models using a customised function that calculated the mean across imputed datasets.In addition, trends of PDPs were estimated and visualised using a LOESS (68) smoothed curve.This analysis highlighted how the different values of each independent variable in uenced the models predicted outcome all other things being equal.

Performance estimates
During All models had good performance estimates indicating a moderate positive relationship between the outcome and models' predictions, good balanced accuracy and very good AUPRC, Figure 3.

Variable pro ling
Model speci c variable pro ling revealed that a speci c subset of variables were consistently amongst the most powerful predictors.These included quality of life, personality and psychology traits, age, and glucose control (Figure 4).The built-in backwards feature elimination of the Adaptive Regression Splines algorithm revealed that the best performance was achieved when it considered the EQ5D index, TIPI extraversion and openness, HbA1c, Depression and Anxiety t-scores and age in descending order.EQ5D index, psychology and personality traits were always amongst the top predictors.Random Forest, based on the mean decrease on node impurity, ranked high the importance of BMI, Age and glucose control.Regarding modi able lifestyle factors, Random Forest models also used alcohol consumption and smoking as predictive features, albeit with lower importance.Naive Bayes classi er produced a similar ranking with the addition of ranking alcohol consumption and experience of traumatic events before the age of 18 as having high importance.In all models, gender was not identi ed amongst the most powerful predictors although it was important enough to be included amongst the independent variables used in the nal trained models.Weighted models had similar feature rankings to unweighted ones, with the noticeable difference of the inclusion of the alcohol consumption scale in the nal model of the weighted Adaptive Regression Splines.We should also note that the Adaptive Regression Splines algorithm achieved its best performance using only 7 out of the 16 potential predictors.

Feature effects
The Adaptive Regression Splines classi er was more likely to predict painful DPN with lower EQ5D index (Figure 5, A), indicating worse quality of life and showed a clear elbow around a cut-off threshold of 0.5, lower TIPI extraversion (Figure 5, B) for most of the independent variable's range, lower openness up to a value of 5 and then with increased prevalence for the top two openness values 6-7 (Figure 5, C), and higher HbA1c (Figure 5, D) indicating worse blood glucose level control.
The EQ5D index (Figure 6, A) had the same in uence on the Random Forests predictions.Higher BMI showed increased marginal probability for painful DPN (Figure 6, B) for the part of the range where PD could be accurately calculated.The effect was non-monotonic for the part of the range where we had sparse density of values.Higher values of the PROMIS Depression t-score were associated with increased probability for a participant to be classi ed as having painful DPN (Figure 6, C).Age (Figure 6, D) had a similar non-monotonic effect as BMI.Ages 40-60 showed an increased marginal probability for painful DPN, while higher ages, where marginal probability was calculated with higher accuracy due to the higher density of values, showed a lower marginal probability for painful DPN.HbA1c (Figure 6, E) showed again that worse blood glucose control increased the probability of a classi cation to the painful DPN group.Anxiety (Figure 6, F) showed a similar in uence to Depression for the part of the feature's range that had high density of values.TIPI extraversion and openness (Figure 6, G, H) were similar and had the same effect in Random Forest as in the Adaptive Regression Splines.Higher emotional stability increased the marginal probability of painful DPN (Figure 6, I) and the same was observed for conscientiousness (Figure 6, L).For the part of range where PD could be accurately estimated, higher TIPI agreeableness was negatively associated with painful DPN (Figure 6, J).Higher alcohol consumption was consistently associated with decreased marginal probability of painful DPN (Figure 6, K).
The in uence of EQ5D values in the Naive Bayes classi er (Figure 7, A) was very similar with that which was observed in the Adaptive Regression Splines and the Random Forest.Depression and Anxiety had a strong effect on prediction (Figure 7 B, C) and were both positively associated with increased marginal probability for painful DPN.Again alcohol consumption and Age were negatively associated with a prediction of painful DPN (Figure 7, D, E).Higher HbA1c increased the marginal probability for painful DPN (Figure 7, F).Finally, traumatic experiences under the age of 18 increased the marginal probability for painful DPN (Figure 7, H).The same relationship between higher BMI values and higher probability for painful DPN was observed for the part of the BMI's range where the model could learn meaningful relationships (Figure 7, I).We should note that for most features, with the exception of EQ5D, Alcohol consumption and to an extent HbA1c, models showed unstable behaviour manifested with erratic changes in the marginal probabilities of the sparse values of the features' range.

Validation
Model validation took place in the independent cohort of 295 people.All ensembles of 45 strong trained classi ers predicted and aggregated prediction on all 45 imputed validation sets.Models' performance was benchmarked using the MCC, AUPRC, balanced accuracy, accuracy and the p.value of the associated binomial test (Figure 8, A).The AUPRC was very good for all models.However only the unweighted versions of Random Forest and Adaptive Regression Splines had an overall accuracy signi cantly better than the NIR (p value < 0.05) and were also the models that showed the highest MCC.
In general, performance was markedly reduced between training and validation (Figure 8 B-D).Regarding the most important and robust metric MCC, Random Forest (0.28) and Adaptive Regression Splines (0.29) showed the smallest reduction in performance whilst still achieving good moderate positive correlation between predicted and observed outcomes.Additionally, considering MCC unweighted models markedly outperformed weighted ones (Figure 8B-D).
Model calibration is presented in Figure 9 A. Adaptive Regression Splines, Random forest, but not Naive Bayes, showed a monotonic increasing event rate for increased positive (painful DPN) class probabilities.The intercept and slope of the linear model tted to calibration curves, i.e. event rate vs midpoint of decile bin, indicated good calibration for Random Forest and Adaptive Regression Splines, (Figure 9, B).Intercept estimates showed that the former slightly over-tted the data whereas the latter slightly under-tted.However, the slope of both models was very good given that they had not been calibrated on the validation set.All models gained power moderately fast, requiring approximately 50% of the samples to correctly identify 60% of painful DPN events.

Discussion
This study is one of the largest and most comprehensively phenotyped cohort of people with DPN, in which ML is applied to classify painful versus painless neuropathy.ML has been used in other studies focusing on diabetic neuropathic pain or neuropathy versus no pain or no neuropathy with smaller sample sizes ranging from 327 to 943 instances (29,69,70).Given the large effects of the presence of neuropathy we consider the distinction of painful vs painless DPN to be more challenging for a differential diagnosis, classi cation or prediction.Given this, we have managed to train models that had good to very good performance on an independent validation dataset.
Importantly the training and validation cohorts were collected at different sites, representing different populations and compiled using different methods.Nevertheless they all follow the core DOLORisk protocol, highlighting the importance of large harmonised cohorts.
This study used statistical learning techniques for painful/painless DPN in a realistic framework that accounts for the presence of missing values both during model training and when predicting new instances.In addition, we have taken measures to avoid over-estimating a model's performance.The comparison of scalar metrics often used in statistical learning, including ML, highlights that the usage of accuracy or the AUC alone can signi cantly overestimate the model's performance.Importantly the two best models in the current study, i.e.
Random Forests and Adaptive Regression Splines, showed the smallest reduction in performance in MCC between the training estimate and the validated performance, better calibration and an overall accuracy signi cantly better than the prevalence of the majority class.The fact that the models' performances were reduced in validation highlights the importance of using an independent validation dataset.The optimistic performance estimates during model training are also known as over-t, indicating a model with high variance that has learned nuances of the training dataset that are not necessarily generalisable to the new unknown instances.In addition, datasets that are imbalanced, as is often the case for clinical cohorts, where a condition of interest can be at the same time serious and rare, can also bias some performance metrics and produce highly optimistic estimates.This is a well-known issue in model benchmarking and validation (24, 66, 67) but is sometimes ignored in ML studies.Simpler and more parsimonious models are usually found to be more robust, trading off some variance with bias, and this could be the case with Adaptive Regression Splines, the algorithm that created the most parsimonious model in this study, and that also achieved the best performance in the validation dataset.
In the current study we have opted for using model-agnostic techniques to provide feature rankings and interpretability in order to highlight how independent variables' values in uenced the prediction probabilities of the trained models.Self-reported quality of life, psychological and personality traits have consistently been the most powerful predictors of painful DPN both in the current study and in previous studies using different modelling techniques (70,72,73).BMI was a modi able factor that was positively associated with painful DPN.Interestingly alcohol consumption is shown to be negatively associated with painful DPN.Although alcohol consumption has been found to be negatively associated with chronic pain (74), genetic randomisation had not found evidence supporting this protective effect and shown that most probably the direction of causality is that chronic pain reduces alcohol consumption (75).Finally, age was found to have a complex nonmonotonic relationship with the development of painful DPN.Although people with painful DPN are younger on average than those with painless DPN, this is mostly driven by a greater prevalence in the 40-60 age group.
These kinds of complex relationships highlight the need of advanced statistical learning techniques in order to accurately model the development of painful or painless DPN.In addition, the sensitivity of PD on the density of the values of the respective feature highlights the need to use the largest possible clinical cohorts to allow ML models to learn meaningful relationships from data.
These models can be used either in the clinical context to assist patient strati cation based on the risk of developing painful DPN if proven to be valid in a prospective study, or in the form of an online calculator that can return broad risk categories based on user input.Models' performance and calibration prove that in both cases they can help timely diagnosis and prognosis, and could ultimately help patients and healthcare personnel to improve outcomes in those at highest risk by changing modi able factors such as BMI and HbA1c control and institute earlier treatments including medication and/or psychological interventions.
The main limitation of this study is the fact that the cohorts used were cross-sectional, therefore we cannot consider this a prognostic modelling study in the temporal sense.However, the GoDARTS validation dataset is a longitudinal cohort and the PINS and DOLORisk Imperial cohorts are followed up for 2-and 5-year outcome re-phenotyping.We will use this data in the future to update and re-validate the models.

Conclusions
Machine learning models trained on large cross-sectional cohorts were able to accurately classify painful or painless DPN on an independent population-based dataset.Painful DPN was strongly associated with poorer self-reported quality of life, younger age, poor glucose control, high BMI and a number of psychological/personality factors.These models showed good performance in realistic conditions in the presence of missing values and noisy datasets.

Figures Figure 1 Flow
Figures

Figure 2 Block
Figure 2

Table 1 Descriptive
summary statistics for all datasets.Set index indicates the training or validation dataset.Numerical variables are represented by the median and Inter Quantile Range (IQR) in brackets, categorical variables by the absolute occurrence and percentage in brackets.Columns hold from left to right data for the Training, Validation and both datasets together.Right most column holds the p.value of the comparison between

Table 2
Descriptive summary statistics for the training dataset.Outcome variable indicates painful or painless DPN.Numerical variables are represented by the median and Inter Quantile Range (IQR) in brackets, categorical variables by the absolute occurrence and percentage in brackets.Columns hold from left to right data for people with Painful DPN, Painless DPN and the total of the Training datasets.Right most column holds the p.value of the comparison between Painful and Painless DPN using the chi-square test for categorical variables or the Kruskal Walis test for numerical variables.

Table 3
Descriptive summary statistics for the validation dataset.Outcome variable indicates painful or painless DPN.Numerical variables are represented by the median and Inter Quantile Range (IQR) in brackets, categorical variables by the absolute occurrence and percentage in brackets.Columns hold from left to right data for people with Painful DPN, Painless DPN and the total of the Validation dataset.Right most column holds the p.value of the comparison between Painful and Painless DPN using the chi-square test for categorical variables or the Kruskal Walis test for numerical variables.