Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning

Background Machine-learning classifiers mostly offer good predictive performance and are increasingly used to support shared decision-making in clinical practice. Focusing on performance and practicability, this study evaluates prediction of patient-reported outcomes (PROs) by eight supervised classifiers including a linear model, following hip and knee replacement surgery. Methods NHS PRO data (130,945 observations) from April 2015 to April 2017 were used to train and test eight classifiers to predict binary postoperative improvement based on minimal important differences. Area under the receiver operating characteristic, J-statistic and several other metrics were calculated. The dependent outcomes were generic and disease-specific improvement based on the EQ-5D-3L visual analogue scale (VAS) as well as the Oxford Hip and Knee Score (Q score). Results The area under the receiver operating characteristic of the best training models was around 0.87 (VAS) and 0.78 (Q score) for hip replacement, while it was around 0.86 (VAS) and 0.70 (Q score) for knee replacement surgery. Extreme gradient boosting, random forests, multistep elastic net and linear model provided the highest overall J-statistics. Based on variable importance, the most important predictors for post-operative outcomes were preoperative VAS, Q score and single Q score dimensions. Sensitivity analysis for hip replacement VAS evaluated the influence of minimal important difference, patient selection criteria as well as additional data years. Together with a small benchmark of the NHS prediction model, robustness of our results was confirmed. Conclusions Supervised machine-learning implementations, like extreme gradient boosting, can provide better performance than linear models and should be considered, when high predictive performance is needed. Preoperative VAS, Q score and specific dimensions like limping are the most important predictors for postoperative hip and knee PROMs. Electronic supplementary material The online version of this article (10.1186/s12911-018-0731-6) contains supplementary material, which is available to authorized users.


Background
Shared decision making (SDM) is an approach where clinicians and patients share available evidence and preferences to support upcoming treatment decisions [1]. SDM has been found to improve care and reduce costs [2]. A recent Cochrane review for the effects of decision aids included 105 studies (31,043 patients in total) and concluded that while knowledge perception increased, no adverse effects on outcomes or satisfaction were observed [3]. One way to support SDM is to gather and evaluate patient reported outcome measures (PROMs). These are powerful tools which transform symptoms into numerical scores that capture why most patients seek medical attention, namely to improve their health state [4]. To control quality of care the National Health Service (NHS) routinely collects PROMs for four elective procedures since 2009 [5] and a the majority of Swedish quality registers are obliged to gather PROMs as well [6]. One advantage of individual PROMs compared with average study population results, is the possibility to predict individual outcomes [7]. While prediction models exist for reoperations [8], scheduling [9,10] or morbidity risk [11,12] of elective surgery, models that predict health-related quality of life are rare, despite around 160,000 hip and knee replacement procedures that are conducted in England and Wales every year [13]. To support SDM, accurate prediction models are needed, for example to inform doctors and patient about likely surgery outcomes. While generalized linear models are solid tools, machine-learning techniques are often able to outperform linear approaches [14][15][16][17]. Combining machine learning with expertise from clinicians is needed to improve collective care and to foster precision medicine [18]. However, there is no free lunch in optimization [19,20] and thus, no single model works best for all problems. Moreover, machine-learning models are often seen as black boxes that deliver very good performance but are less intuitive and transparent than traditional statistical methods. Additional uncertainty is partly rooted in the nature of machine learning where modelers have a wide variety of algorithms and approaches to choose from [21], unless more automated approaches are implemented [22]. Gaining and sharing empirical experience is therefore key to advance the understanding of model applicability and usefulness in respective scenarios. Despite thousands of papers for machine learning in medicine, meaningful contribution to clinical care is still rare [23]. The aim of this study is to evaluate eight different machine learning and one generalized linear model to predict binary PROM outcome following hip and knee replacement surgery. Moreover, by evaluating variable importance of respective models, we provide easy-to-interpret evidence illustrating model findings.

Data
The NHS publishes PROMs data for hip replacement, knee replacement, varicose vein and groin hernia on a monthly basis and releases a finalized data set every year [24]. Eligible patients are only those who are treated by or on behalf of the NHS. The PROMs program is mainly limited to England. NHS PROMs data from April 1st 2015 to March 31st 2017 were used to train and test models. The data sets contain 81 variables before filtering. Variables include sociodemographics with living status, age groups, disease affliction by self-report ("Have you been told by a doctor that you have …? "), EQ-5D-3L [25], visual analog scale (VAS), Oxford Hip Score (OHS) [26] dimensions, Oxford Knee Score (OKS) [27] dimensions and respective Q scores (sum of OHS or OKS). We removed observations with missing values or variables with near zero or zero variance. Moreover, we removed all post-operative variables except those of interest (VAS and Q score). Plausibility checks were applied to all variables. Some algorithms are sensitive to data imbalances. Three common options exist to address this issue, downsampling, upsampling and Synthetic Minority Over-sampling Technique (SMOTE) [28]. Downsampling removes observations from the majority class, upsampling randomly increases observations from the minority class and SMOTE is a more complex form of oversampling that artificially creates minority cases using nearest neighbors. We disregarded downsampling because it causes loss of information. One disadvantage of SMOTE is that it can add additional noise to the dataset because of increased overlap between classes. Due to its ease of use and high competitiveness [29] compared with more complex techniques, we chose normal upsampling to reach balanced class ratios. Normal upsampling is associated with two disadvantages. One, it makes overfitting more likely since it replicates the minority class. Two, it increases the number of observations and thereby increases training time. To avoid overfitting we use cross-validation and apply upsampling only to the training but not to the test data. The increase of computational time was acceptable for us.

Model selection, outcome metrics, cross-validation and variable importance
Algorithm selection has significant influence on model outcome and is essential for model performance [30]. Due to the vast amount of available algorithmsthe caret package [31] in R currently (May 2018) includes 237 models of which 189 can be used for classification problemsit is difficult for researchers to know in advance which algorithm performs best. To reduce the number of potential test algorithms, several software environments offer so called cheat sheets that provide some guidance on algorithm implementation for specific problems [32][33][34]. These cheat sheets are mainly based on expert experience but also oversimplification and generalization. Moreover, data cleaning, feature engineering, hyper-parameter tuning and ensembling cause additional complexity. To select models, we also incorporated expertise published in supplement 1 of Sauer et al. 2018 [35]. The following algorithms were selected for comparison: logistic regression, extreme gradient boosting [36], multi-step adaptive elastic-net [37], random forest [38], neural net [39,40], Naïve Bayes [41], k-Nearest Neighbors [42] and boosted logistic regression [43]. Carets pre-defined grid search values for respective algorithm hyper-parameters were used. Originally, a support vector machine with radial basis function kernel [44] has been evaluated as well. However, due to functional instabilities, results were inconsistent and we consequentially removed the implementation from the analysis.
The area under the receiver operating characteristic (AUROC) is used as outcome metric for the training set. For binary classification, the AUROC combines the sensitivity, in our case the probability of correctly classifying a patient who will reach the minimal important difference (MID), and its specificity, i.e. the probability of correctly classifying a case that will stay below MID. The AUROC combines both characteristics at different probability cutoff points. It has certain advantages compared with overall accuracy, e.g. it is not dependent on decision thresholds or prior class probabilities [45]. It ranges from 0.5 (random predictor) to 1 (perfect predictor). To validate our models and to detect possible overfitting, we test the classifiers with surgery outcomes of the 2016/2017 full data release for both procedures. Since neither cost nor utility nor loss functions for the test characteristics (confusion matrix) are available, we value sensitivity (true positives / (true positives + false negatives); the proportion of people correctly predicted to have improvement among all patients who have improvement) and specificity (true negatives / (true negatives + false positives); the proportion of people correctly predicted to have no improvement among all patients who have no improvement) the same. We also provide the Youden J-statistic [46] (Sensitivity + Specificity -1) for each training model. The statistic is calculated across different thresholds (0 to 1 by steps of 0.05) and allows selecting the threshold that maximizes the sum of sensitivity and specificity. It ranges from − 1 to + 1 and a higher score is considered better. For the validation models we also report other common metrics like positive predictive value/precision (the proportion of patients correctly predicted to have improvement compared with all patients predicted to have improvement), negative predictive value (the proportion of patients who are correctly predicted to have no improvement compared with all patients predicted to have no improvement), F1-score (2 * (Recall * Precision) / (Recall + Precision); a balanced average of precision and sensitivity) and balanced accuracy (0.5 * (true positives / N positives + true negatives / N negatives); the average proportion of correctly classified cases across patients with actual improvement and no improvement).
Overfitted models predict outcomes based on spurious correlations or random noise and have poor fit with unseen data. To avoid overfitting, we used five-fold repeated cross-validation (CV). For five-fold CV, data are split into five equally big parts. One part is retained and the other four parts are used for training. Once training is finished, model performance is tested with the retained part. This is iterated until each of the parts has been used for validation once. Seeds were set to make results reproducible and models comparable.
Variable importance is a concept to indicate the importance of each variable for the predictive performance of the model. For example, in the case of extreme gradient boosting, the importance is calculated by permuting each predictor variable and summing the importance (change in accuracy) over each boosting iteration [47]. The scaled importance ranges from 0 (unimportant variable) to 100 (most important variable). We calculate variable importance for models where the function is available, namely extreme gradient boosting, multistep elastic net, random forest, neural net and linear model.

Performance comparison
For validation and comparison purposes we benchmark one of our high performing hip models against the hip prediction model used by the NHS (predictions of the NHS model are included in the released dataset). The NHS model [48] is a linear regression model that has access to more detailed variables (e.g. age instead of age groups). Since it predicts actual postoperative outcome values, we use two different approaches to benchmark performance. First, we transform the absolute NHS predictions into binary form, by evaluating if the predicted postoperative value reaches MID (= improvement) or not (= no improvement). Second, we calculate our own regression model based on the respective implementation used for the first comparison, via 10-fold cross validation (3 repetitions) and we compare it against the regression results of the NHS model. Comparison metrics for the regression models are root mean squared error (RMSE) and mean absolute error (MAE).

PROMs
The NHS uses the EQ-5D-3L [25] including its VAS, the OHS [26] and the OKS [27] to collect PROMs for hip and knee replacement surgery. The EQ-5D-3L is a widely accepted and validated instrument to measure HRQoL. It consists of five questions, also called dimensions, and the VAS. The five dimensions include mobility, self-care, usual activities, pain/discomfort and anxiety/depression. The survey taker has three answer possibilities (no problems, some/moderate problems, unable to or extreme problems). Moreover, the survey taker is asked to mark his current health state on the VAS. The VAS ranges from 0 (worst imaginable health state) to 100 (best imaginable health state). The VAS measures a broader construct of health and is closer to the patient perspective than population based value sets that are normally used to transform health states. Oxford Hip Score (OHS) as well as Oxford Knee Score (OKS) are hip and knee specific instruments to measure disease-specific HRQoL. They consist of 12 questions with five answer possibilities. Values from 0 (severe) to 4 (none) are assigned to each answer and get summed up to the Q score. The sum score grades are 0-19, 20-29, 30-39 and 40-48 points and can be translated to severe/moderate/mild-to-moderate arthritis and satisfactory joint function. Patients complete the preoperative survey in the interval between having an appointment/ being fit for surgery and the procedure. The time lag between pre-and postoperative questionnaires is at least 6 months. The surveys are voluntary and the response rate is around 75%.

Minimal important differences (MIDs)
MIDs describe the change of a measure that is detectable by the patient. MIDs are not universally valid and vary by patient group and instrument [49]. Several ways to calculate MIDs for PROMs exist. They include anchor-based methods, clinical-trial-based methods as well as distribution-based methods [50]. 0.5 standard deviations were found to approximate MIDs for HRQoL in chronic diseases very well [51]. Since we had no clinical data, we used half a standard deviation of baseline preoperative VAS as MID. This resulted in VAS MIDs of 11 (hip) and 10 (knee). Using multiple anchor-based approaches, a study from Denmark calculated hip MIDs that ranged from 5 to 23 [52]. Our MID is within this range. The individual MIDs for OHS and OKS were taken from literature, they were 8 and 7 respectively [53]. Table 1 depicts sociodemographic data and patient perception before and after surgery. In total, 30,524 observations for hip and 34,110 observations for knee replacement surgery were included from the training dataset 2015/2016. 59.7 and 56.44% of patients were female, respectively. Over 70% of hip and knee surgery patients were between 60 and 79 years of age. Around 7 to 8% had related surgery before. The majority of both patient groups considered themselves to have a disability. On average, patients before hip replacement had lower generic (64.85) and disease specific (18.47) health perception compared with patients before knee replacement (68.18; 19.34) but average postoperative outcomes were higher for hip patients. The numbers for the testing dataset 2016/2017 are comparable. Only slightly more surgeries were done in 2016/2017 and the percentage of people with VAS improvement increased by around 2 percentage points.

Results
The histogram (Fig. 1) illustrates postoperative changes (postoperative response minus preoperative response) for both outcomes and procedures. The blue, dashed lines depict MIDs. Outcomes are distributed widely and while only a minority of patients have VAS improvements ranging above MIDs, a clear majority of patients perceive relevant improvements of Q scores. Box plots of model performance (Fig. 2) depict AUROC for the VAS and Q score prediction models following hip replacement. For both outcomes, extreme gradient boosting delivered the best AUROC (0.87; 0.78). However, other models followed closely, especially the multistep elastic net and the linear model. Overall, models had higher predictive performance for VAS results than for Q score. Model outcome variation was The AUROC of VAS models following knee replacement (Fig. 3) were slightly lower compared with the respective hip models. Extreme gradient boosting, multistep elastic net and the linear model delivered the highest median AUROC and were closely trailed by random forest and neural net, which had an AUROC of around 0.83. Linear model, multistep elastic net and extreme gradient boosting had the highest median AUROC (0.71) for post-operative Q score. Table 2 depicts key performance metrics of the three models with the highest J-statistic for each outcome. The optimal probability thresholds to maximize J-statistic ranged between 0.45 and 0.55. The highest validated J-statistic for each outcome was 0.59 (hip VAS), 0.42 (hip Q score), 0.57 (knee VAS) and 0.31 (knee Q score). Across both procedures and both outcomes, extreme gradient boosting delivered the highest J-statistic, while multistep elastic net, neural net and the linear model followed closely. Among the three models with the highest J-statistic, extreme gradient boosting delivered the highest or equally good F1 scores as well as balanced accuracy as the second best model. Overall, the performance margin was very small and it was easier to predict VAS than Q score improvement, especially for knee replacement surgery. An overview of all performance metrics for all eight models can be found in Additional file 1. Figure 4 illustrates variable importance of several models for hip replacement surgery and both outcomes. Preoperative VAS is the most important predictor for postoperative VAS. Preoperative Q score and Q score dimensions, especially the limping question, were the most important predictors for postoperative Q score respectively. Neural net and linear model show greater reliance on dimensional variables. Figure 5 depicts the variable importance of several models for knee replacement surgery and both outcomes. Again, preoperative VAS, preoperative Q score and Q score dimensions, especially the limping question, were the most important variables for each outcome respectively.

Discussion
This evaluation unveiled three main findings. First, extreme gradient boosting, linear model, multistep elastic net and neural net delivered the highest J-statistic and thus, represent the most robust real world benchmark for one year hip and knee PRO. Second, preoperative VAS, Q score and Q score dimensions were the most important predictors for each respective outcome. Third, it is easier to predict generic VAS than disease-specific Q score and it is easier to predict hip Q score than it is to predict knee Q score.

Predictive performance and adaptability
The performance margin between the top models was small but extreme gradient boosting delivered the highest overall J-statistic for the four prediction tasks. Extreme gradient boosting is a very versatile algorithm that has been found to perform very well in different machine learning challenges [54]. Its high predictive  performance has also been documented for other clinical prediction scenarios like in hip fractures [55], urinary tract infections [56], imaging-based infarcts [57], bioactive molecules [58] and quantitative structure-activity relationships [59]. Due to the ease of implementation and relatively low computing times, compared with other machine learning algorithms, extreme gradient boosting can serve as an alternative to traditional methods or as benchmarking instrument. For our data, the NHS model delivers a sensitivity of 0.77 and a specificity of 0.80. Our extreme gradient boosting model delivers a sensitivity of 0.82 and a specificity of 0.77 (J-statistic 0.57 vs. 0.59). For hip Q scores the extreme gradient boosting model also outperforms the NHS predictions for sensitivity but not specificity (Sensitivity: 0.44 vs. 0.79; specificity: 0.77 vs. 0.63). However, the J-statistic difference is significantly higher (0.21 vs. 0.41). In a next step, we calculated an extreme gradient boosting regression model for the respective data via 10-fold cross validation (3 repetitions ). Overall, despite only incorporating a restrictive set of variables, our model performs slightly better than the predictions provided in the NHS datasets. This confirms robustness of our models.
Extreme gradient boosting provides several hyperparameters (eta, max_depth, colsample_bytree, subsample, nrounds) that can be tuned to improve model performance. Since we only used the standard grid search parameters, performance gains are still possible. Naïve Bayes and KNN delivered only relative low J-statistics.
The Naïve Bayesian classifier tended strongly towards sensitivity for all outcomes (0.99, 0.83, 0.99, and 0.87) but had reduced specificity. Decision makers should be aware that utility, cost or loss functions are needed to optimize models for most clinical scenarios and that blindly following AUROC results or J-statistics does not guarantee finding the best classifier for each respective task. Assuming a patient has severe knee or hip pain, suffers from very low HRQoL and, to allow further simplification, only has one opportunity for respective surgery. In this case, prediction models should avoid false negatives and maximize sensitivity, since a patient who greatly benefits from surgery but is predicted not to do so, will suffer significantly from this decision (assuming the surgery decision is based on the prediction), especially when surgery is only possible now but not in the future. However, the easiest way to avoid false negatives is to maximize sensitivity by always predicting improvement for all patients (sensitivity = 100%, false negatives = 0%), irrespective of actual outcome. This is not realistic for most clinical scenarios however, because a high number of false positives is normally associated with risks (e.g. postoperative disability), disutilities and losses. Consequential, sensitivity and specificity should not be viewed alone. Patient and doctor preferences as well as the surgery situation have to be accounted for before model selection.
Speaking more broadly, outcome valuation depends on aims and risk attitude of the patient, in assuring that improvements are being achieved, or deterioration or lack of change are being avoided. The advantage of machine learning is that different algorithms or implementations can deliver higher predictive performance than traditional methods. While machine learning excels at handling huge amounts of predictors and combining them in non-linear, interactive ways [60,61], linear models may still be a practical option for restrictive data with linear relationships between variables. By using more versatile, non-linear patient data, performance metrics of respective machine learning models will likely improve. It should be noted that for a comparable analysis with longer follow-up periods and less restrictive data with more variables, computing time will increase superlinearly.
Hardware needs should therefore be accounted for.
Since we only used the standard grid search approach, performance gains are still possible, by finetuning associated hyperparameters. Additional training years will also lead to better predictive performance.

Variable importance
Many machine-learning algorithms can reach very high predictive performance but don't solve the problem of causal inference. However, both, traditional methods and machine learning, point us towards meaningful medical conclusions [62]. For example, when overweight is of high importance, doctors may counsel patients to lose weight. While it would be desirable to understand the underlying principles and causative variables of perfect prediction models, it is no requirement to use respective models for SDM. The prediction itself provides inherent value by supplementing available evidence. While inference and machine learning are often viewed as separated entities, variable importance of machine learning classifiers is used for the evaluation of a wide variety of different research objectives. They include healthcare spending [63], identification of biomarkers for knee osteoarthritis [64], microarray studies [65], credit default risk of enterprises Fig. 5 Needle plots of scaled variable importance, VAS and Q score, knee replacement surgery, top ten variables. Note: Importance does not indicate absolute effect or direction [66], energy performance of buildings [67] or even landslide susceptibility modeling [68]. By providing the variable importance of five different models, we illustrated the predictive importance of preoperative VAS and Q score as well as respective dimensions. Vogl et al. 2014 [69] and other studies [70,71] confirm the importance of preoperative HRQoL for postoperative HRQoL. The likely reason is that patients with low preoperative HRQoL can benefit significantly from respective surgery, while patients with high preoperative HRQoL cannot or can only improve slightly. The university of York developed an informed clinical decision tool to predict improvement for hip and knee replacement surgery that also strongly relies on preoperative EQ-5D-3L index as well as age, gender and symptom duration [72].  [75,76] and is likely based on the greater complexity of knee replacement surgery. We also showed that predicting VAS results (AUROC of around 0.87 for hip and 0.87 for knee) is easier than predicting Q scores (AUROC of around 0.78 for hip and 0.70 for knee). One explanation for this difference is the nature of both instruments. VAS results represent a generic summary of health perception and consequentially should be less sensitive to disease-specific influences, as shown by our evaluation. Despite ranging from 0 to 100, VAS results on average, only improve 6 and 12 points, while Q scores, ranging only from 0 to 48, improve by 16 and 21 points respectively. Nevertheless, VAS outcomes represent a more holistic approach that may account for aspects of disease, which are not directly addressed via disease-specific instruments.

Clinical relevance
One important way to support shared decision-making is to provide patients and doctors with highly accurate prediction models for relevant outcomes. From a patient perspective, relevant outcomes in osteoarthritis include HRQoL as well as contextual barriers, treatment disadvantages and consequences for personal life [77]. Our evaluation focused on HRQoL, since it resembles an overall aggregate of patient health perception. When clinicians want to predict postoperative HRQoL, they can rely on either personal expertise, average patient results or individual prediction models. These prediction models should incorporate significant numbers of population-based surgery observations from a realworld context in order to be representative. Our models incorporate data of over 60,000 recent hip and knee replacement surgeries from a real world, routine care, population-based registry and we apply different algorithms/implementations to reach high predictive performance. By delivering real-world benchmarks, results from our models supplement clinical expertise and thus, may contribute to shared-decision making. Clinicians should be aware that predictive performance of our models can be improved further by using more detailed clinical data (e.g. ASA class, blood values, BMI etc.) that were not available for the conduct of this study but that are typically gathered before elective surgery, also on a routine basis. We further showed that preoperative PROMs are the most important predictors for postoperative PROMs. The underlying PROMs can be gathered easily in clinical settings on a routine basis though limitations do exist [78]. The two small self-explanatory surveys are filled out in a few minutes or less and do not require any previous knowledge by the patient.
Another aspect of clinical relevance of this study is that PROMs-based quality of care improvement requires defined standards on postoperative PROMs change [79]. By providing individual outcome estimations, we deliver a more (VAS) or less (Q score) reliable standard to incorporate PROMs into clinical quality of care control.

Sensitivity analysis
Different methods exist to calculate MIDs. To evaluate the influence of MID on model performance we conducted several univariate sensitivity analyses, in a first example, for hip VAS patients. Since MID selection influences the proportion of patients who can achieve MID-based improvement, we also tested the influence of removing respective patients from the dataset. A patient with preoperative VAS score of 90 is not able to achieve postoperative gains greater than 10. Thus, selecting higher MIDs results in less patients being able to achieve improvement, supposedly making it easier for models to predict the correct outcome by only incorporating preoperative VAS score. Our first sensitivity analysis (Additional file 2) concerned patients with hip replacement and tested a MID of 23 for EQ-5D-3L VAS that was stated in a Danish study by Paulsen et al. 2014 [52]. This improved the AUROCs of the best five models to 0.91/ 0.92 (compared with 0.86/0.87 before; MID = 11). This gain is not surprising, since significantly less patients can achieve this MID. Removing all patients not able to achieve MID, reduced respective AUROCs to 0.83/0.84 for the best five models (Additional file 3) and reduced the number of observations to 19,716. Taking the example of our main evaluation and filtering all patients who could not achieve a VAS MID of 11 resulted in 25,606 remaining observations and AUROCs of the best models ranging around 0.81/0.82 (Additional file 4). MID selection, filtering of patients and number of observations all have significant influence on model performance.

Limitations
Strengths of this study include the wide variety of algorithms that were applied for evaluation as well as the testing of specific probability thresholds to find the best classifier. By reporting the J-statistic, we go beyond AUROC calculation and show maximal performance when sensitivity and specificity are valued the same. Moreover, the incorporation of generic and disease-specific outcomes for both, hip and knee replacement surgery, gave insights for both instruments and both procedures.
One limitation of this study is the lack of controls. It was not possible to model patient trajectories without surgery. It is unknown, if a patient has no improvement because of surgery or if surgery prevented an otherwise significant deterioration of health outcome. The lack of long-term data made it impossible to make long-term predictions. Some patients will only have temporary improvement and long-term data are needed to evaluate this issue. Moreover, we only evaluated a binary outcome (improvement/no improvement) but patients may want to know the degree of improvement or deterioration. This could be investigated in future research but results and associated uncertainty are more difficult to apply in shared decision-making. We had no utility, loss or cost function to optimize model metrics because costs were not available and utilities change by patient. Due to privacy concerns, public NHS PROMs data are restrictive and do not reflect clinical precision and versatility. For example, age bands in NHS data cover 10-year time spans and other variables like rehabilitation, BMI or allergies, despite having been found to influence knee and hip replacement outcomes [80][81][82][83], are completely missing. Incorporating respective data will likely improve predictive performance of models. Furthermore, between pre-and postoperative patient reports, response shift has been observed in the UK PROMs data which potentially reduces patient's gain but could not further be analyzed here [84]. Conflicting evidence regarding the validity of self-reported patient data exists [85,86]. However, a rigorous recent study concluded that patient reporting provides similar and less costly information compared with medical records [87]. Moreover, comorbidities in hospital medical records are often based on self-report as well, since clinical validation is mostly not feasible. When we ensembled all models linearly for both procedures and both outcomes (not shown here), the resulting AUROC was either worse or only minimally better (third decimal place) than for single models alone. Ensembling of different models was not the focus of this study and thus, we refrained from adding additional uncertainty.

Conclusion
We provide robust real world benchmarking results for the prediction of PROMs-based postoperative hip and knee replacement surgery outcomes. Extreme gradient boosting delivered the highest overall J-statistic among all models. Linear model, multistep elastic net and neural net followed closely. One strength of machine learning models is their adaptability to different clinical scenarios where certain levels of sensitivity or specificity are needed. Preoperative VAS, Q score and specific instrument dimensions like lumping, were the most important predictors for hip and knee replacement surgery PROMs.