Second opinion machine learning for fast-track pathway assignment in hip and knee replacement surgery: the use of patient-reported outcome measures

Campagner, Andrea; Milella, Frida; Banfi, Giuseppe; Cabitza, Federico

doi:10.1186/s12911-024-02602-3

Volume 24 Supplement 4

Selected Articles From The 18th Conference On Computational Intelligence Methods For Bioinformatics & Biostatistics: medical informatics and decision making

Research
Open access
Published: 23 July 2024

Second opinion machine learning for fast-track pathway assignment in hip and knee replacement surgery: the use of patient-reported outcome measures

Andrea Campagner ORCID: orcid.org/0000-0002-0027-5157¹,
Frida Milella²,
Giuseppe Banfi^1,3 &
…
Federico Cabitza^1,2

BMC Medical Informatics and Decision Making volume 24, Article number: 203 (2024) Cite this article

62 Accesses
Metrics details

Abstract

Background

The frequency of hip and knee arthroplasty surgeries has been rising steadily in recent decades. This trend is attributed to an aging population, leading to increased demands on healthcare systems. Fast Track (FT) surgical protocols, perioperative procedures designed to expedite patient recovery and early mobilization, have demonstrated efficacy in reducing hospital stays, convalescence periods, and associated costs. However, the criteria for selecting patients for FT procedures have not fully capitalized on the available patient data, including patient-reported outcome measures (PROMs).

Methods

Our study focused on developing machine learning (ML) models to support decision making in assigning patients to FT procedures, utilizing data from patients’ self-reported health status. These models are specifically designed to predict the potential health status improvement in patients initially selected for FT. Our approach focused on techniques inspired by the concept of controllable AI. This includes eXplainable AI (XAI), which aims to make the model’s recommendations comprehensible to clinicians, and cautious prediction, a method used to alert clinicians about potential control losses, thereby enhancing the models’ trustworthiness and reliability.

Results

Our models were trained and tested using a dataset comprising 899 records from individual patients admitted to the FT program at IRCCS Ospedale Galeazzi-Sant’Ambrogio. After training and selecting hyper-parameters, the models were assessed using a separate internal test set. The interpretable models demonstrated performance on par or even better than the most effective ‘black-box’ model (Random Forest). These models achieved sensitivity, specificity, and positive predictive value (PPV) exceeding 70%, with an area under the curve (AUC) greater than 80%. The cautious prediction models exhibited enhanced performance while maintaining satisfactory coverage (over 50%). Further, when externally validated on a separate cohort from the same hospital-comprising patients from a subsequent time period-the models showed no pragmatically notable decline in performance.

Conclusions

Our results demonstrate the effectiveness of utilizing PROMs as basis to develop ML models for planning assignments to FT procedures. Notably, the application of controllable AI techniques, particularly those based on XAI and cautious prediction, emerges as a promising approach. These techniques provide reliable and interpretable support, essential for informed decision-making in clinical processes.

Introduction

In medical practice, seeking second opinions is a common and valued approach to achieve consensus in diagnosing and managing patient care, thereby enhancing the overall quality of healthcare [1]. This practice becomes particularly crucial in situations involving complex healthcare decisions, those that are potentially distressing for the patient, or when significant risks are involved [2]. Contrary to cases where patients themselves seek a second opinion for confirmation of a diagnosis or due to unsatisfactory interactions with their doctors, second opinions initiated by other parties, especially those initiated by doctors, often aim to restrict the use of low-value treatments (which are those offering minimal or no benefit, posing potential harm, or yielding marginal benefits at disproportionately high costs [3]). Therefore, within the realm of clinical decision making, a second opinion serves as a significant decision-support, which enables another physician to either confirm or alter the proposed treatment plan [4] and has been proven to significantly reduce medication errors [5], including diagnostic mistakes [2].

Machine Learning (ML) algorithms have increasingly been applied to augment clinical decision-making in recent years across various tasks [6, 7]. Particularly, to counteract cognitive biases associated with an over-reliance on decision support technologies, ML algorithms have recently been utilized as tools for offering second opinions [8, 9]. In this context, they are viewed as cognitive supports with specialized capacities, designed to confirm or revise (i.e., augment) decisions initially made by clinicians, rather than merely automating the clinical decision-making process [10]. Several studies have explored the impact of algorithmic assistance on clinicians’ diagnostic performance when supplemented by a second opinion from an ML algorithm. For example, Gurusamy et al. [11] investigated the use of ML models for providing second-opinion recommendations in brain tumor classification. Kovalenko et al. [12] developed a prototype ML-based video analytics system to aid in diagnosing Parkinson’s disease. Cabitza et al. [13] assessed various ML-based second opinion protocols to enhance the diagnostic accuracy of orthopedists in radiological knee lesion readings. Bennasar et al. [14] created an ML-based second opinion system for predicting root canal treatment outcomes. Similarly, Rosinski et al. [15] proposed an ML-based system for selecting assistive technology in post-stroke patients. While most of these studies primarily focused on the use of ML algorithms for second opinion support in diagnostic or prognostic tasks, our article shifts focus to another aspect of clinical process management - the assignment of rehabilitation protocols. Specifically, we develop second-opinion decision-support ML models for the assignment of patients to surgical Fast Track (FT) in hip and knee arthroplasty.

In the field of orthopedics, the FT surgical procedure represents a rapid rehabilitation protocol designed to mitigate the physiological and psychological stresses typically associated with surgery [16]. Its primary aim is to facilitate early mobilization and recovery post-surgery [16], leading to outcomes such as reduced Length Of hospital Stay (LOS) [17], decreased convalescence time [18], and lower overall costs [19]. However, the criteria for FT patient assignment have not fully leveraged the extensive patient data available, including Patient Reported Outcome Measures (PROMs). A few studies have focused on comparing the effectiveness of Fast Track versus Care-as-Usual surgical procedures from a patient-centered perspective (e.g., [20]). More generally, some studies have considered the application of ML in management of orthopedics’ patients [21], with a specific focus on the prediction of the length of stay [22, 23], which, similarly to FT can be useful to better manage bed availability as well as identifying patients who are most in need of increased rehabilitation theory. By contrast, to our knowledge, no study has yet developed second opinion ML models specifically for decision support in the assignment of patients to surgical FT: thus, the focus on this task represents a crucial element of novelty in our contribution.

To achieve our objective, we utilized ML models designed to predict whether a patient, preliminarily assigned to FT by their managing clinician, will experience an improvement in health status. In this context, improvement serves as a proxy for the effectiveness of the FT procedure. Accordingly, the model either validates the clinician’s decision to assign the patient to FT (if an improvement is predicted) or suggests a need to reconsider this assignment in favor of an alternative approach, such as Care-as-Usual, or prompts the managing clinicians to more thoroughly assess the patient’s specific situation. As these models provide second-opinion support for clinicians, we have developed them based on principles of controllable AI [24], ensuring that the support offered is comprehensible and, if necessary, rejected by the managing clinicians. In line with the definition by Kieseberg et al. [24], we define ‘controllable AI’ as second-opinion AI systems that are not only accurate but also capable of identifying and signaling control loss conditions, wherein the effectiveness cannot be fully assured, necessitating or warranting human intervention to evaluate the second opinion support. Our focus was particularly on methods for detecting control loss, i.e. situations of high uncertainty or potential anomalies, by using eXplainable AI (XAI) [25] to ensure model recommendations are understandable to clinicians, and cautious prediction [26] for uncertainty quantification and to enhance reliability.

Methods

This retrospective study was conducted using a dataset derived from the electronic health records (EHRs) of IRCCS Ospedale Galeazzi - Sant’Ambrogio (OGSA), a leading orthopedic teaching and research hospital in Italy. Our focus was on developing a second-opinion model; therefore, we exclusively analyzed records of patients who were part of the perioperative Fast Track process. The dataset encompassed 925 individual patient records, collected over the period from January 2018 to November 2020.

The dataset for this study included a comprehensive range of patient data, covering demographic characteristics (such as Sex, Age, Weight, Height, and Body Mass Index [BMI]), details about the assigned surgical procedure and primary affected area (including ICD code, and distinction between Knee and Hip surgeries, as well as First intervention vs Revision), clinical information (ASA Class, Pre-surgery Hemoglobin levels), and preoperative PROMs scores (VAS, EQ5D, SF12 Mental score, SF12 Physical score). Additionally, the SF12 Physical score recorded at the 3-month follow-up was also included. The distribution of these features is detailed in Table 1.

Table 1 Table of descriptive statistics for data features with P-Value Analysis: The table presents the descriptive statistics for each feature in the dataset, stratified by ‘Improved’ and ‘Not Improved’ sub-cohorts

Full size table

As outlined in the Introduction section, our approach for providing second-opinion support involved using a proxy for the potential effectiveness of assigning patients to the Fast Track program, namely the improvement in the patients’ health status. Specifically, we defined the target variable as a binary outcome (Improved vs. Not Improved) determined by changes in the SF12 Physical score at the 3-month follow-up. A patient was classified as ‘Improved’ if the difference between their 3-month follow-up score and the preoperative score exceeded the distribution-based Minimum Clinically Important Difference (MCID) for this score [20]^{Footnote 1}. If this threshold was not met, patients were categorized as ‘Not Improved.

Due to the presence of records with missing values, we elected to exclude any patient records that lacked even one of the features under consideration. This resulted in the removal of incomplete data, leaving 899 records available for subsequent analysis. It is noteworthy that the distribution of the target variable was unbalanced: 644 patients (approximately 71.6%) were classified in the ‘Improved’ category, whereas 255 patients (about 28.4%) fell into the ‘Not Improved’ category. Apart from the one-hot encoding of categorical variables, no additional pre-processing procedures were undertaken: specifically, we did not implement any pre-processing method to correct the imbalance in label distribution.

In the development of ML models for this study, we considered a variety of model classes, encompassing both ‘black-box’ approaches known for their efficacy with tabular data [27], as well as models grounded in XAI principles. In alignment with the concepts of controllable AI outlined in the Introduction section, the XAI models were specifically chosen for their interpretability [28]. This feature enables clinicians to ‘look into the models’, thereby understanding the basis of the second-opinion support and potentially identifying classification errors. Among the black-box models, we included Random Forest (RF), Support Vector Machines (SVM), XGBoost (XGB) and Multi-layer Perceptron (MLP). Regarding XAI methods, we opted for Logistic Regression (LR) and Decision Tree (DT), along with two advanced, state-of-the-art approaches: Hierarchical Shrinked Trees (HST) [29] and Fast Interpretable Greedy-Tree Sums (FIGS) [30]. HST functions as a post-hoc regularization method to streamline decision tree models by shrinking predictions at each node towards the sample means of their ancestors. Conversely, FIGS represents a generalization of the CART algorithm, operating by constructing a forest of simple trees through a greedy approach based on boosting principles, with the trees being subsequently combined in summation.

All the models were trained with the objective of predicting the target variable, namely, classifying each patient as either ‘Improved’ or ‘Not Improved’, based on the aforementioned features. Prior to training, the dataset was divided into two distinct sets: a training set and a test set. This division followed a stratified split of 75% for training and 25% for testing. The training set was used for both the training of the models and the optimization of hyper-parameters. On the other hand, the test set was used for a blind evaluation of the results to assess the models’ performance. The test set size was selected based on a minimum sample size determination criterion, so as to ensure that with high probability (greater than 95%) the measured estimates of performance would be close to the true performance values.

The models were implemented as pipelined models encompassing three different steps: feature scaling, feature selection and predictive model. The full list of hyper-parameters for the three different steps of each pipeline model is reported in Table 2. In particular, we used a class weighting hyper-parameter to control label imbalance: we either considered equal weighting of the instances (i.e., ignoring label imbalance) or weighting more the instances in the negative class (i.e., label imbalance correction). All other hyper-parameters not specified in Table 2 were set to the default values, except for random seeds that were all set to the value 0 to ensure reproducibility. As mentioned above, training and hyper-parameter optimization was performed only on the training set, in order to avoid data leakage and overfitting, using a cross-validation approach. The training set was split into 5 folds (each of which encompassed 15% of the original dataset), and at each iteration 4 folds (60% of the original dataset) were used for training and hyper-parameter selection, while the remaining fold was used for internal evaluation. The performance of each model and hyper-parameter configuration was determined as the average of the reported performance across the five iterations of the cross-validation and measured in terms of the Balanced Accuracy, so as to account for the label imbalance. For each model, we selected the configuration of hyper-parameters that reported the best performance on the cross-validation and then re-trained the model on the entire training set.

Table 2 Hyper-parameters for the developed models

Full size table

After training and hyper-parameter optimization the models were evaluated on the separate internal validation test set in terms of different evaluation metrics, namely: accuracy, sensitivity, specificity, PPV, NPV, Area under the ROC curve (AUC), F1-score (F1), Matthew’s correlation coefficient [31] (MCC) and balanced accuracy, as measures of error rate; Brier score as a measure of calibration; and the standardized Net Benefit (sNB), as a measure of utility.

As we mentioned in the Introduction section, to enhance the ability of the developed ML models to reliably detect control loss conditions, we also developed cautious prediction models based on the above mentioned ML models. More specifically, the models developed during the training phase were also considered as cautious prediction models that could abstain whenever the prediction for a given instance to be classified was not sufficiently confident [32]. To this purpose, we considered the confidence scores returned by the models, which were tresholded at a 0.75 cutoff: that is, whenever the confidence score assigned to the predicted label was lower than 0.75, the model was considered as abstaining from providing a support. We decided to adopt this cautious prediction approach, rather than alternative techniques such as conformal prediction [33] or three-way decision [34], due to its increased efficiency (the computational complexity cost of the thresholding strategy is O(1), while it is on the order of \(O(\log n)\), for n being the dataset size, for conformal prediction, and \(O(2^{|Y|})\), for Y being the set of possible labels, for three-way decision), ease of interpretation and also due to its equivalence, in the binary classification setting and under weak assumptions, with the above two mentioned methods [34]. We then evaluated these cautious prediction models according to so-called High-Confidence (HC) evaluation metrics (i.e., metrics that only consider the non-abstained on instances), namely the accuracy, sensitivity, specificity, PPV and NPV, as well as the coverage (i.e., the rate of non-abstained instances over the total number of instances in the test set).

After training and internal validation we also evaluated the generalizability and robustness of the developed models by means of an external validation [35]. Specifically we performed a temporal external validation, through which we evaluated the trained models on a set of data collected at the OGSA institute, as for the internal development set, but in a different time period. The dataset encompassed a total of 1589 individual patient records, collected over the period from January 2021 to October 2023, and the same features as for the internal development set. The distribution of features for the external validation dataset is detailed in Table 3. External validation was performed by evaluating the already trained ML models, including the cautious prediction models, on the external validation dataset in terms of the same metrics considered for the internal validation. We also evaluated the similarity between the internal development set and the external validation set in terms of the degree of correspondence \(\Phi\) [35], as a comprehensive measure of the differences between the two settings.

Table 3 Table of Descriptive Statistics for Data Features with P-Value Analysis for the External Validation dataset: This table presents the descriptive statistics for each feature, comparing the external validation and internal development datasets

Full size table

All software was implemented in Python (v. 3.10.9) using the libraries numpy (v. 1.23.5), scipy (v. 1.9.3), pandas (v. 1.5.2), scikit-learn (v. 1.1.2), imodels (v. 1.4.1), shap (v. 0.41.0), xgboost (v. 1.5.1), matplotlib (v. 3.6.2) and seaborn (v. 0.12.2). The reporting of the methods and results follows the IJMEDI/ChAMAI checklist.

Results

The results of the developed models are detailed in Table 4 and illustrated in Figs. 1, 2 and 3. The FIGS model emerged as the most effective among the considered models: indeed, for all the considered metrics, except sensitivity and specificity, the performance of FIGS was not significantly lower than that of the top-ranked model. In particular, FIGS was significantly better than all other models in terms of balanced accuracy, AUC and standardized Net Benefit (sNB). By contrast, XGB was the best model in terms of sensitivity, while HST and LR were the best models in terms of specificity: in both cases, FIGS ranked as the second best model. Also when considering the cautious prediction versions of the models, FIGS was among the most effective models, being among the top-ranked models for all considered metrics except sensitivity, and having the best coverage.

Table 4 The results of the developed Machine Learning (ML) models are presented along with their respective 95% confidence intervals (C.I.)

Full size table

The results for the best model (FIGS), in terms of both ROC curve (also considering the ROC curve for the corresponding cautious prediction models) and decision curve, are reported in Fig. 4a and b. The FIGS model was uniformly better than the treat-all and treat-none baselines across all probability thresholds. Furthermore, the cautious prediction model based on FIGS improved on the performance of the traditional model across all operating points, and especially so for operating points associated with high specificity (see Fig. 4a).

So as to provide an additional form of support, according to the tenets of XAI, the FIGS model is depicted in Fig. 5. The FIGS model identified the pre-operative SF12 physical score and the surgical procedure location (knee/hip) as the most predictive features. The same information is confirmed also by the analysis of Shapley values (performed through the SHAP library), shown in Fig. 6, that similarly identified the SF12 physical score and the procedure location as the most important features, followed by the pre-operative EQ5D and VAS which were also considered as highly predictive in the tree representation shown in Fig. 5.

As seen in Tables 1 and 3, the internal development dataset and the external validation dataset significantly differed for most of the continuous features: in particular, the two populations had significantly different distributions in terms of age, SF12 Physical and Mental scores, EQ5D score, height and preoperative hemoglobin. The two populations also differed significantly in terms of the frequency of first interventions and revisions in the knee arthroplasty sub-cohort. The overall similarity between the two dataset was \(\Phi = 0.5\), which, according to the scale defined in [35], corresponds to a moderate level of similarity. The results of the external validation are reported in Table 5. The FIGS model performance significantly worsened w.r.t. balanced accuracy, AUC and sNB: however, for all of these metrics, as well as for accuracy, F1 score and Brier score, the performance of FIGS was not significantly worse than that of the best performing model, and were always higher than 0.70. In particular, FIGS was the model with the highest balanced accuracy, AUC and sNB: for these last two metrics, the performance of FIGS was significantly better than for all other models. In terms of cautious prediction models, no model significantly worsened as compared to the internal validation, with the exception of the XGB and MLP models for the HC PPV metric.

Table 5 The results of the developed Machine Learning (ML) models on the external validation dataset are presented along with their respective 95% confidence intervals (C.I.)

Full size table

Discussion

In recent years, the incidence of hip and knee arthroplasties has steadily increased [39, 40], due to an increasingly aging population. Such treatment procedures, while providing benefits in terms of life quality to the treated patients [41], may have a complex rehabiliation and follow-up as well as have a significant impact on national health systems. For this reason, FT protocols have become especially pertinent in managing such surgical procedures, to reduce hospital stays and associated costs, as well as for improving patients’ satisfaction and perceived life quality [20, 42,43,44]. Despite these benefits, however, the criteria for assigning patients to FT are still mostly based around qualitative assessments formulated by the managing clinician that do not comprehensively take into account patients’ data [45, 46], including PROMs.

Our study has contributed to this field by demonstrating for the first time in the literature, up to our knowledge, the effective application of ML as a way to develop second-opinion decision support systems to optimize the assignment of patients to FT surgical protocols for these orthopedic surgeries. As healthcare systems grapple with the demands of an aging population [47], our approach to enhancing decision-making in patient assignment to FT procedures fills a critical gap, by providing clinicians with a quantitative tool that helps them validate and optimize the protocol assignment decisions they have formulated for any given patient. To do so, the developed ML models leverage the extensive patient data available, including PROMs (that were identified as being among the most important predictive feature, see Figs. 5 and 6), thus addressing a previously underutilized resource in patient care optimization [48].

To more effectively leverage the use of ML models as second-opinion support systems, the core of our contribution lies in the incorporation of controllable AI principles [24], particularly XAI and cautious prediction, in the development of such models, so as to align with the need for accountability and transparency in AI applications in healthcare [49, 50]. Indeed, we showed that interpretable models, and particularly so the FIGS model, have performance on par or even better than the best black-box model we considered (i.e., Random Forest), indicating that accuracy does not have to be sacrificed for interpretability [51]. Achieving high sensitivity, specificity, and PPV, along with an AUC greater than 80%, this model underscores the viability of ML and PROMs in predicting whether a patient, preliminarily assigned to FT by the managing clinicians, will have favorable post-surgery outcomes: such an indication is used as proxy for the actual effectiveness of the protocol assignment decision formulated by the clinician, and can thus be used to either validate this preliminary decision or to notify the doctor that further information should be collected for selecting the optimal rehabilitation protocol for the given patient. Furthermore, the application of cautious prediction further enhanced the performance of the developed models, showing how providing ML models with uncertainty quantification and abstention capabilities can make them more accurate as well as provide the clinicians with an important indication about the reliability of the support they provide. Such an approach not only fosters clinician trust in AI [52] but also ensures that AI supports rather than supplants clinical judgment [53], in perfect agreement with the second-opinion approach.

Finally, our study’s external validation, further testifies to the robustness and generalizability of our models: indeed, while unsurprisingly for some metrics the developed models showed a decrease in performance as compared with the internal validation, their error rates remained well within reasonable quality ranges [54]. Interestingly, it was on the external validation that controllable AI approaches best showed their potential: indeed, in all cases, the performance of the cautious prediction models did not decrease significantly as compared with the internal validation, showing that providing such a form of uncertainty quantification can not only improve reliability and trust, but also generalizability and robustness [52, 55].

Obviously, this study is not without limitation. Firstly, the study being of a retrospective nature, we did not evaluate the effectiveness of the developed ML models in clinical practice: we believe that future prospective studies should evaluate the performance of the developed models when deployed in real-world scenarios [56]. Nonetheless, to this regard, we note that we did not limit our evaluation to an internal validation, but rather also externally validated the developed models. Such an analysis, while not being as informative as a prospective study, provides additional indications about the developed models’ robustness and generalizability [57, 58]. Secondly, and with regard to the external validation previously mentioned, we note that our validation procedure considered data coming from the same institute from which the development data were collected. Thus, while we considered the stability of the developed models to time-related shifts [59], we did not evaluate their transportability to other clinical settings [60]. This is an important consideration [35, 58], as different hospitals may have different criteria for assigning patients to FT or Care-as-Usual protocols, as well as different patients’ populations. Therefore, we believe that multi-centric validation studies would be particularly relevant for confirming (or disproving) the generalizability of the developed models. Finally, in our study, we adopted an approach to ML model development grounded in the principles of controllable AI, with specific reference to providing models that are both explainable as well as able to provide an indication of their predictive uncertainty: we motivated this design choice by highlighting the importance of controllability for the development of second-opinion support systems [49], and specifically so providing such systems with the ability to detect control loss situations and notify them to the managing clinician [61]. While we showed how interpretable and cautious models reported performance on-par with, or even better than, traditional and black-box models, we did not perform any user validation aimed at assessing the actual effectiveness of providing such support to the clinicians [62]. While there have been some recent studies showing how providing domain experts with controllable support could prove more effective for both improving accuracy as well as limiting the risk of emergence of cognitive biases (e.g., automation bias) [62, 63], the research on this topic is still limited: thus, we believe this to be a particularly relevant direction for future research, both in terms of analyzing the impact of providing cautious prediction support to clinicians, as well as performing clinical validation of the developed interpretable model (see Fig. 5).

Conclusions

This article has explored the development of ML models in the context of Fast Track surgical procedures, particularly focusing on hip and knee arthroplasties. Our research underscores the increasing relevance of such predictive models in the current healthcare landscape, which is marked by a growing aging population and the consequent rise in demand for efficient and cost-effective surgical management.

Our study demonstrated that ML algorithms can significantly enhance the process of assigning patients to FT protocols. By accurately predicting the improvement in patients’ health status, these models can be used to offer a reliable second-opinion to support clinical decisions. This not only aids in optimizing patient outcomes but also plays a crucial role in reducing the length of hospital stays and associated costs.

Furthermore, our research highlighted the importance of XAI techniques in making these predictive models more transparent and understandable to clinicians. This aspect of controllable AI ensures that the decision-making process remains in the hands of healthcare professionals, thereby enhancing the reliability and ethical integrity of using AI systems in medical settings. We also showed how cautious prediction, another form of controllable AI, could be used to reliably increase the robustness and uncertainty quantification capabilities of predictive models, enabling the clinicians to make more accurate and more informed decisions.

Thus, the adoption of ML models in the assignment of patients to FT procedures represents a significant stride towards improving the appropriateness of post-surgical care, which requires further research and validation studies. Doing so aims to contribute to the broader goal of making healthcare more sustainable, particularly in the face of challenges posed by an aging population and increased demand for medical services. By leveraging predictive analytics, healthcare systems can not only help physicians get better patient outcomes but also help them manage resources more effectively, paving the way for a more resilient and responsive healthcare system.

Availability of data and materials

Access to the de-identified data for all the cohorts involved in the study may be made available upon reasonable request to the authors. Access to the computer code used in this research is available upon reasonable request to the authors.

Notes

The MCID for the SF12 Physical score varied according to the primary affected location: 3.68 for knee and 3.80 for hip.

Abbreviations

AUC:: Area under the ROC curve
CP:: Cautious prediction
DT:: Decision tree
EHR:: Electronic health records
FIGS:: Fast interpretable greedy-tree sums
FT:: Fast track
HC:: High-confidence
HST:: Hierarchical shrinked trees
LR:: Logistic regression
MCID:: Minimum clinically important difference
ML:: Machine learning
MLP:: Multi-layer perceptron
NPV:: Negative predictive value
OGSA:: IRCCS Ospedale Galeazzi - Sant’Ambrogio
PPV:: Positive predictive value
PROM:: Patient reported outcome measure
RF:: Random forest
sNB:: standardized Net Benefit
SVM:: Support vector machine
XAI:: eXplainable AI
XGB:: XGBoost

References

Piepkorn MW, Longton GM, Reisch LM, Elder DE, Pepe MS, Kerr KF, et al. Assessment of second-opinion strategies for diagnoses of cutaneous melanocytic lesions. JAMA Netw Open. 2019;2(10):e1912597–e1912597.
Article PubMed PubMed Central Google Scholar
Payne VL, Singh H, Meyer AN, Levy L, Harrison D, Graber ML. Patient-initiated second opinions: systematic review of characteristics and impact on diagnosis, treatment, and satisfaction. In: Mayo Clinic Proceedings. vol. 89. Elsevier; 2014. pp. 687–96.
Ferreira GE, Zadro J, Liu C, Harris IA, Maher CG. Second opinions for spinal surgery: a scoping review. BMC Health Serv Res. 2022;22(1):358.
Article PubMed PubMed Central Google Scholar
Vashitz G, Davidovitch N, Pliskin JS. Second medical opinions. Harefuah. 2011;150(2):105–10.
PubMed Google Scholar
Graber M, Gordon R, Franklin N. Reducing diagnostic errors in medicine: what’s the goal? Acad Med. 2002;77(10):981–92.
Article PubMed Google Scholar
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.
Article CAS PubMed Google Scholar
Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2(10):719–31.
Article PubMed Google Scholar
Grote T. Randomised controlled trials in medical AI: Ethical considerations. J Med Ethics. 2022;48(11):899–906.
Article PubMed Google Scholar
Grote T, Berens P. How competitors become collaborators-Bridging the gap (s) between machine learning algorithms and clinicians. Bioethics. 2022;36(2):134–42.
Article PubMed Google Scholar
Kudina O, de Boer B. Co-designing diagnosis: Towards a responsible integration of Machine Learning decision-support systems in medical diagnostics. J Eval Clin Pract. 2021;27(3):529–36.
Article PubMed PubMed Central Google Scholar
Gurusamy R, Subramaniam V. A machine learning approach for MRI brain tumor classification. Comput Mater Continua. 2017;53(2):91–109.
Google Scholar
Kovalenko E, Talitckii A, Anikina A, Shcherbak A, Zimniakova O, Semenov M, et al. Distinguishing between Parkinson’s disease and essential tremor through video analytics using machine learning: A pilot study. IEEE Sensors J. 2020;21(10):11916–25.
Article Google Scholar
Cabitza F, Campagner A, Sconfienza LM. Studying human-AI collaboration protocols: the case of the Kasparov’s law in radiological double reading. Health Inf Sci Syst. 2021;9:1–20.
Article Google Scholar
Bennasar C, García I, Gonzalez-Cid Y, Pérez F, Jiménez J. Second Opinion for Non-Surgical Root Canal Treatment Prognosis Using Machine Learning Models. Diagnostics. 2023;13(17):2742.
Article PubMed PubMed Central Google Scholar
Rosiński J, Kotlarz P, Rojek I, Mikołajewski D. Machine Learning Classification for a Second Opinion System in the Selection of Assistive Technology in Post-Stroke Patients. Appl Sci. 2023;13(9):5444.
Article Google Scholar
Berg U, Berg M, Rolfson O, Erichsen-Andersson A. Fast-track program of elective joint replacement in hip and knee-patients’ experiences of the clinical pathway and care process. J Orthop Surg Res. 2019;14(1):1–8.
Article Google Scholar
Ansari D, Gianotti L, Schröder J, Andersson R. Fast-track surgery: procedure-specific aspects and future direction. Langenbeck’s Arch Surg. 2013;398:29–37.
Article Google Scholar
de Carvalho Almeida RF, Serra HO, de Oliveira LP. Fast-track versus conventional surgery in relation to time of hospital discharge following total hip arthroplasty: a single-center prospective study. J Orthop Surg Res. 2021;16:1–7.
Article Google Scholar
Kehlet H. Fast-track hip and knee arthroplasty. Lancet. 2013;381(9878):1600–2.
Article PubMed Google Scholar
Campagner A, Milella F, Guida S, Bernareggi S, Banfi G, Cabitza F. Assessment of Fast-Track Pathway in Hip and Knee Replacement Surgery by Propensity Score Matching on Patient-Reported Outcomes. Diagnostics. 2023;13(6):1189.
Article PubMed PubMed Central Google Scholar
Cabitza F, Locoro A, Banfi G. Machine learning in orthopedics: a literature review. Front Bioeng Biotechnol. 2018;6:75.
Article PubMed PubMed Central Google Scholar
Langenberger B. Who will stay a little longer? Predicting length of stay in hip and knee arthroplasty patients using machine learning. Intell Based Med. 2023;8:100111.
Article Google Scholar
Tian CW, Chen XX, Shi L, Zhu HY, Dai GC, Chen H, et al. Machine learning applications for the prediction of extended length of stay in geriatric hip fracture patients. World J Orthop. 2023;14(10):741.
Article PubMed PubMed Central Google Scholar
Kieseberg P, Weippl E, Tjoa AM, Cabitza F, Campagner A, Holzinger A. Controllable AI-An Alternative to Trustworthiness in Complex AI Systems? In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer; 2023. pp. 1–12.
Goebel R, Chander A, Holzinger K, Lecue F, Akata Z, Stumpf S, et al. Explainable AI: the new 42? In: Machine Learning and Knowledge Extraction: Second IFIP TC 5, TC 8/WG 8.4, 8.9, TC 12/WG 12.9 International Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, August 27–30, 2018, Proceedings 2. Springer; 2018. pp. 295–303.
Hüllermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn. 2021;110(3):457–506.
Article Google Scholar
Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst. 2022;35:507–20.
Google Scholar
Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1(5):206–15.
Article PubMed PubMed Central Google Scholar
Agarwal A, Tan YS, Ronen O, Singh C, Yu B. Hierarchical Shrinkage: Improving the accuracy and interpretability of tree-based models. In: International Conference on Machine Learning. PMLR; 2022. pp. 111–35.
Tan YS, Singh C, Nasseri K, Agarwal A, Yu B. Fast interpretable greedy-tree sums (FIGS). 2022. arXiv preprint arXiv:220111931.
Zhu Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recogn Lett. 2020;136:71–80.
Article Google Scholar
Hendrickx K, Perini L, Van der Plas D, Meert W, Davis J. Machine learning with a reject option: A survey. 2021. arXiv preprint arXiv:210711277.
Vovk V, Gammerman A, Shafer G. Algorithmic Learning in a Random World. Cham: Springer International Publishing; 2022.
Book Google Scholar
Campagner A, Cabitza F, Berjano P, Ciucci D. Three-way decision and conformal prediction: Isomorphisms, differences and theoretical properties of cautious learning approaches. Inf Sci. 2021;579:347–67.
Article Google Scholar
Cabitza F, Campagner A, Soares F, de Guadiana-Romualdo LG, Challa F, Sulejmani A, et al. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput Methods Prog Biomed. 2021;208:106288.
Article Google Scholar
Riley RD, Debray TP, Collins GS, Archer L, Ensor J, van Smeden M, et al. Minimum sample size for external validation of a clinical prediction model with a binary outcome. Stat Med. 2021;40(19):4230–51.
Article PubMed Google Scholar
Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: 2010 20th international conference on pattern recognition. IEEE; 2010. pp. 3121–4.
Bradley AA, Schwartz SS, Hashino T. Sampling uncertainty and confidence intervals for the Brier score and Brier skill score. Weather Forecast. 2008;23(5):992–1006.
Article Google Scholar
Petersen PB, Kehlet H, Jørgensen CC. Improvement in fast-track hip and knee arthroplasty: a prospective multicentre study of 36,935 procedures from 2010 to 2017. Sci Rep. 2020;10(1):21233.
Article CAS PubMed PubMed Central Google Scholar
Drosos GI, Kougioumtzis IE, Tottas S, Ververidis A, Chatzipapas C, Tripsianis G, et al. The results of a stepwise implementation of a fast-track program in total hip and knee replacement patients. J Orthop. 2020;21:100–8.
Article PubMed PubMed Central Google Scholar
Marsh M, Newman S. Trends and developments in hip and knee arthroplasty technology. J Rehabil Assist Technol Eng. 2021;8:2055668320952043.
PubMed PubMed Central Google Scholar
Bouman AI, Hemmen B, Evers SM, van de Meent H, Ambergen T, Vos PE, et al. Effects of an integrated ‘fast Track’ Rehabilitation Service for Multi-Trauma Patients: a non-randomized clinical trial in the Netherlands. PLoS One. 2017;12(1):e0170047.
Article PubMed PubMed Central Google Scholar
den Hertog A, Gliesche K, Timm J, Mühlbauer B, Zebrowski S. Pathway-controlled fast-track rehabilitation after total knee arthroplasty: a randomized prospective clinical study evaluating the recovery pattern, drug consumption, and length of stay. Arch Orthop Trauma Surg. 2012;132:1153–63.
Article Google Scholar
Maempel J, Clement N, Ballantyne J, Dunstan E. Enhanced recovery programmes after total hip arthroplasty can result in reduced length of hospital stay without compromising functional outcome. Bone Joint J. 2016;98(4):475–82.
Article PubMed Google Scholar
Husted H. Fast-track hip and knee arthroplasty: clinical and organizational aspects. Acta Orthopaedica. 2012;83(sup346):1–39.
Article PubMed Google Scholar
Jansson MM, Harjumaa M, Puhto AP, Pikkarainen M. Healthcare professionals’ perceived problems in fast-track hip and knee arthroplasty: results of a qualitative interview study. J Orthop Surg Res. 2019;14(1):1–12.
Article Google Scholar
Lin MH, Chou MY, Liang CK, Peng LN, Chen LK. Population aging and its impacts: strategies of the health-care system in Taipei. Ageing Res Rev. 2010;9:S23–7.
Article PubMed Google Scholar
Verma D, Bach K, Mork PJ. Application of machine learning methods on patient reported outcome measurements for predicting outcomes: a literature review. In: Informatics. vol. 8. MDPI; 2021. p. 56.
Roy Q, Zhang F, Vogel D. Automation accuracy is good, but high controllability may be better. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM; 2019. pp. 1–8.
Yampolskiy RV. On the Controllability of Artificial Intelligence: An Analysis of Limitations. J Cyber Secur Mobil. 2022;11(3):321–404.
Dziugaite GK, Ben-David S, Roy DM. Enforcing interpretability and its statistical impacts: Trade-offs between accuracy and interpretability. 2020. arXiv preprint arXiv:201013764.
Kanse AS, Kurian NC, Aswani HP, Khan Z, Gann PH, Rane S, et al. Cautious artificial intelligence improves outcomes and trust by flagging outlier cases. JCO Clin Cancer Inform. 2022;6:e2200067.
Article PubMed Google Scholar
Shneiderman B. Human-centered artificial intelligence: Three fresh ideas. AIS Trans Hum Comput Interact. 2020;12(3):109–24.
Article Google Scholar
Floares AG. Using computational intelligence to develop intelligent clinical decision support systems. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer; 2009. pp. 266–75.
Kompa B, Snoek J, Beam AL. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digit Med. 2021;4(1):4.
Article PubMed PubMed Central Google Scholar
Bin Rafiq R, Modave F, Guha S, Albert MV. Validation methods to promote real-world applicability of machine learning in medicine. In: 2020 3rd International Conference on Digital Medicine and Image Processing. AAAI Press; 2020. pp. 13–9.
König IR, Malley J, Weimar C, Diener HC, Ziegler A. Practical experiences on the necessity of external validation. Stat Med. 2007;26(30):5499–511.
Article PubMed Google Scholar
Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–7.
Article PubMed Google Scholar
Youssef A, Pencina M, Thakur A, Zhu T, Clifton D, Shah NH. External validation of AI models in health should be replaced with recurring local validation. Nat Med. 2023;29(11):2686–7.
Article CAS PubMed Google Scholar
Degtiar I, Rose S. A review of generalizability and transportability. Ann Rev Stat Appl. 2023;10:501–24.
Article Google Scholar
Cornelissen NAJ, Van Eerdt RJM, Schraffenberger HK, Haselager WFG. Reflection machines: increasing meaningful human control over Decision Support Systems. Ethics Inf Technol. 2022;24(2):19.
Article Google Scholar
Babbar V, Bhatt U, Weller A. On the Utility of Prediction Sets in Human-AI Teams. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. Vienna, Austria: International Joint Conferences on Artificial Intelligence Organization. ACM; 2022. pp. 2457–63.
Schemmer M, Kühl N, Benz C, Satzger G. On the influence of explainable AI on automation bias. 2022. arXiv preprint arXiv:220408859.

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 24 Supplement 4, 2024: Selected Articles From The 18th Conference On Computational Intelligence Methods For Bioinformatics & Biostatistics: medical informatics and decision making. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-24-supplement-4.

Funding

This research has been funded by the Italian Ministry of Health through project “Ricerca Corrente”. The APC was funded by Italian Ministry of Health - “Ricerca Corrente”. The funding body (Italian Ministry of Health) did not have any role in the conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

IRCCS Ospedale Galeazzi Sant’Ambrogio, Milan, Italy
Andrea Campagner, Giuseppe Banfi & Federico Cabitza
Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy
Frida Milella & Federico Cabitza
Faculty of Medicine and Surgery, Universitá Vita-Salute San Raffaele, Milan, Italy
Giuseppe Banfi

Authors

Andrea Campagner
View author publications
You can also search for this author in PubMed Google Scholar
Frida Milella
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Banfi
View author publications
You can also search for this author in PubMed Google Scholar
Federico Cabitza
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AC: conceptualization, methodology, software, validation, formal analysis, investigation, writing - original draft, visualization. FM: conceptualization, methodology, writing - original draft. GB: conceptualization, writing - review and editing, supervision. FC: conceptualization, writing - review and editing, supervision. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Andrea Campagner.

Ethics declarations

Ethics approval and consent to participate

Approval for the study was given from the IRCCS Ospedale Galeazzi Sant’Ambrogio review board. All patients included in the study signed an informed consent form, authorizing the collection and use of their data, at enrollment time.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Campagner, A., Milella, F., Banfi, G. et al. Second opinion machine learning for fast-track pathway assignment in hip and knee replacement surgery: the use of patient-reported outcome measures. BMC Med Inform Decis Mak 24 (Suppl 4), 203 (2024). https://doi.org/10.1186/s12911-024-02602-3

Download citation

Received: 25 January 2024
Accepted: 09 July 2024
Published: 23 July 2024
DOI: https://doi.org/10.1186/s12911-024-02602-3

Selected Articles From The 18th Conference On Computational Intelligence Methods For Bioinformatics & Biostatistics: medical informatics and decision making

Second opinion machine learning for fast-track pathway assignment in hip and knee replacement surgery: the use of patient-reported outcome measures