Skip to main content

Impact of the Covid-19 pandemic on the performance of machine learning algorithms for predicting perioperative mortality



Machine-learning models are susceptible to external influences which can result in performance deterioration. The aim of our study was to elucidate the impact of a sudden shift in covariates, like the one caused by the Covid-19 pandemic, on model performance.


After ethical approval and registration in Clinical Trials (NCT04092933, initial release 17/09/2019), we developed different models for the prediction of perioperative mortality based on preoperative data: one for the pre-pandemic data period until March 2020, one including data before the pandemic and from the first wave until May 2020, and one that covers the complete period before and during the pandemic until October 2021. We applied XGBoost as well as a Deep Learning neural network (DL). Performance metrics of each model during the different pandemic phases were determined, and XGBoost models were analysed for changes in feature importance.


XGBoost and DL provided similar performance on the pre-pandemic data with respect to area under receiver operating characteristic (AUROC, 0.951 vs. 0.942) and area under precision-recall curve (AUPR, 0.144 vs. 0.187). Validation in patient cohorts of the different pandemic waves showed high fluctuations in performance from both AUROC and AUPR for DL, whereas the XGBoost models seemed more stable. Change in variable frequencies with onset of the pandemic were visible in age, ASA score, and the higher proportion of emergency operations, among others. Age consistently showed the highest information gain. Models based on pre-pandemic data performed worse during the first pandemic wave (AUROC 0.914 for XGBoost and DL) whereas models augmented with data from the first wave lacked performance after the first wave (AUROC 0.907 for XGBoost and 0.747 for DL). The deterioration was also visible in AUPR, which worsened by over 50% in both XGBoost and DL in the first phase after re-training.


A sudden shift in data impacts model performance. Re-training the model with updated data may cause degradation in predictive accuracy if the changes are only transient. Too early re-training should therefore be avoided, and close model surveillance is necessary.

Peer Review reports


In the spring of 2020, the Covid-19 pandemic rapidly changed clinical routines in our hospitals with staff redeployment and elective procedures postponed due to increased demand for ICU beds. The surgical spectrum shifted to emergencies. During the lockdown, the number of trauma cases declined, and some centres reported fewer surgeries outside normal operating hours [1]. Patients who still underwent surgery were, on average, older and sicker than before the pandemic. Such changes in surgical spectrum and patient characteristics can lead to shifts in feature importance and affect the performance of machine learning models [2]. In the medical field, there are few studies on the degradation of predictive models due to evolving data [3]. In general, it is known that models performing well initially can degrade as the data changes in course of the progress [4]. This so-called “data drift” is gradual in most cases, however, external events can cause a sudden change in feature distribution, i.e., a covariate shift. Another issue is the incidence of the endpoint which may also be affected if, for example, mortality risk increases [5]. To handle covariate shifts, some researchers suggest that past data should be “forgotten” or down-weighted [4]. The question of whether and at what intervals a model needs to be re-trained is difficult to answer. Many models used in economics experience automatic updates at specific time intervals. However, it is not clear whether this approach is also the right one for models in the clinical setting [6]. This is all the more important as predictive models in surgical medicine have their practical value especially in times of rapidly approaching resource scarcity. An important aspect in this context is, for example, to support responsible and at the same time efficient operating room and intensive care unit (ICU) bed planning based on individual patient risk. The aim of our study was to analyse the predictive quality of machine-learning models in different time periods of the pandemic and to identify whether re-training with updated data helps to make predictions more reliable.

To address this question, we developed machine learning algorithms to predict perioperative mortality based on preoperatively available data. We did this by creating XGBoost models and Deep Learning neural networks (DL) for three different time periods: one with pre-pandemic data, one with pre-pandemic and first-wave data through May 2020, and one with data from the complete period before and during the pandemic until October 2021. We compared the performance metrics of each model during the different pandemic phases and examined changes in feature importance.

Patients and methods

Patient collective and data

The study to generate the prediction model was approved by the Ethics Committee of the Medical Faculty of the Technical University of Munich (TUM) (253/19 S-SR, 11/06/2019), registered in Clinical Trials (NCT04092933, initial release 17/09/2019) and conducted at the University Hospital rechts der Isar of TUM. Informed consent was waived due to the retrospective nature of the study in accordance with German legal regulations. The study was performed in conformity with ethical guidelines, the Declaration of Helsinki and recommendations of the German Ethics Council. In accordance with legal data protection requirements, only de-identified data has been used.

The study was designed in concordance to the TRIPOD guidelines for reporting predictive model studies [7].

Data from all patients who underwent noncardiac surgery between June 2014 and October 2021 were included in the final analysis. Only the first surgery of each patient was of interest; subsequent surgeries were not considered further. Both elective and urgent procedures were included. Patients admitted to the ICU before the first surgery were excluded, as were patients who had a nonsurgical procedure (e.g., diagnostic) or an outpatient procedure.

The data set was divided into different time periods: Patients treated before the Covid-19 pandemic (06/2014 – 03/2020), patients treated during the first pandemic wave (04/2020 – 05/2020), between the first and second pandemic waves (06/2020 – 09/2020), during the second pandemic wave (10/2020 – 05/2021), and after the second pandemic wave (06/2021 – 10/2021). Figure 1 shows the respective time period in context of the pandemic, Fig. 2 provides an overview of the patient numbers in each time period.

Fig. 1
figure 1

Pandemic course. Daily new infections, moving average over 7 days. Colour coded are the defined time periods of our study

Fig. 2
figure 2

Strobe diagram. The area of the bar of “Source Collective” is proportional to the total number of patients in the given period, and the height is proportional to the number of patients per day

The dataset used included all available preoperative information from the hospital information system (SAP, the laboratory information system (swisslab Lauris) and the anaesthesia patient data management system for the pre-anaesthesia visit (QCare, HIM-Health Information Management GmbH, Bad Homburg, Germany). Data that were not already available in tabular or coded form were structured using a quantity-based search algorithm, and drugs were assigned to their respective anatomical therapeutic chemical (ATC) code and summarized into groups, each with the same first four digits of the ATC code.

Development of the XGBoost model

In total, we had over 12,000 parameters at our disposal, including 9300 surgical codes according to the German operation and procedure codes (OPS) and 780 laboratory values. Parameters such as medical history (241), movements within the hospital (24), medications (199), and preoperative orders from the blood depot (13) were also included in the final models. We did not impute missing values but created dichotomous variables about their availability and included this information in the model.

Models were trained and tested using datasets from three selected time periods. The datasets used for this purpose were stratified in a 3:1:1 ratio as a training cohort, a test cohort and an internal validation cohort. Randomization into training, test, and validation cohorts was performed in a way that the frequency of the mortality endpoint was the same in each cohort.

A total of three models were developed:

  1. 1.

    a model using the pre-pandemic dataset (06/2014—03/2020)

  2. 2.

    a model using the pre-pandemic data set and that of the first wave (06/2014—05/2020)

  3. 3.

    a model using data from the entire period (06/2014–10/2021).

After randomization, the proportion of patients from the first wave in training, testing, and validation cohorts of the second model was 2.0, 2.2, and 1.8%, respectively.

The predictive models were built using Extreme Gradient Boosting (XGBoost) with the following hyperparameters: “learning rate,” “minimum loss reduction,” “maximum depth of each tree,” “proportion of features,” “proportion of training samples,” “scale of positive weights,” and “minimum of instance weight” [8].

The limits of the hyperparameters were set as follows: Learning rate (0.01—0.2), minimum loss reduction (0—6), maximum depth of each tree (3—30 levels), proportion of features (0.5—1), proportion of training samples (0.5—1), scale of positive weights (0.01—10), and minimum sum of instance weights (0—20). Confidence intervals for each prediction were calculated using 100 bootstrap samples. Hyperparameter tuning was performed separately for each model using the Bayesian optimization method. After setting up the hyperparameter plane, 64 runs of parameter optimization were performed. The five best runs yielded very similar AUC values, ranging between 0.9329–0.9334, so the search was terminated at this point. Hyperparameter settings are provided in table A1 of supplementary file 1. After training, testing and internal validation, data from the different phases of the pandemic were used as external validation sets to compare the performance of the model in the different time periods. Evaluation plots of the XGBoost models are shown in figure F1 of supplementary file 1.

Development of the deep learning model

With the same dataset and using the approximately 12,000 parameters mentioned above, Deep Learning (DL) neural networks were trained using the H2O framework in the R environment. An exhaustive grid search was performed using common hyperparameters such as learning rate, batch size, number of hidden layers, number of neurons per layer, activation and loss function, regularization and dropout rate. In this way, three DL models were created according to the time periods defined above using the same training, test and validation cohorts as for the XGBoost model.

Statistical analysis

All analyses were performed using R, version 4.2.1 (R Foundation for Statistical Computing, Vienna, Austria). Models were compared based on their area under the receiver operating characteristic (AUROC) and area under precision-recall curve (AUPR) [95% confidence interval]. To further characterize the XGBoost models, a cut-off probability value for mortality was determined on the training sets using the Youden index. Based on this cut-off value, sensitivity, specificity, positive predictive value, and negative predictive value could be determined in each period to compare the performance of the models. Additionally, we calculated feature importance, i.e., the information gain of each feature as well as cover and frequency for each feature used in the XGBoost models.


Patient characteristics and surgical spectrum

Patient characteristics in the different periods are shown in Table 1. The percentage of patients who died ranged from 0.8% before the pandemic to 1.0% during the second pandemic wave. Patients during and after the first and second wave were older than patients before the pandemic. During the pandemic, more patients fell into the American Society of Anaesthesiologists (ASA) 3 and 4 categories and were thus considered more severely ill overall. The number of emergencies was proportionally higher especially during the first wave. In terms of specialty departments, the proportion of patients in gynaecology/obstetrics and neurosurgery increased during the first wave of the pandemic. The frequency of surgeries performed outside regular operating hours (here from 08:00 to 18:00) and on weekends was highest during the first wave. Overall, changes were greatest during the first wave of the pandemic and partially normalized by the end of the study period. Median differences and percentage changes of the individual parameters in the pandemic waves compared with the pre-pandemic period are shown in table A2 of supplementary file 1.

Table 1 Patient characteristics and feature distribution in the respective timeframes

XGBoost vs. deep learning neural network

For model comparison, we calculated both receiver operating characteristic (ROC) and precision-recall (PR) curves for each of the models, as the precision-recall-trade-off is a more suitable measure to determine model quality than AUROC in an imbalanced dataset [9].

XGBoost as well as the Deep Learning neural network (DL) show comparable AUROCs on the pre-pandemic data (0.951 [0.941–0.962] vs. 0.942 [0.921–0.962]). The precision-recall-trade-off is slightly better in DL. Both pre-pandemic models deteriorate when applied to first wave data in AUROC as well as in AUPR. The XGBoost model improves again in the post-wave one phases and shows stable performance overall, while DL improves, especially in terms of precision-recall trade-off, but continues to show fluctuations in AUROC. Similar results are observed for model two from pre-pandemic and first wave data. The XGBoost model trained on the entire data performs much better than the DL model. The performances of XGBoost and DL models are compared in Figs. 3 and 4 as well as in Table 2.

Fig. 3
figure 3

ROC- (first row) and PR-curves (second row) of the three XGBoost models. The dashed line shows the baseline mortality rate according to the performance of a random classifier

Fig. 4
figure 4

ROC- (first row) and PR-curves (second row) of the three DL models. The dashed line shows the baseline mortality rate according to the performance of a random classifier

Table 2 Model performance measured by area under the receiver operating characteristic (AUROC-) and area under precision recall (AUPR-) curves [95% CI]

XGBoost feature importance

The XGBoost models, which show higher stability than the DL models in our study, are characterized in more detail below.

In total, of the more than 12,000 possible features, 587 are used in the model from pre-pandemic data, 275 in the model from pre-pandemic and first wave data, and 923 in the model of the entire period. The most important features of each of the XGBoost models and their percentage share in the prediction are depicted in Fig. 5. In the pre-pandemic phase (model 1), age, number of packed red cells (PRCs) ordered and number of preoperative consults are the top three variables. The model including data from the first wave (model 2) shows an increasing importance of age and number of ordered packed red cells, whereas preoperative c-reactive protein (CRP) displaces the number of preoperative consults. In this model, the top three factors account for approximately 30% of the prediction. Throughout the period (model 3), age, ASA and number of preoperative consults are most important, however, individual importance decreases so that only age has an importance greater than 5%. Here, the three most important factors account for only 13% of the prediction. A table showing cover and frequency as well as the gain of all parameters used in the models are provided in supplementary file 2.

Fig. 5
figure 5

Importance of the top ten features of each model measured by the average gain of the feature if it is used in trees. PRCs = packed red cells, ASA = American Society of Anaesthesiologists Physical Score, EVD = external ventricular drain

XGBoost cut-off and performance metrics

Furthermore, we set a cut-off value based on the Youden-indices of the ROC curves of the three XGBoost models. In the first model, the threshold for predicting the death of a patient was set at a probability of 16.11, and in the second and third model, it was set at 18.05 and 12.96 respectively. At these thresholds, we determined the metrics of sensitivity, specificity, positive predictive value, and negative predictive value for each validation period. This consistently showed poor positive predictive value with acceptable sensitivity and good specificity while negative predictive value was consistently high. The specificity of the first model decreased significantly when applied to the first wave data. The other changes were not significant because of the wide confidence intervals. However, the second model is expected to lose sensitivity when applied to data from after the second wave. The cut-off values and metrics are shown in Table 3.

Table 3 Statistical assessment of different classifiers [95% CI] at the respective cut-off-values for the XGBoost models


We developed machine-learning algorithms to predict perioperative mortality based on pre-operatively available data. Such models can aid decision-making during periods of scarce resources, like during the Covid-19 pandemic when intensive care beds for non-Covid patients were lacking. This makes the question of how robust such a model is to external influences all the more important. To address this issue, we developed three models with data from different phases before and during the pandemic using an XGBoost algorithm and a Deep Learning neural network. Our results show that precision-recall-trade-off was poor in both XGBoost and DL which is mostly due to an imbalanced data set: Mortality, as the end point of the study, is a very rare event with a frequency between 0.8 and 1.0%. AUPR decreases when the pre-pandemic models are used on first-wave data and recovers in the course. The same observation can be made after the first wave when the pre-pandemic and first wave data model is applied. In this respect, XGBoost and DL behave very similarly. The AUROC of the XGBoost models perform consistently very good with values > 0.9 while the DL model shows strong fluctuations in the individual pandemic phases, but both can recover fully or partially after an initial worsening.

To make the changes a little more descriptive, we have determined cut-off values for mortality prediction of the XGBoost models based on the Youden index. This illustrates that the proportion of false positive predictions is quite high, whereas the models perform very well by predicting negatives. It is evident that specificity and sensitivity show fluctuations in the different pandemic phases.

Gradient boosting methods are among the most commonly used algorithms in the field of perioperative medicine and often show excellent performance [10, 11]. However, there is evidence in the current literature that deep learning methods are superior to XGBoost with respect to AUROC [12] which made us use both methods. However, our results cannot support the superiority hypothesis for DL. Foremost, the DL models in our study showed much more pronounced fluctuations in AUROC than XGBoost, and our results show that the phenomenon of performance degradation under covariate shift is not limited to the XGBoost method.

It is generally assumed that different models react differently to changes. Overall, logistic regression models appear to be more vulnerable than machine learning algorithms [13]. Davis et al. studied the effects of a case mix shift on predictions and showed that neural networks are relatively robust, while random forests are moderately and most logistic regression models are strongly affected [14]. Unfortunately, XGBoost models, which are among the most frequently used algorithms for predictive models in perioperative medicine [10], were not examined in their study. However, from our data we can conclude that XGBoost as well as DL models may exhibit at least moderate susceptibility to covariate shift.

From our results it can be concluded that the restrictions of the first pandemic wave have a massive impact on model performance when taking into account that the patient population in this period is only a small fraction of the total population. This is caused by a change in patient characteristics and surgical spectrum with a shift towards urgent and emergency procedures which causes a covariate shift that affects model quality [15]. However, this problem is not new. For example, it is well known from economics that customer preferences change, making models based on old data inconsistent [16]. A change in the data on which the model is based can occur gradually, in the field of medicine, for example, as examination and treatment methods change over time. This can be addressed by removing older data from the data set or by applying factors so that old data is weighted weaker. Abrupt changes such as those caused by the Covid-19 pandemic, however, are more difficult to deal with and, although this problem seems obvious and might be clinically relevant, there is not much preliminary work so far on this issue.

The pandemic-related changes in the general conditions in our hospitals are manifold: Especially in the first phase of the Covid-19 pandemic, the surgical spectrum at our hospitals changed due to the rescheduling of elective procedures and a resulting proportional increase in emergency procedures with a significant decrease in hospital admissions and outpatient procedures. Case-mix-index and mortality rates increased [17]. There is evidence of worsening patient outcomes and a reduction in trauma cases during the first phase of the Covid pandemic [1, 18]. Less obvious changes involve a negative effect on the enrolment of patients in clinical trials [19] as well as a decrease in publications and scientific output of non-infectiology disciplines [20]. With the ongoing pandemic, conditions returned to normal [21]. As the pandemic progressed, delayed elective procedures that had accumulated, the so-called surgical back-log, had to be performed nonetheless, and so the numbers of surgeries normalized again [22].

These manifold dynamics which have the potential to cause significant covariate shift are reflected in our data. During the first wave, there were fewer elective cases, patients had higher ASA scores, and the surgical spectrum shifted toward departments that usually perform a greater proportion of urgent procedures, such as obstetrics or neurosurgery. In contrast to other reports, the number of out-of-hours surgeries in our institution increased, a fact that might also have contributed to poorer outcomes [23]. Looking at the figures after the second wave, they almost approached the pre-pandemic state again. Some changes remained, such as better documentation of presumed important information like the ASA score. Anaesthesiologists seemed to attach more importance to the ASA when the pandemic began, and documented this more frequently in the premedication protocol. This higher accuracy of documentation remained during the whole observation period.

Taken together, in the present study we face an abrupt onset of change in the data underlying the model which partially recedes after a period of several months.

Recently, Duckworth and colleagues developed an XGBoost model for prediction of hospital admission from the emergency department and examined data drift caused by the Covid-19 pandemic. In contrast to our study, they found a drop in AUROC after the onset of the pandemic whereas the AUPR increased [2]. However, these changes were caused by a shift in their target variable, admission rate, which increased markedly during the pandemic. In contrast, the target variable in our study, mortality, showed only fluctuations between 0.8 and 1.0%.

The study of Duckworth and colleagues also reports changes in feature importance. As an example, respiration rate rose in importance at the beginning of the lockdown and decreased during the course [2]. In our work we could observe similar phenomena. Regarding feature importance in our XGBoost models, the top variables are mostly the same, only in a different order. This is not surprising: age is an important variable in many models and scores for mortality prediction [24, 25]. The number of preoperative consults reflects a patient's comorbidities, which correlate with mortality just like the ASA score [26], while the number of blood products provided correlates with the severity of surgery. Only the importance of each variable and its place in the ranking of the top variables changes over the different phases of the pandemic.

Whether early re-training improves the predictive quality of the model remains a subject of discussion. There is some evidence that it might not be enough to just re-train the model with new data. Lacson and colleagues addressed the question whether re-training a model with new data will be sufficient or a newly developed model performs better. They came to the conclusion that a completely new developed model outperforms a model that was simply re-trained with augmented data [27]. We addressed this point by developing three different models for the respective periods using not only augmented data for updating but also performing hyperparameter optimization for each model.

Taken together, we can conclude that the performance of both DL and XGBoost models suffers due to shifts in the data. As a consequence, model performance has to be monitored to detect gradual as well as sudden data drift to regulate model updating cycles [3].

Strengths and limitations

As a weakness could be considered that we provide models based on single-centre data with a relatively small number of patients in the first pandemic wave. However, at the onset of the Covid pandemic, patient numbers generally declined due to regulatory restrictions, and to our knowledge, no multicentre-generated models exist on this topic to date.

Furthermore, we chose mortality as a clearly defined endpoint that could easily be determined from routine data. As in-hospital death after surgery is a relatively rare event with a frequency of about 1%, this choice resulted in a highly imbalanced dataset. As a consequence, we received consistently good AUROCs but low precision-recall rates, and our models perform very well in predicting survivors at the price of a high false positive rate. However, it is precisely this weakness that illustrates the influence of data drift on performance metrics by causing a drastic decline in precision-recall trade-off with AUROCs being almost unaffected, at least in the XGBoost models.

Theoretically, the distribution shift in the data must be taken into account in model building, and appropriate techniques such as covariate shift adaptation should be used. We did not focus on this aspect in our work, because the onset of the Covid pandemic brought sudden unpredictable changes that were difficult to respond to in reality. The true extent of Covid-related changes in the patient and surgical spectrum in our hospitals is only now being analysed and published [28]. Any consideration of adjusting or controlling for the covariates therefore remains necessarily speculative. The fact that conditions in our case largely returned to normal after a few months was also not foreseeable at the beginning of the pandemic.

To date, there are few papers from the medical field that address the problem of sudden covariate shift [2]. Our work is intended to sensitize to this problem and supports the fact that further research is needed in this area.


The present study has shown that a newly developed model with augmented data can perform worse under altered conditions after the initial phase of acute change. XGBoost models and Deep Learning neural networks are both susceptible to covariate shift, whereas XGBoost seems to be more stable in case of sudden changes, at least under the conditions we studied.

These findings tell us that updating a model too early can lead to a noticeable degradation in performance. Therefore, continued monitoring of a model’s predictive ability is necessary even after updating. A viable practical approach might be to use the old and updated models in parallel for a period of time after the update and compare their results. If the changes are only temporary, a model may regain its original predictive power.

Availability of data and materials

The dataset analysed during this study is not publicly available due to legal regulations. To gain access, proposals should be directed to the corresponding author. Requestors will need to sign a data access agreement.



American Society of Anaesthesiologists


Area under receiver operating characteristic


Area under precision-recall curve


C-reactive protein


Intensive Care Unit


Deep Learning neural network


Packed red cells


Extreme Gradient Boosting


  1. Turley L, Mahon J, Sheehan E. “Out of hours” orthopaedics in an Irish regional trauma unit and the impact of COVID-19. Ir J Med Sci. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Duckworth C, Chmiel FP, Burns DK, Zlatev ZD, White NM, Daniels TWV, Kiuber M, Boniface MJ. Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19. Sci Rep. 2021;11(1):23017.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Chi S, Tian Y, Wang F, Zhou T, Jin S, Li J. A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models. Artif Intell Med. 2022;125:102256.

    Article  PubMed  Google Scholar 

  4. Celik B, Vanschoren J. Adaptation Strategies for Automated Machine Learning on Evolving Data. IEEE Trans Pattern Anal Mach Intell. 2021;43(9):3067–78.

    Article  PubMed  Google Scholar 

  5. Das S: Best Practices for Dealing With Concept Drift [] last Accessed 14 Nov 2022

  6. Kumar S: Should a machine learning model be retrained each time new observations are available? [] last Accessed 14 Nov 2022

  7. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD). Ann Intern Med. 2015;162(10):735–6.

    Article  PubMed  Google Scholar 

  8. Chen TQ, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Kdd’16: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.

    Chapter  Google Scholar 

  9. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. Plos One. 2015;10(3):ARTN e0118432.

    Article  CAS  Google Scholar 

  10. Bellini V, Valente M, Bertorelli G, Pifferi B, Craca M, Mordonini M, Lombardo G, Bottani E, Del Rio P, Bognami E. Machine learning in perioperative medicine: a systematic review. J Anesth Analg Crit Care. 2022;2(2):2–13.

    Article  Google Scholar 

  11. Islam MA, Majumder MZH, Hussein MA. Chronic kidney disease prediction based on machine learning algorithms. J Pathol Inform. 2023;14:100189.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Shickel B, Loftus TJ, Ruppert M, Upchurch GR Jr, Ozrazgat-Baslanti T, Rashidi P, Bihorac A. Dynamic predictions of postoperative complications from explainable, uncertainty-aware, and multi-task deep neural networks. Sci Rep. 2023;13(1):1224.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Davis SE, Lasko TA, Chen G, Matheny ME. Calibration Drift Among Regression and Machine Learning Models for Hospital Mortality. AMIA Annu Symp Proc. 2017;2017:625–34.

    PubMed  Google Scholar 

  14. Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration drift in regression and machine learning models for acute kidney injury. J Am Med Inform Assoc. 2017;24(6):1052–61.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Vela D, Sharp A, Zhang R, Nguyen T, Hoang A, Pianykh OS. Temporal quality degradation in AI models. Sci Rep. 2022;12(1):11654.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Tsymbal A. The problem of concept drift: definitions and related work. 2004. . Accessed 09 Feb 2023.

  17. Kazakova SV, Baggs J, Parra G, Yusuf H, Romano SD, Ko JY, Harris AM, Wolford H, Rose A, Reddy SC, et al. Declines in the utilization of hospital-based care during COVID-19 pandemic. J Hosp Med. 2022.

    Article  PubMed  Google Scholar 

  18. Grieco M, Galiffa G, Marcellinaro R, Santoro E, Persiani R, Mancini S, Di Paola M, Santoro R, Stipa F, Crucitti A, et al. Impact of the COVID-19 Pandemic on Enhanced Recovery After Surgery (ERAS) Application and Outcomes: Analysis in the “Lazio Network” Database. World J Surg. 2022;46(10):2288–96.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Pogorzelski D, McKay P, Weaver MJ, Jaeblon T, Hymes RA, Gaski GE, Fraifogl J, Ahn JS, Bzovsky S, Slobogean G, et al. The impact of COVID-19 restrictions on participant enrollment in the PREPARE trial. Contemp Clin Trials Commun. 2022;29:100973.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Wolf M, Landgraeber S, Maass W, Orth P. Impact of Covid-19 on the global orthopaedic research output. Front Surg. 2022;9:962844.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Abdolalizadeh P, Kashkouli MB, Jafarpour S, Rezaei S, Ghanbari S, Akbarian S. Impact of COVID-19 on the patient referral pattern and conversion rate in the university versus private facial plastic surgery centers. Int Ophthalmol. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Mehta A, Awuah WA, Ng JC, Kundu M, Yarlagadda R, Sen M, Nansubuga EP, Abdul-Rahman T, Hasan MM. Elective surgeries during and after the COVID-19 pandemic: Case burden and physician shortage concerns. Ann Med Surg (Lond). 2022;81:104395.

    Article  PubMed  Google Scholar 

  23. Bertram A, Hyam D, Hapangama N. Out-of-hours maxillofacial trauma surgery: a risk factor for complications? Int J Oral Maxillofac Surg. 2013;42(2):214–7.

    Article  CAS  PubMed  Google Scholar 

  24. Moll M, Qiao D, Regan EA, Hunninghake GM, Make BJ, Tal-Singer R, McGeachie MJ, Castaldi PJ, San Jose Estepar R, Washko GR, et al. Machine Learning and Prediction of All-Cause Mortality in COPD. Chest. 2020;158(3):952–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Le Manach Y, Collins G, Rodseth R, Le Bihan-Benjamin C, Biccard B, Riou B, Devereaux PJ, Landais P. Preoperative Score to Predict Postoperative Mortality (POSPOM): Derivation and Validation. Anesthesiology. 2016;124(3):570–9.

    Article  PubMed  Google Scholar 

  26. Hackett NJ, De Oliveira GS, Jain UK, Kim JY. ASA class is a reliable independent predictor of medical complications and mortality following surgery. Int J Surg. 2015;18:184–90.

    Article  PubMed  Google Scholar 

  27. Lacson R, Eskian M, Licaros A, Kapoor N, Khorasani R. Machine Learning Model Drift: Predicting Diagnostic Imaging Follow-Up as a Case Example. J Am Coll Radiol. 2022;19(10):1162–9.

    Article  PubMed  Google Scholar 

  28. McCoy M, Touchet N, Chapple AG, Cohen-Rosenblum A: Total Joint Arthroplasty Patient Demographics Before and after COVID-19 Elective Surgery Restrictions. Arthroplast Today 2023:101081.

Download references


Not applicable.


Open Access funding enabled and organized by Projekt DEAL. The project was funded by the Central Innovation Program for small and medium-sized enterprises of the German Federal Ministry for Economic Affairs and Energy (ZF4544901TS8) as a joint project between TUM and HIM (Health Information Management GmbH, Bad Homburg, Germany) acting as cooperation partners. Neither HIM nor the Federal Ministry influenced the study design, the analysis, the interpretation of the data and the writing of the report.

Author information

Authors and Affiliations



SMK, DIA, BJ, MB and BU designed the study. Data acquisition was performed by MG and BU. BU conducted the statistical analysis. AP assisted in statistical analysis and created all figures and tables. All authors substantially contributed to the interpretation of the data. DIA and SMK drafted the manuscript. All authors critically revised the submitted material for important intellectual content, approved the submitted version and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The corresponding author has the final responsibility to submit for publication.

Corresponding author

Correspondence to S. M. Kagerbauer.

Ethics declarations

Ethics approval and consent to participate

The study was approved by the institutional ethics committee of the medical faculty of the Technical Universitv of Munich (full name: “Ethikkommission der Technischen Universität München”,, postal address: Ismaninger Str. 22, D- 81975 Munich, Germany) which waived the informed consent due to the retrospective nature of the study. The study reference number is 253/19 S-SR from 11/06/2019. The study was performed in accordance with ethical guidelines, the Declaration of Helsinki and recommendations of the German Ethics Council. In accordance with legal data protection requirements, only de-identified data has been used.

Consent for publication

Not applicable.

Competing interests

MG received lecture fees from HIM (Health Information Management GmbH, Bad Homburg, Germany). The other authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table A1.

Hyperparameter settings of the XGBoost models. Table A2. Percentage changes / median differences of features in the different phases of the pandemic with 95% confidence intervals. Figure F1. The evaluation plots show the AUC (top) and logloss (bottom) as a function of the number of iterations.

Additional file 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Andonov, D.I., Ulm, B., Graessner, M. et al. Impact of the Covid-19 pandemic on the performance of machine learning algorithms for predicting perioperative mortality. BMC Med Inform Decis Mak 23, 67 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: