Skip to main content

Table 2 Classical performance assessment of multimodal deep learning against benchmark

From: Predicting decompression surgery by applying multimodal deep learning to patients’ structured and unstructured health data

Prediction

Prevalence

N

Model

Recall

Precision

Balanced accuracy

F1

AUC

AUPRC

Early Surgery

0.024

824

MDL

0.300 ± 0.077*

0.086 ± 0.021*

0.610 ± 0.039*

0.133 ± 0.033*

0.725 ± 0.040*

0.061 ± 0.014*

Benchmark

0.375 ± 0.076

0.069 ± 0.014

0.624 ± 0.038

0.116 ± 0.023

0.597 ± 0.050

0.047 ± 0.011

Late Surgery

0.049

851

MDL

0.595 ± 0.051*

0.080 ± 0.007*

0.619 ± 0.026*

0.140 ± 0.012*

0.655 ± 0.026*

0.077 ± 0.009*

Benchmark

0.440 ± 0.056

0.076 ± 0.009

0.580 ± 0.028

0.129 ± 0.016

0.635 ± 0.031

0.079 ± 0.011

  1. We compared the performance of the MDL architecture against the benchmark (i.e. LASSO). We calculated 1000 bootstrap samples from the test set. For each sample, we calculated the performance metrics: recall, specificity, balanced accuracy, precision, F1-score, AUC, and AUPRC. We then calculated the average and standard deviation across the samples. For each prediction task, we underline the model that had the best performance for each metric. Finally, we performed a t-test to assess significance between each model’s performance metrics for each prediction task; we indicate significance with an asterisk
  2. AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning