Skip to main content

Table 3 Generalizability performance assessment of multimodal deep learning against benchmark

From: Predicting decompression surgery by applying multimodal deep learning to patients’ structured and unstructured health data

Prediction

System

Prevalence

N

Model

Recall

Precision

Balanced Accuracy

F1

AUC

AUPRC

Early Surgery

Group Health

0.021

239

MDL

0.600 ± 0.161*

0.075 ± 0.020*

0.720 ± 0.081*

0.132 ± 0.036*

0.731 ± 0.109*

0.105 ± 0.050*

Benchmark

0.300 ± 0.152

0.056 ± 0.028

0.595 ± 0.076

0.094 ± 0.047

0.656 ± 0.113

0.149 ± 0.114

Henry Ford

0.039

324

MDL

0.640 ± 0.097*

0.127 ± 0.021*

0.732 ± 0.050*

0.212 ± 0.033*

0.795 ± 0.047*

0.128 ± 0.031*

Benchmark

0.200 ± 0.079

0.087 ± 0.034

0.557 ± 0.040

0.120 ± 0.047

0.714 ± 0.050

0.088 ± 0.023

Late Surgery

Group Health

0.079

254

MDL

0.425 ± 0.079*

0.143 ± 0.026*

0.603 ± 0.041*

0.214 ± 0.038*

0.630 ± 0.046*

0.120 ± 0.020

Benchmark

0.600 ± 0.080

0.109 ± 0.014

0.590 ± 0.042

0.185 ± 0.024

0.641 ± 0.044

0.119 ± 0.023

Henry Ford

0.042

325

MDL

0.482 ± 0.099*

0.085 ± 0.017*

0.628 ± 0.051*

0.145 ± 0.029*

0.700 ± 0.053*

0.091 ± 0.024*

Benchmark

0.556 ± 0.096

0.112 ± 0.019

0.682 ± 0.048

0.186 ± 0.031

0.707 ± 0.057

0.097 ± 0.022

  1. We compared the generalizability performance of the MDL architecture and the benchmark (i.e. LASSO). For each test system, we evaluated models’ performance using the performance metrics. We estimated significance performance between models by bootstrapping 1000 samples for each test system. For each prediction task and system, we performed a t-test comparing the bootstrapped samples between the two models across the performance metrics; we indicate significance with an asterisk for the MDL row. We underline the model that had the best average performance for each metric for each system
  2. AUC, Area Under the Curve; AUPRC, Area Under the Precision-Recall Curve; MDL, Multimodal Deep Learning