Skip to main content

Table 4 Methods, performance, demographics, evaluation summary for the 19 selected papers [49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67]

From: Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review

Citation

Outcome Benchmark for Depression

Demographic

Data Source

Data Specifications

Data Sharing

Study Design

Predictors (Note 1)

(Primary, Secondary, Study collected)

  

(Case Control, Case-series, Cross-Sectional, Historical Control,..)

Comorbidities

Demographic

Family History

Obstetric specific

Psychiatric

Smoking

Abar et al. [49]

International Classification of Diseases version 9 (ICD-9) Codes

University of Kentucky (UKY) medical center Electronic Health Records (EHRs). All patient visits during the ten year period 2004–2013. Mixed ethnicity USA assumed, details unspecified

Primary, Secondary

3 million patient visits to the University of Kentucky (UKY) medical center and its affiliated clinics. The dataset has 11,877 unique International Classification of Diseases, Clinical Modification, Version 9 (ICD-9-CM) codes and 1032 unique medication codes by Cerner Multum™LexiconPluscodes

Not offered. Source identified

Cohort

 

Geraci et al. [50]

Clinical Psychiatrists(2), Diagnostic and Statistical Manual of Mental Disorders-IV (DSM)-IV depression

Aged 12 to 18, 60% female, 40% male. 861 individuals from Centre for Addiction and Mental Health, Toronto, ON M6J1H4. Assumed mixed ethnicity (not specified) representative of Canada

Primary

EHRs format not specified

Not offered. Source identified

Cohort

    

 

Hochman et al. [51]

ICD 9, ICD 10 or antidepressant (WHO Anatomical Therapeutic Chemical [ATC] code N06A) (thus excluding off‐label use)

A nationwide longitudinal cohort that included 214,359 births between January 2008 and December 2015. Israel. Average age 29.4 (SD, 5.4) for training set, 29.8 (SD 5.5) validation set. Mixed Arab (circa 34%) Jewish (circa 66%) ethnicity

Primary, Secondary

Clalit Health Services (CHS) EHR data warehouse. ICD‐9 or ICD‐10 codes recorded in the EHRs

The raw data used for this study will be stored at the Clalit servers and within its firewall, and will be made available upon request under the limitations and requisites of the Clalit regulations and Israeli Privacy Laws

Cohort

(5)

(5)

 

(21)

(4)

(1)

Huang et al. [52]

Patient Health Questionnaire 9 (PHQ-9), ICD-9

Age 18 + , EHR data from the Palo Alto Medical Foundation (PAMF) 55.2% female, mixed ethnicity. Group Health Research Institute (GHRI) 70.3% female, mixed ethnicity

Primary

Epic EHR system. International Classification of Diseases, Ninth Revision (ICD-9) diagnosis codes, RxNorm prescription codes, and Current Procedural Terminology (CPT) procedure codes; and unstructured data such as progress notes, pathology reports, radiology reports, and transcription reports. All structured and unstructured data are time-stamped

Not offered. Source identified

Cohort

 

(4)

(1)

 

(7)

 

Jin et al. [53]

PHQ-9, PHQ-8 item 9

Predominantly Latino diabetes patients within a USA public safety net care system: 62% age 45 and older; 68% female

Primary, Secondary

Diabetes-Depression Care-management Adoption Trial (DCAT), a comparative effectiveness study from 2010 to 2013 with three arms: Usual Care (UC) Los Angeles County Department of Health Services (LACDHS), Supported Care (SC), and Technology Care (TC)

Not offered. Source (DCAT) identified

Cohort

    

Kasthurirathne et al. [54]

Physician assessed. ICD-9 and ICD-10 codes

USA sample. Mixed ethnicity. 84,317 adult patients (≥ 18 years of age) with at least 1 primary care visit between the years 2011 and 2016 at Eskenazi Health, Indianapolis, Indiana. Average Age 43.88 (SD 15.60), male 35.09%, White (non-Hispanic) 25.21%, African American (non-Hispanic) 37.23%, Hispanic or Latino 19.47%

Primary

Indiana Network for Patient Care (INPC), structured International Classification of Diseases, ninth revision (ICD-9) and ICD-10 codes. The dataset included a wide array of patient data, including patient demographic, diagnostic, behavioral, and visit data reported in both structured and unstructured form

Not offered. Source (INCP) identified

Cohort

  

 

Koning et al. [55]

Mental Health Problem by WHO International Classification of Primary Care (ICPC) and ATC Code including depression

Patients aged 1–19 years on 31 December 2016 without prior mental health problems. 76 general practice centres in the Leiden area of the Netherlands. Representative of local population

Primary

ELAN primary care network (Extramural Leiden Academic Network) of the Leiden University Medical Centre (LUMC), the Netherlands. Patient data included demographics, consultation dates, symptoms and diagnoses coded according to the WHO International Classification of Primary Care (ICPC), prescribed medication coded according to the Anatomical Therapeutic Chemical (ATC) classification, laboratory test results, and descriptive or coded information of referrals and correspondence with other healthcare professionals

Not offered. Source identified

Cohort

   

Meng et al. [56]

Depression related ICD-9 codes, inclusion of an antidepressant drug in a patient’s medication list, or appearance of an antidepressant drug in clinical notes

EHRs source not specified. Patients selected based on three primary diagnoses: myocardial infarction (MI), breast cancer, and liver cirrhosis. Generally, MI represents the least complexity, cirrhosis the most. 68.78 SD ± 15.46. Min 18, max, 98. Male 27.46%, female 72.54%

Primary, Secondary

International Classification of Disease, ninth revision (ICD-9) format, procedure codes in Current Procedural Terminology (CPT) format, medication lists, demographic information, and clinical notes

Not offered. Source not identified

Cohorts(3)

 

  

 

Meng et al. [57]

Depression related ICD-9 code, inclusion of an antidepressant drug in a patient's medication list, appearance of an antidepressant drug in clinical notes

EHRs source not specified. Patients selected based on three primary diagnoses: myocardial infarction (MI), breast cancer, and liver cirrhosis. Generally, MI represents the least complexity, cirrhosis the most. 68.78 SD ± 15.46. Min 18, max, 98. Male 27.46%, female 72.54%

Primary, Secondary

International Classification of Disease, ninth revision (ICD-9) format, procedure codes in Current Procedural Terminology (CPT) format, medication lists, demographic information, and clinical notes

Not offered. Source not identified

Cohorts (3)

 

  

 

Nemesure et al. [58]

DSM-IV

Students. University of Nice Sophia-Antipolis. Ages, under 18 to over 20. Gender and French nationality status

Primary, Study data

CALCIUM database (Consultations Assistés par Logiciel pour les Centres Inter-Universitaire de Médecine) and included information about the students’ lifestyle (living conditions, dietary behavior, physical activity, use of recreational drugs)

Yes, all data, de-identified, was publicly available on Dryad a nonprofit membership organization that is committed to making data available for research and educational reuse now and into the future. https://datadryad.org/stash/dataset/doi:10.5061/dryad.54qt7

Cohort

 

(5)

  

(8)

(2)

Nichols et al. [59]

National Health Service (UK) Read Codes and British National Formulary (BNF) drug codes

15 to 24 years, representative of UK mixed ethnicity general population, to 2013

Primary

The Health Information Network database (THIN), a large dataset of anonymized electronic medical records extracted from general practices using Vision medical records software. National Health Service Read codes and British National Formulary drug codes

Not offered. Source identified

Cohort

(6)

(1)

(5)

 

(15)

(2)

Półchłopek et al. [60]

ICPC P* code (psychological), T06* ICPC (anorexia, bulimia), 9 ATC values in N05–N07, 21 referral descriptions in Dutch

Electronic Medical Records (EMRs) from 76 general practices in the Leiden area, gathered, concatenated and preliminarily aggregated by a third party, Stichting Informatievoorziening voor Zorg en Onderzoek (STIZON3). 27% identified as having Mental Health Problem. Aged 0–19 from the period 2007–2017 (up to and including 31.12.2016)

Primary

Data sourced from the PIPPI project (‘‘Primary care integrated for identification of psychosocial problems in children’’ conducted in the Department of Public Health and Primary Care of Leiden University Medical Centre. Symptoms and diagnoses coded with International Classification of Primary Care (ICPC) standard (in Dutch); descriptive symptoms text mined from the notes of general practitioners (in Dutch); all GP encounters, including phone calls and visits; prescriptions coded with Anatomical Therapeutic Chemical (ATC) standard; measurements made by the GP or performed in a laboratory; referrals to specialists (in Dutch)

Not offered. Source identified

Cohort (with target/non target populations for disorders)

  

 

Qiu et al. [61]

Confirmed diagnosis of depression in 2016. Chronic Conditions Data

Warehouse (CCW) algorithms by Centers for Medicare and

Medicaid Services (CMS)

Subset of 7.2 million patients in a 3-year period, Patients enrolled between 2014 and 2016, with 2,099 variables including the diagnosis, procedure, medication, and health service provider information. Age < 65 years. Female ratio 56.48% (controls), 69.43% (cases). Mixed ethnicity. USA population

Primary, Secondary

MarketScan commercial claims and encounters database owned by IBM MarketScan R ©1 Research Database. 283 CCS (clinical classification software) codes, mapped from both ICD-9-CM and ICD-10-CM (Clinical Modification) codes in the MarketScan database. 242 CCS procedure codes, mapped from both ICD-10-PCS (Procedure Coding System) codes, Current Procedural Terminology (CPT) and Healthcare Common Procedure Coding System (HCPCS). Revenue codes, Place of service, Provider type, Service sub-category code, e.g. Magnetic resonance imaging (MRI), and positron emission tomography (PET) scans. 234 drugs and medications as defined in IBM Red Book

Not offered. Source identified

Case/Control

 

   

Sau and Bhakta [62]

Hospital Anxiety and Depression Scale (HADS)

520 geriatric patients attending hospital general Out Patients Department (OPD), Mean (± SD) age was 68.5 (± 4.85) years, 281 (55%) males and 229 (45%) females. Local population ethnicity

Primary, Secondary

Data source was the Kar Medical College and Hospital, Kolkata, West Bengal, India, data were collected from 520 geriatric patients attended at the general Out Patients Department of that hospital between January and August 2016. Storage format not specified, hospital data collection

Not offered. Source identified

Case/Control

(6)

(2)

(1)

 

(4)

 

Souza Filho et al. [63]

Diagnostic and Statistical Manual of Mental Disorders-V (DSM V)

971 patients from 20 primary care units in the city of Rio de Janeiro. Mean age 57.67 (± 14.47). 64% male, 36% female

Primary

All data collected were included a posteriori by two blinded and independent researchers in an electronic clinical research form (CRF) database and was stored and managed using Research Electronic Data Capture (REDCap) hosted at Instituto Nacional de Cardiologia. All data were anonymized, as suggested in the General Data Protection Regulation

Not offered, source identified

Cohort

   

Wang et al. [64]

ICD9/10 codes

EHRs from Weill Cornell Medicine and New York-Presbyterian Hospital from 2015 to 2017. Age 33.92 (SD 4.51) in non-PPD group; 34.36 (SD 4.61) in the PPD group. Ethnicities identified included White, Asian, American Indian or Alaska Nation, Black or African American

Primary, Secondary

All study data are represented using Observational Medical Outcomes Partnership (OMOP) common data model. All diagnoses were represented as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) codes. Medication and dosage were standardized by Anatomical Therapeutic Chemical (ATC) Classification System

Not offered. Source identified

Case/Control

 

 

Xu et al. [65]

70 ICD9/10 codes (45.7% ICD9 codes, 54.3% ICD10 codes) RxNorm (USA normalized naming system for generic and branded drugs) codes for drugs

11 275 patients with depression plus same number of controls from between January 2008 to November 2017. Age 18 to > 65. Mean age depressed 62.6 (SD 19.5). Mean age non depressed 63.7 (SD 20.1). Depressed cohort 69.0% female. Non depressed cohort 68.3% female. Race/Ethnicity included: White Black or African American Asian American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Not Hispanic or Latino, Hispanic or Latino

Primary

INSIGHT Clinical Research Network (CRN) database. EHRs of 12 million patients from five large medical centers across New York City: Albert Einstein School of Medicine/Montefiore Medical Center, Columbia University and Weill Cornell Medicine/New York-Presbyterian Hospital, Icahn School of Medicine/Mount Sinai Health System, Clinical Director's Network, and New York University School of Medicine/ Langone Medical Center, 471 federally qualified health centers, safety net clinics, primary care practices, and hospice centers. Multiple comorbidities were also extracted based on the CMS Chronic Conditions Warehouse (CCW). Medication data was mapped to the Anatomical Therapeutic Chemical (ATC) Classification System

Not offered. Source identified

Case/Control

(18)

(4)

 

(17)

(1)

Zhang et al. [66]

ICD-9 Codes

De-identified electronic health records (EHR) data from 10 schools participating in the College Health Surveillance Network (CHSN) from January 1, 2011 through December 31, 2014. The demography of enrolled students (sex, race/ethnicity, age, undergraduate/graduate status) closely matched the demography for the population of 108 Carnegie Research Universities/Very High classification

Primary

The selected 10 schools within the College Health Surveillance Network (CHSN) include 263,947 enrolled students representing all geographic regions of the United States. ICD-9 codes I extracted from primary care visits of 213,112 patients

Not offered. Source identified

Case/Control

(2)

  

(1)

(2)

 

            

Citation

Predictors (Note 1)

CEBM Level

Performance Metric

Social/Family

Somatic

Substance/Alcohol abuse

Visit frequency

Word list/text

Other measurements & predictors

Predictors considered (max) Note 2

Oxford Centre for Evidence Based Medicine for diagnosis (1 to 5)

Acc

Prec

Spec

Sens

Abar et al. [49]

  

 > 10,000

3

na

na

na

na

Geraci et al. [50]

    

 

Note 3

4

na

0.77

0.68

0.94

Hochman et al. [51]

  

(1)

  

(2)

156

4

na

na

0.91

0.35

Huang et al. [52]

   

(1)

  

 > 1000

4

na

na

na

na

Jin et al. [53]

 

 

  

29

3

na

na

na

na

Kasthurirathne et al. [54]

 

 

  

1150

3

na

na

76.03—92.18

68.79—83.91

Koning et al. [55]

 

 

100 s

4

na

na

na

na

Meng et al. [56]

 

    

 > 1000

4

na

na

na

na

Meng et al. [57]

 

    

 > 1000

4

na

na

na

na

Nemesure et al. [58]

(7)

 

(5)

  

(32)

59

3

na

na

0.66- 0.70

0.55—0.66

Nichols et al. [59]

(15)

(8)

(2)

(1)

  

60

3

na

na

na

na

Półchłopek et al. [60]

 

 

 

3240

3

na

na

na

na

Qiu et al. [61]

     

2099

3

na

na

na

na

Sau and Bhakta [62]

(5)

(5)

(1)

  

(1)

20

4

0.91

0.89

0.9

na

Souza Filho et al. [63]

  

34

3

0.89

na

na

0.9

Wang et al. [64]

 

   

98

3

na

na

0.391–0.616

0.867—0.959

Xu et al. [65]

 

(5)

(3)

   

500

3

na

0.61–0.89

na

0.58–0.91

Zhang et al. [67]

 

(21)

   

(2)

1000 s

3

0.56–0.58

na

0.40–0.50

0.60–0.70

            

Citation

Performance Metric

Baseline/Comparator

Range (Case / Controls) Training/Test (%/%)

Classifier (s)

Validation

Separate Holdout

Fitting

Code sharing and details

Ethical Approval

Citation

F1

AUC ROC

      

Abar et al. [49]

na

na

Odds Ratio Lower Bound (ORLB)

 > 3 million

Association rule mining (ARM)

None

No

Two-stage pipeline. Stage 1: Reducing the predictor codes into groups, e.g. 11,887 ICD 9 codes reduced to 282 classes. Stage 2: Identify rules then rank the 75,465 rules with depressive disorders and reduce to top 100. For top 100 novelty ratings were assigned on a scale of 1 to 5 (with 5 indicating most novelty) by a practicing psychiatrist. No statement on overfitting

Not stated. Algorithms discussed in main text. Software identified—Linear-time Closed item set Miner (Open Source Data Mining: Frequent Pattern Mining Implementations, OSDM '05. ACM; 2005. LCM Ver.3)

Not stated

Abar et al. [49]

Geraci et al. [50]

na

na

Performance

758 training/103 testing

Deep Learning (DL)

fivefold

Yes

Three-stage pipeline. Stage 1: EHR clinical notes data deidentified and features extracted using NLP. Stage 2: Creation of two DL models, one aimed at identifying those likely to develop depression and those not (to support patient selection for trials). Stage 3: Models combined to form composite model. No statement on overfitting, but hold out set used

Not stated. For deidentification Perl-based software package De-id V.1.1. For machine learning, R language implementation of the H2O.ai package, which includes a multilayer, feedforward deep neural network for the purpose of prediction under a supervised protocol. R programming language (wordnet, RKEA, tm, SDMTools)

Yes, Research Ethics Board-approved

Geraci et al. [50]

Hochman et al. [51]

na

0.71

AUC-ROC (Area Under Curve—Receiver Operating Characteristic)

185,029 (training split 80/20 for testing), 29,330 validation set

XGBoost (XGradient Boosting)

Yes

Yes

Two-stage pipeline. Stage 1: The main model was fitted using the full set of 156 predictors with initial validation using 20% of training set followed by separate testing on validation data set. A simpler model was also created based on questionnaire derived data. Performance was reported via AUC-ROC, bootstrapping was used to establish 95% confidence intervals. Stage 2: Shapley Additive Explanations (SHAP) was used to show impact of individual features in models. No statement on overfitting, but hold out set used

Not stated. R (R Foundation for Statistical Computing) version 3.4.3 (including the RMS and pROC packages) and Python 3.7.3 (Python Software Foundation)

Not stated

Hochman et al. [51]

Huang et al. [52]

na

0.70–0.80

AUC-ROC

5000 cases and 30,000 matched controls. (80% training, 20% test)

LASSO (Least Absolute Shrinkage and Selection Operator)

None

Yes

Two-stage pipeline. Stage 1: Terms used for defining depression case condition were excluded from the predictors prior to creating model using LASSO. Stage 2: The model is then validated on three test sets created for different cut off points: at time of diagnosis case date, and twelve months prior to that date. The output of the validation being ROC curves. No statement on overfitting, but hold out set used

Not stated. Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression from the R glmnet package

Not stated

Huang et al. [52]

Jin et al. [53]

na

0.73–0.86

AUC-ROC

853 cases (80% training, 20% test)

Poisson Regression

None

Yes

Two-stage pipeline. Stage 1: 20 time varying factors and nine time-invariant factors relating to diabetes as predictors. Estimated effect of each candidate predictor as univariants and obtained p-values. Selected p < 0.05. Stage 2: Models evaluated at baseline, 6 months, 12 and 18 month follow up using ROC. No statement on overfitting, but hold out set used

Not stated. Fixed and random effects for a generalized multilevel model were estimated using quasi-likelihood estimation implemented by the “glmmPQL” function in R package “MASS”. Equations provided in text for generalized multilevel regression model, using the longitudinal dataset from a recent large-scale clinical trial

Not stated

Jin et al. [53]

Kasthurirathne et al. [54]

72–92 (apron)

0.78—0.94

AUC-ROC

84,317 patients. 90% training 10% testing

Random Forest (RF)

None

Yes

Two-stage pipeline. Stage 1: Using data extracted using NLP techniques and EHR ICD codes creating 5 data vectors (4 patient subgroups and 1 master data vector). Stage 2: These used to train RF models that were then applied to test set to derive AUCROC performance data. No statement on overfitting, but hold out set used

Not stated. Python programming language (version 2.7.6) for all data preprocessing tasks and the Python scikit-learn package for decision model development and testing

Not stated

Kasthurirathne et al. [54]

Koning et al. [55]

na

na

C-statistic (0.62–0.63), odds ratios

19,420 out of 70,000

Logistic Regression (LR), K-Nearest-Neighbours (KNN), Classification and Regression Tree (CART), AdaBoost (AB), Gradient Boosting (GB), Extreme Gradient Boosting (XGB), Random Forests (RF) and Support Vector Machine (SVM)

Bootstrap

No

Three-stage pipeline. Stage 1: Predictor variables derived from EHR data and those with low prevalence (< 1%)eliminated from data set. Data sets split by age group. Stage2: Logistic regression models trained on each age group dataset. Stage 3: The models were internally validated using bootstrap resampling (500 bootstrap samples) and estimating a shrinkage factor. Brier scores were calculated to assess the average prediction error. No statement on overfitting

Not stated. Analysis and modelling in SPSS (version 23) and R (version 3.5.1)

Yes, Ethics Committee of the Leiden University Medical Centre issued a waiver of consent (G16.018)

Koning et al. [55]

Meng et al. [56]

na

0.76 (PRAUC)

Comparison with RF model via PRAUC performance

10,148 (3,047 developed depression)

Multi-Level Embeddings of diagnoses, procedures, and medication codes with demographic information and Topic modelling (MLET)

None

No

Model trained on combined data set (including Breast cancer, Liver cirrhosis and MI). Results compared for prediction of depression at two weeks, three months, six months and one year prior to depression diagnosis. No statement on overfitting

Yes. The source code and more detailed description of the model is available at https://github.com/lanyexiaosa/brltm. BERT model was implemented in Pytorch 1.4. A visualization tool was identified: https://github.com/jessevig/bertviz

Yes. Patients for this work were identified from EHR in accordance with an Institutional Review Board (IRB) (#14–000204) approved protocol

Meng et al. [56]

Meng et al. [57]

na

0.77 to 0.81

Comparison with established models for varying times for prediction in advance of diagnosis (two weeks, three months, six months, one year). AUC-ROC and precision recall area under the curve (PRAUC)

10,148 (3,047 developed depression) 70% training, 10% validation, and 20% test

Hierarchical Clinical Embeddings combined with Topic modelling, LASSO, SVM, MLP, MiMe, RF, VAE + RF

tenfold

Yes

Three-stage pipeline. Stage 1: ICD codes (9,285, reduced by only using first three digits of code) used to identify features from EHR data followed by extraction of a further 100 features using Latent Dirichlet allocation (LDA) for pre processing clinical notes. Stage 2: Models fitted to EHR/Clinical Notes data. Models created for depression and also for prediction of three comorbidities, breast cancer, liver cirrhosis and myocardial infarction. Stage 3: Models were assessed for predictive value at two weeks, three months, six months and one year prior to case incidence. No statement on overfitting, but hold out set used

Yes. All models created in TensorFlow 1.12. Equations for models provided in text. The source code of HCET is available at https://github.com/lanyexiaosa/hcet

Yes. Patients for this work were (as per Meng et al. [56]) identified from EHRs in accordance with an Institutional Review Board (IRB) (#14–000204) approved protocol

Meng et al. [57]

Nemesure et al. [58]

na

0.67–0.73

AUC-ROC

4184, 70% training (N = 2929) and 30% (N = 1255) held out testing

XGBoost, Random Forest, Support Vector Machine, K-nearest-neighbours and a neural network with Bayesian fine tuning, logistic regression

fivefold

Yes

Two-stage pipeline. Stage 1: Predictions from each classifier are generated using fivefold training on the training data, Stage 2: Predictions from all models are used to train XGBoost classifier on the test data, which consists of predictions made by the six classifiers. No statement on overfitting, but hold out set used

Yes, Code written in python and used sklearn. Vignettes available at https://github.com/mnemesure/MDD_GAD_EHR. Imputation for missing values using a Bayesian Ridge approach. SHAP (Shapley Additive Explanations) scores were utilized calculate and visualize feature importance this complex model

Yes, National Data Protection Authority (NCIL) approved the original study from which data was sourced. This study received institutional exemption from the Committee for the Protection of Human Subjects at Dartmouth College

Nemesure et al. [58]

Nichols et al. [59]

na

0.70 -0.72

AUC-ROC

98,562 cases and 281,248 matched controls, 70% training, 30% test

Backward stepwise conditional logistic regression

None

Yes

Prediction from the logistic regression to generate ROC curves using test data. No astatement on overfitting but hold out set used

Not stated. STATA was used for statistical analyses and to implement ML models

Yes, Scientific Review Committee on 3 Oct 2014 (SRC Ref: 14–056)

Nichols et al. [59]

Półchłopek et al. [60]

na

0.582 to 0.782

AUC-ROC

92 621 (27% case positive, 63% controls) split by age group (70% for training, 30% test)

Logistic regression, SVM, regression tree, random forest, deep neural network and XGBoost

threefold

Yes

Two-stage pipeline. Stage 1: Patients with insufficient medical history were excluded and case positive patients had medical history excluded after the event and within a fixed time window before it. Then divided into 5 age groups (0–3, 4–7, 8–11, 12–15, 16 +). Stage 2: Models training using training subset and performance evaluated using the test set. For the best performing classifier, XGBoost, variable importance data was calculated. No statement on overfitting, but hold out set used

Yes, Code algorithms and sample code provide in appendices, mathematical basis provided in main text

Not stated

Półchłopek et al. [60]

Qiu et al. [61]

na

0.75–0.76

Prediction vs clinical outcome

Case = 254,648/ control = 6,969,972 (training 75%, testing 25%)

Least Absolute Shrinkage and Selection Operator (LASSO) and Random Forest (RF)

None

Yes

Two models were created the first using a form of penalized regression (LASSO) the second using a decision tree based method (RF). AUCROC was calculated for the models and odds ratios were derived. No statement on overfitting, but hold out set used

Not stated. Details of, e.g., regularization and depth parameter definitions in main text

Not Stated

Qiu et al. [61]

Sau and Bhakta [62]

na

na

Performance vs. HADS—independently assessed

48.2% case and 51.8% healthy controls, 520 training set (83%) and 110 test set (17%)

Random Forest, Bayesian Network, Naïve Bayes, Logistic, multiple layer perceptron (MLP), Naïve Bayes (NB), random forest (RF), random tree (RT), J48, sequential minimal optimisation (SMO), Random sub-space (RS), and K Star (KS)

tenfold

Yes

Two-stage pipeline. Stage 1: The initial classifiers were subjected to feature selection approaches using machine learning technology in Waikato Environment for Knowledge Analysis (WEKA). Stage 2: Training and testing was done using a ten-fold cross validation method and the classifier with the highest predictive accuracy was then validated against the external (110 instances) data set. No statement on overfitting, but hold out set used

Not stated. Coding system specified, Waikato Environment for Knowledge Analysis (WEKA) (version 3.8.0) (http://www.cs.waikato.ac.nz/ml/weka/documentation.html). Main text in paper describes procedures used

Yes. Ethical clearance from the Institutional Ethics Committee of R.G. Kar Medical College and Hospital, Kolkata, West Bengal, India. Informed consent was taken from every patient before data collection

Sau and Bhakta [62]

Souza Filho et al. [63]

na

0.87

AUC-ROC

971 patients (881 non-depressive and 90 with depression)

Logistic Regression (LR), K-Nearest-Neighbours (KNN), Classification and Regression Tree (CART), AdaBoost (AB), Gradient Boosting (GB), Extreme Gradient Boosting (XGB), Random Forests (RF) and Support Vector Machine (SVM)

tenfold

No

Two-stage pipeline. Stage 1: Synthetic Minority Oversampling Technique (SMOTE) was used to resolve imbalances in the data set. Stage 2: The models were built and cross validation used to determine performance (AUCROC). No statement on overfitting

Not stated. “R” statistical software to perform the randomization for trial. Machine learning implemented in the Python 3 programming language

Yes. The study protocol was approved and monitored by Instituto Nacional de Cardiologia in Brazil. All patients signed informed written consent

Souza Filho et al. [63]

Wang et al. [64]

na

0.69—0.79

AUC-ROC

9980 (769 cases, 9211 controls)

L2-regularized Logistic Regression, Support Vector Machine, Decision Tree, Naïve Bayes, XGBoost, and Random forest

tenfold

No

Two-stage pipeline. Stage 1: To down select predictors univariate logistic regression (LR) analyses

select those with p-values below 0.05. Stage 2: models were built using the different classifiers and performance measured as AUCROC by validation, Additionally Odds Ratios and variable importance were established to provide interpretable data. No statement on overfitting

Not stated. All machine learning and statistical analyses were performed with R version 3.4.3

Not stated

Wang et al. [64]

Xu et al. [65]

na

0.80—087

AUC-ROC

11,275 case /11275 control

Logistic Regression (Ridge), Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting Decision Tree (GBDT)

fivefold

No

Two-stage pipeline. Stage 1: Identify 500 features for participants based on selection criteria. Stage 2: Train models and generate performance data. Generate Heatmaps using Clustergrammer. No statement on overfitting

Not stated. For Ridge, RF, SVM, used Scikit-learn software library, for the GBDT, used XGBoost software library, both Python based

Not stated

Xu et al. [65]

Zhang et al. [66]

na

na

Comparison of frequency, pairwise, and M-SEQ representations methods

7322 case/ 205,790 control

SVM, LDA, and RF for models based on frequency, pairwise, and M-SEQ representations

fivefold

No

Two-stage pipeline. Stage 1: To avoid imbalance issues from unmatched case/control ratio EasyEnsemble used to prevent the majority class from dominating the learning process. Stage 2: frequency, pairwise, and M-SEQ models used to create SVM, LDA, and RF models. No statement on overfitting

Not stated. Equations and algorithms included and described in text. Software not specified

Not stated, work supported by College Health Surveillance Project

Zhang et al. [66]

         

Not stated. STATA 14 software was used for statistical analyses but it is not clear if this was used to implement ML

Yes. Institutional Review Board at Weill Cornell Medicine (IRB protocol# 1,711,018,789)

Zhang et al. [67]

  1. Note 1: The predictor categories are further described in main text (results section). Where it was practical to obtain/estimate numbers in brackets have been given for the predictor count within the category for models—these are indicative only, especially where multiple models were created
  2. Note 2: The total number of predictors used was difficult to determine at a summary level as multiple models used different combinations, in some cases no exact number was provided but a reference to a set of definitions used as a starting point. The number given in the table is the maximum used either as stated or estimated
  3. Note 3: For Geraci et al. [50] the number of predictors/features extracted from EHR text entries is not defined. No estimate has been made
  4. Note 4: In the paper by Półchłopek et al. [60], the use of "*" after the ICPC code, e.g., T06*, indicates all codes under that heading