Skip to main content

Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review



Depression is one of the most significant health conditions in personal, social, and economic impact. The aim of this review is to summarize existing literature in which machine learning methods have been used in combination with Electronic Health Records for prediction of depression.


Systematic literature searches were conducted within arXiv, PubMed, PsycINFO, Science Direct, SCOPUS and Web of Science electronic databases. Searches were restricted to information published after 2010 (from 1st January 2011 onwards) and were updated prior to the final synthesis of data (27th January 2022).


Following the PRISMA process, the initial 744 studies were reduced to 19 eligible for detailed evaluation. Data extraction identified machine learning methods used, types of predictors used, the definition of depression, classification performance achieved, sample size, and benchmarks used. Area Under the Curve (AUC) values more than 0.9 were claimed, though the average was around 0.8. Regression methods proved as effective as more developed machine learning techniques.


The categorization, definition, and identification of the numbers of predictors used within models was sometimes difficult to establish, Studies were largely Western Educated Industrialised, Rich, Democratic (WEIRD) in demography.


This review supports the potential use of machine learning techniques with Electronic Health Records for the prediction of depression. All the selected studies used clinically based, though sometimes broad, definitions of depression as their classification criteria. The reported performance of the studies was comparable to or even better than that found in primary care. There are concerns with generalizability and interpretability.

Peer Review reports


Depression is the most common mental health condition globally, with one-year global prevalence rates ranging from 7 to 21% [1]. Quality of life can be seriously impaired by this disorder, with depression ranking as the second highest cause of Disability-Adjusted Life Years (DALYs) and Years Lived with Disability (YLDs) [2, 3]. Depression is a major contributory factor in suicide affecting hundreds of thousands of cases per year [4, 5]. In addition to the significant personal and social impact of depression, there is a significant economic cost. For example, in 2007 alone, total annual costs of depression in England were £7.5 billion, of which health service costs comprised £1.7 billion and lost earnings £5.8 billion [6, 7]. More recently, in 2019, it was estimated that mental health problems cost the UK £ 118 billion per year, of which 72% were due to lost productivity and other indirect costs. At 22% prevalence depression was identified as the third highest contributor to these costs [8, 9].

Depression, like most mental health disorders, can be difficult to diagnose, especially for non-specialist clinicians [10, 11]. Assessment by primary or secondary care clinicians typically relies on the World Health Organisation’s International Catalogue of Diseases version 10 or 11, ICD-10/11 [12], the Diagnostic and Statistical Manual of Mental Disorders DSM [13], or by using an interview script such as the Composite International Diagnostic Interview (CIDI) [14, 15]. Diagnosis can also be aided by garnering self-reported symptoms in response to standardised questionnaires such as the Hospital Anxiety and Depression Scale (HADS) [16], Beck Depression Inventory (BDI) [17, 18] and Patient Health Questionnaire-9 (PHQ-9) [19, 20]. The PHQ-9 is considered a gold standard [21] for screening rather than standalone clinical diagnosis [22] and has been validated internationally [20]. As such it sets a sound benchmark for sensitivity (e.g., 0.92) and specificity (e.g., 0.78) that is a good comparator for assessing alternative methods [23].

Considering mental health care pathways, benefits to patients could be provided by early diagnosis, opening the possibility to early interventions. For example, Bohlmeijer et al. [24] observed reduced symptoms of depression for patients who engaged in acceptance and commitment therapy (ACT) as an early intervention compared to those on a wait list, both initially and at a three month follow up. Furthermore, a meta-analysis by Davey and McGorry [25] showed a reduction in the incidence of depression by about 20% in the 3 to 24 months following an early intervention. At the same time, late diagnoses of depression can result in longer term suffering for the patient in terms of symptoms experienced and disorder trajectory together with increased resource consumption [10, 26].

Recently, attempts to support early medical diagnoses have benefited from a) growing availability of electronic healthcare records (EHRs) that contain patients’ longitudinal medical histories and b) new advances in predictive modelling and machine learning (ML) approaches. The use of EHRs in primary care in the developed world is well established. For example, in the USA, UK, Netherlands, Australia and New Zealand, take up in primary care has exceeded 90% [27, 28]. The wide availability of proprietary EHR systems such as SNOMED (Systematized Nomenclature For Medicine) in the UK [29] are enabling rapid and global implementation and their use for disorder surveillance [30]. For example, ML techniques with EHR data have led to predictive models for cardiovascular conditions [31, 32] and diabetes [33]. These studies have led to cardiovascular risk prediction becoming established in routine clinical care and the UK QRISK versions 2 and 3 show significant improvements in discrimination performance over the Framingham Risk Score and atherosclerotic cardiovascular disease (ASCVD) score methods [34] that preceded them. Many of the recent advances were facilitated by the growing popularity of ML in medical data science. As a subfield of artificial intelligence (AI), ML allows computers to be trained on data to identify patterns and make predictions. This approach is well suited for developing algorithms to predict the likelihood of a patient having a disorder by analysing large volumes of medical data. Once trained, these algorithms can then be tested on new data to assess their performance outside of the training environment. There are a variety of ML techniques, but the two most common include supervised and unsupervised methods. In supervised learning data are labelled with desired outcome. In unsupervised learning the data are not labelled, and the algorithms look for patterns within the data without external guidance. Further information on these methods in relation to mental health and EHRs is provide in Cho et al. [35] and Wu et al. [36] but here we note that many existing applications combine some unsupervised and supervised methods to train algorithms on datasets with large numbers of predictors. A scoping review by Shatte et al. [37] on the general use of ML in mental health identified the use of ML with EHRs for identifying depression as a research area. Similarly, Cho et al. [35] included depression amongst the conditions they identified in their “Review of Machine Learning Algorithms for Diagnosing Mental Illness”. In the examples they cite, which are also covered in the results of this systematic review, ML algorithms were trained on EHRs data that included a variety of symptoms and conditions. These algorithms were then assessed on their ability to distinguish between those who did/did not have clinical depression. If EHR/ML methods are to be considered, a suitable benchmark comparator is needed. Studies assessing diagnosis of depression in primary care suggest that approximately half of all cases are missed at first consultation but that this improves to around two thirds being diagnosed at follow up [38,39,40]. This would be a useful minimum comparator for any diagnostic system based on a combination of ML and EHRs data. There exists the potential to develop predictive models of depression using EHR/ML applications and it is necessary to critically evaluate models developed in recent years. This is particularly important in the context of rapidly developing ML techniques, and the growing accessibility and richness of EHRs health data. Our starting point for this systematic review was, “Is there a case for using EHRs with machine learning to predict/diagnose depression?” From this we derived the objectives to identify and evaluate studies that have used such techniques. As part of the evaluation, we specifically focus on identifying key features of the data and ML methods used. Accordingly, our primary focus is to provide a comprehensive overview of the types of ML models and techniques used by researchers, as well as types of data on which these models were trained, how the models were validated and, where done, how they were then tested. By summarizing the data used, identifying and summarising predictors used, describing diagnostic benchmarks, and outlining what types of validation and testing approaches were used, our review offers an important source of information for those who wish to build on existing efforts to improve predictive accuracy of such models.


Search strategy and search terms

Systematic literature searches were conducted within arXiv, PubMed, PsycINFO, Science Direct, SCOPUS and Web of Science electronic databases. Searches were restricted to information published after 2010 (from 1st January 2011 onwards) and were updated prior to the final synthesis of data on 27th January 2022. Initial searches were made based on titles/key words (where latter available) and papers were selected based on the inclusion criteria summarised in Table 1. These were searched as (#1) AND (#2) AND (#3) AND (#4). These papers were evaluated by reading the Abstract, and then by evaluating main body of each manuscript. Next, a backward citation search for all the selected papers was completed as both a) a quality check to see if other selected papers were included and b) to identify any missing papers. The last search step was a forward search pass where papers that cited the selected papers were identified; again, identifying any missed papers. The same time period and inclusion/exclusion criteria were applied to these additional searches. The initial searches together with primary assessment for inclusion were conducted by DN. 10% of the searches were sampled by LW. The inclusion/exclusion results for the selected papers were audited by LW, and joint discussions were held to resolve any issues. In the event of this not being possible CT would have been involved as final arbiter.

Table 1 Search terms for study identification

This systematic review was prospectively registered with Prospero international database of systematic reviews (# CRD42021269270) [41].

Inclusion/exclusion criteria

Table 2 shows the inclusion and exclusion criteria that were adopted to define the publications that came within the scope of the review.

Table 2 Inclusion/exclusion criteria

Data extraction

Data extraction was informed by requirements detailed in: ‘Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [42]; ‘Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist’ [43], and ‘Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques’ [44]. Table 3 details the data extraction categories. Primary data extraction was conducted by DN this was then validated by LW.

Table 3 Data extraction summary

Quality of studies

The Oxford Centre for Evidence-Based Medicine (OCEBM) system [45] was used to assess quality, previously used for a systematic review about artificial intelligence and suicide prevention by Bernert et al. [46] as many of the models were developed and evaluated in a clinical setting and so merit a level of formal assessment. This ranked the evidence on a scale of 1 to 5, lowest to highest. The results were added to the data extraction table. OCEBM is designed to provide a hierarchy of levels of evidence for researchers and clinicians whose time is limited, it is well established and widely used. For further information, see Howick et al. as reported in [47].


The search protocol together with numbers of studies identified, selected, assessed, included/excluded is presented in Fig. 1, compatible with PRISMA standard [48].

Fig. 1
figure 1

PRISMA flow diagram with results for systematic review study selection [48]. Note: reasons, for example relating to disorder focus, scope, data sources, specially selected cohorts, disorder trajectory not diagnosis, for excluding full text articles are included in supplementary data, Table S 1


A total of 744 research papers were identified in the first stage of the literature search (711 after duplicates were removed). Screening content of abstracts and, subsequently, main body of each article, reduced the sample to 18 eligible articles. The backwards citation search of the selected papers identified 22 papers (including duplicates) that were rejected, 10 that were in the original selection and two (duplicates) that were added to the selection, resulting in one additional paper (giving 19 in total). The forward citation search did not produce additional papers at the time of the review.

Review articles are not included in the final total but were used for supporting research and were recorded.

Selected studies overview

This review summarised studies that use ML methods to train validate, and test ML models for predicting depression based on individual-level EHR data from primary care (11 studies) and from a combination of primary and secondary care (8 studies). Table 4 summarizes key features of each study. We now turn to a detailed overview of each of the components described in Table 4.

Table 4 Methods, performance, demographics, evaluation summary for the 19 selected papers [49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67]

Depression definition

The definition of depression and the method of its classification varied across the studies in this review. A combination of depression diagnosis definitions based on NHS Read codes [68], SNOMED (Systematized Nomenclature For Medicine) [29] codes, ICD [12] or DSM [13] based assessments and/or the prescription of antidepressants (ADs) was used in 16 of the 19 studies. Only one study, by Xu et al. [65], used antidepressant prescription alone as a case definition. Three other studies relied on the use of a validated questionnaire such as the PHQ-9 [69] or HADS [16].


Here we report on aspects of the predictors including their definition, how we grouped them and their frequency of use.


Most predictors were derived from a combination of variables present in the EHR databases (e.g., SNOMED/NHS Read codes and/or prescription of a drug in a similar way to the definition used for depression) and were typically categorical. In some cases, additional parameters specifying a time frame for the predictor were also available. Some predictors were defined by identifying components by pre-processing clinical notes/other textual information. A few studies used non categorical predictors such as physiological measurements for example Body Mass Index (BMI), blood pressure, and cholesterol as predictors. This was usually where participants were receiving some form of secondary care, such as in pregnancy for PPD prediction.


No formal method for grouping predictors was evident in the studies and, due to the large number of diverse predictors used in different papers, for clarity these were organised into the following groups. Specifically: comorbidity, demographic, family history, other (e.g., blood pressure), psychiatric, smoking, social/family, somatic, obstetric specific, substance/alcohol abuse, visit frequency and word list/text. Due to this flexibility in definition, there are overlaps between studies concerning which category a predictor might fall, for example a blood test may be in “other, or “obstetric specific”. Table 5 shows the predictors groups and commentary on their content.

Table 5 Grouping of predictors from the studies

Figure 2 indicates frequency of predictor use across the selected studies.

Fig. 2
figure 2

The approximate number of studies using different groups of predictors. Note 1: Some papers used multiple categories of predictors and not all categorised them. Note 2: The total number of predictors used was difficult to determine at a summary level as multiple models used different combinations, in some cases no exact number was provided but a reference to a set of definitions used as a starting point


The studies in this review used data sets from EHRs systems, insurance claims databases and health service (primary and secondary) providers. As such they store, organise, and define data in a variety of ways that are not expected to be consistent with each other. Most of this data is categorical in nature, though some predictors such as blood pressure, are usually continuous variables within a range. In this section we report how each of the reported studies dealt with missing or erroneous data, potential sources of bias. We also report whether the authors made their data and/or code publicly available.

Missing or erroneous data

Missing data either related to missing patients and/or missing predictor data. In both cases it may not be possible to know that the data is missing. For missing patients, Koning et al. [55] excluded patients whose records did not identify gender or had no postcode registered. Huang et al. [52] removed entries where patients had less than 1.5 years of visit history. Wang et al. [64] excluded from the analysis PPD patients for whom there was no third trimester data.

With regard to missing data. Nemesure et al. [58] estimated that, for their data set, missing values were present in 5% of the data overall and for 20 out of the 59 predictors they used. In some studies, missing data led to exclusion of cases from the analysis. In Nichols et al. [59]. missing smoking status was used to infer non-smoking on the basis this was less likely to be missed for smokers/those with smoking related disorders. Missing data also led to exclusion of predictors. Again, in Nichols et al. [59], the authors did not use ethnicity as it was missing in over 63% of patients. Similarly, Zhang et al. [67] excluded ethnicity from their USA dataset for the same reasons. Many studies (e.g., Koning et al [55]., Meng et al. [57], Nichols et al. [59] raised concerns that errors in predictor data could affect performance, generalizability, and reliability of the models. Errors and missing data were identified as being due to misclassification, measurement errors, data entry and bias; all of which can be difficult identify and/or correct in EHR data as noted by Wu et al. [36]. Other studies varied in the strategies used for dealing with missing data. Common approaches were to estimate the level for a missing point or simply acknowledge that remedial action was not available. Nemesure et al. [58] used an imputation approach fortheir numerical data, such as blood pressure. Where remedial action is not possible then the patient might be excluded from the study, e.g. Hochman et al. [51].

Sources of bias

Many of the studies (12), for instance, Hochman et al. [51], Huang et al. [52] and Koning et al. [55] raised the question about data bias due to cohort selection or collection processes, such as diagnosis, data interpretation and system input. Other studies (12) recognised sources of bias impacting accuracy and generalizability. Jin et al. [53] identified that as the population in their study were mainly Hispanic and there was incompleteness of comorbidity predictor data (e.g., for diabetes), both performance and generalizability would be affected. Zhang et al. [67] acknowledged that sourcing their data from an urban academic medical centre could introduce result in a limited generalizability of their findings. Hochman et al. [51] suggested that their use of an exclusion criteria removing severely depressed patients based on the prescription of specific drugs could also create bias. Zhang et al. [66] chose to exclude ethnicity from their models due to coding inconsistencies and errors; making a bias in that area a potential issue. Huang et al. [52], defined depression based solely on antidepressant usage and suggested their sample would be skewed towards the more severely depressed because the sample excluded those whose condition was treated with only psychotherapy or those without any treatment. A similar concern regarding changing definitions for the detection of depression during their study period was expressed by Xu et al. [65]. At a broader level, 20 of the studies were from “WEIRD” (Western, Educated, Industrialised, Rich, Democratic) countries with the majority (15) from the USA. The remainder were from countries with highly developed IT and healthcare industries such as Brazil, Israel, and India.

Data sharing

The nature of the data, data protection and requirements for anonymity, and privacy issues limited access to source data though details of sources themselves were more often made available (e.g., Hochman et al. [51], Nichols et al. [59]).


In this review, we identified a wide array of statistical techniques used on EHR data (see Table 4). Many different types of supervised ML were used for classification of depression versus control, including regression models (13 studies) and Random Forest (8 studies), XGBoost (8 studies) and SVM (7 studies) were the most common techniques. Use of multiple techniques in a single paper was also common, for instance Xu et al. [65] and Zhang et al. [66] used four or more methods. Geraci et al. [50] was the only study to use a deep neural network-based deep learning approach as the primary component of their model. Figure 3 summarises methods used in the selected studies.

Fig. 3
figure 3

Machine Learning/Artificial Intelligence Methods for pre-processing and modelling (note LR variants add up to 11). Abbreviations:; ARM, Association Rule Mining; BRTLM, Bidirectional Representation Learning model with a Transformer architecture on Multimodal EHR; DNN/ANN, Deep Neural Network/Artificial Neural Network; KNN, K Nearest Neighbours; LASSO, Least Absolute Shrinkage Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; M SEQ, multiple-input multiple-output Sequence; NB, Naïve Bayes; SVM, Support Vector Machine; XGBoost, eXtreme Gradient Boosting

Temporal sequence was referred to in two studies [49, 60] though other studies refer to time between predictors and diagnosis (e.g., Meng et al. [56]). In other studies patterns of predictors were used to determine their predictive probabilities of depression, sometimes using time constraints, such as a primary care visit “within the last twelve months” or specifically including time distant events such as birth trauma (Koning et al. [55], Nichols et al. [59]. Only one study, Półchłopek et al. [60], implemented temporal sequence, whereby the order of presentation of symptoms was considered, in the EHRs. Though Abar et al. [49] speculated that temporal sequence might be used to improve performance by taking causal sequence into consideration.

Most studies (17 out of 19) validated their models, most commonly (12) by splitting data into a training and a testing set. Cross validation data sets for model testing were also used (11 out of 19). Generally testing and validation was carried out by the same team as created the models, only Sau and Bhakta [62] had diagnostic accuracy checked by an independent team. Only one study used a separate data set for testing rather than splitting the original data set, Zhang et al. [67].

Code sharing

Code was made available by the majority (12) of studies. In some cases, just the details of the packages that implemented the ML algorithm were provided. For example, Jin et al. [53] reference the R package MASS, rather than the providing the complete code.


Several performance metrics was used to evaluate ML models of depression. Among those, researchers reported confusion matrices; area under the curve – receiver operating characteristics (AUC-ROC); and Odds Ratios/Variable Importance for predictors.

Confusion Matrix derived metrics (True Positives, True Negatives, False Positives and False Negatives) were used in sixteen of the studies, usually in conjunction with other measures particularly AUC-ROC. Many performance metrics are derived from this information, including accuracy, F1, sensitivity, specificity, and precision. Sensitivity (also known as recall) and specificity were commonly reported, possibly because they give information relating to the discriminative performance of the model and are well understood by practitioners [70].

For sensitivity, reported values range from 0.35 Hochmam et al. [51] to 0.94 Geraci et al. [50]. For specificity, reported values range from 0.39 Wang et al. [64] to 0.91 Hochman et al. [51]. Sensitivity was usually higher than specificity across the models with the exceptions being: Hochman et al. [51] who reported a high specificity figure of 0.91 with a low sensitivity of 0.35 using a gradient boosted decision tree algorithm; and Nemesure et al. [58] reported specificity of 0.7 and sensitivity of 0.55. The highest accuracy at 0.91 was reported by Sau and Bhakta [62] and the lowest was 0.56 (Zhang et al. [67]). This metric only gives a broad overall picture of correctly predicted results vs. all predictions made and gives no indication of the more useful true/false positive rates; it was presented in only six studies.

For the studies that reported performance in terms of AUC- ROC metric (14) the low extreme for any model was 0.55, specifically from a benchmark model predicting depression in the 12–15 years age group (Półchłopek et al. [60]. The highest AUC-ROC score was 0.94 (Zhang et al. [67], Kasthurirathne et al. [71]). The overall range AUC-ROC values reported was 0.70 to 0.90. The average AUC-ROC value was 0.78 with a standard deviation of 0.07. Figure 4 shows the average AUC values achieved in each study.

Fig. 4
figure 4

Average AUC performance across studies reporting them (AUC average = 0.78, Standard Deviation AUC Average = 0.07)

Generalizability and interpretability

Generalizability was mentioned in 14 studies, for example Jin et al. [53] and Zhang et al. [67]. The points already illustrated under, “sources of bias”, for example, demographically specific participants, and, factors relating to missing data and granularity of data, such as only having social deprivation data at practice level have negative consequences for generalizability.

Interpretability was identified as a concern in only 3 studies (Koning et al. [55], Nemesure et al. [58], Meng et al. [56]). For interpretability Nemesure et al. [58] used SHAP (Shapley Additive Explanations) scores which offers a decision chart and other visualisations for model predictors [72]. None of the included studies provided visualisations other than AUC-ROC diagrams and bar charts, as such interpretability was not significantly addressed in the selected studies.

Quality of studies

All the included studies achieved a score of 3 (11) or 4 (8) based on the OCEBM criteria (1 to 5 from highest to lowest) hierarchy of levels of evidence as far they could be applied to the selected studies, areas that related to diagnostic tests only (no interventions). This represents a moderate level of performance. Overall, the studies represented large sample sizes, usually case series or cohort trials and they applied a clinically recognised benchmark, had there been randomized trials studies could have been promoted to level 2.

Only 3 studies provided reference to the use of a formal assessment method such as TRIPOD [42]. suggesting that following standards is not yet widespread or that the frameworks are not yet sufficiently established or appropriate. This lack of consistent reporting is a limitation, and the use of standardised frameworks should become the expectation rather than the exception.


In this review we have identified three areas of interest: generalizability (can the model be reused with, e.g., different populations), interpretability (is the model’s information readily understandable to its users), and performance (does the model meet the needs e.g. in AUC-ROC, for the purpose for which it is intended) as key components to consider for predictive models of depression built on the use of ML with EHR data. All three would need careful evaluation before moving from research to a clinical application environment.


This is a significant consideration for medical ML applications, whilst a model may work well in their development and testing environments, this does not guarantees that they will work in a new context [73, 74]. To be widely deployed clinically, the models in the studies would need to be generalizable, i.e., be able to work reliably outside of their development environment. Kelly et al. [73] identified the ability to deal with new populations as one prerequisite for clinical success. Areas identified in the studies that could impact generalizability included demographics, sources of bias, inclusion/exclusion criteria, missing/incomplete data, the definition of depression and predictors. All of these were identified in the included studies, for instance, Jin et al. [53] identified Hispanic participants being highly represented in their data and Zhang et al. [66] excluding ethnicity from their models.

As noted in the Performance sub-section of the Results, the ML method itself did not seem to be overly critical for outcome performance using the EHR data sets in the included studies and it is provisionally suggested that the method itself may be more generalizable than the data to which it is fitted.

Another area that can limit generalizability is the wide variety of EHR data. This varies depending on source for example insurance derived, a state health service such as the NHS, or a proprietary standard such as SNOMED etc. The coding may, or may not, incorporate a recognised medical standard such as the ICD [12] or DSM [13] amongst others that can be found in the included studies. Although not derived from the studies directly it was noted that individual EHRs systems are proprietary in nature and there is no universally accepted extant standard detailing how data should be categorised, stored, and organised for them.. There are organisations developing, promoting, and gaining accreditation, for example Health Level Seven International [75] with ANSI (American National Standards Institute) [76]. However, none of these are globally adopted, and the only accepted standard developed by the World Health Organization (E1384) was withdrawn in 2017 [77]. Lack of standardisation is currently a barrier to portability for individual applications. Consequently, it is likely that models are data source specific to a greater or lesser extent. Further work needs to consider how this can be addressed.

The studies in this review differed in how depression was defined and by the range of predictors selected and their definitions. As mentioned, a commonly used approach was to use a combination of EHR data entry codes covering diagnoses in combination with prescription of an antidepressant. This can result in too many cases as being diagnosed as depressed due to antidepressants being used for a wider range of conditions. Similar issues apply for the definition of predictors. In combination this restricts the generalizability of any models produced.

Another factor for generalization is the robustness of the models and their replicability. None of the studies included replication of their results, only Sau and Bhakta [62] used an independent team for the verification of results, though the majority employed recognised validation techniques and 12 used separate hold out data set. This last point is also relevant to establishing if models have been overfitted to their data; the possibility for this was not reported in any of the studies despite being known as a serious potential issue for ML models in general. Reducing bias and independent validation and testing is recommended for future work involving the prediction of depression using ML with EHRs.


Interpretability was only identified as a concern in a few studies. However, clinical practitioners may wish to know the explanation for ML algorithm’s predicted diagnosis so they can fit it into a broader diagnostic picture rather than treating it as a “black box” as described by Cadario et al. [78]. Similarly, Vellido [79] and Stiglic et al. [80] also considered that interpretability and visualisation are important for effective implementation of medical ML applications. This may be as simple as listing the specific predictors that contributed to the outcome, for example, anxiety, low mood, chronic pain or similar. Of the included studies Nemesure et al. [58] used SHAP (Shapley Additive Explanations) scores which have been used in clinical applications [81] to aid interpretability, again by identifying the most important predictors. Techniques such as SHAP, and e.g., LIME (Local interpretable model-agnostic explanations) [82] offer visualisations which may be more intuitive and provide more easily digested information. However, none of the other studies included provided visualisations other than AUC-ROC diagrams and bar charts of predictors. That said, there is a long-standing unsettled debate regarding interpretability going back to the 1950s. Providing interpretive data to support a practitioner as opposed to a “black box” approach where the diagnosis made by the application is simply accepted, can lead to a lower diagnostic performance overall [83, 84]. It is recommended that future studies should be made that not only develop predictive models but also include trialling their use, for example with primary practitioners, support staff and/or patients, offering different forms of interpretable/black box output and assessing acceptability. This needs not be done, initially, in a clinical setting, but can be piloted and demonstrated in prototype form in a controlled environment. This can then be assessed using a combination of qualitative and quantitative methods e.g., with surveys, interviews, focus groups and panels prior to moving to clinical trials.


Here we consider what may be limiting the performance of the models with respect to their intended used as a means of identifying depression. One limiting factor on performance in the included studies, relates to the definition of depression itself and the predictors used. Defining depression accurately is critical as this definition is used to train the ML application, a point raised by Meng et al. [57]. In the studies reviewed here, typically a combination of diagnostic and drug codes within the EHRs were used. Using prescription of antidepressants as part of the definition may misidentify too many cases, a point identified in the selected studies by, for example, Qiu et al. [61] and Nichols et al. [59]. ADs are prescribed for other conditions including anxiety [85, 86], chronic pain [87, 88], obsessive compulsive disorder [89, 90], post-traumatic stress disorder [91, 92] and inflammatory bowel disease [93]. Of the included papers Xu et al. [65] suggested that under-identification of depression cases could also occur for patients receiving treatment via private care or an alternate service provider.

The prevalence of predictors can be artificially boosted, as suggested by Koning et al. [55] and Nichols et al. [59] where primary care physicians who think a patient has depression may identify or suspect a precursor or comorbidity, for example, with other mental health conditions like low mood or anxiety. There is strong evidence that family history of depression, alcohol, drug, physical and sexual abuse, and co-morbidity with other mental health conditions, are strong predictors of depression [94,95,96,97]. However, this data appears to be under recorded resulting in removal of important predictors due to low prevalence—again in Nichols et al. [59] removed family history data due to its low prevalence (< 0.02%). This would be expected to have a negative impact on performance. Identifying consistent and valid definitions for depression and any predictors used is a necessity.

The studies in this review reported an overall model performance where AUC-ROC value was 0.78 with a standard deviation of 0.07 (Fig. 2). This compares well with primary care where up to half of depression cases are missed at baseline consultation, improving to around two thirds being diagnosed at follow up [38, 40]. An earlier paper by Sartorius et al. [98] reported that only 39.1% of cases of ICD10 current depression were identified by primary care practitioners. Based on the studies we identified potential areas that might support improvements in the performance of the models. A key area relating to this is that of over/under diagnosis; as mentioned in our background section early diagnosis and thus intervention can show benefits for depression [25, 99]. However, there is a broader argument with regard to over-diagnosis (i.e., false positives) in terms of potentially wasting resource or stigmatising patients.

Although some studies suggested that using more sophisticated techniques should improve performance, we noted that simpler methods such as logistic regression were often comparable to those obtained using more complex ones such as Random Forest and XG Boost (e.g., Zhang et al. [67]. Christodoulou et al. [100] echoed this conclusion in their systematic review of clinical prediction using ML where they saw similar performance for logistic regression compared with ML models such as, artificial neural networks, decision trees, Random Forest, and support vector machines (SVM). Geraci et al. [50] employed a deep neural network (deep learning) as their main modelling technique and Nemesure et al. [58] used it as a component in a larger ensemble model. However, neither demonstrated performance benefits from its use. Even if higher performance could be obtained using deep learning it is important to note that small amounts of noise or small errors in the data can cause significant reliability issues due to misclassification due to very small perturbations in the data [101, 102]. The use of more sophisticated techniques to improve performance is not supported by this review.

How else might performance be improved? The use of non-anonymised data, sourced from within a primary or secondary care facility, something that is more achievable in a clinical than a research setting, could be beneficial. For example, in the Nichols et al. [59] study social deprivation indices were only available at a regional/practice level and inspection of their model suggests that social deprivation has little impact on prediction of depression. This is inconsistent with expectation, as supported by Ridley et al. [103] who showed that there is a link between increased social deprivation and the probability of developing depression. Having this data at an individual level might be expected to increase the performance of a model. However, this is likely to only be achievable in a clinical trial of an application. Alternatively, the use of synthetically generated EHR data [104, 105] removes the patient confidentiality and related ethical constraints that come with real data and would allow all aspects of a model to be fully evaluated as if with non-anonymous patient data.

Another approach is using more information relating to time in predictive models; EHRs typically time stamp entries so it is known when a predictor is activated. Półchłopek et al. [60], considered temporal sequence in EHRs. They were concerned that techniques including support vector machines and random forest identify predictors that affect the outcome but do not identify the effect of sequence on that outcome. They looked at the improvement that could be found by using temporal patterns in addition to non-time specific predictors and noted a small positive effect. Abar et al. [49] also speculated that temporal sequence might be used to improve model performance. There are techniques that might be used to do this. For example, time series analysis methods such as Gaussian processes, which are capable of coping with the sparse nature of EHR data [106] have been used to make predictions for patients with heart conditions. We recommend exploring the use of more time dependent factors in building predictive ML models for depression.

Although missing data is more of a concern in terms of generalizability, some studies identified it as an opportunity to improve performance. Kasthurirathne et al. [54] noted that missing EHR data can reduce model performance and suggested that this could be mitigated by merging with other data sources, for example, related insurance claims. Nichols et al. [59] used missing smoking data as a predictor and it had a positive effect in their model. Missing data is potentially of significance of itself and is an opportunity for further study.

Strengths and limitations

As far as we are aware this is the first systematic review focussed on the use of EHRs to predict depression using ML methods. The choice of journal databases and thedate range covered by the searches means that the studies identified provide a sound basis for comparison. The data extraction protocol was informed by established standards [42,43,44] to best identify data needed to support meaningful and repeatable analyses.

A limitation of this study is that inclusion criteria focused on study titles and key words which may have led to some ML studies using EHRs being missed. This was mitigated using backwards and forwards citation searches. Additionally, the variety of study designs including case control, cohort, and longitudinal studies precluded the possibility of using some of the more traditional quality assessment tools; we did however, as stated in methods, use OCEBM which has been used in previous ML systematic reviews. The categorization, definition, and identification of the numbers of predictors used within models was sometimes difficult to establish, leading to limitation in the scope of this information presented. It is also likely that the included studies are culturally specific as they focused on “WEIRD” populations.


In conducting this systematic review, we have shown that there is a body of work that supports the potential use of ML techniques with EHRs for the prediction of depression. This approach can deliver performance that is comparable to, or better than that found in primary care. It is clear there is scope for improvement both in terms of adoption of standards for both conducting and reporting the research and the data itself. The development of an acceptable global standard for EHRs would improve generalizability and portability. This would involve greater promotion, and development, of standards for research such as TRIPOD [42] and, for data interchange, Health Level Seven International [75], and their further development to support ML/EHR applications. Future work could pay more attention to generalizability and interpretability, both of which need to be addressed prior to trialling implementation in the clinic. It is also worth investigating areas where performance can be improved, for example by including temporal sequence within the models, better selection of predictors and the use of non-anonymised/synthetic data. Our review suggests depression prediction using ML/EHRs is a worthwhile area for future development.

Availability of data and materials

All data generated or analysed during this study are included in this published article [and its supplementary information files].



Area Under Curve – Receiver Operating Characteristic


Artificial Neural Network


Association Rule Mining


Bidirectional Representation Learning model with a Transformer architecture on Multimodal EHR


Deep Neural Network


Electronic Health Records


K Nearest Neighbours


Least Absolute Shrinkage Selection Operator


Logistic Regression


Multilayer Perceptron


Multiple-input multiple-output Sequence


Naïve Bayes


Natural Language Processing


Support Vector Machine


eXtreme Gradient Boosting


  1. Lim GY, Tam WW, Lu Y, Ho CS, Zhang MW, Ho RC. Prevalence of depression in the community from 30 countries between 1994 and 2014. Sci Rep. 2018;8(1):2861.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Vigo D, Thornicroft G, Atun R. Estimating the true global burden of mental illness. Lancet Psychiatry. 2016;3(2):171–8.

    Article  PubMed  Google Scholar 

  3. Ferrari AJ, Charlson FJ, Norman RE, Patten SB, Freedman G, Murray CJL, et al. Burden of depressive disorders by country, sex, age, and year: findings from the global burden of disease study 2010. PLOS Med. 2013;10(11): e1001547.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Chesney E, Goodwin GM, Fazel S. Risks of all-cause and suicide mortality in mental disorders: a meta-review. World Psychiatry. 2014;13(2):153–60.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Organization WH. Depression and other common mental disorders: global health estimates. 2017; Available from: Cited 11 Nov 2022

  6. McCrone P, Dhanasiri S, Patel A, Knapp M, Lawton-Smith S. Paying the price: the cost of mental health care in England to 2026. The King’s Fund; 2008. Available from: Cited 29 Nov 2021

  7. Fineberg NA, Haddad PM, Carpenter L, Gannon B, Sharpe R, Young AH, et al. The size, burden and cost of disorders of the brain in the UK. J Psychopharmacol (Oxf). 2013;27(9):761–70.

    Article  Google Scholar 

  8. McDaid D, Park AL. The economic case for investing in the prevention of mental health conditions in the UK. Care Policy and Evaluation Centre, Department of Health Policy, London School of Economics and Political Science, London; 2022.

  9. Mental health problems cost UK economy at least GBP 118 billion a year - new research. Available from: Cited 18 Sep 2023

  10. McGorry PD, Hickie IB, Yung AR, Pantelis C, Jackson HJ. Clinical staging of psychiatric disorders: a heuristic framework for choosing earlier, safer and more effective interventions. Aust N Z J Psychiatry. 2006;40(8):616–22.

    Article  PubMed  Google Scholar 

  11. McGorry PD. Early intervention in psychosis. J Nerv Ment Dis. 2015;203(5):310–8.

    Article  PubMed  PubMed Central  Google Scholar 

  12. International Classification of Diseases (ICD). Cited 2023 Jan 20. Available from:

  13. DSM Library [Internet]. [cited 2023 Jul 5]. Diagnostic and Statistical Manual of Mental Disorders. Available from:

  14. Andrews G, Peters L, Guzman AM, Bird K. A comparison of two structured diagnostic interviews: CIDI and SCAN. Aust N Z J Psychiatry. 1995;29(1):124–32.

    Article  CAS  PubMed  Google Scholar 

  15. Robins LN, Wing J, Wittchen HU, Helzer JE, Babor TF, Burke J, et al. The composite international diagnostic interview: an epidemiologic instrument suitable for use in conjunction with different diagnostic systems and in different cultures. Arch Gen Psychiatry. 1988;45(12):1069–77.

    Article  CAS  PubMed  Google Scholar 

  16. Zigmond AS, Snaith RP. The hospital anxiety and depression scale. Acta Psychiatr Scand. 1983;67(6):361–70.

    Article  CAS  PubMed  Google Scholar 

  17. Smarr KL, Keefer AL. Measures of depression and depressive symptoms: Beck Depression Inventory-II (BDI-II), Center for Epidemiologic Studies Depression Scale (CES-D), Geriatric Depression Scale (GDS), Hospital Anxiety and Depression Scale (HADS), and Patient Health Questionnaire-9 (PHQ-9). Arthritis Care Res. 2011;63(Suppl 11):S454–466.

    Google Scholar 

  18. Beck AT, Ward CH, Mendelson M, Mock J, Erbaugh J. An inventory for measuring depression. Arch Gen Psychiatry. 1961;4(6):561–71.

    Article  CAS  PubMed  Google Scholar 

  19. Spitzer RL, Kroenke K, Williams JBW, the Patient Health Questionnaire Primary Care Study Group. Validation and utility of a self-report version of PRIME-MDThe PHQ primary care study. JAMA. 1999;282(18):1737–44.

    Article  CAS  PubMed  Google Scholar 

  20. Kroenke K. PHQ-9: global uptake of a depression scale. World Psychiatry. 2021;20(1):135–6.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Kocalevent RD, Hinz A, Brähler E. Standardization of the depression screener Patient Health Questionnaire (PHQ-9) in the general population. Gen Hosp Psychiatry. 2013;35(5):551–5.

    Article  PubMed  Google Scholar 

  22. Arroll B, Goodyear-Smith F, Crengle S, Gunn J, Kerse N, Fishman T, et al. Validation of PHQ-2 and PHQ-9 to screen for major depression in the primary care population. Ann Fam Med. 2010;8(4):348–53.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Levis B, Benedetti A, Thombs BD. Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis. BMJ. 2019;365: l1476.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Bohlmeijer ET, Fledderus M, Rokx TAJJ, Pieterse ME. Efficacy of an early intervention based on acceptance and commitment therapy for adults with depressive symptomatology: evaluation in a randomized controlled trial. Behav Res Ther. 2011;49(1):62–7.

    Article  PubMed  Google Scholar 

  25. Davey CG, McGorry PD. Early intervention for depression in young people: a blind spot in mental health care. Lancet Psychiatry. 2019;6(3):267–72.

    Article  PubMed  Google Scholar 

  26. McGorry P, van Os J. Redeeming diagnosis in psychiatry: timing versus specificity. The Lancet. 2013;381(9863):343–5.

    Article  Google Scholar 

  27. Office-based Physician Electronic Health Record Adoption | Available from: Cited 27 Oct 2027

  28. Jha AK, Doolan D, Grandt D, Scott T, Bates DW. The use of health information technology in seven nations. Int J Med Inf. 2008;77(12):848–54.

    Article  Google Scholar 

  29. SNOMED Home page. SNOMED. Available from: Cited 2 Nov 2021

  30. Kruse CS, Stein A, Thomas H, Kaur H. The use of electronic health records to support population health: a systematic review of the literature. J Med Syst. 2018;42(11):214.

    Article  PubMed  PubMed Central  Google Scholar 

  31. QRISK3. Available from: Cited 27 Oct 2021

  32. Pike MM, Decker PA, Larson NB, St Sauver JL, Takahashi PY, Roger VL, et al. Improvement in cardiovascular risk prediction with electronic health records. J Cardiovasc Transl Res. 2016;9(3):214–22.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Klompas M, Eggleston E, McVetta J, Lazarus R, Li L, Platt R. Automated detection and classification of type 1 versus type 2 diabetes using electronic health record data. Diabetes Care. 2013;36(4):914–21.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, Minhas R, Sheikh A, et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ. 2008;336(7659):1475–82.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Cho G, Yim J, Choi Y, Ko J, Lee SH. Review of machine learning algorithms for diagnosing mental illness. Psychiatry Investig. 2019;16(4):262–9.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Wu H, Yamal JM, Yaseen A. Maroufy V. Statistics and machine learning methods for EHR data: From Data Extraction to Data Analytics. CRC Press; 2020. p. 329.

    Google Scholar 

  37. Shatte ABR, Hutchinson DM, Teague SJ. Machine learning in mental health: a scoping review of methods and applications. Psychol Med. 2019;49(9):1426–48.

    Article  PubMed  Google Scholar 

  38. Kessler D, Bennewith O, Lewis G, Sharp D. Detection of depression and anxiety in primary care: follow up study. BMJ. 2002;325(7371):1016–7.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Kessler RC, Bromet EJ. The epidemiology of depression across cultures. Annu Rev Public Health. 2013;34:119–38.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Mitchell AJ, Rao S, Vaze A. Can general practitioners identify people with distress and mild depression? A meta-analysis of clinical accuracy. J Affect Disord. 2011;130(1):26–36.

    Article  PubMed  Google Scholar 

  41. Booth A, Clarke M, Dooley G, Ghersi D, Moher D, Petticrew M, et al. The nuts and bolts of PROSPERO: an international prospective register of systematic reviews. Syst Rev. 2012;1(1):2.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350. Available from: Cited 26 Apr 2021

  43. Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLOS Med. 2014;11(10): e1001744.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Navarro CLA, Damen JAAG, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques. BMJ Open. 2020;10(11): e038832.

    Article  Google Scholar 

  45. OCEBM Levels of Evidence — Centre for Evidence-Based Medicine (CEBM), University of Oxford.\. Available from: Cited 12 Jul 2021

  46. Bernert RA, Hilberg AM, Melia R, Kim JP, Shah NH, Abnousi F. Artificial intelligence and suicide prevention: a systematic review of machine learning investigations. Int J Environ Res Public Health. 2020;17(16):5929.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Explanation of the 2011 OCEBM Levels of Evidence. Available from: Cited 25 Sep 2023

  48. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Int J Surg. 2010;8(5):336–41.

    Article  PubMed  Google Scholar 

  49. Abar O, Charnigo RJ, Rayapati A, Kavuluru R. On Interestingness Measures for Mining Statistically Significant and Novel Clinical Associations from EMRs. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: Association for Computing Machinery; 2016. p. 587–94. (BCB ’16). Available from: Cited 14 Jul 2021

  50. Geraci J, Wilansky P, de Luca V, Roy A, Kennedy JL, Strauss J. Applying deep neural networks to unstructured text notes in electronic medical records for phenotyping youth depression. Evid Based Ment Health. 2017;20(3):83–7.

    Article  PubMed  Google Scholar 

  51. Hochman E, Feldman B, Weizman A, Krivoy A, Gur S, Barzilay E, et al. Development and validation of a machine learning-based postpartum depression prediction model: a nationwide cohort study. Depress Anxiety. 2021;38(4):400–11.

    Article  PubMed  Google Scholar 

  52. Huang SH, LePendu P, Iyer SV, Tai-Seale M, Carrell D, Shah NH. Toward personalizing treatment for depression: predicting diagnosis and severity. J Am Med Inform Assoc. 2014;21(6):1069–75.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Jin H, Wu S, Vidyanti I, Di Capua P, Wu B. Predicting depression among patients with diabetes using longitudinal data a multilevel regression model. Methods Inf Med. 2015;54(6):553–9.

    Article  CAS  PubMed  Google Scholar 

  54. Kasthurirathne SN, Biondich PG, Grannis SJ, Purkayastha S, Vest JR, Jones JF. Identification of patients in need of advanced care for depression using data extracted from a statewide health information exchange: a machine learning approach. J Med Internet Res. 2019;21(7): e13809.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Koning NR, Büchner FL, Vermeiren RRJM, Crone MR, Numans ME. Identification of children at risk for mental health problems in primary care—Development of a prediction model with routine health care data. EClinicalMedicine. 2019;15:89–97.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Meng Y, Speier W, Ong MK, Arnold CW. Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression. ArXiv200912656 Cs. 2020; Available from: Cited 7 Jan 2021

  57. Meng Y, Speier W, Ong M, Arnold CW. HCET: Hierarchical Clinical Embedding with Topic modeling on electronic health records for predicting future depression. IEEE J Biomed Health Inform. 2021;25(4):1265–72.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Nemesure MD, Heinz MV, Huang R, Jacobson NC. Predictive modeling of depression and anxiety using electronic health records and a novel machine learning approach with artificial intelligence. Sci Rep. 2021;11(1):1980.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Nichols L, Ryan R, Connor C, Birchwood M, Marshall T. Derivation of a prediction model for a diagnosis of depression in young adults: a matched case–control study using electronic primary care records. Early Interv Psychiatry. 2018;12(3):444–55.

    Article  PubMed  Google Scholar 

  60. Półchłopek O, Koning NR, Büchner FL, Crone MR, Numans ME, Hoogendoorn M. Quantitative and temporal approach to utilising electronic medical records from general practices in mental health prediction. Comput Biol Med. 2020;125: 103973.

    Article  PubMed  Google Scholar 

  61. Qiu R, Kodali V, Homer M, Heath A, Wu Z, Jia Y. Predictive modeling of depression with a large claim dataset. In: 2019 IEEE Int Conf Bioinform Biomed (BIBM). 2019;1589–95.

  62. Sau A, Bhakta I. Predicting anxiety and depression in elderly patients using machine learning technology. Healthc Technol Lett. 2017;4(6):238–43.

    Article  Google Scholar 

  63. de Souza Filho EM, Veiga Rey HC, Frajtag RM, Arrowsmith Cook DM, de DalbonioCarvalho LN, Pinho Ribeiro AL, et al. Can machine learning be useful as a screening tool for depression in primary care? J Psychiatr Res. 2021;132:1–6.

    Article  PubMed  Google Scholar 

  64. Wang S, Pathak J, Zhang Y. Using electronic health records and machine learning to predict postpartum depression. Stud Health Technol Inform. 2019;264:888–92.

    PubMed  Google Scholar 

  65. Xu Z, Wang F, Adekkanattu P, Bose B, Vekaria V, Brandt P, et al. Subphenotyping depression using machine learning and electronic health records. Learn Health Syst. 2020;4(4): e10241.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Zhang J, Xiong H, Huang Y, Wu H, Leach K, Barnes LE. M-SEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data. In: 2015 IEEE International Conference on Big Data (Big Data). 2015;2569–77.

  67. Zhang Y, Wang S, Hermann A, Joly R, Pathak J. Development and validation of a machine learning algorithm for predicting the risk of postpartum depression among pregnant women. J Affect Disord. 2021;279:1–8.

    Article  PubMed  Google Scholar 

  68. SCIMP Guide to Read Codes | Primary Care Informatics. Available from: Cited 12 Nov 2021

  69. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med. 2001;16(9):606–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Harris M, Taylor G. Medical Statistics Made Easy: 3rd Edition. Scion Publications; 2014. Available from: Cited 20 Jan 2023

  71. Kasthurirathne SN, Biondich PG, Grannis SJ, Purkayastha S, Vest JR, Jones JF. Identification of patients in need of advanced care for depression using data extracted from a statewide health information exchange: a machine learning approach. J Med Internet Res. 2019;21(7): e13809.

    Article  PubMed  PubMed Central  Google Scholar 

  72. Merrick L, Taly A. The explanation game: explaining machine learning models using shapley values. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E, editors. Machine learning and knowledge extraction. Cham: Cham: Springer International Publishing; 2020. p. 17–38.

    Chapter  Google Scholar 

  73. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):195.

    Article  PubMed  PubMed Central  Google Scholar 

  74. Yang J, Soltan AAS, Clifton DA. Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening. Npj Digit Med. 2022;5(1):1–8.

    Article  Google Scholar 

  75. Health Level Seven International - Homepage | HL7 International. Available from: Cited 17 Nov 2022

  76. American National Standards Institute - ANSI Home. Available from: Cited 17 Nov 2022

  77. Standard Practice for Content and Structure of the Electronic Health Record (EHR) (Withdrawn 2017). Available from: Cited 17 Nov 2022

  78. Cadario R, Longoni C, Morewedge CK. Understanding, explaining, and utilizing medical artificial intelligence. Nat Hum Behav. 2021;5(12):1636–42.

    Article  PubMed  Google Scholar 

  79. Vellido A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Comput Appl. 2020;32(24):18069–83.

    Article  Google Scholar 

  80. Stiglic G, Kocbek P, Fijacko N, Zitnik M, Verbert K, Cilar L. Interpretability of machine learning-based prediction models in healthcare. WIREs Data Min Knowl Discov. 2020;10(5): e1379.

    Article  Google Scholar 

  81. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56–67.

    Article  PubMed  PubMed Central  Google Scholar 

  82. Molnar C. Chapter 1 Preface by the Author | Interpretable Machine Learning. Available from: Cited 10 May 2023

  83. Meehl PE. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis, MN, US: University of Minnesota Press; x, 149 p. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence; 1954.

    Google Scholar 

  84. Dawes RM. The robust beauty of improper linear models in decision making. Am Psychol. 1979;34(7):571.

    Article  Google Scholar 

  85. Bandelow B, Michaelis S, Wedekind D. Treatment of anxiety disorders. Dialogues Clin Neurosci. 2017;19(2):93–107.

    Article  PubMed  PubMed Central  Google Scholar 

  86. Ströhle A, Gensichen J, Domschke K. The diagnosis and treatment of anxiety disorders. Dtsch Ärztebl Int. 2018;115(37):611–20.

    PubMed Central  Google Scholar 

  87. Sutherland AM, Nicholls J, Bao J, Clarke H. Overlaps in pharmacology for the treatment of chronic pain and mental health disorders. Prog Neuropsychopharmacol Biol Psychiatry. 2018;87:290–7.

    Article  CAS  PubMed  Google Scholar 

  88. Urits I, Peck J, Orhurhu MS, Wolf J, Patel R, Orhurhu V, et al. Off-label antidepressant use for treatment and management of chronic pain: evolving understanding and comprehensive review. Curr Pain Headache Rep. 2019;23(9):66.

    Article  PubMed  Google Scholar 

  89. Brakoulias V, Starcevic V, Albert U, Arumugham SS, Bailey BE, Belloch A, et al. Treatments used for obsessive–compulsive disorder—an international perspective. Hum Psychopharmacol Clin Exp. 2019;34(1): e2686.

    Article  Google Scholar 

  90. Del Casale A, Sorice S, Padovano A, Simmaco M, Ferracuti S, Lamis DA, et al. Psychopharmacological treatment of Obsessive-Compulsive Disorder (OCD). Curr Neuropharmacol. 2019;17(8):710–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Abdallah CG, Averill LA, Akiki TJ, Raza M, Averill CL, Gomaa H, et al. The neurobiology and pharmacotherapy of posttraumatic stress disorder. Annu Rev Pharmacol Toxicol. 2019;59:171–89.

    Article  CAS  PubMed  Google Scholar 

  92. Ehret M. Treatment of posttraumatic stress disorder: focus on pharmacotherapy. Ment Health Clin. 2019;9(6):373–82.

    Article  PubMed  PubMed Central  Google Scholar 

  93. Jayasooriya N, Blackwell J, Saxena S, Bottle A, Petersen I, Creese H, et al. Antidepressant medication use in Inflammatory Bowel Disease: a nationally representative population-based study. Aliment Pharmacol Ther;n/a(n/a). Available from: Cited 15 Mar 2022

  94. Milne BJ, Caspi A, Harrington H, Poulton R, Rutter M, Moffitt TE. Predictive value of family history on severity of illness: the case for depression, anxiety, alcohol dependence, and drug dependence. Arch Gen Psychiatry. 2009;66(7):738–47.

    Article  PubMed  PubMed Central  Google Scholar 

  95. van Dijk MT, Murphy E, Posner JE, Talati A, Weissman MM. Association of multigenerational family history of depression with lifetime depressive and other psychiatric disorders in children: results from the Adolescent Brain Cognitive Development (ABCD) study. JAMA Psychiat. 2021;78(7):778–87.

    Article  Google Scholar 

  96. Weissman MM, Wickramaratne P, Gameroff MJ, Warner V, Pilowsky D, Kohad RG, et al. Offspring of depressed parents: 30 years later. Am J Psychiatry. 2016;173(10):1024–32.

    Article  PubMed  Google Scholar 

  97. Williamson DE, Ryan ND, Birmaher B, Dahl RE, Kaufman J, Rao U, et al. A case-control family history study of depression in adolescents. J Am Acad Child Adolesc Psychiatry. 1995;34(12):1596–607.

    Article  CAS  PubMed  Google Scholar 

  98. Sartorius N, Ustün TB, Organization WH. Mental illness in general health care : an international study. Chichester: Wiley; 1995. Available from: Cited 10 Feb 2022

  99. Thapar A, Collishaw S, Pine DS, Thapar AK. Depression in adolescence. Lancet. 2012;379(9820):1056–67.

    Article  PubMed  PubMed Central  Google Scholar 

  100. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.

    Article  PubMed  Google Scholar 

  101. Basu S, Pope P, Feizi S. Influence Functions in Deep Learning Are Fragile. ArXiv200614651 Cs Stat. 2021; Available from: Cited 28 Mar 2022

  102. Ghorbani A, Abid A, Zou J. Interpretation of neural networks is fragile. Proc AAAI Conf Artif Intell. 2019;33(01):3681–8.

    Google Scholar 

  103. Ridley M, Rao G, Schilbach F, Patel V. Poverty, depression, and anxiety: Causal evidence and mechanisms. Science. 2020;370(6522). Available from: Cited 16 Dec 2020

  104. Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol. 2020;20(1):108.

    Article  PubMed  PubMed Central  Google Scholar 

  105. Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. 2018;25(3):230–8.

    Article  PubMed  Google Scholar 

  106. Cheng LF, Dumitrascu B, Darnell G, Chivers C, Draugelis M, Li K, et al. Sparse multi-output Gaussian processes for online medical time series prediction. BMC Med Inform Decis Mak. 2020;20(1):152.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


University of Warwick provided the library, information technology and office facilities that supported the contributors in the production of this study. We thank the peer reviewers and the BMC editor for their reviews and helpful commentary on this paper.


This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership Award (#2300953). Funding was awarded to CT as PI.

The funding source had no role in study design; the collection, analysis, and interpretation of data; the writing of the report; or the decision to submit the article for publication.

Author information

Authors and Affiliations



DN and CT defined the systematic review scope and designed the methods. DN managed the literature searches and analyses. DN and LW undertook the statistical analysis, and DN wrote the first draft of the manuscript. CT, CM and LW reviewed and proofread subsequent versions of the manuscript prior to submission. All authors contributed to and have approved the final manuscript.

Corresponding author

Correspondence to David Nickson.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Table S-1. Studies excluded at full text stage with reasons.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nickson, D., Meyer, C., Walasek, L. et al. Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review. BMC Med Inform Decis Mak 23, 271 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: