Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review

Nickson, David; Meyer, Caroline; Walasek, Lukasz; Toro, Carla

doi:10.1186/s12911-023-02341-x

BMC Medical Informatics and Decision Making

Table 5 Grouping of predictors from the studies

From: Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review

Predictor group	Commentary
Comorbidities	Comorbidities were included in 13 studies. They included long-term conditions, such as diabetes, asthma, epilepsy, and chronic pain. These were commonly used, especially when the study authors highlighted theoretical links with depression
Demographic	Demographic predictors were used in 16 studies. On some occasions, specific demographic variables were excluded due to insufficient availability/coverage (often the case for ethnicity). Gender was included as a predictor and occasionally also as a means of creating gender-specific models (e.g., Nichols et al. [59]). Social deprivation was also used as a predictor, and information about missed immunization(s) was used in two studies, Nemesure et al. [58] and Nichols et al. [59], as a proxy for social deprivation The age range of cases was often an integral part of the study’s specific aims. Age being treated either as a numeric or to break up the study population into subgroups. Some studies specifically focussed on older patients. For instance, Sau and Bhakta [62] used data with an average age of 68.5 years (standard deviation 4.85 years), whereas Nichols et al. [59] focused on early diagnosis among young people, between 15 to 24 years of age. Some studies narrowed the analysis to a narrow age bracket, others included a wide range of ages. For example, Hochman et al. [51], who studied postpartum depression reported an average age of 29.4 years (standard deviation, 5.4) whereas Xu et al. [65] used data from participants whose age ranged from 18 to over 65
Family History	Family history was used in five studies and included family history of abuse (physical/sexual) and drug/substance abuse, often because the study authors cited theoretical links with depression. This group of predictors was often under recorded, as reported in the Nichols et al. [59] study where family history data was removed from the model due to low prevalence (< 0.02%) in their data. Insufficient family history data was also highlighted as a limitation in other studies [53, 55]
Obstetric specific	Obstetric specific were used in five studies focussed on the prediction of postpartum depression, and these included predictors such as premature birth, use of specific drugs during pregnancy and obesity. This type of predictor was also used in non-postpartum depression studies e.g., Abar et al. [49]
Psychiatric symptoms or other diagnoses	Psychiatric symptoms/diagnoses were used in fifteen studies. These include both depression related symptoms such as: anxiety, low mood, self-harm, sleeping and eating disorders, too little sleep etc. They also include the broader range of conditions including post-traumatic stress syndrome, obsessive compulsive disorder, personality disorders and psychoses. Within individual studies there may/may not be a distinction made between these two subgroups
Smoking	Smoking was used in seven studies. However, it was identified, for instance by Nichols et al. [59], that data may be incomplete for all participants and that this might impact the ability to reliably assess correlations with depression, to mitigate this they used “missing smoker” data as a separate predictor. This was a categorical predictor in the selected studies
Social/family	Social and family related factors were used in seven studies these included bereavement, divorce, single parent, police or social services involvement and similar
Somatic	Somatic conditions were used in 14 studies these include physical conditions such as, abdominal pain, back pain, dyspepsia, eczema, headaches, and others
Substance/alcohol abuse	Alcohol/substance abuse was used in seven studies, participants identified as having drug/alcohol abuse problems. Typically categorical, but some studies included levels of abuse and/or combinations of the two
Visit frequency	Visit frequency was used in six studies and shown to be a significant contributor to model performance. This is an integer variable based on number of visits in a specified period to the primary care facility (e.g., NHS GP)
Word list/text	Word list/text derived data was used in only one study, Geraci et al. [50], this was a source of data that was then analysed, using natural language processing, to extract predictors from clinical notes. It is based on language/defined terms specific
Other measurements and predictors	Other measurements and predictors were used in 11 studies and included, e.g., measurements of physical characteristics such as blood pressure, cholesterol, results of assays, and height/weight

Note: There may be overlap or gaps in these groupings as the predictors used and the reason for their use is study specific and not always explained

Back to article page

ISSN: 1472-6947

Contact us

General enquiries: journalsubmissions@springernature.com