Principle structure of the feasibility study
We used three test datasets (105 cases from a telepathology platform from Afghanistan (Data Set I), 124 cases on a medical discussion platform (Coliquio, Data Set II), and 50 case reports taken from New England Journal of Medicine (NEJM, Data Set III) (following section and supplement for details and references (Additional file 1: Table S1)). DDx used were IsabelHealth and Memem7.
The same test strategy was adopted for all three dataset (see also Fig. 1).
The terms used for the search function in Memem7 and IsabelHealth were chosen by one author (SS). Target diagnosis for the Afghan dataset was the opinion of at least three experts (senior pathologists). For the Coliquio dataset, the target diagnosis was the differential diagnosis favored by the majority of discussants, and for the New England Journal of Medicine, the gold standard was the principal diagnosis (including 5 differential diagnoses) proposed by the authors of the case report. In the Afghan and Coliquio cases, a principal diagnosis was not available from an expert in all cases. Two authors (PF and CF) evaluated the issued differential diagnoses with regard to their helpfulness (“helpful”/“not helpful”).
Datasets
Dataset I (Afghanistan): This dataset consists of 105 ongoing medical cases acquired from a tele-pathology platform (ipath [18, 19]) in 2017–2019 with daily diagnostic use for patients treated in Afghanistan (Mazar al Sharif). Each test case was diagnosed based on clinical and morphological data. Responsible for these test cases were three primary physicians in Mazar al Sharif, Afghanistan (RR, AS, HF), the diagnosis was made by four international senior experts (PF, GS, PD, BS). Unlike the other datasets, the cases in this series were dominated by morphological descriptors and questions. The Afghan test set represents user requests from physicians in a country with limited resources. In most cases, there is a lack of sophisticated testing methods such as specialized laboratory methods, immunohistochemistry, or imaging.
Dataset II (Coliquio test cases) [20]: The 124 cases were collected between 2018 and 2019. Coliquio is a German-language online expert network that specializes in knowledge exchange for physicians. Only licensed physicians and licensed psychotherapists have access. Aim of the platform is to exchange information on patient cases, diagnoses and therapy options. Cases were screened in chronological order after creation. A case was included as a test case if (1) at least two symptoms were reported by the physician presenting the case in the Coliquio forum and if (2) both sex and age information were provided. The Coliquio dataset was dominated by clinically oriented descriptions of a patient. The query from this user group reflects the situation of a primary care physician who is treating a difficult case. She/he may be looking for an alternative explanation for the patient's symptoms.
Data set III (NEJM): 50 cases from the New England Journal of medicine were chosen. For each case the article provided an expert diagnosis and five differential diagnosis. The references can be found in the additional file (Additional file 1: Table S1).
The used dataset of keywords and target diagnoses of all test cases can also be found in the additional file (Additional file 1: Table S2).
Examples of test cases For a better understanding of the study approach three test cases, one of each data set, were randomly selected and are described in Tables 3, 4, 5.
Used software systems
Isabel Health [21]: IsabelHealth is a commercial DDX generator built using machine learning technology [4]. It is a "black box" system for the user, where the thesauri used cannot be reviewed or improved by the tester. For each case, a search function was available with up to 10 symptoms, and for the allowed terms IsabelHealth provides a thesaurus. From these, a ranked list of 100 possible differential diagnoses is generated in descending probability. Only the first 10 diagnoses were considered for the evaluation.
Memem7 [22,23,24]: Is a currently non-commercial DDX developed by two of the authors (KA, PF). Memem7 is based on a large semantic network (about 560,000 nodes) that is transparently represented to the user, containing all kinds of entities and relationships such as objects, classes, parts, attributes, processes, states, properties, etc. The inference algorithms use the processing of the semantic network based on linguistic logic, which includes ambiguity, vagueness and uncertainty. For each case, a search function can be used based on the terms entered. The input is mainly structured, but unstructured narrative input (e.g., medical reports) is also possible, which is processed by modified NLP algorithms. The results are output as a ranked list of possible differential diagnoses with no length restriction. For each diagnosis, Memem7 provides a relevance value indicating the relevance of the search terms to the proposed diagnosis. Bayesian methods are used for diagnosis ranking: The more the terms match leading symptoms, the higher the relevance value.
Statistical methods
Excel was used for data collection and the statistical package R (version 3.5.3) [25] for statistical analysis. Statistical significance was assumed for p < 0.05. Numerical data were analyzed with the t-test and factors with the chi-square test.
Ethical aspects
All data used are anonymized, i.e. they cannot be attributed to patients in any way. For the Coliquio cases, neither date of birth, name nor place of residence were given. For the Afghan cases, all cases in the ipath network are anonymized by the responsibility of the treating physician. Only the hospital where patients were treated, but neither name nor date of birth is known. All NEJM test cases are published and therefore ethical aspects are the responsibility of the publishing authors.