Skip to main content

Contextual property detection in Dutch diagnosis descriptions for uncertainty, laterality and temporality



Accurate, coded problem lists are valuable for data reuse, including clinical decision support and research. However, healthcare providers frequently modify coded diagnoses by including or removing common contextual properties in free-text diagnosis descriptions: uncertainty (suspected glaucoma), laterality (left glaucoma) and temporality (glaucoma 2002). These contextual properties could cause a difference in meaning between underlying diagnosis codes and modified descriptions, inhibiting data reuse. We therefore aimed to develop and evaluate an algorithm to identify these contextual properties.


A rule-based algorithm called UnLaTem (Uncertainty, Laterality, Temporality) was developed using a single-center dataset, including 288,935 diagnosis descriptions, of which 73,280 (25.4%) were modified by healthcare providers. Internal validation of the algorithm was conducted with an independent sample of 980 unique records. A second validation of the algorithm was conducted with 996 records from a Dutch multicenter dataset including 175,210 modified descriptions of five hospitals. Two researchers independently annotated the two validation samples. Performance of the algorithm was determined using means of the recall and precision of the validation samples. The algorithm was applied to the multicenter dataset to determine the actual prevalence of the contextual properties within the modified descriptions per specialty.


For the single-center dataset recall (and precision) for removal of uncertainty, uncertainty, laterality and temporality respectively were 100 (60.0), 99.1 (89.9), 100 (97.3) and 97.6 (97.6). For the multicenter dataset for removal of uncertainty, uncertainty, laterality and temporality it was 57.1 (88.9), 86.3 (88.9), 99.7 (93.5) and 96.8 (90.1). Within the modified descriptions of the multicenter dataset, 1.3% contained removal of uncertainty, 9.9% uncertainty, 31.4% laterality and 9.8% temporality.


We successfully developed a rule-based algorithm named UnLaTem to identify contextual properties in Dutch modified diagnosis descriptions. UnLaTem could be extended with more trigger terms, new rules and the recognition of term order to increase the performance even further. The algorithm’s rules are available as additional file 2. Implementing UnLaTem in Dutch hospital systems can improve precision of information retrieval and extraction from diagnosis descriptions, which can be used for data reuse purposes such as decision support and research.

Peer Review reports


The problem-oriented medical record—a structured organization of patient information per provided medical problem—is successful in helping healthcare providers to get a good understanding of the temporality of patients [1,2,3]. A core element is the problem list, which presents a list of active and inactive diagnoses relevant to current care of the patient [2]. Problem lists support reuse of data to e.g. trigger rules of a decision support system, create patient cohorts for quality registries or medical research [4, 5]. However, in order to realize these benefits, problem lists need to be accurate and complete, and problem list entries should be coded [4,5,6,7].

Although healthcare providers acknowledge the importance of accurate problem lists [8,9,10], problem lists often remain inaccurate and incomplete [11,12,13,14]. Healthcare providers consider free-text documentation typically as important, because they have concerns that by recording structured data on the problem list important information could be omitted [15]. As a consequence, when healthcare providers do code diagnoses on the problem list, they may modify the description of these diagnoses or add details in free-text fields as they often find the default diagnosis description insufficient or because they cannot find the diagnosis they are looking for [4, 16].

The context of the diagnoses in modified descriptions is crucial for determining the clinical status of a patient [17,18,19,20]. Previous research showed that healthcare providers describe levels of certainty in clinical free text, e.g., to specify a working diagnosis [16, 20,21,22,23,24]. Certainty levels vary from affirmed (certain) to non-affirmed (uncertain) levels of speculation. Uncertainty can be defined as the expressions of hypotheses, tentative conclusions, and speculations [24]. Uncertainty described in diagnosis descriptions could indicate a change in the meaning of a modified description compared to the default description. For instance, if the default diagnosis description for a code is glaucoma but the modified description suspected glaucoma, the problem list indicates by its code that the patient has glaucoma, while the description indicates that this diagnosis is not yet confirmed. Consequently, if researchers select all patients who suffered from glaucoma, the system returns all patients with the diagnosis code for glaucoma, although some patients might have had suspected glaucoma.

Next to uncertainty it is also important to know if a diagnosis is specified with laterality (e.g. left or right glaucoma) because missing laterality could lead to medical errors, such as procedures being performed on the wrong extremity [25]. Problems that should be specified by laterality are not always available in clinical terminologies, which therefore requires adding laterality in the description [26]. Furthermore, healthcare providers may argue that listing temporality is important (e.g. glaucoma 2002), explaining the timeline of a disease, symptom or event [23, 27]. Specifying a diagnosis with temporality could also be useful for prompting more frequent testing, for instance for breast cancer [28].

Again, it is important to identify temporality, because temporality described in a modified description indicates that the problem is a former problem, while the code indicates it is a current problem [17, 27, 29,30,31,32]. The examples given above might result in discrepancies between codes and modified descriptions or other free text which might lead to inappropriate care or research findings [5, 33,34,35,36,37,38]. The identification of context of information in terms of uncertainty, laterality and temporality is therefore an important task [17, 21, 35, 39]. Uncertainty, laterality and temporality can be referred to as contextual properties, because the information is not captured in the diagnosis itself, but provides the context of the diagnosis [17].

Several algorithms have been developed and evaluated to identify contextual properties in clinical free text [5, 7, 17, 27, 32]. These algorithms can extract concepts from free text and map these concepts to a standardized vocabulary [6], such as the tools MetaMap [40] and IndexFinder [41]. Additionally, regular expressions can be used to identify specific contextual properties. For instance NegEx is an algorithm that uses regular expressions to identify negated (i.e. ‘ruled-out’) diagnoses [32]. ConText, an algorithm that was developed based on NegEx [32], identifies several contextual properties in clinical free text, including whether a condition is negated, but also hypothetical, historical, or experienced by someone else [17]. However, techniques like ConText have been developed for English text, and few algorithms can identify contextual properties in other languages, such as Dutch [27, 42]. One example is ContextD, which identifies the same contextual properties as ConText, but for Dutch [27]. For the historical values of the temporality property, performance ranged from 26 to 54%. To our knowledge, no algorithms have been developed to recognize laterality or uncertainty in Dutch text.

The purpose of this study was to develop and evaluate a new algorithm to be called UnLaTem (uncertainty, laterality, temporality) for identifying (removal of) uncertainty, laterality and temporality in modified diagnosis descriptions. These properties should be identified before reusing diagnosis data as they could cause a difference in meaning between codes and descriptions. We applied the algorithm to Dutch free-text modified diagnosis descriptions, to gain insights into the extent to which diagnosis descriptions contain (removal of) uncertainty, laterality and temporality.



In most Dutch hospitals, the interface terminology underlying the problem list in the EHR systems is the Diagnosis Thesaurus (DT), provided by Dutch Hospital Data (DHD), which is mapped to ICD-10 and SNOMED CT [43]. The DT is used by healthcare providers to select the best-fitting code for their patients’ problems.

An anonymized dataset was extracted from the EHR system (Epic) of the Amsterdam University Medical Center (UMC). This dataset included all diagnoses recorded on the problem list with the DT and their descriptions, of all non-shielded patients (e.g. VIP-patients are shielded) admitted to the hospital in 2017. To develop and validate the algorithm, we selected all records in which the free-text field ‘description’ differed from the default diagnosis description. This included complete replacements (e.g. glaucoma changed to hypertension), additions (e.g. glaucoma changed to suspected glaucoma) and removals (e.g. suspected glaucoma changed to glaucoma). Thus, only exact matches were not included. A multicenter dataset including data from five anonymous Dutch hospitals, all using the same EHR (Epic), was used for second validation of the algorithm. According to the supplier of the multicenter dataset, a total of 1,035,059 diagnoses were registered for these five hospitals between April 2018 and May 2019. Note that the multicenter dataset also contained records from Amsterdam UMC, but these covered a different time frame than the records of the single-center dataset. In contrast to the Amsterdam UMC dataset, this multicenter dataset was constructed to include only encoded problems for which the problem description was modified by the end-user (n = 175,210). Further characteristics of the two datasets are shown in Table 1.

Table 1 Characteristics of the single-center and multicenter dataset

Both datasets included four variables: an ICD-10 code, default diagnosis descriptions, modified descriptions and the medical specialty that modified the diagnosis description. For example, ICD-10 code I10 has diagnosis description essential hypertension. A modified free-text description could be suspected hypertension. For the multicenter dataset, we combined the specialties that were related, such as Audiology and Audiological center (see Appendix 1) [44] after which 41 specialties remained.

Data selection

Of the 73,280 modified descriptions in the Amsterdam UMC dataset, 54,960 (75%) records were used for development of the algorithm. Of the remaining 18,320 (25%), we randomly selected 1000 records for validation while removing case-insensitive duplicate combinations of diagnosis descriptions and modified descriptions. The remaining 17,320 records could be used for future development and validation of the algorithm. Similarly, a second validation of the algorithm was conducted with 1000 records of the 175,210 modified descriptions of the multicenter dataset.

Development of the algorithm UnLaTem

We developed an algorithm that uses regular expressions to identify whether a modified diagnosis description contained (removal of) uncertainty, laterality and/or temporality. In this study, we treat uncertainty as expressions of belief, including at least one tentative conclusion or speculative fragment described in diagnosis descriptions [24]. The detection of uncertainty addition and removal was based on regular expressions that therefore describe tentative conclusions, speculation and hypotheses in diagnoses. Temporality detection was based on regular expressions indicating various forms of dates and laterality was based on regular expressions that indicate left and right in various forms. To discover all variations of the contextual properties, two authors (ESK and FJP) manually checked all descriptions in the development set (n = 54,960). To identify ‘removal’, we also manually checked whether uncertainty and laterality occurred in the unique default diagnosis descriptions of the DT (n = 10,936). Based on all found variations, regular expressions were established. For instance, laterality could be described as left and right, but also le and ri (Dutch: links and rechts, li and re). A final set of regular expressions was made available for all three properties, which is shown in Fig. 1. The translation of the terms to English can be found as Additional file 1. The algorithm was developed in R using R.Studio v 1.2.1335 for Windows.

Fig. 1

Algorithm regular expressions and corresponding categories. The rectangles in the first and second column contain the regular expressions. the rectangles in the third column contain the properties that result from the inclusion of the regular expressions. 1Modified descriptions can be classified according to multiple contextual properties (uncertainty, laterality and/or temporality)

Validation and performance of the algorithm

The single-center and multicenter validation sets were independently manually annotated by the same two authors (ESK and FJP). The annotators distinguished the four contextual properties based on guidelines provided by ESK explaining the process and each of the properties in detail with examples. We reported the interrater reliability score and Cohen’s kappa score to examine agreement between annotators [45]. After independent annotation, both annotators compared their results and discussed the annotations until they reached consensus on discrepancies. After agreement, the sets were used as reference standard.

Confusion matrices were used to determine the performance, i.e. the agreement between the manual expert-based outcome and the algorithm’s outcome per type of property in both validation sets. An example of a confusion matrix is shown in Appendix 2A. Based on these matrices, we determined the actual prevalence, which is defined as the definitive percentage of identified properties (i.e. false negatives and true positives). We calculated the performance of UnLaTem in terms of recall, specificity and precision for both validation samples [46, 47]. Recall is defined as the proportion of modified descriptions that was correctly identified with the correct type(s) of contextual properties (i.e., true positives). The specificity is defined as the proportion of modified descriptions that were correctly identified not to have type(s) of properties (i.e., true negatives). The precision is the probability that the algorithm correctly identified the type(s) of properties in the modified descriptions.

We determined the mean values of the recall, specificity, prevalence and precision of the two validation samples by using the inverse variance-weighted average method [48, 49]. This method calculates the weighted mean of the two validations samples, by taking into account the standard errors of the recall, specificity and prevalence. Haldane-Anscombe correction was applied to calculate an approximation of the weighted means, i.e., if a confusion matrix contained zeros, we added 0.5 to all counts [50,51,52]. Appendix 2A and 2B show all formulas [48, 52, 53].

Error analysis

In order to assess the adequacy of the rule-based approach for contextual property assignment in Dutch modified diagnosis descriptions, we analyzed all incorrect assignments and placed them in an error class, i.e. the counts of the total number of false positives and false negatives. In line with ConText, four error classes were distinguished [17]. The classes included Missing terms (error can be prevented by adding trigger terms, pseudo-trigger terms or termination terms to UnLaTem’s term lists), Simple extension (error can be prevented by adding another rule to the algorithm), Outside framework (prevention of error requires a change that cannot be accommodated within current UnLaTem’s framework) and Annotation/implementation (error is the result of a mistake in the reference standard annotations or a bug in the implementation of the algorithm).

Application of the algorithm

UnLaTem was used to determine the apparent prevalence of (removal of) uncertainty, laterality and/or temporality within the modified descriptions of the Amsterdam UMC dataset (n = 73,280) and the multicenter dataset (n = 175,210). Based on the apparent prevalence, we calculated the actual prevalence using the Rogan–Gladen estimator, of which the formula is shown in Appendix 2C [54]. Note that this is the actual prevalence of contextual properties within the modified descriptions, and not within the complete datasets. The actual prevalence was reported per specialty in the multicenter dataset.


Development of the algorithm

The regular expressions and trigger terms for (removal of) uncertainty, laterality and temporality for the algorithm are shown in Fig. 1, in the second column. For example, if the default diagnosis description was glaucoma and the modified description glaucoma screening, the algorithm’s uncertainty-value was True. If the default diagnosis description was suspected glaucoma and the modified description included suspected, screening, possible, analysis or advice, the algorithm returned False for uncertainty. That is, because both expressions indicate that having glaucoma was suspected. If the default diagnosis description was suspected glaucoma and the modified description glaucoma, the algorithm returned True for removal of uncertainty. Note that not all default diagnoses available in the DT also contained a version with pre-coded uncertainty. For instance, glaucoma and suspected glaucoma exist as default diagnosis descriptions but hypertension and the corresponding suspected hypertension do not.

Only a few records that included removal of laterality were retrieved. However, these modified descriptions were specifications or generalizations of the diagnosis term. For instance, left sided heart failures was changed to decompensatio cordis, the Latin name for heart failure. We considered these modifications another type of property, which was not included in this version of the algorithm.

Also, note that one modified description can have multiple properties (2015 left eye infection is categorized as both laterality as well as temporality for the default description eye infection).

Validation and performance of the algorithm

In both validation sets (n = 1000 + 1000), we discovered that terms from modified descriptions were identical to default diagnosis descriptions, but were flagged as ‘modified’ because some descriptions included spaces or because the healthcare provider had clicked in the description field. We removed these records after which 980 remained in the internal validation set and 996 remained in the multicenter dataset. The interrater reliability and kappa score between the two raters that defined the reference standard validation sets were determined per property and are shown in Table 2. Please note that all disagreements were solved after the initial scoring.

Table 2 Interrater reliability and kappa scores for the internal (n = 980) and multicenter (n = 996) validation set

Table 3 shows the actual prevalence, which was determined using the reference standards and the recall, specificity and precision of UnLaTem to identify the properties for the validation samples. The confusion matrices are shown in Appendix 3. In addition, Table 3 shows the inverse variance-weighted averages of the recall, specificity and prevalence. We applied the weighted prevalence, recall and specificity to determine the weighted precision. Appendix 2B shows the formulas we used and the numbers for these calculations.

Table 3 Actual prevalence, recall, specificity and precision of all four properties in the internal (n = 980) and multicenter validation set (n = 996)

Error analysis

Tables 4 and 5 show the results of the error analysis in the internal validation set (n = 980) and multicenter validation set (n = 996).

Table 4 Error analysis of false positives (FP) and false negatives (FN) in the internal validation set (n = 980)
Table 5 Error analysis of false positives (FP) and false negatives (FN) in the multicenter validation set (n = 996)

Application of UnLaTem

Table 6 shows the apparent and actual prevalence of the contextual properties which was determined within the modified descriptions for both the Amsterdam UMC dataset (n = 73,280) and multicenter dataset (n = 175,210).

Table 6 Number and percentages of the properties identified by the algorithm in the modified descriptions per dataset

Appendix 4 shows actual prevalence of the contextual properties within the modified descriptions per specialty in the multicenter dataset. From the specialties with more than one thousand diagnosis descriptions, clinical genetics had the highest percentage of uncertainty within the modified descriptions (5287/5655, 93.5%), audiology had the lowest percentage of uncertainty within the modified descriptions (4/1127, 0.4%). Laterality additions were found highest in ophthalmology (13,471/16,901, 79.7%) and lowest in audiology (133/1127, 11.8%). Temporality additions were highest for internal medicine (4822/30,457, 15.8%) and lowest for audiology (0/1127, 0.0%). Removal of uncertainty was highest for urology (49/1306, 3.8%) and lowest for audiology (0/1127, 0.0%).

The most-frequent modification was adding od (right eye) to cataract, which occurred 1,237 (0.7%) times in the dataset (n = 175,210), followed by adding os (left eye) to cataract, which occurred 1,191 (0.7%) times.


In this study, we developed and evaluated an algorithm called UnLaTem to identify (removal of) uncertainty, laterality and temporality in Dutch modified diagnosis descriptions. The evaluation showed high performance measures, indicating good performance of the algorithm. Removal of uncertainty had lower performance scores, indicating that for removal of uncertainty the algorithm could be further improved. Additionally, we determined the prevalence of the contextual properties in the modified descriptions, in total and per specialty. As expected, the percentages of laterality modifications that were considerably higher than the mean value occurred in specialties for which adding laterality is important, namely ophthalmology and orthopedics. For temporality, a specialty with a considerably higher value than the mean was emergency care. This can be explained by the fact that the treating healthcare provider is usually reliant on the medical history of the patient. The percentage uncertainty was highest in clinical genetics. This can be expected as it is a specialty that provides screening service and therefore deals with high uncertainty in (initial) diagnoses. The variations of contextual properties between these and other medical specialties should be looked into in future research.

One could argue why healthcare providers have to record diagnoses on a problem list in a coded way, with the risk of adding contextual properties in free-text. In principle, natural language processing (NLP) algorithms can be used to identify diagnoses in clinical free text [55]. NLP could automatically transform clinical text into structured data that can guide clinical decision-making and data reuse such as research. However, current NLP algorithms cannot accurately identify diagnoses in Dutch or English clinical free text yet [56]. Hence, UnLaTem could be beneficial for the identification of discrepancies in modified diagnosis descriptions in Dutch.

Strengths and limitations

A strength of this study is that we developed a successful algorithm for recognizing four contextual properties in Dutch descriptions, as it showed high performance scores in both validation sets. Furthermore, only few algorithms have been developed for the Dutch language [27, 57]. Following application of UnLaTem to the multicenter dataset, we were also able to gain insights into the correctness of underlying diagnosis codes. Identification of differences between codes and modified descriptions is essential, because reuse of problem list data relies often on coded data alone [5, 58]. The results showed for instance that a relatively high number of modified descriptions (n = 17,347, 9.9%, CI: 9.5–10.3) contained uncertainty in the multicenter dataset. Subsequently, identification of patients with a certain disease, design of disease registers or assessment of quality of care based on coded diagnoses may be error-prone [59, 60].

Our research also has several limitations. Although the overall performance of UnLaTem was high, performance was only determined for identification of contextual properties of diagnoses that were registered on the problem list, but problem list incompleteness is a widespread issue [4, 6]. We expect to find more information on diagnoses in other free-text fields, such as history notes, discharge letters and medication letters. This is important because although ‘modified’ descriptions might correspond to default diagnosis descriptions, the information in other free-text fields might state otherwise. Additionally, other free-text fields might include more problems, which therefore implies that diagnoses are missing on the problem list.

Error analysis

It is important to note that UnLaTem is meant to identify simple sentences including uncertainty, laterality and temporality using regular expressions, and was never expected to capture all properties. We believe that the simplicity of this rule-based algorithm makes it appealing to apply to modified diagnosis descriptions, especially for developers without training in NLP [21]. Nonetheless, there are more contextual properties to be considered for a more complex version of the algorithm. ContextD, an algorithm adapted from ConText [17] identifies whether a condition is temporal, but also whether terms were negated in clinical Dutch reports and letters [27]. Previous research showed that half of the terms retrieved in clinical reports were negated [61]. Hence, the negation detection module by ConTextD could be considered to add to future editions of UnLaTem. Furthermore, the “missing terms” class of errors showed that there are more variations to describe temporality [17, 62]. For example, by adding variations to describe temporality by including a historical function such as hypertension in the past (Dutch: hypertensie in het verleden). It is important to note that some terms were purposely not added, such as ‘request’. Terms like ‘request’ are disputable because it has several definitions. For instance, ‘request’ could suggest requests for lab tests to confirm a diagnosis, which implies it is a kind of uncertainty. However, ‘request’ could also mean that healthcare providers request treatments for that specific diagnosis, which suggests it is not an uncertainty.

We also discovered that abbreviations were sometimes misclassified. For example, AD means right ear but was also used for the pregnancy duration (Dutch: Amenorroe duur). Misusing abbreviations could lead to medical errors [63, 64]. The algorithm also misclassified diagnosis descriptions containing values such as mmol/l. That is because l was one of the trigger items for laterality. Furthermore, the algorithm showed lower performance scores for removal of uncertainty, because some term removals were not discovered in the training set. Additionally, although possible (Dutch: waarschijnlijk and mogelijk) was included in the algorithm, the regular expression only returned True for uncertainty if possible occurred at the beginning of the sentence. In the error analysis, we found that possible also appeared in the middle of sentences. Extending the rules for existing trigger items for laterality and (removal of) uncertainty could improve the performance of the algorithm even further.

In the third error class, “outside the framework”, we discovered that the algorithm did not take into account the term order of the modified descriptions, though this is needed to extract meaningful information [27]. For instance, if the diagnosis code is stomach ache and the modified description is stomach ache, suspected flu, the algorithm will flag this as a case of uncertainty, although it was not the stomach ache that was uncertain, but the flu. Finally, the annotators sometimes misclassified a modified description as an uncertainty, or they overlooked a laterality. Before the algorithm can be implemented to larger free-text documents, more research is needed to determine variations in describing other contextual properties and the term order.

Relation to other literature

We compared the results of temporality of UnLaTem and the temporality module of ContextD. Laterality was not included in ContextD. UnLaTem performed better on temporality (recall = 0.97, precision = 0.95) compared to averaged performance rates of the temporality function of ContextD (recall = 0.73, precision = 0.38). It is important to note that there were more trigger items (i.e. since) for the temporality function of ContextD than for the temporality function of UnLaTem. Another comparable algorithm is PYContextNLP [21]. PyContextNLP reports among others whether uncertainty is present or absent for a diagnosis in English text. The performance scores for uncertainty of PyContextNLP (recall = 0.94, precision = 0.93) are comparable to the performance scores for uncertainty of UnLaTem (recall = 0.98, precision = 0.90). PyContextSwe is the Swedish version of PyContextNLP, and distinguishes four different classes (definite existence, probable existence, probable negated existence and definite negated existence) [65]. Probable existence is comparable to uncertainty. The performance of probable existence was slightly lower (recall = 0.81, precision = 0.82) than UnLaTem’s performance on uncertainty (recall = 0.98, precision = 0.90). Additionally, UnLaTem included laterality triggers. This means that UnLaTem could also be used to detect patterns of laterality for implementation in context-sensitive user interfaces and identify terms that should be further specified by laterality. One solution could be that when the healthcare provider selects the term eye infection, a check-box of ‘left’ and ‘right’ is presented in the design of the EHR system, as we found laterality was commonly added in the modified descriptions. However, no check-box should appear when the healthcare provider selects terms for non-lateralizable concepts, for instance diabetes. Additionally, based on the presence of (removal of) uncertainty, laterality or temporality in the modified descriptions, the algorithm could trigger alerts to a decision support system or module to no longer rely on the captured underlying codes. We believe that UnLaTem is therefore an useful addition as it can provide meaningful insights how information is recorded on the problem list.


An algorithm called UnLaTem was developed to identify contextual properties in Dutch modified diagnosis descriptions, including (removal of) uncertainty, laterality and temporality, which is publicly available for other researchers to be used for further improvement or application within their institutions [66]. Our results indicate that the approach used to develop UnLaTem performs well for identifying the four contextual properties in diagnosis descriptions, which could help improve overall precision of information retrieval and extraction. It thereby provides insights in the correctness of diagnosis descriptions and potential discrepancies, that should be identified before reusing these diagnosis data. However, UnLaTem could be improved with more contextual properties including new trigger terms, extension of the rules and the recognition of the term order, before it can be applied to larger free-text documents in EHR systems. UnLaTem could be implemented eventually in Dutch hospital systems, improving quality of diagnosis data for research and clinical decision support. Finally, although the current algorithm focuses on the Dutch language, the methods to develop and evaluate such an algorithm can be generalized to other languages.

Availability of data and materials

The data are not publicly available as they contain information that could harm the privacy regulations in the Netherlands. The complete overview of outcomes of the algorithm and guidelines used for manual annotation can be requested by ESK. The code used to determine the contextual properties and outcomes per specialty can be found in the Additional R-file or can be downloaded via: (Additional file 2).



Amsterdam University Medical Center


Diagnosis thesaurus


Electronic health record


Uncertainty, laterality, temporality


Natural language processing


  1. 1.

    Hartung DM, Hunt J, Siemienczuk J, Miller H, Touchette DR. Clinical implications of an accurate problem list on heart failure treatment. J Gen Intern Med. 2005;20(2):143–7.

    PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Simons SM, Cillessen FH, Hazelzet JA. Determinants of a successful problem list to support the implementation of the problem-oriented medical record according to recent literature. BMC Med Inform Decis Mak. 2016;16(102):1–9.

    Google Scholar 

  3. 3.

    Salmon P, Rappaport A, Bainbridge M, Hayes G, Williams J, editors. Taking the problem oriented medical record forward. In: AMIA Annual Symposium Proceedings; 1996.

  4. 4.

    Wright A, McCoy AB, Hickman T-TT, Hilaire DS, Borbolla D, Bowes WA III, et al. Problem list completeness in electronic health records: a multi-site study and assessment of success factors. Int J Med Inform. 2015;84(10):784–90.

    PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Wright A, Pang J, Feblowitz JC, Maloney FL, Wilcox AR, McLoughlin KS, et al. Improving completeness of electronic problem lists through clinical decision support: a randomized, controlled trial. J Am Med Inform Assoc. 2012;19(4):555–61.

    PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Meystre S, Haug PJ. Automation of a problem list using natural language processing. BMC Med Inform Decis Mak. 2005;5:30.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Meystre S, Haug P, editors. Improving the sensitivity of the problem list in an intensive care unit by using natural language processing. In: AMIA annual symposium proceedings; 2006.

  8. 8.

    Hose B-Z, Hoonakker PL, Wooldridge AR, Brazelton TB III, Dean SM, Eithun B, et al. Physician perceptions of the electronic problem list in pediatric trauma care. Appl Clin Inform. 2019;10(01):113–22.

    PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Wright A, Maloney FL, Feblowitz JC. Clinician attitudes toward and use of electronic problem lists: a thematic analysis. BMC Med Inform Decis Mak. 2011;11(36):1–10.

    Google Scholar 

  10. 10.

    Wright A, Feblowitz J, Maloney F, Henkin S, Ramelson H, Feltman J, et al. Increasing patient engagement: patients’ responses to viewing problem lists online. Appl Clin Inform. 2014;5(04):930–42.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Wright A, Pang J, Feblowitz JC, Maloney FL, Wilcox AR, Ramelson HZ, et al. A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record. J Am Med Inform Assoc. 2011;18(6):859–67.

    PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Kaplan DM. Clear writing, clear thinking and the disappearing art of the problem list. J Hosp Med. 2007;2(4):199–202.

    PubMed  Article  PubMed Central  Google Scholar 

  13. 13.

    Tang PC, LaRosa MP, Gorden SM. Use of computer-based records, completeness of documentation, and appropriateness of documented clinical decisions. J Am Med Inform Assoc. 1999;6(3):245–51.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Porter AS, O’Callaghan J, Englund KA, Lorenz RR, Kodish E. Problems with the problem list: challenges of transparency in an era of patient curation. J Am Med Inform Assoc. 2020;27:981–4.

    PubMed  Article  Google Scholar 

  15. 15.

    Hyppönen H, Saranto K, Vuokko R, Mäkelä-Bengs P, Doupi P, Lindqvist M, et al. Impacts of structuring the electronic health record: a systematic review protocol and results of previous reviews. Int J Med Inform. 2014;83(3):159–69.

    PubMed  Article  PubMed Central  Google Scholar 

  16. 16.

    Klappe ES, de Keizer NF, Cornet R. Factors influencing problem list use in electronic health records—application of the unified theory of acceptance and use of technology. Appl Clin Inform. 2020;11(03):415–26.

    PubMed  Article  PubMed Central  Google Scholar 

  17. 17.

    Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;42(5):839–51.

    PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Elkin PL, Brown SH, Bauer BA, Husser CS, Carruth W, Bergstrom LR, et al. A controlled trial of automated classification of negation from clinical notes. BMC Med Inform Decis Mak. 2005;5(1):13.

    PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Morante R, Daelemans W, editors. A metalearning approach to processing the scope of negation. In: Proceedings of conference on CoNLL-2009; 2009.

  20. 20.

    Velupillai S, editor. Automatic classification of factuality levels: A case study on Swedish diagnoses and the impact of local context. In: 4th international symposium on languages in biology and medicine, LBM 2011; 2011.

  21. 21.

    Chapman BE, Lee S, Kang HP, Chapman WW. Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm. J Biomed Inform. 2011;44(5):728–37.

    PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Saurí R, Pustejovsky J. FactBank: a corpus annotated with event factuality. Lang Resour Eval. 2009;43(3):227.

    Article  Google Scholar 

  23. 23.

    Dalianis H. Semantic analysis and concept extraction.

  24. 24.

    Light M, Qiu XY, Srinivasan P, editors. The language of bioscience: Facts, speculations, and statements in between. In: HLT-NAACL 2004 workshop: linking biological literature, ontologies and databases; 2004.

  25. 25.

    Seiden SC, Barach P. Wrong-side/wrong-site, wrong-procedure, and wrong-patient adverse events: are they preventable? AMA Arch Surg. 2006;141(9):931–9.

    PubMed  Article  Google Scholar 

  26. 26.

    Stearns M. Value of ICD-10 at the Point of Care. Journal of AHIMA website. 2015.

  27. 27.

    Afzal Z, Pons E, Kang N, Sturkenboom MC, Schuemie MJ, Kors JA. ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinform. 2014;15:373.

    Article  Google Scholar 

  28. 28.

    Holmes C, Brown M, St Hilaire D, Wright A. Healthcare provider attitudes towards the problem list in an electronic health record: a mixed-methods qualitative study. BMC Med Inform Decis Mak. 2012;12:127.

    PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402.

    PubMed  PubMed Central  Article  Google Scholar 

  30. 30.

    Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, et al. DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform. 2015;54:213–9.

    PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Horsky J, Drucker EA, Ramelson HZ, editors. Accuracy and completeness of clinical coding using ICD-10 for ambulatory visits. In: AMIA annual symposium proceedings; 2017.

  32. 32.

    Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001;34(5):301–10.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  33. 33.

    Diaz-Garelli J-F, Strowd R, Ahmed T, Wells BJ, Merrill R, Laurini J, et al. A tale of three subspecialties: diagnosis recording patterns are internally consistent but specialty-dependent. JAMIA Open. 2019;2:369–77.

    PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: data quality issues and informatics opportunities. Summit Translat Bioinform. 2010;2010:1–5.

    Google Scholar 

  35. 35.

    Rinott R, Torresani M, Bertulli R, Goldsteen A, Casali P, Carmeli B, et al. Automatic detection of inconsistencies between free text and coded data in Sarcoma discharge letters. Stud Health Technol Inform. 2012;180:661–6.

    PubMed  Google Scholar 

  36. 36.

    Stein HD, Nadkarni P, Erdos J, Miller PL. Exploring the degree of concordance of coded and textual data in answering clinical queries from a clinical data repository. J Am Med Inform Assoc. 2000;7(1):42–54.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Singh H, Mani S, Espadas D, Petersen N, Franklin V, Petersen LA. Prescription errors and outcomes related to inconsistent information transmitted through computerized order entry: a prospective study. Arch Intern Med. 2009;169(10):982–9.

    PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    Martin PM, Sbaffi L. Electronic Health Record and Problem Lists in Leeds, United Kingdom: variability of general practitioners’ views. Health Informatics J. 2020;26(3):1898–911.

    PubMed  Article  Google Scholar 

  39. 39.

    Soriano IM, Pena JLC. Automatic medical concept extraction from free text clinical reports, a new named entity recognition approach. Int J Comput Appl. 2017;2:38–46.

    Google Scholar 

  40. 40.

    Aronson AR, editor. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: AMIA annual symposium proceedings; 2001: American Medical Informatics Association.

  41. 41.

    Zou Q, Chu WW, Morioka C, Leazer GH, Kangarloo H, editors. IndexFinder: a method of extracting key concepts from clinical texts for indexing. In: AMIA annual symposium proceedings; 2003: American Medical Informatics Association.

  42. 42.

    Dentler K, Cornet R, ten Teije A, Tanis P, Klinkenbijl J, Tytgat K, et al. Influence of data quality on computed Dutch hospital quality indicators: a case study in colorectal cancer surgery. BMC Med Inform Decis Mak. 2014;14(1):32.

    PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Dutch Hospital Data. Diagnosethesaurus. 2020. Accessed 17 Dec 2020.

  44. 44.

    Amsterdam UMC. Specialismen. 2020. Accessed 9 July 2020.

  45. 45.

    McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22(3):276–82.

    Article  Google Scholar 

  46. 46.

    Parikh R, Mathai A, Parikh S, Sekhar GC, Thomas R. Understanding and using sensitivity, specificity and predictive values. Indian J ophthalmol. 2008;56(1):45.

    PubMed  PubMed Central  Article  Google Scholar 

  47. 47.

    Wong HB, Lim GH. Measures of diagnostic accuracy: sensitivity, specificity. PPV NPV Proc Singap Healthc. 2011;20(4):316–8.

    Article  Google Scholar 

  48. 48.

    Lee CH, Cook S, Lee JS, Han B. Comparison of two meta-analysis methods: inverse-variance-weighted average and weighted sum of Z-scores. Genomics inform. 2016;14(4):173.

    PubMed  PubMed Central  Article  Google Scholar 

  49. 49.

    Sánchez-Meca J, Marin-Martinez F. Weighting by inverse variance or by sample size in meta-analysis: a simulation study. Educ Psychol Meas. 1998;58(2):211–20.

    Article  Google Scholar 

  50. 50.

    Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method. Med Decis Making. 1993;13(4):313–21.

    CAS  PubMed  Article  Google Scholar 

  51. 51.

    Haldanse B. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet. 1956;20(4):309–11.

    Article  Google Scholar 

  52. 52.

    Glas AS, Lijmer JG, Prins MH, Bonsel GJ, Bossuyt PM. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol. 2003;56(11):1129–35.

    PubMed  Article  Google Scholar 

  53. 53.

    Zwinderman AH, Cleophas TF. Statistics applied to clinical trials

  54. 54.

    Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. Am J Epidemiol. 1978;107(1):71–6.

    CAS  PubMed  Article  Google Scholar 

  55. 55.

    Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13(6):395–405.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  56. 56.

    Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR med inform. 2019;7(2):e12239.

    PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Fivez P, Šuster S, Daelemans W. Unsupervised context-sensitive spelling correction of English and Dutch clinical free-text with word and character n-gram embeddings. arXiv preprint arXiv:1710.07045. 2017.

  58. 58.

    Ceusters W, Blaisure J. Caveats for the use of the active problem list as ground truth for decision support. Decis Support Sys Educ: Help Support Healthc; 2018. p. 10–4.

    Google Scholar 

  59. 59.

    Ford E, Nicholson A, Koeling R, Tate AR, Carroll J, Axelrod L, et al. Optimising the use of electronic health records to estimate the incidence of rheumatoid arthritis in primary care: what information is hidden in free text? BMC Med Res Methodol. 2013;13(1):105.

    PubMed  PubMed Central  Article  Google Scholar 

  60. 60.

    Diaz-Garelli J-F, Strowd R, Wells BJ, Ahmed T, Merrill R, Topaloglu U. Lost in translation: diagnosis records show more inaccuracies after biopsy in oncology care EHRs. AMIA Jt Summits Transl Sci Proc. 2019;2019:325.

    PubMed  PubMed Central  Google Scholar 

  61. 61.

    Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG, editors. Evaluation of negation phrases in narrative clinical reports. In: AMIA annual symposium proceedings; 2001. American Medical Informatics Association.

  62. 62.

    Garcelon N, Neuraz A, Benoit V, Salomon R, Burgun A. Improving a full-text search engine: the importance of negation detection and family history context to identify cases in a biomedical data warehouse. J Am Med Inform Assoc. 2017;24(3):607–13.

    PubMed  Article  Google Scholar 

  63. 63.

    Ao H, Takagi T. ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc. 2005;12(5):576–86.

    PubMed  PubMed Central  Article  Google Scholar 

  64. 64.

    Samaranayake NR, Cheung DS, Lam MP, Cheung TT, Chui WC, Wong IC, et al. The effectiveness of a ‘Do Not Use’ list and perceptions of healthcare professionals on error-prone abbreviations. Int J Clin Pharm-Net. 2014;36(5):1000–6.

    Article  Google Scholar 

  65. 65.

    Velupillai S, Skeppstedt M, Kvist M, Mowery D, Chapman BE, Dalianis H, et al. Cue-based assertion classification for Swedish clinical text—developing a lexicon for pyConTextSwe. Artif Intell Med. 2014;61(3):137–44.

    PubMed  PubMed Central  Article  Google Scholar 

  66. 66.

    Klappe ES. UnLaTem. Github; 2021. Accessed 25 Feb 2021.

Download references


The authors would like to thank Prof. dr. A. H. Zwinderman from the department of Clinical Epidemiology, Biostatistics and Bioinformatics for assistance in the statistical research design.


This research received no specific grant from any funding agency in the public, non-profit or commercial sector. This study was funded by Amsterdam UMC 2019-AMC-JK-7. Amsterdam UMC did not have any role in the study design, collection, analysis, interpretation of the data, writing the report and the decision to submit the report for publication.

Author information




ESK performed the majority of the analysis of the data and the algorithm development and wrote the first draft of the article. FVP was a major contributor in writing the algorithm. ESK and FVP manually annotated the validation sets. RC and NFK conceptualize the study and supervised the process and commented on the drafts of the article presented by ESK. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Eva S. Klappe.

Ethics declarations

Ethics approval and consent to participate

The need for ethical approval was waived by the Medical Ethical Testing Committee (METC) of Amsterdam UMC (W19_300 # 19.357) because the study did not fall under the scope of the Medical Research Involving Human Subjects Act and data was anonymized.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Translation of Figure 1.

Additional file 2.

Code to determine contextual properties and outcomes per specialty.


Appendix 1. Recategorization of specialties

In the multicenter dataset (n = 175,210), 62 different specialty names were available (shown in the left column of Table 7). The right column shows the renaming of the specialties. For example, the records for Audiological Centres (Dutch: audiologische centra) are combined with audiology (Dutch: audiologie).

Table 7 Original names of the specialties and the new names of the specialties

Appendix 2. Performance formulas

2A: Recall, specificity, prevalence and performance formulas

For the calculation of the recall (or sensitivity), specificity and overall performance, we used confusion matrices similar to Table 8.

Table 8 Confusion matrix example

We categorized four possible outcomes; True Positives (TP), False Positives (FP), False Negatives (FN) and True Negatives (TN). TP is a correctly identified description, FP is an incorrectly identified description, FN is an incorrectly denied description and TN is a correctly denied description.

With these numbers, the performance of the algorithm UnLaTem can be assessed by:

  • Actual prevalence, or the proportion of records with a definitive modification type: TP + FN / (TP + FN + FP + TN)

  • Precision: TP / (TP + FP)

  • Recall: TP / (TP + FN)

  • Specificity: TN / (FP + TN)

  • Standard error (SE), where n = sample size, such that:

    • SE recall: √ TP*FN / (TP + FN)^3

    • SE specificity: √ FP*TN / (FP + TN)^3

    • SE proportion of patients with a definitive property (prevalence): √ (TP + FP)(FN + TN)/ (TP + FN + FP + TN)^3

    • SE precision: √ p * (1-p)/n, where p = precision of the internal or multicenter validation set

  • Weighted means, where p = recall, specificity, prevalence or precision of the internal validation set (1) or the multicenter validation set (2):

    $$\left( {{\text{p}}/{\text{SE}}_{{{\text{P1}}}}^{2} + {\text{ p}}/{\text{SE}}_{{{\text{P2}}}}^{2} } \right) \, / \, \left( { \, 1/{\text{ SE}}_{{{\text{P1}}}}^{2} + \, 1/{\text{ SE}}_{{{\text{P2}}}}^{2} } \right)$$
    • Note that for the weighted means of the recall, specificity and prevalence we added 0.5 to all counts

2B: Prediction formulas

Here, “the property type” refers to uncertainty, laterality, temporality or removal of uncertainty.

  • The estimated number of descriptions that will truly have the property type (Dtrue):

    • Weighted mean of the prevalence * number of modified descriptions (175,210)

      • Uncertainty: 19,240

      • Laterality: 47,932

      • Temporality: 21,663

      • Removal of uncertainty: 2,140

  • The number of descriptions that will be recognized by the algorithm to have the property type (Atrue):

    • Dtrue * weighted mean of the recall

      • Uncertainty: 18,869

      • Laterality: 47,880

      • Temporality: 21,091

      • Removal of uncertainty: 1,933

  • The number of descriptions that will truly NOT have the property type (Dfalse):

    • Number of modified descriptions (175,210) − Dtrue

      • Uncertainty: 155,970

      • Laterality: 127,278

      • Temporality: 153,547

      • Removal of uncertainty: 173,070

  • Wrongly categorized descriptions (Afalse)

    • (1-Specificity) * Dfalse

      • Uncertainty: 2,110

      • Laterality: 1,832

      • Temporality: 1,083

      • Removal of uncertainty: 341

  • Total descriptions determined by the algorithm to have the property type (Atotal):

    • Afalse + Atrue

      • Uncertainty: 20,979

      • Laterality: 49,713

      • Temporality: 22,174

      • Removal of uncertainty: 2,275

  • Estimated percentage of descriptions to have the property type (%):

    • 100 * (1 − Afalse/Atotal)

      • Uncertainty: 89.9

      • Laterality: 96.3

      • Temporality: 95.1

      • Removal of uncertainty: 85.0

2C: Actual prevalence, using the Rogan–Gladen estimator [54]

  • Actual prevalence = Apparent prevalence + (weighted mean of the specificity − 1) / weighted mean of the specificity + (weighted mean of the recall − 1)

    • Uncertainty (apparent prevalence: 16,779): 17,347

    • Laterality (apparent prevalence: 54,081):54,931

    • Temporality (apparent prevalence: 16,582): 17,156

    • Removal of uncertainty (apparent prevalence: 1,991):2,208

Appendix 3. Confusion matrices

Confusion matrices per modification type for the internal (n = 980) and multicenter (n = 996) validation sets.

Internal validation

See Tables 9, 10, 11 and 12.

Table 9 The uncertainty confusion matrix of the internal validation set
Table 10 The laterality confusion matrix of the internal validation set
Table 11 The temporality confusion matrix of the internal validation set
Table 12 The removal of uncertainty confusion matrix of the internal validation set

Multicenter validation

See Tables 13, 14, 15 and 16.

Table 13 The uncertainty confusion matrix of the multicenter validation set
Table 14 The laterality confusion matrix of the multicenter validation set
Table 15 The temporality confusion matrix of the multicenter validation set
Table 16 The removal of uncertainty confusion matrix of the multicenter validation set

Appendix 4. Specialties and contextual properties

See Table 17.

Table 17 Specialties and properties in n and % in the multicenter dataset (n = 175,210). Ordered by descending number of modified descriptions. Shown is the actual prevalence, which was determined using the Rogan–Gladen estimator

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Klappe, E.S., van Putten, F.J.P., de Keizer, N.F. et al. Contextual property detection in Dutch diagnosis descriptions for uncertainty, laterality and temporality. BMC Med Inform Decis Mak 21, 120 (2021).

Download citation


  • Electronic health record
  • Problem list
  • Problem-oriented medical record
  • Rule-based algorithm development
  • Single-center and multicenter validation
  • Reuse of clinical data