Skip to main content

Table 2 De-identification performance

From: Clinical records anonymisation and text extraction (CRATE): an open-source software system

Metric Condition 1 Condition 2 Condition 3
Number of words in source text (n) 50,274 50,274 50,274
Hits 1,392 1,326 1,116
False alarms 275 132 25
For known identifiers (those recorded as structured information in the source database):
 Misses 1 0 0
 Correct rejections 48,606 48,816 49,113
 Sensitivity = recall 0.999 1 1
 Precision 0.835 0.909 0.978
For all identifiers (including those not recorded as structured information in the source database):
 Misses 127 125 128
 Correct rejections 48,480 48,691 49,005
 Sensitivity = recall 0.916 0.914 0.897
 Precision 0.835 0.909 0.978
  1. Performance of the de-identifier on the same corpus of clinical documents, with three different specimen configurations. The conditions differed in the definition of “identifying information” used, in whitelisting of geographical location, and in the method used for detecting fragments of addresses (see text; these differences lead also to variation in the number of hits counted, for example whether successful masking of an address such as “29 Acacia Avenue” was counted as one hit, if masked to “[___]”, or several hits, if masked to “[___] [___] [___]”). A miss was defined as any identifier appearing in the destination text and identifiers were defined very liberally, including a single initial, so appearance of a single identifier in the destination text does not equate to identifying the patient concerned [13]