Skip to main content

Table 2 De-identification performance

From: Clinical records anonymisation and text extraction (CRATE): an open-source software system

Metric

Condition 1

Condition 2

Condition 3

Number of words in source text (n)

50,274

50,274

50,274

Hits

1,392

1,326

1,116

False alarms

275

132

25

For known identifiers (those recorded as structured information in the source database):

 Misses

1

0

0

 Correct rejections

48,606

48,816

49,113

 Sensitivity = recall

0.999

1

1

 Precision

0.835

0.909

0.978

For all identifiers (including those not recorded as structured information in the source database):

 Misses

127

125

128

 Correct rejections

48,480

48,691

49,005

 Sensitivity = recall

0.916

0.914

0.897

 Precision

0.835

0.909

0.978

  1. Performance of the de-identifier on the same corpus of clinical documents, with three different specimen configurations. The conditions differed in the definition of “identifying information” used, in whitelisting of geographical location, and in the method used for detecting fragments of addresses (see text; these differences lead also to variation in the number of hits counted, for example whether successful masking of an address such as “29 Acacia Avenue” was counted as one hit, if masked to “[___]”, or several hits, if masked to “[___] [___] [___]”). A miss was defined as any identifier appearing in the destination text and identifiers were defined very liberally, including a single initial, so appearance of a single identifier in the destination text does not equate to identifying the patient concerned [13]