Metric
|
Condition 1
|
Condition 2
|
Condition 3
|
---|
Number of words in source text (n)
|
50,274
|
50,274
|
50,274
|
Hits
|
1,392
|
1,326
|
1,116
|
False alarms
|
275
|
132
|
25
|
For known identifiers (those recorded as structured information in the source database):
|
Misses
|
1
|
0
|
0
|
Correct rejections
|
48,606
|
48,816
|
49,113
|
Sensitivity = recall
|
0.999
|
1
|
1
|
Precision
|
0.835
|
0.909
|
0.978
|
For all identifiers (including those not recorded as structured information in the source database):
|
Misses
|
127
|
125
|
128
|
Correct rejections
|
48,480
|
48,691
|
49,005
|
Sensitivity = recall
|
0.916
|
0.914
|
0.897
|
Precision
|
0.835
|
0.909
|
0.978
|
- Performance of the de-identifier on the same corpus of clinical documents, with three different specimen configurations. The conditions differed in the definition of “identifying information” used, in whitelisting of geographical location, and in the method used for detecting fragments of addresses (see text; these differences lead also to variation in the number of hits counted, for example whether successful masking of an address such as “29 Acacia Avenue” was counted as one hit, if masked to “[___]”, or several hits, if masked to “[___] [___] [___]”). A miss was defined as any identifier appearing in the destination text and identifiers were defined very liberally, including a single initial, so appearance of a single identifier in the destination text does not equate to identifying the patient concerned [13]