Development and evaluation of an open source software tool for deidentification of pathology reports

Beckwith, Bruce A; Mahaadevan, Rajeshwarri; Balis, Ulysses J; Kuo, Frank

doi:10.1186/1472-6947-6-12

Comparing de-identification methods

Jules Berman, Association for Pathology Informatics

31 March 2006

Bruce Beckwith and colleagues have made an important contribution to the field of data scrubbing and data de-identification. To their credit, they made all their source code and Java files publicly available and free. The paper is well written and data-driven.

I have discussed the issue of data scrubbing strategies with Dr. Beckwith on several occasions. Basically, there seems to be two published approaches. One approach is to parse text and remove all the identifying words. This is the way that Bruce Beckwith recommends.

The second way is to parse text and to extract EVERY WORD EXCEPT words from an approved list of non-identifying words. That's the strategy that I have previously published.

Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.

There are advantages and disadvantages to both methods.

Basically, if you write regex rules to extract identifying words (the method of Beckwith et al) , you'll miss some identifiers. If you have a large corpus of text, you'll miss a lot of identifying words. Because it allows some identifiers to slip through, the method by Beckwith et al is best suited for limited data use agreements. In addition, the regex rules will need to change for different sets of records (radiology, surgical pathology, op notes, hospital A formats, hospital B formats). To accommodate changes in style and format, the list of regex rules will need to grow and grow. The software will become increasingly complex and S L O W.

If you use the concept match method, the algorithm is simple and fast, and virtually foolproof.

No identifiers will be present in the output because the only words in the output come from an approved list of terms. But the output of the algorithm will be hard to read. Identifying words in the original text will be replaced by an asterisk, and the text may consist predominantly of asterisks if it contains many words and terms that are not present in the "approved" word list. This was noted, correctly, by Beckwith et al.

I have written a much-improved version of my concept-match software that uses doublets (2-grams). The algorithm is now simpler than ever. There is an external list of "approved" word doublets (about 80,000 of them). The doublet list is chosen to contain no identifying terms. My current list of doublets were derived from two open source medical vocabularies. The algorithm is simple... The text is parsed, and all the doublets in the text that match a term in the approved list are retained. Everything else is replaced by an asterisk. It works fast (1 Mbyte per second on my 1.6 GHz CPU) and doesn't allow any unlisted doublets to slip through. It retains more words from the text than the original concept match algorithm.

The value of the use of doublets (instead of approved words) is that a single seemingly innocuous word (like "No") can be a person's name ("Dr. No is in the hospital"). Because the doublets, "Dr. No" and "No is" are not included in the approved doublet list, the identifying text will be excluded. On the other hand, accepted doublets, like "no way" or "no food" would be saved if they were included in the list of approved doublets.

The method can be scripted in under 20 Perl command lines.

#!/usr/local/bin/perl

#This program is free software; you can redistribute it and/or modify

#it under the terms of the GNU General Public License as published by

#the Free Software Foundation; either version 2 of the License, or

#(at your option) any later version.

#The GNU license is available at: www.gnu.org/copyleft/gpl.txt

#This program is distributed in the hope that it will be useful,

#but WITHOUT ANY WARRANTY; without even the implied warranty of

#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

#GNU General Public License for more details.

#As a courtesy, users of this script should cite the following

#publication:

#Berman JJ. Concept-match medical data scrubbing. How pathology text

#can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.

use Fcntl; #Perl needs the standard Fcntl (file control) module for this script

use SDBM_File; #Perl needs the standard SDBM_File (Simple database module) for this script

tie %doubhash, "SDBM_File", 'scrub', O_RDWR, 0644; #ties the external doublet database

#undef($/); #undefine the line separator so we can slurp a text file in one reading

#open (TEXT, "scrub.txt")||die"Can't open file"; #open an external file to hold scrubbed text

print "What would you like to scrub?\n";

$line = <STDIN>;

print "Scrubbed text.... ";

$line = lc($line); #convert text to lowercase

$line =~ s/[a-z]+\'[s]/possessive word/g;

$line =~ s/[^a-z0-9 \-]/ /g; #replaces non-alphanumerics with a space

@hoparray = split(/ +/,$line); #creates an ordered array from the text words

$lastword = "\*";

for ($i=0;$i<(scalar(@hoparray));$i++) #steps through the array

{

$doublet = "$hoparray[$i] $hoparray[$i+1]"; #finds successive overlapping word doublets

if (exists $doubhash{$doublet}) #checks to see if doublet is in database of allowed doublets

{

print " $hoparray[$i]"; #prints the first word of the doublet

$lastword = " $hoparray[$i+1]"; #saves the second word of the doublet

}

else

{

print $lastword; #doublet not in database, so print the second word of the last matching doublet

$lastword = " \*"; #load an asterisk into the variable that contains the last matching word

}

print "\n";

exit;

The external sdbm files (scrub.dir and scrub.pag) containing the approved doublets are available for download from the API resources page (http://www.pathologyinformatics.org/informatics_r.htm)

Some actual output is shown:

C:\ftp>perl tie_out2.pl

What would you like to scrub?

Basal cell carcinoma, margins involved

Scrubbed text.... basal cell carcinoma margins involved

C:\ftp>perl tie_out2.pl

What would you like to scrub?

Rhabdoid tumor of kidney

Scrubbed text.... rhabdoid tumor of kidney

C:\ftp>perl tie_out2.pl

What would you like to scrub?

Mr Brown has a basal cell carcinoma

Scrubbed text.... * * has a basal cell carcinoma

C:\ftp>perl tie_out2.pl

What would you like to scrub?

Mr. Brown was born on Tuesday, March 14, 1985

Scrubbed text.... * * * * * * * * *

C:\ftp>perl tie_out2.pl

What would you like to scrub?

The doctor killed the patient

Scrubbed text.... * * * * *

My opinion is that the method of Beckwith et al may be the best option if you're planning to share pathology records through a limited data use agreement.

My doublet variant of the concept match method may be the best option if you're working with agnostic text (that doesn't fit any particular format) or if you're preparing data for public distribution.

I plan to expand this work and to eventually publish it. If Bruce Beckwith and colleagues would like to work with me to produce a combined study wherein both methods are used on the same corpus of text, with the advantages and disadvantages of either method described in a controlled study, that would be most welcome.

Competing interests

I declare that I have no competing interests

Comparing de-identification methods

Jules Berman, Association for Pathology Informatics

31 March 2006

Bruce Beckwith and colleagues have made an important contribution to the field of data scrubbing and data de-identification. To their credit, they made all their source code and Java files publicly available and free. The paper is well written and data-driven.
I have discussed the issue of data scrubbing strategies with Dr. Beckwith on several occasions. Basically, there seems to be two published approaches. One approach is to parse text and remove all the identifying words. This is the way that Bruce Beckwith recommends.
The second way is to parse text and to extract EVERY WORD EXCEPT words from an approved list of non-identifying words. That's the strategy that I have previously published.
Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.
There are advantages and disadvantages to both methods.
Basically, if you write regex rules to extract identifying words (the method of Beckwith et al) , you'll miss some identifiers. If you have a large corpus of text, you'll miss a lot of identifying words. Because it allows some identifiers to slip through, the method by Beckwith et al is best suited for limited data use agreements. In addition, the regex rules will need to change for different sets of records (radiology, surgical pathology, op notes, hospital A formats, hospital B formats). To accommodate changes in style and format, the list of regex rules will need to grow and grow. The software will become increasingly complex and S L O W.

If you use the concept match method, the algorithm is simple and fast, and virtually foolproof.
No identifiers will be present in the output because the only words in the output come from an approved list of terms. But the output of the algorithm will be hard to read. Identifying words in the original text will be replaced by an asterisk, and the text may consist predominantly of asterisks if it contains many words and terms that are not present in the "approved" word list. This was noted, correctly, by Beckwith et al.
I have written a much-improved version of my concept-match software that uses doublets (2-grams). The algorithm is now simpler than ever. There is an external list of "approved" word doublets (about 80,000 of them). The doublet list is chosen to contain no identifying terms. My current list of doublets were derived from two open source medical vocabularies. The algorithm is simple... The text is parsed, and all the doublets in the text that match a term in the approved list are retained. Everything else is replaced by an asterisk. It works fast (1 Mbyte per second on my 1.6 GHz CPU) and doesn't allow any unlisted doublets to slip through. It retains more words from the text than the original concept match algorithm.
The value of the use of doublets (instead of approved words) is that a single seemingly innocuous word (like "No") can be a person's name ("Dr. No is in the hospital"). Because the doublets, "Dr. No" and "No is" are not included in the approved doublet list, the identifying text will be excluded. On the other hand, accepted doublets, like "no way" or "no food" would be saved if they were included in the list of approved doublets.
The method can be scripted in under 20 Perl command lines.
#!/usr/local/bin/perl
#Copyright (C) 2006 Jules J. Berman
#This program is free software; you can redistribute it and/or modify
#it under the terms of the GNU General Public License as published by
#the Free Software Foundation; either version 2 of the License, or
#(at your option) any later version.
#
#The GNU license is available at: www.gnu.org/copyleft/gpl.txt
#
#This program is distributed in the hope that it will be useful,
#but WITHOUT ANY WARRANTY; without even the implied warranty of
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
#GNU General Public License for more details.
#
#As a courtesy, users of this script should cite the following
#publication:
#
#Berman JJ. Concept-match medical data scrubbing. How pathology text
#can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.
#
use Fcntl; #Perl needs the standard Fcntl (file control) module for this script
use SDBM_File; #Perl needs the standard SDBM_File (Simple database module) for this script
tie %doubhash, "SDBM_File", 'scrub', O_RDWR, 0644; #ties the external doublet database
#undef($/); #undefine the line separator so we can slurp a text file in one reading
#open (TEXT, "scrub.txt")||die"Can't open file"; #open an external file to hold scrubbed text
print "What would you like to scrub?\n";
$line = <STDIN>;
print "Scrubbed text.... ";
$line = lc($line); #convert text to lowercase
$line =~ s/[a-z]+\'[s]/possessive word/g;
$line =~ s/[^a-z0-9 \-]/ /g; #replaces non-alphanumerics with a space
@hoparray = split(/ +/,$line); #creates an ordered array from the text words
$lastword = "\*";
for ($i=0;$i<(scalar(@hoparray));$i++) #steps through the array
{
$doublet = "$hoparray[$i] $hoparray[$i+1]"; #finds successive overlapping word doublets
if (exists $doubhash{$doublet}) #checks to see if doublet is in database of allowed doublets
{
print " $hoparray[$i]"; #prints the first word of the doublet
$lastword = " $hoparray[$i+1]"; #saves the second word of the doublet
}
else
{
print $lastword; #doublet not in database, so print the second word of the last matching doublet
$lastword = " \*"; #load an asterisk into the variable that contains the last matching word
}
}
print "\n";
exit;
The external sdbm files (scrub.dir and scrub.pag) containing the approved doublets are available for download from the API resources page (http://www.pathologyinformatics.org/informatics_r.htm)
Some actual output is shown:
C:\ftp>perl tie_out2.pl
What would you like to scrub?
Basal cell carcinoma, margins involved
Scrubbed text.... basal cell carcinoma margins involved
C:\ftp>perl tie_out2.pl
What would you like to scrub?
Rhabdoid tumor of kidney
Scrubbed text.... rhabdoid tumor of kidney
C:\ftp>perl tie_out2.pl
What would you like to scrub?
Mr Brown has a basal cell carcinoma
Scrubbed text.... * * has a basal cell carcinoma
C:\ftp>perl tie_out2.pl
What would you like to scrub?
Mr. Brown was born on Tuesday, March 14, 1985
Scrubbed text.... * * * * * * * * *
C:\ftp>perl tie_out2.pl
What would you like to scrub?
The doctor killed the patient
Scrubbed text.... * * * * *
My opinion is that the method of Beckwith et al may be the best option if you're planning to share pathology records through a limited data use agreement.
My doublet variant of the concept match method may be the best option if you're working with agnostic text (that doesn't fit any particular format) or if you're preparing data for public distribution.
I plan to expand this work and to eventually publish it. If Bruce Beckwith and colleagues would like to work with me to produce a combined study wherein both methods are used on the same corpus of text, with the advantages and disadvantages of either method described in a controlled study, that would be most welcome.

Competing interests

I declare that I have no competing interests

Archived Comments for: Development and evaluation of an open source software tool for deidentification of pathology reports

Comparing de-identification methods

Competing interests

BMC Medical Informatics and Decision Making

Contact us