Skip to main content

Table 1 Example regular expressions

From: Clinical records anonymisation and text extraction (CRATE): an open-source software system

Method Source information Example of case-insensitive regular expression(s) for scrubbing
Words John Al’Rahem \bJohn\b\bRahem\b
Phrase 4 Privet Drive \b4\W+Privet\W+Drive\b
Number (01223) 123456 (?< !\d)0\W*1\W*2\W*2\W*3\W*1\W*2\W*3\W*4\W*5\W*6(?!\d)
Alphanumeric code CB12 3DE \bC\W*B\W*1\W*2\W*3\W*D\W*E\b
Date 31 Dec 2016 0*31(?:st|nd|rd|th)?\W*(?:0*12|Dec(?:ember)?)\W*(?:20)?16
Nonspecific: 10-digit numbers (?< !\d)[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9][ \t]*[0-9](?!\d)
Nonspecific: UK postcodes \b[A-Z][0-9]\s*[0-9][A-Z][A-Z]\b
  1. For the method specified in the first column, the de-identifying software will take individual instances of sensitive data, from scrub-source fields (an example is shown in the second column), and generate regular expressions (third column) with which to scrub sensitive free-text fields. Examples are shown at their default settings; all methods can be configured further (e.g., to allow typographical errors, set minimum word lengths for scrubbing, or to change boundary detection conditions for the regular expressions), as described in the text. All source examples are fictional. Emphasis added for clarity