Skip to main content

Table 3 Specific data cleaning techniques used on each dataset

From: The effect of data cleaning on record linkage quality

Synthetic data

Fields available for linkage: forename, surname, date of birth, sex, postcode

No cleaning

Minimal cleaning

High cleaning

Reformat values:

Reformat values:

Reformat values:

Not required

Not required

Not required

 

Remove alt. missing values and uninformative values:

Remove alt. missing values and uninformative values:

Invalid dates of birth removed

Invalid dates of birth removed

Invalid postal code values removed

Invalid post code values removed

 

Remove punctuation:

Remove punctuation:

Both forename and surname fields had all punctuation and spaces removed

Both forename and surname fields had all punctuation and spaces removed

  

Nickname lookup:

 

Nicknames were changed to their more common variant.

  

Sex Imputation

  

Records with missing sex had a value imputed based on their first name.

Hospital admissions data

Fields available for linkage: forename, middle name, surname, sex, date of birth, address, suburb, postcode, state

No cleaning

Minimal cleaning

High cleaning

Reformat values:

Reformat values:

Reformat values:

Date of birth reformatted.

Date of birth reformatted

Date of birth reformatted.

 

Remove alt. missing values and uninformative values:

Remove alt. missing values and uninformative values:

Invalid dates of birth were removed

Invalid dates of birth were removed

Invalid postcode values were removed (‘9999’ etc.)

Invalid postcode values were removed (‘9999’ etc.)

Uninformative address and suburb values removed (‘NO FIXED ADDRESS’, ‘UNKNOWN’ etc.)

Uninformative address and suburb values removed (‘NO FIXED ADDRESS’, ‘UNKNOWN’ etc.)

Birth information encoded in first name removed (‘TWIN ONE OF MARTHA’ etc.)

Birth information encoded in first name removed (‘TWIN ONE OF MARTHA’ etc.)

 

Remove punctuation:

Remove punctuation:

Forename, middle name surname and suburb fields had all punctuation and spaces removed

Forename, middle name surname and suburb fields had all punctuation and spaces removed

  

Nickname lookup:

  

Nicknames were changed to their more common variant.