Skip to main content

Table 1 The data sets that will be included in our simulation

From: Estimating the re-identification risk of clinical data sets

Description

Quasi-identifiers

No. Records

Adult

 

32,561

The adult dataset from the UC Irvine machine learning data repository. This is an extract from the US census and has common demographics and socio-economic status variables: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult

· Age

 
 

· Profession

 
 

· Education

 
 

· Marital status

 
 

· Race

 
 

· Sex

 
 

· Country

 

FARS

·

43,330

Department of Transportation Fatal crash information: http://www-fars.nhtsa.dot.gov/main.cfm

· Age

 
 

· Race

 
 

· Month of Death

 
 

· Day of Death

 

CUP

 

95,412

Data from the Paralyzed Veterans Association on veterans with spinal cord injuries or disease: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

· ZIP code

 
 

· Age

 
 

· Gender

 
 

· Income

 

Pharm

 

16,424

Prescription records from the Children’s Hospital of Eastern Ontario pharmacy from July 2006 to March 2009. This is for inpatients only and excludes acute cases. A de-identified version of this data was disclosed to commercial data aggregators [67].

· Age

 
 

· Postal code (FSA)

 
 

· Admission date

 
 

· Discharge date

 
 

· Sex

 

ED

 

108,344

Emergency department records from Children’s Hospital of Eastern Ontario from 1st June 2007 to 1st June 2009. This data is disclosed for the purpose of disease outbreak surveillance.

· Admission date

 
 

· Postal Code

 
 

· Date of Birth

 
 

· Sex

 

Niday

 

637,964

A registry of all newborns in Ontario from 1st April 2004 to 31st March 2009. This data set is used frequently for research purposes: http://www.bornontario.ca

· Maternal postal code

 
 

· Baby DoB

 
 

· Mother DoB

 
 

· Baby sex

 
  1. Each data set is treated as a population. The data set size as well as the variables which will be included in the analysis are shown.