Skip to main content

Table 1 The data sets that will be included in our simulation

From: Estimating the re-identification risk of clinical data sets

Description Quasi-identifiers No. Records
Adult   32,561
The adult dataset from the UC Irvine machine learning data repository. This is an extract from the US census and has common demographics and socio-economic status variables: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult · Age  
  · Profession  
  · Education  
  · Marital status  
  · Race  
  · Sex  
  · Country  
FARS · 43,330
Department of Transportation Fatal crash information: http://www-fars.nhtsa.dot.gov/main.cfm · Age  
  · Race  
  · Month of Death  
  · Day of Death  
CUP   95,412
Data from the Paralyzed Veterans Association on veterans with spinal cord injuries or disease: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html · ZIP code  
  · Age  
  · Gender  
  · Income  
Pharm   16,424
Prescription records from the Children’s Hospital of Eastern Ontario pharmacy from July 2006 to March 2009. This is for inpatients only and excludes acute cases. A de-identified version of this data was disclosed to commercial data aggregators [67]. · Age  
  · Postal code (FSA)  
  · Admission date  
  · Discharge date  
  · Sex  
ED   108,344
Emergency department records from Children’s Hospital of Eastern Ontario from 1st June 2007 to 1st June 2009. This data is disclosed for the purpose of disease outbreak surveillance. · Admission date  
  · Postal Code  
  · Date of Birth  
  · Sex  
Niday   637,964
A registry of all newborns in Ontario from 1st April 2004 to 31st March 2009. This data set is used frequently for research purposes: http://www.bornontario.ca · Maternal postal code  
  · Baby DoB  
  · Mother DoB  
  · Baby sex  
  1. Each data set is treated as a population. The data set size as well as the variables which will be included in the analysis are shown.