Skip to main content

Table 2 Settings used for the validation experiment. These settings are all configurable by the user. F2C, first two characters. § Values encoded directly or indirectly in the proband file for hashed comparisons; other values set at comparison time. ¶ For probands with gender X or absent gender, the weighted mean of F/M values was used. † From empirical data from a single database or database pair in this study; see Results

From: De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

Setting

Value

Comment

§ SECURITY

 Hash method

HMAC-MD5

HMAC-MD5 has a space of 1632 = 3.40 × 1038. The software also offers HMAC-SHA-256 (space size 1664 = 1.16 × 1077) and HMAC-SHA-512 (space size 16128 = 1.34 × 10154)

 Number of significant figures for rounding frequencies in hashed version

5

Rounding reduces the identifiability of numbers. Some precision is required to distinguish metaphone from name frequencies

§ POPULATION PRIORS: NATIONAL

 Name/metaphone/F2C frequencies for forenames, by gender

[many]

¶ From US baby name frequencies 1880–2015 [58], covering ~345 M people, processed via CRATE [59]. UK data by year is also available [60]

 Name/metaphone/F2C frequencies for surnames

[many]

From US 1990 and 2010 Census surname frequencies [61, 62], processed via CRATE [59]

 Minimum frequency for forenames, fminforename. (If a frequency was less than this, this minimum was used instead.)

5 × 10−6

A minimum is required for unknown names. For the US forename data cited, the floor frequency is ~2.9 × 10−8; however, allowing extremely low frequencies (e.g. much below 1/np) increases the chances of a spurious match, because a name match can add up to ln(1/fmin) to the log odds

 Minimum frequency for surnames, fminsurname

5 × 10−6

As above. For the US surname data cited, the lowest frequency reported is 3 × 10−7, but we used a threshold above 1/np

 P(female | female or male)

0.51

With a binary sex choice, the UK is 51% female and 49% male [63]

 P(not female or male)

0.004

Approximately 0.4% of the UK consider their gender neither male nor female [64]

 Postcode data, for pfpostcode, ppnfpostcode, and pnpostcode

From UK Office for National Statistics data [65], licensed under the Open Government Licence version 3.0

POPULATION PRIORS: LOCAL

 Population size, np

852,523

Population estimate of Cambridgeshire and Peterborough for 2018 [66]

 Birth year “range” b

30

† The prior probability of two people sharing a DOB was taken as 1/365.25b. A value of 90 may be reasonable for a full UK population with few long-deceased people [67], but we used an empirical value reflecting the subsampled age composition of one of our databases

 Postcode frequency multiple kpostcode

nUK/np

Where nUK is the 2017 UK population, 66,040,000 [67]

 Population proportion assumed to be assigned a pseudopostcode (e.g. ZZ99 3VZ, no fixed abode; ZZ99 3CZ, England/UK not otherwise specified) or a postcode unknown to the postcode database (including typographical errors creating an invalid postcode), ppseudopostcode_unit. Taken as an estimate for each unknown/pseudopostcode unit frequency

0.00201

† Based on the proportion of people in the SystmOne database with a ZZ99 3VZ (no fixed abode) postcode. This is higher than an estimate from national data (see Results), potentially reflecting a bias from a healthcare environment, so this value may need alteration in other contexts

 Pseudopostcode multiple kpseudopostcode such that ppseudopostcode_sector = kpseudopostcode × ppseudopostcode_unit

1.83

† Based on an empirical value for ZZ993:ZZ993VZ (see Results). This number cannot be < 1 and should be > 1 to avoid ppnfpostcode = 0

ERROR RATES (given proband/candidate are the same person)

 pep1forename

F: 0.00894

M: 0.00840

†¶ Probability that a forename pair exhibits partial 1 (metaphone) match but not a full (name) match

 pep2np1forename

F: 0.00881

M: 0.00688

†¶ Probability that a forename pair exhibits a partial 2 (F2C) match, but not a partial 1 (metaphone) or full (name) match

 penforename

F: 0.00572

M: 0.00625

†¶ Probability that a forename pair exhibits no match at all

 puforename

0.00191

† Probability, amongst a set of ≥ 2 forenames, of an error that shuffles the names out of strict order

 pep1surname

F: 0.00551

M: 0.00471

†¶ Probability that a surname pair exhibits a partial 1 (metaphone) match but not a full (name) match

 pep2np1surname

F: 0.00378

M: 0.00247

†¶ Probability that a surname pair exhibits a partial 2 (F2C) match, but not a partial 1 (metaphone) or full (name) match

 pensurname

F: 0.0567

M: 0.0134

†¶ Probability that a surname pair exhibits no match at all

 pepdob

0.00459

† Probability of a DOB error causing a partial (year/month, month/day, or year/day) match

 pendob

0

The probability of a DOB error causing no match at all. Using 0 rather than the empirical value of 0.00033 produces a major speed advantage; see Results

 pegender

0.0033

† The probability that proband/candidate (when the same person) do not match on gender

 peppostcode

0.0097

† The probability that a proband/candidate postcode pair (when the same person) exhibits a partial (postcode sector) match but not a full (postcode unit) match, e.g. due to error or because someone has moved within a postcode sector

 penpostcode

0.300

† The probability that two postcodes for the same person mismatch completely