De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

Table 1 Systems for full and partial matching, applicable to different identifier types. D, data; H, hypothesis that the proband and candidate under consideration are the same person. For each system, columns sum to 1, since the options for D are mutually exclusive. (A) Two-state system: a match occurs or does not occur. There is a probability p_c (c, correct) that an identifier is correctly represented (is the same) when the proband and candidate are the same person, and a probability p_e = p_en (e, any error; en, error yielding no match) that an error or mismatch occurs. In the population, there is a probability p_f (f, full match) that a randomly selected other person shares the proband’s identifier, and a probability p_n (n, no match) that they do not. (B) Three-state (“fuzzy”) system. The nature of a partial match is specific to the identifier type. For example, for date of birth (DOB), the partial match is a DOB with 2/3 of year/month/day correct. Probabilities p_c, p_f, p_en, and p_n are as before, but now there is a probability p_ep (ep, error yielding partial match) that when the proband and candidate are the same person, the identifiers match only partially (so p_e = p_ep + p_en), and a probability p_pnf that that a random other person will share the partial but not the full identifier. It may be easier to measure p_p, the total probability of a partial or full match, than p_pnf. (C) Four-state system. Partial matches now occur in two variants, hierarchically. (D) Adjustments for unordered pick-the-best comparisons between multiple identifiers of the same type (e.g. surnames, postcodes). “Positive” comparisons are those for which the log likelihood ratio is > 0. (E) Adjustments for ordered pick-the-best comparisons between multiple identifiers of the same type (e.g. forenames), using the probability p_o (o, ordered) that, given H, for ≥ 2 candidate identifiers, the candidate’s order strictly matches the proband’s, and its converse probability p_u (u, unordered). Positive comparisons are strictly ordered when each proband identifier’s index matches the corresponding candidate’s identifier. † For P(D | ¬H), adjustments use the Bonferroni correction (see text)

Data, D	P(D \| H, same person)	P(D \| ¬H, different person)
A. Two-state comparison
Match	p_c = 1 − p_e = 1 − p_en	p_f
No match	p_e = p_en	p_n = 1 − p_f
B. Three-state comparison
Full match	p_c = 1 − p_e = 1 − p_ep − p_en	p_f
Partial (but not full) match	p_ep	p_pnf = p_p − p_f
No match	p_en	p_n = 1 − p_p
C. Four-state comparison
Full match	p_c = 1 − p_e = 1 − p_ep1 − p_ep2np1 − p_en	p_f
Partial match type 1 (but not full)	p_ep1	p_p1nf = p_p1 − p_f
Partial match type 2 (but not full or partial type 1)	p_ep2np1	p_p2np1
No match	p_en	p_n = 1 − p_p = 1 − p_p2np1 − p_p2nf
D. Adjustments for unordered multi-identifier comparison
For 1 ≤ c ≤ min(n, m) “positive” comparisons between proband identifiers 1…n and candidate identifiers 1…m †	× 1, no correction	\(\times \prod^{c-1}_{i=0} \left(m-i\right)\)
E. Adjustments for ordered multi-identifier comparison
For c ≥ 1 “positive” comparisons, m > 1, and strict order match	× p_o	× 1, no correction
For c ≥ 1 “positive” comparisons, m > 1, and order mismatch †	× p_u = 1 − p_o	\(\times \left(\left[\prod^{c-1}_{i=0} \left(m-i\right)\right]-1\right)\)

ISSN: 1472-6947