Skip to main content

Table 5 Multivariate Logistic model classifying bad clusters from good

From: An efficient record linkage scheme using graphical analysis for identifier error detection

   

Model: Any Females

Model: No females

Parameter

Example inputs

Distance function

Coefficient

p

Parameter

Example inputs

constant

  

-3.33

<1 × 10-16

constant

 

Date of birth, day

02, 24

Levensthein

0.25

5 × 10-4

Date of birth, day

02, 24

Date of birth, month

01, 11

Levensthein

0.43

1 × 10-4

Date of birth, month

01, 11

Date of birth, year

1969, 2007

Levensthein

5.12

<1 × 10-16

Date of birth, year

1969, 2007

Forename

John, Chris

Jaro-Winkler

2.45

<1 × 10-16

Forename

John, Chris

Hospital Number

110111 or 223456

Jaro-Winkler

-0.56

<1 × 10-16

Hospital Number

110111 or 223456

Surname

Smith, Jones

Jaro-Winkler

not present

-

Surname

Smith, Jones

Sex

M, F

Levensthein

0.80

<1 × 10-16

Sex

M, F

  1. A random sample of 25,000 clusters was obtained after initial record linkage. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('good'), or the others ('uncertain'). The uncertain records were not used in model generation. Good clusters were then combined randomly creating a new set of clusters ('bad'). Maximal distances were computed by pairwise comparison of good and bad clusters, and a logistic model was fitted modelling bad cluster status relative to good cluster status for clusters without females, or for clusters including at least one record identified as being from a female, with backwards selection based on AIC. In the female model, surname was omitted; in the non-female model, there is only one level for the Sex field, which was therefore omitted. A model fitted is shown; very similar estimates were obtained from a large number of other builds with different random samples. p refers to the null hypothesis that the coefficient is zero.