Skip to main content

Table 5 Multivariate Logistic model classifying bad clusters from good

From: An efficient record linkage scheme using graphical analysis for identifier error detection

    Model: Any Females Model: No females
Parameter Example inputs Distance function Coefficient p Parameter Example inputs
constant    -3.33 <1 × 10-16 constant  
Date of birth, day 02, 24 Levensthein 0.25 5 × 10-4 Date of birth, day 02, 24
Date of birth, month 01, 11 Levensthein 0.43 1 × 10-4 Date of birth, month 01, 11
Date of birth, year 1969, 2007 Levensthein 5.12 <1 × 10-16 Date of birth, year 1969, 2007
Forename John, Chris Jaro-Winkler 2.45 <1 × 10-16 Forename John, Chris
Hospital Number 110111 or 223456 Jaro-Winkler -0.56 <1 × 10-16 Hospital Number 110111 or 223456
Surname Smith, Jones Jaro-Winkler not present - Surname Smith, Jones
Sex M, F Levensthein 0.80 <1 × 10-16 Sex M, F
  1. A random sample of 25,000 clusters was obtained after initial record linkage. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('good'), or the others ('uncertain'). The uncertain records were not used in model generation. Good clusters were then combined randomly creating a new set of clusters ('bad'). Maximal distances were computed by pairwise comparison of good and bad clusters, and a logistic model was fitted modelling bad cluster status relative to good cluster status for clusters without females, or for clusters including at least one record identified as being from a female, with backwards selection based on AIC. In the female model, surname was omitted; in the non-female model, there is only one level for the Sex field, which was therefore omitted. A model fitted is shown; very similar estimates were obtained from a large number of other builds with different random samples. p refers to the null hypothesis that the coefficient is zero.