An efficient record linkage scheme using graphical analysis for identifier error detection

Table 5 Multivariate Logistic model classifying bad clusters from good

A random sample of 25,000 clusters was obtained after initial record linkage. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('good'), or the others ('uncertain'). The uncertain records were not used in model generation. Good clusters were then combined randomly creating a new set of clusters ('bad'). Maximal distances were computed by pairwise comparison of good and bad clusters, and a logistic model was fitted modelling bad cluster status relative to good cluster status for clusters without females, or for clusters including at least one record identified as being from a female, with backwards selection based on AIC. In the female model, surname was omitted; in the non-female model, there is only one level for the Sex field, which was therefore omitted. A model fitted is shown; very similar estimates were obtained from a large number of other builds with different random samples. p refers to the null hypothesis that the coefficient is zero.

ISSN: 1472-6947