Classification of data into good and bad clusters. A random sample of 25,000 complex clusters was obtained after initial record linkage. Complex clusters are those with more than one variant of at least one identifier. These clusters were divided into those which, on the basis of a series of rules, were thought to represent one individual ('likely good', purple line), or the others (uncertain, blue line). Good clusters were then combined randomly creating a new set of clusters (bad by simulation, green line). Maximal distances were computed for pairwise distances within all members of 'likely good' and simulated bad clusters. A logistic model was fitted modelling bad cluster status relative to good cluster status for (top) clusters without females, or (bottom) clusters including at least one record identified as being from a female. Here, logistic scores are plotted for each of the three groups. The dashed vertical line is at -1.5 in both models, a position chosen empirically as suitable for discrimination of good from bad clusters.