This article has Open Peer Review reports available.
An efficient record linkage scheme using graphical analysis for identifier error detection
© Finney et al; licensee BioMed Central Ltd. 2011
Received: 14 June 2010
Accepted: 1 February 2011
Published: 1 February 2011
Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.
We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.
The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.
The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.
Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications[1–3]. Large data sources are increasingly available in many areas, but unfortunately accurate and ubiquitously applied unique identifiers are rarely available. Frequently, identifiers which are supposed to be unique to an individual (e.g. UK National Health Service numbers, or the patient numbers generated by hospitals) are missing from some or all data items; additionally, some are incorrectly entered due to clerical or typographic errors.
Identifier inaccuracies can result in both organisation costs and risks to individuals.
Because of the problem of identifier error, in many processes (such as the identification of samples sent to hospital laboratories) the record to be linked is usually identified by multiple pieces of information, such as a name and date of birth as well as the supposedly unique identifying number.
Approaches mapping records with multiple identifiers to individuals have been extensively studied. In a classical, probabilistic approach, after similarities between identifiers (such as surnames, forenames, and so on) have been computed using functions such as the Levenhstein distance and Jaro-Winkler functions , Bayesian classification is used to discern likely matches. In general, probabilistic approaches can require up to (n(n-1)/2) comparisons to merge two files each containing n records, and so prove unfeasibly expensive computationally for the very large, dynamic datasets available in many situations. This complexity requires use of heuristics to divide the database into smaller sections (termed "blocks", or "canopies") within which comparisons can be made. Since linkage only occurs within each small section, the algorithm dividing the database can prevent linkage if it does not assign all an individual's records to the same block[4, 7]. Nevertheless, it has been shown that carefully optimised heuristics combining exact and probabilistic matching can be used to generate large databases of healthcare records with good performance, especially when good quality unique identifiers (e.g. NHS number) can be used in initial linkage steps.
There is growing interest in the use of graphs, data structures in which items of data are represented as nodes and their similarities as edges, to store complex relationships in general. An important example is the work of Sauleau and colleagues , who considered the problem of 300,000 clustering health records from a French hospital. They used an approach derived from probabilistic clustering literature on canopies (overlapping blocks) of records, generating pairwise distances between each record. In contrast to classical probabilistic linkage, they then considered records as nodes, and their pairwise distances as weighted edges in an undirected graph from which records of similar patients can be recovered using a hard clustering method (i.e. cluster membership is binary, not probabilistic). In their approach, assignation of edge weight is a critical step. Another sophisticated example is the work of Kalashnikov and colleagues . They investigated clustering based on paths of connectivity between identifiers which are not themselves unique, but the strength of whose connections can be determined by path analysis, and where the optimal cluster edges can be determined by minimizing edge weights.
In many important 'real-world' situations, including healthcare, identifier(s) are available for each record which would be expected to be unique to an individual. These would include 'purpose built' identifiers, such as hospital numbers, but also identifiers comprising high-cardinality combinations of personal data (surname, forename, home telephone number), or (surname, forename, date of birth). Additionally, in 'real world' situations, these identifiers contain errors at some low frequency. In commonly used systems for allocating identifying numbers, particularly sequential allocation, typographical errors in one identifier number may not only result in records containing a novel identifier, but also may generate another person's identifier with high probability. As numbers of records increase, these errors become increasingly important, producing theoretically unique identifiers in records genuinely belonging to two or more different individuals. The existence of this type of error detracts from otherwise appealing and efficient exact-match record linkage methods combining records sharing unique identifiers, as the identifier errors cause co-clustering of records from different individuals, referred to here as record collision, compromising overall linkage quality.
Here we describe a simple and highly efficient solution to the identifier-collision problem, in which collisions are detected by noting discrepancies in unique identifiers within collision-affected records. Our research was driven by the need to link accurately more than 250 million health care records from large UK hospitals for clinical and epidemiological purposes.
From large sets of health records (such as patient admissions and laboratory samples), each of which is identified by one or more pieces of personal data, the objective is to assign records to the individuals from whom they originated in an efficient, sensitive and specific manner. In this setting, we considered that a 'masterlist' or 'gold standard' data set of all patients is not available, requiring that the individuals be discerned from the identifiers present in the record set. Computational efficiency is important, since rapid performance is desirable for clinical decision making. So too is sensitivity and specificity, where sensitive linkage refers to assigning the records to the correct individual; specificity refers to assigning all records from a single individual to a single individual.
Overview of algorithm
We present an algorithm which has the following components, which are described in subsequent sections:
Identifier cleaning (Section 1)
Construction of high-cardinality identifiers from combinations of identifiers, such as forename, surname and date of birth (Section 2)
Exact match using the constructed high-cardinality identifiers (Section 3)
Detection of clusters containing more than one individuals 'identity collision' using logistic classification, applied to all clusters containing any variation in identifiers; (Section 4)
Breaking apart of clusters with identity collision (Section 5)
Data set available
We had available medical records from each of five data sources: the patient administration system (PAS), details of any previous names patients may have had (PAShistory), an emergency admission tracking system (Jonah), a microbiology information system (Micro) and a haematology, biochemistry and immunology laboratory information management systems (LIMS) of a large UK hospital, covering about 1% of the population of the United Kingdom.
The laboratory systems (LIMS and Micro) provide services to the hospital, but also to a large number of general practitioners supplying outpatient samples, samples which make up about half the workload. The outpatient workload is not necessarily represented in the PAS system; therefore, a 'masterlist' of patients does not exist prior to linkage.
Identifiers and Record linkage operation
Start cluster id
New cluster id
date of birth (ddmmyyyy)
frequency of occurrence
Data platform used and Statistical software
We stored the data on a single Windows Server 2003 running SQL Server 2005 databases, with Dell hardware, two 2.5 GHz Xeon processors (8 cores total), 1.2TB RAID 5 hard disc space, and 16GB of RAM. Linkage algorithms were implemented as SQL server stored procedures. Jaro-Winkler and Levensthein distance calculations were implemented using the Simmetrics package, compiled into the SQL server using CLR integration.
Here, we describe the stages of the clustering operation used.
1 Data cleaning
Fields converted to uppercase blanks (e.g. whitespace) deleted
Forename & Surname
remove of forenames containing baby/infant/twins, or synonyms.
Remove all symbols, e.g. '.
deletion of records matching internal hospital test individuals.
removal of non-alphabetic values
remove values containing only one letter
reverse forename and surname if stated forename does not exist in any patient administration records as a forename
Forename & Surname
Remove unless M, F, U characters, representing male, female or unknown, respectively
Remove out-of-range numeric values
Deleted, along with all other identifiers, if the patient is from a Genito Urinary medicine clinic, or from the Occupational Health Department.
Delete out-of-range values
Delete values not conforming to checkdigit requirement as described:
Birthdate & Deathdate
Conversion to SQL date format
remove dates before 1860-01-01
remove dates in the future
Birthdate & Deathdate
2 Construction of high cardinality identifiers by concatenation
We considered whether concatenation of existing identifiers, particularly surname, forename and date of birth, might offer an identifier of high cardinality with potential to act as a unique identifier, using records in the PAS data source having NHS numbers. This subset represents recent PAS entries and is more heavily curated than other data sources, due to cross-checking against central NHS identity databases via the NHS's Spine infrastructure. It represents the subset of patients with contact with the hospital, which we do not believe to differ systematically in terms of names and dates of birth from those without. We counted distinct NHS numbers (chosen because it is supposed to be unique to an individual) mapping to particular combinations of name and date of birth.
3 Initial Record linkage using identifiers
a) We used an initial linkage algorithm which joins all records having any of three high-cardinality identifiers in common, using an iterative procedure one set of which is described in Table 1. The operation ceases when no set of records contains a shared identifier. The sets of records thus formed are termed clusters.
Considering complexity, the implementation is possible using SQL; for one identifier (such as NHS number), in a table with n rows, identifying the records sharing a common identifier can be implemented as a hash join of the table to itself on the identifier, and has complexity O(n), where n is the number of rows to be analysed; the operation has to be repeated across all m identifiers (3, in this case), so complexity for one operation (b) is O(mn) [13, 14]. Because of the nature of set union operations, the order of these operations do not matter, and the solution found is deterministic. The number of iterations required to complete linkage depends on the combinations of identifiers present within the clusters. If the number of shared identifiers remaining after the first set union is very small relative to the total number of records, then overall complexity is approximated by O(mn) as the number of iterations is 1 for almost all records.
4 Identity collision and its detection
It is possible for records from more than one individual to be combined into the same cluster; this results in intra-cluster variation, and is discussed in Results. First, we studied the variation occurring within the clusters produced by initial linkage. We did this by identifying a random sample of 25,000 "complex" clusters was obtained, defining "complex" to mean any difference in any of the identifiers in section 2 within the records comprising the cluster, i.e. having intra-cluster variation.
Some of the intra-cluster variation arises from variations which occur in identifiers, e.g. on marriage. Other variation arises from combination of individuals. We devised a sensitive test for clusters belonging to one person, based on domain knowledge: we provisionally defined regarded the 25,000 clusters as 'good' if they had
one NHS number and one hospital number, or
one NHS number, one name and date of birth
one hospital number and one name and date of birth
We then simulated clusters resulting from inappropriate combination (termed 'bad' clusters) of two 'good' clusters (defined operationally as above) by randomly combining pairs of 'good' clusters. The artificially formed bad clusters generated in this way resemble one frequently observed pattern of mislinkage, as is described in Results.
Thus, three groups of records were derived: 'good', meeting the above criteria, 'uncertain', which were present in the original dataset but did not meet the above criteria, and 'bad' which are derived from good clusters by simulation.
We considered 'good' and 'bad' clusters further. We computed the maximum distance between the all combinations of the fields in the cluster and fitted logistic regression models for clusters including females (excluding distances based on surname, because of the frequent changes in surname occurring on marriage), or for clusters without any females, modelling the odds of bad status relative to good. This process used R scripts calling SQL server stored procedures to extract and simulate clusters, followed by backwards conditional logistic regression modelling with the stepAIC function from the R MASS package. The distribution of scores was plotted using the R lattice densityplot function for 'good' and 'bad' clusters, as well as for clusters considered uncertain by rule-based classification. Cutoffs were chosen by visual inspection of density plots and performance of the fitted model assessed on an independent simulation, extracted as above.
5 Identity collision resolution
identifiers whose removal leads to the division of a cluster into two, where the two divided clusters have improved 'quality' relative to the initial cluster are potentially bad. An example is illustrated in Figure 1.
When one views the identifier combinations as a graph, with edges comprising shared identifiers (Figure 1), the erroneous identifiers form the origins of edges.
It follows that potentially erroneous identifiers are only a small subset of all the identifiers in the records of interest, in that they both form the origins of edges (as in 2, above); further, that they lie within the set of identifiers having the properties in (1, above).
6 Trial of a splitting heuristic by simulation
Pairs of randomly selected 'good' clusters were combined, replacing one randomly selected hospital number from one cluster with the maximal hospital number from the other, thus simulating a cluster collision caused by entry of a 'bad' hospital number.
variants of the cluster were generated, and in each variant one identifier was deleted; the variant was then re-clustered (described above in Initial record linkage) performed.
We identified variants generated in b) in which maximised the increase in identifiers with cardinality = 1. If more than one variant had this property, we identified one at random.
We scored the cluster splitting as successful if the identifier picked for deletion on the basis that it was bad in step (c) was the same 'bad' identifier selected in step (a).
7 Fuzzy Search
We used two fuzzy search algorithms, together with manual curation, to estimate the proportion of clusters which are highly similar. Firstly, to find records similar to a query record, we determined all trigrams (substrings containing three consecutive characters) of surname, forename, date of birth, hospital and NHS numbers within the query, and identified the top 10 matches by ranking matches according to the numbers of shared trigrams between the query record and all other records in the database. Secondly, we identified the subset of records having identical first surname and forename to the query record. After both approaches, the candidates generated were scored as 'likely to be from the same patient' or 'not likely to be from the same patient' subjectively by two different observers.
Identifiers with high cardinality
High cardinality of combinations of name and date of birth
Average NHS numbers per identifier
National Health Service Number
Date of birth
Surname, complete forename
Surname, first letter of forename, date of birth
Surname, first three letters of forename, date of birth
Surname, complete forename, date of birth
Combinations of identifiers available on different record sources
Name/date of birth
Initial Record linkage using identifiers
Identity collision and its detection
Despite the efficiency of the initial clustering operation, it has a major problem. If an identifier is mistyped, and happens to be an identifier used by another individual, then these two individual will be assigned to the same cluster. This process ("collision") can generate differences in multiple identifiers (e.g. in surname, forename, date of birth) between elements in the set generated by identity collisions. We used these to these distances to build logistic regression models predicting that the clusters created by initial exact match clustering contained multiple individuals.
Multivariate Logistic model classifying bad clusters from good
Model: Any Females
Model: No females
<1 × 10-16
Date of birth, day
5 × 10-4
Date of birth, day
Date of birth, month
1 × 10-4
Date of birth, month
Date of birth, year
<1 × 10-16
Date of birth, year
<1 × 10-16
110111 or 223456
<1 × 10-16
110111 or 223456
<1 × 10-16
Classifier performance on an independent validation set of 25,000 complex clusters
Predicted not bad
% predicted bad
Resolution of bad clusters
The logistic classifier suggested 44,330 (1.2%) were derived from more than one individual ("bad"). As described in Methods, we developed and tested by simulation a cluster partitioning algorithm which aimed to detect single, defective identifiers combining two clusters. In a simulation, which used real clusters combined randomly using a single identifier (see Methods), of 19,863 combined clusters, 19,258 cases were correctly partitioned. Thus, ~96.9% of bad clusters of this type can be successfully resolved by the algorithm described. We incorporated the cluster resolution algorithm into the overall pipeline (see Methods). After the cluster resolution operation, only 5,570 (0.15%) were classified bad on re-scoring by the same algorithm.
Assessment of process overall
We wished to investigate the overall performance of the linkage algorithm. We considered three aspects: specificity, sensitivity, and speed.
Non-specific linkage refers to placing multiple individuals in the same cluster ('identity collision'). The logistic classifier suggested that at the end of the linkage process there may be 5,570 such clusters among the 3.6 M clusters, or 0.15%. A process of manual inspection was used to inspect a random sample of these clusters. Of 100 such clusters examined by one author (DW), 6 (6%) were thought to represent two individuals, while the rest were thought to represent one individual, with a variety of variants of identifiers, name variants etc which led to false positive classification. Examples causing false classification of a clusters as containing >1 individual included incorrect NHS numbers (which are heavily weighted in the logistic classifier), and variants of forenames and surnames (particularly where double-barrelled names are variably used, or where transliteration from the original language into latin script is needed. Thus, estimated in this way, the true number of clusters containing multiple individuals in the database may be as low as 300-400. However, the logistic classifier is estimated to be only about 97% sensitive at detecting bad clusters, about 3% of true bad clusters will be missed. The classifier was applied to the 284,636 clusters with some intra-cluster variation, so another ~8,400 bad clusters might be undetected, leaving ~9,000 remaining clusters (~0.2% of the 3.6 M clusters) with identity collision in the database.
We conducted two additional automated tests of linkage specificity. Firstly we determined the numbers of clusters with multiple, theoretically unique identifiers. Secondly, we measured the numbers of patients with multiple death dates recorded. Death date is recorded in the PAS system, so having multiple death dates reflects having two PAS entries. There are two possible explanations for multiple identifiers or death dates:
the information systems contain details on the same individual, but with different identifiers, or
the information systems contain details on different individuals, each with their own identifiers; however, these are incorrectly clustered together.
Effect of collision resolution
Before collision resolution
After collision resolution
Before collision resolution
Number of clusters
Number of clusters
Clusters with multiple:
Clusters with multiple:
Finally, we considered clusters which have both a microbiology sample and a PAS record; this is a large group which includes all inpatients who have ever had a microbiology sample. We identified all those who appeared to have had microbiology samples taken >7 days before they were born (according to their PAS entry), or who had samples taken >7 days after they were reported to have died. The 7-day cut off is arbitrary, and is used to select events which are unlikely to be physiological. Of 281 cases, we found 139 (49%) where there were with no differences in identifiers within the cluster, and 142 (51%) with differences in identifiers within the cluster. Careful inspection suggests that 14/142 are caused by combining two individuals inappropriately, with the other cases being due to typographical errors in dates of collection or dates of death.
Overall, we concluded that linkage specificity is high, with ~99.8% of clusters containing, with high likelyhood, only the records of one individual; the numbers of clusters containing mislinked individuals lying between ~300 and ~9,000 out of 3.6 M. Additionally, for studies where date of death is an important outcome, it appears that mislinkage is only a minor contributor to errors in reported death date, being responsible for about 10% of a series of errors identified.
We also examined linkage sensitivity. By sensitivity, we mean that all the records of one individual are partitioned into a single cluster. To determine whether it was likely that the records from a single individual were distributed across two clusters, we used a combination of a fuzzy searching method (see Methods), and manual curation.
We investigated clusters containing at least one Patient Administration system (PAS) record, a clinically and epidemiologically important group which represents an important test of the linkage system, as it includes inpatient visits and large numbers of laboratory and other records. A random sample of 250 clusters was compared with all other clusters containing at least one PAS record. This indicated that approximately 7% of clusters containing a PAS record were similar to another cluster containing a PAS record. Notably, it appeared that where there were 'duplicate' PAS records, one was created on one single hospital visit and usually lacked an NHS number, whereas all other hospital records of the patient were assigned to the other PAS entry which contained all the other hospital admission information (not shown). Likewise, approximately 14% of clusters appeared similar to clusters with no PAS record. In many of these cases, the similar records were derived from the LIMS data source, where identifiers were few (Table 4), and after examining similarities manually, it was difficult to be sure whether the observed similarities reflected two patients with similar identifiers, or one patient with typographical errors in the identifier.
Identifier cleaning; forename/surname duplication screening
Construction of unique identifiers
Initial clustering using identifiers
Identity collision detection
Identity collision resolution
Identity collision reassessment
We describe an exact-match based, highly efficient linkage scheme suitable for large scale linkage of hospital records. A key requirement is that identifiers expected to be exact should exist, or can be constructed; if this requirement is met, clustering is very efficient and readily implemented. The problem inherent in the approach is 'identity collision' - the inappropriate combination of two individuals based on mis-entry of an identifier in one of them, and its coincidence with an identifier belonging to another patient - which we have demonstrated can be addressed by systematic investigation of identifiers within suspect clusters, followed by cluster partitioning. This process also finds suspect identifiers within datasets. The algorithm is rapid, and is capable of incremental updates. Testing on five data sources including 9.2 million records indicate that ~99.8% of clusters formed consist of records from 1 individual.
pairwise distances between all elements are not performed in the initial linkage, and blocking steps are not used;
decisions about cluster quality is made on analysis of the whole cluster formed deterministically, not on pairwise comparison of records. Maximal weighted distances within a cluster are used to classify clusters into good and bad;
subsequent cluster division relies on edge structure, which probabilistic linkage does not do.
Whilst we are confident that the vast majority of clusters contain only one patient, a more difficult issue concerns the situation when records from one patient are assigned to multiple clusters. We note that among the 2.26 M patients registered with the hospital's administration system, close matches were found in about 5% of clusters. Most of these appear to represent odd orphan records together with a main record to which almost all other data is attached, and so their epidemiological impact may be small for some applications. Put another way, it may be that about 5% of the patient administration system's entries are duplicates, although they differ in all of name and date of birth, hospital number and NHS number. In many cases, it is difficult to be sure whether these entries do reflect the same individual, and we did not add a fuzzy matching component to our routine pipeline, although for some applications this will prove helpful, with or without a manual curation step.
What is an acceptable level of linkage? All linkage methods have a mislinkage rate, and we would argue that the issue of 'acceptable' levels of mislinkage is highly application specific. For clinical use it can be argued that the most dangerous situation is that in which a result is assigned to the wrong patient. This is an event which is not commonly considered clinically; because some tests are highly likely to change management, there is a substantial risk of inappropriate change in therapy. By contrast, the risk associated with the test going 'missing' - not being linked to a patient - is often less, because it can usually be repeated, although there are obviously exceptions. For epidemiological purposes, whether modelling or reporting in a tabular form, the critical issue is bias associated by mislinkage, which is application and data specific. Our study of deaths suggests that this linkage method biases analysis of death following infection, one of our epidemiological goals, relatively little.
The approach presented has a number of limitations. Firstly, it is dependent on having samples with unique identifiers, and preferably multiple unique identifiers. As alluded to above, records without at least one shared identifier will not be linked using the approach shown. This situation arises relatively commonly with our LIMS dataset, which contains low numbers of identifiers, particularly prior to 2003, and will degrade the performance of many linkage algorithms. An additional fuzzy matching step would be required to merge these clusters, if one had sufficient confidence in the match, which we do not in our current application. Alternatively, a composite identifier with high cardinality suitable for incorporating into the exact matching system could be potentially be constructed using transformations designed to eliminate common spelling or other errors, e.g. the double metaphone algorithm. Lack of a fuzzy matching step in the existing pipeline contributes to efficiency, but for some data sets and applications, addition of such a step may be important.
Provided unique identifiers exist, however, if one can detect records from different individuals containing a shared, erroneous identifier, then there is the opportunity to partition the clusters formed in order to drive up clustering quality. The logistic classifier used here is not necessarily the optimal tool to do this with, and other supervised classification systems might offer increased performance.
Indeed, one interesting aspect of the algorithm used here is the separation of the algorithms used for detection of bad clusters, which relies on a logistic classifier, from that used for bad cluster partitioning (which relies on graph-based edge editing), and which was designed for the situation in which identifier error is relatively rare. This setting allows quality scoring of the effect of removal of individual identifiers from the clusters. A simple heuristic is used to score the result, and although this has good performance, it is possible that other quality measures, based around inter-node distances , other forms of edge weighting , cluster entropy , or the maximal intra cluster distance (as in the logistic classifier used here) might offer increased performance in both partitioning and selecting records for partitioning. In situations where unique identifiers cannot be found, although initial clustering based on non-unique identifiers could be performed, large clusters would then result likely requiring more sophisticated algorithms to partition them efficiently. Future work developing these, and comparing this algorithms with probabilistic linkage, are planned.
The technique describes appears to offer a simple, rapid, highly efficient two-step method for large scale linkage for some important record types, including those found in healthcare. Clustering performance is enhanced by a system for finding of erroneous identifiers and subsequent record partitioning.
The work was supported in part by NIHR Biomedical Research Centre, Oxford and by the NIHR Programme Development grant RP-DG-1108-10125. We thank Dr Iryna Schlackow for helpful comments, and particularly the many people who have helped with data provision, including Jane Simms, Kevin Paddon, Dr Brian Shine, Andrew Summers, Adrian Steel, Kevin Woodley, and other members of the Oxfordshire Health Informatics Service.
- Wyllie DH, Crook DW, Peto TE: Mortality after Staphylococcus aureus bacteraemia in two hospitals in Oxfordshire, 1997-2003: cohort study. BMJ. 2006, 333: 281-10.1136/bmj.38834.421713.2F.View ArticlePubMedPubMed CentralGoogle Scholar
- Wyllie DH, Peto TE, Crook D: MRSA bacteraemia in patients on arrival in hospital: a cohort study in Oxfordshire 1997-2003. BMJ. 2005, 331: 992-10.1136/bmj.38558.453310.8F.View ArticlePubMedPubMed CentralGoogle Scholar
- Wyllie DH, Walker AS, Peto TE, Crook DW: Hospital exposure in a UK population, and its association with bacteraemia. J Hosp Infect. 2007, 67: 301-307. 10.1016/j.jhin.2007.08.018.View ArticlePubMedGoogle Scholar
- Winkler WE: Overview of Record Linkage and Current Research Directions. Book Overview of Record Linkage and Current Research Directions. 2006, City: Statistical Research Division, U.S. Census Bureau, 2006-2: 44-pp. 44Google Scholar
- Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710.Google Scholar
- Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210. 10.2307/2286061.View ArticleGoogle Scholar
- McCallum A, Nigam K, Ungar LH: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000, New York: ACMGoogle Scholar
- Lyons RA, Jones KH, John G, Brooks CJ, Verplancke JP, Ford DV, Brown G, Leake K: The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak. 2009, 9: 3-10.1186/1472-6947-9-3.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaeffer SE: Graph Clustering. Computer Science Review. 2007, 1: 27-64. 10.1016/j.cosrev.2007.05.001.View ArticleGoogle Scholar
- Sauleau EA, Paumier JP, Buemi A: Medical record linkage in health information systems by approximate string matching and clustering. BMC Med Inform Decis Mak. 2005, 5: 32-10.1186/1472-6947-5-32.View ArticlePubMedPubMed CentralGoogle Scholar
- Kalashnikov DV, Mehrotra S: Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph. ACM Transactions on Database Systems. 2006, 31: 52-10.1145/1138394.1138401.View ArticleGoogle Scholar
- Chapman S: Simmertics: a Similarity Metric Library. Book Simmertics: a Similarity Metric Library. 2005, A similarity metric library. City: Sheffield University, A similarity metric library, 1.6.2.d07Google Scholar
- Teorey T, Lightstone S, Nadeau T, Jagadish HV: Database Modelling and Design: Logical Design. 2006, Elsevier, 5Google Scholar
- Arasu A, Ganti V, Kaushik R: Efficient Exact set-similarity joins. Proceedings of the 32nd international conference on Very large data bases. 2006Google Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York: Springer, 4View ArticleGoogle Scholar
- Deepayan S: Lattice: Multivariate Data Visualisation with R. Book Lattice: Multivariate Data Visualisation with R. 2008, City: Springer, 1Google Scholar
- Rares V, Li C: Efficient top-k algorithms for fuzzy search in string collections. International Conference on Management of Data Rhode Island. 2009, ACMGoogle Scholar
- Phillips L: The double metaphone search algorithm. C/C++ Users Journal. 2000Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/11/7/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.