Bmc Medical Informatics and Decision Making Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models

Background: Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).


Introduction
Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections [1]. The entity is often a person, in which case record linkage may be used for tasks such as building a longitudinal health record [2], or relating genotypic information to phenotypic information [3,4]. In other settings, the aim may be to link several sources of information about the same event, such as police, accident investigation, ambulance, emergency department and hospital admitted patient records which all relate to the same motor vehicle accident [5]. Record linkage (originally known as "medical record linkage") is now widely used in research -in October 2002, a search of the biomedical literature via PubMed for "medical record linkage" as a Medical Subject Heading returned over 1,300 references [6].
The process of record linkage is trivial where the records that relate to the same entity or event all share a common, unique key or identifier -an SQL "equijoin" operation, or its equivalent in other data management environments, can be used to link records. However, often there is no unique key which is shared by all the data collections which need to be linked, particularly when these data collections are administered by separate organisations, possibly operated for quite different purposes in disparate subject domains.
In these settings, more specialised record linkage techniques need to be used. These techniques can be broadly divided into two groups: deterministic, or rule-based techniques, and probabilistic techniques. A full description of these techniques is beyond the scope of this paper. A number of recent reviews of this topic are available [7,8]. However, all of these techniques rely on an element-wise comparison between pairs of records each comprising an ensemble of non-unique, partially identifying personal (or event) attributes. These attributes commonly include name, residential address, date of birth (or age at a particular date), sex (or gender), marital status, and country of birth.
For example, consider the fictitious personally-identified records in Table 1.
The evident variability in the formatting and encoding of these records is quite typical of data collections which have been assembled from multiple sources. This variability tends to frustrate naive attempts at automated linkage of these records. To a human, it is obvious that records 0 and 2 represent the same person. It is quite likely, but not certain, that records 1 and 3 also represent the same person. The status of record 4 with respect to records 0 and 2 is far less clear -could this be Gwendolynne's spouse, Evelyn, or is this Gwendolynne with her sex and age wrongly recorded?
Regardless of the method used to automate such decisions, it is clear that transformation of the source data into a normalised form is required before valid and reliable comparisons between pairs of records can be made. Such transformation and normalisation is usually called "data standardisation" in the medical record literature, and "data cleaning" or "data scrubbing" in the computer science literature. We will refer to the process as "standardisation" henceforth, which should not be confused with the epide-miological technique of "age-sex standardisation" of incidence or prevalence rates.
Standardisation of scalar attributes such as height or weight involves transformation of all quantities into a common set of units, such as from British imperial to SI units. Categorical attributes such as sex are usually transformed to a common set of representations through simple look-up tables or mapping of various encodings -for example, both "Female" and "2" might be mapped to "F" and "male and "1" to "M" in order to provide a consistent encoding of the sex attribute for each record. Such transformations do not present a major challenge. However, standardisation of attributes which are recorded in highly variable formats, such as names or residential addresses, is far less straightforward, and it is with this task that this paper is concerned.
This standardisation task can itself be decomposed into two steps: segmentation of the data into specific, atomic data elements; and the transformation of these atomic elements into their canonical forms. In some cases, a third step, the imputation of missing or blank data items, and a fourth step, the enhancement of the original data with known alternatives, may also be required.
Some examples of the first two steps will make this clearer. Table 2 shows the segmented and transformed forms of the name, address and sex attributes of the illustrative records introduced in Table 1.
Once the original data have been segmented and standardised in this way, further enhancement of the data is possible. For example, missing postal codes and territories can be automatically filled in from reference tables, and alternate, canonical forms of names can be added where informal, anglicised or other known variations are found, such as "Angie" (Angela, Angelique) or "Lyn" (Evelyn, Lyndon).

Related work
The terms data cleaning (or data cleansing), data standardisation, data scrubbing, data pre-processing and ETL (extraction, transformation and loading) are used synonymously to refer to the general tasks of transforming source data into clean and consistent sets of records suitable for loading into a data warehouse, or for linking with other data sets. A number of commercial software products are available which address this task, and a complete review is beyond the scope of this paper -a summary can be found in [9]. Name and address standardisation is also closely related to the more general problem of extracting structured data, such as bibliographic references, from unstructured or variably structured texts, such as scientific papers.
The most common approach for name and address standardisation is the manual specification of parsing and transformation rules. A well-known example of this approach in biomedical research is AutoStan, which was the companion product to the widely-used AutoMatch probabilistic record linkage software [10].
AutoStan first parses the input string into individual words, and each word is then mapped to a token of a particular class. The choice of class is determined by the presence of that word in user-supplied, class-specific lexicons (look-up tables), or by the type of characters found in the word (such as all numeric, alphanumeric or alphabetical). An ordered set of regular expression-like patterns is then evaluated against this sequence of class tokens. If a class token sequence matches a pattern, a corresponding set of actions for that pattern is performed. These actions might include dynamically changing the class of one or more to-kens, removing particular tokens from the class token sequence, or modifying the value of the word associated with that token. The remaining patterns are then evaluated against the now modified class token sequence -in other words, the pattern matcher is re-entrant, and the actions associated with more than one pattern may act on any given token sequence. When the evolving token sequence for a particular record has been tested against all the available patterns, the words in the input string are output into specific fields corresponding to the final class of the tokens associated with each word.
Such approaches necessarily require both an initial and an ongoing investment in rule programming by skilled staff. In order to mitigate this requirement for skilled programming, some investigators have recently described systems which automatically induce rules for information extrac- Note: The "bleeding" of street address data into the locality column in record 4 is deliberate, and typical of real-life data captured by information systems with fixed-length data fields. tion from unstructured text. These include Whisk [11], Nodose [12] and Rapier [13].
Probabilistic methods are an alternative to these deterministic approaches. Statistical models, particularly hidden Markov models, have been used extensively in the computer science fields of speech recognition and natural language processing to help solve problems such as wordsense disambiguation and part-of-speech tagging [14]. More recently, hidden Markov and related models have been applied to the problem of extracting structured information from unstructured text [15][16][17][18][19][20].
This paper describes an implementation of lexicon-based tokenisation with hidden Markov models for name and address standardisation -an approach strongly influenced by the work of Borkar et al. [20]. This implementation is part of a free, open source [21] record linkage package known as Febrl (Freely extensible biomedical record linkage) [22]. Febrl is written in the free, open source, object-oriented programming language Python [23]. Other aspects of the Febrl project will be described in subsequent papers.

Cleaning and tokenisation
The following steps are used to clean and tokenise the raw name or address input string. Firstly, all letters are converted to lower case. Various sub-strings in the input string, such as " c/-" or " c.of " are then converted to their canonical form, such as "care_of ", based on a user-specified and domain-specific substitution table. Similarly, punctuation marks are regularised -for example, all forms of quotation marks are converted to single character (a vertical bar). The cleaned string is then split into a vector of words, using white space and punctuation marks as delimiters.
Using look-up tables and some hard-coded rules, the words in this input vector are assigned one or more tokens, to which we will refer as "observation symbols" henceforth. The hard-coded rules include, for example, the assignment of the AN (alphanumeric) observation symbol to all words which are a mixture of alphabetic and numeric characters. However, the majority of observation symbols are assigned by searching for words, or sub-sequences of words, in various look-up tables. A list of the observation symbols currently supported by the Febrl package is given in Table 3. For example, one of the lookup tables may be a list of locality names. If a word (or contiguous group of words) is found in the locality table, then the LN (locality name) observation symbol is assigned to that word (or group). This look-up uses a "greedy" matching algorithm. For example, the wayfare name look-up table might contain a record for "macquarie", the locality qualifier look-up table might contain a record for "fields" and the locality name look-up table might contain a record for "macquarie fields". If the first word in the input vector is "macquarie" and the second word is "fields", these first two words will be coalesced (into "macquarie_fields") and will be assigned an LN (locality name) observation symbol, rather than the first word being assigned a WN (wayfare name) symbol and the second field an LQ (locality qualifier) symbol.
Such lexicon-based tokenisation allows readily-available lists of postal codes, locality names, states and territories, as typically published by postal authorities or government gazetteers, to be leveraged to provide the probabilistic model used in the next stage with the maximum number of "hints" about the semantic content of the input string. Note that these probabilistic models are able to cope with situations in which incorrect observation symbols are assigned to particular words in the input string -the only requirement is that the symbols are assigned in a consistent fashion. For example, the input string "17 macquarie fields road, northmead nsw 2345" might be tokenised as "NU-LN-WT-LN-TR-PC" (number-locality name-wayfare type-locality name-territory-postal code). The first LN symbol is wrong in this context because "macquarie fields" is a wayfare name, not a locality name. The hidden Markov models described in the next section are readily able to accommodate such incorrect tokenisation.

Hidden Markov models
A hidden Markov model (HMM) is a probabilistic finite state machine comprising a set of observable facts or observation symbols (also known as output symbols), a finite set of discrete, unobserved (hidden) states, a matrix of transition probabilities between those hidden states, and a matrix of the probabilities with which each hidden state emits an observation symbol [24]. This "emission matrix" is sometimes also called the "observation matrix".
In the case of residential addresses, we posit that hidden states exist for each segment of an address, such as the wayfare (street) number, the wayfare name, the wayfare type, the locality and so on. We treat the tokenised input address as an ordered sequence of observation symbols, and we assume that each observation symbol has been emitted by one of the hidden address states. In other words, we first replace individuals words with tokens which represent a guess (based on look-up tables and simple rules) about the part of the name or address which that word represents. These tokens are our observable facts (observation symbols). We then try to determine by statistical induction which of a large number of possible arrangements of hypothetical "emitters" is most likely to have produce the observed sequence. These hypothetical emitters of observation symbols are the hidden states in our model.
Training data are representative samples of the input records which have been tokenised into sequences of observation symbols as described above, and then tagged with the hidden state which the trainer thought was most likely to have been responsible for emitting each observation symbol. Maximum likelihood estimates (MLEs) are derived for the HMM transition and emission probability matrices by accumulating frequency counts for each type of state transition and observation symbol from the training records. The probability of making the transition from state i to state j is the number of transitions from state i to state j in the training data divided by the total number of transitions from state i to a subsequent state. Similarly, the probability of observing symbol k given an underlying (hidden) state j is the number of times, in the training data, that symbol k was emitted by state j divided by the total number of symbol emissions by state j. Because of the use of frequency-based MLEs, it is important that the records in the training data set are reasonably representative of the data sets to be standardised. However, as reported below, the HMMs appear to be quite robust with respect to the training set used and quite general with respect to the data sources with which they can be used. As a result, it is quite feasible to add training records which are archetypes of unusual name or address patterns, without compromising the performance of the HMMs on more typical source records.
The trained HMM can then be used to determine which sequence of hidden states was most likely to have emitted the observed sequence of symbols. In an ergodic (fully connected) HMM, in which each state can be reached from every other state, if there are N states and T observations symbols in a given sequence, then there are N T different paths through the model. Even with quite simple models and input sequences, it is computationally infeasible to evaluate the probability of every path to find the most likely one. Fortunately, the Viterbi algorithm [25]  provides an efficient method for pruning the number of probability calculations needed to find the most likely path through the model.
Once found, the most likely path through the HMM can then be used to associate each word in the original input string with a hidden state, and this information is then used to segment the input string into atomic data elements like those illustrated in Table 2. This approach can also be used with names or other variably-formatted text, using different sets of hidden states, observation symbols, transition and output matrices. Notice that the probabilities in each row of the transition matrix and in each column of the emission matrix add up to one. Also notice that none of the probabilities in the emission matrix are zero. In practice, it is common for some combinations of state and observations symbol not to appear in the training data, resulting in a maximum likelihood estimate of zero for that element of the emission matrix. Such zero probabilities can cause problems when the model is presented with new data, so smoothing techniques are used to assign small probabilities (in this case 0.01) to all unencountered observation symbols for all states. Traditionally Laplace smoothing is used [26], but Borkar et al. have also described the use of absolute discounting as an alternative when there are a large number of distinct observation symbols [20]. The Febrl package offers both types of smoothing.
Now consider an example address: "17 Epping St Smithfield New South Wales 2987". This would first be cleaned and tokenised as follows. [ Note that Epping is a suburb of the city of Sydney in the state of New South Wales, Australia, hence the word "epping" in the input string is assigned an LN (locality name) observation symbol even though to a human observer it is clearly a wayfare name in this context. This does not matter because we are ultimately not interested in the types of the observed symbols but rather in the underlying hidden states which were most likely to have generated them. It is then a simple matter to use this information to segment the cleaned version of the input string into address elements and output them, as shown in Table 6.   Further details of the way in which HMMs are implemented in the Febrl package are available in the associated documentation [22]. The hidden states used in the name and address HMMs are shown in Tables 7 and 8 respectively. These hidden states, and the observation symbols listed Table 3, were derived heuristically from AutoStan tokens and rules developed previously by two of the authors (TC and KL) for use with Australian names and residential addresses. Figures 2 and 3 show directed graphs of these models. Currently, the observation symbols and hidden states are "hard coded" into the Febrl software package, although they can be altered by editing the freely available source code. Future versions of the package will use "softcoded" observation symbols and hidden states, allowing users in other countries to adapt the HMMs for other types of name and address information, or indeed for quite different information extraction tasks, without the need for Python programming skills.

Methods
We evaluated the performance of the approach described above with typical Australian residential address data using two data sources.
The first source was a set of approximately 1 million addresses taken from uncorrected electronic copies of death certificates as completed by medical practitioners and coroners in the state of New South Wales (NSW) in the years 1988 to 2002. The majority of these data were entered from hand-written death certificate forms. The information systems into which the data were entered underwent a number of changes during this period.
The second data set was a random sample of 1,000 records of residential addresses drawn from the NSW Inpatient Statistics Collection for the years 1993 to 2001 [27]. This collection contains abstracts for every admission to a public-or private-sector acute care hospital in NSW. Most of the data were extracted from a variety of computerised hospital information systems, with a small proportion entered from paper forms.
Most of these data was entered from hand-written forms, although some of the data for the latter years were extracted directly from computerised obstetric information systems.
Access to these data sets for the purpose of this project was approved by the Australian National University Human Research Ethics Committee and by the relevant data custodians within the NSW Department of Health. The data sets used in this project were held on secure computing facilities at the Australian National University and the NSW Department of Health head offices. In order to minimise the invasion of privacy which is necessarily associated with almost all research use of identified data, the medical and health status details were removed from the files used in this project. Thus, for this project the investigators had access to files of names and addresses, but not to any of the medical or other details for the individuals identified in those files, other than the fact that they had died or had given birth.

Address standardisation
Training of HMMs for residential address standardisation was performed by a process of iterative refinement.
An initial hidden Markov model (HMM) was trained using 100 randomly selected death certificate (DC) records.
Annotating these records with state and observation symbol information took less than one person-hour. The resulting model was used to process 1,100 randomly chosen DC records. These records then became a second-stage training set, with each record already annotated with states and observation symbols derived from the initial model. This annotation was manually checked and corrected where necessary, which took about 5 person-hours. An HMM derived from this second training set was then used to standardise 50,000 randomly chosen DC records,    the accuracy assessed. In other words, an HMM trained using one data source (DC) was used to standardise addresses from a different data source (ISC) without any retraining of the HMM.
An additional 1,000 randomly chosen address training records derived from the Midwives Data Collection (MDC) were then added to the 1,450 training records described above, and this larger training set was used to derive HMM2. HMM2 was then used to re-standardise the same sets of randomly chosen test records described in the first and second steps above, and the results were assessed.
A further 60 training records, based on archetypes of those records which were incorrectly standardised in all of the preceding tests, were then added to the training set to produce HMM3. HMM3 was then used to re-standardise the same DC and ISC test sets. Thus, HMM3 could be considered as an "overfitted" model for the particular records in the two test sets, although in practice researchers are likely to use such overfitting to maximise standardisation accu-   racy for the particular data sets used in their studies. The total training time for all address standardisation models was not more than 20 person hours.
Finally, by way of comparison, the same two 1,000 record test data sets were standardised using AutoStan in conjunction with a rule set which had been developed and refined by two of the investigators (TC and KL) over several years for use with ISC (but not DC) address data, representing a cumulative investment of at least several person-weeks of programming time.

Name standardisation
To assess the accuracy of name standardisation, a subset of 10,000 records with non-empty name components was selected from the MDC data set (approximately a one per cent sample). This sample was split into ten test sets each containing 1,000 records. A ten-fold cross validation study was performed, with each of the folds having a training set of 9,000 records and the remaining 1,000 records being the test set. The training records were marked up with state and observation symbol information in about 10 person-hours using the iterative refinement method described above. HMMs were then trained without smoothing, and with Laplace and absolute discount smoothing, resulting in 30 different HMMs. We found that smoothing had a negligible effect on performance, and only the results from the unsmoothed HMMs are reported here.
The performance of HMMs for name standardisation was compared with a deterministic rule-based standardisation algorithm which is also implemented in the Febrl package -details of this algorithm can be found in the associated documentation [22].

Evaluation criteria
For all tests, records were judged to be accurately standardised when all of the elements present in the input address string, with the exception of punctuation, were allocated to the correct output field, and the values in each output field were correctly transformed to their canonical form where required. Thus, a record was judged to have been incorrectly standardised if any element of the input string was not allocated to an output field, or if any element was allocated to the wrong output field. Due to resource constraints, the investigators were not blind to the nature of the standardisation process (HMM versus Auto-Stan) used. Exact binomial 95 per cent confidence limits for the proportion of correctly standardised records were calculated using the method given in [29].
In the records which were standardised incorrectly, not every data element was assigned to the wrong output field. For each of these address records, the proportions (and corresponding 95 per cent confidence limits) of data elements which were assigned to the wrong output field, or which were not assigned to an output field at all, were calculated. These quantities were not calculated for names due to the much simpler form of the name data.
tem's rules were developed, and better when used on a different data set. In other words, HMMs trained on a particular data source appear to be more general than a rule-based system using rules developed for the same data.
In addition, the improvements in performance observed with HMM2 and HMM3 suggest that, although frequencybased maximum likelihood estimates are used to derive the probability matrices, the resulting HMMs are fairly indifferent to the source of their training data, and their performance can even be improved by the addition of a small number of "atypical" training records which do not "fit" the HMM very well.
It is probable that some of the observed generality of the HMMs stems from the use of lexicon-based tokenisation as implemented in the Febrl package, which enables exhaustive but readily available place name and other lists to be leveraged. In contrast, Borkar et al. [20] replaced each word in each input addresses with symbols based on a simple rational expression grouping eg 3-digit number, 5-digit number, single character, multi-character word, mixed alphanumeric word. These symbols contain much less semantic information than the lexicon-based symbols used in Febrl, although they have the advantage of not requiring look-up tables (lexicons). Borkar et al. also used nested HMMs to achieve acceptable accuracy on more complex addresses [20]. At least for Australian addresses, which are of similar complexity to North American addresses, but less complex than most European and Asian addresses, we have not found nested models to be necessary. This may be because the lexicon-based tokenisation used in Febrl preserves more information from the source string for use by the HMM, at the expense of a more complex model. However, the computational performance of these models is satisfactory. Future attempts at optimisation, by re-writing parts of the code, such as the Viterbi algorithm, in C are expected to yield significant increases in speed. In addition, the standardisation of each record is  Table cells contain the proportion of correctly standardised address records for each of the two data sources listed. Ninety-five per cent confidence limits for the proportions are given in brackets. Table cells contain the mean proportion of data items in each address which were assigned to the incorrect output field, or to no output field. Ninety-five per cent confidence limits for the proportions are given in brackets. completely independent from other records, and hence can readily be performed in parallel on clusters of workstations (COWs).
Standardisation is not an all-or-nothing transformation, and both the rule-based and HMM approaches appear to degrade gracefully when the model or rules make errors.
In the address records which were not accurately standardised by the HMMs, at least two-thirds of all data elements present in the input record were allocated to the correct output fields. Thus, even these incorrectly standardised records would have considerable discriminatory power when used for record linkage purposes. In only two test records (out of 2000) were all of the address elements wrongly assigned, and both of these were foreign addresses in non-English speaking countries. The performance of our AutoStan rule set was similar in this respect. It is unlikely that further training would assist the HMM in resolving this conundrum. One solution would be to validate the wayfare names as output by the HMM for each locality (where lists of wayfare names for each locality are available), and in cases in which the validation fails, to re-allocate the first of the two (apparent) wayfare names as a property name. Other incorrectly standardised records would also benefit from this type of specific postprocessing which would be applied only to those records which have been assigned a particular sequence of hidden states by the HMM.

Name standardisation
The performance of the HMM approach for name standardisation, compared to a rule-based approach, was less favourable. Given the simple form of most names in the test data, the rule-based approach was very accurate, achieving 97 per cent accuracy or better, whereas up to 17 per cent of names in the test data were incorrectly standardised by the HMM.
A possible reason for this poor performance may lie in the relative homogeneity of the MDC name data. Out of the 10,000 randomly selected names, approximately 85 per cent were of the simple form "givenname surname", and a further nine per cent were either of the form "givenname givenname surname" or "givenname surname surname". Thus the trained HMMs had very few non-zero transition probabilities, with a consequent restriction in the number of likely paths through the models.
Names with either two given names or two surnames seemed to be especially problematic. Often the HMMs misclassified the middle name as a second given name instead of the first of two surnames. This is due to the large number of names of the form "givenname surname", which resulted in a very high transition probability from the first given name state to the first surname state. Therefore a second given name is often assigned by the HMM as a first surname, and the real surname as a second surname.
We plan to investigate whether higher-order HMMs, in which the transition probabilities between the current state and a sequence of two or more subsequent states are modelled, may perform better on this type of data.
Other areas which warrant further investigation include the utility of iterative re-estimation of the HMM parameters using the Baum-Welch [24], expectation maximisation (EM) [31] or gradient methods [32], and the substitution of maximum entropy Markov models [33] for the hidden Markov models currently used.
One further difficulty is that the estimated probability for the most likely path through the model for each input string depends on the number of words in that string -the more words there are, the more state transitions and hence the lower the overall probability of paths through the model. Thus, input strings cannot be ranked by the maximum probability returned by the Viterbi algorithm in order to find those for which the model is a "poor fit". This problem can be overcome by calculating the "log odds score" [34,35], which is the logarithm of the ratio of the probability that an input string was generated by the HMM to the probability that it was generated by a very general "null" model. Input strings can be ranked by this score, and strings with low scores considered for addition to the training data set.

Conclusions
Clearly more work needs to be done to improve the performance of HMMs on simpler, more homogeneous data such as mothers' names. However, the use of lexiconbased tokenisation combined with simple first-order HMMs as described in this paper does appear to be a viable alternative to traditional rule-based standardisation methods for more complex data such as residential addresses. Furthermore, the HMM approach does not re-quire substantial initial and ongoing input by skilled programmers in order to set up and maintain complex sets of rules. Instead, clerical staff can be used to create and update the training files from which the probabilistic models are derived.
Future work on the standardisation aspects of the Febrl package will focus on internationalisation, the addition of post-processing rules which are associated with particular hidden state sequences which are known to be problematic, and investigation of higher order models and re-estimation procedures as noted above. We hope that other researchers will take advantage the free, open source license under which the package is available to contribute to this development work.