Health information exchanges (HIE’s), with highly heterogeneous data, are becoming increasingly important sources of integrated clinical data supporting many healthcare tasks and health-related research. HIE data are captured from different independent databases with different patient identifiers, and best practices for implementing and operating HIE’s are needed. Specifically with respect to data integration and patient matching, in its formal recommendations to the Director of the Office of the National Coordinator for Health Information Technology (HIT) in 2011, the HIT Policy Committee recognized the need to develop and disseminate best practices for patient matching [1] because best practices for matching data in HIE’s are lacking.

Many methodologies have been proposed to identify records in two or more databases that are related to the same entity. Deterministic approaches are based on ad-hoc rules, which classify a pair of records as matches if the two records satisfy certain conditions. Although straightforward to implement, deterministic approaches are often too conservative with unacceptably high false negative (missed-match) rates, especially when data are noisy [2]. This may lead to suboptimal care since physicians lack the information necessary to make informed medical decisions.

Distance-based methods that can handle numerical or categorical fields, as described in [3], are another method to link records. These methods have been shown to perform similarly to probabilistic methods for both numeric [4] and categorical data [5] but require one to establish appropriate distance measures for each variable under consideration. They are not investigated further here as they are not commonly used in practice and have not yet been investigated thoroughly in the HIE setting although they may be of interest in future work [6].

Another alternative to deterministic linkage methods are probabilistic methods. A common probabilistic record linkage method was proposed by Fellegi and Sunter in 1969 [7]. This model is a latent class model, where the latent, or unknown, class represents the *true match status* of the record pair. For this model, each field contained in both data sources is compared as a record pair and a binary variable is created which is a 1 if the two fields agree and 0 otherwise; thus a binary vector is created for each record pair. The Fellegi-Sunter (F-S) model assumes that the agreement patterns of the fields are independent conditional on the true match status.

This conditional independence assumption is often violated in real-world record linkage scenarios [8]. When conditional independence does not hold, estimates of model parameters can be substantially biased [9]. This bias can lead to inaccurate record linkage outcomes as described previously [2, 8]. Therefore, finding the most parsimonious model that accounts for the conditional dependence will provide the most accurate classification of record pairs.

Various methods have been proposed to address the lack of conditional independence in latent class models for record linkage. For example, Tromp et al. incorporated conditional dependence between two fields by combining them into one field with four nominal levels of agreement [2]. This strategy can be cumbersome if conditional dependence exists between more than two fields since the number of nominal categories increases when combining agreement patterns for multiple fields. Schürle proposed an alternate approach to incorporating conditional dependence in the traditional F-S model framework by working directly with the joint distribution of the observed agreement pattern given the true match status. However, this model involves heavy parameterization that leads to significant overfitting of the model [10]. For example, when seven fields are used for record matching, this model involves 255 parameters, while the data could estimate at most 127 parameters. Due to the extreme complexity of the model, the proper choice of starting values is critical for parameter estimation. This greatly limits the usefulness of the approach due to the computational effort required to examine multiple starting values.

Latent class models with conditional independence can be equivalently formulated using a loglinear framework [11, 12]. Using this formulation, the conditional independence assumption can be readily relaxed to account for conditional dependence among fields by including interactions among fields within the match class or the nonmatch class or both [13]. Such loglinear approaches incorporating interactions within latent classes have been used in many applications, notably in diagnostic testing [14, 15].

Similarly, loglinear models have been applied to record linkage applications. Using survey data whose record pairs had known match status, Thibaudeau identified fields with conditional dependence using a loglinear model with selected interactions [8]. Winkler estimated a loglinear model using three-way interactions, acknowledging that identifying the correct set of interactions is difficult when a large number of fields are involved [16]. Loglinear models with certain interaction terms have also been applied in record linkage by Larsen and his colleagues [17, 18]. There has been no research on effectively identifying appropriate interactions in record linkage until the stepwise model building strategy for identifying interactions recently proposed by Zhu et al. [19]. However, this approach can only identify models with all interactions of the same order.

Many previous record linkage studies focused largely on maximum likelihood (ML) estimation, where the parameter estimates of the loglinear model were obtained using an Expectation-Maximization (EM) algorithm. For situations such as the latent class model where incomplete data (unobserved classes) are involved, the EM algorithm is a powerful tool to estimate model parameters [20]. However, as noted by Winkler (1995), the EM algorithm takes substantially longer to reach convergence when conditional dependence is incorporated in the loglinear model because the M-step does not have a closed-form solution [21]. Alternatively, estimating the loglinear latent class model can be conveniently implemented using routines in existing software, such as SAS® PROC NLMIXED (Cary, NC), thus providing a pragmatic approach to incorporating conditional dependence more efficiently.

Even though loglinear models have been proposed by multiple authors for handling conditional dependence in HIE, implementation of such models requires customized programs and the process for choosing pairwise interactions in these models has not been specified. We therefore describe and evaluate a method for identifying conditional dependence among fields, which are subsequently incorporated as interactions in a loglinear model fitted using standard software. To illustrate the methodology, we use an application linking a client list of a county health department to itself for de-duplication. The step-by-step method described is supplemented by sample code which can be readily modified for linking any two data sets using standard statistical software.