We first describe a loglinear formulation of the extended F-S model with conditional dependence. Let M be the true match status of a pair of records (M = 1 for true match and M = 0 for true non-match). For each record pair with K fields, an agreement vector is observed
where yi = 1 if the i
th field agrees and 0 otherwise. The match prevalence is defined as the proportion of vector patterns belonging to the true match record class and is π = P(M = 1). The parameters of the classical F-S model include
where the m-probabilities are the probability of field agreement given the record pair is a true match, and the u-probabilities are the probabilities of field agreement given the record is a true non-match.
To more effectively accommodate conditional independence, the traditional F-S model can be reparameterized using a loglinear formulation, where the mean number of record pairs with agreement pattern Y and match status M is given as follows:
(1)
With K fields, there are D = 2K possible different agreement patterns. Let f
d
represent the frequency count for the agreement pattern Y
d
(d = 1,2,…,D). Then the log-likelihood is given by
where the marginal probability of observing the agreement pattern Y
d
is
(2)
The match score for a specific agreement pattern Y
d
is defined as
The loglinear formulation has been shown to be equivalent to the F-S classical probabilistic formulation of the conditional independence latent class model [12] through the following relationships:
To incorporate conditional dependence in the loglinear model setting, we add the appropriate interaction terms to the model. For example, if there is dependence between fields j and l within each latent class, the model then includes two additional terms:
(3)
The above loglinear model with interaction terms is easy to fit in standard statistical software such as SAS (example code is provided in Additional file 1). The goodness of fit of a model is measured by both the deviance G2 and the Bayesian Information Criterion (BIC). We use deviance to compare nested models. A model with lower deviance provides a better fit to the data and hence will be preferred. For models that are not nested within each other, BIC is the most commonly used criterion for latent class modeling as it takes into account the sample size [22]. The model with a lower BIC is preferred.
In what follows, we describe a series of steps to fit a loglinear model with appropriate interactions. Specifically, we follow a six-step procedure by identifying the pairwise dependencies between fields using the correlation residual plot proposed by Qu, Tan, and Kutner [23]. We then incorporate the correlations into the model and re-examine the fit of the new model. We iterate between these steps as follows:
Step 1
Fit a loglinear model with no interactions using the observed agreement vectors. This is simply the F-S model formulated as a loglinear model, which provides initial parameter estimates for the next model. Obtain deviance and BIC of this conditional independence model. See Additional files 1, 2 and 3 for SAS code with example.
Step 2
Compute the observed pairwise correlation between fields j and l. The correlation between y
j
and y
l
is
(4)
where p
j
= P(y
j
= 1) , p
l
= P(y
l
= 1), and p
jl
= P(y
j
= 1,y
l
= 1). Using the observed data, the estimates for p
j
, p
l
, and p
jl
are given by:
respectively.
Step 3
Substitute the parameter estimates of λ’s from the fitted model in Step 1 into Equation (1) to obtain the expected number of record pairs m(Y
d
,M
d
) for each vector pattern Y
d
and match status M
d.
Calculate the expected marginal probability P(Y
d
) using Equation (2) and the expected cell count for each vector pattern, where is the total number of record pairs. Expected pairwise correlations are then estimated using (4) (same formulas in Step 2) based on the expected counts rather than the observed counts .
Step 4
Compute the correlation residual, which is equal to the difference between the observed correlation and the expected correlation for each pair of fields. Plot the residuals across the different pairs of fields. A correlation residual which is much different from zero would imply dependence for the corresponding pair of fields.
Step 5
Incorporate the conditional dependence between the pair of fields identified in Step 4 as the interaction term in the loglinear model. Specifically, fit the following four models: interaction in the match class only, interaction in the nonmatch class only, interaction in both classes with different coefficients, and interaction in both classes with the same coefficients. Since the four models are not all nested, BIC is used to compare them and the model with the lowest BIC is chosen. Repeat Steps 3 through 5 to obtain the expected number of record pairs m(Y
d
,M
d
) by substituting the parameter estimates of λ’s of the chosen model into Equation (3) with appropriate interactions instead of Equation (1) until no large correlation residuals are apparent.
Step 6
To classify individual pairs as match, non-match or uncertain matches, we use the final model parameter estimates to calculate the match score for each agreement pattern. Record pairs are then declared as matches or non-matches based on these match scores.
Approval to perform this study was obtained from the Indiana University Institutional Review Board: approval number 1010002784 (0909–68). De-identified data for the HIE example described in the next section is provided as Additional file 3.