During de-identification, the other side of the coin is data utility. A trade-off exists between privacy and utility. If an optimal trade-off can be found, patient privacy can be preserved while data users are satisfied with the data utility. The data utility can be measured by information loss metrics. Typically, lower levels of information loss are associated with higher data utility, and vice versa.

The extent of suppression performed to the data is an important indicator of information loss. Although the extent of suppression has known disadvantages as an information loss metric [87], it provides an intuitive way for an expert data analyst to gauge data quality. The more suppressed records and/or individual values in a data set the greater the potential biases introduced in an analysis of the data.

A reasonable quantitative assessment of information loss could be based on comparing the analysis results obtained from the original and disclosed (de-identified) data [88]. However, this is difficult to achieve because the potential uses of data can vary and it is difficult to predict all of them in advance. In the case of the DAD PUMF it is not possible to know with precision a priori all the ways that data recipients can analyze that data, which can include statistical as well as machine learning methods. In fact, one purpose of creating a PUMF is to encourage the development of novel data modeling and data mining techniques.

Despite the lack of universally acceptable information loss criteria or metrics, it has been argued that there is little information loss if a data set is *valid and interesting* [89]. A de-identified data set is considered *valid* if it preserves (i) means and co-variances in a small subset of records, (ii) marginal values in a few tabulations of the data, and (iii) at least one distributional characteristic. A data set is called *interesting*, if six variables on important subsets of records can be validly analyzed. While a useful starting point, this definition can only be meaningfully operationalized if there is some knowledge of the analysis that will be performed on the de-identified data. Another suggested approach is to examine the function that maps original records to the protected records [88]. As this function gets closer to the identity function, the information loss will decrease, and vice versa.

Information loss metrics for continuous data include comparing the original and de-identified data sets on the mean square error, mean absolute error, and mean variation [59, 90]. Such metrics cannot be easily computed for categorical variables; therefore, three methods are suggested [59, 88]: (i) direct comparisons based on a distance definition using category ranges, (ii) comparison of contingency tables, and (iii) entropy-based measures.

An entropy [91, 92] metric was used in a number of studies to measure information loss [49, 93, 94] where suppression, global recoding, and PRAM were used. Recently, the entropy metrics described in [49, 93] were extended to deal with the non-uniform distributions, and the resulting measure has been called non-uniform entropy [95] and has been used specifically in k-anonymity algorithms [96].

Samarati used *height* as an information loss metric [81]. Height indicates the generalization level in a quasi-identifier generalization hierarchy. Greater height means more information loss. Height is considered a weaker metric compared to non-uniform entropy because it does not take into account the information loss contributed by individual variables [87].

Another information loss metric based on the generalization hierarchy is Precision or *Prec* [83, 97]. For every variable, the ratio of the number of generalization steps applied to the total number of possible generalization steps (total height of the generalization hierarchy) gives the amount of information loss for that particular variable. Overall, *Prec* information loss is the average of the *Prec* values across all quasi-identifiers in the data set.

A frequently used information loss metric is the Discernability Metric [98–105]. The discernability metric (*DM*) assigns a penalty to every record that is proportional to the number of records that are indistinguishable from it. DM has been used often in the computational disclosure control literature.

The *minimal distortion* (*MD*) metric measures the dissimilarities between the original and de-identified records [30, 31, 106]. This charges a unit of penalty to each generalized or suppressed instance of a value. While both *DM* and *MD* can assess the level of distortion, *DM* has an advantage over *MD* in the sense that *DM* can differentiate how much indistinguishability increased by going from the original to de-identified data set [107].

Information loss caused by generalizations can also measured by using the *ILoss* metric [108]. This metric captures the fraction of domain values generalized to a certain value [45]. *ILoss* for a record is calculated by finding the sum of the *ILoss* values over all variable values. Different weights can be applied to different variables while obtaining this sum. Similarly, the overall *ILoss* for a data set can be obtained by adding up the *ILoss* values found for the records.

Iyengar [109] used a classification metric, CM, which assigns a penalty to a record if suppressions or generalizations assigns the record to a different majority class. This metric is applied to the training set and requires a classification method to be used. The associated problem is that the exact classification approach needed may not be known at the time of data publishing. Fung et al. [110, 111] used a metric called *IGPL* to measure a trade-off between information gain (*IG*) and privacy loss (*PL*). *IGPL* is obtained by dividing *IG* by *PL* incremented by 1. The formulas for *IG* and *PL* can be seen in [45].