A new adaptive testing algorithm for shortening health literacy assessments

Background Low health literacy has a detrimental effect on health outcomes, as well as ability to use online health resources. Good health literacy assessment tools must be brief to be adopted in practice; test development from the perspective of item-response theory requires pretesting on large participant populations. Our objective was to develop a novel classification method for developing brief assessment instruments that does not require pretesting on large numbers of research participants, and that would be suitable for computerized adaptive testing. Methods We present a new algorithm that uses principles of measurement decision theory (MDT) and Shannon's information theory. As a demonstration, we applied it to a secondary analysis of data sets from two assessment tests: a study that measured patients' familiarity with health terms (52 participants, 60 items) and a study that assessed health numeracy (165 participants, 8 items). Results In the familiarity data set, the method correctly classified 88.5% of the subjects, and the average length of test was reduced by about 50%. In the numeracy data set, for a two-class classification scheme, 96.9% of the subjects were correctly classified with a more modest reduction in test length of 35.7%; a three-class scheme correctly classified 93.8% with a 17.7% reduction in test length. Conclusions MDT-based approaches are a promising alternative to approaches based on item-response theory, and are well-suited for computerized adaptive testing in the health domain.


Background
More than half of the US adult population has limited health literacy skills [1]. Low health literacy can limit a person's ability to communicate effectively with healthcare providers, comprehend physician's instructions and make informed health decisions. In addition, a strong correlation between low health literacy and poor health outcomes have been documented in a range of medical problems [1].
The goal of the consumer health informatics initiatives is to improve patients' comprehension of health information. This requires tools that identify deficiencies in disadvantaged patient populations. Several health literacy measures are available, including the Rapid Estimate of Adult Literacy of Medicine (REALM) [2], the Test of Functional Health Literacy in Adults (TOFHLA) [3], and the Newest Vital Sign (NVS) [4]. In addition, several other assessments focus on numerical skills or numerical self-efficacy [5][6][7]. However, no single assessment tool covers text reading skills, numerical skills, ability to use forms, tables, and graphs, oral communication skills, conceptual understanding of physiological health issues, and ability to understand and navigate health systems such as hospitals and insurance companies, which are just some of the domains considered to be relevant to health literacy [8][9][10].
Even this partial list suggests that a truly comprehensive health literacy assessment would be prohibitively burdensome for both the patient and the clinician. In order for testing to be useful in clinical practice or for patientoriented informatics initiatives, it would have to be made as simple and short as possible. Abbreviated versions of REALM [11][12][13] and TOFHLA [14] have been proposed, although the applicability of methods used to shorten these tests on new tests remains to be investigated.
Our current work is part of a research program with an overall objective of developing a flexible health literacy instrument for older adults. A short and powerful assessment would be particularly useful for older adults, who: (a) are more likely to have low health literacy [10,15], (b) use health care services more often than younger people [16], and (c) are more likely to experience fatigue from taking an exhaustive literacy exam.
The task of health literacy assessment is similar to mastery tests used in educational and professional domains, in which a subject needs to be classified as master or non-master. Huynh [17] and van der Linden [18,19] have studied the problem of administering fixed-length mastery tests while minimizing the cost of misclassification of the subjects. However, in cases where the subjects exhibit clear mastery (or non-mastery) early termination of the tests would be beneficial. This would require variable-length tests in which the decision to continue or terminate testing can be made after each item. In one of the earliest attempt at a variable-length testing, Ferguson [20] proposed an approach based on Wald's sequential probability ratio [21]. However, in this approach all items were considered to be of equal difficulty. Kingsbury and Weiss [22], Reckase [23] and Lewis and Sheenan [24] have proposed methods that do not rely on this assumption and estimate each item's difficulty using item response theory (IRT).
Under IRT, the probability of a correct response to item i, by an examinee of estimated ability θ is calculated as: where a i [0, ∞), b i (-∞,∞) and c i [0,1] are itemspecific parameters that represent the difficulty, discriminating power and guessing probability of item i respectively [25,26]. IRT requires the items to be pre-tested on a large population in order to estimate the probability that an item is answered correctly by examinees of different competency levels. Although IRT-based approaches provide excellent results when examinees need to be scored with precision on a continuous scale, this calibration can be complex and resource-intensive. Multi-stage testing approaches can relax the calibration size requirements of IRT but would still require calibration samples in the order of a few hundred subjects [27].
The methods proposed by Kingsbury and Weiss [22] and Reckase [23] have used decision theory principles to evaluate the stopping function i.e. the cost of continuation of the test compared to the reduction in cost of misclassification (both false positives and false negatives) that can be expected to result from administering additional items. Vos [28], Welch and Frick [29] and Rudner [30,31] have discussed the extension of decision theory to item-selection.
These testing methods, which adapt to the examinees' perceived ability in order to determine the item order and test length, can be suitably administered using computers. Computerized adaptive testing (CAT) provides a way to make assessment more comprehensive while limiting additional burden on the test-taker [26,32]. CAT is routinely used in standardized educational testing and has recently been applied to patientoriented assessments in health and medicine. Mead and Drasgow's review of 123 tests showed that CAT and paper-and-pencil tests are equivalent for carefully constructed tests [33]. For instance, the Patient-Reported-Outcomes Measurement Information System (PROMIS) is an NIH Roadmap network project that uses CAT to improve the reliability, validity, and precision of patient-reported outcomes [34]. CAT has also been used successfully to screen for developmental problems [35], assess tobacco beliefs among young people [36] and to administer headache impact tests [37]. It has consistently been shown to be an efficient testing approach with high precision. We believe that in situations when the examinees need to be classified into relatively few categories, measurement decision theory (MDT) can provide comparable performance with significantly fewer items and a much smaller test population [30].
In this paper we discuss the application of a MDTbased approach to administer variable-length tests in order classify examinees into a limited number of discrete categories. Pilot testing with a relatively small calibration sample is used to estimate the conditional probabilities that subjects of particular literacy levels will answer a particular question correctly. Subsequently, Bayes' Theorem is used to compute the likelihood that an individual test-taker who answers a series of questions correctly is of a specified literacy level. We demonstrate the validity of the proposed method using data from two patient health literacy questionnaires.

Algorithm development
In the MDT-based CAT process, the goal is to place the examinee in one of k literacy classes (e.g., low or adequate; or low, adequate, or high). One item is presented at a time, and on the basis of the test-taker's previous answers, the next 'best' item is the item that eliminates the most uncertainty about the classification.
Specifically, as described below, the best question is the one that maximally reduces the entropy of the classification. Prior to testing, background or calibration data is needed to calculate: 1. P(L i )-the distribution of different competence levels (L i ) in the participant population; 2. P(Q j = 1/L i ) -the conditional probability that participants of a particular competence level L i respond correctly to a question Q j .
At each point in the testing process, the test-taker's current response vector is Z = Z 0 Z 1 ...Z n , where Z j is 1 if the participant provided the correct response to Q j and 0 otherwise. According to Bayes' Theorem, the conditional probability that s/he belongs to competence level L i given the data vector Z is thus: where 'k' is the number of possible classes. If questions are independent (a central assumption in MDT), then P(Z/L i ), the conditional probability of the response vector given competence level L i , is the product of the responses' individual conditional probabilities.
The response vector also defines the current state of the participant, S curr . The information content of this state can be calculated as entropy formulated by Shannon's information theory [38]: After each answered question, the next question is selected such that the new state, S next resulting from the participant's response to it results in the greatest reduction in entropy i.e. the question that maximizes: In (4), Z' = Z 0 Z 1 ...Z n Z n+1 can be one of two vectors depending on whether the participant answers Q n+1 correctly or incorrectly. If Z' 1 is the vector resulting from a correct response and Z' 0 the vector resulting from an incorrect response, for each of the unanswered questions, Q new the entropy of the state resulting from presenting the question is calculated as: Here P (Q new = 1) is the probability that Q new is answered correctly and is given by: Termination: The testing process could theoretically continue until the participant answered all available questions. However, to gain efficiency, we terminate testing if H(S next ) (from 4) was less than a specified threshold for three successive questions. At termination, the participant's predicted class is the class L i with maximum P(L i /Z).

Materials
We validated our MDT-based CAT approach through a secondary analysis of data from two assessment tests developed for consumers in the health domain. The algorithm was applied retrospectively to the data sets to (a) determine an optimal question order and (b) to categorize each participant into a literacy or numeracy category. Finally, we also performed ROC analysis to characterize the sensitivity, specificity, and predictive power (area under the curve) of the algorithm.

Data Set 1
This paper-based questionnaire study [39] was originally conducted to assess patients' familiarity with healthrelated terms. The 52 participants were recruited from the Brigham and Women's Hospital, an academic medical center in Boston, MA (Table 1). Each participant was required to answer 60 questions that tested his/her familiarity with a term (n = 45) or concept (n = 15). The target terms for the questions were manually selected from consumer health terms identified in three frequently visited MedlinePlus articles related to hypertension, back pain and gastroesophageal reflux disease (GERD). For each question (Figure 1), the participant was invited to select one of the four available responses: 1. a key term that was related to the target term in the question, either at the surface-level ("biopsy" is a "test") or at the concept level ("biopsy" means "removing sample of a tissue"); 2. a distractor term that had the same semantic relation to the target term as the key term and of approximately the same difficulty as the key term; 3. a second distractor term satisfying the same criteria as 2; 4. a do not know option.
Results of the study are reported elsewhere [39]. Cronbach's alpha was found to be 0.93 signifying very high internal consistency. Factor analysis without rotation showed that all questions load heavily on one factor accounting for 26% of the variance. The second and third factors account for 10% and 7% respectively. Different literacy measures use different cut-off points to assign participants to a category. For the current analysis, the participants' scores were used to classify them into one of three categories: low literacy (a score of 44 or lower), moderate literacy (a score in the range of 45 and 52) or high literacy (a score of 53 or higher). These score thresholds were selected for demonstration purposes so as to obtain approximately equal-sized groups: 34.6% of participants in the low literacy group and 32.7% in moderate and high literacy groups. However, different thresholds could be selected for other purposes.

Data Set 2
This on-line questionnaire study was conducted as part of a larger study to evaluate the effect of different visual illustrations of risk on patient decision-making [40]. Two samples of adult consumers (n = 165) were recruited: one online and one from clinics at New York-Presbyterian Hospital, an academic medical center in New York City ( Table 1). As part of this study, participants were assessed for numeracy with a short scale modified from that of Lipkus et al [5]. The scale consists of 8 questions to assess comprehension of probabilities. Two of the questions are in multiple choice format and the rest are of fill-in-theblank format( Table 2). In the Lipkus study [5], the first question in Table 2 was an unscored practice question; however, following the example of Schwartz et al [6] and Gurmankin et al [41,42], the same item was used here as a valid numeracy question. Internal reliability of this modified Lipkus scale was good (Cronbach's alpha = 0.70). A factor analysis using principal components extraction without rotation showed that all questions loaded heavily on one factor that accounted for 34.3% of the variance; the two multiple-choice questions loaded on a weak second factor for 17.0% of the variance, and the two most difficult fill-in-the-blank questions (numbers 1 and 8) loaded on a weak third factor that accounted for 13.3% of the variance. These results were extremely similar to those reported by   Lipkus et al [5], who concluded that the scale measured a single construct of numeracy and that the weak second and third factors were probably artifacts of the measurement scale.
For the current analysis, two categorization schemes were developed for demonstration purposes. In the first scheme, participants were categorized as low numeracy (a score of 5 or lower, n = 48 (29.1%)) or high numeracy; in the second scheme, they were categorized as low numeracy (a score of 5 or lower, n = 48(29.1%)), moderate numeracy (a score of 6 or 7, n = 74(44.8%)), or high numeracy. Classifying patients into 3 categories of competency is typical of other assessments such as the TOFHLA [3].

Evaluation
The gold standard literacy or numeracy level, true category, of a participant was defined by his/her total score on the original questionnaire as described above. The MDT-CAT algorithm was applied to predict the true category of the participant using a leave-one-out approach. For example, for the first data set, data from 51 of the 52 participants served as calibration cases -i.e. used to estimate the distribution P(L i ) and P(Q j = 1|L i )and the 52nd participant served as the test case. The process was repeated so that each participant served as the test case exactly once.
To initialize the testing process, i.e. to determine the first question to be presented, the following two alternatives were explored: SA: Question selection criteria (i.e. entropy minimization) described by (4), (5) and (6) by substituting P(L i ) for P(L i |Z) (since Z is not initialized).
SB: Any question that a participant known to be of moderate literacy has a 40-60% chance of answering correctly.
The average number of questions that needed to be presented prior to termination and the number of wrong classifications were calculated and reported.

Results
For data set 1, using the start criterion SA and a termination threshold of 0 (i.e. no termination threshold), the algorithm classified 88.5% (n = 46) of the participants correctly. Of the misclassified participants (n = 6), half were of moderate literacy incorrectly classified as low literacy while the others were of high literacy classified as moderate.
All the misclassified cases appear to result from the same type of outlier question/response pair. For instance, one of the misclassification resulted when a participant with a score of 45 was misclassified to be of low literacy (true category: moderate literacy). An analysis of the response vector and state entropies revealed that this can be attributed to the participant's response to one particular question, for which the participant gave an incorrect response, in contrast to all other participants of moderate literacy. As the algorithm uses a leave-one-out approach, P(Q i = 0|moderate literacy) for this question would be In a sample of 287 veterans who completed a mail questionnaire [6], 54% answered this question correctly. b Does not include responses of three subjects who were interrupted while completing the online questionnaire; on restarting the test, an unexpected system malfunction prevented their responses from being captured.
equal to 0, and the participant's incorrect response results in a misclassification. The use of a random moderate-difficulty start question (start criterion SB) did not result in any discernible difference in the number or type of errors produced by the algorithm.
As expected, with an increase in the threshold the number of questions to be answered decreases and the error rate increases. Figure 2 shows the average number of question per participant and error rates for threshold values in the range of 0 and 0.2 in increments of 0.01. At the smallest attempted threshold of 0.01 the average number of questions to be answered was found to be 25 (42%) with one additional misclassification.
To determine the largest possible reduction in questions without increasing errors, threshold values in the range of 0 and 0.01 were further investigated and the results are shown in Figure 3. A threshold value of 0.005 appears to be the most promising as this would reduce the number of questions by half without causing any additional misclassifications. For all participants, the predicted class at this threshold is identical to the predicted class when no termination criterion was specified. Figure 2 also shows the result of using start criterion SB in combination with different entropy thresholds. As can be seen in the figure, the reduction in questions is comparable to our earlier results. However, the method tends to misclassify a slightly larger number of participants.
Data set 2 has a larger number of subjects but a small item pool. Figure 4 shows the changes in the average number of questions to be answered and error rate for the two class classification scheme. Without a threshold, 96.9% of the participants were found to be correctly classified. The number of misclassifications remains constant with increase in threshold, up to an entropy threshold of 0.17 and the average number of questions to be answered reduces to 5.14 (64.25%).
A three-class classification can be expected to have more uncertainty than the two-class classification scheme, producing a higher error rate and a smaller reduction in question count. Figure 5 shows that for the three-class classification scheme on dataset 2, without a threshold, 93.8% of the subjects were correctly classified. Though this is less than the corresponding value in the two-class classification scheme, it is better than the observed errors of the three-class classification scheme on data set 1. The number of questions to be answered, 6.59 (82.3%), is also higher than that observed in the twoclass classification scheme.
In order to estimate the sensitivity of the performance of the method to calibration data, we tried two calibration alternatives to the leave-one-out approach described above: (a) calibration with a random half of the population; (b) calibration using the online sample. In both cases, only subjects not used for calibration were used for testing. For example in scheme (b), P(Q j = 1 | L i ) and P(L i ) were computed using the subjects recruited online (n = 100) and used to classify subjects recruited at the clinic (n = 62). Table 3 and Table 4 list P(Q j = 1 | L i ) and P(L i ) observed in the three calibration schemes.
Using the leave one-out approach, the average number of questions to be answered was 6.6 at 93.8% classification accuracy (accuracy possible if no threshold were used). For calibration scheme (a), 6.8 questions needed to be answered and a higher accuracy of 97.5% was observed. Calibration scheme (b) resulted in a classification accuracy of 91.9% with the subject having had to answer 5.3 items on average. As can be seen in Table  4 for scheme (b), the calibration sample has a very different distribution from the testing sample. For example, in the online sample, 16% of the subjects were of low numeracy and 37% of high numeracy whereas in the clinic population 52% were of low numeracy and only 5% of high numeracy. The decrease in performance using scheme (b) can probably be attributed to this difference. Table 5 shows the actual and predicted distribution of the numeracy classes in the testing sets of the three calibration schemes. The misclassifications resulted from overestimation of the numeracy of subjects of low numeracy group and when the calibration set is not representative of the testing sample, as in scheme (b), these misclassifications were found to increase.  As illustrated in the ROC curve in Figure 6, the sensitivity and specificity of the algorithm are very good. The top ROC curve shows the sensitivity and false positive rate on the numeracy data set at an entropy threshold of 0 (area under the curve = 0.96); the lower ROC curve shows very little decrement in performance at an entropy of 0.6 (area under the curve = 0.93).

Discussion
Comprehensive, robust and parsimonious diagnostic health literacy measures are needed in consumer health initiatives, as well as clinical practice, to identify low literacy individuals. The need is particularly acute in older adults who are more likely to require frequent health care services and are also more likely to experience fatigue and cognitive burden from a lengthy testing procedure.
Our results show that the method discussed in this paper can be useful in reducing the number of questions that need to be answered by a participant without seriously compromising classification accuracy. For the data described in this study, we found that by selecting an appropriate threshold, the number of questions can be reduced by half without making any additional misclassifications. In addition, this high degree of accuracy was achieved with only few calibration cases. By contrast, the number of cases needed for calibrating IRTbased algorithms is estimated to be quite large [25,30].   For demonstration purposes, we applied it to data sets from relatively brief assessments. Even with these short assessments, the algorithm markedly reduced the number of test items; the impact on our target population would be even more substantive if this algorithm were applied to lengthier competency assessments. Although MDT-based adaptive testing methodologies have been proposed and used in other domains, to the best of our knowledge this is its first application in the health domain.
The thresholds used to report the results are values at which the greatest reduction in test length was observed at the same classification accuracy as that achieved had no threshold been used. In a real testing scenario, these values can be estimated by simulating the leave-one out approach on the calibration sample. These simulations can also inform the test administrators on the accuracy/test-length trade-offs and help them choose higher thresholds that further reduce the length of the tests.
Results observed with calibration scheme (b) of the second dataset also suggest that the users of the method should be encouraged to have a calibration set that is representative of the overall population and if this is not possible the performance of the classification accuracy can be expected to decrease. Additional analysis, preferably with larger datasets, would be necessary to determine the response of the performance to deviations between calibration and testing samples.
Although the results are promising, this study has some limitations. First, it relies on secondary analysis of data from assessments that measure vocabulary familiarity and numeracy, constructs that are related to or contribute to health literacy but do not fully cover the domain of health literacy. Second, the MDT algorithm is predicated on the assumption that the responses to questions are independent of other responses, and this may not be valid. The effect of the violation of this assumption on the performance of this algorithm remains to be investigated. Future work could compare algorithm performance on assessments that have highly correlated questions and assessments with highly independent questions, with independence inferred from Cronbach's alpha or factor analysis. It is relevant to point out however, that IRT makes a similar assumption of question independence. A final limitation is that the assessment would have to be administered on a computer, and many individuals with low literacy also have low computer literacy. We are currently developing a computer interface for this assessment that is expressly designed to be easier to use for our target population of older adults, many of whom are novice computer users. The processing/memory requirements to administer this kind of testing are minimal and a fully equipped computer is not necessary. Touch screen devices such as tablet computers -which are becoming more affordable -hold promise as a more user-friendly medium for administering these tests in healthcare settings.
In addition, in future work, we intend to study how to apply this method to scenario-based tests where each scenario is described and is followed by several related questions. In such cases, probability and entropy calculations may need to be performed at the module level rather than at the individual question level. Additionally, we intend to validate the method by applying it to directly test the subjects in contrast to the retrospective process used in this study.

Conclusions
The computer-adaptive testing method presented in this paper, which is based on measurement decision theory, can significantly reduce the number of items needed to classify subjects into categories corresponding to levels of competency without significantly compromising the accuracy of the classification. In addition, the measure can be validated with few subjects and test items. This method creates the potential for the development of sensitive diagnostic instruments for measuring health literacy efficiently and without undue burden on subjects.