Bmc Medical Informatics and Decision Making Task-oriented Evaluation of Electronic Medical Records Systems: Development and Validation of a Questionnaire for Physicians

Background: Evaluation is a challenging but necessary part of the development cycle of clinical information systems like the electronic medical records (EMR) system. It is believed that such evaluations should include multiple perspectives, be comparative and employ both qualitative and quantitative methods. Self-administered questionnaires are frequently used as a quantitative evaluation method in medical informatics, but very few validated questionnaires address clinical use of EMR systems.


Background
Evaluation is a challenging but necessary part of the development cycle of clinical information systems like the electronic medical records (EMR) systems in hospitals. EMR systems handle the storage, distribution and processing of information needed for health care delivery of each patient. Such systems have been described as "complex systems used in complex organizations", and their evaluation seems to follow that logic. It is generally believed that multiple perspectives need to be considered, and that qualitative and quantitative methods should be integrated when evaluating EMR systems [1]. In addition, the evaluation should include a comparative element [2] and rely heavily on how humans react to the system [3]. Since the multi-perspective, multi-methodical approach easily exceeds any perceivable amount of allocated resources, methods that require modest resources should be considered whenever possible. Task-oriented self-reporting of EMR use and task performance is one such quantitative method.
In this paper, we present a new questionnaire instrument. The questionnaire may be used to survey and compare the physicians' use of and performance with a given EMR system at various points of time. Furthermore, it may be used to compare general patterns in use and performance to that of EMR systems in other hospitals and from other vendors. EMR use is not necessarily a quality indicator by itself, but an indicator of potential impact of the system. Specific problem areas may be identified by demonstrating a self-reported lack of EMR use or a reduced reported performance of specific tasks. Although clinically oriented task inventories have been published previously, these tasks inventories have been found either too broad [4,5], or too detailed [6] for the questionnaire's intended purpose. Also, very few of them have been tested in several sites or with various EMR systems. Bürkle et al [7] states that questionnaires should be specified depending on the functions of the observed computer system. The design of the questionnaire makes this specification possible, as the tasks generally follow the boundaries of common EMR functionality. In addition, a table of minimum functionality requirements for each task is publicly available [8]. In this paper, we describe the development and successful application of the questionnaire in two demonstration surveys. Support for the validity of its content is demonstrated in an interview study, and that of the questions' reliability by a test-retest study [9]. In addition, a modified response choice scale is investigated in a scaling study.

Development of the task list for the questionnaire
The questionnaire is task-oriented, i.e. it builds upon 24 general tasks essential to physicians' work. These tasks have been formulated by a work group comprised of two computer scientists and two physicians, including the author. The group based their work on observations of 40 hours of clinical activity in five departments in two university teaching hospitals, performed January-February 2000 by two of the members of the group. Parts of the observations (7 hours observation time, five physicians from two departments, 27 patients) were transcribed verbatim and categorized by hierarchical task analysis [10]. However, the resulting hierarchy of low-level tasks was too large (104 tasks) for use in questionnaires. Thus, the tasks were transformed and merged into higher-level tasks. In the process, they were aimed at being easy to understand, relevant for clinical work in all specialties and attributable to the functionality found in present EMR systems. Tasks regarded as rarely performed, representing negligible time consumption or not likely to be supported by an EMR system in the near future were deleted. Further, the principal information needs of physicians defined by Gorman [11] were taken into account by adding three new tasks (table 1, tasks 6, 7 and 8). We used the refined list of 23 clinical tasks in a national survey, the first demonstration study in this paper [8]. Preceding the second demonstration study, a local survey [12], the questionnaire was reviewed in Aust-Agder hospital by six internists in two focus group sessions, and one new task (table 1, task 24) was added to the list. In November 2002, we used video recordings (4.5 h) of two physicians in a rheumatology outpatient clinic attending to nine patients to review the 24 defined tasks, but the tasks were unchanged. Definitions and examples of all tasks are found in additional file 1. Although native English speaking professionals were consulted during translations, all translated material should be regarded as guiding rather than final.

Development of the questions and the response labels in the questionnaire
The questionnaire principally consists of two sections; one covering self-reported frequency of use of a given EMR system, the other covering perceived ease of performing them using the system. The first section appeared in the national survey, and both sections in the local survey. The questions and response labels were adapted from validated questionnaires, Doll & Torkzadeh [13] and Aydin & Rice [14], both appearing in Anderson et al [15]. Within each section, the questions are equally worded for every task. For details on the incremental changes of each revision of the questionnaire, see appendix A in additional file 17.

Validation of the questionnaire
The validation of the questionnaire was performed in four separate studies.

Structured interviews with physicians
Content validity of the questionnaire was addressed by a structured interview study of physicians from ten selected departments in a university teaching hospital. The two senior residents and eight consultants were named by the head of each department. Three physicians refused to be interviewed, and were substituted by others from the same department. Each one-hour interview was recorded digitally, initiated by the physician filling out the questionnaire whilst being observed. A fixed set of 153 open and closed questions were asked [9,16] mostly about the defined tasks in the questionnaire. During the interviews, answers to the open questions were transcribed and that of the closed questions were registered directly in a database. Unclear or incomplete transcriptions were revised and completed using the recordings of the interviews. We analyzed the open questions qualitatively by categorizing the responses into themes. The interview guide is provided in additional file 11 and 12.

Post hoc analysis of two demonstration studies
The data from two published demonstration studies were used for missing response analysis and criterion validation. The first, a national survey, comprised of responses from 219 of 307 physicians (72%) in 17 hospitals [8]. The survey included task-oriented EMR use and two translated user satisfaction measures; the Doll & Torkzadeh's "End User Satisfaction scale" [13] and Aydin & Rice's "Short global user satisfaction measure" [14]. The second demonstration study, a local survey, comprised of responses from 70 of 80 physicians (88%) in Aust-Agder Hospital [12]. The questionnaire contained all of the questions from the national survey, except those regarding five tasks not supported in this hospital (table 1). In addition, the section covering task performance was added in this second revision of the questionnaire (table 2). The questionnaires used in these studies are provided in Norwegian original and English translated versions in additional files 2, 3 and 5, 6.

Test-retest study
We measured test-retest reliability in a postal survey of physicians from three hospitals having EMR systems from separate vendors. Within each hospital, equal groups of physicians were randomly selected from surgical, medical and other wards. The first questionnaire was sent to the 96 included physicians, and a reminder was sent to 57 nonresponders two weeks later. Three weeks after this, the sec- Seek out specific information from patient records x x x 3 Follow results of a test or investigation over time Obtain results from new tests or investigations x x x 5 Enter daily notes x x x 6 Obtain information on investigation or treatment procedures x x 7 Answer questions concerning general medical knowledge (e.g. concerning treatment, symptoms, complications etc.) x x 8 Produce data reviews for specific patient groups x x x 9 Order clinical biochemical laboratory analyses x x x 10 Obtain results from clinical biochemical laboratory analyses x x x 11 Order X-ray, ultrasound or CT investigations x x 12 Obtain Register codes for diagnoses or performed procedures x x ond questionnaire was sent to the 52 responders along with a music compact disc as inducement. The response rate of the first and second questionnaire was 55.2% (52/ 96) and 71% (37/52), respectively. On average, we received the second questionnaire 4.4 weeks after the first.
To estimate test-retest reliability in the task-oriented questions, we used Cohen's weighted kappa. The kappa values were interpreted according to Lewis' guidelines [17]. The questionnaire used in this study is provided in Norwegian original and English translated version in additional files 8 and 9.

Scaling of response labels
To validate and scale the response labels in the "Frequency of EMR use" scale, we selected 31 respondents by convenience sampling and asked them to interpret a set of response labels by placing marks on a visual analogue scale (VAS). The VAS ranged from "never" to "always", and the eight Norwegian labels (five original response labels and three alternatives) appeared on separate sheets in random order. Using a standard ruler, we measured the marks on the VAS in millimeters from the "never" end, and calculated the mean VAS value and confidence interval for each response label, as well as the number of disordinal label pairs [18]. The combination of labels providing the lowest number of disordinal pairs was selected for the final frequency scale. The VAS form used in this study is provided in additional file 15.

Results
The studies provided evaluation of the questionnaire in terms of 1) content validity, 2) compliance, 3) criterion validity, 4) test-retest reliability and 5) scaling of response labels.

Relevance of tasks
The interviews included structured questions about task relevancy, frequency and time consumption. The majority of the physicians (7-10 of 10) found each of the 24 tasks part of their work, except task 8 ( figure 1, section A). In the open-ended questions, they perceived this task partly as an administrative task best performed by other personnel, and partly as not fully applicable to medical work (table 3, themes 1 and 5). However, four of five physicians who did not consider this task a part of their job agreed that it could be a part of it in the future, provided new technology was implemented. The comments transcribed during the interviews suggested that tasks otherwise considered appropriate for other staff could be done by physicians (e.g. gather and present data to the physicians, mediate orders to other instances), if computer support would make the tasks less time consuming (theme 1).
To broadly assess the amount of work represented by each task, the physicians were asked to estimate frequency and time consumption of each task. Regarding frequency, most physicians (7-10 of 10) found that all but four tasks

Questionnaire revision No. of questions Section in questionnaire
Rev.1 National study Frequency of PC use for each task, use of EMR or other program 23 + 23 D End User Computing Satisfaction [13] 12 F Short Global User Satisfaction [14] 5 G

Rev. 2 Local Study
Frequency of EMR use for each task 19 D1, D2 Task performance using the EMR, compared to previous routines 19 F End User Computing Satisfaction [13] 12 E1, E2 Short Global User Satisfaction [14] 5 G

Rev. 3, Test-Retest study and Interviews
Frequency of EMR use for each task 24 B1, B2 Task performance using the EMR, compared to previous routines 24 C End User Computing Satisfaction [13] 12 D Short Global User Satisfaction [14] 5 E were performed frequently, i.e. maximally weekly or daily (median value). Tasks 8, 6 and 19 were all infrequently performed, i.e. maximally less than monthly, but they were relatively time consuming. Regarding the time consumption of each task, most of the tasks (17 of 24) were estimated to 1-10 minutes, and two tasks to more than 10 minutes (tasks 7 and 19). Some tasks (5 of 24 tasks) were estimated to take less than a minute using current paperbased routines (e.g. order lab tests, write prescriptions, register codes), but these tasks were performed frequently (figure 1, part B).

Accuracy of task interpretation, and estimation of EMR use
The interviews included structured questions about how the physicians interpreted each task, and whether they found answering the accompanying question about EMR use (figure 2) difficult or not. The majority of the physicians found all tasks comprehensible (figure 2, part A). As a control, we asked eight of the physicians to formulate their interpretation of each task in their own words. All respondents who chose the identical wording to that of the defined task were requested to name an example. The answers, either formulations or examples, were compared Relevance of tasks Figure 1 Relevance of tasks Responses in the interview study about A) task relevance, B) how frequently they maximally are performed, and C) how much time the physicians estimate that they take. Review the patient's problems Seek out specific information from patient records Follow the results of a test or investigation over time Obtain the results from new tests or investigations Enter daily notes Obtain information on investigation or treatment procedures Answer questions concerning general medical knowledge Produce data reviews for specific patient groups Order clinical biochemical laboratory analyses Obtain the results from clinical biochemical lab. analyses Order X-ray, ultrasound or CT investigations Obtain the results from X-ray, ultrasound, or CT investig. Order other supplementary investigations Obtain the results from other supplemental investigations Refer the patient to other departments or specialists Order treatment directly (e.g. medicines, operations etc.) Write prescriptions Complete sick-leave forms Collect patient data for various medical declarations Give written specific information to patients Give written general information to patients about the illness Collect patient information for discharge reports Check and sign typed dictations Register codes for diagnoses or performed procedures How much do you agree or disagree with the following statement: "I consider the task to be part of my work as an physician in this hospital"? About how often do you maximally perform this task? C Try to remember the last time you performed this task. About how much time did it take? Table 3: Themes from the interviews. The themes, typically appearing in open-ended questions, are sorted in descending order by the number of physicians providing answers attributable to the given theme. In the "Tasks" column, the tasks to which each answer is attributed are sorted in descending order by number of physicians commenting the task. In the "Typical quote" column, the quotes are followed by the physician's specialty in parentheses.

Theme
No. of physicians (no. of quotes) The tasks mentioned in relation to this theme, by number of physicians: Nine of the 24 task-oriented questions about EMR use were found difficult to answer by 2-4 of 10 physicians (figure 2, part C). Five of these addressed functionality not specifically supported by the EMR. An escape choice ("Task not supported by EMR") had been provided, but the physicians never the less found answering these questions confusing. Further explanations were found in the open-ended questions (table 3).

Themes appearing in open-ended questions
The answers to the open-ended questions and the spontaneous comments were categorized into themes. Those mentioned by at least two physicians are shown in table Accuracy of task interpretation, and estimation of EMR use Figure 2 Accuracy of task interpretation, and estimation of EMR use Responses in the interview study about A) whether a task is comprehensible or not, B) whether the physicians' interpretation of each task fitted the actual definition or not, and C) whether estimation of own EMR use for given task was found diffcult or not.

Compliance
Overall, the task-oriented questions had a low percentage of missing responses both in the national and in the local demonstration study. However, the questionnaire design in the former was slightly problematic. In the national study, each question about frequency of PC use for a given task was followed by a question about type of computer program used (i.e. "EMR" and/or "other program"). The percentage of missing responses was low in the former, but quite high in the latter (table 4). As a consequence, a number of respondents reported that they were using a computer without telling whether they were using the EMR or not. This subgroup needed to be presented along with explicitly reported EMR use, making interpretation and presentation of the results challenging. In the local demonstration study, we simplified the taskoriented questions about PC use by limiting them to EMR only. In addition, we omitted questions about tasks not explicitly supported by the EMR under study. In this study, the percentages of missing responses were low, both in the questions about EMR use and in those about task performance. In the latter, the question for task 8 [Produce data reviews for specific patient groups] had the highest proportion of missing responses (14.3%). However, the reported EMR use for this task was very low in this study (91% of the physicians answered "seldom" or "never/almost never").

Criterion validity
Criterion validation was assessed in three ways, by correlating task-oriented EMR use to general EMR use, task performance to overall work performance, and task performance to user satisfaction. As the first criterion, we assessed general EMR use by asking the physicians about how often they used the EMR as an information source in their daily clinical work (table 5, row 1). This question correlated to nine of the 12 tasks about information retrieval, and to 12 of all 24 tasks. This suggests that a considerable proportion of the tasks are regarded essential to EMR's function of information retrieval. Of the remaining three tasks of this kind (tasks 6-8), explicit functionality was available only for task 8 [Produce data reviews for specific patient groups] in this study. As a second criterion, we assessed overall work performance by asking whether performance of the department's work, and that of the respondent's work, had become easier or more difficult using the EMR system (table 5, row 2-4). A high proportion of the questions about task performance correlated to both forms of overall work performance, which suggests that these tasks are regarded important elements of clinical work. As a third criterion for validation of the tasks, we calculated correlations between task performance and two standard measures of user satisfaction (

Test-retest reliability
In the test-retest study, we measured reliability by calculating Cohen's weighted kappa (quadratic weights) for all task-oriented questions. Generally, the weighted kappa was high ( figure 3), but the questions about EMR use showed better reliability than that of task performance (median kappa 0.718 and 0.617, respectively).
In the questions about EMR use, kappa values indicating excellent test-retest agreement was found in seven tasks ( figure 3). On the other hand, a low or non-significant kappa was found in tasks 7, 9, 13, and in the questions

Scaling of response labels
In the scaling study, the original set of labels performed better than the alternatives. In the best alternative set of labels, the number of disordinal pairs was 5%, but the original combination of labels remained the better choice at 4%. The mean positions of the original labels ( figure 4) constituted a symmetrical, s-shaped curve. The confidence intervals of the sample show some overlap between adjacent labels (figure 4), whereas the confidence intervals of the mean do not (data not shown, ANOVA p < 0.001, LSD p < 0.001 between all labels).
We regarded the response choices in the task performance questions as standard, and hence did not include them in this study. (The data from the scaling study is provided in additional file 16.)

Discussion
The results suggest that this questionnaire may provide valid and reliable information about how an implemented EMR system is utilized on an overall level in clinical practice, and how well the system supports clinical tasks.

The tasks-oriented questions are relevant for clinical work, but some are difficult to answer
During development, the tasks have been based on observations of clinical activity, and further refined to suit their purpose as a common denominator for assessments of various EMR systems. In the interviews, the tasks were recognized and correctly interpreted (figure 2) by a wide range of physicians. However, some of the task-oriented questions about EMR use were found difficult to answer, particularly for the higher-level tasks. Four themes appearing in the interviews provided reasons for these problems. First, the respondents were confused when asked about use of EMR for tasks for which no explicit functionality was offered (table 3; theme 3), despite the presence of relevant 'escape' response choices. This confusion may partly explain the contradictory responses in the national survey, where a minor proportion of respondents reported use of the EMR system for tasks it did not explicitly support (tasks 6 and 7) [8], and the low reliability of three questions about EMR use in the test-retest study (tasks 7, 9 and 13). It may also explain the few missing responses in the local study, where unsupported tasks were omitted. As a second problem in describing EMR use, distinguishing EMR from other software or media appeared as a problem in the interviews (theme 4). This problem may explain the many missing responses in parts of the national study (table 4). The reduction of missing responses in the local study suggests that just considering EMR use (and not use of other software) is easier for the respondent. However, the problem will remain for respondents who are using other software than the EMR during clinical work, making reviews of all software available to the physicians necessary. As a third problem, questions about tasks which were not completely supported by the EMR system were found hard to answer, despite the fact that the wording of the questions only implied a supportive role. This problem was in particular attributed to general tasks. Figure 3 Test-retest reliability Reliability (weighted kappa, quadratic weights) is shown for task-oriented questions about A) frequency of EMR use and B) task performance. Error bars show confidence intervals of kappa values. Non-significant tests (p > 0.05) are hidden.

Task
Review the patient's problems Seek out specific information from patient records Follow the results of a test or investigation over time Obtain the results from new tests or investigations Enter daily notes Obtain information on investigation or treatment procedures Answer questions concerning general medical knowledge Produce data reviews for specific patient groups Order clinical biochemical laboratory analyses Obtain the results from clinical biochemical lab. analyses Order X-ray, ultrasound or CT investigations Obtain the results from X-ray, ultrasound, or CT investig. Order other supplementary investigations Obtain the results from other supplemental investigations Refer the patient to other departments or specialists Order treatment directly (e.g. medicines, operations etc.) Write prescriptions Complete sick-leave forms Collect patient data for various medical declarations Give written specific information to patients Give written general information to patients about the illness Collect patient information for discharge reports Check and sign typed dictations Register codes for diagnoses or performed procedures

Excellent Mild to moderate Poor
However, the test-retest reliability was relatively high in these questions, suggesting a limited negative effect. Fourth and final, distinguishing other employee's use of the system from one's own appeared as a problem in the interviews (theme 7) in tasks 5 and 15. Regarding task 5 [Enter daily notes], the explanation was confusion about whose use of the EMR should be stated, the physician's or the transcriptionist's. This problem is probably amendable by revising the instructions to the respondent in the questionnaire.
In addition to providing explanations to the findings of the closed questions, the results from the open-ended questions addressed a number of themes on their own. First, wording problems (table 3, theme 2) were expressed particularly for tasks 16, 4 and 21. However, the respondents' interpretations of these tasks (figure 1) were all concordant with and covering essential parts of the task definition. Another important theme involved functionality missed by the respondent (table 3, theme 6), i.e. that the questionnaire did not allow them to express what functionality they were missing in the EMR system. This in particular made it difficult to answer the questions about Scaling of response labels  Scaling of response labels user satisfaction, as the respondent had problems deciding whether to provide answers based on the functionality actually available in the EMR system, or on the functionality that should have been in the system. The problem is closely related to the problems regarding EMR only supporting parts of a given defined task (table 3, theme 8).

The tasks are relevant for EMR systems
Moderately high correlations were consistently found between a majority of task-oriented questions and overall questions on EMR use, task performance and user satisfaction. The correlations to self-reported overall EMR use suggest that the tasks are regarded essential to EMR systems as such, and the correlations to work performance suggest that the tasks are regarded important to clinical work. The correlations to user satisfaction agree with the results of both Sittig et al [20] and Lee et al [21], who found significant correlations between user satisfaction and questions about how easily the work was done. In combination, this means that high reported EMR use for individual tasks equals high reported use of the EMR on the whole, and that improved performance of individual tasks equals improved overall work performance and high satisfaction with the system as a whole. Although not proving the validity of each task, it is highly suggestive. Furthermore, the correlations were limited to tasks for which clear functionality existed in the EMR systems. For the uncorrelated tasks, further clarification must await completion of the functionality of current EMR systems.
This way of correlating a set of lower-level task-oriented questions to higher-level questions is commonly used as criterion validation [22]. However, higher-level questions regarding EMR use are difficult to answer, as physicians' work consists of a complex mix of tasks that are suited for computer support and tasks that are not. A more direct form of criterion validation could have been achieved by studying system audit trails [2]. Such trails are readily available, but they must be validated themselves, and they cannot be more detailed than the structure of the EMR system itself. In Norway, the EMR systems are documentbased in structure [12]. This limits the interpretation of such trails, particularly when considering informationseeking behavior.

The questionnaire produces interpretable results
The demonstration studies provided readily interpretable results. In the national study, the physicians generally reported a much lower frequency of EMR use than what was expected by the functionality implemented in each hospital [8]. In the local study, the physicians reported a very high frequency of EMR use, mainly for tasks related to retrieval of patient data [12]. In this study, the physicians generally had little choice of information sources, as the paper-based medical records were obliterated in this hospital. The use of the EMR system for other tasks was however much lower. The results from both the national and the local study indicate that the physicians are able to report overall patterns in their use of EMR that is not in line with the implicit expectations signalled by this questionnaire. These results should not be too surprising. The physicians' traditional autonomous position may allow them to withstand instructions from the hospital administration, e.g. regarding ordering of clinical biochemical investigations [23]. Also, in most hospitals having EMR systems, the physicians may freely choose source of patient data. This is due to the fact that both the paperbased and electronic medical record generally are updated concurrently [12], and they are only two of many information sources available in clinical practice (e.g. asking the patient, calling the primary care physician, etc.).
Compared to the 400-600 tasks commonly found in full task inventories [6], the number of tasks in the questionnaire is moderate (24). The high response rates suggest that the number of questions is manageable to the respondents. Compared to that of similar questionnaires [4,21], the task list provides the evaluator with more details about areas for improvement, and it is not designed with one particular EMR system in mind [21]. In addition, more emphasis is placed on clinical use of the EMR system, since the tasks are limited to informationrelated instead of both practical and information-related tasks [24], and to clinical instead of both clinical and academic work [4]. On the other hand, questionnaires describing self-reported usage patterns have previously been criticized for lack of precision and accountability [25,26]. However, the critics often seem to actually consider poorly validated questionnaires or too optimistic interpretations of them [27], rather than the very principle of self-reporting. When interpreting the results from a survey describing self-reported work patterns, the inherent limitations of self-reporting must be taken into account. Respondents remember recent and extraordinary events much more easily than distant or everyday events, suggesting in our case an over-estimation by those who use the EMR infrequently. Also, in even a systematically validated questionnaire, a considerable degree of bias should be expected towards answers that the respondents believe are expected from them. However, when the responses both fit with the structural premises (i.e. the marked EMR use in the local study, where the paper-based medical record was missing), and defy the implicit expectations (i.e. the lack of EMR use in the national study), the degree of bias seem to be manageable.

Reliability and scaling
The test-retest reliability study generally showed high kappa values both in the section about EMR use and in that of task performance, in spite of some tasks perform-ing poorly in either section. The poorly performing tasks in the EMR use section addressed functionality that was available to few respondents, while those performing excellently addressed functionality supported by all EMR systems. This means that changes demonstrated for well supported tasks are more likely to reflect real changes in the underlying processes than they are likely to happen by chance. On the one hand, small differences should be interpreted with caution when using the questionnaire, e.g. when significant differences are found in rank values but not in median response values. On the other hand, the evaluator should be careful not to disregard non-significant differences in small samples in the tasks having reliability less than 0.6, as the most likely effect of reliability issues are attenuation of real differences [28].
In the study of the frequency scale (appearing in the questionnaire section about EMR use), the order of the response labels coincide with that of the respondent's visual analogue scale (VAS) markings. In addition, the confidence intervals of the means are clearly separated in this relatively small sample. This suggests that response labels are considered separate steps on an ordinal scale by the respondent. However, the mean VAS values do not increment linearly, but follows a symmetric s-shaped curve, in which the largest increments appear at the middle part of the scale. This suggests that differences in frequency of EMR use might be considered slightly larger when involving or spanning the central label than when involving the labels at each end of the scale. In sum, the scale is ordinal but not linear, making non-parametric methods the best choice for statistical analysis.

Comparing development and evaluation of this questionnaire to that of other questionnaire
When developing questionnaires, existing literature [22,29] and expert groups [30,31] are commonly used to produce the initial items. For our questionnaire, the literature search was mostly unfruitful, and we had to rely on expert groups and observational work. A common way of structuring the initial collection of items is by identifying latent (and possibly unrelated) variables by performing exploratory factor analysis [22]. For our questionnaire, no factor analysis has been performed. In the national demonstration study, it was due to the considerable differences in implemented functionality between the various EMR systems. In the local demonstration study, it was due to the low sample size relative to the number of questions, i.e. below 10:1 [32]. Although consistent patterns of use (e.g. "the notes reader", "the super-user", "the lab test aficionado", etc.) might be identified by factor analysis, it is unlikely that completely unrelated variables would be extracted from a set of work tasks all designed for the same profession. Work tasks found irrelevant by the physicians could have been identified by analyses of internal consist-ency among the task-oriented questions, e.g. Crohnbach's alpha [22]. However, such investigations should ask about the work tasks per se, not about tasks for which the EMR system is used, rendering our demonstration studies of little value in this respect. Instead of performing another survey, we chose to explore the tasks as well as the taskoriented questions in a structured interview study. This way, we had an opportunity of explaining why some of the tasks were performing better than the others in the demonstration studies.
When evaluating questionnaires, criterion and content validation is frequently used [29,33]. As the list of tasks in our questionnaire is rather heterogeneous and covers a considerable field of clinical activity, a single global criterion is hard to find. Instead, we used either criteria explaining parts of the task list (e.g. the tasks regarding information retrieval) or indirect criteria based on welldocumented relations (e.g. overall user satisfaction vs. task performance).

Limitations of this study
The questionnaire described in this study applies to physicians only, missing the contribution of other types of health personnel. Further, the list of tasks does not cover communication or planning, suggesting that the list could be augmented in future versions of the questionnaire. Finally, three different revisions of the questionnaire appear in this paper, which might appear confusing. The revisions are however incremental, and should be considered consequences of lessons learned during the demonstration studies.

Application of the questionnaire
The questionnaire described here may be used as an important part of an EMR system evaluation. Instead of a simple summed score, the questionnaire's task list provides a framework by which EMR systems may be described and compared in an informative way. Since the questionnaire does not provide reasons or hypotheses for the results it produces, surveys involving it should always be accompanied by a qualitative study. The combination of methods will, however, provide more than the sum of its parts. Qualitative studies like in-depth interviews may be probing deeper when the results of the preceding survey are presented to the informant, and observational studies may focus on phenomena explaining the survey results. Conversely, the interpretation of a qualitative study may be aided by the results of a following quantitative study, as it provides a way of weighting the proposed hypotheses.

Conclusions
The task-oriented questionnaire is relevant for clinical work and EMR systems. It provides interpretable and reli-able results on its chosen level of detail, as a part of any evaluation effort involving the hospital physician's perspective. However, development of a questionnaire should be considered a continuous process, in which each revision is guided by further validation studies.

List of abbreviations
EMR Electronic Medical Records VAS Visual Analogue Scale