Study population
We focused on patients with Chronic Kidney Disease (CKD), for whom there is an online platform, named PatientView [32], that provides access to laboratory results and is available in 90% of renal units in the UK. We included patients who had a kidney transplant (at least 12 months before recruitment). These patients undergo longitudinal follow-up with quarterly visits, including a review of their laboratory test results. This allowed us to obtain a homogeneous group of participants in terms of their experience with and knowledge of the disease. We excluded patients with any visual impairment, to avoid ineffective eye tracking data collection, and patients who did not use the internet in their everyday lives, to ensure a certain level of digital proficiency required to be a potential patient portal user.
We recruited 20 patients from the Renal Transplant Clinic at Salford Royal NHS Foundation Trust (SRFT), which has one of the largest communities of PatientView users in the UK. The study received ethical approval from NHS and local R&D ethical committees (IRAS ID: 183,845). Research nurses from the NIHR Clinical Research Network Portfolio (Study CPMS ID: 20,645) were responsible for approaching eligible patients and collecting signed informed consents of patients willing to participate.
Patient involvement
Increasingly patients are becoming involved as active partners in planning and undertaking research, rather than only as participants or data sources [33]. Throughout the project, we involved three patients from the local CKD patient community (http://gmkin.org.uk/) as collaborators, of whom two had experience of using PatientView. Initially, these patient collaborators participated in a workshop with researchers, during which they provided insights on their perceived importance and relevance of monitoring laboratory test results in CKD, and the role of PatientView in such tasks. During the workshop the patient collaborators also commented on preliminary visual presentations that we had prepared based on the literature discussed in the background and suggested additional features that might support them in interpreting their laboratory results on PatientView. After the workshop the patient collaborators were involved via email, providing comments on visual presentations and the study protocol. One patient collaborator also pilot-tested our data collection procedure and commented on our interpretation of the results.
The continuous involvement of patient collaborators allowed us to design a more realistic, relevant and acceptable experiment. Furthermore, two of our patient collaborators were experienced PatientView users, often interacting with other fellow patients in the local CKD patient community on this topic. Therefore, their advice and feedback was extremely important while developing our visual presentations.
Controlled study design
The study followed a “3 × 3” repeated measures, within-subjects design where each participant used, in a random order, three different presentations of web-based laboratory test results to complete the same simulated task in three different clinical scenarios. These were designed by a nephrologist at SRFT to reflect:
-
High risk clinical scenarios: characterised as life threatening situations that required immediate action; creatinine and estimated Glomerular Filtration Rate (eGFR) (i.e. the main indicators of kidney function [34]), as well as potassium (associated with higher mortality in kidney patients [35]) strongly deviated from the standard range.
-
Medium risk clinical scenarios: identified by abnormal creatinine and eGFR, but normal potassium and stable conditions; no urgent action was necessary, however further tests were required within 4 weeks;
-
Low risk clinical scenarios: characterised by normal creatinine, eGFR and potassium; not requiring any action until the next scheduled appointment.
In addition to creatinine, eGFR and potassium each scenario included 25 laboratory test results with different deviances from the standard range in relation to the reflected risk (i.e. higher risk scenarios had more concomitant abnormal values).
There was no previous training or time limit for performing the task. Participants could decide to stop exploring the laboratory test results whenever they felt ready to reply to the follow up questions.
The controlled study was conducted at the Interaction Analysis and Modelling laboratory of the University of Manchester, and each patient participated individually. All participants performed the tasks using a desktop computer with a 17-in. screen with an embedded eye tracker (Tobii T60), which permits a 60-Hz sampling rate, 0.5 degrees gaze point accuracy, and free head motion.
Presentations
We implemented three presentations of laboratory test results (see Fig. 1 and Additional file 1: Figure S1-S9 for details). In developing these presentations, we aimed at maintaining similar amount of textual information, but showing it through different formats and visual cues to enhance patient interpretation. These were chosen based on review of the literature and patient collaborators’ feedback on preliminary prototypes.
The Baseline presentation was based directly on, and very similar to, the current PatientView [32] interface, which uses tiles to show the latest available laboratory test results. Each tile reports information on the value of a laboratory test, its unit, the date of the test, data source, and standard range. By clicking on a tile the user can access previous results (i.e. longitudinal information), with the possibility of comparing it with another test within the same graph. This feature can be particularly useful in the context of CKD, where the use of some medications improves renal function but can also negatively affect the functioning of other organs. This would be reflected by abnormal values in laboratory test results.
The comparison presentations were based on the Baseline presentation, but provided different visual cues, colours and tools to show normal and abnormal values. Chronic patients have an increased risk of having test results falling slightly outside the reference range, with only those with high deviance from the population reference range likely to be clinically relevant [36]. Therefore, the second presentation (Contextualised presentation) used horizontal coloured bars that contextualise the latest value in relation to the standard range in each tile [16, 23, 37], which, as already said in the background, have outperformed tables in terms of perceived usefulness [16] and decrease perceived urgency of borderline (i.e. low deviance from reference range) laboratory test results [23]. The third presentation (Grouped presentation) aimed at helping patients in identifying abnormal results, and made use of personalised grouping (i.e. dynamically grouping the tiles in “Outside the standard range”, “No standard range available”, and “Inside the standard range”) and the aforementioned overview-preview metaphor [28,29,30]. Both the Contextualised and Grouped presentation graphs reported personalised statistics for the selected test [38,39,40] and used colours, showing the area inside the standard range in green and the remaining area in red [41]. We applied the same approach to colour the latest laboratory test results in the tiles in both presentations. This had the aim of drawing patient’s attention on laboratory test results that might require more careful review (i.e. the ones in red), as well as visually filtering out those ones that were normal (i.e. the ones in green).
All plots within the same scenario displayed the same time period to avoid the well-known difficulties with scale changes [42]. Particularly, each plot displayed the period from the earliest to the latest laboratory test result available, which for each scenarios was between 1 and 2 years. Furthermore, in order to make the task as realistic as possible (i.e. pretending to have just received some results from the clinic), all results were shifted towards the day the experiment was carried out.
Since the purpose of the experiment was to expose patients to three very different types of scenario, keeping fixed ranges for each laboratory test as suggested by Zikmund-Fisher et al. [23] would have been detrimental for our purpose. That is, predefined value ranges specific to the type of clinical scenario (i.e. low, medium and high risk) might have revealed the pattern shown in the data. Therefore, we dynamically tailored the value ranges shown in the plots, for all three presentations, and horizontal coloured bars, for the Contextualised presentation. Specifically, the range of values shown always included the minimum and maximum value of the longitudinal series of results for each laboratory test. If the minimum or maximum fell inside the laboratory test reference range, they were replaced by lower and upper reference range limit, respectively. To ensure that the within and outside reference range areas were always clearly displayed, an offset was added to the minimum and maximum of the value range.
Data collection
At the beginning of the experiment each participant was asked to complete four questionnaires: demographics (age, gender, education, years since transplant, frequency of internet usage, and frequency of PatientView use); Subjective Numeracy Scale (SNC) [43] on a 1–6 scale; self-reported health literacy on a 0–4 scale based on Chew et al. [44], with lower scores indicating better health literacy; and graph literacy, calculated as % of correct answers on the questionnaire from Galesic et al. [45].
After exploring each presentation of laboratory test results, participants were asked to respond to a question about their behavioural intentions in relation to what they saw, which we used as a proxy of their risk interpretation. Particularly, patients were asked what they would do in real life if the results they had just explored were their own. They could choose between: 1) Calling their doctor immediately (high interpreted risk); 2) Trying to arrange an appointment within the next 4 weeks (medium interpreted risk); 3) Waiting for the next appointment in 3 months (low interpreted risk).
Data analysis
Risk interpretation
To assess the effect of the presentations (Baseline, Contextualised and Grouped) on the accuracy of risk interpretation, we created a 3 × 3 confusion matrix for each presentation that reported the judgments made by patients versus our gold standard (i.e. nephrologists’ clinical judgement). From the confusion matrices, we calculated precision (i.e. proportion of correct interpretations of all interpretations as risk X), recall (i.e. proportion of correct interpretations on clinical scenarios with risk X) and accuracy (i.e. proportion of correct interpretations of all interpretations) for each presentation, and compared these using chi-squared tests.
We repeated the analysis with a secondary definition of the outcome, which aimed at investigating a situation in which, from a safety perspective, a misjudgement could have serious consequences. We evaluated the presentation’s performance in terms of patients correctly identifying the need for action (i.e. at least medium interpreted risk in medium and high risk clinical scenarios). To evaluate whether performance was driven by single patients (i.e. there were some patients that misinterpreted most of the information), we counted the mistakes that each patient made. We distinguished between: patients underestimating the need for action (i.e. low interpreted risk in medium or high risk scenarios); and patients over-estimating the need for action and asking for help when not needed (i.e. interpreted medium or high risk in low risk clinical scenarios).
Visual search behaviour
To investigate whether correct interpretations were related to specific visual search behaviours, our secondary objective was to evaluate differences (if any) in eye-tracking data between patients who consistently identified the need for action and those who did not in at least one occasion. We collected the following metrics:
-
Fixation count: a fixation is a stable gaze on screen lasting between 40 and 500 ms. We collected the number of fixations lasting at least 180 ms [46], because higher thresholds are considered to be more reliable when the stimuli include graphical content [47]. Fixation count is an indicator of visual search efficiency [48].
-
Average fixation duration: we computed the average fixation duration, as an indicator of task difficulty and cognitive load whereby longer fixation durations indicate more challenging tasks [48].
-
Dwell time: this metric, which is the aggregated fixation duration, is typically an indicator of attention and interest [48].
There were three AoIs in each presentation (see Additional file 1: Figure S1): 1) tiles showing the latest values for all laboratory tests; 2) the graph showing detailed longitudinal information for a single laboratory test; 3) the graphs comparing detailed longitudinal information for two laboratory tests. At first sight, due to their clear demarcations, the natural unit for defining AoIs would have been each single tile. However, their small size would have resulted in unreliable data because the precision of the eye-tracker is compromised with smaller AoIs. What is more, all the tiles convey the same functionality so the added value of treating them independently would have been negligible. Consequently we defined larger AoIs that grouped widgets with the same appearance and functionality.
Overall differences in fixation count and dwell time were assessed with a mixed ANOVA, including patient group (i.e. patients who did not underestimate the need for action versus the others) and the within subject factors (i.e. presentation, clinical scenario and specific AoIs). We also assessed differences in fixation durations with a mixed ANOVA, this time not accounting for the specific AoI in the within-subject factors. This choice was mandated by the low frequency with which participants looked at some AoIs, therefore limiting our statistical power in assessing differences. To account for the skewedness and non-normality of residuals introduced by count and bounded data, we ran the mixed ANOVAs on the log-transformed eye fixation counts, eye fixation duration, and dwell time, rather than on the raw data.
Eye-tracking data were extracted using the Tobii Studio (version 3.4.0). All data analyses were performed in R version 3.3.1 [49].