Skip to main content

AutoDiscern: rating the quality of online health information with hierarchical encoder attention-based neural networks



Patients increasingly turn to search engines and online content before, or in place of, talking with a health professional. Low quality health information, which is common on the internet, presents risks to the patient in the form of misinformation and a possibly poorer relationship with their physician. To address this, the DISCERN criteria (developed at University of Oxford) are used to evaluate the quality of online health information. However, patients are unlikely to take the time to apply these criteria to the health websites they visit.


We built an automated implementation of the DISCERN instrument (Brief version) using machine learning models. We compared the performance of a traditional model (Random Forest) with that of a hierarchical encoder attention-based neural network (HEA) model using two language embeddings, BERT and BioBERT.


The HEA BERT and BioBERT models achieved average F1-macro scores across all criteria of 0.75 and 0.74, respectively, outperforming the Random Forest model (average F1-macro = 0.69). Overall, the neural network based models achieved 81% and 86% average accuracy at 100% and 80% coverage, respectively, compared to 94% manual rating accuracy. The attention mechanism implemented in the HEA architectures not only provided ’model explainability’ by identifying reasonable supporting sentences for the documents fulfilling the Brief DISCERN criteria, but also boosted F1 performance by 0.05 compared to the same architecture without an attention mechanism.


Our research suggests that it is feasible to automate online health information quality assessment, which is an important step towards empowering patients to become informed partners in the healthcare process.

Peer Review reports


Patients often turn to search engines and online content before, or in place of, talking with a health professional [1]. However, online health information is not regulated, and prior studies have found wide variations in information quality [2]. Poor risk communication, biased writing, and lack of transparency about the source of the information plague online health texts [3, 4]. This presents a real risk to patients, in the form of misinformation [57] and negatively affecting their interactions with health care providers [8, 9].

In response to this problem, many organizations, such as the Health on the Net Organization, the Journal of the American Medical Association, and the National Health Service of the UK, have established guidelines for assessing the quality of online health information [10]. These guidelines describe a set of criteria an article must meet to be considered of high quality. It is worth noting that quality is distinct from accuracy. While these guidelines check for indicators of well written, unbiased, and evidence based articles, they do not attempt to verify the scientific accuracy of the information (a significantly more challenging problem). Similarly, the concept of quality is also distinct from that of credibility, or how likely a reader is to believe the information. The propensity to which readers believe the content they consume is influenced not only by information accuracy, but also structural aspects of the media, such as a website’s design, appearance, and overall readability [11]. Thus, quality guidelines form a basis by which systems may affect individual’s perceptions of credibility, without breaching into the field of information accuracy assessment.

The implementation strategies of these quality guidelines so far fall into two categories: Distributed Guidelines and Centralized Approvers. However, both of these strategies have scalability issues that limit their reach and prevent them from broadly affecting patient information consumption [10]. In the following section, we describe both of these implementation approaches in use today, and then describe a third solution that addresses the issue of scalabilty.

Distributed Guidelines One approach to helping patients find high quality health information is to develop a criteria and publish it as a public tool citizens can use. An example of this approach is the DISCERN instrument [12]. The DISCERN instrument’s criteria are specifically designed to be able to be understood and applied by any lay person; no medical knowledge is required. This implementation approach puts a significant burden on the patient. For this approach to be successful, the patient has to be aware of the guideline, learn how to evaluate the criteria, and take considerable time to apply the guidelines to every website the patient encounters.

Centralized Approvers The second implementation approach in use today is Centralized Approvers. In this approach, an organization manually assesses web pages for health information quality. An example of this approach is the Health on the Net Foundation, which developed the HONcode guidelines. It assesses websites for quality, and allows those that pass their criteria to display a HONcode badge on their webpage [13]. A variant on this approach is to register all manually approved content in a centralized repository. Patients can search the repository with the confidence that all listed sites have been vetted for quality.

The Centralized Approver approach is not scalable in the face of a massive and rapidly growing internet. Quality assessment is a costly manual process. Not only do new pages need to be evaluated, but previously-evaluated pages need to be re-evaluated on a regular basis in case of content changes [10].

Automated Assessment An automated quality assessment process is key to providing the public with scalable tools for assessing online health information quality.

Initial attempts to automate the assessment of health information used simplistic approaches, such as readability scores, and did not capture more complex issues with health information, such as tone and bias [4]. A machine learning model developed by the HON organization showed promising but limited initial results [14]. But with the recent developments in machine learning and natural language processing methods, there is a renewed opportunity for tackling this problem. Neural Language Models have been successfully applied in many domains, including translation, question answering, and many more [1518], capturing details and nuances in language that made information quality assessment an expensive manual process for so long.

Research objectives In this research, we study and develop machine learning models to automate the application of the DISCERN instrument. The DISCERN instrument was developed by Charnock et al. [12] at Oxford University and funded by the National Health Service (UK). The instrument consists of 15 questions to help a lay-person to evaluate the quality of online health information regarding treatment options. The validity of the DISCERN instrument has been evaluated in multiple studies, and is commonly used among researchers [19]. The DISCERN instrument suffers from the same sustainability issues as all distributed guidelines do: patients are unlikely to take the time to apply this criteria to each website they find. In this study, we built and evaluated machine learning models for the automated annotation of the Brief DISCERN criteria [20]. We focus on the Brief DISCERN criteria [20], which is a 6 question subset of the DISCERN crieria that has been shown to capture the quality of health information as reliably as the complete DISCERN instrument. Separate models were developed and tested for each of the five Brief DISCERN questions (one question, Q13, was excluded due to low interrater reliability). We compared the use of traditional machine learning (Random Forest) with feature engineering vs. hierarchical encoder attention-based neural network (HEA) models. We also compared the performance of neural models with the attention mechanism (HEA) and without it (HE). Additionally, for both neural architectures, we experimented with the use of two pre-trained neural language models BERT [21] and BioBERT [22] as embeddings in the HEA and HE models. Thus, in total, we trained and compared 5 different architectures: RF, HE+BERT, HE+BioBERT, HEA+BERT, and HEA+BioBERT.


Data collection

Using Google Trends, we identified breast cancer, arthritis, and depression as medical topics with the highest search volume since 2004. Using Google and Yahoo search engines, we identified a total of 269 Web pages (HTML articles) with a focus on treatment choices and options across the 3 topics. Two raters (master’s students) were trained for 2 months on using the DISCERN instrument and scoring platform. Both raters scored all articles on DISCERN’s 5 point scale. Interrater agreement for the DISCERN criteria was adequate to high, ranging between 0.61–0.91 as measured by the Krippendorf score. The process of building the training corpus is described in more detail in [23]. The dataset is described in Table 1.

Table 1 Description of the dataset by health topic

Data preprocessing

We converted the scores for each question in the DISCERN instrument, which ranges from 1-5, into a binary classification, where score 3-5 is passing and score 1-2 is failing the criteria. The texts from the HTML articles were extracted and cleaned using the beautifulsoup libraryFootnote 1.

Neural network model

We designed and implemented a Hierarchical Encoder Attention-based (HEA) model in PyTorch [24] taking into consideration the structure of our problem and the limits of our training data. The model architecture design is primarily motivated by the intrinsic hierarchy of the documents (i.e. sequences of word/tokens represent a sentence, and sequences of sentences represent a document). In addition, our attention-based modeling architecture reflects the property that passing or failing the DISCERN criteria depends on only small fragments throughout the article. This architecture enables the model to "pay attention" to single sentences within a larger article.

HEA’s architecture is composed of a hierarchical structure with two encoders and a classifier (Fig. 1). The first encoder is a sentence encoder SentEncoder (Fig. 2) which is based on a bidirectional recurrent neural network (RNN) that encodes each sentence (i.e. sequence of tokens) into a dense vector representation. The second encoder is a document encoder DocEncoder (Fig. 3) which is also based on a bidirectional RNN that encodes the sequence of sentences’ representation (i.e. vectors computed from the first encoder) and uses an attention mechanism [15] along with a global context vector to compute a dense vector representation for the whole document. A decoder/classifier maps the document’s learned vector representation to the labels using an affine map followed by softmax layer computing a probability distribution on the labels for the processed document. An overview of the HEA model architecture can be found in Fig. 1.

Fig. 1
figure 1

Overview of the HEA neural network architecture. In lieu of traditional feature engineering, the HEA architecture learns representations at the word, sentence, and document level before making a classification. Word representations are generated by the pre-trained BERT embedder. An attention mechanism aids in learning a document representation from amongst many sentences

Fig. 2
figure 2

HEA’s SentEncoder architecture for computing sentence embedding

Fig. 3
figure 3

Model architecture for converting a document’s sentence embeddings into a document prediction

Sentence encoder (SentEncoder)

Formally, given an input sentence \(\underline {S} = [\overline {w}_{1}, \overline {w}_{2}, \cdots, \overline {w}_{T_{S}}]\) where \(\overline {w}_{t}\) represents the token representation at position t (i.e. 1-of-K encoding where K is the size of vocabulary V – the set of all tokens in the training corpus), a vanilla RNN will compute a hidden vector at each position (i.e. state vector \(\overline {h}_{t}\) at position t), representing a history or context summary of the sequence using the input and hidden states vector form the previous steps. Equation 1 shows the computation of the hidden vector \(\overline {h}_{t}\) using the input \(\overline {w}_{t}\) and the previous hidden vector \(\overline {h}_{t-1}\) where ϕ is a non-linear transformation such as ReLU(z)=max(0,z) or \(tanh(z)=\frac {e^{z} - e^{-z}}{e^{z} + e^{-z}}\).

$$ \overline{h}_{t}=\phi(\mathbf{W}_{hw}\overline{w}_{t}+\mathbf{W}_{hh}\overline{h}_{t-1}+\overline{b}_{hw}) $$

\(\mathbf W_{hh} \in \mathbb {R}^{D_{h} \times D_{h}}\), \(\mathbf W_{hw} \in \mathbb {R}^{D_{h} \times D_{w}}\), \(\overline {b}_{hw} \in \mathbb {R}^{D_{h}}\), represent the RNN’s weights to be optimized and Dh, Dw are the dimensions of \(\overline {h}_{t}\) and \(\overline {w}_{t}\) vectors respectively. Note that the weights are shared across the network and Dw could be equal to K the size of the vocabulary (i.e. in case of 1-of-K encoding) or the size of a dense embedding vector generated using a language model such as BERT [21]. The use of RNN allows the model to learn long-range dependencies where the network is unfolded as many times as the length of the sequence (sentence in our case) it is modeling. Although RNNs are capable of handling and representing variable-length sequences, in practice, the learning process faces challenges due to the vanishing/exploding gradient problem [2527]. In this work, we used gated recurrent unit (GRU) [28, 29] to overcome the latter challenges by updating the computation mechanism of the hidden state vector \(\overline {h}_{t}\) through the specified equations below.

$${} {\begin{aligned} \overline{z}_{t} & =\sigma(\mathbf{W}_{hw}^{z}\overline{w}_{t}+\mathbf{W}_{hh}^{z}\overline{h}_{t-1}+\overline{b}_{hw}^{z})\qquad\quad\!\! \text{(update gate)}\\ \overline{r}_{t} & =\sigma(\mathbf{W}_{hw}^{r}\overline{w}_{t}+\mathbf{W}_{hh}^{r}\overline{h}_{t-1}+\overline{b}_{hw}^{r}) \qquad\quad\!\!\text{(reset gate)}\\ \overline{\tilde{h}}_{t} & =\phi(\mathbf{W}_{hw}^{\tilde{h}}\overline{w}_{t}+\overline{r}_{t}\odot\mathbf{W}_{hh}^{\tilde{h}}\overline{h}_{t-1}+\overline{b}_{hw}^{\tilde{h}}) \quad\text{(new state/memory cell)}\\ \overline{h}_{t} & =(1-\overline{z}_{t})\odot\overline{\tilde{h}}_{t}+z_{t}\odot \overline{h}_{t-1} \qquad\qquad\;\text{(hidden state vector)} \end{aligned}} $$

The GRU model computes a reset gate \(\overline {r}_{t}\) that is used to modulate the effect of the previous hidden state vector \(\overline {h}_{t-1}\) when computing the new memory vector \(\overline {\tilde {h}}_{t}\). The update gate \(\overline {z}_{t}\) determines the importance/contribution of the newly generated memory vector \(\overline {\tilde {h}}_{t}\) compared to the previous hidden state vector \(\overline {h}_{t-1}\) when computing the current hidden vector \(\overline {h}_{t}\). The weights \(\mathbf {W}^{z}_{hw}\), \(\mathbf {W}^{r}_{hw}\), \(\mathbf {W}^{\tilde {h}}_{hw}\) each \(\in \mathbb {R}^{D_{h} \times D_{w}}\) and \(\mathbf {W}^{z}_{hh}\), \(\mathbf {W}^{r}_{hh}\), \(\mathbf {W}^{\tilde {h}}_{hh}\) each \(\in \mathbb {R}^{D_{h} \times D_{h}}\). The biases \(\overline {b}^{z}_{hw}\), \(\overline {b}^{r}_{hw}\), \(\overline {b}^{\tilde {h}}_{hw}\) each \(\in \mathbb {R}^{D_{h}}\) where Dh and Dw are the dimensions of \(\overline {h}_{t}\) and \(\overline {w}_{t}\) vectors respectively. The operator σ represents the sigmoid function, ϕ the tanh or ReLU function, and the element-wise product (i.e. Hadamard product). The SentEncoder uses a bidirectional GRU that computes two hidden state vectors \(\overrightarrow {\overline {h}_{t}}\) and \(\overleftarrow {\overline {h}_{t}}\) for each token \(\overline {w}_{t}\) in sentence \(\underline {S}\) corresponding to left-to-right and right-to-left GRU encoding of the sentence. We experimented with two options for computing sentence representation vector \(\overline {S}\): (1)concatenation\([\overrightarrow {\overline {h}_{T_{S}}}^{\top };\overleftarrow {\overline {h}_{0}}^{\top }]^{\top }\), and/or (2) summation\([\overrightarrow {\overline {h}_{T_{S}}} + \overleftarrow {\overline {h}_{0}}]\) of the computed left and right hidden state vectors of the last \(\overline {w}_{T_{S}}\) and first \(\overline {w}_{0}\) tokens respectively in sentence \(\underline {S}\).

Document encoder (DocEncoder) with attention

Originally, each document \(\underline {Doc}\) in our corpus is composed of a sequence of sentences (i.e. \(\underline {Doc}=[\underline {S}_{1}, \underline {S}_{2}, \cdots, \underline {S}_{T_{Doc}}]\) where \(\underline {S}_{i}\) represents the ith sentence and TDoc is the number of sentences in \(\underline {Doc}\)). Each sentence \(\underline {S}_{i}\) is composed of a sequence of tokens (as described in “Sentence encoder (SentEncoder)” section above) that are processed using SentEncoder model to compute the sentence vector representation \(\overline {S}_{i}\). As a result, the processed document \(\underline {Doc}^{proc}\) is a sequence of sentences’ vector representation (i.e. \(\underline {Doc}^{proc}=[\overline {S}_{1}, \overline {S}_{2}, \cdots, \overline {S}_{T_{Doc}}]\)) that is used as input to DocEncoder model. The DocEncoder uses a bidirectional GRU that computes two hidden state vectors \(\overrightarrow {\overline {l}_{i}}\) and \(\overleftarrow {\overline {l}_{i}}\) for each sentence representation \(\overline {S}_{i}\) corresponding to left-to-right and right-to-left GRU encoding of the sentences in \(\underline {Doc}^{proc}\). We experimented with two options for joining both hidden state vectors \(\overrightarrow {\overline {l}_{i}}\) and \(\overleftarrow {\overline {l}_{i}}\) into one vector using: (1)concatenation\([\overrightarrow {\overline {l}_{i}}^{\top };\overleftarrow {\overline {l}_{i}}^{\top }]^{\top }\), and/or (2) summation\([\overrightarrow {\overline {l}_{i}} + \overleftarrow {\overline {l}_{i}}]\) that will be denoted by \(\overrightarrow {\overleftarrow {\overline {l}_{i}}}\) from now on. Hence, the output of the DocEncoder is a sequence of joined hidden state vectors \(\underline {O}=[\overrightarrow {\overleftarrow {\overline {l}_{1}}}, \overrightarrow {\overleftarrow {\overline {l}_{2}}}, \cdots, \overrightarrow {\overleftarrow {\overline {l}_{T_{Doc}}}}]\) that is fed to an attention layer to compute the weights associated with each vector which in turn are used to compute a weighted vector sum to obtain a document vector representation \(\overline {z}\).

Attention layer

For many of the DISCERN criteria, pass or fail of the criteria depends on only small fragments throughout the article. For example, for the question “Is it clear when the information used or reported in the publication was produced?”, there is likely only one line among a 200+ sentence article (i.e. “Last reviewed on...”) that determines whether the article passes the criteria. Our attention-based modeling architecture reflects this problem structure: the model can “pay attention” to single sentences within a larger article. We adapt the idea of global attention model [16] in which a global context/query vector \(\overline {q}\) (i.e. trainable parameters in the model) was used along with the output \(\underline {O}=[\overrightarrow {\overleftarrow {\overline {l}_{1}}}, \overrightarrow {\overleftarrow {\overline {l}_{2}}}, \cdots, \overrightarrow {\overleftarrow {\overline {l}_{T_{Doc}}}}]\) from DocEncoder to generate document representation vector \(\overline {z}\). The objective is to compute attention weights for every \(\overrightarrow {\overleftarrow {\overline {l}_{i}}}\) vector such that \(\overline {z}=\sum _{i=1}^{T_{Doc}} \alpha _{i}\overrightarrow {\overleftarrow {\overline {l}_{i}}}\) where αi is the normalized weight computed using Eq. 2.

$$ \alpha_{i} = \frac{\exp{(score(\overline{q}, \overrightarrow{\overleftarrow{\overline{l}_{i}}}))}}{\sum_{k=1}^{T_{Doc}}\exp{(score(\overline{q}, \overrightarrow{\overleftarrow{\overline{l}_{k}}}))}} $$

For the attention scoring function, we experimented with two options inspired by the additive approach [16, 30] and the scaled dot-product work in [15] (see Equations 3 and 4 respectively). In Eq. 3, the score is computed using three operations: (1) a weight matrix \(\mathbf {W}^{l}_{ql} \in \mathbb {R}^{D_{q} \times D_{l}}\) maps \(\overleftarrow {\overrightarrow {\overline {l}_{i}}}\) to a fixed-length vector of dimension equal to the query vector \(\overline {q}\) (i.e. Dq), (2) a non-linear transformation tanh is applied, and (3) a dot-product with \(\overline {q}\) is performed. In contrast, in Eq. 4, the score is computed by performing a dot-product between the query vector \(\overline {q}\) and \(\overrightarrow {\overleftarrow {\overline {l}_{i}}}\) scaled by Dl which is the dimension of both vectors in similar approach to [15]. Our choice of attention score functions from the vast array of options in the literature [16, 30], was based on limiting the number of parameters in our model given the size of our dataset.

$$ score(\overline{q}, \overrightarrow{\overleftarrow{\overline{l}_{i}}}) = \overline{q}^{\top} tanh(\mathbf{W}^{l}_{ql}\overrightarrow{\overleftarrow{\overline{l}_{i}}}) $$
$$ score(\overline{q}, \overrightarrow{\overleftarrow{\overline{l}_{i}}}) = \frac{\overline{q}^{\top} \overrightarrow{\overleftarrow{\overline{l}_{i}}}}{\sqrt{D_{l}}} $$

Decoder/output classifier

The last layer in the HEA model takes as input the computed document representation vector \(\overline {z}\) from the DocEncoder layer and performs an affine transformation followed by softmax operation to compute a probability distribution on the labels for the document under consideration. That is, the outcome \(\hat {y}\) for a given Brief DISCERN criterion is computed using Eq. 5

$$ \hat{y}=\sigma(\mathbf{W}_{V_{label}z}\overline{z}+\overline{b}_{V_{label}}) $$

where \(\mathbf W_{V_{label}z} \in \mathbb {R}^{|V_{label}| \times D_{z}}\), \(\overline {b}_{V_{label}} \in \mathbb {R}^{|V_{label}|}\) represents the classifier’s weights to be optimized, Vlabel{0,1} is the set of admissible labels for a criterion (binary variable in our case), |Vlabel| is the number of labels, Dz is the dimension of \(\overline {z}\) (document representation vector), and σ is the softmax function. As a result, the outcome \(\hat {y}\) represents a probability distribution over the set of possible labels Vlabel.

Objective function

We used cross-entropy loss as our objective function for each Brief DISCERN criterion model. The loss function for a jth document is defined by Eq. 6 where yc{0,1} is equivalent to \(\mathbbm {1}{\big [y = c\big ]}\) (i.e. a boolean indicator equal to 1 when c is the reference/ground-truth class), and \(\hat {y}_{c}\) is the probability of the class c. The objective function for the whole training set Dtrain is defined by the average loss across all the documents in Dtrain plus a weight regularization term (i.e. l2-norm regularization) applied to the model parameters represented by θ (see Eq. 7).

$$ l^{(j)}=-\sum_{c=1}^{|V_{label}|}y_{c}^{(j)}\times log(\hat{y}_{c}^{(j)}) $$
$$ L(\mathbf{\boldsymbol{\theta}}) =\frac{1}{N}\sum_{j=1}^{N}l_{j} + \frac{\lambda}{2}||\mathbb{\boldsymbol{\theta}}||_{2}^{2} $$

In addition to the l2-norm regularization, we also experimented with dropout [31] by deactivating neurons in the network layers using probability pdropoout. Moreover, we used pre-trained language models such as BERT [17, 21] and BioBERT [22] to extract token embeddings that are used as input to HEA’s model (i.e. representation of token \(\overline {w}_{t}\)).

We additionally implemented a neural-based model (HE) that follows the same architecture of HEA model but without the attention layer such that the output of the DocEncoder representing a sequence of joined hidden state vectors \(\underline {O}=[\overrightarrow {\overleftarrow {\overline {l}_{1}}}, \overrightarrow {\overleftarrow {\overline {l}_{2}}}, \cdots, \overrightarrow {\overleftarrow {\overline {l}_{T_{Doc}}}}]\) is mean pooled (i.e. averaged) to obtain an overall document vector representation \(\overline {z}\).

Hyperparameter optimization for neural models

We developed a multiprocessing module that used a uniform random search strategy [32] that randomly chose a set of hyperparameters configurations (i.e. layer depth, embedding size, attention approach, etc.) from the set of all possible configurations. Then the best configuration for each model (i.e. the one achieving best performance on the validation set) was used for the final training and testing.

Baseline machine learning models

For the traditional modeling approach, the content of each article was converted into a bag of words representation and weighted using the term frequency–inverse document frequency (TF-IDF) weighting scheme. We also computed a set of features based on the existence of HTML links, bibliography keywords, references to medical terms (extracted using MetaMap Lite [33]), and named entities within the text, as well as a measure of text polarity. Recursive feature elimination with cross validation was used to identify the optimal subset of features. For its ease of interpretability and good performance on feature sets with many categorical variables, we implemented a Random Forest model with scikit-learn [34] to predict if the criterion is fulfilled or not for every criterion in Brief DISCERN.

Experimental setup

We followed a stratified 5-fold cross-validation scheme where each fold was defined as a distinct 80%-20% train-test split. Due to the imbalance in outcome classes, training examples were weighted inversely proportional to class/outcome frequencies in the training data. Articles from the three health topics were randomly distributed between the 5 folds. Within each fold, parameter selection was performed with a validation set consisting of 10% of the training set. During the training of the models, the epoch in which the model achieved the best F1-macro score on the validation set was recorded, and model state as it was trained up to that epoch was saved. This best model, as determined by the validation set, was then tested on the test split.

Model performance was evaluated using F1-macro and classification accuracy. In this quality assessment problem, we value precision equally with recall, so F1 is a good measure that captures both. The evaluation of the trained models was based on their average performance on the test sets of the five folds.

We also performed a coverage analysis to determine how the model could be adapted to handle uncertainty. In addition to classifying articles as low or high quality, we also have the option of allowing the model report that it is unsure about a criteria. In instances when the model has a low confidence in its prediction, it is more valuable to the user for the model to convey that uncertainty, than to make a less accurate prediction. In addition, there is also the option to send articles where the model has low confidence to a human for manual evaluation. However, there is a direct trade-off between the quality (accuracy) and the quantity the predictions; by requiring a higher threshold of confidence, the model will by definition make a fewer number of predictions. The frequency with which the model makes prediction above a certain confidence threshold, i.e. outputs a prediction to the user, is called coverage. We calculated the models’ accuracy at different levels of coverage and their associated confidence thresholds. For example, to calculate the accuracy associated with a coverage of 80%, we computed the 20th percentile prediction confidence score, and computed accuracy metrics on only the articles with prediction confidence scores (i.e. the probability from softmax layer) that exceed the 20th percentile. Predictions that are below this threshold are considered unsure. These are instances where the model would abstain from making a prediction, or the article could be sent for manual review.

Code availability

The data preprocessing and the models’ implementation (training and testing) workflow is made publicly available at


We compared the performance of the five trained models (Random Forest, HEA with BERT and BioBERT embeddings, and HE with BERT and BioBert embeddings) across all five folds using F1-macro scores (Table 2 and Fig. 4). Overall, the HEA architecture perforemd the best, scoring an average F1-macro score of 0.75 with BERT embeddings and 0.74 with BioBERT embeddings. In comparison, the HE architectures without the attention mechanism averaged 0.70 on both embeddings. The Random Forest model achieved an average F1-macro score of 0.69.

Fig. 4
figure 4

Performance comparison of the model architectures on each of the Brief-DISCERN questions. Each point represents the performance of the architecture on each of the 5 cross validation folds

Table 2 Average F1-macro scores with standard deviation by model architecture

Almost all models performed the best on question 4 (“Is it clear what sources of information were used to compile the publication [other than the author or producer]?”) with HEA BERT, HEA BioBERT and Random Forest scoring 0.86, 0.80, and 0.83 respectively. The HE models performed worse on this question, with 0.72 with BERT and 0.71 with BioBERT. All five models achieved high F1-macro scores on question 5 (“Is it clear when the information used or reported in the publication was produced?”) with HEA BioBERT coming first (0.82), HE BioBERT second (0.78), HEA BERT and HE BERT tying for third (0.77), and Random Forest last (0.70).

For treatment related questions, HEA BioBert performed the best. On question 9 (“Does it describe how each treatment works?”), HEA BioBERT came first with an average F1-macro score of 0.72, and the remaining models ranging between 0.68 and 0.66). For question 11 (“Does it describe the risks of each treatment?”), both neural models using the BioBERT Embeddings performed the best: the HEA- and HE BioBERT models scored 0.81 and 0.80 respectively, with the remaining models following betwwen 0.76 and 0.72. In contrast, for question 10 (“Does it describe the benefits of each treatment?”), the neural models using the BERT embeddings performed better, with HEA BERT at 0.66 and HE BERT at 0.60, with the remaining models ranging between 0.56 and 0.53. It worth mentioning that Q10 has the greatest class imbalance (i.e. 77% of the articles in the data set described the benefits of the treatment).

As measured by F1-macro, HEA-BioBert took first place in 3 of the 5 questions, and HEA-Bert took first in the remaining 2 questions (see Table 2). However, when computing the average score on all questions, HEA BERT and BioBERT performed comparably, and the variance across folds was lower for HEA BERT compared to HEA BioBERT (see Fig. 4).

We explored the relationship between model’s coverage, accuracy, and confidence (prediction probability threshold) focusing on the BioBERT model (see Fig. 5). The trend is that as the model’s coverage decreases, the higher is the confidence (i.e. outcome probability) and the accuracy of the model. At 80% coverage, the model achieves 86% average accuracy with average confidence equal to 0.79 (see Table 3).

Fig. 5
figure 5

Relationship between Prediction Coverage, Confidence Threshold, and Model Accuracy. This data is for the HEA BioBERT architecture

Table 3 Comparison of performance metrics for the HEA BioBERT architecture at 80% and 100% coverage. Coverage refers to the percent of articles the model makes a prediction for (as opposed to abstaining from making a prediction when the model has a confidence below the Threshold). The Precision, Recall, and Accuracy scores reflect the accuracy of the model on the resulting 80% of predicted articles

Table 4 compares the machine learning model (HEA BioBERT) performance to human manual performance on the DISCERN and HON guidelines. We report the DISCERN Manual Performance as the frequency with which each of our two raters’ agree with the aggregated average of both raters (i.e. percent agreement to aggregate). The manual rater accuracy score on our data set averaged 94% (spanning 88% - 97% across criteria). Compared to the DISCERN raters’ manual performance of 94%, the models performance was adequate (81% accuracy across all questions at 100% coverage). In order to bring the model’s performance closer to the DISCERN raters’ manual performance, we could reduce the model coverage to 80%, which would yield an accuracy of 86%.

Table 4 Performance Comparison between 525 Human Manual Rating and Deep Learning Model. Manual performance 526 is reported as percent agreement. Automated performance is reported as Implementation Accuracy (see Table 3).

As an additional comparison, Table 4 also contains HON organization manual percent agreement scores that were computed while developing training sets for their own machine learning models [14]. The average percent agreement among the HON raters was 85% on their full criteria, and this drops to 81% when only considering criteria that are shared with Brief DISCERN (Reference, Date, and Justifiability). Overall, the machine learning model achieved a competitive performance at full coverage compared to HON raters: the model averaged 83% accuracy at 100% coverage on the questions that overlap with HON, and the HON raters had percent agreement of 81%.

Table 5 shows top-3 sentences (based on attention probability score) belonging to most confidently predicted documents as determined by the prediction probability score for each question in the three medical topics.

Table 5 Example sentences that the models paid the most attention to for each disease category. These are the sentences with the highest attention weight for the top three most confidently predicted documents as determined by the prediction probability score. These results are from the HEA BioBERT model.

Lastly, the time and space requirements for the different architectures were very different. For the neural models, running the full hyperparameter search and training routine took between 25-30 hours on a GPU node with 256 GB of RAM parallelized across 5 Nvidia GTX 1080 GPUs. In comparison, the baseline model trained in 15 minutes on a machine with 4 CPUs and 16 GB of RAM.


In this research, we developed an attention-based neural network model with the aim to automatically determine the quality of online health information.

The experiments suggest that a neural network model with trained language embeddings on large text corpora (generic or medical) has better performance than a conventional baseline model (Random Forest). Importantly, this superior performance was achieved without the need to hand-craft input features, as was the case with the baseline model. However, it is worth noting that this comes at the trade-off of much higher computing requirements for the neural network models.

Our results reiterate the success of using trained language models [15, 17], and transfer learning [18] in achieving competitive results even on small datasets (as in our case). The BioBERT embeddings show a slight advantage in comparison to BERT ones (Table 2), and we believe this could be due to the medical topics and the language used to describe treatments in each topic.

Our results suggest that the neural attention mechanism not only provided a performance boost over a mean pooled neural architecture, but also enabled greater model explainability. The HEA models performed 7% higher in F1-macro compared to the HE models (Table 2). In addition, inspecting Table 5, it can be seen that the HEA BioBERT model provided reasonable context sentences (i.e. sentences supporting a prediction). In other words, the models identify textual snippets (surrogates, or proxy) in the articles. For question 4 (References), the model identified sentences containing citations, and in question 5 (Date), the model identified text referring to dates when the article was “reviewed”, “revised”, or “updated”. Questions 9 (How Treatment Works) and 10 (Treatment Benefits) are related questions, and we see that the textual snippets identified by the model for these questions overlap, as expected. However, question 10 achieved poorer accuracy scores, which is probably due to class imbalance in the training data for that question (only 33% of articles were in the negative class). For question 11 (Treatment Risks), the model often identified section headings containing the phrase “side effects”.

We compared the models’ quality assessment performance to humans manually performing the same task. Our raters achieved an average of 94% agreement across criteria. Similarly, the HON organization reported an average percent agreement of 85% on their criteria, and this drops to 81% when only considering criteria that are shared with Brief DISCERN. While the HEA BioBERT model performed lower than manual raters used in this study (81% vs. 94%), it showed competitive results when compared to HON raters (83% vs. 81% average accuracy). Restricting the HEA BioBERT model’s prediction coverage to 80%, we could further improve the prediction performance achieving 86% average accuracy across all criteria. In this case, the model refrains from making a prediction when its prediction probability score, or confidence, is below the 20th percentile. This model or a similarly trained model could be effectively used for pre-screening health web pages and for assisting manual raters in the quality assessment task. As suggested by the HON organization, assisting manual rating with automated systems could reduce manual effort [14].

Future work

We are seeking to further improve our models’ performance to more closely achieve human performance in assessing online health information quality. One straightforward approach is to train on a larger data set using the same model architecture. To achieve this aim, we could look beyond manually labeling more health articles. For example, we could construct a larger corpora by combining our current dataset with other existing bodies of online health information that have independently been assessed as being of high quality. For example, articles approved by HON could be used as positive examples in an augmented training set. An additional avenue is to use semi-supervised learning and unsupervised data augmentation approaches [35, 36] where unlabeled data is incorporated to improve classification performance without additional annotation burden.

In future experiments, we plan to further develop our use of language embeddings. For example, in this research we simply used the last layer embeddings from the BERT and BioBERT models. However, recent experiments suggest using different layers (the BERT network contains 12 layers) or further training the embedding networks can yield performance improvements.

Finally, the DISCERN instrument is designed to be applied to articles describing treatment options. Thus, our model’s applicability is limited to these types of articles. Similarly, our model does not extend to the medium of social media, which online users are increasingly using to share and consume health information [11]. More research is needed to develop models for assessing the quality of other types and mediums of health information.


Our study demonstrates that neural models are able to perform online health information quality assessment in accordance with an existing quality criteria (Brief DISCERN) with a performance above 80% accuracy. The neural approach achieves a better performance than a conventional approach using Random Forest. In addition, we observe that existing biomedical language models improve performance on this task. Finally, we show that attention-based neural approaches are able to retrieve relevant supporting sentences from the text, which makes model decisions more explainable to users.

Availability of data and materials

The preprocessing and the models’ implementation (training and testing) workflow is made publicly available at





Hierarchical Encoder Network


Hierarchical Encoder Network with Attention


Health on the Net Organization


Random Forest


term frequency–inverse document frequency


  1. Hesse BW, Nelson DE, Kreps GL, Croyle RT, Arora NK, Rimer BK, Viswanath K. Trust and Sources of Health Information. Arch Intern Med. 2005; 165(22):2618.

    Article  Google Scholar 

  2. Fahy E. Quality of patient health information on the internet: reviewing a complex and. Australas Med J. 2014; 7(1):24–8.

    Article  Google Scholar 

  3. Zhang Y, Sun Y, Xie B. Quality of health information for consumers on the web: A systematic review of indicators, criteria, tools, and evaluation results. J Assoc Inf Sci Technol. 2015; 66(10).

  4. Saunders CH, Petersen CL, Durang M-A, Bagley PJ, Elywn G. Bring on the Machines: Could Machine Learning Improve the Quality of Patient Education Materials? A Systematic Search and Rapid Review. JCO Clin Cancer Inform. 2018; 2:1–16.

    Article  Google Scholar 

  5. Murray E, Lo B, Pollack L, Donelan K, Catania J, Lee K, Zapert K, Turner R. The Impact of Health Information on the Internet on Health Care and the Physician-Patient Relationship: National U.S. Survey among 1.050 U.S. Physicians. J Med Internet Res. 2003; 5(3):17.

    Article  Google Scholar 

  6. Allam A, Schulz PJ, Nakamoto K. The impact of search engine selection and sorting criteria on vaccination beliefs and attitudes: two experiments manipulating Google output,. J Med Internet Res. 2014; 16(4):100.

    Article  Google Scholar 

  7. Ludolph R, Allam A, Schulz PJ. Manipulating Google’s Knowledge Graph Box to Counter Biased Information Processing During an Online Search on Vaccination: Application of a Technological Debiasing Strategy. J Med Internet Res. 2016; 18(6):137.

    Article  Google Scholar 

  8. Iverson SA, Howard KB, Penney BK. Impact of internet use on health-related behaviors and the patient-physician relationship: a survey-based study and review. J Am Osteopath Assoc. 2008; 108(12):699–711.

    PubMed  Google Scholar 

  9. Wald HS, Dube CE, Anthony DC. Untangling the Web-The impact of Internet use on health care and the physician-patient relationship. Patient Educ Couns. 2007; 68(3):218–24.

    Article  Google Scholar 

  10. Risk A, Dzenowagis J. Review of Internet health information quality initiatives. J Med Internet Res. 2001.

  11. Viviani M, Pasi G. Credibility in social media: opinions, news, and health information-a survey. Wiley Interdiscip Rev Data Min Knowl Disc. 2017; 7(5):1209.

    Article  Google Scholar 

  12. Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices,. J Epidemiol Community Health. 1999; 53(2):105–11.

    Article  CAS  Google Scholar 

  13. Boyer C, Dolamic L. Feasibility of automated detection of HONcode conformity for health-related websites. Int J Adv Comput Sci Appl. 2014; 5(3).

  14. Boyer C, Dolamic L. Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation,. J Med Internet Res. 2015; 17(6):135.

    Article  Google Scholar 

  15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention Is All You Need. 2017. Accessed 14 Oct 2019.

  16. Luong M-T, Pham H, Manning CD. Effective Approaches to Attention-based Neural Machine Translation. 2015. Accessed 15 Oct 2019.

  17. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J. Transformers: State-of-the-art Natural Language Processing. 2019.

  18. Ruder S, Peters ME, Swayamdipta S, Wolf T. Transfer Learning in Natural Language Processing. In: Proceedings of the 2019 Conference of the North. Stroudsburg: Association for Computational Linguistics: 2019. p. 15–18.

    Google Scholar 

  19. Rees CE, Ford JE, Sheard CE. Evaluating the reliability of DISCERN: A tool for assessing the quality of written patient information on treatment choices. Patient Educ Couns. 2002.

  20. Khazaal Y, Chatton A, Cochand S, Coquard O, Fernandez S, Khan R, Billieux J, Zullino D. Brief DISCERN, six questions for the evaluation of evidence-based content of health-related websites. Patient Educ Couns. 2009.

  21. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018.

  22. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019.

  23. Allam A, Schulz PJ, Krauthammer M. Toward automated assessment of health Web page quality using the DISCERN instrument. J Am Med Informa Assoc. 2017; 24(3):481–87.

    Google Scholar 

  24. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: An Imperative Style, High-Performance Deep Learning Library In: Wallach H, Larochelle H, Beygelzimer A, Alché-Buc F, Fox E, Garnett R, editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.: 2019. p. 8024–8035.

  25. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997; 9(8):1735–80.

  26. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw. 1994; 5(2):157–66.

  27. Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin, Heidelberg: Springer; 2012.

  28. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics: 2014. p. 1724–1734. Accessed 01 Nov 2019.

    Google Scholar 

  29. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. 2014. Accessed 01 Nov 2018.

  30. Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. 2014. Accessed 18 Dec 2019.

  31. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J Mach Learn Res. 2014; 15:1929–58.

    Google Scholar 

  32. Bergstra JAMESBERGSTRA J, Yoshua Bengio YOSHUABENGIO U. Random Search for HyperParameter Optimization. J Mach Learn Res. 2012.

  33. Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017; 177.

  34. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12:2825–30.

    Google Scholar 

  35. Berthelot D, Carlini N, Goodfellow I, Papernot N, Oliver A, Raffel C. MixMatch: A Holistic Approach to Semi-Supervised Learning. 2019. Accessed 18 Dec 2019.

  36. Xie Q, Dai Z, Hovy E, Luong M-T, Le QV. Unsupervised Data Augmentation for Consistency Training. 2019.

Download references


Not Applicable


The authors would like to acknowledge the Swiss National Science Foundation for their previous funding of A-Discern project (grant number P2TIP1-161635) awarded to AA.

Author information

Authors and Affiliations



LK and AA worked on the development of processing and analysis workflow, algorithms and models implementation. LK, AA and MK analyzed and interpreted the data. LK and AA drafted the manuscript. AA and MK supervised and edited the manuscript. All authors approved the final article.

Corresponding author

Correspondence to Ahmed Allam.

Ethics declarations

Ethics approval and consent to participate

Not Applicable

Consent for publication

Not Applicable …

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kinkead, L., Allam, A. & Krauthammer, M. AutoDiscern: rating the quality of online health information with hierarchical encoder attention-based neural networks. BMC Med Inform Decis Mak 20, 104 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: