AutoDiscern: rating the quality of online health information with hierarchical encoder attention-based neural networks

Background Patients increasingly turn to search engines and online content before, or in place of, talking with a health professional. Low quality health information, which is common on the internet, presents risks to the patient in the form of misinformation and a possibly poorer relationship with their physician. To address this, the DISCERN criteria (developed at University of Oxford) are used to evaluate the quality of online health information. However, patients are unlikely to take the time to apply these criteria to the health websites they visit. Methods We built an automated implementation of the DISCERN instrument (Brief version) using machine learning models. We compared the performance of a traditional model (Random Forest) with that of a hierarchical encoder attention-based neural network (HEA) model using two language embeddings, BERT and BioBERT. Results The HEA BERT and BioBERT models achieved average F1-macro scores across all criteria of 0.75 and 0.74, respectively, outperforming the Random Forest model (average F1-macro = 0.69). Overall, the neural network based models achieved 81% and 86% average accuracy at 100% and 80% coverage, respectively, compared to 94% manual rating accuracy. The attention mechanism implemented in the HEA architectures not only provided ’model explainability’ by identifying reasonable supporting sentences for the documents fulfilling the Brief DISCERN criteria, but also boosted F1 performance by 0.05 compared to the same architecture without an attention mechanism. Conclusions Our research suggests that it is feasible to automate online health information quality assessment, which is an important step towards empowering patients to become informed partners in the healthcare process.


Introduction Background
Patients often turn to search engines and online content before, or in place of, talking with a health professional [1].However, online health information is not regulated, and prior studies have found wide variations in information quality [2].Poor risk communication, biased writing, and lack of transparency about the source of the information plague online health texts [3,4].This presents a real risk to patients, in the form of misinformation [5,6,7] and negatively affecting their interactions with health care providers [8,9].
In response to this problem, many organizations, such as the Health on the Net Organization, the Journal of the American Medical Association, and the National Health Service of the UK, have established guidelines for assessing the quality of online health information [10].These guidelines describe a set of criteria an article must meet to be considered of high quality.The implementation strategies of these guidelines so far fall into two categories: Distributed Guidelines and Centralized Approvers.However, both of these strategies have scalability issues that limit their reach and prevent them from broadly affecting patient information consumption [10].
Distributed Guidelines One approach to helping patients find high quality health information is to develop a criteria and publish it as a public tool citizens can use.An example of this approach is the DISCERN instrument [11].The DISCERN instrument's criteria are specifically designed to be able to be understood and applied by any lay person; no medical knowledge is required.This implementation approach puts a significant burden on the patient.For this approach to be successful, the patient has to be aware of the guideline, learn how to evaluate the criteria, and take considerable time to apply the guidelines to every website the patient encounters.

Centralized Approvers
The second implementation approach in use today is Centralized Approvers.In this approach, an organization manually assesses web pages for health information quality.An example of this approach is the Health on the Net Foundation, which developed the HONcode guidelines.It assesses websites for quality, and allows those that pass their criteria to display a HONcode badge on their webpage [12].A variant on this approach is to register all manually approved content in a centralized repository.Patients can search the repository with the confidence that all listed sites have been vetted for quality.
The Centralized Approver approach is not scalable in the face of a massive and rapidly growing internet.Quality assessment is a costly manual process.Not only do new pages need to be evaluated, but previously-evaluated pages need to be re-evaluated on a regular basis in case of content changes [10].
Automated Assessment An automated quality assessment process is key to providing the public with scalable tools for assessing online health information quality.Initial attempts to automate the assessment of health information used simplistic approaches, such as readability scores, and did not capture more complex issues with health information, such as tone and bias [4].A machine learning model developed by the HON organization showed promising but limited initial results [13].But with the recent developments in machine learning and natural language processing methods, there is a renewed opportunity for tackling this problem.Neural Language Models have been successfully applied in many domains, including translation, question answering, and many more [14,15,16,17], capturing details and nuances in language that made information quality assessment an expensive manual process for so long.

Research Objectives
In this research, we study and develop machine learning models to automate the application of the DISCERN instrument.The DISCERN instrument was developed by Charnock et al. [11] at Oxford University and funded by the National Health Service (UK).The instrument consists of 15 questions to help a lay-person to evaluate the quality of online health information regarding treatment options.The validity of the DISCERN instrument has been evaluated in multiple studies, and is commonly used among researchers [18].The DISCERN instrument suffers from the same sustainability issues as all distributed guidelines do: patients are unlikely to take the time to apply this criteria to each website they find.In this study, we built and evaluated machine learning models for the automated annotation of the Brief DISCERN criteria [19].We compared the use of traditional machine learning (Random Forest) with feature engineering vs. hierarchical encoder attention-based neural network (HEA) models.Additionally, we experimented with the use of two pre-trained neural language models BERT [20] and BioBERT [21] as embeddings in the HEA model.

Data Collection
Using Google Trends, we identified breast cancer, arthritis, and depression as medical topics with the highest search volume since 2004.Using Google and Yahoo search engines, we identified a total of 269 Web pages (HTML articles) with a focus on treatment choices and options across the 3 topics.Two raters (master's students) were trained for 2 months on using the DISCERN instrument and scoring platform.Both raters scored all articles on DISCERN's 5 point scale.Interrater agreement for the DISCERN criteria was adequate to high, ranging between 0.61-0.91 as measured by the Krippendorf score.The process of building the training corpus is described in more detail in [22].

Modeling Approach
We converted the scores for each question in the DISCERN instrument, which ranges from 1-5, into a binary classification, where score 3-5 is passing and score 1-2 is failing the criteria.The texts from the HTML articles were extracted and cleaned using the beautifulsoup library [23].In this work we focus on the Brief DISCERN criteria [19], which is a 6 question subset of the DISCERN crieria that has been shown to capture the quality of health information as reliably as the complete DISCERN instrument.Separate models were developed and tested for each of the 5 Brief DISCERN questions (one question, Q13, was excluded due to low interrater reliability).

Neural Network Model
We designed and implemented a Hierarchical Encoder Attention-based (HEA) model in PyTorch [24] taking into consideration the structure of our problem and the limits of our training data.HEA's architecture is composed of two hierarchical encoders and a classifier.The first encoder is a sentence encoder SentEncoder which is based on a bidirectional recurrent neural network (RNN) that encodes each sentence (i.e.sequence of tokens) into a dense vector representation.The second encoder is a document encoder DocEncoder which is also based on a bidirectional RNN that encodes the sequence of sentences' representation (i.e.vectors computed from the first encoder) and uses attention mechanism [14] along with a global context vector to compute a dense vector representation for the whole document.A decoder/classifier maps the document's learned vector representation to the labels using an affine map followed by softmax layer computing a probability distribution on the labels for the processed document.

Sentence Encoder (SentEncoder)
Formally, given an input sentence S = [w 1 , w 2 , • • • , w T S ] where w t represents the token representation at position t (i.e.1-of-K encoding where K is the size of vocabulary V -the set of all tokens in the training corpus), a vanilla RNN will compute a hidden vector at each position (i.e.state vector h t at position t), representing a history or context summary of the sequence using the input and hidden states vector form the previous steps.Equation 1 shows the computation of the hidden vector h t using the input w t and the previous hidden vector h t−1 where φ is a non-linear transformation such as ReLU (z) = max(0, z) or tanh(z) = e z −e −z e z +e −z .
, represent the RNN's weights to be optimized and D h , D w are the dimensions of h t and w t vectors respectively.Note that the weights are shared across the network and D w could be equal to K the size of the vocabulary (i.e. in case of 1-of-K encoding) or the size of a dense embedding vector generated using a language model such as BERT [20].The use of RNN allows the model to learn long-range dependencies where the network is unfolded as many times as the length of the sequence (sentence in our case) it is modeling.Although RNNs are capable of handling and representing variable-length sequences, in practice, the learning process faces challenges due to the vanishing/exploding gradient problem [25,26,27].In this work, we used gated recurrent unit (GRU) [28,29] to overcome the latter challenges by updating the computation mechanism of the hidden state vector h t through the specified equations below.
The GRU model computes a reset gate r t that is used to modulate the effect of the previous hidden state vector h t−1 when computing the new memory vector ht .The update gate z t determines the importance/contribution of the newly generated memory vector ht compared to the previous hidden state vector h t−1 when computing the current hidden vector h t .The weights where D h and D w are the dimensions of h t and w t vectors respectively.The operator σ represents the sigmoid function, φ the tanh or ReLU function, and the element-wise product (i.e.Hadamard product).
The SentEncoder uses a bidirectional GRU that computes two hidden state vectors − → h t and ← − h t for each token w t in sentence S corresponding to left-to-right and right-to-left GRU encoding of the sentence.We experimented with two options for computing sentence representation vector S: (1)concatenation [ of the computed left and right hidden state vectors of the last w T S and first w 0 tokens respectively in sentence S.

Document Encoder (DocEncoder) with Attention
Originally, each document Doc in our corpus is composed of a sequence of sentences (i.e.
where S i represents the i th sentence and T Doc is the number of sentences in Doc).Each sentence S i is composed of a sequence of tokens (as described in SentEncoder section above) that are processed using SentEncoder model to compute the sentence vector representation S i .As a result, the processed document Doc proc is a sequence of sentences' vector representation (i.e.
that is fed to an attention layer to compute the weights associated with each vector which in turn are used to compute a weighted vector sum to obtain a document vector representation z.

Attention Layer
For many of the DISCERN criteria, pass or fail of the criteria depends on only small fragments throughout the article.For example, for the question "Is it clear when the information used or reported in the publication was produced?",there is likely only one line among a 200+ sentence article (i.e."Last reviewed on...") that determines whether the article passes the criteria.Our attention-based modeling architecture reflects this problem structure: the model can "pay attention" to single sentences within a larger article.We adapt the idea of global attention model [15] in which a global context/query vector q (i.e.trainable parameters in the model) was used along with the output O = [ from DocEncoder to generate document representation vector z.The objective is to compute attention weights for every where α i is the normalized weight computed using Eq. 2.
For the attention scoring function, we experimented with two options inspired by the additive approach [15,30] and the scaled dot-product work in [14] (see Equations 3 and 4 respectively).In Eq. 3, the score is computed using three operations: (1) a weight matrix W l ql ∈ R Dq×D l maps ← − − → l i to a fixed-length vector of dimension equal to the query vector q (i.e.D q ), (2) a non-linear transformation tanh is applied, and (3) a dot-product with q is performed.In contrast, in Eq. 4, the score is computed by performing a dot-product between the query vector q and − → ← − l i scaled by D l which is the dimension of both vectors in similar approach to [14].Our choice of attention score functions from the vast array of options in the literature [15,30], was based on limiting the number of parameters in our model given the size of our dataset.

Decoder/output classifier
The last layer in the HEA model takes as input the computed document representation vector z from the DocEncoder layer and performs an affine transformation followed by softmax operation to compute a probability distribution on the labels for the document under consideration.That is, the outcome ŷ for a given Brief DISCERN criterion is computed using Eq. 5 where represents the classifier's weights to be optimized, V label ∈ {0, 1} is the set of admissible labels for a criterion (binary variable in our case), |V label | is the number of labels, D z is the dimension of z (document representation vector), and σ is the sof tmax function.As a result, the outcome ŷ represents a probability distribution over the set of possible labels V label .

Objective function
We used cross-entropy loss as our objective function for each Brief DISCERN criterion model.The loss function for a j th document is defined by Eq. 6 where y c ∈ {0, 1} is equivalent to 1 y = c (i.e. a boolean indicator equal to 1 when c is the reference/ground-truth class/label), and ŷc is the probability of the class c.The objective function for the whole training set D train is defined by the average loss across all the documents in D train plus a weight regularization term (i.e.l 2 -norm regularization) applied to the model parameters represented by θ (see Eq. 7).
In addition to the l 2 -norm regularization, we also experimented with dropout [31] by deactivating neurons in the network layers using probability p dropoout .Moreover, we used pre-trained language models such as BERT [20,16] and BioBERT [21] to extract token embeddings that are used as input to HEA's model (i.e.representation of token w t ).

Hyperparameter optimization for neural models
We developed a multiprocessing module that used a uniform random search strategy [34] that randomly chose a set of hyperparameters configurations (i.e.layer depth, embedding size, attention approach, etc.) from the set of all possible configurations.Then the best configuration for each model (i.e. the one achieving best performance on the validation set) was used for the final training and testing.

Baseline Machine Learning Models
For the traditional modeling approach, the content of each article was converted into a bag of words representation and weighted using the term frequency-inverse document frequency (TF-IDF) weighting scheme.We also computed a set of features based on the existence of HTML links, bibliography keywords, references to medical terms (extracted using MetaMap Lite [32]), and named entities within the text, as well as a measure of text polarity.Recursive feature elimination with cross validation was used to identify the optimal subset of features.For its ease of interpretability and good performance on feature sets with many categorical variables, we implemented a Random Forest model with scikit-learn [33] to predict if the criterion is fulfilled or not for every criterion in Brief DISCERN.

Experimental setup
We followed a stratified 5-fold cross-validation scheme where each fold was defined as a distinct 80%-20% train-test split.Within each fold, parameter selection was performed with a validation set consisting of 10% of the training set.Due to the imbalance in outcome classes, training examples were weighted inversely proportional to class/outcome frequencies in the training data.Articles from the three health topics were randomly distributed between the 5 partitions.
Model performance was evaluated using F1-macro, F1-micro, and classification accuracy.In this quality assessment problem, we value precision equally with recall, so F1 is a good measure that captures both.The evaluation of the trained models was based on their average performance on the test sets of the five folds.
Furthermore, we performed a coverage assessment of the models.In addition to classifying articles as low or high quality, we also have the option of allowing the model report that it is unsure about a criteria.In this information quality assessment task, our goal is to provide as much information to the user as possible.In instances when the model has a low confidence in its prediction, it is more valuable to the user for the model to convey that uncertainty, than to make a less accurate prediction.
To calculate the models' accuracy at different levels of coverage, we first ranked the model's predictions according to their associated probabilities (i.e. the probability from softmax layer, which we refer to as the model's "confidence").Then, we computed the confidence thresholds for various coverage values, and computed accuracy metrics on only the predictions exceeding that threshold.For example, for a coverage of 80%, we computed the 20th percentile confidence score, and computed accuracy metrics on only the articles with probability scores that exceed the 20th percentile.
Predictions that are below this threshold are considered "unsure", and are where the model would abstain from making a prediction.

Public Online Validator
The preprocessing and the models' implementation (training and testing) workflow is made publicly available at https://github.com/uzh-dqbm-cmi/auto-discern.We also make our HEA model publicly available in the form of an online validator.A user can enter the URL of a website to score, and the validator will return pass/fail for each of the Brief DISCERN criteria, as well as relevant sentences for that classification decision as determined by the attention weights.

Results
We compared the performance of the three trained models (Random Forest, HEA with BERT embeddings, and HEA with BioBERT embeddings) across all five folds using F1-micro and macro scores (Tables 1 and 2).Overall, the HEA architecture outperformed the Random Forest model, scoring an average F1-macro score of 0.75 with BERT embeddings and 0.74 with BioBERT embeddings.In comparison, the Random Forest model achieved an average F1-macro score of 0.69.
All models performed the best on question 4 ("Is it clear what sources of information were used to compile the publication [other than the author or producer]?")with HEA BERT, HEA BioBERT and Random Forest scoring 0.86, 0.80, and 0.83 respectively.Similarly, all three models achieved high F1-macro scores on question 5 ("Is it clear when the information used or reported in the publication was produced?")with HEA BioBERT coming first (0.82), HEA BERT second (0.77) and Random Forest last (0.70).
For treatment related questions, HEA BioBert performed the best.On question 9 ("Does it describe how each treatment works?"),HEA BioBERT came first with an average F1-macro score of 0.72, HEA BERT second (0.68) and Random Forest last (0.66).The order was the same for question 11 ("Does it describe the risks of each treatment?"),where HEA BioBERT again came first with an average F1-macro score of 0.81, HEA BERT second (0.76) and Random Forest last (0.72).In contrast, for question 10 ("Does it describe the benefits of each treatment?"),HEA BERT achieved the higher F1-macro score compared to HEA BioBERT (0.66 vs 0.54) and the Random Forest model again came last with average score equal to 0.53.It worth mentioning that Q10 has the greatest class imbalance (i.e.77% of the articles in the data set described the benefits of the treatment).
As measured by both F1-macro and F1-micro, HEA-BioBert took first place in 3 of the 5 questions, and HEA-Bert took first in the remaining 2 questions (see Tables 1 and 2).However, when computing the average score on all questions (F1-macro and micro), HEA BERT and BioBERT performed comparably, and the variance across folds was lower for HEA BERT compared to HEA BioBERT (see Figure 3).
We explored the relationship between model's coverage, accuracy, and confidence (prediction probability threshold) focusing on the BioBERT model (see Figure 4).The trend is that as the model's coverage decreases, the higher is the confidence (i.e.outcome probability) and the accuracy of the model.At 80% coverage, the model achieves 86% average accuracy with average confidence equal to 0.79 (see Table 3).as the target variable for the machine learning models.Then we measured the frequency with each rater agreed with the ground truth (the average of the two raters) to compute a rater accuracy score.In this comparable form, the rater accuracy score on our data set averaged 94% (spanning 88% -97% across criteria).Additionally, Table 4 contains HON organization manual interrater agreement scores that were computed while developing training sets for their own machine learning models [13].The average interrater agreement was 85% on their criteria, and this drops to 81% when only considering criteria that are shared with Brief DISCERN (Reference, Date, and Justifiability).Overall, the machine learning model achieved a competitive performance when compared to HON raters (average accuracy of 83% vs. 81% at full coverage) and adequate one when compared to DISCERN raters (81% vs. 94% at 100% coverage).In order to bring the model's performance closer to manual performance, we would need to reduce the model coverage to 80% ( 86% vs. 94% at 80% coverage).
Lastly, Table 5 shows top-3 sentences (based on attention probability score) belonging to most confidently predicted documents as determined by the prediction probability score for each question in the three medical topics.Table 3: Comparison of performance metrics for the HEA BioBERT architecture at 80% and 100% coverage.Coverage refers to the percent of articles the model makes a prediction for (as opposed to abstaining from making a prediction when the model has a confidence below the Threshold).The Precision, Recall, and Accuracy scores reflect the accuracy of the model on the resulting 80% of predicted articles.

Question
Coverage

Discussion
In this research, we developed an attention-based neural network model with the aim to automatically determine the quality of online health information.
The experiments suggest that a neural network model with trained language embeddings on large text corpora (generic or medical) has better performance than a conventional baseline model (Random Forest).Importantly, this superior performance was achieved without the need to hand-craft input features, as was the case with the baseline model.
Our results reiterate the success of using trained language models [14,16], and transfer learning [17] in achieving competitive results even on small datasets (as in our case).The BioBERT embeddings show a slight advantage in comparison to BERT ones, and we believe this could be due to the medical topics and the language used to describe treatments in each topic.
We compared the models' quality assessment performance to humans manually performing the same task.Our raters achieved an average of 94% agreement across criteria.Similarly, the HON organization reported an average interrater agreement of 85% on their criteria, and this drops to 81% when only considering criteria that are shared with Brief DISCERN.
While the HEA BioBERT model performed lower than manual raters used in this study (81% vs. 94%), it showed competitive results when compared to HON raters (83 % vs. 81% average accuracy).Restricting the model's prediction coverage to 80%, we further improve the prediction performance achieving 86% average accuracy across all criteria.In this case, the model refrains from making a prediction when its prediction probability score, or confidence, is below the 20th percentile.This model or a similarly trained model could be effectively used for pre-screening health web pages and for assisting manual raters in the quality assessment task.As suggested by the HON organization, assisting manual rating with automated systems could reduce manual effort [13].
Inspecting Table 5, it can be seen that the HEA BioBERT model provided reasonable context sentences (i.e.sentences supporting a prediction).In other words, the models identify textual snippets (surrogates, or proxy) in the articles.For question 4 (References), the model identified sentences containing citations, and in question 5 (Date), the model identified text referring to dates when the article was "reviewed", "revised", or "updated".Questions 9 (How Treatment Works) and 10 (Treatment Benefits) are related questions, and we see that the textual snippets identified by the model for these questions overlap, as expected.However, question 10 achieved poorer accuracy scores, which is probably due to class imbalance in the training data for that question (only 33% of articles were in the negative class).For question 11 (Treatment Risks), the model often identified section headings containing the phrase "side effects".

Future Work
We are seeking to further improve our models' performance to more closely achieve human performance in assessing online health information quality.One straightforward approach is to train on a larger data set using the same model architecture.To achieve this aim, we could look beyond manually labeling more health articles.For example, we could construct a larger corpora by combining our current dataset with other existing bodies of online health information that have independently been assessed as being of high quality.For example, articles approved by HON could be used as positive examples in an augmented training set.An additional avenue is to use semi-supervised learning and unsupervised data augmentation approaches [35,36] where unlabeled data is incorporated to improve classification performance without additional annotation burden.
In future experiments, we plan to further develop our use of language embeddings.For example, in this research we simply used the last layer embeddings from the BERT and BioBERT models.However, recent experiments suggest using different layers (the BERT network contains 12 layers) or further training the embedding networks can yield performance improvements.
Finally, the DISCERN instrument is designed to be applied to articles describing treatment options.Thus, our model's applicability is limited to these types of articles.More research is needed to develop models for assessing the quality of other types of health information.

Conclusion
Our study demonstrates that neural models are able to perform online health information quality assessment in accordance with an existing quality criteria (Brief DISCERN) with a performance above 80% accuracy.The neural approach achieves a better performance than a conventional approach using Random Forest.In addition, we observe that existing biomedical language models improve performance on this task.Finally, we show that attention-based neural approaches are able to retrieve relevant supporting sentences from the text, which makes their decisions more explainable to users.

Figure 2 :
Figure 2: Model architecture for converting a document's sentence embeddings into a document prediction.
used as input to DocEncoder model.The DocEncoder uses a bidirectional GRU that computes two hidden state vectors − → l i and ← − l i for each sentence representation S i corresponding to left-to-right and right-to-left GRU encoding of the sentences in Doc proc .We experimented with two options for joining both hidden state vectors − → l i and ← − l i into one vector using: (1)concatenation [ − → l i ; ← − l i ] , and/or (2) summation [ − → l i + ← − l i ] that will be denoted by − → ← − l i from now on.Hence, the output of the DocEncoder is a sequence of joined hidden state vectors O = [

Figure 4 :
Figure 4: Relationship between Prediction Coverage, Confidence Threshold, and Model Accuracy.

Table 4
compares the machine learning model (HEA BioBERT) performance to human manual performance on the DISCERN and HON guidelines.To estimate the manual performance on the DISCERN guidelines, we use our own training data.We converted the manual rater scores of DISCERN's 5-point scale to the same binary classification used

Table 1 :
Average F1-macro scores with standard deviation by model architecture.

Table 2 :
Average F1-micro scores with standard deviation by model architecture.

Table 4 :
Performance Comparison between Human Manual Rating and Deep Learning Model.Manual performance is reported as interrater agreement.Automated performance is reported as Implementation Accuracy (see Table3).