Deep learning for pollen allergy surveillance from twitter in Australia

Background The paper introduces a deep learning-based approach for real-time detection and insights generation about one of the most prevalent chronic conditions in Australia - Pollen allergy. The popular social media platform is used for data collection as cost-effective and unobtrusive alternative for public health monitoring to complement the traditional survey-based approaches. Methods The data was extracted from Twitter based on pre-defined keywords (i.e. ’hayfever’ OR ’hay fever’) throughout the period of 6 months, covering the high pollen season in Australia. The following deep learning architectures were adopted in the experiments: CNN, RNN, LSTM and GRU. Both default (GloVe) and domain-specific (HF) word embeddings were used in training the classifiers. Standard evaluation metrics (i.e. Accuracy, Precision and Recall) were calculated for the results validation. Finally, visual correlation with weather variables was performed. Results The neural networks-based approach was able to correctly identify the implicit mentions of the symptoms and treatments, even unseen previously (accuracy up to 87.9% for GRU with GloVe embeddings of 300 dimensions). Conclusions The system addresses the shortcomings of the conventional machine learning techniques with manual feature-engineering that prove limiting when exposed to a wide range of non-standard expressions relating to medical concepts. The case-study presented demonstrates an application of ’black-box’ approach to the real-world problem, along with its internal workings demonstration towards more transparent, interpretable and reproducible decision-making in health informatics domain.


Introduction
According to Australian Institute of Health and Welfare (AIHW) [1], in 2014 − 15 nearly 1 in 5 Australian suffered from Pollen allergy, which amounts to 4.5 mln of citizens, predominantly working-aged adults. What is more, the expenditure on Allergic rhinitis medications doubled between 2001 and 2010, going from $107.8 mln to $226.8 mln per year, as reported by Australian pharmacies [1]. Overall allergies are increasing, but the reasons for an observed growth are not entirely clear [2,3].
The potential of social media for public health mining has already been demonstrated in previous studies on user-generated content to support public health monitoring from social media.
Deep learning emerged as a sub-field of machine learning and already benefited numerous Natural Language Processing (NLP) tasks [20]. The ability to learn the most salient aspects from text automatically eliminated the need for conventional classifiers dependent on manual feature-engineering. Further application of word embeddings allowed to account for syntactic and semantic regularities between the words, leading to classification performance improvement. As state-of-the-art approach, deep learning in public health mining domain is still in its infancy. Previous studies on allergies surveillance from social media conducted in the UK and US utilised either traditional machine learning classifiers such as Multinomial Naive Bayes [13,17], or lexicon-based approaches [14][15][16]. The application of deep learning for Hay feverrelated user-generated content identification and knowledge discovery about the condition in Australia is yet to be explored in the literature.

Prevalence and severity of Hay fever
Pollen allergy, commonly known as Hay Fever, significantly reduces the quality of life and affects physical, psychological and social functioning. The symptoms experienced are caused by body's immune response to the inhaled pollen, resulting in chronic inflammation of eyes and nasal passages. Nasal congestion is often associated with sleep disturbance, resulting in daytime fatigue and somnolence. An increased irritability and selfconsciousness along with a decreased level of energy and alertness are frequently observed during pollen season [21]. Moderate and severe symptoms of Hay fever considerably impair learning ability in children, while adults suffer from work absences and reduced productivity [21,22]. According to World Allergy Organisation (WAO) [22], Hay fever is increasing in prevalence and severity, and will continue to be a concern.
Around the world, in both developed and developing countries, environments are undergoing profound changes [3]. An increased air pollution and global warming have a substantial impact on respiratory health of the population. Ziska et al. [23] has already reported that the duration of ragweed pollen season has been increasing in recent decades in North America. Any potential pattern changes, including prolonged pollen season, increased intensity of allergens or un-expected pollens detection directly affect the physical, psychological and social functioning of allergy sufferers [22]. The response to the external factors further differs among the individuals, which is particularly exacerbated in countries with high migration rates [3]. As for 2015, approx. 30% of the Australia's Estimated Resident Population (ERP) was born overseas [24].
The ever-changing and unpredictable nature of Pollen allergies evolution necessitates the accurate and timely statistics about the state of the condition. The conventional, survey-based approaches involve a fraction of the population, and incur significant reporting delays (approx. 1 year in the case of official government reports [1]). Alternative approaches involve the number of hospital admissions and General Practitioners (GPs) reports of Hay fever instances. According to the study conducted in New South Wales -Australia [25], 'patients believe that Allergic rhinitis is the condition that should be selfmanaged' . Bypassing the Health Care Professionals (HCPs) and reliance on over-the-counter drugs can lead to statistics derived from services under-estimation. Also, the pharmacies supply data of oral antihistamines -the common Hay fever medicine -is used to indicate yearly start and peak of the season [1,2]. Despite insightful, such analyses are not conducted systematically as the collection of data from drug manufacturers/pharmacy outlets across the country is required. Finally, the pollen rates assist in estimations of starting and peaking points of allergy seasons. Still, the actual condition prevalence may vary due to different responses to particular allergens among individuals.

Allergies surveillance from social media
Given the limitations of traditional approaches for allergies surveillance, the alternative sources of data increase in importance to closer reflect the state of the condition within the population. One domain that has grown by massive proportions in recent years, as well as continues to grow, is social media [6,26]. Online platforms attract and encourage users to discuss their health issues, use of drugs, side effects and alternative treatments [6]. The updates range from generic signs of dissatisfaction (e.g. 'hay fever sucks') to specific symptoms description (e.g. 'my head is killing me'). Also, it has been observed that individuals often prefer to share their health-related experiences with peers, rather than during clinical studies, or even physicians [27]. As a result, social media has become a source of valuable data, increasingly used for real-time detection and knowledge discovery [28].
Previous studies conducted in UK and US have already investigated the potential of Twitter for allergies surveillance. De Quincey et al. [15] observed that Twitter users are self-reporting the symptoms as well as medications, and the volume of Hay fever-related tweets strongly correlates (r=0.97, p<0.01) with incidents of Hay fever reported by Royal College of General Practitioners (RCGP) within the same year in the UK. Another correlation has been found in the work published by Cowie et al. [17], where the volume of Pollen allergy-related tweets collected in the UK over the period of 1 year resembled the pattern of pollen counts -grass pollen in particular. The study performed in the US has reported similar findings -strong correlations between (1) pollen rates and tweets reporting Hay fever symptoms (r = 0.95), and (2) pollen rates and tweets reporting the use of antihistamines (r = 0.93) [16]. Lee et al. [13] further observed the relationship between the weather conditions (daily maximum temperature), and number of conversations about allergies on Twitter. Additionally, the classification of actual allergy incidents and general awareness promotion was employed, along with the particular allergy types extraction. The correlations between the environmental factors and Hay feverrelated tweets were also performed in the small-scale Australian study [29], where moderately strong dependencies were found for Temperature, Evaporation and Windall crucial factors in allergies development.

Deep learning in text classification
Gao et al. [30] demonstrated how deep learning approach can improve model performance for multiple information extraction tasks from unstructured cancer pathology reports compared to conventional methods. The corpus of 2505 reports was manually annotated for (1) primary site (9 labels), and (2) histological grade (4 labels) identification. The models tested were RNN, CNN, LSTM and GRU, and word embeddings were implemented for word-to-vector representation. Another study explored the effectiveness of domain-specific word embeddings on classification performance in Adverse Drug Reactions (ADRs) extraction from social media [5]. The data was collected from Twitter and DailyStrength (the online support community dedicated to health issues), followed by annotation of total of 7663 posts for presence of (1) adverse reactions, (2) beneficial effects, (3) condition suffered, and (4) other symptoms. The use of word embeddings enabled even the non-medical expressions correct identification in highly informal social media streams. The improved performance following the domain-specific embeddings development was also demonstrated in the classification of ADRs-related [12] (medical embeddings), and crisis-related tweets [31] (crisis embeddings). The former employed the bi-directional LSTM model for detection of ADRs, Drug Entities and others. The latter used CNN model for binary identification of useful versus nonuseful posts during a crisis event. Similarly, CNN was successfully applied in personality identification [32], sarcasm detection [33], aspect extraction [34] or emotion recognition [35].
CNNs capture the most salient n-gram information by means of its convolution and max-pooling operations. In terms of NLP tasks, RNNs are found particularly suitable due to the ability to process variable length inputs as well as long-distance word relationships [36]. In text classification, the dependencies between the center and far-away words can be meaningful and contribute towards performance improvement [37]. The LSTMs (Long Short-Term Memory), as variants of RNNs -can leverage both short and long-distance word relationships [37]. Unlike LSTMs, GRUs (Gated Recurrent Unit) fully expose their memory content each timestep, and whenever a previously detected feature, or the memory content is considered to be important for later use, the update gate will be closed to carry the current memory content across multiple timesteps [38]. Based on empirical results, GRUs outperformed LSTMs in terms of convergence in CPU time and in terms of parameter updates and generalisation by using fixed number of parameters for all models on selected datasets [39].

Contributions
The main contributions of the study can be stated as follows: • We introduce Deep Learning application in the context of Pollen Allergy surveillance from Social Media in place of currently dominant conventional Machine Learning classifiers; • We focus on challenging informal vocabulary, which leads to condition under/over-estimation if unaddressed in place of the traditional limited keyword/lexicon-based approaches; • We propose the fine-grained classification into 4 classes in place of the most common binary classifiers, i.e. Hay Fever-related/Hay Fever-non-related; • We enrich the data with an extensive list of weather variables for potential patterns identification, where previous studies focus mainly on Temperature, and Pollen Rate.

Study design
The objectives of the study are as follows: • Framework development for quantitative and qualitative Hay fever monitoring from Twitter; • Evaluation of multiple deep learning architectures to online user-generated content classification; • Domain-specific embeddings training and evaluation for accuracy performance improvement; • Internal workings demonstration through the predictive probabilities and embeddings vectors investigation; • Correlation with weather variables for patterns identification and future forecasting.
The high-level methodological framework is presented in Fig. 1, and the particular steps are detailed in the following sub-sections.

Data extraction
The extraction phase inlcuded the following stages:

Embeddings development
For the purpose of HF embeddings development, the relevant posts and comments from popular online platforms were crawled. The sources considered were: Twitter, YouTube and Reddit. In order to include only Hay fever-related data, the following keywords were searched for: 'hay fever' OR 'hayfever' OR 'pollen allergy'. In the case of Twitter, the inclusion of pre-defined keywords in the content was required. As for YouTube and Reddit, the associated comments/posts from videos/threads that contained one or more keywords from the list in their titles were extracted. In total, approximately 22k posts were collected.
The following web crawling methods were applied based on the data sources used: (i) Twitter -TwitteR R package, (ii) Reddit -RedditExtractoR R package, and (iii) YouTube -NVivo. Gensim library for Python that provides access to Word2Vec training algorithms was used, with the window size set to 5. To enhance results reproducibility and inform future research, the details of the particular embeddings development schema implemented have been presented in Table 1.

Target data
As the purpose of the study is Hay fever surveillance in Australia, the posts were extracted using the geocoordinates of the following locations: (1) Alice Springs (radius=2,000mi), and (2) Sydney, Melbourne, and Brisbane (radius=300mi). Given that exact location extraction is practically unfeasible if geo-tag option was disabled, the separate datasets for (1) whole Australia, and (2) its major cities were created. The dataset 1 was used for classifier training, whereas dataset 2 was used for tweet volumes correlation with weather conditions for the particular area. Custom script was used to extract the data using R programming language and 'TwitteR' package. The posts were captured retrospectively at regular time intervals, and the parameters were as follows: The high precision was prioritised over the high recall, thus the very narrow scope of the search terms. After preliminary data exploration, wider list of search queries introduced an excessive noise to the dataset. For instance, the generic term 'allergy' included other popular allergy types (i.e. Cats, Peanuts), and the specific symptoms such as 'sneezing' , 'runny nose' , 'watery eyes' frequently referred to the other common conditions (i.e. Cold, Flu). Data was obtained for 191 out of 214 days in total (89%). The posts from remaining 23 days were not captured due to technical issues 1 . Still, for quantitative analysis the missing values were accounted for to ensure findings validity. The compensation approach is detailed in subsection Weather correlation, and the Extraction calendar is presented in Fig. 2, where 'x' indicates the gaps in data collection. Qualitative analysis remained unaffected.

Annotation process
The full dataset of 4,148 posts (Sydney -1,040, Melbourne -1928), and Brisbane -222) was annotated by two researchers, active in health informatics domain. Annotators performed the evaluation using the tweet text as well as link to the online tweet version if text was unclear, where certain commonly occurring emojis provided further context for tweets interpretation, e.g. nose or tears. The approach followed the methodological considerations for undertaking Twitter research outlined by Colditz et al. [40]. In case of potential disagreements, either the consensus was obtained or the 'Unrelated/Ambiguous' class was selected. The inter-rater reliability was calculated using Cohen's kappa statistic [41], taking into account the probability of agreement by chance. The score achieved was κ = 0.78 and is considered significant [42]. The usernames have been removed from the posts given the privacy considerations.
The study conducted by Lee et al. [13] categorised the allergy-related posts into the actual incidents of the condition and general awareness promotion. Analogically, the posts were annotated into Informative and Non-Informative, as detailed in Table 2. The Informative category split was introduced to allow for (1) personal detailed reporting, and (2) personal generic reporting separation. Class 1 was further used for symptoms and/or treatments extraction, whereas combined classes 1 and 2 were used for quantitative analysis of the condition prevalence estimation. The Non-Informative category included public broadcasting (3), and unrelated content (4).

Training and testing
The experiments with 4 deep learning architectures were conducted due to various performances obtained on different datasets in previous studies. Pre-processing performed was minimal, and included removal of URLs, non-alphanumeric characters and lowercasing. In terms of emojis, their numerical representation was retained, following the punctuation removal. No excessive preprocessing was applied as models perform the operations on sequence of words in order they appear. Words are preserved in their original form without stemming/lemmatising due to their context-dependent representation, e.g. 'allergy', 'allergic', 'allergen'. Also, Sarker et al. [6] suggested that stop words can play a positive effect on classifier performance. Analogical preprocessing steps were implemented for the embeddings development.
For feature extraction, the word-to-vector representation was adopted due to its ability to effectively capture the relationships between the words, thus proving superior in text classification tasks. Additionally, the use of word embeddings naturally extends the feature set, which is particularly advantageous in the case of small to moderate datasets. The 2 word embeddings variants were implemented (1) GloVe embeddings -as default, and (2) HF embeddings -as alternative. The pre-trained Common Crawl 840B tokens GloVe embeddings were downloaded from the website 2 . Both 50 dimensions (min) and 300 dimensions (max) options were tested. The HF embeddings were generated using 10 iterations and vector dimension of 50, given the moderate training data size. Previous study [4] reported improved classification performance with 50 dimensions while training domainspecific embeddings.
In terms of the parameters, the mini-batch size was set to default 32, the most popular non-linear activation function ReLU was selected, the number of recurrent units was set to standard 128, and the Nadam optimiser was used. The models were trained up to 50 epochs and implemented with open source neural network library Keras 3 .
Finally, the standard evaluation metrics were adopted, such as Accuracy, Precision (exactness) and Recall (completeness). The 5-fold cross-validation was followed, with 80:20 training and testing split as in [43]. The Confusion Matrices were further produced to examine in-detail the performances obtained for the particular classes.

Weather correlation
As for the patterns investigation, the weather factors were superimposed on the tweet volume charts over the period of 6 months (2018/06/01−2018/12/31). The weekly averages of the number of Informative posts (class 1 + 2) were taken into account for Sydney, Melbourne, and Brisbane. The approach followed previous study conducted by Gesualdo et al. [16], where the weekly averages of tweets were used to avoid daily fluctuations for correlations with pollen rates and antihistamine prescriptions. The environmental data was obtained from Bureau of Meteorology 4 (BOM) -Australia's official weather forecast and weather radar. In the case of gaps in data collection (Fig. 2), the compensation approach was adopted, i.e. given 1 day-worth of data missing within the week, the average of the remaining 6 days was calculated and considered as the 7th day tweet volume. The weekly average was then estimated based on the complete 7-days record.

Accuracy evaluation
The accuracies obtained for RNN, LSTM, CNN and GRU models are presented in Table 3. The default (GloVe) and alternative (HF) word embeddings options were considered. In terms of GloVe, the min (50) and max (300) number of dimensions were implemented. The highest accuracy was obtained for GRU model with GloVe embeddings of 300 dimensions (87.9%). Further evaluation metrics (Precision and Recall) were produced for GloVe/300 and HF/50 options, and are included in Table 4.

Classification output
The exemplary posts with the corresponding Classes, Classes ID, Predictive Probabilities and Post Implications are presented in Table 5. The implicit reference to either symptom or treatment is highlighted within each post. The official Hay fever symptoms list were extracted from Australasian Society of Clinical Immunology and Allergy (ASCIA) [21]. Furthermore, the sample of outputs in the form of word-word co-occurrence statistics for both GloVe and HF embeddings were produced. Table 6 shows the top 15 terms with the highest associations with the following keywords: 'hayfever', 'antihistamines' (as the most common Hay fever medication), 'eyes' and 'nose' (as the most affected body parts).

Error analysis
In order to investigate the classification performance with respect to the particular classes, the confusion matrices were generated for both GloVe/300 and HF/50 options (Fig. 3). The highest performing deep learning architectures were selected according to the outputs presented in Table 4, i.e. GloVe/300 -GRU and HF/50 -CNN. Given different weights associated with the classes, the finegrained performance examination facilitates the selection of the most suitable classifier based on the task at-hand. For instance, the performance achieved for classes 1 and 2 (Informative) is prioritised over the performance achieved for classes 3 and 4 (Non-Informative). The visual format of the analysis further assists the results interpretation.
In order to better understand the sources of misclassifications, the examples of inaccurate predictions were returned along with the corresponding classification probabilities ( Table 7). The approach allows to obtain an insight behind the classifier confusion, and potentially re-annotate the falsely identified posts as part of the Active Learning towards classification performance improvement.

Weather correlation
For potential patterns between environmental factors and HF-related Twitter activity, the graphs representing weekly averages of selected weather variables, and weekly averages of Informative tweets (class 1 + 2) throughout the 6 months period were produced. An interactive approach allowed to visually inspect the emerging correlations for Sydney, Melbourne and Brisbane. The most salient examples are presented in Fig. 4, where (a)  The normalisation procedure has been applied for calculating the inferential statistics. Also, the start as well as the peak of Hay fever season based on Twitter self-reports was indicated, e.g. Melbourne: beginning of September -start, October and November -peak.

Deep learning approach validation
Deep learning approach has been adopted in order to account for the limitations of the lexicon-based and conventional machine learning techniques in accurate identification of non-standard expressions from social media, in the context of Hay fever. The maximum classification accuracy was achieved for GRU model with pre-trained GloVe embeddings of 300 dimensions (87.9%). The application of HF word embeddings did not improve the performance of the classifier, what can be attributed to relatively moderate training dataset size of (20k posts). Future work will investigate the large-scale domainspecific development, including data from online health communities (e.g. DailyStrength).
In the 1st part of the classification outputs (Table 5), the classifier was able to correctly identify the informal and often implicit references to syndromes (e.g. 'cried', 'tears', 'sniff ', 'snot'), and classify them as Informative -symptom (1). Only posts inclusive of 'hayfever' OR 'hay fever' keywords were considered to ensure they relevancy to the scope of the study. Additionally, the 'new' symptoms (e.g. 'cough', 'lose my voice') have been recognised and classified as Informative -symptom (1). For consistency, the 'new' have been defined as syndromes not occurring on the official website of Australasian Society of Clinical Immunology and Allergy [21]. Also, the medication-related terms ranging from generic in the level of granularity ('spray', 'tablet' etc.), to specific brand names ('Sudafed', 'Zyrtec' etc.) were recognised as treatments, proving the flexibility of the approach. Despite correct classification, the lower predictive probabilities were obtained for very rare In the 2nd part of the classification outputs (Table 5), the examples of accurately classified posts despite the confusing content implication are presented. For instance, the advertisement post including distinct Hay fever symptoms such as 'red nose' and 'itchy eyes' was classified correctly as Non-Informative -marketing (3), preventing it from further analysis and condition prevalence over-estimation.
With relatively small training dataset (approx. 4,000), the model proves its robustness in capturing the subtle regularities within the dataset. Lack of reliance on the external, pre-defined lexicons makes it suitable for emerging symptoms and treatments detection. Deep learning eliminates manual feature engineering effort, facilitating more automated and systematic approach. The ability to produce text representation selective to the aspects important for discrimination, but invariant to irrelevant factors is essential given highly noisy character of social media data. The traditional approaches, commonly referred to as 'shallow processing' , allow only for surface-level feature extraction, which proves effective for well-structured documents, but frequently fails when exposed to more challenging user-generated content. Thus, the advanced techniques are required if the minor and often latent details are decisive of the correct class assignment.
In order to obtain greater insight into the classification process, the word embeddings outputs were produced for the following keywords 'hayfever', 'antihistamines', 'eyes' and 'nose' (Table 6). In terms of the 'hayfever', mostly synonyms (e.g. 'rhinitis'), plurals (e.g. 'allergies') or derivatives (e.g. 'allergic') were captured, accounting for their interdependance. The general term 'antihistamines' demonstrated close relationship with specific Hay fever drugs (e.g. 'Cetirizine' , 'Loratadine' , 'Zyrtec'), proving effective in identification of treatments non-identified a priori. The equivalent expressions such as 'eyelids', 'nostril' have been found associated with the most commonly affected by Pollen allergy body parts, i.e. eyes and nose. Despite  the linguistic variety abound on social media, the deep learning-based system with word embeddings demonstrated its ability to recognise the linkages between the concepts, essential for any NLP task. On the other hand, the HF embeddings returned mostly symptoms related to particular organs (e.g. itchy, watery, blocked etc.), which can be considered informative for syndromic surveillance. Still, due to numerous symptoms occurring at once in the extracted posts, it is difficult to distinguish which body part does the particular symptom relates to. Furthermore, the embeddings outputs analysis can be found beneficial for informal health-related  [44], the knowledge of symptoms experienced is equally important as the language used to describe them. Finally, the model trained on causal language prevalent on social media faciltates more robust symptom-driven, rather than disease-driven surveillance approaches [44].
For continuous performance improvement, the concept of Active Learning was incorporated. The misclassified posts are returned along with the corresponding predictive probabilities, allowing for sources of classifier confusion identification and potential classes refinement. The sample of incorrectly identified posts with brief explanation is presented in Table 7.

Knowledge discovery about Hay fever
Deep learning-based classification allows to effectively and efficiently extract the relevant information from large volume of streaming data. The real-time analysis is crucial for disease surveillance purposes. After posts classification into Informative and Non-Informative groups, the prevalence can be accurately estimated following the discard of news, advertisements, or ambiguous content. The finer-grained identification of (1) detailed symptoms/treatments versus (2) generic Hay fever mentions enables further knowledge discovery about the condition severity from the relevant class (1). The combined classes 1 and 2 allow for the quantitative prevalence estimation. As an example, the volume of HF-related tweets in Melbourne peaked in October and November, paralleling the findings obtained by the Australian Institute for Health and Welfare [1] regarding the wholesale supply of antihistamines sold throughout the year. The results prove useful for seasonality in pollen season estimation, accounting for its unpredictable and ever-changing pattern.
As for the correlation with weather factors, the converse relationship has been observed between Humidity [%] and Hay Fever self-reports in Melbourne. Also, the close dependency has been found in Brisbane, where volume of HF-related posts approximated the pattern of Evaporation variable [mm]. It can be attributed to the fact that plants are most likely to release the pollen into the air more on a sunny, rather than rainy day [29]. Thus, the proof-of-concept for future forecasting model was demonstrated.