Skip to main content

Discovering thematic change and evolution of utilizing social media for healthcare research



Social media plays a more and more important role in the research of health and healthcare due to the fast development of internet communication and information exchange. This paper conducts a bibliometric analysis to discover the thematic change and evolution of utilizing social media for healthcare research field.


With the basis of 4361 publications from both Web of Science and PubMed during the year 2008–2017, the analysis utilizes methods including topic modelling and science mapping analysis.


Utilizing social media for healthcare research has attracted increasing attention from scientific communities. Journal of Medical Internet Research is the most prolific journal with the USA dominating in the research. Overly, major research themes such as YouTube analysis and Sex event are revealed. Themes in each time period and how they evolve across time span are also detected.


This systematic mapping of the research themes and research areas helps identify research interests and how they evolve across time, as well as providing insight into future research direction.


In the past decade, the research field of utilizing social media for healthcare has attracted great interests from scientific communities, which can be observed from the annual increasing of research publications. Internet is becoming a significantly important role as the source of information for public health issues [1]. Health-related information is being actively searched, shared, communicated, and discussed through social media. This kind of online information exchange benefits users in aspects of immediate access to health concern information [2], emotional and psychological support [3], and health-related decision making [4]. Furthermore, the development of digital social media brings relatively inexpensive and readily available means for the collection and storage of large volumes of data [5].

Especially in recent years, researchers are beginning to explore how social media can be used in health and healthcare research [6]. There have been rich researches and achievements. For example, based on the regression analysis of country-level HIV rates and aggregation usage of future tense language, Ireland et al. [7] found that there were fewer HIV cases in countries with higher rates of future tense on Twitter. Similar works focusing on sex related events can be found, e.g., HIV prevention among men who have sex with men [8], and assessment of personal and environmental factors associated with premarital sex among adolescents [9]. Some researchers conduct studies on certain diseases, e.g., breast cancer [10], testicular cancer [11], and prostate cancer [12], with social media content as analysis materials, e.g., videos [13], twitter messages [14], and publicly available user profiles [15]. Similar studies centering on drug can also be found, e.g., online drug sales [16], and direct-to-consumer drug advertising [17]. As a result, the research field of utilizing social media for healthcare is growing fast and is receiving more and more attention. It is of great significance to conduct a systematic analysis on existing research publications to understand the status of recent development.

As an effective statistical method for evaluating scientific publications, bibliometric analysis has been widely applied in various fields [18, 19]. It has been especially applied in interdisciplinary research, e.g., artificial intelligence on electronic health records research [20], natural language processing empowered mobile computing research [21], natural language processing in medical research [22], text mining in medical research [23], technology enhanced language learning research [24], and event detection in social media research [25].

To that end, this study carries out a bibliometric analysis of utilizing social media for healthcare research based on the research publications from Web of Science and PubMed during the year 2008–2017. The main aim is to develop a general approach to analyze the thematic change and evolution in the research field. As for the overall thematic detection, topic modelling analysis is conducted to identify major topics in the whole period. As for the thematic evolution, the approach combines performance analysis and science mapping for detecting and visualizing conceptual subdomains to quantify and visualize the thematic evolution of the research field.


Data retrieval and preprocessing

In this study, bibliometric methodology is applied using data from Web of Science (WoS) and PubMed. WoS is the most authoritative citation database and has been widely applied for bibliometric analysis, while PubMed provides a wide coverage of medical-related publications.

The keywords of social media are developed by domain experts after an extensive literature review. In WoS Core Collection database, Topic Subject is used as a retrieval field. Publications indexed in “Science citation index expanded (SCI-EXPANDED)” and “Social Sciences Citation Index (SSCI)” are considered. Further, publications of “Article” and “Proceedings paper” types indexed in the research areas pertaining to healthcare are selected manually. While in PubMed database, Title and MeSH Terms are used as two retrieval fields. Specific exclusion strategies are also conducted to ensure high relatedness of the retrieved publications. The specific search strategy is shown as Additional file 1. In total, 4361 unique publications are finally identified out for analysis. Since there is no citation data available in PubMed, we use Google scholar citation as a measurement of citation count of the 4361 publications.

The raw data are downloaded as plain text. Key elements, e.g., title, published year, abstract, and author address are automatically extracted. Author affiliations and countries are identified based on author addresses. Inconsistent expressions are standardized.

As for the thematic analysis, in addition to author keywords, KeyWords Plus, and PubMed MeSH, we also include keywords from title and abstract using a self-developed Python program with a natural language processing module based on syntactic tree analysis. 1) The singular and plural forms of all the author keywords, KeyWords Plus, and PubMed MeSH are firstly stored as a database; 2) Keywords in title and abstract text are automatically and separately extracted from the database; 3) As for the remaining text of the title and abstract, notional words are also extracted. 4) All the keywords are merged and unified as singular form.

In order to improve the effectiveness of thematic analysis, a duplication checking process is conducted according to the experience by Cobo et al. [26]. Abbreviations are replaced by corresponding full names with a mapping table, e.g., SMS is replaced by short message service; ADE is replaced by adverse drug event; MSM is replaced by men who have sex with men. Keywords representing the same concepts are grouped, e.g., diabete mellitus, type 2, type 2 diabete, type 2 diabete mellitus, etc. We also apply weight 0.4, for author keywords, KeyWords Plus, and PubMed MeSH, as well as weights 0.4 and 0.2 to the keywords from title and abstract, respectively, based on our former experiment [22]. We then set TF-IDF > =0.1 to exclude terms with low frequency as well as those occurring in too many publications.

Approach for thematic detection analysis

Proposed by Blei et al. [27], Latent Dirichlet Allocation (LDA) model has been widely applied in topic detection in various domains. It is a Bayesian mixture model for discrete data with an assumption that topics are uncorrelated. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

A document is represented as a sequence of N words denoted by d = (w1,  … , wN), where a word is an item from a vocabulary indexed by {1,  … , V}. A corpus is a collection of M documents denoted by D = {d1,  … , dM}. LDA follows the following generation process. 1) The term distribution β is as β~Dirichlet(δ), donating the probability of a word occurring in a given topic; 2) θ~Dirichlet(α) is the proportions θ of the topic distribution for a document d; 3) For each word wi in the document d, a topic is chosen by the distribution zi~Multinomial(θ), and a word is chosen as zi : p(wi| zi, β). The log-likelihood for one document dD is as Eq. (1), and Eq. (2) is the likelihood for Gibbs sampling estimation with k topics.

$$ \ell \left(\alpha, \beta \right)=\log \left(p\left(d|\alpha, \beta \right)\right)=\log \int \left\{{\sum}_z\left[{\prod}_{i=1}^Np\left({w}_i|{z}_i,\beta \right)p\left({z}_i|\theta \right)\right]\right\}p\left(\theta |\alpha \right) d\theta $$
$$ \log \left(p\left(d|z\right)\right)=k\log \left(\frac{\varGamma \left( V\delta \right)}{\varGamma {\left(\delta \right)}^V}\right)+{\sum}_{K=1}^k\left\{\left[{\sum}_{j=1}^V\log \left(\varGamma \left({n}_K^{(j)}+\delta \right)\right)\right]-\log \left(\varGamma \left({n}_K^{(.)}+ V\delta \right)\right)\right\} $$

We use 10-fold cross-validation to evaluate model performance with 16 different topic numbers as c(2–10,15,20,30,40,50,100,200). Perplexity criteria is used to select optimal topic number [27]. α for Gibbs sampling is the mean value of the α values in the 10 cross-validation for model fitting using VEM with the optimal topic number. With α and the optimal topic number, we adopt Gibbs sampling and VEM method to estimate the LDA model. The best matches are determined by Hellinger distance as Eq. (3), in which P and Q are two probability measures.

$$ {H}^2\left(P,Q\right)=\frac{1}{2}\int {\left(\sqrt{dP}-\sqrt{dQ}\right)}^2 $$

Further, we conduct comparative analysis using Affinity Propagation (AP) clustering method [28] based on keyword co-occurrence. In the analysis, only author keywords, KeyWords Plus, and PubMed MeSH are utilized. Keywords with a frequency less than 40 or that do not meet a co-occurrence frequency of 40 are excluded. 139 keywords meeting the threshold are selected. Based on keyword co-occurrence matrix of the 139 keywords, a keyword correlation matrix is calculated using Ochiai correlation coefficient expressed in Eq. (4). Oij represents the co-occurrence probability of two keywords. Ai and Aj represent keyword frequencies. Aij indicates the co-occurrence frequency of the two keywords. AP clustering is then conducted with the correlation matrix. Exemplars determined are used for representing and explaining each cluster.

$$ {O}_{ij}={A}_{ij}/\sqrt{A_i{A}_j} $$

Approach for thematic evolution analysis

Science mapping or bibliometric mapping is a spatial representation of the relationship between disciplines, fields, and documents or authors [29]. It has been widely used in different research fields [30,31,32] to reveal hidden key elements such as topics.

Science mapping analysis is carried out with SciMAT presented in [33] as a powerful science mapping software tool integrating the majority of the advantages of available tools [34]. In this paper, we adopt the bibliometric approach defined by Cobo et al. [35] that is based on a co-word analysis [36] and the H-index [37]. This approach establishes four stages to detect and visualize conceptual subdomains and thematic evolution of a research field in a longitudinal framework:

  1. 1)

    Research themes detection

The research themes for each period are detected using a co-word analysis [36]. The clustering of keywords to themes is conducted based on simple centers algorithm [38], a simple and well-known algorithm in the context of co-word analysis. The algorithm locates subgroups of keywords with strong link and that correspond to research interests or problems that are of great significance in the academia. The similarity between the keywords is measured by equivalence index [39] defined as Eq. (5). In the equation, cij is the count of publications in which two keywords i and j co-occur, and ci and cj represent the count of publications in which each one appears.

$$ {e}_{ij}={c}_{ij}^2/{c}_i{c}_j $$
  1. 2)

    Research themes visualization

The detected networks can be represented by two measures [39], i.e., Callon’s centrality and Callon’s density. Callon’s Centrality measures the degree of interaction among networks and can be defined as Eq. (6) with k a keyword belonging to the theme and h a keyword belonging to other themes. The internal strength of the network can be measured by Callon’s density defined as Eq. (7) with keywords i and j belonging to the theme and w is the keyword count in the theme.

$$ c=10\times \sum {e}_{kh} $$
$$ d=100\left(\sum {e}_{ij}/w\right) $$

Based on the two measures, research themes can be mapped in a two-dimensional strategic diagram with four quadrants. Commonly, themes in the upper-right quadrant known as the motor-themes are both well developed and are important for structuring a research field. Themes in the upper-left quadrant are of only marginal importance for the field with well-developed internal ties but unimportant external ties. Themes in the lower-left quadrant are both weakly developed and marginal. They mainly represent either emerging or disappearing themes. Transversal and basic themes are contained in the lower-right quadrant, and they are important but are not developed.

  1. 3)

    Thematic evolution discovery and performance analysis

A thematic area is a set of themes that have evolved across different subperiods. Suppose Tt is the set of detected themes of the subperiod t, and UTt donates each detected theme. Let VTt + 1 be each detected theme in the next subperiod t + 1. It is considered that there is a thematic evolution from theme U to theme V if there are keywords presented in both associated thematic networks. Keywords kU ∩ V are considered to be a “thematic nexus”. The inclusion index [40] shown as Eq. (8) is used to weight the importance of a thematic nexus. It is worth noting that a theme could belong to a different thematic area, or could not come from any.

In a bibliometric map of thematic evolution over two periods. The solid lines show that the linked themes are with the same name. A dotted line indicates that the themes share elements that are not the theme names. The thickness of the lines and the sphere volume are proportional to the inclusion index and the publication count associated with each theme, respectively. Hence, two different thematic areas in different colors can be observed. However, theme in the first period has no link with any themes is discontinued, while theme in the second period has no link with any themes is a new one.

$$ \mathrm{Inclusion}\kern0.20em \mathrm{index}=\frac{\#\left(U\cap V\right)}{\min \left(\#U,\#V\right)} $$

The analysis of the science mapping work-flow can be further enriched by a performance analysis with two kinds of bibliometric indicators, i.e., quantitative and qualitative ones. The quantitative indicators, e.g., publication count, author count, publication source count, and country count, measure the productivity of the detected themes and thematic areas. The qualitative indicators, e.g., citation count and H-index, measure the quality based on the bibliometric impact of those themes and thematic areas.


Performance bibliometric analysis

The statistical result of publication count and citation count from the year 2008 to 2017 are presented in Fig. 1. It is clear that the research of utilizing social media for healthcare is becoming more and more influential in scientific communities evidenced by the significant growth of publications from two databases, i.e. from 18 publications in 2008 to 1030 publications in 2017. The similar increasing trend can also be observed from the publication count in WoS. These results may be explained by the increasing global concerns and interests in exploring the use of social media data for healthcare research. It is worth mentioning that there is a remarkable upsurge on the research in 2010 with growth rates up to 309% in the WoS and 170% in the PubMed. The citation count curve shows an increasing trend between 2008 and 2013, and publications in 2013 have received the most citations. A decreasing trend is shown between 2014 and 2017, which may be resulted from the fact that new publications usually have less citations due to the limited time. On the whole, the research of utilizing social media for healthcare has received growing attention in the last decade.

Fig. 1

Publication count and citation count

Researches in the field have been published in a wide range of nearly one thousand publication sources. Some of these publication sources are highly relevant to the field, while others are partially related. Table 1 lists the top 20 publication sources ranked by publication count in the research field. According to both publication percentage and H-index, Journal of Medical Internet Research, PLoS One, and Cyberpsychology, Behavior and Social Networking are the most influential journals in the field.

Table 1 Prolific publication sources

Among the 4361 publications, there are 3311 affiliations and 14,154 authors from 88 countries/regions. 18.18% of the countries/regions, 65.06% of the affiliations, or 84.41% of the authors contribute only one publication. Table 2 lists top 20 most prolific countries/regions, affiliations, and authors.

Table 2 Prolific countries/regions, affiliations and authors

From the country/region perspective, the USA dominates in the field with 2394 publications, accounting for 54.90% of the total publications. The USA also has the highest H-index as 125, indicating the high quality of its publications. Other prolific countries/regions with more than 100 publications include England, Australia, Canada, China, Germany, and Spain.

15 of the top 20 prolific affiliations are from the USA with Harvard University (97 publications and 30 H-index) and University of Washington (86 publications and 30 H-index) ranking at the top 2. University of Toronto and University of British Columbia from Canada, as well as three affiliations (University of Melbourne, University of Sydney, Monash University) from Australia also appear in the list.

The leading position of the USA in the research field can also be embodied from the analysis of prolific authors. Most of the top 20 authors are from the USA except Mowafa Househ from Saudi Arabia, King-Wa Fu form Hong Kong, and Luis Fernandez-Luqu form Norway. Megan A. Moreno has the most publications as well as the highest H-index, indicating the high productivity and high influence of his research.

Thematic detection analysis

With the optimal topic number as 20 and the initialized α as 0.028204, LDA model using Gibbs sampling is conducted for overall thematic detection. The 20 topics with their top 15 representative terms is shown in Table 3, along with their possible themes, e.g., YouTube analysis, Sex event, Web-based medical education, Students’ use of Facebook, and Twitter use.

Table 3 Top 15 most frequent terms for the 20 detected topics

The top frequent keywords used for AP clustering analysis include social media (3484), human (2109), internet (1323), female (886), male (817), adolescent (694), adult (624), young adult (522), Facebook (473), and social networking (463). Figure 2 shows that the 139 keywords are classified into 28 clusters with exemplars, e.g., self concept, male, middle aged, internet, cancer, Youtube, and weight loss.

Fig. 2

AP clustering result for the publications during the year 2008–2017 (Terms in bold and italic type donate exemplar for each cluster)

Thematic evolution analysis

For each time period, two kinds of strategic diagrams are generated to analyze the most highlighted themes. The sphere size in the first diagram is proportional to publication count associated with each theme, while in the second one, the sphere size is proportional to the citation count received for each theme. We split the 10 years into five periods, i.e., [2008–2009], [2010–2011], [2012–2013], [2014–2015], and [2016–2017]. The identified themes with publication count are reported in Table 4 and are visualized using the strategic diagrams as Figs. 3, 4, 5, 6 and 7.

Table 4 Performance measures for the themes of each subperiod
Fig. 3

Strategic diagrams for the period 2008–2009

Fig. 4

Strategic diagrams for the period 2010–2011

Fig. 5

Strategic diagrams for the period 2012–2013

Fig. 6

Strategic diagrams for the period 2014–2015

Fig. 7

Strategic diagrams for the period 2016–2017

In the period 2008–2009, there are a total of 39 publications. According to the strategic diagrams (Fig. 3) and quantitative measures (Table 4), we can observe that the motor themes PROFILE and SOCIAL-NETWORKING have high citations and H-index scores. Theme MANAGEMENT has the highest H-index score, indicating that it has a higher impact.

In the period 2010–2011, there are a total of 240 publications. The motor-theme FACEBOOK is the most cited and presents the highest impact. Other motor-themes TECHNOLOGY and ADOLESCENT also get high citations, and are with high H-index scores. Themes MASSAGE and DATA-COLLECTION get rather low citations and H-index scores.

In the period 2012–2013, a total of 729 publications are published. According to the performance measures, the following four themes could be highlighted: FACEBOOK, PATIENT, MESSAGE, and WEB-2. These research themes get important impact, achieving higher citations and H-index scores comparing with the remaining themes. The motor-theme FACEBOOK gets the most citations and also has the highest H-index score. The basic and transversal theme SURVEY-AND-QUESTIONNAIRE gets rather low citations and H-index score.

In the period 2014–2015 with a total of 1385 publications, according to the strategic diagrams (Fig. 6) and quantitative measures (Table 4), motor-themes present the highest citations and impact scores. The following seven themes with high citations and H-index scores are highlighted: FACEBOOK, PATIENT, TWEET, TECHNOLOGY, PUBLIC-HEALTH, WEB, and SCHOOL.

A total of 1968 publications are published in the period 2016–2017. The strategic diagrams (Fig. 7) and quantitative measures (Table 4) also show that motor-themes present the highest citations and impact scores, i.e., FACEBOOK, PATIENT, TWITTER, PROGRAM, YOUNG-ADULT, and MEDIA. The theme NETWORK also gets high citations, and are with high H-index score. The basic and transversal theme PERCEPTION gets rather low citations and H-index score.

An analysis of the evolution of the themes detected in each period considering their keywords and evolution across time is developed, shown as Fig. 8. Eight main thematic areas are identified such as FACEBOOK, PATIENT, TWEET, WEB, SOCIAL-NETWORK, and etc. According to Fig. 8, the research in this field presents dramatic cohesion due to the fact that the majority of the detected themes are grouped under a thematic area and come from a theme existing in the previous period. Some thematic areas are present in the research over the five periods studied such as FACEBOOK and PATIENT. Some thematic areas appear in the later periods such as SOCIAL-NETWORK.

Fig. 8

Thematic evolution of the research field (2008–2017)


Based on the 4361 research publications from Web of Science and PubMed during the year 2008–2017, a bibliometric analysis of utilizing social media for healthcare research is conducted, aiming at exploring the thematic detection and evolution of the research field.

The first finding worth noting is that the research field has attracted more and more attention from scientific communities throughout the last ten years. Most prolific publication sources are Journal of Medical Internet Research, PLoS One, and Cyberpsychology, Behavior and Social Networking. The USA dominates in the research with a comparatively higher publication count. Its dominant role can also be observed from the top prolific authors and affiliations, most of which belong to the USA.

In the overall thematic detection, 20 topics are detected by topic modelling analysis, e.g., YouTube analysis, Sex event, Web-based medical education, Students’ use of Facebook, and Twitter use. Most topics identified are recognizable because they are generally major issues in the research field. We here provide interpretations for some representative topics. Topic 14 contains words such as YouTube, YouTube video, video recording, viewer, and viewed. Thus it pertains to YouTube analysis. As a video-sharing platform, YouTube is nowadays widely utilized to search, share and disseminate health-related information. Topic 18 discusses Sex event. It includes terms such as men who have sex with men, HIV, adolescent, sexual, youth, sex, prevention, and intervention. Most relevant studies are about sexually transmitted infections with HIV as the major research focus, e.g., HIV prevention, treatment, and testing, in which men who have sex with men are often the main focus. Topic 10 mainly focuses on Web-based medical education with terms such as student, learning, medical education, teaching, course, nursing student, web-2, and technology. Participatory web-based platforms, including social media, have been increasingly recognized as valuable learning tools in medical and health education.

Comparing the results of topic modelling and AP clustering, it is found that for most of the identified groups, the representative terms in each group are more similar and understandable in AP clustering. The reason for this may be the use of analysis units. In AP clustering, only author keywords, KeyWords Plus, and PubMed MeSH are used with the consideration that too many analysis units may lead to poor performance when the selected frequent keywords are not of high quality. While in topic modelling, not only author keywords, KeyWords Plus, and PubMed MeSH, but also keywords from title and abstract are used with the consideration that more analysis units may lead to higher performance for topic modelling. However, phrase extraction is a difficult task due to the complexity of natural language text, thus the developed extraction program may extract keywords that are of low quality. Therefore, in the future work, more attention should be paid to improve keywords extraction performance.

From the thematic evolution analysis, Eight main thematic areas can be detected, e.g., FACEBOOK, PATIENT, TWEET, WEB, and SOCIAL-NETWORK. Also, generally, the motor-themes are presenting the highest citations and impact scores in each period. FACEBOOK, for instance, is presented as motor-theme in all the last four periods, while PATIENT and TWEET are motor-themes in all the last three periods, demonstrating their significant roles in the research field.

Specifically, the evolution of a certain thematic area can be represented using a series of thematic networks for each period. Taking the thematic area TWEET in Fig. 9 as an example, it first evolves in a decreasing way, and then in an increasing way. This thematic area is the origin of important thematic areas MANAGEMENT and VIRTUAL-COMMUNITY in the period 2008–2009, and these two areas evolve into MESSAGE in 2010–2011, and stays constant in the new period. In the period 2014–2015, it evolves into TWEET and PUBLIC-HEALTH, and finally moves into TWEET and MEDIA in the last period. Some thematic areas evolve in a constant way such as FACEBOOK, as shown in Fig. 10.

Fig. 9

The TWEET thematic area (2008–2017)

Fig. 10

The FACEBOOK thematic area (2011–2017)

Topic modelling analysis depicts the major research themes from the holistic perspective, and it does not take their evolution throughout different periods into consideration. The science mapping analysis fills this gap by providing opportunity to dig out the periodical thematic detection and how the detected themes evolve in a longitudinal framework. Observing from Tables 3 and 4, it is easy to find that there are more themes detected by topic modeling analysis comparatively. For example, some significant themes such as Sex event, Alcohol & drug, Vaccine, and Exercise, food, and weight, cannot be embodied in science mapping analysis. This may be caused by the fact that in the topic modelling analysis, all the keywords selected by TF-IDF are used as analysis units, but are not included the science mapping analysis.

In the science mapping analysis, data reduction and network reduction are used to attain modest network and dendrogram. On the one hand, data reduction is conducted by using a minimum frequency as a threshold to filter infrequent keywords so that the networks are not too complex to identify. On the other hand, as noted in [38], two keywords that appear infrequently in the corpus but always appear together usually have larger strength values than keywords that appear many times in the corpus almost always together, leading to the fact that possibly irrelevant or weak associations may dominate the network. Thus, SciMAT allows the network to be filtered using a minimum threshold edge value. The simple centers algorithm also has two parameters to limit the size of the detected themes: the minimum and maximum size of the networks. Although the data reduction and network reduction are of good intention to demonstrate the most significant keywords and their relationship in a more visible and clear way. Some keywords with a comparatively low frequency that are not taken into account may be also of importance. Thus, in the future work, we will find ways to explore periodical thematic evolution with consideration of every single word.


Aiming at understanding the thematic change and evolution of utilizing social media for healthcare research during the last decade, this paper presents a quantitative analysis of publications from Web of Science and PubMed. Topic modelling analysis is used to identify major areas from an overall perspective. An approach of science mapping combining performance analysis is applied to quantify and visualize the thematic evolution. This systematic mapping of the research themes and research areas helps identify research interests and how they evolve across time, as well as providing insight into future research direction.



Affinity Propagation


Human immunodeficiency virus


Latent Dirichlet Allocation


Medical subject headings


Science mapping anaylsis tool


Term frequency-inverse document frequencies


United States


Variational expectation-maximization


Web of Science


  1. 1.

    Baker L, Wagner TH, Singer S, Bundorf MK. Use of the internet and E-mail for health care information: results from a National Survey. JAMA. 2003;289(18):2400–6.

    Article  Google Scholar 

  2. 2.

    Oh KM, Jun JM, Zhao XQ, Kreps GL, Lee EE. Cancer information seeking behaviors of Korean American women: a mixed-methods study using surveys and focus group interviews. J Health Commun. 2015;20(10):1143–54.

    Article  Google Scholar 

  3. 3.

    Lee SY, Hawkins R. Why do patients seek an Alternative Channel? The effects of unmet needs on Patients' health-related internet use. J Health Commun. 2010;15(2):152–66.

    Article  Google Scholar 

  4. 4.

    Fox S, Purcell K. Chronic disease and the internet. DC: Pew Internet & American Life Project Washington; 2010.

    Google Scholar 

  5. 5.

    Lavrač N, Keravnou ET, Zupan B. Intelligent data analysis in medicine and pharmacology: an overview. In: Lavrač N, Keravnou ET, Zupan B, editors. Intelligent data analysis in medicine and pharmacology. Boston, MA: Springer US; 1997. p. 1–13.

    Google Scholar 

  6. 6.

    Sinnenberg L, DiSilvestro CL, Mancheno C, Dailey K, Tufts C, Buttenheim AM, et al. Twitter as a potential data source for cardiovascular disease research. JAMA Cardiol. 2016;1(9):1032–6.

    Article  Google Scholar 

  7. 7.

    Ireland ME, Schwartz HA, Chen QJ, Ungar LH, Albarracn D. Future-oriented tweets predict lower county-level HIV Prevalence in the United States. Health Psychol. 2015;34(S):1252–60.

    Article  Google Scholar 

  8. 8.

    Ross MW, Berg RC, Schmidt AJ, Hospers HJ, Breveglieri M, Furegato M, Weatherburn P. Internalised Homonegativity predicts HIV-associated risk behavior in European men who have sex with men in a 38-country cross-sectional study: some public health implications of homophobia. BMJ Open. 2013;3(2):e001928.

    Article  Google Scholar 

  9. 9.

    Wong ML, Chan RKW, Koh D, Tan HH, Lim FS, Emmanuel S, Bishop G. Premarital sexual intercourse among adolescents in an Asian country: multilevel ecological factors. Pediatrics. 2009;124(1):e44–52.

    Article  Google Scholar 

  10. 10.

    Bender JL, Jimenez-Marroquin MC, Ferris LE, Katz J, Jadad AR. Online communities for breast Cancer survivors: a review and analysis of their characteristics and levels of use. Support Care Cancer. 2013;21(5):1253–63.

    Article  Google Scholar 

  11. 11.

    Bender JL, Wiljer D, To MJ, Bedard PL, Chung P, Jewett MA, et al. Testicular Cancer survivors’ supportive care needs and use of online support: a cross-sectional survey. Support Care Cancer. 2012;20(11):2737–46.

    Article  Google Scholar 

  12. 12.

    Bravo CA, Hoffman-Goetz L. Tweeting about prostate and testicular cancers: do twitter conversations and the 2013 Movember Canada campaign objectives align? J Cancer Educ. 2016;31(2):236–43.

    Article  Google Scholar 

  13. 13.

    Stellefson M, Chaney B, Ochipa K, Chaney D, Haider Z, Hanik B, et al. YouTube as a source of chronic obstructive pulmonary disease patient education: a social media content analysis. Chron Respir Dis. 2014;11(2):61–71.

    Article  Google Scholar 

  14. 14.

    Park S, Oh HK, Park G, Suh B, Bae WK, Kim JW, et al. The source and credibility of colorectal Cancer information on twitter. Medicine. 2016;95(7):e2775.

    Article  Google Scholar 

  15. 15.

    Himelboim I, Han JY. Cancer talk on twitter: community structure and information sources in breast and prostate Cancer social networks. J Health Commun. 2014;19(2):210–25.

    Article  Google Scholar 

  16. 16.

    Mackey TK, Liang BA. Global reach of direct-to-consumer advertising using social Media for Illicit Online Drug Sales. J Med Internet Res. 2013;15(5):e105.

    Article  Google Scholar 

  17. 17.

    Mackey TK, Cuomo RE, Liang BA. The rise of digital direct-to-consumer advertising? Comparison of direct-to-consumer advertising expenditure trends from publicly available data sources and global policy implications. BMC Health Serv Res. 2015;15(1):236.

    Article  Google Scholar 

  18. 18.

    Chen XL, Weng H, Hao TY. A data-driven approach for discovering the recent research status of diabetes in China. Lect Notes Comput Sci. 2017;10594:89–101.

    Article  Google Scholar 

  19. 19.

    Chen XL, Chen BY, Zhang CX, Hao TY. Discovering the recent research in natural language processing field based on a statistical approach. Lect Notes Comput Sci. 2017;10676:507–17.

    Article  Google Scholar 

  20. 20.

    Chen XL, Liu ZQ, Wei L, Yan J, Hao TY, Ding RY. A comparative quantitative study of utilizing artificial intelligence on electronic health records in the USA and China during 2008-2017. BMC Med Inform Decis Mak. 2018;18(Suppl 5):117.

    Article  Google Scholar 

  21. 21.

    Chen XL, Ding RY, Xu K, Wang S, Hao TY, Zhou Y. A bibliometric review of natural language processing empowered Mobile computing. Wirel Commun Mob Comput. 2018:1–21.

  22. 22.

    Chen XL, Xie HR, Wang FL, Liu ZQ, Xu J, Hao TY. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak. 2018;18(1):14.

    Article  Google Scholar 

  23. 23.

    Hao TY, Chen XL, Li GZ, Yan J. A bibliometric analysis of text mining in medical research. Soft Comput. 2018:1–18.

  24. 24.

    Chen XL, Hao JT, Chen JJ, Hua SS, Hao TY. A bibliometric analysis of the research status of the technology enhanced language learning. Lect Notes Comput Sci. 2018;11284:169–79.

    Article  Google Scholar 

  25. 25.

    Chen XL, Wang S, Tang Y, Hao TY. A bibliometric analysis of event detection in social media. Online Inf Rev. 2019;43(1):29-52.

  26. 26.

    Cobo MJ, Martinez MA, Gutierrez-Salcedo M, Fujita H, Herrera-Viedma E. 25 years at knowledge-based systems: a bibliometric analysis. Knowl-Based Syst. 2015;80:3–13.

    Article  Google Scholar 

  27. 27.

    Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3(Jan):993–1022.

    Google Scholar 

  28. 28.

    Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315(5814):972–6.

    CAS  Article  Google Scholar 

  29. 29.

    Small H. Visualizing science by citation mapping. J Am Soc Inf Sci. 1999;50(9):799–813.

    Article  Google Scholar 

  30. 30.

    Cartes-Velásquez R, Manterola-Delgado C. Bibliometric analysis of articles published in ISI dental journals, 2007-2011. Scientometrics. 2014;98(3):2223–33.

    Article  Google Scholar 

  31. 31.

    Cobo MJ, Chiclana F, Collop A, de Oña J, Herrera-Viedma E. A bibliometric analysis of the intelligent transportation systems research based on science mapping. IEEE Trans Intell Transp Syst. 2014;15(2):901–8.

    Article  Google Scholar 

  32. 32.

    Huang MH, Chang CP. Detecting research fronts in OLED field using bibliographic coupling with sliding window. Scientometrics. 2014;98(3):1721–44.

    Article  Google Scholar 

  33. 33.

    Cobo MJ, López-Herrera AG, Herrera-Viedma E, Herrera F. SciMAT: a new science mapping analysis software tool. J Am Soc Inf Sci Technol. 2012;63(8):1609–30.

    Article  Google Scholar 

  34. 34.

    Cobo MJ, López-Herrera AG, Herrera-Viedma E, Herrera F. Science mapping software tools: review, analysis and cooperative study among tools. J Am Soc Inf Sci Technol. 2011;62(7):1382–402.

    Article  Google Scholar 

  35. 35.

    Cobo MJ, López-Herrera AG, Herrera-Viedma E, Herrera F. An approach for detecting, quantifying, and visualizing the evolution of a research field: a practical application to the fuzzy sets theory field. J Informetrics. 2011;5(1):146–66.

    Article  Google Scholar 

  36. 36.

    Callon M, Courtial JP, Turner WA, Bauin S. From translations to problematic networks: an introduction to co-word analysis. Soc Sci Inf. 1983;22(2):191–235.

    Article  Google Scholar 

  37. 37.

    Hirsch JE. An index to quantify an individuals scientific research output. Proc Natl Acad Sci. 2005;102(46):16569–72.

    CAS  Article  Google Scholar 

  38. 38.

    Coulter N, Monarch I, Konda S. Software engineering as seen through its research literature: a study in co-word analysis. J Am Soc Inf Sci. 1998;49(13):1206–23.

    Article  Google Scholar 

  39. 39.

    Callon M, Courtial JP, Laville F. Co-word analysis as a tool for describing the network of interactions between basic and technological research: the case of polymer chemistry. Scientometrics. 1991;22(1):155–205.

    Article  Google Scholar 

  40. 40.

    Sternitzke C, Bergmann I. Similarity measures for document mapping: a comparative study on the level of an individual scientist. Scientometrics. 2009;78(1):113–30.

    Article  Google Scholar 

Download references


Not applicable.


Publication of the article is supported by grants from National Natural Science Foundation of China (No. 61772146 & No. 61871141), and Guangzhou Science Technology and Innovation Commission (No. 201803010063).

Availability of data and materials

The datasets used and analyzed during the current study are available from the first author upon reasonable requests.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 19 Supplement 2, 2019: Proceedings from the 4thChina Health Information Processing Conference (CHIP 2018). The full contents of the supplement are available online at URL.

Author information




XLC leaded the method application, experiment conduction and the result analysis. YHL participated in the data extraction and preprocessing. JY participated in the manuscript revision. TYH provided theoretical guidance and the revision of this paper. HW took participated in manuscript revision. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Tianyong Hao or Heng Weng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Table S1. Search strategy and keywords used for Web of Science. Table S2. Search strategy and keywords used for PubMed. (DOCX 16 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Lun, Y., Yan, J. et al. Discovering thematic change and evolution of utilizing social media for healthcare research. BMC Med Inform Decis Mak 19, 50 (2019).

Download citation


  • Social media
  • Healthcare research
  • Topic modelling
  • Science mapping
  • Thematic detection
  • Thematic evolution