A statistical analysis of vaccine-adverse event data

Background Vaccination has been one of the most successful public health interventions to date, and the U.S. FDA/CDC Vaccine Adverse Event Reporting System (VAERS) currently contains more than 500,000 reports for post-vaccination adverse events that occur after the administration of vaccines licensed in the United States. The VAERS dataset is huge, contains very large dimension nominal variables, and is complex due to multiple listing of vaccines and adverse symptoms in a single report. So far there has not been any statistical analysis conducted in attempting to identify the cross-board patterns on how all reported adverse symptoms are related to the vaccines. Methods For studies of the relationship between vaccines and reported adverse events, we consider a partial VAERS dataset which includes all reports filed over a period of 24 years between 1990-2013. We propose a neighboring method to process this dataset for dealing with the complications caused by multiple listing of vaccines and adverse symptoms in a single report. Then, the combined approaches based on our neighboring method and novel utilization of data visualization techniques are employed to analyze the large dimension dataset for characterization of the cross-board patterns of the relations between all reported vaccines and events. Results The results of our analysis indicate that those events or symptoms with overall high occurrence frequencies are positively correlated, and those most frequently occurred adverse symptoms are mostly uncorrelated or negatively correlated under different bacteria vaccines, but they are in many cases positively correlated under different virus vaccines, especially under flu vaccines. No particular patterns are shown under live vs. inactive vaccines. Conclusions This article identifies certain cross-board patterns of the relationship between the vaccines and the reported adverse events or symptoms. This helps for better understanding the VAERS data, and provides a useful starting point for the development of statistical models and procedures to further analyze the VAERS data.

new, unusual or rare vaccine adverse events or symptoms; monitor increase in known adverse events; identify potential patient risk factors for particular types of adverse events; assess the safety of newly licensed vaccines; etc.
Each VAERS report includes the following information of an individual: patient ID, place of vaccination, age, gender, vaccines administrated, adverse events or symptoms observed, time between vaccination and adverse event onset, etc. The VAERS data at FDA site are not ready for statistical analysis without being processed, because each report lists adverse events or symptoms in the form of non-regulated words or phrases, and often contains multiple listing of symptoms along with multiple listing of vaccines. Taking into account the possible multiple listing of vaccines and adverse events or symptoms in one report, a well processed dataset file based on current 530,716 case reports during 1990-2016 is estimated to have 2,000,000 -3,000,000 rows. Thus, this is a big and complicated data set.
Challenges: In addition to the large data size issue, as the key components for our research interests the vaccine variable V and symptom variable Z in VAERS data are nominal variables, and the already very large dimension of symptom variable Z (i.e., the total number of different categories) can still increase as more reports are being filed each year. In statistical literature, we have few tools for such kind of data analysis involving nominal categorical variable with unlimited dimension. Another big complication of the VAERS data is due to above mentioned multiple listing of vaccines administrated and multiple listing of adverse symptoms in one single VAERS report. For instance, one report may list vaccines A and B and list adverse symptoms C, D and E. In such a case, we do not exactly know which symptom was triggered by which vaccine. Unfortunately, such huge complication in VAERS data will continue until one vaccination per time is enforced in U.S. Thus, this posts great challenges for the analysis of vaccine data.
Dr. He of this project team was the primary developer of the vaccine ontology. Recently, he and Dr. Zhang (coauthor of this article) along with other collaborators have conducted some network-based studies on the VAERS data to summarize and analyze the vaccine-adverse event association [1][2][3], and have done some ontology-based comparative analyses on the adverse event associated with killed and live influenza vaccines [4]. But these works are not the statistical analysis in the usual sense.
It is well-known that before a particular vaccine was marketed, clinical trials had already identified some adverse symptoms or events associated with such vaccine. However, this is not equivalent to the cross-board patterns of the relations between vaccines and adverse events or symptoms. With huge VAERS data accumulated at this point, the analysis of such cross-board patterns becomes possible, but so far there has not been any statistical analysis conducted in attempting to identify the crossboard patterns on how all reported adverse symptoms are related to the vaccines. Characterizing such cross-board patterns is of importance on its own for better understanding the VAERS data, and would provide insights for developing statistical models and procedures for further analysis of VAERS data. In particular, the characterization of cross-board patterns is in fact a method of using all available data together to deal with the big complication problem in VAERS data caused by aforementioned multiple listing of vaccines and adverse symptoms in a single report; that is one single report with multiple listing makes it impossible for us to know exactly which symptom was triggered by which vaccine, but putting all reports with related information together can lead us to identify cross-board patterns on the relationship between vaccines and adverse symptoms.
In this article, a partial VAERS dataset is considered for characterizing the cross-board patterns of the relationship between all reported vaccines and all reported adverse symptoms or events. We propose a neighboring method to process the raw VAERS data, and we analyze this processed large dimension dataset via novel utilization of data visualization techniques [5] developed for the big data analysis.

Data processing
As mentioned above, the original VAERS data at FDA site are not ready for statistical analysis without being processed. Here, for the study of causal relationship between all reported vaccines and all reported events or symptoms, we consider a partial dataset of VAERS data which was based on all 407,453 reports filed over a period of 24 years between 1990-2013. This partial dataset is processed using our proposed neighboring method into the following form of n = 277, 698 vectors:

Neighboring Method
For the case of a report with multiple listing of vaccines and events or symptoms as aforementioned, it is processed as follows. If a report lists vaccines A and B and lists symptoms or events C, D and E, each of symptoms C, D and E is counted once for each of vaccines A and B, respectively, for frequency variable W in Eq. (1). The description and rationale of our proposed neighboring method are: (i) From this one single report, we do not know whether symptom C was triggered by vaccine A or vaccine B or both; the same goes with symptoms D and E; (ii) Because of (i), we count the occurrence of symptom C under vaccine A once, adding 1 into the corresponding frequency variable W in Eq. (1); also count the occurrence of symptom C under vaccine B once; and do the same for symptoms D and E for the same reasons; (iii) The resulting processed data in the form of Eq. (1) as a whole allow us to use all reports including, say, symptom C and vaccine A, to study the cross-board patterns of the relationship between all reported vaccines and all reported adverse symptoms, which contain symptom C and vaccine A as a pair. This is the idea of using all neighboring information to study the relation of a particular pair.

Additional Notes
Some of the VAERS reports considered in our studies here contain errors or incomplete information. For instance, some reports list the vaccine as "unknown", thus these reports are excluded in some parts of our data analysis. Also, among the reported events or symptoms, some of them are adverse, while some are not considered to be adverse, such as drug ineffective, inappropriate schedule of drug administration, unevaluable event, wrong drug administration, full blood count, full blood count normal, etc. In the parts of our analysis on the relationship between the vaccines and the adverse events or symptoms, we exclude those vectors in Eq. (1) if Z is a non-adverse event or symptom.

Top 100 Adverse Symptoms
Due to the large size of the dataset being considered in this research and due to our limited computing power, parts of our analysis here focus on the cross-board patterns of how those most frequently occurred adverse symptoms or events are related to the vaccines, because it would take several weeks to complete just one explorative data visualization plot for all 7368 symptoms due to its large dimension. Specifically, excluding those non-adverse events or symptoms aforementioned, the top 100 adverse symptoms or events with highest overall occurrence frequencies in the processed VAERS dataset (1) are identified and listed in Table 1, where Z 1 is the adverse symptom with the highest occurrence frequency in the dataset, Z 2 is the adverse symptom with the 2nd highest occurrence frequency in the dataset, and so forth; and FQ i is the total occurence frequency for symptom Z i . Hereafter in this article, these are referred as the top 100 adverse symptoms. We note that among top 107 events or symptoms with highest overall occurrence frequencies, seven are nonadverse, thus Table 1 does not include these 7 non-adverse events.

Data visualization and statistical analysis
In addition to the large size issue, the analysis of VAERS data deals with nominal variables such as vaccines and events or symptoms; in particular, the symptom is a nominal variable of very large dimension. Here, we use data visualization methods in our studies.
For an initial data visualization, we consider all different n = 7368 events or symptoms reported in processed VAERS dataset (1) and arrange them according to the alphabetical order: E 1 , E 2 , · · · , E n . We denote all reported 72 vaccines according to the following order: where V 1 , · · · , V 24 are alphabetically ordered 24 bacteria vaccines, V 25 , · · · , V 62 are alphabetically ordered 38 virus vaccines, V 63 , · · · , V 71 are alphabetically ordered 9 bacteria/virus combined vaccines, and V 72 represents the vaccine listed as unknown. For each vaccine V k , we obtain the frequency vector X k = (X k1 , X k2 , · · · , X kn ), where n = 7, 368 and X ki is the total number of times that event E i was reported for vaccine V k . Based on these 72 vectors X k , we compute the rotated 7368 × 7368 matrix of sample correlation coefficients: whereX i is the sample mean of X 1,i , · · · , X 72,i , andρ ij is the sample correlation coefficient of symptoms E i and E j . This matrix is displayed in Fig. 1a, where red dots represent for thoseρ ij > 0.01, white dots for |ρ ij | ≤ 0.01, and blue dots forρ ij < −0.01. Throughout this article, all matrices are displayed as the rotated version of the conventional matrix, i.e., with the bottom row of the conventional matrix as the top row here. Obviously, Fig. 1a shows no informative patterns about the dataset.

Fig. 1 Correlation matrix of all reported events
Next, we denote all reported symptoms or events in VAERS data (1) by: E 1 , E 2 , · · · , E n , where E 1 is the symptom or event with the highest occurrence frequency in the dataset, E 2 is the symptom or event with the 2nd highest occurrence frequency in the dataset, and so forth. For each vaccine V k in (2), we obtain the frequency vector where Y ki is the total number of times that event E i was reported for vaccine V k . Based on such 72 vectors Y k , we compute the rotated matrix of sample correlation coefficientsρ Y ij using the formula in (3) for Y ki 's, whereρ Y ij is the sample correlation coefficient of symptoms E i and E j . This matrix is displayed in Fig. 1b, where the colored dots have the same meaning forρ Y ij as for those in Fig. 1a. In addition, Fig. 1c displays the matrix of Fig. 1b with 20 different colors to illustrate the values of the sample correlation coefficientsρ Y ij , where green color corresponds to values ofρ Y ij around 0, color from green to red corresponds toρ Y ij > 0, and color from green to blue corresponds toρ Y ij < 0. Interestingly, such a method of data visualization clearly indicates cross-board patterns.
For the study of the cross-board patterns on the relationship between the vaccines and the adverse events or symptoms, we consider the top 100 adverse symptoms Z 1 , · · · , Z 100 listed in Table 1, and consider the vaccines V 1 , · · · , V 71 listed in (2); that is in our analysis hereafter we exclude those vectors in processed VAERS dataset (1) that list the vaccine as "unknown".
For each year, we obtain frequency vector F k = (F k,1,1 , · · · , F k,1,100 , F k,2,1 , · · · , F k,2,100 , · · · F k,71,100 ), where k = 1, · · · , 24 represent 24 years between 1990-2013; and F kij is the total number of times that symptom Z j was reported for vaccine V i during year k. Based on these 24 vectors F k , we compute the rotated 7100 × 7100 matrix of sample correlation coefficientsρ ij,lq using the formula  (3) for F kij 's, whereρ ij,lq is the sample correlation coefficient of symptom Z j under vaccine V i and symptom Z q under vaccine V l , thusρ ij,iq is the sample correlation coefficient of symptoms Z j and Z q under vaccine V i . This matrix is displayed in Fig. 2, where the colored dots have the same meaning forρ ij,lq as for those in Fig. 1c.
As indicated by solid lines, the matrix in Fig. 2  Due to the order of vaccines V i 's in (2), the bold dashed lines separate the matrix of Fig. 2 into 9 big block matrices, among which the square block matrix in the bottom left, displayed separately in Fig. 3, is the matrix of sample correlation coefficients of top 100 adverse symptoms under all 24 different bacteria vaccines; and the square block matrix in the middle, displayed separately in Fig. 5, is the the matrix of sample correlation coefficients of top 100 adverse symptoms under all 38 different virus vaccines.
In Fig. 4, the top are block matrices M 16,22 and M 22,16 in Fig. 3, and the bottom are block matrices M 16,21 and M 21,16 in Fig. 3. Due to better picture resolution reason, these block matrices clearly show that equation M ij = M ji holds. The two block matrices on the top of Fig. 4 are among those mostly green-blue colored block matrices in Fig. 3, while the two block matrices on the bottom are the  very few non-diagonal block matrices in Fig. 3 that are mostly red colored. Figure 6 contains the block matrices M ij of Fig. 5 for i, j = 3, 4, 5, 6, which are the correlation matrices for the top 100 adverse symptoms under 4 different flu vaccines: FLU, FLU(H1N1) , FLUN and FLUN(H1N1).
For the study of the relations between vaccine-adverse events and attributes of vaccines, such as live attenuated vaccine vs. killed inactivated vaccine, Fig. 7 Figure 1b shows that over all reported vaccines, those reported events or symptoms (adverse or non-adverse) with overall high occurrence frequencies are positively correlated, while those with low occurrence frequencies are negatively correlated. In comparison, the blue area of Fig. 1b mostly shows green color in Fig. 1c, which, by color design, indicates that the low-occurrence events or symptoms are mostly uncorrelated. Figure 3 shows that the top 100 adverse symptoms listed in Table 1 are mostly uncorrelated or negatively correlated under different bacteria vaccines. Also, the big rectangular block matrix in the bottom middle of Fig. 2 outlined by the bold dashed lines are mostly green-blue colored, except the row block #16 (bacteria vaccine MNQ), which indicates that the top 100 adverse symptoms under bacteria vaccines are mostly uncorrelated or negatively correlated with the top 100 adverse symptoms under virus vaccines. Figures 5 and 6 show that the top 100 adverse symptoms are in many cases positively correlated under different virus vaccines, especially under flu vaccines. In particular, Fig. 6 shows that the top 100 adverse symptoms are strongly positively correlated under vaccines FLU and FLUN, and they are even more strongly positively correlated under vaccines FLU(H1N1) and FLUN(H1N1). Figures 7 and 8 show that under different live or inactive vaccines, the top 100 adverse symptoms are in some cases positively correlated and in some cases negatively correlated, because in both figures many mostly red or mostly

Summary
The results of our analysis indicate: (a) Over all reported vaccines, those events or symptoms (adverse or non-adverse) with overall high occurrence frequencies are positively correlated, while those with low occurrence frequencies are uncorrelated; (b) Those most frequently occurred adverse symptoms or events are mostly uncorrelated or negatively correlated under different bacteria vaccines, but they are in many cases positively correlated under different virus vaccines, especially under flu vaccines; (c) Under different live or inactive vaccines, those most frequently occurred adverse symptoms or events are in some cases positively correlated and in some cases negatively correlated.

Discussion
The FDA VAERS database provides useful information for the analysis of the relations between the vaccines and the adverse events or symptoms. However, the dataset is huge, includes reports with multiple listing of vaccines and adverse symptoms in a single report, and contains reports with errors or incomplete information. Using our proposed neighboring method for processing the raw VAERS data coupled with novel and proper utilization of data visualization techniques (arbitrary use of data visualization obviously does not work, eg., Fig. 1a), here we conclusively reveal some interesting cross-board patterns For instance, our finding of the low-occurrence events or symptoms' being mostly uncorrelated may be interpreted as that the rarely occurred events or symptoms are mainly vaccine-specific, they generally are not associated among one another, thus are not onset as a cluster. Also, although Fig. 3 shows that the top 100 adverse symptoms are mostly uncorrelated or negatively correlated under different bacteria vaccines, the block matrices M 16,21 and M 21,16 in Fig. 4 show that they are, as an isolated case, In addition to the differences between live vs inactivated vaccines and between bacterial and viral vaccine types which have been considered in this article, other factors such as whole organism vs subunit vaccines, etc., may also affect the outcome of adverse events or symptoms. Further investigation and data analysis on VAERS data are needed.

Conclusions
In this article, we identify certain cross-board patterns of the relationship between the vaccines and the reported events or symptoms via the combined approaches based on our proposed neighboring method and novel utilization of data visualization techniques. This is useful for better understanding the VAERS data, and shows that the data visualization method, if used properly, can serve as a helpful tool for big data analysis problems involving large dimension nominal variables. Moreover, what is discovered in this article provides a needed starting point for the development of statistical models and procedures to further analyze the VAERS data. In fact, a statistical methodology paper (Ren and Sun: An empirical likelihood based NROC classification procedure, in preparation) based on the results here is forthcoming. The ultimate goal is using reliable statistical analysis to help detect and monitor the adverse events or symptoms after vaccination in the years to come.