An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors
© Jafari and Azuaje. 2006
Received: 06 March 2006
Accepted: 21 June 2006
Published: 21 June 2006
Skip to main content
© Jafari and Azuaje. 2006
Received: 06 March 2006
Accepted: 21 June 2006
Published: 21 June 2006
The analysis of large-scale gene expression data is a fundamental approach to functional genomics and the identification of potential drug targets. Results derived from such studies cannot be trusted unless they are adequately designed and reported. The purpose of this study is to assess current practices on the reporting of experimental design and statistical analyses in gene expression-based studies.
We reviewed hundreds of MEDLINE-indexed papers involving gene expression data analysis, which were published between 2003 and 2005. These papers were examined on the basis of their reporting of several factors, such as sample size, statistical power and software availability.
Among the examined papers, we concentrated on 293 papers consisting of applications and new methodologies. These papers did not report approaches to sample size and statistical power estimation. Explicit statements on data transformation and descriptions of the normalisation techniques applied prior to data analyses (e.g. classification) were not reported in 57 (37.5%) and 104 (68.4%) of the methodology papers respectively. With regard to papers presenting biomedical-relevant applications, 41(29.1 %) of these papers did not report on data normalisation and 83 (58.9%) did not describe the normalisation technique applied. Clustering-based analysis, the t-test and ANOVA represent the most widely applied techniques in microarray data analysis. But remarkably, only 5 (3.5%) of the application papers included statements or references to assumption about variance homogeneity for the application of the t-test and ANOVA. There is still a need to promote the reporting of software packages applied or their availability.
Recently-published gene expression data analysis studies may lack key information required for properly assessing their design quality and potential impact. There is a need for more rigorous reporting of important experimental factors such as statistical power and sample size, as well as the correct description and justification of statistical methods applied. This paper highlights the importance of defining a minimum set of information required for reporting on statistical design and analysis of expression data. By improving practices of statistical analysis reporting, the scientific community can facilitate quality assurance and peer-review processes, as well as the reproducibility of results.
The analysis of large-scale gene expression has become a fundamental approach to functional genomics, the identification of clinical diagnostic factors and potential drug targets. DNA microarray technologies provide exciting opportunities for analysing the expression levels of thousands of genes simultaneously . A fundamental objective in microarray data analysis is to identify a subset of genes that are differentially expressed between different samples (e.g. conditions, treatments or experimental perturbations) of interest. However, despite the exponential growth of these studies published in journals, relatively little attention has been paid to the task of reporting important experimental design and analysis factors. Nowadays, researchers, clinicians and decision makers rely on such publications, an implicitly on the peer review process, to assess the potential impact of research, reproduce findings and further develop the research area. Information on experimental design and the correct use of statistical methods is fundamental to aid the community in correctly accomplishing their interpretations and assessments.
Over the past few decades the medical research disciplines, especially the area of clinical trials, have widely emphasised the importance of rigorous experimental design, statistical analysis implementation and the correct use of statistics in peer-reviewed publications [2–6]. Although the general understanding of basic statistical methods (e.g. t-test, ANOVA) has improved in these disciplines, some errors regarding their sound application and reporting can still be found. For instance, the t-test and ANOVA are fairly robust to moderate departures from its underlying assumptions of normally-distributed data and equality of variance (homogeneity) except in the presence of very small or unequal sample sizes, which can considerably decrease the statistical power of the analyses [7–10]. In order to promote a more rigorous application and reporting of data analyses in the area of clinical trials, the Consolidated Standards of Reporting Trials (CONSORT) have been adopted. CONSORT has significantly assisted researchers in improving the design, analysis and reporting of clinical trials . This is an example of how a community-driven effort can help to improve the reporting of scientific information. Moreover, this instrument has shown to be helpful to authors, reviewers, editors and publishers to improve the readers' confidence in the scientific quality, relevance and validity of the studies published. We and others argue [12, 13] that there is still a need for more rigorous approaches to reporting information relevant to gene expression data analysis. Therefore, it is important to have a closer look at the level achieved by recently published papers in connection to fundamental factors for correctly justifying, describing and interpreting data analysis techniques and results.
The main objective of this investigation is to assess the reporting of experimental design and statistical methodologies in recently published microarray data analysis studies. Among the experimental design factors under study are sample size estimation, statistical power and normalisation. This paper also provides insights into the design of studies based on well-known statistical approaches, such as t-test and ANOVA. Our research also examined how papers present fundamental statistical justifications or assumptions for the correct application of the t-test and ANOVA, which are widely applied to gene expression data analysis.
PubMed  was used to identify papers presenting results on gene expression data analysis between 2003 and 2005 using "gene expression data" as the query expression. A manual selection process was implemented in which the following categories of papers were excluded: a) review articles; b) commentaries and brief communications; and c) editorial notes including correspondence to editors. Furthermore, we excluded papers concentrating on: a) Web servers, b) databases, and c) software tools. Full papers were then obtained from different journals [see Additional file 1]. The reporting of the following factors was examined: a) type of study (two main types: papers focused on the presentation of new analysis methodologies and biomedical-relevant applications); b) reporting of methods of sample size calculation and statistical power; c) reporting of data standardisation (i.e. normalisation) and method of normalisation applied; d) description of data analysis techniques applied; e) discussion about missing values; f) explicit statement of directionality (i.e. one-sided or two-sided test); g) explicit statement of hypothesis and alternative; and h) reference to software tools applied for implementing data analyses. In this study application papers refer to any paper whose main contribution is the generation or testing of biological or biomedical hypotheses, including potential diagnostic, prognostic and therapy design applications, as well as biologically-relevant discoveries. Methodology articles emphasize the presentation of a novel, problem-specific (experimental or computational) method or procedure, which may drive a biologically-relevant advancement or discovery.
In connection to the description of data analysis techniques applied, we concentrated on the assessment of techniques or models that were fundamental to obtain key findings in the application and methodology papers. With regard to the discussion of missing data estimation methods, we targeted the application of previously-published imputation or estimation methods.
Definition of factors assessed in gene expression data analysis papers.
Brief definition or question of interest
Estimation of the number of arrays required in order to identify significantly, differentially expressed genes.
Ability of a study to detect a true difference between genes, biological category or condition
Does the paper report normalisation of data? (yes or no)
Does the paper describe how sources of variation were removed or data standardisation method, e.g. total intensity normalisation, normalisation using regression techniques, normalisation using ratio statistics etc.
Explicit statement of directionality of the statistical test applied, i.e. one-sided or two-sided test
Hypothesis and alternative
Explicit statement of null (H0) or alternative hypothesis (H1)
Report of missing values, report of estimation of missing values or description of method for estimating missing values.
Which software, programs or tools were used for statistical analysis?
Which statistical approaches were used for gene expression data analysis?
Homogeneity of variances
Does the paper report the equality of variances assumption for the application of ANOVA and t-test?
We reviewed papers published in Medline-indexed journals. Among these papers 152 (51.9%) concentrated on the presentation of new methodologies for gene expression data analysis, and 141 (48.1%) papers mainly contributed application studies, e.g. discoveries directly relevant to molecular biology and clinical studies. The definition of these paper categories was provided above.
Reporting normalisation and techniques implemented in published methodology and application papers
Description of method of normalisation
Description of method of normalisation
Main types of statistical methods applied in microarray data analysis studies.
Application papers (%)*
Methodology papers (%)
Mixed classification models
Fuzzy logic methods
Time series analysis
Our results showed that among the methodology and application papers, 133 (87.5%) and 115 (82%) did not report the directionality of the tests (one-sided or two-sided test) respectively. Also among the methodology and application papers, only 19 (12.5%) and 26 (18%) included discussions about missing values (report of missing values, estimation of missing values or description of methods for missing value estimation) respectively. Explicit statements of hypothesis and alternative hypothesis were reported in only 43 (28%) and 29 (20.6%) methodology and application papers respectively. In addition, of the 141 application papers, only 52 (36.9%) included sections or sub-sections to describe data analysis methods applied.
Reporting on software tools or programs for data analysis included in Table 3.
The most applied software tools
Our assessment suggests that published papers lack relevant information regarding the determination of sample sizes and statistical power in microarray data analysis studies. These studies often involve hundreds or thousands of genes and only a fraction of genes are expected to be differentially expressed. Therefore, genes that do not show clear patterns of differential expression are filtered out, by performing statistical group comparisons. However, if the subjects or arrays (sample size) have not been properly estimated before the statistical comparisons (e.g. ANOVA or t-test) then spurious predictions and type II errors (β) can be seriously misleading. In fact, undetected significant differences may be explained by a lack of statistical power for detecting true differences between genes or as a result of inadequate sample sizes (subjects or arrays). Our study showed that very few research studies (i.e. either methodology or application papers) discuss power and sample size requirements in microarray experiments, which are fundamental factors to accomplish the validation of the statistical analyses [15–26].
Our review also shows that although classic ANOVA and the t-test are widely applied to the analysis of gene expression data, fundamental statistical assumptions, such as the homogeneity of variances, are seldom mentioned. Therefore, even if we ignore the constraints defined by small sample size in the application of ANOVA and t-test, these papers fail to justify their application on the basis of their assumptions of homogeneity of variance. Researchers also have the option of implementing other statistical significance tests that may relax the assumption of homogeneity of variance. Researchers should also be aware of the limitations of the classic t-test and ANOVA methods for detecting differential expression patterns, e.g. statistical power and detection of spurious relations. Therefore, relatively more powerful and reliable alternatives may be carefully considered, such as distribution-free tests, linear models with empirical Bayes corrections or other significance analysis techniques for gene expression data.
Furthermore, our results indicate that gene expression data analysis papers should provide additional information on data normalisation methods applied. This important data analysis reporting task deserves more attention in order to support a more accurate interpretation and reproducibility of results. Although previous research  has suggested relatively high robustness of microarray data analysis to different types of normalisation techniques, more evidence clearly indicates that prediction outcomes can be significantly affected by the selection of normalisation methods [29–32]. Therefore, we argue that authors should not only indicate that their data have been normalised, but also they should provide details on the normalisation method applied and assumptions.
Our findings show that only 45 (15.4%) methodology and application papers explicitly discussed issues relating missing values, e.g. sources and estimation methods. Gene expression data often contain missing expression values, which may require the application of missing data estimation or imputation techniques to obtain a complete matrix of expression values. Like in the case of data normalisation, authors not only should report on missing values, but also on their approaches to dealing with such a problem. Again this is a crucial factor because different estimation methods may have different effects on the same dataset [40–42, 54]. Also our results stress the need to continue encouraging authors to provide adequate descriptions of the software tools or resources applied to implement their data analyses. For instance, 53 (18.1%) of the application and methodology papers examined did not provide any information on the software package or programs used to implement their statistical analyses.
Finally, our review suggests that the above reporting practices may be improved by encouraging authors to provide separate sections or sub-sections focusing on data analysis. Only 36.9% of the application papers, for example, included a section dedicated to these aspects, i.e. detailed discussion of methods, tools, assumptions. A section (or sub-section) on statistical methods should clearly state, for instance, how the sample size was estimated and how the data were analysed in relation to each of the objectives and underlying biological and statistical assumptions made. Such a section should also include information about statistical software or tools applied for data analysis (e.g. origin and availability) and the directionality of the statistical tests applied.
Even when this study did not aim to analyse the possible causes of such relative lack of statistical information reporting standards, it is necessary to stress the importance of ensuring the participation of statisticians in both the design and analysis phases of gene expression studies. However, in some cases this may be accomplished only if adequate provisions and decisions are made during the project formulation and funding assessment phases (i.e. adequate budget considerations should me made to achieve such participation). An interpretation of the results on the reporting of test directionality should also take into account that for many authors it may be common practice not to report test directionality as they may assume that two-sided directionality is the default setting. However, this assumption should not be used to justify the lack of more rigorous reporting practices, which are commonly adopted in other statistics-driven areas, such as medical sciences, epidemiology and clinical trials.
It is also necessary to recognise that the lack of more rigorous reporting standards may be understood in the light of the technical complexities and constraints presented by the area of gene expression data analysis. For example, there is a need for more comprehensive theoretical and empirical studies about the statistical nature of gene expression data in order to help researchers to present deeper discussions on sample size and power analysis. In relation to these factors, one may also argue that, unlike the clinical sciences domain, there is a lack of accepted, comprehensively-validated methods tailored to gene expression data. Therefore, it is fundamental to promote deeper investigations and the generation of robust, user-friendly tools to assist researchers in their approaches to the discussion of these factors.
More investigations on the application and reporting of other important experimental procedures, such as sample pooling prior to hybridization, are required. It has been shown that pooling may significantly affect the quality of data analysis . Our review showed that only 13 (8.6%) methodology and 21 (14.9%) application papers reported pooling procedures in their studies. These figures are in general consistent with previous estimates of the number of datasets catalogued in the Gene Expression Omnibus Database using this procedure .
Another fundamental analysis factor that continues deserving additional investigations is the application and reporting of P-values adjustments. Our review revealed that only 15 (10.7%) and 28 (18.4%) of the application and methodology papers respectively explicitly reported the P-value adjustment method applied. For instance, among the 141 application papers, 8 (5.7%) and 7 (5%) papers reported the use of Bonferroni and Benjamini-Hochberg adjustment methods respectively. With regard to the methodology papers (152 in total): 14 (9.2%), 12 (7.9%) and 2 (1.3%) papers reported the application of Bonferroni, Benjamini-Hochberg and Hochberg adjustment methods respectively. The selection of a suitable adjustment method depends on the error rate that one wants to control . For example, for controlling family-wise error rates (FWER) Bonferroni and Hochberg are recommended, but for controlling false discovery rates (FDR) Benjamini-Hochberg may be a more appropriate choice [55, 57–59].
Our study may be complemented by other reviews on the correct application of evaluation strategies, such as data sampling and significance interpretation . Additional studies may be useful to assess more specific data analysis components, such as cross-validation techniques for estimating predictive performance of supervised classification models in medical diagnosis and prognosis. To further support a deeper understanding on issues relevant to statistical information reporting, the reader is also referred to [44, 45, 55], which review some of the most representative approaches to analysing gene expression data in different biomedical applications.
Future work may involve an analysis of potentially interesting, significant time-dependent trends relating to statistical information reporting. This may allow the scientific community to assess emergent practices and patterns of knowledge generation and reporting in gene expression data analysis.
Medical research disciplines, especially the area of clinical trials, have placed relatively more emphasis on the reporting of experimental design, statistical analysis implementation and the correct use of statistics in peer-reviewed publications [2–6] in comparison to the current state in gene expression data analysis.
The present survey indicates that the quality and coverage of information regarding experimental design and statistical analysis in gene expression data-driven studies deserve to be improved. The reporting of statistical power, sample size, normalisation and missing data estimation techniques requires a more rigorous treatment. Poor or incomplete reports may significantly affect our capacity to interpret results and assess the relevance and validity of research studies. Moreover, inadequate reporting of statistical analysis information may increase the likelihood of publishing spurious associations or predictions. By paying more attention to these factors authors will be facilitating quality assurance and peer-review processes, as well as the reproducibility of results, which are fundamental factors for the advancement of scientific and technological development, policy and decision making.
Community-driven efforts such as the MIAME (Minimum Information About a Microrray Experiment) protocol  may be useful for motivating or guiding the definition of a well-defined set of requirements for reporting fundamental data analysis and experimental statistical design factors. This research calls for greater discussions involving researchers, editors, publishers and decision makers.
This study was conducted while PJ was a visiting researcher at the School of Computing and Mathematics, University of Ulster (UU). PJ's visit was funded by Iran Ministry of Health and Medical Education. We thank Prof. K. Farahmand at UU for inviting and facilitating PJ's visit. We thank the three reviewers for their comments and suggestions to help us to improve the quality of this manuscript. This work was supported in part by a grant from EU FP6, CARDIOWORKBENCH project, to FA.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.