Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data
© Kang et al.; licensee BioMed Central Ltd. 2013
Published: 5 April 2013
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Detecting causal single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWASs) has been focusing on measuring the statistical power of single SNPs, which have a relatively small effect on predicting disease susceptibility and ignore prior biological information about the target disease. Especially in complex diseases such as type 2 diabetes (T2D), the effect of each single SNP is too small to explain the disease association significantly.
To enhance the statistical power, we propose considering combinations of SNPs. Yang et al. discovered that estimates of variance explained by genome-wide SNPs are unbiased with the proportion of SNPs used to estimate genetic relationships in human height . Although SNPs with relatively low statistical power are considered together, the statistical power is not significantly affected. In addition, Park et al. compared the discriminatory power of the risk models in Crohn's disease and prostate and colorectal (BPC) cancer and found that a risk model with all the predicted susceptibility loci has more discriminatory power than a risk model with only the known susceptibility loci . Therefore, combinations of SNPs with not only significant SNPs that satisfy the genome-wide significance threshold but also common SNPs that have larger p-values than the genome-wide significance threshold may improve the prediction power of disease risk.
To rank SNPs and find SNP combinations, various methods are applied: Bayes factors , logistic regression [4, 5], Hidden Markov Model (HMM) , Support Vector Machine (SVM), [7, 8] and Random Forests (RF) [8–12]. Among the applied standard statistical methods and the machine learning-based methods, RF effectively ranks causal SNPs to detect SNP interactions [13, 14].
Basically, RF is known to have a relatively low risk of overfitting compared to other machine learning algorithms . However, if the number of variables is excessively larger than the number of samples, overfitting could occur. Furthermore, large datasets can increase the computational complexity greatly. Although Meanner et al.  and Wang et al.  did not apply specific threshold criteria for the GWAS dataset and applied 355,649 SNPs and 530,959 SNPs on RF analysis, respectively, previous causal SNP studies applied various threshold criteria to reduce the number of variables. Roshan et al. ranked T1D causal SNPs using RF and SVM from the Wellcome Trust Case Control Consortium (WTCCC) T1D dataset and the Genetics of Kidneys in Diabetes (GoKinD) T1D dataset by using Bonferroni thresholds . Because of the computational capacity, Liu et al. selected the top 65,000 SNPs, which corresponded to a p-value threshold of 0.13 for SNP interaction screening, and selected 862 SNPs to analyze with RF . To accommodate the computational requirements of SNPInterForest, Yoshida et al. selected the top 10,000 SNPs from a single SNP association analysis . The optimal filtration method is required to avoid overfitting and to reduce the computational complexity.
From T2D GWA studies, approximately 40 causal individual SNPs have been identified . However, the heritability of T2D is not yet fully understood and only about 10% of the T2D risk is explained by the causal SNPs that have been detected so far . In addition, the accuracy of the T2D risk prediction with GWAS datasets from recent studies was approximately around 0.55-0.63, which is lower than that of other complex diseases such as T1D, Crohn's disease, rheumatoid arthritis, Alzheimer's disease, and multiple sclerosis . The statistical simulation with heritability indicated that the accuracy of the T2D risk prediction can be improved to 0.8-0.9 if more common SNPs are combined . Therefore, finding the missing heritability by combining common SNPs is essential to discover the T2D mechanism and clinical applications.
To detect T2D causal SNP combinations, Ban et al. tried to find SNP combinations from a dataset containing 408 SNPs in 462 T2D cases and 456 controls using SVM. From p-values that were less than 0.6, a SNP combination with 14 SNPs was selected using a p-value-based filtering method, and the prediction accuracy was 0.653 . Ban et al. successfully suggested the possibility of T2D causal SNP combinations, but the size of the dataset was small and the accuracy was not significantly improved.
Recently, we presented T2D causal SNP combinations from an optimal SNP dataset and find the biological meaning of the detected causal SNP combination . Our previous method was one of the first T2D SNP combination-finding studies using RF with biological meaning detection, even though RF has advantages for detecting SNP combinations. To avoid overfitting and to reduce the computational complexity, our previous method applies linkage disequilibrium (LD) pruning and finds an optimal SNP dataset by comparing the error rates of the selected SNP combinations from Bonferroni thresholds and p-value thresholds. In addition to our previous research, we apply expanded GSEA on not only T2D causal SNP combinations but also genome-wide SNPs to find the T2D associated functional modules and we compare the prediction error rates of T2D causal SNP combinations from an optimal SNP dataset and from a functional module-based filtration.
Linkage disequilibrium pruning based filtration
T2D association of single SNPs is analyzed by measuring the statistical power of individual markers. The WTCCC  GWAS dataset contains 500,567 SNP markers from 1,999 T2D cases and 3,004 controls. Quality control (QC) is applied for single SNP association analysis. To identify and remove poor quality samples, per-individual QC is applied as a sample missing genotype rate of > 3%. To identify and remove the poor quality SNPs, per-marker QC is applied as a SNP missing genotype rate of > 1%, minor allele frequency (MAF) < 1%, and Hardy-Weinberg Equilibrium (HWE) p-value ≤ 10-4. After QC, 409,656 SNPs remained. For single SNP association analysis, Cochran-Armitage trend test statistics is applied using PLINK 1.07 (http://pngu.mgh.harvard.edu/purcell/plink/) .
To reduce the number of variables, LD pruning is used on 409,656 SNPs after the single SNP association analysis. Reducing the number of variables has advantages for reducing the computational complexity and avoiding the possibility of overfitting. In addition, to map SNP combinations at the gene level and the functional module level, LD pruning is an effective solution to avoid overestimation of a specific gene or a functional module containing multiple SNPs. The most significant SNP is selected among the SNPs in LD with r2 > 0.8 within 1 Mb. After LD pruning, 42,798 SNPs were selected for the SNP combination analysis.
Finding SNP combinations from an optimal SNP dataset
To reduce the computational complexity, the proposed method finds an effective threshold of the significance of SNPs by comparing the error rates of the SNP combinations from the Bonferroni threshold, the p-value rank-based threshold and the p-value range-based threshold criteria. Based on the Bonferroni threshold criteria, r is defined as the number of SNPs within the corrected p-value threshold that is determined as 0.05 divided by the total number of SNPs. The Bonferroni thresholds r, 2r, 5r, and 10r are applied to select SNP datasets. Based on the p-value rank-based threshold criteria, 500 SNP sets that are generated from the SNPs of p-value ranks 1-500, which are calculated using a cumulative approach, are fully tested to find the patterns of error rates. Furthermore, the p-value range-based threshold criteria are applied with p-value cutoffs of 0.01, 0.05 and 0.1 to 1.0 based on a cumulative approach. The error rates of SNP combinations from the SNP datasets with various threshold criteria are compared to find the optimal SNP dataset.
The RF algorithm is a combinational classifier that contains multiple classification trees to aggregate them into one classifier. Each classification tree is generated from a bootstrapped sample set, and the Gini index is measured for splitting. RF is selected to find SNP combinations because of its effective performance in ranking the causal SNPs. In addition, RF can detect SNPs with small statistical power because separate models are automatically fit to subsets of data from early splits in the tree . To find an SNP combination from a GWAS dataset, RF with a variable selection algorithm is applied . The R package varSelRF can select very small sets of features such as SNPs or genes that retain high predictive accuracy .
For the optimal parameter settings of varSelRF for finding a SNP combination from the WTCCC T2D dataset, various values are applied to mtryFactor (the multiplication factor to decide the number of variables for the splitting), ntree (the number of trees for the first forest), ntreeIterat (the number of trees for all additional forests), and vars.drop.frac (the drop fraction of variables at each iteration). The error rates are not significantly affected by ntree and ntreIterat when ntree is changed from 500 to 10,000 and ntreeIterat is changed from 200 to 4,000. However, the error rates decrease as the values of vars.drop.frac decrease from 0.35 to 0.2, even though the computation time is greatly increased. Furthermore, mtryFactor values from 0 to 13 are tested, and the error rates are smaller than 0.12 for mtryFactor values between 0.75 and 2. Therefore, the default values of arguments from varSelRF are accepted for the SNP combination analysis: mtryFactor = 1, ntree = 5,000, ntreeIterat = 2,000, and vars.drop.frac = 0.2.
Mapping the biological meanings of SNP combinations and functional module-based filtration
T2D genes are collected to find the biological meanings of SNP combinations by using gene level mapping. T2D genes are collected from public disease gene databases such as OMIM , KEGG , and GAD . From DrugBank , KEGG Drug and PharmGKB Drug  databases, 36 T2D drug targets were collected.
To discover the biological meaning and disease mechanism of a SNP combination that consisted of SNPs from diverse genes, an expanded gene set enrichment analysis (GSEA) is applied to test the disease association of functionally related genes . Expanded GSEA can help to find the biological processes and pathways of underlying complex diseases.
Databases used for enrichment analysis of expanded functional module.
Gene Set Category
Measurement of prediction error rates from random forest analysis
We compare the prediction error rates of SNP combinations and SNP sets from an optimal SNP dataset and from a functional module-based filtration. To measure the prediction error rates from RF, the R package varSelRF is applied. The default argument values (mtryFactor = 1, ntree = 5,000, ntreeIterat = 2,000, and vars.drop.frac = 0.2) are accepted for analyzing the gene sets and SNP sets. In addition to top T2D-associated gene sets, various SNP sets are analyzed with RF to measure the significance of T2D-associated gene sets.
SNP combinations from an optimal SNP dataset
The detection of T2D causal SNP combinations considering common SNPs with low statistical power in single SNP association analysis may perform better in disease risk predictions because common SNPs can explain critical effects if they interact with other SNPs as a SNP combination. An optimal SNP combination can be selected by comparing the error rates from RF analysis with variable selection. To avoid overfitting and to reduce the computational complexity, the proposed method detects an optimal SNP dataset by comparing the error rates of SNP combinations from Bonferroni thresholds and p-value thresholds.
Error rates of SNP combinations from GWAS dataset with Bonferroni threshold based cutoff criteria.
Number of SNPs
No. of SNPs in SNP combination
Error rates of SNP combinations from a GWAS dataset with p-value range-based cutoff criteria.
No. of SNPs in p-value range
No. of selected SNPs in SNP combination
Biological meaning mappings of SNP combinations
The selected SNP combination from RF analysis contains 101 SNPs that can be mapped to 107 nearby genes (see Additional file 1 for the Additional Table 1). To find the biological meaning of the selected SNP combination, multiple levels of information are matched. SOS1 and FTO are found to be T2D-related genes by matching with collected disease genes. In addition, TFB1M is recently revealed as a T2D-related gene with a common variant that is associated with insulin secretion. 
Pathway functional modules from SNP combination.
Functional Module Name
# of genes
# of genes in functional module
ErbB1 downstream signaling
Formation of Platelet plug
PDGF signaling pathway
BCR signaling pathway
amine and polyamine degradation
Rho GTPase cycle
Regulation of RAC1 activity
Signaling by Rho GTPases
Thrombin-mediated activation of PARs
Fc epsilon RI signaling pathway
EGF receptor (ErbB1) signaling pathway
TF-target functional modules and miRNA-target functional modules from SNP combinations.
Functional Module Name
# of genes
# of genes in functional module
Expanded gene set enrichment analysis of WTCCC T2D dataset
From the 42,798 SNPs filtered from the WTCCC T2D GWAS dataset, 451 T2D-associated gene sets with a p-value threshold of <0.05 are detected from the expanded GSEA to match with the collected T2D-associated genes. In all, 2,112 gene sets contains T2D-associated genes from 3,663 expanded gene sets. Surprisingly, 441 gene sets contains T2D-associated genes from the detected 451 T2D-associated gene sets from the expanded GSEA. Moreover, the average number of T2D-associated genes from the selected 441 gene sets with expanded GSEA is 11.188, whereas the average number of T2D-associated genes from selected 2,121 gene sets from total expanded gene sets is 6.554.
Selected pathways from expanded GSEA with a p-value < 0.05 in the WTCCC T2D dataset
WNT signaling pathway
Calcium signaling pathway
Pathways in cancer
Neuroactive ligand-receptor interaction
MAPK signaling pathway
Cell adhesion molecules (CAMs)
Regulation of RhoA activity
Signalling by NGF
Non-small cell lung cancer
Natural killer cell mediated cytotoxicity
Muscarinic acetylcholine receptor 1 and 3 signaling pathway
Integrin cell surface interactions
Cell Cycle, Mitotic
Notch signaling pathway
Signaling events mediated by focal adhesion kinase
EGF receptor (ErbB1) signaling pathway
ErbB signaling pathway
Neurotrophic factor-mediated Trk receptor signaling
Beta1 adrenergic receptor signaling pathway
Hypoxic and oxygen homeostasis regulation of HIF-1-alpha
Heterotrimeric G-protein signaling pathway-Gq alpha and Go alpha mediated pathway
Cytokine-cytokine receptor interaction
Signaling in Immune system
G-protein mediated events
Thromboxane A2 receptor signaling
FGF signaling pathway
Leukocyte transendothelial migration
Integration of energy metabolism
Measurement of prediction error rates from random forest analysis
RF-based prediction error rates of SNP sets from functional module-based filtration and SNP combinations with various thresholds from the WTCCC T2D dataset.
Functional Module Description
Number of SNPs
Number of Selected SNPs
Error Rate with Variable Selection
Average of results from 66 functional modules
Average of results from 66 SNP sets consisting of randomly selected 1615 SNPs
Top 5 functional modules
Top 66 functional modules
Top 1 SNP
Top 10 SNPs
SNPs with Bonferroni threshold
SNPs with p-value < 0.01
Various p-value based SNP sets were also applied as references. The prediction error rate of top 1 SNPs of p-value rank was 33.93%, and the prediction error rate of top 10 SNPs was 27.19%, which are relatively lower than the functional module based prediction error rates when considering 1500-2000 SNPs. On the basis of the Bonferroni threshold criteria, a SNP set with top 82 SNPs within the Bonferroni correction with p-values was selected, and the prediction error rate was 11.76%. The prediction error rate of a SNP set which was measured with a p-value cutoff of 0.1 was 11.70% and the prediction error rate of a SNP set with whole 42798 SNPs was 11.47%. Compared with various p-value-based SNP sets, no specific functional module (pathway, GO function, TF-target, miRNA-target, and protein complex) could significantly explain the T2D association alone.
To enhance the predictive power, the combination of top 5 functional modules and top 66 functional modules was applied. However, the prediction error rates were slightly reduced, although the number of considered SNPs was dramatically increased.
Various thresholds, including Bonferroni correction thresholds and p-value-based thresholds, are tested to find the optimal threshold with considering SNPs with low statistical power. SNP combinations that contain SNPs with low statistical power had lower error rates than SNP combinations with only significant SNPs. With the consideration of common SNPs with low statistical power, the disease risk prediction rate can be improved, especially for complex diseases.
Notable disease genes could be found from SNP combinations. From the selected SNP combination, there are still many genes that are not yet identified as a T2D-related disease gene. SNP combinations have high statistical power. In addition, the activation or inhibition of a T2D-related pathway could prevent and cure T2D and T2D complications. For example, the Rho GTPase pathway inhibitor can prevent T2D development .
No specific functional modules from the T2D-associated gene sets shows significance in T2D development from the RF-based prediction error rates. From the results of the measurement of prediction error rates from RF analysis, we can infer that significant T2D SNPs and genes with high importance are widespread in the genome and are not concentrated in a specific functional module. The expansion of functional modules with protein-protein interaction network may increase the susceptibility of T2D.
To overcome the low statistical power of single SNPs, considering multiple SNPs together becomes a solution for analyzing complex diseases. A T2D causal SNP combination is detected using RF with variable selection from an optimal SNP dataset filtered with a p-value threshold and LD pruning. From the WTCCC T2D GWAS dataset, 101 SNPs are selected with a SNP combination. Not only significant SNPs but also common SNPs with low statistical power are combined as a SNP combination. Mapping the SNP combination at the SNP, gene, and functional module levels gives clues to the relationship with T2D. Functional module-based filtration is also tested using T2D associated functional modules from genome-wide SNPs and the results showed no significance compared to randomly selected SNP sets. The proposed method can detect a SNP combination with considering SNPs with low statistical power. Additionally this method can reveal the biological meaning of the detected SNP combination by mapping functional modules and mapping the T2D-related information at multiple levels including disease genes.
This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) grant No. 2012M3A9C4048759, the Basic Science Research Program through the NRF grant 2012R1A1A2008510, and the NRF grant No. 2012-0001001 funded by the Ministry of Education, Science and Technology (MEST) of Korean government.
This work is based on an earlier work: “Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration”, in Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics, 2012 © ACM, 2012. http://doi.acm.org/10.1145/2390068.2390070
The publication costs for this article were partially funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) grant 2012R1A1A2008510 and the NRF grant No. 2012-0001001 funded by the Ministry of Education, Science and Technology (MEST) of Korean government.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 13 Supplement 1, 2013: Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/13/S1.
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al: Common SNPs explain a large proportion of the heritability for human height. Nature genetics. 2010, 42 (7): 565-569. 10.1038/ng.608.PubMed CentralView ArticlePubMedGoogle Scholar
- Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N: Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature genetics. 2010, 42 (7): 570-575. 10.1038/ng.610.PubMed CentralView ArticlePubMedGoogle Scholar
- WTCCC Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25 (6): 714-721. 10.1093/bioinformatics/btp041.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies. PLoS Genet. 2008, 4 (7): e1000130-10.1371/journal.pgen.1000130.PubMed CentralView ArticlePubMedGoogle Scholar
- Wei Z, Sun W, Wang K, Hakonarson H: Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics. 2009, 25 (21): 2802-2808. 10.1093/bioinformatics/btp476.View ArticlePubMedGoogle Scholar
- Ban HJ, Heo JY, Oh KS, Park KJ: Identification of type 2 diabetes-associated combination of SNPs using support vector machine. BMC genetics. 2010, 11: 26-PubMed CentralView ArticlePubMedGoogle Scholar
- Roshan U, Chikkagoudar S, Wei Z, Wang K, Hakonarson H: Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic acids research. 2011, 39 (9): e62-10.1093/nar/gkr064.PubMed CentralView ArticlePubMedGoogle Scholar
- Maenner MJ, Denlinger LC, Langton A, Meyers KJ, Engelman CD, Skinner HG: Detecting gene-by-smoking interactions in a genome-wide association study of early-onset coronary heart disease using random forests. BMC Proceedings. 2009, 3 (Suppl 7): S88-10.1186/1753-6561-3-s7-s88.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang M, Chen X, Zhang M, Zhu W, Cho K, Zhang H: Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests. BMC Proceedings. 2009, 3 (Suppl 7): S69-10.1186/1753-6561-3-s7-s69.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu C, Ackerman HH, Carulli JP: A genome-wide screen of gene-gene interactions for rheumatoid arthritis susceptibility. Human genetics. 2011, 129 (5): 473-485. 10.1007/s00439-010-0943-z.View ArticlePubMedGoogle Scholar
- Yoshida M, Koike A: SNPInterForest: a new method for detecting epistatic interactions. BMC bioinformatics. 2011, 12: 469-10.1186/1471-2105-12-469.PubMed CentralView ArticlePubMedGoogle Scholar
- Molinaro AM, Carriero N, Bjornson R, Hartge P, Rothman N, Chatterjee N: Power of Data Mining Methods to Detect Genetic Associations and Interactions. Human Heredity. 2011, 72 (2): 85-97. 10.1159/000330579.PubMed CentralView ArticlePubMedGoogle Scholar
- Lunetta K, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests. BMC genetics. 2004, 5 (1): 32-10.1186/1471-2156-5-32.PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Random Forests. 2001, 5-32. 1Google Scholar
- Imamura M, Maeda S: Genetics of type 2 diabetes: the GWAS era and future perspectives. Endocrine journal. 2011, 58 (9): 723-739. 10.1507/endocrj.EJ11-0113.View ArticlePubMedGoogle Scholar
- Herder C, Roden M: Genetics of type 2 diabetes: pathophysiologic and clinical relevance. European journal of clinical investigation. 2011, 41 (6): 679-692. 10.1111/j.1365-2362.2010.02454.x.View ArticlePubMedGoogle Scholar
- Jostins L, Barrett JC: Genetic risk prediction in complex disease. Human molecular genetics. 2011, 20 (R2): R182-188. 10.1093/hmg/ddr378.PubMed CentralView ArticlePubMedGoogle Scholar
- Kang C, Yu H, Yi G-S: Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration. Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics. 2012, New York: ACM, 1-8. 10.1145/2390068.2390070.Google Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006, 7: 3-10.1186/1471-2105-7-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Q, Sung A, Chen Z, Liu J, Chen L, Qiao M, Wang Z, Huang X, Deng Y: Gene selection and classification for cancer microarray data based on machine learning and similarity measures. BMC Genomics. 2011, 12 (Suppl 5): S1-10.1186/1471-2164-12-S5-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Oyston J: Online Mendelian Inheritance in Man. Anesthesiology. 1998, 89 (3): 811-812. 10.1097/00000542-199809000-00060.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000, 28 (1): 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nature genetics. 2004, 36 (5): 431-432. 10.1038/ng0504-431.View ArticlePubMedGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic acids research. 2011, 39 (Database): D1035-1041. 10.1093/nar/gkq1126.PubMed CentralView ArticlePubMedGoogle Scholar
- Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE: PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic acids research. 2002, 30 (1): 163-165. 10.1093/nar/30.1.163.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z: Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics. 2011, 98 (1): 1-8.View ArticlePubMedGoogle Scholar
- Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, et al: Reactome: a knowledge base of biologic pathways and processes. Genome biology. 2007, 8 (3): R39-10.1186/gb-2007-8-3-r39.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic acids research. 2009, 37 (Database): D674-679. 10.1093/nar/gkn653.PubMed CentralView ArticlePubMedGoogle Scholar
- Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, et al: The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic acids research. 2005, 33 (Database): D284-288.PubMed CentralPubMedGoogle Scholar
- Morgat A, Coissac E, Coudert E, Axelsen KB, Keller G, Bairoch A, Bridge A, Bougueleret L, Xenarios I, Viari A: UniPathway: a resource for the exploration and annotation of metabolic pathways. Nucleic acids research. 2012, 40 (Database): D761-769.PubMed CentralView ArticlePubMedGoogle Scholar
- Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, et al: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic acids research. 2010, 38 (Database): D473-479. 10.1093/nar/gkp875.PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun CH, Kim MS, Han Y, Yi GS: COFECO: composite function annotation enriched by protein complex data. Nucleic acids research. 2009, 37 (Web Server): W350-355. 10.1093/nar/gkp331.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Stransky M, Waegele B, Schmidt T, Doudieu ON, Stumpflen V, et al: CORUM: the comprehensive resource of mammalian protein complexes. Nucleic acids research. 2008, 36 (Database): D646-650.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Luc PV, Tempst P: PINdb: a database of nuclear protein complexes from human and yeast. Bioinformatics. 2004, 20 (9): 1413-1415. 10.1093/bioinformatics/bth114.View ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic acids research. 2006, 34 (Database): D436-441.PubMed CentralView ArticlePubMedGoogle Scholar
- Weng L, Macciardi F, Subramanian A, Guffanti G, Potkin SG, Yu Z, Xie X: SNP-based pathway enrichment analysis for genome-wide association studies. BMC bioinformatics. 2011, 12: 99-10.1186/1471-2105-12-99.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao J, Gupta S, Seielstad M, Liu J, Thalamuthu A: Pathway-based analysis using reduced gene subsets in genome-wide association studies. BMC bioinformatics. 2011, 12: 17-10.1186/1471-2105-12-17.PubMed CentralView ArticlePubMedGoogle Scholar
- Koeck T, Olsson AH, Nitert MD, Sharoyko VV, Ladenvall C, Kotova O, Reiling E, Ronn T, Parikh H, Taneera J, et al: A common variant in TFB1M is associated with reduced insulin secretion and increased future risk of type 2 diabetes. Cell metabolism. 2011, 13 (1): 80-91. 10.1016/j.cmet.2010.12.007.View ArticlePubMedGoogle Scholar
- Blaine SA, Ray KC, Branch KM, Robinson PS, Whitehead RH, Means AL: Epidermal growth factor receptor regulates pancreatic fibrosis. American journal of physiology Gastrointestinal and liver physiology. 2009, 297 (3): G434-441. 10.1152/ajpgi.00152.2009.PubMed CentralView ArticlePubMedGoogle Scholar
- Nyblom HK, Bugliani M, Fung E, Boggi U, Zubarev R, Marchetti P, Bergsten P: Apoptotic, regenerative, and immune-related signaling in human islets from type 2 diabetes individuals. Journal of proteome research. 2009, 8 (12): 5650-5656. 10.1021/pr9006816.View ArticlePubMedGoogle Scholar
- Zhou H, Li Y: Long-term diabetic complications may be ameliorated by targeting Rho kinase. Diabetes/metabolism research and reviews. 2011, 27 (4): 318-330. 10.1002/dmrr.1182.View ArticlePubMedGoogle Scholar
- Jackerott M, Moldrup A, Thams P, Galsgaard ED, Knudsen J, Lee YC, Nielsen JH: STAT5 activity in pancreatic beta-cells influences the severity of diabetes in animal models of type 1 and 2 diabetes. Diabetes. 2006, 55 (10): 2705-2712. 10.2337/db06-0244.View ArticlePubMedGoogle Scholar
- Shu Y, Sheardown SA, Brown C, Owen RP, Zhang S, Castro RA, Ianculescu AG, Yue L, Lo JC, Burchard EG, et al: Effect of genetic variation in the organic cation transporter 1 (OCT1) on metformin action. The Journal of clinical investigation. 2007, 117 (5): 1422-1431. 10.1172/JCI30558.PubMed CentralView ArticlePubMedGoogle Scholar
- Al-Mulla F, Leibovich SJ, Francis IM, Bitar MS: Impaired TGF-beta signaling and a defect in resolution of inflammation contribute to delayed wound healing in a female rat model of type 2 diabetes. Molecular bioSystems. 2011, 7 (11): 3006-3020. 10.1039/c0mb00317d.View ArticlePubMedGoogle Scholar
- Perry JR, McCarthy MI, Hattersley AT, Zeggini E, Weedon MN, Frayling TM: Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009, 58 (6): 1463-1467. 10.2337/db08-1378.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J, Helgason A, Stefansson H, Emilsson V, Helgadottir A, et al: Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature genetics. 2006, 38 (3): 320-323. 10.1038/ng1732.View ArticlePubMedGoogle Scholar
- Saxena R, Elbers CC, Guo Y, Peter I, Gaunt TR, Mega JL, Lanktree MB, Tare A, Castillo BA, Li YR, et al: Large-scale gene-centric meta-analysis across 39 studies identifies type 2 diabetes loci. American journal of human genetics. 2012, 90 (3): 410-425. 10.1016/j.ajhg.2011.12.022.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.