This article has Open Peer Review reports available.
Predicting disease risks from highly imbalanced data using random forest
© Khalilia et al; licensee BioMed Central Ltd. 2011
Received: 28 September 2010
Accepted: 29 July 2011
Published: 29 July 2011
We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.
We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.
We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.
In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
The reporting requirements of various US governmental agencies such as Center for Disease Control (CDC), Agency for Health Care Quality (AHRQ) and US Department of Health and Human Services Center for Medicare Services (CMS) have created huge public datasets that, we believe, are not utilized to their full potential. For example, CDC http://www.cdc.gov makes available National Health and Nutrition Examination Survey (NHANES) data. Using NHANES data, Yu et al.  predicts diabetes risk using an SVM classifier. CMS http://www.cms.gov uses the Medicare and Medicaid claims to create the minimum dataset (MDS). Herbert et al.  uses MDS data to identify people with diabetes. In this paper we use the National Inpatient Sample (NIS) data created by AHRQ http://www.ahrq.gov Healthcare Utilization Project (HCUP), to predict the risk for eight chronic diseases.
Disease prediction can be applied to different domains such as risk management, tailored health communication and decision support systems. Risk management plays an important role in health insurance companies, mainly in the underwriting process . Health insurers use a process called underwriting in order to classify the applicant as standard or substandard, based on which they compute the policy rate and the premiums individuals have to pay. Currently, in order to classify the applicants, insurers require every applicant to complete a questionnaire, report current medical status and sometimes medical records, or clinical laboratory results, such as blood test, etc. By incorporating machine learning techniques, insurers can make evidence based decisions and can optimize, validate and refine the rules that govern their business. For instance, Yi et al. , applied association rules and SVM on an insurance company database to classify the applicants as standard, substandard or declined.
Another domain where disease prediction can be applied is tailored health communication. For example, one can target tailored educational materials and news to a subgroup, within the general population, that has specific disease risks. Cohen et al. , discussed how tailored health communication can motivate cancer prevention and early detection. Disease risk prediction along with tailored health communication can lead to an effective channel for delivering disease specific information for people who will be likely to need it.
In addition to population level clinical knowledge, deidentified public datasets represent an important resource for the clinical data mining researchers. While full featured clinical records are hard to access due to privacy issues, deidentified large national public dataset are readily available . Although these public datasets don't have all the variables of the original medical records, they still maintain some of their main characteristics such as data imbalance and the use of controlled terminologies (ICD-9 codes).
Several machine learning techniques were applied to healthcare data sets for the prediction of future health care utilization such as predicting individual expenditures and disease risks for patients. Moturu et al. , predicted future high-cost patients based on data from Arizona Medicaid program. They created 20 non-random data samples, each sample with 1,000 data points to overcome the problem of imbalanced data. A combination of undersampling and oversampling was employed to a balanced sample. They used a variety of classification methods such as SVM, Logistic Regression, Logistic Model Trees, AdaBoost and LogitBoost. Davis et al. , used clustering and collaborative filtering to predict individual disease risks based on medical history. The prediction was performed multiple times for each patient, each time employing different sets of variables. In the end, the clusterings were combined to form an ensemble. The final output was a ranked list of possible diseases for a given patient. Mantzaris et al. , predicted Osteoporosis using Artificial Neural Network (ANN). They used two different ANN techniques: Multi-Layer Perceptron (MLP) and Probabilistic Neural Network (PNN). Hebert et al. , identified persons with diabetes using Medicare claims data. They ran into a problem where the diabetes claims occur too infrequently to be sensitive indicators for persons with diabetes. In order to increase the sensitivity, physician claims where included. Yu et al. , illustrates a method using SVM for detecting persons with diabetes and pre-diabetes.
Zhang et al. , conducted a comparative study of ensemble learning approaches. They compared AdaBoost, LogitBoost and RF to logistic regression and SVM in the classification of breast cancer metastasis. They concluded that ensemble learners have higher accuracy compared to the non-ensemble learners.
Together with methods for predicting disease risks, in this paper we discuss a method for dealing with highly imbalanced data. We mentioned two examples [2, 7] where the authors encountered class imbalanced problems. Class imbalance occurs if one class contains significantly more samples than the other class. Since the classification process assumes that the data is drawn from the same distribution as the training data, presenting imbalanced data to the classifier will produce undesirable results. The data set we use in this paper is highly imbalanced. For example, only 3.59% of the patients have heart disease, thus it is possible to train a classifier with this data and achieve an accuracy of 96.41% while having 0% sensitivity.
The Nationwide Inpatient Sample (NIS) is a database of hospital inpatient admissions that dates back to 1988 and is used to identify, track, and analyze national trends in health care utilization, access, charges, quality, and outcomes. The NIS database is developed by the Healthcare Cost and Utilization Project (HCUP) and sponsored by the Agency for Healthcare Research and Quality (AHRQ) . This database is publicly available and does not contain any patient identifiers. The NIS data contains discharge level information on all inpatients from a 20% stratified sample of hospitals across the United States, representing approximately 90% of all hospitals in the country . The five strata for hospitals are based on the American Hospital Association classification. HCUP data from the year 2005 will be used in this paper.
HCUP data elements
Age in years at admission
Age in days (when age > 1 year)
Admission source (uniform)
Admission source (UB-92 standard coding)
Admission source (as received from source)
Admission day is a weekend
Died during hospitalization
Weight to discharges in AHA universe
Disposition of patient (UB-92 standard coding)
Disposition of patient (uniform)
DRG in effect on discharge date
DRG, version 18
DRG grouper version used on discharge date
Data source hospital identifier
CCS: principal diagnosis
CCS: diagnosis 2
CCS: diagnosis 3
CCS: diagnosis 4
CCS: diagnosis 5
CCS: diagnosis 6
CCS: diagnosis 7
CCS: diagnosis 8
CCS: diagnosis 9
CCS: diagnosis 10
CCS: diagnosis 11
CCS: diagnosis 12
CCS: diagnosis 13
CCS: diagnosis 14
CCS: diagnosis 15
E code 1
E code 2
E code 3
E code 4
Elective versus non-elective admission
CCS: E Code 1
CCS: E Code 2
CCS: E Code 3
CCS: E Code 4
Indicator of sex
HCUP hospital identification number
Hospital state postal code
HCUP record identifier
Length of stay (cleaned)
Length of stay (as received from source)
MDC in effect on discharge date
MDC, version 18
Physician 1 number (re-identified)
Physician 2 number (re-identified)
Number of diagnoses on this record
Number of E codes on this record
Neonatal and/or maternal DX and/or PR
Stratum used to sample hospital
Number of procedures on this record
Primary expected payer (uniform)
Primary expected payer (as received from source)
Secondary expected payer (uniform)
Secondary expected payer (as received from source)
Patient Location: Urban-Rural 4 Categories
CCS: principal procedure
CCS: procedure 2
CCS: procedure 3
CCS: procedure 4
CCS: procedure 5
CCS: procedure 6
CCS: procedure 7
CCS: procedure 8
CCS: procedure 9
CCS: procedure 10
CCS: procedure 11
CCS: procedure 12
CCS: procedure 13
CCS: procedure 14
CCS: procedure 15
Number of days from admission to PR1
Number of days from admission to PR2
Number of days from admission to PR3
Number of days from admission to PR4
Number of days from admission to PR5
Number of days from admission to PR6
Number of days from admission to PR7
Number of days from admission to PR8
Number of days from admission to PR9
Number of days from admission to PR10
Number of days from admission to PR11
Number of days from admission to PR12
Number of days from admission to PR13
Number of days from admission to PR14
Number of days from admission to PR15
Total charges (cleaned)
Total charges (as received from source)
Median household income quartile for patient's ZIP Code
The 10 most prevalent diseases categories
Other Circulatory Diseases
Diabetes mellitus no complication
Some of the most imbalanced diseases categories
Percent of Active class
Male Genital Disease
Diabetes Mellitus w/complication
One limitation of the 2005 HCUP data set is the arbitrary order in which the ICD-9 codes and disease categories were listed. The codes were not listed in the chronological order according to the date they were diagnosed. Also, the data set does not provide anonymous patient identifier, which could be used to check if multiple records belong to the same patient or to determine the elapsed time between diagnoses.
The data set was provided in a large ASCII file containing the 7,995,048 records. The first step was to parse the data set, randomly select N records and extract a set of relevant features. Every record is a sequence of characters that are not delimited. However, the data set instructions specifies the starting column and the ending column in the ASCII file for each data element (length of data element). HCUP provides a SAS program to parse the data set, but we chose to develop our own program to perform the parsing.
Sample Dataset, the bolded column (Cat. 50) represents the category to predict
While, in general, using only disease categories may not lead to a valid disease prediction, the approach presented in this paper needs to be seen in the larger context of our TigerPlace eldercare research . By integrating large public data sets (such as the one used in this paper) with monitoring sensors and electronic health records (EHR) data, we can achieve the required prediction precision for an efficient delivery of tailored medical information.
Learning from Imbalanced Data
A data set is class-imbalanced if one class contains significantly more samples than the other. For many disease categories, the unbalance rate ranges between 0.01-29.1% (that is, the percent of the data samples that belong to the active class). For example, (see table 3) only 3.16% of the patients have Peripheral Atherosclerosis. In such cases, it is challenging to create an appropriate testing and training data sets, given that most classifiers are built with the assumption that the test data is drawn from the same distribution as the training data .
Presenting imbalanced data to a classifier will produce undesirable results such as a much lower performance on the testing that on the training data. Among the classifier learning techniques that deal with imbalanced data we mention oversampling, undersampling, boosting, bagging and repeated random sub-sampling [13, 14]. In the next section we describe the repeated random sub-sampling method that we employ in this paper.
Repeated Random Sub-Sampling
Repeated random sub-sampling was found to be very effective in dealing with data sets that are highly imbalanced. Because most classification algorithms make the assumption that the class distribution in the data set is uniform, it is essential to pay attention to the class distribution when addressing medical data. This method divides the data set into active and inactive instances, from which the training and testing data sets are generated. The training data is partitioned into sub-samples with each sub-sample containing an equal number of instances from each class, except for last sub-sample (in some cases). The classification model is fitted repeatedly on every sub-sample and the final result is a majority voting over all the sub-samples.
In this paper we used the following repeated random sub-sampling approach. For every target disease we randomly choose N samples from the original HCUP data set. The N samples are divided into two separate data sets, N 1 active data samples and N 0 inactive data samples, where N 1+ N 0 = N. The testing data will contain 30% active samples N 1 (TsN 1 ) while the remaining 70% will be sampled from the N 0 (TsN 0 ) inactive samples. The 30/70 ratio was chosen by trial-and-error. The training data set will contain the remaining active samples (TrN 1 ) and inactive samples (TrN 0 ).
It has an effective method for estimating missing data.
It has a method, weighted random forest (WRF), for balancing error in imbalanced data.
It estimates the importance of variables used in the classification.
Chen et al. , compared WRF and balanced random forest (BRF) on six different and highly imbalanced data sets. In WRF, they tuned the weights for every data set, while in BRF, they changed the votes cutoff for the final prediction. They concluded that BRF is computationally more efficient than WRF for imbalanced data. They also found that WRF is more vulnerable to noise compared to BRF. In this paper, we used RF without tuning the class weights or the cutoff parameter.
The decision of the splitting criterion will be based on the lowest Gini impurity value computed among the m variables. In RF, each tree employs a different set of m variables to construct the splitting rules.
One of the most important features of RF is the output of the variable importance. Variable importance measures the degree of association between a given variable and the classification result. RF has four measures for the variable importance: raw importance score for class 0, raw importance score for class 1, decrease in accuracy and the Gini index. To estimate variable importance for some variable j, the out-of-bag (OOB) samples are passed down the tree and the prediction accuracy is recorded. Then the values for variable j are permuted in the OOB samples and the accuracy is measured again. These calculations are carried out tree by tree as the RF is constructed. The average decrease in accuracy of these permutations is then averaged over all the trees and is used to measure the importance of the variable j. If the prediction accuracy decreases substantially, then that suggests that the variable j has strong association with the response . After measuring the importance of all the variables, RF will return a ranked list of the variable importance.
Classification with Repeated Random Sub-Sampling
Training the classifier on a data set that is small and highly imbalanced will result in unpredictable results as discussed in earlier sections. To overcome this issue, we used repeated random sub-sampling. Initially, we construct the testing data and the NoS training data sub-samples. For each disease, we train NoS classifiers and test all of them on the same data set. The final labels of the testing data are computed using a majority voting scheme.
To evaluate the performance of the RF we compared it to SVM on imbalanced data sets for eight different chronic diseases categories. Two sets of experiments were carried out:
Set II: In this experiment we compare RF, bagging, boosting and SVM performance without the sampling approach. Without sampling the data set is highly imbalanced, while sampling should improve the accuracy since the training data sub-samples fitted to the model are balanced. The process was again repeated 100 times and the ROC curve and the average AUC were calculated and compared.
For SVM we used a linear kernel, termination criterion (tolerance) was set to 0.001, epsilon for the insensitive-loss function was 0.1 and the regularization term (cost) was set to 1. Also, we left bagging and boosting with the default parameters.
We randomly selected N = 10,000 data points from the original HCUP data set. We predicted the disease risks on 8 out of the 259 disease categories. Those categories are: breast cancer, type 1 diabetes, type 2 diabetes, hypertension, coronary atherosclerosis, peripheral atherosclerosis, other circulatory diseases and osteoporosis.
Result set I: Comparison of RF, bagging, boosting and SVM
RF, SVM, bagging and boosting performance in terms of AUC on eight disease categories
Diabetes no complication
Other Circulatory Diseases
Statistical comparison of RF and boosting ROC curves, the lower the value the more significant the difference is
Diabetes no complication
Other Circulatory Diseases
We compared our disease prediction results to the ones reported by other authors. For instance, Yu et al. , describes a method using SVM for detecting persons with diabetes and pre-diabetes. They used data set from the National Health and Nutrition Examination Survey (NHANES). NHANES collects demographic, health history, behavioural information and it may also include detailed physical, physiological, and laboratory examinations for each patient. The AUC for their classification scheme I and II was 83.47% and 73.81% respectively. We also predicted diabetes with complications and without complications and the AUC values were 94.31% and 87.91% respectively (Diabetes without complication ROC curve in Figure 5).
Davis et al.  used clustering and collaborative filtering to predict disease risks of patients based on their medical history. Their algorithm generates a ranked list of diseases in the subsequent visits of that patient. They used an HCUP data set, similar to the data set we used. Their system predicts more than 41% of all the future diseases in the top 20 ranks. One reason for their low system performance might be that they tried to predict the exact ICD-9 code for each patient, while we predict the disease category.
Zhang et al.  performed classification on breast cancer metastasis. In their study, they used two published gene expression profiles. They compared multiple methods (logistic regression, SVM, AdaBoost, LogitBoost and RF). In the first data set, the AUC for SVM and RF was 88.6% and 89.9% respectively and for the second data set 87.4% and 93.2%. The results we obtained for breast cancer prediction for RF were 91.23% (ROC curve in Figure 7).
Mantzaris et al  predicted osteoporosis using multi-layer perceptron (MLP) and probabilistic neural network (PNN). Age, sex, height and weight were the input variables to the classifier. They reported a prognosis rate on the testing data of 84.9%. One the same disease, we reported an AUC for RF of 87%.
Top four most importance variable for the eight disease categories
1. Breast cancer
Secondary malignant Secondary malignant sddsmalignant malignant
2. Diabetes no complication
3. Diabetes with/complication
Diabetes without compl.
5. Coronary atherosclerosis
6. Peripheral atherosclerosis
7. Other circulatory diseases
Result set II: Sampling vs. non-sampling
RF, SVM, bagging and boostingperformance without sub-sampling in terms of AUC on eight disease categories
Diabetes no complication
Other Circulatory Diseases
Disease prediction is becoming an increasingly important research area due to the large medical datasets that are slowly becoming available. While full featured clinical records are hard to access due to privacy issues, deidentified large public dataset are still a valuable resource for at least two reasons. First, they may provide population level clinical knowledge. Second, they allow the data mining researcher to develop methodologies for clinical decision support systems that can then be employed for electronic medical records. In this study, we presented a disease prediction methodology that employs random forests (RF) and a nation-wide deidentified public dataset (HCUP). We show that, since no national medical warehouse is available to date, using nation-wide datasets provide a powerful prediction tool. In addition, we believe that the presented methodology can be employed with electronic medical records, if available.
To test our approach we selected eight chronic diseases with high prevalence in elderly. We performed two sets of experiments (set I and set II). In set I, we compared RF to other classifiers with sampling, while in set II we compared RF to other classifiers without sub-sampling. Our results show that we can predict diseases with an acceptable accuracy using the HCUP data. In addition, the use of repeated random sub-sampling is useful when dealing with highly imbalanced data. We also found that incorporating demographic information increased the area under the curve by 0.33-10.1%.
Some of the limitations of our approach come from limitations of the HCUP data set such as the arbitrary order of the ICD-9 codes and lack of patient identification. For example, since the ICD-9 codes are not listed in chronological order according to the date they were diagnosed, we inherently use future diseases in our prediction. This explains, in part, the high accuracy of our prediction. In addition, the HCUP data set does not provide anonymous patient identifier, which can be used to check if multiple records belong to the same patient and to estimate the time interval between two diagnoses. Hence we might use the data for the same patient multiple times.
Additionally, the data set does not include the family history; rather it includes the individual diagnosis history which is represented by the diseases categories.
In this study we used the NIS dataset (HCUP) created by AHRQ. Few researchers have utilized the NIS dataset for disease predictions. The only work we found on disease prediction using NIS data was presented by Davis et al. , in which clustering and collaborative filtering was used to predict individual disease risks based on medical history. In this work we provided extensive proof that RF can be successfully used for disease prediction in conjunction with the HCUP dataset.
The accuracy achieved in disease prediction is comparable or better than the previously published results. The average RF AUC obtained across all disease was about 89.05% which may be acceptable in many applications. Additionally, unlike many other published results were they focus on predicting one specific disease, our method can be used to predict the risk for any disease. Finally, we consider the results obtained with the proposed method adequate for our intended use, which is tailored health communication.
The authors would like to thank the University of Missouri Bioinformatics Consortium for providing us with a Supercomputer account to run the experiments.
- Yu W: Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making. 2010, 10 (1): 16-10.1186/1472-6947-10-16.View ArticlePubMedPubMed CentralGoogle Scholar
- Hebert P: Identifying persons with diabetes using Medicare claims data. American Journal of Medical Quality. 1999, 14 (6): 270-10.1177/106286069901400607.View ArticlePubMedGoogle Scholar
- Fuster V: Medical Underwriting for Life Insurance. 2008, McGraw-Hill's AccessMedicineGoogle Scholar
- Yi T, Guo-Ji Z: The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005Google Scholar
- Cohen E: Cancer coverage in general-audience and black newspapers. Health Communication. 2008, 23 (5): 427-435. 10.1080/10410230802342176.View ArticlePubMedGoogle Scholar
- HCUP Project: Overview of the Nationwide Inpatient Sample (NIS). 2009, [http://www.hcup-us.ahrq.gov/nisoverview.jsp]Google Scholar
- Moturu ST, Johnson WG, Huan L: Predicting Future High-Cost Patients: A Real-World Risk Modeling Application. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. 2007Google Scholar
- Davis DA, Chawla NV, Blumm N, Christakis N, Barabási AL: Proceeding of the 17th ACM conference on Information and knowledge management. Predicting individual disease risk based on medical history. 2008, 769-778.Google Scholar
- Mantzaris DH, Anastassopoulos GC, Lymberopoulos DK: Medical disease prediction using Artificial Neural Networks. BioInformatics and BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on. 2008Google Scholar
- Zhang W: A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis. Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS '09. International Joint Conference on. 2009, 242-245.View ArticleGoogle Scholar
- Skubic M, Alexander G, Popescu M, Rantz M, Keller J: A Smart Home Application to Eldercare: Current Status and Lessons Learned, Technology and Health Care. 2009, 17 (3): 183-201.Google Scholar
- Provost F: Machine learning from imbalanced data sets 101. Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets. 2000Google Scholar
- Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis. 2002, 6 (5): 429-449.Google Scholar
- Quinlan JR: Bagging, boosting, and C4. 5. Proceedings of the National Conference on Artificial Intelligence. 1996, 725-730.Google Scholar
- Breiman L: Classification and regression trees. 1984, Wadsworth. Inc., Belmont, CA, 358:Google Scholar
- Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. 2004, University of California, BerkeleyGoogle Scholar
- Breiman L, others: Manual-Setting Up, Using, and Understanding Random Forests V4. 0. 2003, [ftp://ftpstat.berkeley.edu/pub/users/breiman]Google Scholar
- Hastie T: The elements of statistical learning: data mining, inference and prediction. 2009, 605-622.Google Scholar
- Bjoern M: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 10:Google Scholar
- Mingers J: An empirical comparison of selection measures for decision-tree induction. Machine learning. 1989, 3 (4): 319-342.Google Scholar
- Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30: 1145-1159. 10.1016/S0031-3203(96)00142-2.View ArticleGoogle Scholar
- Palmer D: Random forest models to predict aqueous solubility. J Chem Inf Model. 2007, 47 (1): 150-158. 10.1021/ci060164k.View ArticlePubMedGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/11/51/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.