Predicting Disease Risks from Highly Unbalanced Data using Random Forest

Abstract


Background
Disease prediction can be applied to different domains such as risk management, tailored health communication and decision support systems.Risk management plays a role in health insurance companies, mainly in the underwriting process [1] .Health insurers use a process called underwriting in order to classify the applicant as standard or substandard, based on which they compute the policy rate and the premiums individuals have to pay.Currently, in order to classify the applicants, insurers require every applicant to complete a questionnaire, report current medical status and sometimes medical records, or clinical laboratory results, such as blood test, etc.
Incorporating machine learning techniques, insurers can make evidence based decisions and can optimize, validate and refine the rules that govern their business.
For instance, Yi et al. [2], applied association rules and SVM on a insurance company database to classify the applicants as standard, substandard or declined.
Another important aspect where disease prediction can be applied is tailored health communication.For example, one can target tailored educational materials and news to a subgroup, within the general population, that has specific disease risks.Cohen et al. [3], discussed how tailored health communication for cancer patients can motivate cancer prevention and early detection.Disease risk prediction along with tailored health communication can lead to an effective channel for delivering disease specific information for people who will be likely to need it.
Several machine learning techniques were applied to health care data sets for the prediction of future health care utilization such as predicting individual expenditures and disease risks for patients.Moturu et al. [4], predicted future high-cost patients based on data from Arizona Medicaid program.They used random data sampling to overcome the problem of unbalanced data along with variety of classification methods such as SVM, Logistic Regression, Logistic Model Trees, AdaBoost and LogitBoost.Davis et al. [5], used clustering and collaborative filtering to predict individual disease risks based on medical history.The prediction is performed multiple times for each patient with each iteration employing different basis for clustering; then the clusterings are combined to form an ensemble.The output is a ranked list of diseases that can be used in subsequent visits of that patient.Mantzaris et al. [6], predicted Osteoporosis using Artificial Neural Network (ANN).They used two different ANN techniques: Multi-Layer Perceptron (MLP) and Probabilistic Neural Network (PNN).
Hebert et al. [7], identified persons with diabetes using Medicare claims data.They ran into a problem where the diabetes claims occur too infrequently to be sensitive indicators for persons with diabetes.In order to increase the sensitivity, physician claims where included.Yu et al. [8], illustrates a method using SVM for detecting persons with diabetes and pre-diabetes.
Zhang et al. [9], conducted a comparative study of ensemble learning approaches.
They compared AdaBoost, LogitBoost and RF to logistic regression and SVM in the classification of breast cancer metastasis.They concluded that ensemble learners have higher accuracy compared to the non-ensemble learners.
Together with methods for predicting disease risks, in this paper we discuss a method for dealing with highly unbalanced data.Earlier, we mentioned two examples [4,7] where the authors encountered class imbalanced problems.Class imbalance occurs if one class contains significantly more samples than the other class.Since classifiers assume that the data is drawn from the same distribution as the training data which is not the case in class-imbalanced data sets, they will produce undesirable results in such situations.The Medicare data set we use in this paper is highly unbalanced; the majority of the patients do not have a disease, while a very small subset does.For example, only 3.59% of the patients have heart disease, thus it is possible that a classifier learning from this data might achieve an accuracy of 96.41% while the sensitivity of 0%.

Data
The Nationwide Inpatient Sample (NIS) is a database of hospital inpatient admissions dated back to 1988 and is used to identify, track, and analyze national trends in health care utilization, access, charges, quality, and outcomes.The NIS database is developed by the Healthcare Cost and Utilization Project (HCUP) and sponsored by the Agency for Healthcare Research and Quality (AHRQ) [10].This database is publicly available and does not contain any patient identifiers.The database contains discharge level information on all inpatients from a 20% stratified sample of hospitals across the United States, representing approximately 90% of all hospitals in the country [10].The five strata for hospitals are based on the American Hospital Association classification.HCUP data from the year 2005 will be used in this paper.The data set is highly unbalanced.The unbalance rate ranges from 0.01-29.1%.222 diagnosis categories occurs in less than 5% of the patients and only 13 categories occur in more than 10% of the patients.Table 1 shows the top 10 most prevalent disease categories and Table 2 shows some of the rarest diseases in the 2005 HCUP data set.codes and disease categories were listed.The codes were not listed in the chronological order according to the date they were diagnosed.Also, the data set does not provide anonymous patient identifier, which could be used to check if multiple records belong to the same patient or to determine the elapsed time between diagnoses.

Methods Data Pre-processing
The data set was provided in a large ASCII file containing the 7,995,048 records.The first step was to parse the data set, randomly select N records and extract a set of relevant features.Every record is a sequence of characters that are not delimited.
However, the data set instructions specifies the starting column and the ending column in the ASCII file for each data element (length of data element).HCUP provides a SAS program to parse the data set, but we chose to develop our own C program to perform the parsing.

Feature Selection
For every record, we extracted the age, race, sex and 15 diagnosis categories.As mentioned before, there are 259 categories, every record will be transformed to m=262 dimensional feature vector.Features 1-259 are binary and each represents a disease category.A patient either has the disease (1) (belongs to the "active" class) or does not have the disease (0).The other three features are age, race and sex.

Learning from Unbalanced Data
A data set is class-imbalanced if one class contains many more samples than the other class.As mentioned above the data set is highly unbalanced (Table 2).The unbalance rate of the data ranges between 0.01-29.1%,where this percentage represents the percent of the data samples that belong to the active class.For example, 3.16% of the patients have Peripheral Atherosclerosis while the rest do not.In such cases, it will be hard create an appropriate testing and training da sets given that most classifiers are built with the assumption that the test data is drawn from the same distribution as the training data [11].
Presenting unbalanced data to the classifier will produce undesirable results.Most likely that learning from unbalanced data will cause the classifier to have a much lower performance on the testing that on the training data.There exist techniques to learn classifiers for unbalanced data such as oversampling, undersampling, boosting, bagging and repeated random sub-sampling.We will briefly describe each of these techniques, but the focus will be on the repeated random sub-sampling.

Oversampling and Undersampling
In oversampling the number of examples from the small class is increased to reach the size of the larger class, while in undersampling the examples from the large class are reduced to reach the size of the smaller class.In this paradigm, one need to make a decision whether to use oversampling, undersampling, or a combination of both, and select the appropriate rate to perform the sampling.For example, we could perform oversampling and undersampling to full balance (equal number of samples from each class) or we could choose a different ratio of the two classes [12].

Bagging
Bagging constructs a number, T, of weak classifiers [13].For each weak classifier t=1,2,..., T a training data of size N is sampled from the original data set with replacement.The size of the sampled data is the same size as the original training data, some samples may appear more than once, other may not appear at all.The T classifiers are then aggregated to form a stronger classifier.

Boosting
Unlike bagging where independent bootstrap samples are selected from the original training data, boosting associates a weight with every instance [13].The classifier will use the weights to focus on the samples that are most difficult to classify.For every weak classifier t=1,2,...,T , the weights are updated, that is the weights of misclassified instances are increased.At the end, the weak classifiers are aggregated to form a strong classifier using weighted voting.

Repeated Random Sub-Sampling
Repeated random sub-sampling was found to be very effective in dealing with data sets that are highly unbalanced.Because most classification algorithms make the assumption that the class distribution in the data set is uniform, it is essential to pay

Random Forest
RF is an ensemble learner, a method that generates many classifiers and aggregates their results.RF adds a layer of randomness to bagging by building large collection of de-correlated trees.Due to this randomness, RF is usually faster than bagging.RF will create multiple CART-like trees, each trained on a bootstrap sample of the original training data and searches across a randomly selected subset of input variables to determine the split.Each tree in RF will cast a vote for some input x, then the output of the classifier is determined by majority voting of the trees (Algorithm 2).RF can handle high dimensional data and use a large number of trees in the ensemble.Some important features of RF as mentioned in [14]: 1.It has an effective method for estimating missing data.
2. It has a method for balancing error in unbalanced data.We attempted to use weighted random forest (WRF), but the results were not satisfactory.That could be due to poor selection of weights since weights in WRF are tuning parameters.Chen et al. [15], compared WRF and balanced random forest (BRF) on six different and highly unbalanced data sets.In WRF, they tuned the weights for every data set, while in BRF, they changed the votes cutoff for the final prediction.They concluded that BRF is computationally more efficient than WRF for unbalanced data.They also found that WRF is more vulnerable to noise compared to BRF.In this paper, we used RF without tuning the class weights or the cutoff parameter.In order to address the class imbalanced problem we used repeated random sampling, as discussed in the previous section.

Splitting Criterion
Like CART, RF uses the Gini measure of impurity to select the split with the lowest impurity at every node [16].The Gini measure of impurity measures the impurity of a variable with respect to the classes, the measure produces a value in the range [0, 1].
Formally, the Gini impurity measure for variable X={x 1 , x 2 , …, x j } at node t, where j is the number of children at node t, N is the number of samples, n ci is the number of Formally, let β t be the OOB samples for tree t, t ɛ {1,..., ntree}, y' t i is the predicted class for instance i before the permutation in tree t and y' t i,α is the predicted class for instance i after the permutation.The variable importance VI for variable m in tree t is given by The raw importance value for variable m is then averaged over all trees in the RF.
The variable importance used in this paper is the Mean Decrease Gini (MDG), which is based on the Gini splitting criterion discussed earlier.The MDG measure the decrease ΔI (equation 1) that results from the splitting.For two class problem, the change in I at node t is defined as [19] ∆() = �   � − (, ) Where �   � is defined in equation 1 and (, ) is defined in equation 2. The decrease in Gini impurity is recorded for all the nodes t in all the trees (ntree) in RF for all the variables and Gini Importance (GI) is then computed as [19]  = � � ∆()   (6)

Classification with Repeated Random Sub-Sampling
Training the classifier on a data set that is small and highly unbalanced will result in unpredictable results as discussed in earlier sections.To overcome this issue, we used repeated random sub-sampling.Initially, we construct the testing data and the NoS training data sub-samples.The classifier will be learned NoS number of times, in every iteration the input to the classifier is one of the training data sub-samples and the testing data.Each iteration gives a classification and final vote is computed using majority voting over all the splits.

Model Evaluation
To evaluate the performance of the RF we compared it to SVM on unbalanced data sets for eight different chronic diseases categories.Two sets of experiments were carried out: 1. Set I: this scheme compares RF and SVM performance with repeated random sub-sampling.Both classifiers were fitted to the same training and testing data and the process was repeated 100 times.The ROC curve and the average AUC for each classifier were calculated and compared.
2. Set II: this scheme compares RF performance with and without the sampling approach.Without sampling the data set is highly unbalanced, while sampling should improve the accuracy since the training data sub-samples fitted to the model are balanced.The process was again repeated 100 times and the ROC curve and the average AUC were calculated and compared.

Results
We performed the classification using R, which is an open source statistical software.
We used R randomForest and SVM (e1071) packages.The parameters to the RF were as follows: number of trees (ntree) was set to 500.Overall, the number of trees did not highly influence the classification results.We verified that by running RF for different ntree values to predict breast cancer, we plotted ntree versus sensitivity (Figure 2).
The number of variables randomly sampled as candidates at each split (mtry) is the square root of the number of features, in our case for 262 features, mtry was set to 16.
Palmer et al. [20] and Liaw et al. [21] also reported that RF is usually insensitive to the training parameters.Figure 2 shows how sensitivity in RF varies as the number of trees (ntree) varies, we varied ntree from 1-1001 in intervals of 25 and measured the sensitivity at every interval.Sensitivity ranged from 0.8457 when ntree =1 and 0.8984 when ntree =726.In our experiments we used ntree =500 since the ntree did not have a large effect on accuracy for ntree >1.For SVM we used a linear kernel, termination criterion (tolerance) was set to 0.001, epsilon for the insensitive-loss function was 0.1 and the regularization term (cost) was 1.
We randomly selected N=10,000 data points from the original HCUP data set.We predicted the disease risks on 8 out of the 259 disease categories.Those categories are: breast cancer, type 1 diabetes, type 2 diabetes, hypertension, coronary atherosclerosis, peripheral atherosclerosis, other circulatory diseases and osteoporosis.

Result set I: Comparison of RF and SVM
Both SVM and RF classification where performed 100 times and the average area under the curve (AUC) was measured.The repeated random sub-sampling approach has improved the detection rate considerably.On seven out of eight disease categories RF outperformed SVM in terms of AUC (Table 3).In addition, RF has ROC curves that reach 100% detection rate faster than SVM ( Figures 3,4,5), that is we obtain 100% detection rate at a lower false alarm rate.We compared out results to the ones reported by other authors.For instance, Yu et al.
[8], illustrates a method using SVM for detecting persons with diabetes and prediabetes.They used data set from the National Health and Nutrition Examination Survey (NHANES).NHANES collects demographic, health history, and behavioural information, it may also include detailed physical, physiological, and laboratory examinations.The AUC for their classification scheme I and II was 83.47% and 73.81% respectively.We also predicted diabetes using SVM and RF and the AUC values were 87.8% and 90% respectively (ROC curve in Figure 3).ROC curve for breast cancer comparing RF with the sampling and non-sampling approach, average AUC over the 100 runs is 89.76% and 82.02% respectively.ROC curve for other circulatory diseases comparing RF with the sampling and nonsampling approach, average AUC over the 100 runs is 78.46% and 75.95% respectively.
-25 - ROC curve for peripheral atherosclerosis comparing RF with the sampling and nonsampling approach, average AUC over the 100 runs is 90.69% and 88.41% respectively.

Conclusions and Future Work
In this paper we presented a method for predicting disease risks for patients based on the HCUP data set.Our method can be used in various domains such as health risk management, tailored health communication and medical decision making.For example, a list of possible future diagnosis can be provided to the physician such as appropriate medical tests can be ordered before the disease onset.Moreover, early discussion of lifestyle adjustments may lead to disease prevention.Finally, early diagnosis and prevention may lead to a decrease in healthcare spending and an increase in care quality.
Our results show that the use of repeated random sub-sampling is useful when dealing with highly unbalanced data.While incorporating demographic information increased the accuracy of the prediction, the increase was not as big as we expected.The variable importance is useful in determining the interaction among variables and the association between the variables and the prediction results.However, the usefulness and the accuracy of our proposed method cannot be exploited without expert medical knowledge.Only physicians can provide insight on how and when to use the proposed method.
Also, a better data set can open the doors for new ideas.Limitations of the HCUP data set such as the arbitrary order of the ICD-9 codes and lack of patient identification lead to rough disease prediction model.For example, if the codes were listed in chronological order according to the date they were diagnosed, one can use different techniques such as HMM to perform disease prediction and association.In addition, the HCUP data set does not provide anonymous patient identifier, which can be used to check if multiple records belong to the same patient and to eliminate the time interval between two diagnoses.However, we believe that the developed method can be improved when used in patient data warehouses where the above information is available.
Overall, the results obtained with the proposed method were satisfactory.The accuracy achieved in disease prediction is comparable or better than the previously published results.Also, the average AUC of about 88.58% may be acceptable in many applications.
race and sex are also included in the data set.Figure1shows the distribution of patients across age, race and sex.These three demographic variables are important in predicting certain medical conditions.

Figure 1 -
Figure 1 -Demographics of patients by age, race and sex for the HCUP data set

Algorithm 1 1 : 5 : 7 : 8 :
attention to the class distribution when addressing medical data.This method divides the data set into training and testing data, then it partitions the training data into subsamples with each sub-sample containing an equal number of instances from each class, except for last sub-sample, in some cases.The model is fitted repeatedly on every sub-sample and the final result is a majority voting over all the sub-samples.In this paper we used repeated random sub-sampling approach to randomly choose N samples from the original HCUP data set.The testing data contains 30% of N and the training data contains the remaining 70%.The N samples are divided into two separate data sets, N 1 active data samples and N 0 inactive data samples, where N 1+ N 0= N. The testing data will contain 30\% of the N 1 (TsN 1 ) active samples while the remaining 70% will be sampled from the N 0 (TsN 0 ) inactive samples.The training data set will contain the remaining active samples (TrN 1 ) and inactive samples (TrN 0 ).Since the training data is highly unbalanced (TrN 1 << TrN 0 ), the TrN 0 samples are partitioned into NoS training sub-samples, where NoS is the ratio between TrN 0 and TrN 1 .Every sub-sample has equal number of instances of each class.The training active samples (TrN 1 ) are fixed among all the training data sub-samples, while the inactive samples will be sampled without replacement from TrN 0 .There will NoS subsamples to train the model on.Eventually, every inactive sample in the training data is selected once, while every active sample is selected NoS times.After training the model on all the sub-samples, we take the majority voting to determine the final votes (Algorithm 1).Repeated Random Sub-Sampling TsN = total number of samples in the testing data 2: N0 = number of inactive samples 3: N1 = number of active samples 4: N = total number of samples in the data set, where N=N0+N1 Generate testing data a.Randomly select TsN1 samples from N1 , where TsN1=0.3*N1b.Randomly select TsN0 samples from N0, where TsN0=TsN -TsN1 6: Ts = TsN0 samples + TsN1 samples Generate training data a.Contains TrN1 samples, TrN1 = remaining N1 samples after generating testing data b.Contains TrN0 samples, TrN0 = remaining N0 samples after generating testing data NoS = TrN0/TrN1 9: for s = 1 to NoS do a.Generate the s training data sub-sample i. TrSS1 = All TrN1 samples (Active samples) ii.TrSS0 = Randomly select TrN1 samples from TrN0 samples without replacement (Guarantees full balance of the training data subsamples, except for the last sub-samples, in some cases) b.TrSS = TrSS0 + TrSS1 c. ys(x) = classifer(TrSS, Ts) (Predicted class labels for Ts using sub-sample TrSS) 10: end for 11: y(x) = majority voting {y s (x)} NoS 1 (Final predicted class is majority voting over all subsamples)

3 .Algorithm 2 1 (
It estimates the importance of variables used in the classification.Random Forest for Classification 1: ntree = number of trees to be generated 2: N = number of samples in the data set 3: for t = 1 to ntree do a.Generate bootstrap sample Z of size N from the original data -with replacement b. for each bootstrap Z grow a classi_cation tree c. for i = 1 to NumberOfNodes do i.randomly sample mtry variables from M variables ii.choose best split among the sampled variables (bagging is special case of RF and obtained whenmtry = M) d. end for e. yt(x) = class prediction of the tth tree 4: end for 5: Yrf (x) = majority voting {y t (x)} ntree Final predicted class is majority voting over all trees in RF)

Figure 2 -
Figure 2 -RF behaviour when the number of trees (ntree) varies

Figure 5 -
Figure 5 -ROC curve for breast cancer

Table 1 -
The 10 most prevalent diseases

Table 2 -
Some of the most unbalanced diseases Some limitations of the 2005 HCUP data set is the arbitrary order in which the ICD-9

Table 3 -
RF vs. SVM performance in terms of AUC on eight disease categories Figure 3 -ROC curve for diabetes mellitus ROC curve for diabetes mellitus comparing both SVM and RF, average AUC over the 100 runs is 87.8% and 90% respectively.

Table 5 -
RF (sampling vs. non-sampling) performance in terms of AUC on eight