Development and validation of classifiers and variable subsets for predicting nursing home admission

Background In previous years a substantial number of studies have identified statistically important predictors of nursing home admission (NHA). However, as far as we know, the analyses have been done at the population-level. No prior research has analysed the prediction accuracy of a NHA model for individuals. Methods This study is an analysis of 3056 longer-term home care customers in the city of Tampere, Finland. Data were collected from the records of social and health service usage and RAI-HC (Resident Assessment Instrument - Home Care) assessment system during January 2011 and September 2015. The aim was to find out the most efficient variable subsets to predict NHA for individuals and validate the accuracy. The variable subsets of predicting NHA were searched by sequential forward selection (SFS) method, a variable ranking metric and the classifiers of logistic regression (LR), support vector machine (SVM) and Gaussian naive Bayes (GNB). The validation of the results was guaranteed using randomly balanced data sets and cross-validation. The primary performance metrics for the classifiers were the prediction accuracy and AUC (average area under the curve). Results The LR and GNB classifiers achieved 78% accuracy for predicting NHA. The most important variables were RAI MAPLE (Method for Assigning Priority Levels), functional impairment (RAI IADL, Activities of Daily Living), cognitive impairment (RAI CPS, Cognitive Performance Scale), memory disorders (diagnoses G30-G32 and F00-F03) and the use of community-based health-service and prior hospital use (emergency visits and periods of care). Conclusion The accuracy of the classifier for individuals was high enough to convince the officials of the city of Tampere to integrate the predictive model based on the findings of this study as a part of home care information system. Further work need to be done to evaluate variables that are modifiable and responsive to interventions.

risk factors include advanced age, functional and cognitive impairments, depression, caregiver burden, use of health services, prior hospitalization or nursing home use and dementia. In a literature review [5] and meta-analysis [16], the strongest predictors of NHA were increased age, low self-rated health status, functional and cognitive impairment, dementia and prior NHA.
The above research has focused on finding risk factors for NHA. However, as far as we know, no prior research has proved the prediction performance or accuracy of a NHA model for individuals. The prior research articulates the statistically important variables that increase or decrease the risk of NHA at the population level. It is based on the traditional statistical data processing approach in which statistical modeling connects data to a population of interest. It does not answer the question of how accurately the nursing home admission is possible to predict for individuals.
In this study, we point out the most important variable subsets of different sizes for predicting NHA. Particularly, we measure and validate the performance of our NHA prediction models in terms of classification accuracy. That is, we search the best model and measure how good it is for individuals. The variable subsets were searched by machine learning (ML) methods and the classification accuracy was calculated using the cross-validation principle. The data was consisted of the service records of home care clients in the city of Tampere 1 , Finland. Because our data set is highly unbalanced, we use a random operator to form balanced data sets and the all performance results are reported on those balanced data sets instead of the original unbalanced set.
We claim, that the knowledge of classification accuracy is highly valuable, when deciding on the adoption of the prediction model in actual service production. It should be noted, that statistically significant variables do not guarantee high classification accuracy. Without adequate accuracy, the cost effectiveness of the targeted interventions is not good enough: interventions are targeted to a significant number of people not at risk ("false positive"), and some of those in need of an intervention do not receive one ("false negatives"). Furthermore, the resource planning of service production benefits from the individual predictions of upcoming admissions. The primary contributions of this paper are summarized below: • As far as we know, no prior research has investigated variable subsets of different sizes for predicting NHA. A few scholars have applied variable selection methods [3][4][5]13]. However, they did not investigate the variable subsets of different sizes (1 − n variables), as we did in this study. • The second contribution relates to the way to use, train and validate classification algorithms for predicting NHA. Compared to prior research work, the present study investigates the NHA prediction models for individuals. Prior research investigated statistical significant population-level risk factors for NHA. The 5% level of significance was a de facto standard for important variables. In this study, we measure classification accuracy for classifiers trained and validated using cross-validation. That is, we study the accuracy of our model for unseen clients of home care according to the risk of NHA.
The objective of this study was to gain a better understanding of the accuracy level in which NHA can be predicted in order to support decision making in home care services and allocation of resources between customers. The classification accuracy of our method was 78% that was high enough for the decision to integrate it in the local information system of home caring 2 . The remainder of this paper is divided into three parts. In the first part, we describe how the variables of our prediction models were aggregated and how the variables were selected for the subset selection process. The second part introduces the methods for training and validating classifier algorithms. The third part of the paper presents the performance of the variable selection and discusses the results and practical implications.

Data source
The data consisted of the records of 7259 home care customers between January 2011 and September 2015 in the city of Tampere, Finland. These data were linked to records that contained information regarding all social and health care service usage during the same period. Nursing home admission (model outcome) was indicated by whether the customer admitted to a nursing home or not, and coded as a binary indicator. The data were linked on the customer level using unique encrypted identifiers. We excluded clients with recorded home care episode shorter than 12 months between January 2011 and September 2015 (n = 3192) and those whose RAI-HC (Resident Assessment Instrument -Home Care [17]) values (n = 981) were missing. In total, we had 3056 customers (539 NHA is "true" and 2544 NHA is "false") for analysis.
All the variables were calculated 3-12 months before the evaluation day t ev . In addition, the variables were calculated 6-12 months before the day t ev for additional analyses. The main variables are listed in the column "variable" of Table 1. The variables were selected by the experts of elderly care services. Figure 1 shows a time scale in which t s,i is the starting day and t e,i the ending day of home care according to the home care service data for customer i (i = 1, . . . , 3056). The variables were the  [35]. The Instrumental Activities of Daily Living (IADL) scale [36] provides a measure of the customer's self-performance of seven daily tasks: meal preparation, ordinary housework, managing finances, managing medications, phone use, shopping and transportation. The scores are from 0 to 21. The Method for Assigning Priority Levels (MAPLe) differentiates customers into five different groups ranging from low to very high risk of health decline [34]. Higher risk group indicates a higher risk to be admitted to a long-term care facility numbers of events or boolean value [ true/false] that an event occurred between times t k and t k+1 (k = 1, 2, 3) or t 1 and t 4 . For example, variable j (a blue box in Fig. 1) for customer i was calculated from time period t 2,ij − t 3,ij . The interval between times t 1 and t ev was set to be 12 months and t 1 < t 2 < t 3 < t 4 < t ev . If NHA variable was "true" for the customer i, that is the customer i was admitted to nursing home at time t e,i , time t ev,i was set to be the admission day (→ t ev,i = t e,i ). If the NHA variable of customer i was "false", then time t ev,i was a random day between times t s,i and t e,i , st. t ev,i > t s,i + 12 months. Table 1 presents general characteristics of the study sample (n = 3056), of which 539 (17.6%) were admitted to nursing home. The table includes the results of t-test of significance difference for continuous variables and chi-squared test for categorical variables between the groups of home caring customers and nursing home residents.

Variable subset selection
The aim of this study was to find efficient variable subsets X sub = {x i |i = 1, . . . n} from a large variable set X = {x j |j = 1, . . . k} for predicting the NHA when n and k are the numbers of variables and n < k. Let Y be the binary vector of NHA variable, F(·) is a classifier and Y = F(X sub ). That is, we predict the state of Y i for customer i at time t ev,i (Fig. 1), when the variable vector X sub,i is calculated from time range t 1,i −t 4,i (3-12 months before t ev,i ) or t 1,i − t 3,i (6-12 months before t ev,i ).
Variable selection is a mature research topic and has been used for many applications [18]. In this study we applied sequential forward selection (SFS) method [19] for variable subset generation. SFS starts with an empty set and adds one variable at a time from the original set X for classifier by maximizing the performance measure. Our primary performance metric was classification accuracy: where y pred,i is the predicted NHA class of the i-th sample, y i is the corresponding true NHA class, n samples is the number of samples and L(·) is the indicator function (L = 1 if y pred = y; L = 0 if y pred = y) [20]. We additionally calculated the average area under the curve (AUC) and true-positive rate (recall) values for classifiers. AUC values correlated almost perfectly with acc values, but we decided to report them, because in some research areas they are more familiar than the acc values. Recall of a classifier is calculated by dividing the correctly classified positives (true positives) by the total positive count (true positives + false negatives) [21]. That is, recall is the probability that a risk customer is found. The strength of the accuracy metric, compared to the other common metrics, is that the accuracy metric is easy to understand. It should be noted, that our data set is highly unbalanced. We use a random operator to form balanced data sets and the performance results are reported on those balanced data sets instead of the original set. Otherwise, the accuracy metric would be biased and not suitable.
An alternative of SFS would be sequential backward elimination (SBE). SBE starts with X and eliminates one variable at a time by maximizing the performance measure. Our selection of SFS instead of the SBE method is justified by the ratio of relevant (#r) and all (k) variables. According to Liu et al. [18], if #r is small, then the SFS strategy should be used, and if the number of irrelevant variables (k − #r) is small, then the SBE strategy should be used. According to the pre-tests, the original variable set X includes many irrelevant variables (low univariate prediction power); thus, we prefer the SFS strategy.
Different classifiers have different performance for different data sets. In this study, we evaluated the performance of three classifiers: logistic regression (LR) [22], Gaussian naive Bayes (GNB) [23] and support vector classifier (SVC) [24]. That is, SFS was run three times using the classifiers of LR, GNB or SVC. Figure 2 shows the components of variable subset selection process. The subset generation component (SFS) feeds candidate variable subset X sub to subset evaluation component. Evaluation component trains and validates classifier and calculates the accuracy values for the subset X sub .
It should be noted, that we use a random operator to form balanced data sets for the analyses. Let A = {X|Y = 1} and B = {X|Y = 0} be the data sets. That is, the set A contains the data of customers with the values of "true" of NHA variable and the B with "false". Because the set A is smaller (n = 539) than the set B (n = 2517), the balanced data set C was formed, st. C = A ∪ R(B) where R is a random operator for selecting 539 random samples from B, thus setting the level of chance at 50%. To be sure that the selection did not bias the results, data set C was formed 100 times.
Furthermore, classifier algorithms were trained and validated using a ten-fold cross validation method. That is, we formed sample C i (i = 1, . . . , 100) from the data sets A and B, and split it into 10 equal-sized parts P ik (P ik ∈ C i and k = 1, . . . 10). The classification accuracy value, acc ik , was calculated by Eq. 1 for the part P ik of the data set C i when the parameters of the classifier were trained with the other K-1 parts of the data set C i . The process was repeated for k = 1, 2, . . . 10. The overall classification accuracy, CA, for the subset X sub,n of size n (n = 1, . . . 15) was calculated as SFS calculated the best variable subsets for all balanced data sets C i . That is, we have 100 variable subsets of size of 1-15 variables. The (average) importance of each variable was measured by a rank metric: where r(i, j) is the rank of variable j based on sample C i and #F is the size of the largest subset that was formed by SFS [25][26][27]. In this study #F = 15. Higher R(j) indicates that variable j is more important according to SFS, because it was selected for smaller size variable subsets. That is, variable has higher prediction capability according to SFS and its NHA classification ability is high.

Software
We used four Python packages: sklearn [20], mlxtend [28], numpy [29] and pandas [30] to implement the classifiers and compute, acc, AUC, recall, CA Xsub and R(j). SFS was computed by the function "SequentialFeature-Selector" in the package mlxtend. The classifiers of LR, SVC and GBN were implemented using the functions from the sklearn.linear_model, sklearn.svm and sklearn.naive_bayes packages. The packages of numpy and pandas were used for data reading and processing.  [32]. In general, classification ability is useful if AUC > 0.75 [33]. That is, the performance of the classifiers with 15 variables was at good level.

Results
When the variables were calculated 6-12 months before the evaluation day t ev , the average accuracy of classifiers of LR, SVC and GNB were 0.747 (CI95% = .0030), 0.737 (CI95% = .0029) and 0.734 (CI95% = .0029), respectively, for the variable subset of 15 variables. The AUC values were 0.819 (CI95% = .0027), 0.810 (CI95% = .0028) and 0.813 (CI95% = .0025). The recall values were 0.732 (CI95% = .0017), 0.738 (CI95% = .0025) and 0.732 (CI95% = .0026). The results of the 6-12 months variables show a moderate decrease in performance compared to the 3-12 months variables (e.g. LR CA: 0.776 → 0.747). The performance of the classifiers with the 6-12 months variables, however, is still at good level (AUC > 0.8). Figure 4 shows the p-values calculated by the student's t-test when the average classification accuracy values for the subsets of 15 variables of 3-12 months were compared to the subsets of n variables (n = 1, . . . 15). We defined that if p < .05, the difference between the performances of variable subsets is statistically significant. According to the definition, the optimal subset size for LR method was 9 variables. That is, the performance achieved by the subset size of 9 variables did not differ statistically from the subset of 15 variables, when the classifier was LR. Table 2 sorts the variables according to ranking score, R, described by Eq. 3, for the classifiers of LR and GNB. Large R(j) value means that the variable j was selected regularly in small variable subsets for different balanced data sets C. That is, the NHA classification ability of the variable j Fig. 4 P-value as a function of the size of variable subset compared to the subsets of 15 variables. We defined that if p < .05, the difference between the performances of variable subsets is statistically significant. According to the definition, the optimal subset size for LR method is 9 variables. That is, the performance achieved by the subset size of 9 variables did not differ statistically from the subset of 15 variables, when the classifier is LR is high. According to the results, the most important variables were the diagnoses of G30-G32 and F00-F03 and the RAI metrics of IADL (Activities of Daily Living), MAPLE (Method for Assigning Priority Levels) and CPS (Cognitive Performance Scale). In addition, variables related to the numbes of periods of care were important variables for predicting NHA with the both classifiers. It should be noted, that the RAI variables (IADL, MAPLE and CPS) are not simple measurements or observations, but instead scoring systems developed by researcher and practitioners (e.g., MAPLE [34], CPS [35] and IADL [36]). That is, it is not surprising that these variables have such high performance at predicting NHA. Figures 5 and 6 plot the normalized ranking score values for the classifiers of SVC and GNB as a function of the values of LR. Ten variables with the highest R values of LR classifier are labelled on the figures. The 45 • identity line visualizes the differences between the R values of the classifiers. Variables in the lower-right region of the line were more important for the LR than for the SVC (Fig. 5) or for the GNB (Fig. 6). Similarly, those in the upper-left region were more important for the SVC (Fig. 5) or for the GNB (Fig. 6) than for the LR. For example, the diagnosis N30-N39 was more important for the SVC classifier than for the LR. However, the differences between the most important variables of the classifier were rather small. The variables of the RAI MAPLE, RAI IADL, RAI CPS and diagnoses F00-F03 and G30-G32 were five important variables for the all classifiers.

Discussion
The aim of the study was to analyse predictors and find out efficient variable subsets to predict NHA in a sample of home caring customers. Particularly, we wanted to find and report the level of accuracy in which NHA can be predicted for individuals. Our results show that the admission of nursing home can be predicted at an accuracy level of 78% / 74% when the variables were calculated 3-12 months / 6-12 months before the evaluation day. Thus, on average, our model predicts four out of five or three out of four home care customers in the right class in terms of nursing home admission. This is crucial information for decision makers for two reasons. Firstly, the model has to be accurate enough so that investments in preventive interventions can be made. If the accuracy of the model is too low, there are too many false positives and the cost effectiveness of the interventions is low. Secondly, the model needs to predict the individuals with high risk well in advance of the admission. Otherwise, it is too late to implement any interventions. Therefore, the fact that the accuracy of our model with variables 6-12 months before the evaluation day is as high as 74%, is important.
As far as we know, no prior research has published the classification accuracy of the NHA model for individuals. It should be noted, that the classification accuracy is a very common metric in machine learning and other fields. However, prior research has done the analyses at the population level. The important variables have been detected using the 5% level of significance. That is, the values of the parameters of a model (e.g. linear regression (e.g. [12]), logistic regression (e.g. [9,14]) or Cox model (e.g. [4,5])) are estimated from whole data (without the split of train and test sets) and the significance levels for coefficients are derived. Nothing else has done to see if the model generalizes on the data and individuals that played no role in estimating the parameters for models. Few scholars of NHA (e.g. [2,12]) have applied goodness-of-fit tests (e.g. AIC, R 2 ) for the model, but the test results were often more close to zero than one (≈ .20 − .25).
We see that the above lacks in NHA research are related to the public health science and data modelling cultures, in which model validation is omitted or calculated only on training data [37]. In this study we searched the important variables by averaging the results of variable selection that was executed for many random split of the whole data set. The importance of variables was measured by the ranking metric. The level of classification accuracy of model for different variable subsets was tested by cross-validation. The variable selection from many random data samples and cross-validation warrants the generalization of our variables and models.
The variables of RAI MAPLE, functional impairment (RAI IADL), cognitive impairment (RAI CPS), memory disorders (G30-G32 and F00-F03) and the use of community-based health services and prior hospital use (emergency visits and periods of care) were the most important. The ICD10 (International Classification of Diseases) group of G30-G32 contains the codes for other degenerative diseases of the nervous system (e.g. Alzheimer) and F00-F03 for dementia. A comparison of our results with the findings of the other investigations revealed that especially, functional [1-3, 5-9, 11, 13, 14] and cognitive [2,5,8,9,11,14] impairment, dementia [1,3,13,14] and use of community-based health services [2,4] or prior hospitalization [9] were also strong predictors of NHA. In contrast to our findings, [2, 4-6, 9, 13, 14] found that increased age lead to increased risk of NHA. In our study, the importance of variable of age was rather low according to the ranking score.
The major strengths of this study include its detailed assessment of important variables and model validation and availability of a range of important variables for nursing home admission. The accuracy of the model was high enough to convince the officials of the city of Tampere to integrate the predictive model as a part of home care information system. However, there are some limitations to the present study. We were unable to investigate the associations of social relationships with nursing home admission. Some studies have shown that caregiver characteristics [4,7,14,38,39], having children [8,9] and marital status [6] can be important factors for NHA. Second, this was a study of home caring clients living in a defined geographical location, which may limit the generalizability to older adults living in other areas. Also the finding of this study may not be applicable to population without home caring services. In addition, many of the evaluated risk variables found in this study, are not modifiable. Further work need to be done to evaluate variables that are modifiable and responsive to interventions.

Practical implications
It is clear, that applying ML methods will progress and reform the work of the gerontology researchers and practitioners. The benefits can be viewed from the two aspects: 1) ML methods can be used to construct practical computer software for predicting NHA to aid the decision-making of practitioners, 2) large variable groups can be studied and the most important variables can be found.
The aspect (1) contributes most to the work of practitioners, e.g. home care case managers. The problem the case managers face is equivalent to that in any preventive care: it is difficult to achieve cost-efficiency if you cannot target a specific subgroup. You usually provide a small intervention for everyone, which is not enough for those at high risk. In order to be effective, the preventive measure needs to be substantial (e.g. in the case of home care customers, 2000e) but becomes too expensive, if offered for many customers. With limited resources one needs to know which customers are most in need of a rehabilitation intervention and target those individuals to maximize cost-effectiveness.
In the case of home care in Tampere, about 17% of customers are admitted to a nursing home within a year, which is the a priori risk for everyone. The algorithm produced with ML techniques gives a much more accurate risk value enabling the targeting off interventions. Without an accurate prediction algorithm, it is difficult to identify the high risk individuals. It is not enough to identify variables that have a statistically significant relationship with NHA, because this does not provide guidelines that can be applied in practice. For example, we know that a diagnosis indicating dementia or Alzheimer's disease increases NHA risk, but this information is not specific enough to identify the individuals in need of an intervention (unless we target everyone with that particular diagnosis). The ML algorithm provides a risk classification and also allows for the estimation of the accuracy of the prediction. Also, in many cases the case managers need to convince their superiors of the need of investing in rehabilitation interventions for a particular customer. The risk estimate from a validated prediction algorithm can be used as a means of communication between the case manager and her superior.
Furthermore, the prediction model can also be used to estimate resource requirements for 24 h services by summing up the individual predictions. The predictions provide an upper limit estimation for capacity requirement. With time, when data is gathered on by how much targeted rehabilitation interventions can reduce NHA, the capacity estimates become more accurate.
In this study, the city of Tampere integrated a computer software containing the prediction algorithm in their data warehouse. The computer software aggregates and processes the variables from different databases and calculates the customer specific NHA risk value. If the risk is high, the case managers consider customer specific interventions, e.g. a new service level assessment, more home care visits, a particular therapy or revised medication. Prior to the implementation of the prediction algorithm, the rehabilitation interventions were not targeted systematically. Most often interventions were used when a care taker or nurse or next of kin noticed a change in functional ability and notified the case manager. When using the prediction algorithm, interventions are targeted based on more objective evaluations and customers are screened regularly. This way it is possible to identify customers at risk earlier than before. Also, after the implementation of the prediction algorithm, the selection of different rehabilitation interventions available for home care customers has been increased.
The next step in the study and implementation project is to gather data from the interventions and their effects, and build another ML model to predict the effectiveness of each intervention for each type of customer. Also, the model can be used to predict, who is no longer capable of benefitting from an intervention. This added information will further improve the cost-effectiveness of home care.
The aspect (2) contributes both the gerontology research and practical work. Variable selection can be used to identify which of the available variables are closely related to the prediction of the NHA and to discard those unrelated to it, reducing the dimensionality of the dataset. For the researcher of gerontology, the process of variable selection may indicate new variables that had not been previously considered as relevant to NHA. For example, in this study, we found about 10 important variables for predicting NHA. Furthermore, the model validity is easier to evaluate after variable selection is used to reduce the dimensionality of the model. After dimension reduction, the researchers know the variables for which they should focus in their research [40]. For the NHA research, this may mean that the variables for which the interventions should be focused can be found.
The second benefit, because of the variable selection, is that the number of variables, integrated in the software tool, can be minimized. This is important, because each new added variable requires resources for the processes of data aggregation and validation and requirements for data integration from different databases.

Conclusion
Most elderly people prefer to live at home in a familiar environment than move to a nursing home. The findings of our study indicate important variable subsets for predicting NHA of community dwelling home care customers, and offer potential to find those individuals at the level of 78%, who are at risk of NHA. The most important variables were RAI MAPLE, functional impairment (RAI IADL), cognitive impairment (RAI CPS), memory disorders (diagnoses G30-G32 and F00-F03) and the use of community-based health-service and prior hospital use. 1 Tampere is the third largest city in Finland. The percent of population over 65 years is 18.0% that is approximately same as in the other big cities in Finland (http:// www.stat.fi). Also, the scope or services offered for the elderly as well as eligibility criteria for home care and nursing home care are fairly similar in all areas in Finland. 2