Decision tree–based classifier in providing telehealth service

Chern, Ching-Chin; Chen, Yu-Jen; Hsiao, Bo

doi:10.1186/s12911-019-0825-9

Research article
Open access
Published: 30 May 2019

Decision tree–based classifier in providing telehealth service

BMC Medical Informatics and Decision Making volume 19, Article number: 104 (2019) Cite this article

11k Accesses
26 Citations
4 Altmetric
Metrics details

Abstract

Background

Although previous research showed that telehealth services can reduce the misuse of resources and urban–rural disparities, most healthcare insurers do not include telehealth services in their health insurance schemes. Therefore, no target variable exists for the classification approaches to learn from or train with. The problem of identifying the potential recipients of telehealth services when introducing telehealth services into health welfare or health insurance schemes becomes an unsupervised classification problem without a target variable.

Methods

We propose a HDTTCA approach, which is a systematic approach (the main process of HDTTCA involves (1) data set preprocessing, (2) decision tree model building, and (3) predicting and explaining of the most important attributes in the data set for patients who qualify for telehealth service) to identify those who are eligible for telehealth services.

Results

This work uses data from the NHIRD provided by the NHIA in Taiwan in 2012 as our research scope, which consist of 55,389 distinct hospitals and 653,209 distinct patients with 15,882,153 outpatient and 135,775 inpatient records. After HDTTCA produces the final version of the decision tree, the rules can be used to assign the values of the target variables in the entire NHIRD. Our data indicate that 3.56% (23,262 out of 653,209) of the patients are eligible for telehealth services in 2012. This study verifies the efficiency and validity of HDTTCA by using a large data set from the NHI of Taiwan.

Conclusion

This study conducts a series of experiments 30 times to compare the HDTTCA results with the logistic regression findings by measuring their average performance and determining which model addresses the telehealth patient classification problem better. Four important metrics are used to compare the results. In terms of sensitivity, the decision trees generated by HDTTCA and the logistic regression model are on equal grounds. In terms of accuracy, specificity, and precision, the decision tree generated by HDTTCA provides a better performance than that of the logistic regression model. When HDTTCA is applied, the decision tree model generates a competitive performance and provides clear, easily understandable rules. Therefore, HDTTCA is a suitable choice in solving telehealth service classification problems.

Peer Review reports

Background

The concept of telehealth, which first appeared in the 1900s when physicians began discussing diseases by telephone and has evolved to such a sophisticated level of performing robotic surgery, regardless of the geographical restrictions, is possible at present [19]. Telehealth uses electronic information and communication technology to deliver health and medical information and services over large and small distances [11, 19]. The U.S. Health Resources and Services Administration defines telehealth as “the use of electronic information and telecommunications technologies to support long-distance clinical health care, patient and professional health-related education, public health, and health administration” [7]. For chronically ill patients and people with disability who require frequent updates of health parameters, telehealth services can provide convenience, mobility, and ease of use. As the aging population and people with disability increase, teleassistance and telemonitoring platforms play increasingly significant roles in delivering efficient and low-cost remote care in assisted living environments [22].

The emergence of wireless technologies and the advancements in on-body sensor design can facilitate changes in the conventional healthcare system by replacing it with wearable healthcare systems centered on individuals [21]. For example, devices and techniques in monitoring blood pressure, blood glucose level, cardiac activity, and respiratory activity are recent advances in noninvasive monitoring technologies for chronic disease management. Patients can improve or maintain their health states by using telecommunication and information technology without the need to schedule in-person healthcare visits. However, designing a telemetry system for health monitoring is complicated and expensive, and insurance providers must carefully consider and calculate who will benefit most from it.

A new concept called elderly welfare, which incorporates health welfare and the development of a telehealth system for the aging population, also has emerged. The telecare industry has expanded worldwide. Many countries, such as Japan, the United Kingdom, the United States, and Canada, have developed long-term care assistance policies to utilize telehealth systems [5, 6, 18]. Previous studies [2, 10, 13] summarized the benefits of adopting telehealth systems for three stakeholders namely, (1) cost-saving for patients and health care facilities, (2) far-reaching care for patients, (3) reduced delays in medical treatment for chronic patients, (4) reductions in healthcare facility admission rates and duration of outpatient visits, and (5) improved quality of life for countries as a whole.

If patients must pay their own expenses for telehealth services without insurance reimbursement, then extremely few patients will have motivation to use these services [12, 15]. However, an elderly patient who lives alone in a remote village may spend more than 8 h in transit to see a physician in a healthcare facility, which can actually worsen a patient’s chronic disease condition. For example, patients with diabetes or hypertension may be unaware of an abnormality and miss the crucial time to see a physician. Telehealth services can reduce the urban–rural gap in allowing for patients in remote areas to medical resources without long transport time [12, 15].

When Taiwan introduced its National Health Insurance (NHI) plan in 1995, the Department of Health also introduced a pilot project in providing telehealth services [12]. Nevertheless, a review of the government plan in 2013 [16] showed that the number of cumulative applicants is only 9606 with up to 343,000 times of services at the end of 2011. Researchers have identified several main reasons why patients in Taiwan seldom use telehealth services. First, patients are willing to pay less than 1000 New Taiwan Dollars (NTD) monthly on the average for a telehealth service, but renting remote physiological monitoring equipment costs at least 3000 NTD monthly, excluding service fees [12]. Second, outpatients prefer receiving medical advice in person, and they are not accustomed to use the telehealth services [20]. Third, because telehealth services are excluded in NHI coverage, paying for prevention is not an attractive option compared with the deductibles in medical treatment [6]. Despite these reasons, researchers [3, 4, 9, 11] still found numerous social benefits of using telehealth services, including reductions in hospitalization frequency, healthcare facility medical costs, and caregiver’s burden.

Given that health insurance policy has not officially recognized telehealth services as an efficient treatment, we have no information to compute the cost and benefit of using them if they are reimbursed by health insurances. All patients must be classified into two groups, namely, “need telehealth service” and “do not need telehealth service” which will be a time-consuming task without computer aid. Thus, a proper classification algorithm developed with telehealth experts’ assistance is necessary. Among the many classification algorithms, decision trees are the most suitable one because they are simpler to understand and interpret than association rules or logistic regression. Decision trees also require a simple data preparation stage and can handle categorical data.

This study aims to address the problem of identifying the patients who are the best candidates in receiving telehealth services subsidized by health insurance reimbursements. Specifically, patients with certain chronic diseases can benefit from noninvasive monitoring devices such as those evaluating blood pressure, blood glucose levels, and cardiac activity [21]. However, designing a telehealth system with professional health care staff to operate these noninvasive devices is complicated and costly. To prevent overburdening the telehealth system before insurers implement a telehealth reimbursement policy, researchers must identify the best qualified patients in receiving telehealth services to ensure that the neediest patients are assisted, instead of simply those who able can pay for them.

Although previous research showed that telehealth services can reduce the misuse of resources and urban–rural disparities [2], most healthcare insurers do not include telehealth services in their health insurance schemes [6, 12]. Therefore, no target variable exists for the classification approaches to learn from or train with. Thus, the problem of identifying the potential recipients of telehealth services when introducing telehealth services into health welfare or health insurance schemes becomes an unsupervised classification problem without a target variable.

The first challenge of this study is to generate the target variable for the unsupervised telehealth classification problem. The type of target variable (interval, ordinal, or nominal) determines which data-mining techniques can be used. In classifying patients into recipients and nonrecipients, the target variable is generally the patient’s status (e.g., unqualified or qualified). The target variable can also be defined according to different classes in matching their various meanings. For example, we can classify all patients into several classes, such as “necessary”, “maybe necessary in the long term”, and “unnecessary”. However, having many classes, leads to frequent misclassification, because the individuality of each class becomes diluted, which results in misclassification for similar classes. This condition also can lead to an overfitting problem caused by the excess number of predictor variables for a multiclass target variable but insufficient data points. Consequently, this study limits the number of target variable’s classes into two as binary. Thus, the target variable can be transformed into a 0/1 code and the telehealth service classification problem can be applied easily in many data-mining techniques, including decision trees and logistic regression.

The second challenge of this study involves generating the required information from the existing attributes for the insurance providers to determine whether the applicant is a suitable recipient of telehealth services. In a telehealth classification problem, the attributes are the patients’ personal and outpatient information they provided when they submit telehealth service applications. However, applying these attributes directly from the data set to a data classification technique may be inappropriate. For example, the decision tree–building algorithm does not handle numeric attributes uniformly. When applying the numeric attributes to generate the decision tree, numeric attributes may be used more than once with different thresholds. Some important attributes are excluded in the data set; thus, this attributed need to be derived from other attributes. For example, the patient’s traveling distance or transportation time to the hospital is generally not included in the healthcare data set. Thus, these data need to be generated. Solving this challenge can ensure that insurance providers receive the patients’ detailed medical-related data that can be used to generate the telehealth service classifier and applied directly in the classifier in determining the status of an applicant.

Third challenge is building a classifier to solve the problem of identifying candidates in receiving health insurance reimbursement for telehealth services. In this study, we choose decision trees in generating the classifier because the rules they generate are simple to interpret, such that the results can be easily understood for both medical professionals and patients. Constructing a decision tree–based classifier involves three main steps, namely, variable selection, node splitting, and tree pruning [17]. Generally, researchers use entropy and information gains for the first step and then obtain the local maximum information by splitting the data according to a variable. Given that this method requires the data to be categorical, researchers have developed various methods for interval data, such as ID3, C4.5 and CART. Building a decision tree–based classifier also involves applying appropriate feature selection and feature extraction to enhance classification performance. Feature selection is a process of selecting representative attributes; meanwhile, feature extraction transforms the original attributes to some other forms in decreasing the dimensions of the data set. For the current study, we must determine which node-splitting approach together with feature selection and feature extraction, is most suitable to build the classifier. After splitting nodes to generate a tree, the next step is pruning the tree if it has extremely many levels or nodes in avoiding an overfitting problem. Two pruning approaches have been developed, as follows [14]; pre-pruning stops the tree from growing before the entire training data set is classified and post-pruning prunes the tree after the decision tree is finished. For the current study, we must determine which pruning approach is most suitable to build the classifier.

Finally, a fourth challenge is selecting a validation method. Validation is the process of assessing how well the classification models perform against the validation data (real data) by verifying whether the models’ misclassification rates meet the established requirements. The validation techniques consider the probability of the worst-case scenario, wherein a model’s complexity is high. For example, the widely used k-fold validation technique divides a data set into k subsets and takes k – 1 subsets as the training data, with the remainder as the validation data set. Then, the model is trained for k times, and each iteration uses the subset i one at a time. However, the problem considered herein has a relatively small training data set for the experts to classify the patients as candidates in receiving telehealth services. Given that the training data set is extremely small, we will not split (k-fold validation) or cross-validate the training set in the validation step. We need to develop a new validation method suitable for an unsupervised classification problem with an extremely small training set and an exceedingly large test data set.

In summary, this study aims to solve the unsupervised classification problem of identifying the patients who are the best candidates in receiving telehealth services. Four challenges, such as (1) generating the target variable, (2) generating the needed information from the existing attributes, (3) building a classifier, and (4), selecting a validation method, are addressed.

Methods

To classify candidates to receive telehealth services through health insurance reimbursements, we propose a new decision tree approach, that is, heuristic decision tree telehealth classification approach (HDTTCA), which consists of three major steps, namely, (1) data analysis and preprocessing, (2) decision tree model building, and (3) prediction and explanation, as shown in Fig. 1.

As mentioned before, four challenges are addressed in HDTTCA: step 1 tackles challenges 1, 2 and 4, while step 2 tackles challenges 3 and 4. Finally, in step 3, HDTTCA predicts and explains incoming data by using the decision tree classification model that was chosen previously in step 2. In other words, after building the decision tree model, we use this model to predict the applicability of telehealth services. In the following subsection, we explain the details of steps 1 and 2 and then clarify the time complexity of HDTTCA.

Step 1: data analysis and preprocessing

As discussed previously, the target variable and some important attributes are excluded in the original data set. Therefore, HDTTCA first needs to derive some attributes from the current data set and then used them in determining the value of target variable. To validate the performance of the decision tree classification model, HDTTCA divides the data set into several subsets.

Step 1.1 generating derivative attributes

Given that the raw data containing the critical attributes are often collected from different sources, these data should be integrated into a single data set first. We focus on the two primary actors involved in healthcare activity, namely, patients and hospitals. The patient-related data sets describe the information about those who have seen physicians and contain two types of information, namely, basic information (patients’ important attributes, including gender, age, address, and health history) and clinical information (all medical activities the patients received, including medical treatments and physician visits). We retain only the important attributes of patient-related data sets for telehealth services, such as the hospital where a patient seeks treatment, the code for the international classification of diseases, and the number of days of prescription. The hospital-related data sets describe the information about hospitals that patients visit to see their physicians. For our purposes, we only need the hospital’s location and size.

To reduce the numbers of age categories and balance the percentages among them, we use an attribute, that is, age group (F_Age), to represent the age of ≤30 years as young, 30 years ≤ age ≤ 70 years as middle-aged and age of ≥70 years as elderly. Similarly, we transcode the monthly insurance amount into an attribute, that is, insurance level (F_IL), which is categorized as low, middle, and high with the suggested percentages of 20, 60 and 20%, respectively. We also mark the situations when patients are not required to pay the copayments with an attribute, that is, copayment exemptions (F_CEM) mark as Y and N.

We summarize the number of times in a year that each patient visits a hospital (outpatient) or is hospitalized (inpatient) into op_time and ip_time, respectively. We use an attribute, that is, outpatient frequency (F_op) to categorized op_time into none when its value is 0, low for 1 ≤ op_time ≤ 12, middle for 13 ≤ op_time ≤ 36, and high when its value ≥37. We also denote an attribute, that is, inpatient frequency (F_ip), as 0 for ip_time = 0, 1 for ip_time = 1, 2 for ip_time = 2, and 3+ for ip_time ≥ 3.

Some diseases are inapplicable for telehealth services (e.g., a car accident victim that went to the emergency room for treatment for the injury and then rests in chronic care for rehabilitation). Therefore, we differentiate the total number of days that a patient uses an emergency bed (EB day) and a chronic bed (CB day). We use an attribute, that is, chronic bed rate (F_CBR), which is the number of CB days divided by EB day + CB day, to distinguish those whose symptoms cannot be helped by telehealth services and eliminate the patients with F_CBR = 0. We also summarize the number of drug prescription days as drug day and the total amount of the medical fees as total amount.

In previous studies, the critical influence factors for adopting telehealth services include the patient’s traveling distance or transportation time to the hospital, health status, and financial status [5, 6, 20]. However, these attributes are not recorded directly in the data sets; thus, they need to be derived from existing attributes. Given that telehealth services are more beneficial for patients who live further away from the hospitals, transportation time should be determined when a patient travels from home to the hospital. We can generate the distance of a patient travelling from home to hospital [1], that is, distance (F_Dis), by combining the zip codes of the hospital location and the patient’s residence location and the assistance of Google Maps. We use the great-circle distance to estimate the shortest distance between two points on the surface of a sphere, which is calculated as follows:

$$ {F}_{Dis}=r\ {\cos}^{-1}\left[\sin\ {\upvarphi}_1\times \sin\ {\upvarphi}_2+\cos\ {\upvarphi}_1\times \cos\ {\upvarphi}_2\times \cos |\ {\uplambda}_1-{\uplambda}_2|\right]\ \mathrm{km} $$

where (φ₁, λ₁) and (φ₂, λ₂) denote the latitudes and longitudes of points 1 and 2 (in radians), respectively; and r is the mean earth radius (approximately 6371 km). For example, the longitude and latitude of the zip codes 100 and 700 are (121.5199, 25.0324) and (120.1929, 22.99594), respectively. Therefore, the distance between the two points is calculated as follows:

$$ {F}_{Dis}=6371\times {\cos}^{-1}\left[\ \sin \left(25.0324\uppi /180\right)\times \sin \left(22.9959\uppi /180\right)+\cos \left(25.0324\uppi /180\right)\times \cos \left(22.9959\uppi /180\right)\times \cos \left(121.5199-120.1929\right)\uppi /180\right]=263.5161\ \mathrm{km} $$

Telehealth services are beneficial for the patients with chronic diseases because the administration period is > 7 days. We create a special attribute, that is, drug duration (F_DD) to record whether a drug is administered for an extended period of time. We use the attribute economic priority (F_Eco), to distinguish patients with special conditions, such as low income or disability. Telehealth services are mostly needed by patients living in rural areas, even if their traveling distances to the hospitals are shorter than those of the other patients. Given that remote area is undefined, we use an attribute, that is, remoteness (F_R), to distinguish patients residing in rural areas by changing their addresses.

As mentioned previously, telehealth equipment can monitor only some physiological values, such as blood pressure, blood glucose level, and cardiac activity, at present [21]. Thus, telehealth equipment is mostly useful for target diseases, such as diabetes, hypertension, and hyperlipidemia. We highlight the disease codes in a special attribute, that is, target disease (F_TD), with Y indicating suitability and N indicating unsuitability for telehealth services, respectively. Another way to mark the potential telehealth users is differentiating the treatment that a patient receives. We create an attribute, that is, target treatment (F_TT), to record the specific treatments for diabetes, hypertension, and hyperlipidemia symptoms, with telehealth-applicable as A, other chronic diseases as B, and nonchronic treatment as N. For special cases that do not fit in the preceding categories, we create an attribute, that is, Reim_Spe (F_RS), to record these special telehealth applicable cases, with applicable denoted as Y and nontelehealth-applicable as N. Table 1 lists the attributes used to consult with the experts and generate the decision tree for each expert in the following discussion.

Table 1 Attributes Used to Consult with the Experts

Decision tree–based classifier in providing telehealth service

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Step 1: data analysis and preprocessing

Step 1.1 generating derivative attributes

Step 1.2 target variable generation

Step 1.3 data sampling and partitioning

Step 2: decision tree model building

Step 2.1 attribute selection

Step 2.2 decision tree classifier building

Model assessment

Results

Real-world health insurance research data set

Generating a decision tree for each expert and final target variable

Building the final version decision tree and logistic regression model

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us