Predicting polypharmacy in half a million adults in the Iranian population: comparison of machine learning algorithms

Background Polypharmacy (PP) is increasingly common in Iran, and contributes to the substantial burden of drug-related morbidity, increasing the potential for drug interactions and potentially inappropriate medications. Machine learning algorithms (ML) can be employed as an alternative solution for the prediction of PP. Therefore, our study aimed to compare several ML algorithms to predict the PP using the health insurance claims data and choose the best-performing algorithm as a predictive tool for decision-making. Methods This population-based cross-sectional study was performed between April 2021 and March 2022. After feature selection, information about 550 thousand patients were obtained from National Center for Health Insurance Research (NCHIR). Afterwards, several ML algorithms were trained to predict PP. Finally, to assess the models’ performance, the metrics derived from the confusion matrix were calculated. Results The study sample comprised 554 133 adults with a median (IQR) age of 51 years (40 – 62) that nested in 27 cities within the Khuzestan province of Iran. Most of the patients were female (62.5%), married (63.5%), and employed (83.2%) during the last year. The prevalence of PP in all populations was about 36.0%. After performing the feature selection, out of 23 features, the number of prescriptions, Insurance coverage for prescription drugs, and hypertension were found as the top three predictors. Experimental results showed that Random Forest (RF) performed better than other ML algorithms with recall, specificity, accuracy, precision and F1-score of 63.92%, 89.92%, 79.99%, 63.92% and 63.92% respectively. Conclusion It was found that ML provides a reasonable level of accuracy in predicting polypharmacy. Therefore, the prediction models based on ML, especially the RF algorithm, performed better than other methods for predicting PP in Iranian people in terms of the performance criteria.


Introduction
Polypharmacy (PP) refers to the ''administration of many drugs simultaneously and/or the administration of more drugs than is clinically indicated, representing an unnecessary use of drug" [1]. The global definition is available for PP considering the actual number of drugs taken by one person, and in recent studies, intake of five or more medications is a commonly used definition of PP [2]. The prevalence of PP has been investigated in most studies among the elderly (≥ 65 years) population, and the data related to adults (≥ 18 years) have received less attention. The simultaneous use of multiple prescription drugs is increasingly common, with 27% of the population and 38% of adults in Iran using five or more medications at the same time [3]. Similarly high prevalence among adults is reported in other countries (e.g., 51.5% in Kingdom of Saudi Arabia [4], 36.8% in the United States [5], 22.4% in Poland [6], 30.7% in Scotland [7], 24.4% in Sweden [8], 39.1% in Germany [9], and 45.8% among Covid-19 patients [10]).
A great majority of studies on PP have focused on its potentially negative consequences, e.g., inappropriate prescribing, higher health care costs, non-compliance to medications, drug interactions, adverse drug reactions, decreased physical functioning, and quality of life [2, 5-7, 11, 12]. Some researchers have also investigated the prevalence of PP in the elderly or patients with chronic disease populations [13] and, the factors and conditions leading to PP have received in new studies [14][15][16]. To our knowledge, no study so far has analysed possible predictors for polypharmacy in patients consuming multiple drugs by new statistical classification methods.
Machine learning (ML) is becoming necessary for solving issues in many scopes, including healthcare [17]. Currently, we are seeing the introduction of various ML methods in different healthcare fields that can help professionals in the improvement of diagnosis [18][19][20]. An example is the use of a four-model ensemble strategy to categorise the probability of death of patients contaminated with COVID -19 [21]. Similarly, the clinical decision support system (CDSS) was developed to reduce prescribing errors by helping to prioritise the review of prescriptions [22,23]. Similar support systems can be developed to help pharmaceutical companies select a suitable molecule with which to conduct research and which is likely to go through the approval process and reach the market [24]. Maternal health initiatives can use the CDSS to predict ectopic pregnancies [25]. Pharmaceutical companies are turning to machine learning to facilitate drug discovery and manufacturing. For its part, the FDA has proposed certain regulations that allow the use of AI and machine learning in medical devices. [26].
Despite the new studies in assessing PP [27], its modelling has still received less attention. Hence, we compared the performance of five ML methods in predicting PP in more than 5 thousand Iranian people to find the most favourable features and methods for our data.
In next section, we describe the required datasets and the details of ML algorithm. In results section, determination of the ML model are compared using the metrics derived from the confusion matrix. The conclusion and some possible further works are presented in Discussion Section.

Data collection and preparation
A retrospective cohort study was conducted on health insurance claims data from April 2021 to March 2022, provided by National Center for Health Insurance Research (NCHIR) for elderly in Khuzestan province, Iran, which manages "Bimeh Salamat" for Iranians. As of March 2022, the insurance program was covering 554 133 beneficiaries from 27 cities in Khuzestan province.
The data include patients' clinical and demographic characteristics, like age (≥ 18 years), sex (female, male), marital status (married, single), occupation (employed, unemployed), income (low, middle, high), residence area (rural, urban), ethnicity (Arab, Fars, Lor, Tork & Kord), and prescription's variables per last 12 months include: number of prescriptions (NOP), number of drugs (NOD) per prescription, season of prescription (season), insurance coverage for prescription drugs (ICPD), total pharmaceutical spending (TPS $), number of visits to the general practitioner (NVGP), number of visits to a specialist (NVS). In addition, commonest non-communicable diseases (NCDs) in the subjects were selected by using International Classification of Diseases (ICD) codes, such as Diabetes mellitus (DM); Dyslipidemia (DLP); Asthma; Gastrointestinal reflux disease (GERD); Hypertension (HTN); Cardiovascular diseases (CVD) include heart failure, ischemic heart disease, arrhythmia, and stroke; Chronic kidney disease (CKD); Rheumatoid arthritis (RA) include rheumatoid arthritis and osteoarthritis; and Mental health conditions (MHC) include dementia anxiety and depression. It is worth noting that in the US, the number of prescriptions is usually the same as the number of drugs, so in the case of Iran, one prescription may contain several drugs.
All variables (24 variables) in patients' records were extracted and regarded. Normalization of the continuous variables was done. The outcome was binary PP that was calculated from NOD. Using the SMOTE method, handling the imbalanced dataset problem was done. The research protocol was approved by the Ethics Committee of the Abadan University of Medical Science (No. IR.ABADANUMS. REC.1401.101).
Certain classes were clustered to reduce the number of classes of these variables. Records, which had over 70% of missing data, were not included in the analysis. The imputation technique was used for the remaining missing values, assuming that the missing data had a random distribution, [28]. Little's MCAR test evaluated MCAR with the null hypothesis that the data are missing completely at random (MCAR) [29].

Predictor variables
The analysis was done on data in three classes of predictor variables obtained from the health insurance claims data. Twenty-three variables were classified as sociodemographic characteristics (seven), prescriptions (six), and comorbidities (ten).

Outcome variable
There is no unique consensus on the PP definition. As reported earlier, PP is defined as the concomitant prescription of five or more medications per prescription [3,30]. The feature demonstrates the class variable, which is binary. For each patient, if the average number of prescribed drugs (NOD) per prescription/year is less than five, then PP is 0; otherwise, it will be 1. Out of the 554 133 patients, 199 485 instances were labeled as 1 ( Table 1).

Data balancing
The imbalanced data problem is an important barrier to ML algorithms, which can be seen due to no equal categorization of the classes. In a considered dataset, the data amount in outcome classes is markedly imbalanced containing more samples associated with the non-polypharmacy class (64.0%), whereas the PP class is much smaller (36.0%). Therefore, the trained models usually provide biassed results for the predominant class and the ML models assign new observations to the majority class. We applied the edited nearest neighbor (ENN) along with synthetic minority over-sampling technique (SMOTE) to deal with the class imbalance in the imbalanced-learn toolbox to make the dataset balanced (SMOTEENN 0.9.1).

Feature selection
The feature selection improves the performance of a predictive model and reduces the modeling computational cost by selecting the most important variables; therefore, it reduces the computational complexity of the model. Another goal was to gain insight into the underlying processes, which generated the data [31,32]. Therefore, prior to model prediction, feature selection should be done. Through the calculation of different ML algorithms and the removal of irrelevant factors, errors were reduced in clinical decisions and accuracy improved [32]. To indicate the best predictors, the effectiveness of different feature selection methods was compared. Therefore, in the training set, five methods including eXtreme Gradient Boosting (XGBoost), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), and Artificial Neural Networks (ANN's) were applied to train through the selection of the relevant features for to predict PP.
To prevent overfitting, the ten-fold cross-validation was applied in the training process.

Model development
We trained five ML algorithms, namely DT, RF, XGBoost, SVM, and ANN in the "Rattle" (R Analytical Tool to Learn Easily) package application. Rattle is used for data mining written in R and provides a Graphical Data Interface [33]. To implement these models, we experimentally matched the hyperparameters to the training split of the dataset based on cross-validation (CV). A standard ML technique called k-fold cross-validation (tenfold in our study) was used to train and test ML models. Each method is described below.

Decision trees
DT induction is a classic ML technique that is deployed in data mining [34]. It is very effective as it uses a simple algorithm and a simple tree structure for representing the model. DT can be regarded as a series of IF-THEN rules as well as as conditional probability distributions defined in class and feature spaces [34,35]. When the samples are in one class, the node can become the leaf and is marked by the class. Otherwise, the algorithm selects the discriminatory attribute as the DT current node [36]. Based on the current decision node attribute value, the training samples can be categorized into many subsets and each forms a branch. For every obtained branch or subset, the previous stages should be repeated, recursively producing a DT on each partitioned sample [37][38][39]. Such induction structure is simple for interpretation, easy to implement due to less complicated calculations, and does not need data normalization [40,41]. The rpart package is employed to form the DT.

Random forest
The RF was proposed by Breiman and has many individual DTs that work together as a group [42]. It boosts accuracy using a group of decision models instead of a single learning model. The important difference between this technique and traditional DT algorithms is splitting nodes of the root nodes that are generated randomly [43].  The trees are protective each other against their defects leading to their strong effect. Some trees may estimate wrong classification, but several others are correct, leading to progression in an appropriate direction. Therefore, the predictions and errors caused by particular trees should be correlated with each other; thus, the RF can perform well [44]. Moreover, RF has several advantageous, like being used for both regression and classification duties and processing missing variables. In addition, overfitting occurs less when more DT are added to the forest [45][46][47].

eXtreme gradient boosting
Chen et al. proposed XGBoost method in 2016 [48], which is an ensemble approach based on DT method.
XGBoost is an open-source library and is presented as a scalable tree boosting system. It is built on DT models. After introducing the trees to the ensemble one they are fitted to make prediction mistakes correct due to previous models and then the prediction is made [37,38]. The gradient boosting framework is used and models are added sequentially. Hence, it is capable of minimizing errors, maximizing models' performance, and reducing tree construction length [49]. XGBoost is deployed on many challenges, and can produce state-of-the-art outcomes on many difficult problems [50]. It is extremely and computationally (fast to execute) effective. The xgboost package is used to build the boosted model.

Support vector machine
The SVM method was first introduced by Stephan R. Sain and V.N. Vapnik based on statistical learning theory [51]. SVM was designed for twofold classification. However, it is effectively expanded for multi-class situations. SVM finds a line/ hyper-plane in a multidimensional space capable of splitting the feature space into specific groups [52][53][54]. The "kernel" is the main SVM algorithm. Data that cannot be linearly divided into lower dimensions are transferred by the kernel to a higher dimension. This SVM capacity causes its good performance than other techniques [55][56][57]. SVR is an extension of SVM, which is used for regarding the risk of structural, reducing the generalization error, and increasing hyper-plane margin to decrease the tolerated error [58,59]. Rattle deploys ksvm from the kernlab package.

Artificial neural networks
Neural Networks introduced by Warren McCulloch and Walter Pitts in 1943 as an old method for modeling can imitate a human's neural network and were designed considering the central nervous system [60]. A neural network as a non-parametric regression approach has a series of highly interconnected nodes to model complex functions [61,62]. ANN like the biological neural network is generated by nodes, neurons, or processing features that are connected to make a network. The ANN accumulates data from all surrounding neurons and offers an output associated with its activation functions and weight. Adaptive weights can indicate the strong points of the connection between neurons. To perform the learning process, they must be adjusted so that the network output is nearly similar to the favorable output. Mathematically, this can be well described in a fairly simple, if not straightforward, way. Rattle employs the functionality offered by the nnet package.

Cross-validation
The k-fold cross-validation (k-fold CV) works based on repeated holdout. It has become the initiative standard to estimate the performance of the model. Instead of repeated random sampling, k-folds CV can randomly divide the data into folds [63]. To assess the algorithm performance, ten-fold CV was applied to evaluate predictive models and obtain reliable findings. Using stratified random sampling, the main training dataset was divided into ten folds (each comprising 10 percent of the total data). For each of the 30 percent of data, a ML model is formed on the remaining 70% of data. The fold's 30% sample evaluates the model. Following training and evaluating for 100 times (with 100 various training/testing combinations), the mean performance is reported. The whole samples in the dataset can be trained and evaluated, leading to no higher variance [64,65]. Datasets for cross-validation are formed by the createFolds function in the caret package.

Model evaluation
Evaluation of the model performance is a virtual stage of producing a useful ML model, which is done using some performance indices, mostly obtained from the confusion matrix. In this study, recall, accuracy, specificity, precision, and F1-score metrics were used to compare the performance of methods on validation and training sets in each cross-validation iteration ( Table 1). The interpretation for all measures were poor < 50%, OK: 50-80%, good: 80-90%, and > 90% very good. Such criteria are mostly reported in the model evaluation using ML techniques [66]. The caret (Classification and Regression Training) package by Max Kuhn has functions to compute several performance measures. It offers many tools for training, preparing, visualizing and evaluating ML models and data [67]. Although different types of ML will have distinct approaches to training the model, there are basic steps that most models utilize [68]. Figure 1 gives an overview of the process of the steps taken to create the Machine Learning models in the prediction of PP.

Developing and evaluating models
After selecting the best subset of features, various ML algorithms were used to build the predictive model. Five ML algorithms, such as DT, XGBoost, SVM, ANN's and RF, were trained to develop PP prediction models and their performance was assessed through sensitivity (recall), precision, accuracy, specificity, and F1-score of the performance metrics. Table 3 shows their discriminative capacity for predicting PP in training and test sets. The RF method performance, regarding recall, accuracy, precision, and F1-score, was higher for training and test sets. According to the test set, the specificity and accuracy of XGBoost were similar to that of RF (spe = 90.2% & acc = 79.94%). ANN's had the highest specificity among the ML methods (98.82%). ANN's and RF had the lowest and highest values in sensitivity, accuracy, precision and F1-score, respectively. The average accuracy of the ML methods was from 72.23% to 79.99% for the test sets and ANN's and RF showed the lowest and highest values, respectively. Also, the average specificity of all ML methods was more than 88%. Figure 2 displays the average performance indices of the considered ML algorithms for test set. Figure 3 indicates the top ten VIMPs derived from RF using test and training sets (all dataset). Ranking of the variables is done using the average of 100 runs on the average reduction in classification accuracy (MDA) or the average reduction in classification Gini impurity (MDG). Ranking of all 24 variables was done using their MDA and MDG to classify the subjects into PP or non-PP categories. The ten most crucial variables were recognized according to MDG, which is highly stable during classification permutation. NOP, ICPD, and HTN were the three most crucial variables to predict PP in patients. Among socio-demographic features  age, income and employment status were most influential variables (Fig. 3A, B). The optimal classification was obtained through this set of ten variables, with an accuracy of 82.81% and out-of-bag (OOB) error rate of 19.84% (Fig. 3C, D).

Discussion
PP as a complicated issue can differ in implications and inappropriateness for medically complex patients than those who are more beneficial. The predicts of PP include features associated with the patient (sociodemographic factors, like age, gender, income, place of residence, and ethnicity), the healthcare system or to the physician (prescribed drug information such as costs, number of prescription), as well as the disease (certain diseases, like hypertension or diabetes mellitus, multiple comorbidity status). How to accurately diagnose and predict PP using ML algorithms is valuable studying. Based on the mentioned experiments, except ANN's, we found, the good performance of using DT, XGBoost, SVM, and RF, and the results of using important features have worth findings. PP is common in adults, especially females, elderly, and cases with comorbidities. Considering the adverse outcomes of PP, the prevalence of PP and its related features PP should be understood. Patients should be regularly assessed by clinicians for the presence of PP and institute measures to decrease inappropriate PP if possible.
In our study, among the socio-demographic features, age, income and employment status were the most influential variables. Taherifard et.al. studied the population-based prevalence of polypharmacy and patterns of medication use in southwestern Iran and found that socioeconomic status was not associated with polypharmacy but was significantly associated with patterns of medication use for digestive, metabolic and nervous system diseases [16]. Doheny et al. in a population-based study aimed at examining sociodemographic differences in polypharmacy among the elderly, show that there were greater sociodemographic differences among independents, with those with less education, older age and women being more likely to have polypharmacy [69]. In our study, among the prescription features, number of prescriptions and prescription drug insurance coverage were found to be the two most important predictors. Akande et al. have shown in a cross-sectional study that taking too many prescription drugs, intentionally skipping pills because there are too many, and regularly taking prescriptions from more than one doctor are the most important factors associated with polypharmacy [70]. In many studies, chronic disease was associated with reduced odds of polypharmacy [69,71,72]. Mizokami et al. conclude that physicians should carefully consider the type of chronic disease when assessing the risk of polypharmacy. Older patients with multiple diseases may experience further polypharmacy [72]. In our study, NCDs, particularly HTN, DM, and CVD, were significantly associated with the odds of polypharmacy. A large randomised controlled multicentre trial was conducted by Almodovar et al. to analyse the characteristics of an elderly multimorbid population with polypharmacy. The results show that frailty, multimorbidity, obesity and reduced physical as well as mental health status are risk factors for excessive polypharmacy [71]. Finally, in this research as in Almodovar's study, gender and marital status are not associated with excessive polypharmacy [71].
The results of the comparison of machine learning algorithms showed that, regarding performance criteria, RF was more favorable compared to other ML methods to predict PP. Other ML approaches, except ANN's, showed the same performance and OK discrimination (accuracy: 79.49% -79.94%). The ML can be used for analysis and inference in a large set of retrospective datasets to extract specific relationships or determine strange patterns with minimal human intervention or without programming effort [73]. Similarly, the techniques of ML can be used in medical practise to improve prognostic modelling and uncover new factors associated with a particular target outcome to predict future or obscure trends [74]. In medical imaging studies, for example, ML and deep learning help with COVID -19 diagnosis and provide non-invasive detection measures to prevent medical staff from becoming infected with pathogens [75]. In virological studies, ML is used to study the genetics associated with the SARS-CoV-2 protein and predict new combinations that can be used to produce drugs and vaccines [75]. This model can therefore also be used to predict PP.
Our main limitation was no features associated with physical activities, body mass index, health habits, nutrition patterns, and certain clinical data influencing the medication use and PP, and their related outcomes. However, we indicated that ML methods have good performance in predicting PP in Iranian population. Lengthening the running time of the programs due to the size of the sample (big data) was another limitation of this research.

Conclusions
In this paper, we propose five ML models that predicts polypharmacy in an adult Iranian people. The models have trained on data of all individuals' information in NCHIR of Khuzestan province by using data for the last 12 months. Results show that our model can be implemented globally for effective screening and prioritization of assessing polypharmacy in the general population. In conclusion, according to the all above experiments, we found that the RF performance provided better results compared to other ML methods for predicting PP in Iranian people. In addition, clinicians should know the common occurrence of PP and try to reduce improper prescribing or inappropriate PP if possible. In future studies, the proposed method can be used to predict polypharmacy in the elderly. Furthermore, the performance of our model will improve as we test more classification techniques on small and qualitative datasets.