On the interpretability of machine learning-based model for predicting hypertension

Background Although complex machine learning models are commonly outperforming the traditional simple interpretable models, clinicians find it hard to understand and trust these complex models due to the lack of intuition and explanation of their predictions. The aim of this study to demonstrate the utility of various model-agnostic explanation techniques of machine learning models with a case study for analyzing the outcomes of the machine learning random forest model for predicting the individuals at risk of developing hypertension based on cardiorespiratory fitness data. Methods The dataset used in this study contains information of 23,095 patients who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 10-year follow-up. Five global interpretability techniques (Feature Importance, Partial Dependence Plot, Individual Conditional Expectation, Feature Interaction, Global Surrogate Models) and two local interpretability techniques (Local Surrogate Models, Shapley Value) have been applied to present the role of the interpretability techniques on assisting the clinical staff to get better understanding and more trust of the outcomes of the machine learning-based predictions. Results Several experiments have been conducted and reported. The results show that different interpretability techniques can shed light on different insights on the model behavior where global interpretations can enable clinicians to understand the entire conditional distribution modeled by the trained response function. In contrast, local interpretations promote the understanding of small parts of the conditional distribution for specific instances. Conclusions Various interpretability techniques can vary in their explanations for the behavior of the machine learning model. The global interpretability techniques have the advantage that it can generalize over the entire population while local interpretability techniques focus on giving explanations at the level of instances. Both methods can be equally valid depending on the application need. Both methods are effective methods for assisting clinicians on the medical decision process, however, the clinicians will always remain to hold the final say on accepting or rejecting the outcome of the machine learning models and their explanations based on their domain expertise.


Introduction
Machine learning prediction models have been used in different areas such as financial systems, advertising, marketing, criminal justice system, and medicine. The inability of machine learning users to interpret the outcomes of the complex machine learning models becomes problematic [1]. Machine learning interpretability is defined as the degree to which a machine learning user can understand and interpret the prediction made by a machine learning model [2,3]. Despite the growing use of machine learning-based prediction models in the medical domains [4][5][6][7], clinicians still find it hard to rely on these models in practice for different reasons. First, most of the available predictive models target particular diseases and depend on domain knowledge of clinicians [8][9][10]. Applying such predictive models on large health information systems may not perform well because of the availability of multiple, complex data sources and the heterogeneous mixture of patients and diagnoses. Second, most of the models developed by data scientists mainly focus on prediction accuracy as a performance metric but rarely explain their prediction in a meaningful way [11,12]. This is especially true with complex machine learning, commonly described as black-box models, such as Support Vector Machines [13], Random Forest [14] and Neural Networks [15].
Although many predictive models have been developed to predict the risk of hypertension [16][17][18], the frameworks for establishing trust and confidence for these predictions have been always missing. Thus, there has been some criticism for using machine learning models in the medical domain even with their promise of high accuracy [19]. In practice, addressing this issue is critical for different reasons, especially if clinicians are expected to use these models in practice. First, explaining the predictions of the developed model contributes to the trust problem by enabling clinicians to make sure that the model makes the right predictions for the right reasons and wrong predictions for the right reasons. Second, explaining predictions is always useful for getting some insights into how this model is working and helps in improving model performance. Since May 2018, the General Data Protection Regulation (GDPR) forces industries to explain any decision taken by a machine when automated decision making takes place: "a right of explanation for all individuals to obtain meaningful explanations of the logic involved", and thus increases the efforts of developing interpretable and explainable prediction models [20].
In our previous study [21], we evaluated the performance of several machine learning techniques on predicting individuals at risk of developing hypertension using cardiorespiratory fitness data. In particular, we evaluated and compared six well-known machine learning techniques: LogitBoost, Bayesian Network, Locally Weighted Naive Bayes, Artificial Neural Network, Support Vector Machine, and Random Forest. Using different validation methods, the Random Forest model, a complex ensembling machine learning model, has shown the maximum area under the curve (AUC = 0.93). The attributes used in in the Random Forest model are Age, METS, Resting Systolic Blood Pressure, Peak Diastolic Blood Pressure, Resting Diastolic Blood Pressure, HX Coronary Artery Disease, Reason for test, History of Diabetes, Percentage HR achieved, Race, History of Hyperlipidemia, Aspirin Use, Hypertension response. In this study, we apply various techniques to present complete interpretation for the best performing model (Random Forest) in predicting individuals at risk of developing hypertension in an understandable manner for clinicians either at the global level of the model or the local level of specific instances. We believe that this study is an important step on improving the understanding and trust of intelligible healthcare analytics through inducting a comprehensive set of explanations for prediction of local and global levels. The remainder of this paper is organized as follows. In Section 2, we highlight the main interpretability techniques considered in this work. Related work is discussed in Section 3. In Section 4, we introduce the dataset employed in our experiments and discuss the interpretability methodologies. Results are presented in Section 5. In Section 6, we discuss our results. Threats to the validity of this study are discussed in Section 7 before we finally draw the main conclusions in Section 8.

Background
One simple question that can be posed is "Why we do not simply use interpretable models, white-box models, such as linear regression or decision tree?". For example, linear models [22] present the relationship between the independent variables (input) and the target (output) variable as a linear relationship that is commonly described by weighted equations which makes the prediction procedure a straightforward process. Thus, linear models and decision tree have broad usage in different domains such as medicine, sociology, psychology, and various quantitative research fields [23][24][25]. The decision tree [26] is another example where the dataset is split based on particular cutoff values and conditions in a tree shape where each record in the dataset belongs to only one subset, leaf node. In decision trees, predicting the outcome of an instance is done by navigating the tree from the root node of the tree down to a leaf and thus the interpretation of the prediction is pretty straightforward using a nice natural visualization. However, in practice, even though black-box models such as Neural Networks can achieve better performance than whitebox models (e.g. linear regression, decision tree), they are less interpretable.
In general, methods for machine learning interpretability can be classified as either Model-Specific or Model-Agnostic. In principle, model-specific interpretation methods are limited to specific types of models. For example, the interpretation of regression weights in a linear model is a model-specific interpretation and does not work for any other model. On the other hand, model-agnostic interpretation methods are more general, can be applied on any machine learning model and are usually post hoc [27]. This facilitates the comparison of different types of interpretability techniques and eliminates the need to replace the interpretability technique when the models are replaced, so such techniques are more flexible and usable [27]. These agnostic techniques work by analyzing pairs of input features and output without depending on the underlying model. Model-Agnostic techniques also have some challenges [27]. One challenge is that it is hard to get a global understanding of complex models due to the trade-off between model interpretability and model flexibility. In some applications, an exact explanation may be a must and using such blackbox techniques is not accepted. In this case, using an interpretable model such as a linear regression model is preferable and the same holds for any application in which interpretability is more important than model performance. Another challenge is to make model-agnostic explanations actionable. It is easier to incorporate user feedback into the model implemented using explainable models rather than using a black-box model [28].
Another way to classify machine learning interpretability methods is based on whether the interpretation of the model is global or local. In principle, global interpretations enable a clinician to understand the entire conditional distribution modeled by the trained response function. They are obtained based on average values. In contrast, local interpretations promote the understanding of small parts of the conditional distribution. Since conditional distribution decomposes of small parts that are more likely to be linear or well-behaved and hence can be explained by interpretable models such as linear regression and decision trees.
In this study, we apply various global and local modelagnostic methods that facilitate global model interpretation and local instance interpretation of a model that has been used in our previous study [21]. In particular, in our previous study, we evaluated and compared the performance of six machine learning models on predicting the risk of hypertension using cardiorespiratory fitness data of 23,095 patients who underwent treadmill stress testing at Henry Ford Health hospitals over the period between 1991 and 2009 and had a complte10year follow-up. The six machine learning models evaluated were logit boost, Bayesian network, locally weighted naive Bayes, artificial neural network, support vector machine and random forest. Among such models, random forest achieved the highest performance of AUC = 0.93. Figure 1 illustrates the steps of our interpretation process.

Related work
The volume of research in machine learning interpretability is growing rapidly over the last few years. One way to explain complex machine models is to use interpretable models such as linear models and decision trees to explain the behavior of complex models. LIME interpretability technique explains the prediction of complex machine model by fitting an interpretable model on perturbed data in the neighborhood of the instance to be explained. Decision trees have been used intensively as a proxy model to explain complex models. Decision trees have several desirable properties [29]. Firstly, due to its graphical presentation, it allows users to easily have an overview of complex models. Secondly, the most important features that affect the model prediction are shown further to the top of the tree, which show the relative importance of features in the prediction. Lots of work consider decomposing neural networks into decision trees with the main focus on shallow networks [30,31].
Decision rules have used intensively to mimic the behavior of a black-box model globally or locally given that the training data is available when providing local explanations [32]. Koh and Liang [33] used influence functions to find the most influential training examples that lead to a particular decision. This method requires access to the training dataset used in training the blackbox model. Anchors [34] is an extension of LIME that uses a bandit algorithm to generate decision rules with high precision and coverage. Another notable rule-extraction technique is MofN algorithm [35], which tries to extract rules that explain single neurons by clustering and ignoring the least significant neurons. The FERNN algorithm [36] is another interpretability technique that uses a decision tree and identifies the meaningful hidden neurons and inputs to a particular network.
Another common interpretability technique is saliency maps that aim to explain neural networks models by identifying the significance of individual outcomes as an overlay on the original input [37]. Saliency-based interpretability techniques are popular means for visualizing the of a large number of features such as images and text data. Saliency maps can be computed efficiently when neural network parameters can be inspected by computing the input gradient [38]. Derivatives may miss some essential aspects of information that flows through the network being explained and hence some other approaches have considered propagating quantities other than gradient through the network [39][40][41].
Interpretability of black-box models via visualization has been used extensively [42][43][44]. Several tools have been designed to provide an explanation for the importance of features for random forest predictions [45], however, these tools are model-specific and cannot be generalized to other models. The authors of [46,47] discussed several methods for extracting rules from neural networks. Poulet [48] presented a methodology for explaining the prediction model by assigned a contribution value for each feature using visualization technique. However, this work has been only able to handle linear additive models. Strumbelj et al. [49] provided insights for explaining the predictions of breast cancer recurrence by assigning a contribution value to each feature, which could be positive, negative, or zero. A positive contribution means that the feature supports the prediction of the class of interest, a negative contribution means that the feature is against the prediction of the class of interest, and zero means that the feature has no influence on the prediction of the class of interest. Caruana et al. [50] presented an explanation technique which is based on selecting the most similar instances in the training dataset to the instance to be explained. This type of explanation is called case-based explanation and uses the k-nearest neighbors (KNN) algorithm to find the k nearest examples close to the instance to be explained based on a particular distance metric such as Euclidean distance [51].

Research design and methods
In this section, we describe the charchteristics of the cohort of our study. In addition, we describe the global and local intepretability techniques which we used for explaining the predictions of the model that has been developed for predicting the risk of hypertension using cardiorespiratory fitness data.

Cohort study
The dataset of this study has been collected from patients who underwent treadmill stress testing by physician referrals at Henry Ford Affiliated Hospitals in metropolitan Detroit, MI in the U.S. The data has been obtained from the electronic medical records, administrative databases, and the linked claim files and death registry of the hospital [52]. Study participants underwent routine clinical treadmill exercise stress testing using the standard Bruce protocol between January 1st, 1991 and May 28th, 2009. The total number of patients included in this study is (n = 23,095). The data set includes 43 attributes containing information on vital signs, diagnosis and clinical laboratory measurements. The baseline characteristics of the included cohort are shown in Table 1. The dataset contains 23,095 individuals (12,694 males (55%) and 10,401 (45%) females) with ages that range between 17 and 96. Half of the patients have a family history of cardiovascular diseases. During the 10-years follow-up, around 35% of the patients experienced hypertension. Male hypertension patients represent around 55% of the total hypertension patients while female patients  represent around 44% of the total hypertension patients. For more details about the dataset, the process of developing the prediction model and the FIT project, we refer the reader to [21,52].
In the following, we highlight the interpretability methods that are used in this study. Table 2 summarizes the main features of the model-agnostic interpretability techniques used in this study. In the following, we list and explain each of them.

Feature Importance
It is a global interpretation method where the feature importance is defined as the increase in the model's prediction error after we permuted the values of the features (breaks the relationship between the feature and the outcome) [53]. A feature is considered important if permuting its values increase the error (degrade the performance).

Partial Dependence Plot (PDP)
It is a global interpretation method where the plot shows the marginal effect of a single feature on the predicted risk of hypertension of a previously fit model [54]. The prediction function is fixed at a few values of the chosen features and averaged over the other features. Partial dependence plots are interpreted in the same way of a regression model which makes its interpretation easy. The main disadvantage of the partial dependence plot is the assumption that the feature of which the PDP is computed to be completely independent distributed from the other features that we average over.

Individual Conditional Expectation (ICE)
The partial dependence plot aims to visualize the average effect of a feature on the predicted risk of hypertension.
Partial dependence is a global method as it does not focus on specific instances but on an overall average. ICE plot can be seen as the disaggregated view of PDP by displaying the estimated functional relationship for each instance in the dataset. The partial dependence plot can be seen as the average of the lines of an ICE plot [55]. In other words, ICE visualizes the dependence of the predicted risk of hypertension on particular features for each instance in the dataset. One main advantage of the ICE is that is easier to understand and more intuitive to interpret than the PDP. ICE suffers from the same disadvantage of PDP.

Feature Interaction
It is a global interpretation method where the interaction between two features represents the change in the prediction that occurs by varying the 13 features, after having accounted for the individual feature effects. It presents the effect that comes on top of the sum of the individual feature effects. One way to measure the interaction strength is to measure how much of the variation of the predicted outcome depends on the interaction of the features. This measure is known as H-statistic [56]. One of the main advantages of the feature interaction is that it considers the interaction between the features. The main disadvantage of the feature interaction is that it is computationally expensive as it iterates over all the instances in the dataset.

Global Surrogate Models
It is a global interpretation method which aims to approximate the predictions of a complex machine learning models (such as neural networks) using a simple interpretable machine learning models (such as linear regression) [57]. Global surrogate models are considered model-agnostic methods as they do not require any information about the internal workings and the hyper-parameters settings of the black-box model. One way to obtain a surrogate model is as follows. Train an interpretable model such as logistic regression or decision tree on the same dataset used to train the black-box model (or a dataset that has the same distribution) such that target for the interpretable model is the predictions of the black-box model. The main advantage of the surrogate models is its flexibility, in addition, it is easy to assess how well it approximates the black-box model. However, it is still problematic how well the surrogate model should approximate the black-box model in order to be trusted.

Local interpretability techniques Local Surrogate Models (LIME)
It is a local model agnostic interpretation method which focuses on explaining the prediction of a single prediction of any black-box machine learning model locally (within the neighborhood of the prediction instance to be explained) [58]. The idea of LIME is quite intuitive, it generates a new dataset that consists of perturbed samples and then gets the associated predictions from the black box model. Next, LIME weight perturbed samples by how close they are from the point to be explained where the closer the point form the point to be explained, the higher weight it takes. Then, LIME fits an interpretable model (such as linear regression) on the weighted sampled instances. The learned model should be a good approximation of the machine learning model locally, but not globally.

Shapley Value Explanations
It is a local interpretation method from game theory [59]. This interpretation method assumes that each feature in the instance to be explained is a 'player' in a game and the prediction is the payout. The Shapley value aims to distribute the payout among the features in a fair way. The main idea of Shapley value is that for each feature f in the instance to be explained, evaluate the model using all possible coalitions (sets) of features with and without f. Such approach is extremely computationally expensive as the number of the coalitions increases exponentially with the number of features. Strumbelj and Kononenko [57], presented an approximation algorithm for Shapley Values using Monte-Carlo sampling technique. This approximation algorithm has been used in this work as an example of local explainer and will be referred to as Shapley Values explainer. The analysis of the global and local machine learning interpretability techniques has been conducted using R-

Results
In this section we present the results of applying various gloal and local interpretability techniques for our predictive model for the individuals at risk of developing hypertension based on cardiorespiratory fitness data. In particular, we present the results of Five global interpretability techniques, namely, feature importance, partial dependence plot, individual conditional expectation, feature interaction and global surrogate models. In addition, we present the results of 2 local explanation techniques, namely, LIME and Shapley value explanation.

Global interpretability techniques
Feature Importance Figure 2 shows the ranking of the importance of the selected input features in predicting the high risk of hypertension. The feature importance represents the factor by which the error is increased compared to the original model error. As shown in the figure, Age is the most important feature, followed by Resting Systolic Blood Pressure. The History of Coronary Artery Disease is the least significant feature. The yellow line in Fig. 3 shows the partial dependence plot of the probability of high risk of hypertension for each of the highly ranked features for predicting hypertension: Age, METS, Resting Systolic Blood Pressure and Resting Diastolic Blood Pressure. The black lines in Fig. 3 show the individual conditional expectation plot of the high risk of hypertension probability of the features. Each of the black lines represents the conditional expectation for one patient. For the Age feature, the partial dependence plot shows that, on average, the probability of high risk of hypertension increases gradually from 0.25 to reach 0.5 at the age of 65 and then remain stable till the age of 100 (Fig. 3a). For the METS feature, the partial dependence plot shows that, on average, the increase in METS is associated with a lower probability of high risk of hypertension (Fig. 3b). On average, the increase in the Resting Diastolic Blood Pressure is associated with a gradual increase in the probability of high risk of hypertension (Fig. 3c). For the Resting Systolic Blood Pressure, the plot shows that the probability of high risk of Overall the interaction effects between the features are considerably strong.

Global Surrogate Models
We fit a decision tree of depths equal to 3 and 4, using the original dataset, but with the prediction of the model (Random Forest) used as an outcome for the decision tree model, instead of the real classes (high risk of hypertension and low risk of hypertension) from the original dataset. Figures 5 and 6 show the terminal nodes of a surrogate decision tree of depth equals to 3 and 4 respectively. The counts in the nodes show the distribution of the random forest model predictions in the nodes. The counts in the nodes in Fig. 5 show that the surrogate tree predicts a higher number of low risk of hypertension patients when the Age is less than or equal to 50:2, Resting Diastolic Blood Pressure is less than or equal to 83 and METS is less than or equal to 12:9. Also, the counts show that the surrogate tree of depth 3 predicts a higher number of high risk of hypertension patients when the Age is greater than 50:2, Resting Systolic Blood Pressure is between 126 and 140. One way to

Local interpretability techniques
The explanatory plot produced by the LIME explanation mechanism illustrates for each feature and class, in which the range of values of a representative data point would fall. If it does, this gets counted as support for this prediction and if it does not, it gets scored as contradictory. In addition, LIME produces what is so-called Explanation fit that refers to the R-squared of the linear Ridge regression model which is fitted locally to explain the variance in the neighborhood of the examined instance. The explanatory plot produced by the Shapley Values explainer is close to the one generated by LIME in the sense that it shows the features' names and features' contributions that are used in the explanation. A  Since LIME and Shapley Values explainers are instance based explainers, in the following we evaluate both explainers based on 20 randomly selected instances from the testing dataset. In the following, we present the explanation of 20 instances in detail. We present 2 instances that have been correctly predicted by the black-box prediction model, one instance from the True Positive (correctly predicted as high risk of hypertension) group and another instance for the True Negative (correctly predicted as low risk of hypertension) group. In general, the generated explanations for the correctly predicted instances are commonly very intuitive and clear. They mostly follow common standard patterns. Thus, we chose to more focus on the incorrectly predicted instances as understanding the rationale and explanations for such incorrect predictions of the model increases the trust of the clinicians on the model behavior and performance. Thus, we present instances that comprehensively cover the False Positive and False Negative groups with consideration of the most important prediction factor, the patient's age.     which matches the partial dependence plot created in Fig.  3a and comes inline with the medical study by Rockwood et al. [60]. The explanations also show that those people whose METS are greater than 12:9 are at low risk of developing hypertension which matches the medical study by Juraschek et al. [61]. LIME explanation also shows that white people are at lower risk of developing hypertension compared to black people which is supported by the study conducted by Ergul et al. [62].

Instance 2 (True Positive)
The description of this instance is as follows:  Figure 9 shows the LIME explanation of the prediction of the black-box model for instance 2 as high risk of hypertension (assigning a strong probability of 0.98 for high risk of hypertension). The explanation is created based on five features Age, METS, Race, Hypertension Response, and Peak Diastolic Blood Pressure. The three features Age, METS, and Race positively support the  In the following, we are going to have a deep look at the misclassified instances by the Random Forest model and see the explanation using LIME. To ensure diversity, we selected nine instances from each of the False Positive instances (incorrectly classified as high risk of hypertension) and False Negative instances (incorrectly classified as low risk of hypertension) based on the patient's age as it has been identified to be the most important feature based on the feature importance plot and the partial dependence plot.
We start studying false positive instances. Figure 11 shows the frequency distribution of the false positive instances based on the probability of low risk of hypertension. The probability of low risk of hypertension has been split into three groups (bins). Group 1 represents instances with the probability of low risk of hypertension between [0-0.2]. Group 2 and Group 3 represent instances with the probability of low risk of hypertension that belongs to]0.2-0.35] and]0.35-0.5[, respectively. The frequency of the instances in group three is the highest (the black-box model predicts a patient as low risk of hypertension if the low-risk probability is greater than or equal to 0.5). In the following, we present  In the following, we present sample instances of False Positive predictions from Group 1. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.  Figure 12 shows LIME explanation of instance 3 based on Age, Resting Systolic Blood Pressure, METS, Percentage HR achieved, and Peak Diastolic. All the features used in the explanation positively contributed to the prediction of the high risk of hypertension with a probability equals to 0.68. Figure 13 shows the Shapley Values explanation of instance 3 based on Percentage HR achieved, Aspirin Use, METS, Age, and Reason for test. The most contributed feature toward increasing the probability high risk of hypertension is Percentage HR achieved while Reason for test is the most contributed feature toward decreasing the probability of the high risk of hypertension.      The Age and METS are the most important features for LIME that positively contributed to the prediction of high risk of hypertension while being white and has negative Hypertension Response test negatively contributed to the prediction of high risk of hypertension as shown in Fig. 16. LIME explains instance 5 as high risk of hypertension with a probability of 0.68. Figure 17 shows Shapley Values explanation of instance 5 based on Resting Systolic Blood Pressure, HX Coronary Artery Disease, METS, Reason for test and Age. All the features except Resting Systolic Blood Pressure contributed toward decreasing the probability of the high risk of hypertension.
In the following, we present sample instances of False Positive predictions from Group 2. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.  Figure 18 shows the explanation of instance 6 as high risk of hypertension with a weak probability of 0.64. The     In the following, we present sample instances of False Positive predictions from Group 3. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.    Instance 10 is incorrectly predicted by the black box model as a high risk of hypertension with a weak probability equals to 0.52 using LIME explainer as shown in    Figure 28 shows LIME explanation of instance 11 as a high risk of hypertension with a probability of 0.51. Features Age, Resting Systolic Blood Pressure and METS are the main features that contributed to the prediction of the high risk of hypertension. Shapley Values explanation for instance 11 is shown in Fig. 29, based on Race, Hypertension Response, Age, Resting Systolic Blood Pressure, and Reason for test. The two features Race and Hypertension Response are the only features contributed toward the increasing probability of high risk of hypertension. The explanations of these False Positive examples show that the Age is the most influencing feature towards the explanation of the high risk of hypertension based on LIME. We noticed that instances in Group 3 have the lowest average age of 56, while instances in Group 1 has the highest average age of 68 amongst the three groups which clearly indicates that the probability of low risk of hypertension decreases with the increase in the patient's age.
In the following, we are going to have a deep look at examples for instances that have False Negative predications (Incorrectly classified as low risk of hypertension). Figure 30 shows the frequency distribution of the false negative instances based on the probability of high risk of hypertension. The probability of high risk of hypertension has been split into another three groups. Group 4 represents instances with the probability of high risk of hypertension between [0-0.2]. Group 5 and Group 6 represent instances with a probability of high risk of hypertension belongs to]0.2-0.35] and]0.35-0.48[, respectively (0.48 is the highest probability in the False Negative instances). In particular, we present sample instances of False Negative predictions from Group 4. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.   Figure 31 shows the explanation of instance 12 as low risk of hypertension with a strong probability of 0.8. The explanation is based on Age, METS, Race, Hypertension Response and Reason for test. Age is the most influencing feature that negatively contributed to the prediction of low risk of hypertension while METS, Race and Hypertension Response contributed positively to the prediction of low risk of hypertension. Figure 32 shows Shapley values explanation for instance 12 based on METS, Resting Systolic Blood Pressure, Hypertension Response, Reason for test, and Age. Similar to LIME explanation, features METS, and Hypertension Response contributed toward the probability of low risk of hypertension.

Instance 13 (False Negative Prediction of Low Risk -Group 4 -Close to Minimum Age)
The description of this instance is as follows: Age = 18.    In the following, we present sample instances of False Negative predictions from Group 5. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.

Instance 15 (False Negative Prediction of Low Risk -Group 5 -Close to Maximum Age)
The description of this instance is as follows: Age = 79.6, METS = 7, Resting Systolic Blood Pressure = 120, Peak     In the following, we present sample instances of False Negative predictions from Group 6. The instances are selected based on the patient's age: one instance is close to the maximum age, one instance is close to the minimum age and one instance close to average age.

Discussion
In general, the global interpretability techniques have the advantage that it can generalize over the entire population while local interpretability techniques give explanations at the level of instances. Both methods may be equally valid depending on the application need. For example, a healthcare application such as predicting the progression of risk of hypertension may require global  understanding for the main risk factors for developing hypertension. In this case, local explainers may not be suitable. One way to meet the application goal is to use the global explanation methods. Another way to meet the application requirements using local explainers is to get local explanations and then aggregate them to generate global level explanations. Such technique is computationally expensive. One of the main advantages of LIME is that its explanation is based on the local regression model, which allow physicians to make statements about changes in explanations for changes in the features of the patient to be explained, for example, "what would the probability of hypertension if the patients after five years?". One of the main limitations of LIME is the instability of the explanations. Patients with very close characteristics may have very different explanations. Even for a single patient, if you get the explanation twice, you may get two different explanations. Another limitation is the perturbed data points that act as the training data for the interpretable model are sampled from Gaussian distribution that ignores the correlation between features. This may lead to poor selection of data points that result in poor explanation. LIME assumes a strong assumption that the local model fitted on the perturbed data is linear, however, there is no clear theory about the validity of the assumption.
One of the main advantages that distinguish Shapley value explanation from LIME is that the difference between the average prediction and the prediction of the instance to be explained is fairly distributed among the feature values of the instance to be explained. In other  Another disadvantage is that we need to access the training examples used in training the model to be explained unlike LIME. Many methods have been proposed to make complex machine learning model interpretable, however, these methods have been evaluated individually on small datasets [60]. To the best of our knowledge, this is the first study that applies and demonstrates the utility of various model-agnostic explanation techniques of machine learning models analyzing the outcomes of prediction model for the individuals at risk of developing hypertension based on cardiorespiratory fitness data. This study is designed to take advantage of the unique and rich clinical research dataset consisting of 23,095 patients to explain the predictions of the best performing machine learning model for predicting individuals at risk of developing hypertension in an understandable manner for clinicians. The results show that different interpretability techniques can shed light on different insights on the model behavior where global interpretations can enable clinicians to understand the entire conditional distribution modeled by the trained response function. In contrast, local interpretations promote the understanding of small parts of the conditional distribution for specific instances. In practice, both methods can be equally valid depending on the application need. Both methods are effective methods for assisting clinicians on the medical decision process, however, the clinicians will always remain to hold the final say on accepting or rejecting the outcome of the machine learning models and their explanations based on their domain expertise.  Threats to validity

Extenral validity
A main limitation of this study is that the predictors of the models, the predictions of the models fot the new instances and the explanations of the interpretability techniques are all based on the charachteritsics and used predictors of the cohort of this study.

Construct validity
This study has been mainly focusing on two local interpretability techniques, namely, LIME and Shapley Value Explanations. The inclusion of additional local interpretability techniques may lead to different explanations and additional insights.

Conclusion Validity
Due to the nature of this study and the unlimited availability of similar comparable cohorts. Generalizing the findings and explanations of this study would require the inclusion of multiple datasets representing multiple cohorts.

Conclusion
Explaining the predictions of black-box machine learning models have become a crucial issue which is gaining increasing momentum. In particular, achieving optimal performance of the machine learning models have not become the only focus of data scientists, instead, there is growing attention on the need for explaining the predictions of black-box models on both global and local levels. Several explanations that have been produced by various methods in this study reflect the significant role of these techniques in assisting the clinical staff in the decision-making process. For example, the LIME technique can allow physicians to make statements about changes in explanations for changes in the features of the patient to be explained. However, the LIME technique suffers from the instability of the explanations. Meanwhile, the Shapley value explanation technique has shown the ability to demonstrate that the difference between the average prediction and the prediction of the instance to be explained is fairly distributed among the feature values of the instance to be explained. On the other hand, Shapley value explanation is computationally expensive and needs to access the training data, unlike LIME. Finally, we believe that this study is an important step on improving the understanding and trust of intelligible healthcare analytics through inducting a comprehensive set of explanations for the prediction of local and global levels. As a future work, there are various directions to extend and build up on this work. For example, generalizing the explanation by the inclusion of multiple datasets representing multiple cohort. In addition, incorporationg additional local interpretability techniques and studying their impact. Furthermore, investigating how the outcomes of the various explanation techniques can be effectively utilized to update and improve the accuracy of the prediction model and consequently the quality of the provided interpretations.