 Technical advance
 Open Access
 Published:
AdaWHIPS: explaining AdaBoost classification with applications in the health sciences
BMC Medical Informatics and Decision Making volume 20, Article number: 250 (2020)
Abstract
Background
Computer Aided Diagnostics (CAD) can support medical practitioners to make critical decisions about their patients’ disease conditions. Practitioners require access to the chain of reasoning behind CAD to build trust in the CAD advice and to supplement their own expertise. Yet, CAD systems might be based on black box machine learning models and high dimensional data sources such as electronic health records, magnetic resonance imaging scans, cardiotocograms, etc. These foundations make interpretation and explanation of the CAD advice very challenging. This challenge is recognised throughout the machine learning research community. eXplainable Artificial Intelligence (XAI) is emerging as one of the most important research areas of recent years because it addresses the interpretability and trust concerns of critical decision makers, including those in clinical and medical practice.
Methods
In this work, we focus on AdaBoost, a black box model that has been widely adopted in the CAD literature. We address the challenge – to explain AdaBoost classification – with a novel algorithm that extracts simple, logical rules from AdaBoost models. Our algorithm, AdaptiveWeighted High Importance Path Snippets (AdaWHIPS), makes use of AdaBoost’s adaptive classifier weights. Using a novel formulation, AdaWHIPS uniquely redistributes the weights among individual decision nodes of the internal decision trees of the AdaBoost model. Then, a simple heuristic search of the weighted nodes finds a single rule that dominated the model’s decision. We compare the explanations generated by our novel approach with the state of the art in an experimental study. We evaluate the derived explanations with simple statistical tests of wellknown quality measures, precision and coverage, and a novel measure stability that is better suited to the XAI setting.
Results
Experiments on 9 CADrelated data sets showed that AdaWHIPS explanations consistently generalise better (mean coverage 15%68%) than the state of the art while remaining competitive for specificity (mean precision 80%99%). A very small tradeoff in specificity is shown to guard against overfitting which is a known problem in the state of the art methods.
Conclusions
The experimental results demonstrate the benefits of using our novel algorithm for explaining CAD AdaBoost classifiers widely found in the literature. Our tightly coupled, AdaBoostspecific approach outperforms modelagnostic explanation methods and should be considered by practitioners looking for an XAI solution for this class of models.
Background
Introduction
Medical diagnosis is a complex, knowledge intensive process. A medical expert must consider the symptoms of a patient, along with their medical and family history including complications and comorbidities [1]. The expert may carry out physical examinations and order laboratory tests and combine the results with their prior knowledge. These activities are time intensive and, increasingly, considered sources of Big Data [2, 3]. Suitably experienced, available practitioners and experts are needed to orchestrate and interpret the results, yet these experts are a scarce resource in many healthcare settings. As healthcare needs grow and the sources of medical data increase in size and complexity, the diagnostic process must scale to meet these growing demands.
State of the art machine learning (ML) methods underpin many computer aided diagnostics (CAD) systems. CAD can address the aforementioned scalability challenges and may improve patient outcomes [4–6]. These ML methods demonstrate exceptional predictive and classification accuracy and can handle high dimensional data sets that often have very high rates of missing values. Examples of such challenging data sets include high throughput bioinformatics, magnetic resonance imaging scans, microarray experiments, and complex electronic health records (EHR) [7, 8], as well as unstructured, usergenerated content (e.g. from social media feeds) that have been used to learn individuals’ subhealth and mental health status outside of a clinical setting [9, 10]. Unfortunately, however, many state of the art ML models are socalled “black boxes” because they defy explanation. The complexity of black box models renders them opaque to human reasoning. Consequently, experts and medical practitioners are reluctant to accept black box models in practice since they need to reason about, verify and approve the model’s output before making a final decision. In the clinical setting, the model’s output should facilitate professional decisionmaking alongside their expert clinical training and experience. A standalone classification from a black box model does not serve this purpose well, if at all. This barrier to adoption is evident, even when the black box models are demonstrably more accurate [1, 11–17]. There is also a legal right to explanation for high stakes decisions, which includes medical diagnosis and treatment recommendations [5, 18].
Some might argue that a black box model is no less transparent than a doctor [19]. Nevertheless, a doctor can be asked to justify their diagnosis and will do so from a position of domain understanding. In contrast, providing explanations for black box models is a very complex challenge. These models find patterns in data without domain understanding. Yet we wish to communicate explanations to a variety of levels of domain expertise: patient, practitioner, healthcare administrators and regulators. Additionally, we set higher standards of statistical rigour before granting our trust to ML derived decisions and explanations [20, 21].
Recent studies found that classification is the most widely implemented ML task in the medical sector and solutions using the AdaBoost algorithm [22] form a significant subset of the available research. Clinical applications include the diagnosis of Alzheimer’s disease, diabetes, hypertension and various cancers [23–26]. There are also nonclinical assessments of selfreported mental health, and subhealth status. The latter is characterised by chronic fatigue and infirmity that often leads to future illhealth. These nonclinical approaches used unstructured, user generated content from online health communities [9, 10]. AdaBoost has also been used as a preprocessing tool to select automatically the most important features from high dimensional data [27, 28]. Yet, AdaBoost is considered a typical black box as a consequence of its internal structure: an ensemble of typically 100s to 1000s of shallow decision trees. The ensemble uses a weighted majority vote to classify data instances; a system that is difficult to analyse mathematically. The widespread adoption of AdaBoost in medical applications, coupled with its black box nature leads to the challenge; to make AdaBoost explainable.
We present AdaptiveWeighted High Importance Path Snippets (AdaWHIPS), a novel method for explaining multiclass AdaBoost classification through inspection of the model internals; a collection of adaptive weighted, shallow decision trees. The method proceeds by extracting the decision path from each tree that is specific to the data instance requiring an explanation (the explanandum). Only the paths that agree with the weighted majority vote are retained. These paths are disaggregated into individual decision nodes (which we call path snippets), and the weights are reassigned according to depth within the tree and frequency within the ensemble. The most important snippets are filtered and sorted by the newly applied weights. These adaptiveweighted, high importance path snippets are then greedily added to a classification rule. The final rule is tested for quality metrics and counterfactual conditions against the training (or historical) data.
To demonstrate our contribution, we now present four illustrative examples of AdaWHIPS explanations. These examples have been drawn at random from the data sets used in our experiments, which are all CAD or medically relevant ML problems. An AdaWHIPS explanation is a simple, conjunctive classification rule, presented alongside confidence and counterfactual (contrast) information. This includes: generality (coverage), specificity (precision), and how much precision decreases (% points) when any single rule term is violated. The end user can immediately determine the essential attributes (the features and decision boundary) that led to the model’s confident classification:
In Table 1, statistical features computed from foetal cardiotocograms are used to diagnose heart abnormalities. In Table 2, an online health community (selfselecting) responded to a twentyfour question survey on their mental health. The classification model identifies those individuals who have actually sought treatment. The individual shown in the examples has responded that they are experiencing problems at work and that there may be a family history of mental illness. Table 3 shows attributes from an EHR that were critical in determining the risk of readmission for one particular patient. Table 4 shows the results of a classifier for abnormal thyroid conditions. Full details of the data sets used can be found in Table 6.
We proceed with a walk through of the interpretation of Table 1: The model has classified the instance as "Normal." This is on a prior of 79.0% Normal in the training (historical) data. However, the given instance has a set of readings that raises the precision to 98.2%. If an almost identical instance were found with a point change in any one of the features listed (taking the instance outside the decision boundary), precision would decrease by the amount shown on the adjacent Contrast column. The new values would be worse than a random guess on this prior, with a raised number of prolonged decelerations per second returning a different outcome code altogether. These conditions hold on 60% of the historical data, making this a high quality rule that can inform the clinician’s decision on whether any intervention is necessary – most likely not, in this case.
The rest of this paper is organised as follows: We continue this Background section with an indepth review of the current state of the art in XAI, related work in CAD and a recap of the MultiClass AdaBoost algorithm. We introduce our novel algorithm and describe our experimental setup in the Method section. We report our results and elaborate on their significance in the Results section. Further important points are presented in the Discussion section. The article finishes with a section on Conclusion & future work.
XAI and interpretable models  current state of the art
Medical practitioners making safety critical decisions need explanations of ML classification results that provide the required level of accountability. The current research seeks to address the challenge posed by the use of AdaBoost models in healthcare applications. In contrast to modelagnostic methods that operate on input sensitivity to synthetic data, our approach is to “open the black box” of an already trained and well performing AdaBoost model. This approach provides explanations that directly relate to the model internals. In the following paragraphs, we outline the state of the art and the novelty of our approach.
The decompositional approach [29] to interpretability is well established. “Decompositional” refers to the process of querying directly the smallest information unit of a model, e.g. the set of all decision nodes within each decision tree of an ensemble. Examples in the literature include: DefragTrees [30], Forex++ [31], RF+HC [32], inTrees [33], RuleFit [34], Brute [35]. All these methods generate a cascading rule list (CRL) as a simpler, surrogate of the original classification model. The prevalence of CRL as interpretable models indicates the importance of logical rules for explainability. Logical rules are intuitive to understand, being the standard language of reasoning [20, 36] and are the paradigm that we have adopted in our method.
The above mentioned methods are examples of globally interpretable proxy models; they allow the user to infer some understanding of the black box model’s overall behaviour. However, with such proxy models there is always a tradeoff; increasing interpretability but also increasing classification error and giving no guarantees of fidelity with the original model. Anything less than perfect fidelity means that, for some instances, proxy and model do not agree. Explanations that refer to a different class than the model’s predicted class are of no use in a safetycritical setting, such as CAD. AdaWHIPS uses logical rules and is a decompositional method but unlike the above mentioned methods, AdaWHIPS explains one classification instance at a time rather than the global model behaviour described. The method is local and posthoc [37]. AdaWHIPS also has perfect fidelity by design. That is, the explanation generating process begins with the black model’s classification as its starting point and is, therefore, guaranteed to match.
Several posthoc, per instance explanation methods have been proposed as modelagnostic frameworks (also known as didactic methods [29]). The modelagnostic assumption is that any model’s behaviour can be explained given unfettered access only to the model inputs and outputs (that is, to make an unlimited number of calls) but no access to the training data nor the model internals. Modelagnostic methods probe the model’s behaviour by generating a large, synthetic input sample. Each explanation is inferred from the effect of different input attributes on the outputs. Local Interpretable Modelagnostic Explanations (LIME) [21] generates a sparse linear model, SHapley Additive exPlanations (SHAP) [38] uses a game theoretic approach for a similar result: a set of nonzero coefficients for the input attributes. The coefficients are additive and their magnitude is proportional to the importance in the classification of the attributes they represent. As a result, these methods are categorised as Additive Feature Attribution Methods (AFAM) [38]. The main disadvantage of AFAM is that it is difficult to know when to apply an AFAM explanation to another previously unseen instance that does not share all of the same attribute values associated with the coefficients. Anchors [36] and LOcal Rulebased Explanations (LORE) [39] also use synthetic samples but generate a single classification rule (CR) as an explanation (as opposed to the many rules in a CRL). A CRbased explanation resolves the main disadvantage of AFAM because it is trivial to generalise a CR to another instance; the rule either covers or does not. Anchors uses the same synthetic sampling technique used by LIME since it was developed by the same research team to overcome the shortcoming of AFAM. LORE uses a genetic algorithm to generate the synthetic sample but this requires a very large number of calls to the black box model, and is computationally expensive to run in its own right.
Modelagnostic techniques, while effective in image and text classification, have disadvantages on tabular data sets. For one thing, they require additional checks; variance in the sampling process can cause variance in the resulting explanations over repeated trials [40, 41]. Furthermore, for tabular data, a realistic synthetic distribution must be estimated from the training data set or a large i.i.d. sample. This requirement violates the modelagnostic assumption of accessing only the inputs and outputs of the black box model. LIME, Anchors, and SHAP sample from the marginal training distribution, while LORE explores the marginal input domains. Clearly such synthetic samples have no guarantees to represent the underlying population because they do not use the joint distribution. In most realworld problems, the joint distribution is unknown or intractable. Yet, these methods explicitly access the training data but there is no rationale given in the relevant articles for not using the empirical distribution, for example by the bootstrapping method used in Brute [35]. Consequently, these modelagnostic methods are thought to put too much weight on unlikely or impossible examples. Moreover, LIME and Anchors require all features of tabular data to be categorical. Continuous features must be discretised in advance of training the classification model. To this end, quartile binning [36] is proposed by the authors. This is an arbitrary procedure and a significant compromise that puts constraints on the model of choice and potentially loses important information from the continuous features.
AdaWHIPS, in contrast, assumes access to both the model internals and the training data. By decomposing the internals, using the adaptive weights and executing a greedy heuristic against the bootstrapped training data, the output explanation is an openthebox method, and uses the empirical distribution instead of a synthetic distribution. Furthermore, AdaWHIPS exploits the informationtheoretic discretisation of the continuous features that occurs when the individual decision trees are induced during the AdaBoost model training. This information preserving approach is an advantage over the methods that require discretisation as a preprocessing step. Modelagnostic methods can also be slow to compute. For example, computing Shapley Values entails solving a large combinatorial problem which limits the scalability [42], while LORE’s synthetic samples are generated by a genetic algorithm that is not parallelisable in the currently available version^{Footnote 1}. AdaWHIPS is fast, as our experimental study shows.
We suggest that the modelagnostic assumption should be taken with caution. There is a prevailing view in the XAI research community that modelagnostic methods are a very active research area while modelspecific methods may be in decline. Yet, in a recent, comprehensive literature review [43] the following methods were categorised as modelagnostic when, in fact, they are modelspecific: Saliency Maps, Activation Maximisation, Layerwise Relevance Propagation. These methods all require access to the internal neurons in an Artificial Neural Network and their categorisation as modelagnostic may be a sign of confirmation bias in the research community. We also argue that modelagnostic methods are only required for a subset of ML problems, such as model auditing by an external third party. This scenario does not apply in CAD system development where the capability to add explanations would come from the owners themselves of the model and data. With access to both the training data and the model, decompositional methods should always be considered since they do not rely on synthetic data and can deliver explanations that are more representative of the model’s internals [43]. Treeinterpreter [44] is possibly the earliest modelspecific explanation method, applicable to regression problems with Random Forest models. TreeSHAP [42], based on the SHAP method, assumes an underlying XGBoost model and queries the internal decision nodes. This modelspecific design provides faster and more consistent results than the original SHAP algorithm for XGBoost models. Thus, modelspecific methods are and should remain an active and relevant research area.
Finally, very few XAI methods have so far implemented counterfactuals, which are “what if” scenarios that indicate minimal changes to the inputs that would yield a different classification. LORE is the only wellcited example to the best of our knowledge and applies a strict changeofclass counterfactual paradigm and only works for binary classification. AdaWHIPS provides a more flexible counterfactual solution that shows how the confidence (specificity) of a classification changes, as opposed to a discrete change of class. This novel, probabilistic approach allows the expert user to control and interpret the results since a decreasing confidence has ramifications even if the outcome code does not change. For example CAD may involve rare conditions in very unbalanced data sets, thus simply decreasing the probability that the individual is disease free may be enough to suggest an intervention. The method works just as well for multiclass problems.
As a minor contribution, we also provide a novel method to avoid overfitting explanations that could potentially be applied elsewhere.
Related work
CAD is an active research area. Yet, the safety critical nature suggests that it is unethical to make diagnoses without human intervention [45, 46]. XAI in healthcare offers the paradigm to assist rather than replace the medical expert. Hence, we present recent research that aligns to this paradigm. We focus on methods that predict or classify from nonimage based clinical data. Table 5 summarises our review.
Lamy et al. [47] uses a casebased reasoning (CBR) approach to recommend treatments for breast cancer patients. Using a combination of weighted knearest neighbours (WkNN) and multidimensional scaling (MDS), the user is presented with a visual interface making recommendations based on similarities/differences with historical cases. CBR provides the medical expert with several comparison instances/cases to evaluate, while AdaWHIPS presents one classification rule directly extracted from the model internals that must be true of the explanandum instance while coverage statistics measure the rule’s generalisation to other instances.
Kwon et al. [48] presents RetainVis, a visual analytics application for predicting health status from health insurance data. Feature attribution values and tSNE clustering are used to provide an interactive interface. The paper demonstrates the benefits and deeper insights available from tight coupling to a specific model; a recurrent neural network (RNN), in this case.
Adnan and Islam [31] uses a novel algorithm to simplify an existing tree ensemble. The compact, surrogate model is a rule list that can be used for classifying unseen instances. The authors claim that the global behaviour of the compact model is easier to interpret than the black box ensemble but the rule list can itself be long and time consuming to interpret. In contrast, our method is concerned with generating a single rule to explain a single instance at a time.
Jalali and Pfeifer [8] use an ensemble of linear support vector machines (L1SVM) to predict cancer diagnosis and identify important patterns of gene expression. This novel approach is tightly coupled to the data domain (genetic biomarkers) whereas AdaWHIPS could feasibly be applied to any tabular data including those not related to medicine or healthcare.
Turgeman and May [12] propose a simple ensemble of a C5.0 decision tree and a support vector machine (SVM). The easiest to classify instances can be explained by traversing the tree, while hard to classify instances are left to the SVM which remains a black box. Consequently, this method cannot produce a straightforward explanation for all instances, unlike our method.
Jovanovic et al. [11] implement a TreeLasso system for introducing domain knowledge about serious disease conditions into a sparse logistic regression model that is easy to interpret. Lasso based methods discover a small set of important features using L_{1}norm regularisation but the treelasso requires domain knowledge to be provided apriori. AdaWHIPS rule conditions are discovered by information theoretic tree induction during the AdaBoost model training, and does not require any apriori inputs.
Letham et al. [13] proposes a novel interpretable model, the Bayesian Rule List (BRL). The model is used in stroke prediction. The predictive results are competitive with state of the art, but in common with cascading rule lists, interpretability decreases with rule depth as all previous rules must be considered and excluded. AdaWHIPS generates one rule for one instance from a pretrained AdaBoost model.
Caruana et al. [6] uses generalised additive models (GAM) allowing second order interaction (GA^{2}M) to predict pneumonia risk and hospital readmission. GAMs inherently provide partial independence (PI) plots, giving insight into the global model behaviour, and excellent predictive results. Domain knowledge was required apriori to discretise several features and to determine which second order interactions to include. However, interpretation of the nonlinear components remains a challenge. Our method is a completely different approach that provides an explanation for individual cases and requires no apriori domain expertise.
Kästner et al. [49] integrates expert knowledge into a neural gas. Interpretability arises from the activation of the explicitly incorporated fuzzy rules. The outputs of this novel method includes scored rule conditions but the fuzzy rules must be introduced apriori, again in contrast to AdaWHIPS that requires no apriori domain knowledge.
MultiClass adaBoost
In this section, we describe multiclass AdaBoost, with which our method is tightly coupled. Boosting is a method for generating a strong classifier by sequentially combining weak, base classifiers. It is one of the most significant developments in Machine Learning [50, 51]. AdaBoost [52] was the first, widely used implementation of boosting and is still favoured for its accuracy, ease of deployment and fast training time [53–55]. It uses shallow decision trees as the base classifiers. On each iteration, the training sample is reweighted such that the next decision tree focuses on examples that were previously misclassified, while previously generated classifiers remain unchanged (the details of this iterative reweighting are not central to this research so we refer the interested reader to [52, 56]). AdaBoost also adaptively updates its base classifier weights based on their individual performance, which we discuss now in further detail. Two algorithms, Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME) and realvalued SAMME (SAMME.R) [56] have emerged as the standard [57] for extending the original AdaBoost algorithm from binary classification to multiclass problems. The following formulations are based on [56].
Let \(f : \mathcal {X} \longmapsto \mathcal {Y}\) be an unknown classification function that we would like to approximate, where \(\mathcal {X}\) is an \(\mathbb {R}^{d}\) input space and \(\mathcal {Y} = \{C_{1},\ \dots,\ C_{K} \}\) is the set of possible classes. Let X be an input data set and our multiclass AdaBoost model be g(X)≈f(X). To classify an instance x, the output of a SAMME model is the weighted majority vote of all the base classifiers.
where \([c_{1},\ \dots,\ c_{K}]\) is a one dimensional (1D) vector indicating the position of the output class and is the output of a single tree T^{(m)} at iteration m. Within this 1D vector, c_{k}=1, c_{j}=0, j≠k indicates that C_{k} is the predicted class. The whole model \(g = \left \{\left \{T^{(1)},\ \dots,\ T^{(M)}\right \},\ \left \{ \alpha ^{(1)},\ \dots,\ \alpha ^{(M)} \right \} \right \}\) is the combination of a set of M base decision tree classifiers and a set of M classifier weights. These weights are calculated during the training phase as:
where err^{(m)} is the error rate at iteration m.
To classify an instance x with SAMME.R, each base classifier returns a vector of the conditional probabilities that the class of x is C_{k}. This is the distribution of training instance weights in the terminal node of the decision path taken by x through each tree:
and confidence weights are calculated at run time as:
The output of the whole model is the majority vote based on the additive contribution of these confidence weights per class:
where \(g = \left \{T^{(1)},\ \dots,\ T^{(M)}\right \}\) (weights \(\alpha ^{(m)}_{k}\) evaluated at run time).
Method
AdaWHIPS
We now present AdaWHIPS, our algorithm for generating a CR based explanation for the classification of an explanandum instance x by a previously trained AdaBoost model g. The algorithm begins by initialising a rule as an empty antecedent and the classification outcome g(x) as the consequent. Thus, the CR always agrees with the black box, by design. The algorithm then proceeds through the steps shown in Fig. 1, to identify a small set of antecedent terms, or logical conditions. These conditions must be true of x and must exert the most influence on the classification result. The source of these logical conditions is the ensemble of decision trees that make up g. The influence is determined by the classifier weights within the internals of g, which themselves are derived from the error rates (weights increase as errors decrease).
Extract decision paths
An AdaBoost model typically comprises 100’s1000’s of shallow decision trees, potentially resulting in a very large search space. For a given x∈X, we can reduce this space logarithmically by considering only decision paths of that x in each decision tree and ignoring all other branches. The paths retain all the information about how g(x) was determined. A conceptual example of extracting the decision path is shown in Fig. 2. Here, \(\mathbf {x} = \{\dots,\ x_{i} = 0.1,\ x_{j} = 10,\ \dots \}\), where x_{i} is the attribute value of the i^{th} feature. The decision path starts from the root node Q_{1}, following the binary split conditions down to a leaf node. The decision path contains node detail triples of the following form (j,ν,τ), where j is a feature index and \(\nu \in \mathbb {R}\) is the threshold for the inequality x_{j}<ν and τ∈{0,1} is the binary truth of evaluating the inequality. Note that for this instance, all other nodes are irrelevant. For example, even though Q_{7} applies (x_{i}<1.0), it cannot be reached by x because of the evaluation at Q_{5}.
The search space can be further reduced by considering only those trees that agreed with the weighted majority vote. The rationale for this is based on the application of maximum margin theory to boosting [58]. If x is an unseen instance, the margin in SAMME is:
The quantity a^{+}, represents the sum of weights from the classifiers that voted for the majority class and a^{+}>a^{−} is always true for the majority class. The set \(\mathcal {T}^{+}\) are the base classifiers that voted in the majority and thus contributed their weight to a^{+}, and \(\mathcal {T}^{}\) are the remaining classifiers. \(\mathcal {T}^{+}\) completely determines the ensemble’s output for a given instance because an ensemble classifier formed from the union of \(\mathcal {T}^{+}\) and any subset of \(\mathcal {T}^{}\) would return the same classification with a larger margin because \(a^{}_{*} < a^{},\ \mathcal {T}^{}_{*} \subset \mathcal {T}^{}\). We found no margin formalisation for SAMME.R in the literature but we can define \(\mathcal {T}^{+} := \left \{ (T^{(m)}, \alpha ^{(m)}_{k}) : \alpha ^{(m)}_{k} \geq \alpha ^{(m)}_{j},\ k,j \in \{1,\ \dots,\ K\} \right \}\) and, as a convenience, we can substitute the α terms in Eq. (6) for the following KullbackLeibler (KL) Divergence. The KLDivergence (also known as “relative entropy”) measures the information lost if a distribution P^{′} is used, instead of another distribution P to encode a random variable and is defined as:
and we set P,P^{′} as the posterior class distribution of each T^{(m)}(x) given in Eq. (3), and prior class distribution in the training data, respectively. The KLDivergence will be larger for trees that classify with greater accuracy, relative to the prior distribution. The D_{KL} emulates the classifier weights yielded by Eq. (2), which allows the rest of the algorithm to proceed in an identical manner for SAMME and SAMME.R.
Redistribute adaptive weights
To avoid a combinatorial search of all the available decision nodes, we sort them, prior to rule merging, according to their ability to separate the classes. To do this, we disaggregate the entire set of decision paths into individual decision nodes and redistribute the classifier weights onto the nodes. This procedure is illustrated in Algorithm 1. The contribution of each node is conditional on the previous nodes in the path and this sorting must take into account the node order in the originating tree. To do this, we apply Eq. (7) to determine the relative entropies at each point in a path. For each root node, we set P,P^{′} as the class distribution when applying that decision to the training data, and the prior class distribution respectively. For subsequent nodes, P is the class distribution after applying all previous decision nodes including the current node and P^{′} is the distribution up to but not including the current node. The relative entropy scores for nodes in a single path are normalised such that their total is equal to that of the classifier weight α^{(m)}. The scores are grouped and summed for nodes that appear in multiple paths. We filter the nodes, keeping only those with the largest weights (e.g. top 20%). Finally, all nodes from all paths are sorted by this score in descending order.
Generate classification rule
It is trivial to convert the node detail triples (j,ν,τ) into antecedent terms of a CR [59]. We use nodes and terms interchangeably from here on. The objective is to find a minimal set of terms that maximises both precision and coverage while mitigating the problem of overfitting. Overfitting can occur if we maximise precision as an objective function. We risk converging on “tautological” rules that provide no generalisation. This is because precision is trivially maximised by single instances. A tautological rule contains enough terms to identify a single instance uniquely. In a noisy data set, there could be many such local maxima. Therefore, we propose stability as a novel objective function, defined as:
where Z is the set of instances covered by the current rule and K the number of classes. The maximum achievable ζ is \(\frac {1}{K}\) for a single instance but approaches precision asymptotically as Z→∞. Stability, therefore acts as a brake on adding too many terms and overfitting. We proceed with a breadth first search, iteratively adding terms to an initially empty rule. We always add the first term in the sorted list. Then, we work down the list, greedily adding further terms if they increase stability and discard them if they do not. The algorithm stops when a threshold stability (e.g. 0.95) is reached or the list is exhausted. These steps are illustrated in Algorithm 2.
Generate counterfactuals
Counterfactuals answer the question “what would have happened if... ?” They illustrate minimal changes in the inputs that would give different results. Some authors define counterfactual (sometimes called contrastive) explanations as a minimal change set on the inputs that would return a different result [5, 15, 39, 60]. However, discrete changeofclassification counterfactuals do not allow any uncertainty. We suggest a fuzzy definition is better suited here; namely, if precision (specificity) decreases beyond a userdefined tolerance. The expert can better exercise their judgement with this approach. For example, decreasing from high to low confidence in a CAD or risk score can lead to requests for additional tests, a less aggressive clinical intervention and so on. Since the definition of counterfactuals is a minimal change set, it is not necessary (nor even practical) to provide every possible input scenario. It suffices to show the effect of each point change and this is easy to do with CR simply by changing each of the rule terms, one at a time. Any point changes that do not decrease the precision beyond the userdefined tolerance represent a noncounterfactual change and can be removed from the rule. This procedure provides an intuitive pruning mechanism for removing redundant terms that might have been added during the greedy rule merge algorithm. We illustrate this concept visually in Fig. 3. Here a model with a complex decision boundary is trained on a synthetic data set (a Gaussian mixture model) which has two classes, shown as triangles and circles. The model classifies an explanandum instance x as a triangle. The explanation is found  the following CR: \(\{\mathbf {z} : a \leq z_{1} \leq b,\ c \leq z_{2} \leq d,\ \mathbf {z} \in \mathcal {X}\} \implies \text {triangle}\). The counterfactual spaces are those spaces immediately adjacent to the four rule boundaries, derived by reversing one inequality at a time:
Even though the triangle class is still predicted for parts of these spaces, the expected precision decreases drastically for a CR that is formed from any one of these counterfactual spaces for the antecedent and the same consequent. Thus, the original rule provides a crisp boundary where the maximal precision holds. The counterfactual rules communicate how much precision decreases when the rule is violated in any one dimension.
Experimental design
We compared AdaWHIPS in an experimental study with the state of the art. Three metrics are used to measure effectiveness, namely, coverage, precision and our new measure of stability. Efficiency, in terms of computing performance, is measured using the average time to generate an explanation. Comparisons are made against two other CRbased, per instance explanation methods: Anchors [36] and LORE [39]. Both methods are modelagnostic. Readers who are familiar with XAI research may question the omission of LIME [21] and SHAP [38], which are the most discussed per instance explanation methods. LIME and SHAP fall into a different class of methods, described as additive feature attribution methods (AFAM). AFAM are, effectively, local linear models (LM) whose coefficients relate the importance of various attributes to the original model’s classification of the explanandum. There is no obvious way to apply the local LM for one instance to any other instances in order to calculate the quality measures such as precision and coverage, and comparison with CRbased methods is of limited value [36]. Fortunately, Anchors has been developed by the same research group that contributed LIME and uses the same synthetic sampling technique. Anchors can be viewed as a rulebased extension of LIME and its inclusion into this experimental study provides a useful comparison to best in class AFAM research.
Hardware setup
The experiments were conducted using Python 3.6.x running on a standalone Lenovo ThinkCentre with Intel i77600 CPU @ 3.4GHz and 32GB RAM using the Windows 10 operating system.
Data sets
We used nine data sets described in Table 6. These were sourced from the UCI Machine Learning repository [61] and represent specific disease diagnoses from clinical test results, except; the mental health surveys (Kaggle) which represents case studies in detecting mental health conditions from nonclinical online health community data; the hospital readmission data (Kaggle) which represents a large EHR; and understanding society [62] which is from the General Population Sample of the UK Household Longitudinal Study and used under license. We use the file from waves 2 and 3 where participants had a health visit carried out by a qualified nurse. At least one study [63] has shown that the biomarkers measured in the survey may be associated with the results from selfcompletion instruments measuring mental health. We run a classification task for the SF12 Mental Component Summary (PCS) which has been discretised into nominal values "poor," “neutral” and “good.’
Limitations of the study
Unfortunately, we discovered that LORE was not scalable after finalising our experimental design. The time cost of generating a synthetic distribution by means of a genetic algorithm rendered the method unusable on some of the data sets. The time per instance was on average twentyfive to thirty minutes for the hospital readmission data set and more than two hours per instance on the understanding society data set. The method generated system errors on the mental health survey ’14 data set and was not runnable at all. We thoroughly examined the source code to look for opportunities to parallelise the operation, but the presence of a dynamically generated, nonserialisable distance function rendered this impossible. We have included the results where the method did run to completion.
AdaBoost model training and testing
Each data set was split into training and test sets (70%, 30%) by random sampling without stratification or other class imbalance correction. We trained AdaBoost models using tenfold crossvalidation of the training set on number of trees \({ntrees} \in \{200,\ 400,\ \dots,\ 1600 \}\) and maximum tree depth parameter maxdepth was always 4. We used the ntree setting that delivered the highest classification accuracy to train a final model on the whole training set.
As mentioned in the section on related work, Anchors requires all features of the data to be categorical [36]. For our experiments, we generated a copy of each data set, and discretised them using Anchors’ provided quartile binning function. A second AdaBoost model was generated from this discretised data set for Anchors to explain. Training and test splits used identical indices as the undiscretised versions. Each test set was then used as the pool of unseen instances to be classified by the AdaBoost model and explained by AdaWHIPS, Anchors and LORE. Thus, there are three comparable explanations for each test instance. Generating explanations is done instance by instance, not batch wise as in classification. So, for time constraints, the number of instances (test units) was limited to either the whole test set or the first one thousand test instances, whichever was the smaller. For each explanation, all the remaining instances from the entire test set were used to assess the standard quality measures, precision and coverage, along with the novel quality measure, stability (8), which is more sensitive to overfitting. This leaveoneout procedure ensures that test scores are not biased by leakage of information from the explanationgenerating instance. The entire procedure is repeated for SAMME and SAMME.R AdaBoost models.
We present the performance scores of the trained models in Table 7. It is important to note that the model training is part of the experimental setup and not to be taken as results per se. These training scores simply reflect the performance of AdaBoost; critiquing the performance of AdaBoost itself is not the objective of this work. We provide this level of detail only to demonstrate that the trained AdaBoost models reasonably approximate the underlying data sets and are very accurate. However, a true explanation by definition must stay faithful to the trained model regardless of whether the model is accurate or not (though a poor model would never be used in clinical practice). We show generalisation accuracy scores and Cohen’s κ for the two models (discretised and undiscretised data set variants). Cohen’s κ is a useful measure in multiclass problems and class imbalanced data because this statistic corrects for chance agreement, which can be high in such cases. Values close to zero indicate a high degree of chance agreement. See Appendix for further details on Cohen’s κ.
Significance testing
Our approach for the experimental study is based on the simulated user study implemented in [36]. In that study, coverage represents the fraction of previously unseen instances a user could attempt to classify after seeing an explanation and thence how generally the rule applies to the whole population. Similarly, precision represents the fraction of those classifications that would be correct if a user applied the explanation correctly, indicating the specificity of the rule. Real users who were shown high coverage and precision rulebased explanations demonstrated significantly improved task completion scores over those who were shown AFAM explanations.
To determine statistical significance, we report differences between precision, stability and coverage among the algorithms using nonparametric hypothesis tests. The reason for using these tests is that these measures are proportions; from the interval [0,1] and very rightskewed by design since each method tries to generate very high precision explanations. We use the paired samples Wilcoxon signed rank test where we have results for just AdaWHIPS and Anchors. The null hypothesis of this test is that the medians of the two samples are equal and the alternative is that the medians are unequal. We use the Friedman test where we have results for all three methods. The Friedman test is a nonparametric equivalent to ANOVA and an extension of the rank sum test for multiple comparisons. The null hypothesis of this test is that there is no significant difference between the mean ranks of all the groups and the alternative is that at least two mean ranks are different. For all our threeway comparisons using the Friedman test, pvalues were vanishingly small ≈0. So, in our report that follows, we proceed directly to the recommended pairwise, posthoc comparison test with the Bonferroni correction (for three pairwise comparisons) proposed in [64]. It is sufficient for this study to demonstrate whether the top scoring algorithm was significantly greater than the second place algorithm on our quality measures of interest. The critical value for a twotailed test with the bonferroni correction is \(\frac {0.025}{3} = 0.00833\). See Appendix for further details on the Friedman test applied here.
The threeway posthoc tests and the twoway comparisons are shown in separate tables to avoid drawing invalid comparisons. The mean rank, rather than the mean, is given in the tables, as this is the statistic compared between groups by the chosen tests. A significant result is indicated by ** and the winning algorithm is formatted in boldface only if the results are significant.
Results
We begin by presenting the four worked examples from the introduction. Then, we assess the aggregated quality measures for the test samples. For each measure, we present dot chart showing the mean score (with standard errors) aggregated over all the test instances. In several cases, the results are close, resulting in overplotting that could lead to confusion as to whether two or three results are returned for a given data set. To assist the reader in distinguishing the scores, a guide line has been added. However, each data set should still be viewed as a separate experiment.
Worked examples
Tables 8, 9, 10 and 11 present the worked examples from our introduction. Readers are reminded that the paths taken by a single instance in a pretrained AdaBoost model are disaggregated into individual decision nodes. The most important of these nodes are recombined into a high quality rule for explaining the model’s classification. Note that models had different numbers of iterations, and trees can grow to any depth up to the maximum of 4. It is also interesting to note a detail about the paths from trees that disagreed with the majority classification; that is, while they covered the instance (as they must), the boundary attributes are very distant from the instance attributes in the input space. We suggest that this is in keeping with the theoretical principles of AdaBoost – each iteration focuses on misclassified instances of the previous iteration, leading to a very different decision boundary in the next tree.
Coverage analysis
We present a visual analysis of the raw data (see Appendix for results tables) and tabulate the results of our statistical tests. A cursory inspection of the mean coverage charts shown in Figs. 45 indicates that Anchors has the lowest mean coverage over all the data sets but the comparison between AdaWHIPS and LORE is less clear cut. The results of the hypothesis tests are given in Tables 1213. The Wilcoxon tests showed that AdaWHIPS always has significantly higher coverage than Anchors. AdaWHIPS was the top algorithm in all but three of the posthoc tests for threeway comparisons and in the top two alongside LORE with no significant difference for the remaining tests.
Precision analysis
The mean precision chart, (Figs. 67), show that LORE has the lowest precision in all but one of the data sets where LORE results are available. It is harder to see if there is a definitive lead between AdaWHIPS and Anchors.
However, the complete picture – and the cost to Anchors of implementing a precision guarantee – can be seen in the distribution charts in Figs. 89. Here we see that a certain proportion of explanations have a precision of 0.0. The result shows that Anchors (and LORE to a lesser extent) is overfitting. Some explanations are so specific that they only explain the explanandum and do not generalise to other instances in the test set. We present the proportion of 0.0 precision explanations that were returned by each algorithm in Table 14.
The proportions vary from around 0.5%−28%. There are important consequences for methods that suffer this level of overfitting. The most important consequence is that 0.0 precision rules are so specific that they uniquely identify the explanandum but cover no other instance. A unique identifier does not provide any useful new information to explain the model’s classification. For the person requiring the explanation, this outcome represents a failure of the system. The lowest failure rates (0.5%) may be tolerable, depending on the criticality or compliance requirements of the application. However, we do not foresee any circumstances where a failure rate at the upper end of this range (28%) would ever be acceptable. Secondly, such overfitting is symptomatic of an algorithm that generates rules that are overly long; having too many terms in the antecedent to be easily interpretable. To show the link between overfitting and rule length we present the rule length distribution in Fig 10.
We present the results of the hypothesis tests in Tables 1516. Clearly, Anchors dominates out of the three algorithms on a statistical test of median differences. However, we have shown that these results should be taken with caution. To begin with, Anchors required us to discretise the data as a preprocessing step, which resulted in alternative models that were less accurate classifiers. The difference was two or more percentage points in 7/9 for SAMME models and 5/9 for SAMME.R models. Moreover, Anchors has a long tail distribution of rule length, and sometimes a high proportion of critically overfitting explanations. The tabulated means of precision do not show a clear difference between AdaWHIPS and Anchors (see Appendix). Furthermore, precision (specificity) is in a tradeoff with coverage (generality). Rules that are too specific only apply to a small fraction of other instances. AdaWHIPS makes a very small tradeoff (just a percentage point or two in most cases), and delivers much more generalisable rules that rarely, if ever, overfit. This behaviour is the result of optimising the novel stability function (Eq. 8).
Stability analysis
Stability can also be used as a quality measure in the XAI setting. A precision of 0.0 for an explanation on a heldout test set can be caused by sampling artefacts (i.e. the ground truth may be a nonzero probability of finding certain attributes and that they are simply underrepresented in the data set). For this reason, it can be argued that a precision of 0.0 is a harsh penalty against the aggregate score. Yet, if the rule covers and is correct for just a single instance in the held out set, the precision will be 1.0. This circumstance creates a discontinuity and gives a huge advantage to undesirable, overfitting explanations. Instead of precision, we can measure stability while including the explanandum in the held out set. This condition results in the formulation \(\frac {n + 1}{m + K}\) where n is the number of covered and correct instances, m is the number of covered instances and K is the number of classes. See Eq. (8). Thus, stability is very similar to the classical additive smoothing function (precision with Laplace correction [65]). The minimum/maximum are both \(\frac {1}{1 + K}\) for N=1 but approach 0/1 asymptotically as N→∞. We present the visual analysis of stability in Figs. 1112 and the results of the hypothesis tests in Tables 1718. The posthoc tests for the threeway comparisons show that AdaWHIPS is the top or in the top two with no statistical difference in all except mental health survey ’16 for the SAMME model. For the twoway comparisons, AdaWHIPS has a significantly higher rank for hospital readmission (SAMME) and thyroid (SAMME.R) but lower for the remaining results.
Efficiency analysis
Finally, we show the distribution of computation time per explanation in Fig. 13. A brief visual inspection shows that AdaWHIPS and Anchors are roughly comparable for all data sets. The shortest runtimes are fractions of a second and the longest are two to three minutes. LORE runs at several orders of magnitude longer than this. As we discussed in previous sections, it was prohibitive to run LORE for the data sets mental health survey ’14, hospital readmission, thyroid and understanding society with a single explanation taking over two hours to generate. We performed both static and dynamic analysis of the LORE source code and discovered that the bottleneck was in a nonparallelisable, geneticalgorithmic step.
Discussion
Advantages of AdaWHIPS
Our method improves on prior research in that it delivers explanations that have high mean coverage (15%68%). AdaWHIPS explanations generalise well while making only a very small tradeoff to keep precision/specificity competitive (80%99%). At the same time, AdaWHIPS is guarded against overfitting while competing methods have the tendency to present critically overfitting explanations, in 0.05%28% of cases. A critically overfitting explanation is defined as an explanation that uniquely identifies the explanandum and covers no other instances. AdaWHIPS does not make any assumptions about the underlying data distribution, while some competing methods require continuous features to be discretised prior to model training. This treatment of the data can result in a less accurate model, detracting from the main benefit of using AdaBoost at the outset. By design, AdaWHIPS rules extract discrete, logical conditions from the base decision tree classifiers of the AdaBoost model. These logical conditions have an informationtheoretic derivation and we speculate that this is what leads to AdaWHIPS’s favourable tradeoff between precision and coverage. AdaWHIPS is efficient. At its fastest, explanations are generated in fractions of seconds. On high dimensional data sets, we recorded times of up to three minutes per explanation. This is in line with competing methods and could still be considered realtime in the context of a medical consultation. As a minor contribution, we presented stability, a novel measure that is a regularised version of precision. It gives more informative results in the XAI setting as it penalises low coverage while correcting for sampling artefacts.
Limitations of AdaWHIPS
By design, AdaWHIPS is a companion method for AdaBoost models and the algorithm is not transferable to other models without adaptation. In contrast, modelagnostic methods, such as Anchors and LORE, can be applied to any black box model with few restrictions. It is up to the end user to determine which approach best suits their specific scenario. AdaWHIPS is an heuristic method for finding a short rule with high coverage and precision. Consequently, AdaWHIPS will not provide a feature attribution value for each attribute with theoretical guarantees. If such values with guarantees are required, then the combinatorial calculation of Shapley Values is the recommended method.
Challenges
Experimental studies of XAI are challenging in terms of their time cost. Each explanation must be generated individually and, for all currently wellcited methods, generation of explanations is a much more time consuming process than the classification step. Furthermore, each explanation must be evaluated individually, rather than batchwise. For example, a trivial confusion matrix or AUCROC test is not appropriate. We calculated scores for each explanation and then used the means, medians and mean ranks to compare methods. Any experimental design for evaluating XAI must allow for this time cost, and also consider how instances used to generate explanations can be separated from instances used to evaluate explanations. Such designs may require three data partitions (training, explanation generating, explanation evaluating). We opted for a leaveoneout procedure, training a model on a training set then generating explanations one at a time and evaluating on the remaining instances from a heldout set.
Conclusion & future work
Our main contribution is the novel algorithm AdaWHIPS for explaining the classification of AdaBoost models with simple classification rules. AdaBoost models are widely adopted as computer aided diagnostic tools and the nonclinical identification of subhealth and mental health conditions using unconventional data sources such as online health communities. As a minor contribution, we propose stability as a novel function for optimisation of explanation algorithms that explicitly avoids overfitting and can be used as a quality metric in evaluations of XAI experimental research.
Directions for future work include developing the method for Gradient Boosting Machines such as XGBoost that use decision trees as the base classifiers, and applying the proposed method on a variety of healthcare and medical data sets.
Appendix
Supplementary
Cohen’s κ
Cohen’s κ is calculated as:
where K is the number of classes, N is the total number of instances, N_{ij} is the number of instances in cell ij of the confusion matrix of true vs. predicted class counts, and N_{i+},N_{+j} are the i^{th} row and j^{th} column marginal totals, respectively.
Friedman test
The original Friedman test produces an approximately χ^{2} distributed statistic, but this is known to be very conservative. Therefore, we use the modified Ftest given in [64], because we have very large values for N, i.e. the count of instances in the test set. The null hypothesis of this test is that there is no significant difference between the mean ranks R of all the groups and the alternative is that at least two mean ranks are different. The null hypothesis is rejected when F_{F} exceeds the critical value for an F distributed random variable with the first degrees of freedom df_{1}=k−1 and the second df_{2}=(k−1)(N−1), where k is the number of algorithms:
The recommended pairwise, posthoc comparison test with the Bonferroni correction (for three pairwise comparisons) proposed in [64]:
where R_{i} and R_{j} are ranks of two algorithms and z is distributed as a standard normal under the null hypothesis that the pair of ranks are not significantly different. The critical value for a twotailed test with the bonferroni correction is \(\frac {0.025}{3} = 0.00833\)
Availability of data and materials
The source code and data sets analysed during the current study are available in our repository: https://tinyurl.com/yxuhfh4e.
Notes
Abbreviations
 AdaWHIPS:

Adaptiveweighted high importance path snippets
 AFAM:

Additive feature attribution methods
 BRL:

Bayesian rule list
 CAD:

Computer aided diagnostics
 CBR:

Casebased reasoning
 CR:

Classification rule
 CRL:

Cascading rule list
 DT:

Decision tree(s)
 EHR:

Electronic health record(s)
 GAM:

Generalised additive model(s)
 GA^{2}M:

Generalised additive model(s) with second order interactions
 KL:

KullbackLeibler (divergence)
 LIME:

Local interpretable modelagnostic explanations
 LORE:

LOcal rulebased explanations
 MDS:

Multidimensional scaling
 ML:

Machine learning
 PI:

Partial independence (plots)
 RNN:

Recurrent neural network
 SAMME:

Stagewise additive modeling using a multiclass exponential loss function
 SAMME.R:

Realvalued SAMME
 SHAP:

SHapley additive exPlanations
 SVM:

Support vector machine(s)
 TSH:

Thyroid stimulating hormone
 WkNN:

Weighted knearest neighbours
 XAI:

eXplainable artificial intelligence
References
 1
ElSappagh S, Alonso JM, Ali F, Ali A, Jang JH, Kwak KS. An ontologybased interpretable fuzzy decision support system for diabetes diagnosis. IEEE Access. 2018; 6:37371–94.
 2
Mahdi MA, Al Janabi S. A Novel Software to Improve Healthcare Base on Predictive Analytics and Mobile Services for Cloud Data Centers. In: International Conference on Big Data and Networks Technologies. Leuven: Springer: 2019. p. 320–39.
 3
AlJanabi S, Patel A, Fatlawi H, Kalajdzic K, Al Shourbaji I. Empirical rapid and accurate prediction model for data mining tasks in cloud computing environments. In: International Congress on Technology, Communication and Knowledge (ICTCK). Mashhad: IEEE: 2014. p. 1–8.
 4
AlJanabi S, Mahdi MA. Evaluation prediction techniques to achievement an optimal biomedical analysis. Int J Grid Util Comput. 2019; 10(5):512–27.
 5
Wachter S, Mittelstadt B, Russell C. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Harv J Law Technol. 2017; 31(2). https://doi.org/10.2139/ssrn.3063289.
 6
Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30day Readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining  KDD ’15. Sydney: ACM Press: 2015. p. 1721–30.
 7
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine Learning and Data Mining Methods in Diabetes Research. Comput Struct Biotechnol J. 2017; 15:104–16.
 8
Jalali A, Pfeifer N. Interpretable per case weighted ensemble method for cancer associations. BMC Genomics. 2016; 17(1). https://doi.org/10.1186/s1286401626479.
 9
Yin Z, Sulieman LM, Malin BA. A systematic literature review of machine learning in online personal health data. J Am Med Informat Assoc. 2019; 26(6):561–76.
 10
Sun S, Zuo Z, Li GZ, Yang X. Subhealth state classification with AdaBoost learner. Int J Funct Informat Personalised Med. 2013; 4(2):167.
 11
Jovanovic M, Radovanovic S, Vukicevic M, Van Poucke S, Delibasic B. Building interpretable predictive models for pediatric hospital readmission using TreeLasso logistic regression. Artif Intell Med. 2016; 72:12–21.
 12
Turgeman L, May JH. A mixedensemble model for hospital readmission. Artif Intell Med. 2016; 72:72–82.
 13
Letham B, Rudin C, McCormick TH, Madigan D. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Ann Appl Stat. 2015; 9(3):1350–71.
 14
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015; 13:8–17.
 15
Subianto M, Siebes A. Understanding Discrete Classifiers with a Case Study in Gene Prediction. Omaha: IEEE: 2007. p. 661–6.
 16
Huysmans J, Baesens B, Vanthienen J. Using Rule Extraction to Improve the Comprehensibility of Predictive Models. SSRN Electron J. 2006. Accessed 16 Nov 2018.
 17
Pazzani MJ, Mani S, Shankle WR. Acceptance of Rules Generated by Machine Learning among Medical Experts. Methods Inf Med. 2001; 40(05):380–5.
 18
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). 2018.
 19
Pande V. Artificial Intelligence’s ’Black Box’ Is Nothing to Fear. The New York Times. 2019. Accessed 14 Aug 2019.
 20
Pedreschi D, Giannotti F, Guidotti R, Monreale A, Pappalardo L, Ruggieri S, Turini F. Open the Black Box DataDriven Explanation of Black Box Decision Systems. 2018. arXiv:1806.09936 [cs].
 21
Ribeiro MT, Singh S, Guestrin C. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining. San Francisco: ACM Press: 2016. p. 1135–44.
 22
Freund Y. An adaptive version of the boost by majority algorithm. In: Proceedings of the Twelfth Annual Conference on Computational Learning Theory  COLT ’99. Santa Cruz: ACM Press: 1999. p. 102–13.
 23
Asgari S, Scalzo F, Kasprowicz M. Pattern Recognition in Medical Decision Support. BioMed Res Int. 2019; 2019:1–2.
 24
Rajendra Acharya U, Vidya KS, Ghista DN, Lim WJE, Molinari F, Sankaranarayanan M. Computeraided diagnosis of diabetic subjects by heart rate variability signals using discrete wavelet transform method. KnowlBased Syst. 2015; 81:56–64.
 25
Yoo I, Alafaireet P, Marinov M, PenaHernandez K, Gopidi R, Chang JF, Hua L. Data Mining in Healthcare and Biomedicine: A Survey of the Literature. J Med Syst. 2012; 36(4):2431–48.
 26
Dolejsi M, Kybic J, Tuma S, Polovincak M. Reducing false positive responses in lung nodule detector system by asymmetric adaboost. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Paris: IEEE: 2008. p. 656–9.
 27
Shakeel PM, Tolba A, AlMakhadmeh Z, Jaber MM. Automatic detection of lung cancer from biomedical data set using discrete AdaBoost optimized ensemble learning generalized neural networks. Neural Comput Appl. 2019.
 28
Rangini M, Jiji DGW. Identification of Alzheimer’s Disease Using Adaboost Classifier. In: Proceedings of the International Conference on Applied Mathematics and Theoretical Computer Science: 2013. p. 229–34.
 29
Andrews R, Diederich J, Tickle AB. Survey and critique of techniques for extracting rules from trained artificial neural networks. KnowlBased Syst. 1995; 8(6):373–89.
 30
Hara S, Hayashi K. Making Tree Ensembles Interpretable: A Bayesian Model Selection Approach. 2016. arXiv:1606.09066 [stat].
 31
Adnan MN, Islam MZ. ForEx++: A New Framework for Knowledge Discovery from Decision Forests. Australas J Inf Syst. 2017; 21.
 32
Mashayekhi M, Gras R. Rule Extraction from Random Forest: the RF+HC Methods. In: Advances in Artificial Intelligence 2015. Lecture notes in computer science Artificial intelligence, vol. 9091. Halifax: Springer: 2015. p. 223–37.
 33
Deng H. Interpreting tree ensembles with intrees. Int J Data Sci Anal. 2014; 7(4):277–87.
 34
Friedman J, Popescu BE. Predictive Learning via Rule Ensembles. Ann Appl Stat. 2008; 2(3):916–54.
 35
Waitman LR, Fisher DH, King PH. Bootstrapping rule induction to achieve rule stability and reduction. J Intell Inf Syst. 2006; 27(1):49–77.
 36
Ribeiro MT, Singh S, Guestrin C. Anchors: HighPrecision ModelAgnostic Explanations. In: AAAI. vol. 18. New Orleans: 2018. p. 1527–1535.
 37
Lipton ZC. The mythos of model interpretability: 2016. arXiv Preprint arXiv:1606.03490.
 38
Lundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. Adv Neural Inf Process Syst. 2017; 30:4768–77.
 39
Guidotti R, Monreale A, Ruggieri S, Pedreschi D, Turini F, Giannotti F. Local RuleBased Explanations of Black Box Decision Systems. 2018. arXiv:1805.10820.
 40
Michal F. "Please, explain." Interpretability of blackbox machine learning models. 2019. https://tinyurl.com/y5qruqgf. Accessed 19 April 2019.
 41
Fen H, Tan, Song K, Udell M, Sun Y, Zhang Y. Why should you trust my interpretation? Understanding uncertainty in LIME predictions. 2019. arXiv:1904.12991.
 42
Lundberg SM, Lee SI. Consistent feature attribution for tree ensembles. Sydney: 2017. arXiv:1706.06060 [cs, Stat].
 43
Adadi A, Berrada M. Peeking Inside the BlackBox: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access. 2018; 6:52138–60.
 44
Sabaas A. Interpreting Random Forests. 2014. http://blog.datadive.net/interpretingrandomforests/. Accessed 11 Oct 2017.
 45
Tjoa E, Guan C. A Survey on Explainable Artificial Intelligence (XAI): towards Medical XAI. 2019:21. arXiv preprint arXiv:1907.07374.
 46
Mencar C. Interpretability of Fuzzy Systems. In: Fuzzy Logic and Applications: 10th International Workshop. Genoa: Springer: 2013. p. 22–35.
 47
Lamy JB, Sekar B, Guezennec G, Bouaud J, Séroussi B. Explainable artificial intelligence for breast cancer: A visual casebased reasoning approach. Artif Intell Med. 2019; 94:42–53.
 48
Kwon BC, Choi MJ, Kim JT, Choi E, Kim YB, Kwon S, Sun J, Choo J. RetainVis: Visual Analytics with Interpretable and Interactive Recurrent Neural Networks on Electronic Medical Records. IEEE Trans Vis Comput Graph. 2018; 25(1):255–309.
 49
Kästner M, Hermann W, Villmann T. Integration of Structural Expert Knowledge about Classes for Classification Using the Fuzzy Supervised Neural Gas. Comput Intell. 2012.
 50
Appel R, Fuchs T, Dollár P, Perona P. Quickly Boosting Decision Trees–Pruning Underachieving Features Early. In: Proceedings of the 30th International Conference on Machine Learning (ICML13): 2013. p. 594–602.
 51
Friedman J, Hastie T, Tibshirani R. Additive Logistic Regression A Statistical View of Boosting. Ann Stat. 2000; 28(2):337–407.
 52
Freund Y, Schapire RE. A DecisionTheoretic Generalization of OnLine Learning and an Application to Boosting. J Comput Syst Sci. 1997; 55(1):119–39.
 53
Walker KW, Jiang Z. Application of adaptive boosting (AdaBoost) in demanddriven acquisition (DDA) prediction: A machinelearning approach. J Acad Librariansh. 2019; 45(3):203–12.
 54
Aravindh K, Moorthy S, Kumaresh R, Sekar K. A Novel Data Mining approach for Personal Health Assistance,. Int J Pure Appl Math. 2018; 119(15):415–26.
 55
Jaree T, Guangdong X, Yanchun Z, Fuchun H. Breast cancer survivability via AdaBoost algorithms. In: Proceedings of the Second Australasian Workshop on Health Data and Knowledge Management, vol. 80. Wollongong: Australian Computer Society: 2008. p. 55–64.
 56
Hastie T, Rosset S, Zhu J, Zou H. Multiclass AdaBoost. Stat Interface. 2009; 2(3):349–60.
 57
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D. Scikitlearn: Machine Learning in Python. J Mach Learn Res. 2011; 12:2825–30.
 58
Freund Y, Schapire RE. A Short Introduction to Boosting. J Japan Soc Artif Intell. 1999; 14(5):771–80.
 59
Quinlan JR. Generating Production Rules From Decision Trees. In: Proceedings of the Tenth International Joint Conference on Artificial Intelligence. Milan, Italy, August 2328, 1987. Morgan Kaufmann: 1987. p. 304–307. http://ijcai.org/proceedings/19871.
 60
Dhurandhar A, Chen PY, Luss R, Tu CC, Ting P, Shanmugam K, Das P. Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives. 2018. arXiv:1802.07623 [cs].
 61
Dheeru D, Karra Taniskidou E. UCI Machine Learning Repository. Irvine: University of California, Irvine, School of Information and Computer Sciences; 2017. https://archive.ics.uci.edu/ml/datasets/. Accessed 31 Mar 2019.
 62
Understanding Society: Waves 23 Nurse Health Assessment, 20102012 [data Collection]. vol. 7251, 3rd edn: UK Data Service, University of Essex, Institute for Social and Economic Research and National Centre for Social Research; 2019.
 63
Davillas A, Benzeval M, Kumari M. Association of Adiposity and Mental Health Functioning across the Lifespan: Findings from Understanding Society (The UK Household Longitudinal Study). PLoS ONE. 2016;11(2). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148561. Accessed 18 Aug 2019.
 64
Demsar J. Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res. 2006; 7:1–30.
 65
Clark P, Boswell R. Rule induction with CN2: some recent improvements. Mach Learn. 1991; 482:151–63.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Affiliations
Contributions
Original concept was by JH and MMG. JH was the major contributor in developing the software, designing and executing the experiments, analysing the data and writing the manuscript. JH, MMG and RMAA read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hatwell, J., Gaber, M.M. & Atif Azad, R.M. AdaWHIPS: explaining AdaBoost classification with applications in the health sciences. BMC Med Inform Decis Mak 20, 250 (2020). https://doi.org/10.1186/s12911020012012
Received:
Accepted:
Published:
Keywords
 Explainable AI
 Computer aided diagnostics
 AdaBoost
 Black box problem
 Interpretability