 Research
 Open access
 Published:
Clinical decision making under uncertainty: a bootstrapped counterfactual inference approach
BMC Medical Informatics and Decision Making volumeÂ 24, ArticleÂ number:Â 275 (2024)
Abstract
Background
Learning policies for decisionmaking, such as recommending treatments in clinical settings, is important for enhancing clinical decisionsupport systems. However, the challenge lies in accurately evaluating and optimizing these policies for maximum efficacy. This paper addresses this gap by focusing on two key aspects of policy learning: evaluation and optimization.
Method
We develop counterfactual policy learning algorithms for practical clinical applications to suggest viable treatment for patients. We first design a bootstrap method for counterfactual assessment and enhancement of policies, aiming to diminish uncertainty in clinical decisions. Building on this, we introduce an innovative adversarial learning algorithm, inspired by bootstrap principles, to further advance policy optimization.
Results
The efficacy of our algorithms was validated using both semisynthetic and realworld clinical datasets. Our method outperforms baseline algorithms, reducing the variance in policy evaluation by 30% and the error rate by 25%. In policy optimization, it enhances the reward by 1% to 3%, highlighting the practical value of our approach in clinical decisionmaking.
Conclusion
This study demonstrates the effectiveness of combining bootstrap and adversarial learning techniques in policy learning for clinical decision support. It not only enhances the accuracy and reliability of policy evaluation and optimization but also paves avenues for leveraging advanced counterfactual machine learning in healthcare.
Introduction
Developing Clinical Decision Support Systems (CDSS) that include both diagnosing disease states of patients and formulating corresponding treatment plans is a crucial stride towards realizing the potential of personalized medicine (PM) [1, 2]. However, one of the most significant modeling challenges in applying machine learning to clinical decisionmaking is its inherent counterfactual nature. When considering treatment options for a patient at any given time point, several alternatives exist. Yet, we only witness the outcome of the treatment option selected by the clinician, leaving the potential effectiveness of other treatments unexplored. To understand the effectiveness of our treatment suggestions, we must compare the counterfactual outcomewhat might have happened with a different treatmentwith the observed factual outcome.
To solve the challenge of counterfactual inference in clinical decisionmaking, a traditional approach is randomized controlled trials (RCTs). In these trials, patients are randomly assigned to different treatment groups. By comparing the average outcomes of the treatment and control groups, we can derive a reliable estimation of the treatment effectiveness. However, RCTs have inherent limitations. To adequately capture patient diversity and ensure broad representativeness, these trials need to be conducted on a large scale. Even so, they often provide limited insights into the suitability of treatments for individual patients due to their generalized nature [3]. Most clinical treatment data is available as observational data (e.g., electronic health records (EHRs)) maintained by healthcare providers, hospitals, and insurance companies. In this context, treatments are determined by physicians based on their expertise and patient conditions. Most policy learning algorithms in reinforcement learning require realtime interaction with the environment (patients, in this case) for feedback and updates. Such interaction poses risks, as learning algorithms could make arbitrary decisions potentially detrimental to patient health. Directly deploying these algorithms in a clinical setting would be unethical. Therefore, we focus on learning decision policies offline using EHRs under counterfactual settings.
In clinical decisionmaking, collecting a sufficient amount of pretraining data presents significant challenges due to personal health data protection regulations and individual privacy concerns. A common strategy to address data scarcity is to aggregate smaller datasets from various data sources (e.g., hospitals). However, this approach leads to high heterogeneity in multicenter datasets owing to differences in clinical protocols and medical devices. This heterogeneity introduces uncertainty, resulting in considerable variability in patientspecific model predictions and decisions [4]. Additionally, even datasets from a single institution exhibit diversity in patient demographics. The challenge of uncertainty in clinical decisionmaking models is further compounded due to the limited understanding of the clinicianâ€™s intrinsic model for selecting treatments and their corresponding reward functions. Neural networkbased predictive models, especially in scenarios with limited data, are known to be susceptible to model uncertainty. Policy learning methods relying on estimating cliniciansâ€™ action propensity scores to derive optimal policies are also vulnerable to uncertainty. It is observed that models with similar performance levels can exhibit significant disparities in their predictions, particularly in datascarce cases. Prior research has demonstrated that effectively capturing model uncertainty leads to reduced variance and enhanced exploration in policy learning [4, 5]. Given the critical need for lower risk and higher confidence in medical decisionmaking, there is an urgent need to address model uncertainty in the estimation of propensity scores and to integrate uncertainty into existing offpolicy learning frameworks.
To solve these challenges, we advocate a counterfactual inference framework for learning clinical decisionmaking from observation data under uncertainty. We first formulate a clinical decisionmaking process in the framework of contextual bandit. At each round, we will select a clinical action according to the policy and the patient state. Via executing the clinical action, we collect observations on the reward function. Thus, we can collect an observation dataset under a behavioral policy (e.g., the decision made by one physician) \(h_0\) is in the format of \(\mathcal {D}={(x_i,a_i,r_i)}_{i=1}^{n}\), consisting of patient state \(x_i\), clinical action \(a_i\), and the observed reward \(r_i\). R is the accumulated reward by the observed reward at each step. The two tasks in policy learning given the observation dataset \(\mathcal {D}\) are: 1) Evaluation: For another policy h, what is the expected reward R(h) if it had been applied to the data? 2) Optimization: Is it possible to identify a policy h parametrized with parameters \(\theta\) that maximizes the reward on this dataset, i.e., \(h_\theta = \arg \max R(h_\theta )\)?
In this paper, we introduce a novel methodology for making informed treatment suggestions with the following contributions to the field:

Acknowledging the high uncertainty in medical data and the critical need for lower risk in medical decisionmaking, we propose a bootstrappingbased approach for learning decisionmaking policies. This method provides not only reward estimates but also confidence intervals, enabling physicians to select actions with reduced variance when necessary.

Our bootstrapping framework effectively addresses the model uncertainty typically associated with IPSbased estimators, leading to decreased variance in policy evaluation and enhanced policy optimization.

We address model uncertainty from the perspective of distributionally robust counterfactual risk minimization. Specifically, we introduce an adversarial IPS learner (IPS\(_{adv}\)), designed to maximize rewards under the worstcase propensity model within a defined uncertainty set.

We validate the effectiveness of our proposed frameworks (IPSinv, IPSavg, IPS\(_{adv}\)) in a clinical scenario involving the oral dosing of anticoagulants, heparin and warfarin. Our approaches not only facilitate better initial dosing policies but also achieve higher rewards. Moreover, we introduce the generation of semisynthetic and realworld clinical bandit datasets to promote further research in this field.
Related work
Treatment recommendation
Effective treatment recommendations are an important component in building CDSS to enhance longterm patient benefits. These recommendations in clinical treatment broadly fall into two categories. Predictive Modeling: The first category concentrates on improving the accuracy of future patient outcome predictions, often known as prognosis prediction. This approach is particularly useful in areas like cancer treatment [6] and dermatology [7]. Methods in this category typically employ supervised learning on historical patient data to forecast disease progression, survival outcomes, and specific clinical events following a treatment. The ultimate goal is to suggest treatment options that are likely to yield the best outcomes for the patient. RewardBased Policy Learning. The second category involves developing models that directly link observed clinical features to treatment actions, aiming to maximize overall rewards closely tied to patient health. This approach has been increasingly adopted in recent biomedical studies, using bandit and reinforcement learning algorithms to recommend adaptive treatment policies, particularly in chronic disease and critical care scenarios. Clinical applications include optimizing antiretroviral therapy in HIV patients [8], tailoring antiepilepsy drugs [9], managing ventilation support in ICU settings [10], and determining optimal antibiotic dosing for sepsis [11]. Contrary to the predictive models of the first group, which align closely with clinician judgment, this category explores alternative optimal actions to create policies that enhance the likelihood of favorable clinical outcomes. This adds complexity as the policy output influences not only the patientâ€™s future health but also subsequent treatment plans. Our algorithm seeks to integrate and build upon the complementary aspects of these approaches, addressing the complexities inherent in clinical decisionmaking.
Counterfactual inference
Counterfactual inference includes two basic questions: counterfactual evaluation and counterfactual learning, also referred to as offpolicy learning in the bandit and reinforcement learning literature. The offpolicy evaluation aims to estimate the quality of an alternate target policy h by assessing its expected reward if applied to the dataset D:
Various statistical approaches have been developed to evaluate the quality of target policies based on historical data. There are primarily two classes of evaluation approaches: 1) the direct method (DM) based estimator, also known as regression adjustment, and 2) the importance samplingbased estimator. The Direct method uses a regression approach to fit a parametric or nonparametric approximation to the true reward function as \(\hat{r}(x, a; \theta )\), and the reward of a new policy h is estimated as:
where \(p_h\), known as the propensity score, represents the probability of selecting action a under policy h, given the observed features x. While this approach is straightforward in design, it is susceptible to several biases. The first bias may arise from potential misspecifications in the reward function \(\hat{r}\) (e.g., linear vs. nonlinear models). The second bias emanates from the sampling distribution: the target policy might choose actions differently compared to the logging policy \(h_0\). If \(h_0\) is biased towards a specific region in the action space, the logged data will predominantly consist of samples from that region, leading to an imbalanced dataset and, subsequently, biased estimation in the reward function.
A common approach to correct for the mismatch in the action distributions under h and \(h_0\) is importance weights, defined as \(w(x, a)= \frac{p_h(ax)}{p_{h_0}(ax)}\), where \(p_h\) and \(p_{h_0}\) are the probability of selecting the action a given the observed features, under policies h and \(h_0\) respectively. Importance samplingbased estimators are built on importance weighting, with a widely popular estimator being the inverse propensity scoring (IPS) estimator [12]:
From the formulation, it can be noted that the IPS estimator is an unbiased estimator of R, i.e., \(\mathbb {E}\left[\hat{R}^{IPS}_h\right] = R_h\), which makes it wellsuited to policy optimization. However, the estimator suffers from high variance in reward estimation, especially when \(p_h(ax)>> p_{h_0}(ax)\). For consistent estimation, it is standard to assume that whenever \(p_h > 0\), then \(p_{h_0} > 0\) also, and we assume this throughout our analysis. To reduce the variance of IPS, several techniques have been proposed in the bandit literature. A line of work focuses on regularizing the variance of IPS [13,14,15] with the POEM estimator being widely used. Another straightforward approach is capping propensity weights [16, 17], which leads to the estimator
Smaller values of M reduce the variance of \(\hat{R}_h\) but introduce bias. Given that the IPS estimator is not equivariant [18], thresholding propensity weights exacerbate this effect. Moreover, the IPS estimator is prone to overfitting propensity weights, i.e., for positive reward, policies that avoid actions in the dataset D are selected; for negative reward, policies that overrepresent actions in D are selected. Hence, Swaminathan and Joachims [18] proposed the selfnormalized inverse propensity scoring (SNIPS) estimators, which use weight normalization to counter the propensity overfitting problem of IPS.
SNIPS has a lower variance than the vanilla IPS estimator because of its ability to normalize and bound the propensity weights between 0 and 1. Additionally, another line of work focuses on reducing both the bias and variance of offpolicy estimators by combining the direct method and IPSbased methods in a linear fashion, leading to the doublyrobust estimator [19].
In clinical settings, the behavior policy is typically unknown. Since IPSbased approaches require the behavior policyâ€™s propensity score \(p_{h_0}\), we need to impute these scores using a behavior propensity model. The model must accurately represent the clinicianâ€™s treatment action probability distribution. If the behavior policy is misestimated, IPSbased estimators suffer from significant bias and variance. Given that we do not know the parametric class of behavior policy, we can leverage universal function approximators such as neural networks to estimate the propensity scores. Neural networks often lead to a reduced approximation error with an increasing number of layers and neurons and have been shown to work well in offpolicy bandit scenarios [18, 20]. However, learning a highly accurate model for imputing behavior policy is not enough; our model should provide wellcalibrated probability estimates representing true probabilities. Using overparameterized approximators such as neural networks, which are capable of expressing a wide range of functions, along with the limited size and heterogeneity of clinical datasets, leads to model uncertainty i.e., uncertainty regarding the true underlying parameters. Multiple neural networks can achieve similar accuracy. However, the probability estimates can widely differ, and every model might not be able to capture the true conditional probability of the clinicianâ€™s actions.
Therefore, the question we ask here is: How can we confidently estimate the propensity score in the presence of model uncertainty due to the limited scale and heterogeneity of clinical data? Our algorithms work as a metaframework that can combine existing algorithms.
Preliminaries
Problem definition
In this study, we focus on two problems (Table 1) in CDSS: For a patient with observed features x, (1) a predictive modeling f predicts the interested outcome \(\hat{y}\) (e.g., diagnosis) that aligns closely with a human physicianâ€™s judgment y; (2) while decisionmaking learns a policy h(ax) that suggests the action a to take a based on the observed feature x, and the objective of the policy is to maximize a specific reward function r(x,Â a), such as the clinical risk score.
Contextual banditbased clinical decision making
At each round t, we observe a patient with feature representation \(x_t\); we choose a clinical action \(a_t\) based on some policy ; we observe the outcome/reward \(r_t\) of the action we choose, but not for other unchosen actions \(a' \ne a_t\). Here \(r_t\) may be dependent on \(x_t\) and \(a_t\), and we may write generically as \(r_t=r_t (x_t,a_t)\), for some reward function \(r_t\). The goal of policy learning for decisionmaking is to find the optimal policy h that obtains the maximum cumulative reward when applied, i.e., \(h^* \sim \arg \max _{h} R = \mathbb {E}_{x_t, a_t \sim h} [r_t]\).
Uncertainty of predictive models
There are two types of uncertainty in machine learning and deep learning models: data uncertainty and model uncertainty [21]. Consider a binary classification setting in which we have \(y \sim \text {Bernoulli}(\lambda )\), where y is the binary classification target, and \(\lambda (\cdot x;\theta )\) is the logit representing the conditional distribution \(p(yx; \theta )\) with feature x and parameters \(\theta\). In data uncertainty, the logit \(\lambda\) is a deterministic function of x and \(\theta\), i.e., \(\lambda = g(x, \theta )\), and the uncertainty in data is reflected in the feature x. This uncertainty might be due to inherent noise in the process that generated the data or unaccounted factors that created variability in the targets. This is often referred to as irreducible or aleatoric uncertainty.
On the other hand, model or epistemic uncertainty refers to the uncertainty in the values of the parameters \(\theta\) for modeling the prediction, i.e., we are unable to properly constrain our modelâ€™s parameters. More specifically, we can model \(\lambda\) as a distribution over plausible values instead of a point estimate, as \(\lambda \sim \mathcal {P}(\lambda  x, w)\) and are unsure which distributions better explain the data. This could be due to the use of a complex model relative to the amount of training data. Additionally, our choice of model structure might be wrong and is unable to reflect the process that generated the data (here, the clinician). Model uncertainty can be reduced by observing more data; however, typical clinical datasets for bandit learning have limited size (\(\le\) 5,000 patients). Our focus here is on tackling model uncertainty caused by uncertainty in the parameters. We explore two popular approaches to quantify model uncertainty in the clinical setting: Model Ensembling and Bayesian Neural Networks.
Model ensembling
Deep ensembles proposed by Lakshminarayan et al.Â [22] is a simple yet powerful method in characterizing the model uncertainty. It has been shown to yield highquality predictive uncertainty estimates, requires little hyperparameter tuning, and is readily parallelizable. Ensembles tackle uncertainty by collecting predictions from M independently trained deterministic models (ensemble components). We train an ensemble of neural networks (NNs) (\(NN_1, ..., NN_M\)) by varying the random seed in our training process. The seed affects the initialization of the neural networkâ€™s weights and the order of minibatch samples seen by the neural network during training. At the test time, for a given patient, we output the ensembled action prediction as \(p(ax) = \frac{1}{M} \sum \limits _{m=1}^M p_{NN_m}(ax)\). In addition, the collection of prediction values \(p_{NN_i}(x); i=\{1,2, ..., M\}\) can be seen as samples from the distribution \(p(\lambda x, \theta )\) describing the model uncertainty.
Bayesian neural networks
Bayesian inference is a principled approach to modeling the distribution over possible outcomes and estimating the uncertainty in the prediction of a machine learning model. Bayesian Neural Networks (BNNs) are neural networks whose parameters \(\theta\) are represented by probability distributions, so the uncertainty of weights characterizes the uncertainty of models. Given a dataset D = \({(x_i, y_i)}_{i=1}^{N}\), BNN is defined in terms of a prior p(w) on the weights and the data likelihood p(Dw). By sampling from the posterior weight distributions, BNN could train an infinite number of different realizations of the NNs, and these realizations capture the model uncertainty in the predictive distribution \(p(\lambda x, \theta )\). However, training BNNs is much more challenging since we need to compute the posterior distribution. Various approximate inference methods are proposed to efficiently train BNNs, such as MCDropout [23], Variational Inference, [5] and Noisy Natural Gradient method [24]. Bayesian approaches to uncertainty estimation have been proposed to assess the reliability of clinical predictions [4] but have been applied to very few realworld policy learning settings using clinical data.
Variational Inference
Variational approximation methods aim to estimate the weight posterior by maximizing the evidence lower bound (ELBO) to fit an approximate posterior \(q(w\theta )\), given data D. Variational inference is formulated as an optimization problem of minimizing the KullbackLeiber (KL) divergence between the approximate p(w) and exact \(q(w\theta )\) posterior. The loss function embodies a tradeoff between datadependent likelihood cost and priordependent complexity cost as follows:
where p(w) is the prior distribution on weights, which enforces simplicity. The most common approach to learn an approximate posterior over the weights \(q_{\theta }(w)\) given the prior is meanfield variational inference wherein we assume a fully factorized Gaussian prior and posterior, \(q(w) = \prod _{i=1}^{m}q_i(x)\). This reduces the computational complexity of estimating ELBO. To reduce the time complexity of computing KLdivergence during a forward pass through the network, we leverage Monte Carlo estimates. Blundell et al. [5] proposed BayesbyBackprop by applying the reparametrization trick from Kingma et al. [25] to variational inference and reduced the computational complexity involved in calculating the data likelihood expectation \(E_q[log(p(Dw))]\) over q(wD). They estimate the variational inference loss function by sampling weights from the posterior q(wD):
where \(w^i\) are the sampled weights. To enable training by backpropagation, they choose a Gaussian variational posterior on weights given as \(q(w\theta ) = \prod _{i=1}^n \mathcal {N}(w^i\mu ,\sigma ^2)\). To perform inference using BNNs, Monte Carlo sampling is performed from the weight distribution. Multiple networks are sampled from the variational posterior q, and their predictions are averaged to compute the network output. In BNNs learned using variational inference, typically, both the mean and variance of weights are learnable.
MCDropout
Gal et al. [23] showed that optimizing a standard neural network with dropout and \(L_2\) regularization techniques is equivalently a form of variational inference in a probabilistic interpretation. MCDropout is quite popular due to the simplicity of the idea: by enabling dropout during testing and applying different dropout masks, multiple networks can be sampled to predict the output and related uncertainty. This contrasts with performing inference using a deterministic neural network wherein the dropout approximation is fixed at the test time. However, in practical applications, MCDropout faces challenges, such as the choice of dropout probability and \(L_2\) regularization, the position to insert the dropout layers at, etc.
Bootstrapped counterfactual evaluation
In this section, we first introduce the offpolicy evaluation problem and then present our framework for bootstrappedbased evaluation. For each patient with feature x, a policy \(h_0\) (e.g. a physician) recommended a treatment a, and a reward \(r=r(x, a)\) was observed. The collection of these triplets \(\{ (x_i, a_i, r_i) \}_{i=1, ..., N}\) forms an offline dataset \(\mathcal {D}\). The goal of offline evaluation is defined as follows:
Definition 1
Given an offline dataset collected from a logging policy \(h_0\), for a new policy h, we aim to estimate its expected reward R(h) using the offline dataset \(\mathcal {D}\) only.
This problem is important as it represents a majority of scenarios arising during the evaluation of a clinical decision support system. Suppose we use our machine learning algorithm to build a new treatment policy h using EHRs obtained from a hospital; how do we ensure this policy is advantageous before deploying it in the clinic? Traditionally, to get an unbiased estimate of R(h), we can use the inverse propensity scoring estimator as follows:
where \(p_h\) is the propensity score of the policy h.
In the clinical setting, however, \(p_{h_0}(ax)\) is not available, as physicians will not record the exact probability of them choosing a treatment. Modeling \(h_0\) via supervised learning using a maximum likelihoodbased approach is possible but introduces additional modeluncertainty: There can be multiple versions of \(h_0\) that share equal probabilities and evaluate the same on a finite training set of N data points, however having different behaviors on other data points (test set). To see this, imagine our policy is only a polynomial of degree â€˜\(N+1\)â€™, and with N data points x, we can fit an infinite number of functions f(x,Â w) attaining zero error and satisfying the learning objective, thus, giving out a diverse range of model parameters w. The distribution over the model parameters \(w \sim p(w)\) induces uncertainty in the learned function, characterized by \(\hat{h_0} \sim U(f_w)\), subsequently leading to variance in the marginalized predictive probability distribution \(p_{h_0}(ax)\). Considering more complex functions such as neural networks, the potential solutions for \(h_0\) are even more.
Thus, we propose to reduce such model uncertainty in IPSbased estimators using a bootstrappingbased approach. By bootstrapping over multiple resamples of the dataset D and using model ensembling, we can reduce the uncertainty from learning \(h_0\) and obtain a better estimate of the policy reward. In addition, we also obtain a confidence interval for the overall performance of the new policy h. When we have multiple policies, we can choose not only based on the mean rewards but also the tightness of the reward confidence interval as a criterion for the stability of the policy. We present our bootstrapped policy evaluation framework in Algorithm 1.
We explore deterministic NN ensembles and probabilistic BNNbased approaches to tackle model uncertainty. For simplicity, we discretize the clinician actions a and formulate the propensity score imputation problem as a multiclass classification problem. We train a classifier on \((x_i,a_i) \in D\) and derive the propensity scores from softmaxlayer probability scores.
After bootstrapping B networks, we propose our counterfactual estimators based on the propensity score estimates obtained from those models:
where \(p^b_{h_0}\) is the propensity score derived from \(b^{th}\) bootstrapped model.
The simplest approach is to average the propensity scores from bootstrapped models to reduce the variance of \(p_{h_0}\) leading to IPS\(_{avg}\). The inverse estimator, IPS\(_{inv}\), computes a harmonic mean of propensity scores and is equivalent to averaging the estimated rewards \(\hat{R}(h)^m\) from each bootstrapped model. The average estimator can also be seen as a special case of multiple importance sampling and is equivalent to the Balance Heuristic estimator (Veach et al. [26]) when \(N=KN_k\):
By bootstrapping over multiple resamples of the dataset D, we can reduce the uncertainty from learning \(h_b\) and obtain a better estimate of the policy reward. In addition, we also obtain a confidence interval for the overall performance of new policy h. In this way, when we have multiple policies, we can choose based not only on the average reward but also on the tightness of the confidence intervals as an evaluation of the stability of the policies.
Bootstrapped counterfactual learning
The goal of counterfactual learning is
Definition 2
Given an offline dataset collected from a logging policy \(h_0\), we aim to estimate find \(h^*\) using the offline dataset \(\mathcal {D}\) only, such that its expected reward R(h) is maximized, i.e.
In the previous section, we have developed how to evaluate any h for its expected reward R(h). Intuitively, if we have a finite selection of h, by an exhaustive evaluation of all h, we can find the optimum \(h^*\). For an infinite space of h, we can apply gradientbased optimization; we can evaluate the gradient as
When we parametrize h with deep neural networks, we can automatically compute its gradient via backpropagation techniques to apply gradientbased optimization. Based on our bootstrapped evaluation algorithm, we design a bootstrapped learning algorithm as follows:
The benefit of this algorithm is that by adding bootstrapping, we reduce the variance of learning \(h_0\) to improve the performance and stability of the learned h. In addition, we can also populate the confidence intervals.
Adversarial bandit learner
In decision theory, robust decisionmaking based on Waldâ€™s maximin paradigm [27] suggests acting pessimistically  the optimal decision is one with the least bad worst outcome. Since multiple propensity scoring models are likely and bootstrapping optimizes the learned policy against an ensemble, the empirical reward \(\hat{R}[h_w]\) cannot be used as a performance certificate for the optimal true reward. This is because we are not explicitly tackling the uncertainty due to the worstcase propensity model \(h^{worst}_0\). To address this limitation, we propose an adversarial learningbased framework, treating the propensity model parameter \(P(\theta )\) distribution with skepticism and optimizing the worstcase reward objective with respect to pessimistic model parameters. Instead of selecting the worstcase model from the ensemble, we assume that model parameter distribution belongs to an uncertainty set \(U_{\epsilon }(P)\), which is already constrained by the crossentropy loss by virtue of behavior policy imputation (the goal of \(h_0\) is to model clinicianâ€™s actions accurately). Hence, we can derive an adversarial robust counterfactual learning objective as follows:
where CE is the standard multiclass crossentropy loss, \(\lambda\) is a hyperparameter defining the tradeoff between accurate behavior policy imputation (2\(^{nd}\) term) vs reward maximization (1\(^{st}\) term). Consequently, we propose an adversarial policy learning framework (IPS\(_{adv}\)) (Algorithm 3) as an iterative twoplayer optimization scheme. h is optimized to maximize reward against the worstcase possible \(h_0\), which acts in an adversarial manner to h: the goal of \(h_0\) is to learn a classification model to impute clinicianâ€™s action probabilities accurately and, at the same time, reduce the reward achieved by learned policy h.
Results and discussions
We evaluate the efficacy of our proposed frameworks on the nonclinical semisynthetic bandit dataset as well as on the clinical task of dosing initialization for orally administered anticoagulant drugs. Anticoagulants are blood thinners administered to remove blood clots, and their dosage during treatment initiation varies significantly across patients. Moreover, incorrect dosing can have significant side effects, thus making it a challenging clinical setting for treatment recommendation systems. We consider two commonly used anticoagulants in hospitals: warfarin and heparin. We use two freely available electronic health records databases to derive the clinical bandit datasets: 1) PharmGKB (Consortium 2009) [28] for warfarin dosing and 2) Multiparameter Intelligent Monitoring in Intensive Care (MIMICIII v1.4) [29] for heparin dosing. The semisynthetic dataset is derived from a fullylabeled classification dataset (with access to counterfactuals) by simulating the logging/clinician policy. For Warfarin dosing, we have access to counterfactuals and artificially simulate the logging policy to derive a semisynthetic dataset. For heparin dosing, we construct a true realworld bandit dataset from MIMICIII without access to counterfactuals.
Nonclinical dataset (Semisynthetic)
We select two multiclass classification datasets from the UCI repository [30], namely Optdigits and Letter, which have been previously used for offpolicy bandit evaluation [31, 32] and convert them to contextual bandit dataset by choosing actions derived from a multiclass logistic regression policy trained on 5% of the dataset, similar to Dudik et al. [31].
Clinical datasets
Warfarin dosing (Semisynthetic)
Using the PharmGKB [33] dataset, we develop a case study to evaluate our framework on warfarin dosing. Warfarin dosing is concerned with determining the correct dosage of the blood anticoagulant drug for a heart patient. The dataset includes patient information (demographics, physiological, and genotype features) with the final ideal therapeutic dosage. Warfarinâ€™s administration needs to be monitored closely since incorrect dosage can lead to adverse side effects such as heart attacks. The therapeutic dosage varies widely across patients due to different contextual features. Physicians typically prescribe an initial dose, adjusted according to the patientÅ› response. Previous work [34] on predicting dosage policies using bandits, discretize the dosage into three categories, â€˜lowâ€™ (< 21 mg/wk), â€˜mediumâ€™ (\(\ge\) 21 mg/wk, \(\le\) 49 mg/wk) or â€˜highâ€™ (> 49 mg/wk). Although, recently, warfarin dosing has been approached in the continuous domain, which allows for finer adjustments, we focus on tackling uncertainty in propensity score estimation under a discrete dosage setting to keep the overall formulation simple. With dosage discretization, the Warfarin dataset was converted to a supervised classification dataset \(D = {(x_i,y_i)}_{n_i=1}\) with access to treatment counterfactuals. This provides us with the groundtruth action for each patient. Since the dataset is supervised, we simulate a contextual bandit environment by simulating the clinicianâ€™s policy and using a custom reward function. We follow the Supervised \(\rightarrow\) Bandit conversion approach highlighted in [13, 19, 35] and simulate expert (clinicianÅ›) behavior using stochastic logging policy to sample \(y^*_i = h(*x_i)\) with reward defined based on the match between groundtruth and sampled actions, \(r_i = I(y_i = y^*_i)\). We simulate the following stochastic logging policies with 3 and 5 discrete dosage levels (policy actions). These policies are also referred to as expert policies.

1
LR: We follow the experimental design specified in [13] and use a multiclass logistic regression model trained on 5% data as logging policy. For different simulations, we randomly sample 5% data from our training set and fit a multiclass logistic regression model to obtain weight vector \(w_{lr}\). To introduce further stochasticity, we randomly perturb \(w_{lr}\) using random noise drawn from a standard normal distribution \(u \sim N[0,1]\).

2
PHARMA: We adopt the clinical policies (WPGA, WCGA) [33] as our base deterministic policies (\(h_1, h_2\)). Both WPGA and WCGA are clinically motivated linear models, with WPGA incorporating genotype features to improve over WCGA. Our aim was to emulate clinicians using WPGA or WCGA for dosage recommendation and combine them in a stochastic manner. Motivated by the friendly softening approach proposed by Farajtabar et al. [35], we transform the deterministic policy into stochastic policy by drawing actions \(a_i = h_0(x_i)\) from a mixture of these models with equal probability.
$$\begin{aligned} a_i = \left\{ \begin{array}{ll} h_1(x_i),\quad r_i <= 0.5 \\ h_2(x_i),\quad otherwise \end{array}\right. \end{aligned}$$(13)where \(r_i \in [0,1]\) is a random number for patient \(x_i\).
Heparin dosing (True bandit)
Heparin is one of the most commonly used anticoagulant medications in hospitals and ICUs. The dosage of intravenous unfractionated heparin is commonly based on the patientâ€™s weight, as per most clinical practice guidelines [36]. Such a weightbased approach alone may result in improper dosage for obese patients. Although some works have recommended using an adjusted body weight [37], in practice, activated partial thromboplastin time (aPTT) is a good indicator of blood coagulation level. There is significant variation in the guidelines for the initial loading dose of heparin, the rate of dosage, and the time measurement intervals of aPTT. A higher aPTT level reflects slow blood clotting, whereas a low level indicates fast clotting. Blood samples are usually taken every 46 hours to measure the levels of aPTT, and the result of anticoagulation therapy is analyzed by observing whether aPTT reaches the therapeutic window timely. Typically, aPTT between 60s and 100s is considered therapeutic, with aPTT > 100s being supratherapeutic and aPTT < 60s being subtherapeutic. While machine learning techniques have tried to develop the ability to provide clinical decision support for heparin dosing, the high patient variability has led to the underperformance of multinomial logistic regressionbased models [38]. Here, we formulate heparin dosing as an offline bandit problem by considering the aPTT after 6 hours of dosage initialization as the reward outcome. We discretize the Heparin dosages into 3 categories(actions) â€˜lowâ€™(< 10 mg/wk), â€˜mediumâ€™ (\(\ge\) 10 mg/wk, \(\le\) 15 mg/wk) or â€˜highâ€™ (> 15 mg/wk). The outcome of interest was the aPTT value 6 hours after the initial heparin infusion, and the rewards were defined as:
Patient demographics and physiological features of interest used to define the context included: age, height, weight, ethnicity, gender, obesity, creatinine concentration, SOFA score, type of ICU admission, endstagerenaldisease (ESRD), and pulmonary embolism. These features contribute collectively to the patientâ€™s response to the Heparin dose; for instance, creatinine concentration reflects the filtration function of glomeruli and, together with ESRD, serves as an indicator of renal function. We selected these features in line with the previous studies [38, 39], with most of the features being statistically significant for predicting aPTT outcomes. To create the patient cohort, we follow the scheme proposed by Ghassemi et al. [38]. A total of 4,761 adult patients who had undergone heparin dosing during their ICU stay were extracted from the MIMICIII database. We included only those patients with aPTT measurements 6 hours after the initial Heparin infusion, reducing the cohort size to 2,981. Further, some patients had missing covariates, and by removing these patients, we obtained 2,136 patients. Lastly, we removed patients who were transferred from another hospital since their Heparin infusion might have started before the ICU admission, and we have limited knowledge of medical interventions taken before transfer. Our final cohort comprises of 1,378 patients.
Baselines
We consider two popular offpolicy estimators, IPS & SNIPS, and use the propensity score imputed from a single neural network with vanilla IPS/SNIPS formulation as the baseline. The single neural network can be a deterministic neural network in the case of an NN ensemble or a network obtained by sampling once from the posterior weight distribution of Bayesian NN.
Our logging policy imputation model is a single hiddenlayer perceptron network with ReLU activation units. We establish baseline estimators by selecting one of the bootstrapped models as a propensity score estimator. We denote these baseline estimators as Vanilla SNIPS/Vanilla IPS. To bootstrap the deterministic NN model, we randomly initialize the model weights and use dropout (0.25) to fit the models. To train BNN with variational inference framework, we follow the â€˜BayesbyBackpropagationâ€™ approach [5] assuming a scale mixture of two Gaussian densities as the prior distribution for weights \(w_{h_0} \sim 0.5N(0,0.5) + 0.5N(0,0.002)\). The network configurations differ for Warfarin dosing (hidden units = 20) and Heparin dosing (hidden units = 40). We use the Adam optimizer [40] (\(\beta _{1}\) = 0.999, \(\beta _{2}\) = 0.9) with a learning rate of \(1e^{3}\) and minibatch size of 50 for both datasets, and use progressive validation to detect convergence. We determined the optimal training hyperparameters using 5fold crossvalidation on both datasets. For adversarial IPS learners, we determined \(\lambda\) = 1 to be optimal after experimenting with multiple values (0.5, 1, 1.5, 2).
Policy evaluation
In this experiment, we evaluate whether a bootstrappingbased framework leads to a more confident reward estimation of a custom clinical policy. Here, we leverage the semisynthetic nonclinical and Warfarin dosing datasets since they allow for the comparison of the estimated policy reward with the groundtruth reward (estimated from counterfactuals). We perform 20 simulations and report the rootmeansquared error (RMSE = \(E[\hat{R}(h)R(h)]^2\)) of our proposed estimators and baselines over these 20 sampled datasets, where R(h) is the groundtruth reward. We follow the following methodology of Dudik et al. [19] during each simulation to derive the semisynthetic bandit dataset

1
For each logging policy, we create a partiallylabeled bandit dataset by applying the transformations described in the previous Waefarin Dosing (semisynthetic) section.

2
We randomly subsample 70% of the syntheticbandit dataset as our evaluation dataset and divide it into train/validation sets in an 80/20 ratio for fitting the propensity model.

3
We obtain the evaluation policy h by training a multiclass logistic regression model on a full classification dataset and define its classification accuracy as the ground truth reward R(h).

4
We bootstrap 10 models (\(\hat{h}_0^b, b \in \{1,2,...10\}\)) for imputing logging policy propensity scores as described in the Bootstrapped Counterfactual Evaluation section. For the ensemble approach, we initialize 10 models with seeds in multiples of 2. In variational inference, we train 10 BNNs and sample 10 models, one from each of the 10 weight distributions. For the MCDropout model, we apply dropout randomly to sample networks during inference. We also resample data while bootstrapping an NN ensemble model or training a new BNN.
Policy learning
In this experiment, we use the bootstrapping and adversarial learning frameworks to learn optimal policies with maximum reward. Based on the performance of our bootstrapped estimators for policy evaluation, we expect that addressing uncertainty with bootstrapping and adversarial formulations will translate to learning better policies. We perform 10 simulations and learn the dosing policy h using IPSbased loss formulation and minibatch stochastic gradient descent. As the choice for learning \(\hat{h}_0\) (i.e., using NNs vs using Bayesian NNs) is orthogonal to the policy learning method, in this section, we use regular NN for demonstration. We follow the following steps during each simulation:

1
We randomly split the data into training (70%) and test (30%) sets.

2
For each logging policy type, we obtain a partiallylabeled semisynthetic bandit dataset for Warfarin dosing by applying the transformations described in the previous Warfarin Dosing (semisynthetic) section. Moreover, we also consider the Heparin dosing dataset, which is a true bandit dataset and allows us to evaluate policy learning in a nonsimulated realworld clinical setting.

3
Bootstrapping: We bootstrap ten models for imputing the logging policy. By incorporating the average and inverse learning formulations into IPS and SNIPS estimators, we learn optimum policies \(h_{avg}\) and \(h_{inv}\), respectively.

4
Adversarial Bandit Learner: We train the models \(h_0\) and h alternately using the IPS\(_{adv}\) loss formulation. Before initiating the adversarial training, we initialized the propensity model \(h_0\) by training it for four epochs on the bandit dataset. This assures that \(h_0\) initializes with parameters not widely different from the optimal propensity model, which stabilizes the subsequent adversarial learning process. We train both networks alternately for 100 epochs with a learning rate of 0.001.
To evaluate our frameworks, we report the mean reward achieved by our learned policies along with their variance (\(\mu _{R(h)} \pm \sigma _{R(h)}\)). For Warfarin dosing, we execute the learned policy on the test dataset for Warfarin dosing and compare the predicted actions with ground truth dosage actions from the full classification dataset. For Heparin dosing, since we have access to a realworld bandit dataset, we do not have access to counterfactuals, i.e., the optimal groundtruth dosage for each patient. Hence, we leverage the SNIPS estimator for evaluating the performance of our learned policies, given that offline SNIPS estimates are highly correlated to the true (online) performance for a wide range of policies by Zenati et al. [41]. In our evaluation experiments (Table 2), we found SNIPS to have lower variance and bias than IPS.
Main results
We present the policy evaluation results for SNIPS and IPS estimators on the Warfarin bandit dataset (LR and PHARMA policies) in Tables 2 andÂ 3 respectively. We highlight both bias and variance of the estimated policy rewards. Using bootstrapping leads to significantly lower bias and variance, even in the case of SNIPS, which typically has lower variance due to weight normalization. Comparing the two bootstrapbased estimators, we find that average propensity score estimator can achieve lower policy evaluation bias compared to the inverse estimator. We also observe that NN ensemble and MCDropoutbased networks lead to slightly better variance reduction compared to BNNs, which is in line with the uncertainty reduction results observed in [22].
Impact of number of bootstraps on evaluation error
In the case of the NN ensemble, we also evaluate the impact of bootstrap count on the reduction in bias and variance of SNIPSbased reward estimators (Fig.Â 1). We observe that an ensemble of 5 neural networks performs sufficiently well in reducing both the variance and bias. As the number of bootstrapped models increases, the bias and variance of SNIPS\(_{inv}\) and SNIPS\(_{avg}\) estimators reduce significantly with SNIPS\(_{avg}\) achieving lower bias and variance. Thus, bootstrapping multiple models allows for sampling from multiple proposal distributions and avoids the situation wherein a single propensity score model suffers from very low probability coverage over certain regions of the action space.
In Tables 4 and 5, we highlight the results of policy learning on clinical datasets. Using bootstrapping leads to improved policy learning both on semisynthetic data (LR logging policy in Warfarin dosing) as well as truebandit data (Heparin dosing). Moreover, we observe that IPS\(_{inv}\) outperforms IPS\(_{avg}\) across multiple datasets. Consistent with the policy evaluation results, we find that NN ensemble is more effective at reducing uncertainty than BNNs. An interesting observation is that bootstrapping leads to lower rewards for warfarin datasets simulated using PHARMA logging policy. However, on further analysis, we find that this is because the PHARMA policy actions are heavily biased towards certain actions (dosage 1 in 3action case and dosages 1 & 2 in 5action case). This bias in the simulated actions of the logging policy leads to the learned policy being substantially biased towards action â€˜1â€™. However, the bootstrapped framework leads to a policy which is lessbiased and more balanced in its actions, although it achieves a lower overall reward. As observed in Fig. 2, policy learning using IPS\(_{inv}\) achieves higher accuracy for infrequent actions (dosages 0 & 2 in 3action and dosages 3, 4 & 5 in 5action scenarios).
Discussions
In this study, we explored the application of policy evaluation and optimization algorithms in clinical decisionmaking scenarios. Acknowledging the high uncertainty in medical data and the critical need for lower risk in medical decisionmaking, we proposed a bootstrappingbased approach for learning decisionmaking policies. This method provides not only reward estimates but also confidence intervals, enabling physicians to select actions with reduced variance when necessary. Our bootstrapping framework effectively addresses the model uncertainty typically associated with IPSbased estimators, leading to decreased variance in policy evaluation and enhanced policy optimization. Furthermore, we introduced an innovative adversarial learning technique to further advance policy optimization performance. Specifically, we addressed model uncertainty from the perspective of distributionally robust counterfactual risk minimization. We proposed an adversarial IPS learner (IPS\(_{adv}\)), designed to maximize rewards under the worstcase propensity model within a defined uncertainty set.
Our experiments demonstrate the efficacy of our proposed frameworks (IPSinv, IPSavg, and IPS\(_{adv}\)) in a clinical setting involving the oral dosing of anticoagulants, heparin and warfarin. Our approaches not only facilitate better initial dosing policies but also achieve higher rewards. Moreover, we introduce the generation of semisynthetic and realworld clinical bandit datasets to promote further research in this field. The experimental results highlight the potential of applying the policy learning paradigm to clinical applications, paving the way for various followup studies in this line of research. When investigating the impact of the number of bootstraps on evaluation error, we present the results of 10 simulations on the Warfarin dataset. Bootstrapping leads to improved policy learning, particularly in the PHARMA scenario, where accurately imputing the logging policy is more challenging. Furthermore, the benefits of bootstrapping are more pronounced for the IPS estimator, which is less reliable than the DR estimator for policy learningÂ [19]. We also observe that bootstrapping using Bayesian neural networks results in lower variance compared to ensembling, owing to their enhanced ability to model the uncertainty in logged data. For both logging policies, bootstrapping both data and models yields a slight improvement over model bootstrapping alone.
Limitations and future works
This study demonstrates the effectiveness of combining bootstrap and adversarial learning techniques in policy learning for clinical decision support. However, there are several limitations and potential avenues for future research. Firstly, our framework could be extended to incorporate other policy evaluation algorithms, especially doubly robust estimators, which may further enhance the accuracy and reliability of policy evaluation and optimization. Secondly, in realworld clinical applications, datasets often originate from multiple institutions, each with potentially different underlying distributions. Investigating methods to reduce uncertainty when learning policies from multiple realworld heterogeneous datasets would be a valuable research direction, as it could improve the generalizability and robustness of the learned policies. Lastly, personalized medicine is an increasingly important aspect of clinical decisionmaking. Future research should explore the development of more personalized policies that incorporate individual patient characteristics, such as comprehensive genetic information. This could lead to more tailored and effective treatment recommendations, ultimately improving patient outcomes.
Conclusion
In this paper, we explore the application of policy evaluation and optimization algorithms in clinical decisionmaking scenarios. Given the significant uncertainties inherent in clinical data, our initial approach involves employing a bootstrap method to mitigate these uncertainties. Furthermore, we introduce an adversarial learning technique to enhance policy optimization performance. Our findings demonstrate the potential of leveraging the policy learning paradigm in clinical contexts, opening avenues for future research endeavors. One potential direction involves examining the integration of other policy evaluation algorithms, such as doubly robust algorithms, within our proposed framework. Additionally, there is a compelling need to explore more personalized approaches to clinical decisionmaking, e.g., including genetic information in the decisionmaking process and tailoring treatments to individual patient profiles.
Availability of data and materials
The realworld dataset SACHS used and/or analyzed during the current study is publicly available at (1) PharmGKB (Consortium 2009): https://www.pharmgkb.org; (2) Multiparameter Intelligent Monitoring in Intensive Care (MIMICIII v1.4): https://mimic.mit.edu.
References
Castaneda C, Nalley K, Mannion C, Bhattacharyya P, Blake P, Pecora A, et al. Clinical decision support systems for improving diagnostic accuracy and achieving precision medicine. J Clin Bioinforma. 2015;5(1):1â€“16.
Wu H, Shi W, Wang MD. Developing a novel causal inference algorithm for personalized biomedical causal graph learning using meta machine learning. BMC Med Inform Decis Mak. 2024;24(1):137.
Carey TA, Stiles WB. Some problems with randomized controlled trials and some viable alternatives. Clin Psychol Psychother. 2016;23(1):87â€“95.
Dusenberry MW, Tran D, Choi E, Kemp J, Nixon J, Jerfel G, et al. Analyzing the role of model uncertainty for electronic health records. In: Proceedings of the ACM Conference on Health, Inference, and Learning. ACM. 2020. p. 204â€“213.
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D. Weight Uncertainty in Neural Network. In: International Conference on Machine Learning, pp. 1613â€“1622. http://proceedings.mlr.press/v37/blundell15.html. Accessed 25 Nov 2019.
Zhu W, Xie L, Han J, Guo X. The application of deep learning in cancer prognosis prediction. Cancers. 2020;12(3):603.
Thomsen K, Iversen L, Titlestad TL, Winther O. Systematic review of machine learning for diagnosis and prognosis in dermatology. J Dermatol Treat. 2020;31(5):496â€“510.
Parbhoo S, Bogojeska J, Zazzi M, Roth V, DoshiVelez F. Combining Kernel and Model Based Learning for Hiv Therapy Selection. AMIA Summits on Translational Science Proceedings. 2017:239.
Guez A, Vincent RD, Avoli M, Pineau J. Adaptive treatment of epilepsy via batchmode reinforcement learning. AAAI. 2008. p. 1671â€“1678.
Prasad N, Cheng LF, Chivers C, Draugelis M, Engelhardt BE. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. 2017. arXiv preprint arXiv:1704.06300 .
Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial itelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Med. 2018;24(11):1716â€“20.
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47(260):663â€“85.
Swaminathan A, Joachims T. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In: International Conference on Machine Learning. pp. 814â€“823. http://proceedings.mlr.press/v37/swaminathan15.html. Accessed 8 Oct 2019.
Wu H, Wang M. Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization. In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol.Â 80. PMLR; 2018. pp. 5353â€“5362. https://proceedings.mlr.press/v80/wu18g.html. Accessed 8 Oct 2019.
Faury L, Tanielian U, Vasile F, Smirnova E, Dohmatob E. Distributionally Robust Counterfactual Risk Minimization. arXiv:1906.06211 . Accessed 2 Oct 2019.
Ionides EL. Truncated importance sampling. J Comput Graph Stat. 2008;17(2):295â€“311.
Bottou L, Peters J, QuiÃ±onero Candela J, Charles DX, Chickering DM, Portugaly E, et al. Counterfactual reasoning and learning systems: the example of computational advertising. J Machine Learning Res. 2013;14:3207â€“3260. http://jmlr.org/papers/v14/bottou13a.html. Accessed 8 Oct 2019.
Swaminathan A, Joachims T. The selfnormalized estimator for counterfactual learning. In: advances in neural information processing systems. 2015. p. 3231â€“3239.
DudÃk M, Erhan D, Langford J, Li L, et al. Doubly robust policy evaluation and optimization. Stat Sci. 2014;29(4):485â€“511.
Xie Y, Liu B, Liu Q, Wang Z, Zhou Y, Peng J. Offpolicy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. 2018. arXiv preprint arXiv:1808.00232 .
Dusenberry MW, Tran D, Choi E, Kemp J, Nixon J, Jerfel G, etÂ al. Analyzing the Role of Model Uncertainty for Electronic Health Records. arXiv:1906.03842 . Accessed 16 Oct 2019.
Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. arXiv:1612.01474 . Accessed 16 Oct 2019.
Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv:1506.02142 . Accessed 16 Oct 2019.
Zhang G, Sun S, Duvenaud D, Grosse R. Noisy Natural Gradient as Variational Inference. In: International Conference on Machine Learning. pp. 5852â€“5861. http://proceedings.mlr.press/v80/zhang18l.html. Accessed 25 Nov 2019.
Kingma DP, Salimans T, Welling M. Variational dropout and the local reparameterization trick. In: Advances in neural information processing systems. 2015. pp. 2575â€“2583.
Veach E, Guibas LJ. Optimally combining sampling techniques for Monte Carlo rendering. In: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. 1995. pp. 419â€“428.
Wald A. Contributions to the theory of statistical estimation and testing hypotheses. Ann Math Stat. 1939;10(4):299â€“326.
Owen RP, Altman RB, Klein TE. PharmGKB and the international warfarin pharmacogenetics consortium: the changing role for pharmacogenomic databases and singledrug pharmacogenetics. Hum Mut. 2008;29(4):456â€“60.
Johnson AEW, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, etÂ al. MIMICIII, a freely accessible critical care database. 2016;3(1):1â€“9. https://doi.org/10.1038/sdata.2016.35. Accessed 14 Nov 2019.
Dua D, Graff C. UCI Machine learning repository. 2017. http://archive.ics.uci.edu/ml. Accessed 14 Nov 2019.
DudÃk M, Langford J, Li L. Doubly robust policy evaluation and learning. 2011. arXiv preprint arXiv:1103.4601 .
Vlassis N, Bibaut A, Dimakopoulou M, Jebara T. On the design of estimators for bandit offpolicy evaluation. In: International Conference on Machine Learning. 2019. pp. 6468â€“6476.
Consortium IWP. Estimation of the warfarin dose with clinical and pharmacogenetic data. N Engl J Med. 2009;360(8):753â€“64.
Bastani H, Bayati M. Online Decision Making with HighDimensional Covariates. https://doi.org/10.1287/opre.2019.1902. Accessed 14 Nov 2019.
Farajtabar M, Chow Y, Ghavamzadeh M. More Robust Doubly Robust Offpolicy Evaluation. In: International Conference on Machine Learning. 2018. pp. 1447â€“1456.
Schurr JW, Muske AM, Stevens CA, Culbreth SE, Sylvester KW, Connors JM. Derivation and validation of ageand body mass indexadjusted weightbased unfractionated heparin dosing. Clin Appl Thromb Hemost. 2019;25, p.1076029619833480.
Fan J, John B, Tesdal E. Evaluation of heparin dosing based on adjusted body weight in obese patients. Am J Health Syst Pharm. 2016;73(19):1512â€“22.
Ghassemi MM, Richter SE, Eche IM, Chen TW, Danziger J, Celi LA. A datadriven approach to optimized medication dosing: a focus on heparin. Intensive Care Med. 2014;40(9):1332â€“9.
Nemati S, Ghassemi MM, Clifford GD. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2016. pp. 2978â€“81.
Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980 .
Zenati H, Bietti A, Martin M, Diemert E, Mairal J. Counterfactual learning of continuous stochastic policies. 2020. arXiv preprint arXiv:2004.11722.
Acknowledgements
We would to express our gratitude to Dr. Monica Isgut and Dr. Felipe Giuste careful proofreading of our final manuscript and their insightful suggestions for enhancing paper quality.
Funding
This research has been supported by a Wallace H. Coulter Distinguished Faculty Fellowship, a Petit Institute Faculty Fellowship, and research funding from Amazon and Microsoft Research to Professor May D. Wang.
Author information
Authors and Affiliations
Contributions
H.W., W.S., and A.C. contributed to the study design, data preprocessing, statistical analysis, model development, result analysis, and writing of the manuscript, including figures and tables. M.W. contributed to the study design, result evaluation, and extensive refining of the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wu, H., Shi, W., Choudhary, A. et al. Clinical decision making under uncertainty: a bootstrapped counterfactual inference approach. BMC Med Inform Decis Mak 24, 275 (2024). https://doi.org/10.1186/s1291102402606z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102402606z