Incorporating causal factors into reinforcement learning for dynamic treatment regimes in HIV

Table 1 The Causal Policy Gradient (CPG) Algorithm

Algorithm 1: The CPG Algorithm
Function CPG
Input: a differentiable policy parameterizations π(a\|s,θ), ∀a∈A, s ∈S, θ∈R^d, C=0;
Initialize policy parameter θ;
Repeat forever:
Define event A and event B;
Generate an episode s₀,a₀,r₁,...,s_T−1,a_T−1,r_T, following π(a\|s,θ);
For each step of the episode t=0,...,T-1:
G ← average future return from step t;
C=P(A∩B)/P(A)−P(^¬A∩B)/P(1−P(A));
θ←θ+α▽_θlogπ(a_t\|s_t,θ)∗G∗C;
End for
Return θ;
End CPG

ISSN: 1472-6947