Skip to main content

Advertisement

Table 1 The Causal Policy Gradient (CPG) Algorithm

From: Incorporating causal factors into reinforcement learning for dynamic treatment regimes in HIV

Algorithm 1: The CPG Algorithm
Function CPG
 Input: a differentiable policy parameterizations π(a|s,θ), aA, s S, θRd, C=0;
 Initialize policy parameter θ;
 Repeat forever:
  Define event A and event B;
  Generate an episode s0,a0,r1,...,sT−1,aT−1,rT, following π(a|s,θ);
  For each step of the episode t=0,...,T-1:
   G ← average future return from step t;
   C=P(AB)/P(A)−P(¬AB)/P(1−P(A));
   θθ+αθlogπ(at|st,θ)GC;
 End for
 Return θ;
End CPG