Skip to main content

Table 1 The Causal Policy Gradient (CPG) Algorithm

From: Incorporating causal factors into reinforcement learning for dynamic treatment regimes in HIV

Algorithm 1: The CPG Algorithm

Function CPG

 Input: a differentiable policy parameterizations π(a|s,θ), aA, s S, θRd, C=0;

 Initialize policy parameter θ;

 Repeat forever:

  Define event A and event B;

  Generate an episode s0,a0,r1,...,sT−1,aT−1,rT, following π(a|s,θ);

  For each step of the episode t=0,...,T-1:

   G ← average future return from step t;

   C=P(AB)/P(A)−P(¬AB)/P(1−P(A));

   θθ+αθlogπ(at|st,θ)GC;

 End for

 Return θ;

End CPG