Policy Gradients Based Reinforcement Learning

Policy Gradients

See more from this blog

Partial Observability

Markov property is not actually used!

Can use policy gradient in partially observed MDPs without modification.

What is Wrong with the Policy Gradient


在强化学习中,存在一个因果关系,即现在时刻的策略不能影响过去时刻策略的reward情况: policy at time t’ cannot affect reward at time t when t < t’.


Subtracting a baseline is unbiased in expectation!

Average reward is not the best baseline, but it’s pretty good!

Analyzing Variance

Section Review

  • The high variance of policy gradient
  • Exploiting causality
    • Future doesn’t affect the past
  • Baselines
    • Unbiased
  • Analyzing variance
    • Can derive optimal baselines

Policy Gradient is On-policy



Off-policy Learning & Importance Sampling




Deriving the Policy Gradient with IS

The Off-Policy Policy Gradient

A first-order Approximation for IS

Policy Gradient with Automatic Differentiation

Pseudo-code example (with discrete actions):

Maximum likelihood

# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# Build the graph:

logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)

Policy Gradient

# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

$$ \bar{J(\theta)} \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} \log \pi_\theta(a_{i,t} | s_{i,t}) Q_{i,t} $$


Policy Gradient in Practice

Remember that the gradient has high variance - This isn’t the same as supervised learning! - Gradients will be really noisy! - Consider using much larger batches - Tweaking learning rates is very hard - Adaptive step size rules like ADAM can be OK-ish

Section Review

Covariant/Natural Policy Gradient



Trust region policy optimization: deep RL with natural policy gradient and adaptive step size.

Policy Gradient Suggested Readings

Classic Papers

DRL Policy Gradient Papers

Note: Cover Picture