Policy Gradients Based Reinforcement Learning

Policy Gradients

See more from this blog

Partial Observability

Markov property is not actually used!

Can use policy gradient in partially observed MDPs without modification.

What is Wrong with the Policy Gradient

策略梯度法存在的一个问题是有较大的方差。

在强化学习中，存在一个因果关系，即现在时刻的策略不能影响过去时刻策略的reward情况： policy at time t’ cannot affect reward at time t when t < t’.

Baselines

Subtracting a baseline is unbiased in expectation!

Average reward is not the best baseline, but it’s pretty good!

Analyzing Variance

Section Review

The high variance of policy gradient
Exploiting causality
- Future doesn’t affect the past
Baselines
- Unbiased
Analyzing variance
- Can derive optimal baselines

Policy Gradient is On-policy

对于梯度策略方法来说，最大的问题是：

梯度策略是on-policy的，但是在优化时需要针对当前的策略采样轨迹，一点点的神经网络参数修改都会导致策略的不同，因此on-policy学习是及其不高效的。

Off-policy Learning & Importance Sampling

解决该问题的方法就是使用重要性采样，其核心就是将原本属于难采集数据分布 $\pi_\theta(\tau)$ 的样本 $x$ ，

变相为从另一个易采集数据分布 $\bar{\pi}_\theta(\tau)$ 的样本 $x\prime$

下图中最后一个公式很重要，在后面会进行替换。

Deriving the Policy Gradient with IS

The Off-Policy Policy Gradient

A first-order Approximation for IS

Policy Gradient with Automatic Differentiation

Pseudo-code example (with discrete actions):

Maximum likelihood

# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# Build the graph:

logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)

Policy Gradient

# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

$\bar{J(\theta)} \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} \log \pi_\theta(a_{i,t} | s_{i,t}) Q_{i,t}$

在最后乘以期望的reward之和及为Q值。

Policy Gradient in Practice

Remember that the gradient has high variance - This isn’t the same as supervised learning! - Gradients will be really noisy! - Consider using much larger batches - Tweaking learning rates is very hard - Adaptive step size rules like ADAM can be OK-ish

Section Review

Covariant/Natural Policy Gradient

简单的策略梯度存在的问题是有些参数改变的可能性要比其他参数大

TRPO

Trust region policy optimization: deep RL with natural policy gradient and adaptive step size.

Policy Gradient Suggested Readings

Classic Papers

DRL Policy Gradient Papers

Note: Cover Picture

Super Agents of AI