Exploration in Meta Reinforcement Learning

Meta-reinforcement Learning

For meta learning, the adaptation data is given to us. But for reinforcement learning, the agent has to collect adaptation data.

Implement the policy as a recurrent network, train across a set of tasks.

Pro: general, expressive Con: not consistent

Persist the hidden state across episode boundaries for continued adaptation.

Implement the policy using policy gradient.

Pro: consistent Con: not expressive

Exploration requires stochasticity, optimal policies don’t.

Typical methods of adding noise are time-invariant.

Temporally extended exploration with MAESN:

See more from MAESN

State is unobserved(hidden). Observation gives incomplete information about the state, e.g. incomplete sensor data.

Encoder design:

Building on policy gradient RL, we can implement meta-RL algorithms via a recurrent network or gradient-based adaptation
Adaptation in meta-RL includes both exploration as well as learning to perform well
We can improve exploration by conditioning the policy on latent variables held constant across an episode, resulting in temporally-coherent strategies.
Meta-RL can be expressed as a particular kind of POMDP
We can do meta-RL by inferring a belief over the task, explore via posterior sampling from this belief, and combine with SAC for a sample efficient algorithm.