Model-Based RL for Multi-Task and Meta RL | Super Agents of AI

Model-Based RL for Multi-Task and Meta RL

Dec 3, 2019
One minute read

Model-based RL

Main idea: learn model of environment.

Reasons:

Often leads to better efficiency
Model can be reused

Gradient-based Optimization

Gradient-free Optimization

Action optimization will exploit imprecisions in model. Refitting model using new data will help.

But generally, learning a good global model is hard.

Model-predictive Control(MPC)

Do you know the form of the rewards $r_i(s,a)$ for each task:

Yes
- you can learn a single model and plan w.r.t each $r_i$ at test time.
- Be careful about the reward will change how you collect data
- Model-based RL is sample efficient enough to train on real hardware
No
- Multi-task RL: learn $r_\theta(s,a,z_i)$ , use it to plan
- Meta-RL: meta-learn $r_\theta(s,a,D_i^{tr}$ , use it to plan
- Both solve the multi-task RL & meta-RL problem statements

Model-based RL with Image Observations

Approaches:

Learn model in latent space
Learn model of observations(e.g. video)
Predict alternative quantities

Learning in Latent Space

Key idea: learn embedding $s_t = g(o_t)$ , then do model-based RL in latent space.

Algorithm:

See more:

Embed to Control

Learn embedding & model jointly:

Deep Spatial Autoencoders for Visuomotor Learning

Summary for Learning in Latent Space

Pros:

Learn complex visual skills very efficiently
Structured representation enables effective learning

Cons:

Reconstruction objectives might not recover the right representation

Need better unsupervised representation learning methods.

Models Directly in Image Space

MPC重点分为三步：

第一步利用一个随机策略采集样本作为数据集。
第二步利用采集到的样本数据集训练一个环境模型
利用环境模型优化动作序列

具体实例如下：

如何预测视频中的下一帧：

如何规划：

Consider potential action sequences
Predict the future for each action sequence
Pick best future & execute corresponding action
Repeat 1-3 to replan in real time

这是基于视觉的Visual “model-predictive control”(MPC).

具体案例如下：

See more from Deep visual foresight for planning robot motion

改进版本：

See more from Improvisation through Physical Understanding

Predict Alternative Quantities

Model-Based Meta-RL

What is a RL Task

Deriving tasks from dynamic environments:

具体案例如下：

See more from Learning to Adapt Meta-Learning for Model-Based Control

Model-Based vs Model-Free

Model-based

Easy to collect data in a scalable way (self-supervised)
Easy to transfer across rewards
Typically require a smaller quantity of reward-supervised data
Models don’t optimize for task performance
Sometimes harder to learn than a policy
Often need assumptions to learn complex skills (continuity, resets)

Model-Free

Makes little assumptions beyond a reward function
Effective for learning complex policies
Require a lot of experience (slower)
Harder optimization problem in multi-task setting

Note: Cover Picture

Previous

Exploration in Meta Reinforcement Learning

Next

Lifelong Learning

- CATALOG -

Model-based RL

Gradient-based Optimization

Gradient-free Optimization

Model-predictive Control(MPC)

Model-based RL with Image Observations

Learning in Latent Space

Summary for Learning in Latent Space

Models Directly in Image Space

Predict Alternative Quantities

Model-Based Meta-RL

What is a RL Task

Model-Based vs Model-Free

Model-based

Model-Free