Multi-Task, Goal Conditioned Reinforcement Learning

Terminology & Notation

Use the reward function to determine which action is better or worse.

$s, a, r(s,a), p(s\prime|s,a)$ define Markov decision process.

The Goal of Reinforcement Learning

What is a Reinforcement Learning Task

The Goal of Multi-Task Reinforcement Learning

Multi-task RL

The same as before, except a task identifier is part of the state: $s = (\bar{s}, z_i)$ , e.g. one-hot task ID, language description, desired goal state, $z_i = s_g$ which is goal-conditioned RL.

The reward is the same as before, or for goal-conditioned RL:

$r(s) = r(\bar{s}, s_g) = -d(\bar{s}, s_g)$

Distance function $d$ examples:

Euclidean $L_2$
Sparse 0/1

The Anatomy of a RL Algorithm

Evaluating the Objective

$\theta^* = arg \max_{\theta} J(\theta)$

Sum over samples from $\pi_\theta$ .

$J(\theta) = E_{\tau \sim p_{\theta}(\tau)} \left [ \sum_{t} r(s_t, a_t) \right ] \approx \frac{1}{N} \sum_{i}\sum_{t}r(s_{i,t}, a_{i,t})$

Direct Policy Differentiation

Evaluating the Policy Gradient

Comparison to Maximum Likelihood

Which means:

Good stuff is made more likely
Bad stuff is made less likely
Simply formalizes the notion of “trail and error”

Summary for Policy Gradients

Pros:

Simple
Easy to combine with existing multi-task & meta-learning algorithms.

Cons:

Produces a high0-variance gradient
- Can be mitigated with baselines (used byu all algorithms in practice), trust regions
Requires on-policy data
- Cannot reuse existing experience to estimate the gradient
- Importance weights can help, but also high variance

Q-Learning

Value-Based RL: Definitions

For the optimal policy $\pi^*$ , it follows the bellman equation:

Q^(s_t, a_t) = E_{s\prime \sim p(\cdot |s,a)} \left [ r(s,a) + \gamma \max_{a\prime} Q^ (s\prime, a\prime) \right ]

Fitted Q-iteration Algorithm

Multi-Task Q-Learning

Policy:

$\pi_\theta(a| \bar{s}) \rightarrow \pi_\theta(a|\bar{s}, z_i)$

Q-function:

$Q_{\phi} (\bar{s}, a) \rightarrow Q_\phi(\bar{s},a,z_i)$

Analogous to multi-task supervised learning: stratifies sampling, soft/hard weight sharing, etc.

The different about RL:

The data distribution is controlled by the agent
You may know what aspect(s) of the MDP are changing across tasks

Goal-conditioned RL with Hindsight Relabeling

Relabeling Strategies: Use any state from the trajectory.

The result is the exploration challenges alleviated, see more: Hindsight Experience Replay

Multi-task RL with Relabeling

随机未标记的交互作用在达到最后一个状态的0/1奖励下是最优的。

Unsupervised Visuo-motor Control through Distributional Planning Networks

Note: Cover Picture

Super Agents of AI