David silver 强化学习 Lecture 8

课程主页： https://www.davidsilver.uk/teaching/

这里回顾David silver 强化学习 Lecture 8的课程内容，这一讲简单介绍了结合学习和规划。

Introduction

首先介绍基本概念：

Model-Based and Model-Free RL

不基于模型的RL
- 没有模型
- 从经验中学习价值（策略）函数
基于模型的RL
- 从经验中学习模型
- 从模型中规划价值（策略）函数

Model-Based RL

基于模型的RL图示如下：

优点：

可以通过监督学习方法有效地学习模型
可以推理出模型不确定性

缺点：

首先学习一个模型，然后构造一个值函数，所以有两种近似误差来源

Model-Based Reinforcement Learning

What is a Model?

模型$\mathcal M$通过参数$\eta$表示MDP $\langle\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}\rangle$
我们假设状态空间$\mathcal S$和动作空间$\mathcal A$已知
所以模型$\mathcal{M}=\left\langle\mathcal{P}_{\eta}, \mathcal{R}_{\eta}\right\rangle$表示状态转移$\mathcal{P}_{\eta} \approx \mathcal{P}$和奖励$\mathcal{R}_{\eta} \approx \mathcal{R}$
$\begin{aligned} &S_{t+1} \sim \mathcal{P}_{\eta}\left(S_{t+1} | S_{t}, A_{t}\right)\\ &R_{t+1}=\mathcal{R}_{\eta}\left(R_{t+1} | S_{t}, A_{t}\right) \end{aligned}$
一般假设状态转移和奖励条件独立
$\mathbb{P}\left[S_{t+1}, R_{t+1} | S_{t}, A_{t}\right]=\mathbb{P}\left[S_{t+1} | S_{t}, A_{t}\right] \mathbb{P}\left[R_{t+1} | S_{t}, A_{t}\right]$

Model Learning

目标：从经验$\left\{S_{1}, A_{1}, R_{2}, \ldots, S_{T}\right\}$估计模型$\mathcal M_\eta$
这是一个监督学习问题
$\begin{aligned} S_{1}, A_{1}& \rightarrow R_{2}, S_{2}\\ S_{2}, A_{2} &\rightarrow R_{3}, S_{3}\\ &\vdots\\ S_{T-1}, A_{T-1} &\rightarrow R_{T}, S_{T} \end{aligned}$
学习$s, a \rightarrow r$是回归问题
学习$s, a \rightarrow s’$是密度估计问题
损失函数，均方误差损失，KL散度
找到$\eta$最小化经验损失

Table Lookup Model

对$\hat{\mathcal{P}}, \hat{\mathcal{R}}$建模
对每个状态动作对计算访问次数$N(s,a)$
$\begin{aligned} \hat{\mathcal{P}}_{s, s^{\prime}}^{a} &=\frac{1}{N(s, a)} \sum_{t=1}^{T} \mathbf{1}\left(S_{t}, A_{t}, S_{t+1}=s, a, s^{\prime}\right) \\ \hat{\mathcal{R}}_{s}^{a} &=\frac{1}{N(s, a)} \sum_{t=1}^{T} \mathbf{1}\left(S_{t}, A_{t}=s, a\right) R_{t} \end{aligned}$
另一种方法：
- 在每个时间$t$，记录经验元组$\left\langle S_{t}, A_{t}, R_{t+1}, S_{t+1}\right\rangle$
- 随机选择元组$\langle s, a, \cdot, \cdot\rangle$

学习完模型后，要利用模型进行规划，后续介绍几种常用的方法。

Sample-Based Planning

一种简单而强大的规划方法
仅使用模型生成样本
来自模型的经验样本
$\begin{array}{l} S_{t+1} \sim \mathcal{P}_{\eta}\left(S_{t+1} | S_{t}, A_{t}\right) \\ R_{t+1}=\mathcal{R}_{\eta}\left(R_{t+1} | S_{t}, A_{t}\right) \end{array}$
使用不基于模型的RL去采样，例如
- MC控制
- Sarsa
- Q-learning
基于样本的规划方法通常更有效

Integrated Architectures

这部分介绍整合架构：

Real and Simulated Experience

考虑两种经验：

真实经验从环境中采样（真实的MDP）

$\begin{aligned} S^{\prime} &\sim \mathcal{P}_{s, s^{\prime}}^{a}\\ R&=\mathcal{R}_{s}^{a} \end{aligned}$

模拟经验从模型中采样（近似的MDP）

$\begin{aligned} S^{\prime} &\sim \mathcal{P}_{\eta}\left(S^{\prime} | S, A\right)\\ R&=\mathcal{R}_{\eta}(R | S, A) \end{aligned}$

Integrating Learning and Planning

这里补充一种新的模型，不同于Model-Free RL和Model-Based RL

Dyna
- 从实际经验中学习模型
- 从真实和模拟的经验中学习和计划价值（策略）函数

图示如下

Dyna-Q Algorithm

Simulation-Based Search

Forward Search

前向搜索算法通过提前选择最佳动作
方法是建立一个搜索树，根节点为当前状态$s_t$
使用MDP模型进行展望
不需要求解完整的MDP，只需要求解从现在开始的子MDP

Simulation-Based Search

前向搜索使用基于样本的规划
使用模型模拟从现在开始的经验
$\left\{s_{t}^{k}, A_{t}^{k}, R_{t+1}^{k}, \ldots, S_{T}^{k}\right\}_{k=1}^{K} \sim \mathcal{M}_{\nu}$
使用不基于模型的RL来模拟每一幕
- Monte-Carlo control $\rightarrow$ Monte-Carlo search
- Sarsa $\rightarrow$ TD search

来看一个具体例子：

Simple Monte-Carlo Search

给定模型$\mathcal M_{\nu}$和模拟策略$\pi$
对每个动作$a\in \mathcal A$
- 模拟当前状态开始的$K$次经验
  $\left\{s_{t}, a, R_{t+1}^{k}, S_{t+1}^{k}, A_{t+1}^{k}, \ldots, S_{T}^{k}\right\}_{k=1}^{K} \sim \mathcal{M}_{\nu}, \pi$
- 利用平均回报评估动作（MC评估）
  $Q(s, a)=\frac{1}{N(s, a)} \sum_{k=1}^{K} \sum_{u=t}^{T} \mathbf{1}\left(S_{u}, A_{u}=s, a\right) G_{u} \stackrel{P}{\rightarrow} q_{\pi}(s, a)$
$Q$值最大的动作作为当前（实际）动作
$a_{t}=\underset{a \in \mathcal{A}}{\operatorname{argmax}} Q\left(s_{t}, a\right)$

Advantages of MC Tree Search

高度选择性的最佳搜索
动态评估状态（与DP不同）
使用采样打破维度灾难
适用于“黑匣子”模型（仅需要采样）
计算高效，可并行

Temporal-Difference Search

基于模拟的搜索
使用TD而不是MC（自举）
MC树搜索将MC控制应用于从现在开始的子MDP
TD搜索将Sarsa应用于从现在开始的子MDP

和MC方法的不同在于$Q$值的计算：

$\Delta Q(S, A)=\alpha\left(R+\gamma Q\left(S^{\prime}, A^{\prime}\right)-Q(S, A)\right)$

MC vs. TD search

对于无模型的强化学习和基于模拟的搜索，自举很有帮助
- $\text{TD}$学习减少方差但增加偏差
- $\text{TD}$学习通常比MC更有效
- $\text{TD}(\lambda)$比MC更有效

Dyna-2

在Dyna-2中，智能体存储两组特征权重
- 长期记忆
- 短期（工作）记忆
使用TD学习根据实际经验更新长期记忆
- 适用于一般领域知识
使用TD搜索从模拟经验中更新短期记忆
- 有关当前情况的特定知识