David silver 强化学习 Lecture 10

课程主页： https://www.davidsilver.uk/teaching/

这里回顾David silver 强化学习 Lecture 10的课程内容，这一讲简单介绍了经典游戏。

Game Theory

Optimality in Games

第$i$个玩家的最佳策略$\pi^i$是什么，如果其他的玩家固定他们的策略$\pi^{-i}$？
最优响应$\pi_{*}^{i}\left(\pi^{-i}\right)$为对抗这些策略的最优策略
纳什均衡是所有玩家的共同策略，其中
$\pi^{i}=\pi_{*}^{i}\left(\pi^{-i}\right)$
使得每个玩家的策略都是最佳响应
即没有玩家会选择偏离纳什

Single-Agent and Self-Play Reinforcement Learning

最优响应是解决单智能体RL问题
- 其他参与者成为环境的一部分
- 游戏简化为MDP
- 最佳响应是此MDP的最佳策略
纳什均衡是self-play RL的不动点
- 经验是通过在智能之间玩游戏而产生的
  $a_{1} \sim \pi^{1}, a_{2} \sim \pi^{2}, \ldots$
- 每个智能体都学会了对其他玩家的最佳反应
- 一个玩家的策略决定了另一个玩家的环境
- 所有玩家都在互相适应

Two-Player Zero-Sum Games

我们将专注于特殊类别的游戏：

一个两人游戏包含两个（交替）玩家
- 我们将玩家1命名为白方，将玩家2命名为黑方
零和博弈对黑白双方的奖励相等且相反
$R^{1}+R^{2}=0$

我们考虑在这些游戏中找出纳什均衡的方法

游戏树搜索（即规划）
self-paly强化学习

Perfect and Imperfect Information Games

完全信息或马尔可夫游戏是完全可观测的
- 棋
- 跳棋
- 黑白棋
- 西洋双陆棋
- 围棋
不完全信息游戏是部分可观测的
- 拼字游戏
- 扑克
我们首先专注于完全信息游戏

Minimax

价值函数定义了联合策略$\pi=\left\langle\pi^{1}, \pi^{2}\right\rangle$下的预期总回报
$v_{\pi}(s)=\mathbb{E}_{\pi}\left[G_{t} | S_{t}=s\right]$
minimax价值函数最大化白方的期望回报，同时最小化黑方的期望回报
$v_{*}(s)=\max _{\pi^{1}} \min _{\pi^{2}} v_{\pi}(s)$
minimax策略是一种联合策略$\pi=\left\langle\pi^{1}, \pi^{2}\right\rangle$，该策略达到minimax价值
存在唯一的minimax价值函数
minimax策略是纳什均衡

来看一个具体例子：

Value Function in Minimax Search

搜索树呈指数增长
搜索到游戏结束不切实际
而是使用值函数逼近器$v(s, \mathbf{w}) \approx v_{*}(s)$
- 例如评估函数，启发式函数
使用价值函数来估计叶节点的minimax值

Binary-Linear Value Function

二进制特征向量$\mathrm x(s)$：例如每个格子一个值
权重向量$\mathrm w$：例如每个格子的价值
通过对特征的权重求和来评估位置

例如

Self-Play Reinforcement Learning

Self-Play Temporal-Difference Learning

将基于价值的RL算法应用于self-play游戏
$\text{MC}$：利用回报$G_t$更新价值函数
$\Delta \mathbf{w}=\alpha\left(G_{t}-v\left(S_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} v\left(S_{t}, \mathbf{w}\right)$
$\text{TD}(0)$：利用$v(S_{t+1})$更新价值函数
$\Delta \mathbf{w}=\alpha\left(v\left(S_{t+1}, \mathbf{w}\right)-v\left(S_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} v\left(S_{t}, \mathbf{w}\right)$
$\text{TD}(\lambda)$：利用$\lambda$回报$G_t^\lambda$更新价值函数
$\Delta \mathbf{w}=\alpha\left(G_{t}^{\lambda}-v\left(S_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} v\left(S_{t}, \mathbf{w}\right)$

Policy Improvement with Afterstates

对于确定性游戏，可以估计$v_{*}(s)$
这是因为我们可以有效地评估后继状态
$q_{*}(s, a)=v_{*}(\operatorname{succ}(s, a))$
游戏规则定义了后继状态$\operatorname{succ}(s, a)$
选择动作，通过最小/最大化后继状态值
$\begin{aligned} A_{t}&=\underset{a}{\operatorname{argmax}} v_{*}\left(\operatorname{succ}\left(S_{t}, a\right)\right)& \text{for white}\\ A_{t}&=\underset{a}{\operatorname{argmin}} v_{*}\left(\operatorname{succ}\left(S_{t}, a\right)\right) & \text{for black} \end{aligned}$
这改善了双方的共同策略

Combining Reinforcement Learning and Minimax Search

Simple TD

通过TD学习价值函数
然后在minimax搜索中使用价值函数（无学习）
$v_{+}\left(S_{t}, \mathbf{w}\right)=\operatorname{minimax}_{\boldsymbol{s} \in \text {leaves}\left(S_{t}\right)} v(s, \mathbf{w})$

TD Root

TD根：利用后续搜索结果更新价值
搜索值是从根节点在$S_t$处计算
$v_{+}\left(S_{t}, \mathbf{w}\right)=\operatorname{minimax}_{s \in \text {leaves}\left(S_{t}\right)} v(s, \mathbf{w})$
利用下一个状态的搜索值更新价值函数
$v\left(S_{t}, \mathbf{w}\right) \leftarrow v_{+}\left(S_{t+1}, \mathbf{w}\right)=v\left(l_{+}\left(S_{t+1}\right), \mathbf{w}\right)$
其中$l_{+}(s)$是从$s$处获得minimax值的叶节点

TD Leaf

TD叶子：根据后继搜索值更新搜索值
计算当前和下一步的搜索值
$v_{+}\left(S_{t}, \mathbf{w}\right)=\operatorname{minimax}_{s \in \text { leaves }\left(S_{t}\right)} v(s, \mathbf{w}), \quad v_{+}\left(S_{t+1}, \mathbf{w}\right)=\operatorname{minimax}_{s \in \text { leaves }\left(S_{t+1}\right)} v(s, \mathbf{w})$
利用$t+1$的搜索值更新$t$的搜索值
$\begin{aligned} v_{+}\left(S_{t}, \mathbf{w}\right) & \leftarrow v_{+}\left(S_{t+1}, \mathbf{w}\right) \\ \Longrightarrow v\left(l_{+}\left(S_{t}\right), \mathbf{w}\right) & \leftarrow v\left(l_{+}\left(S_{t+1}\right), \mathbf{w}\right) \end{aligned}$

TreeStrap

TreeStrap：将搜索值更新为更深的搜索值
在所有节点$s \in \text { nodes }\left(S_{t}\right)$处计算出的Minimax搜索值
根据搜索值更新全部节点
$\begin{aligned} & v(s, \mathbf{w}) \leftarrow v_{+}(s, \mathbf{w}) \\ \Longrightarrow & v(s, \mathbf{w}) \leftarrow v\left(l_{+}(s), \mathbf{w}\right) \end{aligned}$

Simulation-Based Search

Self-play强化学习可以代替搜索
从根状态$S_t $模拟Self-play游戏
将RL应用于模拟经验
- 蒙特卡洛控制$\Longrightarrow$蒙特卡洛树搜索
- 最有效的变体是UCT算法
  - 使用UCB平衡每个节点中的探索/开发
- Self-play UCT收敛于minimax值
- 完全信息，零和，2人游戏
- 不完全信息：请参阅下一部分

Reinforcement Learning in Imperfect-Information Games

Game-Tree Search in Imperfect Information Games

玩家的信息状态不同，因此搜索树不同
每个信息状态有一个节点
- 总结玩家所知道的
- 例如他们看过的卡片
许多真实状态可能共享相同的信息状态
也可能汇总状态，例如具有相似的价值

Solution Methods for Imperfect Information Games

信息状态博弈树可以通过以下方式解决：

迭代前向搜索方法
Self-play强化学习
平滑的UCT

Smooth UCT Search

将MCTS应用于信息状态游戏树
UCT的变体，受博弈理论Fictitious Play的启发
- 智能体学习对手的平均行为并做出回应
从节点的动作计数中提取平均策略，$\pi_{a v g}(a | s)=\frac{N(s, a)}{N(s)}$
在每个节点上，根据如下方式选择动作
$A \sim\left\{\begin{array}{ll} \mathrm{UCT}(S), & \text { with probability } \eta \\ \pi_{\text {avg}}(\cdot | S), & \text { with probability } 1-\eta \end{array}\right.$
根据经验，在扑克的变体中：
- 普通的MCTS发散
- 平滑UCT收敛到Nash均衡

Game Theory

Optimality in Games

Single-Agent and Self-Play Reinforcement Learning

Two-Player Zero-Sum Games

Perfect and Imperfect Information Games

Minimax

Value Function in Minimax Search

Binary-Linear Value Function

Self-Play Reinforcement Learning

Self-Play Temporal-Difference Learning

Policy Improvement with Afterstates

Combining Reinforcement Learning and Minimax Search

Simple TD

TD Root

TD Leaf

TreeStrap

Simulation-Based Search

Reinforcement Learning in Imperfect-Information Games

Game-Tree Search in Imperfect Information Games

Solution Methods for Imperfect Information Games

Smooth UCT Search

Conclusions