最近开始学习一些机器学习理论，目前从斯坦福的Machine Learning Theory (CS229M/STATS214)入手，课程的优点是课件，视频比较全，缺点是作业无法公开获得。这里对第一章做一个总结，这一章介绍了监督学习的范式，引入了population risk，excess risk和empirical risk等概念。

课程主页：

https://web.stanford.edu/class/stats214/

课程视频：

https://www.youtube.com/playlist?list=PLoROMvodv4rP8nAmISxFINlGKSK4rbLKh

课件：

https://github.com/tengyuma/cs229m_notes/blob/main/master.pdf

Chapter 1 Supervised Learning Formulations

在本章中，我们将建立有监督学习的标准理论公式，并介绍经验风险最小化 (ERM) 范式。

1.1 有监督学习

基本概念

在监督学习中，我们有输入和输出，输入属于输入空间$\mathcal X$，输出属于输出空间$\mathcal Y$。在监督学习中，我们关注定义在$\mathcal X \times \mathcal Y$上的概率分布$P$；从该概率分布，我们得到训练集：$n$个独立同分布(i.i.d)的数据$\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^n$。监督学习的目标是从训练集中学习从$\mathcal X$到$\mathcal Y$的映射，该映射/函数$h:\mathcal X\to \mathcal Y$被称为predictor(或hypothesis，model )。

给定两个predictors，我们如何判断哪个更好？为此，我们关于预测结果($h(\mathcal X)$)定义了损失函数$\ell$：$\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$，即损失函数给出模型预测结果$\hat y=h(x)$和真实标签$y$的区别大小。通常，我们假设$\ell$是非负的，即$\ell(\hat y, y)\ge 0$。

给定上述定义，我们可以形式化监督学习的问题：监督学习的目标是找到最小化期望损失(expected loss)的模型$h$：

$L(h) \triangleq \underset{(x, y) \sim p}{\mathbb{E}}[\ell(h(x), y)].$

（注：expected loss和polulation loss, expected risk, population risk含义均相同。）

注意到$\ell$非负，所以$L$也非负，因此，我们的目标是找到$h$，使得$L(h)$尽可能地接近$0$。

例子

在回归问题中，$\mathcal Y =\mathbb R$，损失函数通常为$\ell(\hat{y}, y)=(\hat{y}-y)^2$。

在分类问题中，$\mathcal{Y}=[k]=\{1, \cdots, k\}$，一个常见的损失函数为$0-1$ loss：$\ell(\hat{y}, y)=\mathbb{1}(\hat{y} \neq y)$。

Hypothesis class

之前，我们讨论的是找到最小化population risk的任意函数，但是在实际中，我们无法对任意函数求解优化问题。所以，我们通常在一组更受限制的函数$\mathcal H$中进行分析，我们称之为hypothesis family或hypothesis class。$\mathcal H$的每个元素都是函数$h:\mathcal X\to \mathcal Y$。通常，我们选择$\mathcal H$为容易求解优化问题的函数，例如线性模型或者神经网络。

给定某个$h\in \mathcal H$，我们定义$h$关于$\mathcal H$的excess risk：

$E(h) \triangleq L(h)-\inf _{g \in \mathcal{H}} L(g).$

参数化

通常，我们选择的family可以被$\theta \in \Theta$参数化，在这种情形下，我们用$h_\theta$指代$\mathcal H$中的元素，一个例子是线性模型：$\mathcal{H}=\left\{h: h_\theta(x)=\theta^{\top} x, \theta \in \mathbb{R}^d\right\}$。

1.2 经验风险最小化

根据之前的讨论，我们的目标是最小化期望损失$L(h) \triangleq \underset{(x, y) \sim p}{\mathbb{E}}[\ell(h(x), y)]$。但是在实际中，我们只有含有$n$个元素的训练集，因此我们只能计算经验风险(empirical risk)，然后尝试最小化该指标。简而言之，这就是被称为经验风险最小化(ERM)的范式：我们优化训练集损失函数，希望这会导致我们得到一个population risk较低的模型。在后续中，我们通常将$\ell\left(h_\theta(x), y\right)$记为$\ell((x, y), \theta)$。那么模型$h$的经验损失为：

$\widehat{L}\left(h_\theta\right) \triangleq \frac{1}{n} \sum_{i=1}^n \ell\left(h_\theta\left(x^{(i)}\right), y^{(i)}\right)=\frac{1}{n} \sum_{i=1}^n \ell\left(\left(x^{(i)}, y^{(i)}\right), \theta\right).$

经验损失最小化是找到使得$\hat L$最小的参数$\hat \theta$：

$\hat{\theta} \triangleq \underset{\theta \in \Theta}{\operatorname{argmin}} \widehat{L}\left(h_\theta\right).$

注意到经验风险的期望为population risk：

$\begin{aligned} \underset{\left(x^{(i)}, y^{(i)}\right) \stackrel{\mathrm{iid}}{\sim} P}{\mathbb{E} } {\hat L}\left(h_\theta\right) & =\underset{\left(x^{(i)}, y^{(i)}\right) \stackrel{\mathrm{iid}}{\sim} P}{\mathbb{E}} \frac{1}{n} \sum_{i=1}^n \ell\left(h_\theta\left(x^{(i)}\right), y^{(i)}\right) \\ & =\frac{1}{n} \sum_{i=1}^n \underset{\left.\left(x^{(i)}, y^{(i)}\right)\right)^{\mathrm{iid}} P}{\mathbb{\mathrm { iid }} P} \ell\left(h_\theta\left(x^{(i)}\right), y^{(i)}\right) \\ & =\frac{1}{n} \cdot n \cdot \underset{\left.\left(x^{(i)}, y^{(i)}\right)\right)^{\mathrm{iid}} P}{\mathbb{E}} \ell\left(h_\theta\left(x^{(i)}\right), y^{(i)}\right) \\ & =L\left(h_\theta\right) . \end{aligned}$

在本课程的第一部分寻求回答的关键问题是：我们对ERM学习的参数的excess risk有什么保证？

小结

这一讲主要介绍了基本概念，例如population risk，excess risk，ERM等等。然后引入了后续问题：ERM问题的解对应的population risk在什么条件下会同样很小。