Neural Networks for Machine Learning Lecture 14

课程地址：https://www.coursera.org/learn/neural-networks

老师主页：http://www.cs.toronto.edu/~hinton

备注：笔记内容和图片均参考老师课件。

这周介绍了DBN和pre-train，这里主要回顾下选择题。

选择题 1

Why is a Deep Belief Network not a Boltzmann Machine ?

A DBN is not a probabilistic model of the data.
All edges in a DBN are directed.
Some edges in a DBN are directed.
A DBN does not have hidden units.

DBN最顶层是RBM，这一层为无向边，其余每层为有向边，第三个选项正确。

选择题 2

Brian looked at the direction of arrows in a DBN and was surprised to find that the data is at the “output”. “Where is the input ?!”, he exclaimed, “How will I give input to this model and get all those cool features?” In this context, which of the following statements are true? Check all that apply.

In order to get features $h$ given some data $v$, he must perform inference to find out $P(h|v)$. There is an easy exact way of doing this, just traverse the arrows in the opposite direction.
A DBN is a generative model of the data and cannot be used to generate features for any given input. It can only be used to get features for data that was generated by the model.
A DBN is a generative model of the data, which means that, its arrows define a way of generating data from a probability distribution, so there is no “input”.
In order to get features hh given some data vv, he must perform inference to find out $P(h|v)$. There is an easy approximate way of doing this, just traverse the arrows in the opposite direction.

DBN是生成模型，不把inputs映射到outputs，第三项正确；推断$P(h|v)$可以”traverse the arrows in the opposite direction”，但注意这只是一种近似方法。

选择题 3

In which of the following cases is pretraining likely to help the most (compared to training a neural net from random initialization) ?

A dataset of images is to be classified into 100 semantic classes. Fortunately, there are 100 million labelled training examples.
A speech dataset with 10 billion labelled training examples.
A dataset of binary pixel images which are to be classified based on parity, i.e., if the sum of pixels is even the image has label 0, otherwise it has label 1.
A dataset of movie reviews is to be classified. There are only 1,000 labelled reviews but 1 million unlabelled ones can be extracted from crawling movie review web sites and discussion forums.

从课件中看出来，有少量标签和大量数据的情形下，pretraining帮助最大，第四项正确。

选择题 4

Why does pretraining help more when the network is deep ?

As nets get deeper, contrastive divergence objective used during pretraining gets closer to the classification objective.
Backpropagation algorithm cannot give accurate gradients for very deep networks. So it is important to have good initializtions, especially, for the lower layers.
Deeper nets have more parameters than shallow ones and they overfit easily. Therefore, initializing them sensibly is important.
During backpropagation in very deep nets, the lower level layers get very small gradients, making it hard to learn good low-level features. Since pretraining starts those low-level features off at a good point, there is a big win.

第一项不对，网络越多，会萃取越多特征，但是不一定和分类目标接近。第二项显然错误，反向传播的公式是确定的。更多参数意味着模型可以通过学习不能很好地概括的特征来找到过拟合的巧妙方法。预训练可以在权重空间的适当区域初始化权重，以便所学习的特征不会太糟糕，第三项正确。反向传播过程中，前几层的梯度可能很小，所以需要很好的初值，pretraining可以帮助这点，第四项正确。

选择题 5

The energy function for binary RBMs goes by

$E(\mathbf{v,h}) = -\sum_iv_ib_i -\sum_jh_ja_j - \sum_{i,j}v_iW_{ij}h_j$

When modeling real-valued data (i.e., when $\mathbf{v}$ is a real-valued vector not a binary one) we change it to

$E(\mathbf{v,h}) = \sum_i\frac{(v_i-b_i)^2}{2\sigma_i^2} -\sum_jh_ja_j - \sum_{i,j}\frac{v_i}{\sigma_i}W_{ij}h_j$

Why can’t we still use the same old one ?

If we continue to use the same one, then in general, there will be infinitely many $\mathbf{v}$’s and $\mathbf{h}$’s such that, $E(\mathbf{v, h})$ will be infinitely small (close to $-\infty$). The probability distribution resulting from such an energy function is not useful for modeling real data.
Probability distributions over real-valued data can only be modeled by having a conditional Gaussian distribution over them. So we have to use a quadratic term.
If we use the old one, the real-valued vectors would end up being constrained to be binary.
If the model assigns an energy $e_1$ to state $\mathbf{v_1,h}$, and $e_2$ to state $\mathbf{v_2,h}$, then it would assign energy $(e_1 + e_2)/2$ to state $\mathbf{(v_1+v_2)/2,h}$. This does not make sense for the kind of distributions we usually want to model.

对第一个模型，如果$b_i < 0$，$v_i \rightarrow -\infty $，那么$E \rightarrow -\infty$；如果$b_i > 0$，$v_i \rightarrow \infty$，那么$E \rightarrow -\infty$，可以看出第一个式子有很多种使得能量为负无穷的情形，第一个选项正确。第二个选项是不正确的，实际模型不定义要是高斯模型。第三个选项显然不正确。最后一个选项，考虑一个正常图片，将其分离成两个图片，这两个图片的$E$很小，但是合在一起之后是一个正常图片，对应的$E$应该比较大，但是如果使用第一个模型无法满足这点，所以最后一个选项也正确。