CS231 作业3

课程视频地址：https://study.163.com/courses-search?keyword=CS231

课程主页：http://cs231n.stanford.edu/2017/

参考资料：

https://github.com/Halfish/cs231n/tree/master/assignment2/cs231n

https://github.com/wjbKimberly/cs231n_spring_2017_assignment/blob/master/assignment2/TensorFlow.ipynb

我的代码地址：https://github.com/Doraemonzzz/CS231n

这一部分回顾作业3的重点。

准备工作

Look at the data部分如果报如下错误：

另一个程序正在使用此文件，进程无法访问

那么注释image_utils.py文件中函数image_from_url的如下内容即可：

#os.remove(fname)

1.使用普通RNN进行图像标注

Vanilla RNN: step forward

对需要使用的量作出如下假设：

$X = \left[ \begin{matrix} — (x_1)^T— \\ — (x_2)^T— \\ \vdots\\ — (x_n)^T— \end{matrix} \right] \in \mathbb R^{n\times d}, H= \left[ \begin{matrix} — (h_1)^T— \\ — (h_2)^T— \\ \vdots\\ — (h_n)^T— \end{matrix} \right] \in \mathbb R^{n\times m},\\ W_{hh} \in \mathbb R^{m\times m},W_{xh}\in \mathbb R^{d\times m}, b\in \mathbb R^m$

那么对单个数据，更新公式如下：

$\begin{aligned} h_t^T &=\tanh(h_t^TW_{hh} +x_t^TW_{xh}+b^T) \in \mathbb R^{1\times m} \end{aligned}$

所以

$H= \tanh (HW_{hh}+XW_{xh}+ b)$

最后一个加法应该理解为numpy中的加法，$\tanh$是对每个元素作用，代码如下：

next_h = np.tanh(prev_h.dot(Wh) + x.dot(Wx) + b)
cache = [x, prev_h, Wx, Wh, b, next_h]

Vanilla RNN: step backward

注意到

$\tanh(x) =\frac{1-e^{-2x}}{1+e^{-2x}}$

所以

$\begin{aligned} \frac {d \tanh(x)} {dx} = 1-\tanh^2(x) \end{aligned}$

因此第一步先计算：

#读取缓存
x, prev_h, Wx, Wh, b, next_h = cache
#计算梯度
d1 = dnext_h * (1 - next_h ** 2)

因为

$S = HW_{hh}+XW_{xh}+ b$

所以剩余部分的梯度计算方法如下：

dx = d1.dot(Wx.T)
dprev_h = d1.dot(Wh.T)
dWx = x.T.dot(d1)
dWh = prev_h.T.dot(d1)
db = np.sum(d1, axis=0)

推导证明见第一次作业习题4。

Vanilla RNN: forward

循环即可：

#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
cache = []
for t in range(T):
	x_ = x[:, t, :]
	h_, cache_ = rnn_step_forward(x_, h_, Wx, Wh, b)
	h[:, t, :] = h_
	cache.append(cache_)

Vanilla RNN: backward

首先看下RNN的结构：

对于$h_i$，反向传播的数据流有两项，一项是$\nabla_{h_i} y_i$，对应代码中的dh，另一项是关于$h_{i+1}$的梯度，即前一项的梯度，所以代码如下：

#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape

for t in range(T-1, -1, -1):
	cache_ = cache[t]
	x, prev_h, Wx, Wh, b, next_h = cache_
	dh_ = dh[:, t, :]
	dx_, dh_prev, dWx_, dWh_, db_ = rnn_step_backward(dh_ + dh_prev, cache_)
	
	#更新
	dx[:, t, :] = dx_
	dWx += dWx_
	dWh += dWh_
	db += db_

dh0 = dh_prev

Word embedding: forward

$x$每个元素代表索引，$W[i,:]$为第$i$个单词的向量表示，所以如下代码可以得到全部词的向量表示：

W[x.flatten(), :]

改变成输出的形状即得到结果：

N, T = x.shape
V, D = W.shape
out = W[x.flatten(), :].reshape(N, T, D)

cache = [x, W]

Word embedding: backward

这题需要利用np.add.at函数，有如下两个参考资料：

答案为：

x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout)

该代码的效果为：

x, W = cache
dW = np.zeros_like(W)

N, T = x.shape
for i in range(N):
	for j in range(T):
		dW[x[i][j]] += dout[i][j]

RNN for image captioning和Test-time sampling部分放在Lstm中讨论。

2.使用LSTM进行图像标注

LSTM: step forward

根据题目中的公式即可：

A = x.dot(Wx) + prev_h.dot(Wh) + b
H = Wh.shape[0]
i = sigmoid(A[:, : H])
f = sigmoid(A[:, H: 2*H])
o = sigmoid(A[:, 2*H: 3*H])
g = np.tanh(A[:, 3*H: ])

next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)

cache = (x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h)

LSTM: step backward

这部分的理论推导先留个坑，后续会补上，代码如下：

x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h = cache
tan_next_c = np.tanh(next_c)

di = (dnext_c * g + dnext_h * o * (1 - tan_next_c ** 2) * g) * i * (1 - i)
dg = (dnext_c * i + dnext_h * o * (1 - tan_next_c ** 2) * i) * (1 - g ** 2)
df = (dnext_c * prev_c + dnext_h * o * (1 - tan_next_c ** 2) * prev_c) * f * (1 - f)
do = dnext_h * tan_next_c * o * (1 - o)
dA = np.c_[di, df, do, dg]

dx = dA.dot(Wx.T)
dprev_h = dA.dot(Wh.T)
dprev_c = dnext_c * f + dnext_h * o * (1 - tan_next_c ** 2) * f
dWx = x.T.dot(dA)
dWh = prev_h.T.dot(dA)
db = np.sum(dA, axis=0)

LSTM: forward

和RNN部分类似：

#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
c_ = np.zeros_like(h0)
cache = []
for t in range(T):
	x_ = x[:, t, :]
	h_, c_, cache_ = lstm_step_forward(x_, h_, c_, Wx, Wh, b)
	h[:, t, :] = h_
	cache.append(cache_)
	
cache.append([x, h0, Wx, Wh, b])

LSTM: backward

依然和RNN部分类似：

#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dnext_c = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape

for t in range(T-1, -1, -1):
	cache_ = cache[t]
	dh_ = dh[:, t, :]
	dx_, dh_prev, dnext_c, dWx_, dWh_, db_ = lstm_step_backward(dh_ + dh_prev, dnext_c, cache_)
	
	#更新
	dx[:, t, :] = dx_
	dWx += dWx_
	dWh += dWh_
	db += db_

dh0 = dh_prev

LSTM captioning model

根据注释进行前项传播和反向传播，这里仿射层有点问题，所以在代码中重新实现：

#1
h1 = features.dot(W_proj) + b_proj
cache_affine_1 = (features, W_proj, b_proj, h1)
#2
x1, cache_word = word_embedding_forward(captions_in, W_embed)
#3
if self.cell_type == "rnn":
	h, cache_forward = rnn_forward(x1, h1, Wx, Wh, b)
else:
	h, cache_forward = lstm_forward(x1, h1, Wx, Wh, b)
	
#4
s2, cache_affine_2 = temporal_affine_forward(h, W_vocab, b_vocab)
#5
loss, dx = temporal_softmax_loss(s2, captions_out, mask, verbose=False)


#temporal_affine
dx1, grads["W_vocab"], grads["b_vocab"] = temporal_affine_backward(dx, cache_affine_2)
#rnn
if self.cell_type == "rnn":
	dx2, dh0, dWx, dWh, db = rnn_backward(dx1, cache_forward)
	grads['Wx'] = dWx
	grads['Wh'] = dWh
	grads["b"] = db
else:
	dx2, dh0, dWx, dWh, db = lstm_backward(dx1, cache_forward)
	grads['Wx'] = dWx
	grads['Wh'] = dWh
	grads["b"] = db
#word_embedding
grads['W_embed'] = word_embedding_backward(dx2, cache_word)
#temporal_affine
grads['W_proj'] = features.T.dot(dh0)
grads['b_proj'] = np.sum(dh0, axis=0)

LSTM test-time sampling

这里比较难理解，我个人也不是很确定自己写的对不对。首先看word_embedding_forward：

def word_embedding_forward(x, W):
    """
    Forward pass for word embeddings. We operate on minibatches of size N where
    each sequence has length T. We assume a vocabulary of V words, assigning each
    to a vector of dimension D.

    Inputs:
    - x: Integer array of shape (N, T) giving indices of words. Each element idx
      of x muxt be in the range 0 <= idx < V.
    - W: Weight matrix of shape (V, D) giving word vectors for all words.

    Returns a tuple of:
    - out: Array of shape (N, T, D) giving word vectors for all input words.
    - cache: Values needed for the backward pass
    """

注意输出的形状是(N,T,D)，如果获得第t个单词的向量，那么应该使用：

out[:, t, :]

完整代码如下：

#初始化
start = np.array([self._start] * N).reshape(-1, 1)
prev_word = start
h = features.dot(W_proj) + b_proj
#lstm
c = np.zeros_like(h)
for i in range(max_length):
	word, cache = word_embedding_forward(prev_word, W_embed)
	#获得词向量
	word = word[:, 0, :]
	if self.cell_type == "rnn":
		h, cache = rnn_step_forward(word, h, Wx, Wh, b)
	else:
		h, c, cache = lstm_step_forward(word, h, c, Wx, Wh, b)
	#计算得分
	score = h.dot(W_vocab) + b_vocab
	#最高得分对应的索引
	idx = np.argmax(score, axis=1)
	#记录结果
	captions[:, i] = idx
	#记录上一个单词
	prev_word = idx.reshape(-1, 1)

3. Network Visualization: Saliency maps, Class Visualization, and Fooling Images

如果出现如下报错：

You need to download SqueezeNet!

那么注释以下语句即可：

if not os.path.exists(SAVE_PATH):
    raise ValueError("You need to download SqueezeNet!")

这部分因为使用了tensorflow，我本身对tensorflow不太熟，所以这部分主要参考了网上的代码。

Saliency Maps

计算梯度绝对值的最大值：

#计算损失
loss = tf.reduce_mean(correct_scores)
#计算梯度
grad = tf.gradients(loss, model.image)
#返回为元组，第一个元素为梯度
gradabs = tf.abs(grad[0])
#计算梯度绝对值的最大值
saliency_ = tf.reduce_max(gradabs, axis=-1)
#运行
saliency = sess.run(saliency_, feed_dict={model.image :X, model.labels: y})

Fooling Images

梯度上升：

#计算对应分类的梯度
grad = tf.gradients(model.classifier[0, target_y], model.image)
dX = learning_rate * grad / tf.norm(grad)
for i in range(100):
	#运行
	dx = sess.run(dX, feed_dict={model.image : X_fooling})[0]
	#更新
	X_fooling += dx

Class visualization

这里计算损失这里，我发现别人用了如下代码：

loss = model.classifier[0, target_y] - l2_reg * model.image * model.image

我个人感觉这样不大对，因为算出来不是标量，但是这样计算实际效果更好，暂时没发现什么原因。

########################################################################
# TODO: Compute the loss and the gradient of the loss with respect to  #
# the input image, model.image. We compute these outside the loop so   #
# that we don't have to recompute the gradient graph at each iteration #
#                                                                      #
# Note: loss and grad should be TensorFlow Tensors, not numpy arrays!  #
#                                                                      #
# The loss is the score for the target label, target_y. You should     #
# use model.classifier to get the scores, and tf.gradients to compute  #
# gradients. Don't forget the (subtracted) L2 regularization term!     #
########################################################################
loss = None # scalar loss
grad = None # gradient of loss with respect to model.image, same size as model.image
loss = model.classifier[0, target_y] - l2_reg * tf.norm(model.image) ** 2
#loss = model.classifier[0, target_y] - l2_reg * model.image * model.image
#计算梯度
grad = tf.gradients(loss, model.image)[0]
#正规化
dX = learning_rate * grad / tf.norm(grad)
############################################################################
#                             END OF YOUR CODE                             #
############################################################################


for t in range(num_iterations):
	# Randomly jitter the image a bit; this gives slightly nicer results
	ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)
	Xi = X.copy()
	X = np.roll(np.roll(X, ox, 1), oy, 2)
	
	########################################################################
	# TODO: Use sess to compute the value of the gradient of the score for #
	# class target_y with respect to the pixels of the image, and make a   #
	# gradient step on the image using the learning rate. You should use   #
	# the grad variable you defined above.                                 #
	#                                                                      #
	# Be very careful about the signs of elements in your code.            #
	########################################################################
	dx = sess.run(dX, feed_dict={model.image: X})
	X += dx
	############################################################################
	#                             END OF YOUR CODE                             #
	############################################################################

4.Style Transfer

check_scipy()如果报错，把如下内容注释即可：

'''
vnum = int(scipy.__version__.split('.')[1])
assert vnum >= 16, "You must install SciPy >= 0.16.0 to complete this notebook."
'''

Content loss

利用如下公式计算：

$L_c = w_c \times \sum_{i,j} (F_{ij}^{\ell} - P_{ij}^{\ell})^2$

n, height, width, channels = content_current.shape
c = tf.reshape(content_current, (-1, height * width))
o = tf.reshape(content_original, (-1, height * width))
loss = content_weight * tf.reduce_sum((c - o) ** 2)

Style loss

首先计算gram矩阵：

#print(tf.shape(features)[0])
shape = tf.shape(features)
n, height, width, channels = shape[0], shape[1], shape[2], shape[3]
x = tf.reshape(features, shape=[height * width * n, channels])
gram = tf.matmul(tf.transpose(x), x)

if normalize:
	gram /= tf.cast(height * width * channels, tf.float32)

然后计算style_loss：

n = len(style_layers)
style_loss = 0
for i in range(n):
	gram = gram_matrix(feats[style_layers[i]])
	d = tf.reduce_sum((gram - style_targets[i]) ** 2)
	style_loss += style_weights[i] * d

Total-variation regularization

利用如下公式计算：

$L_{tv} = w_t \times \sum_{c=1}^3\sum_{i=1}^{H-1} \sum_{j=1}^{W-1} \left( (x_{i,j+1, c} - x_{i,j,c})^2 + (x_{i+1, j,c} - x_{i,j,c})^2 \right)$

shape = tf.shape(img)
n, H, W, c = shape[0], shape[1], shape[2], shape[3]
x1 = img[:, :H-1, :W-1, :]
x2 = img[:, :H-1, 1:, :]
x3 = img[:, 1:, :W-1, :]

loss = tf.reduce_sum((x1 - x2) ** 2) + tf.reduce_sum((x1 - x3) ** 2)
loss *= tv_weight

5.Generative Adversarial Networks

前面几个部分比较简单，这里略过。

Discriminator

fc1 = tf.layers.dense(x, units=256)
relu1 = leaky_relu(fc1, 0.01)
fc2 = tf.layers.dense(relu1, units=256)
relu2 = leaky_relu(fc2, 0.01)
logits = tf.layers.dense(relu2, 1)

Generator

fc1 = tf.layers.dense(z, units=1024)
relu1 = tf.nn.relu(fc1)
fc2 = tf.layers.dense(relu1, units=1024)
relu2 = tf.nn.relu(fc2)
fc3 = tf.layers.dense(relu2, units=784)
img = tf.tanh(fc3)

GAN Loss

生成器损失：

$\ell_G = -\mathbb{E}_{z \sim p(z)}\left[\log D(G(z))\right]$

判别器损失：

$\ell_D = -\mathbb{E}_{x \sim p_\text{data}}\left[\log D(x)\right] - \mathbb{E}_{z \sim p(z)}\left[\log \left(1-D(G(z))\right)\right]$

利用提示即可完成：

d1 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_real), logits=logits_real)
d2 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(logits_fake), logits=logits_fake)
D_loss = tf.reduce_mean(d1 + d2)

G_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_fake), logits=logits_fake))

Optimizing our loss

D_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)
G_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)

剩余部分和之前类似，这里从略。