课程视频地址:https://study.163.com/courses-search?keyword=CS231

课程主页:http://cs231n.stanford.edu/2017/

参考资料:

https://github.com/Halfish/cs231n/tree/master/assignment2/cs231n

https://github.com/wjbKimberly/cs231n_spring_2017_assignment/blob/master/assignment2/TensorFlow.ipynb

我的代码地址:https://github.com/Doraemonzzz/CS231n

这一部分回顾作业3的重点。

准备工作

Look at the data部分如果报如下错误:

另一个程序正在使用此文件,进程无法访问

那么注释image_utils.py文件中函数image_from_url的如下内容即可:

#os.remove(fname)

1.使用普通RNN进行图像标注

Vanilla RNN: step forward

对需要使用的量作出如下假设:

那么对单个数据,更新公式如下:

所以

最后一个加法应该理解为numpy中的加法,$\tanh$是对每个元素作用,代码如下:

next_h = np.tanh(prev_h.dot(Wh) + x.dot(Wx) + b)
cache = [x, prev_h, Wx, Wh, b, next_h]
Vanilla RNN: step backward

注意到

所以

因此第一步先计算:

#读取缓存
x, prev_h, Wx, Wh, b, next_h = cache
#计算梯度
d1 = dnext_h * (1 - next_h ** 2)

因为

所以剩余部分的梯度计算方法如下:

dx = d1.dot(Wx.T)
dprev_h = d1.dot(Wh.T)
dWx = x.T.dot(d1)
dWh = prev_h.T.dot(d1)
db = np.sum(d1, axis=0)

推导证明见第一次作业习题4。

Vanilla RNN: forward

循环即可:

#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
cache = []
for t in range(T):
	x_ = x[:, t, :]
	h_, cache_ = rnn_step_forward(x_, h_, Wx, Wh, b)
	h[:, t, :] = h_
	cache.append(cache_)
Vanilla RNN: backward

首先看下RNN的结构:

对于$h_i$,反向传播的数据流有两项,一项是$\nabla_{h_i} y_i$,对应代码中的dh,另一项是关于$h_{i+1}$的梯度,即前一项的梯度,所以代码如下:

#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape

for t in range(T-1, -1, -1):
	cache_ = cache[t]
	x, prev_h, Wx, Wh, b, next_h = cache_
	dh_ = dh[:, t, :]
	dx_, dh_prev, dWx_, dWh_, db_ = rnn_step_backward(dh_ + dh_prev, cache_)
	
	#更新
	dx[:, t, :] = dx_
	dWx += dWx_
	dWh += dWh_
	db += db_

dh0 = dh_prev
Word embedding: forward

$x$每个元素代表索引,$W[i,:]$为第$i$个单词的向量表示,所以如下代码可以得到全部词的向量表示:

W[x.flatten(), :]

改变成输出的形状即得到结果:

N, T = x.shape
V, D = W.shape
out = W[x.flatten(), :].reshape(N, T, D)

cache = [x, W]
Word embedding: backward

这题需要利用np.add.at函数,有如下两个参考资料:

答案为:

x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout)

该代码的效果为:

x, W = cache
dW = np.zeros_like(W)

N, T = x.shape
for i in range(N):
	for j in range(T):
		dW[x[i][j]] += dout[i][j]

RNN for image captioning和Test-time sampling部分放在Lstm中讨论。

2.使用LSTM进行图像标注

LSTM: step forward

根据题目中的公式即可:

A = x.dot(Wx) + prev_h.dot(Wh) + b
H = Wh.shape[0]
i = sigmoid(A[:, : H])
f = sigmoid(A[:, H: 2*H])
o = sigmoid(A[:, 2*H: 3*H])
g = np.tanh(A[:, 3*H: ])

next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)

cache = (x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h)
LSTM: step backward

这部分的理论推导先留个坑,后续会补上,代码如下:

x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h = cache
tan_next_c = np.tanh(next_c)

di = (dnext_c * g + dnext_h * o * (1 - tan_next_c ** 2) * g) * i * (1 - i)
dg = (dnext_c * i + dnext_h * o * (1 - tan_next_c ** 2) * i) * (1 - g ** 2)
df = (dnext_c * prev_c + dnext_h * o * (1 - tan_next_c ** 2) * prev_c) * f * (1 - f)
do = dnext_h * tan_next_c * o * (1 - o)
dA = np.c_[di, df, do, dg]

dx = dA.dot(Wx.T)
dprev_h = dA.dot(Wh.T)
dprev_c = dnext_c * f + dnext_h * o * (1 - tan_next_c ** 2) * f
dWx = x.T.dot(dA)
dWh = prev_h.T.dot(dA)
db = np.sum(dA, axis=0)
LSTM: forward

和RNN部分类似:

#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
c_ = np.zeros_like(h0)
cache = []
for t in range(T):
	x_ = x[:, t, :]
	h_, c_, cache_ = lstm_step_forward(x_, h_, c_, Wx, Wh, b)
	h[:, t, :] = h_
	cache.append(cache_)
	
cache.append([x, h0, Wx, Wh, b])
LSTM: backward

依然和RNN部分类似:

#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dnext_c = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape

for t in range(T-1, -1, -1):
	cache_ = cache[t]
	dh_ = dh[:, t, :]
	dx_, dh_prev, dnext_c, dWx_, dWh_, db_ = lstm_step_backward(dh_ + dh_prev, dnext_c, cache_)
	
	#更新
	dx[:, t, :] = dx_
	dWx += dWx_
	dWh += dWh_
	db += db_

dh0 = dh_prev
LSTM captioning model

根据注释进行前项传播和反向传播,这里仿射层有点问题,所以在代码中重新实现:

#1
h1 = features.dot(W_proj) + b_proj
cache_affine_1 = (features, W_proj, b_proj, h1)
#2
x1, cache_word = word_embedding_forward(captions_in, W_embed)
#3
if self.cell_type == "rnn":
	h, cache_forward = rnn_forward(x1, h1, Wx, Wh, b)
else:
	h, cache_forward = lstm_forward(x1, h1, Wx, Wh, b)
	
#4
s2, cache_affine_2 = temporal_affine_forward(h, W_vocab, b_vocab)
#5
loss, dx = temporal_softmax_loss(s2, captions_out, mask, verbose=False)


#temporal_affine
dx1, grads["W_vocab"], grads["b_vocab"] = temporal_affine_backward(dx, cache_affine_2)
#rnn
if self.cell_type == "rnn":
	dx2, dh0, dWx, dWh, db = rnn_backward(dx1, cache_forward)
	grads['Wx'] = dWx
	grads['Wh'] = dWh
	grads["b"] = db
else:
	dx2, dh0, dWx, dWh, db = lstm_backward(dx1, cache_forward)
	grads['Wx'] = dWx
	grads['Wh'] = dWh
	grads["b"] = db
#word_embedding
grads['W_embed'] = word_embedding_backward(dx2, cache_word)
#temporal_affine
grads['W_proj'] = features.T.dot(dh0)
grads['b_proj'] = np.sum(dh0, axis=0)
LSTM test-time sampling

这里比较难理解,我个人也不是很确定自己写的对不对。首先看word_embedding_forward:

def word_embedding_forward(x, W):
    """
    Forward pass for word embeddings. We operate on minibatches of size N where
    each sequence has length T. We assume a vocabulary of V words, assigning each
    to a vector of dimension D.

    Inputs:
    - x: Integer array of shape (N, T) giving indices of words. Each element idx
      of x muxt be in the range 0 <= idx < V.
    - W: Weight matrix of shape (V, D) giving word vectors for all words.

    Returns a tuple of:
    - out: Array of shape (N, T, D) giving word vectors for all input words.
    - cache: Values needed for the backward pass
    """

注意输出的形状是(N,T,D),如果获得第t个单词的向量,那么应该使用:

out[:, t, :]

完整代码如下:

#初始化
start = np.array([self._start] * N).reshape(-1, 1)
prev_word = start
h = features.dot(W_proj) + b_proj
#lstm
c = np.zeros_like(h)
for i in range(max_length):
	word, cache = word_embedding_forward(prev_word, W_embed)
	#获得词向量
	word = word[:, 0, :]
	if self.cell_type == "rnn":
		h, cache = rnn_step_forward(word, h, Wx, Wh, b)
	else:
		h, c, cache = lstm_step_forward(word, h, c, Wx, Wh, b)
	#计算得分
	score = h.dot(W_vocab) + b_vocab
	#最高得分对应的索引
	idx = np.argmax(score, axis=1)
	#记录结果
	captions[:, i] = idx
	#记录上一个单词
	prev_word = idx.reshape(-1, 1)

3. Network Visualization: Saliency maps, Class Visualization, and Fooling Images

如果出现如下报错:

You need to download SqueezeNet!

那么注释以下语句即可:

if not os.path.exists(SAVE_PATH):
    raise ValueError("You need to download SqueezeNet!")

这部分因为使用了tensorflow,我本身对tensorflow不太熟,所以这部分主要参考了网上的代码。

Saliency Maps

计算梯度绝对值的最大值:

#计算损失
loss = tf.reduce_mean(correct_scores)
#计算梯度
grad = tf.gradients(loss, model.image)
#返回为元组,第一个元素为梯度
gradabs = tf.abs(grad[0])
#计算梯度绝对值的最大值
saliency_ = tf.reduce_max(gradabs, axis=-1)
#运行
saliency = sess.run(saliency_, feed_dict={model.image :X, model.labels: y})
Fooling Images

梯度上升:

#计算对应分类的梯度
grad = tf.gradients(model.classifier[0, target_y], model.image)
dX = learning_rate * grad / tf.norm(grad)
for i in range(100):
	#运行
	dx = sess.run(dX, feed_dict={model.image : X_fooling})[0]
	#更新
	X_fooling += dx
Class visualization

这里计算损失这里,我发现别人用了如下代码:

loss = model.classifier[0, target_y] - l2_reg * model.image * model.image

我个人感觉这样不大对,因为算出来不是标量,但是这样计算实际效果更好,暂时没发现什么原因。

########################################################################
# TODO: Compute the loss and the gradient of the loss with respect to  #
# the input image, model.image. We compute these outside the loop so   #
# that we don't have to recompute the gradient graph at each iteration #
#                                                                      #
# Note: loss and grad should be TensorFlow Tensors, not numpy arrays!  #
#                                                                      #
# The loss is the score for the target label, target_y. You should     #
# use model.classifier to get the scores, and tf.gradients to compute  #
# gradients. Don't forget the (subtracted) L2 regularization term!     #
########################################################################
loss = None # scalar loss
grad = None # gradient of loss with respect to model.image, same size as model.image
loss = model.classifier[0, target_y] - l2_reg * tf.norm(model.image) ** 2
#loss = model.classifier[0, target_y] - l2_reg * model.image * model.image
#计算梯度
grad = tf.gradients(loss, model.image)[0]
#正规化
dX = learning_rate * grad / tf.norm(grad)
############################################################################
#                             END OF YOUR CODE                             #
############################################################################


for t in range(num_iterations):
	# Randomly jitter the image a bit; this gives slightly nicer results
	ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)
	Xi = X.copy()
	X = np.roll(np.roll(X, ox, 1), oy, 2)
	
	########################################################################
	# TODO: Use sess to compute the value of the gradient of the score for #
	# class target_y with respect to the pixels of the image, and make a   #
	# gradient step on the image using the learning rate. You should use   #
	# the grad variable you defined above.                                 #
	#                                                                      #
	# Be very careful about the signs of elements in your code.            #
	########################################################################
	dx = sess.run(dX, feed_dict={model.image: X})
	X += dx
	############################################################################
	#                             END OF YOUR CODE                             #
	############################################################################

4.Style Transfer

check_scipy()如果报错,把如下内容注释即可:

'''
vnum = int(scipy.__version__.split('.')[1])
assert vnum >= 16, "You must install SciPy >= 0.16.0 to complete this notebook."
'''
Content loss

利用如下公式计算:

n, height, width, channels = content_current.shape
c = tf.reshape(content_current, (-1, height * width))
o = tf.reshape(content_original, (-1, height * width))
loss = content_weight * tf.reduce_sum((c - o) ** 2)
Style loss

首先计算gram矩阵:

#print(tf.shape(features)[0])
shape = tf.shape(features)
n, height, width, channels = shape[0], shape[1], shape[2], shape[3]
x = tf.reshape(features, shape=[height * width * n, channels])
gram = tf.matmul(tf.transpose(x), x)

if normalize:
	gram /= tf.cast(height * width * channels, tf.float32)

然后计算style_loss:

n = len(style_layers)
style_loss = 0
for i in range(n):
	gram = gram_matrix(feats[style_layers[i]])
	d = tf.reduce_sum((gram - style_targets[i]) ** 2)
	style_loss += style_weights[i] * d
Total-variation regularization

利用如下公式计算:

shape = tf.shape(img)
n, H, W, c = shape[0], shape[1], shape[2], shape[3]
x1 = img[:, :H-1, :W-1, :]
x2 = img[:, :H-1, 1:, :]
x3 = img[:, 1:, :W-1, :]

loss = tf.reduce_sum((x1 - x2) ** 2) + tf.reduce_sum((x1 - x3) ** 2)
loss *= tv_weight

5.Generative Adversarial Networks

前面几个部分比较简单,这里略过。

Discriminator
fc1 = tf.layers.dense(x, units=256)
relu1 = leaky_relu(fc1, 0.01)
fc2 = tf.layers.dense(relu1, units=256)
relu2 = leaky_relu(fc2, 0.01)
logits = tf.layers.dense(relu2, 1)
Generator
fc1 = tf.layers.dense(z, units=1024)
relu1 = tf.nn.relu(fc1)
fc2 = tf.layers.dense(relu1, units=1024)
relu2 = tf.nn.relu(fc2)
fc3 = tf.layers.dense(relu2, units=784)
img = tf.tanh(fc3)
GAN Loss

生成器损失:

判别器损失:

利用提示即可完成:

d1 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_real), logits=logits_real)
d2 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(logits_fake), logits=logits_fake)
D_loss = tf.reduce_mean(d1 + d2)

G_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_fake), logits=logits_fake))
Optimizing our loss
D_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)
G_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)

剩余部分和之前类似,这里从略。