CS231 作业3
课程视频地址:https://study.163.com/courses-search?keyword=CS231
课程主页:http://cs231n.stanford.edu/2017/
参考资料:
https://github.com/Halfish/cs231n/tree/master/assignment2/cs231n
我的代码地址:https://github.com/Doraemonzzz/CS231n
这一部分回顾作业3的重点。
准备工作
Look at the data部分如果报如下错误:
另一个程序正在使用此文件,进程无法访问
那么注释image_utils.py文件中函数image_from_url的如下内容即可:
#os.remove(fname)
1.使用普通RNN进行图像标注
Vanilla RNN: step forward
对需要使用的量作出如下假设:
那么对单个数据,更新公式如下:
所以
最后一个加法应该理解为numpy中的加法,$\tanh$是对每个元素作用,代码如下:
next_h = np.tanh(prev_h.dot(Wh) + x.dot(Wx) + b)
cache = [x, prev_h, Wx, Wh, b, next_h]
Vanilla RNN: step backward
注意到
所以
因此第一步先计算:
#读取缓存
x, prev_h, Wx, Wh, b, next_h = cache
#计算梯度
d1 = dnext_h * (1 - next_h ** 2)
因为
所以剩余部分的梯度计算方法如下:
dx = d1.dot(Wx.T)
dprev_h = d1.dot(Wh.T)
dWx = x.T.dot(d1)
dWh = prev_h.T.dot(d1)
db = np.sum(d1, axis=0)
推导证明见第一次作业习题4。
Vanilla RNN: forward
循环即可:
#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
cache = []
for t in range(T):
x_ = x[:, t, :]
h_, cache_ = rnn_step_forward(x_, h_, Wx, Wh, b)
h[:, t, :] = h_
cache.append(cache_)
Vanilla RNN: backward
首先看下RNN的结构:
对于$h_i$,反向传播的数据流有两项,一项是$\nabla_{h_i} y_i$,对应代码中的dh,另一项是关于$h_{i+1}$的梯度,即前一项的梯度,所以代码如下:
#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape
for t in range(T-1, -1, -1):
cache_ = cache[t]
x, prev_h, Wx, Wh, b, next_h = cache_
dh_ = dh[:, t, :]
dx_, dh_prev, dWx_, dWh_, db_ = rnn_step_backward(dh_ + dh_prev, cache_)
#更新
dx[:, t, :] = dx_
dWx += dWx_
dWh += dWh_
db += db_
dh0 = dh_prev
Word embedding: forward
$x$每个元素代表索引,$W[i,:]$为第$i$个单词的向量表示,所以如下代码可以得到全部词的向量表示:
W[x.flatten(), :]
改变成输出的形状即得到结果:
N, T = x.shape
V, D = W.shape
out = W[x.flatten(), :].reshape(N, T, D)
cache = [x, W]
Word embedding: backward
这题需要利用np.add.at函数,有如下两个参考资料:
- https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ufunc.at.html
- https://stackoverflow.com/questions/45473896/np-add-at-indexing-with-array
答案为:
x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout)
该代码的效果为:
x, W = cache
dW = np.zeros_like(W)
N, T = x.shape
for i in range(N):
for j in range(T):
dW[x[i][j]] += dout[i][j]
RNN for image captioning和Test-time sampling部分放在Lstm中讨论。
2.使用LSTM进行图像标注
LSTM: step forward
根据题目中的公式即可:
A = x.dot(Wx) + prev_h.dot(Wh) + b
H = Wh.shape[0]
i = sigmoid(A[:, : H])
f = sigmoid(A[:, H: 2*H])
o = sigmoid(A[:, 2*H: 3*H])
g = np.tanh(A[:, 3*H: ])
next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)
cache = (x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h)
LSTM: step backward
这部分的理论推导先留个坑,后续会补上,代码如下:
x, prev_h, prev_c, Wx, Wh, b, A, H, i, f, o, g, next_c, next_h = cache
tan_next_c = np.tanh(next_c)
di = (dnext_c * g + dnext_h * o * (1 - tan_next_c ** 2) * g) * i * (1 - i)
dg = (dnext_c * i + dnext_h * o * (1 - tan_next_c ** 2) * i) * (1 - g ** 2)
df = (dnext_c * prev_c + dnext_h * o * (1 - tan_next_c ** 2) * prev_c) * f * (1 - f)
do = dnext_h * tan_next_c * o * (1 - o)
dA = np.c_[di, df, do, dg]
dx = dA.dot(Wx.T)
dprev_h = dA.dot(Wh.T)
dprev_c = dnext_c * f + dnext_h * o * (1 - tan_next_c ** 2) * f
dWx = x.T.dot(dA)
dWh = prev_h.T.dot(dA)
db = np.sum(dA, axis=0)
LSTM: forward
和RNN部分类似:
#读取维度
N, T, D = x.shape
H = h0.shape[1]
#初始化
h = np.zeros((N, T, H))
h_ = h0
c_ = np.zeros_like(h0)
cache = []
for t in range(T):
x_ = x[:, t, :]
h_, c_, cache_ = lstm_step_forward(x_, h_, c_, Wx, Wh, b)
h[:, t, :] = h_
cache.append(cache_)
cache.append([x, h0, Wx, Wh, b])
LSTM: backward
依然和RNN部分类似:
#读取缓存
x, h0, Wx, Wh, b = cache[-1]
#初始化
dx = np.zeros_like(x)
dh_prev = np.zeros_like(h0)
dnext_c = np.zeros_like(h0)
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
N, T, H = dh.shape
for t in range(T-1, -1, -1):
cache_ = cache[t]
dh_ = dh[:, t, :]
dx_, dh_prev, dnext_c, dWx_, dWh_, db_ = lstm_step_backward(dh_ + dh_prev, dnext_c, cache_)
#更新
dx[:, t, :] = dx_
dWx += dWx_
dWh += dWh_
db += db_
dh0 = dh_prev
LSTM captioning model
根据注释进行前项传播和反向传播,这里仿射层有点问题,所以在代码中重新实现:
#1
h1 = features.dot(W_proj) + b_proj
cache_affine_1 = (features, W_proj, b_proj, h1)
#2
x1, cache_word = word_embedding_forward(captions_in, W_embed)
#3
if self.cell_type == "rnn":
h, cache_forward = rnn_forward(x1, h1, Wx, Wh, b)
else:
h, cache_forward = lstm_forward(x1, h1, Wx, Wh, b)
#4
s2, cache_affine_2 = temporal_affine_forward(h, W_vocab, b_vocab)
#5
loss, dx = temporal_softmax_loss(s2, captions_out, mask, verbose=False)
#temporal_affine
dx1, grads["W_vocab"], grads["b_vocab"] = temporal_affine_backward(dx, cache_affine_2)
#rnn
if self.cell_type == "rnn":
dx2, dh0, dWx, dWh, db = rnn_backward(dx1, cache_forward)
grads['Wx'] = dWx
grads['Wh'] = dWh
grads["b"] = db
else:
dx2, dh0, dWx, dWh, db = lstm_backward(dx1, cache_forward)
grads['Wx'] = dWx
grads['Wh'] = dWh
grads["b"] = db
#word_embedding
grads['W_embed'] = word_embedding_backward(dx2, cache_word)
#temporal_affine
grads['W_proj'] = features.T.dot(dh0)
grads['b_proj'] = np.sum(dh0, axis=0)
LSTM test-time sampling
这里比较难理解,我个人也不是很确定自己写的对不对。首先看word_embedding_forward:
def word_embedding_forward(x, W):
"""
Forward pass for word embeddings. We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
to a vector of dimension D.
Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
of x muxt be in the range 0 <= idx < V.
- W: Weight matrix of shape (V, D) giving word vectors for all words.
Returns a tuple of:
- out: Array of shape (N, T, D) giving word vectors for all input words.
- cache: Values needed for the backward pass
"""
注意输出的形状是(N,T,D),如果获得第t个单词的向量,那么应该使用:
out[:, t, :]
完整代码如下:
#初始化
start = np.array([self._start] * N).reshape(-1, 1)
prev_word = start
h = features.dot(W_proj) + b_proj
#lstm
c = np.zeros_like(h)
for i in range(max_length):
word, cache = word_embedding_forward(prev_word, W_embed)
#获得词向量
word = word[:, 0, :]
if self.cell_type == "rnn":
h, cache = rnn_step_forward(word, h, Wx, Wh, b)
else:
h, c, cache = lstm_step_forward(word, h, c, Wx, Wh, b)
#计算得分
score = h.dot(W_vocab) + b_vocab
#最高得分对应的索引
idx = np.argmax(score, axis=1)
#记录结果
captions[:, i] = idx
#记录上一个单词
prev_word = idx.reshape(-1, 1)
3. Network Visualization: Saliency maps, Class Visualization, and Fooling Images
如果出现如下报错:
You need to download SqueezeNet!
那么注释以下语句即可:
if not os.path.exists(SAVE_PATH):
raise ValueError("You need to download SqueezeNet!")
这部分因为使用了tensorflow,我本身对tensorflow不太熟,所以这部分主要参考了网上的代码。
Saliency Maps
计算梯度绝对值的最大值:
#计算损失
loss = tf.reduce_mean(correct_scores)
#计算梯度
grad = tf.gradients(loss, model.image)
#返回为元组,第一个元素为梯度
gradabs = tf.abs(grad[0])
#计算梯度绝对值的最大值
saliency_ = tf.reduce_max(gradabs, axis=-1)
#运行
saliency = sess.run(saliency_, feed_dict={model.image :X, model.labels: y})
Fooling Images
梯度上升:
#计算对应分类的梯度
grad = tf.gradients(model.classifier[0, target_y], model.image)
dX = learning_rate * grad / tf.norm(grad)
for i in range(100):
#运行
dx = sess.run(dX, feed_dict={model.image : X_fooling})[0]
#更新
X_fooling += dx
Class visualization
这里计算损失这里,我发现别人用了如下代码:
loss = model.classifier[0, target_y] - l2_reg * model.image * model.image
我个人感觉这样不大对,因为算出来不是标量,但是这样计算实际效果更好,暂时没发现什么原因。
########################################################################
# TODO: Compute the loss and the gradient of the loss with respect to #
# the input image, model.image. We compute these outside the loop so #
# that we don't have to recompute the gradient graph at each iteration #
# #
# Note: loss and grad should be TensorFlow Tensors, not numpy arrays! #
# #
# The loss is the score for the target label, target_y. You should #
# use model.classifier to get the scores, and tf.gradients to compute #
# gradients. Don't forget the (subtracted) L2 regularization term! #
########################################################################
loss = None # scalar loss
grad = None # gradient of loss with respect to model.image, same size as model.image
loss = model.classifier[0, target_y] - l2_reg * tf.norm(model.image) ** 2
#loss = model.classifier[0, target_y] - l2_reg * model.image * model.image
#计算梯度
grad = tf.gradients(loss, model.image)[0]
#正规化
dX = learning_rate * grad / tf.norm(grad)
############################################################################
# END OF YOUR CODE #
############################################################################
for t in range(num_iterations):
# Randomly jitter the image a bit; this gives slightly nicer results
ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)
Xi = X.copy()
X = np.roll(np.roll(X, ox, 1), oy, 2)
########################################################################
# TODO: Use sess to compute the value of the gradient of the score for #
# class target_y with respect to the pixels of the image, and make a #
# gradient step on the image using the learning rate. You should use #
# the grad variable you defined above. #
# #
# Be very careful about the signs of elements in your code. #
########################################################################
dx = sess.run(dX, feed_dict={model.image: X})
X += dx
############################################################################
# END OF YOUR CODE #
############################################################################
4.Style Transfer
check_scipy()如果报错,把如下内容注释即可:
'''
vnum = int(scipy.__version__.split('.')[1])
assert vnum >= 16, "You must install SciPy >= 0.16.0 to complete this notebook."
'''
Content loss
利用如下公式计算:
n, height, width, channels = content_current.shape
c = tf.reshape(content_current, (-1, height * width))
o = tf.reshape(content_original, (-1, height * width))
loss = content_weight * tf.reduce_sum((c - o) ** 2)
Style loss
首先计算gram矩阵:
#print(tf.shape(features)[0])
shape = tf.shape(features)
n, height, width, channels = shape[0], shape[1], shape[2], shape[3]
x = tf.reshape(features, shape=[height * width * n, channels])
gram = tf.matmul(tf.transpose(x), x)
if normalize:
gram /= tf.cast(height * width * channels, tf.float32)
然后计算style_loss:
n = len(style_layers)
style_loss = 0
for i in range(n):
gram = gram_matrix(feats[style_layers[i]])
d = tf.reduce_sum((gram - style_targets[i]) ** 2)
style_loss += style_weights[i] * d
Total-variation regularization
利用如下公式计算:
shape = tf.shape(img)
n, H, W, c = shape[0], shape[1], shape[2], shape[3]
x1 = img[:, :H-1, :W-1, :]
x2 = img[:, :H-1, 1:, :]
x3 = img[:, 1:, :W-1, :]
loss = tf.reduce_sum((x1 - x2) ** 2) + tf.reduce_sum((x1 - x3) ** 2)
loss *= tv_weight
5.Generative Adversarial Networks
前面几个部分比较简单,这里略过。
Discriminator
fc1 = tf.layers.dense(x, units=256)
relu1 = leaky_relu(fc1, 0.01)
fc2 = tf.layers.dense(relu1, units=256)
relu2 = leaky_relu(fc2, 0.01)
logits = tf.layers.dense(relu2, 1)
Generator
fc1 = tf.layers.dense(z, units=1024)
relu1 = tf.nn.relu(fc1)
fc2 = tf.layers.dense(relu1, units=1024)
relu2 = tf.nn.relu(fc2)
fc3 = tf.layers.dense(relu2, units=784)
img = tf.tanh(fc3)
GAN Loss
生成器损失:
判别器损失:
利用提示即可完成:
d1 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_real), logits=logits_real)
d2 = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(logits_fake), logits=logits_fake)
D_loss = tf.reduce_mean(d1 + d2)
G_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(logits_fake), logits=logits_fake))
Optimizing our loss
D_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)
G_solver = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1)
剩余部分和之前类似,这里从略。