CS231 作业2

课程视频地址：https://study.163.com/courses-search?keyword=CS231

课程主页：http://cs231n.stanford.edu/2017/

参考资料：

https://github.com/Halfish/cs231n/tree/master/assignment2/cs231n

https://github.com/wjbKimberly/cs231n_spring_2017_assignment/blob/master/assignment2/TensorFlow.ipynb

我的代码地址：https://github.com/Doraemonzzz/CS231n

这一部分回顾作业2的重点。

准备工作

如果读取数据的时候报错，那么需要修改data_utils.py文件中如下函数：

def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000,
                     subtract_mean=True):

找到

cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

将其修改为自己存放数据的位置即可。

1.全连接神经网络

为了后面叙述方便，这里做以下假设

$X\in \mathbb R^{N\times D}, W\in \mathbb R^{D\times M} ,b\in \mathbb R^M$

Affine layer: foward

这部分很简单，只要知道输出为

$S=XW+b \in \mathbb R^{N\times M}$

即可，对应代码为

N = x.shape[0]
X = x.reshape(N, -1)
out = X.dot(w) + b

Affine layer: backward

上一次作业最后，我们推导了

$\nabla_W f =\nabla_W S \times \nabla_S f =X^T F\in \mathbb R^{D\times M} \\ \nabla_b f 为F按行求和后的矩阵$

其中$F$为反向传播的输入，这部分只要根据定义即可验证，记忆方式很简单，只要匹配矩阵的维度即可，有了上述公式，不难得到这部分对应的代码为

#转换形状
N = x.shape[0]
X = x.reshape(N, -1)
dx = dout.dot(w.T)
#转换为原来的形状
dx = np.reshape(dx, x.shape)
dw = X.T.dot(dout)
db = np.sum(dout, axis=0)

ReLU layer: forward

没什么好说的，直接根据定义：

out = np.copy(x)
out[out < 0] = 0

ReLU layer: backward

只要将输入小于$0$的位置的梯度取$0$即可：

dx = np.copy(dout)
dx[x < 0] = 0

Two-layer network

实现两层神经网络，这里的网络架构为affine - relu - affine - softmax。

第一步，初始化：

W1 = np.random.randn(input_dim, hidden_dim) * weight_scale
b1 = np.zeros(hidden_dim)
W2 = np.random.randn(hidden_dim, num_classes) * weight_scale
b2 = np.zeros(num_classes)
self.params["W1"] = W1
self.params["b1"] = b1
self.params["W2"] = W2
self.params["b2"] = b2

第二步，前向传播：

W1 = self.params["W1"]
b1 = self.params["b1"]
W2 = self.params["W2"]
b2 = self.params["b2"]

#中间层
z1, cache1 = affine_relu_forward(X, W1, b1)
#输出
scores, cache2 = affine_forward(z1, W2, b2)

第三步：反向传播：

#损失以及dout
loss, dout = softmax_loss(scores, y)
#加上正则项
loss += self.reg * (np.sum(W2 ** 2) + np.sum(W1 ** 2)) / 2
#计算dW2,db2
dz1, dW2, db2 = affine_backward(dout, cache2)
#加上正则项
dW2 += self.reg * W2
#计算dW1,db1
dx, dW1, db1 = affine_relu_backward(dz1, cache1)
dW1 += self.reg * W1
#存入字典
grads["W1"] = dW1
grads["b1"] = db1
grads["W2"] = dW2
grads["b2"] = db2

Multilayer network

这部分是上一部分的推广。

第一步，初始化，注意这里要根据是输入层，输出层还是中间层来分情况讨论：

for i in range(self.num_layers):
	if i == 0:
		W = np.random.randn(input_dim, hidden_dims[i]) * weight_scale
		b = np.zeros(hidden_dims[i])
	elif i == self.num_layers - 1:
		W = np.random.randn(hidden_dims[i-1], num_classes) * weight_scale
		b = np.zeros(num_classes)
	else:
		W = np.random.randn(hidden_dims[i-1], hidden_dims[i]) * weight_scale
		b = np.zeros(hidden_dims[i])
        
	self.params["W"+str(i+1)] = W
	self.params["b"+str(i+1)] = b

第二步，前向传播，这里要分是否是输出层来讨论：

x = X
#记录缓存
Cache = {}
Cache_dropout = {}
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	b = self.params["b"+str(i+1)]
	if i < self.num_layers - 1:
		x, cache = affine_relu_forward(x, W, b)
	else:
		x, cache = affine_forward(x, W, b)
	#存入缓存
	Cache["cache"+str(i+1)] = cache
    
#输出
scores = x

第三步，反向传播，依旧要分是否是输出层来讨论

#损失以及dout
loss, dout = softmax_loss(scores, y)
#加上正则项
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	loss += self.reg * (np.sum(W ** 2)) / 2
#计算dWi
for i in range(self.num_layers, 0, -1):
	cache = Cache["cache"+str(i)]
	W = self.params["W"+str(i)]
	if i == self.num_layers:
		dz, dW, db = affine_backward(dout, cache)
	else:
		dz, dW, db = affine_relu_backward(dz, cache)

	#加上正则项
	dW += self.reg * W
	#存入字典
	grads["W"+str(i)] = dW
	grads["b"+str(i)] = db

这部分内容只需要细心就可以完成，并不是很难。

SGD+Momentum

这里使用的更新公式如下：

$\begin{aligned} v_{t+1} &= \rho v_t -\alpha \nabla f(x_t+\rho v_t) \\ x_{t+1}&= x_t + v_{t+1} \end{aligned}$

v = config["momentum"] * v - config['learning_rate'] * dw
next_w = w + v

RMSProp

config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dx * dx
next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])

Adam

config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dx * dx
first_unbias = config['m'] / (1 - config['beta1'] ** config['t'])
second_unbias = config['v'] / (1 - config['beta2'] ** config['t'])
next_x = x - config['learning_rate'] * first_unbias / (np.sqrt(second_unbias) + config['epsilon'])
config['t'] += 1

2.批量标准化

Batch normalization: Forward

计算公式如下：

分为两部分，首先是训练部分，对应代码如下：

#计算样本均值
sample_mean = np.mean(x, axis=0)
#计算样本方差
sample_var = np.var(x, axis=0)

#记录系数
k = np.sqrt(sample_var + eps)
x1 = (x - sample_mean) / k
out = gamma * x1 + beta

cache.append(k)
cache.append(sample_mean)
cache.append(x1)

running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var

最后一步是记录均值和方差的滑动平均值，这是为了给测试时使用，对应代码如下：

x1 = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x1 + beta

备注，这里多记录了三个缓存的量，k对应的量为

${\sqrt{\sigma_{\mathcal B}^2+\epsilon}}$

x1对应的量为

$\hat x_i$

out即为输出

$\gamma \hat x_i+\beta$

这三部分在反向传播的时候都需要使用。

Batch Normalization: backward

这里我直接用求导的方法计算了，实际上完成了optional部分的作业，这里是作业最难的部分之一。

假设批量标准化后得到的矩阵为

$Y= \left( \begin{matrix} y^{(1)}_1 & \ldots & y^{(1)}_d \\ \ldots& \ldots &\ldots \\ y^{(m)}_1 & \ldots & y^{(m)}_d \\ \end{matrix} \right) \in \mathbb R^{m\times d}$

注意$\beta ,\gamma$实际上为向量，即

$\beta =(\beta_1,...,\beta_d)\\ \gamma =(\gamma_1,...,\gamma_d)$

$\beta_j ,\gamma_j$分别作用在第$j$个分量上，即

$y^{(i)}_j =\gamma_j \hat x^{(i)}_j +\beta_j$

假设我们的函数为

$f(y_1,...,y_d)$

反向传播传入的参数为

$(\frac{\partial f}{\partial y_1},...,\frac{\partial f}{\partial y_d})$

我们求$f$关于各个量的偏导数：

$\begin{aligned} \frac{\partial f}{\partial \beta_k} &=\sum_{i=1}^m \sum_{j=1}^d\frac{\partial f}{\partial y^{(i)}_j} \frac{\partial y^{(i)}_j}{\partial \beta_k} \\ &=\sum_{i=1}^m \sum_{j=1}^d\frac{\partial f}{\partial y^{(i)}_j} \frac{\partial(\gamma_j\hat x^{(i)}_j +\beta_j)}{\partial \beta_k} \\ &=\sum_{i=1}^m \frac{\partial f}{\partial y^{(i)}_k} \end{aligned}$

对应代码为

dbeta = np.sum(dout ,axis=0)

$\begin{aligned} \frac{\partial f}{\partial \gamma_k} &=\sum_{i=1}^m \sum_{j=1}^d\frac{\partial f}{\partial y^{(i)}_j} \frac{\partial y^{(i)}_j}{\partial \gamma_k} \\ &=\sum_{i=1}^m \sum_{j=1}^d\frac{\partial f}{\partial y^{(i)}_j} \frac{\partial(\gamma_j\hat x^{(i)}_j +\beta_j)}{\partial \gamma_k} \\ &=\sum_{i=1}^m \frac{\partial f}{\partial y^{(i)}_k}\hat x^{(i)}_k \end{aligned}$

对应代码为

dgamma = np.sum(dout * x1, axis=0)

$\begin{aligned} \frac{\partial f}{\partial x^{(i)}_k} &=\sum_{s=1}^m \sum_{t=1}^d\frac{\partial f}{\partial y^{(s)}_t} \frac{\partial y^{(s)}_t}{\partial \hat x_t^{(s)}} \frac{\partial \hat x_t^{(s)}}{\partial x_k^{(i)}}\\ &=\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} \frac{\partial y^{(s)}_k}{\partial \hat x_k^{(s)}} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} \\ &=\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} \frac{\partial(\gamma_k\hat x^{(s)}_k +\beta_k)}{\partial \hat x_k^{(s)}} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} \\ &=\sum_{s=1}^m \gamma_k\frac{\partial f}{\partial y^{(s)}_k} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} \\ &=\gamma_k\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} \end{aligned}$

下面重点计算$\frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} $，首先回顾计算公式

$\begin{aligned} \hat x_k^{(s)} &=\frac{x_k^{(s)}-\mu_k}{\sqrt{\sigma_k^2 +\epsilon}} \\ \mu_k &=\frac 1 m\sum_{i=1}^m x_k^{(i)}\\ \sigma_k^2&= \frac 1 m \sum_{i=1}^m ( x_k^{(i)} -\mu_k)^2\\ &=\frac 1 m \sum_{i=1}^m\Big((x_k^{(i)})^2 -2\mu_k x_k^{(i)}+\mu_k^2 \Big)\\ &=\frac 1 m \sum_{i=1}^m(x_k^{(i)})^2 -\frac 2 m \mu_k \Big(\sum_{i=1}^m x_k^{(i)}\Big)+\mu_k ^2 \\ &=\frac 1 m \sum_{i=1}^m(x_k^{(i)})^2 -\frac 2 m \Big(\frac 1 m\sum_{i=1}^m x_k^{(i)}\Big) \Big(\sum_{i=1}^m x_k^{(i)}\Big)+\Big(\frac 1 m\sum_{i=1}^m x_k^{(i)}\Big)^2\\ &=\frac 1 m \sum_{i=1}^m(x_k^{(i)})^2-\frac 1 {m^2}\Big(\sum_{i=1}^m x_k^{(i)}\Big)^2 \end{aligned}$

所以我们有

$\begin{aligned} \frac{\partial \sigma_k^2 }{\partial x_k^{(i)}} &= \frac 2 m x_k^{(i)} -\frac 2 {m^2}\Big(\sum_{i=1}^m x_k^{(i)}\Big)\\ &=\frac 2 m \Big( x_k^{(i)} -\mu_k \Big)\\ \frac{\partial \mu_k }{\partial x_k^{(i)}} &=\frac 1 m \end{aligned}$

有了准备工作，现在来计算$\frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} $：

$\begin{aligned} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} &=\frac{\partial }{\partial x_k^{(i)}}\Big( \frac{x_k^{(s)}-\mu_k}{\sqrt{\sigma_k^2 +\epsilon}} \Big)\\ &=\frac{\partial (x_k^{(s)}-\mu_k) }{\partial x_k^{(i)}} \frac{1}{\sqrt{\sigma_k^2 +\epsilon}} +(x_k^{(s)}-\mu_k) \frac{\partial}{\partial x_k^{(i)}} \Big(\frac{1}{\sqrt{\sigma_k^2 +\epsilon}}\Big)\\ &=\frac{1\{s=i\}-\frac 1 m}{\sqrt{\sigma_k^2 +\epsilon}}+(x_k^{(s)}-\mu_k)(-\frac 1 2)\frac 1 {(\sqrt{\sigma_k^2 +\epsilon})^3} \frac{\partial \sigma_k^{2}}{\partial x_k^{(i)}}\\ &=\frac{1\{s=i\}-\frac 1 m}{\sqrt{\sigma_k^2 +\epsilon}}+(x_k^{(s)}-\mu_k)(-\frac 1 2)\frac 1 {(\sqrt{\sigma_k^2 +\epsilon})^3} \frac 2 m \Big( x_k^{(i)} -\mu_k \Big)\\ &=\frac{1\{s=i\}-\frac 1 m}{\sqrt{\sigma_k^2 +\epsilon}}-\frac 1 m \frac{(x_k^{(s)}-\mu_k)( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \end{aligned}$

所以

$\begin{aligned} \frac{\partial f}{\partial x^{(i)}_k} &=\gamma_k\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} \frac{\partial \hat x_k^{(s)}}{\partial x_k^{(i)}} \\ &=\gamma_k\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} \Big(\frac{1\{s=i\}-\frac 1 m}{\sqrt{\sigma_k^2 +\epsilon}} -\frac 1 m \frac{(x_k^{(s)}-\mu_k)( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \Big)\\ &=\gamma_k \Big( \frac{\frac{\partial f}{\partial y^{(i)}_k}}{\sqrt{\sigma_k^2 +\epsilon}} -\frac {\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k} }{m\sqrt{\sigma_k^2 +\epsilon}} \Big)- \gamma_k\frac {1} m\frac{( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k) \\ &= \frac{\gamma_k.\frac{\partial f}{\partial y^{(i)}_k}}{\sqrt{\sigma_k^2 +\epsilon}} -\frac {\gamma_k.\big(\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k}\big )}{m\sqrt{\sigma_k^2 +\epsilon}}- \gamma_k\frac {1} m\frac{( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k) \end{aligned}$

将上述内容分为三部分计算，首先是

$\frac{\gamma_k.\frac{\partial f}{\partial y^{(i)}_k}}{\sqrt{\sigma_k^2 +\epsilon}}$

将分子写为矩阵的形式：

$\left( \begin{matrix} \gamma_1.\frac{\partial f}{\partial y^{(1)}_1} & \ldots & \gamma_d.\frac{\partial f}{\partial y^{(1)}_d} \\ \ldots &\ldots &\ldots \\ \gamma_1.\frac{\partial f}{\partial y^{(m)}_1} & \ldots & \gamma_d.\frac{\partial f}{\partial y^{(m)}_d} \end{matrix} \right)$

利用numpy的广播机制，上述矩阵为

gamma * dout

注意k为

${\sqrt{\sigma_{\mathcal B}^2+\epsilon}}$

所以再次利用numpy的广播机制，第一项可以计算为

t1 = gamma * dout / k

接着计算第二项:

$-\frac {\gamma_k.\big(\sum_{s=1}^m \frac{\partial f}{\partial y^{(s)}_k}\big )}{m\sqrt{\sigma_k^2 +\epsilon}}$

依旧利用numpy的广播机制，不难得到

m = x.shape[0]
t2 = - gamma / m * np.sum(dout, axis=0).reshape(1, -1) / k

最后是计算：

$- \gamma_k\frac {1} m\frac{( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k)$

这一项比较复杂，我们先计算

$\sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k)$

首先是中心化矩阵：

t3 = x - sample_mean

其次不难看出$\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k)$为梯度矩阵和中心化矩阵对应元素相乘的结果，所以$\sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k)$为该矩阵按行求和得到结果，所以代码为：

t4 = np.sum(dout * t3, axis=0).reshape(1, -1)

最后，利用numpy广播机制可以计算

$- \gamma_k\frac {1} m\frac{( x_k^{(i)} -\mu_k )} {\sqrt{(\sigma_k^2 +\epsilon)^3}} \sum_{s=1}^ m\frac{\partial f}{\partial y^{(s)}_k}(x_k^{(s)}-\mu_k)$

对应代码为

t5 = - gamma / m * t3 / (k ** 3) * t4

最后将上述三项相加即可得到总梯度

dx= t1 + t2 + t5

Fully Connected Nets with Batch Normalization

这部分是修改Connected Nets的代码，因为网络结构为affine_batchnorm_relu，所以编写如下辅助函数：

def affine_batchnorm_relu_forward(x, W, b, gamma, beta, bn_params):
    #affline
    x, cache_affine = affine_forward(x, W, b)
    #batchnorm
    x, cache_batch = batchnorm_forward(x, gamma, beta, bn_params)
    #relu
    x, cache_relu = relu_forward(x)
    
    return x, (cache_affine, cache_batch, cache_relu)

def affine_batchnorm_relu_backward(dout, cache):
    cache_affine, cache_batch, cache_relu = cache
    #relu
    dx = relu_backward(dout, cache_relu)
    #batchnorm
    dx, dgamma, dbeta = batchnorm_backward(dx, cache_batch)
    #affline
    dx, dw, db = affine_backward(dx, cache_affine)
    
    return dx, dw, db, dgamma, dbeta

这部分只是将代码模块化，接着修改Connected Nets，首先是初始化部分：

for i in range(self.num_layers):
	if i == 0:
		W = np.random.randn(input_dim, hidden_dims[i]) * weight_scale
		b = np.zeros(hidden_dims[i])
	elif i == self.num_layers - 1:
		W = np.random.randn(hidden_dims[i-1], num_classes) * weight_scale
		b = np.zeros(num_classes)
	else:
		W = np.random.randn(hidden_dims[i-1], hidden_dims[i]) * weight_scale
		b = np.zeros(hidden_dims[i])
	
	if self.use_batchnorm and i != self.num_layers - 1:
		gamma = np.ones(hidden_dims[i])
		beta = np.zeros(hidden_dims[i])
		self.params["gamma"+str(i+1)] = gamma
		self.params["beta"+str(i+1)] = beta
	
	self.params["W"+str(i+1)] = W
	self.params["b"+str(i+1)] = b

接着是前向传播部分：

x = X
#记录缓存
Cache = {}
Cache_dropout = {}
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	b = self.params["b"+str(i+1)]
	if i < self.num_layers - 1:
		#batchnorm
		if self.use_batchnorm:
			gamma = self.params["gamma"+str(i+1)]
			beta = self.params["beta"+str(i+1)]
			x, cache = affine_batchnorm_relu_forward(x, W, b, gamma, beta, self.bn_params[i])
		else:
			x, cache = affine_relu_forward(x, W, b)
	else:
		x, cache = affine_forward(x, W, b)
	#存入缓存
	Cache["cache"+str(i+1)] = cache
	
#输出
scores = x

最后是反向传播部分：

#损失以及dout
loss, dout = softmax_loss(scores, y)
#加上正则项
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	loss += self.reg * (np.sum(W ** 2)) / 2
#计算dWi
dz, dW, db, dgamma, dbeta = 0, 0, 0, 0, 0
for i in range(self.num_layers, 0, -1):
	cache = Cache["cache"+str(i)]
	W = self.params["W"+str(i)]
	if i == self.num_layers:
		dz, dW, db = affine_backward(dout, cache)
	else:
		if self.use_batchnorm:
			dz, dW, db, dgamma, dbeta = affine_batchnorm_relu_backward(dz, cache)
			grads["gamma"+str(i)] = dgamma
			grads["beta"+str(i)] = dbeta
		else:
			dz, dW, db = affine_relu_backward(dz, cache)

	#加上正则项
	dW += self.reg * W
	#存入字典
	grads["W"+str(i)] = dW
	grads["b"+str(i)] = db

3.随机失活（Dropout）

Dropout forward pass

具体公式可以参考笔记，这里直接给出代码，首先是前向传播，分为训练部分以及测试部分：

if mode == 'train':
	mask = (np.random.rand(x.shape[0], x.shape[1]) < p) / p
	out = x * mask
elif mode == 'test':
	out = x

Dropout backward pass

其次是反向传播，依旧分为训练部分和测试部分：

if mode == 'train':
	dx = dout * mask
elif mode == 'test':
	dx = dout

Fully-connected nets with Dropout

只需增加一个判断即可，前向传播：

x = X
#记录缓存
Cache = {}
Cache_dropout = {}
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	b = self.params["b"+str(i+1)]
	if i < self.num_layers - 1:
		#batchnorm
		if self.use_batchnorm:
			gamma = self.params["gamma"+str(i+1)]
			beta = self.params["beta"+str(i+1)]
			x, cache = affine_batchnorm_relu_forward(x, W, b, gamma, beta, self.bn_params[i])
		else:
			x, cache = affine_relu_forward(x, W, b)
			
		if self.use_dropout:
			x, cache_dropout = dropout_forward(x, self.dropout_param)
			Cache_dropout["cache"+str(i+1)] = cache_dropout
	else:
		x, cache = affine_forward(x, W, b)
	#存入缓存
	Cache["cache"+str(i+1)] = cache
	
#输出
scores = x

反向传播：

#损失以及dout
loss, dout = softmax_loss(scores, y)
#加上正则项
for i in range(self.num_layers):
	W = self.params["W"+str(i+1)]
	loss += self.reg * (np.sum(W ** 2)) / 2
#计算dWi
dz, dW, db, dgamma, dbeta = 0, 0, 0, 0, 0
for i in range(self.num_layers, 0, -1):
	cache = Cache["cache"+str(i)]
	W = self.params["W"+str(i)]
	if i == self.num_layers:
		dz, dW, db = affine_backward(dout, cache)
	else:
		if self.use_dropout:
			cache_dropout = Cache_dropout["cache"+str(i)]
			dz = dropout_backward(dz, cache_dropout)
		if self.use_batchnorm:
			dz, dW, db, dgamma, dbeta = affine_batchnorm_relu_backward(dz, cache)
			grads["gamma"+str(i)] = dgamma
			grads["beta"+str(i)] = dbeta
		else:
			dz, dW, db = affine_relu_backward(dz, cache)

	#加上正则项
	dW += self.reg * W
	#存入字典
	grads["W"+str(i)] = dW
	grads["b"+str(i)] = db

4.在CIFAR-10上运行卷积神经网络

为了方便讨论，这里定义如下变量：$x$是图像数据，维度为$(N, C, H, W)$；$w$为卷积核，维度为$(F, C, HH, WW)$；$b$为偏置项，维度为$(F, )$，其中$N$是图像的数量，$C$是channel数量（RGB图像中这一项为$3$），$H,W$是图像的长宽，$F$是卷积核的数量，$HH,WW$是卷积核的长宽。此外，定义stride为步长，pad为填充数量，那么根据公式，得到填充后的数据$x_$的维度为$(N, C, H_2, W_2)$，其中

$H_2=H+2\times \text{pad}, W_2=W+2\times \text{pad}$

输出维度为$(N,F,H_1,W_1)$，其中

$\begin{aligned} H_1 &= 1+ (H+2\times \text{pad}-HH) / \text{stride} \\ W_1 &= 1+ (W+2\times \text{pad}-WW) / \text{stride}\\ \end{aligned}$

记输出为out，考虑第out第$i,j $个元素

$\text{res}=\text{out}[i][j]\in \mathbb R^{H_1 \times W_1}$

该元素由$x_1=x_[i]\in \mathbb R^{C\times H_2\times W_2}$和$w_1= w[j]\in \mathbb R^{C\times HH \times WW}$计算得到，其第$s,t$个元素的计算方法如下：

x1 = x_[i]
w1 = w[j]
res[s][t] = np.sum(x1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] * w1) + b[j]

（备注，利用numpy的广播机制，实际中可以最后加上$b[j]$）

为了方便反向传播的讨论，这里记

x1[:, s*stride: s*stride+HH, t*stride: t*stride+WW]

为$x’\in \mathbb R^{C\times HH\times WW}$，注意$w_1\in \mathbb R^{C\times HH \times WW}$，$b[j]\in \mathbb R$为对应偏置项，那么

$\text{res}[s][t]= \sum_{l=1}^C \sum_{s=1}^{HH} \sum_{t=1}^{WW} (x'[l][s][t] \times w_1[l][s][t]) + b[j]$

那么

$\begin{aligned} \frac{\partial (\text{res}[s][t])}{\partial (x'[l][s][t])} &= w_1[l][s][t] \\ \frac{\partial (\text{res}[s][t])}{\partial (w_1[l][s][t])} &= x'[l][s][t] \\ \frac{\partial (\text{res}[s][t])}{\partial (b[j])} &= 1 \end{aligned}$

因此

$\begin{aligned} \nabla_{x'} { (\text{res}[s][t])} &= w_1\in \mathbb R^{C\times HH \times WW} \\ \nabla_{w_1} { (\text{res}[s][t])} &= x'\in \mathbb R^{C\times HH\times WW} \\ \nabla_{b[j]} { (\text{res}[s][t])} &=1 \end{aligned}$

记反向传播的输入为dout，其第$i,j$个元素为

$\text{dout1}=\text{dout}[i][j]\in \mathbb R^{H_1 \times W_1}$

假设最后作用在out上的函数为$f$，那么我们有

$\begin{aligned} \nabla_{\text{out}} { f} &= \text{dout} \\ \nabla_{\text{out[i][j]}} { f}&=\nabla_{\text{res}} { f} = \text{dout}[i][j]= \text{dout1} \in \mathbb R^{H_1 \times W_1} \end{aligned}$

所以

$\begin{aligned} \nabla_{x'} { f} &= \frac{\partial f}{\partial (\text{res}[s][t])} \nabla_{x'} { (\text{res}[s][t])}= \text{dout1}[s][t].w_1\\ \nabla_{w_1} { f} &= \frac{\partial f}{\partial (\text{res}[s][t])} \nabla_{w_1} { (\text{res}[s][t])}= \text{dout1}[s][t].x'\\ \nabla_{b} { f} &=\frac{\partial f}{\partial (\text{res}[s][t])} \nabla_{b} { (\text{res}[s][t])}= \text{dout1}[s][t] \end{aligned}$

对应代码如下：

for s in range(H1):
	for t in range(W1):
		dw1 += x1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] * dout1[s][t]
		dx1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] += w1 * dout1[s][t]
		db1 += dout1[s][t]

剩余部分只要利用循环即可完成。

Convolution: Naive forward pass

首先是利用np.pad函数进行$0$填充：

stride = conv_param["stride"]
pad = conv_param["pad"]
x_ = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), "constant")

然后计算输出维度：

#输入维度
N, C, H, W = x.shape
F, C, HH, WW = w.shape
#输出维度
H1 = 1 + (H + 2 * pad - HH) // stride
W1 = 1 + (W + 2 * pad - WW) // stride
out = np.zeros((N, F, H1, W1))

然后根据定义计算即可，这里用循环的方法：

for i in range(N):
	for j in range(F):
		x1 = x_[i]
		w1 = w[j]
		res = np.zeros((H1, W1))
		for s in range(H1):
			for t in range(W1):
				res[s][t] = np.sum(x1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] * w1)
		res += b[j]
		out[i][j] = res

Aside: Image processing via convolutions

这部分如果出现如下报错：

cannot import name imread

只需要安装Pillow即可：

pip install Pillow

Convolution: Naive backward pass

之前已经介绍了大部分内容，后续只要循环遍历即可，首先是初始化工作：

x, w, b, conv_param = cache
stride = conv_param["stride"]
pad = conv_param["pad"]
#填充
x_ = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), "constant")
#输入维度
N, C, H, W = x.shape
F, C, HH, WW = w.shape
H2, W2 = x_.shape[2:]
#输出维度
dx = np.zeros_like(x_)
dw = np.zeros_like(w)
db = np.zeros_like(b)

接着是循环遍历：

for i in range(N):
	for j in range(F):
		t1 = dout[i][j]
		#获得维度
		H1, W1 = t1.shape
		#初始化梯度
		dx1 = np.zeros((C, H2, W2))
		dw1 = np.zeros((C, HH, WW))
		db1 = 0
		#当前维度的x, w
		x1 = x_[i]
		w1 = w[j]
		#当前维度的dout
		dout1 = dout[i][j]
		for s in range(H1):
			for t in range(W1):
				dw1 += x1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] * dout1[s][t]
				dx1[:, s*stride: s*stride+HH, t*stride: t*stride+WW] += w1 * dout1[s][t]
				db1 += dout1[s][t]
		db[j] += db1
		dx[i] += dx1
		dw[j] += dw1

$i,j$部分的循环只是对$dx, dw$的第一个维度进行遍历，注意我们最后计算的实际上是填充后的梯度，所以输出应该为

dx = dx[:, :, pad: pad+H, pad: pad+W]

Max pooling: Naive forward

这部分和前项传播类似，只是将之前的卷积操作换成取最大值：

pool_height = pool_param["pool_height"]
pool_width = pool_param["pool_width"]
stride = pool_param["stride"]
#输入维度
N, C, H, W = x.shape
#输出维度
H1 = 1 + (H - pool_height) // stride
W1 = 1 + (W - pool_width) // stride

out = np.zeros((N, C, H1, W1))
for i in range(N):
    for j in range(C):
        x1 = x[i][j]
        res = np.zeros((H1, W1))
        for s in range(H1):
            for t in range(W1):
                res[s][t] = np.max(x1[s*stride: s*stride+pool_height, t*stride: t*stride+pool_width])
        out[i][j] = res

Max pooling: Naive backward

由于最大池化的特性，只要将最大元素所在位置对应的dout累加即可，代码和反向传播类似：

x, pool_param = cache
pool_height = pool_param["pool_height"]
pool_width = pool_param["pool_width"]
stride = pool_param["stride"]
#输入维度
N, C, H, W = x.shape
#输出维度
H1 = 1 + (H - pool_height) // stride
W1 = 1 + (W - pool_width) // stride

dx = np.zeros_like(x)
for i in range(N):
	for j in range(C):
		#当前维度的dout
		dout1 = dout[i][j]
		x1 = x[i][j]
		dx1 = np.zeros((H, W))
		for s in range(H1):
			for t in range(W1):
				#拉直
				temp = x1[s*stride: s*stride+pool_height, t*stride: t*stride+pool_width].flatten()
				#找到最大元素对应的索引
				index = np.argmax(temp)
				#还原矩阵中的位置
				m, n = index // pool_width, index % pool_width
				dx1[s*stride + m][t*stride + n] += dout1[s][t]
		dx[i][j] = dx1

这里我没找到计算矩阵最大元素对应的行列的方法，只能手工计算：

#拉直
temp = x1[s*stride: s*stride+pool_height, t*stride: t*stride+pool_width].flatten()
#找到最大元素对应的索引
index = np.argmax(temp)
#还原矩阵中的位置
m, n = index // pool_width, index % pool_width

然后累加对应位置的dout

dx1[s*stride + m][t*stride + n] += dout1[s][t]

Fast layers

这部分使用Cython，我一开始产生如下报错

error: Unable to find vcvarsall.bat

最后是参考这篇博客解决的，实际上只要下载一个安装包即可（传送门）。

Three-layer ConvNet

这部分感觉题目没有讲清楚，也有可能是我理解的问题，网络架构为：

conv - relu - 2x2 max pool - affine - relu - affine - softmax

权重是用于relu层以及affine层，一开始对于维度不清楚，后来发现有如下代码：

# pass conv_param to the forward pass for the convolutional layer
filter_size = W1.shape[2]
conv_param = {'stride': 1, 'pad': (filter_size - 1) // 2}

# pass pool_param to the forward pass for the max-pooling layer
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

上述代码说明经过卷积之后数据的维度和输入维度相同，所以初始化步骤如下：

C, H, W = input_dim
F, HH, WW = num_filters, filter_size, filter_size
W1 = np.random.randn(F, C, HH, WW) * weight_scale
b1 = np.zeros(F)
#根据后面算法推断，第一层卷积之后图像数据最后两个维度不变，总数据维度为
n = F * H * W
#W2是在2x2 max pool后使用的权重，所以第一个维度为n // 4
W2 = np.random.randn(n // 4, hidden_dim) * weight_scale
b2 = np.zeros(hidden_dim)
W3 = np.random.randn(hidden_dim, num_classes) * weight_scale
b3 = np.zeros(num_classes)
self.params["W1"] = W1
self.params["b1"] = b1
self.params["W2"] = W2
self.params["b2"] = b2
self.params["W3"] = W3
self.params["b3"] = b3

前向传播：

X1, cache1 = conv_forward_fast(X, W1, b1, conv_param)
X2, cache2 = relu_forward(X1)
X3, cache3 = max_pool_forward_fast(X2, pool_param)
X4, cache4 = affine_forward(X3, W2, b2)
X5, cache5 = relu_forward(X4)
X6, cache6 = affine_forward(X5, W3, b3)
scores = X6

反向传播：

loss, dz6 = softmax_loss(scores, y)
loss += self.reg * (np.sum(W1 ** 2) + np.sum(W2 ** 2) + np.sum(W2 ** 3)) / 2

dz5, dW3, db3 = affine_backward(dz6, cache6)
dz4 = relu_backward(dz5, cache5)
dz3, dW2, db2 = affine_backward(dz4, cache4)
dz2 = max_pool_backward_fast(dz3, cache3)
dz1 = relu_backward(dz2, cache2)
dz, dW1, db1 = conv_backward_fast(dz1, cache1)

grads["W3"] = dW3
grads["b3"] = db3
grads["W2"] = dW2
grads["b2"] = db2
grads["W1"] = dW1
grads["b1"] = db1

Spatial batch normalization: forward

这部分是对每个Channel上使用batch normalization，所以代码为：

N, C, H, W = x.shape
out = np.zeros_like(x)
cache = []

x1 = np.copy(x)
x1 = x.reshape(-1, C)
out, cache = batchnorm_forward(x1, gamma, beta, bn_param)
out = out.reshape(x.shape)

Spatial batch normalization: backward

反向传播也同理：

N, C, H, W = dout.shape
dout1 = np.copy(dout)
dout1 = dout1.reshape(-1, C)
dx, dgamma, dbeta = batchnorm_backward(dout1, cache)
dx = dx.reshape(dout.shape)

5.TensorFlow

如果读取数据的时候报错，则修改如下路径为自己存放数据的路径即可：

cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

TensorFlow Details

这部分的难点是5408这个数据是怎么来的，因为使用的参数为’VALID’，官方文档给出的计算公式为：

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

这里根据公式计算得到的结果是

$(32-7+0)/2 +1 = 13.5$

向下取整为

$13$

所以输出维度为

$5408 = 32 \times 13 \times 13$

参考资料：传送门。

Training a specific model

因为我对Tensorflow不是很熟，所以这部分主要参考了别人的作业，代码如下：

# define model
def complex_model(X,y,is_training):
    #conv1
    Wconv1 = tf.get_variable("Wconv1", shape=[7, 7, 3, 32])
    bconv1 = tf.get_variable("bconv1", shape=[32])
    #Affine layer
    W1 = tf.get_variable("W1", shape=[5408, 1024])
    b1 = tf.get_variable("b1", shape=[1024])
    #Affine layer
    W2 = tf.get_variable("W2", shape=[1024, 10])
    b2 = tf.get_variable("b2", shape=[10])

    #conv
    a1 = tf.nn.conv2d(X, Wconv1, strides=[1,1,1,1], padding='VALID') + bconv1
    #relu
    h1 = tf.nn.relu(a1)
    #Spatial Batch Normalization Layer
    axis = [0, 1, 2]
    mean, variance = tf.nn.moments(h1, axis)
    offset = tf.Variable(tf.zeros([32]))
    scala = tf.Variable(tf.ones([32]))
    bn1 = tf.nn.batch_normalization(h1, mean, variance, offset, scala, 0.001)
    #Max Pooling
    p1 = tf.nn.max_pool(bn1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="SAME")
    #Affine layers
    p1_flat = tf.reshape(p1, [-1, 5408])
    a2 = tf.matmul(p1_flat, W1) + b1
    #relu
    h2 = tf.nn.relu(a2)
    h2_flat = tf.reshape(h2, [-1, 1024])
    #Affine layer 
    y_out = tf.matmul(h2_flat, W2) + b2

    return y_out

Spatial Batch Normalization Layer相对复杂一些，别的部分照葫芦画瓢就可以了。

总结

这次作业真的非常难，需要反复体会，后续应该会把卷积部分的代码优化一下。