Deep Learning Systems HW1
这里回顾HW1,主要是实现一个基本的自动微分框架。
课程主页:
参考资料:
- https://forum.dlsyscourse.org/t/q3-topologic-sort-inputs-is-not-a-node-list-solved/1995
- https://blog.csdn.net/qq_34384524/article/details/86892864
- https://en.wikipedia.org/wiki/Topological_sorting#Depth-first_search
- https://forum.dlsyscourse.org/t/q4-how-to-multiply-the-partial-derivative/2131
- https://forum.dlsyscourse.org/t/q4-reverse-mode-ad-test-throws-grad-attributeerror-for-tensor-object/2035
- https://forum.dlsyscourse.org/t/not-understand-several-sentences-meaning-in-hw1/2090/2
- https://blog.csdn.net/qq1483661204/article/details/70543952
注意事项
老师提供了一个计算反传的技巧:
- 先按照标量形式求导;
- 然后转换成矩阵形式;
- 根据输入输出的维度得到最终结果;
模型前向和反向的结果有所不同:
- 前向的输出为NDArray数组;
- 反向的输出为Tensor;
说明:
- 后续讨论中将输出记录为$o$;
Question 1: Implementing forward computation & Question 2: Implementing backward computation
EWiseAdd
公式:
代码:
class EWiseAdd(TensorOp):
def compute(self, a: NDArray, b: NDArray):
return a + b
def gradient(self, out_grad: Tensor, node: Tensor):
return out_grad, out_grad
AddScalar
公式:
代码:
class AddScalar(TensorOp):
def __init__(self, scalar):
self.scalar = scalar
def compute(self, a: NDArray):
return a + self.scalar
def gradient(self, out_grad: Tensor, node: Tensor):
return out_grad
EWiseMul
公式:
代码:
class EWiseMul(TensorOp):
def compute(self, a: NDArray, b: NDArray):
return a * b
def gradient(self, out_grad: Tensor, node: Tensor):
lhs, rhs = node.inputs
return out_grad * rhs, out_grad * lhs
MulScalar
公式:
代码:
class MulScalar(TensorOp):
def __init__(self, scalar):
self.scalar = scalar
def compute(self, a: NDArray):
return a * self.scalar
def gradient(self, out_grad: Tensor, node: Tensor):
return out_grad * self.scalar
PowerScalar
公式:
代码:
class PowerScalar(TensorOp):
"""Op raise a tensor to an (integer) power."""
def __init__(self, scalar: int):
self.scalar = scalar
def compute(self, a: NDArray) -> NDArray:
### BEGIN YOUR SOLUTION
return array_api.power(a, self.scalar)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
# nx^(n-1)
return out_grad * (self.scalar * array_api.power(input, self.scalar - 1))
### END YOUR SOLUTION
EWiseDiv
公式:
代码:
class EWiseDiv(TensorOp):
"""Op to element-wise divide two nodes."""
def compute(self, a, b):
### BEGIN YOUR SOLUTION
return a / b
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
lhs, rhs = node.inputs
return out_grad / rhs, -out_grad * lhs / rhs / rhs
### END YOUR SOLUTION
DivScalar
公式:
代码:
class DivScalar(TensorOp):
def __init__(self, scalar):
self.scalar = scalar
def compute(self, a):
### BEGIN YOUR SOLUTION
return a / self.scalar
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
return out_grad / self.scalar
### END YOUR SOLUTION
Transpose
交换两个维度即可,注意这里使用np.swapaxes
而不是np.transpose
,具体可以参考链接。
代码:
class Transpose(TensorOp):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def compute(self, a):
### BEGIN YOUR SOLUTION
if self.axes:
x, y = self.axes[0], self.axes[1]
else:
x, y = -1, -2
return array_api.swapaxes(a, x, y)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
if self.axes:
x, y = self.axes[0], self.axes[1]
else:
x, y = -1, -2
return transpose(out_grad, axes=(x, y))
### END YOUR SOLUTION
Reshape
变换形状即可。
代码:
class Reshape(TensorOp):
def __init__(self, shape):
self.shape = shape
def compute(self, a):
### BEGIN YOUR SOLUTION
return array_api.reshape(a, self.shape)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
return reshape(out_grad, input.shape)
### END YOUR SOLUTION
BroadcastTo
比较复杂的一个函数,其主要功能为完成如下映射(情形1):
其中第$x_1$的第$i$维的向量都为$x_0$。
一个特殊情形为将标量映射为张量,此时映射为(情形2):
其中$x_1$的每个元素都为$x_0$。
前向比较简单,直接利用broadcast_to即可;至于反传,只要记住BroadcastTo做的只是复制,对于情形1:
- 找到广播的维度,求和,然后除以$x_0$即可;
对于情形2:
- 直接求和即可;
代码:
class BroadcastTo(TensorOp):
def __init__(self, shape):
self.shape = shape
def compute(self, a):
return array_api.broadcast_to(a, self.shape)
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
# 找到广播的维度
# input: scalar
n1 = len(input.shape)
n2 = len(self.shape)
# 计算系数
c = 1
if n1 != n2:
# scalar->tensor
axes = range(n2)
else:
# tensor->tensorx
axes = []
for i in range(n1):
if input.shape[i] != self.shape[i]:
axes.append(i)
c *= input.shape[i]
# 注意恢复形状
return reshape(summation(out_grad, axes=tuple(axes)), input.shape) / c
### END YOUR SOLUTION
Summation
依然是比较复杂的一个函数,计算公式为:
对于反向,我们需要先将$o$的维度恢复$x$的维度,然后广播到输入的维度即可(因为求和的求导结果为$1$),下面介绍一个例子:
- 假设输入为$x $的维度为$(3, 4, 5)$,对第一个维度求和,那么输出的维度为$(3, 5)$;
- 在反向传播时,假设输入为$g$,维度为$(3, 5)$,首先将维度扩展为$(3, 1, 5)$,然后广播到$(3, 4, 5)$;
代码:
class Summation(TensorOp):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def compute(self, a):
### BEGIN YOUR SOLUTION
return array_api.sum(a, self.axes)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
grad_shape = list(out_grad.shape)
# 使坐标为正并且从小到大排列
if self.axes == None:
axes = input.shape
else:
axes = self.axes
n = len(input.shape)
new_axes = []
for x in axes:
if x >= 0:
new_axes.append(x)
else:
new_axes.append(x + n)
new_axes = sorted(new_axes)
# 恢复grad_shape, 使grad_shape的维度和input.shape的维度相同
for axis in new_axes:
grad_shape.insert(axis, 1)
return broadcast_to(reshape(out_grad, grad_shape), input.shape)
### END YOUR SOLUTION
MatMul
公式:
基本情形如下:
更一般的情形,$x, y$的维度不一致,但最后两个维度正好可以做矩阵乘法,所以其计算形式为:
或者:
因为计算时前向时进行了broadcast,所以反向时要进行求和,整体代码如下:
class MatMul(TensorOp):
def compute(self, a, b):
### BEGIN YOUR SOLUTION
return array_api.matmul(a, b)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
# (A, a, b), (B, b, c)
lhs, rhs = node.inputs
# out_grad: (C, a, c)
# (C, a, b)
lhs_grad = matmul(out_grad, transpose(rhs, axes=(-1, -2)))
# (C, b, c)
rhs_grad = matmul(transpose(lhs, axes=(-1, -2)), out_grad)
# 注意形状
n1 = len(out_grad.shape)
n2 = len(lhs.shape)
n3 = len(rhs.shape)
if n1 > n2:
lhs_grad = summation(lhs_grad, axes=tuple(range(n1 - n2)))
if n1 > n3:
rhs_grad = summation(rhs_grad, axes=tuple(range(n1 - n3)))
return lhs_grad, rhs_grad
### END YOUR SOLUTION
Negate
公式:
代码:
class Negate(TensorOp):
def compute(self, a):
### BEGIN YOUR SOLUTION
return -a
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
return -out_grad
### END YOUR SOLUTION
Log
公式:
代码:
class Log(TensorOp):
def compute(self, a):
### BEGIN YOUR SOLUTION
return array_api.log(a)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
return out_grad / input
### END YOUR SOLUTION
Exp
公式:
代码:
class Exp(TensorOp):
def compute(self, a):
### BEGIN YOUR SOLUTION
return array_api.exp(a)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
return out_grad * exp(input)
### END YOUR SOLUTION
Question 3: Topological sort
node_list是计算图的输出,不是整个计算图,要得到拓扑排序,需要从node_list中节点DFS即可(邻接点为input),代码如下:
def find_topo_sort(node_list: List[Value]) -> List[Value]:
"""Given a list of nodes, return a topological sort list of nodes ending in them.
A simple algorithm is to do a post-order DFS traversal on the given nodes,
going backwards based on input edges. Since a node is added to the ordering
after all its predecessors are traversed due to post-order DFS, we get a topological
sort.
"""
### BEGIN YOUR SOLUTION
# 0: 未访问, 1: 临时访问, 2: 永久访问
visited = dict()
# init
for node in node_list:
visited[node] = 0
topo_order = []
for node in node_list:
if visited[node] == 0:
topo_sort_dfs(node, visited, topo_order)
return topo_order
### END YOUR SOLUTION
def topo_sort_dfs(node, visited, topo_order):
"""Post-order DFS"""
### BEGIN YOUR SOLUTION
if visited[node] == 2:
return
visited[node] = 1
for node1 in node.inputs:
if node1 not in visited:
visited[node1] = 0
if visited[node1] == 0:
topo_sort_dfs(node1, visited, topo_order)
visited[node] = 2
topo_order.append(node)
### END YOUR SOLUTION
Question 4: Implementing reverse mode differentiation
按如下伪代码实现即可:
代码:
def compute_gradient_of_variables(output_tensor, out_grad):
"""Take gradient of output node with respect to each node in node_list.
Store the computed result in the grad field of each Variable.
"""
# a map from node to a list of gradient contributions from each output node
node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {}
# Special note on initializing gradient of
# We are really taking a derivative of the scalar reduce_sum(output_node)
# instead of the vector output_node. But this is the common case for loss function.
node_to_output_grads_list[output_tensor] = [out_grad]
# Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))
### BEGIN YOUR SOLUTION
for node in reverse_topo_order:
v = sum_node_list(node_to_output_grads_list[node])
# important
node.grad = v
# init
for node1 in node.inputs:
if node1 not in node_to_output_grads_list:
node_to_output_grads_list[node1] = []
# compute
if not node.is_leaf():
gradient = node.op.gradient_as_tuple(v, node)
for i, node1 in enumerate(node.inputs):
node_to_output_grads_list[node1].append(gradient[i])
### END YOUR SOLUTION
Question 5: Softmax loss
利用如下公式即可:
参考上一讲的代码:
def softmax_loss(Z, y):
""" Return softmax loss. Note that for the purposes of this assignment,
you don't need to worry about "nicely" scaling the numerical properties
of the log-sum-exp computation, but can just compute this directly.
Args:
Z (np.ndarray[np.float32]): 2D numpy array of shape
(batch_size, num_classes), containing the logit predictions for
each class.
y (np.ndarray[np.int8]): 1D numpy array of shape (batch_size, )
containing the true label of each example.
Returns:
Average softmax loss over the sample.
"""
### BEGIN YOUR CODE
exp_sum_z = np.sum(np.exp(Z), axis=-1)
b = Z.shape[0]
z_y = Z[np.arange(b), y]
loss = np.mean(np.log(exp_sum_z) - z_y)
return loss
### END YOUR CODE
然后进行改写即可,只要知道$z^{\top} y_{onehot}= z_{y_{label}}$即可:
def softmax_loss(Z, y_one_hot):
""" Return softmax loss. Note that for the purposes of this assignment,
you don't need to worry about "nicely" scaling the numerical properties
of the log-sum-exp computation, but can just compute this directly.
Args:
Z (ndl.Tensor[np.float32]): 2D Tensor of shape
(batch_size, num_classes), containing the logit predictions for
each class.
y (ndl.Tensor[np.int8]): 2D Tensor of shape (batch_size, num_classes)
containing a 1 at the index of the true label of each example and
zeros elsewhere.
Returns:
Average softmax loss over the sample. (ndl.Tensor[np.float32])
"""
### BEGIN YOUR SOLUTION
# n, d -> n
exp_sum_z = ndl.summation(ndl.exp(Z), axes=(-1,))
b = Z.shape[0]
z_y = ndl.summation(ndl.multiply(Z, y_one_hot), axes=(-1,))
# print(exp_sum_z.shape, z_y.shape)
loss = ndl.summation(ndl.log(exp_sum_z) - z_y) / b
# print(loss.shape)
return loss
### END YOUR SOLUTION
Question 6: SGD for a two-layer neural network
首先是ReLU算子:
class ReLU(TensorOp):
def compute(self, a):
### BEGIN YOUR SOLUTION
return array_api.maximum(a, 0)
### END YOUR SOLUTION
def gradient(self, out_grad, node):
### BEGIN YOUR SOLUTION
input, = node.inputs
# input_relu = input.realize_cached_data()
input_relu = relu(input).numpy()
return out_grad * Tensor(input_relu > 0)
### END YOUR SOLUTION
然后是代码实现:
def nn_epoch(X, y, W1, W2, lr = 0.1, batch=100):
""" Run a single epoch of SGD for a two-layer neural network defined by the
weights W1 and W2 (with no bias terms):
logits = ReLU(X * W1) * W2
The function should use the step size lr, and the specified batch size (and
again, without randomizing the order of X).
Args:
X (np.ndarray[np.float32]): 2D input array of size
(num_examples x input_dim).
y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
W1 (ndl.Tensor[np.float32]): 2D array of first layer weights, of shape
(input_dim, hidden_dim)
W2 (ndl.Tensor[np.float32]): 2D array of second layer weights, of shape
(hidden_dim, num_classes)
lr (float): step size (learning rate) for SGD
batch (int): size of SGD mini-batch
Returns:
Tuple: (W1, W2)
W1: ndl.Tensor[np.float32]
W2: ndl.Tensor[np.float32]
"""
### BEGIN YOUR SOLUTION
n = X.shape[0]
m = W2.shape[1]
y_one_hot = np.zeros((n, m))
y_one_hot[np.arange(n), y] = 1
step = n // batch
index = np.arange(batch)
for i in range(step + 1):
start = i * batch
end = min(start + batch, n)
if start == end:
break
x1 = ndl.Tensor(X[start: end])
y1 = ndl.Tensor(y_one_hot[start: end])
# compute
Z = ndl.matmul(ndl.relu(ndl.matmul(x1, W1)), W2)
# loss
loss = softmax_loss(Z, y1)
# backward
loss.backward()
# grad
W1 = ndl.Tensor(W1.numpy() - lr * W1.grad.numpy())
W2 = ndl.Tensor(W2.numpy() - lr * W2.grad.numpy())
return W1, W2
### END YOUR SOLUTION
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Doraemonzzz!
评论
ValineLivere