Deep Learning Systems HW1

这里回顾HW1，主要是实现一个基本的自动微分框架。

课程主页：

参考资料：

注意事项

老师提供了一个计算反传的技巧：

先按照标量形式求导；
然后转换成矩阵形式；
根据输入输出的维度得到最终结果；

模型前向和反向的结果有所不同：

前向的输出为NDArray数组；
反向的输出为Tensor；

说明：

后续讨论中将输出记录为$o$；

Question 1: Implementing forward computation & Question 2: Implementing backward computation

EWiseAdd

公式：

$o=x+y\\ \frac{\partial o}{\partial x}=1, \frac{\partial o}{\partial y} =1 \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o}, \frac{\partial \ell}{ \partial y}=\frac{\partial \ell}{ \partial o}$

代码：

class EWiseAdd(TensorOp):
    def compute(self, a: NDArray, b: NDArray):
        return a + b

    def gradient(self, out_grad: Tensor, node: Tensor):
        return out_grad, out_grad

AddScalar

公式：

$o=x+c \\ \frac{\partial o}{\partial x}=1\\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o}$

代码：

class AddScalar(TensorOp):
    def __init__(self, scalar):
        self.scalar = scalar

    def compute(self, a: NDArray):
        return a + self.scalar

    def gradient(self, out_grad: Tensor, node: Tensor):
        return out_grad

EWiseMul

公式：

$o=x \circ y\\ \frac{\partial o}{\partial x}=y, \frac{\partial o}{\partial y} =x \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o} \circ y, \frac{\partial \ell}{ \partial y}=\frac{\partial \ell}{ \partial o}\circ x$

代码：

class EWiseMul(TensorOp):
    def compute(self, a: NDArray, b: NDArray):
        return a * b

    def gradient(self, out_grad: Tensor, node: Tensor):
        lhs, rhs = node.inputs
        return out_grad * rhs, out_grad * lhs

MulScalar

公式：

$o=x \circ c\\ \frac{\partial o}{\partial x}=c \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o} \circ c$

代码：

class MulScalar(TensorOp):
    def __init__(self, scalar):
        self.scalar = scalar

    def compute(self, a: NDArray):
        return a * self.scalar

    def gradient(self, out_grad: Tensor, node: Tensor):
        return out_grad * self.scalar

PowerScalar

公式：

$o=x^n \\ \frac{\partial o}{\partial x}=n x^{n-1} \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o} \circ (n x^{n-1})$

代码：

class PowerScalar(TensorOp):
    """Op raise a tensor to an (integer) power."""

    def __init__(self, scalar: int):
        self.scalar = scalar

    def compute(self, a: NDArray) -> NDArray:
        ### BEGIN YOUR SOLUTION
        return array_api.power(a, self.scalar)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        # nx^(n-1)
        return out_grad * (self.scalar * array_api.power(input, self.scalar - 1))
        ### END YOUR SOLUTION

EWiseDiv

公式：

$o=x / y\\ \frac{\partial o}{\partial x}=1/y, \frac{\partial o}{\partial y} =-x/y^{2} \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o} \circ \frac 1 y, \frac{\partial \ell}{ \partial y}=-\frac{\partial \ell}{ \partial o}\circ \frac{x}{y^2}$

代码：

class EWiseDiv(TensorOp):
    """Op to element-wise divide two nodes."""

    def compute(self, a, b):
        ### BEGIN YOUR SOLUTION
        return a / b
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        lhs, rhs = node.inputs
        return out_grad / rhs, -out_grad * lhs / rhs / rhs
        ### END YOUR SOLUTION

DivScalar

公式：

$o=x / c\\ \frac{\partial o}{\partial x}=1/c \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o} \circ \frac 1 c$

代码：

class DivScalar(TensorOp):
    def __init__(self, scalar):
        self.scalar = scalar

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return a / self.scalar
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        return out_grad / self.scalar
        ### END YOUR SOLUTION

Transpose

交换两个维度即可，注意这里使用np.swapaxes而不是np.transpose，具体可以参考链接。

代码：

class Transpose(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        if self.axes:
            x, y = self.axes[0], self.axes[1]
        else:
            x, y = -1, -2
        return array_api.swapaxes(a, x, y)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        if self.axes:
            x, y = self.axes[0], self.axes[1]
        else:
            x, y = -1, -2
        return transpose(out_grad, axes=(x, y))
        ### END YOUR SOLUTION

Reshape

变换形状即可。

代码：

class Reshape(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.reshape(a, self.shape)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        return reshape(out_grad, input.shape)
        ### END YOUR SOLUTION

BroadcastTo

比较复杂的一个函数，其主要功能为完成如下映射（情形1）：

$x_0\in \mathbb R^{x_1\times \ldots x_{i-1}\times 1\times x_{i+1}\times \ldots \times x_n} \to x_1\in \mathbb R^{x_1 \ldots x_{i-1}\times x_{i}\times x_{i+1}\times \ldots \times x_n}$

其中第$x_1$的第$i$维的向量都为$x_0$。

一个特殊情形为将标量映射为张量，此时映射为（情形2）：

$x_0\in \mathbb R \to x_1\in \mathbb R^{x_1\times \ldots x_{i-1}\times x_{i}\times x_{i+1}\times \ldots \times x_n}$

其中$x_1$的每个元素都为$x_0$。

前向比较简单，直接利用broadcast_to即可；至于反传，只要记住BroadcastTo做的只是复制，对于情形1：

找到广播的维度，求和，然后除以$x_0$即可；

对于情形2：

直接求和即可；

代码：

class BroadcastTo(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        return array_api.broadcast_to(a, self.shape)

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        # 找到广播的维度
        # input: scalar
        n1 = len(input.shape)
        n2 = len(self.shape)
        # 计算系数
        c = 1
        if n1 != n2:
            # scalar->tensor
            axes = range(n2)
        else:
            # tensor->tensorx
            axes = []
            for i in range(n1):
                if input.shape[i] != self.shape[i]:
                    axes.append(i)
                    c *= input.shape[i]
        # 注意恢复形状
        return reshape(summation(out_grad, axes=tuple(axes)), input.shape) / c
        ### END YOUR SOLUTION

Summation

依然是比较复杂的一个函数，计算公式为：

$o=\underbrace{\sum_{i_{k_m}=1}^{d_{k_m}}\ldots \sum_{i_{k_1}=1}^{d_{k_n}}}_{k_j\in\mathrm{axes}} x_{i_1,\ldots, i_n}$

对于反向，我们需要先将$o$的维度恢复$x$的维度，然后广播到输入的维度即可（因为求和的求导结果为$1$），下面介绍一个例子：

假设输入为$x $的维度为$(3, 4, 5)$，对第一个维度求和，那么输出的维度为$(3, 5)$；
在反向传播时，假设输入为$g$，维度为$(3, 5)$，首先将维度扩展为$(3, 1, 5)$，然后广播到$(3, 4, 5)$；

代码：

class Summation(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.sum(a, self.axes)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        grad_shape = list(out_grad.shape)
        # 使坐标为正并且从小到大排列
        if self.axes == None:
            axes = input.shape
        else:
            axes = self.axes
        n = len(input.shape)
        new_axes = []
        for x in axes:
            if x >= 0:
                new_axes.append(x)
            else:
                new_axes.append(x + n)
        new_axes = sorted(new_axes)
        # 恢复grad_shape, 使grad_shape的维度和input.shape的维度相同
        for axis in new_axes:
            grad_shape.insert(axis, 1)

        return broadcast_to(reshape(out_grad, grad_shape), input.shape)
        ### END YOUR SOLUTION

MatMul

公式：

基本情形如下：

$o=xy\\ \frac{\partial o}{\partial x}=y, \frac{\partial o}{\partial y} =x \\ \frac{\partial \ell}{ \partial x}=\frac{\partial \ell}{ \partial o}y^\top, \frac{\partial \ell}{ \partial y}=\frac{\partial \ell}{ \partial o}x^\top$

更一般的情形，$x, y$的维度不一致，但最后两个维度正好可以做矩阵乘法，所以其计算形式为：

$\mathrm{broadcast}(x, y.\mathrm{shape}).y$

或者：

$x.\mathrm{broadcast}(y, x.\mathrm{shape})$

因为计算时前向时进行了broadcast，所以反向时要进行求和，整体代码如下：

class MatMul(TensorOp):
    def compute(self, a, b):
        ### BEGIN YOUR SOLUTION
        return array_api.matmul(a, b)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        # (A, a, b), (B, b, c)
        lhs, rhs = node.inputs
        # out_grad: (C, a, c)
        # (C, a, b)
        lhs_grad = matmul(out_grad, transpose(rhs, axes=(-1, -2)))
        # (C, b, c)
        rhs_grad = matmul(transpose(lhs, axes=(-1, -2)), out_grad)
        # 注意形状
        n1 = len(out_grad.shape)
        n2 = len(lhs.shape)
        n3 = len(rhs.shape)
        
        if n1 > n2:
            lhs_grad = summation(lhs_grad, axes=tuple(range(n1 - n2)))
        if n1 > n3:
            rhs_grad = summation(rhs_grad, axes=tuple(range(n1 - n3)))

        return lhs_grad, rhs_grad
        ### END YOUR SOLUTION

Negate

公式：

$o=-x\\ \frac{\partial o}{\partial x}=-1 \\ \frac{\partial \ell}{ \partial x}=-\frac{\partial \ell}{ \partial o}$

代码：

class Negate(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return -a
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        return -out_grad
        ### END YOUR SOLUTION

Log

公式：

$o=\log x\\ \frac{\partial o}{\partial x}=\frac 1 x \\ \frac{\partial \ell}{ \partial x}=\left(\frac{\partial \ell}{ \partial o} \right) /x$

代码：

class Log(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.log(a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        return out_grad / input
        ### END YOUR SOLUTION

Exp

公式：

$o=\exp (x)\\ \frac{\partial o}{\partial x}= \exp (x) \\ \frac{\partial \ell}{ \partial x}=\left(\frac{\partial \ell}{ \partial o} \right) \exp (x)$

代码：

class Exp(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.exp(a)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        return out_grad * exp(input)
        ### END YOUR SOLUTION

Question 3: Topological sort

node_list是计算图的输出，不是整个计算图，要得到拓扑排序，需要从node_list中节点DFS即可（邻接点为input），代码如下：

def find_topo_sort(node_list: List[Value]) -> List[Value]:
    """Given a list of nodes, return a topological sort list of nodes ending in them.

    A simple algorithm is to do a post-order DFS traversal on the given nodes,
    going backwards based on input edges. Since a node is added to the ordering
    after all its predecessors are traversed due to post-order DFS, we get a topological
    sort.
    """
    ### BEGIN YOUR SOLUTION
    # 0: 未访问, 1: 临时访问, 2: 永久访问
    visited = dict()
    # init
    for node in node_list:
        visited[node] = 0

    topo_order = []

    for node in node_list:
        if visited[node] == 0:
            topo_sort_dfs(node, visited, topo_order)

    return topo_order
    ### END YOUR SOLUTION

def topo_sort_dfs(node, visited, topo_order):
    """Post-order DFS"""
    ### BEGIN YOUR SOLUTION
    if visited[node] == 2:
        return
    visited[node] = 1
    for node1 in node.inputs:
        if node1 not in visited:
            visited[node1] = 0
        if visited[node1] == 0:
            topo_sort_dfs(node1, visited, topo_order)
    visited[node] = 2
    topo_order.append(node)
    ### END YOUR SOLUTION

Question 4: Implementing reverse mode differentiation

按如下伪代码实现即可：

代码：

def compute_gradient_of_variables(output_tensor, out_grad):
    """Take gradient of output node with respect to each node in node_list.

    Store the computed result in the grad field of each Variable.
    """
    # a map from node to a list of gradient contributions from each output node
    node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {}
    # Special note on initializing gradient of
    # We are really taking a derivative of the scalar reduce_sum(output_node)
    # instead of the vector output_node. But this is the common case for loss function.
    node_to_output_grads_list[output_tensor] = [out_grad]

    # Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
    reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))

    ### BEGIN YOUR SOLUTION
    for node in reverse_topo_order:
        v = sum_node_list(node_to_output_grads_list[node])
        # important
        node.grad = v
        # init
        for node1 in node.inputs:
            if node1 not in node_to_output_grads_list:
                node_to_output_grads_list[node1] = []
        # compute
        if not node.is_leaf():
            gradient = node.op.gradient_as_tuple(v, node)
            for i, node1 in enumerate(node.inputs):
                node_to_output_grads_list[node1].append(gradient[i])
    ### END YOUR SOLUTION

Question 5: Softmax loss

利用如下公式即可：

$\ell_{\mathrm{softmax}}(z, y) = \log\sum_{i=1}^k \exp z_i - z_y.$

参考上一讲的代码：

def softmax_loss(Z, y):
    """ Return softmax loss.  Note that for the purposes of this assignment,
    you don't need to worry about "nicely" scaling the numerical properties
    of the log-sum-exp computation, but can just compute this directly.

    Args:
        Z (np.ndarray[np.float32]): 2D numpy array of shape
            (batch_size, num_classes), containing the logit predictions for
            each class.
        y (np.ndarray[np.int8]): 1D numpy array of shape (batch_size, )
            containing the true label of each example.

    Returns:
        Average softmax loss over the sample.
    """
    ### BEGIN YOUR CODE
    exp_sum_z = np.sum(np.exp(Z), axis=-1)
    b = Z.shape[0]
    z_y = Z[np.arange(b), y]
    loss = np.mean(np.log(exp_sum_z) - z_y)

    return loss
    ### END YOUR CODE

然后进行改写即可，只要知道$z^{\top} y_{onehot}= z_{y_{label}}$即可：

def softmax_loss(Z, y_one_hot):
    """ Return softmax loss.  Note that for the purposes of this assignment,
    you don't need to worry about "nicely" scaling the numerical properties
    of the log-sum-exp computation, but can just compute this directly.

    Args:
        Z (ndl.Tensor[np.float32]): 2D Tensor of shape
            (batch_size, num_classes), containing the logit predictions for
            each class.
        y (ndl.Tensor[np.int8]): 2D Tensor of shape (batch_size, num_classes)
            containing a 1 at the index of the true label of each example and
            zeros elsewhere.

    Returns:
        Average softmax loss over the sample. (ndl.Tensor[np.float32])
    """
    ### BEGIN YOUR SOLUTION
    # n, d -> n
    exp_sum_z = ndl.summation(ndl.exp(Z), axes=(-1,))
    b = Z.shape[0]
    z_y = ndl.summation(ndl.multiply(Z, y_one_hot), axes=(-1,))
    # print(exp_sum_z.shape, z_y.shape)
    loss = ndl.summation(ndl.log(exp_sum_z) - z_y) / b
    # print(loss.shape)
    return loss
    ### END YOUR SOLUTION

Question 6: SGD for a two-layer neural network

首先是ReLU算子：

class ReLU(TensorOp):
    def compute(self, a):
        ### BEGIN YOUR SOLUTION
        return array_api.maximum(a, 0)
        ### END YOUR SOLUTION

    def gradient(self, out_grad, node):
        ### BEGIN YOUR SOLUTION
        input, = node.inputs
        # input_relu = input.realize_cached_data()
        input_relu = relu(input).numpy()
        return out_grad * Tensor(input_relu > 0)
        ### END YOUR SOLUTION

然后是代码实现：

def nn_epoch(X, y, W1, W2, lr = 0.1, batch=100):
    """ Run a single epoch of SGD for a two-layer neural network defined by the
    weights W1 and W2 (with no bias terms):
        logits = ReLU(X * W1) * W2
    The function should use the step size lr, and the specified batch size (and
    again, without randomizing the order of X).

    Args:
        X (np.ndarray[np.float32]): 2D input array of size
            (num_examples x input_dim).
        y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
        W1 (ndl.Tensor[np.float32]): 2D array of first layer weights, of shape
            (input_dim, hidden_dim)
        W2 (ndl.Tensor[np.float32]): 2D array of second layer weights, of shape
            (hidden_dim, num_classes)
        lr (float): step size (learning rate) for SGD
        batch (int): size of SGD mini-batch

    Returns:
        Tuple: (W1, W2)
            W1: ndl.Tensor[np.float32]
            W2: ndl.Tensor[np.float32]
    """

    ### BEGIN YOUR SOLUTION
    n = X.shape[0]
    m = W2.shape[1]
    y_one_hot = np.zeros((n, m))
    y_one_hot[np.arange(n), y] = 1
    step = n // batch
    index = np.arange(batch)
    for i in range(step + 1):
        start = i * batch
        end = min(start + batch, n)
        if start == end:
            break
        x1 = ndl.Tensor(X[start: end])
        y1 = ndl.Tensor(y_one_hot[start: end])
        # compute
        Z = ndl.matmul(ndl.relu(ndl.matmul(x1, W1)), W2)
        # loss
        loss = softmax_loss(Z, y1)
        # backward
        loss.backward()
        # grad
        W1 = ndl.Tensor(W1.numpy() - lr * W1.grad.numpy())
        W2 = ndl.Tensor(W2.numpy() - lr * W2.grad.numpy())

    return W1, W2
    ### END YOUR SOLUTION