ECE408 Lecture 4 CUDA Memory Model

这次回顾ECE408 Lecture 4，这次介绍了CUDA内存模型。

课程主页：

https://wiki.illinois.edu/wiki/display/ECE408

搬运视频：

https://www.youtube.com/playlist?list=PL6RdenZrxrw-UKfRL5smPfFFpeqwN3Dsz

本讲内容

了解CUDA线程可访问内存的基本特征；
准备MP-2
- 基本矩阵乘法；
学习评估全局内存访问的性能影响；

内容回顾

控制（分支）divergence

分支的主要性能问题是divergence
- 单个warp中的线程采用不同的路径；
- 不同的执行路径在当前的GPU中被序列化；
常见情况：当分支条件是线程ID的函数时出现divergence
- if (threadIdx.x % 2) { }
  - 这为warp中的线程创建了两个不同的控制路径；
  - 有divergence（50%的线程什么都不做）；
- if ((threadIdx.x / WARP_SIZE) % 2) { }
  - 还创建了两个不同的控制路径；
  - 但分支粒度是warp 大小的整数倍；
  - 任何给定warp中的所有线程都遵循相同的路径；
  - 无divergence；

程序员视角的CUDA内存的视图

每个线程可以：

读/写每线程寄存器，registers（~1 个周期）；
读/写每块共享内存，shared memory（~5 个周期）；
读/写每个网格全局内存，global memory（~500 个周期）；
读每网个格常量内存，constant memory（在使用缓存时大约5个周期）；

CUDA变量类型限定符

Variable declaration	Memory	Scope	Lifetime
`int LocalVar;`	register	thread	thread
`__device__ __shared__ int SharedVar;`	shared	block	block
`_device__ int GlobalVar;`	global	app	application
`__device__ __constant__ int ConstantVar;`	constant	app	application

说明：

__device__
- 可使用__shared__或__constant__；
- 在函数内部不允许；
没有限定符的自动变量
- 在用于原始类型和结构的register中；
- 在用于每个线程数组的global memory中；

下一个应用：矩阵乘法

给定两个方阵$M$和$N$，维度$Width\times Width$；我们计算$P=MN$；即：

${P}_{\mathrm{ij}}=\sum_{k=1}^ M_{i k} N_{k j}$

示意图：

CPU版本

// Matrix multiplication on the (CPU) host in single precision
void MatrixMul(float *M, float *N, float *P, int Width)
{
    for (int i = 0; i < Width; ++i) {
        for (int j = 0; j < Width; ++j) {
            float sum = 0;
            for (int k = 0; k < Width; ++k) {
                float a = M[i * Width + k];
                float b = N[k * Width + j];
                sum += a * b;
            }
        	P[i * Width + j] = sum;
    	}
    }
}

Tiling

让每个2D线程块计算结果矩阵大小为$(\text{BLOCK_WIDTH})^2$的子矩阵；
- 每个块有$(\text{BLOCK_WIDTH})^2$个线程；
生成$(\text{WIDTH/BLOCK_WIDTH})^2$个block的二维grid；
此概念称为tiling，每个block代表一个tile；
思路是将矩阵用分块矩阵表示，每个分块矩阵作为一个block；

Tiling: 例子

实现

根据分块矩阵乘法的定义，不难得到实现过程如下：

GPU版本

Host代码：

// Setup the execution configuration
// BLOCK_WIDTH is a #define constant
dim3 dimGrid(ceil((1.0*Width)/BLOCK_WIDTH),
			ceil((1.0*Width)/BLOCK_WIDTH), 1);

dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Device代码：

__global__
void MatrixMulKernel(float *d_M, float *d_N, float *d_P, int Width)
{
    // Calculate the column index of d_P and d_N
    int Col = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Calculate the row index of d_P and d_M
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if ((Row < Width) && (Col < Width)) {
        float Pvalue = 0;
        // each thread computes one element of d_P
        for (int k = 0; k < Width; ++k)
        	Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];
        d_P[Row*Width+Col] = Pvalue;
    }
}