课程主页：https://www.cs.cmu.edu/afs/cs/academic/class/15418-s18/www/index.html

课程视频：https://www.bilibili.com/video/BV11A411c7pF

CMU 15-418的课程名为Parallel Computer Architecture and Programming，主要介绍并行计算相关内容。这里回顾Assignment 1，主要是熟悉多核的基本操作，利用ISPC编写一些并行程序。

参考资料：

Assignment 1 Exploring Multi-Core and SIMD Parallelism

准备工作

下载代码：

git clone https://github.com/cmu15418/asst1-s18.git

Problem 1: Parallel Fractal Generation Using Pthreads (15 points)

1,2,3

问题1,2整合到一起完成。

首先编译代码并运行baseline：

╰─± ./mandelbrot --view 1 --thread 1
[mandelbrot serial]:            [201.027] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [208.111] ms
Hello world from thread 0
Thread 0:               [206.634] ms
Hello world from thread 0
Thread 0:               [204.978] ms
[mandelbrot thread]:            [205.970] ms
Wrote image file mandelbrot-thread.ppm
++++                            (0.98x speedup from 1 threads)

其次实现多线程版本，首先在WorkerArgs中添加字段：

// Struct for passing arguments to thread routine
typedef struct {
    float x0, x1;
    float y0, y1;
    unsigned int width;
    unsigned int height;
    // add begin
    int startRow;
    int totalRows;
    int startCol;
    int totalCols;
    // add end
    int maxIterations;
    int* output;
    int threadId;
    int numThreads;
} WorkerArgs;

其次添加多线程版本，重点是添加起始行（列）以及每行（列）对应需要完成的数量：

void mandelbrotThread(
    int numThreads,
    float x0, float y0, float x1, float y1,
    int width, int height,
    int maxIterations, int output[])
{
    const static int MAX_THREADS = 32;

    if (numThreads > MAX_THREADS)
    {
        fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
        exit(1);
    }

    pthread_t workers[MAX_THREADS];
    WorkerArgs args[MAX_THREADS];

    // 把行拆成numThreads份
    int height_per_thread = height / numThreads;
    // 已经计算的height数量
    int cum_height = 0;
    // 把列拆成numThreads份
    int width_per_thread = width / numThreads;
    // 已经计算的width数量
    int cum_width = 0;

    for (int i=0; i<numThreads; i++) {
        // TODO: Set thread arguments here.
        args[i].threadId = i;
        // add;
        args[i].x0 = x0;
        args[i].y0 = y0;
        args[i].x1 = x1;
        // 最后一个thread特殊处理, 处理剩余全部的height和width
        if (i == numThreads - 1) {
            height_per_thread = height - cum_height;
            width_per_thread = width - cum_width;
        }
        args[i].y1 = y1;
        args[i].width = width;
        args[i].height = height;
        args[i].maxIterations = maxIterations;
        // 起始行
        args[i].startRow = cum_height;
        args[i].totalRows = height_per_thread;
        args[i].startCol = cum_width;
        args[i].totalCols = width_per_thread;

        args[i].output = output;

        // 更新
        cum_width += width_per_thread;
        cum_height += height_per_thread;
    }

    // Fire up the worker threads.  Note that numThreads-1 pthreads
    // are created and the main app thread is used as a worker as
    // well.

    for (int i=1; i<numThreads; i++)
        pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);

    workerThreadStart(&args[0]);

    // wait for worker threads to complete
    for (int i=1; i<numThreads; i++)
        pthread_join(workers[i], NULL);
}

实验结果：

╰─± ./mandelbrot --view 1 --thread 2
[mandelbrot serial]:            [200.329] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [102.766] ms
Hello world from thread 1
Thread 1:               [108.591] ms
Hello world from thread 0
Thread 0:               [102.456] ms
Hello world from thread 1
Thread 1:               [104.368] ms
Hello world from thread 0
Thread 0:               [102.174] ms
Hello world from thread 1
Thread 1:               [104.064] ms
[mandelbrot thread]:            [104.270] ms
Wrote image file mandelbrot-thread.ppm
++++                            (1.92x speedup from 2 threads)

╰─± ./mandelbrot --view 1 --thread 4 
[mandelbrot serial]:            [200.309] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [20.384] ms
Hello world from thread 3
Thread 3:               [21.224] ms
Hello world from thread 1
Thread 1:               [86.168] ms
Hello world from thread 2
Thread 2:               [86.405] ms
Hello world from thread 3
Thread 3:               [20.456] ms
Hello world from thread 0
Thread 0:               [22.139] ms
Hello world from thread 2
Thread 2:               [88.118] ms
Hello world from thread 1
Thread 1:               [88.628] ms
Hello world from thread 0
Thread 0:               [20.436] ms
Hello world from thread 3
Thread 3:               [23.111] ms
Hello world from thread 2
Thread 2:               [86.491] ms
Hello world from thread 1
Thread 1:               [86.188] ms
[mandelbrot thread]:            [86.637] ms
Wrote image file mandelbrot-thread.ppm
++++                            (2.31x speedup from 4 threads)

╰─± ./mandelbrot --view 1 --thread 8
[mandelbrot serial]:            [201.094] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [3.553] ms
Hello world from thread 7
Thread 7:               [4.552] ms
Hello world from thread 1
Thread 1:               [17.341] ms
Hello world from thread 6
Thread 6:               [18.726] ms
Hello world from thread 2
Thread 2:               [35.681] ms
Hello world from thread 5
Thread 5:               [37.703] ms
Hello world from thread 3
Thread 3:               [52.302] ms
Hello world from thread 4
Thread 4:               [56.164] ms
Hello world from thread 0
Thread 0:               [3.739] ms
Hello world from thread 7
Thread 7:               [4.516] ms
Hello world from thread 1
Thread 1:               [18.916] ms
Hello world from thread 6
Thread 6:               [18.928] ms
Hello world from thread 2
Thread 2:               [34.906] ms
Hello world from thread 5
Thread 5:               [36.526] ms
Hello world from thread 3
Thread 3:               [52.649] ms
Hello world from thread 4
Thread 4:               [58.855] ms
Hello world from thread 0
Thread 0:               [3.846] ms
Hello world from thread 7
Thread 7:               [4.191] ms
Hello world from thread 1
Thread 1:               [18.252] ms
Hello world from thread 6
Thread 6:               [18.705] ms
Hello world from thread 2
Thread 2:               [37.429] ms
Hello world from thread 5
Thread 5:               [38.546] ms
Hello world from thread 3
Thread 3:               [53.211] ms
Hello world from thread 4
Thread 4:               [54.999] ms
[mandelbrot thread]:            [55.322] ms
Wrote image file mandelbrot-thread.ppm
++++                            (3.63x speedup from 8 threads)

╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]:            [200.019] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [0.820] ms
Hello world from thread 15
Thread 15:              [1.372] ms
Hello world from thread 1
Thread 1:               [3.454] ms
Hello world from thread 2
Thread 2:               [4.942] ms
Hello world from thread 14
Thread 14:              [4.093] ms
Hello world from thread 13
Thread 13:              [6.803] ms
Hello world from thread 3
Thread 3:               [15.707] ms
Hello world from thread 11
Thread 11:              [17.481] ms
Hello world from thread 4
Thread 4:               [18.651] ms
Hello world from thread 12
Thread 12:              [21.464] ms
Hello world from thread 5
Thread 5:               [22.844] ms
Hello world from thread 10
Thread 10:              [25.601] ms
Hello world from thread 9
Thread 9:               [28.617] ms
Hello world from thread 6
Thread 6:               [32.949] ms
Hello world from thread 8
Thread 8:               [30.751] ms
Hello world from thread 7
Thread 7:               [36.431] ms
Hello world from thread 0
Thread 0:               [0.796] ms
Hello world from thread 1
Thread 1:               [3.054] ms
Hello world from thread 2
Thread 2:               [4.158] ms
Hello world from thread 15
Thread 15:              [1.274] ms
Hello world from thread 14
Thread 14:              [3.666] ms
Hello world from thread 13
Thread 13:              [6.966] ms
Hello world from thread 3
Thread 3:               [11.873] ms
Hello world from thread 12
Thread 12:              [15.463] ms
Hello world from thread 5
Thread 5:               [20.226] ms
Hello world from thread 11
Thread 11:              [19.453] ms
Hello world from thread 4
Thread 4:               [22.439] ms
Hello world from thread 6
Thread 6:               [26.175] ms
Hello world from thread 9
Thread 9:               [27.223] ms
Hello world from thread 10
Thread 10:              [24.995] ms
Hello world from thread 7
Thread 7:               [32.344] ms
Hello world from thread 8
Thread 8:               [33.665] ms
Hello world from thread 0
Thread 0:               [0.708] ms
Hello world from thread 1
Thread 1:               [3.183] ms
Hello world from thread 2
Thread 2:               [4.689] ms
Hello world from thread 15
Thread 15:              [1.817] ms
Hello world from thread 13
Thread 13:              [8.091] ms
Hello world from thread 14
Thread 14:              [8.535] ms
Hello world from thread 3
Thread 3:               [15.579] ms
Hello world from thread 12
Thread 12:              [18.564] ms
Hello world from thread 5
Thread 5:               [22.614] ms
Hello world from thread 4
Thread 4:               [24.750] ms
Hello world from thread 11
Thread 11:              [24.949] ms
Hello world from thread 6
Thread 6:               [27.631] ms
Hello world from thread 9
Thread 9:               [30.044] ms
Hello world from thread 10
Thread 10:              [32.918] ms
Hello world from thread 8
Thread 8:               [33.039] ms
Hello world from thread 7
Thread 7:               [35.752] ms
[mandelbrot thread]:            [34.221] ms
Wrote image file mandelbrot-thread.ppm
++++                            (5.84x speedup from 16 threads)

结果：

线程数	理论加速比	实际加速比
2	2	1.92
4	4	2.31
8	8	3.63
16	16	5.84

可以看到实际加速比和理论不符，从输出中可以看到，线程负载不均匀，应该是实际加速比和理论不符的原因。

4

尝试了如下两种方法，但是并不奏效：

按列划分：

void mandelbrotSerial_v2(
    float x0, float y0, float x1, float y1,
    int width, int height,
    int startCol, int totalCols,
    int maxIterations,
    int output[])
{
    // 前闭后开
    float dx = (x1 - x0) / width;
    float dy = (y1 - y0) / height;

    int endCol = startCol + totalCols;

    for (int i = startCol; i < endCol; i++) {
        for (int j = 0; j < height; j++) {
            float x = x0 + i * dx;
            float y = y0 + j * dy;

            int index = (j * width + i);
            output[index] = mandel(x, y, maxIterations);
        }
    }
}

void* workerThreadStart_v2(void* threadArgs) {
    double startTime = CycleTimer::currentSeconds();
    WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);

    // TODO: Implement worker thread here.
    mandelbrotSerial_v2(args->x0, args->y0, args->x1, args->y1,
                     args->width, args->height, args->startCol, args->totalCols,
                     args->maxIterations, args->output);

    printf("Hello world from thread %d\n", args->threadId);
    double endTime = CycleTimer::currentSeconds();

    printf("Thread %d:\t\t[%.3f] ms\n", args->threadId, (endTime - startTime) * 1000);

    return NULL;
}

void mandelbrotThread(
    int numThreads,
    float x0, float y0, float x1, float y1,
    int width, int height,
    int maxIterations, int output[])
{
    const static int MAX_THREADS = 32;

    if (numThreads > MAX_THREADS)
    {
        fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
        exit(1);
    }

    pthread_t workers[MAX_THREADS];
    WorkerArgs args[MAX_THREADS];

    // 把行拆成numThreads份
    int height_per_thread = height / numThreads;
    // 已经计算的height数量
    int cum_height = 0;
    // 把列拆成numThreads份
    int width_per_thread = width / numThreads;
    // 已经计算的width数量
    int cum_width = 0;

    for (int i=0; i<numThreads; i++) {
        // TODO: Set thread arguments here.
        args[i].threadId = i;
        // add;
        args[i].x0 = x0;
        args[i].y0 = y0;
        args[i].x1 = x1;
        // 最后一个thread特殊处理, 处理剩余全部的height和width
        if (i == numThreads - 1) {
            height_per_thread = height - cum_height;
            width_per_thread = width - cum_width;
        }
        args[i].y1 = y1;
        args[i].width = width;
        args[i].height = height;
        args[i].maxIterations = maxIterations;
        // 起始行
        args[i].startRow = cum_height;
        args[i].totalRows = height_per_thread;
        args[i].startCol = cum_width;
        args[i].totalCols = width_per_thread;

        args[i].output = output;

        // 更新
        cum_width += width_per_thread;
        cum_height += height_per_thread;
    }

    // Fire up the worker threads.  Note that numThreads-1 pthreads
    // are created and the main app thread is used as a worker as
    // well.

    for (int i=1; i<numThreads; i++)
        pthread_create(&workers[i], NULL, workerThreadStart_v2, &args[i]);

    workerThreadStart_v2(&args[0]);

    // wait for worker threads to complete
    for (int i=1; i<numThreads; i++)
        pthread_join(workers[i], NULL);

}

测试：

╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]:            [200.328] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 1
Thread 1:               [0.574] ms
Hello world from thread 2
Thread 2:               [0.882] ms
Hello world from thread 0
Thread 0:               [0.237] ms
Hello world from thread 15
Thread 15:              [0.434] ms
Hello world from thread 3
Thread 3:               [1.502] ms
Hello world from thread 14
Thread 14:              [0.573] ms
Hello world from thread 13
Thread 13:              [0.761] ms
Hello world from thread 12
Thread 12:              [5.800] ms
Hello world from thread 4
Thread 4:               [8.490] ms
Hello world from thread 6
Thread 6:               [12.065] ms
Hello world from thread 5
Thread 5:               [16.190] ms
Hello world from thread 7
Thread 7:               [28.204] ms
Hello world from thread 11
Thread 11:              [34.645] ms
Hello world from thread 8
Thread 8:               [37.265] ms
Hello world from thread 10
Thread 10:              [39.751] ms
Hello world from thread 9
Thread 9:               [47.683] ms
Hello world from thread 1
Thread 1:               [0.685] ms
Hello world from thread 0
Thread 0:               [0.294] ms
Hello world from thread 2
Thread 2:               [0.788] ms
Hello world from thread 15
Thread 15:              [0.465] ms
Hello world from thread 14
Thread 14:              [0.442] ms
Hello world from thread 3
Thread 3:               [1.634] ms
Hello world from thread 13
Thread 13:              [0.517] ms
Hello world from thread 12
Thread 12:              [6.032] ms
Hello world from thread 4
Thread 4:               [8.249] ms
Hello world from thread 6
Thread 6:               [13.591] ms
Hello world from thread 5
Thread 5:               [16.731] ms
Hello world from thread 7
Thread 7:               [30.074] ms
Hello world from thread 11
Thread 11:              [37.594] ms
Hello world from thread 8
Thread 8:               [38.285] ms
Hello world from thread 10
Thread 10:              [46.705] ms
Hello world from thread 9
Thread 9:               [47.721] ms
Hello world from thread 1
Thread 1:               [0.687] ms
Hello world from thread 0
Thread 0:               [0.231] ms
Hello world from thread 2
Thread 2:               [0.828] ms
Hello world from thread 13
Thread 13:              [0.481] ms
Hello world from thread 15
Thread 15:              [0.541] ms
Hello world from thread 14
Thread 14:              [0.458] ms
Hello world from thread 3
Thread 3:               [1.757] ms
Hello world from thread 12
Thread 12:              [6.099] ms
Hello world from thread 4
Thread 4:               [8.668] ms
Hello world from thread 6
Thread 6:               [11.606] ms
Hello world from thread 5
Thread 5:               [19.509] ms
Hello world from thread 7
Thread 7:               [31.520] ms
Hello world from thread 11
Thread 11:              [35.165] ms
Hello world from thread 8
Thread 8:               [37.545] ms
Hello world from thread 10
Thread 10:              [43.472] ms
Hello world from thread 9
Thread 9:               [48.508] ms
[mandelbrot thread]:            [48.391] ms
Wrote image file mandelbrot-thread.ppm
++++                            (4.14x speedup from 16 threads)

╰─± ./mandelbrot --view 2 --thread 16
[mandelbrot serial]:            [121.285] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 15
Thread 15:              [3.000] ms
Hello world from thread 10
Thread 10:              [5.815] ms
Hello world from thread 11
Thread 11:              [5.630] ms
Hello world from thread 0
Thread 0:               [7.868] ms
Hello world from thread 9
Thread 9:               [5.731] ms
Hello world from thread 8
Thread 8:               [9.158] ms
Hello world from thread 2
Thread 2:               [9.700] ms
Hello world from thread 7
Thread 7:               [10.800] ms
Hello world from thread 1
Thread 1:               [11.537] ms
Hello world from thread 14
Thread 14:              [5.657] ms
Hello world from thread 12
Thread 12:              [4.674] ms
Hello world from thread 13
Thread 13:              [4.436] ms
Hello world from thread 3
Thread 3:               [14.780] ms
Hello world from thread 6
Thread 6:               [15.527] ms
Hello world from thread 4
Thread 4:               [16.482] ms
Hello world from thread 5
Thread 5:               [18.798] ms
Hello world from thread 11
Thread 11:              [5.809] ms
Hello world from thread 10
Thread 10:              [5.872] ms
Hello world from thread 8
Thread 8:               [7.771] ms
Hello world from thread 0
Thread 0:               [7.641] ms
Hello world from thread 2
Thread 2:               [9.835] ms
Hello world from thread 15
Thread 15:              [4.964] ms
Hello world from thread 9
Thread 9:               [6.659] ms
Hello world from thread 13
Thread 13:              [3.724] ms
Hello world from thread 14
Thread 14:              [6.282] ms
Hello world from thread 7
Thread 7:               [12.417] ms
Hello world from thread 1
Thread 1:               [12.729] ms
Hello world from thread 12
Thread 12:              [4.940] ms
Hello world from thread 6
Thread 6:               [11.874] ms
Hello world from thread 3
Thread 3:               [15.604] ms
Hello world from thread 4
Thread 4:               [16.420] ms
Hello world from thread 5
Thread 5:               [18.860] ms
Hello world from thread 11
Thread 11:              [5.304] ms
Hello world from thread 10
Thread 10:              [6.258] ms
Hello world from thread 9
Thread 9:               [8.182] ms
Hello world from thread 14
Thread 14:              [3.032] ms
Hello world from thread 15
Thread 15:              [3.047] ms
Hello world from thread 0
Thread 0:               [8.809] ms
Hello world from thread 8
Thread 8:               [8.716] ms
Hello world from thread 7
Thread 7:               [11.343] ms
Hello world from thread 2
Thread 2:               [12.098] ms
Hello world from thread 1
Thread 1:               [12.429] ms
Hello world from thread 6
Thread 6:               [12.611] ms
Hello world from thread 13
Thread 13:              [3.898] ms
Hello world from thread 12
Thread 12:              [4.488] ms
Hello world from thread 4
Thread 4:               [14.674] ms
Hello world from thread 3
Thread 3:               [15.797] ms
Hello world from thread 5
Thread 5:               [20.155] ms
[mandelbrot thread]:            [19.312] ms
Wrote image file mandelbrot-thread.ppm
++++                            (6.28x speedup from 16 threads)

加速比为4.14和6.28。

另一种方式是给每个args[i].output赋予单独的数组，然后利用循环展开：

void mandelbrotSerial_v3(
    float x0, float y0, float x1, float y1,
    int width, int height,
    int startRow, int totalRows,
    int maxIterations,
    int output[])
{
    // 前闭后开
    float dx = (x1 - x0) / width;
    float dy = (y1 - y0) / height;
    int endRow = startRow + totalRows;

    for (int j = startRow; j < endRow; j++) {
        int i = 0;
        int index = (j - startRow) * width;
        float x;
        float y = y0 + j * dy;
        // 循环展开
        for (; i + 1 < width; i += 2, index += 2) {
            x = x0 + i * dx;
            output[index] = mandel(x, y, maxIterations);

            x = x0 + (i + 1) * dx;
            output[index + 1] = mandel(x, y, maxIterations);
        }
        // 处理剩余部分
        for (; i < width; ++i, ++index) {
            x = x0 + i * dx;
            output[index] = mandel(x, y, maxIterations);
        }
    }
}

void* workerThreadStart_v3(void* threadArgs) {
    double startTime = CycleTimer::currentSeconds();
    WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);

    // TODO: Implement worker thread here.
    mandelbrotSerial_v3(args->x0, args->y0, args->x1, args->y1,
                     args->width, args->height, args->startRow, args->totalRows,
                     args->maxIterations, args->output);

    printf("Hello world from thread %d\n", args->threadId);
    double endTime = CycleTimer::currentSeconds();

    printf("Thread %d:\t\t[%.3f] ms\n", args->threadId, (endTime - startTime) * 1000);

    return NULL;
}

void mandelbrotThread(
    int numThreads,
    float x0, float y0, float x1, float y1,
    int width, int height,
    int maxIterations, int output[])
{
    const static int MAX_THREADS = 32;

    if (numThreads > MAX_THREADS)
    {
        fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
        exit(1);
    }

    pthread_t workers[MAX_THREADS];
    WorkerArgs args[MAX_THREADS];

    // 把行拆成numThreads份
    int height_per_thread = height / numThreads;
    // 已经计算的height数量
    int cum_height = 0;
    // 把列拆成numThreads份
    int width_per_thread = width / numThreads;
    // 已经计算的width数量
    int cum_width = 0;

    for (int i=0; i<numThreads; i++) {
        // TODO: Set thread arguments here.
        args[i].threadId = i;
        // add;
        args[i].x0 = x0;
        args[i].y0 = y0;
        args[i].x1 = x1;
        // 最后一个thread特殊处理, 处理剩余全部的height和width
        if (i == numThreads - 1) {
            height_per_thread = height - cum_height;
            width_per_thread = width - cum_width;
        }
        args[i].y1 = y1;
        args[i].width = width;
        args[i].height = height;
        args[i].maxIterations = maxIterations;
        // 起始行
        args[i].startRow = cum_height;
        args[i].totalRows = height_per_thread;
        args[i].startCol = cum_width;
        args[i].totalCols = width_per_thread;

        // 每个thread一个单独的数组
        args[i].output = new int[args[i].totalRows * args[i].width];
        // 更新
        cum_width += width_per_thread;
        cum_height += height_per_thread;
    }

    // Fire up the worker threads.  Note that numThreads-1 pthreads
    // are created and the main app thread is used as a worker as
    // well.

    for (int i=1; i<numThreads; i++)
        pthread_create(&workers[i], NULL, workerThreadStart_v3, &args[i]);

    workerThreadStart_v3(&args[0]);

    // wait for worker threads to complete
    for (int i=1; i<numThreads; i++)
        pthread_join(workers[i], NULL);

    // 合并结果
    int start = 0;
    for (int i = 0; i < numThreads; i++) {
        int m = args[i].totalRows * args[i].width;
        for (int j = 0; j < m; j++) {
            output[start + j] = args[i].output[j];
        }
        start += m;
    }
}

实验结果为：

╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]:            [200.283] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0:               [0.831] ms
Hello world from thread 1
Thread 1:               [3.641] ms
Hello world from thread 2
Thread 2:               [4.788] ms
Hello world from thread 14
Thread 14:              [3.878] ms
Hello world from thread 15
Thread 15:              [1.449] ms
Hello world from thread 13
Thread 13:              [7.393] ms
Hello world from thread 3
Thread 3:               [13.183] ms
Hello world from thread 4
Thread 4:               [20.360] ms
Hello world from thread 12
Thread 12:              [15.916] ms
Hello world from thread 5
Thread 5:               [21.638] ms
Hello world from thread 11
Thread 11:              [22.251] ms
Hello world from thread 10
Thread 10:              [22.507] ms
Hello world from thread 6
Thread 6:               [27.589] ms
Hello world from thread 9
Thread 9:               [27.889] ms
Hello world from thread 8
Thread 8:               [30.121] ms
Hello world from thread 7
Thread 7:               [32.634] ms
Hello world from thread 0
Thread 0:               [0.808] ms
Hello world from thread 1
Thread 1:               [2.960] ms
Hello world from thread 15
Thread 15:              [1.483] ms
Hello world from thread 2
Thread 2:               [5.813] ms
Hello world from thread 14
Thread 14:              [3.791] ms
Hello world from thread 13
Thread 13:              [7.110] ms
Hello world from thread 3
Thread 3:               [16.928] ms
Hello world from thread 12
Thread 12:              [16.516] ms
Hello world from thread 4
Thread 4:               [17.820] ms
Hello world from thread 5
Thread 5:               [22.885] ms
Hello world from thread 11
Thread 11:              [25.085] ms
Hello world from thread 10
Thread 10:              [24.466] ms
Hello world from thread 6
Thread 6:               [27.473] ms
Hello world from thread 7
Thread 7:               [29.975] ms
Hello world from thread 8
Thread 8:               [30.919] ms
Hello world from thread 9
Thread 9:               [32.183] ms
Hello world from thread 0
Thread 0:               [1.001] ms
Hello world from thread 1
Thread 1:               [3.189] ms
Hello world from thread 2
Thread 2:               [5.523] ms
Hello world from thread 15
Thread 15:              [1.307] ms
Hello world from thread 14
Thread 14:              [3.797] ms
Hello world from thread 13
Thread 13:              [6.020] ms
Hello world from thread 3
Thread 3:               [16.198] ms
Hello world from thread 12
Thread 12:              [18.236] ms
Hello world from thread 4
Thread 4:               [19.965] ms
Hello world from thread 10
Thread 10:              [25.051] ms
Hello world from thread 11
Thread 11:              [25.945] ms
Hello world from thread 5
Thread 5:               [26.681] ms
Hello world from thread 6
Thread 6:               [28.467] ms
Hello world from thread 7
Thread 7:               [31.332] ms
Hello world from thread 9
Thread 9:               [32.669] ms
Hello world from thread 8
Thread 8:               [34.029] ms
[mandelbrot thread]:            [33.867] ms
Wrote image file mandelbrot-thread.ppm
++++                            (5.91x speedup from 16 threads)

╰─± ./mandelbrot --view 2 --thread 16
[mandelbrot serial]:            [125.893] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 6
Thread 6:               [6.453] ms
Hello world from thread 11
Thread 11:              [6.348] ms
Hello world from thread 5
Thread 5:               [6.949] ms
Hello world from thread 10
Thread 10:              [6.905] ms
Hello world from thread 9
Thread 9:               [6.997] ms
Hello world from thread 4
Thread 4:               [7.510] ms
Hello world from thread 7
Thread 7:               [8.023] ms
Hello world from thread 3
Thread 3:               [8.780] ms
Hello world from thread 2
Thread 2:               [12.412] ms
Hello world from thread 1
Thread 1:               [12.899] ms
Hello world from thread 8
Thread 8:               [12.710] ms
Hello world from thread 14
Thread 14:              [7.545] ms
Hello world from thread 15
Thread 15:              [8.204] ms
Hello world from thread 12
Thread 12:              [8.528] ms
Hello world from thread 13
Thread 13:              [10.376] ms
Hello world from thread 0
Thread 0:               [22.576] ms
Hello world from thread 5
Thread 5:               [6.598] ms
Hello world from thread 6
Thread 6:               [6.313] ms
Hello world from thread 4
Thread 4:               [7.217] ms
Hello world from thread 9
Thread 9:               [6.935] ms
Hello world from thread 11
Thread 11:              [7.983] ms
Hello world from thread 7
Thread 7:               [8.460] ms
Hello world from thread 8
Thread 8:               [9.118] ms
Hello world from thread 3
Thread 3:               [10.458] ms
Hello world from thread 2
Thread 2:               [11.159] ms
Hello world from thread 10
Thread 10:              [6.856] ms
Hello world from thread 14
Thread 14:              [7.672] ms
Hello world from thread 12
Thread 12:              [8.058] ms
Hello world from thread 15
Thread 15:              [8.282] ms
Hello world from thread 1
Thread 1:               [16.162] ms
Hello world from thread 13
Thread 13:              [8.554] ms
Hello world from thread 0
Thread 0:               [16.416] ms
Hello world from thread 6
Thread 6:               [6.467] ms
Hello world from thread 5
Thread 5:               [7.401] ms
Hello world from thread 4
Thread 4:               [7.772] ms
Hello world from thread 3
Thread 3:               [8.608] ms
Hello world from thread 15
Thread 15:              [6.725] ms
Hello world from thread 10
Thread 10:              [7.944] ms
Hello world from thread 9
Thread 9:               [8.051] ms
Hello world from thread 14
Thread 14:              [8.397] ms
Hello world from thread 1
Thread 1:               [14.227] ms
Hello world from thread 2
Thread 2:               [16.222] ms
Hello world from thread 11
Thread 11:              [8.859] ms
Hello world from thread 8
Thread 8:               [7.992] ms
Hello world from thread 13
Thread 13:              [7.937] ms
Hello world from thread 12
Thread 12:              [10.159] ms
Hello world from thread 0
Hello world from thread 7
Thread 7:               [16.727] ms
Thread 0:               [20.767] ms
[mandelbrot thread]:            [18.906] ms
Wrote image file mandelbrot-thread.ppm
++++                            (6.66x speedup from 16 threads)

性能略有提高，加速比为5.91和6.66。

Problem 2: Vectorizing Code Using SIMD Intrinsics (20 points)

1

模仿absVector的写法即可：

void clampedExpVector(float* values, int* exponents, float* output, int N) {
    // Implement your vectorized version of clampedExpSerial here
    //  ...
	int i = 0;
	// r表示第一次处理的数量
	int r = N % VECTOR_WIDTH;
	if (r == 0) {
		r = VECTOR_WIDTH;
	}
	__cmu418_vec_float x;
    __cmu418_vec_float result;
	__cmu418_vec_float constant;
	__cmu418_vec_int y;
	__cmu418_vec_float xpower;
	__cmu418_vec_int zero, one, yAndOne;
	__cmu418_mask mask, greaterThanZero, productMask, greater;

	for (i = 0; i < N; ) {
		if (i == 0) {
			// 前r个为1
			mask = _cmu418_init_ones(r);
			// mask = _cmu418_mask_not(mask);
		} else {
			// 全1
			mask = _cmu418_init_ones();
		}
		
		// 初始化为false
		greaterThanZero = _cmu418_init_ones(0);
		productMask = _cmu418_init_ones(0);
		greater = _cmu418_init_ones(0);
		_cmu418_vset_int(zero, 0, mask);
		_cmu418_vset_int(one, 1, mask);
		constant = _cmu418_vset_float(4.18f);
		
		// x = values[i];
		_cmu418_vload_float(x, values + i, mask);
		// result = 1.f
		result = _cmu418_vset_float(1.f);
		// y = exponents[i]
		_cmu418_vload_int(y, exponents + i, mask);
		// xpower = x
		_cmu418_vmove_float(xpower, x, mask);
		// 判断是否大于0
		_cmu418_vlt_int(greaterThanZero, zero, y, mask);
		// while (y > 0)
		while (_cmu418_cntbits(greaterThanZero)) {
			// y & 0x1
			 _cmu418_vbitand_int(yAndOne, y, one, mask);
			// 判断y & 0x1是否为1
			_cmu418_veq_int(productMask, yAndOne, one, mask);
			// result *= xpower;
			_cmu418_vmult_float(result, result, xpower, productMask);
			// xpower = xpower * xpower;
			_cmu418_vmult_float(xpower, xpower, xpower, mask);
			// y >>= 1
			_cmu418_vshiftright_int(y, y, one, mask);
			// 判断是否大于0
			_cmu418_vlt_int(greaterThanZero, zero, y, mask);
		}
		// if (result > 4.18f)
		_cmu418_vlt_float(greater, constant, result, mask);
		// result = 4.18f
		_cmu418_vmove_float(result, constant, greater);
		// output[i] = result
		_cmu418_vstore_float(output + i, result, mask);
		// 更新
		if (i == 0) {
			i += r;
		} else {
			i += VECTOR_WIDTH;
		}
	}
}

测试：

╰─± ./vrun -s 16   
CLAMPED EXPONENT (required) 
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              8
Total Vector Instructions: 124
Vector Utilization:        92.237903%
Utilized Vector Lanes:     915
Total Vector Lanes:        992
************************ Result Verification *************************
Passed!!!

2

修改VECTOR_WIDTH的值然后运行：

╰─± ./vrun -s 10000
CLAMPED EXPONENT (required) 
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              2
Total Vector Instructions: 301610
Vector Utilization:        91.087000%
Utilized Vector Lanes:     549455
Total Vector Lanes:        603220
************************ Result Verification *************************
Passed!!!

╰─± ./vrun -s 10000
CLAMPED EXPONENT (required) 
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              4
Total Vector Instructions: 154195
Vector Utilization:        90.843250%
Utilized Vector Lanes:     560303
Total Vector Lanes:        616780
************************ Result Verification *************************
Passed!!!

╰─± ./vrun -s 10000
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              8
Total Vector Instructions: 77485
Vector Utilization:        90.789024%
Utilized Vector Lanes:     562783
Total Vector Lanes:        619880
************************ Result Verification *************************
Passed!!!

╰─± ./vrun -s 10000
CLAMPED EXPONENT (required) 
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              16
Total Vector Instructions: 38750
Vector Utilization:        90.786935%
Utilized Vector Lanes:     562879
Total Vector Lanes:        620000
************************ Result Verification *************************
Passed!!!

3

span的含义暂时没有理解，目前的做法是将数组拆成$N/W$份，每份分别求和，存储到数组中，然后对数组按顺序调用_cmu418_hadd_float,_cmu418_interleave_float，调用数量为$\log_2W$，不难得到，第$i$次调用时，数组的第一个元素为

$\sum_{j=0}^{2^i -1} a_i$

所以$\log_2 W$次调用可以得到数组的求和，代码如下：

float arraySumVectorHelper(float* values) {
	int i = 1;
	__cmu418_mask mask = _cmu418_init_ones();
	__cmu418_vec_float result, tmp;
	_cmu418_vload_float(tmp, values, mask); 
	while (i < VECTOR_WIDTH) {
		_cmu418_hadd_float(result, tmp);
		_cmu418_interleave_float(tmp, result);
		i *= 2;
	}
	_cmu418_vstore_float(values, result, mask);

	return values[0];
}

float arraySumVector(float* values, int N) {
    // Implement your vectorized version here
    //  ...
	int l = N / VECTOR_WIDTH;
	float res[VECTOR_WIDTH];
	// 一次最多只能看max(N/ W, log2W)
	for (int i = 0; i < VECTOR_WIDTH; i++) {
		res[i] = arraySumSerial(values + i * l, l);
	}

	return arraySumVectorHelper(res);
}

测试：

ARRAY SUM (bonus) 
Passed!!!

Problem 3: Parallel Fractal Generation Using ISPC (15 points)

准备工作

首先是环境配置，在官网下载操作系统对应的版本，解压后将对应文件添加到环境变量中：

export PATH=~/ispc-v1.17.0-linux/bin:$PATH

官网地址：

https://github.com/ispc/ispc

然后阅读ispc的例子：

https://ispc.github.io/example.html

3.1

测试：

╰─± ./mandelbrot_ispc -v 1   
[mandelbrot serial]:            [215.696] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [46.147] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.67x speedup from ISPC)

╰─± ./mandelbrot_ispc -v 2
[mandelbrot serial]:            [131.712] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [32.244] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.08x speedup from ISPC)

╰─± ./mandelbrot_ispc -v 3
[mandelbrot serial]:            [308.290] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [68.822] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.48x speedup from ISPC)

╰─± ./mandelbrot_ispc -v 4
[mandelbrot serial]:            [304.620] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [68.016] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.48x speedup from ISPC)

╰─± ./mandelbrot_ispc -v 5
[mandelbrot serial]:            [60.267] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [13.930] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.33x speedup from ISPC)

╰─± ./mandelbrot_ispc -v 6
[mandelbrot serial]:            [534.112] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [112.647] ms
Wrote image file mandelbrot-ispc.ppm
                                (4.74x speedup from ISPC)

因为使用了8wide AVX2 矢量指令，所以理论加速应该为8，但是和实际情况不符。注意到图片中白色部分计算量最多，黑色部分最少，所以可能和计算量不均匀有关。

3.2

1,2

修改代码，使之可以处理任意的threadCount：

// slightly different kernel to support tasking
task void mandelbrot_ispc_task(uniform float x0, uniform float y0, 
                               uniform float x1, uniform float y1,
                               uniform int width, uniform int height,
                               uniform int rowsPerTask,
                               uniform int maxIterations,
                               uniform int output[])
{

    // taskIndex is an ISPC built-in
    
    uniform int ystart = taskIndex * rowsPerTask;
    // 最后一个task特殊处理
    uniform int yend = ystart + rowsPerTask;
    if (yend + rowsPerTask > height) {
        yend = height;
    }
    
    uniform float dx = (x1 - x0) / width;
    uniform float dy = (y1 - y0) / height;
    
    foreach (j = ystart ... yend, i = 0 ... width) {
            float x = x0 + i * dx;
            float y = y0 + j * dy;
            
            int index = j * width + i;
            output[index] = mandel(x, y, maxIterations);
    }
}

export void mandelbrot_ispc_withtasks(uniform float x0, uniform float y0,
                                      uniform float x1, uniform float y1,
                                      uniform int width, uniform int height,
                                      uniform int maxIterations,
                                      uniform int output[])
{

    // uniform int threadCount = 2;
    uniform int threadCount = 1000;
    uniform int rowsPerTask = height / threadCount;
    // add
    rowsPerTask = max(1, rowsPerTask);

    // create threadCount tasks
    launch[threadCount] mandelbrot_ispc_task(x0, y0, x1, y1,
                                     width, height,
                                     rowsPerTask,
                                     maxIterations,
                                     output); 
}

测试：

╰─± ./mandelbrot_ispc -t -v 1
[mandelbrot serial]:            [215.468] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [46.115] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:    [12.226] ms
Wrote image file mandelbrot-task-ispc.ppm
                                (4.67x speedup from ISPC)
                                (17.62x speedup from task ISPC)

3

问题描述：

Pthread 抽象（用于问题 1）和 ISPC 任务抽象有什么区别？ (create/join 和 (launch/sync) 机制之间在语义上有一些明显的差异，但这些差异的含义更微妙。这里有一个思想实验来指导你的答案：当你启动 10,000 个 ISPC 任务时会发生什么？什么当你启动 10,000 个 pthread 时会发生什么？

回答：

暂时没有理解题目，感觉可能的回答是pthread需要等待线程完成，而ISPC不需要。

Problem 4: Iterative Square Root (10 points)

1

╰─± ./sqrt     
[sqrt serial]:          [829.827] ms
[sqrt ispc]:            [188.124] ms
[sqrt task ispc]:       [20.934] ms
                                (4.41x speedup from ISPC)
                                (39.64x speedup from task ISPC)

2

在3附近的加速比最大：

// Generate data that gives high relative speedup
void initGood(float *values, int N) {
    for (int i=0; i<N; i++)
    {
        // Todo: Choose values
        values[i] = 2.998f - i / N;
    }
}

测试：

╰─± ./sqrt -d g
[sqrt serial]:          [2101.552] ms
[sqrt ispc]:            [321.759] ms
[sqrt task ispc]:       [35.236] ms
                                (6.53x speedup from ISPC)
                                (59.64x speedup from task ISPC)

3

在1附近的加速比最小：

// Generate data that gives low relative speedup
void initBad(float *values, int N) {
    for (int i=0; i<N; i++)
    {
        // Todo: Choose values
        values[i] = 1.0f;
    }
}

测试：

╰─± ./sqrt -d b 
[sqrt serial]:          [27.776] ms
[sqrt ispc]:            [17.068] ms
[sqrt task ispc]:       [12.275] ms
                                (1.63x speedup from ISPC)
                                (2.26x speedup from task ISPC)

Problem 5: BLAS saxpy (10 points)

1

╰─± ./saxpy    
[saxpy serial]:         [12.381] ms     [24.072] GB/s   [1.615] GFLOPS
[saxpy streaming]:      [13.943] ms     [21.374] GB/s   [1.434] GFLOPS
[saxpy ispc]:           [15.216] ms     [19.586] GB/s   [1.314] GFLOPS
[saxpy task ispc]:      [14.207] ms     [20.977] GB/s   [1.408] GFLOPS
                                (0.89x speedup from streaming)
                                (0.81x speedup from ISPC)
                                (0.87x speedup from task ISPC)

可以看到性能没有太多提升，推测是内存读取太多的原因，应该可以改进，但试了很多种并没有成功。

2

每步使用用了4个内存的原因应该是要存中间结果a * X[i]。

3

这部分需要学习如下资料：

代码：

void saxpyStreaming(int N,
                    float scale,
                    float X[],
                    float Y[],
                    float result[])
{
    // Replace this code with ones that make use of the streaming instructions
    // saxpySerial(N, scale, X, Y, result);
    __m128 x, y;
    // scale, scale, scale, scale
    __m128 scalar = _mm_set1_ps(scale);
    for (int i = 0; i < N; i += 4) {
        // X[0], ... , X[3]
        x = _mm_loadu_ps(X + i);
        y = _mm_loadu_ps(Y + i);
        // 乘以标量
        x = _mm_mul_ps(x, scalar);
        // 加法
        x = _mm_add_ps(x, y);
        // 存储
        _mm_store_ps(result + i, x);
    }
}

但是实验结果并不理想（见1）。