CMU 15-418 Assignment 1 Exploring Multi-Core and SIMD Parallelism
课程主页:https://www.cs.cmu.edu/afs/cs/academic/class/15418-s18/www/index.html
课程视频:https://www.bilibili.com/video/BV11A411c7pF
CMU 15-418的课程名为Parallel Computer Architecture and Programming,主要介绍并行计算相关内容。这里回顾Assignment 1,主要是熟悉多核的基本操作,利用ISPC编写一些并行程序。
参考资料:
- https://www.cs.cmu.edu/afs/cs/academic/class/15418-s18/www/assignment_writeups/asst1/asst1.pdf
- http://15418.courses.cs.cmu.edu/spring2016/article/3
- https://github.com/ispc/ispc
- https://ispc.github.io/example.html
- https://www.cnblogs.com/ggjucheng/archive/2013/01/14/2859613.html
- https://blog.csdn.net/gengshenghong/article/details/7010615
- https://stackoverflow.com/questions/9079580/sse-simd-multiply-vector-by-scalar
- https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C++/CompilerInfrastructure/Intrinsics/src/intrin_dot_sample.cpp#L188:7
- https://www.aarongutierrez.com/15418/index.html
Assignment 1 Exploring Multi-Core and SIMD Parallelism
准备工作
下载代码:
git clone https://github.com/cmu15418/asst1-s18.git
Problem 1: Parallel Fractal Generation Using Pthreads (15 points)
1,2,3
问题1,2整合到一起完成。
首先编译代码并运行baseline:
╰─± ./mandelbrot --view 1 --thread 1
[mandelbrot serial]: [201.027] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [208.111] ms
Hello world from thread 0
Thread 0: [206.634] ms
Hello world from thread 0
Thread 0: [204.978] ms
[mandelbrot thread]: [205.970] ms
Wrote image file mandelbrot-thread.ppm
++++ (0.98x speedup from 1 threads)
其次实现多线程版本,首先在WorkerArgs中添加字段:
// Struct for passing arguments to thread routine
typedef struct {
float x0, x1;
float y0, y1;
unsigned int width;
unsigned int height;
// add begin
int startRow;
int totalRows;
int startCol;
int totalCols;
// add end
int maxIterations;
int* output;
int threadId;
int numThreads;
} WorkerArgs;
其次添加多线程版本,重点是添加起始行(列)以及每行(列)对应需要完成的数量:
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;
if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}
pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];
// 把行拆成numThreads份
int height_per_thread = height / numThreads;
// 已经计算的height数量
int cum_height = 0;
// 把列拆成numThreads份
int width_per_thread = width / numThreads;
// 已经计算的width数量
int cum_width = 0;
for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].threadId = i;
// add;
args[i].x0 = x0;
args[i].y0 = y0;
args[i].x1 = x1;
// 最后一个thread特殊处理, 处理剩余全部的height和width
if (i == numThreads - 1) {
height_per_thread = height - cum_height;
width_per_thread = width - cum_width;
}
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
// 起始行
args[i].startRow = cum_height;
args[i].totalRows = height_per_thread;
args[i].startCol = cum_width;
args[i].totalCols = width_per_thread;
args[i].output = output;
// 更新
cum_width += width_per_thread;
cum_height += height_per_thread;
}
// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
for (int i=1; i<numThreads; i++)
pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);
workerThreadStart(&args[0]);
// wait for worker threads to complete
for (int i=1; i<numThreads; i++)
pthread_join(workers[i], NULL);
}
实验结果:
╰─± ./mandelbrot --view 1 --thread 2
[mandelbrot serial]: [200.329] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [102.766] ms
Hello world from thread 1
Thread 1: [108.591] ms
Hello world from thread 0
Thread 0: [102.456] ms
Hello world from thread 1
Thread 1: [104.368] ms
Hello world from thread 0
Thread 0: [102.174] ms
Hello world from thread 1
Thread 1: [104.064] ms
[mandelbrot thread]: [104.270] ms
Wrote image file mandelbrot-thread.ppm
++++ (1.92x speedup from 2 threads)
╰─± ./mandelbrot --view 1 --thread 4
[mandelbrot serial]: [200.309] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [20.384] ms
Hello world from thread 3
Thread 3: [21.224] ms
Hello world from thread 1
Thread 1: [86.168] ms
Hello world from thread 2
Thread 2: [86.405] ms
Hello world from thread 3
Thread 3: [20.456] ms
Hello world from thread 0
Thread 0: [22.139] ms
Hello world from thread 2
Thread 2: [88.118] ms
Hello world from thread 1
Thread 1: [88.628] ms
Hello world from thread 0
Thread 0: [20.436] ms
Hello world from thread 3
Thread 3: [23.111] ms
Hello world from thread 2
Thread 2: [86.491] ms
Hello world from thread 1
Thread 1: [86.188] ms
[mandelbrot thread]: [86.637] ms
Wrote image file mandelbrot-thread.ppm
++++ (2.31x speedup from 4 threads)
╰─± ./mandelbrot --view 1 --thread 8
[mandelbrot serial]: [201.094] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [3.553] ms
Hello world from thread 7
Thread 7: [4.552] ms
Hello world from thread 1
Thread 1: [17.341] ms
Hello world from thread 6
Thread 6: [18.726] ms
Hello world from thread 2
Thread 2: [35.681] ms
Hello world from thread 5
Thread 5: [37.703] ms
Hello world from thread 3
Thread 3: [52.302] ms
Hello world from thread 4
Thread 4: [56.164] ms
Hello world from thread 0
Thread 0: [3.739] ms
Hello world from thread 7
Thread 7: [4.516] ms
Hello world from thread 1
Thread 1: [18.916] ms
Hello world from thread 6
Thread 6: [18.928] ms
Hello world from thread 2
Thread 2: [34.906] ms
Hello world from thread 5
Thread 5: [36.526] ms
Hello world from thread 3
Thread 3: [52.649] ms
Hello world from thread 4
Thread 4: [58.855] ms
Hello world from thread 0
Thread 0: [3.846] ms
Hello world from thread 7
Thread 7: [4.191] ms
Hello world from thread 1
Thread 1: [18.252] ms
Hello world from thread 6
Thread 6: [18.705] ms
Hello world from thread 2
Thread 2: [37.429] ms
Hello world from thread 5
Thread 5: [38.546] ms
Hello world from thread 3
Thread 3: [53.211] ms
Hello world from thread 4
Thread 4: [54.999] ms
[mandelbrot thread]: [55.322] ms
Wrote image file mandelbrot-thread.ppm
++++ (3.63x speedup from 8 threads)
╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]: [200.019] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [0.820] ms
Hello world from thread 15
Thread 15: [1.372] ms
Hello world from thread 1
Thread 1: [3.454] ms
Hello world from thread 2
Thread 2: [4.942] ms
Hello world from thread 14
Thread 14: [4.093] ms
Hello world from thread 13
Thread 13: [6.803] ms
Hello world from thread 3
Thread 3: [15.707] ms
Hello world from thread 11
Thread 11: [17.481] ms
Hello world from thread 4
Thread 4: [18.651] ms
Hello world from thread 12
Thread 12: [21.464] ms
Hello world from thread 5
Thread 5: [22.844] ms
Hello world from thread 10
Thread 10: [25.601] ms
Hello world from thread 9
Thread 9: [28.617] ms
Hello world from thread 6
Thread 6: [32.949] ms
Hello world from thread 8
Thread 8: [30.751] ms
Hello world from thread 7
Thread 7: [36.431] ms
Hello world from thread 0
Thread 0: [0.796] ms
Hello world from thread 1
Thread 1: [3.054] ms
Hello world from thread 2
Thread 2: [4.158] ms
Hello world from thread 15
Thread 15: [1.274] ms
Hello world from thread 14
Thread 14: [3.666] ms
Hello world from thread 13
Thread 13: [6.966] ms
Hello world from thread 3
Thread 3: [11.873] ms
Hello world from thread 12
Thread 12: [15.463] ms
Hello world from thread 5
Thread 5: [20.226] ms
Hello world from thread 11
Thread 11: [19.453] ms
Hello world from thread 4
Thread 4: [22.439] ms
Hello world from thread 6
Thread 6: [26.175] ms
Hello world from thread 9
Thread 9: [27.223] ms
Hello world from thread 10
Thread 10: [24.995] ms
Hello world from thread 7
Thread 7: [32.344] ms
Hello world from thread 8
Thread 8: [33.665] ms
Hello world from thread 0
Thread 0: [0.708] ms
Hello world from thread 1
Thread 1: [3.183] ms
Hello world from thread 2
Thread 2: [4.689] ms
Hello world from thread 15
Thread 15: [1.817] ms
Hello world from thread 13
Thread 13: [8.091] ms
Hello world from thread 14
Thread 14: [8.535] ms
Hello world from thread 3
Thread 3: [15.579] ms
Hello world from thread 12
Thread 12: [18.564] ms
Hello world from thread 5
Thread 5: [22.614] ms
Hello world from thread 4
Thread 4: [24.750] ms
Hello world from thread 11
Thread 11: [24.949] ms
Hello world from thread 6
Thread 6: [27.631] ms
Hello world from thread 9
Thread 9: [30.044] ms
Hello world from thread 10
Thread 10: [32.918] ms
Hello world from thread 8
Thread 8: [33.039] ms
Hello world from thread 7
Thread 7: [35.752] ms
[mandelbrot thread]: [34.221] ms
Wrote image file mandelbrot-thread.ppm
++++ (5.84x speedup from 16 threads)
结果:
线程数 | 理论加速比 | 实际加速比 |
---|---|---|
2 | 2 | 1.92 |
4 | 4 | 2.31 |
8 | 8 | 3.63 |
16 | 16 | 5.84 |
可以看到实际加速比和理论不符,从输出中可以看到,线程负载不均匀,应该是实际加速比和理论不符的原因。
4
尝试了如下两种方法,但是并不奏效:
按列划分:
void mandelbrotSerial_v2(
float x0, float y0, float x1, float y1,
int width, int height,
int startCol, int totalCols,
int maxIterations,
int output[])
{
// 前闭后开
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int endCol = startCol + totalCols;
for (int i = startCol; i < endCol; i++) {
for (int j = 0; j < height; j++) {
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}
void* workerThreadStart_v2(void* threadArgs) {
double startTime = CycleTimer::currentSeconds();
WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);
// TODO: Implement worker thread here.
mandelbrotSerial_v2(args->x0, args->y0, args->x1, args->y1,
args->width, args->height, args->startCol, args->totalCols,
args->maxIterations, args->output);
printf("Hello world from thread %d\n", args->threadId);
double endTime = CycleTimer::currentSeconds();
printf("Thread %d:\t\t[%.3f] ms\n", args->threadId, (endTime - startTime) * 1000);
return NULL;
}
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;
if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}
pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];
// 把行拆成numThreads份
int height_per_thread = height / numThreads;
// 已经计算的height数量
int cum_height = 0;
// 把列拆成numThreads份
int width_per_thread = width / numThreads;
// 已经计算的width数量
int cum_width = 0;
for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].threadId = i;
// add;
args[i].x0 = x0;
args[i].y0 = y0;
args[i].x1 = x1;
// 最后一个thread特殊处理, 处理剩余全部的height和width
if (i == numThreads - 1) {
height_per_thread = height - cum_height;
width_per_thread = width - cum_width;
}
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
// 起始行
args[i].startRow = cum_height;
args[i].totalRows = height_per_thread;
args[i].startCol = cum_width;
args[i].totalCols = width_per_thread;
args[i].output = output;
// 更新
cum_width += width_per_thread;
cum_height += height_per_thread;
}
// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
for (int i=1; i<numThreads; i++)
pthread_create(&workers[i], NULL, workerThreadStart_v2, &args[i]);
workerThreadStart_v2(&args[0]);
// wait for worker threads to complete
for (int i=1; i<numThreads; i++)
pthread_join(workers[i], NULL);
}
测试:
╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]: [200.328] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 1
Thread 1: [0.574] ms
Hello world from thread 2
Thread 2: [0.882] ms
Hello world from thread 0
Thread 0: [0.237] ms
Hello world from thread 15
Thread 15: [0.434] ms
Hello world from thread 3
Thread 3: [1.502] ms
Hello world from thread 14
Thread 14: [0.573] ms
Hello world from thread 13
Thread 13: [0.761] ms
Hello world from thread 12
Thread 12: [5.800] ms
Hello world from thread 4
Thread 4: [8.490] ms
Hello world from thread 6
Thread 6: [12.065] ms
Hello world from thread 5
Thread 5: [16.190] ms
Hello world from thread 7
Thread 7: [28.204] ms
Hello world from thread 11
Thread 11: [34.645] ms
Hello world from thread 8
Thread 8: [37.265] ms
Hello world from thread 10
Thread 10: [39.751] ms
Hello world from thread 9
Thread 9: [47.683] ms
Hello world from thread 1
Thread 1: [0.685] ms
Hello world from thread 0
Thread 0: [0.294] ms
Hello world from thread 2
Thread 2: [0.788] ms
Hello world from thread 15
Thread 15: [0.465] ms
Hello world from thread 14
Thread 14: [0.442] ms
Hello world from thread 3
Thread 3: [1.634] ms
Hello world from thread 13
Thread 13: [0.517] ms
Hello world from thread 12
Thread 12: [6.032] ms
Hello world from thread 4
Thread 4: [8.249] ms
Hello world from thread 6
Thread 6: [13.591] ms
Hello world from thread 5
Thread 5: [16.731] ms
Hello world from thread 7
Thread 7: [30.074] ms
Hello world from thread 11
Thread 11: [37.594] ms
Hello world from thread 8
Thread 8: [38.285] ms
Hello world from thread 10
Thread 10: [46.705] ms
Hello world from thread 9
Thread 9: [47.721] ms
Hello world from thread 1
Thread 1: [0.687] ms
Hello world from thread 0
Thread 0: [0.231] ms
Hello world from thread 2
Thread 2: [0.828] ms
Hello world from thread 13
Thread 13: [0.481] ms
Hello world from thread 15
Thread 15: [0.541] ms
Hello world from thread 14
Thread 14: [0.458] ms
Hello world from thread 3
Thread 3: [1.757] ms
Hello world from thread 12
Thread 12: [6.099] ms
Hello world from thread 4
Thread 4: [8.668] ms
Hello world from thread 6
Thread 6: [11.606] ms
Hello world from thread 5
Thread 5: [19.509] ms
Hello world from thread 7
Thread 7: [31.520] ms
Hello world from thread 11
Thread 11: [35.165] ms
Hello world from thread 8
Thread 8: [37.545] ms
Hello world from thread 10
Thread 10: [43.472] ms
Hello world from thread 9
Thread 9: [48.508] ms
[mandelbrot thread]: [48.391] ms
Wrote image file mandelbrot-thread.ppm
++++ (4.14x speedup from 16 threads)
╰─± ./mandelbrot --view 2 --thread 16
[mandelbrot serial]: [121.285] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 15
Thread 15: [3.000] ms
Hello world from thread 10
Thread 10: [5.815] ms
Hello world from thread 11
Thread 11: [5.630] ms
Hello world from thread 0
Thread 0: [7.868] ms
Hello world from thread 9
Thread 9: [5.731] ms
Hello world from thread 8
Thread 8: [9.158] ms
Hello world from thread 2
Thread 2: [9.700] ms
Hello world from thread 7
Thread 7: [10.800] ms
Hello world from thread 1
Thread 1: [11.537] ms
Hello world from thread 14
Thread 14: [5.657] ms
Hello world from thread 12
Thread 12: [4.674] ms
Hello world from thread 13
Thread 13: [4.436] ms
Hello world from thread 3
Thread 3: [14.780] ms
Hello world from thread 6
Thread 6: [15.527] ms
Hello world from thread 4
Thread 4: [16.482] ms
Hello world from thread 5
Thread 5: [18.798] ms
Hello world from thread 11
Thread 11: [5.809] ms
Hello world from thread 10
Thread 10: [5.872] ms
Hello world from thread 8
Thread 8: [7.771] ms
Hello world from thread 0
Thread 0: [7.641] ms
Hello world from thread 2
Thread 2: [9.835] ms
Hello world from thread 15
Thread 15: [4.964] ms
Hello world from thread 9
Thread 9: [6.659] ms
Hello world from thread 13
Thread 13: [3.724] ms
Hello world from thread 14
Thread 14: [6.282] ms
Hello world from thread 7
Thread 7: [12.417] ms
Hello world from thread 1
Thread 1: [12.729] ms
Hello world from thread 12
Thread 12: [4.940] ms
Hello world from thread 6
Thread 6: [11.874] ms
Hello world from thread 3
Thread 3: [15.604] ms
Hello world from thread 4
Thread 4: [16.420] ms
Hello world from thread 5
Thread 5: [18.860] ms
Hello world from thread 11
Thread 11: [5.304] ms
Hello world from thread 10
Thread 10: [6.258] ms
Hello world from thread 9
Thread 9: [8.182] ms
Hello world from thread 14
Thread 14: [3.032] ms
Hello world from thread 15
Thread 15: [3.047] ms
Hello world from thread 0
Thread 0: [8.809] ms
Hello world from thread 8
Thread 8: [8.716] ms
Hello world from thread 7
Thread 7: [11.343] ms
Hello world from thread 2
Thread 2: [12.098] ms
Hello world from thread 1
Thread 1: [12.429] ms
Hello world from thread 6
Thread 6: [12.611] ms
Hello world from thread 13
Thread 13: [3.898] ms
Hello world from thread 12
Thread 12: [4.488] ms
Hello world from thread 4
Thread 4: [14.674] ms
Hello world from thread 3
Thread 3: [15.797] ms
Hello world from thread 5
Thread 5: [20.155] ms
[mandelbrot thread]: [19.312] ms
Wrote image file mandelbrot-thread.ppm
++++ (6.28x speedup from 16 threads)
加速比为4.14和6.28。
另一种方式是给每个args[i].output赋予单独的数组,然后利用循环展开:
void mandelbrotSerial_v3(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int totalRows,
int maxIterations,
int output[])
{
// 前闭后开
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int endRow = startRow + totalRows;
for (int j = startRow; j < endRow; j++) {
int i = 0;
int index = (j - startRow) * width;
float x;
float y = y0 + j * dy;
// 循环展开
for (; i + 1 < width; i += 2, index += 2) {
x = x0 + i * dx;
output[index] = mandel(x, y, maxIterations);
x = x0 + (i + 1) * dx;
output[index + 1] = mandel(x, y, maxIterations);
}
// 处理剩余部分
for (; i < width; ++i, ++index) {
x = x0 + i * dx;
output[index] = mandel(x, y, maxIterations);
}
}
}
void* workerThreadStart_v3(void* threadArgs) {
double startTime = CycleTimer::currentSeconds();
WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);
// TODO: Implement worker thread here.
mandelbrotSerial_v3(args->x0, args->y0, args->x1, args->y1,
args->width, args->height, args->startRow, args->totalRows,
args->maxIterations, args->output);
printf("Hello world from thread %d\n", args->threadId);
double endTime = CycleTimer::currentSeconds();
printf("Thread %d:\t\t[%.3f] ms\n", args->threadId, (endTime - startTime) * 1000);
return NULL;
}
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;
if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}
pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];
// 把行拆成numThreads份
int height_per_thread = height / numThreads;
// 已经计算的height数量
int cum_height = 0;
// 把列拆成numThreads份
int width_per_thread = width / numThreads;
// 已经计算的width数量
int cum_width = 0;
for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].threadId = i;
// add;
args[i].x0 = x0;
args[i].y0 = y0;
args[i].x1 = x1;
// 最后一个thread特殊处理, 处理剩余全部的height和width
if (i == numThreads - 1) {
height_per_thread = height - cum_height;
width_per_thread = width - cum_width;
}
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
// 起始行
args[i].startRow = cum_height;
args[i].totalRows = height_per_thread;
args[i].startCol = cum_width;
args[i].totalCols = width_per_thread;
// 每个thread一个单独的数组
args[i].output = new int[args[i].totalRows * args[i].width];
// 更新
cum_width += width_per_thread;
cum_height += height_per_thread;
}
// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
for (int i=1; i<numThreads; i++)
pthread_create(&workers[i], NULL, workerThreadStart_v3, &args[i]);
workerThreadStart_v3(&args[0]);
// wait for worker threads to complete
for (int i=1; i<numThreads; i++)
pthread_join(workers[i], NULL);
// 合并结果
int start = 0;
for (int i = 0; i < numThreads; i++) {
int m = args[i].totalRows * args[i].width;
for (int j = 0; j < m; j++) {
output[start + j] = args[i].output[j];
}
start += m;
}
}
实验结果为:
╰─± ./mandelbrot --view 1 --thread 16
[mandelbrot serial]: [200.283] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 0
Thread 0: [0.831] ms
Hello world from thread 1
Thread 1: [3.641] ms
Hello world from thread 2
Thread 2: [4.788] ms
Hello world from thread 14
Thread 14: [3.878] ms
Hello world from thread 15
Thread 15: [1.449] ms
Hello world from thread 13
Thread 13: [7.393] ms
Hello world from thread 3
Thread 3: [13.183] ms
Hello world from thread 4
Thread 4: [20.360] ms
Hello world from thread 12
Thread 12: [15.916] ms
Hello world from thread 5
Thread 5: [21.638] ms
Hello world from thread 11
Thread 11: [22.251] ms
Hello world from thread 10
Thread 10: [22.507] ms
Hello world from thread 6
Thread 6: [27.589] ms
Hello world from thread 9
Thread 9: [27.889] ms
Hello world from thread 8
Thread 8: [30.121] ms
Hello world from thread 7
Thread 7: [32.634] ms
Hello world from thread 0
Thread 0: [0.808] ms
Hello world from thread 1
Thread 1: [2.960] ms
Hello world from thread 15
Thread 15: [1.483] ms
Hello world from thread 2
Thread 2: [5.813] ms
Hello world from thread 14
Thread 14: [3.791] ms
Hello world from thread 13
Thread 13: [7.110] ms
Hello world from thread 3
Thread 3: [16.928] ms
Hello world from thread 12
Thread 12: [16.516] ms
Hello world from thread 4
Thread 4: [17.820] ms
Hello world from thread 5
Thread 5: [22.885] ms
Hello world from thread 11
Thread 11: [25.085] ms
Hello world from thread 10
Thread 10: [24.466] ms
Hello world from thread 6
Thread 6: [27.473] ms
Hello world from thread 7
Thread 7: [29.975] ms
Hello world from thread 8
Thread 8: [30.919] ms
Hello world from thread 9
Thread 9: [32.183] ms
Hello world from thread 0
Thread 0: [1.001] ms
Hello world from thread 1
Thread 1: [3.189] ms
Hello world from thread 2
Thread 2: [5.523] ms
Hello world from thread 15
Thread 15: [1.307] ms
Hello world from thread 14
Thread 14: [3.797] ms
Hello world from thread 13
Thread 13: [6.020] ms
Hello world from thread 3
Thread 3: [16.198] ms
Hello world from thread 12
Thread 12: [18.236] ms
Hello world from thread 4
Thread 4: [19.965] ms
Hello world from thread 10
Thread 10: [25.051] ms
Hello world from thread 11
Thread 11: [25.945] ms
Hello world from thread 5
Thread 5: [26.681] ms
Hello world from thread 6
Thread 6: [28.467] ms
Hello world from thread 7
Thread 7: [31.332] ms
Hello world from thread 9
Thread 9: [32.669] ms
Hello world from thread 8
Thread 8: [34.029] ms
[mandelbrot thread]: [33.867] ms
Wrote image file mandelbrot-thread.ppm
++++ (5.91x speedup from 16 threads)
╰─± ./mandelbrot --view 2 --thread 16
[mandelbrot serial]: [125.893] ms
Wrote image file mandelbrot-serial.ppm
Hello world from thread 6
Thread 6: [6.453] ms
Hello world from thread 11
Thread 11: [6.348] ms
Hello world from thread 5
Thread 5: [6.949] ms
Hello world from thread 10
Thread 10: [6.905] ms
Hello world from thread 9
Thread 9: [6.997] ms
Hello world from thread 4
Thread 4: [7.510] ms
Hello world from thread 7
Thread 7: [8.023] ms
Hello world from thread 3
Thread 3: [8.780] ms
Hello world from thread 2
Thread 2: [12.412] ms
Hello world from thread 1
Thread 1: [12.899] ms
Hello world from thread 8
Thread 8: [12.710] ms
Hello world from thread 14
Thread 14: [7.545] ms
Hello world from thread 15
Thread 15: [8.204] ms
Hello world from thread 12
Thread 12: [8.528] ms
Hello world from thread 13
Thread 13: [10.376] ms
Hello world from thread 0
Thread 0: [22.576] ms
Hello world from thread 5
Thread 5: [6.598] ms
Hello world from thread 6
Thread 6: [6.313] ms
Hello world from thread 4
Thread 4: [7.217] ms
Hello world from thread 9
Thread 9: [6.935] ms
Hello world from thread 11
Thread 11: [7.983] ms
Hello world from thread 7
Thread 7: [8.460] ms
Hello world from thread 8
Thread 8: [9.118] ms
Hello world from thread 3
Thread 3: [10.458] ms
Hello world from thread 2
Thread 2: [11.159] ms
Hello world from thread 10
Thread 10: [6.856] ms
Hello world from thread 14
Thread 14: [7.672] ms
Hello world from thread 12
Thread 12: [8.058] ms
Hello world from thread 15
Thread 15: [8.282] ms
Hello world from thread 1
Thread 1: [16.162] ms
Hello world from thread 13
Thread 13: [8.554] ms
Hello world from thread 0
Thread 0: [16.416] ms
Hello world from thread 6
Thread 6: [6.467] ms
Hello world from thread 5
Thread 5: [7.401] ms
Hello world from thread 4
Thread 4: [7.772] ms
Hello world from thread 3
Thread 3: [8.608] ms
Hello world from thread 15
Thread 15: [6.725] ms
Hello world from thread 10
Thread 10: [7.944] ms
Hello world from thread 9
Thread 9: [8.051] ms
Hello world from thread 14
Thread 14: [8.397] ms
Hello world from thread 1
Thread 1: [14.227] ms
Hello world from thread 2
Thread 2: [16.222] ms
Hello world from thread 11
Thread 11: [8.859] ms
Hello world from thread 8
Thread 8: [7.992] ms
Hello world from thread 13
Thread 13: [7.937] ms
Hello world from thread 12
Thread 12: [10.159] ms
Hello world from thread 0
Hello world from thread 7
Thread 7: [16.727] ms
Thread 0: [20.767] ms
[mandelbrot thread]: [18.906] ms
Wrote image file mandelbrot-thread.ppm
++++ (6.66x speedup from 16 threads)
性能略有提高,加速比为5.91和6.66。
Problem 2: Vectorizing Code Using SIMD Intrinsics (20 points)
1
模仿absVector的写法即可:
void clampedExpVector(float* values, int* exponents, float* output, int N) {
// Implement your vectorized version of clampedExpSerial here
// ...
int i = 0;
// r表示第一次处理的数量
int r = N % VECTOR_WIDTH;
if (r == 0) {
r = VECTOR_WIDTH;
}
__cmu418_vec_float x;
__cmu418_vec_float result;
__cmu418_vec_float constant;
__cmu418_vec_int y;
__cmu418_vec_float xpower;
__cmu418_vec_int zero, one, yAndOne;
__cmu418_mask mask, greaterThanZero, productMask, greater;
for (i = 0; i < N; ) {
if (i == 0) {
// 前r个为1
mask = _cmu418_init_ones(r);
// mask = _cmu418_mask_not(mask);
} else {
// 全1
mask = _cmu418_init_ones();
}
// 初始化为false
greaterThanZero = _cmu418_init_ones(0);
productMask = _cmu418_init_ones(0);
greater = _cmu418_init_ones(0);
_cmu418_vset_int(zero, 0, mask);
_cmu418_vset_int(one, 1, mask);
constant = _cmu418_vset_float(4.18f);
// x = values[i];
_cmu418_vload_float(x, values + i, mask);
// result = 1.f
result = _cmu418_vset_float(1.f);
// y = exponents[i]
_cmu418_vload_int(y, exponents + i, mask);
// xpower = x
_cmu418_vmove_float(xpower, x, mask);
// 判断是否大于0
_cmu418_vlt_int(greaterThanZero, zero, y, mask);
// while (y > 0)
while (_cmu418_cntbits(greaterThanZero)) {
// y & 0x1
_cmu418_vbitand_int(yAndOne, y, one, mask);
// 判断y & 0x1是否为1
_cmu418_veq_int(productMask, yAndOne, one, mask);
// result *= xpower;
_cmu418_vmult_float(result, result, xpower, productMask);
// xpower = xpower * xpower;
_cmu418_vmult_float(xpower, xpower, xpower, mask);
// y >>= 1
_cmu418_vshiftright_int(y, y, one, mask);
// 判断是否大于0
_cmu418_vlt_int(greaterThanZero, zero, y, mask);
}
// if (result > 4.18f)
_cmu418_vlt_float(greater, constant, result, mask);
// result = 4.18f
_cmu418_vmove_float(result, constant, greater);
// output[i] = result
_cmu418_vstore_float(output + i, result, mask);
// 更新
if (i == 0) {
i += r;
} else {
i += VECTOR_WIDTH;
}
}
}
测试:
╰─± ./vrun -s 16
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 124
Vector Utilization: 92.237903%
Utilized Vector Lanes: 915
Total Vector Lanes: 992
************************ Result Verification *************************
Passed!!!
2
修改VECTOR_WIDTH
的值然后运行:
╰─± ./vrun -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 301610
Vector Utilization: 91.087000%
Utilized Vector Lanes: 549455
Total Vector Lanes: 603220
************************ Result Verification *************************
Passed!!!
╰─± ./vrun -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 154195
Vector Utilization: 90.843250%
Utilized Vector Lanes: 560303
Total Vector Lanes: 616780
************************ Result Verification *************************
Passed!!!
╰─± ./vrun -s 10000
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 77485
Vector Utilization: 90.789024%
Utilized Vector Lanes: 562783
Total Vector Lanes: 619880
************************ Result Verification *************************
Passed!!!
╰─± ./vrun -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 38750
Vector Utilization: 90.786935%
Utilized Vector Lanes: 562879
Total Vector Lanes: 620000
************************ Result Verification *************************
Passed!!!
3
span的含义暂时没有理解,目前的做法是将数组拆成$N/W$份,每份分别求和,存储到数组中,然后对数组按顺序调用_cmu418_hadd_float,_cmu418_interleave_float
,调用数量为$\log_2W$,不难得到,第$i$次调用时,数组的第一个元素为
所以$\log_2 W$次调用可以得到数组的求和,代码如下:
float arraySumVectorHelper(float* values) {
int i = 1;
__cmu418_mask mask = _cmu418_init_ones();
__cmu418_vec_float result, tmp;
_cmu418_vload_float(tmp, values, mask);
while (i < VECTOR_WIDTH) {
_cmu418_hadd_float(result, tmp);
_cmu418_interleave_float(tmp, result);
i *= 2;
}
_cmu418_vstore_float(values, result, mask);
return values[0];
}
float arraySumVector(float* values, int N) {
// Implement your vectorized version here
// ...
int l = N / VECTOR_WIDTH;
float res[VECTOR_WIDTH];
// 一次最多只能看max(N/ W, log2W)
for (int i = 0; i < VECTOR_WIDTH; i++) {
res[i] = arraySumSerial(values + i * l, l);
}
return arraySumVectorHelper(res);
}
测试:
ARRAY SUM (bonus)
Passed!!!
Problem 3: Parallel Fractal Generation Using ISPC (15 points)
准备工作
首先是环境配置,在官网下载操作系统对应的版本,解压后将对应文件添加到环境变量中:
export PATH=~/ispc-v1.17.0-linux/bin:$PATH
官网地址:
然后阅读ispc的例子:
3.1
测试:
╰─± ./mandelbrot_ispc -v 1
[mandelbrot serial]: [215.696] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [46.147] ms
Wrote image file mandelbrot-ispc.ppm
(4.67x speedup from ISPC)
╰─± ./mandelbrot_ispc -v 2
[mandelbrot serial]: [131.712] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [32.244] ms
Wrote image file mandelbrot-ispc.ppm
(4.08x speedup from ISPC)
╰─± ./mandelbrot_ispc -v 3
[mandelbrot serial]: [308.290] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [68.822] ms
Wrote image file mandelbrot-ispc.ppm
(4.48x speedup from ISPC)
╰─± ./mandelbrot_ispc -v 4
[mandelbrot serial]: [304.620] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [68.016] ms
Wrote image file mandelbrot-ispc.ppm
(4.48x speedup from ISPC)
╰─± ./mandelbrot_ispc -v 5
[mandelbrot serial]: [60.267] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [13.930] ms
Wrote image file mandelbrot-ispc.ppm
(4.33x speedup from ISPC)
╰─± ./mandelbrot_ispc -v 6
[mandelbrot serial]: [534.112] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [112.647] ms
Wrote image file mandelbrot-ispc.ppm
(4.74x speedup from ISPC)
因为使用了8wide AVX2 矢量指令,所以理论加速应该为8,但是和实际情况不符。注意到图片中白色部分计算量最多,黑色部分最少,所以可能和计算量不均匀有关。
3.2
1,2
修改代码,使之可以处理任意的threadCount:
// slightly different kernel to support tasking
task void mandelbrot_ispc_task(uniform float x0, uniform float y0,
uniform float x1, uniform float y1,
uniform int width, uniform int height,
uniform int rowsPerTask,
uniform int maxIterations,
uniform int output[])
{
// taskIndex is an ISPC built-in
uniform int ystart = taskIndex * rowsPerTask;
// 最后一个task特殊处理
uniform int yend = ystart + rowsPerTask;
if (yend + rowsPerTask > height) {
yend = height;
}
uniform float dx = (x1 - x0) / width;
uniform float dy = (y1 - y0) / height;
foreach (j = ystart ... yend, i = 0 ... width) {
float x = x0 + i * dx;
float y = y0 + j * dy;
int index = j * width + i;
output[index] = mandel(x, y, maxIterations);
}
}
export void mandelbrot_ispc_withtasks(uniform float x0, uniform float y0,
uniform float x1, uniform float y1,
uniform int width, uniform int height,
uniform int maxIterations,
uniform int output[])
{
// uniform int threadCount = 2;
uniform int threadCount = 1000;
uniform int rowsPerTask = height / threadCount;
// add
rowsPerTask = max(1, rowsPerTask);
// create threadCount tasks
launch[threadCount] mandelbrot_ispc_task(x0, y0, x1, y1,
width, height,
rowsPerTask,
maxIterations,
output);
}
测试:
╰─± ./mandelbrot_ispc -t -v 1
[mandelbrot serial]: [215.468] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [46.115] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [12.226] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.67x speedup from ISPC)
(17.62x speedup from task ISPC)
3
问题描述:
Pthread 抽象(用于问题 1)和 ISPC 任务抽象有什么区别? (create/join 和 (launch/sync) 机制之间在语义上有一些明显的差异,但这些差异的含义更微妙。这里有一个思想实验来指导你的答案:当你启动 10,000 个 ISPC 任务时会发生什么?什么 当你启动 10,000 个 pthread 时会发生什么?
回答:
暂时没有理解题目,感觉可能的回答是pthread需要等待线程完成,而ISPC不需要。
Problem 4: Iterative Square Root (10 points)
1
╰─± ./sqrt
[sqrt serial]: [829.827] ms
[sqrt ispc]: [188.124] ms
[sqrt task ispc]: [20.934] ms
(4.41x speedup from ISPC)
(39.64x speedup from task ISPC)
2
在3附近的加速比最大:
// Generate data that gives high relative speedup
void initGood(float *values, int N) {
for (int i=0; i<N; i++)
{
// Todo: Choose values
values[i] = 2.998f - i / N;
}
}
测试:
╰─± ./sqrt -d g
[sqrt serial]: [2101.552] ms
[sqrt ispc]: [321.759] ms
[sqrt task ispc]: [35.236] ms
(6.53x speedup from ISPC)
(59.64x speedup from task ISPC)
3
在1附近的加速比最小:
// Generate data that gives low relative speedup
void initBad(float *values, int N) {
for (int i=0; i<N; i++)
{
// Todo: Choose values
values[i] = 1.0f;
}
}
测试:
╰─± ./sqrt -d b
[sqrt serial]: [27.776] ms
[sqrt ispc]: [17.068] ms
[sqrt task ispc]: [12.275] ms
(1.63x speedup from ISPC)
(2.26x speedup from task ISPC)
Problem 5: BLAS saxpy (10 points)
1
╰─± ./saxpy
[saxpy serial]: [12.381] ms [24.072] GB/s [1.615] GFLOPS
[saxpy streaming]: [13.943] ms [21.374] GB/s [1.434] GFLOPS
[saxpy ispc]: [15.216] ms [19.586] GB/s [1.314] GFLOPS
[saxpy task ispc]: [14.207] ms [20.977] GB/s [1.408] GFLOPS
(0.89x speedup from streaming)
(0.81x speedup from ISPC)
(0.87x speedup from task ISPC)
可以看到性能没有太多提升,推测是内存读取太多的原因,应该可以改进,但试了很多种并没有成功。
2
每步使用用了4个内存的原因应该是要存中间结果a * X[i]
。
3
这部分需要学习如下资料:
- https://blog.csdn.net/gengshenghong/article/details/7010615
- https://stackoverflow.com/questions/9079580/sse-simd-multiply-vector-by-scalar
- https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C++/CompilerInfrastructure/Intrinsics/src/intrin_dot_sample.cpp#L188:7
- https://www.aarongutierrez.com/15418/index.html
代码:
void saxpyStreaming(int N,
float scale,
float X[],
float Y[],
float result[])
{
// Replace this code with ones that make use of the streaming instructions
// saxpySerial(N, scale, X, Y, result);
__m128 x, y;
// scale, scale, scale, scale
__m128 scalar = _mm_set1_ps(scale);
for (int i = 0; i < N; i += 4) {
// X[0], ... , X[3]
x = _mm_loadu_ps(X + i);
y = _mm_loadu_ps(Y + i);
// 乘以标量
x = _mm_mul_ps(x, scalar);
// 加法
x = _mm_add_ps(x, y);
// 存储
_mm_store_ps(result + i, x);
}
}
但是实验结果并不理想(见1)。