## Description

Submit responses to all tasks which don’t specify a le name to Canvas in a le called assign-ment4.txt, docx, pdf, rtf, odt (choose one of the formats). Also all plots should be submitted on Canvas. All source les should be submitted in the HW04 subdirectory on the master branch of your homework git repo with no subdirectories.

All commands or code must work on Euler with only the cuda module loaded unless speci ed otherwise. Commands and/or code may behave di erently on your computer, so be sure to test on Euler before you submit.

Please submit clean code. Consider using a formatter like __clang-format__.

- Before you begin, copy the provided les from HW04of the
__ME759-2020 repo__. Do not change any of the provided les because we will write clean copies over them when grading.

- (a) Implement in a le called cuthe matmul and matmul kernel functions as de-clared and described in matmul.cuh.

- Write a program cuwhich does the following:

Creates matrices (as 1D row major arrays) A and B of size n*n in managed (aka uni ed) memory.

Fills those matrices however you like. Calls your matmul function.

Prints the last element of the resulting matrix.

Prints the time taken to perform the multiplication in milliseconds using CUDA events.

Compile: nvcc task1.cu matmul.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task1

Run (where n and threads per block are positive integers): ./task1 n threads per block

Example expected output: 11.36 1.23

- On an Euler compute node, run task1for each value n = 2
^{5}; 2^{6}; ; 2^{15}and generate a plot pdf which plots the time taken by your algorithm as a function of n when threads per block = 1024. Overlay another plot which plots the same relationship with a di erent choice of threads per block.

1

- (a) Implement in a le called custencil and stencil kernel functions as de-clared and described in stencil.cuh. These functions should produce the 1D convolu-tion of image and mask:

R

X

output[i] = image[i + j] mask[j + R] i = 0; ; n 1

j= R

Assume that image[i] = 0 when i < 0 or i > n 1. Pay close attention to what data you are asked to store and compute in shared memory.

- Write a program cuwhich does the following:

Creates arrays image (length n), output (length n), and mask (length 2 * R + 1) all in managed memory.

Fills those arrays however you like. Calls your stencil function.

Prints the last element of the resulting array.

Prints the time taken to perform the convolution in milliseconds using CUDA events.

Compile: nvcc task2.cu stencil.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task2

Run (where n, R, and threads per block are positive integers):

./task2 n R threads per block

Example expected output: 11.36 1.23

- On an Euler compute node, run task2for each value n = 2
^{10}; 2^{11}; ; 2^{31}and generate a plot pdf which plots the time taken by your algorithm as a function of n when threads per block = 1024 and R = 128. Overlay another plot which plots the same relationship with a di erent choice of threads per block.

2