## Description

Submit responses to all tasks which don’t specify a le name to Canvas in a le called assign-ment5.txt, docx, pdf, rtf, odt (choose one of the formats). Also all plots should be submitted on Canvas. All source les should be submitted in the HW05 subdirectory on the master branch of your homework git repo with no subdirectories.

All commands or code must work on Euler with only the cuda module loaded unless speci ed other-wise. They may behave di erently on your computer, so be sure to test on Euler before you submit.

Please submit clean code. Consider using a formatter like __clang-format__.

- Before you begin, copy the provided les from HW05of the
__ME759-2020 repo__.

- a) Implement in a le called cuthe functions reduce and reduce kernel as de-clared and described in reduce.cuh. Your reduce kernel should use the alteration from reduction #3 (\sequential addressing” from Lecture 13) and reduce should wrap calls to reduce kernel.

- Write a test program cuwhich does the following.

Creates and lls however you like an array of length N where N is the rst command line argument as below.

Uses your reduce to sum the elements in the array. Uses the threads per block from the second command line argument as below.

Prints the resulting sum.

Prints the time taken to run the reduction in milliseconds.

Compile: nvcc task1.cu reduce.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task1

Run (where N 2^{30} and threads per block are positive integers):

./task1 N threads per block

Exampled expected output: 102536 1031.2

- On an Euler compute node, run task1for each value n = 2
^{10}; 2^{11}; ; 2^{30}and generate plot the time taken by your algorithm as a function of N when threads per block = 1024. Overlay another plot which plots the same relationship with a di erent choice of threads per block.

1

- a) Implement in a le called cuthe functions matmul and matmul kernel as de-clared and described in matmul.cuh. Be sure to follow the use of shared memory tiles. These should be based on the tiled matrix multiplication method presented in Lecture
- Your implementation should work for arbitrary matrix dimension n2
^{15}.

- Your implementation should work for arbitrary matrix dimension n2

- Write a test program cuwhich does the following.

Creates and lls however you like row-major representations of n n matrices A, B, and C in managed memory, where n is the rst command line argument as below.

Uses your matmul function to produce C as the matrix product AB. Prints the rst element of the resulting C.

Prints the last element of the resulting C.

Prints the time taken to run the matrix multiplication in milliseconds using CUDA events.

Compile: nvcc task2.cu matmul.cu -Xcompiler -O3 -Xcompiler -Wall -Xptxas -O3 -o task2

Run (where n and block dim are positive integers and n is not necessarily a multiple of block dim): ./task2 n block dim

Exampled expected output: 1025.1 561.3

10256.2

- On an Euler compute node, run task2for each value n = 2
^{5}; 2^{6}; ; 2^{15}and generate a plot of the time taken by your algorithm as a function of n.

- What is the best performing value of block dimwhen n = 2
^{15}?

- Present the best runtime for n= 2
^{15}from your HW04 matrix multiplication task (naive GPU implementation). Explain why the tiled approach performs better.

- Present the runtime for n= 2
^{15}from HW02 (serial implementation mmul1) (or state that it goes beyond the Euler time limit). Explain why both GPU implementations perform better.