## Description

Submit responses to all tasks which don’t specify a le name to Canvas in a le called assign-ment2.txt, docx, pdf, rtf, odt (choose one of the formats). Also submit all plots on Canvas. Do not zip your Canvas submission. All source les should be submitted in the HW02 subdirec-tory on the master branch of your git repo. The HW02 subdirectory should have no subdirectories.

All commands or code must work on Euler with no modules loaded unless speci ed otherwise. Your commands and code may behave di erently on your computer; please be sure to test on Euler before you submit.

Please submit clean code. Consider using a formatter like __clang-format__.

IMPORTANT: Before you begin, copy any provided les from HW02 of the __ME759-2020 repo__.

- a) Implement the Scanfunction in a le called cpp with signature as in scan.h. You should write the exclusive scan
^{1}yourself and not use any library scan function.

- Write a le cppwith a main function which (in this order)

Creates an array of n random floats however you like between -1.0 and 1.0. n should be read as the rst command line argument as below.

Scans the array using your Scan function.

Prints out the time taken by your Scan function in milliseconds^{2}. Prints the rst element of the output scanned array.

Prints the last element of the scanned array.

This le task1.cpp and its output won’t be closely graded; simply make sure that you create reasonable input for the Scan function.

Compile: g++ scan.cpp task1.cpp -Wall -O3 -o task1

Run (where n is a positive integer): ./task1 n

Example output (followed by newline):

0.01

0.0

103.3

- c) On an Euler compute node (using a Slurm Job), run task1for each value n= 2
^{10}; x^{11}; ; 2^{30}and generate a plot pdf (with axis labels) which plots the time taken by your algorithm as a function of n. This is called a scaling analysis.

Feel free to post your scaling analysis plot in case you want to help other colleagues get an idea of what they should obtain at the end of this exercise.

^{1}Given an array [a_{0}; a_{1}; ; a_{n}], the exclusive scan function produces the array [0; a_{0}; a_{0}+a_{1}; a_{0}+a_{1}+a_{2}; ; a_{0}+ a_{1} + + a_{n} _{1}]. (In general, a scan can involve any other associative operation, not only +, like in this example.)

- Recall the document md.

- Convolutions
^{3}appear very prominently in image processing and in other elds like numerical solution of partial di erential equations.

- Implement the Convolvefunction in a le called cpp with signature as in convolution.h.

m 1 m 1 | m | 1 | m | 1 | |||

g[x; y] = | ![i; j]f[x + i | ; y + j | ]x; y = 0; ; n 1 | ||||

2 | 2 | ||||||

X X_{j} |
_{ } |
||||||

i=0 | =0 |

where f is the original image, ! is the mask, and g is the result of the convolution. m is the dimension of the square matrix !. There are several ways of dealing with edges, but here we’ll just assume that f[i; j] = 0 whenever 0 i < n; 0 j < n is not satis ed.

Example:

- 3

1348

f = | 6 | 5 | 2 | 4 | 7 | ||

6_{1} |
_{ } |
4 | 5 | 2 | |||

6 | 3 | 4 | 6 | 8 | 7 | ||

4 | 5 | ||||||

2 | 3 | ||||||

! = | 0 | 1 | 0 | ||||

4 | 0 | 0 | 1 | 5 | |||

1 | 0 | 0 |

- 3

19910

6 | 9 | 12 | 14 | 10 | 7 | |

5 | 10 | 13 | 2 | |||

g = | 6 | 8 | 7 | 14 | 13 | 7 |

4 | 5 | |||||

- Write a le cppwith a main function which (in this order)

Creates an n n image matrix (stored in 1D in row-major order) of random floats however you like. The value of n should be read as the rst command line argument.

Creates a 3 3 mask matrix (stored in 1D in row-major order) of whatever mask values you like.

Applies the mask to image using your Convolve function.

Prints out the time taken by your Convolve function in milliseconds. Prints the rst element of the resulting convolved array.

Prints the last element of the resulting convolved array.

The le task2.cpp and its output won’t be closely graded; simply make sure that you create reasonable input for the Convolve function.

Compile: g++ convolution.cpp task2.cpp -Wall -O3 -o task2

Run (where n is a positive integer): ./task2 n Example output (followed by newline):

0.1

1.5

52.36

- See
__here____for more on convolutions in image processing, just note that we use a slightly di erent formulation.__

2

- Implement in a le called cppthe four functions with signatures and descriptions as in matmul.h to produce the matrix product C = AB. For all of the cases, the array C that stores the matrix C should be reported in row-major order.

- mmul1 should have threefor loops: the outer loop sweeps index i through the rows of C, the middle loop sweeps index j through the columns of C, and the innermost loop sweeps index k through the dot product of the i
^{th}row A with the j^{th}column of B. Inside the innermost loop, you should have a single line of code which increments C_{ij}. Assume that A and B are 1D arrays storing the matrices in row-major order.

- mmul2 should also have threefor loops, but the two outermost loops should be swapped relative to mmul1. That is the only di erence between mmul1 and mmul2.

- mmul3 should have thefor loops ordered as in mmul1. Assume that A is stored in row-major order and B is stored in column-major order. The only di erence between this and mmul1 should be the index calculations.

- mmul4 should have thefor loops ordered as in mmul1. Assume that A is stored in column-major order and B is stored in row-major order. The only di erence between this and mmul1 should be the index calculations.

- Write a program cppthat accomplishes the following:

generates square matrices A and B of dimension at least 1000 1000 stored in row-major order.

computes the matrix product C = AB using each of your functions (note that you may have to rearrange the input arrays to be in the correct storage pattern (row major or column major order) for each function in order to represent the same matrix). Your result stored in matrix C should be the same no matter which function de ned at a) through d) above you call.

prints the number of rows of your input matrices.

for each mmul function in ascending order, prints the amount of time taken in milliseconds and the last element of the resulting C.

Compile command: g++ task3.cpp matmul.cpp -Wall -O3 -o task3

Run command: ./task3 Sample expected output:

1024

1.23

2.365

1.23

2.365

1.23

2.365

1.23

2.365

- In a couple sentences, explain the di erence that you see in the times for mmul3and mmul4 when running on an Euler compute node. Make sure to discuss the hardware design and access patterns that cause this di erence.

3