## Description

Submit responses to all tasks which don’t specify a le name to Canvas in a le called assign-ment9.txt, docx, pdf, rtf, odt (choose one of the formats). Also, all plots should be submitted in Canvas. All source les should be submitted in the HW09 subdirectory on the master branch of your homework git repo with no subdirectories.

All commands or code must work on Euler without loading additional modules unless speci ed otherwise. They may behave di erently on your computer, so be sure to test on Euler before you submit. For the OpenMP related tasks, i.e. Task 1 and Task 2, the following speci cations need to be included in your slurm script:

#SBATCH –nodes=1 –cpus-per-task=20 (or -N 1 -c 20 for short).

For Task 3 (MPI related), on top of the slurm script used for OpenMP, the following changes need to be made to your slurm script:

Remove #SBATCH –cpus-per-task=20 (or -c 20 for short).

Add #SBATCH –ntasks-per-node=2. This speci cation is only for this assignment, since only two processes are needed for task 3.

Please submit clean code. Consider using a formatter like __clang-format__.

- Before you begin, copy the provided les from HW09of the
__ME759-2020 repo__.

- In this task, you are given a function that displays a false sharing issue. Speci cally, in your code, each thread will calculate the sum of the distances between a \center point” associated with this thread and a chunk of entries of a large reference array arrof size n. The function declaration is in h; its de nition is in cluster.cpp. You will need to use the techniques discussed in class (Lecture 24, slides 21-24) to x the false sharing issues and assess any performance bene ts (false sharing vs. no sharing). To that end:

- Modify the current cpple to solve the false sharing issue. You may use macros if needed.

- Write a program cppthat will accomplish the following:

Create and ll with int type random numbers an array arr of length n where n is the rst command line argument, see below. The range for these random integers is [0, n].

Sort arr (std::sort will do).

Create and ll an array centers of length t (if no padding is used; this length can be modi ed if padding is used), where t is the number of threads and it’s the second command line argument, see below. Assuming that n is always a multiple of 2 * t (your code does not need to account for the cases where n is not a multiple of 2 * t), the entries in the array centers should be:

n | ; | 3n | ; :::; | (2t | 1)n | _{:} |
|||

2t | 2t | 2t |

Create and initialize with zeros an array dists of length t (if no padding is used; this length can be modi ed if padding is used), where t is the number of threads and it’s the second command line argument, see below.

Call the cluster function and save the output distances to the dists array. Calculate the maximum distance in the dists array.

Print the maximum distance.

Print the partition ID (the thread number) that has the maximum distance. Print the time taken to run the cluster function in milliseconds.

Compile: g++ task1.cpp cluster.cpp -Wall -O3 -o task1 -fopenmp

Run (where n is a positive integer, t is an integer in the range [1, 10] ):

./task1 n t

1

Example expected output: 32516 2

0.032

A more concrete example (also given in the cluster.h le):

Input: arr = [0, 1, 3, 4, 6, 6, 7, 8], n = 8, t = 2.

Expected centers: [2, 6]. | ||||

Expected dists: [6, 3], where | 2j, | |||

6 = j0 | 2j + j1 | 2j + j3 | 2j + j4 | |

3 = j6 | 6j + j6 | 6j + j7 | 6j + j8 | 6j. |

Your code needs to output on the rst line 6, and on the second line 0.

- On an Euler compute node:

Run task1 for value n = 50400, and value t = 1; 2; ; 10. Generate a plot called task1.pdf which plots time taken by your cluster function vs. t in linear-linear scale. If there are irregular spikes in the run time, run the cluster function for 10 times and use the average time for plotting.

2

- In this task, you will implement an estimation of using the Monte Carlo Method
^{1}. The idea is to generate nrandom oats in the range [ r; r], where r is the radius of a circle, and count the number of oats that reside in the circle (call it incircle), then use 4 * incircle / n as the estimation of . When n becomes very large, this estimation could be fairly accurate. Upon implementing this method with omp for, you will also compare the performance di erence when the simd directive is added.

- Implement in a le called cppwith the prototype speci ed in montecarlo.h the function that accomplishes the Monte Carlo Method.

- Your program cppshould accomplish the following:

Create and ll with float-type random numbers array x of length n, where n is the rst command line argument, see below. x should be drawn from the range [-r, r], where r is the circle radius and it can be set to 1.0.

Create and ll with float-type random numbers array y of length n, where n is the rst command line argument, see below. y should be drawn from the range [-r, r], where r is the circle radius and it can be set to 1.0.

Call the montecarlo function that returns the number of points that reside in the circle.

Print the estimated .

Print the time taken to run the montecarlo function in milliseconds.

Compile^{2}: g++ task2.cpp montecarlo.cpp -Wall -O3 -o task2 -fopenmp -fno-tree-vectorize -march=native -fopt-info-vec

Run (where n is a positive integer, t is an integer in the range [1, 10] ):

./task2 n t

Example expected output: 3.1416

0.352

- On an Euler compute node:

Run task2 for n = 10^{6}, and t = 1; 2; ; 10. Generate a gure called task2.pdf which includes two patterns:

{ The time taken by your montecarlo function without the simd directive vs. t in linear-linear scale.

{ The time taken by your montecarlo function with the simd directive vs. t in linear-linear scale.

If there are irregular spikes in the run time, run the montecarlo function for 10 times and use the average time for plotting.

- See the details about Monte Carlo Method in
__this____It also has an explanation about estimation in the second paragraph of its Overview section__

^{2}Some notes about the compiler ags: -fno-tree-vectorize helps to avoid auto-vectorization that does not come from the OpenMP simd directive; -march=native helps to map to the vector size available on the hardware; -fopt-info-vec outputs a message when vectorization is achieved. You can compile with gcc/9.2.0 (module load gcc/9.2.0) to see the size of the vector that simd can achieve.

3

- In this task, you will write an MPI program to quantify the communication latency and bandwidth between two MPI processes (executed on the same node) using MPI Sendand MPI Recv.

- Write a program cppthat will accomplish the following:

Create and ll two bu er arrays with float-type numbers however you like that represent the messages to be sent or received. Both arrays should have length n, where n is the rst command line argument, see below.

Initialize the variables necessary for the function call of MPI Send and MPI Recv. Implement the procedure listed below for the communication between two processes:

i f | ( rank == 0 ) f | |

// s t a r t t i m i n g t 0 | ||

MPI Send ( . . . ) ; | ||

MPI Recv ( . . . ) ; | ||

g | // end | t i m i n g t 0 |

e l s e i f | ( rank == 1 ) f | |

// s t a r t t i m i n g t 1 | ||

MPI Recv ( . . . ) ; | ||

MPI Send ( . . . ) ; | ||

g | // end | t i m i n g t 1 |

Print the time taken for both processes to complete (t0+t1) in milliseconds.

Compile^{1}: mpicxx task3.cpp -Wall -O3 -o task3

Run^{2} (where n is a positive integer): mpirun -np 2 task3 n

Example expected output: 3.45

- On an Euler compute node:

Run task3 for value n = 2^{1}; 2^{2}; ; 2^{25}. Generate a plot called task3.pdf which plots the time taken by the communication between two processes vs. n in log-log scale.

Based on the timing results, roughly estimate the latency and bandwidth of this point-to-point communication and discuss how well these estimations match with the hardware speci cs.

- Use module load mpi/openmpi

^{ }

^{2}Use module load mpi/openmpi

4