Warming up on CUDA Threads and Shared Memory Model
Some of the most important factors that affect performance are related to memory. We have discussed the effects of caches and some strategies to optimize the code to take advantage of memory layout on CPU. Similar situations arise when working with GPU. As we well know, memory is one of the biggest limitations when it comes to layout, size, and bandwidth.
Lab 4 required you to use vectors of different sizes (2^7, 2^9, 2^12 and 2^15) and run the kernel for vector addition. In this lab you will experiment with different execution configurations to solve vector addition.
Investigate and indicate the hardware specifications of the GPU hardware that you will be using in this lab. Include this information in your report.
Write a simple sequential CPU implementation of vector array to create a reference result vector. You will compare your results from running on the GPU with the ones from the sequential implementation
Use your code form Program 0 to get started.
For this lab, use arrays of random integers between -10 and 10 and vary the vector size (number of elements) between 2^5 and 2^20 for starters (later we can move to try larger sizes).
Use different execution configurations (of blocks and threads) for each array size.
Start with 1 block and vary the number of threads (start with 8 and increase up to 512), and the array size
Continue with changing the number of blocks (1, 8, …, 512): for each number of blocks, keep the number of threads fixed, get results. Vary the array size as before
Based on your observations from above, start varying the number of threads per block together with the number of blocks.
Try odd/corner cases and report what happens! For instance: Assign 2048 threads per block, or have an execution configuration with 1 block and 1 thread.
Note: remember that as the size of the array grows you will need to adjust the number of blocks (like in program 0). Next, increase the number of threads by powers of 2 until you get to 512 or 1024 threads per block (depending on the GPU architecture).
Report, collect and plot two different times:
Time including memory communications from host to device and vice versa (e.g., copying input data and output)
2. Time for only the mathematical operations
For a single execution configuration, plot vector size (number of elements) vs. the two different times.
Suggestion: think about plotting time, vector size and number of blocks (or number of threads) in a single graph.
Report your findings! Include the following in your report
GPU specifications in your report
Different execution combinations
Screenshots of your code execution
Plots as described in Part 3
Submit source code in a zip file.
Submit report separately as a pdf or docx file (do not exceed 8 pages — including figures)