Lab 9 Solution

$35.00 $29.05

You'll get a: . zip file solution, download link after Payment

Category: Tag:

Description

Setup

Download Makefile, sseTest.c and sum.c to an appropriate location in your home directory.

Note that we are using SSE and SSE2 in this lab, since they are enabled by default in GCC on x86-64 platforms. If your CPU does not support SSE and SSE2 (which is very not likely), try to switch to one that supports.

Exercises

lu66049cegj tmp baeafa30343ab4b5

Exercise 1: Familiarize Yourself

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you’ll need in your application.

Open Intel Intrinsics Guide. Do your best to interpret the new syntax and terminology. Find the 128-bit intrinsics for the following SIMD operations (one for each):

Four floating point divisions in single precision (i.e. float)

lu66049cegj tmp 5dddcef8d0d9185a

Sixteen max operations over signed 8-bit integers (i.e. char)

lu66049cegj tmp 5dddcef8d0d9185a

Arithmetic shift right of eight signed 16-bit integers (i.e. short)

lu66049cegj tmp 5dddcef8d0d9185a lu66049cegj tmp 96cb4b37a5a2eaae

Checkoff

Record these intrinsics in a text file to show your TA.

Exercise 2: Reading SIMD Code

In this exercise you will consider the vectorization of 2-by-2 matrix multiplication in double precision:

lu66049cegj tmp 7968d4cfc65b9000

This accounts to the following arithmetic operations:

lu66049cegj tmp f01384f29c03eb6

C[0] += A[0]*B[0] + A[2]*B[1];

C[1] += A[1]*B[0] + A[3]*B[1];

C[2] += A[0]*B[2] + A[2]*B[3];

C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code sseTest.c that implements these operations in a SIMD manner.

The following intrinsics are used:

lu66049cegj tmp f68751457e28d0fb

__m128d

_mm_loadu_pd(

double

*p

)

returns vector (p[0], p[1])

__m128d

_mm_load1_pd(

double

*p

)

returns vector (p[0], p[0])

__m128d _mm_add_pd( __m128d a, __m128d b ) returns vector (a0+b0, a1+b1) lu66049cegj tmp 1a3fa641a0e6f902 __m128d _mm_mul_pd( __m128d a, __m128d b )lu66049cegj tmp be03173c79a6cfd3 returns vector (a0b0, a1b1) lu66049cegj tmp 325477dc4d03b0af void _mm_storeu_pd( double *p, __m128d a ) stores p[0]=a0, p[1]=a1

Compile sseTest.c into x86 assembly by running:

lu66049cegj tmp b7cd337601b77d23

make sseTest.s

Find the for-loop in sseTest.s and identify what each intrinsic is compiled into. Does the loop actually exist? Comment the loop so that your TA can see that you understand the code.

lu66049cegj tmp 96cb4b37a5a2eaae

Checkoff

Show your commented code to your TA and explain the for-loop.

Exercise 3: Writing SIMD Code

For Exercise 3, you will vectorize/SIMDize the following code to achieve approximately 4x speedup over the naive implementation shown here:

lu66049cegj tmp 722c1f6215a2355a static int sum_naive(int n, int *a)

{

int sum = 0;

for (int i = 0; i < n; i++)

{

sum += a[i];

}

return sum;

}

You might find the following intrinsics useful:

lu66049cegj tmp 43105cfa4570de4a

__m128i

_mm_setzero_si128( )

returns 128-bit zero vector

__m128i

_mm_loadu_si128( __m128i *p )

returns 128-bit vector stored at pointer p

lu66049cegj tmp 1a3fa641a0e6f902 __m128i _mm_add_epi32( __m128i a, __m128i b ) lu66049cegj tmp be03173c79a6cfd3 returns vector (a0+b0, a1+b1, a2+b2, a3+b3)lu66049cegj tmp 325477dc4d03b0af void _mm_storeu_si128( __m128i *p, __m128i a ) stores 128-bit vector a at pointer p

lu66049cegj tmp baeafa30343ab4b5

Start with sum.c. Use SSE instrinsics to implement the sum_vectorized() function.

To compile your code, run the following command:

lu66049cegj tmp b7cd337601b77d23

make sum

lu66049cegj tmp 96cb4b37a5a2eaae

Checkoff

Show your TA your working code and performance improvement.

Exercise 4: Loop Unrolling

Happily, you can obtain even more performance improvement! Carefully unroll the SIMD vector sum code that you created in the previous exercise. This should get you about a factor of 2 further increase in performance. As an example of loop unrolling, consider the supplied function sum_unrolled():

lu66049cegj tmp 6bf0c6f5a67dc48b

static int sum_unrolled(int n, int *a)

{

int sum = 0;

// unrolled loop

for (int i = 0; i < n / 4 * 4; i += 4)

{

sum += a[i+0];

sum += a[i+1];

sum += a[i+2];

sum += a[i+3];

}

// tail case

for (int i = n / 4 * 4; i < n; i++)

{

sum += a[i];

}

return sum;

}

Also, feel free to check out Wikipedia’s article on loop unrolling for more information.

Within sum.c, copy your sum_vectorized() code into sum_vectorized_unrolled() and unroll it four times.

To compile your code, run the following command:

lu66049cegj tmp b7cd337601b77d23

make sum

lu66049cegj tmp 96cb4b37a5a2eaae

Checkoff:

Show your TA the unrolled implementation and performance improvement.

Exercise 5: Switch on Compiler Optimization

Modify the Makefile to activiate the compiler optimization (e.g. -O2)

To compile your code, run the following command:

lu66049cegj tmp b7cd337601b77d23

make sum

lu66049cegj tmp 4bd8769b82638eb6

Checkoff:

lu66049cegj tmp add5657033b66ce5 Show your TA the performance of the compiler optimized code. Explain the results.

lu66049cegj tmp b13fceb12381c642 lu66049cegj tmp 4e236396e6562fd4

soerensch AT shanghaitech.edu.cn>

lu66049cegj tmp 852ff6d9167bcf82

wangchd AT shanghaitech.edu.cn>

Modeled after UC Berkeley’s CS61C.

Last modified: 2020-05-01

lu66049cegj tmp baeafa30343ab4b5