Description
You (plus optional teammate) are tasked with the job of making the fastest matrix multiplication program as possible for all machines. That means you cannot specifically target a machine. But you are free to research and find all usual architectures specification for personal and server machines. You may assume that everything is Intel architecture (x86_64) to make life easier.
Background 4Reading:.12
Chapter
The matrix is column major. Naïve implementation is given in dgemmnaive.c and you can run the benchnaive to see the output.
void dgemm( int m, int n, float *A, float *C )
{
for( int i = 0; i < m; i++ )
for( int k = 0; k < n; k++ )
for( int j = 0; j < m; j++ )
C[i+j*m] += A[i+k*m] * A[j+k*m];
}
C is where the result is stored and we are doing all the calculations from just one matrix, A. You are required to do all the calculations and no optimization is
allowed on this front to make benchmarking easier. Zip contains the following files :
Makefile: to make and benchmark
benchmark.c: do not modify. It check results and produce performance numbers
dgemmnaive.c: naïve implementation as shown above

dgemmoptimize.c: your optimization
(
Choose at most 3 of the following c
_{exams is}). The project worths 100 points, any extra points will go into your final
exam as an extra credit
ommon optimiz tions
1 per function, DO NOT
combine
(if submitted on time. Having multiple projects due or
•
[20 points]
Register blocking (reusing
(
not an excuse!)
•
[20 points]
Reordering^{.}
of instructions
compiler peephole optimization)
o
the s me regis ers for multiple
(each subbullet counts as one optiomazation)
calculations)
Prefetching Copying small matrix blocks into contiguous
•
o
[40 points]
Cache optimiza ions
(
[40 points]
matrices)
Blocking/Tiling (trying to keep the data in the cache for large
•
[20 points]
SSE instructions
optiomazation)
Loop optimizations
)
bullet counts
as one
o
(each sub
o
chunks of memory
[40 points]
Pre ompu e transp
se for spatial loc lity
o
Reordering
[40
[20 points]
Unrolling
(at least 3 iterations)
[40 points]
•
points] Padding Matrices (odd sizes can hurt pipeline performance)
You should not use any libraries for parallel computing, such as openMP. You may assume multiple cores. Anything else you can find or can think of is fine to increase performance. Just remember to calculate all the results (copying from one part of the resulting matrix to another is not allowed).
Your solution will be ran across different machines and the results aggregated.
Note that you should not optimize just for your computer or one particular matrix size. Use your knowledge of computer architecture with all the modern features that tries to accelerate execution. Caches will play a big role but it is not safe to assume a particular architecture. But in general, optimizing memory accesses will lead to big gains. Matrix size and corner cases will also matter, as the same optimization will not work across the board. Have fun with this project.