ICS632 Assignment #2

ICS632 Assignment #2
For this assignment turn in an archive of a single directory with your code
(exercise1.c, exercise2.c, exercise3.c) and your report. For your report a
README plain text file is absolutely fine, but if you want to turn in a README.pdf
that’s fine (no points awarded for prettiness, only for readability and content).
At the beginning of this file state the hardware, OS (with version number), and
compiler (with version number) that you used to run your experiments.
Warning: if you’re running on a Mac OS X machine, OpenMP may not
work correctly unless you have gcc 4.8 or above (your program may abort with
”Abort trap: 6”). You simply have to upgrade your compiler (often the default
compiler is some 4.2 Darwin thing that has a known bug with OpenMP).
Exercise #1 (30%)
In this exercise we parallelize the i-k-j non-blocked matrix multiplication program from Exercise #1 in Assignment #1. For this assignment, and all code
we’ll write throughout the semester, we always compile code with the highest
lest of compiler optimization (”-O4”).
1. Setting the number of OpenMP threads to the number of physical cores
on your machine, produce 3 versions of the program, where in each version
you parallelize a different loop with an OpenMP pragma (taking care of
declaring the correct variables private or shared).
• Here again, each matrix multiplication should be a different run of
an executable. C macros and the -D option of gcc are your friends if
you want to avoid having 3 distinct copies of your code. If you have
multiple copies, name them exercise1 1.c, exercise1 2.c, etc.
• BEWARE: Parallelizing some of these loops with a simple OpenMP
parallel for pragma may cause race conditions. You must prevent
race conditions using atomic executions whenever necessary. State
in your README which loop parallelizations produce race conditions
and explain why.
2. Report on wall-clock time results for the 3 versions. Discuss their relative
performance and venture explanations.
• It is possible that some versions of the code produce very high wallclock times. Explain, and feel free to have a time-out in your code or
to interrupt the execution when it takes too long without completing
the trials.
• Feel free to reduce N in case wall-clock times are too high, but make
sure that N is large enough so that the fastest version runs in, say,
more than 1 second.
• Compute average wall-clock times over at least 10 trials.
3. Compare the fastest of the 3 versions to the purely sequential version (i.e.,
not compiled with -fopenmp). How much of a speedup do you observe?
What is the parallel efficiency? If the parallel efficiency is far (e.g., more
than just a few percent) from 100% why do you think that is?
Exercise #2 (50%)
Consider the program provided on the assignment page (exercise2 startingpoint.c),
which applies a heat-transfer-like transformation to a square 2-D array over several iterations. The program takes two command-line arguments which are the
number of iterations and a number of threads (which is unused in this program
for now).
1. In your README file explain why using a simple parallel for pragma to
parallelize the i or the j loop will not lead to correct executions.
2. Re-organize the computation (not the data) so that you can have a parallelizable loop. This can be done by changing the order in which elements
of the matrix are evaluated (by adding an extra loop). Implement this
change in exercise2.c and briefly explain the general idea.
• Hint: think of the computation as a number of steps so that at each
step a number of array values can be computed safely in parallel. It
may help to draw a small example for a small 2-D array and proceed
through the computational steps.
• Hint: it may be helpful to think of the computation as proceeding in
two distinct phases (each phase computes half of the array values).
3. Compare the performance of the original program to the sequential performance of your new program. What do you observe? Why do you think
that is? By how much does using multiple threads help? Should we be
happy with this performance?
• Use a number of iterations that makes the original program run in
about 10 seconds.
4. Use the blocking optimization idea to try to accelerate your program in
a view to executing it on multiple threads. The idea is that each block
is processed sequentially by a thread, and that threads will be able to
process blocks in parallel. Write the corresponding code in a program
called exercise2 fast.c. Empirically determine a good block size to
use. Explain in your README file why too small a block size is not a
good idea, and why too large a block size is not a good idea either.
5. Compare the performance of the blocked implementation with 1 thread to
that with 4 threads. What speedup and parallel efficiency do you achieve?
How about when compared to the original sequential program?
Exercise #3 (20%)
Consider the parallel sort implementation discussed in the last section of the
lecture notes. We saw that the sequential multi-way merge has a negative impact on performance, and should thus be parallelized.
1. Explain how you would parallelize the merge, assuming that the number
of processors is a power of two.
• The idea is to do parallel merges of pairs of array chunks.
2. What is the big-O complexity?
Beyond this assignment towards a project?
• Matrix multiplication, stencil applications, sorting are well-studied topics:
– Do a brief review of (a subset of) the literature
– Re-implement some of the algorithms
– Compare performance to publicly available ”fast” implementations
– Discover what works, what doesn’t, and try to explain why
– Go to distributed-memory settings