Vu logo

Parallel programming Practical 2011/2012

Vu logo

The MPI assignment

Introduction

This assignment for the parallel programming practical requires you to implement a parallel SOR (Successive Overrelaxation), using C and the MPI (Message Passing Interface) communication library. The application has to be ran and benchmarked on the DAS3.

Send any questions you have for this assignment to Ana Lucia Varbanescu .

The Message-Passing Interface or MPI is widely used. It is not a new programming language; rather it is a library of subprograms that can be called from C and Fortran programs. It was developed by an open, international forum consisting of representatives from industry, academia, and government laboratories. It has rapidly received widespread acceptance because it has been carefully designed to permit maximum performance on a wide variety of systems, and it is based on message passing, one of the most powerful and widely used paradigms for programming parallel systems. The introduction of MPI makes it possible for developers of parallel software to write libraries of parallel programs that are both portable and efficient. Use of these libraries will hide many of the details of parallel programming, and, as a consequence, make parallel computing much more accessible to students and professionals in all branches of science and engineering.
Taken from Parallel Programming with MPI by Peter Pacheco.

Successive Overrelaxation (SOR)

Successive Overrelaxation is an iterative algorithm for solving Laplace equations on a grid. The sequential algorithm works as follows. For all non-boundary points of the grid, SOR first calculates the average value of the four neighbors of the point:

formula

Then the new value of the point is determined using the following correction:

formula

w is known as the relaxation parameter and a suitable value can be calculated in the following way:

formula

The entire process terminates if during the last iteration no point in the grid has changed more than a certain quantity.

Parallel SOR

The parallel implementation of SOR is based on the Red/Black SOR algorithm. The grid is treated as a checkerboard and each iteration is split into two phases, red and black. During the red phase only the red points of the grid are updated. Red points only have black neighbors and no black points are changed during the red phase. During the black phase, the black points are updated in a similar way.

Using the Red/Black SOR algorithm the grid can be partitioned among the available processors, each processor receiving a number of adjacent rows. All processors can now update different points of the same color in parallel. Before a processor starts the update of a certain color, it exchanges the border points of the opposite color with its neighbor.

After each iteration, the processors must decide if the calculation must continue. Each processor determines if it wants to continue and sends its vote to the first processor. The calculation is terminated only if all the processors agree to stop. This means a processor may continue the calculation after the termination condition is reached locally.

Requirements

Implement a parallel Successive Overrelaxation Algorithm (SOR) using MPI. Unlike the algorithm discussed in the class, this algorithm should not use a fixed number of iterations but terminate when none of the grid elements have changed more than a certain (constant) value (so the program should compute the maximum change over all processors). Data distribution at the start, and gathering at the end of the process must also be implemented in your program. However, you should not take the time this takes into account when performing speedup measurements.

The version of SOR must partition the NxN grid as discussed in the class (i.e. each processor gets approximately N/P consecutive rows), should run on any number of machines, and accept any given problem size. Furthermore, you must implement a solution that overlaps the communication and computation (using boundary optimization).

Measure the speedups of the SOR version on DAS3, using up to 16 nodes. A sequential version (for the 'row'-variant) and a Makefile are available in /home/ppp/pub/mpi on the DAS3 for your convenience. This sequential program also contains the 'stencil' operation that should repeatedly be applied to each grid point and it illustrates when the program should terminate (see the 'stopdiff' variable). Consider this version as the reference sequential implementation.

Make sure you use the right MPI operations for the best performance. For example, use MPI's collective communication operations whenever that is useful. (e.g., to do a broadcast, use MPI's broadcast primitive, and not N serial send operations). Similarly, pay attention when choosing between synchronous and asynchronous communication. Finally, use the timers provided by MPI to measure the performance of sections of your code.

Write a short report, in English, describing your design, implementation, and performance testing for your paralell SOR solution. Describe (briefly) the way the algorithm works, focusing on the MPI implementation details and the optimizations you have applied (if any). Before presenting the performance results, mention the composition of the experimental set-up - number of machines/nodes, input data-sets, application versions, etc. - and the way you measured the execution time - for which operations/regions. Include the performance measurements of your application (including a performance breakdown, if needed). Report execution times as well as speedups, and include speedup graphs as needed. Furthermore, compare the achieved performance and speed-up with the expected ones, and explain the eventual unexpected results.

Compiling and running your applications

To compile your MPI programs you should use the MPI compiler mpicc, you should also construct a simple Makefile for your needs. Programs are to be written in the C programming language. If you are unfamiliar with C and Makefiles, you may want to read the guide C for Java Programmers by Jason Maassen.

If you want to run MPI programs on the DAS, you have to use a special script. The commandline for running SOR looks like this:

prun -v -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script ./sor-par 4

Submitting

You have to submit both the code (preferably containing useful comments that illustrate how the application works) and the report.

Important Because we use automatic test scripts to test and benchmark your submissions, you must strictly follow the instructions below.

  1. Make sure that your submission has the exact same directory structure as the provided template in /home/ppp/pub/mpi on the DAS-3.
  2. The parallel program should be compiled in the same directory as the sequential program. Change the Makefile to make sure that your parallel SOR version is compiled with the make sor-par command. The parallel executable must be called sor-par.
  3. Make sure that your parallel program gives the exact same output as the sequential program. The -print option should also work the same. Because we compare your application's output with the correct output using diff, any difference (except for the run time you print) leads to a rejection for your submission. A sanity check program is available in /home/ppp/pub/mpi/bin/ on the DAS to check this.
  4. Make sure you compile with optimizations turned on. Solutions that obtain substantially lower speedups when compared with our reference implementation are rejected.
  5. Create the report as a PDF file, and make sure you place the file in the pre-created docs directory.
  6. Place the code and documentation directories in directory that contains your VU netid, full name, and the type of assignment (i.e., mpi - e.g., jj400_JanJanssen_mpi). Archive the directory as a .tar.gz file (i.e. jj400_JanJanssen_mpi.tar.gz) and submit this archive via the blackboard.
  7. You are allowed a total of 3 submission attempts for the MPI assignment. We run a check on your submission(s), and notify you of the results (i.e., pass/fail for the sanity checks) within a couple of days. Note that passing the sanity check does not guarantee a passing grade, but guarantees the assignment can be considered ready for grading.

Documentation

Grading

A correct implementation, compliant with all the requirements above (both for code and documentation) is graded with 8. Up to 2 bonus points (to grade 10) can be given for extra work on implementation, optimizations, testing and benchmarking. Examples are:

  1. implemeting SOR also using a different data distribution (i.e., block-wise) and comparing as the achieved performance;
  2. making a more detailed evaluation of communication overheads and proposing solutions to minimize them.
We encourage creative attempts to improve the implementation, the performance, or the analysis of the parallel application beyond the mandatory requirements.

Important You get bonus points if you find interesting and/or creative ways to improve the parallel application (its implementation, performance, or analysis). However, optimizing the (reference) sequential algorithm, does not count as a bonus. Furthremore, make sure that the basic requiremets for the assignment are still met! Additional features are only graded for working solutions.

TODO list

In order make sure you complete this part of the practical, we recommend the following recipe:

  • Get the sequential version of the SOR application from /home/ppp/pub/mpi on the fs0.das3.
  • Create a parallel application using MPI and C.
  • Test the functionality and correctness of your application on the DAS, using prun (see the DAS-3 site for more information on the DAS-3 and prun). Make sure you check the application runs correctly for various input data (test some borderline cases as well).
  • Benchmark your parallel application for various input data.
  • Write a report about your MPI assignment. Make sure you mention the difficulties you encountered and how you solved them. Also, present the experimental results and discuss the achieved performance of your solution. Include both run-time and speedups graphs.
  • (Optional) To get bonus points for this assignment: add extra features in the parallel application and/or include extra analysis. Be sure to describe your bonus work in your report, explaining what is interesting about it, what it solves, what is original, and include any additional performance data.
  • Test your assignment with the sanity-check script.
  • Check that the code and report are placed in the correct directories, make sure the Makefile is also correct, archive the entire directory, and submit the archive via the blackboard.
  • Wait/check for the pass/fail notification of your assignment, update the code and/or report if wanted/needed, and re-submit. Note that you have a total of 3 attempts to submit your final solution. Only the last submission found on the site at the submission deadline will be graded.

What's new?

January 31, 2012:
Deadline extended

January 31, 2012:
GPU assignment is updated

November 28, 2011:
GPU assignment is available!

November 7, 2011:
MPI assignment is available!

October 31, 2011:
The site for PPP is updated!

October 28, 2011:
The registration for the practical is open on blackboard.

Valid CSS!

Valid HTML 4.01 Strict