Vu logo

Parallel programming Practical 2011/2012

Vu logo

The GPU assignment

Introduction

This assignment for the parallel programming practical requires you to implement an image processing pipeline. The pipeline will convert a color input image to grayscale, compute its histogram, enhance the contrast, and perform edge detection. Thus, for any input image, the filter outputs a gray image (of the same size as the input) with the detected edges. For simplicity and accuracy all operations are done in floating point. The filters have to be accelerated using a GPU; the accelerated code can be implemented using CUDA or OpenCL. The program will be benchmarked on the DAS3 nodes that have NVIDIA GPUs.

Send any questions you have for this assignment to Ana Lucia Varbanescu .

Converting a color image to grayscale

Converting a color image to a grayscale image is straightforward. The Red, Green and Blue (RGB) components of each pixel are weighted and added together: gray = 0.3*R + 0.59*G + 0.11*B.

Contrast enhancement

Next, the application will enhance the contrast of the image. This is done in two steps. First, we have to calculate a histogram. The histogram counts how often each gray scale value (between 0 and 255) is used in the image. We will use the histogram to determine a weight that is used to scale the gray values of each pixel. Second, each pixel in the image is scaled with the weight.

Edge detection and the Sobel operator

Edge detection is a fundamental tool in image processing and computer vision, particularly in the areas of feature detection and feature extraction, which aim at identifying points in a digital image at which the image brightness changes sharply or more formally has discontinuities (from Wikipedia ).

The Sobel operator is used in image processing, particularly within edge detection algorithms. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel operator is either the corresponding gradient vector or the norm of this vector. The Sobel operator is based on convolving the image with a small, separable, and integer valued filter in horizontal and vertical direction and is therefore relatively inexpensive in terms of computations. On the other hand, the gradient approximation which it produces is relatively crude, in particular for high frequency variations in the image (from Wikipedia ).

Mathematically, the operator uses two 3 x 3 kernels which are convolved with the original image to calculate approximations of the derivatives - one for horizontal changes, and one for vertical. If we define A as the source image, and Gx and Gy are two images which at each point contain the horizontal and vertical derivative approximations, the computations are as follows (* here denotes the 2-dimensional convolution operation):

formula

The x-coordinate is here defined as increasing in the "right"-direction, and the y-coordinate is defined as increasing in the "down"-direction. At each point in the image, the resulting gradient approximations can be combined to give the gradient magnitude, using:

formula

Using this information, we can also calculate the gradient's direction (for example, Θ is 0 for a vertical edge which is darker on the left side):

formula

However, the output image (for this assignment) is generated using the gradient magnitude of each pixel.

Accelerated application

A sequential version of the application is provided, for your convenience, on fs0.das3 in /home/ppp/pub/gpu.

There are no restrictions on how the accelerated filter should be implemented. However, we recommend starting from the implementation provided as an example. The four filtering steps (RGBtoGray, histogram, contrast and sobel filter) all have to be parallelized and offloaded on the GPU.

Similar to the sequential code, the accelerated version has to run on gray images of any size. Furthermore, the output should be the same as the one generated by the sequential version. We provide a test program to verify this.

Clarification: The output that is verified for compliance is the final image, obtained after the edge detection (named convolution.bmp in the given code). In case you have written the code to generate the intermediate files (for debugging, most likely), you can leave it in (it might help the assignment checking process), but please make sure it can be enabled/disabled (e.g., by using a #define ) and, more importantly, do not include these transfers in the performance measurements. Finally, the text output of the application is not verified for compliance, but a similar structure as the one in the sequential code makes testing easier.

Requirements

Implement the accelerated image processing application using CUDA or OpenCL. Use the provided code for reading and writing the images, and the sequential code as a reference implementation. You can use the provided Makefile for your GPU implementation or modify it according to your own needs. However, whatever changes you make to the Makefile, be sure that running make in the gpu/cuda-filters or gpu/opencl-filters directory of your submission (depending on your choice of language) will generate a running program named cuda-filters or opencl-filters, respectively.

To check for correctness, use the input images provided in the gpu/images directory. You can use at most 5 additional images for special testing and benchmarking. If you choose to do so, include these images in the same drectory gpu/images, and name them gpu/images/image_1x.bmp, where x is a single digit (i.e., between 0 and 9). Note that sending inapropriate images as part of your submission leads to automatic rejection.

Benchmark your application on DAS3, using nodes that have GPU cards (see section Compiling and running your application below) and (the) images from the gpu/images directory. Measure wall-clock time for the entire application, but also measure kernel execution time and memory transfer overhead(s). Use the provided sequential implementation as a reference for your speed-up computation(s). As performance metrics, use execution time(s), speed-up(s), GFLOPs, and utilization.

Write a short report, in English, describing your design, implementation, and performance testing for your accelerated application. Describe (briefly) the way the parallel algorithm works, focusing on the parallelization details and the optimizations you have applied (if any). Before presenting the performance results, mention the composition of the experimental set-up - which node, what features, what are the input data-sets, etc. - and what is measured. Include the performance measurements of your application (including timings for both the communication and computation). Report use execution time(s), speed-up(s), GFLOPs, and utilization. Finally, compare the achieved performance and speed-up with the expected ones, and explain the eventual unexpected results.

Bug Reports

We have received a couple of bug reports for the GPU assignment. We have updated the code accordingly. In case you have downloaded your sources before 31/01/2012, please download them again.

For the students that have already submitted a GPU assignment prior to these corrections, the sanity check will be performed with the *old* version (i.e., the version included in the submitted archive). In case the sanity check fails, however, the following submissions need to be compliant with the *new* versions.

Changelog:

  • fixed bug with adding values to uninitialized memory in convolution2D (replaced malloc with calloc)
  • fixed bug with pixel values larger than 255 in convolution (added extra test in combineImages)
  • code layout fixes
  • printing to stdout / stderr made consistent

Compiling and running your applications

No matter if you choose CUDA or OpenCL, your code should be written in C or C++. If you are unfamiliar with C/C++ and Makefiles, you may want to read the guide C for Java Programmers by Jason Maassen.

To be able to use the CUDA compiler and/or find the OpenCL libraries, add the following lines in your .bashrc (note that you need to logout and login again for these changes to take effect):

 
export CUDA=/usr/local/package/cuda-3.2.9/cuda
export PATH=$PATH:$CUDA/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64

To compile your GPU code, there are different compilers depending on your choice of language. Thus, for Cuda, your programs you should use the NVIDIA Cuda compiler nvcc. <-- A sample Makefile is provided for your convenience.> If you choose to use OpenCL, you can just use gcc - for example:

 
g++ -I/usr/local/package/cuda-3.2.9/cuda/include -Wall -m64 -g -lOpenCL opencl-filters.cc -o opencl-filters
<-- However, simple Makefiles are provided for your convenience, for both CUDA and OpenCL. >

If you want to run Cuda or OpenCL programs on the DAS, you have to use prun like this:

prun -v -np 1 -q gpu.q /[path_to_file]/cuda-filters /[path_to_file]/image_01.bmp

Available hardware

DAS3 has 8 NVIDIA GT430 GPUs. The GT430 has 3 streaming multiprocessors, with 32 cores each. Thus, in total, the GPU has 96 cores. The chips are based on the Fermi architecture, and provide compute capability 2.1. They have 1 GB of device memory. The exact specifications are available here .

Submitting

You have to submit both the code (preferably containing useful comments that illustrate how the application works) and the report.

Important Because we use test scripts to test and benchmark your submissions, you must strictly follow the instructions below.

  1. Make sure that your submission has the exact same directory structure as the provided template in /home/ppp/pub/gpu on the DAS-3.
  2. The accelerated program should be compiled in the correct directory, depending of your choice for the language - cuda or opencl, and must be called cuda-filters or opencl-filters, respectively.
  3. In case your application has been tested/benchmarked with more images than the ones provided by us in the /images directory, be sure to submit these images in the same directory, and name them image_1x.bmp, where x is a single digit (i.e., between 0 and 9).
  4. Make sure that the accelerated program gives the same output as the sequential program. We compare your application's output with the correct output at byte-level, using a sanity check program. For your convenience, a similar sanity checker called compare_images is available in /home/ppp/pub/gpu/bin/ on the DAS3.
  5. Make sure you compile with optimizations turned on. Solutions that obtain substantially lower speedups when compared with our reference implementation are rejected.
  6. Create the report as a PDF file, and make sure you place the file in the pre-created docs directory.
  7. Place the code and documentation directories in directory that contains your VUNet-id, full name, and the type of assignment (i.e., gpu - e.g., jj400_JanJanssen_gpu). Archive the directory as a .tar.gz file (i.e. jj400_JanJanssen_gpu.tar.gz) and submit this archive via the blackboard.
  8. You are allowed a total of 3 submission attempts for the GPU assignment. We run a check on your submission(s), and notify you of the results (i.e., pass/fail for the sanity checks) within a few days. Note that passing the sanity check does not guarantee a passing grade, but guarantees the assignment can be considered ready for grading.

Documentation

  • The main documentation source for Cuda is the Programming guide. Multiple examples and tutorials are available online.
  • A good documentation source for OpenCL is the OpenCL Programming guide .
  • The OpenCL online man pages can be found here . A quick reference card can be found here .
  • The reader.
  • The code for the sequential implementation, available on the DAS3 (fs0.das3) in /home/ppp/pub/gpu.

Grading

A correct implementation of a GPU-accelerated filter pipeline compliant with the requirements above (both for code and documentation) is graded with 8. Up to 2 bonus points (to grade 10) can be given for extra work on implementation, optimizations, testing and benchmarking. Examples are:

  1. applying different CUDA optimizations (like coalescing, using the constant memory and/or shared memory) and showing their effect on the overall application performance;
  2. comparing different ways/methods to optimize the host-device communication;
  3. verifying the OpenCL (code-level) portability and its impact on performance.
  4. We encourage creative attempts to improve the implementation, the performance, or the analysis of the parallel application beyond the mandatory requirements.

    Important You get bonus points if you find interesting and/or creative ways to improve the parallel application (its implementation, performance, or analysis). However, optimizing the (reference) sequential algorithm, does not count as a bonus. Furthermore, make sure that the basic requirements for the assignment are still met! Additional features are only graded for working solutions.

    Note that we check all your submissions (up to 3) for compliance, but we only grade the last one available on the site at the time of the deadline. Grading is done after the submission deadline(s). Therefore, make sure your last submission is compliant, complete, and correct.

    TODO list

    In order make sure you complete this part of the practical, we recommend the following recipe:

    • Get the sequential version of the Filters application from /home/ppp/pub/gpu on the fs0.das3.
    • Create a parallel application using the CUDA or OpenCL templates, and fill in your parallelized kernel(s).
    • Test the functionality and correctness of your application on the DAS, using prun (see the DAS-3 site for more information on the DAS-3 and prun). Make sure you check the application runs correctly for various input data (test some borderline cases as well).
    • Benchmark your parallel application for various input data. Vary other parameters, if needed to prove flexible behaviour or increased performance.
    • Write a report about your GPU assignment. Make sure you mention the difficulties you encountered and how you solved them. Also, present the experimental results, discuss the achieved performance of your solution, and how far it is from the expected performance. Include graphs to prove your claims for scalability, speed-up and/or utilization.
    • (Optional) To get bonus points for this assignment: add extra features in the parallel application and/or include extra analysis. Be sure to describe your bonus work in your report, explaining what is interesting about it, what it solves, what is original, and include any additional performance data.
    • Test your assignment with the compare-images script.
    • Check that the code and report are placed in the correct directories, make sure the Makefile is also correct, archive the entire directory, and submit the archive via the blackboard.
    • Wait/check for the pass/fail notification of your assignment, update the code and/or report if wanted/needed, and re-submit. Note that you have a total of 3 attempts to submit your final solution, and that only your final solution (the last one on the site at the submission deadline) will be graded.

What's new?

January 31, 2012:
Deadline extended

January 31, 2012:
GPU assignment is updated

November 28, 2011:
GPU assignment is available!

November 7, 2011:
MPI assignment is available!

October 31, 2011:
The site for PPP is updated!

October 28, 2011:
The registration for the practical is open on blackboard.

Valid CSS!

Valid HTML 4.01 Strict