|
Parallel programming Practical 2011/2012 |
|
|||
|
|
The GPU assignmentIntroductionThis assignment for the parallel programming practical requires you to implement an image processing pipeline. The pipeline will convert a color input image to grayscale, compute its histogram, enhance the contrast, and perform edge detection. Thus, for any input image, the filter outputs a gray image (of the same size as the input) with the detected edges. For simplicity and accuracy all operations are done in floating point. The filters have to be accelerated using a GPU; the accelerated code can be implemented using CUDA or OpenCL. The program will be benchmarked on the DAS3 nodes that have NVIDIA GPUs. Send any questions you have for this assignment to Ana Lucia Varbanescu . Converting a color image to grayscaleConverting a color image to a grayscale image is straightforward. The Red, Green and Blue (RGB) components of each pixel are weighted and added together: gray = 0.3*R + 0.59*G + 0.11*B.Contrast enhancementNext, the application will enhance the contrast of the image. This is done in two steps. First, we have to calculate a histogram. The histogram counts how often each gray scale value (between 0 and 255) is used in the image. We will use the histogram to determine a weight that is used to scale the gray values of each pixel. Second, each pixel in the image is scaled with the weight.Edge detection and the Sobel operatorEdge detection is a fundamental tool in image processing and computer vision, particularly in the areas of feature detection and feature extraction, which aim at identifying points in a digital image at which the image brightness changes sharply or more formally has discontinuities (from Wikipedia ). The Sobel operator is used in image processing, particularly within edge detection algorithms. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel operator is either the corresponding gradient vector or the norm of this vector. The Sobel operator is based on convolving the image with a small, separable, and integer valued filter in horizontal and vertical direction and is therefore relatively inexpensive in terms of computations. On the other hand, the gradient approximation which it produces is relatively crude, in particular for high frequency variations in the image (from Wikipedia ). Mathematically, the operator uses two 3 x 3 kernels which are convolved with the original image to calculate approximations of the derivatives - one for horizontal changes, and one for vertical. If we define A as the source image, and Gx and Gy are two images which at each point contain the horizontal and vertical derivative approximations, the computations are as follows (* here denotes the 2-dimensional convolution operation): ![]() The x-coordinate is here defined as increasing in the "right"-direction, and the y-coordinate is defined as increasing in the "down"-direction. At each point in the image, the resulting gradient approximations can be combined to give the gradient magnitude, using: ![]() Using this information, we can also calculate the gradient's direction (for example, Θ is 0 for a vertical edge which is darker on the left side): ![]() However, the output image (for this assignment) is generated using the gradient magnitude of each pixel. Accelerated applicationA sequential version of the application is provided, for your convenience, on fs0.das3 in /home/ppp/pub/gpu. There are no restrictions on how the accelerated filter should be implemented. However, we recommend starting from the implementation provided as an example. The four filtering steps (RGBtoGray, histogram, contrast and sobel filter) all have to be parallelized and offloaded on the GPU. Similar to the sequential code, the accelerated version has to run on gray images of any size. Furthermore, the output should be the same as the one generated by the sequential version. We provide a test program to verify this. Clarification: The output that is verified for compliance is the final image, obtained after the edge detection (named convolution.bmp in the given code). In case you have written the code to generate the intermediate files (for debugging, most likely), you can leave it in (it might help the assignment checking process), but please make sure it can be enabled/disabled (e.g., by using a #define ) and, more importantly, do not include these transfers in the performance measurements. Finally, the text output of the application is not verified for compliance, but a similar structure as the one in the sequential code makes testing easier. RequirementsImplement the accelerated image processing application using CUDA or OpenCL. Use the provided code for reading and writing the images, and the sequential code as a reference implementation. You can use the provided Makefile for your GPU implementation or modify it according to your own needs. However, whatever changes you make to the Makefile, be sure that running make in the gpu/cuda-filters or gpu/opencl-filters directory of your submission (depending on your choice of language) will generate a running program named cuda-filters or opencl-filters, respectively. To check for correctness, use the input images provided in the gpu/images directory. You can use at most 5 additional images for special testing and benchmarking. If you choose to do so, include these images in the same drectory gpu/images, and name them gpu/images/image_1x.bmp, where x is a single digit (i.e., between 0 and 9). Note that sending inapropriate images as part of your submission leads to automatic rejection. Benchmark your application on DAS3, using nodes that have GPU cards (see section Compiling and running your application below) and (the) images from the gpu/images directory. Measure wall-clock time for the entire application, but also measure kernel execution time and memory transfer overhead(s). Use the provided sequential implementation as a reference for your speed-up computation(s). As performance metrics, use execution time(s), speed-up(s), GFLOPs, and utilization. Write a short report, in English, describing your design, implementation, and performance testing for your accelerated application. Describe (briefly) the way the parallel algorithm works, focusing on the parallelization details and the optimizations you have applied (if any). Before presenting the performance results, mention the composition of the experimental set-up - which node, what features, what are the input data-sets, etc. - and what is measured. Include the performance measurements of your application (including timings for both the communication and computation). Report use execution time(s), speed-up(s), GFLOPs, and utilization. Finally, compare the achieved performance and speed-up with the expected ones, and explain the eventual unexpected results. Bug ReportsWe have received a couple of bug reports for the GPU assignment. We have updated the code accordingly. In case you have downloaded your sources before 31/01/2012, please download them again. For the students that have already submitted a GPU assignment prior to these corrections, the sanity check will be performed with the *old* version (i.e., the version included in the submitted archive). In case the sanity check fails, however, the following submissions need to be compliant with the *new* versions. Changelog:
Compiling and running your applicationsNo matter if you choose CUDA or OpenCL, your code should be written in C or C++. If you are unfamiliar with C/C++ and Makefiles, you may want to read the guide C for Java Programmers by Jason Maassen. To be able to use the CUDA compiler and/or find the OpenCL libraries, add the following lines in your .bashrc (note that you need to logout and login again for these changes to take effect): export CUDA=/usr/local/package/cuda-3.2.9/cuda export PATH=$PATH:$CUDA/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64 To compile your GPU code, there are different compilers depending on your choice of language. Thus, for Cuda, your programs you should use the NVIDIA Cuda compiler nvcc. <-- A sample Makefile is provided for your convenience.> If you choose to use OpenCL, you can just use gcc - for example: g++ -I/usr/local/package/cuda-3.2.9/cuda/include -Wall -m64 -g -lOpenCL opencl-filters.cc -o opencl-filters<-- However, simple Makefiles are provided for your convenience, for both CUDA and OpenCL. > If you want to run Cuda or OpenCL programs on the DAS, you have to use prun like this: prun -v -np 1 -q gpu.q /[path_to_file]/cuda-filters /[path_to_file]/image_01.bmp Available hardwareDAS3 has 8 NVIDIA GT430 GPUs. The GT430 has 3 streaming multiprocessors, with 32 cores each. Thus, in total, the GPU has 96 cores. The chips are based on the Fermi architecture, and provide compute capability 2.1. They have 1 GB of device memory. The exact specifications are available here . SubmittingYou have to submit both the code (preferably containing useful comments that illustrate how the application works) and the report.Important Because we use test scripts to test and benchmark your submissions, you must strictly follow the instructions below.
Documentation
GradingA correct implementation of a GPU-accelerated filter pipeline compliant with the requirements above (both for code and documentation) is graded with 8. Up to 2 bonus points (to grade 10) can be given for extra work on implementation, optimizations, testing and benchmarking. Examples are:
Important You get bonus points if you find interesting and/or creative ways to improve the parallel application (its implementation, performance, or analysis). However, optimizing the (reference) sequential algorithm, does not count as a bonus. Furthermore, make sure that the basic requirements for the assignment are still met! Additional features are only graded for working solutions. Note that we check all your submissions (up to 3) for compliance, but we only grade the last one available on the site at the time of the deadline. Grading is done after the submission deadline(s). Therefore, make sure your last submission is compliant, complete, and correct. TODO listIn order make sure you complete this part of the practical, we recommend the following recipe: |
|