Create a fault-tolerant task-farming library that uses the JavaGAT grid toolkit to run applications on a cluster.
BOINC is a network computing infrastructure developed at Berkeley. BOINC is a generalized version of the well-known SETI@home program which allows volunteers to participate in distributed computations. Several distributed computing projects are currently using BOINC, such as einstein@home (gravitational wave detection), predictor@home (protein folding) and climateprediction.net (climate modeling).
BOINC is designed for so called Task Farming applications. These applications typically consist of a single master application which hands out a large number of independent tasks to worker applications. The workers then process the tasks simultaneously and return the results to the master. No inter-worker communication takes place.
Even though the task farming model is very simple, it can still be used to express a large number of applications. Usually, they are either Parameter Sweep Applications, consisting of some model for which many combinations of input parameters must be tried (e.g., Protein Folding, Climate Prediction, Medicine Research) or Search Applications, where a large amount of data must be processed in search of interesting events (e.g., Gravitational Wave Detection, SETI).
As a first assignment for the C&CG practical course you are going to build a Java-based grid-enabled task farming library called JOINC. Unlike its predecessor (BOINC), JOINC does not depend on pre-installed workers, but is able to start new workers on demand using a piece of grid software called the JavaGAT.
To implement JOINC, you will use JavaGAT, the Java version of the Grid Application Toolkit (GAT) developed by the Gridlab project. The JavaGAT provides a set of interfaces which allow you to perform the basic tasks needed on a grid, such as copying files or scheduling jobs. That GAT is designed in a modular plug-and-play manner. This allows multiple implementations of each interface, which can even be used at the same time. Using the interfaces offered by the JavaGAT, you should be able to create an efficient and robust JOINC implementation.
After you have implemented JOINC, you can use three applications to test your implementation and do a performance analysis.
We will use the Distributed ASCI Supercomputer (DAS-4) as the platform for developing JOINC. You have probably participated in the Parallel-programming course in the previous semester, so you may already have an account on this machine. Otherwise, send email to Karol to get a DAS account. All the software you need to complete your assignment is already installed for you on the DAS.
At the end of this assignment you are expected to hand in the following items:
- a tarball containing your implementation of JOINC.
- use ‘ant distro’ to generate the tarball
- documentation (short, 4 pages or so) describing:
- how you schedule tasks (what part of the JavaGAT did you use ?)
- how you handle in and output files (what parts of JavaGAT do you use ?)
- how you handle crashing tasks (how do you detect that something went wrong and what do you do ?)
- application measurements using the Prime or TSP application (or both)
- run the application(s) on 4, 8 and 16 workers
- analyse the overhead of your JOINC and the rest of the grid middleware
- show a breakdown of the average worker time
- describe the failure rates you’ve encountered
- give an estimate of the speedup that you would get if the queue was empty
The JOINC model:
Do not start from scratch when implementing JOINC. Instead, there is a template for your JOINC implementation on the DAS. All you have to do is to complete the implementation by filling in the blanks. This ensures that your JOINC implementation has the correct API, and is able the run the applications. An description of how to install the template is given below.
A JOINC applications are an example of so called Task-Farming or Master-Worker applications. These applications typically consists of two smaller applications: the Master and the Worker.
The Worker is the application that “does the real work”; it performs some complex computation or searches thought a data set. It typically is a straight forward sequential application that doesn’t know anything about Grids. When it is started, it simply gets some parameters on its command line, reads input file(s), performs computations, writes output file(s), and exits.
The Master is the application that generates the tasks that are solved by the workers. This is usually done by taking some large problem and splitting it into a number of smaller sub-problems. Each of these subproblems is then solved seperately by a worker. The result is returned to the master which then merges all results into a single one.
In this assignment, it is the responsibility of your JOINC implementation to take the tasks that are generated by the master application, and then use the JavaGAT to start worker applications on the cluster which process these tasks. Your JOINC implementation must make sure that the neccesary input files and parameters are provided to the worker when it starts, and that the output files are brought back to the master when the worker is done. Every time a task is finished, the master must be notified, so it can process the output.
The number of tasks produced by the master can be large so they cannot all be solved at the same time. Instead, your JOINC implementation should ask the master the maximum number of workers it is allowed to use simulataneously, and ensure that this number is never exceeded.
Since Grid environments are generally unreliable, workers may fail. The machine that the worker is running on may be claimed by some other (higher priority) application, or it may even crash. It is therefore important that your JOINC implementation provides fault-tolerance. Whenever a worker crashes or fails to produce the expected output files, it must not be returned to the master application, but instead be restarted by JOINC.
The JOINC template:
The JOINC template in the tarball consists of two Java classes, Master and Task (note that there is no worker class, since the workers do not have to know anything about JOINC). The Master class is designed to be extended by the ‘master’ of an application. It contains several abstract methods which will be implemented by the application. Your JOINC implementation can use these abstract methods to retrieve information from the application’s master. Since these methods are the interface between your Joinc implementation and the applications, you are not allowed to change them in any way! The Javadoc can be found here. You can also have a look at the source code of the applications to see how the abstract methods are actually implemented.
The abstract methods in the Master are:
int maximumWorkers() – how many workers may JOINC use ?
int totalTasks() – how many task will the master create ?
void idle() – to give CPU time to the master when JOINC is idle.
Task getTask() – get next task from the master.
void taskDone(Task t) – return finished task to the master.
Besides the abstract methods, the Master contains one extra (non-abstract) method:
void start() – start the JOINC library
When a master application is started, it will first perform some initialization, such as reading some data file and generating the required tasks. When it it ready to start it will invoke the start() method of your JOINC implementation. From that moment on, you are in control!
Using the abstract methods, you can then ask the application how many tasks it will produce, and how many workers you are allowed to use simultaneously. By using the getTask method you can retrieve a task from the application. This method returns a Task object which contains all the information you need to start the worker:
- A number which uniquely identifies the task.
- A file name which indicates where the standard output of the taskshould be written.
- A file name which indicates where the standard error of the task should be written.
- A file name which indicates from here the standard input of the task should be read.
- A list of input files on which the task depends.
- A list of output files which the task will produce.
- A list of jar files which contain the code of the worker.
- The name of the class which contains the main method of the worker.
- A list of command-line parameters that must be passed to the worker.
With this information and the JavaGAT you should then be able to run the task on one of the compute nodes of the cluster.
When a task is finished, you should call the taskDone method to notify the application of this fact. The application will then process the task’s output files.
All the applications tasks should be run to completion before the start method returns. Keep in mind that grids are unreliable, so some tasks may fail. The application will expect you to keep trying until the task succeeds.
When your JOINC implementation has nothing else to do, you may call the idle method. This will give the application a chance to produce some feedback to the user (like the fancy screensavers used in SETI@home).
Developing JOINC on the DAS:
To create your own JOINC implementation, you must first get an account on the DAS, and set you environment correctly so that the build tools used for JOINC are able to find the JavaGAT installation. On the frontend of the DAS machine (fs0.das4.cs.vu.nl), you will find all the files related to the assignments in the following directory:
In this directory you will find the tarballs of JOINC template and the applications. It also contains installations of the tools and libraries you need to develop JOINC, such as ant, the JavaGAT, and the script that you need to run the JOINC applications.
To be able to use ant and to make sure that your JOINC implementation build correctly, you have to set a number of environment variables correctly. To do this you must run the environment script that can be found in the /usr/local/VU/practicum/gridcomputing directory. You can run it like this:
The required environment variables should now be set. To check, execute the following command:
It should then print the location of the JavaGAT installation. Note that the variables are only set in the shell/terminal where you have run the command, and they will be lost as soon as you log out. To automatically load the setting whenever you log in, you can and the following line to the end of your .bashrc file in you home directory:
Once you’ve log in to our local frontend machine (fs0) and set your environment correctly, you can install your own copy of the JOINC template tarball. Since the tarball is already on the system. the only thing you have to do is unpack it:
tar xvfz /usr/local/VU/practicum/gridcomputing/tarballs/joinc-template.tgz
This will create a joinc subdirectory which contains the template for your JOINC implementation. To build this implementation you can type:
The template will then compile and create a jar file containing the JOINC implementation (which doesn’t do anything yet). These are all the ant command you can use:
ant build – to build the JOINC jar file
ant clean – to clean the tree
ant distro – to create a tarball of your JOINC
ant docs – to generate javadoc for users
You are now ready to create you own JOINC implementation by implementing the start method in the Master.java file.
You may develop JOINC using any tools you like (vi, emacs, eclipse, etc.) and add all the methods and classes to the JOINC template that you may need. Just make sure that your implementation does not change any of the abstract methods, and that it compiles with the build command explained above. We do expect you to use a reasonable coding style, like the one advocated by Sun. Don’t forget to document your code!
Once you are ready to test your implemetation you can build it and run one of the applications. We wil use the Prime Factorization application as an example.
Before you can run the application, you must first set an environent variable which points to your JOINC implementation, for example:
Note that the exact location depends on where you have untarred the joinc directory. Once the variable is set, you can test it by typing:
ls -all $JOINC_JAR
This should show you your jar file containing your JOINC implementation. Next, untar the application and build it:
tar xvfz /usr/local/VU/practicum/gridcomputing/tarballs/prime.tgz
Finally, you can run the application using the run_prime script that comes with the application:
./run_prime 1 1
This command will run the prime factorization master application, which will generate 1 task, and then run it on 1 worker. Have a look at the README file that comes with the application for an explanation of the arguments of the script. If all goes well, your application will print something like this:
Using worker jar: ./prime-worker.jar
Number of tasks : 1
Tasks waiting 0 / running 0 / finished 1
Tasks finished: 1 (OK)
1234567890 -> 2 * 3 * 3 * 5 * 3607 * 3803
Master time: 189 sec.
Worker time: 153 sec.
When your application has started, you can check if any workers are running on the cluster by typing:
Which should show you the current jobs in the queue of the cluster. If your user name is in their somewhere, your worker is in the queue. Note that it may take a couple of seconds before it appears (especially if the frontend machine is busy)
Also try the other applications to see if your JOINC is working correctly. Use the simulator see if it can also handles fault-tolerance. Finally do a performance analysis of your JOINC using the Prime or TSP application.
You can use the application simulator to test your implementation of JOINC. The workers of the simulator do not actually perform any computation. They only check if their input file is present, wait for a specified amount of time, write a simple output file, and (on request) crash occasionally. Using this simulation you can test if all features of your joinc implementation are working.
The tarball can be found on the frontend: /usr/local/VU/practicum/gridcomputing/tarballs/simulator.tgz
Prime Number Factorization
A simple prime number factorization application using JOINC. The master application generates a set of numbers, taking care that all numbers are approximately the same size. Each number is passed to a worker as a command line parameter. The worker then number is then split into prime factors. The result is written into an output file. The master gathers all output files and prints them as soon as all results are received. Because all number to be factored are similar in size, the workers all take the same about of time.
The tarball can be found on the frontend: /usr/local/VU/practicum/gridcomputing/tarballs/prime.tgz
Traveling Salesman Problem
A Traveling Salesman implementation that tries to find the shortest possible tour that visits each of a given number of cities exactly once. The master application takes reads an input files which contains the locations of each of the cities, and then generates a number of ‘partial solutions’. Each of these ‘partial solutions’ is then examined further by a worker.
The workers of this application expect several command line parameters and an input file. Each of them solves part of the complete TSP problem, and writes its results to an output file. The best result of all workers is the result of the total computation. Note that the workers get the best result found so far (by previous workers) as a parameter, which they use to optimize their own search. Therefore, the time required by the worker may vary (a lot).
Unlike the other two applications, you cannot specify the number of jobs that the TSP master generates. Instead, the Master decides for itself how many jobs it needs (depending on the problem size).
The tarball can be found on the frontend: /usr/local/VU/practicum/gridcomputing/tarballs/tsp.tgz
To implement Joinc, you will use JavaGAT, the Java version of the Grid Application Toolkit (GAT) developed by the Gridlab project. The JavaGAT provides a set of interfaces which allow you to perform the basic tasks needed on a grid, such as copying files or scheduling jobs.
In a grid environment, there are often many different ways of performing the same task. For example, a file can be copied using any one of the following commands: cp, scp, ftp, gsiftp, gridftp, globus-url-copy or wget. Although each of these command are somehow different, in the end they all perform the same task; copying a file.
The purpose of the GAT is to hide most of this complexity from the programmer. For each of the basic tasks that applications need to perform on the grid the GAT offers a single (and simple) interface. This interface allows the application to perform the task it is interested in (copying a file for example), without having to worry about how this is exactly implemented.
The GAT is designed in a modular plug-and-play manner. This allows multiple implementations (called adaptors in the GAT) of each interface. The file copy interface, for example, has an implementation (cp), an ssh implementation (scp), an ftp implementation, etc. When the application is trying to use an interface, the GAT will decide which implementation to use, depending on the information it has available (such as source and target URLs, user certificates, etc). Multiple implementations of the same interface may even be used at the same time.
JavaGAT on DAS-4
To run JavaGAT jobs on DAS-4 you need to use SGE (Sun Grid Engine) adaptor. Therefore your Resource Broker URI for VU DAS-4 should be: “sge://fs0.das4.cs.vu.nl”. Currently submitting jobs to remote clusters is not supported by SGE adaptor, therefore your solution may support local cluster only.