Running parallel programs on DAS-2

Programs are started on the DAS-2 compute nodes using the Sun Grid Engine (SGE) batch queueing system. The SGE system reserves the requested number of nodes for the duration of a program run. It is also possible to reserve a number of hosts in advance, terminate running jobs, or query the status of current jobs. A full documentation set is available.

NOTE: We have switched over from PBS; the information about PBS is only kept for reference purposes, to help porting your PBS scripts to SGE.

Job time policy on DAS-2

The default run time for SGE jobs on DAS-2 is 15 minutes, which is also the maximum for jobs on DAS-2 during working hours. We do not want people to monopolize the clusters for a long time, since that makes interactively running short jobs on many or all nodes practically impossible.

DAS-2 is specifically NOT a cluster for doing production work. It is meant for people doing experimental research on parallel and distributed programming. Only during the night and in the weekend, when DAS-2 is regularly idle, it is allowed to run long jobs. In all other cases, you will first have to ask permission in advance from das-sysadm@cs.vu.nl to make sure you are not causing too much trouble for other users. More information is on the DAS-2 usage page.

SGE

SGE is installed on DAS-2 in /usr/local/package/sge, including extensive documentation. The most often used commands are

Starting a program by means of SGE usually involves the creation of a SGE job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.

Prun user interface

An alternative, often more convenient way to use SGE is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS-1. For DAS-2, the user interface is kept largely the same, but the node reservation is done by SGE. Therefore, SGE- and prun-initiated jobs don't interfere with each other. See also the manual page for preserve.

People circumventing the reservation system will risk their account being blocked!

The SGE system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-2.

SGE caveats

SGE/MPI example

Step 1: Inspect the source code:


$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}

Step 2: Compile the code with MPICH-GM:


$ which mpicc
/usr/local/mpich/mpich-gm-gcc/bin/mpicc
$ mpicc -o cpi cpi.c

Step 3: Adapt the SGE job submission script to your needs.

In general, running a new program using SGE only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SGE-specific annotations in comments starting with "#$" are used to influence scheduling behavior. A script to start an MPI (MPICH) application could look like this:


$ cat cpi.sge
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N cpi
#$ -l h_rt=0:15:00
#$ -pe mpi 4

PROG=cpi
ARGS=""

/usr/local/mpich/mpich-gm-gcc/bin/mpirun \
    -np $NSLOTS -machinefile $TMPDIR/machines $PROG $ARGS

In this example, on the first two lines we specified the shell "/bin/sh" to be used for the execute of the job script. The first line is in fact sufficient on DAS-2, but the second line is included for backward compatibility with other SGE installations.

The "-cwd" in line 3 requests execution of the script from the current working directory; without this, the job would be started from the user's home directory.

The name of the SGE job is then explicitly set to "cpi" with "-N cpi"; otherwise the name of the SGE script itself would be used (i.e., "cpi.sge"). The name of the job also determines how the jobs output files will be called.

In lines five and six, we set the job parameters. Here the job is given 15 minutes maximum; if it takes longer than that, it will automatically be terminated by SGE. This is important since during working hours only short jobs are allowed on DAS-2. The phrase "-pe mpi 4" asks for a "parallel environment" called "mpi", to be run with 4 nodes. The parallel environment is basically a wrapper script that knows about the SGE job interfaces, and which can be used to conveniently start parallel applications.

The script then sets the program to be run and arguments to be passed. The actual starting of the application is by means of MPICH-GM's "mpirun" tool. Mpirun determines the Myrinet ports to be used during the run, and starts the MPI processes on the hosts it obtained from the parallel environment. The example job script starts one process on each node; to start two processes per node, double the -np parameter by specifying `expr $NSLOTS \* 2` instead of $NSLOTS.

Note: the mpirun command used should be from the same MPICH directory that was used during compilation, since the startup procedure may be different between the various versions of MPICH supported on DAS-2.

Step 4: Check to see how many nodes there are using qstat, or use "preserve -llist" to quickly check the current node usage:


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@node000.das2.cs.vu.nl    BIP   2/2       0.00     lx24-x86      
   1210 0.55500 prun-job   versto       r     04/15/2005 15:02:09     2        
----------------------------------------------------------------------------
all.q@node001.das2.cs.vu.nl    BIP   2/2       0.00     lx24-x86      
   1210 0.55500 prun-job   versto       r     04/15/2005 15:02:09     2        
----------------------------------------------------------------------------
all.q@node002.das2.cs.vu.nl    BIP   2/2       0.00     lx24-x86      
   1210 0.55500 prun-job   versto       r     04/15/2005 15:02:09     2        
----------------------------------------------------------------------------
all.q@node003.das2.cs.vu.nl    BIP   2/2       0.00     lx24-x86      
   1210 0.55500 prun-job   versto       r     04/15/2005 15:02:09     2        
[...]

$ preserve -llist
Fri Apr 15 15:02:36 2005

id      user    start           stop            state   nhosts  hosts
1211    versto  04/15   15:02   04/15   15:18   r       4       node000 node001 node002 node003

Note: DAS2 compute nodes will generally be in the SGE queue called "all.q" (the default host queue containing all nodes). Depending on the site, other nodes for special purposes may be available as well, but they will typically be put in a different queue to prevent their accidental use.

Step 5: Submit the SGE job and check its status until it has completed:


$ qsub cpi.sge
Your job 1215 ("cpi") has been submitted.

$ qstat -u $USER
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   1215 0.00000 cpi        versto       qw    04/15/2005 15:21:44                                    4        

$ preserve -llist
Fri Apr 15 15:22:45 2005

id      user    start           stop            state   nhosts  hosts
1215    versto  04/15   15:22   04/15   15:37   r       -       node000 node001 node002 node003

$ qstat -u $USER
$ 

Note that besides the consise "preserve -llist" output, there are many ways to get specific information about the running jobs using SGE's native qstat command. To get an overview of the options available, run "qstat -help".

Step 6: Examine the standard output and standard error files for job with ID 1215:


$ cat cpi.o1215 
pi is approximately 3.1416009869231245, error is 0.0000083333333314
wall clock time = 0.000550

$ cat cpi.e1215 
Process 0 on node000.das2.cs.vu.nl
Process 1 on node001.das2.cs.vu.nl
Process 2 on node002.das2.cs.vu.nl
Process 3 on node003.das2.cs.vu.nl

Prun/MPI example

Using the Prun user interface on top of SGE can often be more convenient as the following examples show.
Note carefully the difference between using the "-1" and "-2" option and the default behavior.


$ prun -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Process 0 on node000.das2.cs.vu.nl
Process 1 on node001.das2.cs.vu.nl
Process 2 on node002.das2.cs.vu.nl
Process 3 on node003.das2.cs.vu.nl
pi is approximately 3.1416009869231245, error is 0.0000083333333314
wall clock time = 0.000541

$ prun -2 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Process 0 on node000.das2.cs.vu.nl
Process 1 on node001.das2.cs.vu.nl
Process 2 on node002.das2.cs.vu.nl
Process 3 on node003.das2.cs.vu.nl
Process 4 on node000.das2.cs.vu.nl
Process 5 on node001.das2.cs.vu.nl
Process 6 on node002.das2.cs.vu.nl
Process 7 on node003.das2.cs.vu.nl
pi is approximately 3.1416009869231245, error is 0.0000083333333314
wall clock time = 0.000808

$ prun -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Process 0 on node000.das2.cs.vu.nl
Process 1 on node001.das2.cs.vu.nl
Process 2 on node000.das2.cs.vu.nl
Process 3 on node001.das2.cs.vu.nl
pi is approximately 3.1416009869231245, error is 0.0000083333333314
wall clock time = 0.000637


The generic SGE script /usr/local/sitedep/reserve.sge/sge-script is used to start the application, similar to the SGE example above.
The script can use a number of environment variables that are provided by Prun.

A few more Prun examples

The following examples show that starting commands that do not require pre-execution coordination is quite easy using Prun, and can be done without the help of a startup script. For more details, see the prun manual page.


$ prun -1 -v /bin/echo 2 hello
Reservation number 1224: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Fri Apr 15 16:04:33 2005
: node000/0 node001/0 
0 2 hello
1 2 hello
$ prun -1 -v -no-panda /bin/echo 2 hello
Reservation number 1225: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Fri Apr 15 16:04:52 2005
: node002/0 node003/0 
hello
hello
$ prun -1 -v -np 2 /bin/echo hello
Reservation number 1226: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Fri Apr 15 16:05:13 2005
: node001/0 node000/0 
hello
hello




Advanced School for Computing and Imaging

Back to the DAS-2 home page
This page is maintained by Kees Verstoep. Last modified: Tue May 24 15:03:24 CEST 2005