Programs are started on the DAS-2 compute nodes using the Sun Grid Engine (SGE) batch queueing system. The SGE system reserves the requested number of nodes for the duration of a program run. It is also possible to reserve a number of hosts in advance, terminate running jobs, or query the status of current jobs. A full documentation set is available.
NOTE: We have switched over from PBS; the information about PBS is only kept for reference purposes, to help porting your PBS scripts to SGE.
The default run time for SGE jobs on DAS-2 is 15 minutes, which is also the maximum for jobs on DAS-2 during working hours. We do not want people to monopolize the clusters for a long time, since that makes interactively running short jobs on many or all nodes practically impossible.
DAS-2 is specifically NOT a cluster for doing production work. It is meant for people doing experimental research on parallel and distributed programming. Only during the night and in the weekend, when DAS-2 is regularly idle, it is allowed to run long jobs. In all other cases, you will first have to ask permission in advance from das-sysadm@cs.vu.nl to make sure you are not causing too much trouble for other users. More information is on the DAS-2 usage page.
SGE is installed on DAS-2 in /usr/local/package/sge, including extensive documentation. The most often used commands are
Starting a program by means of SGE usually involves the creation of a SGE job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.
An alternative, often more convenient way to use SGE is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS-1. For DAS-2, the user interface is kept largely the same, but the node reservation is done by SGE. Therefore, SGE- and prun-initiated jobs don't interfere with each other. See also the manual page for preserve.
The SGE system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-2.
Step 1: Inspect the source code:
$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>
static double
f(double a)
{
return (4.0 / (1.0 + a*a));
}
int
main(int argc, char *argv[])
{
int done = 0, n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Get_processor_name(processor_name, &namelen);
fprintf(stderr, "Process %d on %s\n", myid, processor_name);
n = 0;
while (!done) {
if (myid == 0) {
if (n == 0) {
n = 100; /* precision for first iteration */
} else {
n = 0; /* second iteration: force stop */
}
startwtime = MPI_Wtime();
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0) {
done = 1;
} else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double) i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
printf("pi is approximately %.16f, error is %.16f\n",
pi, fabs(pi - PI25DT));
endwtime = MPI_Wtime();
printf("wall clock time = %f\n", endwtime - startwtime);
}
}
}
MPI_Finalize();
return 0;
}
Step 2: Compile the code with MPICH-GM:
$ which mpicc /usr/local/mpich/mpich-gm-gcc/bin/mpicc $ mpicc -o cpi cpi.c
Step 3: Adapt the SGE job submission script to your needs.
In general, running a new program using SGE only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SGE-specific annotations in comments starting with "#$" are used to influence scheduling behavior. A script to start an MPI (MPICH) application could look like this:
$ cat cpi.sge
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N cpi
#$ -l h_rt=0:15:00
#$ -pe mpi 4
PROG=cpi
ARGS=""
/usr/local/mpich/mpich-gm-gcc/bin/mpirun \
-np $NSLOTS -machinefile $TMPDIR/machines $PROG $ARGS
In this example, on the first two lines we specified the shell "/bin/sh" to be used for the execute of the job script. The first line is in fact sufficient on DAS-2, but the second line is included for backward compatibility with other SGE installations.
The "-cwd" in line 3 requests execution of the script from the current working directory; without this, the job would be started from the user's home directory.
The name of the SGE job is then explicitly set to "cpi" with "-N cpi"; otherwise the name of the SGE script itself would be used (i.e., "cpi.sge"). The name of the job also determines how the jobs output files will be called.
In lines five and six, we set the job parameters. Here the job is given 15 minutes maximum; if it takes longer than that, it will automatically be terminated by SGE. This is important since during working hours only short jobs are allowed on DAS-2. The phrase "-pe mpi 4" asks for a "parallel environment" called "mpi", to be run with 4 nodes. The parallel environment is basically a wrapper script that knows about the SGE job interfaces, and which can be used to conveniently start parallel applications.
The script then sets the program to be run and arguments to be passed. The actual starting of the application is by means of MPICH-GM's "mpirun" tool. Mpirun determines the Myrinet ports to be used during the run, and starts the MPI processes on the hosts it obtained from the parallel environment. The example job script starts one process on each node; to start two processes per node, double the -np parameter by specifying `expr $NSLOTS \* 2` instead of $NSLOTS.
Note: the mpirun command used should be from the same MPICH directory that was used during compilation, since the startup procedure may be different between the various versions of MPICH supported on DAS-2.
Step 4: Check to see how many nodes there are using qstat, or use "preserve -llist" to quickly check the current node usage:
$ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@node000.das2.cs.vu.nl BIP 2/2 0.00 lx24-x86 1210 0.55500 prun-job versto r 04/15/2005 15:02:09 2 ---------------------------------------------------------------------------- all.q@node001.das2.cs.vu.nl BIP 2/2 0.00 lx24-x86 1210 0.55500 prun-job versto r 04/15/2005 15:02:09 2 ---------------------------------------------------------------------------- all.q@node002.das2.cs.vu.nl BIP 2/2 0.00 lx24-x86 1210 0.55500 prun-job versto r 04/15/2005 15:02:09 2 ---------------------------------------------------------------------------- all.q@node003.das2.cs.vu.nl BIP 2/2 0.00 lx24-x86 1210 0.55500 prun-job versto r 04/15/2005 15:02:09 2 [...] $ preserve -llist Fri Apr 15 15:02:36 2005 id user start stop state nhosts hosts 1211 versto 04/15 15:02 04/15 15:18 r 4 node000 node001 node002 node003
Step 5: Submit the SGE job and check its status until it has completed:
$ qsub cpi.sge
Your job 1215 ("cpi") has been submitted.
$ qstat -u $USER
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1215 0.00000 cpi versto qw 04/15/2005 15:21:44 4
$ preserve -llist
Fri Apr 15 15:22:45 2005
id user start stop state nhosts hosts
1215 versto 04/15 15:22 04/15 15:37 r - node000 node001 node002 node003
$ qstat -u $USER
$
Step 6: Examine the standard output and standard error files for job with ID 1215:
$ cat cpi.o1215 pi is approximately 3.1416009869231245, error is 0.0000083333333314 wall clock time = 0.000550 $ cat cpi.e1215 Process 0 on node000.das2.cs.vu.nl Process 1 on node001.das2.cs.vu.nl Process 2 on node002.das2.cs.vu.nl Process 3 on node003.das2.cs.vu.nl
Using the Prun user interface on top of SGE can often be more convenient
as the following examples show.
Note carefully the difference between using the
"-1" and "-2" option and the default behavior.
$ prun -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4 Process 0 on node000.das2.cs.vu.nl Process 1 on node001.das2.cs.vu.nl Process 2 on node002.das2.cs.vu.nl Process 3 on node003.das2.cs.vu.nl pi is approximately 3.1416009869231245, error is 0.0000083333333314 wall clock time = 0.000541 $ prun -2 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4 Process 0 on node000.das2.cs.vu.nl Process 1 on node001.das2.cs.vu.nl Process 2 on node002.das2.cs.vu.nl Process 3 on node003.das2.cs.vu.nl Process 4 on node000.das2.cs.vu.nl Process 5 on node001.das2.cs.vu.nl Process 6 on node002.das2.cs.vu.nl Process 7 on node003.das2.cs.vu.nl pi is approximately 3.1416009869231245, error is 0.0000083333333314 wall clock time = 0.000808 $ prun -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4 Process 0 on node000.das2.cs.vu.nl Process 1 on node001.das2.cs.vu.nl Process 2 on node000.das2.cs.vu.nl Process 3 on node001.das2.cs.vu.nl pi is approximately 3.1416009869231245, error is 0.0000083333333314 wall clock time = 0.000637
The following examples show that starting commands that do not require
pre-execution coordination is quite easy using Prun, and can
be done without the help of a startup script.
For more details, see the
prun manual page.
$ prun -1 -v /bin/echo 2 hello Reservation number 1224: Reserved 2 hosts for 900 seconds Run on 2 hosts for 960 seconds from Fri Apr 15 16:04:33 2005 : node000/0 node001/0 0 2 hello 1 2 hello $ prun -1 -v -no-panda /bin/echo 2 hello Reservation number 1225: Reserved 2 hosts for 900 seconds Run on 2 hosts for 960 seconds from Fri Apr 15 16:04:52 2005 : node002/0 node003/0 hello hello $ prun -1 -v -np 2 /bin/echo hello Reservation number 1226: Reserved 2 hosts for 900 seconds Run on 2 hosts for 960 seconds from Fri Apr 15 16:05:13 2005 : node001/0 node000/0 hello hello