The Distributed ASCI Supercomputer 3

DAS-3 Job Execution


General

    Programs are started on the DAS-3 compute nodes using the Sun Grid Engine (SGE) batch queueing system. The SGE system reserves the requested number of nodes for the duration of a program run. It is also possible to reserve a number of hosts in advance, terminate running jobs, or query the status of current jobs. A full documentation set is available.

Job time policy on DAS-3

    The default run time for SGE jobs on DAS-3 is 15 minutes, which is also the maximum for jobs on DAS-3 during working hours. We do not want people to monopolize the clusters for a long time, since that makes interactively running short jobs on many or all nodes practically impossible.

    During daytime, DAS-3 is specifically NOT a cluster for doing production work. It is meant for people doing experimental research on parallel and distributed programming. Only during the night and in the weekend, when DAS-3 is regularly idle, it is allowed to run long jobs. In all other cases, you will first have to ask permission in advance from

    das-sysadm@cs.vu.nl to make sure you are not causing too much trouble for other users. More information is on the DAS-3 Usage Policy page.

SGE

    SGE on DAS-3 is imported into the development environment of the current login session (by setting the PATH and appropriate other environment variables) using the commands

    module load default-ethernet

    or

    module load default-myrinet

    (the latter only if the high speed Myri-10G interconnect should be used, which is not available in DAS-3/Delft).

    The most often used SGE commands are
    • qsub: submit a new job
    • qstat: ask status about current jobs
    • qdel: delete a queued job

    Starting a program by means of SGE usually involves the creation of a SGE job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.
    The SGE system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-3.

    NOTE: People circumventing the reservation system will risk their account being blocked!


Temporary files in /local

    If variable SGE_KEEP_TMPFILES is set to "no" in "$HOME/.bashrc", at the end of the job any files created by the user in the /local file system will be removed by SGE.
    If SGE_KEEP_TMPFILES is set to "yes", the user's files in /local will be left untouched.
    If SGE_KEEP_TMPFILES is not set (as is the default), only the user's JavaGAT sandbox directories named /local/.JavaGAT_SANDBOX.* will be deleted.

Prun user interface

    An alternative, often more convenient way to use SGE is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS. For DAS-3, the user interface is kept largely the same, but the node reservation is done by SGE. Therefore, SGE- and prun-initiated jobs don't interfere with each other. Note that on DAS-3 a module command, as for SGE mentioned above, should be used before invoking prun. See also the manual pages for preserve.

SGE caveats

  • SGE does not enforce exclusive access to the reserved processors. It is not difficult to run processes on the compute nodes behind the reservation system's back. However, this harms your fellow users, and yourself when you are interested in performance information.
  • SGE does not alter the users' environment. In particular this means, that pre-set execution limits (such as memory_use, cpu_time, etc) are not changed. We think this is the way it ought to be: if users want to change their environment, they should do so with the appropriate "ulimit" command in their .bashrc (for bash users) or .cshrc (for csh/tcsh users) files in their home directory.
  • NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users) is NOT executed within the SGE job, so be very careful with environment settings in your .bashrc/.cshrc.

SGE/MPI example:

Here, we will discuss a simple parallel example application, using MPI on a single DAS-3 cluster. See the DAS-3 OpenMPI page for an example regarding multi-cluster Grid runs using OpenMPI.
  • Step 1: Inspect the source code:
$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}

  • Step 2: Compile the code with MPICH:
$ module load default-ethernet   # or "default-myrinet"
$ which mpicc
/usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpicc
$ mpicc -o cpi cpi.c

  • Step 3: Adapt the SGE job submission script to your needs. In general, running a new program using SGE only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SGE-specific annotations in comments starting with "#$" are used to influence scheduling behavior. A script to start an MPI (MPICH) application could look like this:
$ cat cpi.sge
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N cpi-job
#$ -l h_rt=0:15:00
#$ -pe mpich 4

MYAPP=cpi
MYAPP_FLAGS=''

# The number of cores per node can be different on the various DAS sites.
# Here we will use every cpu core there is on the first compute node,
# and assume this will be the same on the other nodes in the same cluster:
cpuspernode=`cat /proc/cpuinfo | grep processor | wc -l`
ncpus=`expr $NSLOTS \* $cpuspernode`

export MPICH_PROCESS_GROUP=no
. /etc/profile.d/modules.sh
module add default-ethernet  # or "default-myrinet"

#$ -v MPI_HOME
#$ -v LD_LIBRARY_PATH

echo "Got $NSLOTS nodes."
echo "Original machines file:"
cat $TMPDIR/machines
for i in `cat $TMPDIR/machines`; do
    j=0
    while [ $j -lt $cpuspernode ]; do
        echo $i
        j=`expr $j + 1`
    done
done > $TMPDIR/machines2
echo "Using following machines file:"
cat $TMPDIR/machines2

cmd="$MPI_HOME/bin/mpirun -np $ncpus -machinefile \
$TMPDIR/machines2  $MYAPP $MYAPP_FLAGS"
echo Will run: $cmd
$cmd

    In this example, on the first two lines we specified the shell "/bin/sh" to be used for the execute of the job script. The first line is in fact sufficient on DAS-3, but the second line is included for backward compatibility with other SGE installations.

    The "-cwd" in line 3 requests execution of the script from the current working directory; without this, the job would be started from the user's home directory.

    The name of the SGE job is then explicitly set to "cpi-job" with "-N cpi-job"; otherwise the name of the SGE script itself would be used (i.e., "cpi.sge"). The name of the job also determines how the job's output files will be called.

    In lines five and six, we set the job parameters. Here the job is given 15 minutes maximum; if it takes longer than that, it will automatically be terminated by SGE. This is important since during working hours only short jobs are allowed on DAS-3. The phrase "-pe mpich 4" asks for a "parallel environment" called "mpich", to be run with 4 nodes. The parallel environment specification causes SGE to use wrapper script that know about the SGE job interfaces, and which can be used to conveniently start parallel applications of a particular kind, in this case MPICH.

    The script then sets the program to be run and arguments to be passed. The actual starting of the application is by means of MPICH's "mpirun" tool. The example job script starts one process for each processor core on each node. Since the number of cores per node depends on the cluster in DAS-3 (it is 4 on DAS-3/VU and DAS-3/UvA, but 2 on DAS-3/Leiden, DAS-3/Delft and DAS-3/UvA-MN), it is determined dynamically by the SGE script, which gets run on the first node allocated. To use only one or two MPI processes per node, explicitly set "cpuspernode" to 1 or 2.

    Note: the mpirun command used should be from the same MPICH directory that was used during compilation, since the startup procedure may be different between the various versions of MPICH supported on DAS-3.

  • Step 4: Check to see how many nodes there are, and their availability using qstat -f, or use "preserve -llist" to quickly check the current node usage:
$ qstat

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@node301.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node302.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node303.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node304.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
[...]

$ preserve -llist
Mon Oct  9 15:29:39 2006

id      user            start           stop            state   nhosts  hosts
2590    versto          10/09   15:29   10/09   15:44   r       4       node308 node320
node354 node366

    Note: DAS-3 compute nodes will generally be in the SGE queue called "all.q", the default host queue containing all nodes. Depending on the site, other nodes for special purposes may be available as well, but they will typically be put in a different queue to prevent their accidental use.

  • Step 5: Submit the SGE job and check its status until it has completed:
$ qsub cpi.sge 
Your job 2591 ("cpi-job") has been submitted
$ qstat -u $USER
job-ID  prior   name     user     state submit/start at     queue    slots ja-task-ID 
-------------------------------------------------------------------------------------
  2591  0.00000 cpi-job  versto   qw    10/09/2006 15:31:14              4        
$ preserve -llist
Mon Oct  9 15:31:24 2006

id      user            start           stop            state   nhosts  hosts
2591    versto          10/09   15:31   10/09   15:46   r       4       node322 node353
node359 node368

$ preserve -llist
Mon Oct  9 15:35:02 2006

$

    Note that besides the consise "preserve -llist" output, there are many ways to get specific information about the running jobs using SGE's native qstat command. To get an overview of the options available, run "qstat -help".

  • Step 6: Examine the standard output and standard error files for job with ID 2591:
$ cat cpi-job.o2591 
Got 4 nodes.
Original machines file:
node353
node322
node359
node368
Using following machines file:
node353
node353
node322
node322
node359
node359
node368
node368
Will run: /usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpirun -np 8 -machinefile
/tmp/2591.1.all.q/machines2 cpi
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.007812

$ cat cpi-job.e2591 
Process 0 on node353.beowulf.cluster
Process 1 on node353.beowulf.cluster
Process 2 on node322.beowulf.cluster
Process 3 on node322.beowulf.cluster
Process 5 on node359.beowulf.cluster
Process 4 on node359.beowulf.cluster
Process 6 on node368.beowulf.cluster
Process 7 on node368.beowulf.cluster


Prun/MPI example:

    Using the Prun user interface on top of SGE can often be more convenient as the following examples show.
    Note carefully the difference between using the "-1", "-2" and "-4" options (explicitly starting one, two or four processes per node, with -np specifying the number of nodes) and the default behavior: starting two process per node, with -np specifying the total number of processes.
$ module load default-ethernet   # or "default-myrinet"
$ prun -v -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2592: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:08 2006
: node355/0 node334/0 node338/0 node305/0 
Process 0 on node355.beowulf.cluster
Process 1 on node334.beowulf.cluster
Process 2 on node338.beowulf.cluster
Process 3 on node305.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906

$ prun -v -2 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2593: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:37 2006
: node321/0 node321/1 node349/0 node349/1 node302/0 node302/1 node356/0 node356/1 
Process 0 on node321.beowulf.cluster
Process 1 on node349.beowulf.cluster
Process 2 on node302.beowulf.cluster
Process 3 on node356.beowulf.cluster
Process 4 on node321.beowulf.cluster
Process 5 on node349.beowulf.cluster
Process 6 on node302.beowulf.cluster
Process 7 on node356.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906

$ prun -v -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2594: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:52 2006
: node350/0 node350/1 node301/0 node301/1 
Process 0 on node350.beowulf.cluster
Process 1 on node301.beowulf.cluster
Process 2 on node350.beowulf.cluster
Process 3 on node301.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906
    The generic Prun/SGE script /usr/local/sitedep/reserve.sge/sge-script is used to start the application, similar to the SGE example above. The script also uses a number of environment variables that are provided by Prun.


A few more Prun examples:

    The following examples show that starting commands that do not require pre-execution coordination is quite easy using Prun, and can be done without the help of a startup script. For more details, see the prun manual page.
$ prun -1 -v /bin/echo 2 hello
Reservation number 2595: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:01:37 2006
: node341/0 node323/0 
0 2 hello
1 2 hello

$ prun -1 -v -no-panda /bin/echo 2 hello
Reservation number 2596: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:01:52 2006
: node318/0 node343/0 
hello
hello

$ prun -1 -v -np 2 /bin/echo hello
Reservation number 2597: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:02:38 2006
: node317/0 node365/0 
hello
hello


RELATED LINKS