Running parallel programs on DAS-2 using PBS

NOTE: We have switched over to the Sun Grid Engine (SGE) system. The information about PBS on this page is only kept for reference purposes, to help porting your PBS scripts to SGE.

PBS

PBS is installed on DAS-2 in /usr/local/pbs, including extensive documentation. The most often used commands are

Starting a (parallel) program by means of PBS usually involves the creation of a PBS job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.

Prun user interface

An alternative, often more convenient way to use PBS is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS-1. For DAS-2, the user interface is kept largely the same, but the node reservation is done by PBS. Therefore, PBS and prun initiated jobs don't interfere with each other. See also the manual pages for preserve and pkill.

People circumventing the reservation system will risk their account being blocked!

The PBS system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-2.

PBS caveats

PBS does not enforce exclusive access to the reserved processors. It is not difficult to run processes on the compute nodes behind the reservation system's back. However, this harms your fellow users, and yourself when you are interested in performance information.

PBS does not alter the users' environment. In particular this means, that pre-set execution limits (such as memory_use, cpu_time etc) are not changed. We think this is the way it ought to be: if users want to change their environment, they should do so in their .bashrc (for bash users) or .cshrc (for csh/tcsh users) files in their home directory.

NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users) is NOT executed within the PBS job, so be very careful with environment settings in your .bashrc/.cshrc.

PBS/MPI example

Step 1: Inspect the source code:


[versto@fs0 MPI]$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}

Step 2: Compile the code with MPICH-GM:


[versto@fs0 MPI]$ which mpicc
/usr/local/mpich/mpich-gm/bin/mpicc
[versto@fs0 MPI]$ mpicc -o cpi cpi.c

Step 3: Adapt the PBS job submission script to your needs.

In general, running a new program using PBS only requires a few minor changes to an existing job script.

In lines two and three of the script below, we set PBS job parameters. We first ask for 4 nodes, with both processors allocated on each node ("ppn=2"), for 15 minutes. The name of the PBS job is then set with "#PBS -N cpi" explicitly set to "cpi"; otherwise the name of the PBS script itself would be used (i.e., "cpi.pbs").

Next, the script sets variable "NODES" by default to $PBS_NODEFILE (the file from the PBS system specifying the cpus allocated). Since two cpus are reserved per node, each node will appear twice in this file, resulting in two processes per node (NP, containing the number of cpus allocated, will thus be set to 8).

Occasionally it is useful to start only a single process per dual-cpu node allocated. Setting ALLCPUS in the script to "no" causes it to filter out the duplicate node allocations.

MPICH-GM's mpirun command will finally determine the Myrinet ports to be used during the run, and start the MPI processes on the cpus specified.

Note: the mpirun command used should be from the same MPICH-GM directory as was used during compilation, since the startup procedure may be different between the various versions of MPICH-GM.


[versto@fs0 MPI]$ cat cpi.pbs
#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=00:15:00 
#PBS -N cpi

PROG=cpi
ARGS=""

# By default start a process on each cpu, as in $PBS_NODEFILE:
ALLCPUS=yes
if [ $ALLCPUS = yes ]; then
    GMCONF=no
    NODES=$PBS_NODEFILE
else
    # Only start a single process per node
    mkdir -p ~/.gmpi
    GMCONF=~/.gmpi/mconf.$PBS_JOBID
    uniq < $PBS_NODEFILE > $GMCONF
    NODES=$GMCONF
fi

cd $PBS_O_WORKDIR
NP=`cat $NODES | wc -l`
/usr/local/mpich/mpich-gm/bin/mpirun -np $NP -machinefile $NODES $PROG $ARGS
result=$?

if [ $GMCONF != no ]; then
    rm -f $GMCONF
fi

exit $result

Step 4: Check the Maui scheduler and/or the PBS server to see if enough nodes are available:


[versto@fs0 MPI]$ showq
ACTIVE JOBS--------------------
           JOBNAME USERNAME      STATE  PROC   REMAINING            STARTTIME

               520   rutger    Running    8     0:02:05  Tue Jul 30 13:19:04

     1 Active Jobs       8 of  144 Processors Active (5.56%)
                         4 of   72 Nodes Active      (5.56%)

IDLE JOBS----------------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

NON-QUEUED JOBS----------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Non-Queued Jobs: 0

[versto@fs0 MPI]$ qstat -aRn dque

fs0.das2.cs.vu.nl: 
                                          Req'd  Req'd   Elap 
Job ID          Username Queue    NDS TSK Memory Time  S Time   BIG  FAST  PFS
--------------- -------- -------- --- --- ------ ----- - ----- ----- ----- ---
520.fs0.das2.cs rutger   dque       4  --    --  00:15 R   --    --    --   --
node060/1+node060/0+node059/1+node059/0+node058/1+node058/0+node057/1+node057/0

Note: DAS2 compute nodes will generally be in the PBS queue called "dque" (the "default" queue). Depending on the site, other nodes for special purposes may be available as well, but they will typically be put in a different queue to prevent their accidental use.

Step 5: Submit the PBS job and check its status until it has completed:


[versto@fs0 MPI]$ qsub cpi.pbs
521.fs0.das2.cs.vu.nl
[versto@fs0 MPI]$ qstat -aRn dque

fs0.das2.cs.vu.nl: 
                                          Req'd  Req'd   Elap 
Job ID          Username Queue    NDS TSK Memory Time  S Time   BIG  FAST  PFS
--------------- -------- -------- --- --- ------ ----- - ----- ----- ----- ---
520.fs0.das2.cs rutger   dque       4  --    --  00:15 R   --    --    --   --
node060/1+node060/0+node059/1+node059/0+node058/1+node058/0+node057/1+node057/0
521.fs0.das2.cs versto   dque       4  --    --  00:15 R   --    --    --   --
node071/1+node071/0+node070/1+node070/0+node069/1+node069/0+node068/1+node068/0
[versto@fs0 MPI]$ qstat -aRn
[versto@fs0 MPI]$ 

Step 6: Examine the standard output and standard error files for job ID 521:


[versto@fs0 MPI]$ cat cpi.o521 
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000623

[versto@fs0 MPI]$ cat cpi.e521 
Process 0 on node071.das2.cs.vu.nl
Process 1 on node071.das2.cs.vu.nl
Process 2 on node070.das2.cs.vu.nl
Process 3 on node070.das2.cs.vu.nl
Process 4 on node069.das2.cs.vu.nl
Process 5 on node069.das2.cs.vu.nl
Process 6 on node068.das2.cs.vu.nl
Process 7 on node068.das2.cs.vu.nl

Prun/MPI example

Using the Prun user interface on top of PBS can often be more convenient as the following examples show.
Note carefully the difference between using the -1 and -2 option and the default behavior.


$ prun -1 -pbs-script /usr/local/sitedep/reserve/pbs_script `pwd`/cpi 4
Process 0 on node071.das2.cs.vu.nl
Process 1 on node070.das2.cs.vu.nl
Process 2 on node069.das2.cs.vu.nl
Process 3 on node068.das2.cs.vu.nl
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.000603

$ prun -2 -pbs-script /usr/local/sitedep/reserve/pbs_script `pwd`/cpi 4
Process 0 on node071.das2.cs.vu.nl
Process 1 on node071.das2.cs.vu.nl
Process 2 on node070.das2.cs.vu.nl
Process 3 on node070.das2.cs.vu.nl
Process 4 on node069.das2.cs.vu.nl
Process 5 on node069.das2.cs.vu.nl
Process 6 on node068.das2.cs.vu.nl
Process 7 on node068.das2.cs.vu.nl
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.000615

$ prun -pbs-script /usr/local/sitedep/reserve/pbs_script `pwd`/cpi 4
Process 0 on node071.das2.cs.vu.nl
Process 1 on node071.das2.cs.vu.nl
Process 2 on node070.das2.cs.vu.nl
Process 3 on node070.das2.cs.vu.nl
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.000485


The generic PBS script /usr/local/sitedep/reserve/pbs_script is used to start the application, similar to the PBS example above.
The script can use a number of environment variables that are provided by Prun.
For more details, see the prun manual page.
Advanced School for Computing and Imaging

Back to the DAS-2 home page
This page is maintained by Kees Verstoep. Last modified: Thu Mar 4 17:25:43 CET 2004