Announcements

6 Jul 2015

DAS-5 is now fully operational! To make room for DAS-5, DAS-4/UvA and DAS-4/ASTRON have been decomissioned, only their headnodes remain available.

28 Oct 2014

The Hadoop setup on DAS-4/VU has been updated to version 2.5.0.

30 Jan 2014

The Intel OpenCL package for Intel CPU's and Xeon Phi has been updated to version 3.2.1.

3 Sep 2013

The Nvidia CUDA development kit has been updated to version 5.5.

25 April 2013

Slides of the DAS-4 workshop presentations are now available.

14 Jan 2013

DAS-4/VU now has 8 new nodes with latest Nvidia K20 GPU.

DAS-4 Job Execution


General

Programs are started on the DAS-4 compute nodes using the Grid Engine (SGE) batch queueing system. The SGE system reserves the requested number of nodes for the duration of a program run. It is also possible to reserve a number of hosts in advance, terminate running jobs, or query the status of current jobs. A full documentation set is available.

Job time policy on DAS-4

The default run time for SGE jobs on DAS-4 is 15 minutes, which is also the maximum for jobs on DAS-4 during working hours. We do not want people to monopolize the clusters for a long time, since that makes interactively running short jobs on many or all nodes practically impossible.

During daytime, DAS-4 is specifically NOT a cluster for doing production work. It is meant for people doing experimental research on parallel and distributed programming. Only during the night and in the weekend, when DAS-4 is regularly idle, it is allowed to run long jobs. In all other cases, you will first have to ask permission in advance from das-sysadm@cs.vu.nl to make sure you are not causing too much trouble for other users. More information is on the DAS-4 Usage Policy page.

SGE

Both SGE and prun on DAS-4 are imported into the development environment of the current login session (by setting the PATH and appropriate other environment variables) using the command
module load prun

The most often used SGE commands are:

  • qsub: submit a new job
  • qstat: ask status about current jobs
  • qdel: delete a queued job

Starting a program by means of SGE usually involves the creation of a SGE job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.

The SGE system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-4.

NOTE: People circumventing the reservation system will risk their account being blocked!

Temporary files in /local

If variable SGE_KEEP_TMPFILES is set to "no" in "$HOME/.bashrc", at the end of the job any files created by the user in the /local file system will be removed by SGE.

If SGE_KEEP_TMPFILES is set to "yes", the user's files in /local will be left untouched.

If SGE_KEEP_TMPFILES is not set (as is the default), only the user's JavaGAT sandbox directories named /local/.JavaGAT_SANDBOX.* will be deleted.

Prun user interface

An alternative, often more convenient way to use SGE is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS. For DAS-4, the user interface is kept the same, but the node reservation is done by SGE. Therefore, SGE- and prun-initiated jobs don't interfere with each other. Note that on DAS-4 a module command, as for SGE mentioned above, should be used before invoking prun. See also the manual pages for preserve.

SGE caveats

  • SGE does not enforce exclusive access to the reserved processors. It is not difficult to run processes on the compute nodes behind the reservation system's back. However, this harms your fellow users, and yourself when you are interested in performance information.
  • SGE does not alter the users' environment. In particular this means, that pre-set execution limits (such as memory_use, cpu_time, etc) are not changed. We think this is the way it ought to be: if users want to change their environment, they should do so with the appropriate "ulimit" command in their .bashrc (for bash users) or .cshrc (for csh/tcsh users) files in their home directory.
  • NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users) is NOT executed within the SGE job, so be very careful with environment settings in your .bashrc/.cshrc.

SGE/MPI example

Here, we will discuss a simple parallel example application, using MPI on a single DAS-4 cluster.

  • Step 1: Inspect the source code:

$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}


  • Step 2: Compile the code with OpenMPI:

$ module load openmpi/gcc
$ which mpicc
/cm/shared/apps/openmpi/gcc/64/1.4.2/bin/mpicc
$ mpicc -O2 -o cpi cpi.c


  • Step 3: Adapt the SGE job submission script to your needs. In general, running a new program using SGE only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SGE-specific annotations in comments starting with "#$" are used to influence scheduling behavior. A script to start an MPI (OpenMPI) application could look like this:

$ cat cpi.job
#!/bin/bash
#$ -pe openmpi 16
#$ -l h_rt=0:15:00
#$ -N CPI
#$ -cwd

APP=./cpi
ARGS=""

# Get OpenMPI settings
. /etc/bashrc
module load openmpi/gcc

# Make new hostfile specifying the cores per node wanted
ncores=8
HOSTFILE=$TMPDIR/hosts
for host in `uniq $TMPDIR/machines`; do
    echo $host slots=$ncores
done > $HOSTFILE
nhosts=`wc -l < $HOSTFILE`
totcores=`expr $nhosts \* $ncores`

# Use regular ssh-based startup instead of OpenMPI/SGE native one
unset PE_HOSTFILE
PATH=/usr/bin:$PATH

$MPI_RUN -np $totcores --hostfile $HOSTFILE $APP $ARGS

In this example, on the first line we specified the shell "/bin/bash" to be used for the execute of the job script.

The statement "-pe openmpi 16" asks for a "parallel environment" called "openmpi", to be run with 2 nodes. We need to specify 16 since every regular compute node on DAS-4 has 8 cores, corresponding to 8 SGE "job slots", and 2 times 8 is 16. The parallel environment specification causes SGE to use a wrapper script that knows about the SGE job interfaces, and which can be used to conveniently start parallel applications of a particular kind, in this case OpenMPI.

Next we set the job length parameters. Here the job is given 15 minutes walltime maximum; if it takes longer than that, it will automatically be terminated by SGE. This is important since during working hours by default only relatively short jobs are allowed on DAS-4.

The name of the SGE job is then explicitly set to "CPI" with "-N CPI"; otherwise the name of the SGE script itself would be used (i.e., "cpi.job"). The name of the job also determines how the job's output files will be called.

The "-cwd" line requests execution of the script from the current working directory at submission time; without this, the job would be started from the user's home directory.

The script then sets the program to be run (APP) and arguments to be passed (ARGS). The actual starting of the application is by means of OpenMPI's "mpirun" tool. The example job script starts one process for each processor core on each node. The number of cores per node may depend on the user's preferences, if processes themselves are internally multithreaded, etc. To use only one MPI process per node, set "ncores" to 1.

Note: the mpirun command used should typically match the MPI version that was used during compilation, since the startup procedure will typically be different between the various versions of MPI supported on DAS-4. To select the right version, we do not use an absolute path but use the "module" command to import the right OpenMPI environment, after which the MPI_RUN shell variable can be used.

  • Step 4: Check to see how many nodes there are, and their availability using qstat -f, or use "preserve -llist" to quickly check the current node usage:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   4671 0.50617 prun-job   versto       r     11/18/2010 23:15:21 all.q@node001.cm.cluster          32        

$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node001.cm.cluster       BIP   0/8/8          0.00     lx26-amd64    
   4671 0.50617 prun-job   versto       r     11/18/2010 23:15:21     8        
---------------------------------------------------------------------------------
all.q@node002.cm.cluster       BIP   0/8/8          0.00     lx26-amd64    
   4671 0.50617 prun-job   versto       r     11/18/2010 23:15:21     8        
---------------------------------------------------------------------------------
all.q@node003.cm.cluster       BIP   0/8/8          0.00     lx26-amd64    
   4671 0.50617 prun-job   versto       r     11/18/2010 23:15:21     8        
---------------------------------------------------------------------------------
all.q@node004.cm.cluster       BIP   0/8/8          0.00     lx26-amd64    
   4671 0.50617 prun-job   versto       r     11/18/2010 23:15:21     8        
---------------------------------------------------------------------------------
all.q@node005.cm.cluster       BIP   0/8/8          0.00     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node006.cm.cluster       BIP   0/0/8          0.01     lx26-amd64    
[..]

$ preserve -llist
Thu Nov 18 23:16:15 2010

id   user   start       stop        state nhosts hosts
4671 versto 11/18 23:15 11/18 23:19 r     4      node001 node002 node003 node004

Note: DAS-4 compute nodes will generally be divided over two queues called "all.q" (containing standard nodes) and "gpu.q" respectively (similar nodes, but with a GPU). Depending on the site, additional nodes for special purposes may be available as well (e.g., for jobs requiring more memory, more or faster cores, or more local storage), but these will be put in a queue called "fat.q" requiring special SGE parameters to prevent their accidental use.

  • Step 5: Submit the SGE job and check its status until it has completed:

$ module load sge
$ qsub cpi.job
Your job 4675 ("CPI") has been submitted
$ qstat -u $USER
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   4675 0.45617 CPI        versto       r     11/18/2010 23:38:51 all.q@node043.cm.cluster          16        

$ preserve -llist
Thu Nov 18 23:39:18 2010

id   user   start       stop        staten hosts hosts
4675 versto 11/18 23:38 11/18 23:53 r      2     node010 node043

$ preserve -llist
Thu Nov 18 23:45:33 2010

$

Note that besides the consise "preserve -llist" output, there are many ways to get specific information about the running jobs using SGE's native qstat command. To get an overview of the options available, run "qstat -help".

  • Step 6: Examine the standard output and standard error files for job with ID 4675:

[versto@fs0 cpi]$ cat CPI.o4675 
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.050803
[versto@fs0 cpi]$ cat CPI.e4675 | sort -n -k 2
Process 0 on node043
Process 1 on node043
Process 2 on node043
Process 3 on node043
Process 4 on node043
Process 5 on node043
Process 6 on node043
Process 7 on node043
Process 8 on node010
Process 9 on node010
Process 10 on node010
Process 11 on node010
Process 12 on node010
Process 13 on node010
Process 14 on node010
Process 15 on node010

Prun/MPI example

Using the Prun user interface on top of SGE can often be more convenient as the following examples show.

The number of compute nodes is specified by the "-np nodes" argument. By default one process per node is started. To specify more processes per node, use the argument "-nprocs".


$ module load prun

$ prun -v -np 2 -sge-script $PRUN_ETC/prun-openmpi `pwd`/cpi
Reservation number 4679: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Fri Nov 19 00:04:51 2010
: node043/0 node010/0 
Process 0 on node043
Process 1 on node010
pi is approximately 3.1416009869231241, error is 0.0000083333333309
wall clock time = 0.018761

$ prun -v -4 -np 2 -sge-script $PRUN_ETC/prun-openmpi `pwd`/cpi
Reservation number 4680: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Fri Nov 19 00:06:22 2010
: node026/0 node026/1 node026/2 node026/3 node044/0 node044/1 node044/2 node044/3 
Process 0 on node026
Process 1 on node026
Process 2 on node026
Process 3 on node026
Process 4 on node044
Process 5 on node044
Process 6 on node044
Process 7 on node044
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.036693

Here, the generic Prun/SGE script $PRUN_ETC/prun-openmpi is used to start the OpenMPI application, similar to the SGE example above. The script also uses a number of environment variables that are provided by Prun:


$ cat $PRUN_ETC/prun-openmpi
#!/bin/sh

# Construct host file for OpenMPI's mpirun:
NODEFILE=$TMPDIR/hosts

# Configure specified number of CPUs per node:
( for i in $PRUN_PE_HOSTS; do
    echo $i slots=$PRUN_CPUS_PER_NODE
  done
) > $NODEFILE

# Need to disable SGE's PE_HOSTFILE, or OpenMPI will use it instead or the
# constructed nodefile based on prun's info:
unset PE_HOSTFILE

module load openmpi/gcc

$MPI_RUN $OMPI_OPTS --hostfile $NODEFILE $PRUN_PROG $PRUN_PROGARGS

Note though, that SGE "#$" directives are ignored by prun when using a job script.

Environment variable OMPI_OPTS is passed to the MPI deployment tool "mpirun", which can be used to provide MPI-specific runtime options. For example, this can be used to enforce MPI process to CPU-socket binding (sometimes useful to improve performance due to the NUMA architecture):


$ prun OMPI_OPTS="--bind-to-socket --bysocket" -sge-script $PRUN_ETC/prun-openmpi etc

To use TCP/IP as a network protocol using the IP-over-InfiniBand implementation instead of the low-level InfiniBand driver:


$ prun OMPI_OPTS="--mca btl tcp,self --mca btl_tcp_if_include ib0" -sge-script $PRUN_ETC/prun-openmpi etc

For more details, see the prun manual page and the OpenMPI documentation.