Announcements

2 Jan 2017

DAS-5/VU has been extended with 4 TitanX-Pascal GPUs. Check DAS-5 special nodes for an overview of all DAS-5 special nodes.

9 Nov 2016

CUDA 8 is now available on all DAS-5 sites with GPU nodes. Check the DAS-5 GPU page for usage info.

May, 2016

IEEE Computer publishes paper about 20 years of Distributed ASCI Supercomputer. See the DAS Achievements page.

28 Sep 2015

DAS-5/VU has been extended with 16 GTX TitanX GPUs.

6 Aug 2015

DAS-5/UvA has been extended with 4 GTX TitanX GPUs: two on both node205 and node206.

6 Jul 2015

DAS-5 is fully operational!

DAS-5 Job Execution


General

Programs are started on the DAS-5 compute nodes using the SLURM batch queueing system. The SLURM system reserves the requested number of nodes for the duration of a program run. It is also possible to reserve a number of hosts in advance, terminate running jobs, or query the status of current jobs. A full documentation set is available.

Job time policy on DAS-5

The default run time for jobs on DAS-5 is 15 minutes, which is also the maximum for jobs on DAS-5 during working hours. We do not want people to monopolize the clusters for a long time, since that makes interactively running short jobs on many or all nodes practically impossible.

During daytime, DAS-5 is specifically NOT a cluster for doing production work. It is meant for people doing experimental research on parallel and distributed programming. Only during the night and in the weekend, when DAS-5 is regularly idle, it is allowed to run long jobs. In all other cases, you will first have to ask permission in advance from das-sysadm@cs.vu.nl to make sure you are not causing too much trouble for other users. More information is on the DAS-5 Usage Policy page.

SLURM

Both SLURM and prun on DAS-5 are imported into the development environment of the current login session (by setting the PATH and appropriate other environment variables) using the command
module load prun

The most often used SLURM commands are:

  • sbatch: submit a new batch job
  • srun: run/submit a new job
  • squeue: ask status about current jobs
  • scancel: delete a queued job

Starting a program by means of SLURM usually involves the creation of a SLURM job script first, which takes care of setting up the proper environment, possibly copying some files, querying the nodes that are used, and starting the actual processes on the compute nodes that were reserved for the run. An MPI-based example can be found below.

The SLURM system should be the only way in which processes on the DAS compute nodes are invoked; it provides exclusive access to the reserved processors. This is vitally important for doing controlled performance experiments, which is one of the main uses of DAS-5.

NOTE: People circumventing the reservation system will risk their account being blocked!

Prun user interface

An alternative, often more convenient way to use SLURM is via the prun user interface. The advantage is that the prun command acts as a synchronous, shell-like interface, which was originally developed for DAS. For DAS-5, the user interface is kept the same, but the node reservation is done by SLURM. Therefore, SLURM- and prun-initiated jobs don't interfere with each other. Note that on DAS-5 a module command, as for SLURM mentioned above, should be used before invoking prun. See also the manual pages for preserve.

SLURM caveats

  • SLURM does not enforce exclusive access to the reserved processors. It is not difficult to run processes on the compute nodes behind the reservation system's back. However, this harms your fellow users, and yourself when you are interested in performance information.
  • SLURM does not alter the users' environment. In particular this means, that pre-set execution limits (such as memory_use, cpu_time, etc) are not changed. We think this is the way it ought to be: if users want to change their environment, they should do so with the appropriate "ulimit" command in their .bashrc (for bash users) or .cshrc (for csh/tcsh users) files in their home directory.
  • NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users) is NOT executed within the SLURM job, so be very careful with environment settings in your .bashrc/.cshrc.

SLURM/MPI example

Here, we will discuss a simple parallel example application, using MPI on a single DAS-5 cluster.

  • Step 1: Inspect the source code:

$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}


  • Step 2: Compile the code with OpenMPI:

$ module load openmpi/gcc/64
$ which mpicc
/cm/shared/apps/openmpi/gcc/64/1.10.1/bin/mpicc
$ mpicc -O2 -o cpi cpi.c


  • Step 3: Adapt the SLURM job submission script to your needs. In general, running a new program using SLURM only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SLURM-specific annotations in comments starting with "#SBATCH" are used to influence scheduling behavior. A script to start an MPI (OpenMPI) application could look like this:

$ cat cpi.job
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH -N 2
#SBATCH --ntasks-per-node=16

. /etc/bashrc
. /etc/profile.d/modules.sh
module load openmpi/gcc/64

APP=./cpi
ARGS=""
OMPI_OPTS="--mca btl ^usnic"

$MPI_RUN $OMPI_OPTS $APP $ARGS

In this example, on the first line we specified the shell "/bin/bash" to be used for the execute of the job script.

We first set the job length parameters. Here the job is given 15 minutes walltime maximum; if it takes longer than that, it will automatically be terminated by SLURM. This is important since during working hours by default only relatively short jobs are allowed on DAS-5.

Every regular compute node on DAS-5 has 16 cores. In this job we want to request two compute nodes (line "#SBATCH -N 2") and use all 16 cores by means of a separate MPI process on each of them (line "#SBATCH --ntasks-per-node=16").

Next, the script imports the OpenMPI module environment. Since a job is started without the initialization that comes with interactive sessions, the bash and module environment needs to be imported explicitly before that.

The script then sets the program to be run (APP) and arguments to be passed (ARGS). The actual starting of the application is by means of OpenMPI's "mpirun" tool. By default OpenMPI as currently installed tries to use an "usnic" network that is not available, to work around this the OMPI_OPTS line is set to avoid the resulting warnings (this should be fixed shortly).

The mpirun command used should typically match the MPI version that was used during compilation, since the startup procedure will typically be different between the various versions of MPI supported on DAS-5. To select the right version, we do not use an absolute path, but use the "module" command to import the right OpenMPI environment, after which the MPI_RUN shell variable can be used.

  • Step 4: Check to see how many nodes there are, and their availability using "sinfo":

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite     65   idle node[001-068]

DAS-5 compute nodes will generally be in partition "defq" containing all standard cpu and gpu-capable nodes. Depending on the site, additional nodes for special purposes may be available as well (e.g., for jobs requiring more memory, more or faster cores, or more local storage), but these will be put in a partition called "fatq" requiring special SLURM parameters to prevent their accidental allocation.

  • Step 5: Submit the SLURM job and check its status until it has completed:

$ sbatch cpi.job; squeue
Submitted batch job 2621
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2621      defq  cpi.job   versto  R       0:00      2 node[017-018]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


  • Step 6: Examine the standard output and standard error files for job with ID 2621:

$ cat slurm-2621.out 
srun: cluster configuration lacks support for cpu binding
Process 0 on node017
Process 1 on node017
Process 2 on node017
..
Process 14 on node017
Process 15 on node017
Process 16 on node018
Process 17 on node018
..
Process 31 on node018
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.064584

Prun/MPI example

Using the Prun user interface on top of SLURM can often be more convenient as the following examples show.

The number of compute nodes is specified by the "-np nodes" argument. By default one process per node is started. To specify more processes per node, use the argument "-nprocs" (so the replicate the native SLURM sbatch command above, use "-np 2 -16").


$ module load prun

$ prun -np 2 -script $PRUN_ETC/prun-openmpi ./cpi
Process 0 on node017
Process 1 on node018
pi is approximately 3.1416009869231241, error is 0.0000083333333309
wall clock time = 0.028548

$ prun -np 2 -4 -script $PRUN_ETC/prun-openmpi ./cpi
Process 0 on node017
Process 1 on node017
Process 2 on node017
Process 3 on node017
Process 4 on node018
Process 5 on node018
Process 6 on node018
Process 7 on node018
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.041275

Here, the generic Prun/SLURM script $PRUN_ETC/prun-openmpi is used to start the OpenMPI application, similar to the SLURM example above. The script also uses a number of environment variables that are provided by Prun. Here is the actual script; for non-OpenMPI use cases it is little work to write an alternative re-usable prun run script.

Note: the output can currently contain the spurious warning "srun: cluster configuration lacks support for cpu binding" which is a known SLURM/OpenMPI minor integration issue that should be fixed shortly.


$ cat $PRUN_ETC/prun-openmpi
#!/bin/sh

# Sanity checks to make sure we are running under prun/SLURM:
if [ "X$SLURM_JOB_ID" = X ]; then
    echo "No SLURM_JOB_ID in environment; not running under SLURM?" >&2
    exit 1
fi
if [ "X$PRUN_PE_HOSTS" = X ]; then
    echo "No PRUN_PE_HOSTS in environment; not running under prun?" >&2
    exit 1
fi

module load openmpi/gcc/64

# Construct host file for OpenMPI's mpirun:
mkdir -p $HOME/tmp
NODEFILE=$HOME/tmp/hosts.$SLURM_JOB_ID

# Configure specified number of CPUs per node:
( for i in $PRUN_PE_HOSTS; do
    echo $i slots=$PRUN_CPUS_PER_NODE
  done
) > $NODEFILE

OMPI_OPTS="--mca btl ^usnic"
$MPI_RUN $OMPI_OPTS --hostfile $NODEFILE $PRUN_PROG $PRUN_PROGARGS
retval=$?

rm $NODEFILE
exit $retval

NUMA hardware locality (hwloc), or cpu/memory affinity

Environment variable OMPI_OPTS is passed to the MPI deployment tool "mpirun", which can be used to provide MPI-specific runtime options. For example, this can be used to enforce specific MPI process to CPU-socket binding. Sometimes alternative mappings are useful to improve performance due to the NUMA architecture:


$ prun OMPI_OPTS="--map-by core --bind-to core" -script $PRUN_ETC/prun-openmpi etc

For more details on NUMA binding, check out OpenMPI's "hwloc" related documentation on the OpenMPI webpage. Hwloc is integrated in OpenMPI for convenience, but also available as separate hwloc toolset including the "hwloc-bind" utility. The most recent version is accessible via "module load hwloc".

In case an MPI application employs extra application threads, the default CPU core affinity settings in the MPI runtime system may in fact be counterproductive. In that case it may be preferable to disable the affinity settings; for OpenMPI, then use option --bind-to none and for MVapich2 use option MV2_ENABLE_AFFINITY=0.


To use TCP/IP as a network protocol using the IP-over-InfiniBand implementation instead of the low-level InfiniBand driver:


$ prun OMPI_OPTS="--mca btl tcp,self --mca btl_tcp_if_include ib0" -script $PRUN_ETC/prun-openmpi etc

For more details, see the prun manual page and the OpenMPI documentation.