The Distributed ASCI Supercomputer 3

MPI Grid jobs using OpenMPI/TCP on DAS-3


OpenMPI is highly configurable MPI implementation, offering many runtime options to specify its exact runtime behavior. For details, please consult the OpenMPI website.

OpenMPI example:


  • Step 1: Compile the code with OpenMPI instead of MPICH:

$ module load default-myrinet   # or "default-ethernet" on DAS-3/Delft
$ module del mpich
$ module add openmpi
$ module list
Currently Loaded Modulefiles:
  1) mx/64/1.2.0j          4) prun/default          7) openmpi/gcc/default
  2) sge/6.0u8             5) globus/4.0.3
  3) cluster-tools/2.0.5   6) default-myrinet
$ which mpicc
/usr/local/package/openmpi-1.2.1/bin/mpicc
$ mpicc -o cpi_openmpi cpi.c

  • Step 2: reserve compute nodes on the clusters you want to use.
    In this case we assume we have already reserved two nodes on both DAS-3/VU and DAS-3/Leiden (using manual preserve commands or by letting Koala do the co-allocation) and put the host names in two variables:

$ hostlist0="node030.das3.cs.vu.nl node031.das3.cs.vu.nl"
$ hostlist1="node110.das3.liacs.nl node111.das3.liacs.nl"

  • Step 3: create a comma-separated list of all hosts to be used in the order required.
    In this example we want to run two processes on each host allocated:

$ hosts=`for i in $hostlist0 $hostlist1; do for j in 0 1; do echo -n $i,; done; done`
$ echo $hosts
node030.das3.cs.vu.nl,node030.das3.cs.vu.nl,node031.das3.cs.vu.nl,node031.das3.cs.vu.nl,node110.das3.liacs.nl,node110.das3.liacs.nl,node111.das3.liacs.nl,node111.das3.liacs.nl,

  • Step 4: transfer the binary to all sites participating in the grid run

$ rsync -e ssh -avz `pwd`/cpi_openmpi fs1.das3.liacs.nl:`pwd`/cpi_openmpi

  • Step 5: run the binary using the compute node's external eth0:0 interfaces across the internet.
    NOTE: It is important that this is done by running OpenMPI's mpirun startup tool on one of the compute nodes, not one of the DAS-3 headnodes.
    NOTE: This only works if you have set up your ssh keys such that password-less ssh runs are possible between the DAS-3 sites.

$ set $hostlist0
$ starthost=$1
$ echo $starthost
node030.das3.cs.vu.nl
$ incl=eth0:0
$ excl=myri0,eth0
$ ssh $starthost "cd `pwd`; $MPI_HOME/bin/mpirun --prefix $MPI_HOME \
   --mca oob tcp,self --mca btl sm,tcp,self                         \
   --mca oob_tcp_include $incl --mca oob_tcp_exclude $excl          \
   --mca btl_tcp_if_include $incl --mca btl_tcp_if_exclude $excl    \
   --host $hosts -np 8 `pwd`/cpi_openmpi"
Process 0 on node030
Process 1 on node030
Process 2 on node031
Process 3 on node031
Process 4 on node110
Process 5 on node110
Process 6 on node111
Process 7 on node111
pi is approximately 3.1416009869231241, error is 0.0000083333333309
wall clock time = 0.051411

  • Step 6: Alternatively, run the binary using the compute node's internal myri0 interfaces across DAS-3's dedicated 10G WAN links (not on DAS-3/Delft):

$ incl=myri0
$ excl=eth0,eth0:0
$ ssh $starthost "cd `pwd`; $MPI_HOME/bin/mpirun --prefix $MPI_HOME \
   --mca oob tcp,self --mca btl sm,tcp,self                         \
   --mca oob_tcp_include $incl --mca oob_tcp_exclude $excl          \
   --mca btl_tcp_if_include $incl --mca btl_tcp_if_exclude $excl    \
   --host $hosts -np 8 `pwd`/cpi_openmpi"
Process 0 on node030
Process 1 on node030
Process 2 on node031
Process 3 on node031
Process 4 on node110
Process 5 on node110
Process 6 on node111
Process 7 on node111
pi is approximately 3.1416009869231241, error is 0.0000083333333309
wall clock time = 0.024332

RELATED LINKS