The Distributed ASCI Supercomputer 3

prun - Reserve compute nodes from cluster and run job


SYNOPSIS

prun [options] application ncpus [application args]
prun [options] -np ncpus application [application args]

DESCRIPTION

prun provides a convenient way to run applications on a cluster. It reserves the requested number of cpus (or nodes) and executes a parallel application on them. Host scheduling is exclusive, i.e., prun does not allocate multiple jobs on one host. prun builds on a reservation system that is partly based on goodwill: compute node reservation is implemented, but generally not strictly enforced. However, users sidestepping the reservation mechanism and accessing compute nodes directly will incur the wrath of both fellow users and the system administrators.

Scheduling

prun runs an application in parallel on the requested number of cpus. The default maximum execution time is 15 minutes on DAS-2 and DAS-3, which is also the maximum allowed reservation during daytime. If no start time is specified explicitly, the parallel run is scheduled as soon as the requested number of cpus is available. If not enough cpus are available immediately, prun waits until the reserved time, or until canceled reservations allow earlier execution. In the latter case, the reservation schedule is compressed. If a start time is specified explicitly, and the requested resources are available, the reservation is scheduled, and prun sleeps until the specified time. The users are themselves responsible to honor the required execution limits, or request an exception by email to the system administrators.
By default, prun rounds the requested number of cpus upwards to a multiple of the number of cpus per node, i.e., 2 for DAS-2 and DAS-3. This ensures that there are no other jobs on the reserved nodes (from the same or another user) that might interfere.

Rsh peculiarities

prun by default uses rsh(1) for application invocation. rsh only works if the caller node is trusted by the callee node. Therefore, the caller node must be present in the user's .rhost file, or, as is the case on DAS-2 and DAS-3, this must be enabled on a system-wide scale.
Another limitation of rsh is its requirement at the protocol level for TCP ports within a restricted range, which can run out when large numbers of processes are to be started. However, this restriction can be circumvented by letting prun use ssh instead: see the -rsh option below.
Since rsh has some problems in dealing with standard input, prun imposes two limitations. The first is that only cpu 0 of the started parallel application is allowed to read from standard input. The second is caused by the fact that standard input is always opened, whether the application wants to read it or not. Therefore, if prun must run in the background and no input is necessary, standard input must be redirected to /dev/null. See also rsh(1).
Since prun uses rsh to start remote processes, the process limits (like memory usage, execution time limit) are derived from the user's default values. When the process limits are to be changed, users must change them in their .cshrc (csh, tcsh users) or .bashrc (bash users) or .kshrc (ksh users).

Single-shot property

prun generates a run-unique key for each parallel run. This key can be used for synchronization by other software layers, like the ones based on Panda. prun should not be used to invoke scripts that run multiple parallel programs in sequence, since in that case the run-unique key would be shared between consecutive runs. This leads to start-up problems. Therefore invoke prun for each parallel program run separately.

OPTIONS

-c dir
Symbolic name of the directory where the parallel application is to be executed (default: current directory).
prun writes temporary files into the current directory, with instructions and environment information for the worker processes. For this reason, worker processes must run from the current directory. rsh(1) starts its remote execution from the user's home directory, so prun must remotely change to the desired directory. However, the current directory name on the local host may differ from the (symbolic) name in remote hosts (since the file system may have been differently mounted). To overcome this, an option -c dir is supplied, in which the (symbolic) name of the current directory is specified. To determine the current directory, prun first inspects the environment variable PWD, which is set by tcsh(1) and bash(1). For other shells, it may be necessary to specify the (symbolic) name of the current directory with -c.
Since prun creates temporary files in this directory, the user must have write permission in it.
-core
allow application core dumps (default).
-no-core
suppress application core dumps.
-d time
poll every time seconds (default: 1).
-delay time
add a delay of time seconds (default: depends on file size) between spawns of remote processes. time is a floating point number.
-export-env
export prun's process environment to forked application processes (default).
-no-export-env
do not export prun's process environment to forked application processes.
-keep
do not return reservation after execution. Generally used in conjunction with -reserve.
-no-keep
return reservation after execution (default).
-n
echo rsh and reserve commands, but do not execute.
-np ncpus
The -np option expresses the (common) case of parallel runs that do not expect Panda-style cpu rank arguments in a more natural way.
prun -np ncpu app args
is an alias for
prun -no-panda app ncpu args.
-o outputfile
output from each of the parallel processes is diverted to a separate file, named outputfile.0, outputfile.1, ....
This option does not work in combination with -sge-script.
-panda
feed the application as the first two command line arguments its process rank the total number of processes (default).
-no-panda
do not add any process ranking arguments to the application command line.
-sge-script script
Runs script on cpu 0. The script should start up the processes on the other cpus, as is customary for SGE scripts. To ease the development cycle, prun -sge-script sets a number of environment variables: PRUN_CPUS contains the total number of cpus; PRUN_CPUS_PER_NODE contains the number of cpus per node; PRUN_PROG contains the name of the executable specified to prun; PRUN_PROGARGS contains the list of application arguments specified to prun.
An example script is found in /usr/local/sitedep/reserve.sge/sge_script; it allows the user to run an MPICH/MX or MPICH/GE application without having to bother about SGE or MPICH configuration. As is usual with prun, but in contrast to SGE schedules, stdin, stdout and stderr are redirected to the terminal, and the program is run from the current directory or the directory indicated with the -c option.
-pg dir_prefix
each process changes directory to dir_prefixXX, where XX is the instance number. E.g. this can be used to generate separate profile dumps or core dumps.
-ping
ping all hosts on which the program is to run before forking the processes. If the ping fails, an indication that the host is down is printed (default).
-no-ping
do not ping the hosts before forking.
-q queue
Enter the reservation into the cluster queue named queue. The default is the system-default queue; typically this is the queue containing all available nodes.
For prun running on SGE (as on DAS-2 and DAS-3), this option can be used to enforce scheduling on a specific subset of nodes. E.g., using the following:
-q "all.q@node001,all.q@node002".
-reserve id
use previously obtained reservation id id. By default this also sets -keep. This option can be used to reserve nodes for a time spanning a number of runs. A reservation id can be obtained by calling preserve(1) with the required time and nodes.
-rsh remote-shell
use remote-shell (as an absolute path name) to spawn remote processes, instead of rsh, e.g., -rsh /usr/bin/ssh. This is typically used to replace prun's use of rsh by ssh which, besides being more secure, also does not suffer from rsh limitations related to the number of TCP ports that can be allocated on the submitting host.
-s time
start at time [[mm-]dd-]hh:mm (default: now).
-t time
the maximum application walltime is set to time = [[hh:]mm:]ss (default: 15 minutes).
-tmk
start application in the Treadmarks manner. This means that only the process on the first cpu is started, and this process forks the other Treadmarks processes. Also, a file $HOME/.Tmkrc is created to distribute the list of nodes. Since the name of this file is shared between all parallel runs of a given user, it is impossible for any user to run more than one parallel Treadmarks program at the same time.
-v
report host allocation.
-[124]
By default, prun reserves the requested amount of cpus, and starts the same amount of processes per node, i.e., 2 processes matching the 2 cpus per node on DAS-2 and DAS-3. If -[124] is specified, however, prun allocates and schedules the specified number of nodes, and then runs the number of processes per node specified by this option (ignoring the number of cpus per node),
var=value
add var=value to application environment.
-?
print usage.

SEE ALSO

rsh(1), preserve(1).

ENVIRONMENT

prun copies all its own environment variables to the environment of the spawned processes. It adds some extra variables: PRUN_CPU_RANK contains the rank of the current spawned process; PRUN_HOSTNAMES contains a list of host names, one per spawned process. The -sge-script option adds some more environment variables.

POSSIBLE PITFALLS

``Illegal option: 0 16''
The user has not specified -no-panda, so prun adds host numbers to the command line.
Example:
$ prun -v /bin/echo 2 hello
Reserved 2 hosts for 900 seconds from Tue Mar 27 14:43:02 CEST 2007
: node001 node002
All hosts are alive

1 2 hello
0 2 hello

Another example:
$ prun -no-panda -v /bin/echo 2 hello
Reserved 2 hosts for 900 seconds from Tue Mar 27 14:43:25 CEST 2007
: node001 node002
All hosts are alive

hello
hello

The previous example can also be started using the -np option, as follows (note that the process argument follows -np rather than the application):
$ prun -np 2 -v /bin/echo hello
``Fatal error: cannot stat application a.out''
The user has specified an incomplete path for his executable.
Prolongued silence
There is no room in the current schedule for the requested number of cpus and compute time, so prun waits. prun -v or preserve -llist shows information on host allocation and presumed start time.
``Out of memory''
The user has not changed the process memory limit in his .cshrc or .bashrc file. Maybe he did it in his .profile or his .login file, but rsh(1) does not look there.
No core dumps
The user has not changed the process coredump limit in his .cshrc or .bashrc file. Maybe he did it in his .profile or his .login file, but rsh(1) does not look there.
``watchit fatal error: Cannot open environment file .PRUN_ENVIRONMENT.procid.host''
One of two possibilities: either the user has no write permission in the current directory, or he has specified an illegal -c directory option. That way, prun is requested to run from an unexisting directory, a directory without write permission, or a directory which has not been remote-mounted: /tmp and /var/tmp are excellent examples of the latter error.
I cancelled my reservation, but the compute jobs live on!
Reservation and execution are two different things. True, prun(1) obtains a reservation, executes your jobs and cancels the reservation. But these are three separate actions. To kill the jobs, interrupt your prun process by sending it a ^C or a SIGINT (kill(1)). Your prun process will propagate a SIGINT to your jobs so they will die, and then cancel your reservation.



RELATED LINKS