Introduction

This document is intended as a short introduction to the RAL batch system. I describe the system itself, job submission and job monitoring. It is hoped that this will be enough to get people started.

RAL system

The RAL system consists of a few frontend machines ( csfX.rl.ac.uk where X = a-f) intended for logins and interactive work, 454 dual processor worker nodes intended for batch job processing, and a host of additional supporting machines (database, disk servers, job scheduler) which users shouldn't worry much about. The interactive machines and batch machines share a common environment. When your job logs into a batch machine it's just like if it was running on one of the interactive nodes.

The RAL system is described here though some of the information is a little bit out of date. I usually look at the Ganglia monitoring site, which has lots of detailed information if you click around. That site is here .

MINOS software

Nick has a site which describes how one uses the MINOS software at RAL. You should probably run a sofware setup script in your logon file (for bash this is ~/.profile):

source /rutherford/minos-soft2/OO/minossoft/setup/setup_minossoft_csf.sh
source /rutherford/minos-soft2/labyrinth/setup_labyrinth.sh
export CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1
export DEFAULT_CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1

The lines above set up the development release of the offline softare, the associated root version and the labyrinth (i.e. GMINOS). The CVSROOT variables allow read/write access to the MINOS cvs repository so that you can check code changes in (if you have permissions for the package in question...).

In my job scripts (e.g. the actual "job" that runs on the farm) I usually rerun the setup routines to make sure I am using the proper version of the software.

Batch System Commands

RAL has a batch queue system based on pbs. If you've used pbs before you'll feel at home and many of the following commands will be familiar. Most documentaition can be found by doing "man pbs" from any interactive node. I describe the most commonly used commands below.

qsub

Purpose: Submits a job for batch processing. For extended documentation "man qsub" from any interactive machine.

Example:

qsub -l cput=15:00:00,pmem=350mb -q prod -M someone@something.ac.uk -j oe -o /some/path/some_logfile.log -v "remote_a=${local_a}, remote_b=${local_b}"  /path/to/some_job.sh

This submits the script /path/to/some_job.sh for execution on the batch system. Anyting done inside some_job.sh will be done on the worker node when the job runs. The options are: