How to Run Batch Jobs at RAL

Introduction

This document is intended as a short introduction to the RAL batch system. I describe the system itself, job submission and job monitoring. It is hoped that this will be enough to get people started.

RAL system

The RAL system consists of a few frontend machines ( csfX.rl.ac.uk where X = a-f) intended for logins and interactive work, 454 dual processor worker nodes intended for batch job processing, and a host of additional supporting machines (database, disk servers, job scheduler) which users shouldn't worry much about. The interactive machines and batch machines share a common environment. When your job logs into a batch machine it's just like if it was running on one of the interactive nodes.

The RAL system is described here though some of the information is a little bit out of date. I usually look at the Ganglia monitoring site, which has lots of detailed information if you click around. That site is here .

MINOS software

Nick has a site which describes how one uses the MINOS software at RAL. You should probably run a sofware setup script in your logon file (for bash this is ~/.profile):

source /rutherford/minos-soft2/OO/minossoft/setup/setup_minossoft_csf.sh
source /rutherford/minos-soft2/labyrinth/setup_labyrinth.sh
export CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1
export DEFAULT_CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1

The lines above set up the development release of the offline softare, the associated root version and the labyrinth (i.e. GMINOS). The CVSROOT variables allow read/write access to the MINOS cvs repository so that you can check code changes in (if you have permissions for the package in question...).

In my job scripts (e.g. the actual "job" that runs on the farm) I usually rerun the setup routines to make sure I am using the proper version of the software.

Batch System Commands

RAL has a batch queue system based on pbs. If you've used pbs before you'll feel at home and many of the following commands will be familiar. Most documentaition can be found by doing "man pbs" from any interactive node. I describe the most commonly used commands below.

qsub

Purpose: Submits a job for batch processing. For extended documentation "man qsub" from any interactive machine.

Example:

qsub -l cput=15:00:00,pmem=350mb -q prod -M someone@something.ac.uk -j oe -o /some/path/some_logfile.log -v "remote_a=${local_a}, remote_b=${local_b}"  /path/to/some_job.sh

This submits the script /path/to/some_job.sh for execution on the batch system. Anyting done inside some_job.sh will be done on the worker node when the job runs. The options are:

-l cput=15:00:00,pmem=350mb : Specify the maximum time and memory used by your job. Here, I've said that my job will use at most 15hrs on a 2.8GHz machine and no more than 350mb of RAM. These settings are optional. So why set them? Well, you don't have to, and perhaps for your first set of jobs it's better not to. But, if you know your resource estimates it can help the scheduler start your job sooner. For example, if you claim your job uses only 100mb of memory the the scheduler knows it could be run in parallel with a 900mb job (batch nodes have 1G of memory). The default is 500mb so in this case you'd want to use the -l option.
-q prod: run in the production queue. No need to change this. Probably can omit it. We don't have a specific MINOS queue.
-M someone@something.ac.uk : the job will send mail here. This is for system errors, if your job gets killed, runs out of time, runs out of memory, etc. If your job merely crashes you will not get any mail.
-j oe : Join the error stream (stderr = std::cerr) and the output stream (stdout = std::cout). You could also have seperate error and output files but I don't think it's useful.
-o /some/path/some_logfile.log : The logfile of your job. You should be able to write to it. Each job should have it's own logfile. If they all try to write to the same one, wacky things will happen.
-v "remote_a=${local_a}, remote_b=${local_b}" : Pass environmental variables into your job. Here, local_a and local_b are shell variables defined inside the script which is submitting jobs . Actually, they don't even have to be shell variables. You could just use remote_a=blah.root if you wanted. Anyway, the batch system will make sure that variables remote_a, remote_b are defined inside the batch job itself. Examples of information you might want to pass in this way: the input and output file.
/path/to/some_job.sh : This is your job. Whatever you do here will be done on the execution host.

The qsub command returns a JobID which may be used to refer to the specific job.

qstat

Purpose: Shows the status of jobs, queues or hosts in the cluster. For extended documentation "man qstat" from any interactive machine. qstat can do quite a lot, I show the most common usage below.

Example: Check on my jobs.

[csfc] /home/csf/kordosky > qstat -u kordosky
 
csflnx353.rl.ac.uk:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1822248.csflnx3 kordosky sl3p     one_single    953   1  --    --  15:00 R 01:50
1822287.csflnx3 kordosky sl3p     one_single   8051   1  --    --  15:00 R 00:53
1822289.csflnx3 kordosky sl3p     one_single  20850   1  --    --  15:00 R 01:11
1822292.csflnx3 kordosky sl3p     one_single  26257   1  --    --  15:00 R 00:50
.
.
1822363.csflnx3 kordosky sl3p     one_single    --    1  --    --  15:00 Q   --
.
.
1822916.csflnx3 kordosky sl3p     one_single    --    1  --    --  15:00 H   --
.
.

The command displayed all the jobs I had running (R), queued (Q) or held (H). I often hold some jobs out of the queue until others finish since having several hundred of these sort of jobs running at once causes a large load on the machines serving out disk areas.

The JobID column is identical to what is returned by qsub.

qcat

Purpose: Look at the output of one of your jobs. For extended documentation "man qcat" from any interactive machine.

Example:

qstat 1822248

This will show the standard output of your job. You can use it to figure out what your job is doing.

showq

Purpose: Show jobs in the active and idle queue. The latter is more interesting as the former is similar to qstat .

Example: Look at the jobs waiting to run. Jobs are sorted in order of priority. How long till my job runs!?

[csfa] /stage/minos-data7/kordosky > showq -i
           JobName    Priority  XFactor  Q      User    Group  Procs     WCLimit     Class      SystemQueueTime
 
           1840758*     132410      1.0  -      h1mc       h1      1  2:08:00:00      sl3p  Wed Jan 11 12:29:54

           1840759      132410      1.0  -      h1mc       h1      1  2:08:00:00      sl3p  Wed Jan 11 12:29:54

           1840891      132410      1.0  -      h1mc       h1      1  2:08:00:00      sl3p  Wed Jan 11 12:30:23

.
.
.
           1840987      113232      1.0  -       cbs    minos      1  2:08:00:00      sl3p  Wed Jan 11 12:48:33

           1840988      113232      1.0  -       cbs    minos      1  2:08:00:00      sl3p  Wed Jan 11 12:48:35

.
.
.

Chris Smith (cbs) will have to wait a while since the MINOS priority is kinda low (since Yours Truly has been running many jobs lately). To learn more about the priority system you can look at the GridPP Wiki , specifically the pbs scheduling part of it. You can also have a look at the FairShare metrics link off of the Ganglia page.

Full example

Here is how I run PAN (ntuple) making jobs. Each job runs on an sntp files and writes out an ntuple file.

submit_pan_jobs_pro.sh

This job reads a list of files from standard input and submits an ntuple making job (pan_job_pro.sh) for each one. It uses standard getopts to specify the output directory. The script implements a mechanism (-n option) so that at most nmax jobs are able to run at once. gnumi and mc_beam are specific to this particular job.

 
#! /bin/bash
# loop over the list of files supplied on standard input
# for each file submit a job which will make a pan out of that file

# default values for options
out_dir="/stage/minos-data6/near/pan_data_R1.18"
gnumi=""
mcbeam="z_000"
# 
let nmax=99999

tf=`mktemp /tmp/temp.XXXXXX`

# temporary file is used as a fifo
# store jobid in it then read from it later
# used to make the second set of nmax jobs depend on the first finishing
# before they will run, and the third set on the second set, and .. etc.
# i.e. only run nmax jobs at a time
exec 5<> ${tf}

while getopts "o:g:b:n:" opt
  do case $opt in
  o) out_dir=${OPTARG};;
  g) gnumi=${OPTARG};;
  b) mcbeam=${OPTARG};;
  n) let nmax=${OPTARG};;
  '?') echo "bad command line option"
	  exit 1;;
  ':') "Missing arg to option: $OPTARG"
	  exit 1 ;;
      
  esac
done

shift ${OPTIND-1}

echo "out_dir: ${out_dir}"
echo "gnumiaux: ${gnumi}"
echo "mcbeam: ${mcbeam}"
echo "max jobs at once : ${nmax}"
let cnt=1;

while read f ; do
    
    echo "********************************************"
#    echo ${f}
    
    b=`basename ${f}`

    r=`echo ${b} | sed 's/.root//'`
    
    log_name="${out_dir}/pan_${r}.log"
    in_name="${f}"
    out_name="${out_dir}/pan_${r}.root"
    
    
    echo "in: ${in_name}"
    echo "out: ${out_name}"
    echo "log: ${log_name}"
    depstr=""
    if [ ${cnt} -gt ${nmax} ]; then
	read dep <&5
	echo "This job will depend on job ${dep}"
	depstr="-W depend=afterany:${dep}"
    fi

    # submit the job to the batch system
    jid=`qsub  -q prod -M ${USER}@fnal.gov -j oe -o "${log_name}" -v "inname=${in_name},outname=${out_name},bmonpath=${bmon},mcbeam=${mcbeam},gnumi=${gnumi},outdir=${out_dir}" ${depstr} ${HOME}/bin/pan_job_pro.sh`
    # to do a "dry run", echo the jid=... line above instead of running it
    # e.g. echo "jid=`....`"
    # and comment in the following line
    #jid=${cnt}

    echo "storing jobid=${jid}"
    echo ${jid} >> ${tf}

    # do at end
    let cnt=cnt+1

done

pan_job_pro.sh

This is the ntuple making job.

#! /bin/bash

source /rutherford/minos-soft2/OO/minossoft/setup/setup_minossoft_csf.sh
source /rutherford/minos-soft2/labyrinth/setup_labyrinth.sh
cd /home/csf/kordosky/test_pro
srt_setup -a

ulimit -c unlimited

cd $WORKDIR

# GNUMIAUX is needed by the pan making code
export GNUMIAUX=${gnumi}

# all the quoting here needed to properly feed arguments into a root macro.
loon -bnq '/home/csf/kordosky/test_pro/Mad/macros/MakePanMK.C("'"${inname}"'", "test",0,"'"${mcbeam}"'")'

mv PAN_test.root $outname

echo "Moving any core dumps to ${logdir}"
for fcore in `ls core*`  ; do
    echo "found core dump:"
    echo "`file ${fcore}`"
    echo "Moving ${fcore} to ${outdir}/coredumps"
    mv ./${fcore} ${outdir}/coredumps/${fcore}
done

A couple of comments are in order. First, every job is assigned a unique working directory $WORKDIR on the execution host. You should cd to that directory at the start of your job and write any output there. Do not write an ntuple (or, heaven forbid, a gaf file) over the network. It's not efficient. At the end of your job you should use cp or mv to copy your output files back to a standard MINOS disk area. Second, you do not need to clean up $WORKDIR after your job ends. The system will do it for you. However, I've found in the past that it makes sense to archive any coredumps your job has produced into some standard location. This only makes sense if you set ulimit -c unlimited as I show above, otherwise you won't get any coredumps. Anyway, take my advice and save the coredumps as it makes it easy to figure out which jobs have crashed and you can also do debugging.

Running

To run one file I do:

outdir="/stage/minos-data7/near/pan_mc"
gnumi="/stage/minos-data6/gnumi/v17/le010z185i"
beam="z_000"

echo "/stage/minos-data3/dcm_catalogue/n13020190_0000_L010185.sntp.R1_18.root" | submit_pan_jobs_pro.sh -o $outdir -b $beam -g $gnumi

Running over many files is easy:

find /stage/minos-data3/dcm_catalogue -name 'n1301*L010185*.sntp.R1_18.root' | submit_pan_jobs_pro.sh -o $outdir -b $beam -g $gnumi

Unless the number of files is large (and it is if you are running over all the FD or ND data) one could use ls rather than find.

Final Advice

My experience is that I rarely get a job script right the first time. Thus, if you are like me, you will want to start by submitting one job, possibly a shortened one, and use qcat to watch it run. Writing a script and then using it to submit 1000 jobs cold-turkey is a recipe for chaos.