This document is intended as a short introduction to the RAL batch system. I describe the system itself, job submission and job monitoring. It is hoped that this will be enough to get people started.
The RAL system consists of a few frontend machines ( csfX.rl.ac.uk where X = a-f) intended for logins and interactive work, 454 dual processor worker nodes intended for batch job processing, and a host of additional supporting machines (database, disk servers, job scheduler) which users shouldn't worry much about. The interactive machines and batch machines share a common environment. When your job logs into a batch machine it's just like if it was running on one of the interactive nodes.
The RAL system is described here though some of the information is a little bit out of date. I usually look at the Ganglia monitoring site, which has lots of detailed information if you click around. That site is here .
Nick has a site which describes how one uses the MINOS software at RAL. You should probably run a sofware setup script in your logon file (for bash this is ~/.profile):
source /rutherford/minos-soft2/OO/minossoft/setup/setup_minossoft_csf.sh source /rutherford/minos-soft2/labyrinth/setup_labyrinth.sh export CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1 export DEFAULT_CVSROOT=:ext:minoscvs@minoscvs.fnal.gov:/cvs/minoscvs/rep1
The lines above set up the development release of the offline softare, the associated root version and the labyrinth (i.e. GMINOS). The CVSROOT variables allow read/write access to the MINOS cvs repository so that you can check code changes in (if you have permissions for the package in question...).
In my job scripts (e.g. the actual "job" that runs on the farm) I usually rerun the setup routines to make sure I am using the proper version of the software.
RAL has a batch queue system based on pbs. If you've used pbs before you'll feel at home and many of the following commands will be familiar. Most documentaition can be found by doing "man pbs" from any interactive node. I describe the most commonly used commands below.
Purpose: Submits a job for batch processing. For extended documentation "man qsub" from any interactive machine.
Example:
qsub -l cput=15:00:00,pmem=350mb -q prod -M someone@something.ac.uk -j oe -o /some/path/some_logfile.log -v "remote_a=${local_a}, remote_b=${local_b}" /path/to/some_job.sh
This submits the script /path/to/some_job.sh for execution on the batch system. Anyting done inside some_job.sh will be done on the worker node when the job runs. The options are:
The qsub command returns a JobID which may be used to refer to the specific job.
Purpose: Shows the status of jobs, queues or hosts in the cluster. For extended documentation "man qstat" from any interactive machine. qstat can do quite a lot, I show the most common usage below.
Example: Check on my jobs.
[csfc] /home/csf/kordosky > qstat -u kordosky csflnx353.rl.ac.uk: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1822248.csflnx3 kordosky sl3p one_single 953 1 -- -- 15:00 R 01:50 1822287.csflnx3 kordosky sl3p one_single 8051 1 -- -- 15:00 R 00:53 1822289.csflnx3 kordosky sl3p one_single 20850 1 -- -- 15:00 R 01:11 1822292.csflnx3 kordosky sl3p one_single 26257 1 -- -- 15:00 R 00:50 . . 1822363.csflnx3 kordosky sl3p one_single -- 1 -- -- 15:00 Q -- . . 1822916.csflnx3 kordosky sl3p one_single -- 1 -- -- 15:00 H -- . .
The command displayed all the jobs I had running (R), queued (Q) or held (H). I often hold some jobs out of the queue until others finish since having several hundred of these sort of jobs running at once causes a large load on the machines serving out disk areas.
The JobID column is identical to what is returned by qsub.
Purpose: Look at the output of one of your jobs. For extended documentation "man qcat" from any interactive machine.
Example:
qstat 1822248
This will show the standard output of your job. You can use it to figure out what your job is doing.
Purpose: Show jobs in the active and idle queue. The latter is more interesting as the former is similar to qstat .
Example: Look at the jobs waiting to run. Jobs are sorted in order of priority. How long till my job runs!?
[csfa] /stage/minos-data7/kordosky > showq -i JobName Priority XFactor Q User Group Procs WCLimit Class SystemQueueTime 1840758* 132410 1.0 - h1mc h1 1 2:08:00:00 sl3p Wed Jan 11 12:29:54 1840759 132410 1.0 - h1mc h1 1 2:08:00:00 sl3p Wed Jan 11 12:29:54 1840891 132410 1.0 - h1mc h1 1 2:08:00:00 sl3p Wed Jan 11 12:30:23 . . . 1840987 113232 1.0 - cbs minos 1 2:08:00:00 sl3p Wed Jan 11 12:48:33 1840988 113232 1.0 - cbs minos 1 2:08:00:00 sl3p Wed Jan 11 12:48:35 . . .
Chris Smith (cbs) will have to wait a while since the MINOS priority is kinda low (since Yours Truly has been running many jobs lately). To learn more about the priority system you can look at the GridPP Wiki , specifically the pbs scheduling part of it. You can also have a look at the FairShare metrics link off of the Ganglia page.
Here is how I run PAN (ntuple) making jobs. Each job runs on an sntp files and writes out an ntuple file.
This job reads a list of files from standard input and submits an ntuple making job (pan_job_pro.sh) for each one. It uses standard getopts to specify the output directory. The script implements a mechanism (-n option) so that at most nmax jobs are able to run at once. gnumi and mc_beam are specific to this particular job.
#! /bin/bash # loop over the list of files supplied on standard input # for each file submit a job which will make a pan out of that file # default values for options out_dir="/stage/minos-data6/near/pan_data_R1.18" gnumi="" mcbeam="z_000" # let nmax=99999 tf=`mktemp /tmp/temp.XXXXXX` # temporary file is used as a fifo # store jobid in it then read from it later # used to make the second set of nmax jobs depend on the first finishing # before they will run, and the third set on the second set, and .. etc. # i.e. only run nmax jobs at a time exec 5<> ${tf} while getopts "o:g:b:n:" opt do case $opt in o) out_dir=${OPTARG};; g) gnumi=${OPTARG};; b) mcbeam=${OPTARG};; n) let nmax=${OPTARG};; '?') echo "bad command line option" exit 1;; ':') "Missing arg to option: $OPTARG" exit 1 ;; esac done shift ${OPTIND-1} echo "out_dir: ${out_dir}" echo "gnumiaux: ${gnumi}" echo "mcbeam: ${mcbeam}" echo "max jobs at once : ${nmax}" let cnt=1; while read f ; do echo "********************************************" # echo ${f} b=`basename ${f}` r=`echo ${b} | sed 's/.root//'` log_name="${out_dir}/pan_${r}.log" in_name="${f}" out_name="${out_dir}/pan_${r}.root" echo "in: ${in_name}" echo "out: ${out_name}" echo "log: ${log_name}" depstr="" if [ ${cnt} -gt ${nmax} ]; then read dep <&5 echo "This job will depend on job ${dep}" depstr="-W depend=afterany:${dep}" fi # submit the job to the batch system jid=`qsub -q prod -M ${USER}@fnal.gov -j oe -o "${log_name}" -v "inname=${in_name},outname=${out_name},bmonpath=${bmon},mcbeam=${mcbeam},gnumi=${gnumi},outdir=${out_dir}" ${depstr} ${HOME}/bin/pan_job_pro.sh` # to do a "dry run", echo the jid=... line above instead of running it # e.g. echo "jid=`....`" # and comment in the following line #jid=${cnt} echo "storing jobid=${jid}" echo ${jid} >> ${tf} # do at end let cnt=cnt+1 done
This is the ntuple making job.
#! /bin/bash source /rutherford/minos-soft2/OO/minossoft/setup/setup_minossoft_csf.sh source /rutherford/minos-soft2/labyrinth/setup_labyrinth.sh cd /home/csf/kordosky/test_pro srt_setup -a ulimit -c unlimited cd $WORKDIR # GNUMIAUX is needed by the pan making code export GNUMIAUX=${gnumi} # all the quoting here needed to properly feed arguments into a root macro. loon -bnq '/home/csf/kordosky/test_pro/Mad/macros/MakePanMK.C("'"${inname}"'", "test",0,"'"${mcbeam}"'")' mv PAN_test.root $outname echo "Moving any core dumps to ${logdir}" for fcore in `ls core*` ; do echo "found core dump:" echo "`file ${fcore}`" echo "Moving ${fcore} to ${outdir}/coredumps" mv ./${fcore} ${outdir}/coredumps/${fcore} done
A couple of comments are in order. First, every job is assigned a unique working directory $WORKDIR on the execution host. You should cd to that directory at the start of your job and write any output there. Do not write an ntuple (or, heaven forbid, a gaf file) over the network. It's not efficient. At the end of your job you should use cp or mv to copy your output files back to a standard MINOS disk area. Second, you do not need to clean up $WORKDIR after your job ends. The system will do it for you. However, I've found in the past that it makes sense to archive any coredumps your job has produced into some standard location. This only makes sense if you set ulimit -c unlimited as I show above, otherwise you won't get any coredumps. Anyway, take my advice and save the coredumps as it makes it easy to figure out which jobs have crashed and you can also do debugging.
To run one file I do:
outdir="/stage/minos-data7/near/pan_mc" gnumi="/stage/minos-data6/gnumi/v17/le010z185i" beam="z_000" echo "/stage/minos-data3/dcm_catalogue/n13020190_0000_L010185.sntp.R1_18.root" | submit_pan_jobs_pro.sh -o $outdir -b $beam -g $gnumi
Running over many files is easy:
find /stage/minos-data3/dcm_catalogue -name 'n1301*L010185*.sntp.R1_18.root' | submit_pan_jobs_pro.sh -o $outdir -b $beam -g $gnumi
Unless the number of files is large (and it is if you are running over all the FD or ND data) one could use ls rather than find.
My experience is that I rarely get a job script right the first time. Thus, if you are like me, you will want to start by submitting one job, possibly a shortened one, and use qcat to watch it run. Writing a script and then using it to submit 1000 jobs cold-turkey is a recipe for chaos.