Introduction

Introductions to GRID can be found in several places (eg. see link 1 or bibliography 1), depending which level the user wants to get into. Here, I'll give rough instructions but I want to focus on tips and give scripts that will make a physicist run his/her jobs and get results fast.

A very good introduction to GRID is given by Steve Lloyd (see link 2) but also at the Atlas Wiki (see link 3). In terms of "bureaucracy" you will need (a) to get a GRID certificate and (b) to join a Virtual Organisation (VO), which -for us- is the atlas VO The certicate will be valid only for the machine that you used to issue it. This is very important, since if you need to have access to web-pages (like for example the Grid User Support), you will need to do so using the machine you got the certificate from. Otherwise, you should log-in to that machine and from there do whatever you wish... These steps now take few days (instead of few months, as it used to be)

After that, you will need to get an account on a User Interface and install your certificate. The final aim of doing these steps is to create your proxy certificate, which will allow you to have access to the grid for a desired period of time.

The guides on how to do the above steps are on our group's web-page (see link 4)

By now, you will be able to use the grid. Each time you log in, you should do:

source /usr/local/lcg/etc/profile.d/grid_env.sh

For convinience, you could add this command into your shell script (.bashrc usually)

You have also to get or check if you have a valid grid proxy and if so, for how long. To get a proxy type:

grid-proxy-init -valid 24:30

That will give you a proxy for 1 day (24 hours) and 30 minutes.

  • ALERT! Caution:You should make sure that your proxy is valid for the whole period of the run of your jobs. If for example you have the above proxy and you run full simulation of 100 events, then your job will be killed at the expire of your proxy.

  • TIP Tip: You can use the grid-proxy-(TAB) to get all the commands and --help at the end of the command to see the syntax.

  • TIP Tip: 3 days proxy is enough for full simulation of 100 events (multi-particle final state samples).

Prepare a Grid Job

The detailed guide/manual for the LCG is given in Link 5.

It is essential to understanbd that between you and the site that your job will eventually run, there is the Resource Broker (RB), which controls and distributes the jobs. For us, the RB is RAL. For example, you sumbit your jobs there, and you retrieve your job from there (you will that when you perform these requests). The concept is that the user doesn't have direct contact with the final site.

To make the first step, run the HelloWorld, described at Steve's notes (link 2). There are three main componets of your job: (a) the .jdl file, which gives the instructions to the RB of what your jobs will need. (b) you .sh file , which is the executable script (the same that we submit at the PBS) and (c) your .py file, which is the normal python jobOptions file that we all run within ATHENA.

One example of these files can be found on the work-book. Another example is given here. The full simulation (GEANT) of (Pythia) generated events is used as a case study.

(a) *simulate.jdl

* simulate.jdl: That is an example of a jdl file

The first line declares which file is the executable one. The second and the third line define the names of the files to dump the errors and the print-outs. In the fourth line we give the files that we want to be copied at the site. For example, I copy the executable of course, and the jobOptions file which I will use to run ATHENA. The OutputSandbox variable defines which files I want to get back. One of these is of course the GEANT output (Hits). Finally, the Requirements variable gets all our options: we need to define within which VO we will run our jobs (here is atlas) and what release we want to use (here 11.0.4). So the RB will search all the sites in the atlas VO which have the release 11.0.4. And we finally require our jobs to run in the long queue by adding the other.GlueCEPolicyMaxCPUTime > 120

  • TIP Tip: You can check which sites satisfy all of your requirements by typing:

edg-job-list-match --vo atlas simulate.jdl

The next part of the line has the sites that we want to exclude (for instance because we noticed that they're not correctly setup).

  • ALERT! Caution: The requirements line must be continuous, without line breaks.

(b) *simulate.sh

* simulate.sh: That is an example of a sh file

This is a quite long file to explain, but it is very simple to understand, even for someone with a basic knowledge of bash commands. There few important things to mention here:

  • Since our generated files are very big to be transfered (GRID allows only up to a certain amount of MB to be transfered with the jdl file) we need to copy them
to the site that the job is running and which of course we don't know and control. Therefore we must first copy them to site(s), from where we can retrieve them. It is essential that we register the file to the grid. Say for example that you need to transfer the file /home/storage/fileGenEvents.pool.root to the site se1.pp.rhul.ac.uk. You will have to type:

lcg-cr -d se1.pp.rhul.ac.uk -l lfn:/grid/atlas/fileGenEvents.pool.root --vo atlas file:////home/storage/fileGenEvents.pool.root

The above command will make copy of that file with a name of /grid/atlas/fileGenEvents.pool.root. But the file will get a unique Identification Code (something like file7764465a-55ca-4396-85e8-655c86d2c1bd) which identifies where exactly it is.

* ALERT! Caution: All the file names should start with the /grid/atlas

* TIP Tip: You can check all the available storage elements by typing: lcg-infosites --vo atlas se. A --help will explain how to use this and all the lcg commands.

You can check the existence of the file by typing:

lcg-lr --vo atlas lfn:/grid/atlas/fileGenEvents.pool.root

and copy it by typing:

lcg-cp --vo atlas lfn:/grid/atlas/fileGenEvents.pool.root file:///home/storage/copiedFile.pool.root

But there is the possibility that the site which hosts the file you want may not be available. Therefore you must make replicas of that file. A replica means that the file name will be same, but it will be hosted in different places. The command lcg-rep does this job.

  • TIP Tip: It is good to have the file at quite few places so that you make sure that it will be copied successfully. It is also advised to try to copy it to sites outside UK, since sometimes, the GRID problems are country-dependent.

This is what we do in the first line of the sh file. We give the sites that we made replicas of our generated samples and then, by checking each time if the copy has been successful, we loop over the site to get the file.

  • The other line that is important (just for the simulation step) is that the file geomDB_sqlite, which is needed by GEANT, must be copied at the local area:

cp $SITEROOT/atlas/offline/data/geomDB_sqlite $PWD

(b) *simulate.py

* simulate.py: This is an example of a jobOptions file

This is a well-known file. Nothing to stress.

Running a Grid Job

In order to run a grid job you will to type:

edg-job-submit --vo atlas -o simulate_jobIDfile simulate.jdl

To check the status:

edg-job-status -i simulate_jobIDfile

To retrieve the output

edg-job-get-output -i simulate_jobIDfile

The important key to mention here is the file simulate_jobIDfile, which includes the identification of the submitted job. Unfortunately, the code given to the job is a random one (like akqIkNdtGa4LPNUTUrsWgg). Therefore the book-keeping must be very carefull.

If you want to submit many jobs, it is wise to make first a template of you jdl, sh and py files. Then you can use the script:

to create as many files as you want.

You can use the following scripts to submit your jobs, check the status of the submitted jobs and retrieve the completed jobs:

A detailed guide for the LCG is given in Link 5.

Links

Bibliography

  • 1. The GRID: Blueprint for a New Computing Infrastructure

-- StathisStefanidis - 07 Jun 2006

Edit | Attach | Watch | Print version | History: r20 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2007-01-31 - LilyAsquith
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback