Difference: AtlasGrid (19 vs. 20)

Revision 202010-04-21 - JamesRobinson

  META TOPICPARENT 
 name="HEPGroup.AtlasStuff" 

 Introduction
- META TOPICPARENT
+ name="HEPGroup.AtlasStuff"
-<
<
+Introductions to GRID can be found in several places (eg. see link 1 or bibliography 1), depending which level the user wants to get into. 
Here, I'll give rough instructions but I want to focus on tips and give scripts that will make a physicist run his/her jobs
and get results fast.

A very good introduction to GRID is given by Steve Lloyd (see link 2) but also at the Atlas Wiki (see link 3).
In terms of "bureaucracy" you will need (a) to get a GRID certificate and (b) to join a Virtual Organisation (VO), which -for us- is the atlas VO
The certicate will be valid only for the machine that you used to issue it. This is very important, since if you need to have access
to web-pages (like for example the Grid User Support), you will need to do so using the machine you got the certificate from.
Otherwise, you should log-in to that machine and from there do whatever you wish...
These steps now take few days (instead of few months, as it used to be)
->
>
+ Introductions to GRID can be found in several places (eg. see link 1 or bibliography 1), depending which level the user wants to get into. Here, I'll give rough instructions but I want to focus on tips and give scripts that will make a physicist run his/her jobs and get results fast.
-<
<
+After that, you will need to get an account on a User Interface and install your certificate. The final aim of doing these steps is 
to create your proxy certificate, which will allow you to have access to the grid for a desired period of time.
->
>
+A very good introduction to GRID is given by Steve Lloyd (see link 2) but also at the Atlas Wiki (see link 3). In terms of "bureaucracy" you will need (a) to get a GRID certificate and (b) to join a Virtual Organisation (VO), which -for us- is the atlas VO The certicate will be valid only for the machine that you used to issue it. This is very important, since if you need to have access to web-pages (like for example the Grid User Support), you will need to do so using the machine you got the certificate from. Otherwise, you should log-in to that machine and from there do whatever you wish... These steps now take few days (instead of few months, as it used to be)

After that, you will need to get an account on a User Interface and install your certificate. The final aim of doing these steps is to create your proxy certificate, which will allow you to have access to the grid for a desired period of time.
 The guides on how to do the above steps are on our group's web-page (see link 4)
->
>
+ Ganga
->
>
+Ganga is a command line (or graphical) frontend for submitting and running jobs on the grid. Ganga usage is explained at AtlasGanga. The rest of this page is probably not so relevant any more.
 The Grid
 By now, you will be able to use the grid. Each time you log in, you should do:
 For convinience, you could add this command into your shell script (.bashrc usually)
-<
<
+You have also to get or check if you have a valid grid proxy and if so, for how long.
To get a proxy type:
->
>
+You have also to get or check if you have a valid grid proxy and if so, for how long. To get a proxy type:
 grid-proxy-init -valid 24:30
 The detailed guide/manual for the LCG is given in Link 5.
-<
<
+It is essential to understanbd that between you and the site that your job will eventually run, there is the Resource Broker (RB), which controls and distributes the jobs. 
For us, the RB is RAL. For example, you sumbit your jobs there, and you retrieve your job from there (you will that when you perform these requests). The concept is that the user doesn't have 
direct contact with the final site.

To make the first step, run the HelloWorld, described at Steve's notes (link 2). There are three main componets of your job:
(a) the .jdl file, which gives the instructions to the RB of what your jobs will need. (b) you .sh file , which is the executable script (the same
that we submit at the PBS) and (c) your .py file, which is the normal python jobOptions file that we all run within ATHENA.
->
>
+It is essential to understanbd that between you and the site that your job will eventually run, there is the Resource Broker (RB), which controls and distributes the jobs. For us, the RB is RAL. For example, you sumbit your jobs there, and you retrieve your job from there (you will that when you perform these requests). The concept is that the user doesn't have direct contact with the final site.

To make the first step, run the HelloWorld, described at Steve's notes (link 2). There are three main componets of your job: (a) the .jdl file, which gives the instructions to the RB of what your jobs will need. (b) you .sh file , which is the executable script (the same that we submit at the PBS) and (c) your .py file, which is the normal python jobOptions file that we all run within ATHENA.
 One example of these files can be found on the work-book. Another example is given here. The full simulation (GEANT) of (Pythia) generated events is used as a case study.
 * simulate.jdl: That is an example of a jdl file
-<
<
+The first line declares which file is the executable one. The second and the third line define the names of the files to dump the errors and the print-outs.
In the fourth line we give the files that we want to be copied at the site. For example, I copy the executable of course, and the jobOptions file which I will
use to run ATHENA.
The OutputSandbox variable defines which files I want to get back. One of these is of course the GEANT output (Hits).
Finally, the Requirements variable gets all our options: we need to define within which VO we will run our jobs (here is atlas) and what release we want to 
use (here 11.0.4). So the RB will search all the sites in the atlas VO which have the release 11.0.4. And we finally require our jobs to run in the long
queue by adding the other.GlueCEPolicyMaxCPUTime > 120
->
>
+The first line declares which file is the executable one. The second and the third line define the names of the files to dump the errors and the print-outs. In the fourth line we give the files that we want to be copied at the site. For example, I copy the executable of course, and the jobOptions file which I will use to run ATHENA. The OutputSandbox variable defines which files I want to get back. One of these is of course the GEANT output (Hits). Finally, the Requirements variable gets all our options: we need to define within which VO we will run our jobs (here is atlas) and what release we want to use (here 11.0.4). So the RB will search all the sites in the atlas VO which have the release 11.0.4. And we finally require our jobs to run in the long queue by adding the other.GlueCEPolicyMaxCPUTime > 120
   Tip: You can check which sites satisfy all of your requirements by typing:
 * simulate.sh: That is an example of a sh file
-<
<
+This is a quite long file to explain, but it is very simple to understand, even for someone with a basic knowledge of bash commands. There few important things
to mention here:
->
>
+This is a quite long file to explain, but it is very simple to understand, even for someone with a basic knowledge of bash commands. There few important things to mention here:
  Since our generated files are very big to be transfered (GRID allows only up to a certain amount of MB to be transfered with the jdl file) we need to copy them
-<
<
+to the site that the job is running and which of course we don't know and control. Therefore we must first copy them to site(s), from where we can retrieve them. It is essential that we register the file to the grid. 
Say for example that you need to transfer the file /home/storage/fileGenEvents.pool.root to the site se1.pp.rhul.ac.uk. You will have to type:
->
>
+ to the site that the job is running and which of course we don't know and control. Therefore we must first copy them to site(s), from where we can retrieve them. It is essential that we register the file to the grid. Say for example that you need to transfer the file /home/storage/fileGenEvents.pool.root to the site se1.pp.rhul.ac.uk. You will have to type:
 lcg-cr -d se1.pp.rhul.ac.uk -l lfn:/grid/atlas/fileGenEvents.pool.root --vo atlas file:////home/storage/fileGenEvents.pool.root
-<
<
+ The above command will make copy of that file with a name of /grid/atlas/fileGenEvents.pool.root. But the file will get a unique Identification Code (something
like file7764465a-55ca-4396-85e8-655c86d2c1bd) which identifies where exactly it is.
->
>
+The above command will make copy of that file with a name of /grid/atlas/fileGenEvents.pool.root. But the file will get a unique Identification Code (something like file7764465a-55ca-4396-85e8-655c86d2c1bd) which identifies where exactly it is.
        *  Caution: All the file names should start with the /grid/atlas
-<
<
+       *  Tip: You can check all the available storage elements by typing: lcg-infosites --vo atlas se. A --help will explain how to use this and all the
lcg commands.
->
>
+*  Tip: You can check all the available storage elements by typing: lcg-infosites --vo atlas se. A --help will explain how to use this and all the lcg commands.
 You can check the existence of the file by typing:
 lcg-cp --vo atlas lfn:/grid/atlas/fileGenEvents.pool.root file:///home/storage/copiedFile.pool.root
-<
<
+But there is the possibility that the site which hosts the file you want may not be available. Therefore you must make replicas of that file. A replica means
that the file name will be same, but it will be hosted in different places. The command lcg-rep does this job.
->
>
+But there is the possibility that the site which hosts the file you want may not be available. Therefore you must make replicas of that file. A replica means that the file name will be same, but it will be hosted in different places. The command lcg-rep does this job.
   Tip: It is good to have the file at quite few places so that you make sure that it will be copied successfully. It is also advised to try to copy it to sites outside UK, since sometimes, the GRID problems are country-dependent.
-<
<
+This is what we do in the first line of the sh file. We give the sites that we made replicas of our generated samples and then, by checking each time if the 
copy has been successful, we loop over the site to get the file.
->
>
+This is what we do in the first line of the sh file. We give the sites that we made replicas of our generated samples and then, by checking each time if the copy has been successful, we loop over the site to get the file.
  The other line that is important (just for the simulation step) is that the file  geomDB_sqlite, which is needed by GEANT, must be copied at the local area:
 edg-job-get-output -i simulate_jobIDfile
-<
<
+The important key to mention here is the file simulate_jobIDfile, which includes the identification of the submitted job. Unfortunately, the code given 
to the job is a random one (like akqIkNdtGa4LPNUTUrsWgg). Therefore the book-keeping must be very carefull.
->
>
+The important key to mention here is the file simulate_jobIDfile, which includes the identification of the submitted job. Unfortunately, the code given to the job is a random one (like akqIkNdtGa4LPNUTUrsWgg). Therefore the book-keeping must be very carefull.
 If you want to submit many jobs, it is wise to make first a template of you jdl, sh and py files. Then you can use the script:

View topic | History: r20 < r19 < r18 < r17 | More topic actions...