Last modified: Fri Sep 18 10:53:10 BST 2009
Nick West
Return to home page

GridTools: Grid Scripts

Contents


Introduction

Eventually here will be found everything necessary to develop and operate GRID production systems based on the GridTools RSD, DCM and Ganga.

The Scripts are divided into the following:-


Installing GridTools and supporting applications

The following assumes bash, so csh users will have to make the obvious changes:-

  1. Decide on some directory into which GridTools and support applications will be installed and cd into it.
      cd (my top level directory)
    

  2. Prepare applications directory and download GridTools
      mkdir GridApps 
      cvs -d :pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1 get minossoft/GridTools
      mv minossoft/GridTools/ ./
      rm -r minossoft
      cd GridTools
      export MOG_TOOLS=`pwd`
      source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh
    

  3. RSD configuration.

    Pick a name for your site, it can be arbitrary but examples we have so far are:-

      ral_t1_ui
      oxford_t2_ui
    
    Tell RSD what name to use and set up a configuration for it.
     local_name= (what ever name you have chosen)
    
     cd  $MOG_TOOLS/RemoteSoftwareDeployment/config 
    
     echo $local_name > minos.site_name
     cp example.config  minos-$local_name.config
    
    You could tweak the .config if you want to, but the default ought to be O.K.

    Have a test run of RSD

      rsd
    
    It should give help instructions and near the top you should see the line
      VO name:   MINOS_VO_GRIDPP_AC_UK (from default or minos-[your site name here].config)
    

  4. Install the Sam Web Client
      rsd install $MOG_TOOLS/../GridApps SamWebClient:v0_9_2_NULL-build_1\
      --soft_link=pro:delete
    
    This tells RSD to install version v0_9_2_NULL below $MOG_TOOLS/../GridApps and to make a soft 'pro' link to it. If all goes well the log should end:-
      RSD terminating. No error reported
    
    and then by resourcing the setup script you should be able to use it
       source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh
        [ should see the line: Setting up SamWebClient ]
    
       samLocate --file=N00008695_0023.cosmic.sntp.R1_18.0.root 
        [ should give: /pnfs/minos/reco_near/R1_18/sntp_data/2005-10 ]
    

  5. DCM configuration.

    Tell DCM what the site name is and what SEs it can access using the Oxford setup as typical.

      cd  $MOG_TOOLS/DataCacheManager/config/
    
      echo $local_name > minos.site_name
      cp minos.site_oxford_t2_ui.se_access  minos.site_$local_name.se_access
    
    Now you need to tell DCM about the local disks and directories.
      data_dir= (the top directory of your data disk)
    
      rm -f minos.site_$local_name.local_disks    (should not exists, but just in case)
    
      echo Group  minos                                      >> minos.site_$local_name.local_disks
      echo Scratch_dir        /tmp                           >> minos.site_$local_name.local_disks
      echo @Disks             $data_dir                      >> minos.site_$local_name.local_disks
      echo @Exclude_dirs      $data_dir                      >> minos.site_$local_name.local_disks
      echo Soft_links_dir     $data_dir/dcm_catalogue        >> minos.site_$local_name.local_disks
      echo Catalogue_dir      $data_dir/dcm_catalogue/DCM    >> minos.site_$local_name.local_disks
      echo Resource_lock_dir  $data_dir/dcm_resource_locks   >> minos.site_$local_name.local_disks
    
    DCM is capable of surveying everything below @Disks and provide a catalogue, but we assume that you don't need this feature which is why @Exclude_dirs is set to the same thing.

    Now create all the required directories giving group write access.

      mkdir --mode 0775  $data_dir/dcm_catalogue
      mkdir --mode 0775  $data_dir/dcm_cache
      mkdir --mode 0775  $data_dir/dcm_catalogue/DCM
      mkdir --mode 0775  $data_dir/dcm_resource_locks 
    
    Confirm that dcm runs
      dcm
    
    It should type its help and near the top list 'host_name' (the name you chose) and the SEs it can see and the local disk setup.

  6. Setup up local and FNAL catalogues
      dcm survey
    
    It will take no time to survey the local disk because everything was excluded but then will take about 15 minutes to download a ~ 0.3GB file from FNAL and reformat it for DCM usage.

    Note: DCM does not automatically refetch this file as it does take a while so will slip out of date. One way to prevent this is to have a nightly cron job that just executes this command.

  7. Test that DCM can do catalogue searches and Sam queries
      dcm get --accept_dcm_url [ file_name like N00008695_002%.cosmic.sntp.R1_18.0.root ]
      [  should locate 4 files in fnal-dcache-enstore ]
      dcm get --accept_dcm_url N00006771_cat0.spill.sntp.R1_18_2.0.root
      [  should locate one file in ral_t1-dcache-tape ]
      dcm get --accept_dcm_url AnaNue-N00009062_0018.spill.sntp.cedar.0.root
      [  should locate a file in ral_t1_ui-nfs ]
    
    Note that you cannot actually get data from SEs at RAL with DCM yet, it isn't supported, but you could get data from FNAL if you needed to.


Setting up the Environment

The environment components

In order to use the GridTools and run applications on the GRID, two components of the environment have to be established:-

The scripts are meant primarily to run on software structures defined by RSD but they can also be used to set up GridTools interactively on systems maintained by SRT although in this case they cannot set up application - they cannot know which setup script in $SRT_DIST/setup to use.

Running the setup scripts

The scripts are:-
  setup_minos_lcg_grid.sh   (for sh/bash shells)
  setup_minos_lcg_grid.csh  (for csh/tcsh shells)
and take a single, optional, argument specifying the application:-
  app-name:app-version
  example: CedarDaikon:03-build_0-SL4

There several cases to be covered:-

On RAL T1 UI or PBS batch worker

  source /stage/minos-data1/software/grid/setup_minos_local-SL4.sh (or .csh) {application}
Each time a new release of GridTools is installed, it creates these wrapper scripts in the top level directory which do nothing more than invoke the latest setup_minos_lcg_grid.sh (or .csh) and thus simplify job setup.

Caution: As their names suggest these are Scientific Linux 4 scripts so should not be used on any SL3 UI.

On any GRID WN with RSD installed software stack

In a perfect world the correct setup would be:-
  source $VO_MINOS_VO_GRIDPP_AC_UK_SW_DIR/setup_minos_local-SL4.sh (or .csh) {application}
as that's the standard environmental variable. However at present the only LCG computing element is at RAL T1 and its software disk was full in October. Although there is now space, for the moment we will continue to use the RAL T1 UI/PBS scripts instead i.e.
  source /stage/minos-data1/software/grid/setup_minos_local-SL4.sh (or .csh) {application}

On any other machine

Follow the installation instructions and then to run the script:-
   export MOG_TOOLS=wherever                               
   source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh 

or setenv MOG_TOOLS wherever
   source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.csh
Note that you cannot specify an application in this case, the script can only be used to set up GridTools themselves.

The resulting environment

After successful execution the following environmental variables will be set:-

VariableMeaningExample
MOG_SW_DIR Software top level (see RSD_SW_DIR) /stage/sl3-lcg-exp/minossgm
MOG_TOOLS GridTools top directory
Derived from $MOG_SW_DIR
/stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4
MOG_SCRIPTS GridTools script directory
Derived from $MOG_SW_DIR
/stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/Scripts
MOG_CE_NAME The GRID queue name (or empty if not on GRID) lcgce01.gridpp.rl.ac.uk
MOG_HOST_NAME The standardised host name:-
<site> - t1 | t2 - ui | wn 
ral_t1_wn
MOG_OS_TYPE The standardised operating system type SL4
MOG_WORK_DIR Scratch area. Based on the first defined of the following:-
  1. $SCRATCH_DIRECTORY
  2. $WORKDIR
  3. /tmp
It may not be empty and must not be erased.
Instead create a directory whose name includes $$
/pool/minosmc_6237043.csflnx353.rl.ac.uk
DCM_HOME DCM source home directory
Derived from $MOG_SW_DIR
/stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/DataCacheManager
dcm
(wrapper executable)
Use to run DCM
Derived from $MOG_SW_DIR
n/a
RSD_HOME RSD source home directory
Derived from $MOG_SW_DIR
/stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/RemoteSoftwareDeployment
rsd
(wrapper executable)
Use to run RSD
Derived from $MOG_SW_DIR
n/a
ganga
(wrapper executable)
Use to run Ganga
Hardwired for Oxford and RAL
Only if installed.
If multiple versions installed can select version
sam* e.g. samLocate
(added to path)
Use to run SAM Web Services package
Picked up from SamWebClient pro link
Only if installed
svn
(added to path)
Use to run Subversion
Picked up from Subversion pro link
Only if installed
MOG_APP_DIR Application top level (see RSD_TOP_DIR)
Only set if application specified.
/stage/sl3-lcg-exp/minossgm/apps/CedarDaikon/0-build_0-SL3
* Application specific variables
Only set if application specified.
SRT_PUBLIC_CONTEXT

If an application is has been located then $MOG_APP_DIR will be defined and can be used to source further setup scripts for individual libraries if required. To see what library scripts are available and what they do look at the installation scripts in the RSD library scripts directory and in particular:-

install_cernlib
install_daikon_scripts
install_GENIE
install_minossoft
install_neugen3
install_pythia
install_root

Note that different libraries having different locations and naming conventions for their scripts so RSD always provides a standardised location and naming convention by creating scripts, or soft links to them, in:-
  source $MOG_APP_DIR/setup_library/setup_<library-name>.sh


Simulated GRID Mode

Eventually production jobs running on the RAL farm will only have access to shared NFS disk for software; there will be no shared NFS disk for data, including any flux files. This means that all data will have to be obtained from Storage Elements via DCM.

At the time of writing NFS disks will not be phased out for some months into 2008. However, it makes sense to test out DCM based production before then and this can be done using "Simulated GRID Mode". If the environmental variable:-

MOG_SIMULATE_GRID
exists (its value is irrelevant) then DCM disregards the local NFS disk and switches to "Worker Node mode". See Running on a Worker Node (WN)


Running a Loon Job on a Worker Node

In this section we are going to run 2 loon jobs:-
  1. Loon with NFS disk and no Ganga
  2. Loon with SE access via DCM submitted by Ganga


Loon with NFS disk and no Ganga

For this first example we are going to spoon feed you, the next one is going to be a lot harder! Prepare by logging onto your UI and setting up your proxy as described in Creating a short term proxy

Next set the GridTools environment:-

RAL:    source /stage/minos-data1/software/grid/setup_minos_local-SL4.csh (or .sh)
Oxford: source /datadisk/minos/software/setup_minos_oxford.csh (or .sh)
We are going to be using the two files:-
  1. demo_loon_job_nfs.sh starts by defining a release, script and input file to run, creates a work directory and cd's into it. Then it runs loon, lists the output files it produces and then finally cd's out of the work directory and wipes it.

  2. demo_loon_job_nfs.jdl which is the JDL to pass the job script to the input sandbox, run the job on the queue:-

      lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS 
    
    and return the output.
Now move to a scratch directory and copy over these files:-
  cd to/some/scratch/directory

  cp $MOG_SCRIPTS/jobs/demo_loon_job_nfs.* ./
Take a look at:-
 demo_loon_job_nfs.sh
You will see that it contains:-
 #  Define the release, file and script to run.  Will have to update from time to time.

    release=minossoft:S07-10-22-R1-26-build_2-SL4
That release is history, so look at Installed Base Releases at RAL and update the script to use a current one.

Now you can run this like your very first "Hello World" job, but, for very short jobs where you are keen to see the results ASAP, we have a little tool that will submit the JDL for you and then poll waiting for the job to end. To run it:-

   perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_loon_job_nfs.jdl
If all goes well, never a given in the GRID, you should see something like this:-

  Creating work directory: /tmp/run_test_job_27403
  Submitting: edg-job-submit --vo minos.vo.gridpp.ac.uk --output /tmp/run_test_job_27403/job_id demo_loon_job_nfs.jdl ...
  
  
  Entering polling phase ...
  2007-11-02 14:56:01 Ready unavailable
  2007-11-02 14:56:34 Scheduled Job successfully submitted to Globus
  2007-11-02 14:58:45 Running Job successfully submitted to Globus
  2007-11-02 15:02:04 Done Job terminated successfully
  
  
  Retrieving job output ...
  Job output returned to /tmp/jobOutput/nwest_bmjPHgQA3WnoIPTcOqT8bQ:-
  
    File: demo_loon_job_nfs.err  begins (first 20 lines max):-
      (output from demo_loon_job_nfs.err)

    File: demo_loon_job_nfs.out  begins (first 20 lines max):-
      (output from demo_loon_job_nfs.out)
  
  Cleaning up and removing /tmp/run_test_job_27403
You can see the files in full by examining them in the temporary directory where they have been returned. To be considerate you should delete this directory when done rather than leave it to be eventually removed.


Loon with SE access via DCM submitted by Ganga

In this section the gloves come off and we go in detail through the steps necessary to run a loon job on the GRID. For the purposes of this exercise we shall assume:-

  1. The version of MINOSSOFT you want is installed on the GRID, we will get to how you can check that in a minute.

  2. You have input data set that DCM can resolve i.e. a SAM query or some file name, possibly with wild-carding.

  3. You want to run Loon with a script you already have prepared.

  4. You want to write the resulting Loon binary output files back to a SE write accessible from RAL.
It's a bit painful, and the plan is to make this easier, but it's no bad thing to see all the details; it will help if (or rather when) things go wrong or if you want to do something a bit non-standard.

Remember you need to have obtained a GRID Certificate before you can play, otherwise you can only sit and watch.

  1. Start by logging onto the UI at RAL and create a GRID proxy

    To see what software is installed on the GRID, use lcg-infosites

      lcg-infosites --vo minos.vo.gridpp.ac.uk tag
    
    you should see something like:-
    Name of the CE: lcgce01.gridpp.rl.ac.uk
    
    Name of the CE: lcgce02.gridpp.rl.ac.uk
    
    These are the SL3 (lcgce01 - don't use) and SL4 (lcgce02) queues.

    In an ideal world the 'tag' argument to lcg-infosites would list software tags but as explained above we don't live in such a world so for now you have to check what Installed Base Releases for RAL SL4

    For the sake of this exercise, let's pick:-

      minossoft:S07-10-22-R1-26-build_2-SL4
    

    Besides selecting the CE and the software, we also need to decide which queue to use as CEs typically have more than one available. To get this information:-

      lcg-infosites --vo minos.vo.gridpp.ac.uk ce
    
    and look for lcgce02.gridpp.rl.ac.uk:-
      #CPU    Free    Total Jobs      Running Waiting ComputingElement
      ----------------------------------------------------------
      1214      51       0              0        0    lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
      1214      51     223             32      191    lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
    
    By the way, don't get too excited if it looks like you have the farm to yourself; it's not showing you jobs from other experiments.

    So there are two queues:-

      lcgpbs-gridS
      lcgpbs-grid500M
    
    and it's not hard to guess that they are running PBS and one is a general purpose short queue and the other a long one specifically for MINOS. We will pick the MINOS one so the full queue name we want is:-
      lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
    

  2. Next we have to decide what input files to process. We want DCM to resolve the files as DCM URLs but not retrieve them, that will happen on the WN. Further we don't want DCM to find copies that just happen to be on the local disk - the WN cannot see them! Finally we want DCM to record the files it finds in a file (say 'my_input_files') and have to remember the DCM demands that this file not exists on input. So the commands are something like this:-
      rm -f my_input_files
      dcm get --accept_dcm_url           \
              --remote_se not_nfs        \
              --file_list my_input_files \
              [ file_name like F00034638_0000.mdaq.root ]
    
    O.K., so the SAM query is contrived, but you get the idea. That query only gave one file name of course:-
     dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0000.mdaq.root#17711445
     
    but your query could produce lots.

  3. In order to run a job you will need to pass in your "input sandbox" 2 files:-

    1. The script to run the job, getting the input files, running loon and writing the output. This is:-
        $MOG_SCRIPTS/jobs/demo_loon_job_se.sh
      
      Take a few minutes to look at that script and see if it all makes sense.

    2. The loon script you want to run. Here we shall assume its:-
        reco_far_Alt_All_development.C
      
      It simplifies things to have both of these files in the current directory:-
      cp $MOG_SCRIPTS/jobs/demo_loon_job_se.sh                 ./
      cp from/where/ever/reco_far_Alt_All_development.C   ./
    
    If you cannot easily lay your hands on a reco_far_Alt_All_development.C you can take a look at:-
      $MINOS_TOOLS/LoonValidationJobs/README
    
    and adapt the appropriate
      LVJ_reco_far_Alt_All_ver
    
    by renaming the file and the internal function call.

  4. Now we have to decide which output files DCM will write and where. For simplicity let's assume that we want to write out:-
      *.root
    
    Whenever you run DCM it will list the SEs that it can access from there. For example from the UI at RAL:-
      ral_t1_ui has access to the following SEs:-
        ral_t1_ui-nfs                   Local NFS Disks
        ral_t1-castor-prod_d0t1         RAL T1 CASTOR disk0tape1 Production Service
        ral_t1-castor-test_d0t1         RAL T1 CASTOR disk0tape1 Test Service
        ral_t1-dcache-disk              RAL T1 dCache Disk Store
        ral_t1-dcache-tape              RAL T1 dCache Tape Store
        fnal-dcache-enstore             FNAL dCache interface to Enstore
    
    but of course what you need are the SEs that the WN on lcgce02.gridpp.rl.ac.uk can see. In fact RAL Tier 1 can also see them all.

    For this exercise we will assume you want to write into the directory:-

      grid_tests/loon_job/output
    
    below the top-level minos directory of the
      RAL T1 dCache Disk Store
    
    in which case DCM has to write to:-
      ral_t1-dcache-disk/grid_tests/loon_job/output
    

  5. Finally, we are ready to run Ganga to submit a job!

    1. Start Ganga:-
        ganga
      

    2. Prepare list, element by element containing the args to be given to demo_loon_job_se.sh:-
        arglist = []
        arglist.append('minossoft:S07-10-22-R1-26-build_2-SL4')
        arglist.append('reco_far_Alt_All_development.C')
        arglist.append('dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0000.mdaq.root#17711445')
        arglist.append('ral_t1-dcache-disk/grid_tests/loon_job/output')
        arglist.append('*.root')
      
      That is filling out:-
      1. The required version of software
      2. The input script
      3. The input DCM URL. If you have more than one use the first.
      4. The output SE name and directory
      5. The output file names.

    3. Next create a job object:-
        j = Job(application=Executable(exe=File('demo_loon_job_se.sh'),args=arglist),backend='LCG')
      
      Quite a lot is going on here but should all make sense if you have worked through the Ganga Tutorial and even if you haven't it's still clear that the intention is to run demo_loon_job_se.sh with the supplied arguments on the LCG GRID.

    4. Now you place the loon script in the input sandbox; the executable script is added automatically.
        j.inputsandbox = ["reco_far_Alt_All_development.C"]
      

    5. Nearly done now! You need to tell Ganga which queue to send the job to:-
        j.backend.CE = 'lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M'
      

    6. That's it, the job is ready to run. You can take one last look at it simply by typing:-
        j
      
      and if it looks O.K.:-
        j.submit()
      

    7. If you have more input files you need to create further jobs, but can save a lot of time by just updating the arglist and copying the one you already have:-
        arglist[2] = 'dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0001.mdaq.root#34339325'
        j=j.copy()
        j.application.args = arglist
        j.submit()
      
      I have chosen to use the same 'j' variable but the previous all jobs are still available and can be listed with:-
        jobs
      

    8. Repeat the last step until all the input file list is exhausted. Then type
        jobs
      
      until all the jobs show "submitted". At that point, unless something bad has happened nothing is going to for a while so you may as well quit Ganga and return later to see how you jobs have faired.

    9. When the jobs have ended you can examine them within Ganga e.g. for job 52:-
        !ls -l $jobs[52].outputdir
        !more  $jobs[52].outputdir/stdout
      
      or examine them directly from the command line of course.


Installing a Test Release on a Worker Node

Introduction

There will be times when you want to run Test Release code on the GRID. This section covers such cases and assumes:- Transferring your Test Release to the GRID involves creating a tar file of it and then passing that with your GRID job script and invoking a perl script to install it.

Creating a Test Release tar file

  1. Setup your Test Release
      cd /my/test/release
      srt_setup -a
    

  2. Run the script to make tar file
      perl $MOG_SCRIPTS/setup/create_test_release_tar.perl
    
    This creates a clean (i.e. without binaries) version of your Test Release and places that file in you Test Release top level directory. It does this by making a temporary copy; it does not clean yours. The tar file is called:-
      test_release_tar-<your-test-release-directory-name>.tar.gz
    

Installing the Test Release on the WN

The actual installation will be done by:-
  $MOG_SCRIPTS/setup/install_test_release_from_tar.perl
The script take two arguments:-
  1. The name of the Test Release tar file.
  2. The directory where the Test Release is to be installed. If this directory already exists it must be empty.
The script assumes that the Base Release has already been setup.

Your first job is to decide how to get the tar file to the GRID. You have two choices, depending on its size:-

In your GRID job script, you need to
  1. Set up you Base Release.
  2. Use create_test_release_tar.perl to setup your Test Release
  3. cd into you Test Release and use srt_setup -a
  4. Run your executable.
For an example script that does the first 3 steps see:- demo_test_release_setup_job.sh

Running the demos

There are two demos based on demo_test_release_setup_job.sh showing both sandbox and URL methods of installation. To run them:-

  1. Setup GridTools
      e.g. at RAL T1 UI: source /stage/minos-data1/software/grid/setup_minos_local-SL4.csh/.sh
    

  2. Setup up a GRID proxy
      voms-proxy-init  -voms minos.vo.gridpp.ac.uk
    

  3. Copy over the demo files
      cd to/some/scratch/directory
      cp $MOG_SCRIPTS/jobs/demo_test_release_setup_job* ./
    

  4. Submit to GRID via helper tool:-
      Sandbox demo: perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_test_release_setup_job_sbox.jdl 
      URL demo:     perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_test_release_setup_job_url.jdl 
    
Incidentally, if you run these demos you will see warnings that the .base_release does not match the Base Release. You too will get that if you don't use the same version of minossoft locally and on the GRID.


Setting up a GBS based MC Production

Preparation

In this section we deal with the steps required before starting to set up a GBS based MC production system.

  1. You need to have a GRID certificate and to have joined the MINOS VO. See User Administration

  2. You need to have GridTools installed. See the first two steps of Installing GridTools and supporting applications

  3. You need to have configured GBS and be familiar with it be reading the tutorial

  4. You have read access to the properly configured KERBEROS keytab file in:
      /stage/minos-data1/vo/grid/mcarchiver.keytab
    
    but all that should mean is that you are in the minos group.

  5. You have looked over the multi-step job script run_gbs_job.sh which will be configured to invoke the set of gbs_do_*.shscripts to carry out the individual job steps. At RAL this set of files can be found in:-
      /stage/minos-data1/vo/mc_production/daikon_scripts 
    

The GBS Job Steps

You will configure run_gbs_job.sh by setting the environmental variable GBS_JOB_STEPS, to run some or all of the following steps:-

StepAction
setup Define global environment including
  1. $WRKDIR workspace within job step
  2. $CACHE_DIR workspace between job steps
gminos Run gminos and store results as $CACHE_DIR/${GAFBASE}.tar.gz
save_gminos Copy $CACHE_DIR/${GAFBASE}.tar.gz to local Storage Element
reroot Run rerootjob on gminos out and store results as $CACHE_DIR/${GAFBASE}.reroot.tar
recon Run loon on reroot output and store results as $CACHE_DIR/${GAFBASE}.recon.tar
save_recon $CACHE_DIR/${GAFBASE}.recon.tar to local Storage Element
copy_to_remote Copy required files, as determined by $COPY_MODE, to subdirectories of mindata@minos26.fnal.gov:STAGED
cleanup Removes $CACHE_DIR plus any empty parent directory

For example, to just do MC and send all the results to FNAL:-

  GBS_JOB_STEPS=setup;gminos;copy_to_remote;cleanup
Whereas to do detector and rock MC at RAL and then run overlay reconstruction and just send the ntuples to FNAL would involve a series of:-
  GBS_JOB_STEPS=setup;gminos;reroot
followed by a:-
  GBS_JOB_STEPS=setup;recon;cleanup

The setup step

This step performs a global setup for the entire job. It also handles conversion of "..." to ";" in any user supplied Database Cascade

The gminos step

This step skips if the output file $CACHE_DIR/${GAFBASE}.tar.gz already exists otherwise it
  1. Loads the appropriate flux files.
  2. Runs gminos using $TYP configuration.
  3. Stores it output in $CACHE_DIR/${GAFBASE}.tar.gz.
The script can be made to behave differently when running in Test and Production Modes :-
short_test_config=""
if [ "$GBS_MODE" = "Test" ] ; then
  short_test_config="short_test"
  if [ "$DETECTOR" = "far" ] ; then short_test_config="short_test_far"; fi
  $GBS_LOG INFO GBS_MODE = Test - configuring short  test: $short_test_config
fi

$GBS_LOG INFO Running gminos_jobs.py -r $RUNN -s $SUBRUNN ${TYP[*]} $short_test_config
gminos_jobs.py -r $RUNN -s $SUBRUNN ${TYP[*]} $short_test_config > ${TYP[0]}_${RUNN}_${SUBRUNN}.log 2>&1
  

The save_gminos step

This step saves the gminos output $CACHE_DIR/${GAFBASE}.tar.gz to a local SE (Storage Element).

It creates a directory structure below $SE_TOP_DIR that mirrors the directory structure $CACHE_DIR below $NFS_TOP_DIR.

The reroot step

This step unpacks $CACHE_DIR/${GAFBASE}.tar.gz and runs rerootjob $GAFBASE.fz_gaf storing the results in $CACHE_DIR/${GAFBASE}.reroot.tar.

The recon step

If output $CACHE_DIR/${GAFBASE}.recon.tar does not already exist, this step:-
  1. Unpacks ${GAFNAME}.reroot.root from $CACHE_DIR/$GAFBASE.reroot.tar
  2. Runs loon on ${GAFNAME}.reroot.root using scripts $LOON_MACROS adding additional paths as specified by $LOON_MACRO_PATHS.
  3. Renames all products by prefixing with ${GAFNAME}.
  4. Stores as $CACHE_DIR/${GAFBASE}.recon.tar
LOON_MACRO_PATHS
LOON_MACRO_PATHS is semicolon separated list of relative or absolute paths to be added to ROOT's include and macro paths. For example:-
task.SetGlobalEnvironment('+LOON_MACRO_PATHS=Dogwood')
For any element in the list that is not a directory GBS tries to make it into one by
  1. Prefixing $SRT_PUBLIC_CONTEXT/Production/
  2. Prefixing a "$" and performing parameter expansion.
For the above example this produces:-
$SRT_PUBLIC_CONTEXT/Production/Dogwood
LOON_MACRO_PATHS is required when the macros you use specify with LOON_MACROS, directly or indirectly use headers and macros with relative file names. If this is not the case LOON_MACRO_PATHS can be left empty

LOON_MACROS
LOON_MACROS is semicolon separated list of relative or absolute paths to macros to be passed in order to loon. For example:-
task.SetGlobalEnvironment('+LOON_MACROS=asciidb/set_tsql_override.C;Dogwood/macros/GoodSpillTime.C;SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.C')
For any element in the list that is not a file GBS tries to make it into one by
  1. Prefixing $SRT_PUBLIC_CONTEXT/Production/
  2. Prefixing a "$" and performing parameter expansion.
For the above example this produces:-
$SRT_PUBLIC_CONTEXT/Production/asciidb/set_tsql_override.C
$SRT_PUBLIC_CONTEXT/Production/Dogwood/macros/GoodSpillTime.C
$SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.C
Non-standard scripts can be added to the task to be passed with each job:-
task.SetGlobalInputSandbox('/minossoft/releases/R2.0/my_recon_script.C')
On the worker node it will appear on the start-up directory $GBS_WORK_DIR so to select it:-
LOON_MACROS=  ...  GBS_WORK_DIR/my_recon_script.C ...

The save_recon step

This step saves the recon output $CACHE_DIR/${GAFBASE}.recon.tar to a local SE (Storage Element).

It creates a directory structure below $SE_TOP_DIR that mirrors the directory structure $CACHE_DIR below $NFS_TOP_DIR.

The copy_to_remote step

This step copies required files to subdirectories of remote site (mindata@minos26.fnal.gov) Exactly what gets copied i.e. MC or Recon and as tar file or separate files and where i.e. all in one place to separate area for log files is controlled by COPY_MODE.

There is further work here that will be completed when we know where Cambridge ntuples are to be stored.

The cleanup step

This step cleans out all results from $CACHE_DIR, and if that leaves the directory empty remove it and cleans out any empty parent directory.

Task creation and configuration

Introduction

This section deals with the creation of a Task based on the RSMonteCarlo model and its configuration necessary for a particular MC.

The first step is to decide upon a Task naming convention. A reasonable one is to use the name of the configuration for example:-

  L010185_near_bhcurv
We will use a RSMonteCarlo model as this enforces a naming convention that encodes run and subrun which it passes as part of the environments and supports the allocation of a new subrun as the way of changing the MC seed.

Job Environment

Before looking at how the Task will be configured we will look at the environment that the executing job sees. Elements of it come from the configuration you supply to the Task (highlighted in red) but others come from GBS and the job steps themselves.

Variable Name Set by Required by steps Meaning Example
APPLICATION Task setup RSD application to setup DogwoodDaikon04:build_0-SL4
ASCII_DB_NAME Task recon The name of the temporary DB to hold ASCII tables minos_temp
CACHE_DIR setup * The directory used to hold results between job steps
The subdirectory structure below $NFS_TOP_DIR consists of:-
  1. ${VEG_NAME}_${VEG_VERSION}
  2. ${TYP}
  3. near or far
  4. run/10
$NFS_TOP_DIR/daikon_04/L010185N/far/400
CONCURRENT_COPY_MAX Task copy_to_remote Determines the maximum number of files to copy concurrently
Optional, default 5.
1
COPY_MODE Task copy_to_remote Determines what gets copied and to where
Not fully operational but following defined:-
  • gminos - gminos output
  • recon_sntp - recon sntp
gminos
DETECTOR setup None Either "near" or "far" near
FLUXDIR Task gminos DCM style SE name and subdirectory hold flux files ral_t1-castor-prod_d0t1/flux/gnumi/v19
GAFBASE gminos * The base name for output files f21134005_0008_L010185F_D04
GBS_CHECK_ENV_VARS GBS * Check list of supplied environmental variables and fail if any missing. -
GBS_CURRENT_JOB_STEP GBS * The current job step gminos
GBS_JOB_STEPS Task GBS Semicolon separated list of job steps setup;gminos;copy_to_remote;cleanup
GBS_LAST_STEP GBS * = "YES" if in RUN or RERUN mode and last step in chain
[signals step to communicate SUCCEEDED, FAILED, HOLD or RETRY]
NO
GBS_LOG GBS * Invoke the logger -
GBS_MODE Task gminos One of "Production" or "Test".
See Test and Production Modes
Production
GBS_NUM_RETRY_ARGS GBS GBS Number of retry args 1
GBS_PREVIOUS_JOB_STEP GBS GBS The name of the previous job step (null string for first step) setup
GBS_RETRY_ARG_n GBS GBS nth retry arg -
GBS_SCRIPTS_DIR Task GBS The location of the scripts directory /stage/minos-data1/vo/mc_production/daikon_scripts
GBS_WORK_DIR GBS setup Work (starting) directory /pool/13832632.csflnx353.rl.ac.uk/gangajob_qiy14649
JOB_TYPE Task setup One of DETECTOR, ROCK or OVERLAY (not currently used) DETECTOR
LOON_MACRO_PATHS Task recon Semicolon separated list of macro/include paths
LOON_MACROS Task recon Semicolon separated list of macros Dogwood
MINIFLUX Task gminos Controls flux set size: = "yes" use reduced flux set = "no" use full flux set. no
NFS_TOP_DIR Task setup Top directory under which to create $CACHE_DIR /stage/minos-data1/vo/mc_production/cambridge/STAGE
REMOTE_HOST Task copy_to_remote Determines the remote host.
Optional: Default: mindata@minos26.fnal.gov
mindata@minos27.fnal.gov
run Job * Run number 1020
SE_NAME Task save_gminos save_recon Storage Element name ral_t1-castor-prod_d0t1
SE_TOP_DIR Task save_gminos save_recon Storage Element top directory mc_production/cambridge/STAGE
STAGED Task copy_to_remote The subdirectory beneath remote site STAGE/nwest/gbs_test
subrun Job * Subrun number 19
TYP Task gminos Beam configuration
(config file is gminos_cfg_${TYP}.py)
VEG_NAME setup None MC vegetable name daikon
VEG_VERSION setup None MC vegetable version 04
WRKBASE setup gminos Top level empty work directory /tmp/tmpwyS67u/work_dir
WRKDIR setup * Work directory for individual job steps /tmp/tmpwyS67u/work_dir/L010185_far_NC_LEM_1020_23

Task Configuration

To run GBS and create our task:-
  ganga -i $GBS_HOME/python/bootstrap.py

  man = GetManager()
  task = man.AddTask("L010185_near_bhcurv","RSMonteCarlo")
Now we have to specify the top level script i.e. run_gbs_job.sh and to simplify things, use the GBS_HOME environmental variable:-
  import os
  task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
Next we have to tell it what job steps are to be executed and where it will find these job step scripts:-
  task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/stage/minos-data1/vo/mc_production/daikon_scripts')
  task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;save_gminos;reroot;copy_to_remote;cleanup')
Of course what steps you will want depends on what production work you want to do.

Now consult Job Environment to see what Task configuration is required for the steps you want to carry out.

Database Cascade

The Database Cascade environmental variables are hardwired into the application setup script so what if you want to have a non-standard one? The obvious thing to do would be to add it to the Task environment but there are a number of complicating factors:- In summary, here is an example of how to supply a DB cascade:-
  task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_URL=mysql:odbc://sql.gridpp.rl.ac.uk/minos_temp...mysql:odbc://lcgsql0365.gridpp.rl.ac.uk/minos_dogwood1')
  task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_USER=minos_reader')
  task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_PSWD=\\\\0')

Example 1: Run gminos and send output to FNAL

  import os
  task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
  task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/stage/minos-data1/vo/mc_production/daikon_scripts')
  task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;copy_to_remote;cleanup')

  task.SetGlobalEnvironment('+APPLICATION=DogwoodDaikon04:build_0-SL4')
  task.SetGlobalEnvironment('+JOB_TYPE=DETECTOR')
  task.SetGlobalEnvironment('+NFS_TOP_DIR=/stage/minos-data1/vo/mc_production/cambridge/STAGE')

  task.SetGlobalEnvironment('+TYP=L010185F_far_beam')
  task.SetGlobalEnvironment('+FLUXDIR=ral_t1-castor-prod_d0t1/flux/gnumi/v19')
  task.SetGlobalEnvironment('+MINIFLUX=no')

  task.SetGlobalEnvironment('+COPY_MODE=gminos')
  task.SetGlobalEnvironment('+STAGED=STAGE/nwest/gbs_test')

Example 2: Run full chain and send SNTP to FNAL

  import os
  task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
  task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/data/minos/software/mc_production/daikon_scripts')
  task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;save_gminos;reroot;recon;save_recon;copy_to_remote;cleanup')
  
  task.SetGlobalEnvironment('+APPLICATION=DogwoodDaikon04:build_0-SL4')
  task.SetGlobalEnvironment('+JOB_TYPE=DETECTOR')
  task.SetGlobalEnvironment('+NFS_TOP_DIR=/data/minos/west/gbs_mc_test/STAGE')

  task.SetGlobalEnvironment('+FLUXDIR=ral_t1-castor-prod_d0t1/flux/gnumi/v19')
  task.SetGlobalEnvironment('+MINIFLUX=no')
  task.SetGlobalEnvironment('+TYP=L010185_far_NC_LEM')
  
  task.SetGlobalEnvironment('+LOON_MACROS=asciidb/set_tsql_override.C;Dogwood/macros/GoodSpillTime.C;SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.C')
  task.SetGlobalEnvironment('+LOON_MACRO_PATHS=Dogwood')
  task.SetGlobalEnvironment('+ASCII_DB_NAME=minos_temp')
  
  task.SetGlobalEnvironment('+SE_NAME=ral_t1-castor-prod_d0t1')
  task.SetGlobalEnvironment('+SE_TOP_DIR=mc_production/gbs_test/STAGE')

  task.SetGlobalEnvironment('+COPY_MODE=recon_sntp')
  task.SetGlobalEnvironment('+STAGED=STAGE/nwest/gbs_test')

Single job testing

This section covers the launching of single jobs to test the system using a short queue.

We will create a job with run number and subrun number both equal to 1. Because of the RSMonteCarlo model we have have selected, job names are forced to have the format:-

  job_rrrrrrrr_ssss
so to create such a job:-
  job=task.AddJob("job_00000001_0001")
If you examine the job you will see it has:-
  Local environment: 'run=1,subrun=1'
Next we need to select the short job queue on RAL Tier 1:-
  task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS")
and submit the job:-
  job.Submit()
after that you can check at intervals either by looking for the Ganga job to end:-
  jobs(<Job ID>).status
or by updating the job status:-
  job.UpdateStatus()

Production running

This section covers the switching to Production mode with the long job queue, the generation of the job set, and the use of a cron job to automate the process of job submission and resubmission.

  1. The first steps are to cull all the test jobs. Make sure that none are submitted and then:-
    task.RemoveJobs()
    

  2. Next switch to production mode:-
    task.SetMode("Production")
    
    this will prevent you accidentally changing things that effect all jobs globally and also changes the value of the application script environmental variable
    GBS_MODE
    
    and consequently will change any script behaviour that is based on this variable.

  3. Now you can populate the Task with a full set of Jobs, creating them first as ProtoJobs. In the submit_ral.sh script there are entries like:-
    #runnum=`seq 1000 1020`
    #subrunnum=`seq 0 19`
    
    to create ProtoJobs equivalent to this:-
    for runnum in range(1000,1021):
        for subrunnum in range(0,20):
            task.AddProtoJob("job_" + str(runnum).zfill(8) + "_" + str(subrunnum).zfill(4))
    
    
    If that looks O.K., promote them to real jobs:-
    task.PromoteProtoJobs()
    
    and if not:-
    task.RemoveProtoJobs()
    

  4. Next switch the backend to the long MINOS queue:-
    task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M")
    
    enable Task submission and decide the total number of jobs to having running at one time and the maximum number that you want to have submitted each time GBS runs.
    task.EnableSubmit()
    task.SetMaxGangaJobs(500)
    task.SetMaxSubmitJobs(10)
    
    GBS doesn't exploit any fancy machinery to launch multiple jobs efficiently and the best it does is about 10/minute so the SetMaxSubmitJobs ought not to exceed about 50.

  5. You are now ready to launch your first wave with:-
    task.SubmitJobs()
    
    after which you can either list the updated status:-
    task.ListJobs()
    
    or create a web page with links to the jobs e.g.
    task.WriteHtmlReport("/home/west/work/minos/temp")
    

  6. If that's O.K., then you are ready to set up a cron scripts based on run_gbs_cron.sh and run_gbs_cron.py making suitable changes to:-

    1. export MOG_TOOLS=...

    2. task=man.GetTask('...')

    3. task.WriteHtmlReport("...")
    and then setting up a crontab entry e.g.
     0,30 * * * * .../GridTools/Ganga/GBS/scripts/run_gbs_cron.sh
    
    When considering Cron frequency a typical MC production has:- So a good frequency would be 30 minutes. That implies submitting/retrieving 10 to maintain the 200 and the cron should run for about:-
    Submit:    10/10  =  1 min
    Check:  2*200/40  = 10 min
    Retrieve:  10/25  =  1 min
    
    Total               12 min
    

  7. Don't forget to setup and maintain both short and long term proxies and also read up about cron job authentication

    If, or rather when, you need to investigate problems it is best to suspend the cron job while you run GBS interactively to avoid it it start submitting jobs while you are working.

Throttling data flow to FNAL

In the copy_to_remote step script gbs_do_copy_to_remote.sh you will see the lines:-
    CONCURRENT_COPY_MAX=5
CONCURRENT_COPY_TIMEOUT=36000 #10 hours
This will limit to at most 5 concurrent copies to FNAL. If more want to copy they will have to wait, up to the TIMEOUT value, for permission to copy. Based on rough figures above for an average MC job together with an estimate that one connection can sustain the copy of 5 jobs an hour, 4 concurrent connections should keep up with a steady load of 200 jobs each lasting 10 hours.

However, if necessary that number can be reduced. The next 3 lines of the script contain:-

# Override those values if  run_ral_lcg_set_copy_parms.sh exists
override_copy_parms_file="${BUNDLE_TOP}/daikon_scripts/run_ral_lcg_set_copy_parms.sh"
if [ -f $override_copy_parms_file ] ; then eval `cat $override_copy_parms_file`; fi
and normally the file run_ral_lcg_set_copy_parms.sh contains the same values:-
CONCURRENT_COPY_MAX=5
CONCURRENT_COPY_TIMEOUT=36000 #10 hours
Placing a hacked copy of this file with a smaller CONCURRENT_COPY_MAX at RAL will take effect on the next job to start copying. Note that this is better than hacking gbs_do_copy_to_remote.sh as that is loaded at job start time so changes would not take effect until it reached the copy stage.

If:-

    CONCURRENT_COPY_MAX=0
then copies will not be undertaken. The value should not be left like that for long as GBS will preferentially rerun failures rather than start fresh ones. This means that the job queue slowly switches to one in which all jobs are doing nothing more than unpacking a tar and then throwing it away when they discover there are no free locks. So if you want to suspend copying for more than a few hours do:-
    CONCURRENT_COPY_MAX=-1
This special value tells GBS to place the job on hold. Then when the crisis has passed you can release all HELD jobs and let them complete.


Return to home page