GBS: Ganga-based Batch Submission: User Manual V01-16-05

A fault tolerant batch submission framework layered on Ganga

Status: As of version V01-16-05 GBS is a running system although it has only been tested so far on an internal dummy application. Major features (IOPool, IOItem and TaskConnector) are missing. As a temporary, and possibly permanent, replacement for the IO items, Tasks directly create the inputs for their Jobs.
Here is the code See also

n west
Last modified: Fri May 8 07:27:13 BST 2009
Return to GBS home page


GBS User Manual Contents

This manual describes how to install and run GBS.

Installing

As of now GBS is only available as part of the GridTools package of the minossoft release and the objective is that it should run "out of the box" - see Running. The directory structure is:-
  GBS/
    docs     This documentation + script to produce doxygen code
    python   The generic source code
    minos    MINOS specific extensions

Configuring

GBS has a built-in configuration file:-
  GBS/python/.gbsrc
which it reads first and which should not be modified. Instead to change any of the values, take a copy of this file, edit it and either The file itself consists of simple key value pairs with attendant comments.

KeyValueExample
Expt Experiment name Minos
VO VO name minos.vo.gridpp.ac.uk
DataDirectory Top level directory for GBS data. /home/west/work/python/gbs_datadir
LoggerThresholdTerminal Threshold for stdout (terminal) logging:-
Level           Prints
FATAL     = 6   Crash and burn
ERROR     = 5   Definitely wrong but can continue
WARNING   = 4   Unusual but not definitely wrong 
INFO      = 3   Normal stuff user expects to see
SYNOPSIS  = 2   Behind the scenes detail
DEBUG     = 1   In depth debug level
3
LoggerThresholdPermanent Threshold for permanent (file) logging
Values as for LoggerThresholdTerminal
3
LoggerGlobalDirectory Default for all file (permanent) logging /tmp/gbs_logdir
Logger-<task-name>-Directory Directory for task <task-name>(file) logging (optional) /tmp/gbs_logdir
DefaultBackend Default Backend. One of: Local, PBS:queue, LCG:queue Local
MaxTimeEarlyFails Maximum time in minutes (start to end time in GBS log file)
for failures to be classified as early.
5
MaxRetryEarlyFails Maximum number of early failure retries. 20
MaxRetryLateFailsHandled Maximum number of late, handled failure retries. 5
MaxRetryLateFailsUnandled Maximum number of late, unhandled failure retries. 1
DefaultMaxGangaJobs Default maximum number of jobs that can be submitted to Ganga at any one time 100
DefaultMaxSubmitJobs Default maximum number of jobs that can be submitted by a single call to Task.SubmitJobs() 10
UserModelsPath Directory holding register_user_models.py (if adding user extensions) $GBS_HOME/minos

At the very least, you must alter:-

Tutorial: Running GBS

Introduction: Concepts and Objects

This is a pragmatic tutorial on how to use GBS. A nodding acquaintanceship with python and ganga would help but is not essential. If you really need to understand GBS in detail then you need to read:- but one of the ideas behind GBS is to make your life simpler, not harder, so for now all you need to know is that GBS is a fault tolerant batch submission framework.

This means:-

GBS is python object-oriented system and you need to be introduced to 3 key objects:-

Running your first application script

Right, you are ready to run GBS. If you are working with MINOS GridTools the environmental variables GBS_HOME and PYTHONPATH and the executable "ganga" should be set for you, and if necessary you can select a specific version using GANGA_VER.

If you are not working with these tools you need to:-

   export GBS_HOME=path-to-GBS-directory
   export PYTHONPATH=${PYTHONPATH}:$GBS_HOME/python

or setenv GBS_HOME   path-to-GBS-directory
   setenv PYTHONPATH ${PYTHONPATH}:$GBS_HOME/python

and define "ganga" to invoke Ganga
If you have a GRID certificate, now would be a good time to create a proxy if you haven't already got one. If you don't you will see lots of warning along the lines
  WARNING  GridProxy creation/renewal failed [1].
throughout the Ganga session. They are benign, if irritating, you don't need a GRID proxy to run most of this tutorial as it will just use your local machine but there doesn't appear to be any way to explain this to Ganga.

Run ganga with the GBS bootstrap:-

  ganga -i $GBS_HOME/python/bootstrap.py
If you haven't run Ganga before it goes through an initial setup up but if things are working you will eventually you should see:-
  GBS version V01-16-05
Now you need to get hold on the manager object:-
  man = GetManager()
A very nice feature of python is that if you just type in an object identifier, it will tell you something about it. So try:-
  man
and you will get something like:-
  GBSManager named 'Manager' stored in /tmp/gbs_datadir/Manager.state
  Managing 0 Tasks
Caution: If you get exactly that you are headed for trouble; GBS is storing all its data on /tmp so it will get wiped at some point. Stop what you are doing and go back to Installing and configure GBS to write its data somewhere safer!

Every GBS object that needs to record its state has a '.state' file in which it records it. These files are designed to be easy to read and offer a way to sneak in and fiddle if things ever get broken. That's a last resort mind! For now all you need to know is that every time an object changes its state, it writes itself back to its file. So you never have to tell GBS you are done; you can exit at any time (by ^D) and come back later and resume your GBS session.

A second very nice feature of python is that every object carries some documentation about it in its __doc__ data member so to ask man what it can do:-

  print man.__doc__
You will find that very brief. GBS exploits the improved built-help provided by ipython which Ganga is based on, so now try:-
  help(man)
You should see the __doc_ as above but now with descriptions of all the methods. Press 'q' to leave help.

Don't forget these handy features, they help both learning the system and examining its state.

Managers don't do a lot besides create tasks, so that's what you need to do next:-

  task = man.AddTask("my_first_task")  
That name is going to be used to create a directory so don't include spaces or other "odd" characters. GBS will shout at you if you do. When you do this for real, you should choose names for tasks that will remind you what they are for.

Now that you have got some tasks, you can list them:-

  man.ListTasks()
and you should see:-
  The following tasks are setup up:-
  
  Name                !Ready(other)  Ready(retry)    Sub(!run)   Done(fail)   Backend  Script file                             Args  
  my_first_task              0(  0)        0(  0)       0(  0)       0(  0)     Local
For an explanation of the counts see the reference section ListTasks()

Before going further, trying quiting from Ganga, coming back in, getting the manager and listing its tasks:-

  ^D
  ganga -i $GBS_HOME/python/bootstrap.py
  man = GetManager()
  man.ListTasks()
and hopefully you will see your task again.

Now pick up your task and examine it:-

  task = man.GetTask("my_first_task")
  task
It is a bit more interesting. For now we will concentrate on a couple of things:-
  Model: default
  Backend: Local
GBS is designed to be extensible, with an experiment neutral set of core objects that can be replaced in different "models". For now we will stick with the core default one.

At the moment the task is talking to the local backend, which means it will run jobs as child processes on the current machine. This is very useful for checking things out before running production, and also for learning in this tutorial!

This object has a lot of methods:-

 help(task)    
too many for that printout to be useful at this stage, but as you might guess, one of the things you can do is to create jobs, so try that now:-
  job = task.AddJob("my_first_job")
  task.ListJobs()
Notice something? Your job is called:-
  job_my_first_job
GBS requires that all job names begin "job_" and prefixes those that don't.

By now you probably have already tried:-

  job
  help(job)
but never mind about them and instead try to submit it:-
  job.Submit()
and you will get told:-
  Cannot submit job, no user application script assigned to Task 'my_first_task'
which is fair enough, you have to tell GBS what to run!

In any directory you like create the following little bash script:-

  echo "Hello World! (what else?)"
  echo "Here is my GBS environment:-"
  env | grep GBS_
- note that your application will run in a bash shell, but there is nothing to stop you say:-
  csh my_application_job.csh
if you prefer that shell.

Now that you have a script, you have to give it to your task so that all its jobs can use it. Remember: all Jobs of a Task run the same application script.

In my case:-

  task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
and you will see something similar to:-
  Copying /home/west/work/minos/temp/my_first_script.sh -> /home/west/work/python/gbs_datadir/Manager/my_first_task/my_first_script.sh
GBS has taken its own copy, well you wouldn't want an entire production to crash because you accidentally deleted a script in some random directory would you?

Now can we run a job?

  job.Submit()
now you script should get run and you should see output that includes lines like:-
  Ganga.GPIDev.Lib.Job               : INFO     submitting job 93
  Ganga.GPIDev.Adapters              : INFO     submitting job 93 to Local backend
  Ganga.GPIDev.Lib.Job               : INFO     job 93 status changed to "submitted"
In this case GBS has created Ganga job with the ID 93. Ganga supports the organisation of its jobs into a JobTree and within this structure your task has created the folder:-
  /gbs/<task-name>
and if a Ganga job is created it will be placed in this folder.

If you list your jobs one should be running:-

  task.ListJobs()
but no matter how many time you type that command, the job continues to run. When you ask your task to list its jobs, that's a "lightweight" question; its jobs don't check with Ganga. To get up to date information, you can either work at the Task or the individual Job level:-
  task.UpdateJobsStatus()
or
  job.UpdateStatus()
so do that now.

Once you job has ended, and chances are that you trivial one has, you will notice:-

  Ganga.GPIDev.Lib.Job               : INFO     removing job 93
If you know a little about Ganga, then you also need to understand how GBS interacts with it. Once Ganga considers the job complete and associated the GBS Job updates, it moves all of the Ganga job files into it's own area and then erases the Ganga job. This has two advantages:- Now checking on your jobs:-
  task.ListJobs()
shows:-
  job_my_first_job    RETRY 
So your very first job has failed! How can that happen with anything so simple? It's time to introduce the fault analysis and handling framework

The fault analysis and handling framework

Look at what you job:-
  job
and now not only does it tell you about the Job object but also the output it produced:-
The output for try 1 can be found in

   /home/west/work/python/gbs_datadir/Manager/my_first_task/job_my_first_job/try_001

 and consists of:-

  total 40
  -rw-r--r--  1 west minos   10 Nov 11 18:54 gbs_ganga.status
  -rw-r--r--  1 west minos  546 Nov 11 18:54 gbs_my_first_task_job_my_first_job_1.log
  -rw-r--r--  1 west minos 1369 Nov 11 18:54 gbs_grid_info.log
  -rw-r--r--  1 west minos   86 Nov 11 18:54 __jobstatus__
  -rw-r--r--  1 west minos    0 Nov 11 18:54 stderr
  -rw-r--r--  1 west minos  217 Nov 11 18:54 stdout
  -rw-r--r--  1 west minos    0 Nov 11 18:54 __syslog__

The GLF (GBS Log File) gbs_my_first_task_job_my_first_job_1.log contains:-

  2007-11-11 18:54:07 INFO GBS_JOB_SUBMIT submitting job
  2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
  2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
  2007-11-11 18:54:31 INFO GBS_JOB_ANALYSIS:-
      Communication Level: APPLICATION [Application failed to record SUCCEEDED, FAILED, HOLD or RETRY]
      Ganga Exit Status: 'completed' Recorded job interval:0.0mins
      Appl. Job Status Code: UNKNOWN []
      Failure category: EARLY
      Judgement: Status Code:RETRY  [] Retry Args:''
First it shows you the output directory name and contents. If you have used Ganga before then you will be familiar with some of these.
  stderr                                      Your job error output
  stdout                                      Your job standard output

  gbs_ganga.status                            Dump of the Ganga job
  gbs_grid_info.log                           Summary the passage of the job within the GRID.
                                                (only present for GRID jobs)
  gbs_my_first_task_job_my_first_job_1.log    The GLF - GBS Log File
                                                (format: gbs_<task-name>_<job-name>_<try_num>.log)
                      
The GLF is the cornerstone to fault handling. GBS helpfully displays its contents:-
  2007-11-11 18:54:07 INFO GBS_JOB_SUBMIT submitting job
That's written when your Job asked Ganga to run the job.
  2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
Your Job sent along a little wrapper script to get things ready for your application script and this line says that it has started.
  2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
That's the wrapper again saying that you job exited normally and that it too is exiting. Looks good doesn't it?
  2007-11-11 18:54:31 INFO GBS_JOB_ANALYSIS:-
      Communication Level: APPLICATION [Application failed to record SUCCEEDED, FAILED, HOLD or RETRY]
      Ganga Exit Status: 'completed' Recorded job interval:0.0mins
      Appl. Job Status Code: UNKNOWN []
      Failure category: EARLY
      Judgement: Status Code:RETRY  [] Retry Args:''
This last part gets written when your Job receives control back and analyses what went on. Its fault recovery system works of a positive signal concept: it's not enough that the job looks O.K., the application script, i.e. your script, has to positively tell it that it is O.K.

The analysis organises failures at different levels. If you want to see the gory details look at:- Detailed Design: Error Recovery: Job Interface If you do you will see that the Communication Level APPLICATION means the GLF exists with job wrapper start line and end lines but either application ends with a non-zero code or fails to write one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF.

You script exited O.K. but didn't report back. How does it do that?

Take a look at your stdout. It should look like:-

  Hello World! (what else?)
  Here is my GBS environment:-
  GBS_HOME=/tmp/tmp-3wLUy
  GBS_RETRY_COUNT=0
  GBS_MODE=Test
  GBS_LOG_FILE=/tmp/tmp-3wLUy/gbs_my_first_task_job_my_first_job_1.log
  GBS_LOG=/tmp/tmp-3wLUy/gbs_logger.sh
The wrapper has set up an environment for you and it include a little script ($GBS_LOG) logger that will write to the GLF (called $GBS_LOG_FILE).

Take your script file and modify it:-

  echo "Hello World! (what else?)"
  echo "Here is my GBS environment:-"
  env | grep GBS_
  $GBS_LOG INFO Everything looks O.K.  
  $GBS_LOG SUCCEEDED my_data_file_1  my_data_file_2
These last two lines get written, timestamped, to the log file. The INFO isn't essential, but is a handy to record general information but the second reassures GBS that the job really is O.K. Caution: The SUCCEEDED (and HOLD, RETRY and FAILED) lines all require some string after them or they will not be recognised; this is the data your script is passing back, although its form is completely arbitrary.

Don't forget to hand you revised script to your task:-

  task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
and try running again, waiting a few moments, updating the status and checking
  job.Submit()
  task.UpdateJobsStatus()
  task.ListJobs()
This time you should job work and you should see the line:-
  job_my_first_job    SUCCEEDED    [my_data_file_1 my_data_file_2]  
The information after the keyword "SUCCEEDED" is recorded as "Status Details" The convention here is to record the names of the output files. In some future incarnation GBS may be able to connect Tasks together and by using the convention it would be possible to hand the output from one job to another.

Take a look at you job again:-

  job
This time the GLF contains:-
  2007-11-11 19:43:42 INFO GBS_JOB_SUBMIT submitting job
  2007-11-11 19:43:42 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
  2007-11-11 19:43:42 INFO Everything looks O.K.
  2007-11-11 19:43:42 SUCCEEDED my_data_file_1 my_data_file_2
  2007-11-11 19:43:42 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
  2007-11-11 19:43:53 INFO GBS_JOB_ANALYSIS:-
      Communication Level: USER [Achieved communication with application]
      Ganga Exit Status: 'completed' Recorded job interval:0.0mins
      Appl. Job Status Code: SUCCEEDED [my_data_file_1 my_data_file_2]
      Failure category: NONE
      Judgement: Status Code:SUCCEEDED  [my_data_file_1 my_data_file_2] Retry Args:''
and tells you the your Job has managed to communicate on the USER level i.e. with your script. Further your script signalled back SUCCEEDED and that made GBS happy and it marked you job down as finished and attempting to:-
  job.Submit()
only result in your being told:-
  Cannot submit job job_my_first_job: Status: SUCCEEDED [my_data_file_1 my_data_file_2]
It's not much of an error recovery system if all it can handle is success, so what else can it do? Well it can also handle RETRY. To demonstrate this, take another look at your stdout files from your first two tries and in particular the GBS environment:-
  in try_001/stdout: GBS_RETRY_COUNT=0
  in try_002/stdout: GBS_RETRY_COUNT=1
You can use this to make your script a bit more cantankerous:-
  echo "Hello World! (what else?)"
  echo "Here is my GBS environment:-"
  env | grep GBS_
  if     [ $GBS_RETRY_COUNT  = 0  ] ; then $GBS_LOG RETRY 1 abc
  elif   [ $GBS_RETRY_COUNT  = 1  ] ; then $GBS_LOG SUCCEEDED my_data_file_1  my_data_file_2
  fi
So it will fail the first time but succeed the second. Note also that the RETRY has some (odd looking) data with it.

To do this you will have again pass the script in and the create a second job, as the first one is finished, and run that:-

  task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
  job = task.AddJob("job_my_second_job")
  job.Submit()

  ( wait a few seconds )

  job.UpdateStatus()
  task.ListJobs()
you will see it signal retry and that it's data [1 abc] is displayed. Now look at its GLF:-
  2007-11-11 21:57:25 INFO GBS_JOB_SUBMIT submitting job
  2007-11-11 21:57:25 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
  2007-11-11 21:57:25 RETRY 1 abc
  2007-11-11 21:57:25 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
  2007-11-11 21:57:30 INFO GBS_JOB_ANALYSIS:-
      Communication Level: USER [Achieved communication with application]
      Ganga Exit Status: 'completed' Recorded job interval:0.0mins
      Appl. Job Status Code: RETRY [1 abc]
      Failure category: EARLY
      Judgement: Status Code:RETRY  [1 abc] Retry Args:'1 abc'
  
This time it reaches USER communication and your application signals retry. The judgement is retry with retry args '1 abc'

Run again and now the contrived job succeeds. This time look at its stdout:-

  GBS_NUM_RETRY_ARGS=2
  GBS_RETRY_ARG_1=1
  GBS_RETRY_ARG_2=abc
As you can see, the data you passed back from your script is returned to you for the next try along with a count of the number of args.

Here then is the central concept of error recovery: a situation arises that the application script can identify and is unable to rectify but might, if allowed to start again later. Situations in which this strategy could arise both early and late in the script's execution:-

  1. Early The script checks and finds that the software disk isn't available. So it just retries again later.

  2. Middle A MC SegVs and trying another seed might help. Th script requests a new seed. In this case this will need the co-operation of the GBS Job, to understand and respond to the signal. This means adding code, which is why GBS is designed to be extensible.

  3. Late The script finds that the SE it is to write to is unavailable but can find another and writes the file there. It passes back information that it itself can the recognise telling where the file is and where it is to be moved to.
As you have already seen GBS has the concept of a communication level and this to allow it to do sensible things when the failure prevents it communication directly with the script.

Our job has a single step but it isn't hard to extend it to multiple steps and have the script signal back which step it failed at. If you look at:-

  $GBS_HOME/scripts/run_gbs_job.sh
you will see a ready to run example that can be used as the basis for multi-step jobs.

By now you should have understood enough to be able to read through:- Error Recovery if you want to and understand how GBS classifies other errors, but to keep simple, for now all you need to understand is that GBS classifies all failures into 3 types based on the job duration and whether or not it managed to communicate with your script:-

For each of these types of failures there is a configuration option:-

  MaxRetryEarlyFails
  MaxRetryLateFailsUnandled
  MaxRetryLateFailsHandled
Take a look at you job:-
  job
and you see what its current failure counters are:-
  Early Fails: 1, Late Handled Fails: 0, Late Unhandled Fails: 0
Of course all are early failures, they could hardly be otherwise given the script.

Logging

GBS logs messages both to stdout (terminal) and to a set of log files (permanent) stored in a configurable directory. By default the log files have the name:-
  gbs_global_<current-date>.log
although for any specific task you can configure its output to be written to:-
  gbs_<task-name>_<current-date>.log
Messages are classified according to severity and the threshold for the terminal and permanent can be changed independently e.g.:-
  SetLoggerThreshold(logger.SYNOPSIS,"Terminal")
  SetLoggerThreshold(logger.INFO,"Permanent")
If you only supply a threshold, it applies to terminal output. The permanent threshold should never be set higher than INFO for otherwise it will not record normal job submission and retrieval messages.

You can also check the current level e.g.:-

  GetLoggerThreshold("Terminal")
  GetLoggerThreshold("Permanent")

Batch Production: Script args, ProtoJobs, Submitting jobs, Back-ends

So far we have dealt with individual jobs, but this a task based system where a task is defined to be:-

"A set of jobs all running the same script but with different inputs".

The system allows you to have one set of arguments that are global to all jobs and a second set that is local to the current job. For example if you:-

  task.SetScriptGlobalArgs("a string with spaces,123")
  job.SetScriptLocalArgs("(this is an array),456")
Then when your script runs it will be given the arguments:-
  "a string with spaces" "123" "(this is an array)" "456"
so pass in your arguments as comma separated lists, all spaces are significant. For you trouble makers who want to pass in a string that includes a comma just escape it with a '\' e.g.
  task.SetScriptGlobalArgs("a string with spaces\, commas and other odd stuff e.g. \";!,123")

An alternative way to provide different inputs is to provide a different environment to each job. In a way analogous to setting script arguments the system allows you to set one environment that is global to all jobs and a second set that is local to the current job. For example if you:-

  task.SetGlobalEnvironment('config=L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no')
  job.SetLocalEnvironment('run=1001,subrun=1')
Then when your script runs it will be given the environment:-
  config=L010185_near_bhcurv
  daikon_ver=daikon_04
  mini_flux=no
  run=1001
  subrun=1
GBS does some basic syntax checking on the environment and objects if it does not consist of a comma separated list of key=value pairs. It also removes duplicates, in favour of the later entry, and sorts into alphabetical order.

You can incrementally establish an environment by prefixing a '+' to the start of your string. This allows you both to add new items and replace existing ones. For example:-

  task.SetGlobalEnvironment('+mini_flux=yes,REROOT=0')
changes mini_flux and adds REROOT.

If you just

  task.GetGlobalEnvironment()
  job.GetLocalEnvironment()
you will see the environment string but these methods take an optional argument:-
  prettyPrint = False
which lists the environment one entry per line e.g.
  task.GetGlobalEnvironment(True)

You can also set up the local arguments and local environment when you create a job:-

  job = task.AddJob("my_job_with_args_and_env","123,456","run=1001,subrun=1")
As always you can inspect you task and job to see what these are set to.

Clearly it would be extremely tedious, not to say error prone, to create hundreds of jobs individually, but hey, this is python, so you can create them in seconds:-

  for job_no in range(1,101):
    task.AddJob(str(job_no).zfill(8),str(job_no))
      <--- blank line here
If you want to try that, and you are new to python then make sure you indent the second line a few spaces and then end with a blank line. You will also need to know:- Right, that's just made 100 jobs and given them all different arguments. Easy wasn't it? Trouble is it was too easy! Picture this situation: you already have a flock of several hundred jobs browsing away on the GRID hillsides and you find you need to add some more, perhaps because some dataset has just increased in size. So you write a little python script to add them, but get it a teeny bit wrong and it goes mad creating millions of jobs before you can hit ^C and stop it! Now you have to exit and go into the directory structure GBS uses and clean up the mess.

GBS offers a safer alternative: ProtoJobs. They are made just like Jobs, with a name, an argument list and environment. They will be listed along with your other jobs.

  for job_no in range(1,101):
    task.AddProtoJob(str(job_no).zfill(8),str(job_no),"run=" + str(job_no))

  task.ListJobs()  
but are not written to disk. Exit from GBS or type
  task.RemoveProtoJobs()  
and they are history.

So you can check and only when you are sure that you want to keep them:-

  task.PromoteProtoJobs()  
Incidentally, the following 3 Task methods actually take a pattern match argument
  ListJobs(job_name_pattern = ".*")
  PromoteProtoJobs(job_name_pattern = ".*")
  RemoveProtoJobs(job_name_pattern = ".*")
The pattern ".*" means any number of any character, but if you know about pattern matching you can be more selective about what you list, promote and remove.

So that covers block creation for now, but for production work there is also has to be block submission. That is done using the call:-

  task.SubmitJobs()
but try that now and you will be told:-
  Sorry, cannot submit; submit not enabled.
Although that's easy to fix:-
  task.EnableSubmit()
you need to understand the GBS submit philosophy. Instead of submitting every job it can instead it works on the principle: submit few, submit often. The principle use of SubmitJobs is with a cron job that gets run frequently. On each call to SubmitJobs() GBS launches a small number of jobs, preferring retries over new jobs, until a maximum number have been submitted to Ganga. Thereafter it won't submit any until some are returned. This is to avoid situations, which we have experienced in the past, of catastrophic system failure after which every job fails as soon as it runs. In such cases it doesn't help if GBS just launches them all again only to have the fail as soon as they start to execute. Instead only a few get launched and fail on each round, but once things set fixed, the job levels will start to rise again.

The number of jobs you task launches with each submit, and the maximum number are both configuration options, namely: DefaultMaxSubmitJobs and DefaultMaxGangaJobs

Take a look at your task and you will see the current values:-

  Limits:  10(single submit) 100(maximum)
that can be changed with the Task methods:-
  SetMaxSubmitJobs(n)
  SetMaxGangaJobs(n)
The system can be disabled entirely with:-
  task.EnableSubmit(False)
so if you know that the farm is going to be down you can suspend operations. In fact it doesn't entirely disable the system as the first thing SubmitJobs() does is to call UpdateJobsStatus() and it will continue to do this to check running jobs (the farm could be draining i.e. continuing to run existing jobs but refusing new ones)

You can override the MaxSubmit limit, but not the MaxGanga limit by passing in the number to submit e.g.:-

  task.SubmitJobs(100)
which is handy on a Friday evening if there is a risk that the GRID proxy your cron job uses will expire before you can get back to the terminal and refresh it.

Finally for this section, through the magic of Ganga you can switch from running jobs as child processes on your machine to running on machines on the outermost rim of the known GRID, well RAL anyway. When you create a new Task its back-end is set to the configuration option DefaultBackend, but just by flipping a single switch:-

  task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M")
you are on the GRID, but don't forget that you will need to set up a GRID proxy

Possible values for backend are:-

  Local          - local machine
  PBS:{queue}    - local qsub. Can optionally specify queue
  LCG:{queue}    - GRID.  Can optionally specify queue
GRID proxy before you can run on the GRID.

As you might expect, submitting jobs to the GRID introduces an extra delay in job turnaround and even short jobs submitted to a quick queue takes time. Even so, when testing it is sometimes useful to wait for the job to end and you can do this for an individual job using:-

  job.WaitForJob()
It won't wait for ever. The full argument list for this method is
  job.WaitForJob(num_tries=100,time_interval=30)
so by default it will try 100 times sleeping 30 seconds between attempts so you could do e.g.
  job.WaitForJob(time_interval=60)

Input and Output Sandboxes

Although the main input and output data files from a job are read from and written back to Storage Elements, it is frequently necessary to associate small (typically ~ a few KB) files with a job. For example an executable that runs as part of a job may require its own input file and may produce its own output log. GRID submission middleware provide input and output sandboxes that are used to transport such files.

GBS supports both global and local sandboxes. For example:-

  task.SetGlobalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
adds the input sandbox files my_input_data.dat and my_script.sh to all its jobs. Copies are taken of all these files so there is no need to retain the supplied files once they have been passed to the Task. All .csh and .sh sandbox input files are made executable.

In a similar way:-

  task.SetGlobalOutputSandbox('my_output_data.dat,my_output.log')
requests that the files my_output_data.dat and my_output.log are returned in the sandbox for every job. In this case file names must be supplied simply as file names i.e. without any directory names.

There corresponding Job methods that local to an individual job e.g.:-

  job.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
  job.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
that work in exactly the same way.

It's not unusual that the names of the output sandbox files are specific to the job. GBS provides two alternative ways to deal with this situation:-

ProtoJobs have the same methods:-

  pjob.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
  pjob.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
but in this case there is one difference: SetLocalInputSandbox does not take copies of the input files. That only happens if the ProtoJobs get promoted so the files must be retained until then.

For every "Set" method there is a corresponding "Get" method e.g.:-

  task.GetGlobalInputSandbox()
  job.GetLocalInputSandbox()
  pjob.GetLocalOutputSandbox()

Test and Production Modes

The central concept of GBS is the ability to run the same script with a range of inputs, but so far there has been nothing to stop you running a few jobs and then changing the script file and running some more. If you really want to do that then that's fine but for serious production its much more likely that, having set up a system and run a few test jobs, you will want to freeze those methods that apply globally to all jobs and switch to production.

GBS supports this way of working. When a Task is first created you will see that its state includes:-

    Mode:    Test
While it remains in this mode all Task methods are enabled. However by typing:-
    task.SetMode('Production')
some become disabled. You will be warned as to which will be and asked for confirmation before the change is accepted. You can also see which will be disabled by typing:-
    task.GetTestOnlyMethods()
All Jobs submitted include in their environment the variable:-
    GBS_MODE
which takes the current setting of the Task mode. So you can use this in your application scripts to do different things in test mode, for example only run short jobs and write output files to scratch areas.

If it becomes necessary, you can

    task.SetMode('Test')
to re-enable all methods, but naturally this exposes you to the risk that jobs are run with different sets of conditions.

Manipulating (holding, releasing etc.) Jobs

A job that is ready to be run can be held:-
  job.Hold()
and then cannot be submitted until it is released:-
  job.Release()
This can be useful if you want to prepare jobs ahead of time but don't want GBS to submit them until you are ready.

You can clear the error counts that a job has accumulated with:-

  job.ClearErrorCounts()
which also has the side effect of changing its status to RETRY if it was FAILED. This does not wipe the history of the job and when next submitted it will use the retry args that came from the previous attempt.

You can go further with:-

  job.ClearHistory()
which resets the job to its initial processing state, even if it was SUCCEEDED. The only process state retained is that if the job was held it will remain so. As this method wipes all job output it asks for confirmation before proceeding.

There is also the nuclear option:-

  job.Remove()
which removes the job entirely, even if it was SUCCEEDED. Not surprisingly it asks for confirmation before proceeding!

You can kill a job that has been submitted to Ganga:-

  job.Kill()
this will kill the Ganga job and place your Job on HOLD so it will need to be released before it can be resubmitted.

You can also manipulate groups of jobs a task owns using the Task methods:-

  ClearErrorCountsJobs(job_name_pattern= ".*")
  ClearHistoryJobs(job_name_pattern= ".*")
  HoldJobs(job_name_pattern= ".*")
  KillJobs(job_name_pattern= ".*")
  ReleaseJobs(job_name_pattern= ".*")
  RemoveJobs(job_name_pattern= ".*")
All work in the same way: they take a pattern, with a default that matches every job, and then collect a list of all jobs that match this pattern and for which the action is appropriate. If the list is not empty the list is shown to you and you are asked for confirmation before the action is applied.

Job Perusal

As explained in Job Advice: Can I see what my job is doing or why it failed? jobs that fail at the batch level do not return any output. A feature called Job Perusal can be enabled to retrieve selected files in such case as a debug measure. Ganga supports Job Perusal and to have GBS activate it, the Job Submit method has a single optional argument:-
  Submit(Perusable=False)
To submit with the feature enabled:-
  job.Submit(True)
As explained in Reference: glite-wms-job-perusal Job Perusal is expensive which is why there is no way to permanently enable it at either the Job or the Task level; you must request it explicitly every time you need it.

Ganga only enables perusal of the stdout so to make best use of the feature in your application script you will want to merge stdout and stderr then, when you job is running use the Ganga job peek method:-

  job.GetGangaJob().peek('stdout','cat')
If job doesn't currently have an associated Ganga job then this will fail:-
  AttributeError: 'NoneType' object has no attribute 'peek'
When the Ganga job has terminated and the job's UpdateStatus() method is called it will attempt to get a copy of the final version of stdout and store it as
  stdout.perusable
in the job try output directory. So it's worth submitting with the perusable option even if you won't be around to look at the file while it is running.

Aside: If you run Ganga directly, enable job perusal with:-

  job.backend.perusable = True
When using perusal, it is opten useful to enable Job Monitoring See the next sectiom.

Job Monitoring

As explained above in Job Perusal jobs that fail at the batch level do not return any output and then the first recourse is to use Job Perusal. However, batch failures typically occur because some resource such as CPU or memory has exceeded the limit set by the batch queue, and then even having the stdout file may not be enough to diagnose the problem: what resource limit was exceeded and what was happening when the limit was reached?

To help debug such situations GBS can, along with your application script, run a second process that at intervals runs a command and outputs the results to stdout. The Job Submit method actually has 3 optional argument:s-

  Submit(Perusable=False,
         MonitorFrequency=0,
         MonitorCommand="ps -o pid,ppid,rss,vsize,pcpu,pmem,cmd -u $USER")
By default monitoring is switched off but if MonitorFrequency is set to some positive number N, then every N seconds the monitor will run MonitorCommand, which by default is set to run ps and report on memory and cpu command. Normally monitoring is used in conjunction with perusal, but that's not a requirement

You can write your own monitoring script and submit it in the input sandbox and then select it e.g.

  MonitorCommand="$GBS_HOME/my_monitoring_script.sh")
Note that GBS makes all sandbox .csh and .sh files executable.

Cron jobs

Cron scripts

As explained in Batch Production the Task.SubmitJobs() is really meant to run in a regular cron job to maintain a steady level of jobs on the GRID. To run like this you need two scripts:- You may want to keep an eye on how your task is progressing and you can do that without having to run Ganga with GBS. Another Task method we have not yet introduced (but is used in the above run_gbs_cron.py) is:-
  task.WriteHtmlReport(dir)
where
  dir  Is a directory GBS can write to
In that directory GBS will create:-
  <task-name>.html  Top level index
  <task-name>/      Directory holding job and Ganga job data
The document produced allows you to get an overview of the task and to look at individual jobs, and their associated Ganga jobs (if any). For details of the layout see WriteHtmlReport(task_dir)

Cron frequency

Submitting jobs, checking on their status, and retrieving output all takes time and this will in part determine a sensible cron frequency. From a few ad hoc trials at Oxford to RAL Tier1:-

Activity Rate
(jobs/minute)
Submitting 10
Status checking 40
Output retrieval 25

In principle these figures could be reduced a lot if the batch features of gLite/WMS were exploited, but at the time of this writing, GBS does not.

For now at least, if maintaining a high volume of very long running jobs then it's the status checking that will take the time. For example with 1000 concurrent jobs it will take about 25 minutes to check them all. The script file example run_gbs_cron.py actually causes the status to be checked twice, first when submitting and again when updating the status, so that brings the time to run to close to an hour. The cron could run a little more frequently than once an hour if the launcher scripts follows the run_gbs_cron.sh example and quits if the last is still running, but there is no point in running much more frequently.

At the other end of the spectrum, if submitting many very short jobs to the short queue where they turn round very fast, it is the time to submit and retrieve that will count. If for example you limit the number submitted to 10 then in a steady state that will take less than 2 minutes and the cron could run say every 10 minutes.

Potential problems

We don't have much experience yet running cron job but there are at least two potential problems:-

MINOS Extensions

In this section you can find out about MINOS extensions to GBS.

DCMquery

You start as before.
  man = GetManager()
but this time, when you create a Task, you pass in a second argument:-
  task = man.AddTask("My_DCM_Query_task","DCMquery")
You are not taking the default model but one tailored to the DCM query. If you look at the task there is one new member:-
  DCM Query: ''
You can set that to a DCM query using the SetDCMQuery method, and you may find python's triple quote string useful. [New python users, it's basically """ anything you like here """].
  task.SetDCMQuery("""[     "run_type physics% and data_tier sntp-near
                 and physical_datastream_name spill%
                 and start_time < to_date('2006-02-18','yyyy-mm-dd')
                 and end_time   > to_date('2006-02-17','yyyy-mm-dd')
                 and version cedar" ]""")
Now when you do
  task.AddProtoJob()
It goes off, executes the DCM query and then tries to add ProtoJobs where the job name is formed from the run and subrun number and the single local script arg is the DCM URL. So if you want to run over a set of data files that can be returned as a DCM query you only have to add the script file and you are all set. If the DCM query is actually a SAM query then the set may change over time and you can repeat the AddProtoJob() at any time, safe in the knowledge that only new entry will form ProtoJobs. The Task also warns you if you have jobs that are not in the query and then you will have to investigate why that is.

RSMonteCarlo

You start as before.
  man = GetManager()
but this time, when you create a Task, you pass in a second argument:-
  task = man.AddTask("My_DCM_Query_task","RSMonteCarlo")
which selects the Run Seeded Monte Carlo (RSMonteCarlo) model which is is one in which the Monte Carlo random number seed is determined by the run and subrun numbers. Job names are of the form:-
  job_rrrrrrrr_ssss

  where

    rrrrrrrr  is an 8 digit zero padded run number
    ssss      is a 4 digit zero padded subrun number
The local job environment always includes:-
  run=<run-number>
  subrun=<subrun-number>
If the application script signals back RETRY with a single retry arg NEW_SEED i.e.
   $GBS_LOG RETRY NEW_SEED
Then the MinosRSMJobAnalyser finds the highest subrun so far for the job's run, increments it and then renames the job to the this subrun number.


Log File Analysis

Using the LogAnalyser

As explained in the section on Logging in the Tutorial GBS produces date stamped logs that record job submission and retrieval and subsequent analysis along with errors encountered. A LogAnalyser object is provided that can scan a set of directories and process all the log files they contain and produce an analysis for a specified audit period.

For example:-

  cd $GBS_HOME
  python
  from LogAnalyser import LogAnalyser
  LogAnalyser("2008-05-01","2008-05-29",["/home/user1/gbs_logs","/home/user2/gbs_logs","/home/user2/gbs_logs",])
would scan the 3 directories in the list for all log files and write to the terminal output a summary for the May 2008.

An HTML version of the output can be written to a file by using the 'html_file' argument:-

  LogAnalyser(...,html_file="/home/user1/gbs_statistics.html")
By default, the summary is entitled "MINOS Grid Production", but this can be changed:-
  LogAnalyser(...,title="My Private Analysis")

Definition of the Audit Period

For auditing purposes the system analyses all the jobs that were submitted in the accounting period so ignores any that end within the period if they started before it and includes any that end after so long as they started within the period. This definition means that an arbitrary interval can be divided up into accounting intervals and every job submission will be counted exactly once. However, from this definition it follows that, in order to record all the jobs, it may be necessary to analyse log files for a few days after the accounting period ends. Make sure that these log files are available in the directories supplied.


Reference


GBSManager

This is a singleton that gives the user access to the system. It can list active Tasks, create news ones according to different models and eventually will be able request the destruction of old ones (the Task will refuse it there are running jobs). It can also give access to individual Tasks.

To obtain the manager object:-

  man = GetManager()
The print out of a Manager reveals the following state:-
Schema version 1
Managing 9 Tasks
The schema version is a global number that gets incremented each time the storage schema changes in any object. The mechanism is that, on start up, a Manager passes itself to a schema migrator (schema_migrator.py) which holds the current schema number. If the manager's version is out of date its WriteFamily method is called and its version number updated. See Schema Evolution.
GetSchemaVersion()
Lists the current schema version number.
ListModels()
Lists the available models on which to base tasks.
AddTask(task_name,model_name="default")
Creates a new task based on a given model e.g.
  task = man.AddTask("my_first_task")  
  task = man.AddTask("My_DCM_Query_task","DCMquery")
GetTask(name)
Gives access to an existing named task e.g.
  task = man.GetTask("my_first_task")
ListTasks()
Lists all existing tasks:-
  man.ListTasks()
The header line is:-
  The following tasks are setup up:-
  
  Name                !Ready(other)  Ready(retry)    Sub(!run)   Done(fail)   Backend  Script file                             Args  
The counts are divided into 4 job phases: Not Ready, Ready, Sub and Done each of which show two numbers, the total and the number within that total that are the "exception". In general the fewer exceptions the better. The definitions are as follows:-

GroupDescriptionException
!Ready Not ready to run (normally because on HOLD) other - anything not on HOLD
Ready Ready to run retry - retry jobs
Sub Everything submitted to Ganga !run - anything not actually running on a farm.
Done Anything GBS has finished with fail - anything that failed and needs user intervention


GBSTask

A Task is responsible for a running a single application script multiple times each with a different set of inputs. Its purpose is to shepherd all its Jobs through to successful completion. It may create more Jobs if the task grows, for example if a data set increases. It can monitor its jobs to see own many are active running Ganga jobs and decide how many passive ones should become active.

State

The print out of a Task reveals the following state:-
Model: default

Job Definition
  Script file: run_lcg.sh
  Global args: ''
  Global env:  'config =L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no'
  Input Sandbox:     'input.dat'
  Output Sandbox:    'output.log,output.dat'

Job Submission
  Disabled
  Backend: Local
  Limits:  10(single submit) 100(maximum)
  Mode:    Production

Managing 3 jobs
  Holding:       0 (0 other)
  Waiting:       2 (0 retries)
  Submitted:     1 (1 not running)
  Done:          0 (0 failed)
It comprises:-
AsString(level = "Full")
This method can be used to return the task state as a string although
repr(task)
works just well (it calls AsString).

Job Creation

Jobs can be created individually using AddJob but if adding blocks using scripts you are strongly advised to use AddProtoJob. See the Tutorial section Batch Production: Script args, ProtoJobs, Submitting jobs, Back-ends
AddJob(job_name,args_str = "",env_str = "")
Adds a new Job called 'job_name' which must be unique and must start job_ (GBS prepends this if not present). A local argument list and environment list may also be supplied for the job.
AddProtoJob(job_name,args_str="",env_str="")
Adds a new GBSProtoJob called job_name which must be unique and must start job_ (GBS prepends this if not present). A local argument list and environment list may also be supplied for the ProtoJob.
PromoteProtoJobs(job_name_pattern = ".*")
After user confirmation, promotes all GBSProtoJob matching supplied pattern.
RemoveProtoJobs(job_name_pattern = ".*")
Deletes all GBSProtoJob matching supplied pattern.

Job Setup

The Task sets up all the elements of a batch jobs that are global to all its Jobs.
SetScriptFileName(ext_file_spec)
Takes a copy of the supplied script file 'ext_file_spec' that will be used as the executable when creating Ganga jobs. The original file can be removed once the Task has taken a copy e.g.
  task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
GetScriptFileName()
Returns the name, if any, of the current script file.
GetScriptFileSpec()
Returns the file spec (directory + name), if any, of the internal copy of the current script file.
SetScriptGlobalArgs(arg_str)
Provide a set of arguments that are global to all jobs e.g.
  task.SetScriptGlobalArgs("a string with spaces,123")
These are supplied to the application script before any defined by the individual job's SetLocalEnvironment(env_str)
GetScriptGlobalArgs()
Returns the global argument list, if any. These are supplied to the application script before any defined by the individual job's SetLocalEnvironment(env_str)
SetGlobalEnvironment(env_str)
Provide a set of environment variables that are global to all jobs e.g.
  task.SetGlobalEnvironment('config=L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no')
These are in addition to any defined by the individual job's SetLocalEnvironment(env_str)
GetGlobalEnvironment(prettyPrint = False)
Returns the global environment list, if any. These are in addition to any defined by the individual job's SetLocalEnvironment(env_str)

If prettyPrint is True, prints environment as a list, one entry per line.

SetGlobalInputSandbox(in_sbox_str)
Set, as a comma separated list string, the input sandbox file list that is global to all jobs. Copies are taken of all these files so there is no need to retain the supplied files once they have been passed to the Task. These are supplied in addition to any defined by the individual jobs SetLocalInputSandbox(in_sbox_str)
  task.SetGlobalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
GetGlobalInputSandbox()
Return, as a comma separated list string, the output sandbox file list that is global to all jobs. These are supplied in addition to any defined by the individual jobs SetLocalInputSandbox(in_sbox_str)
SetGlobalOutputSandbox(out_sbox_str)
Set, as a comma separated list string, the output sandbox file list that is global to all jobs. These must be supplied simply as file names i.e. without any directory names. These are supplied in addition to any defined by the individual jobs SetLocalOutputSandbox(in_sbox_str)
  task.SetGlobalOutputSandbox('my_output_data.dat,my_output.log')
GetGlobalOutputSandbox()
Return, as a comma separated list string, the output sandbox file list that is global to all jobs. These are supplied in addition to any defined by the individual jobs SetLocalOutputSandbox(out_sbox_str)

Job Access

A Task provides access to one or a collection of its Jobs.
GetJob(name)
Returns the job named 'name'.
GetJobs(job_name_pattern = ".*")
Returns a list of jobs whose names match 'job_name_pattern'.

Job Submission

The GBS submit philosophy is submit few, submit often. The principle use of Task level job submission is with a cron job that gets run frequently. On each call to SubmitJobs() GBS launches a small number of jobs, preferring retries over new jobs, until a maximum number have been submitted to Ganga. Thereafter it won't submit any until some are returned. This is to avoid situations, which we have experienced in the past, of catastrophic system failure after which every job fails as soon as it runs. In such cases it doesn't help if GBS just launches them all again only to have the fail as soon as they start to execute. Instead only a few get launched and fail on each round, but once things set fixed, the job levels will start to rise again.

The number of jobs you task launches with each submit, and the maximum number are both configuration options, namely: DefaultMaxSubmitJobs and DefaultMaxGangaJobs

SetBackend(backend)
Sets the backend to 'backend'. Choices are:-
  Local
  PBS{:queue}  e.g. PBS  or PBS:prod4
  LCG{:queue}  e.g. LGC:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
GetBackend()
Return the current backend.
SetMaxGangaJobs(n)
Sets the maximum number of jobs that can be submitted to Ganga at any one time e.g.
  task.SetMaxGangaJobs(300)
GetMaxGangaJobs()
Returns the maximum number of jobs that can be submitted to Ganga at any one time.
SetMaxSubmitJobs(n)
Sets the maximum number of jobs that can be submitted by a single call SubmitJobs() e.g.
  task.SetMaxSubmitJobs(10)
GetMaxSubmitJobs()
Returns the maximum number of jobs that can be submitted by a single call SubmitJobs().
GetGangaTreeDir()
Returns the JobTree folder used to hold Ganga jobs. Ganga jobs created by Submit(Perusable=False,MonitorFrequency=0,MonitorCommand="ps -o pid,ppid,rss,vsize,pcpu,pmem,cmd -u $USER") are stored in this folder.
EnableSubmit(enable=True)
Enable or disable submission of jobs via the SubmitJobs() method. The function returns the previous value of the switch.
IsAuthorisedToSubmit(warn=True)
This returns True the user is authorised to submit jobs. For local and PBS backends it always returns true and for LCG it looks for a valid GRID proxy. This method is used both by GBSJob.Submit(Perusable=False,MonitorFrequency=0,MonitorCommand="ps -o pid,ppid,rss,vsize,pcpu,pmem,cmd -u $USER") and SubmitJobs(num_req = 0) to avoid wasting time submitting jobs that will fail when presented to the GRID.
SubmitJobs(num_req = 0)
If enabled, submit more jobs but only up to the maximum number of running jobs, see GetMaxGangaJobs(). If 'num_req' is not supplied then use the default, see GetMaxSubmitJobs(). First choose ones scheduled for retry and then new jobs. Function returns number submitted.
UpdateJobsStatus()
After job submission use this method to have all Jobs that have been submitted check for changes to their associated Ganga jobs. As a by product, also remove any orphaned Ganga jobs i.e. Ganga jobs in Task's jobtree folder but not owned by any Job

Task Mode

As explained in the tutorial section Test and Production Modes Tasks can be in one of two modes: 'Test' or 'Production'. In 'Production' mode methods that change the global properties of all Jobs are disabled. All Jobs submitted include in their environment the variable:-
    GBS_MODE
which takes the current setting of the Task mode which allow application scripts to respond to the current mode.
GetMode()
Returns the current mode.
GetTestOnlyMethods()
Lists the methods that are disabled in 'Production' mode.
IsDisabled(method)
Returns True if 'method' is current disabled.
SetMode(mode)
Asks for confirmation and then sets mode to 'mode' which must be one of 'Test' or 'Production'.

Job Manipulation

The methods in this section all perform some sort of action on a collection of jobs. They work in the same way: they take a pattern, with a default that matches every job, and they collect a list of all jobs that match this pattern and for which the action is appropriate. If the list is not empty the list is shown to you and you are asked for confirmation before the action is applied.
ClearErrorCountsJobs(job_name_pattern= ".*")
Clear the error counts that jobs have accumulated:-
  task.ClearErrorCountsJobs()
which also has the side effect of changing their status to RETRY if it was FAILED. This does not wipe the histories of the jobs and when next submitted they will use the retry args that came from the previous attempt.
ClearHistoryJobs(job_name_pattern= ".*")
Clears the the entire processing history that the jobs have accumulated:-
  task.ClearErrorCountsJobs()
and resets the jobs to their initial processing state, even if they were SUCCEEDED. The only process state retained is that if the jobs were held they will remain so.
RemoveJobs(job_name_pattern= ".*")
Entirely removes the jobs:-
  task.RemoveJobs()
even if they were SUCCEEDED.
HoldJobs(job_name_pattern= ".*")
A group of jobs that are ready to be run can be held:-
  job.HoldJobs()
This can be useful if you want to prepare jobs ahead of time but don't want GBS to submit them until you are ready.
KillJobs(job_name_pattern= ".*")
You can kill a group of jobs that have been submitted to Ganga:-
  job.KillJobs()
this will kill the Ganga jobs and place your Jobs on HOLD so they will have to be released before they can be resubmitted.

ReleaseJobs(job_name_pattern = ".*")
A group of jobs that are being held can be released:-
  job.ReleaseJobs()

Job Summaries

Task has a couple of methods to produce summaries of the jobs it is managing. However they report the current state of the GBS objects, but they are not proactive; they don't query Ganga to see if the state of any Ganga job has changed. To do that:-
  task.UpdateJobsStatus()
ListJobs(job_name_pattern = ".*")
This lists to the terminal a summary of all the jobs.
WriteHtmlReport(task_dir)
Write an HTML status web page to directory 'dir' e.g.:-
  task.WriteHtmlReport("/home/west/work/minos/temp")
In that directory GBS will create:-
  <task-name>.html  Top level index
  <task-name>/      Directory holding job and Ganga job data
The document produced allows you to get an overview of the task and to look at individual jobs, and their associated Ganga jobs (if any). The top level table has 3 columns:-
  Job Name
  Ganga Job
  Status
Each row of the table has two colours; one for Job Name and Ganga Job and one for Status as follows:-

Colour for Job Name
and Ganga Job
Colour for Status
Not ready HELD
Not ready Not HELD
Ready First attempt
Retry
Submitted Running
Not running
Succeeded
Failed

By adding a call to that method at the end of the python script you run with the cron job you can record the current status and look at if off-line.

RefreshJobStats()
The Task state includes counts of the number of jobs in each phase (Not Ready, Ready, Submitted and Complete). This state is stored in the Task's state file but instead is derived from the Jobs that task is managing. To refresh these numbers:-
  task.RefreshJobStats()
It should not be necessary (but harmless) for user to call this as child jobs call it when their state changes. Note: unlike UpdateJobsStatus() this simply asks jobs what their current state without checking Ganga.

Proxy Monitoring

WarnLowGridProxy(...)
Using myproxy servers to refresh GRID proxies (see Potential problems) allows GBS to run unattended for extended periods, but there is a risk that you may forget to renew the myproxy proxy. To help avoid this you can ask Tasks to send email when a proxy lifetime is running low. The method is:-
  WarnLowGridProxy(self,email_list,proxy_min_hours=3.,myproxy_min_days=3.)
This does nothing if there isn't a valid GRID proxy or if 'email_list' is not defined. If both are true then an email is sent to 'email_list' if:-

  1. 'proxy_min_hours' defined and Proxy lifetime is less than 'proxy_min_hours'

  2. 'myproxy_min_days defined and myproxy lifetime is less that 'myproxy_min_days'


GBSProtoJob

ProtoJobs are the safe way create batches of new Jobs. See the Tutorial section Batch Production: Script args, ProtoJobs, Submitting jobs, Back-ends They are made just like Jobs, with a name, an argument list and environment. They will be listed along with your other jobs.
  for job_no in range(1,101):
    task.AddProtoJob(str(job_no).zfill(8),str(job_no),"run=" + str(job_no))

  task.ListJobs()  
but are not written to disk. Exit from GBS and they are history.

ProtoJobs are maniplated by the Task methods:-

AsString(level = "Brief")
This method can be used to return the ProtoJob state as a string although
repr(pjob)
works just well (it calls AsString).
GetLocalEnvironment()
Return, as a comma separated list string, the environment that is local to this ProtoJob.
GetLocalInputSandbox()
Return, as a comma separated list string, the input sandbox file list that is local to this Protojob.
GetLocalOutputSandbox()
Return, as a comma separated list string, the output sandbox file list that is local to this Protojob.
GetScriptLocalArgs()
Return (as a string) the comma list of application script args that are local to this ProtoJob.
SetLocalEnvironment(env_str)
Set, as a comma separated list string, the environment that local to this Protojob.
 pjob.SetLocalEnvironment('var1=123,var2=a string with spaces,var3=456')
SetLocalInputSandbox()
Set, as a comma separated list string, the input sandbox file list that is local to this ProtoJob. The files must exist but copies are not taken until the ProtoJobs are promoted so they must be retained until then.
  pjob.SetLocalInputSandbox('my_output_data.dat,my_output.log')
SetLocalOutputSandbox()
Set, as a comma separated list string, the output sandbox file list that is local to this ProtoJob. These must be supplied simply as file names i.e. without any directory names.
  pjob.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
SetScriptLocalArgs(arg_str)
Set (as a string) the comma separated list of application script args that are local to this ProtoJob.
 pjob.job.SetScriptLocalArgs('123,a string with spaces,456')


GBSJob

This is responsible for running a single application script with a fixed set of inputs to produce an output. When requested, a Job submits a Ganga job to run the application script and have it return the termination status in a text file which it uses in co-operation with the application script to recover from errors. Jobs, once created, live until the end of the Task.

State

The print out of a Job reveals the following state:-
Status: RETRY []
  Associated Ganga Job ID: -1

Job Definition
  Script local args: '1001,1'
  Local environment: 'run=1001,subrun=1'
  Input Sandbox:     'input.dat'
  Output Sandbox:    'output.log,output.dat'

Retry Status
  Try:                  5
  Retry Args:           ''
  Early Fails:          5
  Late Handled Fails:   0
  Late Unhandled Fails: 0
It comprises:-
Getters
A Job has a wide range of getters to examine its state. They are described below.
AsString(level = "Full")
Return the state as a string, although
  repr(job)
works just as well.
CanClear()
Return true if the methods ClearErrorCounts and ClearHistory can be called.
CanKill()
Return true if there is an associated Ganga job that can be killed.
CanSubmit()
Return true if job is ready to submit a Ganga job.
GetEarlyFailsCount()
Return Early Fails Count.
GetGangaJobId()
Return associated Ganga job, if any, returns None otherwise.
GetGangaJob()
Return associated Ganga job ID, if any, returns -1 otherwise.
GetLateHandledFailsCount()
Return Late Handled Fails Count.
GetLateUnhandledFailsCount()
Return Late Unhandled Fails Count.
GetLocalEnvironment(prettyPrint = False)
Return, as a comma separated list string, the environment that is local to this job. These are supplied in addition to any defined by the parent Task's SetGlobalEnvironment(env_str)

If prettyPrint is True, prints environment as a list, one entry per line.

GetLocalInputSandbox()
Return, as a comma separated list string, the input sandbox file list that is local to this job. These are supplied in addition to any defined by the parent Task's SetGlobalInputSandbox(in_sbox_str)
GetLocalOutputSandbox()
Return, as a comma separated list string, the output sandbox file list that is local to this job. These are supplied in addition to any defined by the parent Task's SetGlobalOutputSandbox(out_sbox_str)
GetRetryArgs()
Return as a string current retry args i.e. as determined from previous try (or empty for first try).
GetScriptLocalArgs()
Return (as a string) the comma list of application script args that are local to this job. These are supplied to the script after any defined by the parent Task's SetScriptGlobalArgs(arg_str)
GetStatusCode()
Return status code.
GetStatusText()
Return status text which qualifies the Status Code.
GetStatusTime()
Return date time as a string when the last change to state Code or Text was recorded.
GetPhaseCode()
Return phase code. These are broad categories of status code used by Task for Job statistics. These correspond to the counts described in the Manager's ListTasks() method.
GetTryNumber()
Return Try Number (0 before first try).
IsComplete()
Return true if job is complete (Successful or Failed).
IsFailure()
Return true if job is failure.
IsHeld()
Return true if job is Held.
IsNotReady()
Return true if job is not ready to submit.
IsReady()
Return true if job is ready to submit.
IsRunning()
Return true if job is submitted and associated Ganga Job status is running.
IsSubmitted()
Return true if job is submitted.
IsSuccessful()
Return true if job is successful.
Setters
A Job has several setters to examine its state. They are described below, but see also Manipulation
SetLocalEnvironment(env_str)
Set, as a comma separated list string, the environment that local to this job. These are supplied in addition to any defined by the parent Task's SetGlobalEnvironment(env_str)
  job.SetLocalEnvironment('var1=123,var2=a string with spaces,var3=456')
SetLocalInputSandbox(in_sbox_str)
Set, as a comma separated list string, the input sandbox file list that is local to this job. Copies are taken of all this files so there is no need to retain the supplied files once they have been passed to the Job. These are supplied in addition to any defined by the parent Task's SetGlobalInputSandbox(in_sbox_str)
  job.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
SetLocalOutputSandbox(out_sbox_str)
Set, as a comma separated list string, the output sandbox file list that is local to this job. These must be supplied simply as file names i.e. without any directory names. These are supplied in addition to any defined by the parent Task's SetGlobalOutputSandbox(out_sbox_str)
  job.SetLocalOutputSandbox('my_output_data.dat,$printf "%s_%8.8d_%4.4.log" $config $run $subrun')
See Tutorial Input and Output Sandboxes for further information.
SetScriptLocalArgs(arg_str)
Set (as a string) the comma separated list of application script args that are local to this job. These are supplied to the script after any defined by the parent Task's SetScriptGlobalArgs(arg_str)
  job.SetScriptLocalArgs('123,a string with spaces,456')

Manipulation

A Job has a series of methods that effect job submission.
Hold(warn=True)
A job that is ready to be run can be held with:-
  job.Hold()
If 'warn' is True a warning will be issued if job not suitable for Holding.
Kill(warn=True)
A job that is submitted can be killed with:-
  job.Kill()
If 'warn' is True a warning will be issued if job not suitable for Killing. If successful this will kill the Ganga job and place the Job on HOLD so it will need to be released before it can be resubmitted.
Release(warn=True)
A job that is held can be released with:-
  job.Release()
If 'warn' is True a warning will be issued if job not suitable for Releasing.
ClearErrorCounts(warn=True)
The error counts that a job has accumulated can be cleared with:-
  job.ClearErrorCounts()
which also has the side effect of changing its status to RETRY if it was FAILED. This does not wipe the history of the job and when next submitted it will use the retry args that came from the previous attempt.

If 'warn' is True a warning will be issued if job not suitable for Error count clearing.

ClearHistory(confirm=True,warn=True)
Clear the entire processing history that the job has accumulated:-
  job.ClearHistory()
which resets the job to its initial processing state, even if it was SUCCEEDED. The only process state retained is that if the job was held it will remain so. As this method wipes all job output it asks for confirmation before proceeding.

If 'warn' is True a warning will be issued if job not suitable for history clearing.

Remove(confirm=True,warn=True)
To remove a job entirely:-
  job.Remove()
which removes the job entirely, even if it was SUCCEEDED. Not surprisingly it asks for confirmation before proceeding!

If 'warn' is True a warning will be issued if job not suitable for history clearing.

Submission and Analysis

The methods in this section deal with job submission, monitoring of the associated Ganga job and analysis after that job completes.
Submit(Perusable=False,MonitorFrequency=0,MonitorCommand="ps -o pid,ppid,rss,vsize,pcpu,pmem,cmd -u $USER")
To submit a suitable job:-
  job.Submit()
An warning will be issued if it is not suitable.

The method takes 3 option arguments. Set it True to make the job output perusable. See Tutorial: Job Perusal and Tutorial: Job Monitoring

Ganga supports the organisation of its jobs into a JobTree and within this structure the job's parent task will have created the folder:-

  /gbs/<task-name>
see GetGangaTreeDir(), into which it places all the Ganga jobs its Jobs create.

UpdateStatus()
Check for changes to the associated Ganga job:-
  job.UpdateStatus()
A warning will be issued if there is no associated Ganga job. If the Ganga job appears stalled, i.e. has stayed in the same state for too long, it is killed. Once the Ganga job is complete the UpdateStatus method moves all of the Ganga job files into it's own area and then erases the Ganga job. This has two advantages:- This method creates and calls a GBSJobAnalyser to decide what to do next and that object also creates, for GRID jobs, the file:-
  gbs_grid_info.log
which summarises the passage of the job within the GRID.
WaitForJob(num_tries=100,time_interval=30)
This method makes up to 'num_tries' calls to UpdateStatus() with a sleep of 'time_interval' between each call, waiting for a submitted job to end.
Analyse(update = True)
This method performs post Ganga job execution analysis and, if 'update' is True applies the results to determine the new state of the Job. It should not be necessary to call this method as it is called internally by UpdateStatus()


GBSJobAnalyser

For an introduction see the JobAnalyser section of the Design Manual.
GetRetryArgs()
Return the retry args that apply to the job that has just been analysed.
Analyse(job)
Perform post job execution analysis on a supplied job but don't modify it.
Apply()
Apply the results of analysis to the supplied job.


DCMquery (MinosDCM*)

For an introduction to the DCMquery model see its description in the tutorial. It consists of the following classes:-

MinosDCMTask

This inherits from GBSTask and has one additional data member: a DCM query string with the associated get and set:-
GetDCMQuery()
Return the DCM query.
SetDCMQuery(query)
Set the DCM query e.g.:-
  task.SetDCMQuery("""[     "run_type physics% and data_tier sntp-near
                 and physical_datastream_name spill%
                 and start_time < to_date('2006-02-18','yyyy-mm-dd')
                 and end_time   > to_date('2006-02-17','yyyy-mm-dd')
                 and version cedar" ]""")
The data set is extended using the method:-
AddProtoJob()
This overrides the GBSTask's AddProtoJob(job_name,args_str="",env_str="") and adds ProtoJobs for any that are missing from results of applying the DCM query. The method also warns if there exist jobs that are not in the query.


RSMonteCarlo (MinosRSM*)

For an introduction to the RSMonteCarlo model see its description in the tutorial. It consists of the following classes:-

MinosRSMTask

Inherits from GBSTask. and overrides the internal _AddJobOrProtoJob method to ensure that all jobs have the correct form of name.
IsValidJobName(name)
Returns True if 'name' is a valid Job name.
RenameJob(old_name,new_name)
Rename a Job from 'old_name' to 'new_name'

MinosRSMJob

Inherits from GBSJob and has the internal method _Rename which is required whn changing seeds.
GetRun()
Returns the run number.
GetSubrun()
Returns the subrun number.
SetLocalEnvironment(env_str)
Overrides GBSJob's SetLocalEnvironment(env_str) by extending it to include 'run' and 'subrun' environmental variables.

MinosRSMJobAnalyser

Inherits from GBSJobAnalyser
Apply()
Overrides GBSJobAnalyser's Apply() by extending it to cover the case then the retry args returned consists of the single string "NEW_SEED" in which case it renames the job to have a subrun number that is one higher that the current highest run subrun number of any job for the run.


Member Function List

GBSJob

GBSJobAnalyser

GBSManager

GBSProtoJob

GBSTask

MinosDCMTask

MinosRSMTask

MinosRSMJob

MinosRSMJobAnalyser

Future Additions

Still somewhere in the pipeline:-