A fault tolerant batch submission framework layered on Ganga
Status: As of version V01-16-05 GBS is a running system
although it has only been tested so far on an internal dummy application.
Major features (IOPool, IOItem and TaskConnector) are missing. As a
temporary, and possibly permanent, replacement for the IO items, Tasks
directly create the inputs for their Jobs.
Here is the
code
See also
GBS/
docs This documentation + script to produce doxygen code
python The generic source code
minos MINOS specific extensions
GBS/python/.gbsrcwhich it reads first and which should not be modified. Instead to change any of the values, take a copy of this file, edit it and either
$HOME/.gbsrc
GBS_CONFIG_PATHto point to it.
| Key | Value | Example |
|---|---|---|
| Expt | Experiment name | Minos |
| VO | VO name | minos.vo.gridpp.ac.uk |
| DataDirectory | Top level directory for GBS data. | /home/west/work/python/gbs_datadir |
| LoggerThresholdTerminal | Threshold for stdout (terminal) logging:-
Level Prints FATAL = 6 Crash and burn ERROR = 5 Definitely wrong but can continue WARNING = 4 Unusual but not definitely wrong INFO = 3 Normal stuff user expects to see SYNOPSIS = 2 Behind the scenes detail DEBUG = 1 In depth debug level | 3 |
| LoggerThresholdPermanent | Threshold for permanent (file) logging Values as for LoggerThresholdTerminal | 3 |
| LoggerGlobalDirectory | Default for all file (permanent) logging | /tmp/gbs_logdir |
| Logger-<task-name>-Directory | Directory for task <task-name>(file) logging (optional) | /tmp/gbs_logdir |
| DefaultBackend | Default Backend. One of: Local, PBS:queue, LCG:queue | Local |
| MaxTimeEarlyFails | Maximum time in minutes (start to end time in GBS log file) for failures to be classified as early. | 5 |
| MaxRetryEarlyFails | Maximum number of early failure retries. | 20 |
| MaxRetryLateFailsHandled | Maximum number of late, handled failure retries. | 5 |
| MaxRetryLateFailsUnandled | Maximum number of late, unhandled failure retries. | 1 |
| DefaultMaxGangaJobs | Default maximum number of jobs that can be submitted to Ganga at any one time | 100 |
| DefaultMaxSubmitJobs | Default maximum number of jobs that can be submitted by a single call to Task.SubmitJobs() | 10 |
| UserModelsPath | Directory holding register_user_models.py (if adding user extensions) | $GBS_HOME/minos |
At the very least, you must alter:-
This means:-
GANGA_VER. If you are not working with these tools you need to:-
export GBS_HOME=path-to-GBS-directory
export PYTHONPATH=${PYTHONPATH}:$GBS_HOME/python
or setenv GBS_HOME path-to-GBS-directory
setenv PYTHONPATH ${PYTHONPATH}:$GBS_HOME/python
and define "ganga" to invoke Ganga
If you have a GRID certificate, now would be a good time to create a
proxy if you haven't already got one. If you don't you will see lots
of warning along the lines
WARNING GridProxy creation/renewal failed [1].throughout the Ganga session. They are benign, if irritating, you don't need a GRID proxy to run most of this tutorial as it will just use your local machine but there doesn't appear to be any way to explain this to Ganga.
Run ganga with the GBS bootstrap:-
ganga -i $GBS_HOME/python/bootstrap.pyIf you haven't run Ganga before it goes through an initial setup up but if things are working you will eventually you should see:-
GBS version V01-16-05Now you need to get hold on the manager object:-
man = GetManager()A very nice feature of python is that if you just type in an object identifier, it will tell you something about it. So try:-
manand you will get something like:-
GBSManager named 'Manager' stored in /tmp/gbs_datadir/Manager.state Managing 0 TasksCaution: If you get exactly that you are headed for trouble; GBS is storing all its data on /tmp so it will get wiped at some point. Stop what you are doing and go back to Installing and configure GBS to write its data somewhere safer!
Every GBS object that needs to record its state has a '.state' file in which it records it. These files are designed to be easy to read and offer a way to sneak in and fiddle if things ever get broken. That's a last resort mind! For now all you need to know is that every time an object changes its state, it writes itself back to its file. So you never have to tell GBS you are done; you can exit at any time (by ^D) and come back later and resume your GBS session.
A second very nice feature of python is that every object carries some documentation about it in its __doc__ data member so to ask man what it can do:-
print man.__doc__You will find that very brief. GBS exploits the improved built-help provided by ipython which Ganga is based on, so now try:-
help(man)You should see the __doc_ as above but now with descriptions of all the methods. Press 'q' to leave help.
Don't forget these handy features, they help both learning the system and examining its state.
Managers don't do a lot besides create tasks, so that's what you need to do next:-
task = man.AddTask("my_first_task")
That name is going to be used to create a directory so don't include
spaces or other "odd" characters. GBS will shout at you if you do.
When you do this for real, you should choose names for tasks that will
remind you what they are for.Now that you have got some tasks, you can list them:-
man.ListTasks()and you should see:-
The following tasks are setup up:- Name !Ready(other) Ready(retry) Sub(!run) Done(fail) Backend Script file Args my_first_task 0( 0) 0( 0) 0( 0) 0( 0) LocalFor an explanation of the counts see the reference section ListTasks()
Before going further, trying quiting from Ganga, coming back in, getting the manager and listing its tasks:-
^D ganga -i $GBS_HOME/python/bootstrap.py man = GetManager() man.ListTasks()and hopefully you will see your task again.
Now pick up your task and examine it:-
task = man.GetTask("my_first_task")
task
It is a bit more interesting. For now we will concentrate on a couple of things:-
Model: default Backend: LocalGBS is designed to be extensible, with an experiment neutral set of core objects that can be replaced in different "models". For now we will stick with the core default one.
At the moment the task is talking to the local backend, which means it will run jobs as child processes on the current machine. This is very useful for checking things out before running production, and also for learning in this tutorial!
This object has a lot of methods:-
help(task)too many for that printout to be useful at this stage, but as you might guess, one of the things you can do is to create jobs, so try that now:-
job = task.AddJob("my_first_job")
task.ListJobs()
Notice something? Your job is called:-
job_my_first_jobGBS requires that all job names begin "job_" and prefixes those that don't.
By now you probably have already tried:-
job help(job)but never mind about them and instead try to submit it:-
job.Submit()and you will get told:-
Cannot submit job, no user application script assigned to Task 'my_first_task'which is fair enough, you have to tell GBS what to run!
In any directory you like create the following little bash script:-
echo "Hello World! (what else?)" echo "Here is my GBS environment:-" env | grep GBS_- note that your application will run in a bash shell, but there is nothing to stop you say:-
csh my_application_job.cshif you prefer that shell.
Now that you have a script, you have to give it to your task so that all its jobs can use it. Remember: all Jobs of a Task run the same application script.
In my case:-
task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
and you will see something similar to:-
Copying /home/west/work/minos/temp/my_first_script.sh -> /home/west/work/python/gbs_datadir/Manager/my_first_task/my_first_script.shGBS has taken its own copy, well you wouldn't want an entire production to crash because you accidentally deleted a script in some random directory would you?
Now can we run a job?
job.Submit()now you script should get run and you should see output that includes lines like:-
Ganga.GPIDev.Lib.Job : INFO submitting job 93 Ganga.GPIDev.Adapters : INFO submitting job 93 to Local backend Ganga.GPIDev.Lib.Job : INFO job 93 status changed to "submitted"In this case GBS has created Ganga job with the ID 93. Ganga supports the organisation of its jobs into a JobTree and within this structure your task has created the folder:-
/gbs/<task-name>and if a Ganga job is created it will be placed in this folder.
If you list your jobs one should be running:-
task.ListJobs()but no matter how many time you type that command, the job continues to run. When you ask your task to list its jobs, that's a "lightweight" question; its jobs don't check with Ganga. To get up to date information, you can either work at the Task or the individual Job level:-
task.UpdateJobsStatus() or job.UpdateStatus()so do that now.
Once you job has ended, and chances are that you trivial one has, you will notice:-
Ganga.GPIDev.Lib.Job : INFO removing job 93If you know a little about Ganga, then you also need to understand how GBS interacts with it. Once Ganga considers the job complete and associated the GBS Job updates, it moves all of the Ganga job files into it's own area and then erases the Ganga job. This has two advantages:-
Right now your directory structure should look something like this:-
./Manager ./Manager/my_first_task ./Manager/my_first_task/job_my_first_job ./Manager/my_first_task/job_my_first_job/try_001If things are bad, the next attempt will go into try_002 and so on.
task.ListJobs()shows:-
job_my_first_job RETRYSo your very first job has failed! How can that happen with anything so simple? It's time to introduce the fault analysis and handling framework
joband now not only does it tell you about the Job object but also the output it produced:-
The output for try 1 can be found in
/home/west/work/python/gbs_datadir/Manager/my_first_task/job_my_first_job/try_001
and consists of:-
total 40
-rw-r--r-- 1 west minos 10 Nov 11 18:54 gbs_ganga.status
-rw-r--r-- 1 west minos 546 Nov 11 18:54 gbs_my_first_task_job_my_first_job_1.log
-rw-r--r-- 1 west minos 1369 Nov 11 18:54 gbs_grid_info.log
-rw-r--r-- 1 west minos 86 Nov 11 18:54 __jobstatus__
-rw-r--r-- 1 west minos 0 Nov 11 18:54 stderr
-rw-r--r-- 1 west minos 217 Nov 11 18:54 stdout
-rw-r--r-- 1 west minos 0 Nov 11 18:54 __syslog__
The GLF (GBS Log File) gbs_my_first_task_job_my_first_job_1.log contains:-
2007-11-11 18:54:07 INFO GBS_JOB_SUBMIT submitting job
2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
2007-11-11 18:54:31 INFO GBS_JOB_ANALYSIS:-
Communication Level: APPLICATION [Application failed to record SUCCEEDED, FAILED, HOLD or RETRY]
Ganga Exit Status: 'completed' Recorded job interval:0.0mins
Appl. Job Status Code: UNKNOWN []
Failure category: EARLY
Judgement: Status Code:RETRY [] Retry Args:''
First it shows you the output directory name and contents. If you have
used Ganga before then you will be familiar with some of these.
stderr Your job error output
stdout Your job standard output
gbs_ganga.status Dump of the Ganga job
gbs_grid_info.log Summary the passage of the job within the GRID.
(only present for GRID jobs)
gbs_my_first_task_job_my_first_job_1.log The GLF - GBS Log File
(format: gbs_<task-name>_<job-name>_<try_num>.log)
The GLF is the cornerstone to fault handling. GBS helpfully displays its
contents:-
2007-11-11 18:54:07 INFO GBS_JOB_SUBMIT submitting jobThat's written when your Job asked Ganga to run the job.
2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.shYour Job sent along a little wrapper script to get things ready for your application script and this line says that it has started.
2007-11-11 18:54:07 INFO GBS_JOB_WRAPPER Terminating. User script returned 0That's the wrapper again saying that you job exited normally and that it too is exiting. Looks good doesn't it?
2007-11-11 18:54:31 INFO GBS_JOB_ANALYSIS:-
Communication Level: APPLICATION [Application failed to record SUCCEEDED, FAILED, HOLD or RETRY]
Ganga Exit Status: 'completed' Recorded job interval:0.0mins
Appl. Job Status Code: UNKNOWN []
Failure category: EARLY
Judgement: Status Code:RETRY [] Retry Args:''
This last part gets written when your Job receives control back and
analyses what went on. Its fault recovery system works of a positive
signal concept: it's not enough that the job looks O.K., the
application script, i.e. your script, has to positively tell
it that it is O.K.The analysis organises failures at different levels. If you want to see the gory details look at:- Detailed Design: Error Recovery: Job Interface If you do you will see that the Communication Level APPLICATION means the GLF exists with job wrapper start line and end lines but either application ends with a non-zero code or fails to write one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF.
You script exited O.K. but didn't report back. How does it do that?
Take a look at your stdout. It should look like:-
Hello World! (what else?) Here is my GBS environment:- GBS_HOME=/tmp/tmp-3wLUy GBS_RETRY_COUNT=0 GBS_MODE=Test GBS_LOG_FILE=/tmp/tmp-3wLUy/gbs_my_first_task_job_my_first_job_1.log GBS_LOG=/tmp/tmp-3wLUy/gbs_logger.shThe wrapper has set up an environment for you and it include a little script ($GBS_LOG) logger that will write to the GLF (called $GBS_LOG_FILE).
Take your script file and modify it:-
echo "Hello World! (what else?)" echo "Here is my GBS environment:-" env | grep GBS_ $GBS_LOG INFO Everything looks O.K. $GBS_LOG SUCCEEDED my_data_file_1 my_data_file_2These last two lines get written, timestamped, to the log file. The INFO isn't essential, but is a handy to record general information but the second reassures GBS that the job really is O.K. Caution: The SUCCEEDED (and HOLD, RETRY and FAILED) lines all require some string after them or they will not be recognised; this is the data your script is passing back, although its form is completely arbitrary.
Don't forget to hand you revised script to your task:-
task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
and try running again, waiting a few moments, updating the status and
checking
job.Submit() task.UpdateJobsStatus() task.ListJobs()This time you should job work and you should see the line:-
job_my_first_job SUCCEEDED [my_data_file_1 my_data_file_2]The information after the keyword "SUCCEEDED" is recorded as "Status Details" The convention here is to record the names of the output files. In some future incarnation GBS may be able to connect Tasks together and by using the convention it would be possible to hand the output from one job to another.
Take a look at you job again:-
jobThis time the GLF contains:-
2007-11-11 19:43:42 INFO GBS_JOB_SUBMIT submitting job
2007-11-11 19:43:42 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
2007-11-11 19:43:42 INFO Everything looks O.K.
2007-11-11 19:43:42 SUCCEEDED my_data_file_1 my_data_file_2
2007-11-11 19:43:42 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
2007-11-11 19:43:53 INFO GBS_JOB_ANALYSIS:-
Communication Level: USER [Achieved communication with application]
Ganga Exit Status: 'completed' Recorded job interval:0.0mins
Appl. Job Status Code: SUCCEEDED [my_data_file_1 my_data_file_2]
Failure category: NONE
Judgement: Status Code:SUCCEEDED [my_data_file_1 my_data_file_2] Retry Args:''
and tells you the your Job has managed to communicate on the USER
level i.e. with your script. Further your script signalled back
SUCCEEDED and that made GBS happy and it marked you job down as
finished and attempting to:-
job.Submit()only result in your being told:-
Cannot submit job job_my_first_job: Status: SUCCEEDED [my_data_file_1 my_data_file_2]It's not much of an error recovery system if all it can handle is success, so what else can it do? Well it can also handle RETRY. To demonstrate this, take another look at your stdout files from your first two tries and in particular the GBS environment:-
in try_001/stdout: GBS_RETRY_COUNT=0 in try_002/stdout: GBS_RETRY_COUNT=1You can use this to make your script a bit more cantankerous:-
echo "Hello World! (what else?)" echo "Here is my GBS environment:-" env | grep GBS_ if [ $GBS_RETRY_COUNT = 0 ] ; then $GBS_LOG RETRY 1 abc elif [ $GBS_RETRY_COUNT = 1 ] ; then $GBS_LOG SUCCEEDED my_data_file_1 my_data_file_2 fiSo it will fail the first time but succeed the second. Note also that the RETRY has some (odd looking) data with it.
To do this you will have again pass the script in and the create a second job, as the first one is finished, and run that:-
task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
job = task.AddJob("job_my_second_job")
job.Submit()
( wait a few seconds )
job.UpdateStatus()
task.ListJobs()
you will see it signal retry and that it's data [1 abc] is displayed.
Now look at its GLF:-
2007-11-11 21:57:25 INFO GBS_JOB_SUBMIT submitting job
2007-11-11 21:57:25 INFO GBS_JOB_WRAPPER Starting. About to execute my_first_script.sh
2007-11-11 21:57:25 RETRY 1 abc
2007-11-11 21:57:25 INFO GBS_JOB_WRAPPER Terminating. User script returned 0
2007-11-11 21:57:30 INFO GBS_JOB_ANALYSIS:-
Communication Level: USER [Achieved communication with application]
Ganga Exit Status: 'completed' Recorded job interval:0.0mins
Appl. Job Status Code: RETRY [1 abc]
Failure category: EARLY
Judgement: Status Code:RETRY [1 abc] Retry Args:'1 abc'
This time it reaches USER communication and your application signals
retry. The judgement is retry with retry args '1 abc'Run again and now the contrived job succeeds. This time look at its stdout:-
GBS_NUM_RETRY_ARGS=2 GBS_RETRY_ARG_1=1 GBS_RETRY_ARG_2=abcAs you can see, the data you passed back from your script is returned to you for the next try along with a count of the number of args.
Here then is the central concept of error recovery: a situation arises that the application script can identify and is unable to rectify but might, if allowed to start again later. Situations in which this strategy could arise both early and late in the script's execution:-
Our job has a single step but it isn't hard to extend it to multiple steps and have the script signal back which step it failed at. If you look at:-
$GBS_HOME/scripts/run_gbs_job.shyou will see a ready to run example that can be used as the basis for multi-step jobs.
By now you should have understood enough to be able to read through:- Error Recovery if you want to and understand how GBS classifies other errors, but to keep simple, for now all you need to understand is that GBS classifies all failures into 3 types based on the job duration and whether or not it managed to communicate with your script:-
For each of these types of failures there is a configuration option:-
MaxRetryEarlyFails MaxRetryLateFailsUnandled MaxRetryLateFailsHandledTake a look at you job:-
joband you see what its current failure counters are:-
Early Fails: 1, Late Handled Fails: 0, Late Unhandled Fails: 0Of course all are early failures, they could hardly be otherwise given the script.
gbs_global_<current-date>.logalthough for any specific task you can configure its output to be written to:-
gbs_<task-name>_<current-date>.logMessages are classified according to severity and the threshold for the terminal and permanent can be changed independently e.g.:-
SetLoggerThreshold(logger.SYNOPSIS,"Terminal") SetLoggerThreshold(logger.INFO,"Permanent")If you only supply a threshold, it applies to terminal output. The permanent threshold should never be set higher than INFO for otherwise it will not record normal job submission and retrieval messages.
You can also check the current level e.g.:-
GetLoggerThreshold("Terminal")
GetLoggerThreshold("Permanent")
"A set of jobs all running the same script but with different inputs".
The system allows you to have one set of arguments that are global to all jobs and a second set that is local to the current job. For example if you:-
task.SetScriptGlobalArgs("a string with spaces,123")
job.SetScriptLocalArgs("(this is an array),456")
Then when your script runs it will be given the arguments:-
"a string with spaces" "123" "(this is an array)" "456"so pass in your arguments as comma separated lists, all spaces are significant. For you trouble makers who want to pass in a string that includes a comma just escape it with a '\' e.g.
task.SetScriptGlobalArgs("a string with spaces\, commas and other odd stuff e.g. \";!,123")
An alternative way to provide different inputs is to provide a different environment to each job. In a way analogous to setting script arguments the system allows you to set one environment that is global to all jobs and a second set that is local to the current job. For example if you:-
task.SetGlobalEnvironment('config=L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no')
job.SetLocalEnvironment('run=1001,subrun=1')
Then when your script runs it will be given the environment:-
config=L010185_near_bhcurv daikon_ver=daikon_04 mini_flux=no run=1001 subrun=1GBS does some basic syntax checking on the environment and objects if it does not consist of a comma separated list of key=value pairs. It also removes duplicates, in favour of the later entry, and sorts into alphabetical order.
You can incrementally establish an environment by prefixing a '+' to the start of your string. This allows you both to add new items and replace existing ones. For example:-
task.SetGlobalEnvironment('+mini_flux=yes,REROOT=0')
changes mini_flux and adds REROOT.If you just
task.GetGlobalEnvironment() job.GetLocalEnvironment()you will see the environment string but these methods take an optional argument:-
prettyPrint = Falsewhich lists the environment one entry per line e.g.
task.GetGlobalEnvironment(True)
You can also set up the local arguments and local environment when you create a job:-
job = task.AddJob("my_job_with_args_and_env","123,456","run=1001,subrun=1")
As always you can inspect you task and job to see what these are set
to.Clearly it would be extremely tedious, not to say error prone, to create hundreds of jobs individually, but hey, this is python, so you can create them in seconds:-
for job_no in range(1,101):
task.AddJob(str(job_no).zfill(8),str(job_no))
<--- blank line here
If you want to try that, and you are new to python then make sure you
indent the second line a few spaces and then end with a blank line.
You will also need to know:-
GBS offers a safer alternative: ProtoJobs. They are made just like Jobs, with a name, an argument list and environment. They will be listed along with your other jobs.
for job_no in range(1,101):
task.AddProtoJob(str(job_no).zfill(8),str(job_no),"run=" + str(job_no))
task.ListJobs()
but are not written to disk. Exit from GBS or type
task.RemoveProtoJobs()and they are history.
So you can check and only when you are sure that you want to keep them:-
task.PromoteProtoJobs()Incidentally, the following 3 Task methods actually take a pattern match argument
ListJobs(job_name_pattern = ".*") PromoteProtoJobs(job_name_pattern = ".*") RemoveProtoJobs(job_name_pattern = ".*")The pattern ".*" means any number of any character, but if you know about pattern matching you can be more selective about what you list, promote and remove.
So that covers block creation for now, but for production work there is also has to be block submission. That is done using the call:-
task.SubmitJobs()but try that now and you will be told:-
Sorry, cannot submit; submit not enabled.Although that's easy to fix:-
task.EnableSubmit()you need to understand the GBS submit philosophy. Instead of submitting every job it can instead it works on the principle: submit few, submit often. The principle use of SubmitJobs is with a cron job that gets run frequently. On each call to SubmitJobs() GBS launches a small number of jobs, preferring retries over new jobs, until a maximum number have been submitted to Ganga. Thereafter it won't submit any until some are returned. This is to avoid situations, which we have experienced in the past, of catastrophic system failure after which every job fails as soon as it runs. In such cases it doesn't help if GBS just launches them all again only to have the fail as soon as they start to execute. Instead only a few get launched and fail on each round, but once things set fixed, the job levels will start to rise again.
The number of jobs you task launches with each submit, and the maximum number are both configuration options, namely: DefaultMaxSubmitJobs and DefaultMaxGangaJobs
Take a look at your task and you will see the current values:-
Limits: 10(single submit) 100(maximum)that can be changed with the Task methods:-
SetMaxSubmitJobs(n) SetMaxGangaJobs(n)The system can be disabled entirely with:-
task.EnableSubmit(False)so if you know that the farm is going to be down you can suspend operations. In fact it doesn't entirely disable the system as the first thing SubmitJobs() does is to call UpdateJobsStatus() and it will continue to do this to check running jobs (the farm could be draining i.e. continuing to run existing jobs but refusing new ones)
You can override the MaxSubmit limit, but not the MaxGanga limit by passing in the number to submit e.g.:-
task.SubmitJobs(100)which is handy on a Friday evening if there is a risk that the GRID proxy your cron job uses will expire before you can get back to the terminal and refresh it.
Finally for this section, through the magic of Ganga you can switch from running jobs as child processes on your machine to running on machines on the outermost rim of the known GRID, well RAL anyway. When you create a new Task its back-end is set to the configuration option DefaultBackend, but just by flipping a single switch:-
task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M")
you are on the GRID, but don't forget that you will need to set up a
GRID proxy
Possible values for backend are:-
Local - local machine
PBS:{queue} - local qsub. Can optionally specify queue
LCG:{queue} - GRID. Can optionally specify queue
GRID proxy
before you can run on the GRID.As you might expect, submitting jobs to the GRID introduces an extra delay in job turnaround and even short jobs submitted to a quick queue takes time. Even so, when testing it is sometimes useful to wait for the job to end and you can do this for an individual job using:-
job.WaitForJob()It won't wait for ever. The full argument list for this method is
job.WaitForJob(num_tries=100,time_interval=30)so by default it will try 100 times sleeping 30 seconds between attempts so you could do e.g.
job.WaitForJob(time_interval=60)
GBS supports both global and local sandboxes. For example:-
task.SetGlobalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
adds the input sandbox files my_input_data.dat and my_script.sh to all
its jobs. Copies are taken of all these files so there is no need to
retain the supplied files once they have been passed to the Task.
All .csh and .sh sandbox input files are made executable.
In a similar way:-
task.SetGlobalOutputSandbox('my_output_data.dat,my_output.log')
requests that the files my_output_data.dat and my_output.log are
returned in the sandbox for every job. In this case file names must
be supplied simply as file names i.e. without any directory names.There corresponding Job methods that local to an individual job e.g.:-
job.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
job.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
that work in exactly the same way.It's not unusual that the names of the output sandbox files are specific to the job. GBS provides two alternative ways to deal with this situation:-
If the name can be constructed from the job's environment (either global and/or local) then this method can be used. The name must begin with a '$' followed by a printf command to create the name. For example suppose the job's environment includes:-
config=L010185_near run=12345 subrun=12then
job.SetLocalOutputSandbox('my_output_data.dat,'$printf "%s_%8.8d_%4.4.log" $config $run $subrun' )
would generate the output sandbox file names
my_output_data.dat L010185_near_00012345_0012.log
Where the output file names are only known at execution time they can be returned by having your application script write one or more RETURN_FILE entries to the GLF each giving the name of an output file e.g.:-
$GBS_LOG RETURN_FILE `pwd`/${TYP[0]}_${RUNN}_${SUBRUNN}.log
The job wrapper collects all such file names and creates a single tar file named
gbs_output_sandbox.tar.gzOn return this file is automatically unpacked. Note to avoid problems where the working directory changes between the time the command is issued and the tar file created, it is better to use absolute file names, hence the `pwd` in the above GBS_LOG command.
ProtoJobs have the same methods:-
pjob.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
pjob.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
but in this case there is one difference: SetLocalInputSandbox does
not take copies of the input files. That only happens if the
ProtoJobs get promoted so the files must be retained until then.For every "Set" method there is a corresponding "Get" method e.g.:-
task.GetGlobalInputSandbox() job.GetLocalInputSandbox() pjob.GetLocalOutputSandbox()
GBS supports this way of working. When a Task is first created you will see that its state includes:-
Mode: Test
While it remains in this mode all Task methods are enabled. However by typing:-
task.SetMode('Production')
some become disabled. You will be warned as to which will be and
asked for confirmation before the change is accepted. You can also
see which will be disabled by typing:-
task.GetTestOnlyMethods()
All Jobs submitted include in their environment the variable:-
GBS_MODE
which takes the current setting of the Task mode. So you can use this
in your application scripts to do different things in test mode, for
example only run short jobs and write output files to scratch areas.If it becomes necessary, you can
task.SetMode('Test')
to re-enable all methods, but naturally this exposes you to the risk
that jobs are run with different sets of conditions.
job.Hold()and then cannot be submitted until it is released:-
job.Release()This can be useful if you want to prepare jobs ahead of time but don't want GBS to submit them until you are ready.
You can clear the error counts that a job has accumulated with:-
job.ClearErrorCounts()which also has the side effect of changing its status to RETRY if it was FAILED. This does not wipe the history of the job and when next submitted it will use the retry args that came from the previous attempt.
You can go further with:-
job.ClearHistory()which resets the job to its initial processing state, even if it was SUCCEEDED. The only process state retained is that if the job was held it will remain so. As this method wipes all job output it asks for confirmation before proceeding.
There is also the nuclear option:-
job.Remove()which removes the job entirely, even if it was SUCCEEDED. Not surprisingly it asks for confirmation before proceeding!
You can kill a job that has been submitted to Ganga:-
job.Kill()this will kill the Ganga job and place your Job on HOLD so it will need to be released before it can be resubmitted.
You can also manipulate groups of jobs a task owns using the Task methods:-
ClearErrorCountsJobs(job_name_pattern= ".*") ClearHistoryJobs(job_name_pattern= ".*") HoldJobs(job_name_pattern= ".*") KillJobs(job_name_pattern= ".*") ReleaseJobs(job_name_pattern= ".*") RemoveJobs(job_name_pattern= ".*")All work in the same way: they take a pattern, with a default that matches every job, and then collect a list of all jobs that match this pattern and for which the action is appropriate. If the list is not empty the list is shown to you and you are asked for confirmation before the action is applied.
Submit(Perusable=False)To submit with the feature enabled:-
job.Submit(True)As explained in Reference: glite-wms-job-perusal Job Perusal is expensive which is why there is no way to permanently enable it at either the Job or the Task level; you must request it explicitly every time you need it.
Ganga only enables perusal of the stdout so to make best use of the feature in your application script you will want to merge stdout and stderr then, when you job is running use the Ganga job peek method:-
job.GetGangaJob().peek('stdout','cat')
If job doesn't currently have an associated Ganga job then this will
fail:-
AttributeError: 'NoneType' object has no attribute 'peek'When the Ganga job has terminated and the job's UpdateStatus() method is called it will attempt to get a copy of the final version of stdout and store it as
stdout.perusablein the job try output directory. So it's worth submitting with the perusable option even if you won't be around to look at the file while it is running.
Aside: If you run Ganga directly, enable job perusal with:-
job.backend.perusable = TrueWhen using perusal, it is opten useful to enable Job Monitoring See the next sectiom.
To help debug such situations GBS can, along with your application script, run a second process that at intervals runs a command and outputs the results to stdout. The Job Submit method actually has 3 optional argument:s-
Submit(Perusable=False,
MonitorFrequency=0,
MonitorCommand="ps -o pid,ppid,rss,vsize,pcpu,pmem,cmd -u $USER")
By default monitoring is switched off but if MonitorFrequency is set
to some positive number N, then every N seconds the monitor will run
MonitorCommand, which by default is set to run ps and report on
memory and cpu command. Normally monitoring is used in conjunction
with perusal, but that's not a requirementYou can write your own monitoring script and submit it in the input sandbox and then select it e.g.
MonitorCommand="$GBS_HOME/my_monitoring_script.sh")Note that GBS makes all sandbox .csh and .sh files executable.
Something along the lines:-
setup UI source local-setup-script check that the previous cron job isn't still running and quits if it is ganga gbs-python-scriptSee run_gbs_cron.sh for a working example, well at Oxford at any rate.
Something like this:-
import GBSConfig # That gets the configuration Singleton up and running
import register_models # That sets up the models
# Create a Singleton Manager
import GBSModelRegistry
man = GBSModelRegistry.GetModelRegistry().CreateObject("default","Manager","Manager",None)
task = man.GetTask("my_first_task")
task.SubmitJobs()
task.ListJobs()
See
run_gbs_cron.py for a working example
task.WriteHtmlReport(dir) where dir Is a directory GBS can write toIn that directory GBS will create:-
<task-name>.html Top level index <task-name>/ Directory holding job and Ganga job dataThe document produced allows you to get an overview of the task and to look at individual jobs, and their associated Ganga jobs (if any). For details of the layout see WriteHtmlReport(task_dir)
| Activity | Rate (jobs/minute) |
|---|---|
| Submitting | 10 |
| Status checking | 40 |
| Output retrieval | 25 |
In principle these figures could be reduced a lot if the batch features of gLite/WMS were exploited, but at the time of this writing, GBS does not.
For now at least, if maintaining a high volume of very long running jobs then it's the status checking that will take the time. For example with 1000 concurrent jobs it will take about 25 minutes to check them all. The script file example run_gbs_cron.py actually causes the status to be checked twice, first when submitting and again when updating the status, so that brings the time to run to close to an hour. The cron could run a little more frequently than once an hour if the launcher scripts follows the run_gbs_cron.sh example and quits if the last is still running, but there is no point in running much more frequently.
At the other end of the spectrum, if submitting many very short jobs to the short queue where they turn round very fast, it is the time to submit and retrieve that will count. If for example you limit the number submitted to 10 then in a steady state that will take less than 2 minutes and the cron could run say every 10 minutes.
What can be done about it?
Using this GBS should be able to run unattended for extended periods, but there is a risk that you may forget to renew the myproxy proxy. To help avoid this you can ask Tasks to send email when a proxy lifetime is running low. See WarnLowGridProxy(...)
man = GetManager()but this time, when you create a Task, you pass in a second argument:-
task = man.AddTask("My_DCM_Query_task","DCMquery")
You are not taking the default model but one tailored to the
DCM
query. If you look at the task there is one new member:-
DCM Query: ''You can set that to a DCM query using the SetDCMQuery method, and you may find python's triple quote string useful. [New python users, it's basically """ anything you like here """].
task.SetDCMQuery("""[ "run_type physics% and data_tier sntp-near
and physical_datastream_name spill%
and start_time < to_date('2006-02-18','yyyy-mm-dd')
and end_time > to_date('2006-02-17','yyyy-mm-dd')
and version cedar" ]""")
Now when you do
task.AddProtoJob()It goes off, executes the DCM query and then tries to add ProtoJobs where the job name is formed from the run and subrun number and the single local script arg is the DCM URL. So if you want to run over a set of data files that can be returned as a DCM query you only have to add the script file and you are all set. If the DCM query is actually a SAM query then the set may change over time and you can repeat the AddProtoJob() at any time, safe in the knowledge that only new entry will form ProtoJobs. The Task also warns you if you have jobs that are not in the query and then you will have to investigate why that is.
man = GetManager()but this time, when you create a Task, you pass in a second argument:-
task = man.AddTask("My_DCM_Query_task","RSMonteCarlo")
which selects the Run Seeded Monte Carlo (RSMonteCarlo) model which is
is one in which the Monte Carlo random number seed is determined by
the run and subrun numbers. Job names are of the form:-
job_rrrrrrrr_ssss
where
rrrrrrrr is an 8 digit zero padded run number
ssss is a 4 digit zero padded subrun number
The local job environment always includes:-
run=<run-number> subrun=<subrun-number>If the application script signals back RETRY with a single retry arg NEW_SEED i.e.
$GBS_LOG RETRY NEW_SEEDThen the MinosRSMJobAnalyser finds the highest subrun so far for the job's run, increments it and then renames the job to the this subrun number.
For example:-
cd $GBS_HOME
python
from LogAnalyser import LogAnalyser
LogAnalyser("2008-05-01","2008-05-29",["/home/user1/gbs_logs","/home/user2/gbs_logs","/home/user2/gbs_logs",])
would scan the 3 directories in the list for all log files and write
to the terminal output a summary for the May 2008.An HTML version of the output can be written to a file by using the 'html_file' argument:-
LogAnalyser(...,html_file="/home/user1/gbs_statistics.html")By default, the summary is entitled "MINOS Grid Production", but this can be changed:-
LogAnalyser(...,title="My Private Analysis")
To obtain the manager object:-
man = GetManager()The print out of a Manager reveals the following state:-
Schema version 1 Managing 9 TasksThe schema version is a global number that gets incremented each time the storage schema changes in any object. The mechanism is that, on start up, a Manager passes itself to a schema migrator (schema_migrator.py) which holds the current schema number. If the manager's version is out of date its WriteFamily method is called and its version number updated. See Schema Evolution.
task = man.AddTask("my_first_task")
task = man.AddTask("My_DCM_Query_task","DCMquery")
task = man.GetTask("my_first_task")
man.ListTasks()The header line is:-
The following tasks are setup up:- Name !Ready(other) Ready(retry) Sub(!run) Done(fail) Backend Script file ArgsThe counts are divided into 4 job phases: Not Ready, Ready, Sub and Done each of which show two numbers, the total and the number within that total that are the "exception". In general the fewer exceptions the better. The definitions are as follows:-
| Group | Description | Exception |
|---|---|---|
| !Ready | Not ready to run (normally because on HOLD) | other - anything not on HOLD |
| Ready | Ready to run | retry - retry jobs |
| Sub | Everything submitted to Ganga | !run - anything not actually running on a farm. |
| Done | Anything GBS has finished with | fail - anything that failed and needs user intervention |
Model: default Job Definition Script file: run_lcg.sh Global args: '' Global env: 'config =L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no' Input Sandbox: 'input.dat' Output Sandbox: 'output.log,output.dat' Job Submission Disabled Backend: Local Limits: 10(single submit) 100(maximum) Mode: Production Managing 3 jobs Holding: 0 (0 other) Waiting: 2 (0 retries) Submitted: 1 (1 not running) Done: 0 (0 failed)It comprises:-
repr(task)works just well (it calls AsString).
task.SetScriptFileName("/home/west/work/minos/temp/my_first_script.sh")
task.SetScriptGlobalArgs("a string with spaces,123")
These are supplied to the application script before any defined by the
individual job's
SetLocalEnvironment(env_str)
task.SetGlobalEnvironment('config=L010185_near_bhcurv,daikon_ver=daikon_04,mini_flux=no')
These are in addition to any defined by the individual job's
SetLocalEnvironment(env_str)
If prettyPrint is True, prints environment as a list, one entry per line.
task.SetGlobalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
task.SetGlobalOutputSandbox('my_output_data.dat,my_output.log')
The number of jobs you task launches with each submit, and the maximum number are both configuration options, namely: DefaultMaxSubmitJobs and DefaultMaxGangaJobs
Local
PBS{:queue} e.g. PBS or PBS:prod4
LCG{:queue} e.g. LGC:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
task.SetMaxGangaJobs(300)
task.SetMaxSubmitJobs(10)
GBS_MODE
which takes the current setting of the Task mode which allow
application scripts to respond to the current mode.
task.ClearErrorCountsJobs()which also has the side effect of changing their status to RETRY if it was FAILED. This does not wipe the histories of the jobs and when next submitted they will use the retry args that came from the previous attempt.
task.ClearErrorCountsJobs()and resets the jobs to their initial processing state, even if they were SUCCEEDED. The only process state retained is that if the jobs were held they will remain so.
task.RemoveJobs()even if they were SUCCEEDED.
job.HoldJobs()This can be useful if you want to prepare jobs ahead of time but don't want GBS to submit them until you are ready.
job.KillJobs()this will kill the Ganga jobs and place your Jobs on HOLD so they will have to be released before they can be resubmitted.
job.ReleaseJobs()
task.UpdateJobsStatus()
task.WriteHtmlReport("/home/west/work/minos/temp")
In that directory GBS will create:-
<task-name>.html Top level index <task-name>/ Directory holding job and Ganga job dataThe document produced allows you to get an overview of the task and to look at individual jobs, and their associated Ganga jobs (if any). The top level table has 3 columns:-
Job Name Ganga Job StatusEach row of the table has two colours; one for Job Name and Ganga Job and one for Status as follows:-
| Colour for Job Name and Ganga Job | Colour for Status |
|---|---|
| Not ready HELD | |
| Not ready Not HELD | |
| Ready | First attempt |
| Retry | |
| Submitted | Running |
| Not running | |
| Succeeded | |
| Failed | |
By adding a call to that method at the end of the python script you run with the cron job you can record the current status and look at if off-line.
task.RefreshJobStats()It should not be necessary (but harmless) for user to call this as child jobs call it when their state changes. Note: unlike UpdateJobsStatus() this simply asks jobs what their current state without checking Ganga.
WarnLowGridProxy(self,email_list,proxy_min_hours=3.,myproxy_min_days=3.)This does nothing if there isn't a valid GRID proxy or if 'email_list' is not defined. If both are true then an email is sent to 'email_list' if:-
for job_no in range(1,101):
task.AddProtoJob(str(job_no).zfill(8),str(job_no),"run=" + str(job_no))
task.ListJobs()
but are not written to disk. Exit from GBS and they are history.ProtoJobs are maniplated by the Task methods:-
repr(pjob)works just well (it calls AsString).
pjob.SetLocalEnvironment('var1=123,var2=a string with spaces,var3=456')
pjob.SetLocalInputSandbox('my_output_data.dat,my_output.log')
pjob.SetLocalOutputSandbox('my_output_data.dat,my_output.log')
pjob.job.SetScriptLocalArgs('123,a string with spaces,456')
Status: RETRY [] Associated Ganga Job ID: -1 Job Definition Script local args: '1001,1' Local environment: 'run=1001,subrun=1' Input Sandbox: 'input.dat' Output Sandbox: 'output.log,output.dat' Retry Status Try: 5 Retry Args: '' Early Fails: 5 Late Handled Fails: 0 Late Unhandled Fails: 0It comprises:-
repr(job)works just as well.
If prettyPrint is True, prints environment as a list, one entry per line.
job.SetLocalEnvironment('var1=123,var2=a string with spaces,var3=456')
job.SetLocalInputSandbox('/home/users/west/my_input_data.dat,../my_script.sh')
job.SetLocalOutputSandbox('my_output_data.dat,$printf "%s_%8.8d_%4.4.log" $config $run $subrun')
See Tutorial Input and Output Sandboxes
for further information.
job.SetScriptLocalArgs('123,a string with spaces,456')
job.Hold()If 'warn' is True a warning will be issued if job not suitable for Holding.
job.Kill()If 'warn' is True a warning will be issued if job not suitable for Killing. If successful this will kill the Ganga job and place the Job on HOLD so it will need to be released before it can be resubmitted.
job.Release()If 'warn' is True a warning will be issued if job not suitable for Releasing.
job.ClearErrorCounts()which also has the side effect of changing its status to RETRY if it was FAILED. This does not wipe the history of the job and when next submitted it will use the retry args that came from the previous attempt.
If 'warn' is True a warning will be issued if job not suitable for Error count clearing.
job.ClearHistory()which resets the job to its initial processing state, even if it was SUCCEEDED. The only process state retained is that if the job was held it will remain so. As this method wipes all job output it asks for confirmation before proceeding.
If 'warn' is True a warning will be issued if job not suitable for history clearing.
job.Remove()which removes the job entirely, even if it was SUCCEEDED. Not surprisingly it asks for confirmation before proceeding!
If 'warn' is True a warning will be issued if job not suitable for history clearing.
job.Submit()An warning will be issued if it is not suitable.
The method takes 3 option arguments. Set it True to make the job output perusable. See Tutorial: Job Perusal and Tutorial: Job Monitoring
Ganga supports the organisation of its jobs into a JobTree and within this structure the job's parent task will have created the folder:-
/gbs/<task-name>see GetGangaTreeDir(), into which it places all the Ganga jobs its Jobs create.
job.UpdateStatus()A warning will be issued if there is no associated Ganga job. If the Ganga job appears stalled, i.e. has stayed in the same state for too long, it is killed. Once the Ganga job is complete the UpdateStatus method moves all of the Ganga job files into it's own area and then erases the Ganga job. This has two advantages:-
gbs_grid_info.logwhich summarises the passage of the job within the GRID.
task.SetDCMQuery("""[ "run_type physics% and data_tier sntp-near
and physical_datastream_name spill%
and start_time < to_date('2006-02-18','yyyy-mm-dd')
and end_time > to_date('2006-02-17','yyyy-mm-dd')
and version cedar" ]""")
The data set is extended using the method:-