Last modified: Fri May 23 07:23:17 BST 2008
Nick West
Return to home page

Running Jobs: Practical Advice

This section contains practical advice about writing and submitting job scripts.


Writing Scripts


How do I merge stdout and stderr?

To merge stdout and stderr so that error messages appear in context, add the following to the top of your bash script:-
  exec 2>&1 
Alternatively you can get stdout and stderr in separate files, as well as having one file containing both, by putting these two lines at the beginning of your script:
  exec 2> >(tee std.err)
  exec 1> >(tee std.out)
This will send stderr to the file std.err, stdout to the file std.out, and both will go to the script's normal stdout (so you might run "./myscript > alloutput").


Submitting and Monitoring Jobs


How do I direct my job to a specific queue?

The first step is to find out what queues are available and what their limits are. See What limits does this job queue have?

If you plan to submit your job using glite-wms-job-* then use the JDL's attribute Requirements to force it to a specific queue e.g.

  Requirements = other.GlueCEUniqueID == "lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS";

If you plan to submit your job using Ganga then set the job's backend.CE attribute e.g.

  j.backend.CE = 'lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS'


Where is my job and what is it's state?

If you submit your job using glite-wms-job-* then glite-wms-job-status will supply information on the job queue it was submitted to and its current state in the output that contains:-
  ...
  Current Status:     Scheduled
  Status Reason:      Job successfully submitted to Globus
  Destination:        lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-minosL
  ..

If you submitted your job using Ganga The job's status will give you its Ganga status and its backend should tell you the queue is was submitted to and it's GRID state in output that contains:-

  ...
  status = 'completed' ,
  ...
  backend = LCG (
    status = 'Done (Success)' ,
    actualCE = 'lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS' ,
   ...
  ..


When will my job run?

You first need to know the job queue is was submitted to. See Where is my job and what is it's state?

Then you can use lcg-info command:-

    lcg-info --list-ce --vo minos.vo.gridpp.ac.uk \
    --query 'CE=lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M' \
    --attrs EstRespTime
which produces output like:-
- CE: lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
  - EstRespTime         63422
- the time is in seconds.


Can I see what my job is doing or why it failed?

One of the GRID's endearing features is that if a job fails at the batch level e.g. the job exceeds some resource limit such as CPU time, then it gets thrown off and no output is returned with the job. This makes debugging somewhat challenging.

Fortunately, with the introduction gLite/WMS middleware it is possible to examine job output before it has completed and also retrieve output from failed jobs using the glite-wms-job-perusal command. Both Ganga and GBS support job perusal.

At the time of this writing, the Ganga User Manual does not explain that to enable perusal before job submission:-

  job.backend.perusable = True
and then to view the output during job execution:-
  job.peek('stdout','cat')


What control do I have to hold, kill and resubmit my job?

I don't think there is such a concept of job holding; once a job is submitted it is placed in a queue and will eventually run unless there are some clever tricks that can be play with the JDL. I will however raise it as an outstanding problem

Jobs can be killed and resubmitted.

If you submit your job using glite-wms-job-* then glite-wms-job-cancel can be used to kill it and then it can be resubmitted in the usual way.

If you submitted your job using Ganga The job's kill method can be used to cancel and to resubmit job use its resubmit method.

Caution: resubmit entirely erases the job's former output so take care not to issue it on a job that has ended normally unless you mean to start again.


What limits does this job queue have?

Recall, to see what queues are available to MINOS use the lcg-infosites command:-
    lcg-infosites --vo minos.vo.gridpp.ac.uk ce

Should give something like:-

  #CPU    Free    Total Jobs      Running Waiting ComputingElement
  ----------------------------------------------------------
  1190      45       1              1        0    lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
  1190      45      56              9       47    lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
    72       1       0              0        0    t2ce02.physics.ox.ac.uk:2119/jobmanager-lcgpbs-minos
    72       1       7              0        7    t2ce02.physics.ox.ac.uk:2119/jobmanager-lcgpbs-short

or

    lcg-infosites --vo minos.vo.gridpp.ac.uk ce -v 2

which should give something like:-

  RAMMemory    Operating System    System Version            Processor    Subcluster name
  -------------------------------------------------------------------------------------------------------------------------
      512           ScientificSL     SL                                       PIII                    lcgce02.gridpp.rl.ac.uk
     2048       Scientific Linux      3                                       xeon                    t2ce02.physics.ox.ac.uk

Then, once you have a specific queue e.g. lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS use the lcg-info command:-
    lcg-info --list-ce --vo minos.vo.gridpp.ac.uk \
    --query 'CE=lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS' \
    --attrs MaxRunningJobs,MaxCPUTime,MaxTotalJobs,AssignedJobSlots,Priority,MaxWCTime

which should give something like:-

- CE: lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
  - MaxRunningJobs      0
  - MaxCPUTime          60
  - MaxTotalJobs        0
  - AssignedJobSlots    0
  - Priority            1
  - MaxWCTime           120

Omit the --query 'CE=...' to look at all queues that MINOS can use.

Also recall that

   lcg-info --list-attrs
will give a complete list of all available attributes.


How heavily loaded is this job queue?

The procedure is very similar to What limits does this job queue have? it is simply that you want to list different attributes:-
    lcg-info --list-ce --vo minos.vo.gridpp.ac.uk \
    --query 'CE=lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS' \
    --attrs EstRespTime,TotalCPUs,Memory,ClockSpeed

which should give something like:-

- CE: lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
  - EstRespTime         0
  - TotalCPUs           1190
  - Memory              512
  - ClockSpeed          1001


My job fails with "Cannot read JobWrapper output, both from Condor and from Maradona".

This rather unhelpful error message indicates that the batch job aborted, typically because some resource limit such as CPU or elapse time has been exceeded. In such cases no output is returned and your only option is to rerun the job and use the glite-wms-job-perusal command. See Can I see what my job is doing or why it failed?



Return to home page