GRID Problems

Last modified: Mon Jan 21 15:47:28 GMT 2008
Nick West
Return to home page
This section lists existing problems, and things that need further investigation.

There is also a page of Fixed Problems


Monitoring and Debugging Jobs

Is there anyway to check on running jobs submitted via edj-job (possibly via Ganga)? If a job appears to be taking too long to run it would be good to be able to check its stdout and stderr, and possibly other output files. When some resource limit is exceeded it appears that nothing gets returned. How do we debug cases like that? Should we use edg-job-attach?

As of November 2007 the situation appears to be:-


Once submitted, can jobs be held?

Once a job has been sent to an RB is there any control over it beyond canceling it? For example can it be held?

Status (December 2007): This should be available with the CREAM (Computing Resource Execution And Management) Service but is at least a few months away.


Retrieving data from the GRID

Its typical that the final results from production jobs are data sets that are analysed interactively outside the GRID environment. Have any experiments developed systems to help automate migrate data out of the GRID or does the final step always involve running on a UI and using GRID tools to pull data out of an SE?

Also the LFC isn't well suited to our needs as most of our data is outside of its reach on local disk or at FNAL. So what are the best tools to read, write, list, create directories and delete data in an SE if we cannot use lgc-utils? The SRM tools (srmcp, srmls etc) look idea, but at Oxford they fail:-

  org.globus.ftp.exception.ServerException: 
    Server refused performing the request. 
      Custom message: Server reported transfer failure (error code1) 
      [Nested exception message:  Custom message: 
      Unexpected reply: 426 Transfer aborted, closing connection 
      :Unexpected Exception : java.net.ConnectException: Connection timed out].  
       Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  
       Custom message: Unexpectedreply: 426 Transfer aborted, closing connection :
       Unexpected Exception : java.net.ConnectException: Connection timed out
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:178)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:105)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1451)
        at org.globus.ftp.GridFTPClient.extendedGet(GridFTPClient.java:452)
        at org.globus.ftp.GridFTPClient.extendedGet(GridFTPClient.java:416)
        at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:842)
        at java.lang.Thread.run(Thread.java:534)
        GridftpClient:  transfer exception

and also several people have warned that srm is not an end user API, but one for developers of such APIs.

On the other hand globus-url-copy worked both to read and write data but how do I list, create directories and delete?

Status (January 2008): A version of lcg-utils that can be used independently of LFC is available, although not on the UIs we use yet. This should be our API of choice once it is. I will need a version that includes lcg-ls i.e. 1.6.4-1 or later [21 January 2008 1.6.5 released].


Jobs abort when short term proxy expires despite long term MyProxy

We have yet to run any full length production jobs on the GRID; despite have a long lived MyProxy server running all our jobs fail as soon as the short term proxy expires! As the grid500M queue has typically a day's worth of jobs, often our jobs don't even start to run. Trying to simulate the problem in order to investigate it I got another: when jobs got within ~ a hour of the lifetime of the short term proxy they would fail:-
   Globus error 158: the job manager could not lock the state lock file
A ticket [Gridpp #22053] has been raised.

December 13. Derek Ross has an explanation of the Globus error 158: MyProxy proxies don't contain the voms extensions, and the RB isn't smart enough to renew that part separately, so the voms extensions get dropped when the proxy gets renewed, which means that new connections using the new proxy get mapped to a different user. My problem was that without a role I was being mapped to minos003 by my proxy but to minossgm (i.e. my admin role) by MyProxy proxy. So I could fix by using my lcgadmin role so that both mapped to minossgm.

December 14: Sadly that doesn't work for others. Tobi gets minos010 with a VO and minos004 without.

December 15: Better news, the new glite/WMS middleware does do it properly. The setup of the MyProxy server is as before but the WMS server, that replaces the RB, is smart enough to get a proxy from myproxy and then contact the voms server to get the voms extensions for this new proxy so it will match the proxy used to submit the job. I have tested it and it works although, as the WMS configuration isn't yet standard, the job was submitted :-

glite-wms-job-submit -a -c ~/grid/config/lcgwms01.gridpp.rl.ac.uk.conf ./sleep_job.jdl

 [ -a auto delegate]
Where ~/grid/config/lcgwms01.gridpp.rl.ac.uk.conf contains:-
[
        VirtualOrganisation = "minos.vo.gridpp.ac.uk";
        HLRLocation = "";
        NSAddresses = {
            "lcgwms01.gridpp.rl.ac.uk"
        };
        LBAddresses = {
            {"lcgwms01.gridpp.rl.ac.uk"}
        };
        WMProxyEndpoints = {
            "https://lcgwms01.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server"
        };

        MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";
]


The following have been fixed or have expired:-

Return to home page