GRID Fixed Problems

Last modified: Sat Nov 24 16:12:48 GMT 2007
Nick West
Return to home page
This section lists past problems and investigations just in case we want to revisit them. It does not include current problems


ldapsearch - what objectclasses are returned?

Expired (has not been followed up for too long)

I still have problems understanding what objectclasses of a tree are selected when using a filter (objectClass=XXX). On 5 July 2006 I mailed Steve:-

Sorry, but I am still struggling here: exactly what is meant by:

  "instances in the tree"

I have been trying to reconcile what I get with the schema as shown in

 https://edms.cern.ch/file/454439/2/LCG-2-UserGuide.html#appGLUE

and specifically:-


             ---- .2. GlueCETop   
             |     |
             |     ---- .1. ObjectClass
             |     |     |
             |     |     ---- .1  GlueCE 
             |     |     |
             |     |     ---- .2  GlueCEInfo 
             |     |     |
             |     |     ---- .3  GlueCEState 
             |     |     |
             |     |     ---- .4  GlueCEPolicy 
             |     |     |
             |     |     ---- .5  GlueCEAccessControlBase 
             |     |     |
             |     |     ---- .6  GlueCEJob
             |     |     |
             |     |     ---- .7  GlueVOView
             |     |
             |     ---- .2. Attributes
             |     |     |
             |     |     ---- .1.  Attributes for GlueCE
             |     |     |              . . .
             |     |     |
             |     |     ---- .7.  Attributes for GlueVOView


I have tried some very simple filter to see what objectClass I get:-

objectClass=GlueCETop

objectClass: GlueCE
objectClass: GlueCEAccessControlBase
objectClass: GlueCEInfo
objectClass: GlueCEPolicy
objectClass: GlueCEState
objectClass: GlueCETop
objectClass: GlueInformationService
objectClass: GlueKey
objectClass: GlueSchemaVersion
objectClass: GlueVOView

i.e. gives everything below except GlueCEJob but also has GlueInformationService, GlueKey and GlueSchemaVersion.

objectClass=GlueCE

objectClass: GlueCE
objectClass: GlueCEAccessControlBase
objectClass: GlueCEInfo
objectClass: GlueCEPolicy
objectClass: GlueCEState
objectClass: GlueCETop
objectClass: GlueInformationService
objectClass: GlueKey
objectClass: GlueSchemaVersion

i.e. everything at the same level except GlueVOView and also has GlueInformationService, GlueKey and GlueSchemaVersion.

objectClass=GlueCEPolicy

objectClass: GlueCE
objectClass: GlueCEAccessControlBase
objectClass: GlueCEInfo
objectClass: GlueCEPolicy
objectClass: GlueCEState
objectClass: GlueCETop
objectClass: GlueInformationService
objectClass: GlueKey
objectClass: GlueSchemaVersion
objectClass: GlueVOView

i.e. the same as objectClass=GlueCE but also has GlueVOView.

Perhaps the schema shown in the LCG user Guide is out of date but this is driving me nuts!

RAL dCache - recommended file sizes

I asked Derek Ross: "Am I right I assume that we have no control over the way files are organised on tapes? Also what size of tape gets allocated and is there a minimum average file size below which tape storage is inefficient? I ask this as I know that, at least using the ADS TAPE command, you can on write a maximum of 1000 files to a single tape."

He replied: For individual data files, 1GB is a reasonable size, but really anything over a few hundred MBs is okay.


TGAFFile fails: Communication error on send

30 June 2006 I am trying to use ROOT's TGFALFile to access a file given it's LFN. Under the covers TGFALFile does:-
   Int_t ret = ::gfal_open64(pathname, flags, (Int_t) mode);

and in my case:  pathname = lfn:/grid/minos/nwest/tape/test/LVJ_F00034638_0000.mdaq.root
                 flags    = 0  RDONLY
                 mode     = 0644 (and not used for read)

but it fails:-
  file lfn:/grid/minos/nwest/tape/test/LVJ_F00034638_0000.mdaq.root
  can not be opened for reading (Communication error on send)
The error comes from gfal_open64 and I found some documentation on: http://grid-deployment.web.cern.ch/grid-deployment/gis/GFAL/gfal.3.html In particular I need to set LCG_GFAL_VO LCG_GFAL_INFOSYS and have a valid Grid proxy, which I have.

7 July Steve Traylen has contacted to storage group. He suggests raising a GGUS ticket which I have done: 10129

11 July Jean-Philippe is investigating. He suggest trying gfal_open instead of gfal_open64. It fixes it, but only temporarily, after switching back to gfal_open64 to confirm that the problem still exists there, gfal_open fails too! He asked me to try gfal_testread. That work on the dcap: URL. After more tests he has isolated it: gfal_open64 does a dcap library dc_open with O_RDONLY|O_LARGEFILE and that fails:-

   Open failure : Invalid flags passed to open
Taking off O_LARGEFILE fixes it.

12 July We are using dcap v 1-2-36. I went to http://www.dcache.org/downloads/IAgree.shtml and got the latest linux client (1-2-38). It fails too so I have contactedsupport@dcache.org (ticket 1106). Steve assigns the ticket the dCache support unit.

14 August. I get a very brief reply from dCache support:-


  Thankyou for this bug, I hope this can be resolved in future versions of D-Cache.
  Regards
  Owen Synge

15 September. Derek Ross:

  "RAL-LCG2 has now upgraded from 1.6.6-1 to the latest
  production release 1.6.6-5. There is some mention on

    http://www.dcache.org/manuals/Book/rf-changelog-1661-1663.shtml

  that dccp now supports large files. So now would be a good time to
  retest to see if the problem still exists here."
If I try my current dcap client: 1-2-38 Nov 11 2005 it still fails but if I don't supply a library it picks up /opt/d-cache/dcap/lib/libdcap.so which is 1-2-39 Jan 17 2006 and succeeds. Frustratingly, the download page
 http://www.dcache.org/downloads/IAgree.shtml
only lists dcap 1-2-39 for Solaris, for linux it is still 1-2-38 for linux! I have kludged a linux 1-2-39 tar file from /opt/d-cache/dcap/lib and that seems to work. I'll update my build to use dcap 1-2-39. Derek will ask dCache support to make the dcap 1-2-39 client tar available for linux.


TURL - when can it be released?

I had read about lcg-gt and of the importance of doing the closing lcg-sd, which could otherwise end up leaving the server with too many open requests. I understand that a TURL is emphermeral but don't quite understand how long it needs to live. When is the earliest I can safely dismiss it with lcg-sd:- Derek Ross: "For dCache (but not necessarily other SRMs), it is entirely possible to run the lcg-sd before you get anywhere near the actual data file, the TURL will not be invalidated (modulo the server it told you to use going down). But properly the lcg-sd should be called after you've closed the file."


dcap access fails: Server error message for [1]: ""path (errno 22)

I am trying to access the tape pool of the RAL dCache using the URL:-
  dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root 
using ROOT's TDCacheFile that makes calls to the dcap library. If I run from my UI (lcgui0358) it runs fine. There was a long delay the first time (retrieving from tape) but after that it is fast. However, if I submit a job to the RAL Tier 1 CE lcgce02.gridpp.rl.ac it fails:-
  Server error message for [1]: ""path (errno 22).
  Error in : 
      file dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root 
      does not exist
It looks like some path is not set correctly. I wondered if it was some enviromental variable that wasn't configured properly when on a WN but I don't see any reference to such in
   C - API to the dCache Access Protocol (dcap)
I have run a trivial dcap program to open and close the file and that runs O.K. from RAL Tier 1 (lcg0472) and RAL Tier 2 (heplnx41) so it looks like it is something wrong when wrapped in TDCacheFile. All tests are with Dcap version: version-1-2-36 Jun 20 2005 11:00:59. On the backends I am testing with ROOT 5.08 and looking at TDCacheFile there have been changes since it was release back in December 2005. I need to test with something more up to date.

I have now installed ROOT 5.11/06 and the error has changed:-

Command failed!
Server error message for [1]: ""path (errno 22).

Processing reco_far_Alt_All_dev.C...
Error in : error reading all requested bytes from file dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root, got 403129 of 1634559346
Error in : error reading all requested bytes from file dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root, got 0 of 1024
Error in : error reading all requested bytes from file dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root, got 0 of 6640
Error in : Unknown class
Info in : cannot find the StreamerInfo record in file dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root
I have confirmed that this still works O.K., from the UI.

I have R1.23 installed with ROOT 5.12/00 on both the UI and RAL Tier 2 so I can try:-

  loon -bq reco_far_Alt_All_dev.C  \
    dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root
and

rsd job submit lcg:heplnx206.pp.rl.ac.uk run_loon.sh loon_tier_2_dcap \
   --arguments="R1.23-build_1 reco_far_Alt_All_dev.C dcap://dcache-head.gridpp.rl.ac.uk:22125/pnfs/gridpp.rl.ac.uk/tape/minos/nwest/test/LVJ_F00034638_0000.mdaq.root"\
   --input_sandbox=reco_far_Alt_All_dev.C\
   --output_sandbox="ntupleStA.root ntupleStA.sub.root"\
   --output_dir=/rutherford/minos-soft2/rsd_work
on both. It runs file on UI, the log file starts:-
Warning in : class TSQLStatement already in TClassTable
Warning in : class timespec already in TClassTable
loon [0]
Processing reco_far_Alt_All_dev.C...
Successfully opened connection to: mysql:odbc://sql.gridpp.rl.ac.uk/minos_temp?option=1;
Successfully opened connection to: mysql:odbc://sql.gridpp.rl.ac.uk/minos_offline?option=1;
Running on Ral Tier 2 (heplnx26):-
Warning in : class TSQLStatement already in TClassTable
Warning in : class timespec already in TClassTable
Command failed!
Server error message for [1]: ""path (errno 22).
Processing reco_far_Alt_All_dev.C...
Successfully opened connection to: mysql:odbc://sql.gridpp.rl.ac.uk/minos_temp?option=1;
Successfully opened connection to: mysql:odbc://sql.gridpp.rl.ac.uk/minos_offline?option=1;
but it did run and produce output. I have repeated on Ral Tier 1 (lcg0616) and got the same results i.e. the odd
Command failed!
Server error message for [1]: ""path (errno 22).
but otherwise O.K.. I'll declare this solved.


SysError in <TGFALFile::TGFALFile>: file ... can not be opened for reading (Invalid argument)

Expired (has not been followed up for too long)

21 September 2006, having fixed TGAFFile fails: Communication error on send I had hoped TGFALFile might now work but with the latest ROOT (cvs) and my hacked dcap 1-2-39 tar file from /opt/d-cache/dcap/lib it fails:-

Info in <TPluginManager::FindHandler>: found plugin for TGFALFile
Info in <TUnixSystem::Load>: loaded library /rutherford/minos-soft2/OO/minos_packs/root_cvs/root/lib/libGFAL.so, status 0
Info in <TPluginManager::FindHandler>: did not find plugin for class TArchiveFile and uri lfn:/grid/minos/nwest/tape/test/LVJ_F00034638_0000.mdaq.root
free(): invalid pointer 0x68c2078!
Exanded file name: ( binary i.e. unprintable data ? here)
Open with pathname ( binary i.e. unprintable data ? here) flags 0 mode 420
Ret -1
SysError in <TGFALFile::TGFALFile>: file ( binary i.e. unprintable data ? here) can not be opened for reading (Invalid argument)
Open failed-1
Segmentation fault (core dumped)
it looks like a corrupted path name.


Replication failure lcg-rep gives " No such file or directory"

I thought that I would start playing with file replication now that I
have a SEs:-

    RAL T1 (dcache.gridpp.rl.ac.uk)  
and RAL T2 (heplnx204.pp.rl.ac.uk).

I have environment:-

  LCG_GFAL_INFOSYS=lcgbdii02.gridpp.rl.ac.uk:2170
  LCG_GFAL_VO=minos
  LFC_HOST=lfc.gridpp.rl.ac.uk

Inspired by Steve Traylen's example:-

  4) Replicate one to CERN.
  
  $ lcg-rep -v  --vo dteam  lfn:/grid/dteam/user/t/traylen/mygroupfile1 -d castorsrm.cern.ch
  
I tried:-

  lcg-rep -v --vo minos -d heplnx204.pp.rl.ac.uk \
    lfn:/grid/minos/nwest/test/LVJ_F00034638_0000.mdaq.root

which gives

  Using grid catalog type: lfc
  Using grid catalog : lfc.gridpp.rl.ac.uk
  lcg_rep: No such file or directory

even though

  lcg-lr lfn:/grid/minos/nwest/test/LVJ_F00034638_0000.mdaq.root

gives

  srm://dcache.gridpp.rl.ac.uk/pnfs/gridpp.rl.ac.uk/data/minos/nwest/test/LVJ_F00034638_0000.mdaq.root

If I turn my LFN into a GUID or SURL I get the same thing.  Well
mostly I do, but just occasionally, for variety it returns:-

  No information found for SE : dcache.gridpp.rl.ac.uk
  lcg_rep: Invalid argument
3 Aug: Steve Traylen can reproduce my problem and is talking to Chris Brew.

15 Aug Chris Brew fixes the problem:-

  lcgui0357:~>lcg-cr -v --vo minos file:/etc/group -d heplnx204.pp.rl.ac.uk
  Using grid catalog type: lfc
  Using grid catalog : lfc.gridpp.rl.ac.uk
  Source URL: file:/etc/group
  File size: 571
  VO name: minos
  Destination specified: heplnx204.pp.rl.ac.uk
  Destination URL for copy: gsiftp://heplnx204.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/minos/generated/2006-08-15/file57d8a9e0-2fda-4190-8c58-d2720f20c30f
  # streams: 1
  # set timeout to 0 seconds
  Alias registered in Catalog: lfn:/grid/minos/generated/2006-08-15/file-553dbd25-bc07-4ce3-b78b-38b26cab4370
            571 bytes      0.77 KB/sec avg      0.77 KB/sec inst
  Transfer took 2030 ms
  Destination URL registered in Catalog: srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/minos/generated/2006-08-15/file57d8a9e0-2fda-4190-8c58-d2720f20c30f
  guid:b6d009d8-79fc-440f-8f88-6077e1f79b29
  lcgui0357:~>lcg-lr --vo minos lfn:/grid/minos/generated/2006-08-15/file-553dbd25-bc07-4ce3-b78b-38b26cab4370
  srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/minos/generated/2006-08-15/file57d8a9e0-2fda-4190-8c58-d2720f20c30f

Return to home page