Data Cache Manager V01-16-07

- access to MINOS Storage Elements

Last modified: Tue Nov 11 09:04:12 GMT 2008

Nick West

Contents

Introduction

A Generic Interface to MINOS Data

MINOS jobs running in the UK need to have access to data held in a variety of locations:- The situation is further complicated by the facts that:- DCM presents a uniform interface to all data sources. A user makes a series of "requests" which may either be SAM queries or explicit file names which pass through a 4 stage process:-
  1. Resolution - Here a SAM query is resolved into a list file names.
  2. Location - The SEs catalogues are searched to find out the best one to get them from.
  3. Cache check - The local NFS disk catalogue is checked to see if the file is already on local disk
  4. Access - This gives file access to the client. This can mean copying the file from a SE to disk and then passing back the name of the local copy or, if acceptable, could mean passing back an access URL to allow the client to read directly from the SE.
DCM confers additional advantages over directly accessing the SEs:-

Storage Element Access, Names and The DCM URL

DCM has a list of locally available Store Elements (SEs) from which it builds ASCII flat file master catalogues and then uses these to locate and transfer files to local disk. It rebuilds theses catalogues afresh by scanning the local SEs each night to ensure they stays in sync. So even if files are written to a local SE without DCM, DCM will quickly know about them.

Several times a day a cron job takes these master catalogues and places them on a web visible directory. DCM running a sites that does not have access to the master catalogues instead use slave copies which it refreshes from the web when more than a few hours old. DCM names SE elements using the following syntax:-

  <site>-<type>-<service>
  e.g. ral_t1-castor-test_d0t1

Where:-

  <site>    Site name e.g. ral_t1 (RAL Tier 1) or fnal
  <type>    The storage technology e.g. dcache or castor
  <service> The individual service e.g.tape 
When DCM is runs its initial output includes a list of the SEs it can access. For example:-
Local Storage Elements (in search order):-
  ral_t1_ui-nfs                   Local NFS Disks
  ral_t1-castor-prod_d0t1         RAL T1 CASTOR disk0tape1 Production Service
  ral_t1-castor-test_d0t1         RAL T1 CASTOR disk0tape1 Test Service
  ral_t1-dcache-disk              RAL T1 dCache Disk Store
  ral_t1-dcache-tape              RAL T1 dCache Tape Store
  fnal-dcache-enstore             FNAL dCache interface to Enstore
The order reflects the default search order used to retrieve files. Note how the local disk is treated as an SE.

DCM uses the SE names as the basis for a "DCM URL". The syntax is:-

  dcm://<SE_name>/<SE_dir>/<File_name>#<byte_size>
  e.g. dcm://fnal-dcache-enstore/pnfs/fs/usr/minos/reco_near/R1_18/snts_data/2005-04/N00007148_0008.spill.snts.R1_18.0.root#129262
When requested to transfer files, DCM first converts the files names into URLs which are then used to determine the appropriate commands to perform the operation.

Users may specify requests in the form of DCM URLs and for this purpose a basic form of wildcarding is allowed:-

For example:-
  "dcm://ral_t1-castor-prod_d0t1/user/nwest/grid_tests/*.\.root"
i.e. any .root file in the directory user/nwest/grid_tests/

The advantage of using DCM URLs is that it bypasses the catalogue look-up so latency issues associated with the catalogues, which could make them over a day out of date, can be avoided. The cost is that the user has to know the exact location of the files.

Local Disk Management

DCM retains its original function of local disk management and on each disk it manages there must be a top level directory:
    dcm_cache/
which is where DCM will place files, although it can also manage files that users have placed elsewhere on these disks.

On the first disk in the list there must also be a top-level directory:-

    dcm_catalogue/
this is the "soft links catalogue" which is where DCM places soft links to data files on all the disks it manages. That directory has the sub-directory
    DCM/
where DCM maintains its text catalogues which also holds as series of Global Log files of the form
    global_log_YYYY-MM-DD  e.g. global_log_2007-05-25
together with 2 soft links:-
    global_log_current
    global_log_previous
which point to the latest two. These log files record file transfers, both success and failures.

When DCM is run it starts by listing the disks it is managing. For example:-

DCM configuration:-
  List of DCM-managed disks:      /stage/minos-data1
  List of excluded directories:
  Ownership group:                minos
  Soft links catalogue:           /stage/minos-data1/dcm_catalogue
  DCM catalogue directory:        /stage/minos-data1/dcm_catalogue/DCM
  Scratch directory               /tmp/dcm_scratch_area_9321

Running on a Worker Node (WN)

Production jobs running on GRID WNs require DCM to move data in and out of accessible SEs. However this presents a basic problem: the locally available disk is different on each WN and is cleared before the job starts to run. This means that there is no local disk cache nor does DCM have anywhere locally to store either catalogues or global logs.

Access to a local (scratch) disk isn't a problem; its name ($work_dir) is communicated via the environment (see get_site_info.pm ) and on start up DCM can create the standard subdirectories that it requires.

At least in the current LCG implementation there is a shared disk where files can be held permanently - the software disk - and as the logs and catalogues are small its possible to have a group writable directory on the software disk and store them there. However that only solves part of the problem. This disk is not the same as the one that holds the catalogues and logs on the User Interface (UI), so this means that there has to be a system to keep them synchronised.

The system works is as follows:-

  1. When DCM is installed on the GRID using RSD it will create, if it doesn't already exist, the software top level directories
    dcm_data/catalogues
    dcm_data/resource_locks
    
    and make them group writable.

  2. When DCM runs on a WN it detects from the environment that it is on a WN and switches to "Worker Node mode". It initialises its local disk and sets its catalogue directory to dcm_data/catalogues.

  3. DCM records operations into the global log on its catalogue directory. Although nothing is in place yet, it should not be too hard to collect up the logs every few days and merge them to get a global picture of how well the system is running.

[July 2007: The system to propagate the catalogues to the software disk is not yet in place so for now the system only runs on RAL T1 and gets the catalogues from the UI disk.]

Although the initially we only expect to run on RAL T1 and T2 this model would also work if operations were extended to other nearby T2s. In the longer term we should start to use the LCG File Catalog (LFC) as this is accessible from WNs. However this catalogue is not searchable in the same way the current DCM catalogue, it's more like a dCache directory structure, so we may have to change the way we form queries and be explicit about the directories to be searched before we can take this step.

User Manual

Command syntax

    $MINOS_TOOLS/dcm.sh  {global options} command {command options} {command args}

Global Options

  --debug n    Switch on debug level n (=0 off)
  --expt  e    Selected experiment.  Allowed values: minos [default] and sno
  --site  s    Select site.  CAUTION: use for testing only!!

The catalogue command

catalogue {<file>...<file>...} { --all}
Example:  catalogue  /stage/minos-data1/d4/C00080277_0000.mdaq.root
This adds the file into both in the text catalogue and and as a soft link to the file in the soft links catalogue:-
    dcm_catalogue/
directory which must be the top level directory on the first disk managed by DCM. This command is useful if adding a file that is not within the set of directories managed by DCM
Example:  catalogue -all
This uses the results of the last disk scan (see the survey command) and checks that all the data files that it found are in the text and soft links catalogues.

The directory_ownership command

directory_ownership {mode}
where
  mode  (optional):-
          "full" [default] show every directory
          "compress" suppress sub-directory wholly owned by a single user
This command uses the results of the last disk scan (see the survey command) and reports, for each data directory, the users who own files in it including sub directories.

The disk_usage command

disk_usage {command option}  ...

Command Options

 --user_lists <user_list_dir>  Produce set of files_owned_by_<user>.txt and disk_usage.txt in <user_list_dir>
This uses the results of the last disk scan (see the survey command) to produce a summary of usage, both by disk and by user.

DCM classifies all files into 1 of 4 types:-

The get command

get {command options} file-query  file-query ...
Transfer one or more files from an SE (Storage Element).

Command Options

  --accept_dcm_url  Return files as DCM URLs; doesn't attempt any transfers

  --accept_root_url Return files as ROOT URLs if supported; otherwise transfer.

  --demand_complete_set
                    Quit without getting any files unless able to get them all
                    Default: return whatever files can be located.

  --file_list f     If command succeeds, record list of files (or URLs) in file f.
                    Will include all files i.e. even those already on disk.
                    Caution: On input f must not exist.

  --force_local     Force a copy to local dir (see  --local_dir) unless already
                    there.

  --local_dir d     Copy files to specified directory.  
                    Default: the dcm_cache directory of the disk with most space

  --max_files n     Set upper limit on number of files to transfer.
                    Default 10.  Hard upper limit of 1000 files.
                    Used to prevent misplaced wildcard from transferring
                    huge amounts of data!

  --num_get_jobs n  Run up to n transfer jobs at once.
                    Default 1.  Hard upper limit of 10 jobs.

 --names_not_unique Use this option if the file name are not definitely unique.  This prevents DCM 
                    from seeing if it can find a copy already on the local disk rather than getting
                    it from the SE. 
                    USE FOR FLUX FILES: If not DCM could find the wrong version (it has happened!)

  --preserve_rel_dir d
                    Preserve relative directory structure: Directory d in in SE maps to top of 
                    local dir.  It's useful when downloading flux files e.g.:-
                       --remote_se "ral_t1-castor-prod_d0t1/flux/gnumi/v19/fluka05_le010z185i/job[0-3]"
                       --preserve_rel_dir flux/gnumi (or even --preserve_rel_dir gnumi)
                       --local_dir /some/local/dir
                    Files would be written to /some/local/dir/v19/fluka05_le010z185i/...
                    Local directories will be created as necessary.
                    If the remote SE directory path does not start with d the file is
                    placed in the top level directory.

  --remote_se se_name{/se_dir} 
                    Only copy files from selected SE  {and within selected /se_dir}
                    e.g --remote_se ral_t1-castor-test_d0t1/gnumi/v19/fluka05_le010z185i
                      Only look in SE ral_t1-castor-test_d0t1 within directory gnumi/v19/fluka05_le010z185i
                    e.g --remote_se 'ral_t1-dcache-disk/gnumi/v19/fluka05_le010z185i/job1.*'
                      Only look in SE ral_t1-dcache-disk within directory sub-tree gnumi/v19/fluka05_le010z185i/job1.*
                      Note: . - any single char; .* - any char string

                    Use the se_name "not_nfs" to just exclude the local disk.

  --test            Determine what files have to be transferred and from where  but 
                    don't transfer files

Command Args

file-query  

  Either: File name 
          e.g. F00030574_0002.mdaq.root
          or an 'egrep' wildcard regular expression: 'F000256.*.cand.R1.14.root'
          Note: . - any single char; .* - any char string
          CAUTION: Once match found in any SE DCM quits searching.

  Or:     A database query for SAM enclosed in square brackets
          e.g. [ file_name like N00008695_002%.cosmic.sntp.R1_18.0.root ]
          e.g. [     "run_type physics% 
                 and data_tier sntp-near 
                 and physical_datastream_name spill%
                 and start_time < to_date('2006-02-18','yyyy-mm-dd') 
                 and end_time   > to_date('2006-02-17','yyyy-mm-dd') 
                 and version cedar" ]
          e.g. [ dataset_def_name gemma3-Cedar-near-all-sntp-2007-5-w2 ]

          Make sure there is a space after the leading '[' or the shell
          command parser may treats as a wildcard construction.

          Enclose in double quotes if query includes parentheses.

   Or     A DCM URL e.g. dcm://fnal-dcache-enstore/pnfs/fs/usr/minos/rec ... .snts.R1_18.0.root#129234

All 3 type types of command arg may be mixed in the same invocation. DCM first executes all SAM commands to resolve them into files names. Then, for file names that are not already a DCM URLs it searches the SE catalogues and converts then to DCM URLs. It then transfers any that it locates that are not already on local disk.

Note that the 2 stage approach allows users to have a dataset defined by a SAM query and yet retrieve files from the closest SE.

Note that, for a given file-query, DCM stops searching SE catalogues as soon as it finds any match. The logic is that a dataset should always be defined by applying a search to a single SE and not by the logical OR of all SEs. So if you want to copy some data set, say a group of files matching a wildcard, and some are already on the local disk, then, by default DCM will only find them and not copy the rest. The solution is to use the --remote_se option to force DCM to look at the SE which has the full set; it will still check the local disk so there is no risk that it will copy files it already has.

If using the --file_list option be sure that the name of the file you pass is unique. The normal way to do that is to include the process ID (environmental variable $$) in the file name. Otherwise on a system with multiple jobs running all getting files via DCM there is a danger that two might use the same name to return their file list. As an additional precaution, DCM will reject the command if it is passed an pre-existing file.

The --accept_dcm_url can be useful to see what files would satisfy a request without doing any transfer. Using the --test option only shows you what files would have to be transferred, unlike the URL request which will show files on local disks as well. It also allows you to see if transfers would have to take place. The resultant URLs can later be passed to DCM for transfer, so long as they are still valid. This might be useful if running a job on a Worker Node if no catalogue were available.

The help command

The help command has provides brief on-line help, but for details this document should be consulted.

The put command

put {command options} file_name file_name ...
Transfer one or more files to an SE (Storage Element).

Command Options

  --create_remote_dir
                     If necessary create remote directory
  --file_list f      If command succeeds, record list of files transferred
                     Each line of file is:-
                       Either: Name of file successfully written
                           Or: Error message starting with the character '?'
                     Caution: On input f must not exist.
  --local_dir d      Copy files from specified directory.  Default: current directory
  --overwrite        Overwrite existing file.  Default don't overwrite
  --remote_se se_name/se_dir
                     Directory on SE. Compulsory
   --test            Just test, don't transfer files

Command Args

file-name   File name relative to  --local_dir.  
            No wild-cards permitted and no check that file is recognisable as a data file.

The survey command

survey {<se>...<se>...}
Example: survey ral_t1-castor-test_d0t1 fnal-dcache-enstore
This command rebuilds the catalogues for the selected SEs or from all available SEs if none is specified. The resulting catalogue is stored in
  dcm_catalogue/DCM/<SE name>.cat
For most SEs the scan is carried out using the appropriate commands for the SE concerned, but there are two special cases:- Normally the survey command gets executed by nightly cron job.

The test command

  test <sub-command> <arg> ...
Is used to test and debug DCM. Typing the test command without further arguments will list what tests are currently available.

The uncatalogue command

catalogue <file>{. <file>..}
Example:  uncatalogue  /stage/minos-data1/d4/C00080277_0000.mdaq.root
This removes the file from both in a disk based catalogue and and as a soft link to the file in the soft links catalogue:-
    dcm_catalogue/
directory which must be the top level directory on the first disk managed by DCM.

Installation

Individual experiments and sites are configured with the following files stored in the
  DataCacheManager/config
subdirectory.

  1. <expt>.site_name This file should contain a single line giving the site name.

  2. <expt>.se_servers e.g. minos.se_servers

    This file identifies all the SEs used by the experiment, the services each provides and the way to access these services.

  3. <expt>.site_<site>.se_access e.g. minos.site_ral_t1_ui.se_access

    This file specifies which of the experiments SEs can be accessed from the local site, the mean transfer rate (used to calculate a timeout) and which interfaces to use to them.

  4. <expt>.site_<site>.local_disks e.g. minos.site_ral_t1_ui.local_disks

    This file specifies the local disk setup at the site.

The remainder of this section is specific to MINOS but also serves as an example for other experiments.

Prepare the configuration files and the associated directories as follows.

  1. minos.site_name This file should contain a single line giving the site name. Names follow the convention:-
       <site><tier-level><ui or wn>
    
    For example:-
      ral_t1_wn
      oxford_t2_ui
      sussex_ui
    
    This file remains local to the site and is included in the .cvsignore list.

      cd DataCacheManager/config
    
      local_name= (what ever name you have chosen)
    
      echo $local_name > minos.site_name
    

  2. minos.site_<site-name>.se_access This file identifies what SEs the local site can access. Take a copy of minos.site_oxford_t2_ui.se_access. It should not need editing.
      cp minos.site_oxford_t2_ui.se_access  minos.site_$local_name.se_access
    

  3. minos.site_<site-name>.local_disk This file identifies the local disks and directories to be used by DCM. You could start by taking a copy of
       minos.site_ral_t1_ui.local_disks
    
    and renaming entries to match the local disks, creating directories as required with group write permission.

    Alternatively, start from scratch, define 'data_dir' to be the top directory of your data and then used that to fill out the file:-

      data_dir= (the top directory of your data disk)
    
      rm -f minos.site_$local_name.local_disks    (should not exists, but just in case)
    
      echo Group  minos                                      >> minos.site_$local_name.local_disks
      echo Scratch_dir        /tmp                           >> minos.site_$local_name.local_disks
      echo @Disks             $data_dir                      >> minos.site_$local_name.local_disks
      echo @Exclude_dirs      $data_dir                      >> minos.site_$local_name.local_disks
      echo Soft_links_dir     $data_dir/dcm_catalogue        >> minos.site_$local_name.local_disks
      echo Catalogue_dir      $data_dir/dcm_catalogue/DCM    >> minos.site_$local_name.local_disks
      echo Resource_lock_dir  $data_dir/dcm_resource_locks   >> minos.site_$local_name.local_disks
    
    DCM is capable of surveying everything below @Disks and provide a catalogue, but we assume that you don't need this feature which is why @Exclude_dirs is set to the same thing.

    Now create all the required directories giving group write access.

      mkdir --mode 0775  $data_dir/dcm_catalogue
      mkdir --mode 0775  $data_dir/dcm_cache
      mkdir --mode 0775  $data_dir/dcm_catalogue/DCM
      mkdir --mode 0775  $data_dir/dcm_resource_locks 
    
    Confirm that dcm runs
      dcm
    
    It should type its help and near the top list 'host_name' (the name you chose) and the SEs it can see and the local disk setup.

  4. Setup up local and FNAL catalogues
      dcm survey
    
    It will take no time to survey the local disk because everything was excluded but then will take about 15 minutes to download a ~ 0.3GB file from FNAL and reformat it for DCM usage.

    Note: DCM does not automatically refetch this file as it does take a while so will slip out of date. One way to prevent this is to have a nightly cron job that just executes this command.

  5. Test that DCM can do catalogue searches
      dcm get --accept_dcm_url N00006771_cat0.spill.sntp.R1_18_2.0.root
      [  should locate one file in ral_t1-dcache-tape ]
      dcm get --accept_dcm_url AnaNue-N00009062_0018.spill.sntp.cedar.0.root
      [  should locate a file in ral_t1_ui-nfs ]
    

Implementation notes

  • Internal Structure

    Internal Structure

    When DCM was originally developed its function was local disk management and as such didn't require any formal internal structure but now that its principle objective is SE access two layers have been developed:-

    1. SEI: Storage Element Interface
      This layer is responsible for all commands that directly access an SE. Changes to the SEs available and methods of access should only affect this layer.

    2. FRS: File Retrieval System
      This layer takes user requests, converts them to DCM URLs and executes the commands to effect transfers and handles failures, all using the SEI layer.
    These layers are describe in more detail in the following sections.

    SEI: Storage Element Interface

    Introduction

    This layer is responsible for all commands that directly access an SE. Changes to the SEs available and methods of access should only affect this layer.

    Configuration

    The system is essentially data driven and is built upon the following concepts.

    1. A SE offers a series of services e.g. 'rfio' or 'dcap'

    2. The combination of an SE name and a specific service constitutes a server named <name>;<service>

    3. From a server there is a mapping to:-

      1. URL prefix that has to be prefixed to the SE directory before it can be used in a command

      2. environment commands a set of 0 or more bash commands that have to be executed before the command.

      For example, for the server "ral_t1-castor-prod_d0t1;rfio"

      • The prefix is
          /castor/ads.rl.ac.uk/prod/grid/hep/disk0tape1/minos;
        
      • The environment is
        export STAGE_SVCCLASS=minosDisk0Tape1
        export STAGE_HOST=castorstager.ads.rl.ac.uk
        export RFIO_USE_CASTOR_V2=YES
        

    4. A site wanting to access an SE makes a request e.g. "get" or "list" to it. The combination of the SE name and the request constitutes an action named <name>;<request>

    5. sites are configured by enumerating the actions that are available and hence what services they can call upon on different SEs.

    6. An action maps to:-

      1. A service

      2. A command - but only if a departure from the default for the service - see below.

    7. Typically an accessible SE offers a number of services to a site and then, to avoid explicitly enumerating all the actions the following shortcut can be used:-

      1. A request can be set to "*" which matches any request that's not explicitly listed in an action

      2. In such cases the command is default for the service. For example for the request list of the rfio service the command is rfdir.

      3. A specification can be mapped to the special service "disabled" meaning that it's not available. This allows the use of a wildcard to cover most cases and then fine tune the remainder, either with their own commands and services or disabled.

    8. In at least one case, the command also requires that the local file has a prefix (globus-copy-url requires 'file:' before the local file name). This is dealt with by having a hardwired look-up from command to local prefix.

    The Routines

    The central routines are sei_assemble_command which calls sei_get_server_cmd that are responsible to assembling the appropriate command given the SE name and a request, for example "list" or "put" (copy to SE). Construction of ROOT URLs is handled by sei_get_root_url

    Handling of DCM URL , which encodes the SE name, SE directory and file size), is done by sei_dcm_url_pack

    SE directory creation is done by sei_prepare_directory and file overwriting is done by sei_prepare_file

    sei_dcm_url_unpack

    Catalogue handling is provided by sei_survey that can scan an SE and build a text catalogue and searching such a catalogue for a file name and hence infer the DCM URL (which encodes the SE name, SE directory and file size) is done by sei_search_catalogue

    FNAL Anomalies

    There are a couple of related anomalies when it comes to FNAL:- I don't claim to really understand this but it's basically what Art does, or at least did, in 2006!

    FRS: File Retrieval System

    This layer takes user requests, converts them to DCM URLs and executes the commands to effect transfers and handles failures, all using the SEI layer. If the user supplies a SAM queries FRS is responsible for resolving into a series of file names by passing to a web based SAM client. The SEI catalogues are then searched for files names to convert them to DCM URLs. The file transfers themselves are performed by a separate perl script: dcm_frs_job which does the transfer, checks the size of the copied file and retries in cases where an error has occurred.

    Having the transfer as a separate script allows FRS, when transferring multiple files, to run multiple jobs in parallel.

    After a successful transfer FRS updates the SEI catalogues.

    Experiment API

    Introduction

    Originally designed for MINOS, DCM has been extended for SNO. Some of the code is experiment specific and will be held in the subdirectories:-
      dcm/minos
      dcm/sno
    
    apart from
      init_minos.pm
      init_sno.pm
    
    After parsing any global switches DCM knows which experiment it is dealing with and then executes the appropriate experiment initialisation.

    Calls from the generic to the experiment specific code constitute the experiment API.

    identify_file

    
         Parameters:-
         ==========
      
         $file_name   Name of file to be identified (can contain directory)
      
         Return:-
         ======
      
         $data_name   MINOS:   Currently this is returned as the component
                               between the sub-run and the data type
                      SNO:     The module name e.g.Reconstruct
         $data_type   MINOS:   Data type i.e. the extension e.g. mdaq.root
                      SNO:     Data type e.g. sno_root
         $detector    MINOS:   The detector.One of "CalDet", "Far" or Near".
                      SNO:     The phase e.g. salt
         $run_no      Run number
         $sub_run_no  Sub-run number (or -1 if n/a)
         $version     MINOS:   Release (or "" if n/a)
                      SNO:     Pass number (or "" if n/a)
    

    frs_locate_file

        Parameters:-
        ==========
    
        Either: $file_name   File name whose access info is required.
        Or:     $db_query    A database query for SAM (MINOS) or Ral (SNO)
    
        Return:-
        ======
    
        A list file_access_size variables: Each consisting of:-
    
           $file_name:$access_info:$estimated_file_size
    
        where:-
    
        $file_name            File name.
        $access_info          MINOS:   ENSTORE directory 
                              SNO:     Tape name:file number 
        $estimated_file_size  Estimated size in GB
    
        In the case of an error a single entry is returned: "? Error message"
    

    Global Data Structures

    See the routine init.pm