MINOS SAM Tutorial

What is SAM?

SAM is a data handling system originally developed by FNAL for use by D0 in Run II. It has since been adopted by CDF. The name stands for Sequential data Access with Meta-data. More general background can be found here.

SAM handles file storage, it keeps track of the locations of files on tape and disk. It also handles the file meta-data so that the user does not need to know the file name to find data of interest. It delivers files to your computer or your job, you don't have to know where they come from. It does bookkeeping, it keeps track of the datasets you have created and the files you have processed. All of the data is kept in an Oracle database.

There are four steps to running a loon job under SAM. First you find out what data you would like to process. Then you create the definition of that dataset and define a snapshot which contains the files that match the dataset definition at this moment in time. Then you run a loon job that will read that snapshot. These steps are descibed in more detail below.

Getting Started

  1. The first step is to register as a SAM user. This will create an entry for you in the database and add you to the minos-sam-users mailing list.
  2. If you are at FNAL and are using the software from AFS space and the standard setup_minossoft_FNALU then all the SAM products are already installed and loon is SAM-enabled. If you have your own local software installation then you will need to install the SAM client products on your machine. This will allow you to run loon jobs against the FNAL stations and use the command line interface. Running loon jobs against the FNAL stations requires access to dcache and afs.

MINOS SAM Stations

MINOS currently has one SAM station, minos. All the MINOS raw data and official production data is stored in ENSTORE. All this data can be accessed from the station minos via dcache. In addition, ntuple data is stored in AFS space. This data will be acessible from the station minos-afs but this is not live yet.

Determining what data you would like to process

This can be done in one of two ways. You can browse the data via the SAM Data Browsing web interface. You may find these metadata definitions useful.

The second method is to use the SAM command line interface. The relevant command is sam translate constraints or equivalently sam list files Here are some examples.

      sam translate constraints --dim="run_type physics% and data_tier raw-near and 
                                      start_time > to_date('2005-08-10','yyyy-mm-dd')"
                         or
      sam list files --dim="run_type physics% and data_tier raw-near and 
                            start_time > to_date('2005-08-10','yyyy-mm-dd')"

This returns


Files:
   N00008273_0004.mdaq.root
   N00008273_0005.mdaq.root
   N00008273_0006.mdaq.root
   N00008273_0007.mdaq.root
   N00008273_0008.mdaq.root
   N00008273_0009.mdaq.root
   N00008273_0010.mdaq.root
   N00008273_0011.mdaq.root
   N00008273_0012.mdaq.root
   N00008273_0013.mdaq.root
   N00008273_0014.mdaq.root
   N00008273_0015.mdaq.root
   N00008273_0016.mdaq.root
   N00008273_0017.mdaq.root
   N00008273_0018.mdaq.root
   N00008273_0019.mdaq.root
   N00008273_0020.mdaq.root
 
File Count:         17
Average File Size:  94.35MB
Total File Size:    1.57GB
Total Event Count:  866876

You can include a time in the date query, for example

      sam translate constraints --dim="run_type physics% and data_tier raw-near and 
                         start_time > to_date('2005-08-10 18:00:00','yyyy-mm-dd hh24:mi:ss')"

You have to use the hh24 to specify times in the 24 hour clock.

The arguments given to the --dim qualifier (short for --dimension) are a simplified ORACLE query. Dimensions are defined for various field in the ORACLE tables. They are grouped into categories. A list of the available categories can be obtained using sam get registered dimension categories. This returns

      alpgen
      cdf
      cdfsim
      datafile
      datasetdef
      dfc
      herwig
      madgraph
      mc
      mcrun
      pythia
      run

The only one currently of relevance to MINOS is datafile. Typing

      sam get dimension info --category=datafile

will get you a list of the available dimensions for the datafile category which you can use. You can also use this list. The use of the to_date function in the query allows you to specify dates in the more normal yyyy-mm-dd rather than the ORACLE default which is dd-MON-yyyy, e.g 01-OCT-2004.

If you want to create a dataset using files from AFS space then you need to look for files that have an AFS path in addtion to the usual PNFS one. You can do this by adding full_path %afs% to the --dim argument, namely

      sam translate constraints --dim="run_type physics% and full_path %afs%"
 
      Files:
         F00019841_0000.snts.R1.0.0a.root
         F00019844_0000.snts.R1.0.0a.root
         F00019844_0001.snts.R1.0.0a.root
         F00019844_0002.snts.R1.0.0a.root
         F00019853_0000.snts.R1.0.0a.root
         F00019881_0000.snts.R1.0.0a.root
         F00019884_0000.snts.R1.0.0a.root
         F00019888_0000.snts.R1.0.0a.root
         F00019888_0001.snts.R1.0.0a.root
         F00019888_0002.snts.R1.0.0a.root

      File Count:         10
      Average File Size:  2.21MB
      Total File Size:    22.15MB
      Total Event Count:  818220

Creating a dataset definition

Once you have decided what data you want to process then you need to actually create a dataset definition. You can also look at what datasets other people have created and use an existing definition. There are test datasets created from production.

You can find out what dataset definitions already exits by using the web interface. Clicking on the Submit request button without entering any data will return all datasets in the database.

To create your own dataset definition you use the command sam create dataset definition or sam create definition

     sam create definition --defName=test-near --defdesc="Near detector test dataset" --group=minos 
     --dim="run_type physics% and data_tier raw-near and start_time > to_date('2004-11-1','yyyy-mm-dd')"

The dataset definition is assigned a unique ID number in the database.

You can use an existing dataset definition and add additional constraints to create your own dataset. For example lets say you are interested in Far detector physics raw data for the first half of June 2005. You can use the data set zeval-far-raw-physics and add an additional constraint for the date range.

     sam translate constraints --dim="dataset_def_name zeval-far-raw-physics and start_time <= to_date('2005-06-15','yyyy-mm-dd') 
     and end_time >= to_date('2005-06-01','yyyy-mm-dd')"

This finds 332 files. Note the way that the date query is done. This ensures that you pick up all the files that could have started and ended during the time period of interest. You can then use sam create dataset definition to create the dataset.

     sam create definition --defName=my-zeval-far-raw-physics --defdesc="Far Raw data June 1-15 2005" --group=minos 
     --dim="dataset_def_name zeval-far-raw-physics and start_time <= to_date('2005-06-15','yyyy-mm-dd') and end_time >=
     to_date('2005-06-01','yyyy-mm-dd')"

Creating a snapshot

The final step is to create a snapshot. This runs the query defined in the dataset definition step against the data files in the database and generates the list that currently match the definition. You can see that the dataset defined in the example above would produce a different list of files each time it is run. You can tell loon to either use the last snapshot that was defined, or to create a new one at the time the job is run or specify a specific snapshot. You can see which snaphot versions exist on the Snapshot query page. Clicking on the link in the Snapshot version column will show you a list of the files in that snapshot. The dataset name and the snapshot version provide a unique combination.

To create a snapshot using the sam command line use the sam create dataset or sam take snapshot

      sam create dataset --group=minos --definitionName=test-near

Running a loon job that reads one dataset

Now you are ready to run a loon job. Below is a simple script that runs the EventDump Module and dumps every 5000th snarl

     run_samtest() 
     {

         // Create the Job Controller.
         JobC j;

         // Ask it to create the "Demo" path.
         j.Path.Create("Demo",
		"EventDump::Ana");

         // Configure the EventDump module.
         j.Path("Demo").Mod("EventDump").Cmd("Dump RawHeader");
         j.Path("Demo").Mod("EventDump").Set("Freq=5000");


         j.Input.Set("ProjectName=samtest");
         j.Input.Set("SnapShotVers=1");
         j.Input.Set("Station=minos");
         j.Input.Set("ApplicationVers=r1.14");
         j.Input.Set("MaxNumberOfFiles=20");

         j.Input.Report();

         j.Input.AddFile("SAM:test-near","DaqSnarl");


         // Run the job
         j.Path("Demo").Run();

         // Print run summary report
         j.Path("Demo").Report();

     }
j.Input.Set("ProjectName=samtest")
You can give your SAM project a name. This name will have the date and timestamp appended to it. If you don't specify an name it will default to your username + date and timestamp. The default is to always create a new project. See below for how to run a loon job(s) on a pre-existing project.
j.Input.Set("SnapShotVers=1")
This is the version of the snapshot to use. To take a new snapshot you should use a value of 0. To use the most recent snapshot use -1. To use a specific version just use the correct version number, the example uses 1. You can see which snapshots are available from the Snapshot query page. The default is 0.
j.Input.Set("Station=minos")
This is name of the SAM station. You should specify station minos for projects that read dcache data and minos-afs for projects that read files from AFS space. The default is minos.
j.Input.Set("ApplicationVers=r1.12")
This is the version of the offline that you are running. The default is dev
j.Input.AddFile("SAM:test-near","DaqSnarl")
This command defines the SAM dataset that you want to process. The SAM: tells JobControl that this is a SAM dataset. The name following is the name of the dataset, in this case test-near. You should add the name of the streams you want to process otherwise the IoInputModule will attempt to open all files at the start of the job. This is bad.
j.Input.Set("MaxNumberOfFiles=20")
This option allows you to take a dataset that contains many files and process the first N. The default is 0 which means process everything.

There are two other options that can be set in the Input Module:

j.Input.Set("WorkGroupName=minos")
This is the name of the working group that the project should be run under. There is currently only one and that is minos. You must be a member of the chosen working group. Valid groups can be found by using the following query or the command line sam get registered work groups.
j.Input.Set("ApplicationName=loon")
This is the name of the application that you are running. The default is loon. The name must be defined in the database. Valid combinations of ApplicationName and ApplicationVers can be found by browsing the SAM metadata or by using the command line sam get registered application families

You can now go ahead a run your loon job in the usual manner

     loon -bq run_samtest.C

You do not need to specify a list of input files as it will be generated internally inside the loon job by talking to the SAM station. Output files should be defined as usual. At the moment we are not automatically storing user output files back into SAM. Should you be creating a stripped dataset that is of use to other people then contact the SAM team to arrange to get it stored back into SAM for access by other users.

Running a loon job that uses more than one dataset

You can also access multiple datasets in the same loon job. This is useful when you want to have two input streams which will be synchronized.

run_samtest()
{
 
  // Create the Job Controller.
  JobC j;
 
  //j.Msg.SetLevel("Io","Verbose");
  //j.Msg.SetLevel("Per","Verbose");
   
  // Ask it to create the "Demo" path.
  j.Path.Create("Demo",
                "EventDump::Ana");
 
  // Configure the EventDump module.
  j.Path("Demo").Mod("EventDump").Cmd("Dump RawHeader");
  j.Path("Demo").Mod("EventDump").Set("Freq=5000");
 
 
  j.Input.Set("ProjectName=samtest");
  j.Input.Set("SnapShotVers=1");
  j.Input.Set("Station=minos");
  j.Input.Set("ApplicationVers=r1.0.0a");
 
  j.Input.Report();
 
  j.Input.DefineStream("RawData","DaqSnarl");
  j.Input.DefineStream("DCSData","DcsMonitor");
  j.Input.Set("Streams = RawData,DCSData");
  j.Input.AddFile("SAM:test-data1","RawData");
 
  j.Input.Report();
 
  j.Input.AddFile("SAM:dcs-test1","DCSData");
   
  j.Input.List();
  // Run the job for 10 inputs
  j.Path("Demo").Run();
 
  // Print run summary report
  j.Path("Demo").Report();
 
}

Each SAM dataset will be added to a different stream. You can then proceed as if you had added the files to the streams manually.

Running one or more loon jobs on a pre-existing project

If your dataset consists of a large number of files then you will probably not want to run a single loon job on all of them. You first need to create a project outside of your loon job. To do this you use the SAM command sam start project.


         sam start project --station=minos
                           --definitionName=test-near --snapshotVersion=new 
                           --project=samtest --group=minos

This will start a project called samtest on station minos using the dataset test-near with a new snapshot version. You can see the project status from the Sam-at-a-glance page (click on the minos station link) or by querying the database. You then need to create your loon script. You need to set the variable StartNewProject equal to 0 (teh default is 1 which means create a new project). This will force loon to use the project you have just created and not start a new one. You should specify how many files you want each loon job to process using the MaxNumberOfFiles variable. Then run the loon jobs. The project will deliver the stated number of files to each loon job. Note that although the files within a loon job will be processed in time order, the overall decision of which files are given to which loon job is under the control of the project and will not be in time order. The script below runs the EventDump Module and dumps every 5000th snarl for 20 files using the project defined above.

     run_samtest() 
     {

         // Create the Job Controller.
         JobC j;

         // Ask it to create the "Demo" path.
         j.Path.Create("Demo",
		"EventDump::Ana");

         // Configure the EventDump module.
         j.Path("Demo").Mod("EventDump").Cmd("Dump RawHeader");
         j.Path("Demo").Mod("EventDump").Set("Freq=5000");


         j.Input.Set("ProjectName=samtest");
         j.Input.Set("SnapShotVers=1");
         j.Input.Set("Station=minos");
         j.Input.Set("ApplicationVers=r1.14");
         j.Input.Set("MaxNumberOfFiles=20");
         j.Input.Set("StartNewProject=0");

         j.Input.Report();

         j.Input.AddFile("SAM:test-near","DaqSnarl");


         // Run the job
         j.Path("Demo").Run();

         // Print run summary report
         j.Path("Demo").Report();

     }

Once all your loon jobs have ended you should stop the project.

         sam stop project --project=samtest --station=minos

Accessing indivdual files with SAM

You can access individual files in a loon job using SAM. This allows you to specify files without needing to know the dcache path. This is designed for accessing handfuls of files, if you want to process large numbers then it is still more efficient to create a dataset using the database. The following example shows you how to access three files without needing to know their dcache location. This method of access does not generate a SAM project, it just uses the database to locate the pnfs path for the file and convert this into a dcache path.

     run_samtest() 
     {

         // Create the Job Controller.
         JobC j;

         // Ask it to create the "Demo" path.
         j.Path.Create("Demo",
		"EventDump::Ana");

         // Configure the EventDump module.
         j.Path("Demo").Mod("EventDump").Cmd("Dump RawHeader");
         j.Path("Demo").Mod("EventDump").Set("Freq=5000");


         j.Input.Report();

         j.Input.AddFile("SAM_FILE:N00001159_0000.mdaq.root","DaqSnarl");
         j.Input.AddFile("SAM_FILE:N00001243_0000.mdaq.root","DaqSnarl");
         j.Input.AddFile("SAM_FILE:N00001306_0000.mdaq.root","DaqSnarl");


         // Run the job
         j.Path("Demo").Run();

         // Print run summary report
         j.Path("Demo").Report();

     }

Fermi National Accelerator Laboratory Magnet Logo