Fermilab


MINOS Offline Documentation
[home] [installing MINOS software] [supported platforms] [package links] [mailing list] [HyperNews] [FAQ] [computing help] [MINOS glossary] [archives]

Distributing Auxillary Data Files Associated with Code Releases



Data Files Associated with Code Releases

Associated with any code release there is often the need for accompanying ancillary data files containing non-source code informations. Examples include: PDFs, histograms, bfield maps, etc. Since the beginning of MINOS we have facilitated the distribution of such auxillary files by using CVS. This document describes the new, preferred means of distributing these files for MINOS.

Contents:

Overview:

Using CVS and SRT for code management is relatively straightforward in how ordinary users see it; there is a $SRT_PUBLIC_CONTEXT and possibly a $SRT_PRIVATE_CONTEXT that represents a release. Under each of these are a set of packages that appear, to the user, as a complete release (base) or some select packages for overriding the base (test). Behind the scene what this means is that if package MyPkg has a file mystuff.root then complete independent copies of this will be found in the locations of the form: $SRT_DIST/packages/MyPkg/HEAD/mystuff.root $SRT_DIST/packages/MyPkg/R1-24/mystuff.root $SRT_DIST/packages/MyPkg/R1-24-0/mystuff.root ... $SRT_DIST/packages/MyPkg/R1-24-4/mystuff.root $SRT_DIST/packages/MyPkg/R1-28/mystuff.root $SRT_DIST/packages/MyPkg/S08-02-24-R1-28/mystuff.root ... $SRT_PRIVATE_CONTEXT/MyPkg/mystuff.root If the binary-identical file get committed into both MyPkg and TheirPkg separate copies are made for those as well.

What this new mechanism does is allow releases and packages to share copies. A master copy of each unique file is kept at FNAL and made available via the web. Where previously in a package there was a file mystuff.root it is removed and a new file mystuff.root.proxy created. The contents of .proxy tell the system which copy of the master file is needed. Master files that go into the system must be uniquely named even if the target (e.g. mystuff.root) is generic. So, the first file into the system might be named mystuff.v1.root, and the next mystuff.v2.root, etc. The text in each instance of the .proxy file it will name one in this series. Different copies of mystuff.root.proxy might have different contents for different releases or packages, but each is a small text file that can comfortably be handled by CVS.

The sharing can then be accomplished in any of four ways.
flagaction
--cache download copies to a site specific cache area and symlink from the release/package to that location.
--afs symlinks to AFS (needs site machine to run AFS) /afs/fnal.gov/files/data/minos/release_data/
--minosdata symlinks to /minos/data/release_data/ (not always possible due to limit on mount permissions)
--local download copies in the same directories as the .proxy -- this leaves one in a state no different than previous to this method (but not really "sharing", then either)

In general the first, site specific cache, is probably the right approach for most non-FNAL installations.

Procedure For Updating Releases:

Base Release:

The simplest way to resolve all the .proxy files (asumming that the site has chosen to go with a local cache and configured a .proxyrc to handle that) is to issue the command: $ $SRT_DIST/setup/proxy_resolver.py This should be done after any update to the local minossoft installation. It will correctly handle .proxy files that are newer than their target on the local machine and automatically get the new file and remake the link. At some point will probably become the default behaviour when using msrt update for working on a base release.

If a site is using AFS or /minos/data one adds the command line flags --afs or --minosdata. If instead one desires to mimic the old multiple instance case, use --local.

Test Release:

$ setup_minos $ cd /path/to/testrel $ srt_setup -a $ $SRT_DIST/setup/proxy_resolver.py -r test

BField Maps:

$ $SRT_DIST/setup/proxy_resolver.py -r bmap

Commandline Flags:

$SRT_DIST/setup/proxy_resolver.py --help usage: proxy_resolver.py -h, --help this message -q, --quiet don't show actions -v increase debug verbosity level -f, --force force refetch/relink of file -t, --test print actions, but don't do them -p, --package <pkg> limit search to just package pattern [*] -r, --release <rel> limit search to just release pattern [*] if "test" act on $SRT_PRIVATE_CONTEXT otherwise HEAD, R1-NN, SYY-MM-DD-R1-NN ( also super-special case: "bmap" ) The following determine the action taken: --cache <dir> Set symlink to site cache copy. Fetch remote file to site cache if necessary. Default action, but if not specified on the cmd line the cache location my be resolved by a $SITE_PROXY_CACHE env variable or a line in a .proxyrc starting with SITE_PROXY_CACHE: --afs symlink via AFS --minosdata symlink to /minos/data --local fetch copy to pkg directory --unlink remove target special flags: --no-proxyrc ignore all .proxyrc lists --no-sys-proxyrc ignore .proxyrc except in dir w/ .proxy --source <urlbase> alternative master source directory

Transitioning From Old to New Approach:

The transition from the old scheme to the new take a few easy steps. One must choose a site cache that is visible to all local instances of the minossoft installation (i.e. if the base release is on a NFS filesystem visable to many nodes, the cache should be as well).

The initial steps, when using a site cache, are:

$ setup_minos $ cd /path/to/site/cache $ mkdir release_data $ echo "SITE_PROXY_CACHE:/path/to/site/cache/release_data" > $SRT_DIST/setup/.proxyrc $ $SRT_DIST/setup/proxy_resolver.py --unlink # remove any stale files $ $SRT_DIST/setup/proxy_resolver.py --force

If a site-wide decision is made to exclude particlar files (or patterns of files) one can use the .proxyrc file to accomplish that; that decision and changes made to the .proxyrc before the last step to avoid downloading unnecessary files.

Proxy Files:

Here is an example .proxy file: # This is an example .proxy file. It is named "mytarget.root.proxy" # and gets resolved locally with the addition of a file "mytarget.root" # in the same directory which is symlink to (copy of) the real file. # # The real contents of the .proxy is a single named file which must be on # the first non-blank, non-comment line. The remote file must be absolutely # unique in its name, so should incorporate a version number and be # as descriptive as possible. The remote file line can include a subdir # path to facilitate clustering of related files (not generally recommended # unless a long series is anticipated or sharing between packages # is unlikely). Note that the remote file name needn't be a simple # tranformation of the target file name -- though that would generally # be a wise choice. # # The comments can serve to describe the file and provide extra metadata # e.g. "the v3 version of the file has the correct blah-blah PDF" # This allow one to keep track of why changes were made to the file # and provide other helpful hints. # mysubdir/myremotefile.v3.root

Proxy Control Files:

The role of the .proxyrc file is to allow local control over the proxy resolution. These files should never be committed back to the repository -- they are strictly for local site configuration.

Here is an example .proxyrc file:

# This is an example .proxyrc file. It serves two purposes: # * provide a list of file names and patterns that aren't desired # locally at this site. # * provide a place for specifying where the local site cache is located # SITE_PROXY_CACHE:/path/to/where/actual/files/live excluded-file.dat excluded-pattern*.dat These can be located in individual package directories with the .proxy files or in $SRT_DIST/setup, ~, or $SRT_PUBLIC_CONTEXT, as well as $SRT_PRIVATE_CONTEXT when using the "test" release, and $BMAPPATH when using "bmap" release.

Adding and Updating Files:

There now are two steps involved in adding or updating a file. First the master copy must be put in the FNAL repository. Secondly a .proxy file must be created/modified to point to that copy. The steps should be done in this order and allowing sufficient time for the master copy to be made available before the proxy file is committed to CVS.

Users desiring to distribute a large or .root file should copy it to FNAL (either AFS or /minos/scratch). They then notify Robert and Arthur where it is located (core software group as furlough/vacation/sickday backup). Robert or Arthur verifies that the name is unique and makes appropriate copies into both the master AFS and /minos/data areas. That person then informs the user that the file has been installed.

An example of a .proxy is shown above. If the desired target file is named mystuff.root then the proxy file must be named mystuff.root.proxy. The user puts the new/updated proxy file into the CVS repository.


The Problem:

Recently the number and size of the auxillary files has exploded out of proportion and is starting to cause problems. As of 2008-03-07 there were 114 files that were either binary .root or files over 1MB in the primary minossoft packages, in all, due to revisions, there are 140 distinct files. There are over 330 MB of this data checked out for every recent release. Many of these files are then duplicated when individuals make test releases. There are also another 55 b-field maps.

The CVS backup process is having a hard time handling this and code checkouts are taking up too much space on disk, often with duplicate copies of identical files located in different directories and/or different releases.

cvs pro's:
cvs con's:
The new approach has several advantages, though a few weaknesses:
pro:
con:

Original Request For Comment:

Nick's message (2008-02-28):
Dear All,

We hope finally to bring closure to an item back last June:-

Latest snapshot release is 3 times the size of R1.24!

http://listserv.fnal.gov/scripts/wa.exe?A2=ind0706&L=MINOS_SOFTWARE_DISCUSSION&D=0&I=-3&P=2068

We need to do this both to dramatically reduce the size of the CVS Repository, which is important both for backing it up and for checkout and to cut down on disk space when multiple releases are installed, as frequently releases have identical versions of the files.

IF YOU HAVE ANY OBJECTION TO THE SCHEME BELOW, PLEASE SEND EMAIL TO THIS LIST ASAP

We want to do this in a way that minimises disruption to users by moving all large data files to a public directory leaving behind symlinks of the same name in the releases. In order to support multiple versions of the same files, the reason they ended up in CVS in the first place, the public directory files will be qualified by a version number with the symlinks pointing to the appropriate version.

Naturally this will complicate installation in two respects: access to the data and setting the symlinks.

Access to the data
------------------
The data will be available both via AFS and the web. Sites for which AFS access is not acceptable can maintain their own copy by using rsync or some web incremental access method e.g. wget -N as the first step in the installation procedure.

Setting the symlinks
--------------------
For each file that is removed we will leave behind a text file with the same name + a suffix .proxy whose contents is to qualified name of the data file. Then an installation script, which will also be invoked as required by msrt, will hunt out all proxy files and generate the corresponding symlink using the information they contain and some global environmental variable that points to the public data area.

In order to clear the files from the Repository, it is not only necessary to remove the current version of the data files; all earlier versions must also be removed. The first step will be to perform a sweep through the Repository looking for candidates to be removed. Mostly these are .root files but any file statistically much larger than the average source file will be considered. Once these have been identified all versions will be extracted into the public directory. At this stage we announce that the directory is available and allow sites time to set up their own copies.

Then we develop a script that for each target data file, 'cvs removes' the current version and for each tagged release moves the tag forward to the new reversion (which removes it from the release) and adds and tags a proxy that holds the qualified name of the version that the release did hold and updates/creates a .cvsignore to ignore the symlink. Once that is accomplished any release can be cvs updated to replace the data files it contains by their proxies and can then run the script to set up the symlinks.

In future people will be asked to consider carefully before adding large data files to the Repository and will instead be encouraged to contribute them to the public area and commit a .proxy file and and an entry in .cvsignore.

Cheers,
Nick.


Last Modified: $Date: 2009/01/20 19:46:28 $
Contact: rhatcher@fnal.gov
Page viewed from http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/release_data.html
Fermilab
Security, Privacy, Legal Fermi National Accelerator Laboratory