The Scripts are divided into the following:-
cd (my top level directory)
mkdir GridApps cvs -d :pserver:anonymous@minoscvs.fnal.gov:/cvs/minoscvs/rep1 get minossoft/GridTools mv minossoft/GridTools/ ./ rm -r minossoft cd GridTools export MOG_TOOLS=`pwd` source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh
Pick a name for your site, it can be arbitrary but examples we have so far are:-
ral_t1_ui oxford_t2_uiTell RSD what name to use and set up a configuration for it.
local_name= (what ever name you have chosen) cd $MOG_TOOLS/RemoteSoftwareDeployment/config echo $local_name > minos.site_name cp example.config minos-$local_name.configYou could tweak the .config if you want to, but the default ought to be O.K.
Have a test run of RSD
rsdIt should give help instructions and near the top you should see the line
VO name: MINOS_VO_GRIDPP_AC_UK (from default or minos-[your site name here].config)
rsd install $MOG_TOOLS/../GridApps SamWebClient:v0_9_2_NULL-build_1\ --soft_link=pro:deleteThis tells RSD to install version v0_9_2_NULL below $MOG_TOOLS/../GridApps and to make a soft 'pro' link to it. If all goes well the log should end:-
RSD terminating. No error reportedand then by resourcing the setup script you should be able to use it
source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh
[ should see the line: Setting up SamWebClient ]
samLocate --file=N00008695_0023.cosmic.sntp.R1_18.0.root
[ should give: /pnfs/minos/reco_near/R1_18/sntp_data/2005-10 ]
Tell DCM what the site name is and what SEs it can access using the Oxford setup as typical.
cd $MOG_TOOLS/DataCacheManager/config/ echo $local_name > minos.site_name cp minos.site_oxford_t2_ui.se_access minos.site_$local_name.se_accessNow you need to tell DCM about the local disks and directories.
data_dir= (the top directory of your data disk) rm -f minos.site_$local_name.local_disks (should not exists, but just in case) echo Group minos >> minos.site_$local_name.local_disks echo Scratch_dir /tmp >> minos.site_$local_name.local_disks echo @Disks $data_dir >> minos.site_$local_name.local_disks echo @Exclude_dirs $data_dir >> minos.site_$local_name.local_disks echo Soft_links_dir $data_dir/dcm_catalogue >> minos.site_$local_name.local_disks echo Catalogue_dir $data_dir/dcm_catalogue/DCM >> minos.site_$local_name.local_disks echo Resource_lock_dir $data_dir/dcm_resource_locks >> minos.site_$local_name.local_disksDCM is capable of surveying everything below @Disks and provide a catalogue, but we assume that you don't need this feature which is why @Exclude_dirs is set to the same thing.
Now create all the required directories giving group write access.
mkdir --mode 0775 $data_dir/dcm_catalogue mkdir --mode 0775 $data_dir/dcm_cache mkdir --mode 0775 $data_dir/dcm_catalogue/DCM mkdir --mode 0775 $data_dir/dcm_resource_locksConfirm that dcm runs
dcmIt should type its help and near the top list 'host_name' (the name you chose) and the SEs it can see and the local disk setup.
dcm surveyIt will take no time to survey the local disk because everything was excluded but then will take about 15 minutes to download a ~ 0.3GB file from FNAL and reformat it for DCM usage.
Note: DCM does not automatically refetch this file as it does take a while so will slip out of date. One way to prevent this is to have a nightly cron job that just executes this command.
dcm get --accept_dcm_url [ file_name like N00008695_002%.cosmic.sntp.R1_18.0.root ] [ should locate 4 files in fnal-dcache-enstore ] dcm get --accept_dcm_url N00006771_cat0.spill.sntp.R1_18_2.0.root [ should locate one file in ral_t1-dcache-tape ] dcm get --accept_dcm_url AnaNue-N00009062_0018.spill.sntp.cedar.0.root [ should locate a file in ral_t1_ui-nfs ]Note that you cannot actually get data from SEs at RAL with DCM yet, it isn't supported, but you could get data from FNAL if you needed to.
The scripts are meant primarily to run on software structures defined by RSD but they can also be used to set up GridTools interactively on systems maintained by SRT although in this case they cannot set up application - they cannot know which setup script in $SRT_DIST/setup to use.
setup_minos_lcg_grid.sh (for sh/bash shells) setup_minos_lcg_grid.csh (for csh/tcsh shells)and take a single, optional, argument specifying the application:-
app-name:app-version example: CedarDaikon:03-build_0-SL4
There several cases to be covered:-
source /stage/minos-data1/software/grid/setup_minos_local-SL4.sh (or .csh) {application}
Each time a new release of GridTools is installed, it creates these
wrapper scripts in the top level directory which do nothing more than
invoke the latest setup_minos_lcg_grid.sh (or .csh) and thus simplify
job setup.Caution: As their names suggest these are Scientific Linux 4 scripts so should not be used on any SL3 UI.
source $VO_MINOS_VO_GRIDPP_AC_UK_SW_DIR/setup_minos_local-SL4.sh (or .csh) {application}
as that's the standard
environmental variable.
However at present the only LCG computing element is at RAL T1 and its
software disk was full in October. Although there is now space, for
the moment we will continue to use the RAL T1 UI/PBS scripts instead
i.e.
source /stage/minos-data1/software/grid/setup_minos_local-SL4.sh (or .csh) {application}
export MOG_TOOLS=wherever source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.sh or setenv MOG_TOOLS wherever source $MOG_TOOLS/Scripts/setup/setup_minos_lcg_grid.cshNote that you cannot specify an application in this case, the script can only be used to set up GridTools themselves.
| Variable | Meaning | Example |
|---|---|---|
| MOG_SW_DIR | Software top level (see RSD_SW_DIR) | /stage/sl3-lcg-exp/minossgm |
| MOG_TOOLS | GridTools top directory Derived from $MOG_SW_DIR | /stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4 |
| MOG_SCRIPTS | GridTools script directory Derived from $MOG_SW_DIR | /stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/Scripts |
| MOG_CE_NAME | The GRID queue name (or empty if not on GRID) | lcgce01.gridpp.rl.ac.uk |
| MOG_HOST_NAME | The standardised host name:- <site> - t1 | t2 - ui | wn | ral_t1_wn |
| MOG_OS_TYPE | The standardised operating system type | SL4 |
| MOG_WORK_DIR | Scratch area. Based on the first defined of the following:-
Instead create a directory whose name includes $$ | /pool/minosmc_6237043.csflnx353.rl.ac.uk |
| DCM_HOME | DCM source home directory Derived from $MOG_SW_DIR | /stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/DataCacheManager |
| dcm (wrapper executable) | Use to run DCM Derived from $MOG_SW_DIR | n/a |
| RSD_HOME | RSD source home directory Derived from $MOG_SW_DIR | /stage/sl3-lcg-exp/minossgm/apps/GridTools/pro-SL4/RemoteSoftwareDeployment |
| rsd (wrapper executable) | Use to run RSD Derived from $MOG_SW_DIR | n/a |
| ganga (wrapper executable) | Use to run Ganga Hardwired for Oxford and RAL | Only if installed. If multiple versions installed can select version |
| sam* e.g. samLocate (added to path) | Use to run SAM Web Services package Picked up from SamWebClient pro link | Only if installed |
| svn (added to path) | Use to run Subversion Picked up from Subversion pro link | Only if installed |
| MOG_APP_DIR | Application top level (see RSD_TOP_DIR) Only set if application specified. | /stage/sl3-lcg-exp/minossgm/apps/CedarDaikon/0-build_0-SL3 |
| * | Application specific variables Only set if application specified. | SRT_PUBLIC_CONTEXT |
If an application is has been located then $MOG_APP_DIR will be defined and can be used to source further setup scripts for individual libraries if required. To see what library scripts are available and what they do look at the installation scripts in the RSD library scripts directory and in particular:-
install_cernlib install_daikon_scripts install_GENIE install_minossoft install_neugen3 install_pythia install_rootNote that different libraries having different locations and naming conventions for their scripts so RSD always provides a standardised location and naming convention by creating scripts, or soft links to them, in:-
source $MOG_APP_DIR/setup_library/setup_<library-name>.sh
At the time of writing NFS disks will not be phased out for some months into 2008. However, it makes sense to test out DCM based production before then and this can be done using "Simulated GRID Mode". If the environmental variable:-
MOG_SIMULATE_GRIDexists (its value is irrelevant) then DCM disregards the local NFS disk and switches to "Worker Node mode". See Running on a Worker Node (WN)
Next set the GridTools environment:-
RAL: source /stage/minos-data1/software/grid/setup_minos_local-SL4.csh (or .sh) Oxford: source /datadisk/minos/software/setup_minos_oxford.csh (or .sh)We are going to be using the two files:-
demo_loon_job_nfs.sh starts by defining a release, script and input file to run, creates a work directory and cd's into it. Then it runs loon, lists the output files it produces and then finally cd's out of the work directory and wipes it.
demo_loon_job_nfs.jdl which is the JDL to pass the job script to the input sandbox, run the job on the queue:-
lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridSand return the output.
cd to/some/scratch/directory cp $MOG_SCRIPTS/jobs/demo_loon_job_nfs.* ./Take a look at:-
demo_loon_job_nfs.shYou will see that it contains:-
# Define the release, file and script to run. Will have to update from time to time.
release=minossoft:S07-10-22-R1-26-build_2-SL4
That release is history, so look at
Installed Base Releases at RAL
and update the script to use a current one.Now you can run this like your very first "Hello World" job, but, for very short jobs where you are keen to see the results ASAP, we have a little tool that will submit the JDL for you and then poll waiting for the job to end. To run it:-
perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_loon_job_nfs.jdlIf all goes well, never a given in the GRID, you should see something like this:-
Creating work directory: /tmp/run_test_job_27403
Submitting: edg-job-submit --vo minos.vo.gridpp.ac.uk --output /tmp/run_test_job_27403/job_id demo_loon_job_nfs.jdl ...
Entering polling phase ...
2007-11-02 14:56:01 Ready unavailable
2007-11-02 14:56:34 Scheduled Job successfully submitted to Globus
2007-11-02 14:58:45 Running Job successfully submitted to Globus
2007-11-02 15:02:04 Done Job terminated successfully
Retrieving job output ...
Job output returned to /tmp/jobOutput/nwest_bmjPHgQA3WnoIPTcOqT8bQ:-
File: demo_loon_job_nfs.err begins (first 20 lines max):-
(output from demo_loon_job_nfs.err)
File: demo_loon_job_nfs.out begins (first 20 lines max):-
(output from demo_loon_job_nfs.out)
Cleaning up and removing /tmp/run_test_job_27403
You can see the files in full by examining them in the temporary
directory where they have been returned. To be considerate you should
delete this directory when done rather than leave it to be eventually
removed.
Remember you need to have obtained a GRID Certificate before you can play, otherwise you can only sit and watch.
To see what software is installed on the GRID, use lcg-infosites
lcg-infosites --vo minos.vo.gridpp.ac.uk tagyou should see something like:-
Name of the CE: lcgce01.gridpp.rl.ac.uk Name of the CE: lcgce02.gridpp.rl.ac.ukThese are the SL3 (lcgce01 - don't use) and SL4 (lcgce02) queues.
In an ideal world the 'tag' argument to lcg-infosites would list software tags but as explained above we don't live in such a world so for now you have to check what Installed Base Releases for RAL SL4
For the sake of this exercise, let's pick:-
minossoft:S07-10-22-R1-26-build_2-SL4
Besides selecting the CE and the software, we also need to decide which queue to use as CEs typically have more than one available. To get this information:-
lcg-infosites --vo minos.vo.gridpp.ac.uk ceand look for lcgce02.gridpp.rl.ac.uk:-
#CPU Free Total Jobs Running Waiting ComputingElement ---------------------------------------------------------- 1214 51 0 0 0 lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS 1214 51 223 32 191 lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500MBy the way, don't get too excited if it looks like you have the farm to yourself; it's not showing you jobs from other experiments.
So there are two queues:-
lcgpbs-gridS lcgpbs-grid500Mand it's not hard to guess that they are running PBS and one is a general purpose short queue and the other a long one specifically for MINOS. We will pick the MINOS one so the full queue name we want is:-
lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
rm -f my_input_files
dcm get --accept_dcm_url \
--remote_se not_nfs \
--file_list my_input_files \
[ file_name like F00034638_0000.mdaq.root ]
O.K., so the SAM query is contrived, but you get the idea. That query
only gave one file name of course:-
dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0000.mdaq.root#17711445but your query could produce lots.
$MOG_SCRIPTS/jobs/demo_loon_job_se.shTake a few minutes to look at that script and see if it all makes sense.
reco_far_Alt_All_development.CIt simplifies things to have both of these files in the current directory:-
cp $MOG_SCRIPTS/jobs/demo_loon_job_se.sh ./ cp from/where/ever/reco_far_Alt_All_development.C ./If you cannot easily lay your hands on a reco_far_Alt_All_development.C you can take a look at:-
$MINOS_TOOLS/LoonValidationJobs/READMEand adapt the appropriate
LVJ_reco_far_Alt_All_verby renaming the file and the internal function call.
*.rootWhenever you run DCM it will list the SEs that it can access from there. For example from the UI at RAL:-
ral_t1_ui has access to the following SEs:-
ral_t1_ui-nfs Local NFS Disks
ral_t1-castor-prod_d0t1 RAL T1 CASTOR disk0tape1 Production Service
ral_t1-castor-test_d0t1 RAL T1 CASTOR disk0tape1 Test Service
ral_t1-dcache-disk RAL T1 dCache Disk Store
ral_t1-dcache-tape RAL T1 dCache Tape Store
fnal-dcache-enstore FNAL dCache interface to Enstore
but of course what you need are the SEs that the WN on
lcgce02.gridpp.rl.ac.uk can see. In fact RAL Tier 1 can
also see them all.For this exercise we will assume you want to write into the directory:-
grid_tests/loon_job/outputbelow the top-level minos directory of the
RAL T1 dCache Disk Storein which case DCM has to write to:-
ral_t1-dcache-disk/grid_tests/loon_job/output
ganga
arglist = []
arglist.append('minossoft:S07-10-22-R1-26-build_2-SL4')
arglist.append('reco_far_Alt_All_development.C')
arglist.append('dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0000.mdaq.root#17711445')
arglist.append('ral_t1-dcache-disk/grid_tests/loon_job/output')
arglist.append('*.root')
That is filling out:-
j = Job(application=Executable(exe=File('demo_loon_job_se.sh'),args=arglist),backend='LCG')
Quite a lot is going on here but should all make sense if you have
worked through the
Ganga Tutorial
and even if you haven't it's still clear that the intention is to run
demo_loon_job_se.sh with the supplied arguments on the LCG GRID.
j.inputsandbox = ["reco_far_Alt_All_development.C"]
j.backend.CE = 'lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M'
jand if it looks O.K.:-
j.submit()
arglist[2] = 'dcm://fnal-dcache-enstore/fardet_data/2006-04/F00034638_0001.mdaq.root#34339325' j=j.copy() j.application.args = arglist j.submit()I have chosen to use the same 'j' variable but the previous all jobs are still available and can be listed with:-
jobs
jobsuntil all the jobs show "submitted". At that point, unless something bad has happened nothing is going to for a while so you may as well quit Ganga and return later to see how you jobs have faired.
!ls -l $jobs[52].outputdir !more $jobs[52].outputdir/stdoutor examine them directly from the command line of course.
cd /my/test/release srt_setup -a
perl $MOG_SCRIPTS/setup/create_test_release_tar.perlThis creates a clean (i.e. without binaries) version of your Test Release and places that file in you Test Release top level directory. It does this by making a temporary copy; it does not clean yours. The tar file is called:-
test_release_tar-<your-test-release-directory-name>.tar.gz
$MOG_SCRIPTS/setup/install_test_release_from_tar.perlThe script take two arguments:-
Your first job is to decide how to get the tar file to the GRID. You have two choices, depending on its size:-
e.g. at RAL T1 UI: source /stage/minos-data1/software/grid/setup_minos_local-SL4.csh/.sh
voms-proxy-init -voms minos.vo.gridpp.ac.uk
cd to/some/scratch/directory cp $MOG_SCRIPTS/jobs/demo_test_release_setup_job* ./
Sandbox demo: perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_test_release_setup_job_sbox.jdl URL demo: perl $MOG_SCRIPTS/jobs/run_test_job.perl demo_test_release_setup_job_url.jdl
/stage/minos-data1/vo/grid/mcarchiver.keytabbut all that should mean is that you are in the minos group.
/stage/minos-data1/vo/mc_production/daikon_scripts
| Step | Action |
|---|---|
| setup | Define global environment including
|
| gminos | Run gminos and store results as $CACHE_DIR/${GAFBASE}.tar.gz |
| save_gminos | Copy $CACHE_DIR/${GAFBASE}.tar.gz to local Storage Element |
| reroot | Run rerootjob on gminos out and store results as $CACHE_DIR/${GAFBASE}.reroot.tar |
| recon | Run loon on reroot output and store results as $CACHE_DIR/${GAFBASE}.recon.tar |
| save_recon | $CACHE_DIR/${GAFBASE}.recon.tar to local Storage Element |
| copy_to_remote | Copy required files, as determined by $COPY_MODE, to subdirectories of mindata@minos26.fnal.gov:STAGED |
| cleanup | Removes $CACHE_DIR plus any empty parent directory |
For example, to just do MC and send all the results to FNAL:-
GBS_JOB_STEPS=setup;gminos;copy_to_remote;cleanupWhereas to do detector and rock MC at RAL and then run overlay reconstruction and just send the ntuples to FNAL would involve a series of:-
GBS_JOB_STEPS=setup;gminos;rerootfollowed by a:-
GBS_JOB_STEPS=setup;recon;cleanup
short_test_config=""
if [ "$GBS_MODE" = "Test" ] ; then
short_test_config="short_test"
if [ "$DETECTOR" = "far" ] ; then short_test_config="short_test_far"; fi
$GBS_LOG INFO GBS_MODE = Test - configuring short test: $short_test_config
fi
$GBS_LOG INFO Running gminos_jobs.py -r $RUNN -s $SUBRUNN ${TYP[*]} $short_test_config
gminos_jobs.py -r $RUNN -s $SUBRUNN ${TYP[*]} $short_test_config > ${TYP[0]}_${RUNN}_${SUBRUNN}.log 2>&1
It creates a directory structure below $SE_TOP_DIR that mirrors the directory structure $CACHE_DIR below $NFS_TOP_DIR.
task.SetGlobalEnvironment('+LOON_MACRO_PATHS=Dogwood')
For any element in the list that is not a directory GBS tries to make
it into one by
$SRT_PUBLIC_CONTEXT/Production/DogwoodLOON_MACRO_PATHS is required when the macros you use specify with LOON_MACROS, directly or indirectly use headers and macros with relative file names. If this is not the case LOON_MACRO_PATHS can be left empty
task.SetGlobalEnvironment('+LOON_MACROS=asciidb/set_tsql_override.C;Dogwood/macros/GoodSpillTime.C;SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.C')
For any element in the list that is not a file GBS tries to make it into one by
$SRT_PUBLIC_CONTEXT/Production/asciidb/set_tsql_override.C $SRT_PUBLIC_CONTEXT/Production/Dogwood/macros/GoodSpillTime.C $SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.CNon-standard scripts can be added to the task to be passed with each job:-
task.SetGlobalInputSandbox('/minossoft/releases/R2.0/my_recon_script.C')
On the worker node it will appear on the start-up directory
$GBS_WORK_DIR so to select it:-
LOON_MACROS= ... GBS_WORK_DIR/my_recon_script.C ...
It creates a directory structure below $SE_TOP_DIR that mirrors the directory structure $CACHE_DIR below $NFS_TOP_DIR.
There is further work here that will be completed when we know where Cambridge ntuples are to be stored.
The first step is to decide upon a Task naming convention. A reasonable one is to use the name of the configuration for example:-
L010185_near_bhcurvWe will use a RSMonteCarlo model as this enforces a naming convention that encodes run and subrun which it passes as part of the environments and supports the allocation of a new subrun as the way of changing the MC seed.
| Variable Name | Set by | Required by steps | Meaning | Example |
|---|---|---|---|---|
| APPLICATION | Task | setup | RSD application to setup | DogwoodDaikon04:build_0-SL4
|
| ASCII_DB_NAME | Task | recon | The name of the temporary DB to hold ASCII tables | minos_temp
|
| CACHE_DIR | setup | * |
The directory used to hold results between job steps The subdirectory structure below $NFS_TOP_DIR consists of:-
| $NFS_TOP_DIR/daikon_04/L010185N/far/400
|
| CONCURRENT_COPY_MAX | Task | copy_to_remote | Determines the maximum number of files to copy concurrently Optional, default 5. | 1
|
| COPY_MODE | Task | copy_to_remote | Determines what gets copied and to where Not fully operational but following defined:-
| gminos
|
| DETECTOR | setup | None | Either "near" or "far" | near
|
| FLUXDIR | Task | gminos | DCM style SE name and subdirectory hold flux files | ral_t1-castor-prod_d0t1/flux/gnumi/v19
|
| GAFBASE | gminos | * | The base name for output files | f21134005_0008_L010185F_D04
|
| GBS_CHECK_ENV_VARS | GBS | * | Check list of supplied environmental variables and fail if any missing. | -
|
| GBS_CURRENT_JOB_STEP | GBS | * | The current job step | gminos
|
| GBS_JOB_STEPS | Task | GBS | Semicolon separated list of job steps | setup;gminos;copy_to_remote;cleanup
|
| GBS_LAST_STEP | GBS | * | = "YES" if in RUN or RERUN mode and last step in chain [signals step to communicate SUCCEEDED, FAILED, HOLD or RETRY] | NO
|
| GBS_LOG | GBS | * | Invoke the logger | -
|
| GBS_MODE | Task | gminos | One of "Production" or "Test". See Test and Production Modes | Production
|
| GBS_NUM_RETRY_ARGS | GBS | GBS | Number of retry args | 1
|
| GBS_PREVIOUS_JOB_STEP | GBS | GBS | The name of the previous job step (null string for first step) | setup
|
| GBS_RETRY_ARG_n | GBS | GBS | nth retry arg | -
|
| GBS_SCRIPTS_DIR | Task | GBS | The location of the scripts directory | /stage/minos-data1/vo/mc_production/daikon_scripts
|
| GBS_WORK_DIR | GBS | setup | Work (starting) directory | /pool/13832632.csflnx353.rl.ac.uk/gangajob_qiy14649
|
| JOB_TYPE | Task | setup | One of DETECTOR, ROCK or OVERLAY (not currently used) | DETECTOR
|
| LOON_MACRO_PATHS | Task | recon | Semicolon separated list of macro/include paths |
|
| LOON_MACROS | Task | recon | Semicolon separated list of macros | Dogwood
|
| MINIFLUX | Task | gminos | Controls flux set size: = "yes" use reduced flux set = "no" use full flux set. | no
|
| NFS_TOP_DIR | Task | setup | Top directory under which to create $CACHE_DIR | /stage/minos-data1/vo/mc_production/cambridge/STAGE
|
| REMOTE_HOST | Task | copy_to_remote | Determines the remote host. Optional: Default: mindata@minos26.fnal.gov | mindata@minos27.fnal.gov
|
| run | Job | * | Run number | 1020
|
| SE_NAME | Task | save_gminos save_recon | Storage Element name | ral_t1-castor-prod_d0t1
|
| SE_TOP_DIR | Task | save_gminos save_recon | Storage Element top directory | mc_production/cambridge/STAGE
|
| STAGED | Task | copy_to_remote | The subdirectory beneath remote site | STAGE/nwest/gbs_test
|
| subrun | Job | * | Subrun number | 19
|
| TYP | Task | gminos | Beam configuration (config file is gminos_cfg_${TYP}.py) |
|
| VEG_NAME | setup | None | MC vegetable name | daikon
|
| VEG_VERSION | setup | None | MC vegetable version | 04
|
| WRKBASE | setup | gminos | Top level empty work directory | /tmp/tmpwyS67u/work_dir
|
| WRKDIR | setup | * | Work directory for individual job steps | /tmp/tmpwyS67u/work_dir/L010185_far_NC_LEM_1020_23
|
ganga -i $GBS_HOME/python/bootstrap.py
man = GetManager()
task = man.AddTask("L010185_near_bhcurv","RSMonteCarlo")
Now we have to specify the top level script i.e.
run_gbs_job.sh
and to simplify things, use the GBS_HOME environmental variable:-
import os
task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
Next we have to tell it what job steps are to be executed and where it
will find these job step scripts:-
task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/stage/minos-data1/vo/mc_production/daikon_scripts')
task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;save_gminos;reroot;copy_to_remote;cleanup')
Of course what steps you will want depends on what production work you
want to do.Now consult Job Environment to see what Task configuration is required for the steps you want to carry out.
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_URL="entry1;entry2;entry3"')
won't work; when the job is passed to the GRID middleware the string
quoting confuses it and it complains: "ClassAd utils - cannot
parse classad". The solution is to replace ";" by "..." i.e.
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_URL=entry1...entry2...entry3')
and have the
setup
convert the "..." back to ";"
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_PSWD=\\\\0')
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_URL=mysql:odbc://sql.gridpp.rl.ac.uk/minos_temp...mysql:odbc://lcgsql0365.gridpp.rl.ac.uk/minos_dogwood1')
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_USER=minos_reader')
task.SetGlobalEnvironment('+ENV_TSQL_UPDATE_PSWD=\\\\0')
import os
task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/stage/minos-data1/vo/mc_production/daikon_scripts')
task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;copy_to_remote;cleanup')
task.SetGlobalEnvironment('+APPLICATION=DogwoodDaikon04:build_0-SL4')
task.SetGlobalEnvironment('+JOB_TYPE=DETECTOR')
task.SetGlobalEnvironment('+NFS_TOP_DIR=/stage/minos-data1/vo/mc_production/cambridge/STAGE')
task.SetGlobalEnvironment('+TYP=L010185F_far_beam')
task.SetGlobalEnvironment('+FLUXDIR=ral_t1-castor-prod_d0t1/flux/gnumi/v19')
task.SetGlobalEnvironment('+MINIFLUX=no')
task.SetGlobalEnvironment('+COPY_MODE=gminos')
task.SetGlobalEnvironment('+STAGED=STAGE/nwest/gbs_test')
import os
task.SetScriptFileName('%s/scripts/run_gbs_job.sh' % os.environ["GBS_HOME"] )
task.SetGlobalEnvironment('+GBS_SCRIPTS_DIR=/data/minos/software/mc_production/daikon_scripts')
task.SetGlobalEnvironment('+GBS_JOB_STEPS=setup;gminos;save_gminos;reroot;recon;save_recon;copy_to_remote;cleanup')
task.SetGlobalEnvironment('+APPLICATION=DogwoodDaikon04:build_0-SL4')
task.SetGlobalEnvironment('+JOB_TYPE=DETECTOR')
task.SetGlobalEnvironment('+NFS_TOP_DIR=/data/minos/west/gbs_mc_test/STAGE')
task.SetGlobalEnvironment('+FLUXDIR=ral_t1-castor-prod_d0t1/flux/gnumi/v19')
task.SetGlobalEnvironment('+MINIFLUX=no')
task.SetGlobalEnvironment('+TYP=L010185_far_NC_LEM')
task.SetGlobalEnvironment('+LOON_MACROS=asciidb/set_tsql_override.C;Dogwood/macros/GoodSpillTime.C;SRT_PUBLIC_CONTEXT/Production/Dogwood/reco_far_spill_daikon04_base_dogwood0.C')
task.SetGlobalEnvironment('+LOON_MACRO_PATHS=Dogwood')
task.SetGlobalEnvironment('+ASCII_DB_NAME=minos_temp')
task.SetGlobalEnvironment('+SE_NAME=ral_t1-castor-prod_d0t1')
task.SetGlobalEnvironment('+SE_TOP_DIR=mc_production/gbs_test/STAGE')
task.SetGlobalEnvironment('+COPY_MODE=recon_sntp')
task.SetGlobalEnvironment('+STAGED=STAGE/nwest/gbs_test')
We will create a job with run number and subrun number both equal to 1. Because of the RSMonteCarlo model we have have selected, job names are forced to have the format:-
job_rrrrrrrr_ssssso to create such a job:-
job=task.AddJob("job_00000001_0001")
If you examine the job you will see it has:-
Local environment: 'run=1,subrun=1'Next we need to select the short job queue on RAL Tier 1:-
task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS")
and submit the job:-
job.Submit()after that you can check at intervals either by looking for the Ganga job to end:-
jobs(<Job ID>).statusor by updating the job status:-
job.UpdateStatus()
task.RemoveJobs()
task.SetMode("Production")
this will prevent you accidentally changing things that effect all
jobs globally and also changes the value of the application script
environmental variable
GBS_MODEand consequently will change any script behaviour that is based on this variable.
#runnum=`seq 1000 1020` #subrunnum=`seq 0 19`to create ProtoJobs equivalent to this:-
for runnum in range(1000,1021):
for subrunnum in range(0,20):
task.AddProtoJob("job_" + str(runnum).zfill(8) + "_" + str(subrunnum).zfill(4))
If that looks O.K., promote them to real jobs:-
task.PromoteProtoJobs()and if not:-
task.RemoveProtoJobs()
task.SetBackend("LCG:lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M")
enable Task submission and decide the total number of jobs to having
running at one time and the maximum number that you want to have
submitted each time GBS runs.
task.EnableSubmit() task.SetMaxGangaJobs(500) task.SetMaxSubmitJobs(10)GBS doesn't exploit any fancy machinery to launch multiple jobs efficiently and the best it does is about 10/minute so the SetMaxSubmitJobs ought not to exceed about 50.
task.SubmitJobs()after which you can either list the updated status:-
task.ListJobs()or create a web page with links to the jobs e.g.
task.WriteHtmlReport("/home/west/work/minos/temp")
export MOG_TOOLS=...
task=man.GetTask('...')
task.WriteHtmlReport("...")
0,30 * * * * .../GridTools/Ganga/GBS/scripts/run_gbs_cron.shWhen considering Cron frequency a typical MC production has:-
Submit: 10/10 = 1 min Check: 2*200/40 = 10 min Retrieve: 10/25 = 1 min Total 12 min
If, or rather when, you need to investigate problems it is best to suspend the cron job while you run GBS interactively to avoid it it start submitting jobs while you are working.
CONCURRENT_COPY_MAX=5
CONCURRENT_COPY_TIMEOUT=36000 #10 hours
This will limit to at most 5 concurrent copies to FNAL. If more want
to copy they will have to wait, up to the TIMEOUT value, for
permission to copy. Based on rough figures above for an average MC
job together with an estimate that one connection can sustain the copy
of 5 jobs an hour, 4 concurrent connections should keep up with a
steady load of 200 jobs each lasting 10 hours.However, if necessary that number can be reduced. The next 3 lines of the script contain:-
# Override those values if run_ral_lcg_set_copy_parms.sh exists
override_copy_parms_file="${BUNDLE_TOP}/daikon_scripts/run_ral_lcg_set_copy_parms.sh"
if [ -f $override_copy_parms_file ] ; then eval `cat $override_copy_parms_file`; fi
and normally the file run_ral_lcg_set_copy_parms.sh contains
the same values:-
CONCURRENT_COPY_MAX=5 CONCURRENT_COPY_TIMEOUT=36000 #10 hoursPlacing a hacked copy of this file with a smaller CONCURRENT_COPY_MAX at RAL will take effect on the next job to start copying. Note that this is better than hacking gbs_do_copy_to_remote.sh as that is loaded at job start time so changes would not take effect until it reached the copy stage.
If:-
CONCURRENT_COPY_MAX=0
then copies will not be undertaken. The value should not be left like
that for long as GBS will preferentially rerun failures rather than
start fresh ones. This means that the job queue slowly switches to
one in which all jobs are doing nothing more than unpacking a tar and
then throwing it away when they discover there are no free locks. So
if you want to suspend copying for more than a few hours do:-
CONCURRENT_COPY_MAX=-1
This special value tells GBS to place the job on hold. Then when the
crisis has passed you can release all HELD jobs and let them complete.