Database Distribution: Slave Site Operation

Laptop users: see also DBS-lite: A Minimal Database for Laptops

Contents

Preparation

Note To use this system you must have:-

Proceed as follows:-

  1. Database Priming

    DBMauto distributes updates only, so you need to have a database that is already primed with existing tables. Unless you your database is up to date, see Database Priming

  2. Create top level directory

    Subdirectories of this will be used for local scripts, configuration files, log files etc. A suggested convention is to use:-

        $MINOS_SOFT/dbm
    
    but any empty directory with read/write access will do. For the remainder of these instructions this directory will be called dbm_top_dir.
        mkdir dbm_top_dir
    
  3. Prime top level directory

    This is done by going to the directory and then running the DBMauto launcher script, supplying the --prime option as an argument:-

      cd dbm_top_dir
      $SRT_PUBLIC_CONTEXT/DatabaseMaintenance/scripts/primer/launch.csh --prime
    
       (sh/bash shell users: use launch.sh instead)
    
    Notes:-
    1. SRT should be already setup; the priming process takes current environmental settings to set up the local configuration file, although you can then customise if necessary.
    2. In particular you will need to edit the local.config and
      1. replace the value of any line having ???. One such value is the name of the SRT setup script.
      2. Update dbm_host.

    If everything is O.K., you will output similar to the following :-

    Creating /home/west/work/minos/temp/work...
    Creating /home/west/work/minos/temp/scripts...
    Populating the scripts subdirectory...
      Creating local.config...
      Creating run_import.sh...
      Creating run_checksum.sh...
    Populating the work subdirectory...
      Creating FNAL_import.context...
      Creating FNAL_import.log...
    Priming complete
    
    If you run a second time you will get:-
    Populating the scripts subdirectory...
      File local.config already exists; please move it away to recreate it
      File run_import.sh already exists; please move it away to recreate it
      File run_checksum.sh already exists; please move it away to recreate it
    Populating the work subdirectory...
      File FNAL_import.context already exists; please move it away to recreate it
      File FNAL_import.log already exists; please move it away to recreate it
    Priming complete 
    
    showing that the system only recreates missing files.

    This priming process creates the following structure:-

      scripts/
        local.config
        run_import.sh
        run_checksum.sh
      work/
        FNAL_import.context
        FNAL_import.log
    
    The scripts directory contains a file holding the local configuration and two scripts that should be set up as cron jobs. The work directory contains a log file that records imports and a context file used to determine what update to take next.

    sh/bash shell users

    The priming system assumes you are running with csh or tcsh (not sh or bash). To use sh or bash:-

  4. Check local.config and correct as required.

  5. Check FNAL_import.context and change the update number to the one to which you synchronised when you carried out Database Priming

  6. Set up the cron jobs

  7. If you want to receive reports of significant changes to the database i.e. those that could affect production running, add your email address to the file:-
        $SRT_PUBLIC_CONTEXT/DatabaseMaintenance/scripts/logentry_mail_list 
    
    and commit it, or get someone who has the privileges to do it for you.

If you have problems see Slave Site Problems.

Setting up a CRON Job

The priming process produces the scripts run_import.sh and run_checksum.sh which should be ready to run as cron jobs bearing in mind that such jobs must run within a minimal environment. If you are taking part in the Distribution Validation System then run_checksum.sh should run once a day. How often you want the run_import.sh job to run depends on how up to date you want to stay. It makes little sense to update more frequently that once an hour so suitable crontab entries might look something like:- 00 * * * * /minos/software/OO/dbm/scripts/run_import.sh 2>&1 > /dev/null 30 1 * * * /minos/software/OO/dbm/scripts/run_checksum.sh 2>&1 > /dev/null Note the discarding of stdout and stderr into /dev/null so that the contact list doesn't get a mail every time the job runs!

If you are not used to crontab, then a quick primer:-

Important
The first time you run run_import.sh it will take a long time as it will have to process a large backlog of update files. It may not even complete, for example, linux sometimes kills jobs if it gets short of resources. The system is designed to be failsafe so repeat the job and check on FNAL_import.log until you see that it is waiting for updates. Only then should you set it up as a cron job, for otherwise there is a risk that a second cron job will start up before the first has completed.

As an alternative you might consider adding executables to the /etc/cron.hourly, /etc/cron.daily and /etc/cron.weekly directories, which get executed at the frequency their name suggests. For example the file:-

  /etc/cron.hourly/dbm_import.cron
could contain the file:-
  /minos/software/OO/dbm/scripts/run_import.sh  2>&1 > /var/log/dbm_import_latest.log

If you maintain a Master site then you will need to run run_export.sh instead of, or as well as, run_import.sh.

ImportantYou must run Slave sites imports at least weekly, and fix problems promptly. If you let it get more than 2 weeks out of date then it may fail with "Unable to process update files .. - there is a gap after update..."

There is a recommendation on the scheduling of cron jobs. If the job involves access to AFS disks under KERBEROS control see Using KERBEROS.

Slave Site Problems

How do I know if it's working?

If you have followed the Slave: Preparation instructions then your local.config will have dbm_heartbeat_freq set to 24. If you have also followed the Setting up a CRON Job recommendation, to run every hour then you should get a report every day.

If you don't get a report daily, or the report does not say that there have been at least a few successful imports then something is wrong. Depending on other jobs on the local machine, some jobs may fail and send error reports. For example if you have a regular build running each night, then DBMauto may fail simply because the dbmjob has been deleted. So long as these are isolated incidents, and you continue to get good daily reports, such errors can be ignored.

What should I do if it's not working?

Carry out the following steps:-

  1. Check your local.config and in particular:-

  2. If the dbm_publish_dir is /afs/fnal.gov/files/data/minos/d210/rsync check that you can access it by typing:- ls /afs/fnal.gov/files/data/minos/d210/rsync

  3. If the dbm_publish_dir isn't /afs/fnal.gov/files/data/minos/d210/rsync and you are using rsync to distribute update files to the local site, check that this is working and that updates are being copied to the local site.

  4. If your local.config looks O.K. and your dbm_publish_dir has update files of the form FNAL_000nnnnn.dbm.gz then try running run_import.sh interactively :-
      setenv DBM_DEBUG
      ./run_import.sh
    
    to see if something is wrong and then examine FNAL_import.log to see if that gives any clues. If there are errors only a summary is placed in the log file but it will contain the name of a error log file that contains the full job output which should also be examined.

  5. If you are having problems with CAL*, PLEX* or UGLI* tables, then you may need to re-prime from CVS as follows:-

    1. Ensure DatabaseTables is up to date:-
      
      cd $SRT_PUBLIC_CONTEXT/DatabaseTables/
      cvs update
      
      
    2. Reprime the database
      
      cd CalibrationTables/
      mysql --local-infile=1 -u writer -p ...
      use offline
      \. create_and_fill_calibration.mysql
      quit
      cd..
      
      cd UgliTables/
      mysql --local-infile=1 -u writer -p ...
      use offline
      \. define_and_fill_ugli.mysql
      quit
      cd..
      
      cd PlexTables/
      mysql --local-infile=1 -u writer -p ...
      use offline
      \. define_and_fill_plex.mysql
      quit
      cd..
      
      
      
    3. Currently the files in CVS have some updates that are not ready for general consumption. The following patch has also to be applied:-
      
      mysql -u writer -p ...
      use offline
      
      delete from PLEXVETOSHIELDMUXTOMDL    where seqno = 200000401;
      delete from PLEXVETOSHIELDMUXTOMDLVLD where seqno = 200000401;
      delete from UGLIDBISCINTMDL           where seqno between 210004528 and 210004575;
      delete from UGLIDBISCINTMDLVLD        where seqno between 210004528 and 210004575;
      delete from UGLIDBISCINTPLN           where seqno between 210004528 and 210004575;
      delete from UGLIDBISCINTPLNVLD        where seqno between 210004528 and 210004575;
      delete from UGLIDBISTEELPLN           where seqno between 210004528 and 210004575;
      delete from UGLIDBISTEELPLNVLD        where seqno between 210004528 and 210004575;
      delete from UGLIDBISTRIP              where seqno between 210004528 and 210004575;
      delete from UGLIDBISTRIPVLD           where seqno between 210004528 and 210004575;
      
      quit
      
      

  6. If none of this helps then mail Nick West and send along:-

How can I confirm that my database is O.K.?

To confirm the state of your database:-

  1. Make sure that your local.config has dbm_coordinator_list set to set dbm_coordinator_list = "n.west1\@physics.ox.ac.uk"

  2. run run_checksum.sh

  3. Send mail to Nick West to ask him to check it.

Nick will compare it against the database at FNAL, and if there are problems may send back a Checksum Analysis report

What is a Checksum Analysis report?

The output from two or more run_checksum.sh jobs can be compared to produce a Checksum Analysis report. Here are a few fragments of an example comparing the Database at RAL to the one at FNAL:- Using filter file: /minos/software/OO/dbm/scripts/analyse_checksum.filter Filter for host:* *:0-0 DBUSUBRUNSUMMARY:200000000-299999999 CAL*:0-999999999 PLEX*:0-999999999 UGLI*:0-999999999 ... The system has a set of global filters that allow selected regions of selected tables to be ignored. For example, some sites may choose not to import DBUVA* tables and if these were not ignored when comparing reports, would swamp it with false positives. The filters can apply to a specific site or all sites. In this example the filter applies to all sites.

Filters apply to each table and select the range of SEQNOs to be included. The filter:-

*:0-0 mean ignore everything from every table. However, the filters:- DBUSUBRUNSUMMARY:200000000-299999999 CAL*:0-999999999 PLEX*:0-999999999 UGLI*:0-999999999 mean accept SEQNOs 2xxxxxxxx for DBUSUBRUNSUMMARY and everything for CAL*, PLEX* and UGLI*.

For each table the best (i.e. longest) match filter is taken, so in the above example CAL*, PLEX* and UGLI* are matched in full, DBUSUBRUNSUMMARY is matched for 2xxxxxxxx SEQNOs and everything else is ignored.

Using the following reports from sites:- Site ----------------------- Report ----------------------- 0 Database minos_offline on server RAL_csf at 2003-05-07 05:46:05 (last update: 1828) 1 Database offline on server Fnal at 2003-05-07 02:20:04 (last update: 1825) Ignoring records with insertion dates between 2003-05-05 13:32:44 and 2003-05-07 13:32:44 ... In this case there are only two site reports, called site 0 and site 1. Updates take time to propagate so the system ignores records with very recent insert dates. Analysing table: DBUSUBRUNSUMMARY SEQNO Creation Date Insert Date Site Site 0 1 200487600 2003-03-07 00:00:00 2003-04-12 07:19:01 o.k. Date 200487700 2003-03-07 00:00:00 2003-04-12 07:19:01 o.k. Date This is the analysis of the DBUSUBRUNSUMMARY table. 2 conflicts were found, for both there was a "Date" error at site 1 (Fnal). The possible conflicts are:-

Type Description
CksmWrong checksum
DateWrong Creation Date
InsBad Insert Date
MissMissing

Range processed: 200429500-900000000. Number: Conflicts: 2, O.K.: 8885, Part ign: 0, All ign: 3584 The analysis of each table concludes with a summary of the full range of SEQNOs analysed and results found:-

Type Description
ConflictsConflicts of some type (Cksm,Date, Ins, Miss)
O.K.O.K.
Part ignNo conflict but record ignored at some sites.
All ign:Record ignored at all sites.

What does "Found ... in ..., but not in its VLD" mean?

It's complaining that it has found SEQNOs that are in the main table but not the associated VLD table. It may be an indication of a more serious problem, or there may not be a problem at all. Very occasionally there isn't a problem i.e. the VLD entry really is there but has got missed. This problem hasn't been tracked down but must involve either the writing of the mysql dump of the table or its subsequent input. If the VLD entry really is missing use mysql to delete the SEQNOs from the main table.

What does "Unable to process update files .. - there is a gap after update..." mean?

DBMauto files are only kept for a minimum of 50 days to keep the directory holding them to a reasonable size. The assumption is that anyone using DBMauto is running it at least once a week. If you let DBMauto operation fall behind to the point that it looses updates then it fails, reporting that it has found a gap and then your only course of action is to reprime that database following the standard procedure given in Database Priming


Return to the top-level Database Distribution document
Last Modified: $Date: 2007/09/07 15:31:42 $