Database Distribution: Validation

Comparing Local Database to FNAL

To properly validate the distribution flows, a separate system has been set up that works as follows.

A regular intervals, say once a day, each participating site runs a run_checksum.sh cron job e.g.:-

30 1 * * * /minos/software/OO/dbm/scripts/run_checksum.sh 2>&> > /dev/null This scans the database and produces a checksum report which it compares to the master checksum report produced at FNAL and written to dbm_publish_dir directory. The results of this comparison are mailed both to the dbm_contact_list and to the central database manager (dbm_coordinator_list - currently Nick West).

Generating a complete checksum for the entire database takes many hours so the system works as follows:-

  1. Each table checksum is written to a separate log file.

  2. The system can do 3 types of checksum:-

    1. incremental
      To do an incremental checksum, the system reads the previous checksum log file for the table and only recomputes checksums for entries that have insertion dates that are greater than the creation date on the file.

    2. differential
      To do an differential checksum, the system reads the SEQNO and insertion date from the VLD table and compares these to the checksum log file. It then reads and computes checksums for any SEQNO where the log file insertion date does not match. This type of checksum is almost as fast as incremental but is almost as comprehensive as full as any normal update of a table will change insertion dates.

    3. full
      To do an full checksum, the system reads both the VLD and main tables in full and recreates the checksum log file from scratch. For the largest tables this can take many hours so, after each full checksum the system also records the results of the MySQL command
      
        checksum table xxx,xxxVLD;  [where xxx = table name]
      
      before doing a full checksum it repeats this and if unchanged simply make a copy of the existing log.

  3. differential and full refresh intervals are assigned to each table (see get_table_attribute.pm) with essential tables e.g. UGLI* being assigned lower intervals than less essential ones e.g. DCS*

  4. Each time the checksumming process runs it determines the type of checksum based on the refresh interval and the time that has elapsed since the last checksum of that type. It is not a simple threshold (do checksum once interval has passed) as this can lead to tables getting into sync with the result that most days the system has little to do and then occasionally has a great deal to do. So instead a randomising element is introduced that ensures the checksum is carried out uniformly within the refresh interval. The test is applied first to full checksum and if it fails, to differential. If that too fails then incremental is done.

So the system ensures that:-

Tuning the Comparison using: analyse_checksum.filter

In all probability local databases will only keep a subset of all the tables held in the Master at FNAL, and this could lead to spurious conflicts. To avoid them the system looks for the file:-
  analyse_checksum.filter
and uses it to ignore entire tables, or bands of SEQNOs within tables. If the file does not exist, one is created and set to ignore the tables:-
  DBU*
  DCS*
  PULSER*
but you can modify it to suit your local requirements i.e. to match the filter in the dbm_command_options of the local.config file. See the file itself for instructions

The checksum analysis report produced by the system lists the filter. Make sure that the filter isn't set to broad for that might hide conflicts!


Return to the top-level Database Distribution document
Last Modified: $Date: 2009/08/28 07:25:25 $