Database Distribution: Validation
Comparing Local Database to FNAL
To properly validate the distribution flows, a separate system has been set up
that works as follows.
A regular intervals, say once a day, each participating site
runs a
run_checksum.sh
cron job e.g.:-
30 1 * * * /minos/software/OO/dbm/scripts/run_checksum.sh 2>&> > /dev/null
This scans the database and produces a checksum report which it
compares to the master checksum report produced at FNAL and
written to dbm_publish_dir directory. The results of
this comparison are mailed both to the dbm_contact_list
and to the central database manager (dbm_coordinator_list - currently Nick West).
Generating a complete checksum for the entire database takes many hours
so the system works as follows:-
- Each table checksum is written to a separate log file.
- The system can do 3 types of checksum:-
- incremental
To do an incremental checksum, the
system reads the previous checksum log file for the table and only
recomputes checksums for entries that have insertion dates that are
greater than the creation date on the file.
- differential
To do an differential checksum, the
system reads the SEQNO and insertion date from the VLD table and
compares these to the checksum log file. It then reads and computes
checksums for any SEQNO where the log file insertion date does not
match. This type of checksum is almost as fast as incremental
but is almost as comprehensive as full as any normal update of
a table will change insertion dates.
- full
To do an full checksum, the system reads both
the VLD and main tables in full and recreates the checksum log file
from scratch. For the largest tables this can take many hours so,
after each full checksum the system also records the results of the
MySQL command
checksum table xxx,xxxVLD; [where xxx = table name]
before doing a full checksum it repeats this and if unchanged simply
make a copy of the existing log.
- differential and full refresh intervals are
assigned to each table (see get_table_attribute.pm) with essential
tables e.g. UGLI* being assigned lower intervals than less essential
ones e.g. DCS*
- Each time the checksumming process runs it determines the type of
checksum based on the refresh interval and the time that has elapsed
since the last checksum of that type. It is not a simple threshold
(do checksum once interval has passed) as this can lead to tables
getting into sync with the result that most days the system has little
to do and then occasionally has a great deal to do. So instead a
randomising element is introduced that ensures the checksum is carried
out uniformly within the refresh interval. The test is applied
first to full checksum and if it fails, to
differential. If that too fails then incremental is
done.
So the system ensures that:-
- All data in transit gets checked each time it runs.
- All tables are checked differentially regularly and occasionally
in full.
Tuning the Comparison using: analyse_checksum.filter
In all probability local databases will only keep a subset of all the
tables held in the Master at FNAL, and this could lead to spurious
conflicts. To avoid them the system looks for the file:-
analyse_checksum.filter
and uses it to ignore entire tables, or bands of SEQNOs within tables.
If the file does not exist, one is created and set to ignore the tables:-
DBU*
DCS*
PULSER*
but you can modify it to suit your local requirements i.e. to match the
filter in the dbm_command_options of the
local.config
file. See the file itself for instructions
The checksum analysis report produced by the system lists the filter.
Make sure that the filter isn't set to broad for that might hide conflicts!
Return to the top-level Database Distribution document
Last Modified: $Date: 2009/08/28 07:25:25 $