Database Distribution: Introduction
Contents

The flow of data is shown schematically in diagram. At the heart of
the system is the Master Database at Fermilab. Most database updates
enter the database realm here. At regular intervals the DBM job
dbmjob is used to export all recently updated data and these export
files are distributed to all other databases where the data is
imported if not already present. This is done by the local database
manager again using dbmjob. These primary data flows are shown in
red.
Smaller amounts of data come from secondary databases e.g. at CalDet
and these are exported up to the Master Database where they join other
up
This system relies on the ability to:-
- Record the insertion date so that updates can be incremental.
- Uniquely identify data so that it is not accidentally duplicated
if attempting import more than once. For example updates to a
secondary database might be reflected back if exporting all recent
changes. However such data is ignored as duplicated data when
resubmitted to the Master.
dbmjob exploits the fact that all Dbi compliant database tables come
in pairs, the main data table and an auxiliary validity range table.
The auxiliary table records insertion dates and have globally unique
SeqNos (Sequence Numbers). The diagram shows how globally unique
numbers are assigned. Every database that is a source of data has a
GlobalSeqNo table that is used to generate sequence numbers. Each
time one is allocated the count is incremented in the table. For each
database the table operates in a different range of numbers hence
ensuring that all are unique. dbmjob moves data in ``Validity
Packets'' i.e. a single row in the auxiliary table and all its
associated data rows. The insertion date and SeqNo on the auxiliary
row allow dbmjob to support incremental updates and avoid data
duplication.
All this implies a very important restriction on dbmjob:-
dbmjob can only distribute Dbi compliant database tables i.e.
ones that come in pairs, the main data table and an auxiliary validity range
table.
Along with the data, dbmjob exports table descriptions in the database
update file. The Slave database ignores these descriptions for any
table it already has otherwise it uses them to create empty tables
before importing the associated data. This allows new tables to be
added the Master database and for the Slave databases to automatically
create them when they first receive data for them.
The system also distributes updates to the DBILOGENTRY table which
records significant changes to the database i.e. those that could
affect production running. These updates are written to the import
log files and also broadcast to the list of names contained in:-
$SRT_PUBLIC_CONTEXT/DatabaseMaintenance/scripts/logentry_mail_list

The DatabaseMaintenance package provides a perl driver script to
automate the system described above. The key points are:-
- At a site with access to the Master Database a cron job runs the
script with a configuration file that specifies what data to export
and where. Each time the cron job runs dbmjob it creates a new update
file in a work directory, although these will be empty if no data
has been added to the database since the last update. The driver
script just removes empty updates and moves non-empty ones into a
publish directory.
- At Slave sites the same script is configured to act as an importer.
The driver script checks to see what update file was last imported and
then looks for further updates, in order, in the publish directory.
- For Slave sites that don't have direct access to the Master's publish
directory some type of mirror software has to run to provide a locally accessible
mirror. One possible way to mirror is to use rsync. This is described
below
- Note that the Master and Slave don't need to be in synchronisation. For
example the Master might attempt to export every hour but a particular
Slave site might run a cron job only once a week to import all the updates
generated.
- The scripts
mails
errors, and regular summaries to a contact list
specified as part of the configuration, so if all is working normally
the contacts have only to ensure that they receive the summaries.
- As the scripts are configured locally its possible to build a distribution
web with databases acting both as masters and slaves. The reflection logic
described in the
Overview
ensures that updates don't get re-imported if a pair of databases both export to,
and import from, one another.
The perl driver scripts are to be found in:-
$SRT_PUBLIC_CONTEXT/DatabaseMaintenance/scripts/
and the sub-directory:-
primer/
has a series of files that can be used as the starting point.
The exporting logic overlaps the date window used to select data to
ensure no update is missed but it does mean that most updated
"Validity Packet"s with their associated globally unique SeqNo get
written to two consecutive update files. The first time a Slave site
receives a validity packet it imports it but the second time, it finds
that it already has the SeqNo so instead compares it.
The import logic that checks for repeated SeqNos classifies them into 3 types:-
- Duplicate The two packets are identical. The imported
packet is discarded.
- Out of Date The imported packet is considered to be out of
date wrt. the database packet, i.e. its CREATIONDATE is earlier, and
is discarded.
- Update The imported packet is considered to be more recent
than the database packet, i.e. its CREATIONDATE is later, and replaces
the database packet.
The Out of Date and Update classifications will probably
only ever apply to DBUSUBRUNSUMMARY packets because the DBU creates
them in the Master database as soon as the start of run record is
found and then updates them when the end run record is reached, so it
is not abnormal to see this packet in two forms.
As is explained
above
it is not necessary for Masters and Slaves to synchronise exports and
imports; it is fine for the Slave to import at a lower frequency than a
Master exports, or even a higher frequency although rather pointless
of course. Actually synchronisation brings the low-level risk that
two jobs, one writing and one reading, will try to simultaneously
access the same update file. To minimise this risk the
following convention is proposed for the scheduling of cron jobs:-
- Cron jobs run no more frequently than once an hour.
- When they run:-
- Master exports run on the half hour
- Mirroring jobs run at 45 mins past the hour
- Slave imports run one the hour
So, at the highest frequency of hourly updates, an change in the Master would be
caught within an hour and then propagate to a Slave 30 mins later.
Return to the top-level Database Distribution document
a
Last Modified: $Date: 2006/08/03 07:21:27 $