| IRIS
DMC Data Transcription: The Process of Implementing New Technologies
"What
doesn't kill you, makes you stronger"
This
article explains the process of data transcription that the IRIS
DMC undertook the first six months of 2001. Because we operate a
very active archive, this will give insight to how waveform data
are transcribed to new media within the new StorageTek Powderhorn
mass storage robot, and what is done to ensure that the health of
the archive is always in peak condition. This is, after all, the
point of doing transcription in the first place.
When
data are stored on media (tape, disk or paper), a fact of life is
that these media will become outdated, and/or wear out. This happens
about every 4-5 years, in our case. For this reason it is accepted
fact that there will be a need to read everything back and migrate,
or transcribe, all data to new media. This was necessary in 1992
and 1997 and we just recently completed our third transcription.
Currently
residing in our mass storage file system, there are about 2.5 million
files, totaling about 19 terabytes, where one file contains the
entire data generated at one station, for one day. These are what
we call "station/day" files. Since we have archived data
from over 100 different networks, and span the time period from
1970 to present, we take stewardship of the data very seriously.
Equally important is the need to preserve a good sort of the data,
so that servicing requests for data is optimized. In the case of
the DMC holdings, these are stored in two sort orders: by time,
and by station. By doing this, we are able to quickly stage data
back for event-based requests like earthquake gathers, or by single
station requests. Not only does this sort order ensure that we minimize
tape loads into drives, but it also gives us a built-in back up,
ensuring that we always have access to data in the event of any
bad media under one copy.
The
process of transcription begins with staging back all the data from
one network for one year, from the old media. These data are then
parsed, and exact file sizes for each channel for each station are
compared to the Oracle database, where internal synchronization
between the database and waveform data is performed. This is useful
in determining any inconsistencies between what the database says
we have, and what resides in the waveform files. This two-way check
verifies that we are internally consistent before we go on to stage
2, synchronizing with the network operators that originally submitted
the data. This step is important for verifying that we have all
the data that the network operator recorded. (As you might suspect,
we also find data that the network operator has submitted, but didn't
intend to, as this is a two-way synchronization). Currently, only
IRIS nodes of the Data Management System, which includes Albuquerque
Seismic Laboratories, IDA at UCSD, and PASSCAL, are the data collection
centers which we have the ability to synchronize our holdings. It
is intended that we utilize the synchronization mechanism with others
within the FDSN, regional network operators, as well as anyone who
submits data to the DMC
Once
we have determined that we have all available data for this one
year time period, or in some cases as little as two months where
data volumes are very large, we begin the process of staging these
data to the Powderhorn mass storage machine, where a UNIX file system
called SAM-FS, commercially available from LSC, is configured to
associate either time sorted or station data within archive_sets,
enabling data to be streamed to designated tapes in an efficient
manner. Without the ability to control the writes to media, a random
sort order would take over. In the case of a request asking for
event data, it would likely be at least an order of magnitude more
tape mounts. As important as this process is to the health of the
archive, we remain very aware that we have to simultaneously service
some 4,000 requests per month for data, and by engineering the system
for transcription that we have, we have not had any interruption
in service to the community.
Because
the IRIS DMC continues to expand its holdings at a nonlinear rate
annually, we have had to look down the road to even the next period
of transcription, and is one reason that we have chosen the StorageTek
company to help us, as they have a migration path built into their
product cycle that will include being able to simultaneously manage
both old and new tape media.
Submitted
by Rick Benson, IRIS DMC
For more information or comments contact
|