Fedora, Objects, Datastreams, Filesystems, and a Correction

In an earlier post, I extolled the virtues of Fedora as an ideal candidate for digital preservation because “[a]ll of the metadata (descriptive, preservation, and relationship to other objects) and managed datastreams that make up a digital object are ‘serialized’ to a single XML file on a file system.” Well, as I found out last week, it isn’t quite that straight forward.

Last week we had a problem with Fedora running on our production server — it had locked up tight in a wait-state for some sort of disk I/O and not even a `kill -9` could get rid of it. So I rebooted the server. Then the Fedora service wouldn’t start again — it complained about corruption in the Kowari triplestore. Okay, so we’d have to blow away the Kowari database and the MySQL database and reload.

Now I can only imagine that readers who are familiar with the Fedora software are yelling “what version of Fedora were you using?!?” Well, unfortunately it was 2.0. (And those that know are probably groaning.) For everyone else — and I used to be in that category — the consensus might be, “well, simply reload the objects off the disk,” right? So here’s the kicker:

the storage format of the objects on disk isn’t really a single serialized XML file.

Gasp! Yep, that’s right. As it turns out, the framework of the object is stored as a single XML (METS-like…”FOXML” to be exact) file in the “objects” directory and each of that object’s datastreams is stored as a single binary file in the “datastreams” directory. (Or, to be completely accurate, the “objects” and “datastreams” directories are further subdivided in to a year, month-day, hour and minute directory structure.) The FOXML markup refers to a file with type “reference”:

    <foxml:datastream ID="DS1" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">        <foxml:datastreamVersion ID="DS1.0" LABEL="DSCN0349" CREATED="2006-03-20T20:58:54.352Z" MIMETYPE="image/jpeg" SIZE="0">            <foxml:contentLocation TYPE="INTERNAL_ID" REF="peter:4+DS1+DS1.0" />        </foxml:datastreamVersion>    </foxml:datastream>

This is a sample piece of a FOXML file for a digital object in FEDORA. In summary, this snippet defines a datastream called “DS1″ with the label “DSCN0349″ that is a JPEG file managed by the FEDORA server. The inner-most tag is a contentLocation with the type “INTERNAL_ID” and a reference attribute of “peter:4+DS1+DS1.0″. That reference corresponds exactly to a file name in the “datastreams” directory (with the exact subdivided directory structure as the “objects” directory) that is the datastream itself.

So what happened with the server restore? I thought I could just point the automated ingestion utility at the ‘objects’ directory and have them all sucked back into the repository. What happened, though, was a whole bunch of errors like “invalid location” and “no objects loaded” because, of course, the batch loader couldn’t resolve the datastream locations.

(By the way, the reason this isn’t so much an issue in v2.1b and beyond is that the core developers added a reindexing tool that can walk the ‘objects’ and ‘datastreams’ hierarchies in order to rebuild the triplestore and SQL databases. The FEDORA process itself, of course, must be stopped while this kind of bulk rebuild is happening. What we ended up doing was jumping to version 2.1.1 faster than we had planned so we could make use of the reindexing tool.)

This division between the ‘objects’ and ‘datastreams’ directories means the preservation benefits are not as straight forward as I had originally thought. There isn’t one XML file that represents the entire object — rather there is an XML file that has structure, some metadata and opaque (albeit easily decoded) references to files elsewhere in the file system. This makes me think about changing our backup scheme to use a utility that will put together the XML framework with the datastream file(s) before that whole combination is written to off-line storage. Doing so would make me much more comfortable about the prospects for recovery at a later date.

(This post was updated on 06-Jun-2006.)