Best Practice Proposal for a DESCRIPTION Datastream

 Posted on 
 ·  4 minutes reading time

OhioLINK is deep in the process of migrating content from our old Bulldog/Documentum-based system to, well, something else, and we've been talking about the treatment of the metadata in the course of the migration. I think it is safe to say that the Bulldog asset management system (and Documentum, which bought and integrated Bulldog into its product line about five years ago) is not really known for its rich handling of metadata. Or at least how the library community thinks of metadata: Dublin Core, MIX, MODS, MARC, VRA Core, PREMIS, FGCD, etc. — all at the same time in the same application engine with structured crosswalks between them. Reality check for those in the "library community" ... do you think of metadata in this way? I think it is also safe to say that pure, unqualified Dublin Core, the only datastream that is required for every FEDORA object, does not completely encompass the descriptive fidelity needed for our objects. These observations, combined with reading a mid-term project report from the RepoMMan effort in the U.K., got me thinking about metadata and how we should store it in FEDORA objects. The outcome of that line of thinking is this proposal: "to establish a practice of creating an in-line XML datastream with the label 'DESCRIPTION' that contains the primary descriptive metadata for each object."

Rationale

Although FEDORA mandates an unqualified Dublin Core datastream for every object, unqualified Dublin Core is not expressive enough to describe our objects. Therefore I recommend establishing this practice so subsequent agents/consumers of the objects (internal disseminators and external applications) will know the location of the most expressive metadata for the object.

Risks/Unknowns

  • FEDORA does not provide a mechanism to keep elements of the DESCRIPTION datastream in sync with the DC datastream. Do we store common data elements (e.g. "creator") in both places? If so, our front-end applications would need to change the value of "creator" in two places and there is always the risk that they will get out of sync. How much real value is there in maintaining the FEDORA-mandated DC datastream?
  • There is no convention (that I know of) for a "primary descriptive metadata" datastream label in a FEDORA object, so "DESCRIPTION" is an arbitrary choice at this point. Future practices may go against this decision (although the choice does set us up to start using datastream labels like "PRESERVATION" for PREMIS metadata and so forth).

Background

In their "Experiences with Fedora" report, the RepoMMan team noted:

...working with Fedora's compulsory Dublin Core (DC) datastream started one thinking about the metadata that a repository object would eventually need and how this might be mapped onto the Dublin Core fields. It was some considerable time later than an e-mail on the Fedora-users list made it clear that the inherent DC datastream was intended solely for Fedora's internal use and not as the basis of external searches. Richard Green, "Experiences with Fedora during the project's first year" Report D-D8, July 2006; page 8; retrieved 28-Aug-2006 from http://www.hull.ac.uk/esig/repomman/downloads/D-D8-fedora-exp-v10.pdf.

Even with our most simplest collection, we already know that unqualified Dublin Core will not be sufficient (most specifically, we had discussions about the lack of precision of "Date" and "Coverage" as compared to the field labels we already have in the Bulldog data dictionary). It is important that our metadata be parsable by machine processes, so I would advocate the proposed practice rather than trying to "shoe-horn" our descriptions into unqualified Dublin Core with text labels added the values and the like. And if we keep the machine parsable, we will have a wider variety of options for indexing the data and displaying it at the presentation layer.

The "in-line XML" part of this proposal means that the DESCRIPTION datastream would be "managed" by the FEDORA server (e.g. not external or referenced), so it would become part of the object in the content store.

Example

If we take for a moment what is displayed in the presentation layer for a sample object from the Forestry collection as the sum total of all of the descriptive metadata for an object of this collection, a corresponding DESCRIPTION datastream would look something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<metadata xmlns="http://drc.ohiolink.edu/schema/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://drc.ohiolink.edu/schema/
http://drc.ohiolink.edu/schema/schema.xsd"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/">
<dc:title>Catalpa speciosa, bignonoides and Kampfera seeds.</dc>
  <dc:creator>Ohio Agricultural Experiment Station.  Dept. of
Forestry.</dc>
  <dc:description>Catalpa speciosa, bignonoides and Kampfera seeds.
Item #2</dc>
  <dc:contributor>Ohio Agricultural Research and Development
Center</dc>
  <dc:date>1908-12</dc>
  <dcterms:available xsi:type="dcterms:W3CDTF">
        2003-04-17T00:00:00
  </dcterms>
  <dc:type>photographic prints</dc>
  <dc:identifier>hdl:21151</dc>
  <dc:source>2</dc>
  <dmci:spatial>Ohio</dmci>
  <dc:rights>Copyright: Ohio State University</dc>
  <dcterms:licence xsi:type="dcterms:URI">
    http://library.osu.edu/sites/dlib/terms.html
  </dcterms>
</metadata>

Comments?

Reactions to the proposal? A rational step forward, or is there a better way?