Open Repositories 2011 Report: Day 2 with DSpace plus Fedora and Lots of Lightning Talks

Today was the second day of the Open Repositories conference, and the big highlight of the day for me was the panel discussion on using Fedora as a storage and service layer for DSpace. This seems like such a natural fit, but with two pieces of complex software the devil is in the details. Below that summary is some brief paragraphs about some of the 24x7 lightning talks.

Fedora inside DSpace

Mark Diggory of @MIRE moderated a panel of Mark Leggott (Islandora, DiscoveryGarden and UPEI), Bradley McLean (CTO for DuraSpace), Richard Rogers (Head of Software at MIT Libraries), Ryan Scherle (Technical Lead, Dryad Digital Repository), and Matt Zumwalt (MediaShelf; Technical Lead, Hydra) on the topic of "DSpace with Fedora Inside." At last year's Open Repositories conference there was a call for the DSpace and Fedora communities to explore this idea. All of the content and metadata would be stored in Fedora with DSpace continuing to provide the user interface for workflow, discovery and administration. Or, viewed another way, retain the out-of-the-box experience of DSpace while exposing the versioning, object relationship, and flexible architecture features provided by Fedora. Work on this has been going on for a number of years, starting in 2007 with Scott Yeadon demonstrating object portability between Fez/Fedora and DSpace. In 2008, 2009 and 2010 there were three Google Summer of Code projects by Andrius Blažinskas that laid the groundwork for some of this integration by abstracting the DSpace storage layer options.

Here are some of the questions and responses from the panel. I hope I'm representing everyone's views as intended; comments and clarifications are welcome.

Will adding DSpace on Fedora make DSpace even more complex? Matt: It is an opportunity to revisit the design assumptions "and clean up your work." This is an example of where the transition will create the opportunity to tidy up the complexity and make DSpace simpler while gaining the flexibility of Fedora. Mark: The complexity of DSpace isn't in its content model, it is in trying to use the existing content model to do things DSpace wasn't designed to do. For instance, Islandora has more complex atomistic content models, particularly with science data, than DSpace's content model. Bradley: Acknowledged that there is a risk here, but "if it becomes more complex we are doing it wrong."

There is concern from the DSpace because it may have to change to accommodate Fedora, but Fedora will not need to be changed. Bradley: New ideas are sources of concern. It is difficult to categorize DSpace developers as a whole. Any time you move a major application on top of another component you may find the underlying APIs need to change.

At what level do we align DSpace and Fedora? Do we really need an intermediary format (AIP)? Bradley: If all we do is find a way to graft the existing DSpace workflows onto a Fedora that is very specific to DSpace and can't be used with other Fedora tools, then we haven't moved very far. The end goal is to find those places where DSpace is not formally specified enough and get it formally specified. Mark: Islandora hooks Drupal into Fedora. 80%-90% of the time we work with the Drupal-Fedora API -- a simple PHP wrapper around Fedora API. It transforms calls into appropriate actions into Drupal. Another option comes from an Italy project that created a synchronization of Fedora objects and Drupal Nodes; it copies information from Fedora into the Drupal RDBMS. Other applications like Omeka have a Fedora plugin. DiscoveryGarden has also looked at things like Wordpress with Fedora underneath. As repository services become more intelligent about microservices it would take even less time to make these integrations.

What benefit does Fedora receive? Richard: For the Fedora community, DSpace alignment would provide a rich IR content model for Fedora. Ryan: DSpace was designed as an IR and nothing else; it has that at its core. The problems that people have with DSpace are when people try to make it do something outside that vision. Having a Fedora repository and have a DSpace interface for those IR use cases and something like Hydra or Islandora for people using those use cases. Matt: There has never been a large cohesion of IR workflow in Fedora; having this workflow satisfy the IR use case. Bradley: DSpace with Fedora inside is a repackaging with a slightly different set of existing components. DSpace across its lifetime has tried to become more modular; integration with Fedora will make this clearer. Mark: Would agree that one of the main things DSpace brings to Fedora is the workflow tool and also the back-end data transformation workflows. But he has also never been a fan of the DSpace workflow because the staunch requirement to fill out a lot of metadata is a mistake. Working with science data, researchers want to ingest 100K microscope images without metadata then go back and add metadata with time. Ryan: (Agreeing) Some of the requirements of the native installation of DSpace is difficult to work with in other use cases. Started with configuring the workflow as much as could be done with config file, but then created a new workflow process that still used many of the underlying tools.

The way I think of the motivation is that DSpace on Fedora will have the same easy setup with access to the underlying APIs for customization. Bradley: Yes -- that is an aspirational goal. The practical realities mean that we will have to take steps there one at a time. And given the time scale the question comes whether we will get there before we decide to do something else. One of them -- sort of unsaid -- is to take a look at EPeople and see how that would migrate. Ryan: Unlocking the data is one thing, but unlocking the underlying datastreams as a API. DSpace storage API is opaque. You can rebuild everything from the underlying storage. [Ryan also calls out an old DLTJ post: A key advantage of DSpace with Fedora inside. "Why Fedora? Because You Don’t Need Fedora".]

Lightning Talks

The other sessions that I went to today were of the quick 24x7 type. Here are some highlights.

Mark Phillips talked about a PREMIS Event Service. He needed a way to log events that occur during the life of digital objects (virus checks, ingest, fixity check, replication) -- when they occur and their outcomes. So he built a microservice based on AtomPub. Each repository component sends outcomes of an event to this service, a central event collector. Uses the PREMIS Event and Agent Modules. Events metadata include: event_identifier, event_type, event_timestamp, event_outcome, outcome_detail, agent_identifier, object_identifier, and event_detail. Agent metadata includes details about the software, human or organization triggering the event. An AtomPub feed for each object returns all events for that object. The system includes a basic search interface to see all events of a particular type and enables a feed to be set up on those searches. There is the ability to harvest all events via OAI-PMH and Atom. It is built with Django and Python and will be release on the MetaArchive Google Code page.

Peter Sefton talked about The Fascinator & Fedora Commons: A Toolkit Tour. Fascinator is a java-based platform targeting repository solutions. It is open source (GPL), a plugin-based platform, and highly customizable. It was first used as an aggregator of data from various sources into a discovery service. They tried doing the same thing on the desktop computer (something a researchers could put on a personal computer, index data, group/describe it). He thinks of the process as a conveyor belt: harvest digital objects, transform them, store them, index them and find them. Harvest: draw into your digital ecosystem files, databases, online resources. Transform: be ready to present and share. Real web stuff -- not PDFs. Video and image previews. Multiple renditions for a multi-platform world. Storage: Store originals and their friends; basic filesystem storage or use Fedora Commons. Index: Apache SOLR index. Find: Faceted search interface. Web previews (turn Word into HTML and PDF for preview, same for video transcoding). Easily customizable UI (Jython and Velocity). It is used by REDDBOX, Mint (described at a session yesterday), and a library of university policies (policies sent from Microsoft Sharepoint and transformed into HTML and PDF).

Rich Rogers talked about Publishing Large, Data-Rich Collections on the Web with Exhibit3. We collect and we like to share what we collect. Nowadays we live on the web, so how can we share out collections there? Exhibit is all I need to publish my collection to the web: no backend database, no server application; it will even convert a spreadsheet into usable data on the web. Originally created by the SIMILE Project at MIT, Exhibit is an entire data publication platform with 'list' and 'tabular. views. There is also a rich library of additional views. If temporal data, scrollable interactive timeline. If geospatial data, interactive mapping displays. Numerical data, scatter or time plots. It is installable by HTML configuration and uses a browser-resident RDF database in JSON.

Geri Ingram and Carol Godby talked about Sustaining Collaboration Among Open Access Repositories focused on the WorldCat Digital Collection Gateway. WorldCat broadens exposure to digital object collections; end users click through to the hosting repository server from WorldCat.org. Metadata from OAI-PMH-compliant repositories are regularly harvested to WorldCat through the Gateway; digital objects remain on local repository server. The gateway includes a translation service that allows repository managers to create mappings from Dublin Core (or selected other metadata schemas) to MARC.