Position Announcement: OhioLINK Systems Developer

The Ohio Library and Information Network (OhioLINK) is seeking a hard-working, analytical individual to participate in the creation and maintenance of our internationally recognized set of online library information services, with special focus on the Ohio Digital Resource Commons. OhioLINK serves the higher education population in the State of Ohio with over 85 college and university member institutions.

The position requires a four-year degree in Computer Science, or a graduate degree in Information or Library Science, or equivalent technical experience. The candidate should have strong programming skills in languages such as Java, and should be comfortable working in a Unix/Linux environment with open source software. Experience with the following is highly valued: Digital Repositories, Cocoon, Apache Tomcat, XML/XSLT, PostgreSQL. Experience with the following is desirable: DSpace/Manakin, HTML/CSS site design, metadata, Subversion, Perl, shell scripting.

Salary:  $49,000 minimum

If you are interested in this position, please send a resume, a summary statement of experience, and an indication of your salary expectations to resume@ohiolink.edu.

The text was modified to update a link from http://www.ohiolink.edu/member-info/ to http://www.ohiolink.edu/members-info/ on January 13th, 2011.

Disseminators As the Core of an Object Repository

I’ve been working to get JBoss Seam tied into Fedora, and along the way thought it would be wise to stop and document a core concept of this integration: the centrality of Fedora Disseminators in the the design of the Ohio Digital Resource Commons. Although there is nothing specific to JBoss Seam (a Java Enterprise Edition application framework) in these concepts, making an object “render itself” does make the Seam-based interface application easier to code and understand. A disseminator-centric architecture also allows us to put our code investment where it matters the most — in the repository framework — and exploit that investment in many places. So what does it mean to have a disseminator-centric architecture and have objects “render themselves”?

How It Works

Sequence Diagram This is a sequence diagram showing all of the pieces:

  • Browser: The user’s browser
  • DRCseam: A JBoss Seam application that generates the user interface and performs much of the business logic. DRCseam, however, does not render the objects or their metadata into browser-consumable artifacts. Read on!
  • Fedora: A basic Fedora digital object repository.
  • Disseminator: A simple servlet that performs various transformations on object datastreams to render content usable by the browser.

With these components in play, here is the description of a sequence to render a page showing the metadata for a repository item:

  1. request item page: The browser follows a link to an item detail page.
  2. API-A ObjectProfile: The interface application asks the repository for the ‘Object Profile’ of the item…
  3. return object profile: …which the repository returns. The interface application now knows basic details about the object: that it exists, the creation and updated timestamps, and so forth.
  4. API-A DatastreamDissemination for fullDisplay: The interface application needs the object’s metadata display, so it asks the object to “render itself” by making a call to the Fedora repository for the object’s “FullDisplay” disseminator.
  5. call getFullDisplay: The Fedora repository in turn calls the object’s disseminator with the Persistent Identifier (PID) of the object as a parameter.
  6. API-A Datastream for metadata: Using the object PID, the disseminator calls back to the Fedora repository for the descriptive metadata datastream (the DC datastream, in this case)…
  7. XML metadata: …which the Fedora repository returns.
  8. transform metadata: The disseminator performs some transformation or derivation on the descriptive datastream to create an XHTML representation…
  9. XHTML fragment: …which it returns to the Fedora software…
  10. XHTML fragment: …which is returned to the interface application…
  11. XHTML page: …which inserts it at the appropriate place in the XHTML page it has built and returns the XHTML page to the browser.

Step #4 is where we diverge from previous architectures. Instead of making the interface application transform the metadata into a human-readable format, the interface application calls the object’s disseminator to do the job.

The Heart of It All: The Disseminator

The key to this architecture is asking the object to “render itself”. This puts the task of creating the appropriate representation at the object level. The object can be an image, a video, a spreadsheet, or a PDF file. More importantly, the object can be a PDF of a journal article or a PDF of a thesis; in both cases the metadata describing that PDF file will be different (journal/volume/issue in one case and department/degree/advisor in the other).

Rather than putting special case code in the interface application to render the description of the journal article one way and the thesis another way, that special case code is bound to the object in the form of a “disseminator”. The disseminator methods for the journal article and the thesis share the same name — getFullDisplay — but will return entirely different XHTML fragments — one for a journal article and one for a thesis. For both objects, though, the interface application will make a call to the object in the Fedora repository asking for the output of each getFullDisplay dissemination. In the case of a Dublin Core description, the dissemination output could look like this:

<table class="drc_dublinCore_table">
<tr class="drc_dublinCore_row drc_dublinCore_title">
<td class="drc_dublinCore_label drc_dublinCore_title">Title:</td>
<td class="drc_dublinCore_value drc_dublinCore_title">Jester Example</td>
<tr class="drc_dublinCore_row drc_dublinCore_identifier">
<td class="drc_dublinCore_label drc_dublinCore_identifier">Identifier:</td>
<td class="drc_dublinCore_value drc_dublinCore_identifier">demo:exampleObject</td>

You’ll note that there is a liberal application of CSS styles on all of the XHTML elements, allowing for the look of the dissemination to be further transformed in the browser via CSS stylesheets. A getFullDisplay dissemination for a journal article could look like this:

<table class="drc_ejc_table">
<tr class="drc_ejc_row drc_ejc_title">
<td class="drc_ejc_label drc_ejc_title">Article Title:</td>
<td class="drc_ejc_value drc_ejc_title">Taking Advantage of Fedora Disseminations</td>
<tr class="drc_ejc_row drc_ejc_volume">
<td class="drc_ejc_label drc_ejc_volume">Volume:</td>
<td class="drc_ejc_value drc_ejc_volume">3</td>
<tr class="drc_ejc_row drc_ejc_issue">
<td class="drc_ejc_label drc_ejc_issue">Issue:</td>
<td class="drc_ejc_value drc_ejc_issue">2</td>

Looking at the Pieces

There is a demonstration system set up for a short period of time that shows all of the pieces. First, the disseminator:

  • http://drc-dev.ohiolink.edu:8080/BaseDisseminator/getFullDisplay/demo:exampleObject

Next, how this disseminator looks as accessed through the Fedora repository:

  • http://drc-dev.ohiolink.edu:8080/fedora/get/demo:exampleObject/demo:bDefExample/getFullDisplay/

And finally, how this result looks through the Seam-based interface application. (A note about this application — only this URL works at the moment even though there are other links on the page. This is also the ‘trunk’ version of our interface code, so it is likely to change and/or break and/or work better at any time.)

  • http://drc-dev.ohiolink.edu:8080/drc/item.seam?itemId=demo%3AexampleObject

Fedora Setup

In addition to the Seam-based interface application and the disseminator code, there is setup required at the Fedora repository — specifically, the creation of a Behavior Definition (bDef) that describes the disseminators that the objects share in common and the creation of a Behavior Mechanism (bMech) that describes the implementation of that definition for a particular object type. Below is a series of screen shots that show the steps to create the bDef and bMech.

Disseminator Behavior Definition (bDef)

Using the Fedora Admin client, under the “Builders” menu, select “Behavior Definition Builder”. The first pane, “General” parameters, use a specific PID of ‘demo:bDefExample‘ and put something in for the Behavior Object Name, Behavior Object Description, and one of the Dublin Core Metadata fields. (It doesn’t matter what you put in for these values.)
Fedora Admin Behavior Definition Builder “General” pane

Under the “Abstract Methods” pane, create new definitions for each of the disseminator methods.
Fedora Admin Behavior Definition Builder “Abstract Methods” pane

Under the “Documentation” pane, put something in the first entry. Again, it doesn’t matter what is put in for these values, but they are required.
Fedora Admin Behavior Definition Builder “Documentation” pane

Select “Ingest” at the bottom of the window, and the demo:bDefExample bDef will be created. Alternatively, you could import the demo:bDefExample saved in the DRC source code repository (choose “original format” at the bottom of that page).

Disseminator Mechanism Definition (bMech)

The bMech is a little more complicated. Under the “Builders” menu, select “Behavior Mechanism Builder”. The first pane, “General” parameters, use a specific PID of ‘demo:bMechExample‘ and put something in for the Behavior Object Name, Behavior Object Description, and one of the Dublin Core Metadata fields. (It doesn’t matter what you put in for these values.) In the “Behavior Definition Contract” pick the bDef just created (demo:bDefExample).
Fedora Admin Behavior Mechanism Builder “General” pane

In the “Service Profile” pane, put in values in the “General” area (it doesn’t matter what). In the Service Binding area, make sure the Message Protocol is HTTP GET, put in text/html, text/xml for Input MIME Types and put in text/html, text/xml, text/plain for Output MIME Types.
Fedora Admin Behavior Mechanism Builder “Service Profile” pane

Under the Service Methods pane, put in http://localhost:8080/BaseDisseminator for the Base URL. (The disseminator is also loaded in the same servlet as the Fedora repository and the Seam interface application, and it is loaded at the “/BaseDisseminator” context path in the servlet.) Create Service Method Definitions that correspond to the Abstract Methods in the bDef.
Fedora Admin Behavior Mechanism Builder “Service Methods” pane

Select “Properties” for each one of the Service Method Definitions in turn. “echo” is a unique disseminator method that simply echos back the context parameters of the disseminator request. This is useful for seeing exactly what the Fedora server is going to give to the disseminator.
Fedora Admin Behavior Mechanism Builder “Service Methods” Definitions for “echo” Method

With the exception of “echo” all of the other Service Method Definitions are the same. The Method Binding consists of the disseminator method followed by a slash and the PID placeholder followed by a question mark and ‘dc’ equals the DC placeholder. Since the Method Binding field has two placeholders, there are two entries in the Method Parameter Definitions area. The first is for PID — a “Default” parameter that is required and passed by value to the disseminator. The default value is the special value $PID, which the repository software will replace with the PID of the object as the disseminator is called. The second is for DC, a “Datastream” parameter that is required and passed to the disseminator by URL reference. The disseminator doesn’t actually use this reference to a datastream, but it is a requirement that all bMechs pass a datastream of one sort or another to the disseminator.
Fedora Admin Behavior Mechanism Builder “Service Methods” Definitions for “getFullDisplay” Method

If you have followed all of the steps so far, under the “Datastream Input” pane there will be one entry for DC in the table. The only thing that needs to be done here is adding “text/xml” in the MIMEType column.
Fedora Admin Behavior Mechanism Builder “Datastream Input” pane

Under the “Documentation” pane, put something in the first entry. Again, it doesn’t matter what is put in for these values, but they are required.
Fedora Admin Behavior Mechanism Builder “Documentation” pane

Select “Ingest” at the bottom of the window, and the demo:bMechExample bMech will be created. Alternatively, you could import the demo:bMechExample saved in the DRC source code repository (choose “original format” at the bottom of that page).

Sample Object

The last step is to add this disseminator bDef/bMech combination to an object. Edit any object in the repository and go to the “Disseminators” pane. If there are other disseminators already defined for this object, select “New” along the left side. Put in a label — any label will do. Next to “Behavior defined by…” select demo:bDefExample. Then next to “Mechanism” select demo:bMechExample. The admin client will prompt for a DC binding; select “Add” and choose the DC datastream in the pop-up window.
Fedora Admin Sample Object’s “Disseminators” pane in progress

Select “Save Changes” at the bottom. The completed disseminator looks like this:
Fedora Admin Sample Object’s “Disseminators” pane completed

There is a sample object in the DRC source code repository that has the disseminator already defined.


Comments about this architecture are certainly welcome. I’m sure I’ll be writing about it more in the future, but here are some thoughts at this point:

Future Directions

In this case, I’m using an XSLT stylesheet to transform the Dublin Core XML into an XHTML table. That stylesheet is stored in the BaseDisseminator WAR file. The stylesheet could just as easily be a datastream of a special “formatting” object in the repository. One of the key distinctions of OhioLINK’s Fedora implementation is that institutions using the repository will be able to “brand” their content in any way they choose. Having the flexibility of storing metadata transformations just like any other object in the repository would seem to be of great advantage in that scenario.

On a related front, this style of implementation would be greatly enhanced by the work of the Fedora Content Model Dissemination Architecture (CMDA). Because disseminators must be bound to specific objects rather than classes of objects, management of the variety of bMechs in a scenario such as this will likely become difficult very soon. I’m heartened by the fact that the CMDA work is going on and will cut our management complexity dramatically when it becomes available.


These concepts are based in part on the work of the Digital Library Federation’s Aquifer Asset Actions technical working group and discussions among members of the OAI Object Reuse and Exchange technical committee as well as conversations with many Fedora developers and implementors. Thanks, everyone.

[Update 20070426T1147 : Fixed the sample object URL. Thanks, Jodi.]

The text was modified to update a link from http://rama.grainger.uiuc.edu/assetActions/ to https://wiki.dlib.indiana.edu/display/DLFAquifer/Asset+Action+Project on January 19th, 2011.

Building an Institutional Repository Interface Using EJB3 and JBoss Seam

This tour is designed to show the overall architecture of a FEDORA digital object repository application within the JBoss Seam framework while at the same time pointing out individual design decisions and extension points that are specific to the Ohio Digital Resource Commons application. Geared towards software developers, a familiarity with Java Servlet programming is assumed, although not required. Knowledge of JBoss Seam, Hibernate/Java Persistence API, EJB3 and Java EE would be helpful but not required; brief explanations of core concepts of these technologies are included in this tour.

The tour is based on revision 709 of /drc/trunk and was last updated on 18-Jan-2007.

This tour will also be incorporated into a presentation at Open Repositories 2007 on Tuesday afternoon.

Directory Layout

The source directory tree has four major components: ‘lib’, ‘resources’, ‘src’, and ‘view’.

lib – libraries required by the application. The lib directory contains all of the JAR libraries required by the application. Its contents is a mix of the Seam-generated skeleton (pretty much everything at the top level of the ‘lib’ directory) and JAR libraries that are specific to the DRC application (in subdirectories of ‘lib’ named for the library in use). For instance, the ‘commons-codec-1.3’ and the ‘hibernate-all’ and the ‘jboss-seam’ JAR files were all brought into the project via ‘seam-gen’ while ‘lib/commons-net-1.4.1/commons-net-1.4.1.jar’ library was added specifically for this project. A convention has been established whereby new libraries added to the project appear as entries in the lib.properties file which is used by series of directives in the build.xml file to setup the classpaths for compiling and for building the EJB JAR. This is done to make the testing and transition of new libraries into the application more explicit and easily testable. Note that the newly included library directory also includes a copy of any license file associated with that library; this is not only a requirement to use some libraries but is also a good practice to show the lineage of some of the lesser known libraries. (For an example of what is required, see the changes to build.xml and to lib.properties in order to bring the Apache Commons Net library into the application.)

resources – configuration files and miscellaneous stuff. The resources directory holds the various configuration files required by the application plus other files used for testing and demonstration. Much of this was generated by the Seam-generated skeleton as well. Some key files here are the import.sql file (SQL statements that are used to preload the RDBMS used by Hibernate as the mocked up repository system) and the test-datastreams directory which has sample files for each of the media types.

src – Java source code. The src directory contains all of the Java source code for the application. Everything exists in a package called ‘edu.ohiolink.drc’ with subpackages for classes handling actions from the view component of the MVC, entity beans (sometimes known as Data Access Objects — or DAOs — I think), exception classes (more on this below), classes for working with FEDORA (not currently used), media type handler classes (more on this below), unit test classes (not currently used), and utility classes.

view – XHTML templates, CSS files, and other web interface needs. The view directory holds all of the files for the “view” aspect of the Model-View-Controller paradigm. More information about the view components is below.

Entity Classes

The entity beans package has three primary entity beans defined: Item.java, Datastream.java, and Description.java. (The FedoraServer.java entity bean is not used at this time.) Item.java is the primary bean that represents an object in the repository. Datastream.java and Description.java are component beans that only exist in the lifecycle of an Item.java bean; Datastream.java holds a representation of a FEDORA object datastream and Description.java holds a representation of a Dublin Core datastream for that object.

The Datastream and Description objects are annotated with @Embedded in the Item.java source; this is Hibernate’s way of saying that these objects do not stand on their own. Item.java also has numerous methods marked with a @javax.persistence.Transient annotation meaning that this is information not stored in the backing Hibernate database; these methods are for the various content handlers, which will be outlined below.

Mock Repository

As currently configured, the entity beans pull their information from a static RDBMS using Hibernate rather than from an underlying FEDORA digital object repository. (You’ll need to go back to revision 691 to see how far we got with the FEDORA integration into JBoss Seam before we switched our development focus to the presentation ‘view’ aspects of the application.) As currently configured, Hibernate uses an embedded Hypersonic SQL database for its datastore. As part of the application deploy process, the Java EE container will instantiate a Hypersonic database and preload it with the contents of the import.sql file. (The import.sql file contains just three sample records at the moment: one each for a text file, a PDF file, and a graphic file.)

All of the data for a repository object is contained in a single table record. Hibernate manages the process for us of reading that record out of the database and creating the three corresponding Java objects: Item, Datastream and Description. (Hibernate could also handle the process of updating the underlying table record if we were to change a value in one of the Java objects.) The mapping of table column to Java object field is handled by the @Column(name="xx") annotations in the entity beans.

For Datastream, what is stored in the database is not the datastream content itself but rather a filename that points to the location of the datastream file. The file path in this field can either be absolute (meaning a complete path starting from the root directory of the filesystem) or a relative path. In the case of the latter, the path is relative to the deployed application’s WAR directory (something like “…/jboss-4.0.5.GA/server/default/deploy/drc.ear/drc.war/” for instance). Note that the getter/setter methods for the contentLocation are private — the rest of the application does not need to know the location of the datastreams; this will also be true when the DRC application is connected to a FEDORA digital object repository. The method marked public instead is getContent, and the implementation of getContent hides the complexity of the fact that the datastream is coming from a disk file rather than a FEDORA repository call. For the three records/repository-objects currently defined in ‘import.sql’ there are three corresponding demo datastreams in the test-datastreams directory.

In all likelihood, this representation of the FEDORA repository will be too simple for us to move forward much further. In particular, the current notion of one datastream per repository object is too simplistic. The Datastream embedded object will likely need to be broken out into a separate table and as a corresponding distinct Java applet. (We may reach the same point soon for the Description object as well.)

By using the Entity Beans as a buffer between the business logic and the view components of the rest of the application, I hope we can minimize/localize the changes required in the future in order to replace the mock repository with a real underlying FEDORA repository.

View Templates

The preferred view technology for JBoss Seam is Facelets, an implementation of Java Server Faces that does not require the use of Java Server Pages (JSP). Although the ‘.xhtml’ pages in the view directory bear a passing resemblance to JSP, behind the scenes they are radically different. Of note for us is the clean templating system used to generate pages. The home.xhtml file has a reference to the template.xhtml file in the ‘layout’ directory. If you read through the template.xhtml file, you can see where the Facelets engine will pull in other .xhtml files in addition to the content within the <ui:define name="body"> tag of home.xhtml.

Content Handlers

The paradigm of handling different media types within the DRC application is guided in large part by the notion of disseminators for FEDORA objects and the Digital Library Federation Aquifer Asset Actions experiments. The underlying concept is to push the media-specific content handling into the digital object repository and to have the presentation interface consume those content handlers as it is preparing the end-user presentation.

For instance, the DRC will need to handle content models for PDFs, images, video, and so forth. Furthermore, how a video datastream from the Digital Video Collection is offered to the user may be different than how a video datastream from a thesis is offered to the user. Rather than embedding the complexity of making those interface decisions into the front-end DRC application, this model of content handlers pushes that complexity closer to the objects themselves by encoding those behaviors a disseminators of the object. What the presentation layer gets from the object is a chunk of XHTML that it inserts into the dynamically generated HTML page at the right place.

There is work beginning on a framework for FEDORA disseminators at /BaseDisseminator/trunk in the source code repository; that work has been put on hold at the moment in favor of focusing on the presentation interface. In order to prepare for the time when the presentation behaviors are encoded as FEDORA object disseminators, the current presentation layer makes use of Content Handlers for each of the media types. The Handler interface defines the methods required by each handler and the TextHandler class, the ImageHandler class, and the PdfHandler class implement the methods for the three media types already defined.

Of these, TextHandler class is the most complete, so I’ll use it as an example.

  • The getRawDatastream method takes the datastream and sends it back to the browser with the HTTP headers that cause a File-Save dialog box to open.
  • The getFullDisplay method returns a chunk of XHTML that presents the full metadata in a manner that can be included in a full metadata display screen.
  • The getRecordDisplay method (currently unwritten) returns a chuck of XHTML used to represent the object in a list of records that resulted from a user’s search or browse request.
  • The getThumbnail method (currently unwritten) returns a static graphic thumbnail rendition of the datastream (e.g. a cover page, a key video frame, etc.).

By making these content handlers distinct classes, it is anticipated that the rendering code for each of these methods can be more easily moved to FEDORA object disseminators with minimal impact to the surrounding DRC interface application.

Exception Handling

The DRC application follows the practice suggested by Barry Ruzek in Effective Java Exceptions (found via this link on The Server Side). The article can be summarized as:

One type of exception is a contingency, which means that a process was executed that cannot succeed because of a known problem (the example he uses is that of a checking account, where the account has insufficient funds, or a check has a stop payment issued.) These problems should be handled by way of a distinct mechanism, and the code should expect to manage them.

The other type of exception is a fault, such as the IOException. A fault is typically not something that is or should be expected, and therefore handling faults should probably not be part of a normal process.

With these two classes of exception in mind, it’s easy to see what should be checked and should be unchecked: the contingencies should be checked (and descend from Exception) and the faults should be unchecked (and descend from Error).

All unchecked exceptions generated by the application are subclasses of DrcBaseAppException. (DrcBaseApplication itself is a subclass of RuntimeException.) For an example, see NoHandlerException. By setting up all of the applications exceptions to derive from this point, we have one place where logging of troubleshooting information can take place (although this part of the application has not been set up yet). Except when there is good reason to do otherwise, this pattern should be maintained.

At this point, no checked (or contingency) exceptions specific to the DRC have been defined. When they are needed, though, they will follow the same basic structure with a base exception derived from Exception.

The text was modified to update a link from http://rama.grainger.uiuc.edu/assetActions/ to https://wiki.dlib.indiana.edu/display/DLFAquifer/Asset+Action+Project on January 19th, 2011.

The text was modified to update a link from http://dev2dev.bea.com/pub/a/2006/11/effective-exceptions.html to http://www.oracle.com/technetwork/articles/entarch/effective-exceptions-092345.html on January 20th, 2011.


p style=”padding:0;margin:0;font-style:italic;” class=”removed_link”>The text was modified to remove a link to http://facelets.dev.java.net/ on November 6th, 2012.

Looking Forward to Version 2.2 of FEDORA

Sandy Payette, Co-Director of the Fedora Project and Researcher in the Cornell Information Science department, announced a tentative date for the release 2.2 of the FEDORA digital object repository.

The Fedora development team would like to announce that Fedora 2.2 will be released on Friday, January 19, 2007.

This new release will contain many significant new features and enhancements, including [numbers added to the original for the sake of subsequent commentary]:

  1. Fedora repository is now a web application (.war) that can be installed in any container
  2. Fedora authentication has been refactored to use servlet filters (no longer Tomcat realms)
  3. A new Fedora installer makes it easy to get started with Fedora (with both “quick” and “custom” install options)
  4. GSearch service (backed by Lucene or Zebra) – flexible, configurable, indexes any datastream
  5. Journaling service to create a backup/mirror repository
  6. New checksum features for datastreams
  7. Support for Postgres database configuration
  8. Standard system logging with Log4J
  9. Over 40 bug fixes
  10. Many other enhancements

Be on the lookout for the release announcement the new year! Also, there will be opportunities to talk with the Fedora development team at Open Repositories 2007 (http://openrepositories.org/).

This is great news and a major step forward for the project. Here are some reasons why I think this is true.

1. Fedora repository is now a web application (.war)

To this point, the FEDORA repository application distribution has been pre-bundled inside a Tomcat Java servlet container. The binding has been pretty tight with certain dependencies written into the Tomcat configuration itself. That made it very difficult to install FEDORA into an organization’s existing servlet container (be it another installation of Tomcat or Jetty/JBoss/Glassfish, etc.). Even more problematic, there were reports of problems trying to get JSP-based applications to work inside the FEDORA-supplied container (we ran into this ourselves) meaning that organizations wanting to run both FEDORA and another servlet-based application needed to run two servlet containers; pretty inefficient. (OhioLINK was in this position in its early implementations of the Ohio DRC project.)

With release 2.2, the core developers have effectively turned the software distribution inside out. The primary output of the new build process is a standard Web ARchive (or WAR) file that can be put inside any servlet container. The new installation program (see #3 below) comes with a Tomcat distribution, should a new installation need it, but it is no longer required. There have been reports that the new WAR-based distribution works inside the Jetty servlet container; we’re hoping it will work in the JBoss Application Server as well (since that is what we’re using to build our next generation interface).

2. Fedora authentication has been refactored to use servlet filters

I’m not quite sure what this means, but I have hopes that it will make integration with Shibboleth easier. Can anyone else see the path between FEDORA and Shibboleth and comment on it?

3. A new Fedora installer makes it easy to get started with Fedora

From the start, FEDORA required a Java servlet container in order to run. To make the installation job easier for those that are not familiar with Java servlet containers, the FEDORA installation process did everything for you. Now that the relationship between the FEDORA application and the servlet container have been flipped around (see #1 above), the core developers devised an easy-to-use installation application that mimics the simplicity of the previous installation style while allowing others to make use of FEDORA as an integrated application within an existing servlet container.

4. GSearch service

The original FEDORA search service, the appropriately-named “basic search,” indexes only the Dublin Core (DC) datastream of each object. As has been mentioned on the Fedora-Users mailing list several times, the DC datastream is really meant as an administrative metadata datastream and not necessarily the full description of the object; that full description can be stored in other datastreams of a FEDORA object. Not only did basic search not index these other descriptive metadata streams, but it also wouldn’t index the full text of PDF, text, and other indexable datastreams.

GSearch — where “G” stands for “General” but could equally well stand for “Gert” Schmeltz Pedersen, its lead developer from the Technical University of Denmark — does all of the above as a new component in the FEDORA Service Framework. We extend our gratitude to Gert and his colleagues for contributing their work to the general FEDORA distribution as well as to DEFF, Denmark’s Electronic Research Library, which funded the GSearch project.

5. Journaling service

Like a journaling file system or a journaling database, this capability allows one to capture all of the transactions applied to the repository and replay them against a secondary repository instance or to restore a repository from backup.

6. Datastream checksums

As part of its ingestion and maintenance functions, the FEDORA software can now calculate, store, and verify checksums of datastreams. This helps ensure the integrity of the repository content, or at least detect when something goes wrong.

7. Support for PostgreSQL

In the battle between which relational database engine is best, FEDORA now supports most of the big ones out-of-the-box: Oracle, MySQL, and new PostgreSQL. Here at OhioLINK, we’ve started with MySQL but are considering a migration to PostgreSQL as our in-house, preferred RDBMS, so the timing of this announcement is great.

8. Standard system logging with Log4J

Put this one in the category of “playing nicely with others.” We’ve already reaped the benefit of the refactored logging code in the client JAR file in a pre-release version of the code.

9 and 10. Bug fixes and many other enhancements

The core code is evolving along a nice trajectory. This is good to see for the health of the overall project!

Version 2.2 represents another monumental step towards the vision of a Flexible, Extensible Digital Object Repository Architecture. Congratulations to the core developers for what sounds like is going to be a great release.

The text was modified to update a link from http://comm.nsdl.org/pipermail/fedora-users/2006-December/002330.html to http://article.gmane.org/gmane.comp.cms.fedora-commons.user/2330/ on January 19th, 2011.

Why FEDORA? Answers to the FEDORA Users Interview Survey

The Fedora Outreach and Communications team is conducting a survey of the high-level sense of passion and commitment inherent in the Fedora community. I’ve posted some answers back to the FEDORA wiki on behalf of OhioLINK, and am also including the responses here as it fits into the “Why FEDORA?” series of blog postings. (If you are reading this through a RSS news reader, I think you’ll have to actually come to the DLTJ website and scroll down to the bottom of this post to see the table of contents of the series.) On with the responses!

How did you hear about Fedora?

I first remember hearing about FEDORA at a Coalition for Networked Information meeting in 2003. I only really remember it in passing because what was being presented was so radical that I didn’t appreciate what was being described.

I next encountered FEDORA during a conference call with the Internet2 Shibboleth core developers in mid-2004. The topic was enabling cross-repository access management — a topic that is still a challenge today (although the Shibboleth team is working on it). But that time I really started to catch on with what the FEDORA team was doing, and started paying closer attention.

Why did you chose Fedora?

When I arrived as project manager to the Ohio Digital Resource Commons (DRC) project in January 2005, OhioLINK was on the path to expand their existing Documentum installation to include a hosted institutional repository service. The Ohio DRC Steering Committee reviewed and accepted a proposal to use FEDORA as the foundation of this new hosted institutional repository service primarily because OhioLINK would be working with peers to develop the service (rather than working in isolation as likely would have happened with a Documentum-based solution).

Were there economic advantages to your project/org. in selecting Fedora?

The open source, free-to-license nature of FEDORA was definitely an advantage. It allowed us to turn grant funding that would have been used to pay for additional Documentum modules and licenses into to salary for temporary-hire programmers. In that way we felt that we had a better control over our destiny by creating the application code ourselves rather than relying on consultants.

What is Fedora’s unique role in your production system?

OhioLINK is beginning to look at the Service Oriented Architecture (SOAs) software design paradigm, and FEDORA fits right into that model as the content repository for all of our digital objects. If anything, FEDORA’s nature as a best-of-breed content repository — and nothing else — encourages us to think along the likes of sooner than we might otherwise have done.

Is there one specific Fedora attribute that enables your project/organization to accomplish your overall goals.

The fact that FEDORA is completely agnostic to what is contained in a datastream — be it audio, video, image, dataset, PDF, Dublin Core, MODS, EAD, FGDC, TEI, etc. — means that we can truly pursue a goal of managing all of our content in one place. The robustness of the content repository functions allows us to consider more interesting questions such as how this different content is ultimately presented to the end user.

Do you see yourself as an active member of the Fedora community? Why?

Yes. FEDORA represents the ability to take long-term control over the destiny of our digital objects. If, for some reason, the existing core developers at Cornell and UVa disappeared, a vibrant user community (OhioLINK included) can pick up the task of maintaining the software for the collective good. And if no one but OhioLINK is left in a “FEDORA community” our job of migrating out of it, should we desire to do so, is eased by the fact that we have the full view of the source code to help us move content and services to a new platform.

What would inspire you to become more involved?

It would take the existence of more hours in the day, I’m afraid!

What should be the mission of an ongoing Fedora organization?

A FEDORA community should first and foremost inspire communication among users of the FEDORA software. Almost all of us are working with extremely limited resources, and it weakens our collective effort if there is duplicated work underway. This communication should include not only developers but also users of the software.

Analysis of CDL’s XTF textIndexer to Replace the Local Files with FEDORA Objects

This is a continuation of the investigation about integrating the California Digital Library’s XTF software into the FEDORA digital object repository that started earlier. This analysis looks at the textIndexer module in particular, starting with an overview of how textIndexer works now with filesystem-based objects and ending with an outline of how this could with reading objects from a FEDORA repository instead.

XTF’s Native File System handler

Natively, XTF wants to read content out of the file system. The core of the processing is done in these two class files:


The main() driver for ingesting content into the index. It reads commandline arguments (cfgInfo.readCmdLine( args, startArg );) to determine the various parameters, one of which is the top of the document source directory (String srcRootDir = Path.resolveRelOrAbs( xtfHomeFile, cfgInfo.indexInfo.sourcePath );). Assuming all goes well, it calls a method to open the Lucene index for writing, process files in the source directory, and close the Lucene index:
srcTreeProcessor.open( cfgInfo );
srcTreeProcessor.processDir( new File(srcRootDir), 0 );


processDir() is called recursively on the directory structure to process files in that directory. For each directory, a docBuf XML-as-a-string buffer is consisting of an element for every directory entry. docBuf is fed into the SAXON processor along with the docSelector XSLT stylesheet. The resulting XML is read node-by-node looking for file entries that have an “indexFile” tag. For each matching node, it calls processFile() to index each entry.

processFile() will run the prefilter XSLT against the file content, build the Lazy Tree (if possible and requested), create the IndexSource version by running the source document through the appropriate file type “*IndexSource” method (e.g. PDFIndexSource(), XMLIndexSource, and MARCIndexSource()) and queue the content for indexing by the Lucene indexer.

Requirements for an Object Handler for textIndexer

Based on this analysis, if one were to replace the TextIndexer.java and SrcTreeProcessor.java “front end” of textIndexer, I think these would be the pieces that would be requried. (Note that some steps are skipped in this overview — any replacement of these two classes would need to be sure to do everything that those classes do now.)

  1. Parse command line and configuration file parameters to create an IndexerConfig instance (guiding parameters for the indexer as a whole) and an IndexInfo instance (parameters specific to the identified index-name).
  2. Specify a collection of objects that you want in index-name.
  3. Open up a writable instance of the index-name’s Lucene index (a la srcTreeProcessor.open( cfgInfo );)
  4. For each object to be put into index-name, do these things:
    1. Optionally, run the source object through a prefilter (an XSLT transformation used to restructure the source document just prior to indexing without changing the stored source document).
    2. Optionally, remove a DOCTYPE declaration in the source object before it is indexed.
    3. Set up an transformation object from the native file format to something that is XML and call textProcessor.checkAndQueueText() to add it to a queue to be processed.
  5. Close index-name’s Lucene index (a la srcTreeProcessor.close();), which should have the side effect of processing the queued text (a la textProcessor.processQueuedTexts();) which will ultimately create the Lazy Tree (if specified) and add the object to the Lucene index.
  6. Optionally, compare the collection of objects that you want in index-name with what is actually in index-name before you started, and remove anything that wasn’t in the specified collection.

Considering a FEDORA-based XTF handler

So, all-in-all, that doesn’t seem too bad. Here is where we get to mix in some FEDORA pieces and see what we get in the end.

First off, in terms of dealing with “collections of source objects to be indexed” I think it would be best to have this start with one of our “collection aggregation” objects as the root level of a source collection. We’d perform an RDF “isMemberOf” query against the resource index using the FEDORA PID of the aggregation object (and optionally make an “isMemberOf” query recursively against the returned set — as if one was drilling down a file system).

Secondly, to get the XML content to be indexed, each object would have a getXML disseminator (see Thinking about Our FEDORA Disseminators for background) that would render to XTF an XML version of itself. If the source object is an XML-based object, it just returns the XML. If the source object is a PDF or Word document or something that can be rendered into a text-like form, the disseminator would handle that. If the source object is an image or audio clip, the disseminator can return the descriptive XML of the object. The point being, though, by the time the object gets to XTF’s textIndexer, it has already be rendered to XML, so just the XML transformation tool would be needed (as in this snipped from SrcTreeProcessor.java):
IndexSource srcFile = null;
if( format.equalsIgnoreCase(“XML”) ) {
InputSource finalSrc = new InputSource( systemId );
srcFile = new XMLIndexSource( finalSrc, srcPath, key,
preFilters, displayStyle, lazyStore );
if( removeDoctypeDecl )
((XMLIndexSource)srcFile).removeDoctypeDecl( true );

Third, a FEDORA-aware driver that replaces TextIndexer.java and SrcTreeProcessor.java. Given a configuration file location and a starting PID, it would gather the objects to be indexed, “open” the Lucene index, run through the snippet of Java above for each object, and “close” the Lucene index.

The quick-and-dirty first implementation would copy the XML source to a directory on the hard drive (directory and subdirectory names would be the PID of the aggregation object containing the collection of objects), and have XTF use that local filesystem copy as the indexed source. Lazy Tree files for each object would also be created and stored locally. This means we have two copies (three, if you count the Lazy Tree) of the object laying around, so eventually I think we’d want to modify XTF to pull content directly from FEDORA using a REST-based URL. Eventually I think we may also want to store the Lazy Tree in something other than the local file system. Could that be another datastream in the FEDORA object?

CDL’s XTF as a Front End to Fedora

We’re experimenting pretty heavily now with the California Digital Library‘s XTF framework as a front-end to a FEDORA object repository. Initial efforts look promising — thanks go out to Brian Tingle and Kirk Hastings of CDL; Jeff Cousens, Steve DiDomenico, and Bill Parod from Northwestern; and Ross Wayland from UVa for helping us along in the right direction.

XTF into Eclipse How-To

As we get more serious about XTF, I wrote up a How-To document for bringing XTF into Eclipse so that it can be deployed as a dynamic web application. Let me know if you find it useful. Definitely let me know if you find it in error. We haven’t put a version of XTF into OhioLINK’s source code repository, but that might follow shortly.

Points of Integration

In its base configuration, XTF reads documents out of a “data” directory that is in the application’s Tomcat context directory. It looks like two of the XTF components will need to be modified to successfully converse with a FEDORA-based object repository: DynaXML and textIndexer. Of the two, DynaXML seems to be the most straight forward.


First I went looking for where XTF’s DynaXML reads documents and found the DocLocator interface with one implementation that looks into the file system. John Davison, one of the DRC programmers, figured out (with help from the CDL folks) that in fact it is possible to pass a FEDORA API-A URL to DefaultDocLocator and have it do the right thing. Its ‘getInputSource()’ method has this signature:

public InputSource getInputSource( String sourcePath,
boolean removeDoctypeDecl ) throws IOException
…followed shortly by:

// If it’s non-local, load the URL.
if( sourcePath.startsWith(“http:”) ||
sourcePath.startsWith(“https:”) )
return new InputSource( sourcePath );
where “InputSource” is the entry point into the SAX parser, which will accept a URI as a parameter.

Unfortunately, using DefaultDocLocator in this way negates the use of CDL’s “Lazy Trees” (a binary version of each XML document containing all the original contents of the document, plus an index telling XTF where each element starts and ends). Lazy Trees are a good thing because they speed up parsing of the XML document and the resulting rendering to the user.

When dealing with local files (as opposed to the URL method described above), DefaultDocLocator will build a Lazy Tree in its index directory the first time the XML document is called up. In implementing a FEDORA interface for XTF’s DynaXML, what is required is a mixture of URL (or, in the case of FEDORA, a PID plus API-A call) to get the document and then create/store its lazy tree in the XTF index directory for subsequent retrieval. This does seem pretty straight forward, does it not?


XTF’s textIndexer, on the other hand, really wants the XML it is indexing to be files on the local hard drive. The XTF programming guide speaks of a textIndexer Document Selector whose job it is to create a single XML file with the specifications of which documents to index and how to do it:

It is the responsibility of the Document Selector XSLT code to output an XML fragment that identifies which of the files in the directory should be indexed. This output XML fragment should take the following form:


Now the trick seems to be to build an alternate Document Selector that will not use filenames but rather URIs to build the index. That’ll be the subject of the next round of investigations.

Comments and observations are welcome!

The text was modified to update a link from http://xtf.cvs.sourceforge.net/xtf/xtf/WEB-INF/src/org/cdlib/xtf/dynaXML/DocLocator.java?revision=1.5&view=markup to http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/549e4167039e/WEB-INF/src/org/cdlib/xtf/dynaXML/DocLocator.java on January 28th, 2011.

The text was modified to update a link from http://xtf.cvs.sourceforge.net/xtf/xtf/WEB-INF/src/org/cdlib/xtf/dynaXML/DefaultDocLocator.java?revision=1.10&view=markup to http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/de7d8a406bef/WEB-INF/src/org/cdlib/xtf/dynaXML/DefaultDocLocator.java on January 28th, 2011.