Presentation Summary: “MPTStore: Implementing a fast, scalable, and stable RDBMS-backed triplestore for Fedora and the NSDL”

Chris Wilper gave this presentation on behalf of the work that he and Aaron Birkland did to improve the performance of the Fedora Resource Index.

Version 2.0 of the Fedora digital object repository software added a feature called the Resource Index (RI). Based on Resource Description Framework (RDF) triples, the RI provided quick access to relationships between objects as well as to the descriptive elements of the object itself. After about two years of use using the Kowari software, the RI has pointed to a number of challenges for “triplestores”: scalability (few triplestores are designed for greater than 100 million triples); performance; and stability (frequent “rebuilds”).

The real motivation behind experimenting with a new triplestore, however, was the NSDL use case. The National Science Digital Library (NSDL) is a moderately large repository (4.7 million objects, 250 million triples) with a lot of write activity (driven by periodic OAI harvests; primarily mixed ingests and datastream modifications). The NSDL data model also includes existential/referential integrity constraints that must be enforced. Querying the RI to determine correct repository state proved to be difficult: Kowari is aggressively buffering triple, sometimes on the order of seconds, before writing them to disk. Flushing the buffer after every write is also computationally expensive (hence the drive to use buffers in the first place).

The NSDL team also encountered corruption under concurrent use and with abnormal shutdowns, forcing the rebuild of the triplestore. And the solution was not scaling well; performance was becoming notably worse. In looking for solutions other triplestores were considered but rejected. Using a RDBMS seemed attractive — efficient transactions, very stable, generally speedy — but a “one big table” paradigm to store all of the relations did not seem to give them a desired scalability.

NSDL developers observed that total number of distinct predicates is much lower than the number of predicates or objects; NSDL has about 50 distinct predicates. Based on this observation, their solution, called “Mapped Predicate Tables,” creates a table for every predicate in the triplestore. This has several advantages: a low computational cost for triple adds and deletes, queries for known predicates are fast, complex queries benefit from the relatively mature RDBMS planner having finer-granularity statistics and query plans, and flexible data partitioning to help address scalability. This solution comes with several disadvantages, however: one needs to manage predicate to table mapping, complex queries crossing many predicates require more effort to formulate, and with a naive approach simple unbound queries scale linearly with the number of predicates.

So the NSDL team created the MPTStore triplestore and contributed it back to the Fedora core developers for use by the community. MPTStore is a Java library that handles all of the predicate mapping and accounting behind the scenes. The basic API remains the same as for other triplestores, performing triple writes and queries, and the library hides all of the implementation details of translating queries from a particular language (SPO, SPARQL) into SQL statements. The library is also designed to expose transaction/connection semantics should the developer wish to have direct access to the predicate tables.

A solution like MPTStore is well suited for NSDL use case. The NSDL team was very familiar with the operations of RDBMS administration: performance tuning, backups, etc. The stored triplestore data is transparent and “hackable” — adhoc SQL queries and analysis are relatively simple. In fact, the RDBMS triplestore helped track down Fedora middleware bugs that resulted in an inconsistent state. Fixing these bugs also improved the performance of the Kowari-based RI.

[Updated 20070129T1447 to include links to Chris’ presentation on SlideShare.]

Open Source for Open Repositories — New Models for Software Development and Sustainability

This is a summary of a presentation by James L. Hilton, Vice President and CIO of University of Virginia, at the opening keynote session of Open Repositories 2007. I tried to capture the esessence of his presentation, and omissions, contradictions, and inaccuracies in this summary are likely mine and not that of the presenter.

Setting the stage

This is a moment in which institutions may be willing to invest in open source development in a systematic way (as opposed to what could currently be characterized as an ad hoc fashion) driven by these factors:

  • Fear. Prior to Oracle’s hostile take-over of PeopleSoft, the conventional wisdom of universities was that they needed to buy their core enterprise applications rather than build them. In doing so, they sought the comfort of buying the security of a leading platform. Oracle’s actions diminished that comfort level. Blackboard acquisition of WebCT and lawsuit against a competitor does not help either.
  • Disillusionment and ERP fatigue. What was largely thought to be an outsourced project was found to be an endless upgrade cycle. Organizations need to build entire support units to handle the upgrades for large ERP systems rather than supporting the needs of the users.
  • Incredulity — we’re supposed to do what? The application of technology typically has a disruptive impact (cannot predict the end), the stakes are incredibly high (higher education and/or research could be lost in a decade), it tends to be expensive, and the most common survival strategy is to seed many expensive experiments in the hopes that one will be in the right place at the time the transition needs to happen. The massive investment anticipated for technology to support academic computing (libraries, high-performance clusters, etc) will pale in comparison to the investment in administrative computing.
  • Rising tide of collaboration. This is a realization that the only way to succeed is through collaboration. To paraphrase Hilton, “In the new order it will be picking the right collaborative partners where the new competitive advantage will come from.”


Hilton offered these definitions and contrasts as a way to frame the rest of his discussion. First was Open or “free” software. Free as in beer, or free as in “adopt a puppy.” The software comes with the ability to do with as you want with the code, not just the ability to use the code. They he defined the term License as a contract — what ever you agree to you are bound to; you cannot use copyright law to protect you. The rules and conditions that are applied to the software do matter.

Lastly, he talked about Copyleft or “viral” licensing. There are different interpretations of “open” in open source. “Copyleft” has come to mean that code should be freely available to be used and modified, and it should never by locked up. GPL is an example. This is often called “viral” because if you include software with this license in any other work that is released, the additional software must be released under the same license. This is seen by some as valuable because it prevents open source from being encircled by proprietary code. Copyleft is contrasted with an “open/open” license — you can do whatever you want to do with a code under this license. An “open/open” license places no restrictions on what users do with code in derivative software packages.

Case Study — Michigan’s Sakai Sojourn

Hilton briefly described why UMich went down the Sakai path in 2001-2002:

  • Legacy system with no positive trajectory forward. It could never be released into open source; all of the development would have to be carried on UMich’s shoulders forever.
  • Saw market consolidation in CMS. This was mostly evident in the commercial sector with Blackboard and WebCT being the dominant choices. They had concerns about the cost of licenses in this environment down the road.
  • Saw the potential of tapping the institution’s core competencies and starting a virtuous cycle of development, teaching and research. Or, put another way, they didn’t want core competencies in teaching and research held hostage to a commercial development cycle.
  • Strategic desire to blur the distinction between the laboratory/classroom and between knowledge creation/digestion. They realized that the functions of a research support tool and a course support tool were pretty much the same under different skins, and they sought to blur that distinction even more.
  • NRC report and the need for collaboration. UMich was willing to fund the project two years internally but knew after that need to find collaborative partners by the fifth year in order to be declared a success.
  • A moment of time opportunity that synchronized the development process of several partners with funding provided by the Mellon Foundation.

There were also specific goals for the Sakai project. The new system had to replicate the functionality of existing course and research collaboration environments. They also wanted experience in finding partners willing to collaborate. Hilton said, “Sakai was/is at least as interest from a collaboration perspective as it is from the technology perspective.” Bringing together disparate organizations with different beliefs on how things should be done is a challenge. Additionally, they wanted to get better as an institution at discerning open source winners; it shouldn’t be like a lottery. Lastly, they wanted to implement software parts that were not built at UMich. Each partner institutions committed to implementing the same thing even if wasn’t built at that institution. This is tough to do, but they knew they needed to do it for their own good in the long run.

What happened? Not only did the original partners show up, but the community came, too. Even more interesting was that the community was formed with dues-paying members — even in a world where the software is free. It became a vibrant community, too, with a conference every six months. Sakai was released under an open-open license model, and corporate partners showed up as well (selling support services, or hosting services, or hardware for the software). The software did grow up and left its home; a separate foundation now holds the intellectual property of the code (originally partners assigned copyright to UMich). They also positioned Sakai to be a creditable threat to the commercial entities in order to force them to the standards table.

Takeaway lessons that generalize to open source development

First, the benefits of open source development.

  • destiny control (but only when you really need to drive). having the control is not always a good thing. Is it worth the effort? Is the project core to the institution’s mission? (Does it directly support scholarship and teaching?)
  • builds community and camaraderie (in the case of Sakai, both locally at UMich and internationally)
  • unbundles software ownership and its support. inspires more competition in the implementation and support space.
  • community source provides institutions an opportunity to leverage links between open source, open access and culture of the academy/wider world (a.k.a. put up or shut up)

Then, the challenges of open source development.

  • Guaranteeing clean code (IP) is hard (read as “impossible”). A certain amount of faith about the code they get and there needs to be consideration for mitigating risks.
  • Figuring out who is authorized to license institutionally-owned code is challenging and then you have to convince them to give it away. No one in the institution typically has been appointed or given the authority to release code. One of the things that the sakai licensing discussions highlighted was institutional differences in requirements and aesthetics.
  • Patent quagmire always looming. How do you know your software is not infringing? How do you make sure you don’t inadvertently give away all institution patents? Be careful when looking at licenses from an institutional perspective versus an individual perspective.
  • There is also the inevitable lawsuit risk. Or, as your counsel might say to you, “Let me get this straight, we can get sued but there’s no one we can sue.”

Then, some discoveries that they made along the way.

  • An open source project not a silver bullet. The commitment to build rather than buy must align with institutional priorities and competencies; it is not right for every project/application.
  • Licensing does matter; it is a contract: whatever you stick in its rules is what sticks. There are probably have too many open source license options and some sort of standardization is needed. Also keep in mind that if you release something under an open/open license, you can’t include any copyleft components.
  • Communities don’t just happen, they require: specific shared purpose (when visions vary, or when they change, collaborations struggle); and governance (e.g., separate board with dedicated developers sitting between institutions). Cooperation (“I won’t hurt you if you don’t hurt me”) is not collaboration.
  • Open (community) source requires real project discipline. “It is as spontaneous as a shuttle launch.” Along the way one needs to learn to balance pragmatics and ideals. One also needs to learn to trust your partners. “It really requires learning to let go.” Letting go, and having the community make the decisions, may be the quickest path to efficiency.

Reflection on open/community source for repositories

Repositories are at the center of everything at the institution. It connects with the library, with the presses/scholarly publishing operation, with classroom teaching, with the laboratory, and with the world. It is a core piece of of infrastructure for the university of the 21st century. As institutions, we need to make sustaining investments in our repositories.

Hilton sees three different approaches to “community” in the existing projects:

  • dspace: community of user/developers. The come together to talk about what they want to do, write code, and support each other. Clearly there are enthusiastic users as developers.
  • eprints: appears as like a vendor talking with customers wanting the community help shape the direction.
  • fedora: in transition from a combination of the previous two models moving towards a Sakia-like model. it will require institutions to make commitments to it.

In the end, Hilton asked some thought-provoking questions. Is now the time for institutional investment in open/community source? Will a coherent community (or communities) emerge in ways that are sustainable? — is there a shared vision?

The text was modified to update a link from to on January 19th, 2011.

A Vision for FEDORA’s Future, an Implementation Plan to Get There, and a Project Update

This morning, Sandy Payette of Cornell University and FEDORA project co-director, gave an update on the FEDORA project including a statement of a vision for FEDORA’s future, information about the emerging FEDORA Commons non-profit, and a status report/roadmap for the software itself. Below is a summary based on my notes of Sandy’s comments and slide content.

Vision for FEDORA’s Future

From her perspective, Sandy sees many kinds of projects using FEDORA, and she sees them fall into these general categories: Scholarly Workbenches — capturing, managing and publishing the process of scholarship; Linking Data and Publications — complex objects built up of relationships with different types of internal and external objects; Reviews and Annotations of Objects — blogs and wikis on top of information spaces; collaborations surrounding a repository object; and Museum Exhibits with K-12 Lesson Plans.

Based on these observations, she can envision the evolution of FEDORA as an open source software for building robust information spaces with these major components:

  • repository services: manage, access, versioning, and storage of digital objects
  • preservation services: repository integrity checking, monitoring, alerting, migration, and replication
  • process management services: workflow centered around digital objects and messaging with peer applications
  • collaboration services: annotation, discussion, and rating of digital objects

The collaboration services suite has not been part of the core FEDORA project to date. Other people have found clever ways to put services such as blogs and wikis on top of a FEDORA content repository, but there are functions that can be put into the FEDORA system that can enable and enhance collaborative services.

FEDORA, of course, does not exist in isolation from other activities on the internet, and there are implications of what is commonly called “Web 2.0” on the FEDORA system. The key theme of Web 2.0 is an “architecture of participation:” the capability to remix and transform data sources — building on top of objects that already exist — to harness collective intelligence of the community. Some specific examples are collaborative classification (Del.Icio.Us), content sharing and feedback (YouTube), power of collective intelligence (Wikipedia, amazon reviews), and alternative trust models (such as ebay — one based on reputation). This emergent behavior is influencing upcoming generations of scholars and scientists; they will have a completely different expectations regarding the technology they use for learning and research.

Taken as a whole, the vision for FEDORA is to enable “object-centric” collaborations. FEDORA is evolving into an open source platform that integrates a robust core (repositories and enterprise SOA) with dynamic content access (collaborative applications and web access/re-use). It is a technology for complex digital objects. As contrasted with a technology such as Wikipedia’s MediaWiki — ideal for working with wiki-based resources — FEDORA is great for many different applications, including as a content store for wikis. In other words, one is not tied to one particular application or use case.

Fedora Commons Non-Profit

FEDORA as a project is evolving into FEDORA as an organization. That organization, called Fedora Commons, will be a non-profit to “to enable software developers to collaborate and create open source software (repository, services, tools) that enables information-oriented communities to collaborate in creating new forms of network-based knowledge objects (educational, scholarly, cultural) and that, ultimately, enables institutions to manage and preserve these information objects over time in a robust repository and service architecture.” FEDORA Commons will be a custodian of the software platform and the means to steer its direction.

Structurally, it is envisioned as a 501c3 (as in the section of the IRS tax code) non-profit charitable organization. There is a proposal to the Moore Foundation being prepared to receive a grant for the initial start-up funds for the Fedora Commons focusing on sustainability and community building. The Commons may also seek matching funds from other foundations (Getty, Mellon) in later years until the organization is fully self-sustaining. The current thinking is that the Commons will achieve “steady state” with its own business model in 2010. The startup funds will extend the funding for the core development team as well as fostering a community of contributors to the project and committers to the code base. The plans include several funded positions: board of directors, executive director, technology architect (supervising sysadmin and build master as well as developers), a financial/accounting specialist, and a communications specialist.

Sustainability in this context means increasing the installed base of FEDORA as well as moving towards a community leadership model. One model is the Eclipse Foundation with four technical councils (collaboration, repository, enterprise, preservation) with corresponding community outreach councils. The community will also need do develop an income-generating model, be it corporate membership (dues structure like Eclipse) and/or university and government members.

Fedora Project Status Report

Fedora 2.2 was released on January 19th, and Sandy went through the major changes and features. First is FEDORA as a web application; it has been refactored and repackaged so it can now run in different (even existing) servlet/web containers. Along with this is a new installer application that steps one through the process of bringing up the software. There is a “Quick” option to get running immediately and a “Custom” option to set Fedora up optimally for a particular environment.

Within FEDORA itself, datastreams can now have checksums, and this is supported with new repository configuration options. This enabled trusted client/server collaboration and offers on-demand integrity checking of the repository. The manner in which it handles authentication has changed as well; version 2.2 uses servlet filters instead of Tomcat realms. This decouples FEDORA authentication from Tomcat. Three filters come with the core software: username/password file, LDAP, and Pubcookie.

FEDORA 2.2 also includes several modules from community committers: GSearch (configurable search for datastreams in FEDORA); Journaling (replication/recovery module for repositories); and MPTStore (new high-performing triplestore).

Sandy also covered the roadmap. The Mellon Phase 2 grant runs through 4Q2007 and the work remaining content models, content model dissemination architecture, basic messaging service, and preservation services. Next is “FEDORA Enterprise” (in the form of a grant proposal in front of Mellon now ending in 2Q2009) to include workflow engine and supporting tools, message-oriented middleware for an enterprise service bus (ESB), and distributed transactions. Finally, the FEDORA Commons 501c3 work (starting 3Q2007) in two parts: the technical (evolution of the integrated platform) and community building (foster development and outreach, evolving a business model, and tapping ongoing sources of funding).

[Updated 20070129T1655 to correct the section of the U.S. Tax Code in the last paragraph. I don’t think we want anything to do with 26 USC 301c3.]

Open Repositories Presentation: Building an IR Interface Using EJB3 and JBoss Seam

Below is the outline of the Ohio DRC presentation from today’s FEDORA session at Open Repositories conference. Comments welcome!

Blogging About Open Repositories 2007? Use the ‘icor2007’ tag.

Are you blogging the Open Repositories conference in San Antonio this week? Are you posting pictures to Flickr?

if so, may I suggest using the ‘icor2007’ tag when posting your content. When doing so, the HitchHikr service will aggregate content from Technorati and Flickr based on that tag. (Why not simply “or2007”? — it looks like that tag picks up extra cruft in a non LATIN-1 character set.)

Building an Institutional Repository Interface Using EJB3 and JBoss Seam

This tour is designed to show the overall architecture of a FEDORA digital object repository application within the JBoss Seam framework while at the same time pointing out individual design decisions and extension points that are specific to the Ohio Digital Resource Commons application. Geared towards software developers, a familiarity with Java Servlet programming is assumed, although not required. Knowledge of JBoss Seam, Hibernate/Java Persistence API, EJB3 and Java EE would be helpful but not required; brief explanations of core concepts of these technologies are included in this tour.

The tour is based on revision 709 of /drc/trunk and was last updated on 18-Jan-2007.

This tour will also be incorporated into a presentation at Open Repositories 2007 on Tuesday afternoon.

Directory Layout

The source directory tree has four major components: ‘lib’, ‘resources’, ‘src’, and ‘view’.

lib – libraries required by the application. The lib directory contains all of the JAR libraries required by the application. Its contents is a mix of the Seam-generated skeleton (pretty much everything at the top level of the ‘lib’ directory) and JAR libraries that are specific to the DRC application (in subdirectories of ‘lib’ named for the library in use). For instance, the ‘commons-codec-1.3’ and the ‘hibernate-all’ and the ‘jboss-seam’ JAR files were all brought into the project via ‘seam-gen’ while ‘lib/commons-net-1.4.1/commons-net-1.4.1.jar’ library was added specifically for this project. A convention has been established whereby new libraries added to the project appear as entries in the file which is used by series of directives in the build.xml file to setup the classpaths for compiling and for building the EJB JAR. This is done to make the testing and transition of new libraries into the application more explicit and easily testable. Note that the newly included library directory also includes a copy of any license file associated with that library; this is not only a requirement to use some libraries but is also a good practice to show the lineage of some of the lesser known libraries. (For an example of what is required, see the changes to build.xml and to in order to bring the Apache Commons Net library into the application.)

resources – configuration files and miscellaneous stuff. The resources directory holds the various configuration files required by the application plus other files used for testing and demonstration. Much of this was generated by the Seam-generated skeleton as well. Some key files here are the import.sql file (SQL statements that are used to preload the RDBMS used by Hibernate as the mocked up repository system) and the test-datastreams directory which has sample files for each of the media types.

src – Java source code. The src directory contains all of the Java source code for the application. Everything exists in a package called ‘edu.ohiolink.drc’ with subpackages for classes handling actions from the view component of the MVC, entity beans (sometimes known as Data Access Objects — or DAOs — I think), exception classes (more on this below), classes for working with FEDORA (not currently used), media type handler classes (more on this below), unit test classes (not currently used), and utility classes.

view – XHTML templates, CSS files, and other web interface needs. The view directory holds all of the files for the “view” aspect of the Model-View-Controller paradigm. More information about the view components is below.

Entity Classes

The entity beans package has three primary entity beans defined:,, and (The entity bean is not used at this time.) is the primary bean that represents an object in the repository. and are component beans that only exist in the lifecycle of an bean; holds a representation of a FEDORA object datastream and holds a representation of a Dublin Core datastream for that object.

The Datastream and Description objects are annotated with @Embedded in the source; this is Hibernate’s way of saying that these objects do not stand on their own. also has numerous methods marked with a @javax.persistence.Transient annotation meaning that this is information not stored in the backing Hibernate database; these methods are for the various content handlers, which will be outlined below.

Mock Repository

As currently configured, the entity beans pull their information from a static RDBMS using Hibernate rather than from an underlying FEDORA digital object repository. (You’ll need to go back to revision 691 to see how far we got with the FEDORA integration into JBoss Seam before we switched our development focus to the presentation ‘view’ aspects of the application.) As currently configured, Hibernate uses an embedded Hypersonic SQL database for its datastore. As part of the application deploy process, the Java EE container will instantiate a Hypersonic database and preload it with the contents of the import.sql file. (The import.sql file contains just three sample records at the moment: one each for a text file, a PDF file, and a graphic file.)

All of the data for a repository object is contained in a single table record. Hibernate manages the process for us of reading that record out of the database and creating the three corresponding Java objects: Item, Datastream and Description. (Hibernate could also handle the process of updating the underlying table record if we were to change a value in one of the Java objects.) The mapping of table column to Java object field is handled by the @Column(name="xx") annotations in the entity beans.

For Datastream, what is stored in the database is not the datastream content itself but rather a filename that points to the location of the datastream file. The file path in this field can either be absolute (meaning a complete path starting from the root directory of the filesystem) or a relative path. In the case of the latter, the path is relative to the deployed application’s WAR directory (something like “…/jboss-4.0.5.GA/server/default/deploy/drc.ear/drc.war/” for instance). Note that the getter/setter methods for the contentLocation are private — the rest of the application does not need to know the location of the datastreams; this will also be true when the DRC application is connected to a FEDORA digital object repository. The method marked public instead is getContent, and the implementation of getContent hides the complexity of the fact that the datastream is coming from a disk file rather than a FEDORA repository call. For the three records/repository-objects currently defined in ‘import.sql’ there are three corresponding demo datastreams in the test-datastreams directory.

In all likelihood, this representation of the FEDORA repository will be too simple for us to move forward much further. In particular, the current notion of one datastream per repository object is too simplistic. The Datastream embedded object will likely need to be broken out into a separate table and as a corresponding distinct Java applet. (We may reach the same point soon for the Description object as well.)

By using the Entity Beans as a buffer between the business logic and the view components of the rest of the application, I hope we can minimize/localize the changes required in the future in order to replace the mock repository with a real underlying FEDORA repository.

View Templates

The preferred view technology for JBoss Seam is Facelets, an implementation of Java Server Faces that does not require the use of Java Server Pages (JSP). Although the ‘.xhtml’ pages in the view directory bear a passing resemblance to JSP, behind the scenes they are radically different. Of note for us is the clean templating system used to generate pages. The home.xhtml file has a reference to the template.xhtml file in the ‘layout’ directory. If you read through the template.xhtml file, you can see where the Facelets engine will pull in other .xhtml files in addition to the content within the <ui:define name="body"> tag of home.xhtml.

Content Handlers

The paradigm of handling different media types within the DRC application is guided in large part by the notion of disseminators for FEDORA objects and the Digital Library Federation Aquifer Asset Actions experiments. The underlying concept is to push the media-specific content handling into the digital object repository and to have the presentation interface consume those content handlers as it is preparing the end-user presentation.

For instance, the DRC will need to handle content models for PDFs, images, video, and so forth. Furthermore, how a video datastream from the Digital Video Collection is offered to the user may be different than how a video datastream from a thesis is offered to the user. Rather than embedding the complexity of making those interface decisions into the front-end DRC application, this model of content handlers pushes that complexity closer to the objects themselves by encoding those behaviors a disseminators of the object. What the presentation layer gets from the object is a chunk of XHTML that it inserts into the dynamically generated HTML page at the right place.

There is work beginning on a framework for FEDORA disseminators at /BaseDisseminator/trunk in the source code repository; that work has been put on hold at the moment in favor of focusing on the presentation interface. In order to prepare for the time when the presentation behaviors are encoded as FEDORA object disseminators, the current presentation layer makes use of Content Handlers for each of the media types. The Handler interface defines the methods required by each handler and the TextHandler class, the ImageHandler class, and the PdfHandler class implement the methods for the three media types already defined.

Of these, TextHandler class is the most complete, so I’ll use it as an example.

  • The getRawDatastream method takes the datastream and sends it back to the browser with the HTTP headers that cause a File-Save dialog box to open.
  • The getFullDisplay method returns a chunk of XHTML that presents the full metadata in a manner that can be included in a full metadata display screen.
  • The getRecordDisplay method (currently unwritten) returns a chuck of XHTML used to represent the object in a list of records that resulted from a user’s search or browse request.
  • The getThumbnail method (currently unwritten) returns a static graphic thumbnail rendition of the datastream (e.g. a cover page, a key video frame, etc.).

By making these content handlers distinct classes, it is anticipated that the rendering code for each of these methods can be more easily moved to FEDORA object disseminators with minimal impact to the surrounding DRC interface application.

Exception Handling

The DRC application follows the practice suggested by Barry Ruzek in Effective Java Exceptions (found via this link on The Server Side). The article can be summarized as:

One type of exception is a contingency, which means that a process was executed that cannot succeed because of a known problem (the example he uses is that of a checking account, where the account has insufficient funds, or a check has a stop payment issued.) These problems should be handled by way of a distinct mechanism, and the code should expect to manage them.

The other type of exception is a fault, such as the IOException. A fault is typically not something that is or should be expected, and therefore handling faults should probably not be part of a normal process.

With these two classes of exception in mind, it’s easy to see what should be checked and should be unchecked: the contingencies should be checked (and descend from Exception) and the faults should be unchecked (and descend from Error).

All unchecked exceptions generated by the application are subclasses of DrcBaseAppException. (DrcBaseApplication itself is a subclass of RuntimeException.) For an example, see NoHandlerException. By setting up all of the applications exceptions to derive from this point, we have one place where logging of troubleshooting information can take place (although this part of the application has not been set up yet). Except when there is good reason to do otherwise, this pattern should be maintained.

At this point, no checked (or contingency) exceptions specific to the DRC have been defined. When they are needed, though, they will follow the same basic structure with a base exception derived from Exception.

The text was modified to update a link from to on January 19th, 2011.

The text was modified to update a link from to on January 20th, 2011.


p style=”padding:0;margin:0;font-style:italic;” class=”removed_link”>The text was modified to remove a link to on November 6th, 2012.

Looking Forward to Version 2.2 of FEDORA

Sandy Payette, Co-Director of the Fedora Project and Researcher in the Cornell Information Science department, announced a tentative date for the release 2.2 of the FEDORA digital object repository.

The Fedora development team would like to announce that Fedora 2.2 will be released on Friday, January 19, 2007.

This new release will contain many significant new features and enhancements, including [numbers added to the original for the sake of subsequent commentary]:

  1. Fedora repository is now a web application (.war) that can be installed in any container
  2. Fedora authentication has been refactored to use servlet filters (no longer Tomcat realms)
  3. A new Fedora installer makes it easy to get started with Fedora (with both “quick” and “custom” install options)
  4. GSearch service (backed by Lucene or Zebra) – flexible, configurable, indexes any datastream
  5. Journaling service to create a backup/mirror repository
  6. New checksum features for datastreams
  7. Support for Postgres database configuration
  8. Standard system logging with Log4J
  9. Over 40 bug fixes
  10. Many other enhancements

Be on the lookout for the release announcement the new year! Also, there will be opportunities to talk with the Fedora development team at Open Repositories 2007 (

This is great news and a major step forward for the project. Here are some reasons why I think this is true.

1. Fedora repository is now a web application (.war)

To this point, the FEDORA repository application distribution has been pre-bundled inside a Tomcat Java servlet container. The binding has been pretty tight with certain dependencies written into the Tomcat configuration itself. That made it very difficult to install FEDORA into an organization’s existing servlet container (be it another installation of Tomcat or Jetty/JBoss/Glassfish, etc.). Even more problematic, there were reports of problems trying to get JSP-based applications to work inside the FEDORA-supplied container (we ran into this ourselves) meaning that organizations wanting to run both FEDORA and another servlet-based application needed to run two servlet containers; pretty inefficient. (OhioLINK was in this position in its early implementations of the Ohio DRC project.)

With release 2.2, the core developers have effectively turned the software distribution inside out. The primary output of the new build process is a standard Web ARchive (or WAR) file that can be put inside any servlet container. The new installation program (see #3 below) comes with a Tomcat distribution, should a new installation need it, but it is no longer required. There have been reports that the new WAR-based distribution works inside the Jetty servlet container; we’re hoping it will work in the JBoss Application Server as well (since that is what we’re using to build our next generation interface).

2. Fedora authentication has been refactored to use servlet filters

I’m not quite sure what this means, but I have hopes that it will make integration with Shibboleth easier. Can anyone else see the path between FEDORA and Shibboleth and comment on it?

3. A new Fedora installer makes it easy to get started with Fedora

From the start, FEDORA required a Java servlet container in order to run. To make the installation job easier for those that are not familiar with Java servlet containers, the FEDORA installation process did everything for you. Now that the relationship between the FEDORA application and the servlet container have been flipped around (see #1 above), the core developers devised an easy-to-use installation application that mimics the simplicity of the previous installation style while allowing others to make use of FEDORA as an integrated application within an existing servlet container.

4. GSearch service

The original FEDORA search service, the appropriately-named “basic search,” indexes only the Dublin Core (DC) datastream of each object. As has been mentioned on the Fedora-Users mailing list several times, the DC datastream is really meant as an administrative metadata datastream and not necessarily the full description of the object; that full description can be stored in other datastreams of a FEDORA object. Not only did basic search not index these other descriptive metadata streams, but it also wouldn’t index the full text of PDF, text, and other indexable datastreams.

GSearch — where “G” stands for “General” but could equally well stand for “Gert” Schmeltz Pedersen, its lead developer from the Technical University of Denmark — does all of the above as a new component in the FEDORA Service Framework. We extend our gratitude to Gert and his colleagues for contributing their work to the general FEDORA distribution as well as to DEFF, Denmark’s Electronic Research Library, which funded the GSearch project.

5. Journaling service

Like a journaling file system or a journaling database, this capability allows one to capture all of the transactions applied to the repository and replay them against a secondary repository instance or to restore a repository from backup.

6. Datastream checksums

As part of its ingestion and maintenance functions, the FEDORA software can now calculate, store, and verify checksums of datastreams. This helps ensure the integrity of the repository content, or at least detect when something goes wrong.

7. Support for PostgreSQL

In the battle between which relational database engine is best, FEDORA now supports most of the big ones out-of-the-box: Oracle, MySQL, and new PostgreSQL. Here at OhioLINK, we’ve started with MySQL but are considering a migration to PostgreSQL as our in-house, preferred RDBMS, so the timing of this announcement is great.

8. Standard system logging with Log4J

Put this one in the category of “playing nicely with others.” We’ve already reaped the benefit of the refactored logging code in the client JAR file in a pre-release version of the code.

9 and 10. Bug fixes and many other enhancements

The core code is evolving along a nice trajectory. This is good to see for the health of the overall project!

Version 2.2 represents another monumental step towards the vision of a Flexible, Extensible Digital Object Repository Architecture. Congratulations to the core developers for what sounds like is going to be a great release.

The text was modified to update a link from to on January 19th, 2011.

Heads up! International Conference on Open Repositories (01/23/07 – 01/27/07, San Antonio, TX, US)

Open Repositories 2007 is coming up next year, and it looks to be an interesting meeting. The first day is open user group meetings for DSpace, Fedora, and Eprints, followed by general conference sessions that cover issues that cut across all of the open repository systems. This year, the user groups will partition their programs into Plenary, Technical Issues, and Management Issues and the partitions will be staggered so that IT managers can attend all plenary sessions, technical staff can attend all technical sessions, etc.

The call for participation for the general conference has gone out. Its Program Committee is seeking submissions in the form of an extended abstract of no more than 500 words by October 2, 2006. The contributions must be written in English and should be double spaced. The Program Committee will select relevant submissions. Selected speakers will receive an email by November 6, 2006 with guidelines for their presentation. Presentations will be limited to 20 minutes, plus 10 minutes for questions.

This looks to be a really good meeting. You can track it on HitchHikr at