CDL’s XTF as a Front End to Fedora

Posted on     3 minute read

× This article was imported from this blog's previous content management system (WordPress), and may have errors in formatting and functionality. If you find these errors are a significant barrier to understanding the article, please let me know.

We're experimenting pretty heavily now with the California Digital Library's XTF framework as a front-end to a FEDORA object repository. Initial efforts look promising -- thanks go out to Brian Tingle and Kirk Hastings of CDL; Jeff Cousens, Steve DiDomenico, and Bill Parod from Northwestern; and Ross Wayland from UVa for helping us along in the right direction.

XTF into Eclipse How-To

As we get more serious about XTF, I wrote up a How-To document for bringing XTF into Eclipse so that it can be deployed as a dynamic web application. Let me know if you find it useful. Definitely let me know if you find it in error. We haven't put a version of XTF into OhioLINK's source code repository, but that might follow shortly.

Points of Integration

In its base configuration, XTF reads documents out of a "data" directory that is in the application's Tomcat context directory. It looks like two of the XTF components will need to be modified to successfully converse with a FEDORA-based object repository: DynaXML and textIndexer. Of the two, DynaXML seems to be the most straight forward.


First I went looking for where XTF's DynaXML reads documents and found the DocLocator interface with one implementation that looks into the file system. John Davison, one of the DRC programmers, figured out (with help from the CDL folks) that in fact it is possible to pass a FEDORA API-A URL to DefaultDocLocator and have it do the right thing. Its 'getInputSource()' method has this signature:

public InputSource getInputSource( String sourcePath,
     boolean removeDoctypeDecl ) throws IOException

...followed shortly by:

// If it's non-local, load the URL.
if( sourcePath.startsWith("http:") ||
      sourcePath.startsWith("https:") )
      return new InputSource( sourcePath );
where "InputSource" is the entry point into the SAX parser, which will accept a URI as a parameter.

Unfortunately, using DefaultDocLocator in this way negates the use of CDL's "Lazy Trees" (a binary version of each XML document containing all the original contents of the document, plus an index telling XTF where each element starts and ends). Lazy Trees are a good thing because they speed up parsing of the XML document and the resulting rendering to the user.

When dealing with local files (as opposed to the URL method described above), DefaultDocLocator will build a Lazy Tree in its index directory the first time the XML document is called up. In implementing a FEDORA interface for XTF's DynaXML, what is required is a mixture of URL (or, in the case of FEDORA, a PID plus API-A call) to get the document and then create/store its lazy tree in the XTF index directory for subsequent retrieval. This does seem pretty straight forward, does it not?


XTF's textIndexer, on the other hand, really wants the XML it is indexing to be files on the local hard drive. The XTF programming guide speaks of a textIndexer Document Selector whose job it is to create a single XML file with the specifications of which documents to index and how to do it:

It is the responsibility of the Document Selector XSLT code to output an XML fragment that identifies which of the files in the directory should be indexed. This output XML fragment should take the following form:

    <indexfile fileName      = "FileName"
               {format       = "FileFormatID"}
               {preFilter    = "PreFilterPath"}
               {displayStyle = "DocumentFormatterPath"}>

Now the trick seems to be to build an alternate Document Selector that will not use filenames but rather URIs to build the index. That'll be the subject of the next round of investigations.

Comments and observations are welcome!

The text was modified to update a link from to on January 28th, 2011.

The text was modified to update a link from to on January 28th, 2011.