Skip to content
Solely for the Purpose of Catching $PAMRZ

Analysis of CDL’s XTF textIndexer to Replace the Local Files with FEDORA Objects


This is a continuation of the investigation about integrating the California Digital Library’s XTF software into the FEDORA digital object repository that started earlier. This analysis looks at the textIndexer module in particular, starting with an overview of how textIndexer works now with filesystem-based objects and ending with an outline of how this could with reading objects from a FEDORA repository instead.

XTF’s Native File System handler

Natively, XTF wants to read content out of the file system. The core of the processing is done in these two class files:

TextIndexer.java

The main() driver for ingesting content into the index. It reads commandline arguments (cfgInfo.readCmdLine( args, startArg );) to determine the various parameters, one of which is the top of the document source directory (String srcRootDir = Path.resolveRelOrAbs( xtfHomeFile, cfgInfo.indexInfo.sourcePath );). Assuming all goes well, it calls a method to open the Lucene index for writing, process files in the source directory, and close the Lucene index:
[java]
srcTreeProcessor.open( cfgInfo );
srcTreeProcessor.processDir( new File(srcRootDir), 0 );
srcTreeProcessor.close();
[/java]

SrcTreeProcessor.java

processDir() is called recursively on the directory structure to process files in that directory. For each directory, a docBuf XML-as-a-string buffer is consisting of an element for every directory entry. docBuf is fed into the SAXON processor along with the docSelector XSLT stylesheet. The resulting XML is read node-by-node looking for file entries that have an “indexFile” tag. For each matching node, it calls processFile() to index each entry.

processFile() will run the prefilter XSLT against the file content, build the Lazy Tree (if possible and requested), create the IndexSource version by running the source document through the appropriate file type “*IndexSource” method (e.g. PDFIndexSource(), XMLIndexSource, and MARCIndexSource()) and queue the content for indexing by the Lucene indexer.

Requirements for an Object Handler for textIndexer


Based on this analysis, if one were to replace the TextIndexer.java and SrcTreeProcessor.java “front end” of textIndexer, I think these would be the pieces that would be requried. (Note that some steps are skipped in this overview — any replacement of these two classes would need to be sure to do everything that those classes do now.)

  1. Parse command line and configuration file parameters to create an IndexerConfig instance (guiding parameters for the indexer as a whole) and an IndexInfo instance (parameters specific to the identified index-name).
  2. Specify a collection of objects that you want in index-name.
  3. Open up a writable instance of the index-name’s Lucene index (a la srcTreeProcessor.open( cfgInfo );)
  4. For each object to be put into index-name, do these things:
    1. Optionally, run the source object through a prefilter (an XSLT transformation used to restructure the source document just prior to indexing without changing the stored source document).
    2. Optionally, remove a DOCTYPE declaration in the source object before it is indexed.
    3. Set up an transformation object from the native file format to something that is XML and call textProcessor.checkAndQueueText() to add it to a queue to be processed.
  5. Close index-name’s Lucene index (a la srcTreeProcessor.close();), which should have the side effect of processing the queued text (a la textProcessor.processQueuedTexts();) which will ultimately create the Lazy Tree (if specified) and add the object to the Lucene index.
  6. Optionally, compare the collection of objects that you want in index-name with what is actually in index-name before you started, and remove anything that wasn’t in the specified collection.

Considering a FEDORA-based XTF handler


So, all-in-all, that doesn’t seem too bad. Here is where we get to mix in some FEDORA pieces and see what we get in the end.

First off, in terms of dealing with “collections of source objects to be indexed” I think it would be best to have this start with one of our “collection aggregation” objects as the root level of a source collection. We’d perform an RDF “isMemberOf” query against the resource index using the FEDORA PID of the aggregation object (and optionally make an “isMemberOf” query recursively against the returned set — as if one was drilling down a file system).

Secondly, to get the XML content to be indexed, each object would have a getXML disseminator (see Thinking about Our FEDORA Disseminators for background) that would render to XTF an XML version of itself. If the source object is an XML-based object, it just returns the XML. If the source object is a PDF or Word document or something that can be rendered into a text-like form, the disseminator would handle that. If the source object is an image or audio clip, the disseminator can return the descriptive XML of the object. The point being, though, by the time the object gets to XTF’s textIndexer, it has already be rendered to XML, so just the XML transformation tool would be needed (as in this snipped from SrcTreeProcessor.java):
[java]
IndexSource srcFile = null;
if( format.equalsIgnoreCase(”XML”) ) {
InputSource finalSrc = new InputSource( systemId );
srcFile = new XMLIndexSource( finalSrc, srcPath, key,
preFilters, displayStyle, lazyStore );
if( removeDoctypeDecl )
((XMLIndexSource)srcFile).removeDoctypeDecl( true );
}
[/java]

Third, a FEDORA-aware driver that replaces TextIndexer.java and SrcTreeProcessor.java. Given a configuration file location and a starting PID, it would gather the objects to be indexed, “open” the Lucene index, run through the snippet of Java above for each object, and “close” the Lucene index.

The quick-and-dirty first implementation would copy the XML source to a directory on the hard drive (directory and subdirectory names would be the PID of the aggregation object containing the collection of objects), and have XTF use that local filesystem copy as the indexed source. Lazy Tree files for each object would also be created and stored locally. This means we have two copies (three, if you count the Lazy Tree) of the object laying around, so eventually I think we’d want to modify XTF to pull content directly from FEDORA using a REST-based URL. Eventually I think we may also want to store the Lazy Tree in something other than the local file system. Could that be another datastream in the FEDORA object?

(This post was updated on 22-Aug-2006.)

3 Comments

  1. Martin Haye | August 22, 2006 at 5:52 pm | Permalink

    One clarification before I get to general comments: processFile() forms the path for the lazy tree, but the file isn’t actually created until the queued documents are processed.

  2. Martin Haye | August 22, 2006 at 6:04 pm | Permalink

    I agree with your strategy. However you get a set of documents from Fedora (I’m not that familiar with it), you basically want to replicate the functionality of SrcTreeProcessor, which does the work of wrapping the input sources and passing them to XMLTextProcessor for the heavy lifting.

    One thought on lazy files: if you have some mechanism for random byte-level access to a data stream in the Fedora object, you could supply that to XTF through the StructuredStore interface. The interface was designed with this in mind, though we haven’t ever used it for that.

    What you call a “disseminator” is similar to subclasses of IndexSource, such as XMLIndexSource. There is also a PDFIndexSource and HTMLIndexSource that you might be interested in. If you end up creating others for Word or other formats, perhaps you’d consider contributing them back into XTF.

  3. the jester | August 23, 2006 at 9:57 am | Permalink

    One thought on lazy files: if you have some mechanism for random byte-level access to a data stream in the Fedora object, you could supply that to XTF through the StructuredStore interface. The interface was designed with this in mind, though we haven’t ever used it for that.

    Ah, that is a bit of a problem — at present, access to objects in FEDORA requires a Web Services call and the calls to retrieve data (through a “disseminator” or directly a datastream) do not support a byte-range request. (There has been some talk about this, so it might get done some day.)

    What you call a “disseminator” is similar to subclasses of IndexSource, such as XMLIndexSource. There is also a PDFIndexSource and HTMLIndexSource that you might be interested in. If you end up creating others for Word or other formats, perhaps you’d consider contributing them back into XTF.

    Exactly — the point of distinction is where the transformation to XML takes place…as a subclass of IndexSource in XTF or as a function of the repository (the disseminator). In either case, the transformation is happening in a piece of Java code. My inclination, probably borne out by just raw familiarity at the moment, is to do the transformation as a FEDORA disseminator, but the code would certainly be used to write a new IndexSource class. I’ll let you know if we come up with anything new that might be useful to you.

    Thank you for your comments! They have been most helpful.

1 Trackback

  1. SourceForge.net: xtf-user | September 13, 2006 at 2:48 pm | Permalink

    Kramer auto Pingback[...] To the FEDORA and XTF communities — At OhioLINK, we’re aggressively pursing the California Digital Library’s XTF software as a front end for our digital collections in a FEDORA content repository. I’ve written up some observations about what XTF integrated with FEDORA might look like and would welcome your comments and observations. We’d particularly like to know if anyone else is pursuing as similar path. The URLs are: http://dltj.org/2006/08/xtf-fedora-1/ http://dltj.org/2006/08/xtf-fedora-2/ Public comments (in the form of responses on the blog) or private ones (e-mail replies) would be most appreciated. Martin Haye, one of the lead developers of XTF, has been kind enough to offer some replies already and so far this seems like a viable solution. Peter   [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
Human Detection Scheme
(What's this?)
Comment Preview

Subscribe without commenting

From the Disruptive Library Technology Jester (http://dltj.org/), printed on Friday the 14th of November 2008 at 5:43:24 PM EST (-0500). The URL to this page is http://dltj.org/article/xtf-fedora-2/

[Creative Commons Logo] This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.