Analysis of CDL’s XTF textIndexer to Replace the Local Files with FEDORA Objects

Posted on 5 minute read

× This article was imported from this blog's previous content management system (WordPress), and may have errors in formatting and functionality. If you find these errors are a significant barrier to understanding the article, please let me know.

This is a continuation of the investigation about integrating the California Digital Library's XTF software into the FEDORA digital object repository that started earlier. This analysis looks at the textIndexer module in particular, starting with an overview of how textIndexer works now with filesystem-based objects and ending with an outline of how this could with reading objects from a FEDORA repository instead.

XTF's Native File System handler

Natively, XTF wants to read content out of the file system. The core of the processing is done in these two class files:

TextIndexer.java

The main() driver for ingesting content into the index. It reads commandline arguments (cfgInfo.readCmdLine( args, startArg );) to determine the various parameters, one of which is the top of the document source directory (String srcRootDir = Path.resolveRelOrAbs( xtfHomeFile, cfgInfo.indexInfo.sourcePath );). Assuming all goes well, it calls a method to open the Lucene index for writing, process files in the source directory, and close the Lucene index:

srcTreeProcessor.open( cfgInfo );
srcTreeProcessor.processDir( new File(srcRootDir), 0 );
srcTreeProcessor.close();

SrcTreeProcessor.java

processDir() is called recursively on the directory structure to process files in that directory. For each directory, a docBuf XML-as-a-string buffer is consisting of an element for every directory entry. docBuf is fed into the SAXON processor along with the docSelector XSLT stylesheet. The resulting XML is read node-by-node looking for file entries that have an "indexFile" tag. For each matching node, it calls processFile() to index each entry.

processFile() will run the prefilter XSLT against the file content, build the Lazy Tree (if possible and requested), create the IndexSource version by running the source document through the appropriate file type "*IndexSource" method (e.g. PDFIndexSource(), XMLIndexSource, and MARCIndexSource()) and queue the content for indexing by the Lucene indexer.

Requirements for an Object Handler for textIndexer

Based on this analysis, if one were to replace the TextIndexer.java and SrcTreeProcessor.java "front end" of textIndexer, I think these would be the pieces that would be requried. (Note that some steps are skipped in this overview -- any replacement of these two classes would need to be sure to do everything that those classes do now.)

  1. Parse command line and configuration file parameters to create an IndexerConfig instance (guiding parameters for the indexer as a whole) and an IndexInfo instance (parameters specific to the identified index-name).
  2. Specify a collection of objects that you want in index-name.
  3. Open up a writable instance of the index-name's Lucene index (a la srcTreeProcessor.open( cfgInfo );)
  4. For each object to be put into index-name, do these things:
    1. Optionally, run the source object through a prefilter (an XSLT transformation used to restructure the source document just prior to indexing without changing the stored source document).
    2. Optionally, remove a DOCTYPE declaration in the source object before it is indexed.
    3. Set up an transformation object from the native file format to something that is XML and call textProcessor.checkAndQueueText() to add it to a queue to be processed.
  5. Close index-name's Lucene index (a la srcTreeProcessor.close();), which should have the side effect of processing the queued text (a la textProcessor.processQueuedTexts();) which will ultimately create the Lazy Tree (if specified) and add the object to the Lucene index.
  6. Optionally, compare the collection of objects that you want in index-name with what is actually in index-name before you started, and remove anything that wasn't in the specified collection.

Considering a FEDORA-based XTF handler

So, all-in-all, that doesn't seem too bad. Here is where we get to mix in some FEDORA pieces and see what we get in the end.

First off, in terms of dealing with "collections of source objects to be indexed" I think it would be best to have this start with one of our "collection aggregation" objects as the root level of a source collection. We'd perform an RDF "isMemberOf" query against the resource index using the FEDORA PID of the aggregation object (and optionally make an "isMemberOf" query recursively against the returned set -- as if one was drilling down a file system).

Secondly, to get the XML content to be indexed, each object would have a getXML disseminator (see Thinking about Our FEDORA Disseminators for background) that would render to XTF an XML version of itself. If the source object is an XML-based object, it just returns the XML. If the source object is a PDF or Word document or something that can be rendered into a text-like form, the disseminator would handle that. If the source object is an image or audio clip, the disseminator can return the descriptive XML of the object. The point being, though, by the time the object gets to XTF's textIndexer, it has already be rendered to XML, so just the XML transformation tool would be needed (as in this snipped from SrcTreeProcessor.java):

IndexSource srcFile = null;
if( format.equalsIgnoreCase("XML") ) {
    InputSource finalSrc = new InputSource( systemId );
    srcFile = new XMLIndexSource( finalSrc, srcPath, key,
                             preFilters, displayStyle, lazyStore );
    if( removeDoctypeDecl )
        ((XMLIndexSource)srcFile).removeDoctypeDecl( true );
}

Third, a FEDORA-aware driver that replaces TextIndexer.java and SrcTreeProcessor.java. Given a configuration file location and a starting PID, it would gather the objects to be indexed, "open" the Lucene index, run through the snippet of Java above for each object, and "close" the Lucene index.

The quick-and-dirty first implementation would copy the XML source to a directory on the hard drive (directory and subdirectory names would be the PID of the aggregation object containing the collection of objects), and have XTF use that local filesystem copy as the indexed source. Lazy Tree files for each object would also be created and stored locally. This means we have two copies (three, if you count the Lazy Tree) of the object laying around, so eventually I think we'd want to modify XTF to pull content directly from FEDORA using a REST-based URL. Eventually I think we may also want to store the Lazy Tree in something other than the local file system. Could that be another datastream in the FEDORA object?