XTF and FEDORA — Comments from the Community
Some questions and observations that have come in through mechanisms other than blog comments on the analysis of the XTF/FEDORA integration. I've reproduced those here for the sake of completeness, but also be sure to go back to the first two entries in this series to read the comments there as well.
Indiana University's Observations
As it turns out, Indiana University is considering much the same path. They have an existing FEDORA-based repository and a number of XTF projects that have been in development for a while. They, too, are looking to put these two technologies together and have a page on their project website with Digital Repository Architecture > Search">IU's observations of an XTF plus FEDORA (plus more!) combination.
Incremental Indexing
How will the XTF processor know when new documents are added? Or modified or deleted? From your description it sounds like only manual bulk-indexing of a collection would be supported. This will make it inefficient to update the index if, for instance only one document out of a collection is changed or deleted.
A good observation. XTF does this now by querying the index at the start to see what documents it knows about (along with when they were last updated) and then scanning the source directory tree for new and revised objects as compared with that index. While doing that it marks the documents that were query results and at the end it deletes index entries for documents it doesn't find.
We'll probably need to mimic the same kind of functionality with a FEDORA interface. Since FEDORA doesn't provide hooks for firing off messages/processes when items are added/modified/deleted from the repository, it'll have to be done in this kind of bulk update.
Differentiated Indexing
How feasible is it to search within certain elements of XML documents with this model? For instance, with the EAD documents I end up storing things such as the title, elements with the attribute audience="internal", and subjects as separate Lucene fields in order to make searching for these elements a lot easier. (I just took a look at some of the XTF ocumentation -- I assume you can apply a pre-filter to separate your XML file into different sections for searching).
The pre-filter can apply rules to the form of the document that is subsequently indexed in Lucene. As you saw in the documentation, it is normally used to add weight to specific fields or terms, but I suppose it is also useful for suppressing terms from the index entirely. I think if you wanted those suppressed terms to be searchable by internal
users, you would need to create a second index (or "index-name" as XTF calls them) with a different pre-filter.