<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule"><channel><title>Disruptive Library Technology Jester &#187; California Digital Library</title> <atom:link href="http://dltj.org/tag/cdlib/feed/" rel="self" type="application/rss+xml" /><link>http://dltj.org</link> <description>We&#039;re Disrupted, We&#039;re Librarians, and We&#039;re Not Going to Take It Anymore</description> <lastBuildDate>Mon, 06 Feb 2012 20:04:22 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <cloud domain='dltj.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' /> <creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/3.0/us/</creativeCommons:license> <item><title>Google Book Search Settlement: Public Access Service</title><link>http://dltj.org/article/gbs-settlement-public-access/</link> <comments>http://dltj.org/article/gbs-settlement-public-access/#comments</comments> <pubDate>Tue, 11 Nov 2008 17:10:32 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[policy]]></category> <category><![CDATA[California Digital Library]]></category> <category><![CDATA[copyright]]></category> <category><![CDATA[Google]]></category> <category><![CDATA[Google Book Search]]></category> <category><![CDATA[higher education]]></category><guid isPermaLink="false">http://dltj.org/?p=582</guid> <description><![CDATA[One of the very relevant aspects of the Google Book Search Settlement Agreement to libraries is the provision that allows for free public access to the full text of books in public and academic libraries. The Notice of Settlement summary &#8230; <a href="http://dltj.org/article/gbs-settlement-public-access/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/?p=582"></abbr><p>One of the very relevant aspects of the <a href="http://books.google.com/booksrightsholders/agreement-contents.html" title="Google Book Search Copyright Settlement">Google Book Search Settlement Agreement</a> to libraries is the provision that allows for free public access to the full text of books in public and academic libraries.  The Notice of Settlement summary says:  &#8220;Google will provide, on request, &#8216;Public Access&#8217; licenses for free through a dedicated computer terminal at each public library building and through an agreed number of dedicated computer terminals at non-profit higher educational institutions located in the United States.&#8221; (Notice; Q9(F)(1)(c); p.&nbsp;18)  The details beyond the summary are quite a bit more interesting and, of course, have tidbits of useful information that isn&#8217;t in the summary.</p><p>Two notes before we get started. First, for readers who don&rsquo;t normally follow <acronym title="Disruptive Library Technology Jester"><i>DLTJ</i></acronym>, this exploration of the Settlement is taken from a decidedly library (and library technologist) point of view. There are lots of bits that will probably be of greater interest to authors and publishers (like formula for determining how much compensation is due, etc.) that are not covered here. Second, this is a review of the document as it was <a href="http://docs.justia.com/cases/federal/district-courts/new-york/nysdce/1:2005cv08136/273913/56/" title="The Author&#039;s Guild et al v. Google Inc. Document 56 - :: Justia Docs">submitted</a> to the U.S. District Court for the Southern District of New York in the case of <a href="http://dockets.justia.com/docket/court-nysdce/case_no-1:2005cv08136/case_id-273913/" title="The Author&#039;s Guild et al v. Google Inc. - Justia">The Author&rsquo;s Guild et al v. Google Inc</a>. The court needs to review and approve it, and details may change between now and then. There are lots of holes in the document &mdash; notably dates and URLs &mdash; that will need to filled in.</p><p>In the summary below, references to the Proposed Class Action Settlement take the form of the word &ldquo;Settlement&rdquo; followed by the question number and possibly the paragraph number plus the page number. For instance: &ldquo;(Settlement: 3.1, p. 21)&rdquo;.  Likewise, references to the summary Notice of Proposed Class Action Settlement use the word &#8220;Notice&#8221; in the citation.</p><p><h2>The Most Interesting Points</h2><br />The complete text of the section is reproduced below.  Here are the critical bits:</p><ul type="disc"><li>Each Public Library<sup><a href="http://dltj.org/article/gbs-settlement-public-access/#footnote_0_582" id="identifier_0_582" class="footnote-link footnote-identifier-link" title="&amp;#8216;Public Library&amp;#8217; means a library that (a) is accessible by the public, (b) is, or is part of, a not-for-profit or government-funded institution other than a not-for-profit or government-funded institution that is classified under the Carnegie Classification of Institutions of Higher Education, and (c) allows patrons to take books and other materials off the premises but may also have non-circulating reference collections or provide other library services; however, &amp;#8216;Public Library&amp;#8217; does not include any library primarily funded or managed by the federal government or an agency thereof.  (Settlement; 1.119; p.&amp;nbsp;15) ">1</a></sup> can have one access point per building.</li><li>Higher Education Institutions<sup><a href="http://dltj.org/article/gbs-settlement-public-access/#footnote_1_582" id="identifier_1_582" class="footnote-link footnote-identifier-link" title="&amp;#8216;Higher Education Institution&amp;#8217; means an Institution of Higher Education, as defined by the Carnegie Classifications of Institutions of Higher Education from time to time or, if and when the Carnegie Foundation for the Classification of Teaching is no longer classifying colleges and universities in the United States, as such term or its successor term is defined by any successor classification system used to classify colleges and universities in the United States. (Settlement; 1.66; p.&amp;nbsp;9) ">2</a></sup> are divided into two tiers, based on the <a href="http://www.carnegiefoundation.org/classifications/" title="Carnegie Classifications homepage">Carnegie Classifications of Institutions of Higher Education</a>.  Institutions that qualify as &#8220;Associate&#8217;s Colleges&#8221; can get one access point per 4,000 <abbrev title="Full-Time Equivalent">FTE</abbrev> students; everyone else gets one access point per 10,000 FTE students.</li><li>For higher education institutions, the definition includes the phrase &#8220;computer terminal may change from time to time&#8221;, but details about how to specify and/or change the access point are not included in the Settlement document.</li><li>Printing can be enabled for public access service points for a per-page fee, but the institution must have a way to collect that per-page fee and remit it to Google for forwarding to the Book Rights Registry (which doesn&#8217;t yet exist) to be split among the rightsholders.  The per-page fee will be set by the Registry, and there doesn&#8217;t appear to be a need to report <em>what</em> was printed by users.  How Google will actually collect this money is also not specified in the Settlement.</li><li>Google and the Registry may agree to make this public access service available to one or more public libraries or not-for-profit higher education institutions.  This access can be free or for an annual fee; it isn&#8217;t clear how this would be different from an institutional subscription.  (I&#8217;ll cover institutional subscriptions in a later <acronym title="Disruptive Library Technology Jester"><i>DLTJ</i></acronym> post.)</li></ul><p>Since the per-page charge requires action by the Book Rights Registry, we probably shouldn&#8217;t expect the Public Access provision to be executed before the Registry is formed and its new board of directors has met.  Formation of the Registry must be done by the &#8220;Effective Date&#8221; of the agreement, and the effective date of the agreement is after all parties have agreed to the settlement, the court dismisses the relevant actions against Google, and the time for filing appeals to the courts decision has passed.  The date for the final approval hearing by the court has not been set yet.</p><p>There sure is a lot left undefined in this section about how such a program would work.  How does an institution register?  Who at an institution gets to decide which computer is used?  A &#8220;computer terminal&#8221; (what I called an &#8220;access point&#8221;) presumably means a single IP address.  One also presumes that the single IP address can&#8217;t be a proxy server of sorts that is enabling more than one user to have access at a time, but this isn&#8217;t explicitly stated.  (And, believe me, there is <em>a lot</em> of stuff that is explicitly stated in the agreement.)</p><p>There is also some interesting mechanics.  Are institutions going to be able to detect and charge more for prints from the public access Google Books station?  Presumably institutions that have systems to charge for printing do so to recover the cost of consumables.  Will such a system be able to charge for consumables and, under certain circumstances, charge more for the content itself?</p><p>Also tucked in this section is a bit about the possibility of a &#8220;Commercial Public Access Service&#8221; for &#8220;copy shops and other entities&#8221; (Settlement; 4.8(b); p.&nbsp;61).  Under the terms of the Settlement, the Registry and Google may choose to make this available for an annual fee based on concurrent users plus a fee per printed page.  The notion of &#8220;print on demand&#8221; immediately leaps to mind, but that would seem to be covered on a special paragraph for print-on-demand (POD) under the heading of &#8220;New Revenue Models&#8221; (Settlement; 4.7(a); p.&nbsp;59).</p><p><h2>What the Settlement Says About &#8220;Public Access Service&#8221;</h2><br />This is the actual text of the &#8220;Public Access Service&#8221; section of the settlement (Settlement; 4.8; pp.&nbsp;60-61):</p><blockquote><ol type="a" start="1"><li>Public Access Service.<ol type="i" start="1"><li><u>Free Public Access Service</u>.  Google may provide the Public Access Service to each not-for-profit Higher Education Institution and Public Library that so requests at no charge (and without any payment to the Rightsholders, through the Registry or otherwise (other than as set forth in Section 4.8(a)(ii) (Printing)) as follows:<ol type="1" start="1"><li>in the case of not-for-profit Higher Education Institutions that do not qualify as Associate&#8217;s Colleges pursuant to the Carnegie Classification of Institutions of Higher Education, one computer terminal for every ten thousand (10,000) Full-Time Equivalency (i.e., full-time equivalent students) at each such institution (which computer terminal may change from time to time);</li><li>in the case of not-for-profit Higher Education Institutions that qualify as Associate&#8217;s Colleges pursuant to the Carnegie Classification of Institutions of Higher Education, one computer terminal for every four thousand (4,000) Full-Time Equivalency (i.e., full-time equivalent students) at each such institution (which computer terminal may change from time to time); and</li><li>in the case of each Public Library, no more than one terminal per Library Building.</li></ol></li><li><u>Printing</u>.  Google shall design the Public Access Service to enable users at a not-for-profit Higher Education Institution to print pages from Display Books for a per-page fee, and to enable users at a Public Library to print pages from Display Books for a per-page fee to the extent that such Public Library offers per-page printing services for a fee for other products and services.  The Registry shall set a reasonable fee for such printing.  Google shall collect all such printing fees and shall pay them to the Registry in accordance with the Standard Revenue Split for Purchases.</li><li><u>Additional Public Access Service</u>.  The Registry and Google may agree that Google may make available the Public Access Service to one or more Public Libraries or not-for-profit Higher Education Institutions either for free or for an annual fee, in addition to the Public Access Service provided under Section 4.8(a)(i) (Free Public Access Service).</li></ol></li><li><u>Commercial Public Access Service</u>.  The Registry and Google may agree to make a commercial public access service available to copy shops and other entities for an annual fee per concurrent user and a fee per printed page.</li></ol></blockquote><h2>Footnotes</h2><ol class="footnotes"><li id="footnote_0_582" class="footnote">&#8216;Public Library&#8217; means a library that (a) is accessible by the public, (b) is, or is part of, a not-for-profit or government-funded institution other than a not-for-profit or government-funded institution that is classified under the Carnegie Classification of Institutions of Higher Education, and (c) allows patrons to take books and other materials off the premises but may also have non-circulating reference collections or provide other library services; however, &#8216;Public Library&#8217; does not include any library primarily funded or managed by the federal government or an agency thereof.  (Settlement; 1.119; p.&nbsp;15)</li><li id="footnote_1_582" class="footnote">&#8216;Higher Education Institution&#8217; means an Institution of Higher Education, as defined by the Carnegie Classifications of Institutions of Higher Education from time to time or, if and when the Carnegie Foundation for the Classification of Teaching is no longer classifying colleges and universities in the United States, as such term or its successor term is defined by any successor classification system used to classify colleges and universities in the United States. (Settlement; 1.66; p.&nbsp;9)</li></ol><div class='series_links'><a href='http://dltj.org/article/oclc-gbs-speculation/' title='Is OCLC&#8217;s Change of WorldCat Record Use/Transfer Policy Related to the Google Book Search Agreement?'>Previous in series</a> <a href='http://dltj.org/article/gbs-settlement-preliminary-approval/' title='Preliminary Court Approval of Google Book Settlement; Final Approval Hearing Set'>Next in series</a></div>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/gbs-settlement-public-access/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>Soundprint&#8217;s &#8216;Who Needs Libraries?&#8217;</title><link>http://dltj.org/article/who-needs-libraries/</link> <comments>http://dltj.org/article/who-needs-libraries/#comments</comments> <pubDate>Fri, 01 Feb 2008 20:15:49 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[Disruption in Libraries]]></category> <category><![CDATA[academic libraries]]></category> <category><![CDATA[audio]]></category> <category><![CDATA[California Digital Library]]></category> <category><![CDATA[digitization]]></category> <category><![CDATA[libraries]]></category> <category><![CDATA[preservation]]></category><guid isPermaLink="false">http://dltj.org/article/who-needs-libraries/</guid> <description><![CDATA[OhioLINK&#8217;s Meg Spernoga pointed our staff to a 30 minute audio documentary called Who Needs Libraries? from Soundprint.org:As more and more information is available on-line, as Amazon rolls out new software that allows anyone to find any passage in any &#8230; <a href="http://dltj.org/article/who-needs-libraries/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/article/who-needs-libraries/"></abbr><p>OhioLINK&#8217;s Meg Spernoga pointed our staff to a 30 minute audio documentary called <a href="http://www.soundprint.org/radio/display_show/ID/629/name/Who+needs+libraries" title="Who needs libraries?">Who Needs Libraries?</a> from Soundprint.org:</p><blockquote><p>As more and more information is available on-line, as Amazon rolls out new software that allows anyone to find any passage in any book, an important question becomes: Who needs libraries anymore? Why does anyone need four walls filled with paper between covers? Surprisingly, they still do and in this program Producer Richard Paul explores why; looking at how university libraries, school libraries and public libraries have adapted to the new information world. This program airs as part of our ongoing series on education and technology, and is funded in part by the U.S. Department of Education.</p><p>Produced by Richard Paul. Hosted by Lisa Simeone.</p></blockquote><p>Some of the topics covered:</p><ul type="square"><li>Numbers of New/Renovated public libraries are steady</li><li>Use of consortial depositories by academic libraries</li><li>Licensed content that can&#8217;t be found in Google (pros &#8212; immediate access; and cons &#8212; preservation)</li><li>Widespread digitization of content for online access (pros and cons)</li><li>Impact of Gates Foundation money on public library services</li><li>Changing ways libraries are being used</li></ul><p>Thanks, Meg!</p>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/who-needs-libraries/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Analysis of CDL&#8217;s XTF textIndexer to Replace the Local Files with FEDORA Objects</title><link>http://dltj.org/article/xtf-fedora-2/</link> <comments>http://dltj.org/article/xtf-fedora-2/#comments</comments> <pubDate>Tue, 22 Aug 2006 20:57:57 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[DRC]]></category> <category><![CDATA[Fedora]]></category> <category><![CDATA[California Digital Library]]></category> <category><![CDATA[digital libraries]]></category> <category><![CDATA[libraries]]></category> <category><![CDATA[xtf]]></category><guid isPermaLink="false">http://dltj.org/2006/08/xtf-fedora-2/</guid> <description><![CDATA[This is a continuation of the investigation about integrating the California Digital Library&#8217;s XTF software into the FEDORA digital object repository that started earlier. This analysis looks at the textIndexer module in particular, starting with an overview of how textIndexer &#8230; <a href="http://dltj.org/article/xtf-fedora-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/2006/08/xtf-fedora-2/"></abbr><p>This is a continuation of the investigation about integrating the California Digital Library&#8217;s XTF software into the FEDORA digital object repository that <a href="http://dltj.org/2006/08/xtf-fedora-1">started earlier</a>.  This analysis looks at the textIndexer module in particular, starting with an overview of how textIndexer works now with filesystem-based objects and ending with an outline of how this could with reading objects from a FEDORA repository instead.</p><p><h2>XTF&#8217;s Native File System handler</h2></p><p>Natively, XTF wants to read content out of the file system.  The core of the processing is done in these two class files:</p><p><h3>TextIndexer.java</h3></p><p>The <code>main()</code> driver for ingesting content into the index.  It reads commandline arguments (<code>cfgInfo.readCmdLine( args, startArg );</code>) to determine the various parameters, one of which is the top of the document source directory (<code>String srcRootDir = Path.resolveRelOrAbs( xtfHomeFile, cfgInfo.indexInfo.sourcePath );</code>).  Assuming all goes well, it calls a method to open the Lucene index for writing, process files in the source directory, and close the Lucene index:<br />[java]<br />srcTreeProcessor.open( cfgInfo );<br />srcTreeProcessor.processDir( new File(srcRootDir), 0 );<br />srcTreeProcessor.close();<br />[/java]</p><p><h3>SrcTreeProcessor.java</h3></p><p><code>processDir()</code> is called recursively on the directory structure to process files in that directory.  For each directory, a <code>docBuf</code> XML-as-a-string buffer is consisting of an element for every directory entry. <code>docBuf</code> is fed into the SAXON processor along with the docSelector XSLT stylesheet.  The resulting XML is read node-by-node looking for file entries that have an &#8220;indexFile&#8221; tag.  For each matching node, it calls <code>processFile()</code> to index each entry.</p><p><code>processFile()</code> will run the prefilter XSLT against the file content, build the Lazy Tree (if possible and requested), create the <code>IndexSource</code> version by running the source document through the appropriate file type &#8220;*IndexSource&#8221; method (e.g. <code>PDFIndexSource()</code>, <code>XMLIndexSource</code>, and <code>MARCIndexSource()</code>) and queue the content for indexing by the Lucene indexer.</p><p><h2>Requirements for an Object Handler for textIndexer</h2><br />Based on this analysis, if one were to replace the TextIndexer.java and SrcTreeProcessor.java &#8220;front end&#8221; of textIndexer, I think these would be the pieces that would be requried.  (Note that some steps are skipped in this overview &#8212; any replacement of these two classes would need to be sure to do everything that those classes do now.)</p><ol type="1"><li>Parse command line and configuration file parameters to create an <span class="removed_link" title="http://texts-stage.cdlib.org/xtf/javadoc/org/cdlib/xtf/textIndexer/IndexerConfig.html">IndexerConfig</span> instance (guiding parameters for the indexer as a whole) and an <span class="removed_link" title="http://texts-stage.cdlib.org/xtf/javadoc/org/cdlib/xtf/textIndexer/IndexInfo.html">IndexInfo</span> instance (parameters specific to the identified index-name).</li><li>Specify a collection of objects that you want in index-name.</li><li>Open up a writable instance of the index-name&#8217;s Lucene index (<i>a la</i> <code>srcTreeProcessor.open( cfgInfo );</code>)</li><li>For each object to be put into index-name, do these things:<ol type="a"><li>Optionally, run the source object through a prefilter (an XSLT transformation used to restructure the source document just prior to indexing without changing the stored source document).</li><li>Optionally, remove a DOCTYPE declaration in the source object before it is indexed.</li><li>Set up an transformation object from the native file format to something that is XML and call <code>textProcessor.checkAndQueueText()</code> to add it to a queue to be processed.</li></ol></li><li>Close index-name&#8217;s Lucene index (<i>a la</i> <code>srcTreeProcessor.close();</code>), which should have the side effect of processing the queued text (<i>a la</i> <code>textProcessor.processQueuedTexts();</code>) which will ultimately create the Lazy Tree (if specified) and add the object to the Lucene index.</li><li>Optionally, compare the collection of objects that you want in index-name with what is actually in index-name before you started, and remove anything that wasn&#8217;t in the specified collection.</li></ol><p><h2>Considering a FEDORA-based XTF handler</h2><br />So, all-in-all, that doesn&#8217;t seem too bad.  Here is where we get to mix in some FEDORA pieces and see what we get in the end.</p><p>First off, in terms of dealing with &#8220;collections of source objects to be indexed&#8221; I think it would be best to have this start with one of our &#8220;collection aggregation&#8221; objects as the root level of a source collection.  We&#8217;d perform an RDF &#8220;isMemberOf&#8221; query against the resource index using the FEDORA PID of the aggregation object (and optionally make an &#8220;isMemberOf&#8221; query recursively against the returned set &#8212; as if one was drilling down a file system).</p><p>Secondly, to get the XML content to be indexed, each object would have a <code>getXML</code> disseminator (see <a href="http://dltj.org/2006/05/fedora-disseminators/">Thinking about Our FEDORA Disseminators</a> for background) that would render to XTF an XML version of itself.  If the source object is an XML-based object, it just returns the XML.  If the source object is a PDF or Word document or something that can be rendered into a text-like form, the disseminator would handle that.  If the source object is an image or audio clip, the disseminator can return the descriptive XML of the object.  The point being, though, by the time the object gets to XTF&#8217;s textIndexer, it has already be rendered to XML, so just the XML transformation tool would be needed (as in this snipped from SrcTreeProcessor.java):<br />[java]<br />IndexSource srcFile = null;<br />if( format.equalsIgnoreCase(&#8220;XML&#8221;) ) {<br /> InputSource finalSrc = new InputSource( systemId );<br /> srcFile = new XMLIndexSource( finalSrc, srcPath, key,<br /> preFilters, displayStyle, lazyStore );<br /> if( removeDoctypeDecl )<br /> ((XMLIndexSource)srcFile).removeDoctypeDecl( true );<br />}<br />[/java]</p><p>Third, a FEDORA-aware driver that replaces TextIndexer.java and SrcTreeProcessor.java.  Given a configuration file location and a starting PID, it would gather the objects to be indexed, &#8220;open&#8221; the Lucene index, run through the snippet of Java above for each object, and &#8220;close&#8221; the Lucene index.</p><p>The quick-and-dirty first implementation would copy the XML source to a directory on the hard drive (directory and subdirectory names would be the PID of the aggregation object containing the collection of objects), and have XTF use that local filesystem copy as the indexed source.  Lazy Tree files for each object would also be created and stored locally.  This means we have two copies (three, if you count the Lazy Tree) of the object laying around, so eventually I think we&#8217;d want to modify XTF to pull content directly from FEDORA using a REST-based URL.  Eventually I think we may also want to store the Lazy Tree in something other than the local file system.  Could that be another datastream in the FEDORA object?<p style="padding:0;margin:0;font-style:italic;" class="removed_link">The text was modified to remove a link to http://texts-stage.cdlib.org/xtf/javadoc/org/cdlib/xtf/textIndexer/IndexerConfig.html on December 31st, 2010.</p><p style="padding:0;margin:0;font-style:italic;" class="removed_link">The text was modified to remove a link to http://texts-stage.cdlib.org/xtf/javadoc/org/cdlib/xtf/textIndexer/IndexInfo.html on December 31st, 2010.</p><div class='series_links'><a href='http://dltj.org/article/xtf-fedora-1/' title='CDL&#8217;s XTF as a Front End to Fedora'>Previous in series</a> <a href='http://dltj.org/article/xtf-fedora-3/' title='XTF and FEDORA &mdash; Comments from the Community'>Next in series</a></div>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/xtf-fedora-2/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>CDL&#8217;s XTF as a Front End to Fedora</title><link>http://dltj.org/article/xtf-fedora-1/</link> <comments>http://dltj.org/article/xtf-fedora-1/#comments</comments> <pubDate>Tue, 22 Aug 2006 13:29:09 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[DRC]]></category> <category><![CDATA[Fedora]]></category> <category><![CDATA[California Digital Library]]></category> <category><![CDATA[digital libraries]]></category> <category><![CDATA[libraries]]></category> <category><![CDATA[OhioLINK]]></category> <category><![CDATA[xtf]]></category><guid isPermaLink="false">http://dltj.org/2006/08/xtf-fedora-1/</guid> <description><![CDATA[We&#8217;re experimenting pretty heavily now with the California Digital Library&#8216;s XTF framework as a front-end to a FEDORA object repository. Initial efforts look promising &#8212; thanks go out to Brian Tingle and Kirk Hastings of CDL; Jeff Cousens, Steve DiDomenico, &#8230; <a href="http://dltj.org/article/xtf-fedora-1/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/2006/08/xtf-fedora-1/"></abbr><p>We&#8217;re experimenting pretty heavily now with the <a href="http://cdlib.org/" title="California Digital Library">California Digital Library</a>&#8216;s <a href="http://sourceforge.net/projects/xtf" title="SourceForge.net: eXtensible Text Framework (XTF)">XTF</a> framework as a front-end to a <a href="http://www.fedora.info/" title="Fedora">FEDORA object repository</a>.  Initial efforts look promising &#8212; thanks go out to Brian Tingle and Kirk Hastings of CDL; Jeff Cousens, Steve DiDomenico, and Bill Parod from Northwestern; and Ross Wayland from UVa for helping us along in the right direction.</p><p><h2>XTF into Eclipse How-To</h2><br />As we get more serious about XTF, I wrote up a <span class="removed_link" title="http://drc-dev.ohiolink.edu/wiki/EclipseXTFHowTo">How-To document for bringing XTF into Eclipse</span> so that it can be deployed as a dynamic web application.  Let me know if you find it useful.  Definitely let me know if you find it in error.  We haven&#8217;t put a version of XTF into OhioLINK&#8217;s source code repository, but that might follow shortly.</p><p><h2>Points of Integration</h2><br />In its base configuration, XTF reads documents out of a &#8220;data&#8221; directory that is in the application&#8217;s Tomcat context directory.  It looks like two of the XTF components will need to be modified to successfully converse with a FEDORA-based object repository:  DynaXML and textIndexer.  Of the two, DynaXML seems to be the most straight forward.</p><p><h3>DynaXML</h3><br />First I went looking for where XTF&#8217;s DynaXML reads documents and found the <a href="http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/549e4167039e/WEB-INF/src/org/cdlib/xtf/dynaXML/DocLocator.java" title="">DocLocator interface</a> with one <a href="http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/de7d8a406bef/WEB-INF/src/org/cdlib/xtf/dynaXML/DefaultDocLocator.java" title="">implementation that looks into the file system</a>.  John Davison, one of the DRC programmers, figured out (with help from the CDL folks) that in fact it is possible to pass a FEDORA API-A URL to DefaultDocLocator and have it do the right thing.  Its &#8216;getInputSource()&#8217; method has this signature:</p><p>[java]<br />public InputSource getInputSource( String sourcePath,<br /> boolean removeDoctypeDecl ) throws IOException<br />[/java]<br />&#8230;followed shortly by:</p><p>[java]<br />// If it&#8217;s non-local, load the URL.<br />if( sourcePath.startsWith(&#8220;http:&#8221;) ||<br /> sourcePath.startsWith(&#8220;https:&#8221;) )<br />{<br /> return new InputSource( sourcePath );<br />}<br />[/java]<br />where &#8220;InputSource&#8221; is the <a href="http://www.docjar.com/docs/api/org/xml/sax/InputSource.html" title="InputSource">entry point into the SAX parser</a>, which will accept a URI as a parameter.</p><p>Unfortunately, using DefaultDocLocator in this way negates the use of <span class="removed_link" title="http://xtf.sourceforge.net/WebDocs/HTML/XTF_Under_Hood/XTFUnderHood.html#LazyFiles">CDL&#8217;s &#8220;Lazy Trees&#8221;</span> (a binary version of each XML document containing all the original contents of the document, plus an index telling XTF where each element starts and ends).  Lazy Trees are a good thing because they speed up parsing of the XML document and the resulting rendering to the user.</p><p>When dealing with local files (as opposed to the URL method described above), DefaultDocLocator will build a Lazy Tree in its index directory the first time the XML document is called up.  In implementing a FEDORA interface for XTF&#8217;s DynaXML, what is required is a mixture of URL (or, in the case of FEDORA, a PID plus API-A call) to get the document and then create/store its lazy tree in the XTF index directory for subsequent retrieval.  This does seem pretty straight forward, does it not?</p><p><h3>textIndexer</h3><br />XTF&#8217;s textIndexer, on the other hand, really wants the XML it is indexing to be files on the local hard drive.  The XTF programming guide speaks of a <span class="removed_link" title="http://xtf.sourceforge.net/WebDocs/HTML/XTF_Programming_Guide/XTFProgGuide.html#textIndexer_DocSelector_Prog">textIndexer Document Selector</span> whose job it is to create a single XML file with the specifications of which documents to index and how to do it:</p><blockquote><p>It is the responsibility of the <b>Document Selector</b> XSLT code to output an XML fragment that identifies which of the files in the directory should be indexed. This output XML fragment should take the following form:<br />[xml]<br /><indexfiles><br /> <indexfile fileName      = "FileName"<br /> {format       = "FileFormatID"}<br /> {preFilter    = "PreFilterPath"}<br /> {displayStyle = "DocumentFormatterPath"}><br /></indexfile></indexfiles><br />[/xml]</p></blockquote><p>Now the trick seems to be to build an alternate Document Selector that will not use filenames but rather URIs to build the index.  That&#8217;ll be the subject of the next round of investigations.</p><p>Comments and observations are welcome!<p style="padding:0;margin:0;font-style:italic;" class="removed_link">The text was modified to remove a link to http://drc-dev.ohiolink.edu/wiki/EclipseXTFHowTo on December 31st, 2010.</p><p style="padding:0;margin:0;font-style:italic;" class="removed_link">The text was modified to remove a link to http://xtf.sourceforge.net/WebDocs/HTML/XTF_Under_Hood/XTFUnderHood.html#LazyFiles on December 31st, 2010.</p><p style="padding:0;margin:0;font-style:italic;" class="removed_link">The text was modified to remove a link to http://xtf.sourceforge.net/WebDocs/HTML/XTF_Programming_Guide/XTFProgGuide.html#textIndexer_DocSelector_Prog on December 31st, 2010.</p><p style="padding:0;margin:0;font-style:italic;">The text was modified to update a link from http://xtf.cvs.sourceforge.net/xtf/xtf/WEB-INF/src/org/cdlib/xtf/dynaXML/DocLocator.java?revision=1.5&#038;view=markup to http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/549e4167039e/WEB-INF/src/org/cdlib/xtf/dynaXML/DocLocator.java on January 28th, 2011.</p><p style="padding:0;margin:0;font-style:italic;">The text was modified to update a link from http://xtf.cvs.sourceforge.net/xtf/xtf/WEB-INF/src/org/cdlib/xtf/dynaXML/DefaultDocLocator.java?revision=1.10&#038;view=markup to http://xtf.hg.sourceforge.net/hgweb/xtf/xtf/file/de7d8a406bef/WEB-INF/src/org/cdlib/xtf/dynaXML/DefaultDocLocator.java on January 28th, 2011.</p><div class='series_links'> <a href='http://dltj.org/article/xtf-fedora-2/' title='Analysis of CDL&#8217;s XTF textIndexer to Replace the Local Files with FEDORA Objects'>Next in series</a></div>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/xtf-fedora-1/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> </channel> </rss>
<!-- Served from: dltj.org @ 2012-02-11 09:28:36 by W3 Total Cache -->
