<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule"><channel><title>Disruptive Library Technology Jester &#187; National Science Digital Library</title> <atom:link href="http://dltj.org/tag/nsdl/feed/" rel="self" type="application/rss+xml" /><link>http://dltj.org</link> <description>We&#039;re Disrupted, We&#039;re Librarians, and We&#039;re Not Going to Take It Anymore</description> <lastBuildDate>Mon, 06 Feb 2012 20:04:22 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <cloud domain='dltj.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' /> <creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/3.0/us/</creativeCommons:license> <item><title>Presentation Summary: &#8220;MPTStore: Implementing a fast, scalable, and stable RDBMS-backed triplestore for Fedora and the NSDL&#8221;</title><link>http://dltj.org/article/fedora-mptstore/</link> <comments>http://dltj.org/article/fedora-mptstore/#comments</comments> <pubDate>Mon, 29 Jan 2007 19:03:09 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[Fedora]]></category> <category><![CDATA[Meeting]]></category> <category><![CDATA[Raw Technology]]></category> <category><![CDATA[icor2007]]></category> <category><![CDATA[National Science Digital Library]]></category> <category><![CDATA[programming]]></category> <category><![CDATA[RDF]]></category> <category><![CDATA[semantic web]]></category><guid isPermaLink="false">http://dltj.org/2007/01/fedora-mptstore/</guid> <description><![CDATA[Chris Wilper gave this presentation on behalf of the work that he and Aaron Birkland did to improve the performance of the Fedora Resource Index. Presentation slides via SlideShareVersion 2.0 of the Fedora digital object repository software added a feature &#8230; <a href="http://dltj.org/article/fedora-mptstore/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/2007/01/fedora-mptstore/"></abbr><p>Chris Wilper gave this presentation on behalf of the work that he and Aaron Birkland did to improve the performance of the Fedora Resource Index.</p><div style="margin-left: 5em;"><object type="application/x-shockwave-flash" data="https://s3.amazonaws.com:443/slideshare/ssplayer.swf?id=20618&#038;doc=mptstore-a-fast-scalable-and-stable-resource-index-1484" width="425" height="348"><param name="movie" value="https://s3.amazonaws.com:443/slideshare/ssplayer.swf?id=20618&#038;doc=mptstore-a-fast-scalable-and-stable-resource-index-1484" /></object><br /><span style="font-size: 85%"><a href="http://www.slideshare.net/cwilper/mptstore-a-fast-scalable-and-stable-resource-index" title="MPTStore: A Fast, Scalable, and Stable Resource Index &amp;raquo; Slideshare">Presentation slides via SlideShare</a></span></div><p>Version 2.0 of the <a href="http://www.fedora.info/" title="Fedora">Fedora digital object repository software</a> added a feature called the Resource Index (RI).  Based on <a href="http://www.w3.org/RDF/" title="http://www.w3.org/RDF/">Resource Description Framework</a> (RDF) triples, the RI provided quick access to relationships between objects as well as to the descriptive elements of the object itself.  After about two years of use using the <a href="http://www.kowari.org/" title="kowari">Kowari software</a>, the RI has pointed to a number of challenges for &#8220;triplestores&#8221;:  scalability (few triplestores are designed for greater than 100 million triples); performance; and stability (frequent &#8220;rebuilds&#8221;).</p><p>The real motivation behind experimenting with a new triplestore, however, was the NSDL use case.  The <a href="http://nsdl.org/" title="The National Science Digital Library">National Science Digital Library</a> (NSDL) is a moderately large repository (4.7 million objects, 250 million triples) with a lot of write activity (driven by periodic OAI harvests; primarily mixed ingests and datastream modifications).  The NSDL data model also includes existential/referential integrity constraints that must be enforced.  Querying the RI to determine correct repository state proved to be difficult:  Kowari is aggressively buffering triple, sometimes on the order of seconds, before writing them to disk.  Flushing the buffer after every write is also computationally expensive (hence the drive to use buffers in the first place).</p><p>The NSDL team also encountered corruption under concurrent use and with abnormal shutdowns, forcing the rebuild of the triplestore.  And the solution was not scaling well; performance was becoming notably worse.  In looking for solutions other triplestores were considered but rejected.  Using a RDBMS seemed attractive &#8212;  efficient transactions, very stable, generally speedy &#8212; but a &#8220;one big table&#8221; paradigm to store all of the relations did not seem to give them a desired scalability.</p><p>NSDL developers observed that total number of distinct predicates is much lower than the number of predicates or objects;  NSDL has about 50 distinct predicates.  Based on this observation, their solution, called &#8220;Mapped Predicate Tables,&#8221; creates a table for every predicate in the triplestore.  This has several advantages:  a low computational cost for triple adds and deletes, queries for known predicates are fast, complex queries benefit from the relatively mature RDBMS planner having finer-granularity statistics and query plans, and flexible data partitioning to help address scalability.  This solution comes with several disadvantages, however:  one needs to manage predicate to table mapping, complex queries crossing many predicates require more effort to formulate, and with a naive approach simple unbound queries scale linearly with the number of predicates.</p><p>So the NSDL team created the <a href="http://mptstore.sourceforge.net/" title="MPTStore 0.9.1 Documentation">MPTStore triplestore</a> and contributed it back to the Fedora core developers for use by the community.  MPTStore is a Java library that handles all of the predicate mapping and accounting behind the scenes.  The basic API remains the same as for other triplestores, performing triple writes and queries, and the library hides all of the implementation details of translating queries from a particular language (SPO, SPARQL) into SQL statements.  The library is also designed to expose transaction/connection semantics should the developer wish to have direct access to the predicate tables.</p><p>A solution like MPTStore is well suited for NSDL use case.  The NSDL team was very familiar with the operations of RDBMS administration: performance tuning, backups, etc.  The stored triplestore data is transparent and &#8220;hackable&#8221; &#8212; adhoc SQL queries and analysis are relatively simple.  In fact, the RDBMS triplestore helped track down Fedora middleware bugs that resulted in an inconsistent state.  Fixing these bugs also improved the performance of the Kowari-based RI.</p><p>[Updated 20070129T1447 to include links to Chris' presentation on SlideShare.]</p>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/fedora-mptstore/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>&#8220;Cautiously Optimistic&#8221;</title><link>http://dltj.org/article/cautiously-optimistic/</link> <comments>http://dltj.org/article/cautiously-optimistic/#comments</comments> <pubDate>Tue, 13 Jun 2006 23:21:58 +0000</pubDate> <dc:creator>Peter Murray</dc:creator> <category><![CDATA[Disruption in Libraries]]></category> <category><![CDATA[digital libraries]]></category> <category><![CDATA[Joint Conference on Digital Libraries 2006]]></category> <category><![CDATA[metadata]]></category> <category><![CDATA[National Science Digital Library]]></category> <category><![CDATA[standards]]></category> <category><![CDATA[xml]]></category><guid isPermaLink="false">http://dltj.org/2006/06/cautiously-optimistic/</guid> <description><![CDATA[During the cookies and lemonade break during JCDL this afternoon I surprised one of the well-respected elders of the field with this question: are we really making progress? are we winning a fight against entropy 1? I wasn&#8217;t out for &#8230; <a href="http://dltj.org/article/cautiously-optimistic/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<abbr class="unapi-id ignore noPrint" title="http://dltj.org/2006/06/cautiously-optimistic/"></abbr><p>During the cookies and lemonade break during JCDL this afternoon I surprised one of the well-respected elders of the field with this question:  are we really making progress?  are we winning a fight against entropy <sup><a href="http://dltj.org/article/cautiously-optimistic/#footnote_0_70" id="identifier_0_70" class="footnote-link footnote-identifier-link" title="Defined as:  &amp;#8220;Measure of disorganization or degradation in the universe that reduces available energy, or tendency of available energy to dwindle. Chaos, opposite of order.&amp;#8221;  Do you remember your Second Law of Thermodynamics?">1</a></sup>?  I wasn&#8217;t out for a quote for publication at the time so I won&#8217;t reveal the individual&#8217;s name, but I will report that there was a chuckle then the reply &#8220;cautiously optimistic.&#8221;</p><p>This person went on to say that access to raw information has improved much over the last five years &mdash; that the internet and its tools have increased the capacity to publish and retrieve information.  &#8216;Sure,&#8217; s/he went on to say, &#8216;we have a number of hard problems to solve &mdash; linking related object to each other and so forth &mdash; but we are making progress.&#8217;  I, too, offered a chuckle and agreed, and we went back to our cookies and lemonade.</p><p>Entropy and chaos are powerful forces, however, and it was just after this brief encounter that we heard from <a href="http://www.cs.cornell.edu/lagoze/">Carl Lagoze</a> with a talk called <a href="http://arxiv.org/abs/cs.DL/0601125">Metadata aggregation and &#8220;automated digital libraries&#8221;: A Retrospective on the NSDL experience</a>.  Although the paper is a modestly dry report on the issues resolved and overcome in &#8220;running a relatively large-scale digital library (over a million objects) by collecting, processing, storing, and using metadata&#8221; <sup><a href="http://dltj.org/article/cautiously-optimistic/#footnote_1_70" id="identifier_1_70" class="footnote-link footnote-identifier-link" title="Lagoze, C., Krafft, D. B., Cornwell, T., Dushay, N., Eckstrom, D., Saylor, J. 200y. Metadata aggregation and &amp;#8220;automated digital libraries&amp;#8221;: A retrospective on the NSDL experience. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (Chapel Hill, NC, USA, June 11 &amp;#8211; 15, 2005). JCDL &amp;#8217;06. ACM Press, New York, NY, 231. [arXiv:cs.DL/0601125]">2</a></sup>, the oral presentation was anything but dry.  In fact, it offered a sobering reminder of how hard this is and the challenges before us.  He did it with four questions:</p><ol><li>What is a digital library anyway?</li><li>What is the role of metadata in a digital library?</li><li>What is &#8220;low barrier&#8221; technology? <i>[This one was tied to the observation that OAI-PMH, while modestly simple compared to other protocols, still requires a lot of effort to get right.  See reality lesson #4 below.]</i></li><li>Where should expensive and limited human energy be allocated?</li></ol><p>&#8230; and seven reality lessons:</p><ul><li>Reality lesson #1: <i>Metadata is not being created</i><br />In truth, there is not a lot of funding set aside in projects to create metadata.</li><li>Reality lesson #2: <i>Participating as a metadata provider is complicated by a &#8220;knowledge gap&#8221;</i><br />Doing so requires three skill sets that are frequently distinct: Domain expertise (e.g. &#8220;mathematics&#8221;); Metadata expertise (e.g. &#8220;Dublin Core&#8221;); and Technical expertise (e.g. encode it in XML and use a formal protocol).</li><li>Reality lesson #3: <i>Harvested metadata is not necessarily useful metadata</i><br />&#8220;Correct&#8221; metadata is not necessarily &#8220;rich&#8221; metadata.  The general problem of metadata quality remains unsolved &#8212; even the best automated/automatic transformations are not good enough.</li><li>Reality lesson #4: <i>OAI-PMH is not necessarily low-barrier and automatic</i><br />Doing OAI-PMH right incorporates lots of details and assumed knowledge (UTF-8, XML schema validation, URL encoding, date stamping, resumption tokens, etc.).  An even after sometimes months of hand-holding data provider, the initial success does not persist in the majority of cases; the failure rate of subsequent harvests is high.  And the &#8220;incremental harvest&#8221; functionality is a nice concept but it doesn&#8217;t work: support for &#8220;deleted&#8221; records is inconsistent in data providers; less than 50% of providers claim to persist deletions and many persistent claims are faulty.  Too often server failures and harvest failures require a full harvest &#8216;resync&#8217;.</li><li>Reality lesson #5: <i>Human cost of large-scale harvesting is high</i><br />In the case of NSDL, their metrics show that they exchange 170 messages per year per provider and that it takes on average 98 message exchanged for first harvest to succeed (which, as previously noted, subsequently fails).</li><li>Reality lesson #6: <i>Matching individual metadata records of equivalent resources is hard</i><br />I didn&#8217;t have anything in my notes about this, but as I recall his comments were about the lack of ways to uniformly handle these surrogate objects in the OAI-PMH protocol.</li><li>Reality lesson #7: <i>Lots of (even good) metadata does not make a complete digital library (and maybe not even a digital library that is highly useful for education)</i><br />There is a real need to understand the value-add of a digital library: capturing the wisdom of the community served as well as focusing less on structured information and more on relationships among resources and user-derived relationships and annotations.</li></ul><p>So what do I think?  You know &mdash; I&#8217;m not sure.  These are tough problems, and the world would be a better place if they were solved.  We can demand answers, but sometimes there just isn&#8217;t enough of a shoulder to stand on from the giant below.  Still, one can&#8217;t help but wonder if all of the energy put into the collective &#8220;digital library&#8221; problem so far has just dissipated into chaos.</p><h2>Footnotes</h2><ol class="footnotes"><li id="footnote_0_70" class="footnote"><a href="http://www.himalayasaltcrystal.com/glossary.htm">Defined as</a>:  &#8220;Measure of disorganization or degradation in the universe that reduces available energy, or tendency of available energy to dwindle. Chaos, opposite of order.&#8221;  Do you remember your Second Law of Thermodynamics?</li><li id="footnote_1_70" class="footnote">Lagoze, C., Krafft, D. B., Cornwell, T., Dushay, N., Eckstrom, D., Saylor, J. 200y. Metadata aggregation and &#8220;automated digital libraries&#8221;: A retrospective on the NSDL experience. In <i>Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries</i> (Chapel Hill, NC, USA, June 11 &#8211; 15, 2005). JCDL &#8217;06. ACM Press, New York, NY, 231. [<a href="http://arxiv.org/abs/cs.DL/0601125">arXiv:cs.DL/0601125</a>]</li></ol>]]></content:encoded> <wfw:commentRss>http://dltj.org/article/cautiously-optimistic/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>
<!-- Served from: dltj.org @ 2012-02-11 09:01:30 by W3 Total Cache -->
