Chris Wilper gave this presentation on behalf of the work that he and Aaron Birkland did to improve the performance of the Fedora Resource Index.
Version 2.0 of the Fedora digital object repository software added a feature called the Resource Index (RI). Based on Resource Description Framework (RDF) triples, the RI provided quick access to relationships between objects as well as to the descriptive elements of the object itself. After about two years of use using the Kowari software, the RI has pointed to a number of challenges for “triplestores”: scalability (few triplestores are designed for greater than 100 million triples); performance; and stability (frequent “rebuilds”).
The real motivation behind experimenting with a new triplestore, however, was the NSDL use case. The National Science Digital Library (NSDL) is a moderately large repository (4.7 million objects, 250 million triples) with a lot of write activity (driven by periodic OAI harvests; primarily mixed ingests and datastream modifications). The NSDL data model also includes existential/referential integrity constraints that must be enforced. Querying the RI to determine correct repository state proved to be difficult: Kowari is aggressively buffering triple, sometimes on the order of seconds, before writing them to disk. Flushing the buffer after every write is also computationally expensive (hence the drive to use buffers in the first place).
The NSDL team also encountered corruption under concurrent use and with abnormal shutdowns, forcing the rebuild of the triplestore. And the solution was not scaling well; performance was becoming notably worse. In looking for solutions other triplestores were considered but rejected. Using a RDBMS seemed attractive — efficient transactions, very stable, generally speedy — but a “one big table” paradigm to store all of the relations did not seem to give them a desired scalability.
NSDL developers observed that total number of distinct predicates is much lower than the number of predicates or objects; NSDL has about 50 distinct predicates. Based on this observation, their solution, called “Mapped Predicate Tables,” creates a table for every predicate in the triplestore. This has several advantages: a low computational cost for triple adds and deletes, queries for known predicates are fast, complex queries benefit from the relatively mature RDBMS planner having finer-granularity statistics and query plans, and flexible data partitioning to help address scalability. This solution comes with several disadvantages, however: one needs to manage predicate to table mapping, complex queries crossing many predicates require more effort to formulate, and with a naive approach simple unbound queries scale linearly with the number of predicates.
So the NSDL team created the MPTStore triplestore and contributed it back to the Fedora core developers for use by the community. MPTStore is a Java library that handles all of the predicate mapping and accounting behind the scenes. The basic API remains the same as for other triplestores, performing triple writes and queries, and the library hides all of the implementation details of translating queries from a particular language (SPO, SPARQL) into SQL statements. The library is also designed to expose transaction/connection semantics should the developer wish to have direct access to the predicate tables.
A solution like MPTStore is well suited for NSDL use case. The NSDL team was very familiar with the operations of RDBMS administration: performance tuning, backups, etc. The stored triplestore data is transparent and “hackable” — adhoc SQL queries and analysis are relatively simple. In fact, the RDBMS triplestore helped track down Fedora middleware bugs that resulted in an inconsistent state. Fixing these bugs also improved the performance of the Kowari-based RI.
[Updated 20070129T1447 to include links to Chris' presentation on SlideShare.]