Thursday Threads: RDF, Digital Document Tampering, and Amazon’s Mechanical Turk

Posted on 3 minute read

× This article was imported from this blog's previous content management system (WordPress), and may have errors in formatting and functionality. If you find these errors are a significant barrier to understanding the article, please let me know.

Enter your email address to receive DLTJ Thursday Threads:

Delivered by FeedBurner

This is definitely becoming a habit...welcome to the fourth edition of DLTJ's Thursday Threads. If you find these interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the left. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments, as always, are welcome.

Defining Linked Data By Analogy

RDF is the grammar for a language of data. URIs are the words of that language. As in natural language, these words (i.e., the URIs) belong to grammatical categories. RDF properties (such as "isReferencedBy") function a bit like verbs, RDF classes like nouns.

As in natural languages, where utterances are meaningful only if they follow a sentence grammar, RDF statements follow a simple and consistent three-part grammar of subject, predicate, and object. Analogously to paragraphs, RDF statements are aggregated into RDF graphs.

This is a posting from Thomas Baker on the W3C Library Linked Data exploratory group mailing list. It compares RDF to natural languages using analogies of grammar, words, sentences, and paragraphs. I think this is a useful way to think about RDF and linked data, although as initial introduction to the topic, you might want to see the presentation below.

RDF For Librarians presentation recording

The RDF model underlying Semantic Web technologies is frequently described as the future of structured metadata. Its adoption in libraries has been slow, however. This is due in no small part to fundamental differences in the modeling approach that RDF takes, representing a "bottom up" architecture where a description is distributed and can be made up of any features deemed necessary, whereas the record-centric approach taken by libraries tends to be more "top down" relying on prespecified feature sets that all should strive to make the best use of. This presentation will delve deeply into the differences between these two approaches to explore why the RDF approach has proven difficult for libraries, look at some RDF-based initiatives that are happening in libraries and how they are allowing different uses of this metadata than was previously possible, and pose some questions about how libraries might best.

Jenn Riley gave this hour-long presentation to the Indiana University Digital Library Brown Bag earlier this month. The URL to the slides synchronized to the audio recording is The presentation slides and the handout from the session are available as well. I highly recommend spending an hour with this presentation to learn about how linked data compares and contrasts with MARC records. (via Diane Hillmann)

The Future of the Federal Depository Libraries

[ProPublica's Dafna] Linzer's expose of government tampering with a court docket is an example of the problem on which the LOCKSS Program has been working for more than a decade, how to make the digital record resistant to tampering and other threats. The only reason this case was detected was because Linzer created and kept a copy of the information the government published, and this copy was not under their control. Maintaining copies under multiple independent administrations (i.e. not all under control of the original publisher) is a fundamental requirement for any scheme that can recover from tampering (and in practice from many other threats).

David Rosenthal summarizes a story about how a published document from the U.S. government was changed and why we need highly-distributed copies of government documents to detect and recover from tampering. There are big implications here for the future of government documents depository programs.

ProPublica’s Guide to Mechanical Turk

Amazon Mechanical Turk – or mTurk – is an online marketplace, set up by the online shopping site Amazon, where anyone can hire workers to complete short, simple tasks over the Internet. Amazon originally developed it as an in-house tool, and commercialized it in 2005. The mTurk workforce now numbers more than 100,000 workers in 200 countries, according to Amazon. At ProPublica, we use it for tasks like collecting, reformatting, and de-duplicating data. This is a guide to journalists looking to use Mechanical Turk in their data projects. It’s meant for users who are already familiar with mTurk and are looking for ways to improve their results.

Do you have repetitive digital conversion or analysis jobs that can be broken down into manageable-sized chunks? ProPublica published this guide on using Amazon's Mechanical Turk service to outsource this activity.