For the heart and soul of librarianship -- human description versus fulltext analytics

A non-librarian colleague forwarded a link to an essay by Mark Pesce called The Alexandrine Dilemma. From the context of one of the comments, I think it might have been the text of a keynote given at New Librarians Symposium in Australia last month. It is a thought-provoking piece that, well, provoked some thoughts.

Some Corrections

As I was reading through it, my train of thought was tripped up by a couple of occasions in the text that needed correction. A comment describing these have been pending in the blog author's moderator queue for a couple of days, so I'm adding it here in case it gets lost there. These are minor points, and don't detract too much from Pesce's essay. First, Wikipedia's current content license is the GNU Free Documentation License for one, not a Creative Commons license (the full explanation of the Wikipedia Copyrights is available). The GFDL is not the same as the Creative Commons licenses. The recently amended GFDL will allow Wikipedia to adopt the CC-BY-SA license (according to the GFDL FAQ. Wikimedia has a very complicated chart on what licenses can be used when reusing Wikimedia content, but seems to be migrating towards Creative Commons licenses in general.

Second, Google is not scanning the whole Harvard library; it is only scanning the public domain works at the library. After reviewing the terms of the settlement agreement with the book authors and book publishers, Harvard reaffirmed its decision to limit its participation to public domain books only. I think it also important to note that the presentation of in-copyright books as you described should be in the future tense; the settlement agreement has not been finalized and money from subscribers/purchases of books under the Google book reader interface is not yet being collected.

The Heart of the Matter

My colleague chose to pull out these three paragraphs as hooks to get me and others to read the article.

…the basic point is this: wherever data is being created, that’s the opportunity for library science in the 21st century. Since data is being created almost absolutely everywhere, the opportunities for library science are similarly broad. It’s up to you to show us how it’s done, lest we drown in our own creations.

The dilemma that confronts us is that for the next several years, people will be questioning the value of libraries; if books are available everywhere, why pay the upkeep on a building? Yet the value of a library is not the books inside, but the expertise in managing data. That can happen inside of a library; it has to happen somewhere. Libraries could well evolve into the resource the public uses to help manage their digital existence. Librarians will become partners in information management, indispensable and highly valued.

In a time of such radical and rapid change, it’s difficult to know exactly where things are headed. We know that books are headed online, and that libraries will follow. But we still don’t know the fate of librarians. I believe that the transition to a digital civilization will flounder without a lot of fundamental input from librarians. We are each becoming archivists of our lives, but few of us have training in how to manage an archive. You are the ones who have that knowledge. Consider: the more something is shared, the more valuable it becomes. The more you share your knowledge, the more invaluable you become. That’s the future that waits for you.

I don't question Pesce's premise that the more something is shared, the more valuable it becomes. In particular, I believe that is the basis of many of the arguments being made against OCLC's new Records Use Policy. The question in my mind, though, is the notion that librarians are the privileged holders of keys to the information management lock-box. I question his premise that library science has the way to manage information; that is, through careful examination, description and categorization of information. Library science, with these techniques, certainly has a way to deal with information overload, but certainly not the only way.

One of Pesce's examples is the Google Book Search project. With it, Google is testing a very powerful paradigm -- that is that it is easer to search than it is to sort. Or, more accurately, it is becoming easier for computer algorithms to ferret out the information being sought than it is for library science practitioners to categorize it in a standard vocabulary. Pesce's calls the reliance on computer algorithms "a beginner’s mistake" among "Google’s army of PhDs". I'd call it a fascinating experiment.

Google's Experiment

The axioms of Precision and Recall are well known in the library science field. Recall is a measure of effectiveness in retrieving (or selecting) performance and can be viewed as a measure of effectiveness in including relevant items in the retrieved set. Precision is a measure of purity in retrieval performance, a measure of effectiveness in excluding nonrelevant items from the retrieved set. ((Michael Buckland and Fredric Gey. "The Relationship between Recall and Precision" Journal of the American Society of Information Science. 45(1): 12-19.)) As a general rule, Precision and Recall are inversely related: if you construct a search geared towards high Precision, you loose Recall -- that is you'll miss some of everything that is out there. The opposite is also true: a high Recall search results in low Precision -- lots of extraneous stuff that you don't want. The key characteristic of any information retrieval system is to push the boundaries of Precision versus Recall as far as they will go.

Traditional library science invokes to tools of descriptive surrogates for the item itself. Specifically, the use of controlled vocabularies. We have taxonomies of names (the most popular in North America is the Library of Congress Name Authority File). We also have rich ontologies for subjects and topics, be they general like the Library of Congress Subject Headings or specialized like the Medical Subject Headings. In all cases, a skilled 'descriptionist' (a.k.a. a cataloging librarian) distills the nature of the item into a descriptive record at a level of characterization that the descriptionist will strike a balance for users between Precision and Recall.

Google, on the other hand, relies on an analysis of the text of the item itself to be the descriptive surrogate. It employs algorithms that look at every word -- perhaps even every concept ((See “We are scanning them to be read by an AI.” for a discussion around Google's desire to scan books to be read by an artificial intelligence engine.)) -- in an item and weights it relative to those words and terms in other items in the corpus. I doubt that this technique is new, but the scale to which Google is applying this technique -- to all knowledge recorded in all available books in major world libraries -- is definitely new.

Descriptionists versus fulltext analytics

So which is better: the descriptionist technique or the fulltext-analytics technique? For the previous hundreds of years, the descriptionist technique was undoubtedly the best, and this seems to be the thesis of Pesce's article. But the world is changing. Has the recent availability of cheap computing power and the invention of new algorithms changed the balance towards the fulltext-analytics technique? Google seems to think so. At the very least, they have plunked down a sizable chunk of cash to test out that proposition.

The best answer is probably some combination of the description and automated analysis, and that is a puzzle no computer will be able to solve. I think the next challenge for the field of library science is to explore the balance between descriptionists and fulltext-analytics to employ the most rational and cost-effective uses of both. Understanding these techniques and balancing them is dilemma that confronts us.

The text was modified to update a link from http://www.loc.gov/cds/lcsh.html to http://en.wikipedia.org/wiki/Library_of_Congress_Subject_Headings on November 17th, 2010.