Skip to content
Solely for the Purpose of Catching $PAMRZ

For the heart and soul of librarianship — human description versus fulltext analytics

A non-librarian colleague forwarded a link to an essay by Mark Pesce called The Alexandrine Dilemma. From the context of one of the comments, I think it might have been the text of a keynote given at New Librarians Symposium in Australia last month. It is a thought-provoking piece that, well, provoked some thoughts.

Some Corrections


As I was reading through it, my train of thought was tripped up by a couple of occasions in the text that needed correction. A comment describing these have been pending in the blog author’s moderator queue for a couple of days, so I’m adding it here in case it gets lost there. These are minor points, and don’t detract too much from Pesce’s essay. First, Wikipedia’s current content license is the GNU Free Documentation License for one, not a Creative Commons license (the full explanation of the Wikipedia Copyrights is available). The GFDL is not the same as the Creative Commons licenses. The recently amended GFDL will allow Wikipedia to adopt the CC-BY-SA license (according to the GFDL FAQ. Wikimedia has a very complicated chart on what licenses can be used when reusing Wikimedia content, but seems to be migrating towards Creative Commons licenses in general.

Second, Google is not scanning the whole Harvard library; it is only scanning the public domain works at the library. After reviewing the terms of the settlement agreement with the book authors and book publishers, Harvard reaffirmed its decision to limit its participation to public domain books only. I think it also important to note that the presentation of in-copyright books as you described should be in the future tense; the settlement agreement has not been finalized and money from subscribers/purchases of books under the Google book reader interface is not yet being collected.

The Heart of the Matter


My colleague chose to pull out these three paragraphs as hooks to get me and others to read the article.

…the basic point is this: wherever data is being created, that’s the opportunity for library science in the 21st century. Since data is being created almost absolutely everywhere, the opportunities for library science are similarly broad. It’s up to you to show us how it’s done, lest we drown in our own creations.

The dilemma that confronts us is that for the next several years, people will be questioning the value of libraries; if books are available everywhere, why pay the upkeep on a building? Yet the value of a library is not the books inside, but the expertise in managing data. That can happen inside of a library; it has to happen somewhere. Libraries could well evolve into the resource the public uses to help manage their digital existence. Librarians will become partners in information management, indispensable and highly valued.

In a time of such radical and rapid change, it’s difficult to know exactly where things are headed. We know that books are headed online, and that libraries will follow. But we still don’t know the fate of librarians. I believe that the transition to a digital civilization will flounder without a lot of fundamental input from librarians. We are each becoming archivists of our lives, but few of us have training in how to manage an archive. You are the ones who have that knowledge. Consider: the more something is shared, the more valuable it becomes. The more you share your knowledge, the more invaluable you become. That’s the future that waits for you.

I don’t question Pesce’s premise that the more something is shared, the more valuable it becomes. In particular, I believe that is the basis of many of the arguments being made against OCLC’s new Records Use Policy. The question in my mind, though, is the notion that librarians are the privileged holders of keys to the information management lock-box. I question his premise that library science has the way to manage information; that is, through careful examination, description and categorization of information. Library science, with these techniques, certainly has a way to deal with information overload, but certainly not the only way.

One of Pesce’s examples is the Google Book Search project. With it, Google is testing a very powerful paradigm — that is that it is easer to search than it is to sort. Or, more accurately, it is becoming easier for computer algorithms to ferret out the information being sought than it is for library science practitioners to categorize it in a standard vocabulary. Pesce’s calls the reliance on computer algorithms “a beginner’s mistake” among “Google’s army of PhDs”. I’d call it a fascinating experiment.

Google’s Experiment


The axioms of Precision and Recall are well known in the library science field. Recall is a measure of effectiveness in retrieving (or selecting) performance and can be viewed as a measure of effectiveness in including relevant items in the retrieved set. Precision is a measure of purity in retrieval performance, a measure of effectiveness in excluding nonrelevant items from the retrieved set.1 As a general rule, Precision and Recall are inversely related: if you construct a search geared towards high Precision, you loose Recall — that is you’ll miss some of everything that is out there. The opposite is also true: a high Recall search results in low Precision — lots of extraneous stuff that you don’t want. The key characteristic of any information retrieval system is to push the boundaries of Precision versus Recall as far as they will go.

Traditional library science invokes to tools of descriptive surrogates for the item itself. Specifically, the use of controlled vocabularies. We have taxonomies of names (the most popular in North America is the Library of Congress Name Authority File). We also have rich ontologies for subjects and topics, be they general like the Library of Congress Subject Headings or specialized like the Medical Subject Headings. In all cases, a skilled ‘descriptionist’ (a.k.a. a cataloging librarian) distills the nature of the item into a descriptive record at a level of characterization that the descriptionist will strike a balance for users between Precision and Recall.

Google, on the other hand, relies on an analysis of the text of the item itself to be the descriptive surrogate. It employs algorithms that look at every word — perhaps even every concept2 — in an item and weights it relative to those words and terms in other items in the corpus. I doubt that this technique is new, but the scale to which Google is applying this technique — to all knowledge recorded in all available books in major world libraries — is definitely new.

Descriptionists versus fulltext analytics


So which is better: the descriptionist technique or the fulltext-analytics technique? For the previous hundreds of years, the descriptionist technique was undoubtedly the best, and this seems to be the thesis of Pesce’s article. But the world is changing. Has the recent availability of cheap computing power and the invention of new algorithms changed the balance towards the fulltext-analytics technique? Google seems to think so. At the very least, they have plunked down a sizable chunk of cash to test out that proposition.

The best answer is probably some combination of the description and automated analysis, and that is a puzzle no computer will be able to solve. I think the next challenge for the field of library science is to explore the balance between descriptionists and fulltext-analytics to employ the most rational and cost-effective uses of both. Understanding these techniques and balancing them is dilemma that confronts us.

Footnotes

  1. Michael Buckland and Fredric Gey. “The Relationship between Recall and Precision” Journal of the American Society of Information Science. 45(1): 12-19. []
  2. See “We are scanning them to be read by an AI.” for a discussion around Google’s desire to scan books to be read by an artificial intelligence engine. []

3 Comments

  1. Ron Murray | January 8, 2009 at 2:17 pm | Permalink

    The descriptionist/fulltext analytics (perhaps better thought of as computationalist – that way we can deal with certain aspects of image and sound description) question is indeed an interesting one to pose.

    I think the computationalists can do a decent job of relating their assumptions to a broad range of theories in computer and cognitive sciences, statistics, and linguistics.

    I should like to see the descriptionists do likewise in attaching their assertions to fields of inquiry beyond that of the library (but amply represented within the library).

    Descriptionists might try avoiding the easy reach for its philosophical roots and pay a great deal more attention to psychology, linguistics, anthropology, and sociology.

    Each of these fields have much to say about the kinds of behavior indulged in when one describes a Resource for oneself or on behalf of others.

  2. the Jester | January 9, 2009 at 11:24 am | Permalink

    Ron –

    I like the term computationalist because, as you suggest, computer analysis can be performed on more than just text. The terms ‘descriptionist’ and ‘fulltext analytics’ went through several iterations in drafts of this post, and I was never quite comfortable with them. Perhaps my remaining doubts are that the terms relate to the actors doing the description. The term ‘descriptionist’ (in my mind, at least) implies the human element. A ‘computationalist’ is a human actor as well, but one that creates the algorithms. It is the execution of the algorithms themselves that generates the description — akin to the activity that the descriptionist performs.

    That the computationalist and the descriptionist bring different behaviors to the act of creating a surrogate for the target item is why I think a combination of the two is important.

  3. the Jester | January 16, 2009 at 12:30 pm | Permalink

    A colleague sent a private note about the term computationalist and the implied relation to the field of ‘computational linguistics’. That is an interesting observation because it starts to get at the meaning behind the strings of words that Google is digitizing — or the “concepts” as I say in the original post. I wish I had more time to study the intersection of these two areas — the descriptionist and the computationalist. This seems like a rich area of exploration.

Post a Comment

Your email is never published nor shared. Required fields are marked *
Human Detection Scheme
(What's this?)
Comment Preview

Additional comments powered by BackType

Subscribe without commenting

From the Disruptive Library Technology Jester (http://dltj.org/), printed on Friday the 12th of March 2010 at 8:22:48 AM EST (-0500). The URL to this page is http://dltj.org/article/human-description-vs-fulltext-analytics/

[Creative Commons Logo] This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.