At ALA Midwinter, ALCTS sponsored a panel discussion about sharing library-created data inside and outside the library community, with a particular focus on cataloging data. I was honored to be ask to speak on the topic from the perspective of a consortial office. This is the first in a series of posts that represents an approximation of what I said on the panel. (Also be sure to read the summary of the session by Norman Oder in Library Journal.)
I think it is important to step back and reflect on the nature of what we are talking about. We build bibliographic records as surrogates for the desired object, meaning that the surrogate is a means to an end – retrieving the described object – and not an end onto itself. We build indexes of these surrogates for patrons to use to discover information. All other factors held constant, the better the surrogate, the greater the chance the user will find the information they are seeking. The following discussion looks at the sources of records, the way they are built, and what it means to try to share them.
Sources of Records
The most familiar form of surrogates are the records generated by humans, and in our field AACR2 encoded in MARC is the most common. There are many sources of human-generated cataloging records. For academic libraries, the most obvious is OCLC, but it is not the only one. Integrated library systems include Z39.50 clients that enable the search of remote catalogs and import the resulting MARC records into the local catalog. The Library of Congress catalog and the OhioLINK library catalog can be used in this fashion. Records can also be purchased from vendors. One emerging source is the recently announced ‘‡biblios.net’ from LibLime. The Open Library project of the Internet Archive could conceivably also be a source of cataloging records, although such use is not the goal of the project.
The humans creating these surrogate records are typically called “catalogers” although I’m coming to prefer the term descriptionists as a more accurate portrayal of their activity (if, for no other reason, that the description activity engaged in by these professionals extends beyond what we traditionally consider “the catalog”). Using the tools of taxonomies and ontologies, the descriptionist creates the surrogate of the item to put into the catalog systems.
[caption id="attachment_google_tech_talk" align="alignright" width="304" caption="Google Tech Talk by David Weinberger about his book 'Everything is Miscellaneous', starting at 29 minutes and 25 seconds into the talk"] [/caption]
There is another way emerging, however, to create these surrogates: through computer algorithmic computation against the object itself. In his book Everything is Miscellaneous, David Weinberger talks about the fungible nature of metadata and data: “metadata” is what we know and “data” is what we want to find out. In a talk given at Google in May 2007, he gave an example of using something known — like a quote from a book — to find out something — like the author of that book. (Skip to 29 minutes and 25 seconds into the video.) The “metadata” (the quote) was used to find the “data” (the author) that was being sought.
This is Google’s big experiment in computational analytics. It is relying on an analysis of the text of the item itself to be the descriptive surrogate. It employs algorithms that look at every word — perhaps even every concept — in an item and weights it relative to those words and terms in other items in the corpus.
Amazon has done something similar for years with its “Key Phrases” techniques. For Amazon, Capitalized Phrases are people, places, events, or terms mentioned in a book. Statistically Improbable Phrases are the most distinctive phrases in a book as compared against the corpus of texts in its catalog. These become a form of index points – the surrogates – for finding the item.
Of human-generated or machine-generated, which of these is better – a measure of effectiveness in precision and recall as well as an assessment of the relative cost to create – is undoubtedly a topic of research and debate. For those of us more familiar in the ways of the descriptionists, however, it is undoubtedly time to become familiar with the ways of the computationalists to understand the strengths of each.
Ownership of records
Our surrogate records contain two varieties of data: recitation of facts and efforts of creativity. Under U.S. intellectual property law, facts are not creative works and therefore are not covered by copyright. (This is the common interpretation of Feist v. Rural, a 1991 Supreme Court case that determined that telephone directories are not creative works and therefore are not offered copyright protection.) The legal status of our surrogate records is somewhat murkier, though. While the inclusion of facts such as title, author, and publisher are not creative acts, the assignment of classification numbers and subject headings could be a creative act, and the person doing the creation would hold copyright for those acts.
Ownership of the data in records is even cloudier than what is outlined above. By my estimation, we are entering a world in which there are four types of attributes that make up our bibliographic records. The first is the recitation of the facts, whether copied from the item-in-hand or obtained as a feed of information from publishers. The second is the creative work of the descriptionist in adding value to the surrogate in the form of classification numbers, subject headings, abstracts, and the like. The third is the additions that the computationalists – or, more specifically, their algorithms – bring to the object’s description in the surrogate. This can take the form of the previous discussed Google Book Search algorithms as well as Amazon’s Capitalized Phrases and Statistically Improbable Phrases. It also includes actions specific to our traditional domain of human-generated surrogates; for instance, of the algorithms OCLC runs across the Worldcat dataset to merge records and improve records. And finally, thinking beyond the boundaries typical of AACR2 and MARC, there is a fourth type: user-contributed information. The creative efforts of users to add tags, reviews, summaries, and commentary add to the surrogate, and with that we kick wide open a door to the problem of who “owns” these surrogate records.
The issue of who “owns” the user-contributed information to records is handled in a wide variety of ways. In LibraryThing, for instance, the creator retains copyright over tags, reviews, summaries, and comments, and grants LibraryThing non-exclusive rights to make use of that creative effort. LibraryThing also has a function where users can add “Common Knowledge” (character names and occupations, locations, honors/awards, and quotations, among other details) about the item. Users add Common Knowledge augmentations with a Creative Commons “By, Share-Alike” license. The Open Library project asserts no rights over the data in its system; it declares the material in the open library database to be facts, and therefore in the public domain (at least under U.S. copyright law). The newly announced ‡biblios.net repository of records uses the Open Data Commons Public Domain Dedication and License, developed by Talis in the UK and Creative Commons. It, too, asserts that the subject data is in the public domain and offers suggested community norms surrounding the use of the data.
Ownership of records is also a continuing source of discussion among members of the OCLC cooperative. If a brief record from a publisher is added into the WorldCat database and is subsequently enhanced by one or more members, who “owns” the record. If a library systematically adds enhancements to records that matter to its own community – enhancements such as paper and binding types – who “owns” those records?
One way to handle this variety of who (if anyone) owns portions of the surrogate is to split records based on who contributed what. This would enable consuming systems to make decisions on what data to include based on the conditions the owners might put on their creative acts. (Keep in mind that legal precedent would seem to indicate that the facts expressed in the surrogate couldn’t be protected by copyright restrictions.) But our existing record structures do not give us a way to do this, nor does there seem to be any work going on to enable this to happen. Perhaps we are all just burying our collective heads in the sand and hoping it will go away.
More to come...
Part 2 reflects on the implication of restrictive licenses, with a focus on the now-withdrawn OCLC records use policy, and a call for open discussion.
The text was modified to update a link from https://biblios.net/ to http://biblios.net/ on February 11th, 2011.
The text was modified to update a link from https://biblios.net/pddl to http://biblios.net/pddl on February 11th, 2011.
The text was modified to update a link from http://video.google.com/videoplay?docid=2159021324062223592 to http://www.youtube.com/watch?v=WHeta_YZ0oE on August 22nd, 2013.