“We are scanning them to be read by an AI.”

Towards the end of the last chapter of his book, Nicholas Carr relates an anecdote about the visit of a guest speaker to the Google headquarters (emphasis added):

George Dyson, a historian of technology…, Freeman Dyson, was invited to Google’s headquarters in Mountain View, California, in October 2005 to give a speech at the party celebrating the sixtieth anniversary of von Neumann’s invention [of an electronic computer that could store in its memory the instructions for its use]. “Despite the whimsical furniture and other toys, “Dyson would later recall of his visit, “I felt I was entering a 14th-century cathedral — not in the 14th century but in the 12th century, while it was being built. Everyone was busy carving one stone here and another stone there, with some invisible architect getting everything to fit. The mood was playful, yet there was a palpable reverence in the air.” After his talk, Dyson found himself chatting with a Google engineer about the company’s controversial plan to scan the contents of the world’s libraries into its database. “We are not scanning all of those books to be read by people,” the engineer told him. “We are scanning them to be read by an [artificial intelligence engine].”

Continue reading

A Glimpse into the Internet Archive’s Scanning and Print-on-Demand Operations

Wired magazine published a brief story and online photo gallery of the book scanning and print-on-demand projects at the Internet Archive. It is a fascinating glimpse into their vision and processes. Included below are cropped thumbnails and part of the text captions that accompanied the pictures in the Wired online gallery.

The book to be scanned sits in front of a technician underneath a V-shaped glass platter. Two opposing cameras angled at each page take photos of the book. On screen is the multipage view that the operator uses to verify the quality of the scans and the book’s pagination.
Scanning books into the Internet Archive’s custom-built Scribe Station is a manual process. Although automated page-turning machines exist, Internet Archive has chosen to go the manual route due to the large amount of extremely delicate, rare and valuable manuscripts they scan.
The book scanner uses off-the-shelf Canon hardware including the EOS 1-Ds Mark II and the EF 100 mm f/2.8 macro lens. The newer systems use the 5-D instead of the 1-Ds, which saves money in the short term. But, according to Internet Archive staff, the 5-D fails much more frequently, resulting in increased maintenance costs.
At the start of every shift the operator calibrates the color levels using a pair of color-calibration cards. When the scanning project first started, Internet Archive attempted to color correct the scanned pages to white, but later decided to capture and store them as they are in their various aged shades of yellow. Preservation of the oxidized tints makes the virtual viewing of old books more lifelike.
At the turn of the last century, fold-out illustrations were all the rage. These foldouts are cool to look at, but present a problem for scanning due to their size. When an operator comes across one of these foldouts in a book, they scan the closed version and note the foldout in the Scribe software. Later, another scanner is used consisting of a camera mounted on a copy stand.
Soon, you’ll be able to print books found at the Internet Archive with this self-contained, fully automated book machine. Send it a PDF and it will print and bind it into a complete book. The process takes about 10 minutes depending on the size of the book, and costs $10 plus a penny per page.
Inside the book machine, the laser-printed pages are trimmed, then slathered with adhesive on what will become the book’s spine. The cover is then wrapped around the book. After another trim, out pops a custom-printed book ready for reading.
Instead of stacks of books, these archival volumes are now contained in racks of 160 terabyte boxes. Multiple redundant copies of the archive’s data are spread across servers all over the world.
Before entering the world of public-domain-promoting nonprofits, Robert Miller spent the last few decades at the top levels of various brick-and-mortar tech corporations. He is currently the director of books at the Internet Archive, and it’s his vision that drives the archive’s quest to digitize all public-domain knowledge and publish it online.

The text was modified to update a link from http://redjar.org/jared/blog/archives/2006/02/10/more-details-on-open-archives-scribe-book-scanner-project/ to http://web.archive.org/web/20061206025609/http://redjar.org/jared/blog/archives/2006/02/10/more-details-on-open-archives-scribe-book-scanner-project/ on November 13th, 2012.

New Blog for Ebooks in Libraries: “No Shelf Required”

Sue Polanka, head of reference and instruction at the main library of Wright State University, sent a message to the OhioLINK membership today about a new blog she is moderating called No Shelf Required:

No Shelf Required provides a forum for discussion among librarians, publishers, distributors, aggregators, and others interested in the publishing and information industry. The discussion will focus on the issues, concepts, current and future practices of Ebook publishing including: finding, selecting, licensing, policies, business models, usage (tracking), best practices, and promotion/marketing. The concept of the blog is to have open discussion, propose ideas, and provide feedback on the best ways to implement Ebooks in library settings. The blog will be a moderated discussion with timely feature articles and product reviews available for discussion and comment.

No Shelf Required will be moderated by Sue Polanka, Wright State University. The role of the moderator will be to articulate discussion topics, provide feature articles and product reviews, and ask poignant questions to the group in order to stimulate open discussion and collaborative learning about Ebooks. The moderator will also provide audio content in the form of interviews with librarians and those in the publishing industry.

The blog has been running for about a week, and already has topics like:

It sounds like it is going to be an interesting place to keep an eye on, particularly since ebooks can/could be a disruptive influence on library services. Good luck, Sue!

The text was modified to update a link from http://noshelfrequired.blogspot.com/ to http://www.libraries.wright.edu/noshelfrequired/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/should-ebooks-be-updated.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/21/should-ebooks-be-updated/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/book-sales-increase-at-years-end.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/20/book-sales-increase-at-year%E2%80%99s-end/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/ala-program-future-of-electronic.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/19/ala-program-the-future-of-electronic-reference-publishing/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/xreferplus-credo-reference.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/19/xreferplus-credo-reference/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/oxford-reference-online.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/19/oxford-reference-online/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/greenwood-digital-collection.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/19/greenwood-digital-collection/ on November 6th, 2012.

The text was modified to update a link from http://noshelfrequired.blogspot.com/2008/02/gale-virtual-reference-library_19.html to http://www.libraries.wright.edu/noshelfrequired/2008/02/19/gale-virtual-reference-library/ on November 6th, 2012.

Out of Print Books Get New Life via Amazon and Participating Libraries

Why settle for mere digital copies of books (a la the Google Book Search project and the Open Content Alliance) when you can have an edition printed, bound and sent to you in the mail? That’s the twist behind a recent partnership announced by Amazon.com, Kirtas Technologies, Emory University, University of Maine, Toronto Public Library, and the Public Library of Cincinnati and Hamilton County.

More information via C|Net News, The Chronicle of Higher Education (subscription required), and Inside Higher Ed. I’m putting this in the “Disruption in Libraries” category because it is an example of using a technical innovation to serve an un-served or under-served population — not only the digitization of books but also the ability to deliver a physical reproduction to the user. That aspect makes this program distinct from the others, and it is the first time that we’ve seen a glimpse of a reasonable business model: costs recovered and profits made that go back into the digitization program for new books. Since this is a non-exclusive agreement that puts the libraries in control, the texts can be made available freely online or available at a nominal cost to the user in a physical form.

[Update 20070704T0904 : Ack! I linked to the wrong Chronicle of Higher Ed article. Fixed now — thanks Jodi.]

The text was modified to update a link from http://news.emory.edu/Releases/KirtasPartnership1181162558.html to http://www.emory.edu/news/Releases/KirtasPartnership1181162558.html on November 13th, 2012.

Brewster Kahle on the Economics and Feasibility of Mass Book Digitization

Brewster Kahle, Director of the Internet Archive, was interviewed this week in a Chronicle of Higher Education podcast on the Economics and Feasibility of Mass Book Digitization. Among the many interesting points in the interview was that one of the biggest challenges is to such a mass digitization effort to believe that to digitize massive numbers of books and make them available is actually possible. The Open Content Alliance has put together a suite of technology that brings down the cost for a color scan with OCR to 10 cents per page or about $30 per book. He then goes on to perform this calculation: the library system in the U.S. is a 12B industry. One million books digitized a year is $30M, or “a little less than .3 percent of one year’s budget of the United States library system would build a 1 million book library that would be available to anyone for free.” He also covers copyright concerns including the more liberal copyright laws in countries such as China.

Source: Audio: How Digital Book Collections Will Change Academe
Address : < http://chronicle.com/media/audio/v53/i30/khale/>
Date Visited: Fri Mar 30 2007 16:19:24 GMT-0400 (EDT)

Just In Time Acquisitions versus Just In Case Acquisitions

What of a service existed where the patrons selected an item they needed out of our library catalog and that item was delivered to the patron even when the library did not yet own the item? Would that be useful? With the growth of online bookstores, our users do have the expectation of finding something they need on the web, clicking a few buttons and having it delivered. When such expectations of what is possible exist, where is the first place a patron would go to find recently published items — the online bookstore or their local library catalog? Does your gut tell you it is the online bookstore? Would it be desirable if the patron’s instinct were to be the local library catalog?

A savvy patron looking for a recently published item will likely try the local library catalog to see if the item has been selected, purchased, received, cataloged, processed, and shelved (hereafter “SPRCPS”) by the staff — in other words, gone through the traditional process libraries use for acquiring items. If not, the patron has one of three choices (that I can think of):

  1. make a request for the item to be SPRCPS’d with a hold placed on the item so that the patron is notified when it is ready;
  2. start an Interlibrary Loan process to try to get the item from another site that has SPRCPS’d the item faster than your library; or
  3. pay a cost premium — buy the book themselves and have it delivered.

Looking at this from the perspective of elapsed time, #1 is likely many weeks, #2 is likely a few weeks, and #3 is likely a few days. Looking at this from the perspective of direct cost to the patron, #1 is the cheapest, #2 may be free or some nominal ILL transaction cost (depending on local policy), and #3 is the most expensive. All-in-all, reasonable tradeoffs.

But what if our libraries offered a service that had the speed of #3 and the cost of #1? Do you think that would be an appropriate service to our users?

Local Catalog Display

In my mind, such a system would be predicated on four factors. First, our local catalog would need to display some record of items that are not yet held but could be acquired on an expedited basis. If the savvy patron is going to start at the library catalog to determine if we already have the item, thereby executing the cheapest (no direct cost to the patron) and likely fastest (hop down to your local branch and pick it up) path to getting the item in hand, there needs to be a way to show patrons that the item could be added to the library’s collection on an expedited basis. Here in OhioLINK and with other similar consortial catalog systems, that expectation is already being set. “Can’t find the item in your local catalog? Push this button and see if it is available from one of our consortial members. If so, push this other button and we’ll transport it from that library to here for you.”

In terms of the mechanics of getting these records into our systems, it seems that we need a new form of MARC record loads into our systems. The most likely source? How about what booksellers use now — the ONIX format “that publishers can use to distribute electronic information about their books to wholesale, e-tail and retail booksellers, other publishers, and anyone else involved in the sale of books.” A feed of ONIX records from publishers, filtered through a selection criteria, converted into MARC21, and loaded into our local catalogs would do the trick.

Automated “Request This Item” Function

Second factor — a highly automated process to get the requested book to the library. Again, those familiar with OhioLINK and other similar consortial borrowing/lending systems know that there is a ubiquitous “Request This Item” button (“RTIB” hereafter) for objects that are not in the patron’s own library but can be requested from a consortial partner. In this new Just-In-Time acquisition based on the ONIX record in our catalog, that RTIB would need the addition a second workflow: the buy-this-item-and-deliver-it-to-my-library workflow. Like a business-to-business transaction, the RTIB would trigger the purchase of the item to be expedited to the patron’s library.

Speedy Copy Cataloging and Shelf Prep

Third factor — the item must get through copy cataloging and shelf prep quickly. When a RTIB item reaches the library loading dock, there must be a workflow and a commitment by copy catalogers and shelf prep staff to turn the item around in four hours for patron pick up. If the RTIB immediately buys the item from the distributor, the distributor turns it around for same-business-day shipping, and the item arrives on our doorstep via an expedited courier (no “library rate postage” here, please), then the only place where we have an influence on the time it takes to get the item into the hands of the user is right here — in our own technical services processes. And there are a number of short-cuts that can be made here as well,

  • Use “on the fly” circulation procedures to lend the book out immediately. When it is returned route the item through technical services for formal copy cataloging (or decide that the Onyx data is acceptable as is for the copy cataloging).
  • Use a distributor that will delivery the item shelf-ready. Just yesterday, through an LISnews posting, I learned that Amazon is now one such distributor:

    Amazon offers a wide array of library processing options. In addition to mylar jackets on hard-cover books, Amazon also offers MARC records, spine labels, and barcodes. By partnering with leading cataloging companies and organizations, Amazon is also able to offer you highly customized MARC records, spine labels, and barcodes that meet your specific needs.

New Roles for Staff

The fourth factor is the hardest — the humans involved in the process. And I don’t think it is the patrons that would have as big of an issue with this Just-In-Time acquisitions process. Here in Ohio a user expectation exists to tolerate receipt of an item in 24 to 48 hours via consortial borrowing services. I think it will be the library staff who would need first convincing then time to adjust to this new way of selecting and purchasing items. Some initial thoughts on new roles:

  • Selectors/bibliographers still have front-end work to do. They are the ones to tune the profile of “items that could be purchased” records (informed by the requesting patterns from users) that are loaded into the system and to buy items not yet requested that round out a collection. This is modestly akin to the approval plan systems we use now.
  • Copy cataloging staff may have a reduced workload for items that come in through the RTIB process — particularly if the distributor selected does much of the shelf prep and copy cataloging work already. This, too, is nothing new: we have been outsourcing more of our technical services work and assigning the copy cataloging and shelf prep staff to work on other areas of the collection.


Let’s take one more look at the traditional SPRCPS process and see how things would change under a Just-In-Time acquisitions model.

An initial round of selection is done by the bibliographers and collection managers. They decide which broad categories of ONIX records from publishers/distributors will be represented as “items-to-be-acquired” in the local catalog. Patrons, then vote with their fingers and mice clicks as to which items meet their needs.
An entirely automated business-to-business transaction. Once the user decides the item is what they need, our library computer talks directly to the publisher/distributor computer and buys the item.
The publisher/distributor doesn’t dally — they ship the item to us for next-day or second day delivery. When it arrives in our mail room, we need to act fast.
Copy cataloging could be done by us, we could receive copy catalog records from the publisher/distributor, or we could decide that — at least for now — that the ONIX data is good enough and that like an “on-the-fly” transaction the formal copy cataloging will happen after the item is returned.
Choices here, too. Will our staff do the shelf-prep work or is that something we contract with the publisher/distributor? In any case quick processing here, too because…
…we want to get the item in the hands of the user who requested it. “Shelved” in this case could be the hold-pickup shelf, or it could be a local physical delivery service that sends the item to the patron.

Can we do this as fast as it would take the patron to get the item directly from the online bookseller? Maybe not — we do have some necessary processing steps that a direct patron purchase process does not have. Can we make that delay short enough so that the patron considers it acceptable as compared to the direct price premium of ordering it themselves?

Do we want to?

The text was modified to update a link from http://www.bisg.org/onix/index.html to http://www.bisg.org/activities-programs/activity.php?n=d&id=15&cid=2.

The text was modified to update a link from http://lisnews.org/article.pl?sid=06/08/01/023238 to http://lisnews.org/node/19233 on January 13th, 2011.