A Glimpse into the Internet Archive’s Scanning and Print-on-Demand Operations

Wired magazine published a brief story and online photo gallery of the book scanning and print-on-demand projects at the Internet Archive. It is a fascinating glimpse into their vision and processes. Included below are cropped thumbnails and part of the text captions that accompanied the pictures in the Wired online gallery.

The book to be scanned sits in front of a technician underneath a V-shaped glass platter. Two opposing cameras angled at each page take photos of the book. On screen is the multipage view that the operator uses to verify the quality of the scans and the book’s pagination.
Scanning books into the Internet Archive’s custom-built Scribe Station is a manual process. Although automated page-turning machines exist, Internet Archive has chosen to go the manual route due to the large amount of extremely delicate, rare and valuable manuscripts they scan.
The book scanner uses off-the-shelf Canon hardware including the EOS 1-Ds Mark II and the EF 100 mm f/2.8 macro lens. The newer systems use the 5-D instead of the 1-Ds, which saves money in the short term. But, according to Internet Archive staff, the 5-D fails much more frequently, resulting in increased maintenance costs.
At the start of every shift the operator calibrates the color levels using a pair of color-calibration cards. When the scanning project first started, Internet Archive attempted to color correct the scanned pages to white, but later decided to capture and store them as they are in their various aged shades of yellow. Preservation of the oxidized tints makes the virtual viewing of old books more lifelike.
At the turn of the last century, fold-out illustrations were all the rage. These foldouts are cool to look at, but present a problem for scanning due to their size. When an operator comes across one of these foldouts in a book, they scan the closed version and note the foldout in the Scribe software. Later, another scanner is used consisting of a camera mounted on a copy stand.
Soon, you’ll be able to print books found at the Internet Archive with this self-contained, fully automated book machine. Send it a PDF and it will print and bind it into a complete book. The process takes about 10 minutes depending on the size of the book, and costs $10 plus a penny per page.
Inside the book machine, the laser-printed pages are trimmed, then slathered with adhesive on what will become the book’s spine. The cover is then wrapped around the book. After another trim, out pops a custom-printed book ready for reading.
Instead of stacks of books, these archival volumes are now contained in racks of 160 terabyte boxes. Multiple redundant copies of the archive’s data are spread across servers all over the world.
Before entering the world of public-domain-promoting nonprofits, Robert Miller spent the last few decades at the top levels of various brick-and-mortar tech corporations. He is currently the director of books at the Internet Archive, and it’s his vision that drives the archive’s quest to digitize all public-domain knowledge and publish it online.

The text was modified to update a link from http://redjar.org/jared/blog/archives/2006/02/10/more-details-on-open-archives-scribe-book-scanner-project/ to http://web.archive.org/web/20061206025609/http://redjar.org/jared/blog/archives/2006/02/10/more-details-on-open-archives-scribe-book-scanner-project/ on November 13th, 2012.

(This post was updated on 13-Nov-2012.)