JPEG2000 to Zoomify Shim — Creating JPEG tiles from JPEG2000 images

This is a textual representation of a lightning talk done on Feb 26th at Code4Lib 2008. When the video of the talk is up (thanks, Noel!) I’ll link it here, too. The video is now available, and that article includes an update on progress since the this article was posted.

OhioLINK has a collection of JPEG2000 images as an access format that were generated for use in our DLXS-based content system. We are in the process of migrating those collections to DSpace and were looking for a mechanism to leverage the existing JPEG2000 files and not have to generate new derivatives. We are also considering the use of JPEG2000 as a preservation format, and would find it attractive to use the same image format for both access copies and preservation copies. We looked at Zoomify, but to perform its scaling function it generates JPEG tiles at several resolutions and storing those tiles can triple or quadruple disk space requirements. Or, one could use the ‘enterprise’ version of Zoomify and its proprietary PFF format or the equally proprietary MrSID format. We didn’t want to be locked into either of these scenarios. Our solution is to create a web application that mimics the directory-of-JPEG-tiles solution, but to dynamically generate the tiles our of a JPEG2000 master.

The free version of Zoomify reads JPEG tiles out of a directory structure that looks like this:

/ImageProperties.xmlIncludes descriptive elements of the source image like height, width, and tile size./TileGroup0/0-0-0.jpgThe highest power-of-2 zoom out level that creates an image with dimensions less than 256×256/TileGroup0/1-0-0.jpgThe tile at the upper left corner at the first power-of-2 zoom level/TileGroup0/1-1-0.jpgThe tile to the left of 1-0-0.jpg

The shim mimics that directory structure. It parses the URL of the request and dynamically creates the appropriate JPEG tile (or metadata file) out of the JPEG2000 image.

The Code

The JPEG2000 for Zoomify shim requires Java 1.5 or greater. It does not require a servlet engine; rather, it uses the Restlet library to perform as a stand-alone application. The OneJar library allows the Java classes and required dependencies to be bundled into a single JAR file. We’re using the Kakadu Software JPEG2000 library to perform the on-the-fly decoding of JPEG2000 images. Kakadu is a commercial JPEG2000 codec, although inexpensive licenses are available for not-for-profit activity. We are using the Enterprise version of Zoomify, a Flash-based image viewer, although I believe the free version will work as well. (You’ll need the Enterprise version to be able to modify and adapt the appearance of the Zoomify applet.) The same techniques can also be used for other Flash applets and probably even JavaScript-based viewers (a la Google Maps).

The source code is available from the OhioLINK DRC source code repository (Subversion access). We plan to integrate it into DSpace 1.5 as part of the Ohio Digital Resource Commons, and I may create a Fedora disseminator to serve up the tiles as well.

Thanks go out to Keith Gilbertson and John Davison on the OhioLINK staff for their help in making this work as well as Stu Hicks and François d’Erneville for being a sounding board for these ideas.

The text was modified to update a link from http://code4lib/conference/2008 to on January 28th, 2011.

The text was modified to update a link from to on January 28th, 2011.

New Blog for Ebooks in Libraries: “No Shelf Required”

Sue Polanka, head of reference and instruction at the main library of Wright State University, sent a message to the OhioLINK membership today about a new blog she is moderating called No Shelf Required:

No Shelf Required provides a forum for discussion among librarians, publishers, distributors, aggregators, and others interested in the publishing and information industry. The discussion will focus on the issues, concepts, current and future practices of Ebook publishing including: finding, selecting, licensing, policies, business models, usage (tracking), best practices, and promotion/marketing. The concept of the blog is to have open discussion, propose ideas, and provide feedback on the best ways to implement Ebooks in library settings. The blog will be a moderated discussion with timely feature articles and product reviews available for discussion and comment.

No Shelf Required will be moderated by Sue Polanka, Wright State University. The role of the moderator will be to articulate discussion topics, provide feature articles and product reviews, and ask poignant questions to the group in order to stimulate open discussion and collaborative learning about Ebooks. The moderator will also provide audio content in the form of interviews with librarians and those in the publishing industry.

The blog has been running for about a week, and already has topics like:

It sounds like it is going to be an interesting place to keep an eye on, particularly since ebooks can/could be a disruptive influence on library services. Good luck, Sue!

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

The text was modified to update a link from to on November 6th, 2012.

Voting open for Code4Lib 2009; Central Ohio is a candidate

The Columbus Metropolitan Library, OCLC, and Ohio State University and OhioLINK have put in a bid as host site for the 2009 Code4Lib meeting. Code4Lib is an informal organization of self-selected librarians and technology professionals. It exists as a volunteer organization run by consensus of interested individuals. The meeting in 2009 will be the fifth fourth1 face-to-face meeting of this group. Details of the central Ohio host location proposal are on the web at

Information about becoming a member of the Code4Lib community and voting in the host site selection process are included below.

The meeting is conducted in an “unconference” or “barCamp” format. It is a highly democratic style consisting of prepared talks, “lightning talks” (described below) and breakouts; the meeting schedule is divided almost equally between these three components. Prepared talks are 20 minutes long and are proposed by speakers prior to the meeting. Proposals are voted on by the entire Code4Lib community, and the highest ranking ones are slotted into the schedule. “Lightning talks” are 5 minutes long and are assigned on a first-come, first-scheduled basis at the start of the meeting. Prepared talks and lightning talks are presented to the entire attendee body (e.g. a single-track meeting); they are also usually recorded and published to the web after the meeting. Time slots for breakouts are built into the schedule and rooms are provided by the conference organizers. Attendees create breakout sessions at the meeting on any topic on a first-come, first-scheduled basis.

If you have any questions about Code4Lib in general or the central Ohio site proposal in particular, please let me know.

Code4Lib Host Site Voting Process

Adapted from a message by Mike Giarlo.

We received four very good proposals for hosting the 2009 conference, and now it is time to vote on them! Voting is open until 3am Eastern Time on Thursday, February 28th. We expect to announce results at the conference later that day.

How to vote:

  1. Go here:
  2. Log in using your credentials (register at if you haven’t done so already)
  3. Click on a host’s name to read the proposal in full
  4. Assign the proposal a rank from 0 to 3, 0 being least desirable and 3 being the most.
  5. Once you are satisfied with your rankings, click “Cast your ballot”

Feel free to watch for returns.

And as always, if you have questions or other feedback, let us know.


P.S. Your vote counts! Please keep the conference requirements and desirables in mind as you make your selection:

P.P.S. The election not powered by Diebold.

The text was modified to update a link from to on January 28th, 2011.

The text was modified to update a link from to on January 28th, 2011.

The text was modified to update a link from to on November 21st, 2012.


  1. Thanks for the correction, Mike! []

Microsoft Giving Away Developer Software to Students

Stu Hicks, one of OhioLINK’s systems engineers, told the OhioLINK staff last night about a new program at Microsoft called DreamSpark. Through this program, post-secondary students around the world who are attending accredited schools or universities can download some of Microsoft’s big developer and designer tools free of charge. At the time and place this post is being written, the list of software is:

  • Visual Studio 2008 Professional Edition
  • Windows Server 2003 Standard Edition
  • SQL Server 2005 Developers Edition
  • Expression Studio
  • XNA Game Studio
  • Visual Studio 2005 Professional Edition
  • Visual C# 2005 Express Edition
  • Visual C++ 2005 Express Edition
  • Visual Basic 2005 Express Edition
  • SQL Server 2005 Express Edition
  • Visual Web Developer 2005 Express Edition
  • Visual J# 2005 Express Edition
  • Virtual PC 2007

Eligibility is determined by either a Shibboleth or a Windows CardSpace identity provider on the student’s campus. One must link a Windows Live ID account with that campus identity provider and renew that eligibility about once every 12 months. They are using Shibboleth for what it was designed for; it is actually nice to see Microsoft recognize that only a true/false response from the campus is required to determine eligibility and that no personally-identifying attributes are passed from the campus to the Microsoft server to make this happen. There are FAQs for students and for higher education administrators.

The blog post announcing the program has an video interview with Bill Gates, but unfortunately one needs Microsoft’s Flash alternative called Silverlight to watch it.

Note to Future Self: Use `ssh -D` to bypass annoying interception proxies

Dear future self,

If you are reading this, you are remembering a time when you ran into a really nasty interception proxy1 and you are looking for a way around it. Do you remember when you were sitting in the Denver International Airport using their free wireless service? And remember how it inserted advertising banners in HTML frames at the top of random web pages as you surfed?

After about a half an hour of this, you started looking for solutions and found that the secure shell client can act as a SOCKS proxy2. Using ‘ssh’, you set up a tunnel between your laptop and a server in the office that encrypted and effectively hid all of your network communications from the interception proxy. And if you are reading this again you want to remember how you did it.

Set up the SOCKS proxy

SOCKS is a client protocol that can be used to tunnel all of your traffic to a remote host before it fans out across the internet. The OpenSSH client can set up a local SOCKS proxy that uses an ‘ssh’ session as the network tunnel. To set up the tunnel, use the -D option followed by a local port number:

ssh -D 9050 [username]@[]

To refresh your memory, here is an extract from the ‘ssh’ manual page for the -D option:

-D [bind_address:]port
Specifies a local “dynamic” application-level port forwarding. This works by allocating a socket to listen to port on the local side, optionally bound to the specified bind_address. Whenever a connection is made to this port, the connection is forwarded over the secure channel, and the application protocol is then used to determine where to connect to from the remote machine. Currently the SOCKS4 and SOCKS5 protocols are supported, and ssh will act as a SOCKS server. Only root can forward privileged ports. Dynamic port forwardings can also be specified in the configuration file.

Using the SOCKS proxy

MacOSX 10.5 Proxy screen

Next you need to tell the applications to use the SOCKS proxy. If you are still using a Mac when you are reading this, you’ll probably have it pretty easy. Mac OSX lets you set a proxy system-wide that all well-written Mac applications will use to get their parameters. It is in the “Proxies” tab of the Advanced… network settings. On Mac OSX version 10.5 (Leopard), it looks like the graphic to the right.

If you’re using some sort of UNIX variant, the application may have a setting to use a SOCKS client, or you may need to use the ‘tsocksshim that intercepts the network calls of the application. And, future self, if you are using a Microsoft Windows box right now, please remember how much simpler life was when you used a Mac or Linux desktop. If you find yourself in such a spot, some reader of this blog posting may have left a comment for you below that will help you use a SOCKS proxy with a Windows platform.

Hope this helps. Sincerely,

Self, circa February 2008


  1. Version of the “Proxy Server” Wikipedia page when this posting was written []
  2. Version of the SOCKS Wikipedia page when this posting was written []

Would the Real “Dublin Core” Please Stand Up?

I’ve been following the discussion by Stu Weibel on his blog about the relationship between Resource Description Framework (RDF) and Dublin Core Abstract Model (DCAM), and I think I’m as confused as ever. It comes as a two part posting with comments by Andy Powell Pete Johnston (apologies, Pete), Mikael Nilsson, Jonathan Rochkind, and Ed Summers. Jonathan’s and Ed’s comments describe the same knowledge black hole that I’ve been facing as well; in Ed’s words: “The vocabulary I get–the DCAM is a tougher nut for me to crack.”

I’m struggling to get beyond Dublin Core as simply the definition of metadata terms. That does seem to be the heart of Dublin Core, doesn’t it? The Mission and Scope of the Dublin Core Metadata Initiative, as described on the Dublin Core Metadata Initiative’s (DMCI) “about” page, is:

The development and maintenance of a core set of metadata terms (the DCMI Metadata Terms) continues to be one of the main activities of DCMI. In addition, DCMI is developing guidelines and procedures to help implementers define and describe their usage of Dublin Core metadata in the form of Application Profiles. This work is done in a work structure that provide discussion and cooperation platforms for specific communities (e.g. education, government information, corporate knowledge management) or specific interests (e.g. technical architecture, accessibility).

Terms? … yeah, it’s in there. Application Profiles? — how one would actually use DC? … yeah, it’s there too. An abstract model? Either it is so fundamental to Dublin Core that it doesn’t get mentioned as a work activity, or its definition is somehow secondary to the work of the DCMI. To be honest, I’m not sure which it is (or even if this is a fair dichotomy).

An outsider’s view of the history of Dublin Core

There doesn’t seem to be a “brief history of Dublin Core” document out there; if there is, I can’t find it. [Update 20080218T1514 : My wife, rightly so, asked if I had done an actual literature search for one; alas, no I had not. So I Google’d "dublin core" history (relax, I used Google Scholar) and came up with a book chapter by Stu Weibel and others called Dublin core: process and principles that takes the history through 2002. There is also a rather unflattering article by Jeffery Beall called Dublin Core: an obituary but it doesn’t really contain much history. I might need to try a more rigorous literature search later.] I think I get the history of Dublin Core — in very broad strokes, it goes something like this.

First we had the Dublin Core Metadata Element Set (DCMES, sometimes DCES) — otherwise known simply as “Dublin Core” as it was the first product of the Dublin Core body — which defined the 15 elements that we all think we know and love: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type. This “Dublin Core” was codified and ratified in all sorts of places: IETF RFC 5013 (version 1.1, which obsoletes version 1.0 which is RFC 2413, ANSI/NISO Standard Z39.85, and ISO Standard 15836-21003. In the common vernacular, when one refers to “Dublin Core” one is talking about these 15 elements. (It is sort of like how the Open Archives Initiative Protocol for Metadata Harvesting — OAI-PMH — is called OAI even though OAI is certain now bigger than just PMH with the work on the definition of Object Reuse and Exchange.)

Next we had the Qualified Dublin Core, which (in part) added attributes to some of the 15 “core” elements (the one that always leaps to mind is “spatial” and “temporal” for the core term “coverage”). This was a tweak to the DCMES — done in such a way, it would appear, so as not to invalidate all of the nicely codified and ratified versions. I imagine all of that codifying and ratifying took a lot of effort; I wouldn’t intentionally want to mess it up either.

But then the story gets sort of fuzzy. Dublin Core is successful, and so an effort starts to define why it is successful. To me, this seems like the W3C work on the Architecture of the World Wide Web. It isn’t an attempt at revisionist history as much as it is trying to put the genie back in the bottle by coming up with formal definitions for all of the stuff that is successful. (In the W3C, this seems to result in efforts by the Technical Architecture Group to reconcile the way things are with the way one would like things to be.) In the case of the DCMI, it is release last year of the Dublin Core Abstract Model followed by corresponding realignment of the “DMCI Metadata Terms” to match the DCAM.

By the way, in case you hadn’t noticed the core 15 elements plus the qualifications on some of the elements was expanded in 2002 to include a lot more. In fact, in the latest definition of DCMI Metadata Terms, the original 15 are called “legacy terms”:

Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., or in the dcterms: variant (e.g., depending on application requirements…. Over time…, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata.

And where does RDF fit in?

This is where it gets really fuzzy for me, and I, too, am trying to reconcile what differences exist between RDF and the DCAM based on these postings and comments from Stu’s blog. The DCAM, on the surface, makes complete sense as a model for defining the description of a digital object. The use of URIs from the DCMI Metadata Terms as predicates of triples in RDF makes perfect sense, too. The overlap of the DCMI Description Set Model — in particular its apparent redefinition of value surrogates and value strings from RDF’s URI references and plain/typed literals — is confusing.

Stu’s second post says:

The abstract model provides a syntax-independent (hence the abstract bit) set of conventions for expressing metadata on the web. RDF is the natural idiom for the expression of the DCAM, but it is NOT essential. You can build any arbitrary syntactical representation of the metadata according to DCAM, and a lossless transformation to any other arbitrary syntactical representation should be possible between two machines that grok both syntaxes.

One of the concepts that I think I’m missing here is the value, either by description or by example, of other syntactical representations of the DCAM that get us further than RDF. It is bad enough that the original native representation of “Dublin Core” was XML when one considers “RDF is the natural idiom for the expression of the DCAM.” I think I’m in tune with an RDF view of the world, but I suspect that for many others RDF is a foreign, albeit graspable, notion. Now to layer on top of this that RDF is nature but not essential really muddies the waters.

So what is “Dublin Core”? Is it the abstract model? Is the set of terms that can be used as predicates in RDF expressions? Is it the legacy 15-element XML-based standard for describing digital objects? Count me in among those want more in trying to figure this out….

The text was modified to update a link from to on January 19th, 2011.

Is JPEG Good Enough for Archival Masters?

On the ImageLib mailing list, Rob Lancefield (Manager of Museum Information Services for Wesleyan University) posted a link to the Universal Photographic Digital Imaging Guidelines (UPDIG) for image creators. The introduction says: “These 12 guidelines — provided as a Quick Guide plus an in-depth Complete Guide — aim to clarify the issues affecting accurate reproduction and management of digital image files. Although they largely reflect a photographer’s perspective, anyone working with digital images should find them useful…. This document, prepared by the UPDIG working group, represents the industry consensus as of September 2007.” The listed members of UPDIG leads one to believe that this is a professional photography group. One thing in the introduction to the guidelines caught my eye, though:

The chapter on archiving now has a discussion of JPEG as an archival format.

Note that the authors do indeed mean JPEG (circa 1994), not JPEG2000. The chapter on archiving lists the pros and cons of a number of formats, to include JPEG. The following bullet points are excerpted from the text.

  • Conversion to TIFF files: By converting images to TIFF format [from camera RAW], the photographer is storing the images in the most accessible file format… There is a downside, however. TIFF files are much larger than RAW files… Another downside to conversion to TIFF is that it precludes the use of better RAW converters that are surely coming in the future.
  • Archiving JPEG files: Conventional wisdom holds that the TIFF format holds a quality advantage over the JPEG format. This holds true only if the JPEG file is saved at less than 10 quality using the Photoshop standard. When using JPEG quality 10 or 12, the artifacts are either non-existent or insignificant. Higher bit-depth is really the only advantage of using TIFF over JPEG 10 or 12 (in terms of image quality)… Update 2008-02-11: Please see below.
  • Archiving RAW files: If a photographer chooses to archive the RAW file, then he will be preserving the largest number of options for future conversion of the files… This, too, has its downside. RAW files will likely have to be converted to a more universal file format at some time in the future.
  • Archiving DNG files: RAW files can be converted to DNG, a documented TIFF-based format created by Adobe that can store the RAW image data, metadata, and a color-corrected JPEG preview of the image. The DNG file format provides a common platform for information about the file and adjustments to the image… DNG is likely to be readable long after the original RAW format becomes obsolete, simply because there will be so many more of them than any particular RAW file format… There’s a downside to DNG, of course. Conversion to DNG requires an extra step at the time of RAW file processing; it does not take terribly long, but it is an extra process.

Update 2008-02-11: Ken Fleisher noted in the comments that the excerpt above was truncated before his reasoning was described. In the interest of clarity, the full text of this bullet point on the UPDIG site is:

Archiving JPEG files: Conventional wisdom holds that the TIFF format holds a quality advantage over the JPEG format. This holds true only if the JPEG file is saved at less than 10 quality using the Photoshop standard. When using JPEG quality 10 or 12, the artifacts are either non-existent or insignificant. Higher bit-depth is really the only advantage of using TIFF over JPEG 10 or 12 (in terms of image quality). Some have argued that that JPEG, because of the way it encodes data, compromises color. This is a misconception. When using the highest quality settings, there is no loss of color fidelity. Therefore, if JPEG files are saved at 10-12 quality, and if they do not require much pixel editing before use, archiving JPEG files is not a bad concept, and it can save a lot of space. For many picture archives, the economics of storing large numbers of files dominates all other considerations, and JPEG offers a feasible solution to the problem.

The notes at the end of the chapter say: “The archiving JPEG section is based on research and analysis by Ken Fleisher.”

So I wonder what is going on here. Does the cultural heritage community have a different definition of the word archive from the professional photography community? Are there sufficient differences in our goals that warrant the differences in practices?

This topic is of interest because the program of the JPEG2000 in Archives and Libraries Interest Group of the Library and Information Technology Association (LITA) will be holding a panel at the ALA Annual Conference in Anaheim this summer on using the JPEG2000 file format for archival purposes. Part of the discussion will center around the notion of visually lossless versus data lossless compression. This mention of lossy-yet-high-quality JPEG compression seems to fit into the same topic.

Soundprint’s ‘Who Needs Libraries?’

OhioLINK’s Meg Spernoga pointed our staff to a 30 minute audio documentary called Who Needs Libraries? from

As more and more information is available on-line, as Amazon rolls out new software that allows anyone to find any passage in any book, an important question becomes: Who needs libraries anymore? Why does anyone need four walls filled with paper between covers? Surprisingly, they still do and in this program Producer Richard Paul explores why; looking at how university libraries, school libraries and public libraries have adapted to the new information world. This program airs as part of our ongoing series on education and technology, and is funded in part by the U.S. Department of Education.

Produced by Richard Paul. Hosted by Lisa Simeone.

Some of the topics covered:

  • Numbers of New/Renovated public libraries are steady
  • Use of consortial depositories by academic libraries
  • Licensed content that can’t be found in Google (pros — immediate access; and cons — preservation)
  • Widespread digitization of content for online access (pros and cons)
  • Impact of Gates Foundation money on public library services
  • Changing ways libraries are being used

Thanks, Meg!

OAI-ORE Open Meeting, April 4 2008, Johns Hopkins University

Copied from the press release announcing the U.K. Public Meeting for OAI/ORE.

Open Archives Initiative Announces U.K. Public Meeting on April 4, 2008 for European Release of Object Reuse and Exchange Specifications

Ithaca, NY and Los Alamos, NM, January 21, 2008 – As a result of initiatives in eScholarship, the format of scholarly communication, and the process that underlies it, are becoming increasingly expressive and complex. The resulting new artifacts of scholarship are aggregations composed of multiple media types, links to data, and to applications that allow interaction with that data. The success of these innovations depends on standard methods to identify, describe, and exchange these new forms scholarly communication.

On April 4, 2008 the Open Archives Initiative (OAI) will hold a public meeting in Southampton, England, in connection with Open Repositories 2008, to introduce the Object Reuse and Exchange (ORE) specifications, which propose such standards. This meeting is the European follow-on to a U.S.-based meeting on March 3 at Johns Hopkins University.

The OAI-ORE specifications define a data model to represent aggregations as resources on the web, identified with a URI like any resource. They also define a machine-readable format in ATOM to describe these aggregations. The specifications provide a foundation for new forms of citation, reuse, and analysis of the products and process of scholarship.

In addition to eScholarship applications, ORE specifications are useful for the aggregations that are part of our everyday interaction with the web. These include multi-page HTML documents, collections of multi-format images on sites like flickr. ORE descriptions of aggregations can be used to improve search engine behavior, provide input for browser-based navigation tools, and make it possible to develop automated web services to analyze and preserve information.

Attendees of the April 4 meeting will learn about the ORE data model, which is based on techniques developed in the Semantic Web initiative. They will also learn about the translation of this data model to the XML-based ATOM syndication format, allowing exchange of ORE- based descriptions via standardized feed software. Finally, they will hear the results of initial experiments with the specifications by ORE community members. There will be ample time for discussion and questions. Detailed information and registration information for the meeting is (NOTE: attendees must register in advance and attendance is limited).

About the Open Archives Initiative: The Open Archives Initiative (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination, sharing, and reuse of web-based content. OAI-ORE work is supported by the Andrew W. Mellon Foundation, Microsoft Corporation, the JISC, and the National Science Foundation (IIS-0430906). More information is available at

hCalendar Encoded MicroformatAdd this event to your desktop calendar program.