Open Repositories 2011 Report: Day 1 with Apache, Technology Trends, and Bolded Labels

Today was the first main conference day of the Open Repositories conference in Austin, Texas. There are 300 developers here from 20 countries and 30 states. I have lots of notes from the sessions, and I’ve tried to make sense of some of them below before I lose track of the entire context.

The meeting opened with the a keynote by Jim Jagielski, president of the Apache Software Foundation. He gave a presentation on what it means to be open source project with a focus on how Apache creates a community of developers and users around its projects.

Slide 50 of Open Source: It's Not Just for IT Anymore Slide 50 of Open Source: It's Not Just for IT Anymore

Slides 50 and 51 of "Open Source: It's Not Just for IT Anymore"

One of the take-a-ways was a characterization of Open Source Licenses from Dave Johnson. Although it is a basic shorthand, it is useful in understanding the broad classes of licenses:

He also explained why community and code are at peer levels in Apache. One is not possible without the other; it takes an engaged community to create great code and great code can only be created by a healthy community. He also described how the primary communications tool for projects is not new-fangled technologies like wikis and conference calls and IRC. The official record of a project is its e-mail lists. This enables the broadest possible inclusion of participants across many time zones and the list archives enable people to look into the history of decisions. If discussions take place in other forums or tools, the summary is always brought back to the e-mai list.

Jim’s concluding thoughts were a great summary of the presentation, and I’ve inserted them in on the right.

I missed the first concurrent session of the day due to work conflict, so the first session I went to was the after lunch 24×7 presentations. That is no more than 24 slides in no more than seven minutes. I like this format because it forces the presenters to be concise, and if the topic is not one that interests you it isn’t long until the next topic comes up. The short presentations are also great for generating discussion points with the speakers during breaks and the reception. Two of these in particular struck a cord with me.

The first was “Technology Trends Influencing Repository Design” by Brad McLean of DuraSpace. His list of four trends were:

  1. Design for mobile, not just PCs. The model of a mobile app — local computation and page rendering backed by web services for retrieving data — is having several impacts on design: a reinforcement of the need for lightweight web services and UIs; accounting for how screen size has shrunk again; and having a strategy for multi-platform apps will become critical.
  2. More programming language(s) than you need/want. Java, Python, Ruby, Scala, LISP, Groovy, JavaScript and the list goes on. This proliferation of languages has forced looser coupling between components (e.g. a JavaScript based script can consume data from and write data to a Java-based servlet engine). The implications he listed for this are that it is even clearer that true integration challenges are in the data modeling and policy domains; harder to draw neat boxes around required skill sets; and that you might lose control of your user experience (and it might be a good thing).
  3. Servers and clusters. Clusters are not just for high-performance computing and search engines anymore. Techniques like map/reduce are available to all. He said that Ebay was the last major internet company to deploy its infrastructure on “big iron” but he didn’t attribute that statement to a source. (Seems kind of hard to believe…) The implications are that we should look to replicated and distributed SOLR indexing (hopefully stealing a page from “noSQL” handbook); keep an eye on Map/Reduce-based triple stores (interesting idea!); and repository storage will be spanning multple systems.
  4. What is a filesystem. Brad noted that with filesystems what was once hidden from the end user (think the large systems of the 1960s, 1970s and 1980s) became visible (the familiar desktop file folder structure) and is now becoming hidden again (as with mobile device apps). Applications are now storing opaque objects again; how do we effectively ingest them into our repositories?

Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11less than a minute ago via TweetDeck Favorite Retweet Reply

Tweet from Dorothea Salo

The second 24×7 talk that struck a chord was “Don’t Bold the Field Name” by Simeon Warner. And by that he literally meant “Don’t Bold the Field Name”. He walked through a series of library interfaces and noted how we have a tendancy to display bolded field labels. He then pointed out how this draws the eye’s attention to the labels and not the record content beside the labels. Amazon doesn’t do this (at least with the metadata at the top of the page), Ebay doesn’t do this, and the search engines don’t do this. He did note — pointing to the case of the “Product Details” section of an Amazon item page — that “the task of finding the piece of information is more important than consuming it.” (Again, in the Amazon case, the purpose of bolding the label is to draw the eye to the location of data like publisher and shipping weight on the page.) I think Dorothea Salo’s tweet summed it up best: “Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11”

I also attended the two sessions on identifiers in the afternoon (Peter Sefton’s “A Hint of Mint: Linked Authority Control Service” and Richard Rodgers’s “ORCID: Open Research and Contributor ID — An Open Registry of Scholarly IDs”), but the time is late and tomorrow’s events will come soon enough. Given eough time and energy, I’ll try to summarize those sessions later.

New Web Expectations and Mobile Web Techniques

Late last year I was asked to put together a 20-minute presentation for my employer (LYRASIS) on what I saw as upcoming technology milestones that could impact member libraries. It was a good piece, so I thought I’d share what I learned with others as well. The discussion was in two parts — general web technologies/expectations and mobile applications/web.
Continue reading


DLTJ now uses reCAPTCHA on comment forms. reCAPTCHA is an enhanced version of CAPTCHA (an acronym for “completely automated public Turing test to tell computers and humans apart”) and like the original it is a type of challenge-response test used to determine whether there is a human user at the other end of the browser or if it is a software agent (such as a SPAM robot). And like the original it asks the user to type in recognized words from an image or a set of numbers from an audio clip.

reCAPTCHA example with textreCAPTCHA audio example

Help with reCAPTCHA

The reCAPTCHA box contains three buttons to help use the service:

Refresh buttonRefresh the word images. If you are unsure what the two words are, select this button to receive a new pair of words. (Alternatively, just try to guess what the two words are; if you are wrong, you’ll get a new pair of words automatically.)
Audio button / Text buttonAlternate between the Audio- and Text-based challenges. If you cannot see the word images, select this audio button to hear a set of digits among random noise that can be entered instead of the visual challenge.
Help buttonGet help from the reCAPTCHA site about this human detection scheme. Also includes introductory information about the reCAPTCHA service itself.

What’s Special About reCAPTCHA

Example words from a reCAPTCHA challenge The human mind is still a more powerful computer than any silicon circuitry in place now or in the foreseeable future. With just a glance our brains can recognize the patterns among the noise — something that is computationally very expensive or impossible to do. reCAPTCHA researchers at Carnegie Mellon University, also the home of the original CAPTCHA concept, estimate that 60 million CAPTCHAs are solved by humans around the world every day with roughly ten seconds of human time are being spent in each instance. That is not a lot of time per person, but in aggregate it adds up to more than 150,000 hours of work each day.

In the original CAPTCHA scheme, that work is wasted on deciphering random strings of letters and numbers. The researchers at Carnegie Mellon realized that they could harness that work to resolve ambiguities in deciphering scanned text from books. As with the original CAPTCHA system, there are some blocks of scanned text that computers cannot decipher yet are easily readable by humans. reCAPTCHA pairs a known word with one of these unknown blocks of text. If the human types the known word correctly, the reCAPTCHA system tells the DLTJ system that the comment is coming from a human. And if enough humans type the same response for the unknown block of text, the reCAPTCHA system can be pretty sure the word has been deciphered.

So by commenting here on DLTJ you are helping make the world a better place by aiding in the digital conversion of texts from the Internet Archive. This is a bit of an experiment, so if it is not working out, please let me know.

2007 Web Design Survey

2007 Web Design Survey logoFriend and former colleague Eric Meyer writes about the 2007 Web Design Survey (first annual) on his blog. It is an effort to “increase knowledge of web design and boost respect for the profession” and asks questions to learn “Who are we? Where do we live? What are our titles, our skills, our educational backgrounds? Where and with whom do we work? What do we earn? What do we value?”

I hesitate to put myself in the “we” category of web designers, but I do try to keep up on best practices by reading Eric’s blog and following A List Apart. I’ve never had a good enough business justification to attend An Event Apart event, but the possibility of winning a free registration for filling out this survey is compelling!

If you work in web design, and I get the sense that many readers of DLTJ dabble in it about as much as I do, please take a few minutes to fill out the survey.