Open Repositories 2011 Report: Day 1 with Apache, Technology Trends, and Bolded Labels

Today was the first main conference day of the Open Repositories conference in Austin, Texas. There are 300 developers here from 20 countries and 30 states. I have lots of notes from the sessions, and I've tried to make sense of some of them below before I lose track of the entire context.

The meeting opened with the a keynote by Jim Jagielski, president of the Apache Software Foundation. He gave a presentation on what it means to be open source project with a focus on how Apache creates a community of developers and users around its projects.

Slide 50 of Open Source: It's Not Just for IT Anymore — Slides 50 and 51 of 'Open Source: It's Not Just for IT Anymore'

One of the take-a-ways was a characterization of Open Source Licenses from Dave Johnson. Although it is a basic shorthand, it is useful in understanding the broad classes of licenses:

Give Me Credit ("You can use, modify and redistribute my code in your product but give me credit"): Apache License, Berkeley License, and MIT License
Give Me Fixes ("You can use, modify and redistribute my code in your product but give me the source for any fixes you make to it."): Mozilla Public License, Eclipse Public License, and GNU Library or "Lesser" General Public License
Give Me Everything ("You can use, modify and redistribute my code in your product but give me your entire product's source code."): GPL

He also explained why community and code are at peer levels in Apache. One is not possible without the other; it takes an engaged community to create great code and great code can only be created by a healthy community. He also described how the primary communications tool for projects is not new-fangled technologies like wikis and conference calls and IRC. The official record of a project is its e-mail lists. This enables the broadest possible inclusion of participants across many time zones and the list archives enable people to look into the history of decisions. If discussions take place in other forums or tools, the summary is always brought back to the e-mai list.

Jim's concluding thoughts were a great summary of the presentation, and I've inserted them in on the right.

I missed the first concurrent session of the day due to work conflict, so the first session I went to was the after lunch 24x7 presentations. That is no more than 24 slides in no more than seven minutes. I like this format because it forces the presenters to be concise, and if the topic is not one that interests you it isn't long until the next topic comes up. The short presentations are also great for generating discussion points with the speakers during breaks and the reception. Two of these in particular struck a cord with me.

The first was "Technology Trends Influencing Repository Design" by Brad McLean of DuraSpace. His list of four trends were:

Design for mobile, not just PCs. The model of a mobile app -- local computation and page rendering backed by web services for retrieving data -- is having several impacts on design: a reinforcement of the need for lightweight web services and UIs; accounting for how screen size has shrunk again; and having a strategy for multi-platform apps will become critical.
More programming language(s) than you need/want. Java, Python, Ruby, Scala, LISP, Groovy, JavaScript and the list goes on. This proliferation of languages has forced looser coupling between components (e.g. a JavaScript based script can consume data from and write data to a Java-based servlet engine). The implications he listed for this are that it is even clearer that true integration challenges are in the data modeling and policy domains; harder to draw neat boxes around required skill sets; and that you might lose control of your user experience (and it might be a good thing).
Servers and clusters. Clusters are not just for high-performance computing and search engines anymore. Techniques like map/reduce are available to all. He said that Ebay was the last major internet company to deploy its infrastructure on "big iron" but he didn't attribute that statement to a source. (Seems kind of hard to believe...) The implications are that we should look to replicated and distributed SOLR indexing (hopefully stealing a page from "noSQL" handbook); keep an eye on Map/Reduce-based triple stores (interesting idea!); and repository storage will be spanning multple systems.
What is a filesystem. Brad noted that with filesystems what was once hidden from the end user (think the large systems of the 1960s, 1970s and 1980s) became visible (the familiar desktop file folder structure) and is now becoming hidden again (as with mobile device apps). Applications are now storing opaque objects again; how do we effectively ingest them into our repositories?

[caption id="tweet_78539910046425089" align="alignright" width="300" caption="Tweet from Dorothea Salo"]

Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11less than a minute ago via TweetDeck Favorite Retweet ReplyRattus repositor
RepoRat

[/caption]
The second 24x7 talk that struck a chord was "Don’t Bold the Field Name" by Simeon Warner. And by that he literally meant "Don't Bold the Field Name". He walked through a series of library interfaces and noted how we have a tendancy to display bolded field labels. He then pointed out how this draws the eye's attention to the labels and not the record content beside the labels. Amazon doesn't do this (at least with the metadata at the top of the page), Ebay doesn't do this, and the search engines don't do this. He did note -- pointing to the case of the "Product Details" section of an Amazon item page -- that "the task of finding the piece of information is more important than consuming it." (Again, in the Amazon case, the purpose of bolding the label is to draw the eye to the location of data like publisher and shipping weight on the page.) I think Dorothea Salo's tweet summed it up best: "Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11"

I also attended the two sessions on identifiers in the afternoon (Peter Sefton's "A Hint of Mint: Linked Authority Control Service" and Richard Rodgers's "ORCID: Open Research and Contributor ID -- An Open Registry of Scholarly IDs"), but the time is late and tomorrow's events will come soon enough. Given eough time and energy, I'll try to summarize those sessions later.