Local and Unique and Digital: A Evolving Trend for Libraries and Cultural Heritage Institutions

These are slides and audio from presentation given at the LOUIS Users Group meeting, on October 4, 2013, in Baton Rouge, LA. The description of the talk was:

Libraries have been digitizing materials for decades as surrogates for access to physical materials, and in doing so have broadened the range of people and uses for library materials. With projects like Hathi Trust and Google Book Search systematically digitizing mass-produced monographs and making them available within the bounds of copyright law, libraries continue the trend of digitizing what is local and unique, and the emergence of projects like the Digital Public Library of America and OCLC’s WorldCat Digital Collection Gateway expand discoverability of the local and unique well beyond the library’s traditional reach. This presentation provides an overview of this trend, updates on what libraries can do, and describes activities LYRASIS is doing to help libraries and other cultural heritage institutions expand their reach.

Links to resources and pages mentioned in the talk are below.


Interoperability and Its Role In Standardization, Plus A ResourceSync Overview: Slidecast from ALA2013

At the American Library Association meeting in Chicago last month I gave a 20 minute presentation that was a combination of an overview of interoperability and standards plus a brief overview of the ResourceSync activity for the NISO Update session. Included below are my slides with a synchronized audio track.

Host Your Own Virtual Lightning Talks Using Google Hangout — Slides and Links from ALA2013 Presentation

In the “very meta” category, this morning I gave a lightning talk about lightning talks to a crowd of about 150 at the LITA Lightning Talks session. More specifically, it was a brief presentation on how Code4Lib uses Google Hangouts-on-Air for its Virtual Lightning Talks. The slides and links from the slides are included below.

URLs from the presentation

As promised, here are the URLs from the presentation.

Notes on the Code4Lib Virtual Lightning Talks

Last week I emcee’d the second Code4Lib Virtual Lightning talk session and I wanted to record some notes and pointers here in case me (or anyone else) wants to do the same thing again. First, though, here is a list of those that presented with links to the talks archived on Internet Archive.

Name Topic
Terry Brady File Analyzer and Metadata Harvester
Misty De Meo Transitioning a legacy thesaurus to SKOS/RDF
Roy Tennant Under the Hood of Hadoop Processing at OCLC Research
Kate Kosturski How I Taught Myself Drupal In a Weekend (And You Can Too!)

The full session recording is also on YouTube. (If I get the time, I’d like to try a hand a cleaning up the automatically generated captions track as well.)

Here are the notes:

  • The beep track was created using Audacity and its function to create DTMF tones. I’ll add the full five minute recording here in MP3 format, but next time I do this I think I’m inclined to add a minute or two to each presentation, so I’ll have to recreate the track.
  • The beep track and my voice were combined in realtime my Mac using the two-channel Soundflower mixer. I was using a Blue USB external microphone, and needed the LineIn application to route the Blue microphone’s input into one of the Soundflower channels. (I couldn’t figure out how to route the USB mic input natively.)
  • When you create a Google+ Event, you have the option of saying it will be via Google+ Hangout. I had set the start time of the event to 1:30 Eastern U.S. time, but wanted to open up the Hangout 30 minutes early so the presenters could come in and test the environment. I started an ad hoc hangout 30 minutes early, but right at the start time another Hangout was created and some viewers went there instead. I don’t think there is an elegant way around this, but next time I’ll set the start time of the event to include that 30 minute window and mention in the event description that it won’t really start until 30 minutes later.
  • Warn the presenters about the start tones on the beep track. The start tones will cause the Hangout to focus on the em cee screen, which will have the title slide. Some presenters got eager, though, and talked before or through the beep track. Add 10 seconds to the first minute’s beep track time, then tell the presenters that leeway is built in.
  • Download the MP4 recording from YouTube and split it using the QuickTime Player “Trim” feature. It helps to have QuickTime Player go fullscreen so you have a finer granularity on the editing times.
  • Presentations in Prezi format did seem to work out fine.
  • Remind other speakers to mute their mics when they are not presenting so they don’t steal Hangout video focus from the presenter. Hangouts-on-Air has a Cameraman app that might be useful in limiting who is seen/heard at any one time during the session. Explore this before the next session…

Trip Report of DPLA Audience & Participation Workstream

On December 6, 2012, the Audience and Participation workstream met at the Roy Rosenzweig Center for History and New Media at George Mason University. About two dozen colleagues participated in person and remotely via Google+ Hangout to talk about processes and strategies for getting content into the DPLA (the content hubs and service hubs strategy), brainstormed on the types of users and the types of uses for the DPLA, and outlined marketing and branding messages that aligned with the goals and technology of the DPLA while getting content contributors and application developers excited about what the DPLA represents. I’m happy to have been invited to take part in the meeting, am grateful to DPLA for funding my travel to attend in person, and came away excited and energized about the DPLA plans — if also with a few commitments to help move the project along.

Emily Gore, DPLA’s Director of Content, started the first topic by describing the mechanisms being set up to feed metadata to the DPLA database. The first version of DPLA will be an aggregation of metadata about objects various services and cultural heritage organizations around the country. The DPLA will leverage and promoting metadata coming through hubs, where a hub can be an existing large gathering of stuff (“content hubs” — think Harvard University, Smithsonian Institution, National Archives and Records Administration) or a hub can be a meeting point for state or regional content (“service hubs”). From the perspective of the Audience and Participation workstream, the service hubs are probably the most interesting because that will be how information about an institution’s content gets into the DPLA.

Just about every state in the country is covered by a state or regional digital library program, so the infrastructure is already out there to help organizations. The DPLA itself is aiming to be a small organization of about five to ten people, and at that scale it would be impossible to have a one-on-one relationship between the DPLA and all the possible organizations in the country. So the DPLA Service Hubs will offer a number of services in a region: aggregation of metadata from local repositories, help with new digitization and creation of associated metadata, and engaging participants in the region around different uses of the content in the DPLA. By the April 2013 launch of the DPLA site, the goal is to have seven service hubs operating and a similar number of content hubs. Some of the state and regional collaboratives have already reached out to the DPLA to join, and DPLA is working on updating a list of collaboratives that was created a few years ago. One path of outreach is through the Chief Officers of State Library Agencies (COSLA) group. Talking to state library agencies makes sense because there are indications that IMLS — who grants money to state library agencies — is aligning its LSTA funding with the goals of participating in DPLA. State humanities councils and ALA can also be venues for getting the word out. The ALA Washington Office can be especially useful for getting word to legislators about the importants and value of collaboration with the DPLA.

We talked about how there are technical tasks involved with adding new hubs to the DPLA — it isn’t as easy as just ingesting and indexing new metadata. There will be iterations of mapping adjustments, tuning the weighting of fields in the index, and other tasks, so DPLA will need to set expectations about how fast it can add new hubs to the service. It was noted in the meeting that the service and content hubs will in one sense be customers of the DPLA and in another sense will be providers to the DPLA. This relationship will be akin to a business-to-business relationship, and it will be important that the DPLA can provide adequate “customer support” to the hubs to make the relationship work out best.

The focus at launch is on cultural heritage objects, books, and manuscripts. The common denominator is that the metadata must be sharable under a Creative Commons Zero (CC0) license, allowing for the free reuse and remixing of the metadata. In this form, the DPLA will be an index of descriptive metadata that leads the searcher to where the item is stored at the contributing institution. That institution can specify other rights on the reuse of the digital object itself. Interestingly, the CC0 policy for metadata is a source of concern for some potential DPLA participants. Where libraries have less of a sense of ownership over the metadata describing their objects, the museum community has a higher sense of ownership because of the extra effort they put into creating and curating this metadata.

We talked for a bit about the impact that the visibility of DPLA will have on desires for organizations and even individuals to digitize, describe and electronically mount their content. (“If they have stuff like that, I have stuff like that, too, that I want to add.”) The DPLA can be helpful by providing clear pipelines for the processes for content to be added to places that will be harvested and integrated into the DPLA. Perhaps even bringing digitization “to the masses” by going through the local historical societies where there will be opportunities for conversation about what is good to keep and how to do it.

This discussion of what content will be in the DPLA lead into talks about the kinds of people using the DPLA and what they will want to use it for. The goal is to create “personas” of DPLA users — fictional representations that encompass research about the users, their motivations, and their desires. (As examples, we briefly looked at the HathiTrust personas and the earlier work on DPLA Use Cases.) The driving goal is to give these personas to the contracted developer (iFactory) for use in creating the initial front end website. As an aside at this point, the heart of the DPLA at this stage will be the aggregation, enhancement, and serving of descriptive metadata to software applications that remix and display results to users. One way, but not the only way, this will happen is via the http://dp.la/ website interface being created by iFactory.

We brainstormed the possible labels for personas: Casual Searchers, Genealogy, Hardcore Enthusiasts, Wikipedia / Open Source Folks; info nerds, Small business / startups, Writers / Journalism, Artists, Students, Public School Teachers, Home schoolers, Scholars, Other Digital Libraries, State Libraries, Public Libraries / Public Librarians, Museums, and Historical Societies. We also brainstormed a whole slew of behaviors that these personas could do (several hundred post-it notes worth), and then grouped them into broad categories:

  • Finding Specific Knowledge: school research, curricular-related; local/personal history; specific “laser-like” focus; open-ended, on-going activity; awareness of a body-of-knowledge problem
  • Learn: skill-acquisition (things that take longer, as a project)
  • Harvest and/or reuse: visualizations, build new collections
  • Contribution: contribute content; enhance metadata (DPLA needs to be able to answer the question “I want you to add X”)
  • Sharing/Connecting: outwardly-focused; using DPLA as a point to go out to other people, find partners, start a book group, sharing something cool with, friends; building community; connecting institutions, see what other libraries are doing, sharing content with other libraries
  • General, accessibility: featurish-type notes

After a little more refinement in sorting and labeling, these behaviors will then be used to create the characteristics of the personas.

The last activity was talking about branding and marketing — how to get organizations and individuals excited about using the DPLA. A backdrop of this discussion is making people — especially funders — aware of how DPLA is an enhancement to every library’s services and collections, not a replacement for them. That the DPLA will be seen as complimentary to the local library came out strongly in the October DPLA Plenary session. Among the discussion of “what’s in a name?” (‘dp.la’ or ‘library.us’ or something else) and what is it that DPLA wants to pitch to users (the metadata platform, a single-entry homepage at http://dp.la/, or an app store of DPLA-enabled application), was a fascinating discussion about getting developers interested in the DPLA platform and programming interface. We talked about getting library school, college, and high school classes interested in building DPLA apps as term projects. We also talked about the role of existing organizations like Code4Lib and LITA in introducing and supporting developers creating applications using the DPLA platform.

In the end what emerged is a possible thread of activities from the midwinter meeting of the American Library Association (ALA) through the Code4Lib conference into South-by-Southwest and the annual ALA meeting. The thread goes something like this:

  1. Petition LITA to form an interest group for DPLA activities at the ALA Midwinter meeting, and possibly hold a forum there for interested librarians.
  2. Hold a half-day preconference tutorial at the mid-February Code4Lib meeting in Chicago covering example uses of the DPLA API, effective ways to process and remix JSON-LD data (the computerized format of information returned by the DPLA platform), and discussions of the content available in DPLA.
  3. Use the Code4Lib meeting to launch a four to five-week contest for teams of developers to create interesting applications around the DPLA platform.
  4. Show the entries at an already-arranged presentation at the South-by-Southwest conference in mid-March, and announce a winner.
  5. Arrange for space at the ALA Annual meeting in June in Chicago for a more in-depth discussion and exploration of DPLA.

The hook for developers is showing them a new, vast store of liberated data that they can use to remix with other data, create absorbing visualizations of the data, and facilitate user engagement with the data. The DPLA is going to become a huge set of liberated data, and we think that can be attractive to not only library developers but also developers outside the traditional library and cultural heritage community.

And with that we ended the meeting at George Mason University. As I said in my previous post recounting the November Appfest meeting in Chattanooga, these are exciting times when the reality of the DPLA vision can start to be seen. I’m eager to see, and participate as much as I can, in the effort.

The text was modified to update a link from http://dp.la/get-involved/events/dplamidwest/ to http://dp.la/info/get-involved/events/dplamidwest/ on September 26th, 2013.

The text was modified to update a link from http://dp.la/workstreams/audience/ to http://dp.la/wiki/Audience_and_Participation on September 26th, 2013.

The text was modified to update a link from http://dp.la/about/digital-hubs-pilot-project/ to http://blogs.law.harvard.edu/dplaalpha/about/digital-hubs-pilot-project/ on September 26th, 2013.

Trip Report of DPLA Chattanooga Appfest: Project Shows Signs of Life

Below is my report of the DPLA AppFest last month. This post is the raw input of an article on the IMLS blog that was co-written with Mary Barnett, Social Media Coordinator at the Chattanooga Public Library. I also attended yesterday’s DPLA Audience and Participation workstream meeting at George Mason University, and hope to have a similar trip report posted soon.

The Digital Public Library of America held an AppFest gathering at the Chattanooga Public Library on November 8-9, 2012 for a full day of designing, developing and discussion. About 40 people attended from a wide range of backgrounds:

  • Public libraries, academic libraries, companies, library consortia and other groups.
  • Designers, user experience professions, metadata specialists, coders, and “idea people”.
  • A variety of U.S. states and Canadian provinces.

One of the primary purposes of day was to test the DPLA application programming interface (API) to see if it could answer queries from programs that would do useful tasks. The quick answer to that challenge would seem to be “yes!” and there were a lot of positive feelings coming out of the meeting.

The goal of the DPLA is a little unusual. Although there will be a public web-based interface to the DPLA content, the overriding desire is to build the back-end services that will enable DPLA content to be used through local libraries, archives and museums in a variety of interfaces. In this way, DPLA is like a traditional publisher: it will gather content from a variety of sources (authors), add new value (copyedit, create an index), and send out content through a variety of streams (libraries, local bookstores, internet retailers and other venues like convenience stores and airport shops). The DPLA will gather content from regional and subject hubs, enhance it, and make it available to websites, mobile applications, and other tools. So the AppFest was a way to figure out if the early technology designs of the DPLA could meet those goals.

The AppFest started on Thursday evening with a recorded video welcome from John Palfrey and a recording of Emily Gore from DPLA Midwest introducing the DPLA Digital Hubs Pilot Program followed by a brief introduction to developer productivity tools Git and GitHub by Andromeda Yelton. The evening ended with people pitching ideas for projects that participants would work on the next day. The ideas were ones previously recorded on the DPLA wiki page or were spontaneously created at the meeting.

DPLA Project Sign-up Page

Sign-up sheet for Appfest Projects

Friday morning started with an introduction to the DPLA API by Jeffrey Licht of Pod Consulting and a review of the possible projects. The project titles were put up on a wall and people self-selected what they wanted to work on for the day. I picked the project that wanted to create a map interface with pins that corresponded to the position of resources in the repository, and other choices were distributed reference services, visualization techniques to see collection content in context with the whole repository, and a new-content notification service. The teams presented their work at the end of the day to the other participants and a panel of judges. We even had a presentation from a remote participant show his work over a Skype video call. After much deliberation, the judges picked the “DPLA Plus” project as best of the AppFest; that project focused on the need of public users to discover what is in the repository.

I think most would agree that the AppFest was a success. The API lived up to the challenge, both in terms of robustness of functionality and its technical ability to respond to queries from the projects. The records returned from the DPLA data set show the challenges of aggregating metadata from disparate sources. In our project we had problems with nonsensical geo-located records that could be filtered, checked and enhanced by automated routines at the central hub. We also ran into the classic difficulty of having reasonable thumbnails to display to users. Web addresses to thumbnail images are not a common part of metadata since they are usually created by the repository system itself. Having a DPLA service to augment records from hubs with URLs to real thumbnail images would be a very helpful addition.

The AppFest was more evidence that the DPLA is definitely pulling itself up by its bootstraps. Coming a month after a successful Midwest plenary session and two months after formal incorporation and installation of its board of directors plus announcements of new rounds of grant funding from IMLS, NEH, and the Knight Foundation, the DPLA is making steady progress towards fulfilling its mission. The next major event is slated to be the unveiling of a public interface to the DPLA collection in April 2013.

The text was modified to update a link from http://dp.la/get-involved/events/appfest/ to http://dp.la/info/developers/appfest/ on September 26th, 2013.

The text was modified to update a link from http://dp.la/get-involved/events/dplamidwest/ to http://dp.la/info/get-involved/events/dplamidwest/ on September 26th, 2013.

ALA Virtual Conference Includes Talk on Open Source in Libraries

ALA has its “Virtual Conference” coming up on July 18th and 19th. It is two days of at-your-desktop talks on some of the most interesting topics in libraries today. I’m presenting a derivative of the Introducing FOSS4Lib webinar and in-person. The version I’m doing for the ALA Virtual Conference has a broader look at open source software in libraries in addition to the tools and software registry on FOSS4Lib.org. There are a number of sessions on the state of ebooks in libraries plus talks on effective engagement with patrons and building responsive organizations. Registration for the virtual conference is $69 ($51.75 if you attended the ALA Annual Conference in Anaheim), and group registration for up to 15 IP addresses is $300 ($225 if you registered for the Annual Conference).

My ALA Anaheim 2012 Schedule

It is that time of year again where representatives from the library profession all gather for the annual Annual Library Association meeting. This year it is in Anaheim, California on June 21–26. And as the pace of technology continues to push libraries into new areas of content and service, this meeting promises to be an exciting one. Or, at least I’m planning on having a fun and engaging time. Here is my tentative schedule of public events. If you’d like to get together to chat outside these times, please get in touch.

Updated to correct the date for the LYRASIS lounge.


Time: Saturday, June 23 Friday June 22, 2012 from 7:30 PM to 9:30 PM
Location: Anabella House – Magnolia Poolside Meeting Room and Private Patio

LYRASIS members, friends, and those interested are invited to join staff for this get-together. RSVP via Facebook or email.

Conversation Starters: Discovery Here, Discovery There: Pros and Cons of Local or Remote Hosting of Discovery Tools

Session in the ALA scheduler
Time: Saturday, June 23, 2012 from 9:15 AM to 10:00 AM
Location: Anaheim Convention Center – 208A
Discovery systems are powerful tools to help users find information resources across the breadth of the library’s online holdings. Many of these tools offer APIs for libraries to build their own user interfaces to the search index, allowing a library to keep site visitors within the library until the time they access the full text of a resource. What are the pros and cons of keeping discovery local? This talk will explore the user interaction, interface design, and user expectations of such homegrown interfaces.

Discovery systems continue to be a hot topic, as is the question about whether libraries should run their own systems or subscribe to commercial services. This is a topic that is also addressed in the FOSS4Lib Control versus Responsibility 40-question survey tool. I’m interested to hear Ken Varnum‘s take on it and the kinds of questions that come from the audience.

As an aside, ALA is two new session formats this year: 5-minute/20-slide “Ignite” style sessions and 45-minute “conversation starter” sessions. This is one of the latter.

Kuali OLE: Community Ready Software for Your Library!

Session in the ALA scheduler
Time: Saturday, June 23, 2012 from 10:30 AM to 12:00 PM
Location: Anaheim Convention Center – 210D
Kuali OLE (pronounced oh-lay) is in the first year of building a community-source library management system that takes advantage of existing Kuali Foundation software. Operating since July 2010, and supported by The Andrew W. Mellon Foundation, Kuali OLE is one of the largest academic library software collaborations in the United States. This program will provide an overview of our current software release and how you can get involved with Kuali OLE at your library.

At LYRASIS I’m hearing lots of questions from our members about the Kuali OLE project. I’m heading to this session to see what the needs of libraries would be as a part of forming a strategy for LYRASIS. OLE is an important open source software project for library automation of a kind we haven’t seen in a decade (since the foundation was laid for Evergreen and Koha). The fruits of the early labor are just ripening, and the results could have a profound impact — not only on the ILS marketplace but also in how libraries come together to work on shared software development. Note that this is one of several sessions at ALA featuring the OLE project.

New Library Technology Paradigms: OS vs Black Box vs Hybrids

Session in the ALA scheduler
Time: Saturday, June 23, 2012 from 1:30 PM to 3:30 PM
Location: Anaheim Convention Center – 206A
Some libraries build new Open Source Products, some adopt existing ones and others buy packaged products. How do libraries make the choice? What are the trade-offs, benefits and pitfalls of building something in house, using an existing OS solution, buying something out of the box or building a hybrid solution. Our panelists will talk about how and why they build systems and what drives their decision making processes.

Here’s the place in the program where I’ll be speaking. Joining me on the panel moderated by Evviva Weinraub is Bohyun Kim and Megan Banasek on the decision-making process for choosing software. I’ll be talking about the FOSS4Lib Decision Support tools and the other two speakers will be talking about their experiences.

ACRL / SPARC Forum: Campus Open Access Funds: The State of Play

Session in the ALA scheduler
Time: Saturday, June 23, 2012 from 4:00 PM to 5:30 PM
Location: Disneyland Hotel – Disneyland Grand Ballroom South
A sustainable, Open Access scholarly communication system requires robust, stable sources of funding. One key source of such funding are campus-based Open Access funds – pools of money provided academic institutions specifically earmarked to help authors offset the cost of journal publication. These funds have sprung up on campuses large and small, in colleges and universities across the U.S., Canada, and increasingly, worldwide. How are these funds created? Where are they located and who administers them? Where does the money come from? Are authors using these funds? Where can my institution turn for information on creating such a fund?

This forum will explore all of these questions and more, as a panel of experts delve into the latest developments in creating, implementing and sustaining this crucial resource.

Presenters include:

  • Chuck Eckman, Librarian and Dean of Library Services at Simon Fraser University
  • Sue Kriegsman , Program Manager for the Office for Scholarly Communication at Harvard University Library
  • Andrew Waller, Librarian at University of Calgary

These panelists will share their experiences in establishing and running some of the most visible and longest-running Open Access Campus Funds in existences, and discuss what’s working, what need fine tuning, and what they see pending as new developments on the horizon for these crucial resources.

Bibliographic Framework Transition Initiative update

Session in the ALA scheduler
Time: Sunday, June 24, 2012 from 10:30 AM to 12:00 PM
Location: Anaheim Marriott Grand Salon A-C

Community forum to share news and views on the LC Bibliographic Framework Transition Initiative.

I’m expecting lots of news here, particularly with the recent news of the modeling initiative. I’m also eager to hear how librarians can participate more deeply in the effort.

As an aside, if this session wasn’t going on I’d be going to the Responsive Web Design: get beyond the myth of the mobile Web. Responsive web design is an important technique, and I’m glad to see it getting some play to a broader library audience.

Chat Library Technology With Me at the LYRASIS Booth

Exhibitor information in the ALA scheduler
Time: June 24, 2012 from 1:00 PM to 3:00 PM
Location: Exhibit floor booth 2001

Want to talk open source software? FOSS4Lib? Ebooks? Discovery layers? Come meet with me at the LYRASIS booth and we can chat about these topics and more.

The Fourth Paradigm: Data-Intensive Research, Digital Scholarship and Implications for Libraries

Session in the ALA scheduler
Time: Sunday, June 24, 2012 from 4:00 PM to 5:30 PM
Location: Anaheim Convention Center – Ballroom A
Tony Hey will describe the emergence of a new, ‘fourth paradigm’ for scientific research – involving the acquisition, management and analysis of vast quantities of scientific data. This ‘data deluge’ is already affecting many fields of science most notably fields like biology, astronomy, particle physics, environmental science and oceanography. The term eScience or eResearch is used to describe the development of the tools and technologies to support this more data-intensive, collaborative and often multidisciplinary research. This revolution will not be confined to the physical sciences but will also transform large parts of the humanities and social sciences as more and more of their primary research data is now being born digital.

Tony Hey is Vice President of Microsoft Research Connections, and he has a lot of good things to say on the ‘data’ that we should be listening about.

The Ultimate Debate: Cloud Computing: Floating or Free Falling?

Session in the ALA scheduler
Time: Monday, June 25, 2012 from 1:30 PM to 3:30 PM
Location: Anaheim Convention Center – 213AB
The Ultimate Debate returns for the seventh straight year with a lively discussion over the promises and pitfalls of cloud computing. Three panelists will tease out the various components of cloud computing to give you the insight needed to decide if you should be in the clouds or on terra firma.

A recent article on GigaOm1 said, “The good news is that you’re not going to mind that your cloud computing budget will be higher than what you’re paying now for IT, because you’ll be able to do more.” I wonder if that is true. Cloud computing and Software-as-a-Service specifically has taken off in libraries of all types and sizes, but I haven’t seen where we’ve engaged in a cost-benefit analysis. I expect this “ultimate debate” will shed some light on the topic.

Drive Your Project Forward with Scrum

Session in the ALA scheduler
Time: Monday, June 25, 2012 from 4:00 PM to 5:30 PM
Location: Anaheim Convention Center – 203A
NPR Librarian, Janel Kinlaw, shares lessons learned from adapting Scrum, an agile process framework, to content management projects. She’ll discuss how this approach freed the team to innovate in structuring projects, gathering feedback from end-users in real-time, identifying risk and scope creep sooner and aligning library goals to the broader objectives of the organization. Janel will demonstrate where the Scrum process took us further than traditional methodologies.

This last session is for the geeky side of me. I haven’t worked in a formal Scrum environment, but I enjoy hearing of the story of those that do.


  1. The cloud will cost you, but you’ll be happy to pay, by Dave Roberts, GigaOm Cloud Computing News, published Jun. 9, 2012 []

My ALA Midwinter 2012 Schedule

The snow is falling here in central Ohio, so I’m eager to leave here and head to warm Dallas for ALA Midwinter 2012. I’m looking forward to catching up with colleagues; making new acquaintances; learning the latest thinking on RDA, linked data, and standards activity; and talking about free/open source software in libraries. On the latter point, I encourage you to come see me give an introduction to the newly announced FOSS4LIB site, answer questions, and take feedback on Saturday morning (10:30 to 11:30) or Sunday morning (10:30 to 11:30). (Or, if you are not coming to Midwinter, sign up for one of the free webinar sessions later in January and February.)

ALA is using a new iteration of its scheduler this year, and it keeps getting better and better. This one even allows you to embed your selected schedule as an <iframe> on an arbitrary page. So here is my schedule:

You can follow me on Twitter where I’ll be tweeting about #alamw12. A Twitter mention or direct message is also the best way to get ahold of me while in Dallas.

Safe travels if you are headed to Midwinter, and I hope to run into you there.

Open Repositories 2011 Report: Day 3 – Clifford Lynch Keynote on Open Questions for Repositories, Description of DSpace 1.8 Release Plans, and Overview of DSpace Curation Services

The main Open Repositories conference concluded this morning with a keynote by Clifford Lynch and the separate user group meetings began. I tried to transcribe Cliff’s great address as best I could from my notes; hopefully I’m not misrepresenting what he said in any significant ways. He has some thought-provoking comments about the positioning of repositories in institutions and the policy questions that come from that. For an even more abbreviated summary, check out this National Conversation on the Economic Sustainability of Digital Information (skip to “chapter 2” of the video) held April 1, 2010 in Washington DC.

Not only have institutional repositories acted as focal point for policy, they have also been a focal point for collaborations. Library and IT collaborations were happening long before institutional repositories surfaced. Institutional repositories, though, have been a great place to bring other people into that conversation, including faculty leaders to start engaging them in questions about dissemination of their work. Also chief research officers; in 1995 if you were a university librarian doing leadership work constructing digital resources to change scholarly communication, would have talked to CIO but may not know who your chief research officer was at that point. That set of conversations, which are now critical when talking about data curation, got their start with institutional repositories and related policies.

Another place for conversation has been those in the university administrations concerned with building public support for the institution. By giving the public a deeper understanding of what the institution contributes to culture, industry, health and science, and connecting faculty to this effort. This goes beyond the press release by opening a public window into the work of the institutions. This is particularly important today with questions of public support for institutions.

That said, there are a number of open questions and places where we are dealing with works-in-progress. Cliff then went into an incomplete and, from his perspective, perhaps idiosyncratic, list of these issues.

Repositories are one of the threads that are leading us nationally and internationally into a complete rethinking of the practice of name authority. While it is a librarian, old fashion concept, but it is converging with “identity management” from IT. He offered an abbreviated and exaggerated example: librarians did name authority for authors of stuff in general in 19th century. In 20th century there was too much stuff, particularly stuff in journals and magazines became overwhelming. So libraries backed off and focused only on books and stuff that went into catalogs; the rest they turned over to indexing and abstracting services. We made a few weird choices like an authority file should be as simple as possible to disambiguate authors rather than be as full as possible, so we had the development of things along side name authority files like the national dictionaries of literary biographies.

For scientific journal literature, publishers followed practices about how obscure author names could be (e.g. just last name and first initial). Huge amounts of ambiguity of “telegraphic author names” results in a horribly dirty corpus of data. A variety of folks are realizing that we need to disambiguate authorship by assigning author identifiers and somehow go back and cleanup the mess in the existing bibliographic data of scholarly literature, especially journal literature. Institutions taking more responsibility for the work of their community, and having to do local name authority all over again. We have the challenge of how to reconnect this activity to national and international files. We also have a set of challenges on whether we want to connect this to biographical resources. It brings up issues of privacy, when do people do things of record, and how much else should come along with building a public biography resource. We also see a vast parallel investment of institutional identity management. Institutions haven’t quite figured out that people don’t necessarily publish with the same name that is recorded in the enrollment or employment systems that the institution manages, and that it would be a good idea to tie those literary names to identity files that the institution manages.

We’re not confident of the kind of ecological positioning institutional repositories among a pretty complicated array of information systems found at a typical large university. Those systems include digital library platforms, course management systems, lecture capture systems, facilities for archiving the digital records of the institution, and platforms intended to directly support active research by faculty. All are evolving at their own rate. It is unclear where the institutional repositories fit, and what are the boundaries around them.

Here is one example. What is the difference between an institutional repository and a digital library/collection? You’d get very different answers from different people. One might be who does the curation, how it is sourced, and how it is scoped. The decisions are largely intellectual. Making this confusing is that you’ll see the same platform for institutional repositories and digital library platforms. We are seeing a convergence of the underpinning platforms.

Another one: learning management systems (LMS). These are virtually universal among institution in the same timeframe that institutional repositories have been deployed. We’ve done a terrible job at thinking about what happens to the stuff in them when the course is over. We can’t decide if it is scholarly material, institutional records, or something else. They are tangled up between learning materials and all of the stuff that populates a specific performance of a course such as quizzes and answers, discussion lists, and student course projects. We don’t have taxonomies and policies here and a working distinction between institutional repositories and learning management systems. It is an unusual institution that has as systematic export from the LMS to an IR.

Lecture capture systems becoming quite commonplace; students are demanding them in much the same way that the LMS was demanded. A lecture capture system may be more universally helpful than an LMS. Lectures being captured for a wide range of reasons, but not knowing why means it is difficult to know whether to keep them and how to integrate them into the institution’s resources.

Another example: the extent to which institutional repositories should sit in the stream of active work. As faculty are building datasets and doing computation with them, when is it time for something to go into an institutional repository. How volatile can content be in the repository? How should repositories be connected or considered as robust working storage? He suspects that many institutional repositories are not provisioned with high-performance storage and network connections, and would become a bottleneck in the research process. The answers would be different for big data sets and small data sets, and we are starting to see datasets that are too big to backup or two big to replicate.

Another issue is that of virtual organizations, the kind of collaborative efforts that span institutions and nations. They often allow relatively low overhead to mobilize researchers to work on a problem, and are becoming commonplace in sciences and social sciences and starting to pop up in the humanities. We have a problem for the rules-of-the-road between virtual organizations and institution-based repositories. It is easy to spin up an institutional repository for a virtual organization, but what happens to it when the virtual organization shuts down. Some of these organizations are intentionally transient; how do we assign responsibility for a world of virtual organizations and map them into institutional organizations for long-term stewardship.

Software is starting to concern people. So much scholarship is tied up now in complicated software systems that we are starting to see a number of phenomena. One is data that is difficult to reuse or understand without the software. Another is the is difficulty surrounding reproducibility — taking results and realizing they are dependent on an enormous stack of software and we don’t have a clear way to talk about the provenance of a result that is based on the stack of software versions that would allow for high-confidence in reproduction of results. We’re doing to have to deal with software. We are also entering an era of deliberate obsolescence of software; for instance, any Apple product that is older than a few years is going to the dustbin and it hasn’t been fully announced or realized so that people can deal with it.

Another place that has been under-exploited is the question of retiring faculty and repositories. Taking inventory of someone’s scholarly collections and migrating it to an institutional framework in an orderly fashion.

How we reinterpret institutional repositories going beyond universities. For example there is something that looks a bit like an institutional repository but has some different things about it that belongs in public libraries or historic societies or similar. This dimension bears exploration.

To conclude his comments he talked about a last open issue. When we talk about good stewardship and preservation of digital materials, there are a couple of ideas that have emerged as we tried to learn from our past stewardship of print scholarly literature. One of these principles is that geographic replication is a good thing; we’re starting to see this in a sense that most repositories are based on some geographically redundant storage system or we’ll see a steady migration towards this in the next few years. A second one is organizational redundancy. If you look at the print work, it wasn’t just that the scholarly record wasn’t in a number of independent locations but also that control was replicated among institutions that were making independent decisions about adding materials to their library collection. Clearly they coordinated to a point, but they also have institutional independence. We don’t know how to do this with institutional repositories. This is also emerging in special collections as they become digital. Because they didn’t start life as published materials in many replicated versions, we need other mechanisms to have curatorial responsibility distributed. This is linked to the notion that it is usually not helpful to talk about preservation in terms like “eternity” or “perpetuity” or life-of-the-republic. It is probably better in most cases to think about preservation in one chunk at a time; an institution making a 20-year or 50-year commitment with a well-structured process at the end. That process includes whether an institution should renew the commitment and if not other interested parties could come in and take responsibility with a well-ordered hand-off. This ties into policies and strategies for curatorial replication across institutions and ways that institutional repositories will need to work together. It may be less critical today, but will become increasingly critical.

In conclusion, Cliff said that he hoped left the attendees with a sense that repositories are not things that stand on their own. That they in fact are mechanism that advance policy in a very complex ecology of systems. In fact, we don’t have our policy act together on many systems adjacent to the repository that leads to issues of appropriate scope and interfaces with those systems. Where repositories will evolve to in the future as we understand the role of big data is also of interest.

DSpace 1.8

Robin Taylor, the DSpace version 1.8 release manager, gave an overview of what was planned (not promised!) for the next major release. The release schedule was to have a beta last week, but that didn’t happen. The remainder of the schedule is to have a beta on July 8th, feature freeze on August 19th, release candidate 1 published on September 2nd in time for the test-a-thon from the 5th to the 16th, followed by a second release candidate on September 30th, final testing October 3rd through the 12th, and a final release on October 14th. He then went into some of the planned highlights of this release.

SWORD is a lightweight protocol for depositing items between repositories; it is a profile of the Atom Publishing Protocol. At the current release, DSpace has be able to accept items; the planned work for 1.8 will make it possible to send items. Some possible use cases: publishing from a closed repository to an open repository, sending from the repository to a publisher, from the repository to a subject-specific service (such as arXiv), or vice versa. The functionality was copied from the Swordapp demo. It supports SWORD v1 and only the DSpace XMLUI. A question was ask about whether the SWORD copy process is restricted to just the repository manager? The answer was that it should be configurable. On the one hand it can be open because it is up to the receiving end to determine whether or not to accept it. On the other hand, a repository administrator might want to prevent items being exported out of a collection.

MIT has rewritten the Creative Commons licensing selection steps. It uses the Creative Commons web services (as XML) rather than HTML iframes, which allows better integration with DSpace. As an aside, the Creative Commons and license steps have been split into two discrete steps allowing different headings in the progress bar.

The DSpace Community Advisory Team prioritized issues to be addressed by the developers, and for this release they include JIRA issue DS-638 for virus checking during submission. The solution invokes the existing Curation Task and requires Clam AV antivirus software to be installed. It is switched off by default and is configured in submission-curation.cfg. Two other issues that were addressed are DS-587 (Add the capability to indicate a withdrawn reason to an Item) and DS-164 (Deposit interface), which was completed as the Google Summer of Code Submission Enhancement project.

Thanks to Bojan Suzic in his Google Summer of Code project, DSpace has had a REST API. The code has been publicly available and repositories have been making use of it, so the committers group want to get it into a finished state and include it in 1.8. There is also work on an alternative approach to a REST API.

DSpace and DuraCloud was also covered; it was much the same that I reported on earlier this week, so I’m not repeating it here.

From the geek perspective, the new release will see increasing modularization of the codebase and more use of Spring and the DSpace Services Framework. The monolithic dspace.cfg will be split up into separate pieces; some pieces would move into Spring config while other pieces could go into the database. It will have a simplified installation process, and several components that were talked about elsewhere at the meeting: WebMVC UI, configurable workflow, and more curation tasks.

Introduction to DSpace Curation Services

Bill Hays talked about curation tasks in DSpace. Curation tasks are Java objects managed by the Curation System. Functionally, they are an operation run on a DSpace Object and (optionally) its contained objects (e.g., community, subcommunity, collection, and items). They do not work site-wide and not on bundles or bitstreams. The tasks can be run in multiple ways by different types of administrative users, and they are configured separately from dspace.cfg.

Some built-in tasks are to validate metadata against input forms (halts on task failure), count bitstreams by format type, virus scan (uses external virus detection service), on ingest (the desired use case), and the replication suite of tasks for DuraCloud. Other tasks: link checker and 11 others (from Stuart Lewis and Kim Shepherd), format id with DROID (in development), validate/add/replace metadata, status report on workflow items, filter media in workflow (proposed), and checksum validation (proposed).

What does this mean for different users? As a repository or collection manager, it means new functionality — GUI access without GUI development: curation, preservation, validation, reporting. As a developer: rapid development, and deployment of functionality without rebuilding or redeploying the DSpace instance.

The recommended Java development environment for tasks is with a package outside of dspace-api. Make a POM with dependency on dspace-api, especially /curate. Required features of the task are a constructor with no arguments to support loading as a plugin and that it implements the CurationTask interface or extends the AbstractCurationTask class. Deploy it as a JAR and configure (similar to a DSpace plugin)

There are some Java annotations for Curation Task code that are important to know about. Setting @Distributive means that the task is responsible for handling any contained DSpace objects as appropriate. Otherwise the default is to have the task executed across all contained objects (subcommunities, collections, or items). Setting @Suspendable means the task interrupts processing when first FAIL status is returned. Setting @Mutative means the task makes changes to target objects.

Invoking tasks can be done several ways: from the web application (XMLUI), the command line, from workflow, from other code, or from a queue (deferred operation). In the case of the workflow, one can target the action of the task at anywhere in the workflow steps (e.g. before step 1, step 2, step 3 or at item installation). Actions (reject or approve) are based on tasks results, and notifications are sent by e-mail.

A mechanism for discovering and sharing tasks doesn’t exist yet. What is needed is a community repository of tasks. For each task what is needed is: a descriptive listing, documentation, reviews/ratings, link to source code management system, and link to binaries applicable to specific versions.

With dynamic loading with scripting languages in JSR-223, it is theoretically possible to create Curation Tasks in Groovy, JRuby, Jython, although the only one Bill has been able to get to work so far has been Groovy. Scripting code needs a high level of interoperability with Java, and must implement the CurationTask interface. Configuration is a little bit different: one needs a taskcatalog with descriptors for language, name of script, and how the constructor is called. Bill demonstrated some sample scripts.

In his conclusion, Bill said that the new Curation Services: increases functionality for content in a managed framework; has multiple ways of running tasks for different types of users and scenarios; makes it possible to add new code without a rebuild; simplifies extending DSpace functionality; and with scripting lowers the bar even more.