Open Repositories 2011 Report: Day 3 – Clifford Lynch Keynote on Open Questions for Repositories, Description of DSpace 1.8 Release Plans, and Overview of DSpace Curation Services

The main Open Repositories conference concluded this morning with a keynote by Clifford Lynch and the separate user group meetings began. I tried to transcribe Cliff’s great address as best I could from my notes; hopefully I’m not misrepresenting what he said in any significant ways. He has some thought-provoking comments about the positioning of repositories in institutions and the policy questions that come from that. For an even more abbreviated summary, check out this National Conversation on the Economic Sustainability of Digital Information (skip to “chapter 2” of the video) held April 1, 2010 in Washington DC.

Not only have institutional repositories acted as focal point for policy, they have also been a focal point for collaborations. Library and IT collaborations were happening long before institutional repositories surfaced. Institutional repositories, though, have been a great place to bring other people into that conversation, including faculty leaders to start engaging them in questions about dissemination of their work. Also chief research officers; in 1995 if you were a university librarian doing leadership work constructing digital resources to change scholarly communication, would have talked to CIO but may not know who your chief research officer was at that point. That set of conversations, which are now critical when talking about data curation, got their start with institutional repositories and related policies.

Another place for conversation has been those in the university administrations concerned with building public support for the institution. By giving the public a deeper understanding of what the institution contributes to culture, industry, health and science, and connecting faculty to this effort. This goes beyond the press release by opening a public window into the work of the institutions. This is particularly important today with questions of public support for institutions.

That said, there are a number of open questions and places where we are dealing with works-in-progress. Cliff then went into an incomplete and, from his perspective, perhaps idiosyncratic, list of these issues.

Repositories are one of the threads that are leading us nationally and internationally into a complete rethinking of the practice of name authority. While it is a librarian, old fashion concept, but it is converging with “identity management” from IT. He offered an abbreviated and exaggerated example: librarians did name authority for authors of stuff in general in 19th century. In 20th century there was too much stuff, particularly stuff in journals and magazines became overwhelming. So libraries backed off and focused only on books and stuff that went into catalogs; the rest they turned over to indexing and abstracting services. We made a few weird choices like an authority file should be as simple as possible to disambiguate authors rather than be as full as possible, so we had the development of things along side name authority files like the national dictionaries of literary biographies.

For scientific journal literature, publishers followed practices about how obscure author names could be (e.g. just last name and first initial). Huge amounts of ambiguity of “telegraphic author names” results in a horribly dirty corpus of data. A variety of folks are realizing that we need to disambiguate authorship by assigning author identifiers and somehow go back and cleanup the mess in the existing bibliographic data of scholarly literature, especially journal literature. Institutions taking more responsibility for the work of their community, and having to do local name authority all over again. We have the challenge of how to reconnect this activity to national and international files. We also have a set of challenges on whether we want to connect this to biographical resources. It brings up issues of privacy, when do people do things of record, and how much else should come along with building a public biography resource. We also see a vast parallel investment of institutional identity management. Institutions haven’t quite figured out that people don’t necessarily publish with the same name that is recorded in the enrollment or employment systems that the institution manages, and that it would be a good idea to tie those literary names to identity files that the institution manages.

We’re not confident of the kind of ecological positioning institutional repositories among a pretty complicated array of information systems found at a typical large university. Those systems include digital library platforms, course management systems, lecture capture systems, facilities for archiving the digital records of the institution, and platforms intended to directly support active research by faculty. All are evolving at their own rate. It is unclear where the institutional repositories fit, and what are the boundaries around them.

Here is one example. What is the difference between an institutional repository and a digital library/collection? You’d get very different answers from different people. One might be who does the curation, how it is sourced, and how it is scoped. The decisions are largely intellectual. Making this confusing is that you’ll see the same platform for institutional repositories and digital library platforms. We are seeing a convergence of the underpinning platforms.

Another one: learning management systems (LMS). These are virtually universal among institution in the same timeframe that institutional repositories have been deployed. We’ve done a terrible job at thinking about what happens to the stuff in them when the course is over. We can’t decide if it is scholarly material, institutional records, or something else. They are tangled up between learning materials and all of the stuff that populates a specific performance of a course such as quizzes and answers, discussion lists, and student course projects. We don’t have taxonomies and policies here and a working distinction between institutional repositories and learning management systems. It is an unusual institution that has as systematic export from the LMS to an IR.

Lecture capture systems becoming quite commonplace; students are demanding them in much the same way that the LMS was demanded. A lecture capture system may be more universally helpful than an LMS. Lectures being captured for a wide range of reasons, but not knowing why means it is difficult to know whether to keep them and how to integrate them into the institution’s resources.

Another example: the extent to which institutional repositories should sit in the stream of active work. As faculty are building datasets and doing computation with them, when is it time for something to go into an institutional repository. How volatile can content be in the repository? How should repositories be connected or considered as robust working storage? He suspects that many institutional repositories are not provisioned with high-performance storage and network connections, and would become a bottleneck in the research process. The answers would be different for big data sets and small data sets, and we are starting to see datasets that are too big to backup or two big to replicate.

Another issue is that of virtual organizations, the kind of collaborative efforts that span institutions and nations. They often allow relatively low overhead to mobilize researchers to work on a problem, and are becoming commonplace in sciences and social sciences and starting to pop up in the humanities. We have a problem for the rules-of-the-road between virtual organizations and institution-based repositories. It is easy to spin up an institutional repository for a virtual organization, but what happens to it when the virtual organization shuts down. Some of these organizations are intentionally transient; how do we assign responsibility for a world of virtual organizations and map them into institutional organizations for long-term stewardship.

Software is starting to concern people. So much scholarship is tied up now in complicated software systems that we are starting to see a number of phenomena. One is data that is difficult to reuse or understand without the software. Another is the is difficulty surrounding reproducibility — taking results and realizing they are dependent on an enormous stack of software and we don’t have a clear way to talk about the provenance of a result that is based on the stack of software versions that would allow for high-confidence in reproduction of results. We’re doing to have to deal with software. We are also entering an era of deliberate obsolescence of software; for instance, any Apple product that is older than a few years is going to the dustbin and it hasn’t been fully announced or realized so that people can deal with it.

Another place that has been under-exploited is the question of retiring faculty and repositories. Taking inventory of someone’s scholarly collections and migrating it to an institutional framework in an orderly fashion.

How we reinterpret institutional repositories going beyond universities. For example there is something that looks a bit like an institutional repository but has some different things about it that belongs in public libraries or historic societies or similar. This dimension bears exploration.

To conclude his comments he talked about a last open issue. When we talk about good stewardship and preservation of digital materials, there are a couple of ideas that have emerged as we tried to learn from our past stewardship of print scholarly literature. One of these principles is that geographic replication is a good thing; we’re starting to see this in a sense that most repositories are based on some geographically redundant storage system or we’ll see a steady migration towards this in the next few years. A second one is organizational redundancy. If you look at the print work, it wasn’t just that the scholarly record wasn’t in a number of independent locations but also that control was replicated among institutions that were making independent decisions about adding materials to their library collection. Clearly they coordinated to a point, but they also have institutional independence. We don’t know how to do this with institutional repositories. This is also emerging in special collections as they become digital. Because they didn’t start life as published materials in many replicated versions, we need other mechanisms to have curatorial responsibility distributed. This is linked to the notion that it is usually not helpful to talk about preservation in terms like “eternity” or “perpetuity” or life-of-the-republic. It is probably better in most cases to think about preservation in one chunk at a time; an institution making a 20-year or 50-year commitment with a well-structured process at the end. That process includes whether an institution should renew the commitment and if not other interested parties could come in and take responsibility with a well-ordered hand-off. This ties into policies and strategies for curatorial replication across institutions and ways that institutional repositories will need to work together. It may be less critical today, but will become increasingly critical.

In conclusion, Cliff said that he hoped left the attendees with a sense that repositories are not things that stand on their own. That they in fact are mechanism that advance policy in a very complex ecology of systems. In fact, we don’t have our policy act together on many systems adjacent to the repository that leads to issues of appropriate scope and interfaces with those systems. Where repositories will evolve to in the future as we understand the role of big data is also of interest.

DSpace 1.8

Robin Taylor, the DSpace version 1.8 release manager, gave an overview of what was planned (not promised!) for the next major release. The release schedule was to have a beta last week, but that didn’t happen. The remainder of the schedule is to have a beta on July 8th, feature freeze on August 19th, release candidate 1 published on September 2nd in time for the test-a-thon from the 5th to the 16th, followed by a second release candidate on September 30th, final testing October 3rd through the 12th, and a final release on October 14th. He then went into some of the planned highlights of this release.

SWORD is a lightweight protocol for depositing items between repositories; it is a profile of the Atom Publishing Protocol. At the current release, DSpace has be able to accept items; the planned work for 1.8 will make it possible to send items. Some possible use cases: publishing from a closed repository to an open repository, sending from the repository to a publisher, from the repository to a subject-specific service (such as arXiv), or vice versa. The functionality was copied from the Swordapp demo. It supports SWORD v1 and only the DSpace XMLUI. A question was ask about whether the SWORD copy process is restricted to just the repository manager? The answer was that it should be configurable. On the one hand it can be open because it is up to the receiving end to determine whether or not to accept it. On the other hand, a repository administrator might want to prevent items being exported out of a collection.

MIT has rewritten the Creative Commons licensing selection steps. It uses the Creative Commons web services (as XML) rather than HTML iframes, which allows better integration with DSpace. As an aside, the Creative Commons and license steps have been split into two discrete steps allowing different headings in the progress bar.

The DSpace Community Advisory Team prioritized issues to be addressed by the developers, and for this release they include JIRA issue DS-638 for virus checking during submission. The solution invokes the existing Curation Task and requires Clam AV antivirus software to be installed. It is switched off by default and is configured in submission-curation.cfg. Two other issues that were addressed are DS-587 (Add the capability to indicate a withdrawn reason to an Item) and DS-164 (Deposit interface), which was completed as the Google Summer of Code Submission Enhancement project.

Thanks to Bojan Suzic in his Google Summer of Code project, DSpace has had a REST API. The code has been publicly available and repositories have been making use of it, so the committers group want to get it into a finished state and include it in 1.8. There is also work on an alternative approach to a REST API.

DSpace and DuraCloud was also covered; it was much the same that I reported on earlier this week, so I’m not repeating it here.

From the geek perspective, the new release will see increasing modularization of the codebase and more use of Spring and the DSpace Services Framework. The monolithic dspace.cfg will be split up into separate pieces; some pieces would move into Spring config while other pieces could go into the database. It will have a simplified installation process, and several components that were talked about elsewhere at the meeting: WebMVC UI, configurable workflow, and more curation tasks.

Introduction to DSpace Curation Services

Bill Hays talked about curation tasks in DSpace. Curation tasks are Java objects managed by the Curation System. Functionally, they are an operation run on a DSpace Object and (optionally) its contained objects (e.g., community, subcommunity, collection, and items). They do not work site-wide and not on bundles or bitstreams. The tasks can be run in multiple ways by different types of administrative users, and they are configured separately from dspace.cfg.

Some built-in tasks are to validate metadata against input forms (halts on task failure), count bitstreams by format type, virus scan (uses external virus detection service), on ingest (the desired use case), and the replication suite of tasks for DuraCloud. Other tasks: link checker and 11 others (from Stuart Lewis and Kim Shepherd), format id with DROID (in development), validate/add/replace metadata, status report on workflow items, filter media in workflow (proposed), and checksum validation (proposed).

What does this mean for different users? As a repository or collection manager, it means new functionality — GUI access without GUI development: curation, preservation, validation, reporting. As a developer: rapid development, and deployment of functionality without rebuilding or redeploying the DSpace instance.

The recommended Java development environment for tasks is with a package outside of dspace-api. Make a POM with dependency on dspace-api, especially /curate. Required features of the task are a constructor with no arguments to support loading as a plugin and that it implements the CurationTask interface or extends the AbstractCurationTask class. Deploy it as a JAR and configure (similar to a DSpace plugin)

There are some Java annotations for Curation Task code that are important to know about. Setting @Distributive means that the task is responsible for handling any contained DSpace objects as appropriate. Otherwise the default is to have the task executed across all contained objects (subcommunities, collections, or items). Setting @Suspendable means the task interrupts processing when first FAIL status is returned. Setting @Mutative means the task makes changes to target objects.

Invoking tasks can be done several ways: from the web application (XMLUI), the command line, from workflow, from other code, or from a queue (deferred operation). In the case of the workflow, one can target the action of the task at anywhere in the workflow steps (e.g. before step 1, step 2, step 3 or at item installation). Actions (reject or approve) are based on tasks results, and notifications are sent by e-mail.

A mechanism for discovering and sharing tasks doesn’t exist yet. What is needed is a community repository of tasks. For each task what is needed is: a descriptive listing, documentation, reviews/ratings, link to source code management system, and link to binaries applicable to specific versions.

With dynamic loading with scripting languages in JSR-223, it is theoretically possible to create Curation Tasks in Groovy, JRuby, Jython, although the only one Bill has been able to get to work so far has been Groovy. Scripting code needs a high level of interoperability with Java, and must implement the CurationTask interface. Configuration is a little bit different: one needs a taskcatalog with descriptors for language, name of script, and how the constructor is called. Bill demonstrated some sample scripts.

In his conclusion, Bill said that the new Curation Services: increases functionality for content in a managed framework; has multiple ways of running tasks for different types of users and scenarios; makes it possible to add new code without a rebuild; simplifies extending DSpace functionality; and with scripting lowers the bar even more.

Open Repositories 2011 Report: Day 2 with DSpace plus Fedora and Lots of Lightning Talks

Today was the second day of the Open Repositories conference, and the big highlight of the day for me was the panel discussion on using Fedora as a storage and service layer for DSpace. This seems like such a natural fit, but with two pieces of complex software the devil is in the details. Below that summary is some brief paragraphs about some of the 24×7 lightning talks.
Continue reading

Open Repositories 2011 Report: Day 1 with Apache, Technology Trends, and Bolded Labels

Today was the first main conference day of the Open Repositories conference in Austin, Texas. There are 300 developers here from 20 countries and 30 states. I have lots of notes from the sessions, and I’ve tried to make sense of some of them below before I lose track of the entire context.

The meeting opened with the a keynote by Jim Jagielski, president of the Apache Software Foundation. He gave a presentation on what it means to be open source project with a focus on how Apache creates a community of developers and users around its projects.

Slide 50 of Open Source: It's Not Just for IT Anymore Slide 50 of Open Source: It's Not Just for IT Anymore

Slides 50 and 51 of "Open Source: It's Not Just for IT Anymore"

One of the take-a-ways was a characterization of Open Source Licenses from Dave Johnson. Although it is a basic shorthand, it is useful in understanding the broad classes of licenses:

He also explained why community and code are at peer levels in Apache. One is not possible without the other; it takes an engaged community to create great code and great code can only be created by a healthy community. He also described how the primary communications tool for projects is not new-fangled technologies like wikis and conference calls and IRC. The official record of a project is its e-mail lists. This enables the broadest possible inclusion of participants across many time zones and the list archives enable people to look into the history of decisions. If discussions take place in other forums or tools, the summary is always brought back to the e-mai list.

Jim’s concluding thoughts were a great summary of the presentation, and I’ve inserted them in on the right.

I missed the first concurrent session of the day due to work conflict, so the first session I went to was the after lunch 24×7 presentations. That is no more than 24 slides in no more than seven minutes. I like this format because it forces the presenters to be concise, and if the topic is not one that interests you it isn’t long until the next topic comes up. The short presentations are also great for generating discussion points with the speakers during breaks and the reception. Two of these in particular struck a cord with me.

The first was “Technology Trends Influencing Repository Design” by Brad McLean of DuraSpace. His list of four trends were:

  1. Design for mobile, not just PCs. The model of a mobile app — local computation and page rendering backed by web services for retrieving data — is having several impacts on design: a reinforcement of the need for lightweight web services and UIs; accounting for how screen size has shrunk again; and having a strategy for multi-platform apps will become critical.
  2. More programming language(s) than you need/want. Java, Python, Ruby, Scala, LISP, Groovy, JavaScript and the list goes on. This proliferation of languages has forced looser coupling between components (e.g. a JavaScript based script can consume data from and write data to a Java-based servlet engine). The implications he listed for this are that it is even clearer that true integration challenges are in the data modeling and policy domains; harder to draw neat boxes around required skill sets; and that you might lose control of your user experience (and it might be a good thing).
  3. Servers and clusters. Clusters are not just for high-performance computing and search engines anymore. Techniques like map/reduce are available to all. He said that Ebay was the last major internet company to deploy its infrastructure on “big iron” but he didn’t attribute that statement to a source. (Seems kind of hard to believe…) The implications are that we should look to replicated and distributed SOLR indexing (hopefully stealing a page from “noSQL” handbook); keep an eye on Map/Reduce-based triple stores (interesting idea!); and repository storage will be spanning multple systems.
  4. What is a filesystem. Brad noted that with filesystems what was once hidden from the end user (think the large systems of the 1960s, 1970s and 1980s) became visible (the familiar desktop file folder structure) and is now becoming hidden again (as with mobile device apps). Applications are now storing opaque objects again; how do we effectively ingest them into our repositories?

Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11less than a minute ago via TweetDeck Favorite Retweet Reply

Tweet from Dorothea Salo

The second 24×7 talk that struck a chord was “Don’t Bold the Field Name” by Simeon Warner. And by that he literally meant “Don’t Bold the Field Name”. He walked through a series of library interfaces and noted how we have a tendancy to display bolded field labels. He then pointed out how this draws the eye’s attention to the labels and not the record content beside the labels. Amazon doesn’t do this (at least with the metadata at the top of the page), Ebay doesn’t do this, and the search engines don’t do this. He did note — pointing to the case of the “Product Details” section of an Amazon item page — that “the task of finding the piece of information is more important than consuming it.” (Again, in the Amazon case, the purpose of bolding the label is to draw the eye to the location of data like publisher and shipping weight on the page.) I think Dorothea Salo’s tweet summed it up best: “Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11”

I also attended the two sessions on identifiers in the afternoon (Peter Sefton’s “A Hint of Mint: Linked Authority Control Service” and Richard Rodgers’s “ORCID: Open Research and Contributor ID — An Open Registry of Scholarly IDs”), but the time is late and tomorrow’s events will come soon enough. Given eough time and energy, I’ll try to summarize those sessions later.

Open Repositories 2011 Report: DSpace on Spring and DuraSpace

This week I am attending the Open Repositories conference in Austin, Texas, and yesterday was the second preconference day (and the first day I was in Austin). Coming in as I did I only had time to attend two preconference sessions: one on the integration — or maybe “invasion” of the Spring Framework — into DSpace and one on the introduction of the DuraCloud service and code.
Continue reading