Open Repositories 2011 Report: Day 3 - Clifford Lynch Keynote on Open Questions for Repositories, Description of DSpace 1.8 Release Plans, and Overview of DSpace Curation Services

The main Open Repositories conference concluded this morning with a keynote by Clifford Lynch and the separate user group meetings began. I tried to transcribe Cliff's great address as best I could from my notes; hopefully I'm not misrepresenting what he said in any significant ways. He has some thought-provoking comments about the positioning of repositories in institutions and the policy questions that come from that. For an even more abbreviated summary, check out this Storify archive of tweets during his keynote. Then I attended the DSpace track of user group programming, and below there are summaries of plans for DSpace version 1.8 and the new DSpace Curation Services.

Repositories: Major Progress and Open Questions

Mindful that we are roughly a decade into building institutional repositories, Cliff said it was an appropriate time to look at what has been accomplished along with some of the open issues and new questions have emerged. We still don't have a good way to measure content in repositories. People had radically different ideas from institution to institution on what is an "object" so those metrics don't mean much; counting terabytes is equally fruitless because some repositories have video and others only have textual material.

Instead, the growth of repositories has highlighted critical policy discussions of what the missions of institutions of higher education are supposed to be. Questions such as the responsibility to curate knowledge they create, curate the evidence on which inquiry is based, to disseminate knowledge. These weren't on the table 10-15 years ago. Now they are central issues for discussion in the leadership of universities. Tied to institutional repositories is the question of open access. The scope of issues goes beyond just open access, though. It reaches into the kind of questions that are now getting traction like getting access to research data and an institution's role of stewardship and disseminate of research data. Also the creation of open educational resources and institution's responsibility to disseminate such resources. These questions wouldn't have emerged without the effort to build out institutional repositories. As soon as you start talking about these questions they demand investments in infrastructure of institutional repositories. So we should take satisfaction in the role that the efforts of deploying institutional repositories have played in advancing these critical policy discussions. These questions have gone unanswered far too long.

That said, there is a danger of confusion of mechanism with policy. We went through a bad period around 2004 when institutional repositories were deployed without knowing what would be put in them. Institutional repositories are services to support policies, not ends to themselves. We have mostly gotten past this and can have the discussion of whether institutionally based repositories are the appropriate tool and when should we build discipline-specific repositories or other kinds of platforms that are not institutionally focused.

He also noted that the question of institutional assets and the balance of faculty control with institutional responsibility is being talked about, if only quietly. The piece of video that Clifford referred to -- a talk by Derick Law on how universities have failed in their stewardship responsibilities of research -- may be this video from the The Blue Ribbon Task Force on Sustainable Digital Preservation and Access's National Conversation on the Economic Sustainability of Digital Information (skip to "chapter 2" of the video) held April 1, 2010 in Washington DC.

Not only have institutional repositories acted as focal point for policy, they have also been a focal point for collaborations. Library and IT collaborations were happening long before institutional repositories surfaced. Institutional repositories, though, have been a great place to bring other people into that conversation, including faculty leaders to start engaging them in questions about dissemination of their work. Also chief research officers; in 1995 if you were a university librarian doing leadership work constructing digital resources to change scholarly communication, would have talked to CIO but may not know who your chief research officer was at that point. That set of conversations, which are now critical when talking about data curation, got their start with institutional repositories and related policies.

Another place for conversation has been those in the university administrations concerned with building public support for the institution. By giving the public a deeper understanding of what the institution contributes to culture, industry, health and science, and connecting faculty to this effort. This goes beyond the press release by opening a public window into the work of the institutions. This is particularly important today with questions of public support for institutions.

That said, there are a number of open questions and places where we are dealing with works-in-progress. Cliff then went into an incomplete and, from his perspective, perhaps idiosyncratic, list of these issues.

Repositories are one of the threads that are leading us nationally and internationally into a complete rethinking of the practice of name authority. While it is a librarian, old fashion concept, but it is converging with "identity management" from IT. He offered an abbreviated and exaggerated example: librarians did name authority for authors of stuff in general in 19th century. In 20th century there was too much stuff, particularly stuff in journals and magazines became overwhelming. So libraries backed off and focused only on books and stuff that went into catalogs; the rest they turned over to indexing and abstracting services. We made a few weird choices like an authority file should be as simple as possible to disambiguate authors rather than be as full as possible, so we had the development of things along side name authority files like the national dictionaries of literary biographies.

For scientific journal literature, publishers followed practices about how obscure author names could be (e.g. just last name and first initial). Huge amounts of ambiguity of "telegraphic author names" results in a horribly dirty corpus of data. A variety of folks are realizing that we need to disambiguate authorship by assigning author identifiers and somehow go back and cleanup the mess in the existing bibliographic data of scholarly literature, especially journal literature. Institutions taking more responsibility for the work of their community, and having to do local name authority all over again. We have the challenge of how to reconnect this activity to national and international files. We also have a set of challenges on whether we want to connect this to biographical resources. It brings up issues of privacy, when do people do things of record, and how much else should come along with building a public biography resource. We also see a vast parallel investment of institutional identity management. Institutions haven't quite figured out that people don't necessarily publish with the same name that is recorded in the enrollment or employment systems that the institution manages, and that it would be a good idea to tie those literary names to identity files that the institution manages.

We're not confident of the kind of ecological positioning institutional repositories among a pretty complicated array of information systems found at a typical large university. Those systems include digital library platforms, course management systems, lecture capture systems, facilities for archiving the digital records of the institution, and platforms intended to directly support active research by faculty. All are evolving at their own rate. It is unclear where the institutional repositories fit, and what are the boundaries around them.

Here is one example. What is the difference between an institutional repository and a digital library/collection? You'd get very different answers from different people. One might be who does the curation, how it is sourced, and how it is scoped. The decisions are largely intellectual. Making this confusing is that you'll see the same platform for institutional repositories and digital library platforms. We are seeing a convergence of the underpinning platforms.

Another one: learning management systems (LMS). These are virtually universal among institution in the same timeframe that institutional repositories have been deployed. We've done a terrible job at thinking about what happens to the stuff in them when the course is over. We can't decide if it is scholarly material, institutional records, or something else. They are tangled up between learning materials and all of the stuff that populates a specific performance of a course such as quizzes and answers, discussion lists, and student course projects. We don't have taxonomies and policies here and a working distinction between institutional repositories and learning management systems. It is an unusual institution that has as systematic export from the LMS to an IR.

Lecture capture systems becoming quite commonplace; students are demanding them in much the same way that the LMS was demanded. A lecture capture system may be more universally helpful than an LMS. Lectures being captured for a wide range of reasons, but not knowing why means it is difficult to know whether to keep them and how to integrate them into the institution's resources.

Another example: the extent to which institutional repositories should sit in the stream of active work. As faculty are building datasets and doing computation with them, when is it time for something to go into an institutional repository. How volatile can content be in the repository? How should repositories be connected or considered as robust working storage? He suspects that many institutional repositories are not provisioned with high-performance storage and network connections, and would become a bottleneck in the research process. The answers would be different for big data sets and small data sets, and we are starting to see datasets that are too big to backup or two big to replicate.

Another issue is that of virtual organizations, the kind of collaborative efforts that span institutions and nations. They often allow relatively low overhead to mobilize researchers to work on a problem, and are becoming commonplace in sciences and social sciences and starting to pop up in the humanities. We have a problem for the rules-of-the-road between virtual organizations and institution-based repositories. It is easy to spin up an institutional repository for a virtual organization, but what happens to it when the virtual organization shuts down. Some of these organizations are intentionally transient; how do we assign responsibility for a world of virtual organizations and map them into institutional organizations for long-term stewardship.

Software is starting to concern people. So much scholarship is tied up now in complicated software systems that we are starting to see a number of phenomena. One is data that is difficult to reuse or understand without the software. Another is the is difficulty surrounding reproducibility -- taking results and realizing they are dependent on an enormous stack of software and we don't have a clear way to talk about the provenance of a result that is based on the stack of software versions that would allow for high-confidence in reproduction of results. We're doing to have to deal with software. We are also entering an era of deliberate obsolescence of software; for instance, any Apple product that is older than a few years is going to the dustbin and it hasn't been fully announced or realized so that people can deal with it.

Another place that has been under-exploited is the question of retiring faculty and repositories. Taking inventory of someone's scholarly collections and migrating it to an institutional framework in an orderly fashion.

How we reinterpret institutional repositories going beyond universities. For example there is something that looks a bit like an institutional repository but has some different things about it that belongs in public libraries or historic societies or similar. This dimension bears exploration.

To conclude his comments he talked about a last open issue. When we talk about good stewardship and preservation of digital materials, there are a couple of ideas that have emerged as we tried to learn from our past stewardship of print scholarly literature. One of these principles is that geographic replication is a good thing; we're starting to see this in a sense that most repositories are based on some geographically redundant storage system or we'll see a steady migration towards this in the next few years. A second one is organizational redundancy. If you look at the print work, it wasn't just that the scholarly record wasn't in a number of independent locations but also that control was replicated among institutions that were making independent decisions about adding materials to their library collection. Clearly they coordinated to a point, but they also have institutional independence. We don't know how to do this with institutional repositories. This is also emerging in special collections as they become digital. Because they didn't start life as published materials in many replicated versions, we need other mechanisms to have curatorial responsibility distributed. This is linked to the notion that it is usually not helpful to talk about preservation in terms like "eternity" or "perpetuity" or life-of-the-republic. It is probably better in most cases to think about preservation in one chunk at a time; an institution making a 20-year or 50-year commitment with a well-structured process at the end. That process includes whether an institution should renew the commitment and if not other interested parties could come in and take responsibility with a well-ordered hand-off. This ties into policies and strategies for curatorial replication across institutions and ways that institutional repositories will need to work together. It may be less critical today, but will become increasingly critical.

In conclusion, Cliff said that he hoped left the attendees with a sense that repositories are not things that stand on their own. That they in fact are mechanism that advance policy in a very complex ecology of systems. In fact, we don't have our policy act together on many systems adjacent to the repository that leads to issues of appropriate scope and interfaces with those systems. Where repositories will evolve to in the future as we understand the role of big data is also of interest.

DSpace 1.8

Robin Taylor, the DSpace version 1.8 release manager, gave an overview of what was planned (not promised!) for the next major release. The release schedule was to have a beta last week, but that didn't happen. The remainder of the schedule is to have a beta on July 8th, feature freeze on August 19th, release candidate 1 published on September 2nd in time for the test-a-thon from the 5th to the 16th, followed by a second release candidate on September 30th, final testing October 3rd through the 12th, and a final release on October 14th. He then went into some of the planned highlights of this release.

SWORD is a lightweight protocol for depositing items between repositories; it is a profile of the Atom Publishing Protocol. At the current release, DSpace has be able to accept items; the planned work for 1.8 will make it possible to send items. Some possible use cases: publishing from a closed repository to an open repository, sending from the repository to a publisher, from the repository to a subject-specific service (such as arXiv), or vice versa. The functionality was copied from the Swordapp demo. It supports SWORD v1 and only the DSpace XMLUI. A question was ask about whether the SWORD copy process is restricted to just the repository manager? The answer was that it should be configurable. On the one hand it can be open because it is up to the receiving end to determine whether or not to accept it. On the other hand, a repository administrator might want to prevent items being exported out of a collection.

MIT has rewritten the Creative Commons licensing selection steps. It uses the Creative Commons web services (as XML) rather than HTML iframes, which allows better integration with DSpace. As an aside, the Creative Commons and license steps have been split into two discrete steps allowing different headings in the progress bar.

The DSpace Community Advisory Team prioritized issues to be addressed by the developers, and for this release they include JIRA issue DS-638 for virus checking during submission. The solution invokes the existing Curation Task and requires Clam AV antivirus software to be installed. It is switched off by default and is configured in submission-curation.cfg. Two other issues that were addressed are DS-587 (Add the capability to indicate a withdrawn reason to an Item) and DS-164 (Deposit interface), which was completed as the Google Summer of Code Submission Enhancement project.

Thanks to Bojan Suzic in his Google Summer of Code project, DSpace has had a REST API. The code has been publicly available and repositories have been making use of it, so the committers group want to get it into a finished state and include it in 1.8. There is also work on an alternative approach to a REST API.

DSpace and DuraCloud was also covered; it was much the same that I reported on earlier this week, so I'm not repeating it here.

From the geek perspective, the new release will see increasing modularization of the codebase and more use of Spring and the DSpace Services Framework. The monolithic dspace.cfg will be split up into separate pieces; some pieces would move into Spring config while other pieces could go into the database. It will have a simplified installation process, and several components that were talked about elsewhere at the meeting: WebMVC UI, configurable workflow, and more curation tasks.

Introduction to DSpace Curation Services

Bill Hays talked about curation tasks in DSpace. Curation tasks are Java objects managed by the Curation System. Functionally, they are an operation run on a DSpace Object and (optionally) its contained objects (e.g., community, subcommunity, collection, and items). They do not work site-wide and not on bundles or bitstreams. The tasks can be run in multiple ways by different types of administrative users, and they are configured separately from dspace.cfg.

Some built-in tasks are to validate metadata against input forms (halts on task failure), count bitstreams by format type, virus scan (uses external virus detection service), on ingest (the desired use case), and the replication suite of tasks for DuraCloud. Other tasks: link checker and 11 others (from Stuart Lewis and Kim Shepherd), format id with DROID (in development), validate/add/replace metadata, status report on workflow items, filter media in workflow (proposed), and checksum validation (proposed).

What does this mean for different users? As a repository or collection manager, it means new functionality -- GUI access without GUI development: curation, preservation, validation, reporting. As a developer: rapid development, and deployment of functionality without rebuilding or redeploying the DSpace instance.

The recommended Java development environment for tasks is with a package outside of dspace-api. Make a POM with dependency on dspace-api, especially /curate. Required features of the task are a constructor with no arguments to support loading as a plugin and that it implements the CurationTask interface or extends the AbstractCurationTask class. Deploy it as a JAR and configure (similar to a DSpace plugin)

There are some Java annotations for Curation Task code that are important to know about. Setting @Distributive means that the task is responsible for handling any contained DSpace objects as appropriate. Otherwise the default is to have the task executed across all contained objects (subcommunities, collections, or items). Setting @Suspendable means the task interrupts processing when first FAIL status is returned. Setting @Mutative means the task makes changes to target objects.

Invoking tasks can be done several ways: from the web application (XMLUI), the command line, from workflow, from other code, or from a queue (deferred operation). In the case of the workflow, one can target the action of the task at anywhere in the workflow steps (e.g. before step 1, step 2, step 3 or at item installation). Actions (reject or approve) are based on tasks results, and notifications are sent by e-mail.

A mechanism for discovering and sharing tasks doesn't exist yet. What is needed is a community repository of tasks. For each task what is needed is: a descriptive listing, documentation, reviews/ratings, link to source code management system, and link to binaries applicable to specific versions.

With dynamic loading with scripting languages in JSR-223, it is theoretically possible to create Curation Tasks in Groovy, JRuby, Jython, although the only one Bill has been able to get to work so far has been Groovy. Scripting code needs a high level of interoperability with Java, and must implement the CurationTask interface. Configuration is a little bit different: one needs a taskcatalog with descriptors for language, name of script, and how the constructor is called. Bill demonstrated some sample scripts.

In his conclusion, Bill said that the new Curation Services: increases functionality for content in a managed framework; has multiple ways of running tasks for different types of users and scenarios; makes it possible to add new code without a rebuild; simplifies extending DSpace functionality; and with scripting lowers the bar even more.