Call for Public Comment — W3C Library Linked Data Incubator Group

The W3C Library Linked Data (LLD) Incubator Group invites librarians, publishers, linked data researchers, and other interested parties to review and comment on drafts of reports to be published later this year. The LLD group has been chartered from May 2010 through August 2011 to prepare a series of reports on the existing and potential use of Linked Data technology for publishing library data. The group is currently preparing:

  • A report describing Benefits of LLD, an Overview of Existing Vocabularies and Data Sets, Relevant Technologies, Implementation Challenges, and Recommendations
  • A survey report of use cases describing existing projects
  • A survey report of Vocabularies and Datasets

Submitting Comments

The incubator group invites comments in one of two ways. Feedback can be posted as comments to individual sections on a dedicated blog. Comments can also be sent by e-mail to using descriptive subject lines such as:

    Subject: [COMMENTS] "Benefits" - section on "Benefits to Developers"

Comments sent this way are archived in the public mailing list.

Comments will be especially welcome in the four weeks from 24 June through 22 July. Reviewers should note that as with Wikipedia, the text may be revised and corrected by its editors in response to comments at any time, but that earlier versions of a document may be viewed by clicking on the History tab.

It is anticipated that the three reports will be published in final form by 31 August.

Thursday Threads: Publisher/Librarian Rights, Cultural Commons, HTML5 Web Apps, Wifi Management

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

This week’s list of threads starts with a pointer a statement by the International Coalition of Library Consortia on the growing pressure between publishers and libraries over the appropriate rights and permissions for scholarly material. In that same vein, Joe Lucia writes about his vision for libraries and the cultural commons to the Digital Public Library of America mailing list. On the more geeker side is a third link to an article with the experience of content producers creating HTML5-enabled web apps. And finally, on the far geeky side, is a view of what happens when a whole lot of new wireless devices — smartphones, tablets, and the like — show up on a wifi network.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments and tips, as always, are welcome.

ICOLC Response to the International Association of Scientific Technical and Medical (STM) Statement

A recent statement by the International Association of Scientific Technical and Medical Publishers (STM) advocates a set of new guidelines for document delivery ( While intellectual property laws vary from country to country, STM’s approach would radically alter well-established library practices that advance knowledge, support scholarship, and are compliant with current copyright laws. The STM recommendations are in conflict with widely held principles that provide a copyright exception for interlibrary loan (ILL) activities. The regime anticipated by the STM statement would place unfair restrictions on researchers’ access to information. In particular, ICOLC contends that:

  1. interlibrary loan, under existing principles and laws, is consistent with the three-step test of Berne;
  2. cross-border deliveries are adequately and appropriately governed by current copyright law;
  3. digital document delivery directly to an end-user is best coordinated through the end-user’s library or community of learners;
  4. libraries are able to deliver on-site articles to library walk-up patrons in any format, including both digital and print;
  5. current copyright law appropriately places the burden on the library user to affirm that the documents they receive are for private, non-commercial use.

The ICOLC strongly supports IFLA’s Draft Library Treaty, Article 7, which states “It shall be permissible for a library or archive to supply a copy of any work. . . lawfully acquired or accessed by the library or archive, to another library or archive for subsequent supply to any of its users, by any means . . . provided that such use is compatible with fair practice as determined in national law” ( See also ARL’s statement clarifying legal protections afforded to libraries for national and international ILL use (, and related documents ( and

ICOLC Response to the International Association of Scientific Technical and Medical (STM) Statement, International Coalition of Library Consortia, issued June 22, 2011

On the heels of last week’s frightening copyright scenario comes this statement from the International Coalition of Library Consortia. It was short, so the main content of the statement is posted above. Follow the link in the citation to find contact information for the ICOLC statement. Library Journal also has an article on the statement with quotes from Tracy Thompson-Przylucki and Ann Okerson.

Libraries & the Cultural Commons

Reduced to its medium-independent core, the mission of libraries is to subsidize and sustain barrier-free access to intellectual and cultural resources for our constituents and communities. In that sense, libraries establish a bridge between the proprietary realm of commercially supplied intellectual property and the gift economies of intellectual and cultural expression. From my perspective, everything we do flows from that core function. The DPLA will be, in effect, a new global networked digital face of the library as cultural and intellectual commons.

Libraries & the Cultural Commons, by Joe Lucia, DPLA mailing list, 22-Jun-2011

Joe Lucia, University Librarian at Villanova University, posted this broad and, frankly, energizing view of the role for libraries to the Digital Public Library of America mailing list. If you want a concise view of how libraries are about content and services and not the historical carrier and delivery mechanisms, then take a look at this message.

The FT and NPR: HTML5 as part of a multi-platform strategy

I had heard that the FT and Apple were struggling to come to an agreement on digital subscriptions, so it came as no surprise to me that the FT has launched an HTML5 web app. Some folks have added sneer quotes around app, but I’m not going to. The HTML5 version of the FT’s app looks, behaves and has even more functionality than their native iPad app.

The FT and NPR: HTML5 as part of a multi-platform strategy, Strange Attractor blog, 7-Jun-2011

I think there is a strong future in common agreement of web markup standards over proprietary app development. I’ve made that point serveral times on DLTJ, so I remain attuned to stories that point in that direction. This article points to how the U.K.’s Financial Times built an iPad app using the built-in Safari browser and the HTML5 tools like advanced cascading stylesheets and offline storage for reading when you are off the net (just like the old Financial Times native app). And, of course, the techniques work on other tablet platforms with minimal modification. NPR is experimenting with the same technique using Google’s Chrome web browser.

Wi-Fi client surge forcing fresh wireless LAN thinking

IDC reports that twice as many smartphones and tablets, nearly all with Wi-Fi, will ship compared to laptops this year. The number of Wi-Fi certified handsets in 2010 was almost 10 times the number certified in 2007, according to the Wi-Fi Alliance. Tablets, e-readers and portable audio devices are helping to drive this growth.

The result is a very different wireless environment in terms of radio behaviors, Wi-Fi implementations, applications, usage and traffic compared to just a year or two ago. This raises a different set of issues from simply managing these mobile devices with tools from vendors…

Long ago I used to have to manage network infrastructure. That was back in the days when, for a small organization, one person could be the unix system administrator, the network administrator, and help with desktop support. With the complexity and pervasiveness of devices, though, I don’t think one person can do all of that any more. It is articles like this one that talk about the difficulties managing wireless networks that are bursting at the seams with new devices that make me realize how far networking has come in the past two decades.

The text was modified to update a link from to on November 21st, 2012.

PPTP VPN for iOS with AT&T Uverse and DD-WRT

Wandering into public or semi-public wireless networks makes me nervous because I know how my network traffic can be easily watched, and because I’m a geek with control issues I’m even more nervous when using devices that I can’t get to the insides of (like phones and tablets). One way to tamp down my concerns is to use a Virtual Private Network (VPN) to tunnel the device’s network connection through the public wireless network to a trusted end-point, but most of those options require a subscription to a VPN service or a VPN installed in a corporate network. I thought about using one of the open source VPN implementations with an Amazon EC2 instance, but it isn’t possible with the EC2 network configuration judging from the comments on the Amazon Web Services support forums. (Besides, installing one of the open source VPN software implementations looks far from turnkey.) Just before I lost hope, though, I saw a reference to using the open source DD-WRT consumer router firmware to do this. After plugging away at it for an hour or so, I made it work with my home router, a AT&T U-verse internet connection, and iOS devices. It wasn’t easy, so I’m documenting the steps here in case I need to set this up again.


To make this happen, I’m using a D-Link DIR-825 that has been flashed with “v24-sp2 (04/23/10) std” of the DD-WRT firmware. For my internet connection I have a AT&T U-verse residential gateway and a “Max Turbo” plan (I work from home so I need the 3 Mbps uplink speed that is only available with “Max Turbo”, although that added uplink capacity is certainly helpful for this road-warrior VPN use). I also have a pair of iOS version 4.3.3 devices; this setup might work for other handheld operating systems (e.g. Android or Windows Mobile), but I don’t have any of those to test with.

DD-WRT comes with support for a point-to-point-tunneling-protocol (PPTP) server. I know PPTP has some inherent security risks. At this point I’m just aiming to be harder for someone passively listening on the public wireless network to eavesdrop on my connections. I’m not doing anything ultra-sensitive that I need advanced encryption; I just don’t want to make it easy to watch what my devices are doing.

Setting up the AT&T U-verse Residential Gateway

Since the D-Link router is behind the U-verse residential gateway, we need to punch a couple holes through its firewall to allow downstream connections from the iOS devices to reach the D-Link router. Specifically, one needs to forward ports 1723/TCP and 1723/UDP through the residential gateway firewall to the internal D-Link router. To do this:

  1. Connect to the web interface of the residential gateway, select the Settings tab followed by the Firewall tab then the Applications, Pinholes and DMZ tab.
  2. This screen has two steps: 1) Select a computer; then 2) Edit firewall settings for this computer. Click on the link to “Choose” the DIR-825 router (by name).
  3. In the second step choose the “Add a new user-defined application” link. Use “PPTP” for the Application Profile Name.
  4. Select “TCP” and put “1723” in the From text box, under Application Type select PPTP virtual private network server and leave the rest of the boxes blank for the defaults; click on Add to List.
  5. Repeat everything in the last step except choose UDP in place of TCP.
  6. Click on the Back button to return to the Allow device application traffic to pass through firewall screen.
  7. Select the Allow individual application(s) radio button, click on the User-defined applications list, pick “PPTP” from the Application List, and click on Add.
  8. Click Save.

The U-verse residential gateway will now pass everything inbound on ports 1723/TCP and 1723/UDP to the D-Link router. You’re done with the residential gateway setup now.

Setting up the PPTP Service on DD-WRT

Now we need to set up the DD-WRT PPTP service. This is harder than it probably should be, but given the geeky focus of the DD-WRT effort (in my humble opinion), features seem to come before user interface and documentation niceties. This works for me, but it isn’t entirely clear or easy, and I can’t offer troubleshooting insights if it doesn’t work for you. It has two main steps — first, turn on and configure the PPTP server; and second, patch the PPTP server configuration with a start-up script so that it actually works. First, the PPTP server configuration:

  1. Log onto the DD-WRT web interface, select the Services tab then the VPN tab.
  2. Enable PPTP Server, Broadcast support, and Force MPPE Encryption.
  3. Put in the WAN IP (listed in the upper right corner of the web page) in the Server IP box. (Some instructions I have seen said that this can be left blank and the firmware will automatically pick it up. That didn’t work for me.)
  4. For Client IPs, put in a range of LAN-side IPs that aren’t being used by the DHCP server. In my case I’m using “”.
  5. Put in one or more CHAP-Secrets. These are the username and passwords used on the PPTP client to connect to this server, and they follow a weird form: username-space-asterisk-space-password-space-asterisk. For example:
    username * password *
  6. Leave Radius disabled.
  7. At the bottom of the screen, pick Apply Settings.

The second step is the startup script:

  1. Select the Administration tab then the Commands tab.
  2. Put this in the Commands text box, then select Save Startup:
    sed -i -e 's/mppe .*/mppe required,stateless/' /tmp/pptpd/options.pptpd
    echo "nopcomp" >> /tmp/pptpd/options.pptpd
    echo "noaccomp" >> /tmp/pptpd/options.pptpd
    kill `ps | grep pptp | cut -d ' ' -f 1`
    pptpd -c /tmp/pptpd/pptpd.conf -o /tmp/pptpd/options.pptpd
  3. Go to the Management subtab of Administration and at the bottom select Reboot Router.

This script comes from the PPTP Server Configuration page. The bulk of it is from the iOS 4.3 heading with the addition of the sed line to force encryption.

Configuring the iOS Device

iOS PPTP VPN Configuration

iOS PPTP VPN Configuration

The iOS device was pretty straight forward (particularly compared to the previous steps):

  1. In the Settings app, choose General then Network then VPN.
  2. Select Add VPN Configuration…
  3. At the top choose PPTP and give this configuration a descriptive label.
  4. For Server put in the IP address of your U-verse residential gateway. (Setting up something like Dynamic DNS with DD-WRT is left as an exercise to the reader.)
  5. For Account put in the username field from the CHAP-Secrets text box above.
  6. Leave RSA SecurID off and put in the password field from the CHAP-Secrets text box.
  7. Under Encryption Level select Maximum.
  8. Select Save in the upper right hand corner.

Now when you connect to a public network, before starting any applications that will access the internet, go into the Settings app and near the top will be a choice to turn on the VPN. Give it about five or six seconds to make the connection, and you’ll then see a blue VPN icon in the status bar at the top next to the WiFi icon.


The PPTP Server Configuration was much more helpful than the built in documentation for figuring out what was needed to make this work. A series of posts on the Whirlpool Forums starting with this reply and continuing through a half-dozen more had the final pieces.

Thursday Threads: RDA Test Results, Author’s Rights Denied, Future Copyright Scenario

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

This week we got the long-awaited report from the group testing RDA to see if its use would be approved for the major U.S. national libraries. And the answer? An unsatisfying, if predictable, maybe-but-not-yet. This week also brought new examples of the tensions between authors and publishers and libraries. The first example is an author’s story of an attempt to navigate an author’s rights agreement and coming to an insurmountable barrier. The second example tries to look in to the future of teaching and learning in a world where fair use has been dramatically scaled back from the existing status quo, and it is a frightening one.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments and tips, as always, are welcome.

Implementation of RDA Contingent on Improvements

Contingent on the satisfactory progress/completion of the tasks and action items below, the [U.S. RDA Test] Coordinating Committee recommends that RDA should be implemented by [the Library of Congress], [National Agricultural Library], and [National Library of Medicine] no sooner than January 2013. The three national libraries should commit resources to ensure progress is made on these activities that will require significant effort from many in and beyond the library community.

To achieve a viable and robust metadata infrastructure for the future, the Coordinating Committee believes that RDA should be part of the infrastructure. Before RDA is implemented, however, the activities below must be well underway. In order to allow sufficient lead time for these actions to occur, the Committee recommends that RDA implementation not proceed prior to January 2013. Timeframes in these recommendations assume a start date of July 1, 2011 and represent the Coordinating Committee’s best estimates.

Over three years in the making, the work of the U.S. RDA Test Coordinating Committee is starting to be published. “Resource Description and Access” (or RDA) is the name of the standard that has been under formal development since 2005 to “provide a comprehensive set of guidelines and instructions on resource description and access covering all types of content and media.” From the foundation of the existing standard, the Anglo-American Cataloguing Rules (AACR), work on this standard has been delayed and debated quite a bit in the past half-decade, and this mixed report from RDA Test Coordinating Committee casts both light and doubt on the viability of RDA in the U.S. All told, it is hard to separate the issues with the text of the standard from those of the lack of flexibility of the underlying carrier (MARC) and the tool to access the standard (

An Author’s Rights Horror Story

Now, let me really conclude by saying this: I hereby boycott all [Taylor & Francis] journals. T&F publishes a fair number of journals in ILS (Journals by Subject > Information Science), and I shall not publish in any of them ever again. And, furthermore, I would encourage you, whether you are in ILS or not, to not publish in T&F journals either. Because, let’s face it, the only way publishers will change their restrictive copyright policies is if authors refuse to publish with those publishers. Give ‘em hell.

My Copyfight, PomeRantz, 14 Jun 2011

This is the last paragraph of a detailed story from a tenured faculty member that agreed to write an article in a special issue of the journal The Reference Librarian only to run afoul of the authors rights agreement. The points brought up by Jeffrey Pomerantz in the post are something I suspect we are going to see more of as copyright conflicts between authors, publishers, and libraries remain unresolved.

It sounds preposterous, right? This is what Kevin Smith has called a nightmare scenario, one that doubles down with new guidelines for interlibrary loan (which in his terms opening are opening “a second front” of attack on education).

But this is our future if publishers prevail. We may have to adhere to a strict and highly conservative interpretation of old guidelines drawn up by – you guessed it – publishers, who back in 1976 were troubled by that disruptive new technology, the Xerox machine. If they call the shots, we will have to create a bureaucracy to enforce copyright compliance or face litigation. We will have to reserve interlibrary loan for journal articles only for rare instances and in a manner controlled by “rightsholders” – which, by design, are publishers, not the authors. Where would we get the lines to staff compliance mechanisms? And the money to pay permissions for everything we use in teaching and research, every time we use it? Out of our existing budgets. The ones that keep getting smaller.

Librarians have been Cassandras for long enough. It’s time for the rest of the academy to wake up before they have this nightmare and stop treating research as a commodity we naturally give away in exchange for personal advancement, assuming it will always be available, somehow. Otherwise, get ready for a future that will not be a hospitable place for that old-fashioned pursuit, the advancement of knowledge.

Dispatches from the Future, Barbara Fister’s Library Babel Fish blog, Inside Higher Ed, 13 Jun 2011

Speaking of copyright, this post describes what could be a worst-case scenario of publishers’s desires to control copyrighted academic content come to fruition. We can see some of it coming in cases like the author rights story above, in the Georgia State University copyright case, and in the International Association of Scientific, Technical and Medical Publishers’ Statement on Document Delivery. And the crystal ball is too murky to try to make out which direction this is going.


p style=”padding:0;margin:0;font-style:italic;”>The text was modified to update a link from to on December 4th, 2012.

Open Repositories 2011 Report: Day 3 – Clifford Lynch Keynote on Open Questions for Repositories, Description of DSpace 1.8 Release Plans, and Overview of DSpace Curation Services

The main Open Repositories conference concluded this morning with a keynote by Clifford Lynch and the separate user group meetings began. I tried to transcribe Cliff’s great address as best I could from my notes; hopefully I’m not misrepresenting what he said in any significant ways. He has some thought-provoking comments about the positioning of repositories in institutions and the policy questions that come from that. For an even more abbreviated summary, check out this National Conversation on the Economic Sustainability of Digital Information (skip to “chapter 2” of the video) held April 1, 2010 in Washington DC.

Not only have institutional repositories acted as focal point for policy, they have also been a focal point for collaborations. Library and IT collaborations were happening long before institutional repositories surfaced. Institutional repositories, though, have been a great place to bring other people into that conversation, including faculty leaders to start engaging them in questions about dissemination of their work. Also chief research officers; in 1995 if you were a university librarian doing leadership work constructing digital resources to change scholarly communication, would have talked to CIO but may not know who your chief research officer was at that point. That set of conversations, which are now critical when talking about data curation, got their start with institutional repositories and related policies.

Another place for conversation has been those in the university administrations concerned with building public support for the institution. By giving the public a deeper understanding of what the institution contributes to culture, industry, health and science, and connecting faculty to this effort. This goes beyond the press release by opening a public window into the work of the institutions. This is particularly important today with questions of public support for institutions.

That said, there are a number of open questions and places where we are dealing with works-in-progress. Cliff then went into an incomplete and, from his perspective, perhaps idiosyncratic, list of these issues.

Repositories are one of the threads that are leading us nationally and internationally into a complete rethinking of the practice of name authority. While it is a librarian, old fashion concept, but it is converging with “identity management” from IT. He offered an abbreviated and exaggerated example: librarians did name authority for authors of stuff in general in 19th century. In 20th century there was too much stuff, particularly stuff in journals and magazines became overwhelming. So libraries backed off and focused only on books and stuff that went into catalogs; the rest they turned over to indexing and abstracting services. We made a few weird choices like an authority file should be as simple as possible to disambiguate authors rather than be as full as possible, so we had the development of things along side name authority files like the national dictionaries of literary biographies.

For scientific journal literature, publishers followed practices about how obscure author names could be (e.g. just last name and first initial). Huge amounts of ambiguity of “telegraphic author names” results in a horribly dirty corpus of data. A variety of folks are realizing that we need to disambiguate authorship by assigning author identifiers and somehow go back and cleanup the mess in the existing bibliographic data of scholarly literature, especially journal literature. Institutions taking more responsibility for the work of their community, and having to do local name authority all over again. We have the challenge of how to reconnect this activity to national and international files. We also have a set of challenges on whether we want to connect this to biographical resources. It brings up issues of privacy, when do people do things of record, and how much else should come along with building a public biography resource. We also see a vast parallel investment of institutional identity management. Institutions haven’t quite figured out that people don’t necessarily publish with the same name that is recorded in the enrollment or employment systems that the institution manages, and that it would be a good idea to tie those literary names to identity files that the institution manages.

We’re not confident of the kind of ecological positioning institutional repositories among a pretty complicated array of information systems found at a typical large university. Those systems include digital library platforms, course management systems, lecture capture systems, facilities for archiving the digital records of the institution, and platforms intended to directly support active research by faculty. All are evolving at their own rate. It is unclear where the institutional repositories fit, and what are the boundaries around them.

Here is one example. What is the difference between an institutional repository and a digital library/collection? You’d get very different answers from different people. One might be who does the curation, how it is sourced, and how it is scoped. The decisions are largely intellectual. Making this confusing is that you’ll see the same platform for institutional repositories and digital library platforms. We are seeing a convergence of the underpinning platforms.

Another one: learning management systems (LMS). These are virtually universal among institution in the same timeframe that institutional repositories have been deployed. We’ve done a terrible job at thinking about what happens to the stuff in them when the course is over. We can’t decide if it is scholarly material, institutional records, or something else. They are tangled up between learning materials and all of the stuff that populates a specific performance of a course such as quizzes and answers, discussion lists, and student course projects. We don’t have taxonomies and policies here and a working distinction between institutional repositories and learning management systems. It is an unusual institution that has as systematic export from the LMS to an IR.

Lecture capture systems becoming quite commonplace; students are demanding them in much the same way that the LMS was demanded. A lecture capture system may be more universally helpful than an LMS. Lectures being captured for a wide range of reasons, but not knowing why means it is difficult to know whether to keep them and how to integrate them into the institution’s resources.

Another example: the extent to which institutional repositories should sit in the stream of active work. As faculty are building datasets and doing computation with them, when is it time for something to go into an institutional repository. How volatile can content be in the repository? How should repositories be connected or considered as robust working storage? He suspects that many institutional repositories are not provisioned with high-performance storage and network connections, and would become a bottleneck in the research process. The answers would be different for big data sets and small data sets, and we are starting to see datasets that are too big to backup or two big to replicate.

Another issue is that of virtual organizations, the kind of collaborative efforts that span institutions and nations. They often allow relatively low overhead to mobilize researchers to work on a problem, and are becoming commonplace in sciences and social sciences and starting to pop up in the humanities. We have a problem for the rules-of-the-road between virtual organizations and institution-based repositories. It is easy to spin up an institutional repository for a virtual organization, but what happens to it when the virtual organization shuts down. Some of these organizations are intentionally transient; how do we assign responsibility for a world of virtual organizations and map them into institutional organizations for long-term stewardship.

Software is starting to concern people. So much scholarship is tied up now in complicated software systems that we are starting to see a number of phenomena. One is data that is difficult to reuse or understand without the software. Another is the is difficulty surrounding reproducibility — taking results and realizing they are dependent on an enormous stack of software and we don’t have a clear way to talk about the provenance of a result that is based on the stack of software versions that would allow for high-confidence in reproduction of results. We’re doing to have to deal with software. We are also entering an era of deliberate obsolescence of software; for instance, any Apple product that is older than a few years is going to the dustbin and it hasn’t been fully announced or realized so that people can deal with it.

Another place that has been under-exploited is the question of retiring faculty and repositories. Taking inventory of someone’s scholarly collections and migrating it to an institutional framework in an orderly fashion.

How we reinterpret institutional repositories going beyond universities. For example there is something that looks a bit like an institutional repository but has some different things about it that belongs in public libraries or historic societies or similar. This dimension bears exploration.

To conclude his comments he talked about a last open issue. When we talk about good stewardship and preservation of digital materials, there are a couple of ideas that have emerged as we tried to learn from our past stewardship of print scholarly literature. One of these principles is that geographic replication is a good thing; we’re starting to see this in a sense that most repositories are based on some geographically redundant storage system or we’ll see a steady migration towards this in the next few years. A second one is organizational redundancy. If you look at the print work, it wasn’t just that the scholarly record wasn’t in a number of independent locations but also that control was replicated among institutions that were making independent decisions about adding materials to their library collection. Clearly they coordinated to a point, but they also have institutional independence. We don’t know how to do this with institutional repositories. This is also emerging in special collections as they become digital. Because they didn’t start life as published materials in many replicated versions, we need other mechanisms to have curatorial responsibility distributed. This is linked to the notion that it is usually not helpful to talk about preservation in terms like “eternity” or “perpetuity” or life-of-the-republic. It is probably better in most cases to think about preservation in one chunk at a time; an institution making a 20-year or 50-year commitment with a well-structured process at the end. That process includes whether an institution should renew the commitment and if not other interested parties could come in and take responsibility with a well-ordered hand-off. This ties into policies and strategies for curatorial replication across institutions and ways that institutional repositories will need to work together. It may be less critical today, but will become increasingly critical.

In conclusion, Cliff said that he hoped left the attendees with a sense that repositories are not things that stand on their own. That they in fact are mechanism that advance policy in a very complex ecology of systems. In fact, we don’t have our policy act together on many systems adjacent to the repository that leads to issues of appropriate scope and interfaces with those systems. Where repositories will evolve to in the future as we understand the role of big data is also of interest.

DSpace 1.8

Robin Taylor, the DSpace version 1.8 release manager, gave an overview of what was planned (not promised!) for the next major release. The release schedule was to have a beta last week, but that didn’t happen. The remainder of the schedule is to have a beta on July 8th, feature freeze on August 19th, release candidate 1 published on September 2nd in time for the test-a-thon from the 5th to the 16th, followed by a second release candidate on September 30th, final testing October 3rd through the 12th, and a final release on October 14th. He then went into some of the planned highlights of this release.

SWORD is a lightweight protocol for depositing items between repositories; it is a profile of the Atom Publishing Protocol. At the current release, DSpace has be able to accept items; the planned work for 1.8 will make it possible to send items. Some possible use cases: publishing from a closed repository to an open repository, sending from the repository to a publisher, from the repository to a subject-specific service (such as arXiv), or vice versa. The functionality was copied from the Swordapp demo. It supports SWORD v1 and only the DSpace XMLUI. A question was ask about whether the SWORD copy process is restricted to just the repository manager? The answer was that it should be configurable. On the one hand it can be open because it is up to the receiving end to determine whether or not to accept it. On the other hand, a repository administrator might want to prevent items being exported out of a collection.

MIT has rewritten the Creative Commons licensing selection steps. It uses the Creative Commons web services (as XML) rather than HTML iframes, which allows better integration with DSpace. As an aside, the Creative Commons and license steps have been split into two discrete steps allowing different headings in the progress bar.

The DSpace Community Advisory Team prioritized issues to be addressed by the developers, and for this release they include JIRA issue DS-638 for virus checking during submission. The solution invokes the existing Curation Task and requires Clam AV antivirus software to be installed. It is switched off by default and is configured in submission-curation.cfg. Two other issues that were addressed are DS-587 (Add the capability to indicate a withdrawn reason to an Item) and DS-164 (Deposit interface), which was completed as the Google Summer of Code Submission Enhancement project.

Thanks to Bojan Suzic in his Google Summer of Code project, DSpace has had a REST API. The code has been publicly available and repositories have been making use of it, so the committers group want to get it into a finished state and include it in 1.8. There is also work on an alternative approach to a REST API.

DSpace and DuraCloud was also covered; it was much the same that I reported on earlier this week, so I’m not repeating it here.

From the geek perspective, the new release will see increasing modularization of the codebase and more use of Spring and the DSpace Services Framework. The monolithic dspace.cfg will be split up into separate pieces; some pieces would move into Spring config while other pieces could go into the database. It will have a simplified installation process, and several components that were talked about elsewhere at the meeting: WebMVC UI, configurable workflow, and more curation tasks.

Introduction to DSpace Curation Services

Bill Hays talked about curation tasks in DSpace. Curation tasks are Java objects managed by the Curation System. Functionally, they are an operation run on a DSpace Object and (optionally) its contained objects (e.g., community, subcommunity, collection, and items). They do not work site-wide and not on bundles or bitstreams. The tasks can be run in multiple ways by different types of administrative users, and they are configured separately from dspace.cfg.

Some built-in tasks are to validate metadata against input forms (halts on task failure), count bitstreams by format type, virus scan (uses external virus detection service), on ingest (the desired use case), and the replication suite of tasks for DuraCloud. Other tasks: link checker and 11 others (from Stuart Lewis and Kim Shepherd), format id with DROID (in development), validate/add/replace metadata, status report on workflow items, filter media in workflow (proposed), and checksum validation (proposed).

What does this mean for different users? As a repository or collection manager, it means new functionality — GUI access without GUI development: curation, preservation, validation, reporting. As a developer: rapid development, and deployment of functionality without rebuilding or redeploying the DSpace instance.

The recommended Java development environment for tasks is with a package outside of dspace-api. Make a POM with dependency on dspace-api, especially /curate. Required features of the task are a constructor with no arguments to support loading as a plugin and that it implements the CurationTask interface or extends the AbstractCurationTask class. Deploy it as a JAR and configure (similar to a DSpace plugin)

There are some Java annotations for Curation Task code that are important to know about. Setting @Distributive means that the task is responsible for handling any contained DSpace objects as appropriate. Otherwise the default is to have the task executed across all contained objects (subcommunities, collections, or items). Setting @Suspendable means the task interrupts processing when first FAIL status is returned. Setting @Mutative means the task makes changes to target objects.

Invoking tasks can be done several ways: from the web application (XMLUI), the command line, from workflow, from other code, or from a queue (deferred operation). In the case of the workflow, one can target the action of the task at anywhere in the workflow steps (e.g. before step 1, step 2, step 3 or at item installation). Actions (reject or approve) are based on tasks results, and notifications are sent by e-mail.

A mechanism for discovering and sharing tasks doesn’t exist yet. What is needed is a community repository of tasks. For each task what is needed is: a descriptive listing, documentation, reviews/ratings, link to source code management system, and link to binaries applicable to specific versions.

With dynamic loading with scripting languages in JSR-223, it is theoretically possible to create Curation Tasks in Groovy, JRuby, Jython, although the only one Bill has been able to get to work so far has been Groovy. Scripting code needs a high level of interoperability with Java, and must implement the CurationTask interface. Configuration is a little bit different: one needs a taskcatalog with descriptors for language, name of script, and how the constructor is called. Bill demonstrated some sample scripts.

In his conclusion, Bill said that the new Curation Services: increases functionality for content in a managed framework; has multiple ways of running tasks for different types of users and scenarios; makes it possible to add new code without a rebuild; simplifies extending DSpace functionality; and with scripting lowers the bar even more.

Open Repositories 2011 Report: Day 2 with DSpace plus Fedora and Lots of Lightning Talks

Today was the second day of the Open Repositories conference, and the big highlight of the day for me was the panel discussion on using Fedora as a storage and service layer for DSpace. This seems like such a natural fit, but with two pieces of complex software the devil is in the details. Below that summary is some brief paragraphs about some of the 24×7 lightning talks.
Continue reading

Thursday Threads: Machine-Meaningful Web Content and Successful IPv6 Test

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

Two threads this week: the first is an announcement from the major search engine on a way they agree to discover machine-processable information in web pages. The search engines want this so they can do a better job understanding the information web pages, but it stomps on the linked data work that has been a hot topic in libraries recently. The second is a red-letter day in the history of the internet as major services tried out a new way for machines to connect. The test was successful, and its success means a big hurdle has been crossed as the internet grows up.
Continue reading

Open Repositories 2011 Report: Day 1 with Apache, Technology Trends, and Bolded Labels

Today was the first main conference day of the Open Repositories conference in Austin, Texas. There are 300 developers here from 20 countries and 30 states. I have lots of notes from the sessions, and I’ve tried to make sense of some of them below before I lose track of the entire context.

The meeting opened with the a keynote by Jim Jagielski, president of the Apache Software Foundation. He gave a presentation on what it means to be open source project with a focus on how Apache creates a community of developers and users around its projects.

Slide 50 of Open Source: It's Not Just for IT Anymore Slide 50 of Open Source: It's Not Just for IT Anymore

Slides 50 and 51 of "Open Source: It's Not Just for IT Anymore"

One of the take-a-ways was a characterization of Open Source Licenses from Dave Johnson. Although it is a basic shorthand, it is useful in understanding the broad classes of licenses:

He also explained why community and code are at peer levels in Apache. One is not possible without the other; it takes an engaged community to create great code and great code can only be created by a healthy community. He also described how the primary communications tool for projects is not new-fangled technologies like wikis and conference calls and IRC. The official record of a project is its e-mail lists. This enables the broadest possible inclusion of participants across many time zones and the list archives enable people to look into the history of decisions. If discussions take place in other forums or tools, the summary is always brought back to the e-mai list.

Jim’s concluding thoughts were a great summary of the presentation, and I’ve inserted them in on the right.

I missed the first concurrent session of the day due to work conflict, so the first session I went to was the after lunch 24×7 presentations. That is no more than 24 slides in no more than seven minutes. I like this format because it forces the presenters to be concise, and if the topic is not one that interests you it isn’t long until the next topic comes up. The short presentations are also great for generating discussion points with the speakers during breaks and the reception. Two of these in particular struck a cord with me.

The first was “Technology Trends Influencing Repository Design” by Brad McLean of DuraSpace. His list of four trends were:

  1. Design for mobile, not just PCs. The model of a mobile app — local computation and page rendering backed by web services for retrieving data — is having several impacts on design: a reinforcement of the need for lightweight web services and UIs; accounting for how screen size has shrunk again; and having a strategy for multi-platform apps will become critical.
  2. More programming language(s) than you need/want. Java, Python, Ruby, Scala, LISP, Groovy, JavaScript and the list goes on. This proliferation of languages has forced looser coupling between components (e.g. a JavaScript based script can consume data from and write data to a Java-based servlet engine). The implications he listed for this are that it is even clearer that true integration challenges are in the data modeling and policy domains; harder to draw neat boxes around required skill sets; and that you might lose control of your user experience (and it might be a good thing).
  3. Servers and clusters. Clusters are not just for high-performance computing and search engines anymore. Techniques like map/reduce are available to all. He said that Ebay was the last major internet company to deploy its infrastructure on “big iron” but he didn’t attribute that statement to a source. (Seems kind of hard to believe…) The implications are that we should look to replicated and distributed SOLR indexing (hopefully stealing a page from “noSQL” handbook); keep an eye on Map/Reduce-based triple stores (interesting idea!); and repository storage will be spanning multple systems.
  4. What is a filesystem. Brad noted that with filesystems what was once hidden from the end user (think the large systems of the 1960s, 1970s and 1980s) became visible (the familiar desktop file folder structure) and is now becoming hidden again (as with mobile device apps). Applications are now storing opaque objects again; how do we effectively ingest them into our repositories?

Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11less than a minute ago via TweetDeck Favorite Retweet Reply

Tweet from Dorothea Salo

The second 24×7 talk that struck a chord was “Don’t Bold the Field Name” by Simeon Warner. And by that he literally meant “Don’t Bold the Field Name”. He walked through a series of library interfaces and noted how we have a tendancy to display bolded field labels. He then pointed out how this draws the eye’s attention to the labels and not the record content beside the labels. Amazon doesn’t do this (at least with the metadata at the top of the page), Ebay doesn’t do this, and the search engines don’t do this. He did note — pointing to the case of the “Product Details” section of an Amazon item page — that “the task of finding the piece of information is more important than consuming it.” (Again, in the Amazon case, the purpose of bolding the label is to draw the eye to the location of data like publisher and shipping weight on the page.) I think Dorothea Salo’s tweet summed it up best: “Takeaways from Simeon: think about what to present sans label; find cues you can use instead of labels; use labels for 2ndary info. #or11”

I also attended the two sessions on identifiers in the afternoon (Peter Sefton’s “A Hint of Mint: Linked Authority Control Service” and Richard Rodgers’s “ORCID: Open Research and Contributor ID — An Open Registry of Scholarly IDs”), but the time is late and tomorrow’s events will come soon enough. Given eough time and energy, I’ll try to summarize those sessions later.

Open Repositories 2011 Report: DSpace on Spring and DuraSpace

This week I am attending the Open Repositories conference in Austin, Texas, and yesterday was the second preconference day (and the first day I was in Austin). Coming in as I did I only had time to attend two preconference sessions: one on the integration — or maybe “invasion” of the Spring Framework — into DSpace and one on the introduction of the DuraCloud service and code.
Continue reading

Does the Google/Bing/Yahoo Markup Promote Invalid HTML?

[Update on 10-Jun-2011: The answer to the question of the title is “not really” — see the update at the bottom of this post and the comments for more information.]

Yesterday Google, Microsoft Bing, and Yahoo! announced a project to promote machine-readable markup for structured data on web pages.

Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.

The problem is, I think, that the markup they describe on there site generates invalid HTML. Did they really do this?

Take this example from the Event description page:

< !DOCTYPE html>
<html xmlns="">
<div itemscope itemtype="">
  <a itemprop="url" href="nba-miami-philidelphia-game3.html">
  NBA Eastern Conference First Round Playoff Tickets:
  Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)
  <time itemprop="startDate" datetime="2011-04-21T20:00">
    Thu, 04/21/11
    8:00 p.m.
  <div itemprop="location" itemscope itemtype="">
    <a itemprop="url" href="wells-fargo-center.html">
    Wells Fargo Center
    <div itemprop="address" itemscope itemtype="">
      <span itemprop="addressLocality">Philadelphia</span>,
      <span itemprop="addressRegion">PA</span>
  <div itemprop="offers" itemscope itemtype="">
    Priced from: <span itemprop="lowPrice">$35</span>
    <span itemprop="offerCount">1,938</span> tickets left

The problem is in the first <div> line and the attribute ‘itemscope’ that has no value associated with it. If you copy-and-paste that markup into the W3 validator (using the “Validate by Direct Input” option and manually removing the space between the less-than sign and the exclamation point in the first line), it comes back with:

Line 7, Column 16: required character (found i) (expected =)

A bare attribute may be valid in some forms of HTML, but it certainly isn’t valid XML, and I think that will cause all sorts of problems down the line. Does anyone else see this as an issue?


I heard back from one of the keepers of W3C’s validator, and the xmlns="" attribute of the html tag was triggering the XML version of the validator. The bare itemscope attribute is valid HTML but invalid XML (important for the XML serialization of HTML), but can be fixed by making it itemscope="itemscope". See the comments for more information.