Google and DataNet: Two Ships Passing in the Night, or Maybe Something More?

Wired Magazine’s blog network says “Google to Host Terabytes of Open-Source Science Data” while the National Science Foundation (NSF) is reviewing submissions to the DataNet solicitation “to catalyze the development of a system of science and engineering data collections that is open, extensible and evolvable.” On the surface, you might think they are working on the same project, but there is more here than meets the eye (or, rather, the ear listening to these two sound-bites).

Disclosure: OhioLINK is a named party in a submission by The Ohio State University to the NSF DataNet solicitation. We’re looking forward to a positive reception to our proposal in the first round of DataNet reviews.

As with most things Google, the real nuts and bolts of their strategy are unknown until Google chooses to unveil them. This much seems to be known: under the moniker of “Google Research” the company will make large datasets available to the world for free. According to the Wired article, “two planned datasets are all 120 terabytes of Hubble Space Telescope data and the images from the Archimedes Palimpsest, the 10th century manuscript that inspired the Google dataset storage project.” Sources at Google told Wired that the Research site will offer YouTube-style annotation features and data visualization technology purchased from Gapminder last year. Part of the plan also includes the shipping and loaning of large disk packs so the data doesn’t have to flow across the internet. The presumed home of Google Research is At this point, that URL describes contributions by Google staff to the research community, but I’m guess that will change when the new service is brought public.

On the other hand, the NSF DataNet solicitation envisions a new type of organization that “will integrate library and archival sciences, cyberinfrastructure, computer and information sciences, and domain science expertise to: provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data…; anticipate and adapt to changes in technologies and in user needs and expectations; [perform R/D in] computer and information science and cyberinfrastructure…; and serve as component elements of an interoperable data preservation and access network.” More than a service, DataNet seeks a model of organization that brings varied expertise together on the issues surrounding data curation. By way of comparison, it would seem like NSF thinks of this as a people challenge while Google Research thinks of it as a technology platform challenge.

A technology platform is certainly part of the DataNet needs, but not all of it. As one of the commenters in the Wired article noted, “masses of data are of course completely useless without extensive meta-data describing provinence.” Still, given the cyberinfrastructure that Google can bring to bear on the problem of large scale data archiving and the dataset visualization technology that they now have in house, it is a big part of a potential solution. One wonders about the viability of creating a response to the DataNet solicitation that, in effect, outsources the cyberinfrastructure piece to Google and focuses on building the sustainable organization model surrounding the description and dissemination of the data.

Anybody working on that?

Digital Preservation Activities: NSF’s “DataNet” and the NSF/Mellon Blue Ribbon Task Force

The past few weeks have seen announcements of large digital preservation programs. I find it interesting that the National Science Foundation is involved in both of them.

Sustainable Digital Data Preservation and Access Network Partners

The NSF’s Office of Cyberinfrastructure has announced a request for proposals with the name Sustainable Digital Data Preservation and Access Network Partners (DataNet). The lead paragraph of its synopsis is:

Science and engineering research and education are increasingly digital and increasingly data-intensive. Digital data are not only the output of research but provide input to new hypotheses, enabling new scientific insights and driving innovation. Therein lies one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams. This solicitation addresses that challenge by creating a set of exemplar national and global data research infrastructure organizations (dubbed DataNet Partners) that provide unique opportunities to communities of researchers to advance science and/or engineering research and learning.

The introduction in the solicitation goes on to say:

Chapter 3 (Data, Data Analysis, and Visualization) of NSF’s Cyberinfrastructure Vision for 21st Century Discovery presents a vision in which “science and engineering digital data are routinely deposited in well-documented form, are regularly and easily consulted and analyzed by specialists and non-specialists alike, are openly accessible while suitably protected, and are reliably preserved.” The goal of this solicitation is to catalyze the development of a system of science and engineering data collections that is open, extensible and evolvable.

The full program solicitation is available (here’s a hint if the left side of the PDF version is cut off when printing — in the Acrobat print dialog, reduce the document size to 94% of the paper size). There will be up to five awards of $20 million each for five years with the possibility of continuing funding.

The part that I find interesting, from a library technologist’s perspective, is this: “Successfully providing stability for long-term preservation and agility both to embrace constant technological change and to engage evolving research challenges requires a novel combination of expertise in library and archival sciences, computer, computational, and information sciences, cyberinfrastructure, and the other domain sciences and engineering. A goal of this solicitation is to support the creation of new types of organizations that fully integrate all of these capabilities.” Undertaking such an endeavor must be a truly cross-discipline attempt — marrying up the best of library and archive practices with other forms of science and engineering to accomplish the task.

It would seem that the Fedora Commons platform is a great starting point for the technological infrastructure. It is as if the solicitation could have been written with Fedora in mind: “content heterogeneity requires that each awardee create a resource that serves a broad disciplinary and subject matter range, manages a diverse array of data types and formats, and provides the capability to support collections at the research, resource, and reference levels.” Another component of the program goals — developing models for economic and technological sustainability — is similar to OhioLINK’s attempts to aggregate the creation and support of content repositories at state-wide economies of scale.

Peter Brantley, Executive Director of the Digital Library Federation, has established a group on Nature’s Network service for those who want to collaborate or get further information (open to participation from anyone, but registration is required). There is a kernel of a group in Ohio that are considering the possibility of a joint application; if interested, please let me know. Peter also has a post on the topic on O’Reilly’s Radar.

Blue Ribbon Task Force on Sustainable Digital Preservation and Access

The National Science Foundation (NSF) and the Andrew W. Mellon Foundation are funding a blue-ribbon task force to address the issue of economic sustainability for digital preservation and persistent access. Co-chaired by Fran Berman of the San Diego Supercomputer Center and Brian Lavoie of OCLC, the task force will meet with over the next two years to look at the issue. It is intended as an international effort; support is also coming from JISC in the U.K.

In its final report, the Task Force is charged with developing a comprehensive analysis of current issues, and actionable recommendations for the future to catalyze the development of sustainable resource strategies for the reliable preservation of digital information. During its tenure, the Task Force also will produce a series of articles about the challenges and opportunities of digital information preservation, for both the scholarly community and the public.1

The only news so far appears to be the press releases linked above. Now I recognize it is a two year effort and they only got started late last month, but I half expect some public face to the work of the task force to be available somewhere, even in the early stages. If DLTJ readers see anything, please mention it in this posting’s comments.


  1. From the OCLC press release. []