Google and DataNet: Two Ships Passing in the Night, or Maybe Something More?

Posted on 2 minute read

× This article was imported from this blog's previous content management system (WordPress), and may have errors in formatting and functionality. If you find these errors are a significant barrier to understanding the article, please let me know.

Wired Magazine's blog network says "Google to Host Terabytes of Open-Source Science Data" while the National Science Foundation (NSF) is reviewing submissions to the DataNet solicitation "to catalyze the development of a system of science and engineering data collections that is open, extensible and evolvable." On the surface, you might think they are working on the same project, but there is more here than meets the eye (or, rather, the ear listening to these two sound-bites).

Disclosure: OhioLINK is a named party in a submission by The Ohio State University to the NSF DataNet solicitation. We're looking forward to a positive reception to our proposal in the first round of DataNet reviews.

As with most things Google, the real nuts and bolts of their strategy are unknown until Google chooses to unveil them. This much seems to be known: under the moniker of "Google Research" the company will make large datasets available to the world for free. According to the Wired article, "two planned datasets are all 120 terabytes of Hubble Space Telescope data and the images from the Archimedes Palimpsest, the 10th century manuscript that inspired the Google dataset storage project." Sources at Google told Wired that the Research site will offer YouTube-style annotation features and data visualization technology purchased from Gapminder last year. Part of the plan also includes the shipping and loaning of large disk packs so the data doesn't have to flow across the internet. The presumed home of Google Research is http://research.google.com/. At this point, that URL describes contributions by Google staff to the research community, but I'm guess that will change when the new service is brought public.

On the other hand, the NSF DataNet solicitation envisions a new type of organization that "will integrate library and archival sciences, cyberinfrastructure, computer and information sciences, and domain science expertise to: provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data...; anticipate and adapt to changes in technologies and in user needs and expectations; [perform R/D in] computer and information science and cyberinfrastructure...; and serve as component elements of an interoperable data preservation and access network." More than a service, DataNet seeks a model of organization that brings varied expertise together on the issues surrounding data curation. By way of comparison, it would seem like NSF thinks of this as a people challenge while Google Research thinks of it as a technology platform challenge.

A technology platform is certainly part of the DataNet needs, but not all of it. As one of the commenters in the Wired article noted, "masses of data are of course completely useless without extensive meta-data describing provinence." Still, given the cyberinfrastructure that Google can bring to bear on the problem of large scale data archiving and the dataset visualization technology that they now have in house, it is a big part of a potential solution. One wonders about the viability of creating a response to the DataNet solicitation that, in effect, outsources the cyberinfrastructure piece to Google and focuses on building the sustainable organization model surrounding the description and dissemination of the data.

Anybody working on that?