What Does it Mean to Have Unlimited Storage in the Cloud?

We’ve seen big announcements recently about unlimited cloud storage offerings for a flat monthly or fee. Dropbox offers it for subscribers to its Business plan. Similarly, Google has unlimited storage for Google Apps for Business customers. In both cases, though, you have to be part of a business group of some sort. Then Microsoft unlimited storage for any subscriber of all Office 365 customers (Home, School, and soon Business) as bundled offering of OneDrive with the Office suite of products. Now comes word today from Amazon of unlimited storage to consumers…no need to be part of a business grouping or have bundled software come with it.

Today a colleague asked why all of this cloud storage couldn’t be used as file storage for the Islandora hosting service that is offered by LYRASIS. On the surface, it would seem to be a perfect backup strategy — particularly if you subscribed to multiple of these services and ran audits between them to make sure that they were truly in sync. Alas, the terms of service prevent you from doing something like that. Here is an excerpt from Amazon:

1.2 Using Your Files with the Service. You may use the Service only to store, retrieve, manage, and access Your Files for personal, non-commercial purposes using the features and functionality we make available. You may not use the Service to store, transfer or distribute content of or on behalf of third parties, to operate your own file storage application or service, to operate a photography business or other commercial service, or to resell any part of the Service. You are solely responsible for Your Files and for complying with all applicable copyright and other laws, including import and export control laws and regulations, and with the terms of any licenses or agreements to which you are bound. You must ensure that Your Files are free from any malware, viruses, Trojan horses, spyware, worms, or other malicious or harmful code.

Amazon Cloud Drive Terms of Use, Last updated March 25, 2015

It did get me wondering, though. Decades ago the technology community created RAID storage: Redundant Array of Inexpensive Disks. The concept is that if you copy your data across many different disks, you can survive the failure of one of those disks and rebuild the information from the remaining drives. We also have virtual storage systems like iRODS and distributed file systems like Google File System and Apache Hadoop Distributed File System. I wonder what it would take to layer these concepts together to have a cloud-independent, cloud-redundant storage array for personal backups. Sort of like a poor-man’s RAID over Dropbox/Amazon/Microsoft/Google. Something that would take care of the file verifications, the rebuilding from redundant copies, and the caching of content between services. Even if we couldn’t use it for our library services, it would be a darn good way to ensure the survivability of our cloud-stored files against the failure of a storage provider’s business model.

Iron Mountain to Close its Virtual File Store Service

About two years ago I wrote a blog post wondering if we could outsource the preservation of digital bits. What prompted that blog post was an announcement from Iron Mountain of a Cloud-Based File Archiving service. Since then there have been a number of other services that have sprung up that are more attuned to the needs of cultural heritage communities (DuraCloud and Chronopolis come to mind), but I have wondered if the commercial sector had a way to do this cheaply and efficiently. The answer to that question is “maybe not” as Iron Mountain has told Gartner Group (PDF archive) that it is closing its Virtual File Store services and its Archive Service Platform.

The Gartner analysis goes on to say: “Virtual File Store customers that stay with Iron Mountain will be transferred to a higher-value offering, File System Archiving (FSA) in 2012. The new offering will be a hybrid that leverages policy-based archiving on site and in the cloud with indexing and classification capabilities.” The Register has more details and speculation about what happened. As always, the full story might be more interesting that what the news reports are saying. In any case — just to close this loop — if you were thinking of trying this particular option, think no further.

Thursday Threads: Estimating and Understanding Big Data, Key Loggers Steal Patron Keystrokes

Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

Two entries on big data lead this week’s edition of DLTJ Thursday Threads. The first is at the grandest scale possible: a calculation of the amount of information in the world. Add up all the digital memory (in cell phones, computers, and other devices) and analog media (for instance, paper) and it goes to a very big number. The authors try to put it in perspective, which for me brought home how insignificant my line of work can be. (All of our information is still less than 1% of what is encoded in the human DNA?) The second “big data” entry describes an effort to make sense of huge amounts of data in the National Archives through the use of visualization tools. Rounding out this week is a warning to those who run public computers — be on the look-out for key loggers that can be used to steal information from users.

If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my FriendFeed stream (or subscribe to its feed in your feed reader). Comments and tips, as always, are welcome.

How Much Information Is There in the World?

Video with author of paper (4 minutes)

So how much information is there in the world? How much has it grown?

Prepare for some big numbers:

  • Looking at both digital memory and analog devices, the researchers calculate that humankind is able to store at least 295 exabytes of information. (Yes, that’s a number with 20 zeroes in it.)

    Put another way, if a single star is a bit of information, that’s a galaxy of information for every person in the world. But it’s still less than 1 percent of the information stored in all the DNA molecules of a human being.

  • 2002 could be considered the beginning of the digital age, the first year worldwide digital storage capacity overtook total analog capacity. As of 2007, almost 94 percent of our memory is in digital form.
  • In 2007, humankind successfully sent 1.9 zettabytes of information through broadcast technology such as televisions and GPS. That’s equivalent to every person in the world reading 174 newspapers every day.

Feeling swamped in data? You probably don’t have it too bad. Also see a video interview (4 minutes, embedded above) and a podcast interview (12 minutes) with one of the authors that briefly describe some of the findings in the original paper (subscription to Science Magazine required).

A Window on the Archives of the Future

In collaborating with NARA, members of TACC’s Data and Information Analysis group developed a multi-pronged approach to address technical challenges. The overall goal of their research is to investigate different data analysis methods within a visualization framework. The visualization interface is the bridge between the archivist and the analysis results, which are rendered visually onscreen as the archivists make selections and interact with the data. The results are presented as forms, colors and ranges of color to assist in synthesis and to facilitate an understanding of large-scale electronic records collections.

This article from the Texas Advanced Computing Center describes a research project to visualize the volumes of digital data in the National Archives. The visualization provides information about the amount of particular types of information, an assessment of the risks to files in the archive based on file type, and other metrics. A brief paper from the Society for Imaging Science and Technology “Archives” proceedings last year, Visualization for Archival Appraisal of Large Digital Collections [PDF], goes into more detail.

Hardware keyloggers discovered at public libraries

Photograph of a USB Key Logging device

USB Key Logger, courtesy of Sophos

Public libraries in Manchester, England, have been advised to keep their eyes peeled for USB bugs after two devices were discovered monitoring every keystroke made by every user of affected PCs.

According to local media reports, the small surveillance devices were found attached to the keyboard sockets at the back of two PCs in Wilmslow and Handforth libraries.

Sophos, maker of internet security software, posted this notice about key-logging devices attached to public library computers in the U.K. This device would make it possible to capture usernames and passwords typed at the keyboard by patrons. The article goes on to suggest actions: conduct frequent checks of hardware and to plug keyboards into USB ports on the front of computers for easy visual inspection. [Via Jessamyn West]

Options in Storage for Digital Preservation

A last-minute change to my plans for ALA Midwinter came on Tuesday when I was sought out to fill in for a speaker than canceled at the ALCTS Digital Preservation Interest Group meeting. Options for outsourcing storage and services for preserving digital content has been a recent interest, so I volunteered to combine two earlier DLTJ blog posts with some new information and present it to the group for feedback. The reaction was great, and here is the promised slide deck, links to further information, and some thoughts from the audience response.

Slide Deck and References

Slides for 'Options in Storage for Digital Preservation'

In the presentation there is a Table About Costs that uses a scenario from an earlier DLTJ blog post. The text of the scenario is:
To examine the similarities and differences in costs, let’s use the OhioLINK Satellite Image collection as a prototypical example. It consists of about 2 terabytes (2TB) of high-quality images in TIFF format, with about 7.5GB of data going into the repository each month. In the interest of exploring everything that S3 can do, there is an assumption that approximately 4GB of data will be transferred out of the archive each month; OCLC’s Digital Archive does not have a end-user dissemination component.

The point of showing this scenario is to show the widest range of costs — from a storage-only solution like Amazon S3 to a soup-to-nuts service like OCLC Digital Archive. A word about the redacted costs. Some of the numbers for OCLC’s Digital Archive response (from 2008) came from a confidential quote, so the numbers were removed from the public table. For the numbers that are publicly listed, the values come from Barbara Quint’s article.

The articles and blog posts I referenced in the course of the presentation were:

Iglesias, Edward and Wittawat Meesangnil (2010). Using Amazon S3 in Digital Preservation in a mid sized academic library: A case study of CCSU ERIS digital archive system. The Code4Lib Journal, issue 12, retrieved 5-Jan-2011 from http://journal.code4lib.org/articles/4468

Murray, Peter (2008). Long-term Preservation Storage: OCLC Digital Archive versus Amazon S3. Disruptive Library Technology Jester. Retrieved 5-Jan-2011 from http://dltj.org/article/oclc-digital-archive-vs-amazon-s3/

Murray, Peter (2009). Can We Outsource the Preservation of Digital Bits?. Disruptive Library Technology Jester. Retrieved 5-Jan-2011 from http://dltj.org/article/outsource-digital-bits/

Quint, Barbara (2008). OCLC Introduces High-Priced Digital Archive Service. Information Today. Retrieved 5-Jan-2011 from http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=49018

Some Thoughts

There was a great deal of discussion after the presentation about how good of a guarantee is good enough. Amazon S3, offers two levels of availability: “Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.” The question was whether that slight risk of loss is “good enough” for our purposes. Coming to grips with the digital storage, can we (as the librarian profession) get someone from Amazon to talk about what they do to assure that data is available? Can the terms that they use be translated into terms that we use and understand? Can we get a level of familiarity and comfort with their storage about what they do to trust them as a long-term data warehouse? Can we pull out the appropriate questions of the Trusted Repositories Audit & Certification: Criteria and Checklist to see how Amazon S3 measures up?

Can We Outsource the Preservation of Digital Bits?

A colleague forwarded an article from The Register with news of a new service from Iron Mountain for Cloud-Based File Archiving. It is billed as a “storage archiving service designed to help companies reduce costs of storing and managing static data files.” My place of work is facing an increasing need large-scale digital preservation storage with the acquisition of a large collection of music and the conversion of our educational videos from physical DVD preservation to digital preservation. We’re talking terabytes of content that is we need to keep in its archival form — uncompressed, high quality media files (not the lower quality, derivatives for day-to-day access). It doesn’t make sense to keep that on expensive SAN storage, of course, so this article struck me at just the right time to consider alternatives.

Architecture Diagram for Iron Mountain's Virtual File Store service, showing the placement of the Virtual File Store appliance relative to other assets on the data center network

Architecture Diagram for Iron Mountain's Virtual File Store service, showing the placement of the Virtual File Store appliance relative to other assets on the data center network. Graphic from product datasheet (http://www.ironmountain.com/resources/vfs/virtual_file_store_datasheet.pdf)

According to the product literature, the service works by putting a black box on your network where one can drop files via CIFS or NFS. The black box transfers the files over the internet to two Iron Mountain data centers. Files can then be retrieved via an on-line on-demand service or by exchanging physical media with Iron Mountain for bulk retrieval needs.

We know, of course, that digital preservation is more than just preserving the digital bits: it is the intellectual exercise of describing the stored information, the effort of maintaining an accurate catalog of that information, and the burden of migrating file formats or emulating platforms to read old file formats. Handling the raw bits is a big deal, too — checksumming to ensure unaltered status, refreshing files to new storage media, and protection from physical disasters. This Iron Mountain solution seems to address this more mechanical portion of digital preservation, and it one that probably can benefit from aggregating the service needs of many customers (and so is ripe for outsourcing).

Is anyone doing something similar with their physical preservation of digital media? Are there other companies that do the same thing? (I know of OCLC’s Digital Archive service — I did a comparison of it with Amazon S3 last year.)