Long-term Preservation Storage: OCLC Digital Archive versus Amazon S3

Last month OCLC announced a new service offering for long-term storage of libraries’ digital collections. Called Digital Archive™, it provides “a secure storage environment for you to easily manage and monitor the health of your master files and digital originals.” Barbara Quint has an article in Information Today called “OCLC Introduces High-Priced Digital Archive Service” in which she makes a comparison to Amazon’s Simple Storage Service (or “S3″) from primarily a cost perspective: “The price for S3 storage at Amazon Web Services is 15 cents a gigabyte a month or $1.80 a year, in comparison to OCLC’s $7.50 a gig.” Barbara also goes into some of the technical differences, but I think it might be worthwhile to go a little more into depth on them.

OCLC’s Digital Archive


According to the service overview, Digital Archive is a content hosting service that provides:

  • Systems management
  • Physical security
  • Data security
  • Data backups
  • Disaster recovery
  • ISO 9001 certification
  • Manifest verification
  • Virus check
  • Format verification
  • Fixity check

It is targeted towards the preservation of digital masters. There is a document on the Digital Archive website called Our commitment that describes other aspects of a digital preservation program: “OCLC is actively developing processes for full preservation of digital assets to ensure complete renderability, regardless of technology changes. This preservation system will likely involve a combination of migration and emulation.” But it is not clear whether these services, beyond “bit preservation” activities, is part of the Digital Archive service or will be part of an add-on service to be developed later.

This “Digital Archive” is a revamping of an older product from OCLC, also called “Digital Archive” but one that included a web harvesting tools component. The service and support documentation on the OCLC website still refers to the former version of Digital Archive, so there is little information about how the service works beyond what one can infer from the sales information.

Amazon’s S3


Amazon describes S3 as “a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.” Files are transfered across the internet to Amazon’s services and stored in multiple data centers. Files can be retrieved using standard HTTP mechanisms (the same protocol that powers the web) and are protected by an optional access control mechanism. S3 does have a Service Level Agreement (SLA) that offers guarantees on performance.

SLA seems to extend only to availability of the service, not to a long term commitment to keeping track of files on the service.

AWS [Amazon Web Services, LLC] will use commercially reasonable efforts to make Amazon S3 available with a Monthly Uptime Percentage (defined below) of at least 99.9% during any monthly billing cycle (the “Service Commitment”). In the event Amazon S3 does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below.

There is no mention specifically in the S3 SLA about permanence of file storage. In leu of that, one seems to be covered by the overarching Amazon Web Services Customer Agreement, which has several points of interest from a preservation use perspective:

3.3. Termination or Suspension by Us Other Than for Cause.
3.3.2. Paid Services…. We may suspend your right and license to use any or all Paid Services (and any associated Amazon Properties)…, or terminate this Agreement in its entirety (and, accordingly, cease providing all Services to you), for any reason or for no reason, at our discretion at any time by providing you sixty (60) days’ advance notice in accordance with the notice provisions set forth in Section 15 below.

So if they desire to terminate a library’s use of the service (assuming there was no specific cause — such as a violation of the terms of use — to do so), they have to give 60 days notice. That’s when the “Data Preservation in the Event of Suspension or Termination” clause kicks in:

3.7.2. In the Event of Termination Other Than for Cause. In the event of any termination by us of any Service or any set of Services, or termination of this Agreement in its entirety, other than a for cause termination under Section 3.4.1, (i) we will not take any action to intentionally erase any of your data stored on the Services for a period of thirty (30) days after the effective date of termination; and (ii) your post termination retrieval of data stored on the Services will be conditioned on your payment of Service data storage charges for the period following termination, payment in full of any other amounts due us, and your compliance with terms and conditions we may establish with respect to such data retrieval.

The customer agreement then goes on to say:

3.8. Post-Termination Assistance.Following the suspension or termination of your right to use the Services by us or by you for any reason other than a for cause termination (i.e., a termination under Section 3.2 or under Section 3.3), you shall be entitled to take advantage of any post-termination assistance we may generally make available with respect to the Services, such as data retrieval arrangements we may elect to make available. We may also endeavor to provide you unique post-suspension or post-termination assistance, but we shall be under no obligation to do so. Your right to take advantage of any such assistance, whether generally made available with respect to the Services or made available uniquely to you, shall be conditioned upon your acceptance of and compliance with any fees and terms we specify for such assistance.

Perhaps the most troubling aspect, from a preservation point-of-view, deals with data security and backups. Specifically, Amazon says that data security and backups are the responsibility of the customer. The Amazon Web Services Customer Agreement says (emphasis added):

7.2. Security. We strive to keep Your Content secure, but cannot guarantee that we will be successful at doing so, given the nature of the Internet. Accordingly, without limitation to Section 4.3 above and Section 11.5 below, you acknowledge that you bear sole responsibility for adequate security, protection and backup of Your Content. We strongly encourage you, where available and appropriate, to use encryption technology to protect Your Content from unauthorized access and to routinely archive Your Content. We will have no liability to you for any unauthorized access or use, corruption, deletion, destruction or loss of any of Your Content.

That kind of security and data backup is something you’d want in a preservation service. Since activities against S3 storage is limited only by a knowing a private “key”1 (as opposed to limiting to particular IP addresses or not allowing deletes/modifications from the web at all), it is a real possibility that the archive can be harmed if the private key is disclosed. Furthermore, S3 does not have a backup/restore service for retrieving files that were accidentally or maliciously deleted.

Feature Comparison


It is useful to compare Amazon’s S3 on a point-by-point basis OCLC’s Digital Archive service to try to put some meaning behind the cost numbers.

OCLC Digital ArchiveAmazon S3
Systems managementYesYes
Physical securityYesYes
Data securityYesNo
Data backupsYesNo
Disaster recoveryYesunclear
ISO 9001 certificationwhatever the heck that might mean in this context
Manifest verificationYesNo
Format verificationYesNo
Virus checkYesNo
Fixity checkYesNo
“Light archive” capabilityNoYes

This is a useful comparison because it would indicate what one would have to layer on top of S3 to reach the level of service provided by Digital Archive. For instance, it would be possible to create an application that would perform the manifest and format verifications as well as the periodic virus and fixity checks against the files in S3. It would even be possible to run that application in Amazon’s Elastic Compute Cloud (EC2) — a “virtual computing environment” that allows developers to easily create and deploy software on the internet. Since data transferred between Amazon EC2 and Amazon S3 is free of charge, there wouldn’t be the S3 cost of periodically downloading the data to perform the virus and fixity checks.

One advantage to note about the S3 solution is that it can perform as a “light archive” — meaning the data is available to users in addition to being part of the content repository. In contrast to the OCLC Digital Archive service — a “dark archive” — access to the data is highly or completely restricted. Still, the lack of automated backups and a robust data security infrastructure in the S3 infrastructure are notable from a preservation data service perspective.

Cost Comparison


To examine the similarities and differences in costs, let’s use the OhioLINK Satellite Image collection as a prototypical example. It consists of about 2 terabytes (2TB) of high-quality images in TIFF format, with about 7.5GB of data going into the repository each month. In the interest of exploring everything that S3 can do, there is an assumption that approximately 4GB of data will be transfered out of the archive each month; OCLC’s Digital Archive does not have a end-user dissemination component.

OCLC Digital ArchiveAmazon S3
RateCostRateCost
Setup Cost- – - redacted – - -- – - none – - -
Startup Ingest Cost- – - redacted – - -$0.10/GB into S3 [#1]$200
Initial Storage Cost$750/100GB/year [#2]$15,000/year$0.15/GB/month$3,600/year

Ongoing Ingest Cost- – - redacted – - -$0.10/GB into S3 [#1]$9/year
Ongoing Storage Cost$750/100GB/year [#2]previous year
plus $750/year [#3]
$0.15/GB/monthprevious year
plus $10.80/year [#3]

Ongoing Access CostNot availablevaries [#1, #4]$8.16/year
Note #1: Amazon S3 also adds charges by HTTP request, but those are considered negligible for the data load and the ongoing accesses.

Note #2: As listed in Barbara Quint’s article. Charge is for any part of 100GB used.

Note #3: Additions each year factor in the assumption of adding 90GB/year to the collection.

Note #4: Costs for transfers out of S3 is: $0.17/GB for the first 10TB/month; $0.13/GB for the next 40TB/month; $0.11/GB for the next 100TB/month; and $0.10/GB for outflowing data over 150TB/month.

For this prototypical example, S3 would cost $3,800 in the first year and roughly $3,615 per year after that, with the added benefit that end-users could access the content without using our infrastructure. There are costs associated with the OCLC Digital Archive service that had to be redacted from the public version of this table due to a confidentiality clause, but the costs that are assumed for ongoing storage based on Barbara Quint’s article are comparable to the quote I got from OCLC and represent a large portion of the total yearly costs.

By way of comparison, we are planning the purchase of 50TB of storage this summer for roughly $250K; that is about $5,000/TB. Amortize the cost of the hardware over five years and assume 150% of the purchase price represents maintenance, personnel support, and other factors, and we get $2,500/TB/year. This doesn’t include software costs, so it is comparable to S3 in the functions table above; software would have to be written to verify the manifest and file formats on ingest as well as the monthly fixity and virus scanning. It also represents only one copy of the data; it does not include the duplication across data centers that both Digital Archive and S3 provide.

Conclusions

OCLC’s Digital Archive product goes pretty far down the path of a preservation-worthy archive of digital files. The value-added services, in addition to simply storing and retrieving files, make it as close to a one-stop shop as I’ve seen so far. Whether outsourced digital preservation services makes sense — particularly at this price point — remains to be seen, especially since is hard to make a comparison since I’m betting that most of us aren’t (yet) doing all of the ongoing activities with digital preservation masters that Digital Archive is doing.

Amazon’s S3 is an inexpensive, network-oriented file hosting service, and as such it doesn’t have many of the features built into it that we would want to see in a preservation archive service. Beyond raw file service, one would need to add layers of software and human activities to perform the functions that Digital Archive provides now.

Looking at OCLC’s Digital Archive and Amazon S3 is almost an apples-to-oranges comparison, both in price and in functionality. Comparing functionality first, S3 is missing critical components of a preservation storage system — namely, rigorous access control and a content backup/restore facility. Comparing costs, though, S3 is dramatically cheaper…and has the benefit of serving up large files to end-users using Amazon’s distributed infrastructure.

It is possible to level the functionality playing field a bit by taking responsibility for the ongoing maintenance of files in the S3 archive — those things that Digital Archive offers as value-added services over raw file storage. An EC2 virtual machine running in Amazon’s infrastructure can perform the virus and fixity scanning. And with good key maintenance (as with passwords, regularly changing the private key and securing it appropriately), S3 could conceivably offsite copies of content stored offline (e.g. burned to preservation quality optical media). Again, in this scenario one has to take responsibility for refreshing the offline media and occasionally running comparisons against the S3 offsite copy.

The text was modified to update a link from http://www.oclc.org/us/en/digitalarchive/overview/ to http://web.archive.org/web/20081012181719/http://www.oclc.org/us/en/digitalarchive/overview/ on November 13th, 2012.

The text was modified to update a link from http://www.oclc.org/us/en/digitalarchive/about/commitment/default.htm to http://web.archive.org/web/20110830065826/http://www.oclc.org/us/en/digitalarchive/about/commitment/default.htm on August 22nd, 2013.

Footnotes

  1. S3 uses secret keys — a 40-character password — to verify the identify of the client making the request. If the private key becomes known, anyone on the internet can perform operations actions as the content owner. []
(This post was updated on 21-Aug-2013.)