Last month OCLC announced a new service offering for long-term storage of libraries’ digital collections. Called Digital Archive™, it provides “a secure storage environment for you to easily manage and monitor the health of your master files and digital originals.” Barbara Quint has an article in Information Today called “OCLC Introduces High-Priced Digital Archive Service” in which she makes a comparison to Amazon’s Simple Storage Service (or “S3″) from primarily a cost perspective: “The price for S3 storage at Amazon Web Services is 15 cents a gigabyte a month or $1.80 a year, in comparison to OCLC’s $7.50 a gig.” Barbara also goes into some of the technical differences, but I think it might be worthwhile to go a little more into depth on them.
OCLC’s Digital Archive
According to the service overview, Digital Archive is a content hosting service that provides:
- Systems management
- Physical security
- Data security
- Data backups
- Disaster recovery
- ISO 9001 certification
- Manifest verification
- Virus check
- Format verification
- Fixity check
It is targeted towards the preservation of digital masters. There is a document on the Digital Archive website called Our commitment that describes other aspects of a digital preservation program: “OCLC is actively developing processes for full preservation of digital assets to ensure complete renderability, regardless of technology changes. This preservation system will likely involve a combination of migration and emulation.” But it is not clear whether these services, beyond “bit preservation” activities, is part of the Digital Archive service or will be part of an add-on service to be developed later.
This “Digital Archive” is a revamping of an older product from OCLC, also called “Digital Archive” but one that included a web harvesting tools component. The service and support documentation on the OCLC website still refers to the former version of Digital Archive, so there is little information about how the service works beyond what one can infer from the sales information.
Amazon’s S3
Amazon describes S3 as “a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.” Files are transfered across the internet to Amazon’s services and stored in multiple data centers. Files can be retrieved using standard HTTP mechanisms (the same protocol that powers the web) and are protected by an optional access control mechanism. S3 does have a Service Level Agreement (SLA) that offers guarantees on performance.
SLA seems to extend only to availability of the service, not to a long term commitment to keeping track of files on the service.
AWS [Amazon Web Services, LLC] will use commercially reasonable efforts to make Amazon S3 available with a Monthly Uptime Percentage (defined below) of at least 99.9% during any monthly billing cycle (the “Service Commitment”). In the event Amazon S3 does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below.
There is no mention specifically in the S3 SLA about permanence of file storage. In leu of that, one seems to be covered by the overarching Amazon Web Services Customer Agreement, which has several points of interest from a preservation use perspective:
3.3. Termination or Suspension by Us Other Than for Cause.
3.3.2. Paid Services…. We may suspend your right and license to use any or all Paid Services (and any associated Amazon Properties)…, or terminate this Agreement in its entirety (and, accordingly, cease providing all Services to you), for any reason or for no reason, at our discretion at any time by providing you sixty (60) days’ advance notice in accordance with the notice provisions set forth in Section 15 below.
So if they desire to terminate a library’s use of the service (assuming there was no specific cause — such as a violation of the terms of use — to do so), they have to give 60 days notice. That’s when the “Data Preservation in the Event of Suspension or Termination” clause kicks in:
3.7.2. In the Event of Termination Other Than for Cause. In the event of any termination by us of any Service or any set of Services, or termination of this Agreement in its entirety, other than a for cause termination under Section 3.4.1, (i) we will not take any action to intentionally erase any of your data stored on the Services for a period of thirty (30) days after the effective date of termination; and (ii) your post termination retrieval of data stored on the Services will be conditioned on your payment of Service data storage charges for the period following termination, payment in full of any other amounts due us, and your compliance with terms and conditions we may establish with respect to such data retrieval.
The customer agreement then goes on to say:
3.8. Post-Termination Assistance.Following the suspension or termination of your right to use the Services by us or by you for any reason other than a for cause termination (i.e., a termination under Section 3.2 or under Section 3.3), you shall be entitled to take advantage of any post-termination assistance we may generally make available with respect to the Services, such as data retrieval arrangements we may elect to make available. We may also endeavor to provide you unique post-suspension or post-termination assistance, but we shall be under no obligation to do so. Your right to take advantage of any such assistance, whether generally made available with respect to the Services or made available uniquely to you, shall be conditioned upon your acceptance of and compliance with any fees and terms we specify for such assistance.
Perhaps the most troubling aspect, from a preservation point-of-view, deals with data security and backups. Specifically, Amazon says that data security and backups are the responsibility of the customer. The Amazon Web Services Customer Agreement says (emphasis added):
7.2. Security. We strive to keep Your Content secure, but cannot guarantee that we will be successful at doing so, given the nature of the Internet. Accordingly, without limitation to Section 4.3 above and Section 11.5 below, you acknowledge that you bear sole responsibility for adequate security, protection and backup of Your Content. We strongly encourage you, where available and appropriate, to use encryption technology to protect Your Content from unauthorized access and to routinely archive Your Content. We will have no liability to you for any unauthorized access or use, corruption, deletion, destruction or loss of any of Your Content.
That kind of security and data backup is something you’d want in a preservation service. Since activities against S3 storage is limited only by a knowing a private “key”1 (as opposed to limiting to particular IP addresses or not allowing deletes/modifications from the web at all), it is a real possibility that the archive can be harmed if the private key is disclosed. Furthermore, S3 does not have a backup/restore service for retrieving files that were accidentally or maliciously deleted.
Feature Comparison
It is useful to compare Amazon’s S3 on a point-by-point basis OCLC’s Digital Archive service to try to put some meaning behind the cost numbers.
| OCLC Digital Archive | Amazon S3 | |
|---|---|---|
| Systems management | Yes | Yes |
| Physical security | Yes | Yes |
| Data security | Yes | No |
| Data backups | Yes | No |
| Disaster recovery | Yes | unclear |
| ISO 9001 certification | whatever the heck that might mean in this context | |
| Manifest verification | Yes | No |
| Format verification | Yes | No |
| Virus check | Yes | No |
| Fixity check | Yes | No |
| “Light archive” capability | No | Yes |
This is a useful comparison because it would indicate what one would have to layer on top of S3 to reach the level of service provided by Digital Archive. For instance, it would be possible to create an application that would perform the manifest and format verifications as well as the periodic virus and fixity checks against the files in S3. It would even be possible to run that application in Amazon’s Elastic Compute Cloud (EC2) — a “virtual computing environment” that allows developers to easily create and deploy software on the internet. Since data transferred between Amazon EC2 and Amazon S3 is free of charge, there wouldn’t be the S3 cost of periodically downloading the data to perform the virus and fixity checks.
One advantage to note about the S3 solution is that it can perform as a “light archive” — meaning the data is available to users in addition to being part of the content repository. In contrast to the OCLC Digital Archive service — a “dark archive” — access to the data is highly or completely restricted. Still, the lack of automated backups and a robust data security infrastructure in the S3 infrastructure are notable from a preservation data service perspective.
Cost Comparison
To examine the similarities and differences in costs, let’s use the OhioLINK Satellite Image collection as a prototypical example. It consists of about 2 terabytes (2TB) of high-quality images in TIFF format, with about 7.5GB of data going into the repository each month. In the interest of exploring everything that S3 can do, there is an assumption that approximately 4GB of data will be transfered out of the archive each month; OCLC’s Digital Archive does not have a end-user dissemination component.
| OCLC Digital Archive | Amazon S3 | |||
|---|---|---|---|---|
| Rate | Cost | Rate | Cost | |
| Setup Cost | - - - redacted - - - | - - - none - - - | ||
| Startup Ingest Cost | - - - redacted - - - | $0.10/GB into S3 [#1] | $200 | |
| Initial Storage Cost | $750/100GB/year [#2] | $15,000/year | $0.15/GB/month | $3,600/year |
| Ongoing Ingest Cost | - - - redacted - - - | $0.10/GB into S3 [#1] | $9/year | |
| Ongoing Storage Cost | $750/100GB/year [#2] | previous year plus $750/year [#3] | $0.15/GB/month | previous year plus $10.80/year [#3] |
| Ongoing Access Cost | Not available | varies [#1, #4] | $8.16/year | |
Note #2: As listed in Barbara Quint’s article. Charge is for any part of 100GB used.
Note #3: Additions each year factor in the assumption of adding 90GB/year to the collection.
Note #4: Costs for transfers out of S3 is: $0.17/GB for the first 10TB/month; $0.13/GB for the next 40TB/month; $0.11/GB for the next 100TB/month; and $0.10/GB for outflowing data over 150TB/month.
For this prototypical example, S3 would cost $3,800 in the first year and roughly $3,615 per year after that, with the added benefit that end-users could access the content without using our infrastructure. There are costs associated with the OCLC Digital Archive service that had to be redacted from the public version of this table due to a confidentiality clause, but the costs that are assumed for ongoing storage based on Barbara Quint’s article are comparable to the quote I got from OCLC and represent a large portion of the total yearly costs.
By way of comparison, we are planning the purchase of 50TB of storage this summer for roughly $250K; that is about $5,000/TB. Amortize the cost of the hardware over five years and assume 150% of the purchase price represents maintenance, personnel support, and other factors, and we get $2,500/TB/year. This doesn’t include software costs, so it is comparable to S3 in the functions table above; software would have to be written to verify the manifest and file formats on ingest as well as the monthly fixity and virus scanning. It also represents only one copy of the data; it does not include the duplication across data centers that both Digital Archive and S3 provide.
Conclusions
OCLC’s Digital Archive product goes pretty far down the path of a preservation-worthy archive of digital files. The value-added services, in addition to simply storing and retrieving files, make it as close to a one-stop shop as I’ve seen so far. Whether outsourced digital preservation services makes sense — particularly at this price point — remains to be seen, especially since is hard to make a comparison since I’m betting that most of us aren’t (yet) doing all of the ongoing activities with digital preservation masters that Digital Archive is doing.
Amazon’s S3 is an inexpensive, network-oriented file hosting service, and as such it doesn’t have many of the features built into it that we would want to see in a preservation archive service. Beyond raw file service, one would need to add layers of software and human activities to perform the functions that Digital Archive provides now.
Looking at OCLC’s Digital Archive and Amazon S3 is almost an apples-to-oranges comparison, both in price and in functionality. Comparing functionality first, S3 is missing critical components of a preservation storage system — namely, rigorous access control and a content backup/restore facility. Comparing costs, though, S3 is dramatically cheaper…and has the benefit of serving up large files to end-users using Amazon’s distributed infrastructure.
It is possible to level the functionality playing field a bit by taking responsibility for the ongoing maintenance of files in the S3 archive — those things that Digital Archive offers as value-added services over raw file storage. An EC2 virtual machine running in Amazon’s infrastructure can perform the virus and fixity scanning. And with good key maintenance (as with passwords, regularly changing the private key and securing it appropriately), S3 could conceivably offsite copies of content stored offline (e.g. burned to preservation quality optical media). Again, in this scenario one has to take responsibility for refreshing the offline media and occasionally running comparisons against the S3 offsite copy.
Footnotes
- S3 uses secret keys — a 40-character password — to verify the identify of the client making the request. If the private key becomes known, anyone on the internet can perform operations actions as the content owner. [↩]





7 Comments
This is an outstanding job, Peter.
My first question when I read about OCLC’s archive service was how this compared to LOCKSS. That’s not to criticize your piece one bit — just to suggest that someone needs to run those numbers as well.
A good point, Karen — I had not considered comparing it to LOCKSS yet. My first thought, obviously, was S3 as a raw hosting service. A comparison to LOCKSS would be more appropriate — and harder given the fuzzier (perhaps better described as “cooperative” instead) economics.
If anyone else embarks on such an effort, be sure to post a comment here that points to your work. I’m sure we’d all be interested.
wow. your analysis is impressive. i feel very proud to have contributed to its creation. the original piece also went slightly into the issue of buying 2 1-terabyte external hard drives for the cost of 1 year’s worth of a 10th of a terabyte, which oclc sells in chunks. complicated, all these archive choices.
bq
One thing that I see missing from your wonderful analysis is the network costs associated with S3. This may be figured into Barbara’s numbers but if not including at least 1 put and 1 get cost factor for networking then you should at least include some type of cost for ec2 for checking on the data during the course of the year.
Robert — In retrospect, it isn’t entirely clear from the the table of cost distributions, but the start-up and ongoing “ingest costs” represent the network transmission costs. For data coming into the S3 service, the user is charged a flat rate of $0.10/GB. The rate for returning content out of the S3 archive varies. Outputting up to 10TB per month (which would cover outputting the entire 2TB prototypical content discussed in the article) costs 17 cents per GB; the cost for 2TB would $340/month. If you tested the content once a month by downloading it to a local server, that would be $4,080.
Costs for using a Amazon EC2 virtual machine are harder to calculate, and probably can’t be accomplished without a little bit of experimentation. That would involve actually building an EC2 machine to do the virus and fixity testing.
Thanks for the comments.
[DLTJ editor's note: With Barbara Quint's permission, I'm posting a comment received as e-mail here along with my reply.]
both amazon and oclc had put/get costs as i recall, but i didn’t compare them. have to suspect that amazon’s were probably lower.
about your analysis, though, you describe oclc’s digital archive as complete, including end-user access. i don’t think so. it’s for archival copies (hi-res) only and, as far as i know, accessible to archive managers. they expect enduser access to come through their contentDM service to which libraries subscribe separately. the contentDM subscribers don’t even get a special rate from oclc for the new digital archive service.
and as for putting formatting issues on the user/subscriber, oclc was quite clear that the archivist sending them the files was responsible for any changes in preservation formatting. at present, they didn’t even have a forum for warning people that their formats are getting obsolete, although there was a glimmer of interest when i suggested it as a good idea. i got the feeling that it would only happen if the service sold big enough to develop its own forum.
what i didn’t look into is whether other amazon web services could supply tools that would appeal to archivists. package together a number of their services and you might get a better deal all around and probably still at a lower price.
What I had intended to write was that OCLC’s Digital Archive service is “complete” in terms of a traditional dark archive service that may be sought by an archivist. (It occurs to me, though, that we don’t have much of a tradition of dark archive digital services, so that may be a bit of an oxymoron.) Amazon S3 could be considered an “incomplete” dark archive service (no inherent backup or disaster recovery assurance) with a “light archive” add-on.
It is true that OCLC’s Digital Archive service does not allow for end- user access to content out of the Digital Archive. For archive managers, there is a cost associated with retrieving bulk batches of content out of the archive. I did not include the latter in the analysis since it was outside the bounds of what we had envisioned the Digital Archive could be used for. But I’ll agree that it is an important aspect to look at.
I don’t think there are other Amazon Web Services that by themselves would inherently be appealing to a digital archive service, but the ability to build such tools in the Amazon EC2 virtual computing cloud is possible, if only theoretical at this point.
2 Trackbacks
Post a Comment