These list items are microformat entries and are hidden from view.
https://dltj.org/article/digital-preservation-software/
Peter Murray
Bryan Brown’s tweet led me to Ruth Kitchin Tillman’s Repository Ouroboros post about the treadmill of software development/deployment.And wow do I have thoughts and feelings. <figcaption>Ouroboros: an ancient symbol depicting a serpent or dragon eating its own tail. Or—in this context—constantly chasing what you can never have. Source: Wikipedia</figcaption> Let’s start with feelings.I feel pain and misery in reading Ruth’s post.As Bryan said in a subsequent tweet, I’ve been on both sides: a system maintainer watching much-needed features put off to major software updates (or rewrites) and the person participating in decisions to put off feature development in favor of major updates and rewrites.It is a bit like a serpent chasing its tail (a reference to “Ouroboros” in Ruth’s post title)—as someone who just wants a workable, running system, it seems like a never-ending quest to get what my users need.I think it will get better.I offer as evidence the fact that almost all of us can assume network connectivity.That certainly wasn’t always the case: routers used to break, file servers crash would under stress, network drivers go out of date at inopportune times.Now we take network connectivity for granted—almost (almost!) as if it a utility as common as water and electricity.We no longer have to chase our tail to assume those things.When we make those assumptions, we push that technology down the stack and layer on new things.Only after electricity is reliable can we layer on network connectivity.With reliable network connectivity, we layer on—say—digital repositories.Each layer goes through its own refinement process…getting better and better as it relies on the layers below it.Are digital repositories as reliable as printed books?No way! Without electricity and network connectivity, we can’t have digital repositories but we can still use books.Will there come a time when digital repositories are as reliable as electricity and network connectivity?That sounds like a Star Trek world, but if history is our guide, I think the profession will get there.(I’m not necessarily saying I’ll get there with it—such reliability is probably outside my professional lifetime.)So, yeah, I feel pain and misery in Ruth’s post about the achingly out-of-reach nature of repository software that can be pushed down the stack…that can be assumed to exist with all of the capabilities that our users need.That brings me around to one of Bryan’s tweets:If the idea of a digital preservation platform is that it is purpose-built to preserve assets for a long period of time, then isn't it an obvious design flaw to build it with an EOL in mind? If the system is no longer supported, then can it really be trusted for preservation?— Bryan J. Brown (@bryjbrown) June 22, 2021Can digital repositories really be trusted in-and-of-themselves?No.(Not yet?)That isn’t to say that steps aren’t being made.Take, for example, HTTP and HTML.Those are getting pretty darn reliable, and assumptions can be built that rely on HTML as a markup language and HTTP as a protocol to move it around the network.I think that is a driver behind the growth of “static websites”—systems that rely on nothing more than delivering HTML and other files over HTTP.The infrastructure for doing that—servers, browsers, caching, network connectivity, etc.—is all pretty sound.HTML and HTTP have also stood the test of time—much like how we assume we will always understand how to process TIFF files for images.Now there are many ways to generate static websites.This blog uses Markdown text files and Jekyll as a pre-processor to create a stand-alone folder of HTML and supporting files.A more sophisticated method might use Drupal as a content management system that exports to a static site.Jekyll and Drupal are nowhere near as assumed-to-work as HTML and HTTP, but they work well as mechanisms for generating a static site.Last year, colleagues from the University of Iowa published a paper about making a static site front-end to CONTENTdm in the Code4Lib Journal, which could be the basis of a digital collection website development.So if your digital repository creates HTML to be served over HTTP and—for the purposes of preservation—the metadata can be encoded in HTML structures that are readily machine-processable?Well, then you might be getting pretty close to a system you can trust.But what about the digital objects themselves.Back in 2006, I crowed about the ability of Fedora repository software to recover itself just based on the files stored to disk.(Read the article for more details…it has the title “Why Fedora? Because You Don’t Need Fedora” in case that might make it more enticing to read.)Fedora used a bespoke method of saving digital objects as a series of files on disk, and the repository software provided commands to rebuild the repository database from those files.That worked for Fedora up to version 3.For Fedora version 4, some of the key object metadata only existed in the repository database.From what I understand of version 5 and beyond, Fedora adopted the Oxford Common File Layout (OCFL), “an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner.”The OCFL website goes on to say: “It is designed to promote long-term object management best practices within digital repositories.”So Fedora is back again in a state where you could rebuild the digital object repository system from a simple filesystem backup.The repository software becomes a way of optimizing access to the underlying digital objects.Will OCFL stand the test of time like HTML, HTTP, TIFF, network connectivity, and electricity?Only time will tell.So I think we are getting closer.It is possible to conceive of a system that uses simple files and directories as long-term preservation storage.Those can be backed up and duplicated using a wide variety of operating systems and tools.We also have examples of static sites of HTML delivered over HTTP that various tools can create and many, many programs can deliver and render.We’re missing some key capabilities—access control comes to mind.I, for one, am not ready to push JavaScript very far down our stack of technologies—certainly not as far as HTML—but JavaScript robustness seems to be getting better over time.Ruth: I’m sorry this isn’t easy and that software creators keep moving the goalposts.(I’ll put myself in the “software creator” category.)We could be better at setting expectations and delivering on them.(There is probably another lengthy blog post in how software development is more “art” than it is “engineering”.)Developers—the ones fortunate to have the ability and permission to think long term—are trying to make new tools/techniques good enough to push down the stack of assumed technologies.We’re clearly not there for digital repository software, but…hopefully…we are moving in the right direction.
2021-06-23T00:00:00+00:00
2021-06-24T01:15:50+00:00

Digital Repository Software: How Far Have We Come? How Far Do We Have to Go?

Posted on June 23, 2021 5 minute read

Bryan Brown’s tweet led me to Ruth Kitchin Tillman’s Repository Ouroboros post about the treadmill of software development/deployment. And wow do I have thoughts and feelings.

Let’s start with feelings. I feel pain and misery in reading Ruth’s post. As Bryan said in a subsequent tweet, I’ve been on both sides: a system maintainer watching much-needed features put off to major software updates (or rewrites) and the person participating in decisions to put off feature development in favor of major updates and rewrites. It is a bit like a serpent chasing its tail (a reference to “Ouroboros” in Ruth’s post title)—as someone who just wants a workable, running system, it seems like a never-ending quest to get what my users need.

I think it will get better. I offer as evidence the fact that almost all of us can assume network connectivity. That certainly wasn’t always the case: routers used to break, file servers crash would under stress, network drivers go out of date at inopportune times. Now we take network connectivity for granted—almost (almost!) as if it a utility as common as water and electricity. We no longer have to chase our tail to assume those things.

When we make those assumptions, we push that technology down the stack and layer on new things. Only after electricity is reliable can we layer on network connectivity. With reliable network connectivity, we layer on—say—digital repositories. Each layer goes through its own refinement process…getting better and better as it relies on the layers below it.

Are digital repositories as reliable as printed books? No way! Without electricity and network connectivity, we can’t have digital repositories but we can still use books. Will there come a time when digital repositories are as reliable as electricity and network connectivity? That sounds like a Star Trek world, but if history is our guide, I think the profession will get there. (I’m not necessarily saying I’ll get there with it—such reliability is probably outside my professional lifetime.) So, yeah, I feel pain and misery in Ruth’s post about the achingly out-of-reach nature of repository software that can be pushed down the stack…that can be assumed to exist with all of the capabilities that our users need.

That brings me around to one of Bryan’s tweets:

If the idea of a digital preservation platform is that it is purpose-built to preserve assets for a long period of time, then isn't it an obvious design flaw to build it with an EOL in mind? If the system is no longer supported, then can it really be trusted for preservation?
— Bryan J. Brown (@bryjbrown) June 22, 2021

Can digital repositories really be trusted in-and-of-themselves? No. (Not yet?)

That isn’t to say that steps aren’t being made. Take, for example, HTTP and HTML. Those are getting pretty darn reliable, and assumptions can be built that rely on HTML as a markup language and HTTP as a protocol to move it around the network. I think that is a driver behind the growth of “static websites”—systems that rely on nothing more than delivering HTML and other files over HTTP. The infrastructure for doing that—servers, browsers, caching, network connectivity, etc.—is all pretty sound. HTML and HTTP have also stood the test of time—much like how we assume we will always understand how to process TIFF files for images.

Now there are many ways to generate static websites. This blog uses Markdown text files and Jekyll as a pre-processor to create a stand-alone folder of HTML and supporting files. A more sophisticated method might use Drupal as a content management system that exports to a static site. Jekyll and Drupal are nowhere near as assumed-to-work as HTML and HTTP, but they work well as mechanisms for generating a static site. Last year, colleagues from the University of Iowa published a paper about making a static site front-end to CONTENTdm in the Code4Lib Journal, which could be the basis of a digital collection website development. So if your digital repository creates HTML to be served over HTTP and—for the purposes of preservation—the metadata can be encoded in HTML structures that are readily machine-processable? Well, then you might be getting pretty close to a system you can trust.

But what about the digital objects themselves. Back in 2006, I crowed about the ability of Fedora repository software to recover itself just based on the files stored to disk. (Read the article for more details…it has the title “Why Fedora? Because You Don’t Need Fedora” in case that might make it more enticing to read.) Fedora used a bespoke method of saving digital objects as a series of files on disk, and the repository software provided commands to rebuild the repository database from those files. That worked for Fedora up to version 3. For Fedora version 4, some of the key object metadata only existed in the repository database. From what I understand of version 5 and beyond, Fedora adopted the Oxford Common File Layout (OCFL), “an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner.” The OCFL website goes on to say: “It is designed to promote long-term object management best practices within digital repositories.” So Fedora is back again in a state where you could rebuild the digital object repository system from a simple filesystem backup. The repository software becomes a way of optimizing access to the underlying digital objects. Will OCFL stand the test of time like HTML, HTTP, TIFF, network connectivity, and electricity? Only time will tell.

So I think we are getting closer. It is possible to conceive of a system that uses simple files and directories as long-term preservation storage. Those can be backed up and duplicated using a wide variety of operating systems and tools. We also have examples of static sites of HTML delivered over HTTP that various tools can create and many, many programs can deliver and render. We’re missing some key capabilities—access control comes to mind. I, for one, am not ready to push JavaScript very far down our stack of technologies—certainly not as far as HTML—but JavaScript robustness seems to be getting better over time.

Ruth: I’m sorry this isn’t easy and that software creators keep moving the goalposts. (I’ll put myself in the “software creator” category.) We could be better at setting expectations and delivering on them. (There is probably another lengthy blog post in how software development is more “art” than it is “engineering”.) Developers—the ones fortunate to have the ability and permission to think long term—are trying to make new tools/techniques good enough to push down the stack of assumed technologies. We’re clearly not there for digital repository software, but…hopefully…we are moving in the right direction.

Social Media Interactions

Reposts

Likes

Discussion

Gerrit @EzellaGarnie@openbiblio.social 🇪🇺

I totally agree with Peter’s view! Digital Repository Software: How Far Have We Come? How Far Do We Have to Go? dltj.org/article/digita… via @DataG

24 June 2021 | Permalink
Stephen Francoeur

Digital Repository Software: How Far Have We Come? How Far Do We Have to Go? dltj.org/article/digita… via @instapaper

24 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

1/n. Everything you say in this post is true, @DataG, but I don’t think it really addresses @ruthbrarian’s problems at all. The issue is not with infrastructure reliability, but with software-version churn. It’s not the Repo X is fragile, it’s that support for it gets withdrawn.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

2/n. And I am very very sympathetic to Ruth’s position because I run into this very problem in a much smaller way ALL THE TIME. Most recently, I discovered yesterday that Slack will be abandoned on my operating system in a few months. (My OS is a couple of years old).

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

3/n. Speaking as a software developer myself (Ruth, just so you know, I work with Peter and we are among those working on the FOLIO library services platform), I absolutely deplore anything that breaks compatibility, ever.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

4/n. Which is why I wrote this a couple of years ago: reprog.wordpress.com/2019/06/14/maj…

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

⁣5. My point being not that major releases are often handled badly, and are disastrous for that reason; but the much more fundamental point that any major release is a confession of failure. Failure to keep the promise of compatibility, which the poor user depends on.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

6/n. So I consider it a point of pride that sixteen years on from release 1.00 of my ZOOM-Perl module, it’s still on major version 1 (specifically, v1.31) — that I have not forced any of my users to rewrite their code just to keep using my module.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

7/n. To my mind this is basic decency and comptence on the part of a software developer. And when I see the major version numbers of some pieces of software that we are both familiar with, I feel an aching sense of shame.

Grown-ups do not break compatibility.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

8/n. And that really is my manifesto here. People can talk all they want about user-focussed, test-driven, peer-reviewed, unit-tested code that reaches a “quality” threshhold by attaining a certain level of test coverage. But that is NOTHING to do with actual quality.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

9/n. Many software projects have become so seduced by all the shiny new toys that their priorities have got waaaaay off beam. Our priority should be not making Ruth migrate to a new version of her repository software.

Rant ends. I await your rebuttal :-)

25 June 2021 | Permalink
🧵Ruth Bright-Needle 🕊 📯@platypus@glammr.us

Yes, I think my focus was more on the version churn and the way versions or entire directions would change before anything was close to “complete.” That does also lead to fragility in the system.

25 June 2021 | Permalink
🧵Ruth Bright-Needle 🕊 📯@platypus@glammr.us

incidentally, this is why I left repositories and moved more toward catalogs/ILSes … which means, who knows, FOLIO may be in my future! 👀 (we are on Symphony for foreseeable future)

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

On behalf of all library software engineers everywhere: I am so sorry.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

You definitely should check out FOLIO :-) For all its flaws (what doesn’t have them?) it is 100% community-based, driven by the needs of real users, all — crucially — modular from the ground up, so anyone can write a module for it (or hire someone to).

25 June 2021 | Permalink
Peter Murray

That’s a fair criticism. My point was that software can get to a point where it is stable/reliable enough to be considered a kind of infrastructure, and we are clearly not there yet with repository software. (Read the rest of @MikeTaylor’s thread before continuing here.) 1/8

25 June 2021 | Permalink
Peter Murray

Repository software—as it exists now—is not stable/reliable as compared to other technologies (such as “HTML, HTTP, TIFF, network connectivity, and electricity”) because the foundations on which it is built—web frameworks, databases, storage systems—are on shifting sand. 2/8

25 June 2021 | Permalink
Peter Murray

Further down your thread, you mention ZOOM-Perl. The Z39.50 standard has changed in any significant way—undergone its own major version increment, if you will—in 16 years. When you wrote the module, you were leveraging the benefit of coding to stable requirements. 3/8

25 June 2021 | Permalink
Peter Murray

If you had started with whatever the early drafts of Z39.50 were, would you have had to make a major version increment to signal a breaking change. (To those watching the conversation, we are talking about Semantic Versioning: semver.org ) 4/8

25 June 2021 | Permalink
Peter Murray

The sands keep shifting—new feature requirement, obsolete tools replaced by more functional tools, companies/organizations changing direction—so breaking changes are a fact of life. It’s not just repository software…we had Gopher before we had the Web. 5/8

25 June 2021 | Permalink
Peter Murray

Had you built your repository system on Gopher, you’d be facing a breaking change to move it to the web. Now we’ve got the web (HTML+HTTP)—we can assume it exists—and now we build on top of that. If one could build with only HTML+HTTP, that would be a sound foundation. 6/8

25 June 2021 | Permalink
Peter Murray

Given marketplace pressures for interactive user experiences, could one use just HTML+HTTP? Since we can’t assume a foundation of strong digital object management (digital object mgt==handling metadata+derivatives+preservation+migration+etc), we build that on sand, too. 7/8

25 June 2021 | Permalink
Peter Murray

Although our priority isn’t to make Ruth’s repository manager chase a stable, workable system through breaking versions, it is the reality that we have today. And I’m hopeful it doesn’t have to always be that way but the field is not mature enough to make it happen right now. 8/8

25 June 2021 | Permalink
Peter Murray

Oh, and to add—the shiny-new-toys factor doesn’t help us make mature, stable software either.

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

Yes! It’s great, isn’t it? This is how it SHOULD work.

(Note that Z39.50 was not static — at all — after 2005: it was extended, in a backward-compatible way, by a host of Implementor Agreements, profiles, etc.)

25 June 2021 | Permalink
Mike Tⓐylor 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🇬🇧 🇪🇺

I’m sorry, I just don’t buy that. New requirements do require changes; they do not require breaking changes. The only thing on your list that requires breaking changes is when someone further down the stack has made … guess what? A breaking change. So they shouldn’t do that

25 June 2021 | Permalink
Jeff Jarmoc 🐉🔥

That’s a really interesting read, and a new tool for me! Thanks for sharing.

15 December 2021 | Permalink
Kraft Skunk

Not sure I understand how this concept works? What is saved: image of the front page? Where is it saved?

16 December 2021 | Permalink
Kraft Skunk

This concept is an extremely interesting tool. For users in general but especially to journalists, bloggers, authors…, actually anybody who links to external sources. Additionally, you participate in making the Internet Archive and Wayback Machine relevant. twitter.com/datag/status/1…

16 December 2021 | Permalink
Kraft Skunk

Links in the menu were not visible on my iPhone when I checked the link. All is good now, I got the hang of it. It’s definitely an amazing concept!

16 December 2021 | Permalink
Jeff Jarmoc 🐉🔥

Well, now I’m not sure. I tried it once (on mobile) and it worked pretty much like archive.org, but now it seems to be broken.

16 December 2021 | Permalink
Jeff Jarmoc 🐉🔥

I assumed it was a web archive file, but browser support isn’t great for that; it’s a Safari only format I believe.

16 December 2021 | Permalink
Jeff Jarmoc 🐉🔥

This time it literally wrapped archive.org. Previously it linked to something in .il

16 December 2021 | Permalink
Kraft Skunk

So, it’s hit-and-miss?

16 December 2021 | Permalink
Peter Murray

Robustified links are decorated with attributes—”data-versionurl” for the archive URL and “data-versiondate” for the archived date. JavaScript reads these and creates the pull-down list on the page that points to the archives. @shawnmjones and @hariharshankar worked on this.

16 December 2021 | Permalink
Peter Murray

The attributed-informed/JavaScript-generated links point to archived pages on archive(dot)org or archive(dot)today or perma(dot)cc. The browser isn’t rendering a web archive file…it is pointing to a page in a web archive service.

16 December 2021 | Permalink
Peter Murray

Oh, dear—that would be a problem. Can you say which mobile browser you are using? Seems to work okay with Safari on iOS.

16 December 2021 | Permalink
Peter Murray

Robustify relies on the page author to have saved copies of linked pages in one of the archive services. In the case of my blog, I’m hand-crafting the HTML to include the Robustify attributes for pages I save, but it could easily be done by a content management system.

16 December 2021 | Permalink
Kraft Skunk

That sounds relatively complicated. I think “easily done” should inlcude a button that does what the service is supposed to do. No? This sounds like a lot of hacking to have it work the way it should?

16 December 2021 | Permalink
Kraft Skunk

I have to admit, Im working on something else, so haven’t been able to go into depth on this. I’ll check it out later. The service is definitely amazing.

16 December 2021 | Permalink
Shawn M. Jones, PhD

Robust Links, developed by @hvdsomp and @mart1nkle1n, provide access to a version of the link in a web archive in case the current one fails. They do so through minor additions to HTML a elements (links).

Our recent @code4lib article explains it: journal.code4lib.org/articles/15509

16 December 2021 | Permalink
Shawn M. Jones, PhD
We’ve developed some services to help web page authors make their own Robust Links.
- a user-friendly service for creating the HTML for 1 Robust Link at a time (robustlinks.mementoweb.org)
- an API for users who need to process many links (robustlinks.mementoweb.org/api-docs/)
16 December 2021 | Permalink
Shawn M. Jones, PhD
Some Robust Links benefits:
1. if broken link/link rot, a reader is linked to a web archive version
2. if content drift, then a reader can see a version close to the citation date
3. if the web archive goes offline, there is enough info to find the page in another web archive
16 December 2021 | Permalink
Shawn M. Jones, PhD

Of course, this requires that an author create an archived web page (memento), and they can!

@waybackmachine provides “Save Page Now” where one can mint their own mementos.

Also, @archiveis has provided an on-demand web archiving form for years.

16 December 2021 | Permalink
Shawn M. Jones, PhD

I hope this helps. If you have any questions, please contact me, @hvdsomp, @mart1nkle1n. We’ve tried to make it as simple as possible, but sometimes the problem space can be more complex than people expect and they need to discuss it.

16 December 2021 | Permalink
Kelly Clowers

kinda interesting but who the hell is “Archive.today”/”Archive.ph”? why would I trust them to be around in 5 years, much less longer?

16 December 2021 | Permalink
Jeff Jarmoc 🐉🔥

I think some of the idea behind the approach is to leverage multiple archives as a means of hedging your bets.

16 December 2021 | Permalink
Peter Murray

Exactly—I think that is the idea behind adding the equivalent of the date-cited into the markup. That way tools like Memento can help you find the best archived copy. timetravel.mementoweb.org/about/

16 December 2021 | Permalink