With Mastodon on the Rise, Who Archives the Digital Public Square?

DALL*E prompt: photorealistic waves of twitter logos and mastodon logos crashing onto a sandy beach

Much has been made about the differences between Twitter and Mastodon: the challenge of finding a home for your account (and the corresponding differences between your “local” timeline and your “global” timeline), the intentional antiviral design choices (no quote-tweets and a narrow search system), and the more-empowering block and mute features. A recent article in MIT Technology Review about the potential loss to history if Twitter goes away had me thinking of another one difference: a Mastodon-filled world changes expectations for archiving this kind of primary source material.

Think Bigger Than Mastodon

Let's set some common ground. "Mastodon " is being used here as a shortcut for the growing federation of servers that follow the ActivityPub protocol—the "fediverse". Most people caught up in the migration away from Twitter are looking for a "Twitter-equivalent", and the option that has caught the popular imagination is Mastodon. As we view the fediverse digital public square, we could just as well be talking about Mastodon forks like Hometown. We should also include in the genre-specific ActivityPub software like Pixelfed (for photographers, me there), Bookwyrm (for book groups and reader commentary, me there), Funkwhale (for music), and write.as (for long-form articles). Although Mastodon is getting the most traction right now, the question of archiving the digital public square is bigger than just Mastodon...just keep that in mind as you read below.

Twitter Archiving Challenges

As the MIT Technology Review article points out, there are challenges to archiving Twitter.

For eight years, the US Library of Congress took it upon itself to maintain a public record of all tweets, but it stopped in 2018, instead selecting only a small number of accounts’ posts to capture. “It never, ever worked,” says William Kilbride, executive director of the Digital Preservation Coalition. The data the library was expected to store was too vast, the volume coming out of the firehose too great. “Let me put that in context: it’s the Library of Congress. They had some of the best expertise on this topic. If the Library of Congress can’t do it, that tells you something quite important,” he says.

The challenges include that of scale:

[In January 2013] We now have an archive of approximately 170 billion tweets and growing. The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012. - Update on the Twitter Archive at the Library of Congress, Library of Congress blog, January 2013.

And also of scope—the Library does not receive the multimedia parts of tweets. As the whitepaper attached to the Update on the Twitter Archive at the Library of Congress says:

The Library only receives text. It does not receive images, videos or linked content. Tweets now are often more visual than textual, limiting the value of text-only collecting.

Both points speak to the changing nature of Twitter from when its origins as an extension of text messaging geared towards a U.S. audience into a world-wide multimedia platform. Michael Zimmer wrote in great detail about these challenges and the issues of processing, privacy, and user consent for First Monday in 2015. The donor agreement between Twitter and the Library of Congress is silent on the matters of privacy and user consent as well.

On December 26, 2017, the Library of Congress announced that it was no longer collecting a comprehensive archive of tweets as of January 1, 2018. What is at the Library now has known limitations in its comprehensiveness, and we may not see open access to that archive in our lifetime because of privacy concerns.

The MIT Technology Review article talks about the loss to historians, human rights lawyers, and researchers using "open source intelligence" — that which is openly published in the digital public square. Given that we are facing this moment of reckoning as Twitter may be on the brink of disappearing and people are finding community on Mastodon, should we consider an explicit archiving role for the fediverse?

Mastodon Archiving Challenges

With Twitter's recent upheaval and the migration to Mastodon, I've seen mentions of how Twitter was unique to its time. At Twitter's public unveiling in March 2006, the only way to interact with Twitter was through text messages. Apple would introduce the iPhone the following year, and it was a year after that when an app for iPhone would launch. Twitter's growth was jumpstarted by the influx of users at the 2007 South-by-Southwest (SXSW) conference as attendees publicly shared their experiences in real time in a way they could not have previously. The combination of an experience that straddled mobile and desktop devices and the ability of the company to scale to meet the demand made this Twitter's moment. A moment that it ran with for the next 15 years.

Mastodon is different. Conceptually, there isn't one "Mastodon" (like there is one "Twitter"); there are many little Mastodons that have a standard way of talking to each other. (Yes, this is where the "ActivityPub" standard becomes key.) And crucially, these many little Mastodons are run by individual users and organizations. We witnessed firsthand the difficulties these Mastodons had in scaling to meet the demand from the outflow of Twitter users. (Many of the larger Mastodon instances halted or greatly limited new user registrations in November 2022.)

Now consider what would be needed to construct a "Mastodon Digital Archive" similar in scope to how Twitter donated its timeline of tweets to the Library of Congress. At the very least, it would mean contacting each of these Mastodon instances to get copies of their databases and feeds of ongoing posts. And even if there was a mechanism to do that, internet users are more aware about rights to their digital content (or at least more savvy of their digital footprint); some sort of user consent would likely be needed.

Inherent in the structure of independent Mastodon instances is the fact that there isn't a central point of aggregation, and that is seen by the broader community as a good thing. (The most common reason I've heard is that the lack of a search tool makes finding the discussion of controversial topics harder and decreases the likelihood of bad actors "dogpiling" into a conversation.) There have been attempts to aggregate content for a search engine, but Mastodon administrators quickly ban those kinds of ActivityPub peers. Creating an archive of Mastodon posts will likely run into the same issues.

Do We Want a Digital Public Square Archive?

Let's step further back: should there be an archive of the digital public square? Physical public squares don't have comprehensive archives. The fact that a digital public square is made up of ones and zeros in files and databases makes it at least conceivable. (Setting aside the technical challenges that the Library of Congress faced with the Twitter archive; with progress in technology and techniques, having such an archive will likely be technically possible at some point.) As the MIT Technology Review article points out, there are benefits to such an archive. Perhaps archivists and historians can help aim us toward ideas that make sense for this new public space.