Web Archives on, of, and off the Web

By Ed Summers

This is a guest post by Ed Summers, who is working with Webrecorder as a technical writer and designer on the WACZ spec. The post is cross-posted from his blog at: https://inkdroid.org/2021/11/24/wacz/

Last month Webrecorder announced a new effort to improve browser support for web archives by initiating three new streams of work: standardization, design research and browser integration. They are soliciting use cases for the Web Archive Collection Zipped (WACZ) format, which could be of interest if you use, create or publish web archives…or develop tools to support those activities.

Webrecorder’s next community call will include a discussion of these use cases as well as upcoming design research that is being run by New Design Congress. NDC specialize in thinking critically about design, especially with regards to how technical systems encode power, and how designs can be weaponized. I think this conversation could potentially be of interest to people who are working adjacently to the web archiving field, who want to better understand strategies for designing technology for the web we have, but don’t necessarily always want.

I’m helping out by doing a bit of technical writing to support this work and thought I would jot down some notes about why I’m excited to be part of the project, and why I think WACZ is an important development for web archives.

So what is WACZ and why do we need another standard for web archives? Before answering that let’s take a quick look at the web archiving standards that we already have.

Since 2009 WARC (ISO 28500) has become the canonical file format for saving content that has been collected from the web. In addition to persisting the payload content, WARC allows essential metadata to be recorded, such as the HTTP requests and response headers that document when and how the web resources were retrieved, as well as information about how the content was crawled. ISO 28500 kicked off a decade of innovation that has resulted in the emergence of non-profit and commercial web archiving services, as well as a host of crawling, indexing and playback tools.

In 2013, after three years of development, the Memento protocol was released as RFC 7089 at the IETF. Memento provides a uniform way to discover and retrieve previous versions of web resources using the web’s own protocol, HTTP. Memento is now supported in major web archive replay tools such as OpenWayback and PyWB as well as services such as the Internet Archive, Archive-It, archive.today, PermaCC, and cultural heritage organizations around the world. Memento adoption has made it possible to develop services like Memgator that search across many archives to see which one might have a snapshot of a specific page, as well as software extensions that bring a versioned web to content management systems like Mediawiki.

More recently, the Web Archiving Systems API (WASAPI) specification was developed to allow customers of web archiving services like Archive-It and LOCKSS to itemize and download the WARC data that makes up their collections. This allows customers to automate the replication of their remote web archives data, for backup and access outside of the given services.

So, if we have standards for writing, accessing and replicating web archives what more do we need?

One constant that is running through these various standards is the infrastructure needed to implement them. Creating, storing, serving and maintaining WARC data with Memento and WASAPI usually requires the management of complex software and server infrastructure. In many ways web archives are similar to the brick and mortar institutions that preceded them, of which only “the most powerful, the richest elements in society have the greatest capacity to find documents, preserve them, and decide what is or is not available to the public” [@Zinn:1977]. This was meant as a critique in 1977, and it remains valid today. But really it’s a simple observation of the resources that are often needed to create authoritative and persistent repositories of any kind.

The Webrecorder project is working to both broaden and deepen web archiving practice, by allowing every day users of the web to create and share high fidelity archives of web content using their web browser. Initial work on WACZ v1.0 began during the development of ArchiveWeb.page and ReplayWeb.page, which are client-side JavaScript applications for creating and sharing archived web content. That’s right, they run directly on your computer, using your browser, and don’t require servers or services running in the cloud (apart from the websites you are archiving).

You can think of a WACZ as a predictable package for collected WARC data that includes an index to that content, as well as metadata that describes what can be found in that particular collection. Using the well understood and widely deployed ZIP format means that WACZ files can be placed directly on the web as a single file, and archived web pages can be read from the archive on-demand without needing to retrieve the entire file, or by implementing a server side API like Memento.

WACZ, and WACZ enabled tools, will be a game changer for sharing web archives because it makes web archive data into a media-type for the web, where a WACZ file can be moved from place to place as a simple file, without requiring complex server side cloud services to view and interact with it—just your browser.

It’s important to remember that games can change in unanticipated ways, and that this is an important moment to think critically about the use cases a technology like WACZ will be enabling. You can see some of these threats starting to get documented in the WACZ spec repository alongside the standard use cases. These threats are just as important to document as the desired use cases, perhaps they are even more consequential. Recognizing threats helps to delineate the positionality of a project like Webrecorder, and recognizes that specifications and their implementations are not neutral, just like the archives that they make possible.

However, it’s important to open up the conversation around WACZ because there are potentially other benefits to having a standard for packaging up web archive data that are not necessarily exclusive to the ArchiveWeb.page and ReplayWeb.page applications. For example:

  1. Traditional web archives (perhaps even non-public ones) might want to make data exports available to their users.
  2. It might be useful to be able to package up archived web content so that it can be displayed in content management systems like Wordpress, Drupal or Omeka.
  3. A WACZ could be cryptographically signed to allow data to be delivered and made accessible for evidentiary purposes.
  4. Community archivists and other memory workers could collaborate on collections of web content from social media platforms that are made available on their collective’s website.
  5. Using a standard like Frictionless Data could allow WACZ metadata be simple to create, use and reuse in different contexts as data.

Webrecorder are convening an initial conversation about this work at their November community call. I hope to see you there! If you’d rather jump right in and submit a use case you can use the GitHub issue tracker, which has a template to help you. Or, if you prefer, you can also send your idea to info [at] webrecorder.net.