Improving browser-based web archiving with standards and design research

By Cade Diehm, Ed Summers, and Lorena Ramírez-López

After 30 years, we see how much the web and its users have grown and evolved and web archiving — “the process of collecting portions of the World Wide Web” must adapt technology, tools and workflows to evolve with it to ensure the access and use of these preserved collections in an archival format.

​​The Webrecorder project has been working to both broaden and deepen web archiving practice, by allowing every day users of the web to create and share high fidelity archives of web content using their web browser with our suite of open-source tools: ArchiveWeb.page, ReplayWeb.page, pywb, Browsertrix Crawler and many other tools that can be found on our Tools section.

And thanks to the support from the Filecoin Foundation’s Open Source Development Grant, our team is developing new efforts to improve browser support for web archive by initiating three new streams of work: standardization of our new format, Web Archive Collection Zipped (WACZ), design research, and integration beginning of integration of this work into existing tools.

The main value of this project is to create a formally standardized approach to creating, storing and accessing web archives on p2p/decentralized systems, which is why as part of these efforts, Webrecorder has partnered with the New Design Congress, a research organization that recognizes all infrastructure as expressions of power, and sees interfaces and technologies as social, economic, political and ecological accelerants, on design research and Agregore Integration Developer, Mauve Software, for browser integration.

And while we hope that our tools will further user’s ability to safely and easily create browser-based high-fidelity web archives, store them on decentralized systems, and efficiently access them, it’s important to remember that tools and even file formats can change in unanticipated ways.

Web archiving remains a niche discipline but is a profoundly important one. The preservation and curation of digital material for cultural, legal or historical reasons, is just as crucial as its physical equivalents. Both within the US and internationally, the landscape for tooling is small. Almost all archive systems rely on just a handful of base tools to capture and maintain their collections. As the new decade begins, we have new challenges to consider. Gone are the days of technological optimism; as tool builders, we must acknowledge the challenges of what we make, and how those challenges evolve and change over regions, time and from unexpected outside influences.

In response. the attitudes of tool-builders, policy makers, and infrastructure designers have not kept pace. Far too often in digital spaces, hasty and purportedly ethical answers are offered at scale in response to structural harms before the actual underlying problem is fully identified or its complexities accounted for. Perhaps unsurprisingly, there is little publicly available research or critical evaluation of the existing beliefs and practices of web archiving and how they manifest consequences for those who are involved with, subjected to or interact with the web archiving process.

Through the WACZ specification, we aim to produce a collection of deeper understandings of these threats, alongside proactive recommendations that will — alongside making WACZ a more resilient format — provide real, tangible examples of responding to the challenges inherent in designing and building digital tools.

As part of our milestones for this project, we’ve already begun gathering use cases as well as threats, and anti-use cases on our GitHub. We’ve taking the conversation publicly on our last two community calls in November and December of 2021 and over the coming weeks, we’ll be conducting a series of interviews with archivists, journalists and researchers with the specific goals of both getting a better sense of individual and institutional archival practices, and uncovering deeper concerns specific to archival tools. Each interview will take approximately 1-1 ¼ hour to complete. We will use these contributions to help inform a set of design recommendations for Webrecorder that are more resilient to the effects of weaponised design and other threats.

Do you web archive, and want to help? Get in touch or fill out our Google Form.