What Is Web Archiving? • Webrecorder Blog

Brief History of Web Archiving

The World Wide Web was created in 1989, with web archiving efforts beginning in the late 1990s. During the early 2000s, major institutions and organizations started building their own web archives and forming collaborative groups to establish and develop best practices. In 2003, the International Internet Preservation Consortium (IIPC), an international organization of libraries and other organizations, was established to coordinate efforts to preserve internet content for the future, and continues to spearhead international conversations about developing web archiving practices.

In 2014, the first iteration of Webrecorder was started. Initially just a browser-based interactive web archiving tool, the set of tools built under the Webrecorder name expanded to cover more use cases, and in 2021 work began on Browsertrix, which is now our most complete and advanced web archiving platform.

General Concepts

Most practitioners in the community agree that web archiving involves developing a clear strategy and workflow for selecting content, capturing parts of the web, and archiving digital content in a standardized format for long-term preservation, ensuring it remains accessible and readable for future access and use, even if the original website goes offline.

Web archiving is a multi-step process within the broader field of digital preservation. It’s vital for safeguarding our digital heritage, culture, and society — especially given the constantly changing nature of the internet. National libraries, galleries, museums, governments, private and public companies, and individuals all archive websites. Organizations typically collect content according to their web archiving policy, a guiding set of principles that dictate the scope, subject, metadata, retention policy and best practices of how content should be archived. Individuals can archive content that is meaningful to their individual interests, research or projects.

A Technical Introduction to Web Archiving

Websites are composed of a few key components, including

the structure of the page (HTML)
styling information telling the browser how to render it (CSS)
code that the browser should execute (JavaScript)

The page might also include media such as images, audio, and videos. When you visit a website for the first time, your browser will download all the files it needs from a server (or often, many different servers), parse them, and use their contents to lay out and render the contents of the web page. Web browsers usually don’t save these files though, instead re-downloading them whenever you reload a page.

Diagram showing image assets being downloaded frm the web onto a laptop

This works well for serving up-to-date information, but if we want to preserve the website we need to save all these files somewhere where we can easily re-assemble everything the browser loaded later in time. Though there are a variety of conventions for how this is done, most web archiving software uses either WARC or WACZ files to store this archived data.

WARC & WACZ Files

WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. A WACZ file includes all the data that is needed for the rendering archived content, as well as contextual information required for users to interpret it, as developed by Webrecorder in 2021. As of May 2024, Library of Congress has acknowledged WACZ files as a sustainable digital format. They function as ZIP files with a custom extension, and actually contain WARC files alongside additional indexing metadata which enables faster replay. You can open a WACZ in any unarchiving program such as 7-Zip or Keka to reveal its component WARC files.

WARC files are an ISO Standard open format for web archive data. Every time an archiving program downloads a file from the web it adds it as a record to the WARC file, along with additional header information like where it was downloaded from and at what time. Both WARC and WACZ files can be viewed in ReplayWeb.page, Webrecorder’s free, open source web archive viewer.

Browser-Based Archiving

Modern websites will often use JavaScript to load files based on user interaction. One example of this is an infinite scroll website that loads content when you reach the bottom of a page. JavaScript that runs in your browser tells the website to load more items! Examples like this are why Webrecorder’s archiving tools are developed around the browser, if you can’t load a page exactly as a real user would, you won’t be able to create the highest-fidelity possible archive.

How to Get Started

ArchiveWeb.page is Webrecorder’s free interactive archiving extension that runs in Chrome-based browsers. It is a great way of getting started making web archives by archiving sites manually as you browse them!

Browsertrix is Webrecorder’s high-fidelity browser-based crawling service, suitable for archiving entire websites. Check it out once you’re ready to expand your practice beyond manual capture or want to share a link to your archives with others.