What does it take to archive a web page/project?

As the web transitioned from static documents to interactive web applications, the challenge of archiving and preserving the web have only increased.

But some web pages/projects/publications - lets refer to them as ‘web objects’ - are easier to archive than others. Some require no effort at all, while others require significant effort and still can not be correctly archived. Sure, the number of pages there are, or how ‘interactive’ a project is, plays a role, but it’s not the fully story.

This is a key question for Webrecorder, and something I have wondered about for a while in determining what is possible and how much effort may be required.

To fully express this difficulty, a new methodology is needed, something I’ve decided to call ‘encapsulation complexity’.

Introducing Encapsulation Complexity

At its core, web archiving is really a reproducibility problem, the problem of capturing web objects, and replaying, or reproducing them later, as accurately as possible in an isolated environment from archival storage. To reproduce a captured web object, it must first be encapsulated, meaning all dependencies must be determined and also captured.

How difficult it is to encapsulate any web object and reproduce it later can be called the ‘encapsulation complexity’ of the object, which depends on a number of factors, such as external dependencies, explained further below.

Different levels of encapsulation complexity require different digital preservation approaches and tools and lead to different expectations. Sometimes a web object can be saved as a single page, or a web archive stored in WARC or WACZ files is sufficient. But sometimes it is necessary to also encapsulate a fully running web server running in an emulation system, and for other cases, even that is not feasible.

This complexity can be categorized with the following levels, which each level being an order of magnitude ‘harder’ than the previous one.

Level 0 (Single-Page Encapsulatable)

A single-page web object, with zero external dependencies. Images, CSS, Javascript, if any, are fully inlined in the page. The page can be directly loaded in any browser. The page need not be fully static, but it should not have any external resources, and does not require any web server.

A Level 0 object does not require much effort for encapsulation, as it can simply be saved using the browser’s ‘Save Page’ functionality and replayed again, as it is already self-encapsulating in a sense.

A screenshot, or a PDF can also fit into this category, as they are single-page web objects that can be loaded in a web browser.

Possible Tools: Built-in Save Page As, Single-Page Extension, Taking a Screenshot, Saving as PDF.

Level 1 (Web Archive Encapsulatable)

Level 1 web objects that consist of a finite number of URL resources that can be exhaustively captured or crawled. The resources can be loaded from any number of different web servers, including embeds. The object can be arbitrarily complex on the client, running complex Javascript, and requiring arbitrary user input. The web server interaction of this object must be limited to a fixed amount of data from a fixed number of URLs. Other dynamic network data, such as websocket connections, can not be included. The Level 1 object can be fully encapsulated as a web archive in a single WARC or WACZ format, and can load directly in a browser.

Most web objects that work well within web archives fit into this level of complexity, from small or single-page projects to large sites requiring crawling.

Possible Tools: Browser-based capture (Webrecorder), web crawling (Browsertrix, Heritrix, etc…)

Level 2 (Web Archive + Server Emulation Encapsulatable)

Level 2 web objects require a fixed, known web server to also be encapsulated to be fully functional, along with a WARC/WACZ based web archive. The web server must be encapsulated and reproducible in a self-contained computing environment. The server can have other dependencies, such as a database, as long as the database is deployed alongside the service and not externally. The client web object can make any number of dynamic URL requests to the fixed web server(s), including with websockets, that are running within the encapsulated environments.

Possible tools: Orchestrated web server containers (Docker Compose, Kubernetes) combined with web archives, web server emulaton

Level 3+ (Not Fully Encapsulatable)

Any web objects that have an unknown number of external dependencies, or dependencies that simply can not be enumerated in a encapsulation/preservation system are Level 3 objects or higher. Web objects that make dynamic requests to external servers that are outside the control of the user, such as doing a search on Google. Web objects that rely on dynamic external data, such as specific camera, microphone, geolocation inputs or network speed. Of course, there is no limit to how complex such objects can get, and examining them further is not useful, as they are not ‘encapsulatable’ at full fidelity.

Possible Tools None currently, requires migration to a Level 2, 1, or 0 object.

Determining Encapsulatability: So what does it take to archive a particular web object?

Coming back to the original question, when looking at a particular web object, determining its encapsulation complexity can greatly inform on how difficult it will be to encapsulate/preserve and what the options may be.

Given the above methodology, this determination can still be tricky without examining ‘how’ a web object ‘works’, looking at the network raffic, interacting with it, and even examining the code.

For one, web objects consisting of multiple pages, the levels of complexity of each individual page may vary. For example, a mostly static blog (Level 1) may contain a page with a YouTube embed (a Level 3 object). Therefore, the whole blog would be a Level 3 object, because it involves a YouTube video which brings with it an infinitely interactive external dependency (including recommendations, related videos, etc..).

There are a few telltale signs that can help, though, such as embeds of external services like YouTube. As another example, a web object that has server-side search will be at least a Level 2 object, as it must make dynamic requests to the server for search.

If the search is implemented entirely client-side using a JSON search index and a client-side framework like FlexSearch, then the same object could become Level 1, all other things being equal. However, without looking at the network traffic, it may not be evident if server or client-side search is used.

For the author of a web object that will need to be encapsulated/preserved, it may be worth it to choose client-side search over server-side search to make encapsulation as a web archive easier down the road.

Alternatively, it may be reasonable trade-off to ‘migrate’ a Level 2 object with server-side search to a Level 1 encapsulated web archive, which could have a built-in web archive search, but lose its original search features. Indeed, web archives can fully encapsulate web objects up to Level 1, and anything higher is necessarily migrated to Level 1 to be encapsulated as a web archive.

The encapsulation complexity level provides an upper bound on how hard it may be to encapsulate a particular web object, as well as what the maintainence costs can be.

In a future blog post, I’ll provide additional approaches to determine this complexity and discuss migration of web objects to a lower complexity level and the trade-offs involved.