Showing Provenance on ReplayWeb.page Embeds

Product
By Ilya Kreymer and Henry Wilkinson

The ReplayWeb.page viewer allows web archives to be embedded and displayed almost anywhere. It can be used as a single-page embed, or integrated into existing services and digital repository systems.

Previously, ReplayWeb.page embeds either showed the location and nav bar UI, or without any additional UI or context. With the interface hidden, there is limited information to signal to users that the content is being served from a web archive, and a high fidelity web archive should look and feel the same as the original. At the same time, we want people viewing web archives to understand that the content within an archive may not be a perfect record of a live website, and that the content is frozen won’t be updated or deleted like other content embedded through traditional means.

To help with this, we’ve added a new embed mode for ReplayWeb.page embeds to be seamlessly added without extra UI, but with a dropdown ‘archive receipt’, which shows the provenance of the archive, as seen below:

This mode can be enabled by setting the embed="replay-with-info" attribute in the <replay-web-page> tag.

Archive Provenance

The embed (above) now includes a dropdown, which expands to show the ‘archive receipt’ with info about the web archive:

TODO

In the technical info section, the archival receipt can show basic provenance Original URL and Archived On metadata, as well as what tools were used to create the archive. Additional provenance info can be added here as needed. A download link at the bottom provides a link to download the full archive locally as well, and the size of the full archive file is included.

Validating Signed Web Archives

A key use case for the receipt is to also show cryptographic signature metadata about the web archive. If the web archive is a signed WACZ files, created as per the WACZ Signing Spec, the real-time validation status, cryptographic keys, as well as hash of the full data package in the WACZ file, are also shown.

When loading a signed WACZ file, all data (WARC records, indexes, page lists, etc…) is validated as it is loaded on the fly (via wabac.js). The receipt includes a Validation field, which shows the number of hashes validated thus far. If the WACZ file has at all been tampered with, hashes would not match, and this will also be displayed to the user.

The cryptographic metadata also relates to provenance, and includes either a public key, or a link to a trusted third party observer certificate, to establish the creator of the web archive. (We’ll discuss how this works in a future blog post!)

Trusting Web Archives

We hope the display of this information will be a first step in making distributed web archives more trusted. While some of the data may not be applicable to most users and is fairly technical, we believe this can encourage independent verification of archived content in the future.

In a follow up blog post, we will discuss the different WACZ signing approaches and their implications for authenticity!