A screenshot of the home page of https://govarchive.us/ listing available collections

Introducing GovArchive.us & Mirroring Entire Sites with Web Archives

Product
By Ilya Kreymer

We’re excited to announce the launch of GovArchive.us, a dedicated site for exploring our US Government Web Archive on Browsertrix. The project also introduces a brand new approach for viewing web archives: the ability to host a full-site “mirror” from any web archive, keeping original links intact while hosting them on a new domain.

One example of this is our archived version of the previous usaid.gov website, which is now accessible at usaid.govarchive.us. Unlike traditional web archive replay, this “mirror” archive preserves the original URL structure, making the site as easy to navigate and reference as the original site. For instance, the archived version of a page originally hosted at https://usaid.gov/about-us/mission-vision-values can be viewed at https://usaid.govarchive.us/about-us/mission-vision-values, by simply replacing the domain usaid.gov with usaid.govarchive.us.

We’ve reserved the *.govarchive.us domain and subdomains to be able to dynamically add more archives of US Government sites from our collections to this system.

What is Available Now?

Here’s a selection of a few ‘mirror’ sites that we have available from govarchive.us. Each mirror is a static site that loads an archived version from our collection, hosted on a dedicated domain:

Screenshot of GovArchive.us

Check GovArchive.us for an up-to-date list as we add more mirrors from our archives!

Mirroring Sites with Web Archives — Getting Started

This approach can be used by anyone to mirror a dynamic website hosted as a static site powered by web archives!

If you run a particular domain, you can set up a web archive as a static site, and point the domain to the static version of the site instead!

Or, you can host a mirror elsewhere, as we have done. This can be used to migrate off costly or obsolete infrastructure, while still preserving a site at the highest fidelity!

We provide the following template to get started with a single site mirror created from a web archive:

Using the above template, you can host your own web archive mirror entirely on GitHub Pages!

How it works: GovArchive and Wildcard Subdomains

GovArchive.us demonstrates a more complex setup with wildcard subdomains.

We’ve set up a wildcard DNS to point to a static site for any *.govarchive.us. (For this, we use Bunny CDN as GitHub pages does not support wildcard subdomains pointing to the same repo.)

Then, we dynamically choose the correct site to mirror in the browser based on the subdomain. A specific Browsertrix collection is chosen based on the current subdomain, allowing for maximum flexibility to add more collections.

Nested subdomains are flatted by replacing the . with the - so that more.subdomains.example.gov would become one-level of subdomain with more-subdomain-example.govarchive.us so that we can use a wildcard SSL cert easily. For example, nca2023-globalchange.govarchive.us mirrors nca2023.globalchange.gov.

With GovArchive.us, we also provide a custom banner and loading screen. If the archive is already initialized, it will load right away, otherwise the bootstrap script runs and a loading screen is shown while the service work is being initialized. Finally, the top-level site just provides a landing page index, hosted in a different repo.

As always, whole thing is open source, and further details are available on our GitHub repos:

The replay itself is provided with our low-level browser-based replay engine, wabac.js, which is also used in ReplayWeb.page. (In the future, the mirror capability may be added to ReplayWeb.page itself).

We hope GovArchive.us provides a much needed resource, as well as an example of how web archive-powered site mirrors can be done at scale.

If you need help setting up your own web archive mirror, reach out and we’d be happy to support your efforts!


Comments

Reply on Bluesky to join the conversation.

7 replies 58 reposts 131 likes

avatar

James R Jacobs @freegovinfo.bsky.social · a day ago

Thanks! this looks really interesting. a new avenue to govinfo in addition to @eotarchive.org EOT collection search at web.archive.org.

0

1

6

View on Bluesky
avatar

Nintendofan885 @nintendofan885.bsky.social · a day ago

nice :) What about domains not on .gov? (e.g. military sites)

1

0

0

View on Bluesky
avatar

Nintendofan885 @nintendofan885.bsky.social · a day ago

ah, just realised that [something]-mil.govarchive.us works for .mil domains

0

0

0

View on Bluesky
1 reply marked as spam.