A Quick Look Back at 2023

It has been almost two years since we initially announced Browsertrix! Since then, we’ve been pretty much solely focused on developing our next generation cloud-based archiving platform. One of the downsides of this sole focus is that sometimes you forget to update the company blog and actually tell people about what you made! I’m not going to go back and write update posts for every major release we’ve done (there have been eight!), but here are some of the more recent highlights if you missed them:

Collections

In 1.6 we added collections, the ability to add archived items to multiple different groups for sharing and export! Collections serve as the base for future curation features, but right now they allow for both crawls and uploads to be replayed, downloaded, or shared together as one package.

Because both uploads and crawls share their data within a collection, they also allow you to manually patch automated crawls created through our ArchiveWeb.page browser extension. If elements on a site you’ve tried to capture in Browsertrix aren’t available when replaying the crawl but are available when you capture the page with ArchiveWeb.page, try uploading a WACZ from ArchiveWeb.page and adding both to a collection!

A screenshot of the Browsertrix collection archived item list tab with both crawls and uploads in the list.

Dashboard & Execution Time

In 1.7 we added the Overview page which displays key org metrics for storage and crawling. In 1.9 we’ve updated the usage history table to give you more granular stats on your execution time, separately listing the execution minutes used per-month based on how you’re charged for them.

Screenshot of the Org Dashboard page, the storage section shows a bar graph of how much data is being used for each archived item type, a table lists the breakdown of execution minutes used per month.

Documentation

We may not have updated the blog much, but we definitely upgraded our docs! Browsertrix now has a full user guide using the excellent Material for MKDocs theme. One page I’m personally proud of is our extensive list of every crawl workflow setting, a handy reference if the in-app help text isn’t quite enough.

A screenshot of the Browsertrix workflow settings documentation, a long article that lists every setting available to users

1.9 Release!

Crawler Version Selection

We frequently release beta versions of Browsertrix Crawler, the core component of our software actually responsible for capturing websites. Up until now, the version the app uses has been set by us, soon we’ll be providing you with some options! Release channels can now be set on a per-crawl basis so if we’ve implemented a fix in the latest beta for a site you’re trying to crawl, you can use it — while keeping in mind that there may be other unresolved issues that aren’t quite ready for prime time yet and that’s why it’s a beta version. But you knew that already, right?

More information can be found in the Crawler Release Channel section of the workflow setup page.

A screenshot of the release channel selection dropdown menu and custom user agent field

Custom User Agents

Release channels aren’t the only new crawling feature though, we’ve also added the ability to set a custom user agent that the crawler will use to identify itself to websites. In addition to bypassing sites that try to restrict which browsers can access them — something they generally shouldn’t be doing anyway — this feature is also useful for some of our larger clients for coordinating with publications to ensure their crawls don’t get blocked.

Updated Collection Selection UI

While we updated some of this to remove the clunky multi-stage setup and editing process for collections in 1.8 (if you know, you know), the release of 1.9 completes our overhaul to the collection content editing process. “Auto add” can be toggled for workflows right in the collections editor, and archived items in and out of the collection are now displayed in the same window giving us more room to display information about each item.

The new collection item selection UI, a unified list of archived items with a checkbox to add or exclude from the collection.

Fixes & Small Things

As always, a full list of fixes (and additions) can be found on our GitHub releases page, but here’s some of the big small stuff:

  • The “workflow settings” tab that displays the current workflow settings and crawl settings tab that displays the workflow settings used for that crawl now display the same data without any discrepancies. #1473
    • Useful for nailing down what might have changed when crawling the same site multiple times with different settings!
  • We’ve increased the max width of the app. More data on the screen at once! #1484
  • Fixed a memory leak, now the server doesn’t have to restart every day! #1468
  • Run on Save is now only toggled on by default when creating a new workflow. #1458
    • No more workflows running by accident!

What’s next?

We’re busy developing the initial version of our assistive quality assurance tools, currently focused on screenshot analysis of captured content. While there’s still a little ways to go before it’s ready for testing, everyone is pretty excited to get that into your hands. Look for it in the next major release! 🙂

If you’re interested in signing up to crawl with Browsertrix for your institution, check out the details at Browsertrix.com.

Have thoughts or comments on this post? Feel free to start a discussion on our forum!