The Browsertrix logo with version 1.17 displayed next to it, overlaid on a pause and resume button

Browsertrix 1.17: Crawl Pause/Resume and Lower Numbers of Browser Windows

Crawl pause/resume and lower number of browser windows

By Tessa Walsh / Senior Applications & Tools Engineer

We are excited to announce the release of Browsertrix 1.17. This release includes some commonly-requested improvements to crawling, including the ability to pause and resume crawls, and the ability to crawl with only 1 or 2 browser windows.

Pausing and Resuming Crawls

A common piece of feedback we’ve received since launching Browsertrix is that it would be nice to be able to pause crawls and then later resume them from where they left off. We listened, and this feature is now available via a new Pause button in the crawl workflow.

After clicking the Pause button, a running workflow will tidy up any remaining pages and upload WACZ files containing all of the data crawled so far. Once the workflow is successfully paused, you can replay all of the pages that have been crawled so far, download your WACZ files, and inspect the logs. You are then free to inspect the crawl thus far at your convenience for up to a week. If you forget to resume or stop the paused crawl within that week, Browsertrix will stop it for you, preserving all of your already-crawled data.

Based on previous conversations with many of you, we anticipate pausing will be especially useful for conducting test crawls. Not sure how well a website will be captured with a given workflow’s settings? Start the crawl with its full scope, pause it after however many pages you want to use as your sample have been crawled, and then inspect the replay and logs to see if you’re happy with the results. If all looks good, you can simply resume your crawl and it will pick up right where it left off. If not, you can modify the workflow settings before resuming the crawl, or cancel the crawl without having used many of your execution minutes in the process.

This functionality is made possible by the new Latest Crawl crawl workflow tab, which consolidates several pre-existing tabs such as Watch Crawl and Logs into a simpler interface. Latest Crawl displays information about the currently active crawl, or if the workflow is not currently running, the last crawl that was run from that workflow. This also means you can now replay a workflow’s latest crawl without needing to navigate away from the workflow!

A screenshot of a paused crawl in Browsertrix, showing details about the crawl status, options to resume or cancel the crawl, and a replay viewer

Browser Windows

Another commonly requested feature is being able to crawl with a lower number of browser windows to avoid issues with sites that aggressively rate limit users. Now you can do just that. In the workflow editor, it’s now possible to configure crawls to run with 1, 2, or 3 browser windows, in addition to the multiples of 4 that were previously offered. In combination with other politeness settings such as Delay Before Next Page, this should help significantly with avoiding rate limiting and IP bans.

For our users that primarily interact with Browsertrix via the REST API, you’ll note in the API documentation that the scale field in the /crawlconfigs/ endpoints has been deprecated in favor of a new, simpler browserWindows field, which overrides scale when used and can be used to configure workflows to use an amount of browser windows lower than the increments available via scale. But don’t worry, in order to avoid breaking any existing user tooling and integrations, we have made sure to continue to support scale when it is used for creating and updating crawl workflows.

A screenshot of the browser settings panel in the crawl workflow settings page in Browsertrix, showing the number of browser windows to use with options ranging from 1 to 12

Sign up and start crawling with Browsertrix

Comments

Reply on Bluesky to join the conversation.

0 replies 2 reposts 6 likes