The Browsertrix logo with version 1.18 displayed next to it, overlaid on an email icon and a file icon

Browsertrix 1.18: Large URL Lists and Beautiful Emails

Product
Featured

Browsertrix 1.18 brings support for large URL lists, new email templates, and UX improvements for crawling and curating.

By Emma Segal-Grossman / Senior Software Developer

In the new version of Browsertrix we’re adding support for large URL lists, and making some steps towards better and more consistent communication.

Large URL Lists

A long-requested feature in Browsertrix has been support for crawling a very long list of URLs. While we’ve supported large numbers of pages in crawls for a long time via crawl scopes that allow the crawler to discover pages, in some archiving workflows that doesn’t make sense. We’d limited the number of individual URLs you could enter on the workflow configuration page to 100, but in some cases users were splitting crawls in multiple workflows just to be able to get all of their pages crawled, which then made other operations in Browsertrix unwieldy and in some cases a little broken.

Now, you can upload or paste your huge URL list into Browsertrix and everything will just work! You can still enter URLs manually if you want, but if you have a large number of URLs you can upload a plain text file with one URL per line, and we’ll save it for you and use it in your workflow.

Upload a seed file to crawl a large number of URLs

For more details, check out our documentation.

Pretty Emails

As of now, emails for various account-related interactions with Browsertrix (such as new sign-up invitations, password resets, etc.) now more clearly come from us, and are now much more easy to read. Welcome emails have some more clear instructions for getting started, and updates related to your subscription will include more details.

Screenshot of an invite email in Apple Mail

One of the new email templates now in use

For a while now, these types of emails had been… functional, but very plain. We’d heard they were sometimes mistaken for spam, so we’ve built a new look for all of our transactional emails, and updates to our marketing and newsletter emails should follow soon. Implementing this posed some interesting technical challenges, detailed below if you’re interested.

How We Built Our Own Emails

Our email templates were originally implemented as plain-text Jinja templates. While this got the job done early on in Browsertrix’s development, it wasn’t ideal for a number of reasons: editing these templates was a tedious process that required a lot of trial and error, and styling emails is notoriously messy and difficult because of how few HTML and CSS features are consistently supported across email clients.

When looking into other alternatives we initially considered other Python libraries that could more easily drop in as a replacement for the existing templates in the backend, but ultimately found that React Email was the best maintained and easiest to use library available, despite being a Javascript library. I started putting together a template for the welcome email, and within a few hours had templates for almost all of the other email types done as well.

One of the nice things React Email does, especially compared to other libraries, is let us use style tokens from our design system Hickory as part of a a Tailwind config. It’s got a bunch of built-in components that provide abstractions for some of the more frustrating aspects of email development, i.e. all the style and layout constraints imposed by email clients, and a very simple API that can output both HTML and plain text from the same template. I originally intended to use it to just generate HTML that I’d then convert back to Jinja2 templates, but realized that with some of the more advanced templating I ended up using (e.g. date and time formatting; Python’s date and time formatting utilities don’t support the same types of formatting as the ones in Javascript’s Intl.DateTimeFormat) it would be easier to just use React Email directly. I put together a simple API server for generating email templates in both HTML and plain text along with subject lines, and got it all set up to deploy alongside the main Browsertrix application in the helm chart.

If you’d like to check out the changes, mess around with templates yourself, or contribute to the project, feel free to check out the codebase on GitHub. The email templating service lives in the emails folder, and has instructions for running the very handy dev server that React Email provides.

What’s Next

There’s a few more updates to email communication in the works, including some improvements to the onboarding and trial experience as well as improvements to styling and layout for marketing and newsletter emails. We’re also working hard on a few exciting crawler updates,

Other Updates

  • You can now use some of our Quality Assurance tools without running QA Analysis on crawls! While we initially wanted users to try out our QA tools with QA Analysis, we’ve found the tooling we built to be useful even without it, so we’re making them available and more easily accessible from workflow detail pages. We’ve updated the documentation to reflect this as well, it’s worth another read even if you’ve already been using QA.
  • We’ve added a crawler setting that will fail a crawl if the site you’re crawling logs you out. This will work on specific sites we’ve built detection for, which at the moment includes Facebook, Instagram, TikTok, and X. For more details, check out the docs.

As always, you can view the full list of changes on GitHub.

Sign up and start crawling with Browsertrix

Comments

Reply on Bluesky to join the conversation.

0 replies 0 reposts 0 likes