Webrecorder

Browsertrix 1.11: Self Sign-Up, QA Improvements, Easier Downloading and new APIs

2024-08-06T00:00:00+00:00

A lot of our work in this release cycle has been focused on our internal tooling to allow you to sign up for our Browsertrix hosted service in a fully automated way, with billing integration via Stripe. You can now sign up to use Browsertrix on your own, choosing from one of our newly offered plans. Once you sign-up, you can always update your subscription from the new “Billing” pane in Org settings, including automatically switching to a different plan as your crawling needs change!

Better Quality Assurance (QA) Stats

In our last release, we introduced our new QA system. In 1.11, we’ve made a few improvements to make the system even more useful.

The QA analysis meters are now updated in real-time as the analysis is running, allowing you to see immediate results on how the analysis going without having to wait for it to check all the pages. This should provide more immediate feedback about the quality of a larger crawl!

We’ve also added a few extra stats that help our image and text comparison meters make a little more sense. If you haven’t noticed anything funky up until now, that’s great, continue on as you were! For the inquisitive folks, explaining this change fully is a little more involved…

When archiving a website, sometimes the crawler encounters media such as PDFs, images, or video files that are surfaced on pages as links. These are treated as “pages” in the archive because they were linked to like any other HTML page would be (as opposed to being embedded as part of a page) but unlike actual webpages, these “pages” are just static files. Based on this fact, we can say with 100% certainty that, if these files are present within the archive, they’re going to be accurate copies of what was downloaded, and we don’t have to bother assessing them in a QA run, saving you time (and money!)

We’ve never actually run analysis on these files for the reason above, but our bar graph breakdowns didn’t account for this properly and grouped these non-HTML files captured as pages — along with objective failures — in with un-assessed pages, meaning that a 100% complete analysis run might look like it has some un-assessed pages when really they’re just not relevant! In Browsertrix 1.11, we list these above the meters as a separate stat, meaning that the HTML page match analysis graphs should always be fully filled when an analysis run is 100% complete, as it will only display HTML pages that can be analyzed!

Who knew bar graphs could be so involved?!

Easier Archived Item Downloads

Archived items often contain multiple WACZ files; typically one will be generated for each crawler instance, and they are all split in ~10GB increments. You’ve always been able to download these from the “Files” tab within an archived item’s details page, but we’ve never been satisfied with the amount of clicks required to accomplish this task. Today, that changes! Much like collections, any archived item can now be downloaded with a single click from the actions menu. This packs the associated WACZ files into a single “multi-WACZ” file, which can be replayed in ReplayWeb.page as you’d expect.

Fixes & Small Things

As always, a full list of fixes (and additions) can be found on our GitHub releases page. Here are the highlights:

If you’re a part of multiple orgs, the list of orgs in the quick switcher is now alphabetically sorted.
We’ve turned off behaviors for crawl analysis. This greatly reduces the time it takes to run!

Changes for Self-Hosting Users and Developers

While the bulk of the work in this release has been focused on our hosted service, users who are self-hosting Browsertrix can also benefit from a number of improvements to our API and webhooks.

New Webhooks

We now have additional webhooks which can notify when a crawl has been reviewed in the QA process, and when assistive QA analysis has started or finished.

Org Import & Export

Superadmins on self-hosted instances can now export an organization’s data from the database to a JSON file, and import an organization from an exported JSON file into the database. This API-only feature can be used to move organizations from one instance of Browsertrix to another, or to export all information from an organization for backup purposes.

Documentation for org import and export has been added to the Browsertrix deployment documentation.

Org Deletion & Crawling Restrictions

Superadmins on self-hosted instances can now delete orgs. To make sure you know what you’re deleting before you remove it from existence forever, we’ve added a nice verification screen that makes you type in the org name.

Superadmins can also turn off all crawling abilities for an org.

What’s next?

Support for crawling through proxies in different geographic locations and custom storage options for crawling to your own S3 bucket are currently being worked and will be available in a coming release.

To sign up and start crawling with Browsertrix, check out the details at: Browsertrix.com

Browsertrix 1.10: Now with Assistive QA!

2024-06-10T00:00:00+00:00

After some rest and a few solid weeks of polish after our demo at IIPC’s 2024 Web Archiving Conference, we’re proud to release Browsertrix 1.10: Now with Assistive QA!

Assistive Quality Assurance

Quality assurance for web archives has long been a challenging and time consuming task. The best methods of ensuring a page was captured properly typically fall to a discerning archivist manually scoring various aspects of the page replay to get an overall picture of crawl quality. While we wanted to retain the human element of curation given the vast diversity of the web, our goal in developing these features is to dramatically speed up the review process by providing meaningful heuristics that help direct attention towards pages that need it most.

The crawl analysis and review process is the culmination of these efforts! Browsertrix can now analyze archived webpages by crawling pages from the captured WACZ files and comparing their replay with what the browser saw on the page during the initial crawl — a feature uniquely made possible due to the tight integration of Browsertrix, Browsertrix Crawler, and ReplayWeb.page.

Crawl Analysis

After crawling has completed, the first step in the review process is conducting an analysis run for the archived item in question. On the crawl’s details page in the Quality Assurance tab, press the Review Crawl button to begin the analysis process.

Like crawling, running crawl analysis will also use execution time. While we would generally recommend sticking it out and letting analysis runs complete to get a full picture of the success of your crawl, there may be some instances (a website with many similar pages that you believe to be captured successfully but still want some evidence that is the case) where a stopped analysis run may be enough to get the data you’re looking for.

Once analysis is complete, Browsertrix graphs the matching scores of two analysis dimensions — screenshot and text match comparison.

Details on what to expect when re-running crawl analysis can be found in the documentation!

Reviewing Crawls

Now that crawl analysis has finished, let’s take a look at its results in context and decide if our archive is any good! Generally, pages with high scores are less problematic, and you’ll want to direct more attention to pages with lower scores. For an archived item with at least one finished analysis run, pressing the Review Crawl button will open the Crawl Review page.

In just a few seconds, we can sort the list of pages by the heuristic we’re interested in (here text comparison) and jump straight to where there might be problems, and indeed there are! Some older ReplayWeb.page embed example pages aren’t loading everything that was found when crawling — in this case it’s because we’re trying to load ReplayWeb.page inside ReplayWeb.page and it doesn’t seem to handle recursive instances of itself very well, a known limitation! This example is pretty specific to our tools and website, some more common text matching discrepencies we’ve encountered are the result of video players’ UI text, cookie & consent popups, embedded content that loaded while archiving but 404s due to a replay issue — any time you see ReplayWeb.page’s “Archived Page Not Found” error show up here there’s probably something worth investigating further!

You can vote the page down and leave a comment if it’s a serious issue you’d like to flag for others, or vote the page up and do the same if the issues aren’t ones you need to worry about. Generally, we wouldn’t recommend assessing and voting on every page; instead, try to assemble a few key examples that exhibit common problems or consistent successes to give other curators concise information about the page quality they can expect from the archived item.

Once you’re satisfied with your assessment of pages, press the Finish Review button to score the success of the entire crawl and update the description with any additional details! This assessment score is reflected in the Archived Items list and will be used elsewhere in the app to assist with organization and discovery in the future.

That’s the general process, but it’s not quite everything! The other tabs not covered here allow you to compare screenshots taken while crawling and on replay, a standard replay tab to check the heuristics against the real thing, and a resource comparison table displaying counts of loaded vs unsuccessful page resource fetches. For more information on exactly what you can expect from each heuristic, check out the documentation for crawl analysis!

We’ve been working on this feature since November of 2023 and we’re all very excited to finally get it into your hands! As you might have noticed in the screenshots above, we’re releasing our crawl analysis and review tools with a “beta” tag attached and we’d really like your feedback! Your thoughts are always appreciated on the forum or on GitHub.

Fixes & Small Things

As always, a full list of fixes (and additions) can be found on our GitHub releases page, here are the highlights:

Browsertrix joins ReplayWeb.page with updated branding! You can see it in the screenshots above, and on this very website!
Emails are now displayed for both pending and current users in the org settings.
You can now offset the crawl queue to view URLs from any part of the upcoming pages list.

Changes for Developers

As a key component of Browsertrix, Browsertrix Crawler 1.1+ CLI also supports analysis runs! Running QA with Browsertrix Crawler can also output screenshot diffs to a separate directory for local debugging when using the --qaDebugImageDiff option. Check out our crawler QA documentation and full list of CLI options for more info about this feature.

We have created a Helm repo for Browsertrix! You can add our repository with helm repo add browsertrix https://docs.browsertrix.com/helm-repo/. See our deployment documentation for details.

What’s next?

We’re already hard at work on Browsertrix 1.11 with the beginnings of proxy support and improvements to the heuristic meters above currently in progress. Look for them in the next major release!

If you’re interested in signing up to crawl (and assess the quality of your captures) with Browsertrix, check out the details at: Browsertrix.com

ReplayWeb.page 2.0

2024-04-23T00:00:00+00:00

We’re thrilled to announce the release of ReplayWeb.page, our embeddable browser-based web archive viewer. This release features updated branding, reorganized documentation, various UI improvements, experimental ad-blocking support, and a more robust codebase with TypeScript.

Read below for additional details.

New Branding

ReplayWeb.page has a new logo and new app icons for macOS, Windows, and Linux! Behind the scenes we’ve been working on updating our branding since October of 2023, and ReplayWeb.page is the first of our primary tools to launch with the new logos! More of these are on their way with Browsertrix and ArchiveWeb.page to follow.

Multiple captures of the same URL can now be accessed directly in the navigation bar! While all page snapshots are still listed in the Pages list, this should make finding snapshots of the same page a little faster.

Page Thumbnail Support

A thumbnail picture might not say a thousand words, but it’s still helpful when combined with page titles and URLs!

All new WACZ files created with Browsertrix include thumbnail screenshots captured while crawling. For those using the command line Browsertrix Crawler application, see the various screenshot options available in the Browsertrix Crawler documentation… But also the one you want is --screenshot thumbnail.

Adblock Embed Option

Most web advertisements will dynamically load content from ad servers which will send a relevant advertisement to the user. This setup poses a few challenges when replaying archived content, namely the client-side Javascript may not request the same advertisement files that it did originally. This makes ads a very challenging element and one of the least replicable parts of web archives to display. While some archivists work diligently to preserve advertising — a traditionally under-preserved aspect of culture and cherished element of the web loved by all — an alternate solution to these issues is to simply hide them and pretend they don’t exist!

The useAdblock="" embed attribute will use the Easylist filter rules by default to hide ads on archived webpages loaded in embedded ReplayWeb.page. adblockRulesUrl="https://urlhere.com/file.txt" can also be used to set a custom filter list.

Just like rendering advertisements, blocking them also brings new and unique challenges! We’re releasing ad blocking as a beta feature only available through the embed option above and expect it to evolve over time, please let us know what works for you and what might need further attention!

Update Favicons Embed Option

Adding the updateFavicons="" attribute will update the favicon of the page ReplayWeb.page is embedded within to the favicon of the embedded website. Currently this is only supported on Chrome.

Details regarding the above embed options (and all the rest) can be found in the “embedding” documentation section.

Documentation Overhaul

We have completely overhauled the ReplayWeb.page documentation and converted to MkDocs. We’re also using the homepage to highlight some of the great organizations that have been building their tools around or integrating ReplayWeb.page.

Go check them out and if you think your project belongs in this list, get in touch!

All pages have received some level of attention with hierarchy improvements and content corrections across the board. If something seems amiss, unclear, or incorrect, please let us know at: docs-feedback [at] webrecorder.net

Fixes & Small Things

As always, a full list of fixes (and additions) can be found on our GitHub releases page, here are the highlights:

The nav bar controls have been reordered. Navigation controls are now grouped on the left side, full screen has been moved to the right.
- The main navigation controls are now visible at smaller screen sizes.
Archive info has been moved to a dialog available under the More Replay Controls (three dots) menu.

Changes for Developers

The ReplayWeb.page codebase has been converted to TypeScript which should make the code more accurate and efficient. We have also added support for the Shoelace component library (also used in Browsertrix) to further improve UI consistency between our tools in future releases, and streamlined the build process to no longer commit prebuilt artifacts to simplify merging. We hope these changes — along with the documentation updates — will improve the developer experience, especially for new open source contributors!

Although a major release, this version should generally be compatible with previous ReplayWeb.page releases, but if you run into any issues, please let us know!

Browsertrix 1.9

2024-01-31T00:00:00+00:00

A Quick Look Back at 2023

It has been almost two years since we initially announced Browsertrix! Since then, we’ve been pretty much solely focused on developing our next generation cloud-based archiving platform. One of the downsides of this sole focus is that sometimes you forget to update the company blog and actually tell people about what you made! I’m not going to go back and write update posts for every major release we’ve done (there have been eight!), but here are some of the more recent highlights if you missed them:

Collections

In 1.6 we added collections, the ability to add archived items to multiple different groups for sharing and export! Collections serve as the base for future curation features, but right now they allow for both crawls and uploads to be replayed, downloaded, or shared together as one package.

Because both uploads and crawls share their data within a collection, they also allow you to manually patch automated crawls created through our ArchiveWeb.page browser extension. If elements on a site you’ve tried to capture in Browsertrix aren’t available when replaying the crawl but are available when you capture the page with ArchiveWeb.page, try uploading a WACZ from ArchiveWeb.page and adding both to a collection!

Dashboard & Execution Time

In 1.7 we added the Overview page which displays key org metrics for storage and crawling. In 1.9 we’ve updated the usage history table to give you more granular stats on your execution time, separately listing the execution minutes used per-month based on how you’re charged for them.

Documentation

We may not have updated the blog much, but we definitely upgraded our docs! Browsertrix now has a full user guide using the excellent Material for MKDocs theme. One page I’m personally proud of is our extensive list of every crawl workflow setting, a handy reference if the in-app help text isn’t quite enough.

1.9 Release!

Crawler Version Selection

We frequently release beta versions of Browsertrix Crawler, the core component of our software actually responsible for capturing websites. Up until now, the version the app uses has been set by us, soon we’ll be providing you with some options! Release channels can now be set on a per-crawl basis so if we’ve implemented a fix in the latest beta for a site you’re trying to crawl, you can use it — while keeping in mind that there may be other unresolved issues that aren’t quite ready for prime time yet and that’s why it’s a beta version. But you knew that already, right?

More information can be found in the Crawler Release Channel section of the workflow setup page.

Custom User Agents

Release channels aren’t the only new crawling feature though, we’ve also added the ability to set a custom user agent that the crawler will use to identify itself to websites. In addition to bypassing sites that try to restrict which browsers can access them — something they generally shouldn’t be doing anyway — this feature is also useful for some of our larger clients for coordinating with publications to ensure their crawls don’t get blocked.

Updated Collection Selection UI

While we updated some of this to remove the clunky multi-stage setup and editing process for collections in 1.8 (if you know, you know), the release of 1.9 completes our overhaul to the collection content editing process. “Auto add” can be toggled for workflows right in the collections editor, and archived items in and out of the collection are now displayed in the same window giving us more room to display information about each item.

Fixes & Small Things

As always, a full list of fixes (and additions) can be found on our GitHub releases page, but here’s some of the big small stuff:

The “workflow settings” tab that displays the current workflow settings and crawl settings tab that displays the workflow settings used for that crawl now display the same data without any discrepancies. #1473
- Useful for nailing down what might have changed when crawling the same site multiple times with different settings!
We’ve increased the max width of the app. More data on the screen at once! #1484
Fixed a memory leak, now the server doesn’t have to restart every day! #1468
Run on Save is now only toggled on by default when creating a new workflow. #1458
- No more workflows running by accident!

What’s next?

We’re busy developing the initial version of our assistive quality assurance tools, currently focused on screenshot analysis of captured content. While there’s still a little ways to go before it’s ready for testing, everyone is pretty excited to get that into your hands. Look for it in the next major release! 🙂

If you’re interested in signing up to crawl with Browsertrix for your institution, check out the details at Browsertrix.com.

An update on the WACZ format

2023-05-03T00:00:00+00:00

New WACZ Tools and Integrations

It has been over two years since we’ve first introduced the WACZ format and I wanted to provide a brief update on exciting new tools and integrations of WACZ, and also provide a glimpse of what’s next in the evolution of the format.

WACZ support in Perma Tools

We are thrilled to share that our colleagues at Harvard LIL, who run the perma.cc have released several new tools that leverage the WACZ format, including:

js-wacz - Javascript library for WACZ, designed to be compatible with our original Python py-wacz
Scoop - a standalone archiving tool for generating signed WACZ files for single pages. The tool adheres to the WACZ Signing spec and also uses our Browsertrix Behaviors for improved high-fidelity capture.
wacz-exhibitor - a tool for bootstrapping ReplayWeb.page viewer embed system embed with a bundle Nginx servicer, custom cache layer and additional wrapping via an iframe.

You can read more about Scoop on this blog post from Matteo Cargnelutti, the lead developer of the tooland the rest of the Perma Tools suite at https://tools.perma.cc/.

It is exciting to see a growing open source ecosystem around the format and high-fidelity archiving!

Save WACZ Now!

We are also excited to share that Internet Archive’s Save Page Now system now provides support for emailing users a copy a WACZ files created from an on-demand capture using this service.

Our colleague Ed Summers describes testing out this feature in a recent blog post

We appreciate IA’s support in helping make web archives more portable via the WACZ format!

WACZ in AP News

Our collaboration with Starling Lab has led to an experimental use of a signed WACZ for an embedded Tweet in an AP News article, which you can read here.

The tweet uses our archival ‘receipts’ provenance view to indicate that this archived tweet was created on the specified date and time, by Starling Lab server! Even if the original tweet is removed or edited, we can digital proof that this web archive was created by Starling Lab (signed by certificate issued to authsign.starlinglab.org) This is all possible by using a WACZ signed according to the WACZ Signing and Verification Specs created using an instance of Browsertrix operated by Starling.

What’s next For WACZ Spec

These are just some of the examples of the growing adoption fo the WACZ format. (If you have more examples, please share with us).

We also wanted to provide an update on new specification work happening around the WACZ format.

WACZ on IPFS Custom Storage Spec

One of the new things we are working on is how to put WACZ files on IPFS, in a way to maximize deduplication by splitting the files in a certain way along file and WARC record, and WARC payload boundaries.

The spec covers how to put general ZIP files, how to put WARC files (compressed or uncompressed) onto IPFS and the various trade-offs involved.

By following the spec, it will be possible to leverage IPFS’s content addressing to automatically deduplicate the same archived content, even if stored in different WARC files inside different WACZ files!

You can read the current draft of the spec here

The ArchiveWeb.page extension and our Save Tweet to IPFS tool are already using this spec.

For example, if two different users save the save tweet, the actual content will be automatically deduplicated, while the WARC headers will be new, resulting in storage savings overall.

Look forward additional blog posts describing this spec in more detail!

Multi-WACZ or WACZ Collections Spec

We are also working on a spec for how to combine multiple WACZ files to create collections. A single WACZ file can only be so big (though we’re exceeded 1TB with the format last year), and we need a way to group WACZ files, either from the same crawl, or multiple crawls, into a user-defined collection.

The ‘Multi WACZ’ spec will be all about creating collections of persistent web archives, and will be a Frictionless Data package that specifies URLs to WACZs files logically grouped together.

If you are interested in this spec, please see the pull request and GitHub Issue and feel free to provide feedback!

We will provide additional information as this spec develops!

Improvements and Suggestions Wanted!

We are always looking for further improve the spec as web archiving continues to evolve. Are there other data you’d like to see in the WACZ format, or other feedback in general? If so, feel free to leave an issue directly on our specs repo!

EDIT 2024-05-22: “Browsertrix” was previously referred to here as “Browsertrix Cloud”. This post has been updated to reflect the new name.

Announcing pywb 2.7.0 release

2022-11-23T00:00:00+00:00

After several betas and months of development, we are excited to announce the release of pywb 2.7!

This release includes a new banner and calendar user interface for pywb written in Vue. The new banner has the same localization/multi-language support as pywb 2.6 with a number of new additions and improvements, including an interactive timeline for navigation between captures and easier local theming of the banner via the config.yaml configuration file.

We hope that this new UI will be more flexible and easier to modify to meet user needs.

The new timeline and calendar can be independently toggled on and off as needed:

The pywb documentation now has a section on the how to use and custom the new Vue UI, and a complete list of changes is also available in the pywb Changelist on GitHub. This release also adds a new contributing guide to the pywb GitHub repository with information about how to submit issues, propose new features, and contribute code to pywb.

This release builds on previous rounds of work which were supported by the IIPC. Webrecorder wishes to thank the IIPC membership for their beta testing and feedback for the 2.7.0 release.

Next Steps

The next release of pywb is planned to include support for the WACZ format created and used by other Webrecorder open source tools, including browsertrix-crawler, ArchiveWeb.Page, and ReplayWeb.Page.

Community consultation and roadmapping for a future pywb 3.0 release is also currently underway - be sure to stay tuned in the coming months for updates!

Showing Provenance on ReplayWeb.page Embeds

2022-11-10T00:00:00+00:00

The ReplayWeb.page viewer allows web archives to be embedded and displayed almost anywhere. It can be used as a single-page embed, or integrated into existing services and digital repository systems.

Previously, ReplayWeb.page embeds either showed the location and nav bar UI, or without any additional UI or context. With the interface hidden, there is limited information to signal to users that the content is being served from a web archive, and a high fidelity web archive should look and feel the same as the original. At the same time, we want people viewing web archives to understand that the content within an archive may not be a perfect record of a live website, and that the content is frozen won’t be updated or deleted like other content embedded through traditional means.

To help with this, we’ve added a new embed mode for ReplayWeb.page embeds to be seamlessly added without extra UI, but with a dropdown ‘archive receipt’, which shows the provenance of the archive, as seen below:

This mode can be enabled by setting the embed="replay-with-info" attribute in the tag.

Archive Provenance

The embed (above) now includes a dropdown, which expands to show the ‘archive receipt’ with info about the web archive:

In the technical info section, the archival receipt can show basic provenance Original URL and Archived On metadata, as well as what tools were used to create the archive. Additional provenance info can be added here as needed. A download link at the bottom provides a link to download the full archive locally as well, and the size of the full archive file is included.

Validating Signed Web Archives

A key use case for the receipt is to also show cryptographic signature metadata about the web archive. If the web archive is a signed WACZ files, created as per the WACZ Signing Spec, the real-time validation status, cryptographic keys, as well as hash of the full data package in the WACZ file, are also shown.

When loading a signed WACZ file, all data (WARC records, indexes, page lists, etc…) is validated as it is loaded on the fly (via wabac.js). The receipt includes a Validation field, which shows the number of hashes validated thus far. If the WACZ file has at all been tampered with, hashes would not match, and this will also be displayed to the user.

The cryptographic metadata also relates to provenance, and includes either a public key, or a link to a trusted third party observer certificate, to establish the creator of the web archive. (We’ll discuss how this works in a future blog post!)

Trusting Web Archives

We hope the display of this information will be a first step in making distributed web archives more trusted. While some of the data may not be applicable to most users and is fairly technical, we believe this can encourage independent verification of archived content in the future.

In a follow up blog post, we will discuss the different WACZ signing approaches and their implications for authenticity!

Perma.cc Upgrades to ReplayWeb.page

2022-08-17T00:00:00+00:00

I am thrilled to share that our colleages at Harvard Law Library Innovation Lab (LIL) who run the Perma.cc service have recently switched to using ReplayWeb.page as the default replay system for all Perma.cc archives! Read the announcement on their blog.

ReplayWeb.page provides a fully browser-based viewer / web archive replay system with a variety of embedding and customization options.

ReplayWeb.page includes a number of replay fidelity improvements, and Perma.cc users should see improved replay across the board. The new system also simplifies the replay architecture by allowing web archives to be loaded directly from Perma’s cloud-based storage. For added security, Perma.cc replays WARC files in a separate iframe that is only accessible from a Perma.cc URL.

Perma.cc has a large archive of single-page WARC files and ReplayWeb.page will download and index the WARC file on first load. Since Perma.cc captures a single page at a time, this is generally very performant with most of their archives. ReplayWeb.page supports both WARC and WACZ replay, allowing Perma.cc to easily experiment with using WACZ files, should they wish do so in the future.

Perma.cc continues to be a pioneer and early adopter of the latest Webrecorder tools and technologies, and we have a long tradition of working together! In 2014, Perma.cc was one of the first adopters of pywb system from when it was barely in alpha!

In 2019, Perma.cc switched to a customized version of the then-current Webrecorder stack (which still powers the Conifer service). Over the years, Perma.cc developers have also contributed to various Webrecorder open source tools, supported collaborative development efforts, and contributed to important research around web archives and security, including the WACZ signing specification.

I look forward to continuing our long-standing collaborations!

Webrecorder receives $1.3M open source development grant from the Filecoin Foundation

2022-06-21T00:00:00+00:00

I’m really thrilled to announce that Webrecorder has received a two-year, $1.3M open source development grant from the Filecoin Foundation!

The grant will support our mission of developing quality open source web archiving tools for all!

This funding will help us to grow the Webrecorder team and make improvements across the broad Webrecorder ecosystem of tools.

(Check out the jobs page for more info on current and future job postings!)

From the beginning, Webrecorder’s mission has been to support decentralized web archiving that can be performed directly in the browser, where web archives can live anywhere that data can be stored. A key part of enabling decentralized web archiving, is a system of decentralized or distributed storage. The IPFS protocol provides a powerful foundation and the Filecoin storage network can go a long way in making decentralized web archiving a reality.

Dietrich Ayala, IPFS Ecosystem Lead at Protocol Labs, agrees: “Our collaboration with Webrecorder is key to the IPFS and Filecoin mission: making a web that works for the most impacted users in critical situations, and ensuring the safety of the digital human experience for future generations. Webrecorder provides the specs, libraries, tools and services to build bridges between the HTTP web and any of these new technologies, and bring the last 30+ years of the web along too.”

I am very grateful for Filecoin Foundation’s continued support of the Webrecorder project and our web archiving mission, and thankful to everyone who has supported our efforts thus far!

You can also read our project spotlight on the Filecoin Foundation blog.

Highlights from Recent Work

This grant supersedes and expands on our previous open source development grant from Filecoin Foundation, which focused on design and research.

Here are a few highlights from the progress we’ve made in design, research and browser-based tool integration over the last few months:

UX Research with New Design Congress

As part of our previous grant, New Design Congress has been working on extensive UX research around browser-based web archiving. They’ve shared their initial findings in our last community call and a more detailed report is forthcoming! We will continue to collaborate around research, and examine use cases and risks associated with browser-based web archiving.

WACZ Spec + Use Cases Development

We have formalized the WACZ spec, and added additional web archiving related specs, including a spec for the CDXJ format, and a spec for signing WACZ files. The specs are available at: specs.webrecorder.net, thanks in large part to the work of Ed Summers, our technical writer and editor for this effort. The work on the WACZ spec continues, focusing on full-text search and additional metadata, and additional recommendations for WACZ storage on IPFS.

ArchiveWeb.page Browser Integration

Part of our work in making web archives more accessible is to attempt to integrate web archiving directly into browsers. Mauve Software has released an update to their Agregore Browser which includes a proof-of-concept integration of web archiving via ArchiveWeb.page extension.

The Agregore Browser supports IPFS natively as well as many other p2p protocols, and support for browser-based archiving will soon allow users to share web archives they created directly through the browser itself.

Project Goals

This new funding will help us continue these existing efforts, as well as support our software development goals for the Webrecorder ecosystem in several key areas.

We will share a more detailed timeline in the later, but a few high level goals for the next two years include:

Browsertrix - Continued development of our open-source, federated cloud-native SaaS service, with support for archival storage of data on IPFS/Filecoin as one option. The service will allow institutions as well as independent communities to be able to create archives on their own.
Scalable web archive data model/specification and necessary tooling - Building on the WACZ file format, and implementing a robust data model for storing larger web archive collections, from a single WACZ file to multi-TB or even multi-PB collections. The data model would support encryption, signing and storage of all data on IPFS/Filecoin, and an optional IPFS-based search index.
Web archive signing and validation framework - Building tools and specifications for signed and verifiable web archives, to prove identity and authenticity. To support a variety of use cases, this will include multiple approaches, such as PKI-based, DID, and possibly blockchain-based solutions for verifying authenticity. Standalone validator tools that are deployable by independent institutions will also be created.
Replay Viewer and Embedding - Continued development of ReplayWeb.page viewer to keep up with complexity of web, including integration with additional CMS/digital preservation systems. Improvements for self-hosting the viewer for loading web archives from IPFS, and implementation of validation features in the viewer.
Browser-Based Web Archiving Tooling: Continued development and research around several browser-based archiving approaches, including our ArchiveWeb.page extension and the extension-less archiving via ArchiveWeb.page Express.

Next Steps

The next two years will be an exciting time for Webrecorder, as we expand and continue to build on this previous work, and expand our efforts on Browsertrix, which has gained a lot of use over the last few months. (More details in an upcoming blog post!)

In the short term, we will also be looking for additional help! If you would like to work with Webrecorder on achieving our mission of web archiving for all, please do not hesitate to reach out!

EDIT 2024-05-22: “Browsertrix” was previously referred to here as “Browsertrix Cloud”. This post has been updated to reflect the new name.

Introducing: Browsertrix

2022-02-23T00:00:00+00:00

I’m excited to announce that Webrecorder is working on embarking on perhaps our most ambitious development effort to date: the collaborative development of Browsertrix!

You can read more on our official site browsertrix.com, including a list of key features.

Browsertrix is a fully integrated open source browser-based crawling platform that will allow users to create their own high-fidelity web archives in an automated way at scale.

Current Status

The full source code for Browsertrix is available on GitHub.

Development is currently in the early stages, with a focus on implementing core features and a user-friendly UI.

Our Senior Frontend Developer Sua Yoo has been tackling the creation of a brand new user interface to manage crawls and crawl configurations. To keep up with the development progress, please follow the project on GitHub.

Planned Service and Collaborative Development with IIPC Community

Webrecorder plans to eventually offer Browsertrix as a service, and we will be rolling out testing gradually over the next few months.

If you’re interested in participating in early testing, please sign-up for the Browsertrix info list

But, Browsertrix won’t just be another siloed online service!

Browsertrix is still in early stages of development, but we believe it is important to share this work, and more importantly, develop it in collaboration with our partners in the web archiving community.

Towards this goal, we are also very excited to announce our collaboration with the IIPC community!

The International Internet Preservation Consortium (IIPC) has agreed to contribute funding towards the development of Browsertrix over the next one or two years.

In addition to this funding and as part of this collaboration, several IIPC members, including The Royal Danish Library, British Library, the National Library of New Zealand, and the University of North Texas will also be deploying Browsertrix within their institutions.

The goal of Browsertrix is to provide a kind of federated web archiving system, which can be deployed not just by Webrecorder, but by other institutions. The close collaboration with IIPC members from the start will ensure that this system can meet the broader goals of the web archiving community, from smaller institutions to large national libraries.

Open Source and Open Web Archive Data

Webrecorder believes strongly that web archiving tools should be fully open-source to ensure long-term viability of the digital record. Web archiving is too critical to be relegated to proprietary processes and the whims of individual vendors.

While the WARC format provides a standard way to store raw HTTP data, there are no standards formats for everything else, from crawl specifications and crawl logs, page lists and indexes, full text search data, and curatorial metadata. With the WACZ format, Webrecorder is beginning to standardize some of these remaining components of the web archiving workflow.

One of our goals with Browsertrix is to allow crawl outputs (WACZ or WARC) to be stored in any storage of the users’ choosing. Browsertrix will allow output to any S3-bucket, so that even if the service were to disappear or stop working, all of the data will still be accessible using existing tools like ReplayWeb.page. This federated approach to storage will allow crawled data to be stored almost anywhere, from custom institutional repositories like Archipelago, to existing WARC data centers at national libraries, to any cloud S3 provider (like Amazon or Digital Ocean), to local file system, as well as decentralized storage systems like IPFS.

With Browsertrix, we hope to enable users to truly own all of their web archive data, and to be able to access and make use of it without relying on infrastructure from any single vendor (including Webrecorder!)

Collaborations Welcome

We know this will be an ambitious project, and we are just getting started! Web archiving is becoming more critical and more difficult.

If you would like to contribute to the development, testing, or are interested in a custom deployment of Browsertrix, please feel free to reach out directly via e-mail, GitHub or our forums.

If you would like to support Webrecorder financially, please consider supporting Webrecorder via our Open Collective or GitHub Sponsors accounts and don’t hesitate to reach out to us with any questions.

EDIT 2024-05-22: “Browsertrix” was previously referred to here as “Browsertrix Cloud”. This post has been updated to reflect the new name.

Webrecorder

Browsertrix 1.11: Self Sign-Up, QA Improvements, Easier Downloading and new APIs

Self Sign-up

Better Quality Assurance (QA) Stats

Easier Archived Item Downloads

Fixes & Small Things

Changes for Self-Hosting Users and Developers

New Webhooks

Org Import & Export

Org Deletion & Crawling Restrictions

What’s next?

Browsertrix 1.10: Now with Assistive QA!

Assistive Quality Assurance

Crawl Analysis

Reviewing Crawls

Fixes & Small Things

Changes for Developers

What’s next?

ReplayWeb.page 2.0

New Branding

Page Snapshots Dropdown

Page Thumbnail Support

Adblock Embed Option

Update Favicons Embed Option

Documentation Overhaul

Fixes & Small Things

Changes for Developers

Browsertrix 1.9

A Quick Look Back at 2023

Collections

Dashboard & Execution Time

Documentation

1.9 Release!

Crawler Version Selection

Custom User Agents

Updated Collection Selection UI

Fixes & Small Things

What’s next?

An update on the WACZ format

New WACZ Tools and Integrations

WACZ support in Perma Tools

Save WACZ Now!

WACZ in AP News

What’s next For WACZ Spec

WACZ on IPFS Custom Storage Spec

Multi-WACZ or WACZ Collections Spec

Improvements and Suggestions Wanted!

Announcing pywb 2.7.0 release

Next Steps

Showing Provenance on ReplayWeb.page Embeds

Archive Provenance

Validating Signed Web Archives

Trusting Web Archives

Perma.cc Upgrades to ReplayWeb.page

Webrecorder receives $1.3M open source development grant from the Filecoin Foundation

Highlights from Recent Work

UX Research with New Design Congress

WACZ Spec + Use Cases Development

ArchiveWeb.page Browser Integration

Project Goals

Next Steps

Introducing: Browsertrix

Current Status

Planned Service and Collaborative Development with IIPC Community

Open Source and Open Web Archive Data

Collaborations Welcome