The Webrecorder team has just finished a new release for WACZ and we’re delighted to share it with you!
WACZ stands for Web Archive Collection Zipped, and is a new file format designed to make creating and hosting web archives quicker and easier. The format has been in development for a few months, and we’re excited to announce the release of WACZ Format 1.0. The spec for the format can be found on github.
ReplayWeb.page and the newly announced ArchiveWeb.page extension both support the WACZ format 1.0. (ReplayWeb.page continues to support earlier iterations of the format as well)
ZIP-Based packaging for WARCs, Indices and Metadata
WACZ serves as a zipped package format for WARCS. Normally WARC files contain mostly the raw network data. WACZ files take the raw WARC files and zip them up, along with a CDX or compressed CDX index, and a full text index.
This gives WACZ files a few distinct advantages over plain WARC files. Because WACZ files are essentially Zip files, they benefit from a property of Zip files allowing them to be read on-demand over network without downloading the entire file. WACZ files they come packaged with everything you need to create and host a web archive collection: A random-access index of all raw data, a list of entry-point pages into the archive, and a user-defined, editable metadata about the web archive collection. As an added bonus, the full-text data extracted from web pages is also included, ready to be ingested into search engines like Solr or loaded on-the-fly along with the replay.
When using WACZ, ReplayWeb.page can quickly load large web archives without downloading the entire file. Using the ReplayWeb.page Google Drive extension, WACZ files can be loaded directly from Google Drive without downloading the entire file.
Frictionless Data Package
In an effort to base WACZ on established formats, starting from 1.0, WACZ also conforms to the Frictionless Data Package standard. The Data Package manifest adds integrity checks (via SHA-256 or MD5) for each file contained in the WACZ. We hope to expand this specification as well as collaborate with the Frictionless Data community to better standardize formats that are used in web archives.
Tools for creating and verifying WACZ
We have released wacz 0.2.0 Python package, the official reference implementation for creating and validating WACZ files. The library supports packaging up WARC files, simple full-text extraction and a variety of other options. The library can also validate existing WACZ files to spec. (For extracting WACZ files, any unzip tool can be used since WACZ files are also ZIP files).
See the py-wacz page for more details on options or run
wacz -h after installing the python package.
While Webrecorder is leading the development, we would like the WACZ format to be responsive to the needs of web archiving communities. If you have any suggestions or comments, feel free to open an issue on the WACZ format GitHub or attend one of our upcoming community calls.
Have thoughts or comments on this post? Feel free to start a discussion on our forum!