Glossary of Terms

Resources

A handy reference for commonly used language.

By Webrecorder Team

Understanding the web requires some knowledge of the technical terms and jargon involved. These terms get more specialized when we begin archiving websites! This glossary offers explanations of commonly used web-related technical terms that relate to Webrecorder’s web archiving software and the broader web archiving ecosystem. By breaking down specialized vocabulary, we aim to create a shared understanding and make navigating the intricacies of web archiving more approachable.

TermDefinitionRelated Terms
Archived Item(noun) A file or grouping of files representing web content that is generated and/or stored by Webrecorder software. May be referred to as “item(s)” when there is sufficient context. Archived items are further categorized by how they were created.
Archiving(verb) The act of capturing web content and saving it to a file.
Archiving Session(noun) An item created by Webrecorder software through an interactive process of browsing pages and saving them as they are viewed by a user.
Browser(noun) a program that retrieves and displays pages from the Web.Also known as web browser. Examples are Google Chrome, Mozilla Firefox, Apple Safari, Microsoft Edge, Opera Browser. Google Chrome is different from the Google search engine.
Browser Profile(noun) A saved package consisting of browser sessions and other settings that are interactively configured by a user to use during crawling.
Collection(noun) a group of archived items related by a common theme or subject matter.
Cookies(noun) a small piece of information left on a visitor’s computer by a website, via a web browser.
Crawl(noun) an item archived by a crawler.
Crawl Start URL(noun) The first URL a crawler will visit to start looking for other pages.Also known as a seed URL
Crawl Workflow(noun) A particular configuration used by a crawler for automated crawling. Crawl workflows also contain a list of all crawls that have been run with them. May be be referred to as a “workflow” when required.
Crawler(noun) a program which systematically browses the Web to collect data from webpages.Also known as Web crawler, Bot or robot
Crawler Trap(noun) a structural part of a website that leads a crawler to capturing large number of URLs that don’t result in new unique content or links.Also known as spider traps.
Crawling(verb) the automated act of browsing the web and saving its contents.Also known as running a Harvest or Capture.
CSS(noun) a declarative language that controls how webpages look in the browser.Stands for Cascading Style Sheets.
Domain(noun) a unique, easy-to-remember name that identifies a website’s address on the internet. It’s divided into top-level domain (i.e. org, net, com) and second-level domain (i.e. name of site like webrecorder from webrecorder.net).Also known as Domain Name
Heritrix(noun) an open-source web crawler developed by the Internet Archive, launched in 2004, and currently utilized by the Library of Congress.This is a web archiving term from Internet Archive.
HTML(noun) a descriptive language that specifies webpage’s structure.Stands for HyperText Markup Language.
HTTP(noun) a network protocol that enables transfer of hypermedia documents on the Web, typically between a browser and a server.Stands for HyperText Transfer Protocol.
HTTPS(noun) an encrypted version of the HTTP protocol. This is a secure connection between a browser and a server.Stands for HyperText Transfer Protocol Secure.
IP Address(noun) a series of numbers used to uniquely address each device connected to the Internet.Stands for Internet Protocol.
JS(noun) a programming language used most often for dynamic client-side scripts on webpages.Stands for JavaScript.
Link(noun) a reference that connects webpages or data items to one another.Also known as a Hyperlink.
Quality Assurance (QA)(noun) In digital preservation, QA refers to the processes designed to ensure that materials are preserved in a manner that maintains their integrity, accessibility, and usability over the long term.
Replay(verb) to display web archives stored in WARC or WACZ files.
Scope(noun) outline of what the archive contains and what it aims to encompass based on a common theme or subject matter.
Server(noun) a computer that provides services and resources to other computers, known as clients, over a network.
Sitemap(noun) a list of pages within a website.
SPA(noun) a web app implementation that loads only a single webpage with the rest of the navigation handled with JavaScript.Also known as Single-page application.
Status Codes(noun) codes that indicate the status of a request or condition of an HTTP request.Examples of status codes: 404 (resource not found), 500 (internal server error)
Subdomain(noun) a prefix that’s added to a domain name to create a separate part of a website (i.e. the “app” in app.browsertrix.com).
URL(noun) a text string that identifies the location of a resource (such as a web page, image, or video) on the internet.Stands for Uniform Resource Locator. Also known as web address or link.
WACZ File(noun) an open source format developed by Webrecorder that contains WARC files along with other index metadata.Stands for Web Archive Collection Zipped.
WARC File(noun) an open source format developed by the Internet Archive and an ISO standard (CD 28500) for web archives.Stands for Web ARChive file. This is a web archiving term from Internet Archive.
Web Archive(noun) one or more web pages that can be viewed even if the original site goes offline. Sometimes also used to refer to an institution that archives websites.
Web Archiving(verb) the process of collecting web pages.
Website(noun) one or more web pages, typically located under the same domain.

General web terms are primarily sourced from the MDN web docs by Mozilla. Web archiving terms are listed as used in Webrecorder’s software, alongside a compilation of other organizations such as the International Internet Preservation Consortium, Internet Archive, Library of Congress, and Society of American Archivists.