The stack: High fidelity web collecting at scale with Brozzler

November 3rd, 2020

by Karl-Rainer Blumenthal, Web Archivist for Archive-It

This is the first in a blog series about the web archiving software and hardware that Archive-It partners share. Want to know more about a tool in our box? Your suggestions or questions for future posts are very welcome here.

Browser + Crawler = Brozzler!

With support from the Andrew W. Mellon Foundation, the Internet Archive developed a web browser-based collecting tool to help web archivists collect “hard to archive” modern and interactive websites. Many Archive-It partners helped directly in testing and improving its performance as an automated collection tool. You can watch the video below for an introduction and demonstration, and read on for more information about Brozzler.

Watch the recorded “Introduction to Brozzler” webinar here or anytime in the Help Center

The modern, personalized web is a choose your own adventure experience–one in which each choice alters the landscape, so preserving it meaningfully requires tools that can make decisions judiciously, intelligently, and at scale. Web crawlers like Heritrix, developed at the Internet Archive, have long been used by web archivists around the world to collect the static documents that comprise “live” websites. As the web grows more interactive, stewarding its increasingly extensible code environment is necessary in order to avoid leaving gaps. In order to preserve a website more ecologically, that is, a collecting tool must activate all of the scripts, play all of the embedded media, and otherwise generally experience a website more like a person would.

Screenshots of the DC Moore Gallery website's code as it renders in a command line terminal and a web browser

How a web crawler (left) and browser (right) interpret the same web page’s code

Just as any Archive-It partner would start a traditional crawl with their web application, they can now also direct Brozzler to automatically deploy web browsers, cycle them through a growing list of user behaviors that are necessary to experience a website in full, and deposit the results into the trusty Web ARChive (WARC) files that we all need for storage, access, and reuse. By automating the experience of a person at a web browser, web archivists can create more accurate and complete records than a crawling robot can.

Schematic diagram of capture technologies deployed to a seed URL

In fact, Archive-It partners started using web browsers automatically like this in 2014, with the introduction of Umbra–a technology that still runs in parallel with the Heritrix web crawler. Umbra prompts a web browser (we use Chromium, basically a developer-friendly version of Google’s popular Chrome browser) on a virtual machine to open each web crawl’s seed URLs, activate and run those pages’ client-side scripts, and add the additional resources that the process reveals to the crawler’s archiving queue.

Schematic diagram of Brozzler deployed to a seed URL

Brozzler simplifies and expands upon this model to its logical scale. Instead of feeding ever more moving targets into a web crawler’s queue, Brozzler uses warcprox–a python-based “man-in-the-middle”–to record all browser requests and web server responses straight into WARC files for storage and eventual replay. And whereas Umbra only runs on seed URLs, Brozzler runs warcprox on every page across every site that it experiences. Archive-It partner and staff collaboration and testing have made it possible to scale up that capacity without sacrificing the speed and efficiency of a web crawler or the fidelity enabled by the growing list of custom user behaviors.

Brozzler opens locally and navigates a dynamic web page automatically (left) while warcprox writes its contents into a WARC file for preservation (right).

Developing Brozzler began with Archiving the Websites of Contemporary Composers, a partnership with NYU Libraries funded by Mellon. The composer websites in NYU’s collecting scope offered a useful lab environment for experimentation, given their many interactive audio-visual features. The beta testing team grew exponentially since. Partners from from Oxford’s Bodleian Libraries to the Alaska State Library, the Kansas Historical Society to the Smithsonian Institution, all adopted Brozzler early and contributed use cases to expand its capabilities.

Brozzler now comes standard to the web application that every Archive-It partner uses, and is the recommended collecting technology for many types of sites, including most forms of social media. That’s a lot of web browsers to keep happily and healthy! The Archive-It engineering team continues to find or create the efficiencies that support continually increasing this use of partners’ shared computing resources.

Partners who have not yet used Brozzler can start with the aptly named How and when to use Brozzler guide in the Archive-It Help Center. Not a partner? Or want to collaborate in future development? You can also read Brozzler’s docs on GitHub, run it locally, and let us know how it might be even further enhanced. We’re eager to see and share how this new technology improves web archive fidelity at scale going forward.

Archive-It

The leading web archiving service for collecting and accessing cultural heritage on the web

November 3rd, 2020

Archive-It