Analyze your Web Archives at Scale: The Archives Unleashed Cloud

September 19th, 2018

by Samantha Fritz, Project Manager, and Ian Milligan, Project Lead, for Archives Unleashed at the University of Waterloo

Nova Scotia Municipal Government web archive in the Archives Unleashed Cloud

Do you have great collections, but wish you had more and different ways to explore them? Archive-It provides a fantastic service to curate, collect, and preserve content from the Web. Wayback browsing and search interfaces work for some applications, but with the rise of the digital humanities, the computational social sciences, and web science more generally, even more analysis tools can help to express archived data in new and enlightening ways.

Enter the Archives Unleashed Project. As researchers look for increasingly sophisticated approaches and methods for working with data at scale, our priority is to design and implement tools to allow academics and other researchers to work with the raw ARCs and WARCs that comprise Archive-It crawls. Funded by the Andrew W. Mellon Foundation, we’ve created a web interface that allows Archive-It subscribers to analyze their collections at scale. And best of all, it’s all currently free.

You can check out the Archives Unleashed Cloud at https://cloud.archivesunleashed.org right now, or read on to find out why we’re so excited about it!

Why do we care so much about Web Archives?

We might be preaching to the choir here on the Archive-It blog, but like many of you, we’re convinced that web archives are key to the future of scholarly research. Given the exponential rise of digital-born content, the opportunities for scholarly inquiry using digital content as a primary source of exploration are boundless. Previous work in the web archiving field has demonstrated that the sheer scale of working with web archives can be overwhelming. Web archives are big, they come in often unfamiliar file formats, and they consist of so many different types of data.

As a team made up of humanists, librarians, archivists, and computer scientists, we appreciate the struggles faced by web archiving scholars and practitioners. The Archives Unleashed Cloud (AUK) is accordingly motivated by a desire to lower barriers to working with web archives at scale by providing a convenient, practical, and user-centered tool that takes the stress out of complicated coding scripts and the dreaded command line.

The Archives Unleashed Cloud and Archive-It

Supported by the rest of our team, project co-lead Nick Ruest spent just over five months building the infrastructure of the Archives Unleashed Cloud. A Ruby on Rails application, the Cloud uses the Web Archiving Systems API (or WASAPI) to download and populate information about web archive collections. In practice, this means that any institution with an Archive-It account can be fully connected with the Archives Unleashed Cloud out of the box.

Since launching this past spring,  the Cloud has had over sixty users, processed almost 2000 jobs, and has analyzed over 100TB of web archives. Over the past few months, we had the opportunity to work with several Archive-It partners to help download and analyze their WARC files.

For example, a PhD candidate at the University of Waterloo, Sarah McTavish, used Archives Unleashed Cloud data from the University of Alberta to explore what a researcher could discover about the Alberta Oil Sands from their “Alberta Oil Sands” collection. In a post which recently appeared on our blog, she initially used network information – files that show which crawled domains within a collection link to certain files or other domains – to see which webpages had the most influence.

Network graph of links among documents in the Alberta Oil Sands collection

A network graph of links among documents in the Alberta Oil Sands collection

By using this approach, McTavish could surface the web pages and sites within this large Archive-It collection that might deserve closer investigation for their importance to the topic. Rather than having to rely on stabs in the dark using keyword searches, the information in the Archives Unleashed Cloud helped her research the project and even begin to annotate pages of interest for future researchers to use.

Experience the Power Yourself and Try the Archives Unleashed Cloud!

 

Archives Unleashed Cloud: A Tour

 

The Cloud offers a suite of tools for ingesting and analyzing ARC and WARC files like the above example. You can watch the full video above to learn the complete process. In a nutshell though, Cloud users sync their web archive collections, select the ones that they want to have analyzed, AUK processes these files for analysis “behind the scenes,” and you can explore them yourself.

Snowden web archive in the Archives Unleashed Cloud

Users can also download the “derivative” datasets that AUK creates in the process to support these analyses. Currently, we produce three main types of files:

A full text file that contains the extracted plain text of all of the HTML documents in a collection, which can support text mining across a web collection. 

A network diagram that allows you to see the major linking patterns within an Archive-It collection and where the sites you crawled linked to; and

Basic statistics on how often certain domains appear within a crawl (i.e. there are 10,000 captures of pages within the “liberal.ca” domain, or 5,000 from “twitter.com).

We also provide documentation and learning resources to help users fully explore their web collections. Before diving in, we suggest that people take a quick read of our Documentation, as well as explore the Derivative Learning Guides to see what’s possible with the derivative files above.

Seeing the Cloud come to life has been a really exciting time for our team, especially as we see the real-life impact it has for researchers as they dig into and explore their content. Over the next few weeks, our blog at https://news.archivesunleashed.org will feature more of these researcher stories.

We’d like to invite any Archive-It subscribers to try the Archives Unleashed Cloud out for free! To get started visit: https://cloud.archivesunleashed.org.

Connect with Archives Unleashed

Have you already tested out the Cloud? We’d love to connect with you – whether you’d like to chat about web archives, your experience with AUK, contribute a case study, or join our growing community, there are a number of ways to connect with our team and community: