Take your web archives to go with these Archive-It APIs and integrations

October 10th, 2019

by Karl-Rainer Blumenthal, Web Archivist for Archive-It

The web supports delivering information where and when people need it. Web archives can do the same.

Archive-It is a platform that brings together different tools that have been developed to acquire, store, preserve, and share primary source material from the World Wide Web — a kind of “one stop shop” for web archiving tech. More than that though, it is the community of partners who share that infrastructure, shape its development, and build its corpus: nearly 700 organizations and individuals archiving over a petabyte of data across 40 billion (with a b!) URLs since the first pilot partners ran their first web crawls in 2005. Archive-It software and hardware is built to meet the needs that span this scale, but one size rarely fits all.

The goal is to ensure that every Archive-It partner can access, manage, migrate, and represent their web archives where and how they need them. For most, that means using Archive-It’s web application to collect web archives, the Internet Archive’s repository to store and preserve them, and this website to share them. However, partners also have options to preserve original data themselves, to apply custom access layers, to curate the same kinds of data and metadata as the Archive-It web application, and more. To read a high level summary of each, review example use cases, and find detailed documentation and instructions for implementation, start here: Archive-It APIs and integrations.

Illustration of Archive-It as a to-go container

Consider this your menu for web archive data and metadata to go: W/ARC files, descriptive and technical metadata, Wayback and full-text search indexes, and much more information about Archive-It partners and their web archive collections. All of this Archive-It data can be retrieved selectively through a web browser, broadly from a command line interface, and automatically with custom scripts. What happens next is up to your imagination. 

The primary way that Archive-It shares data is through application programming interfaces, or “APIs.” And whatever your knowledge and experience with APIs might be, you’ve seen them in action on this website or in the Archive-It web application, where they connect front-end interfaces with the latest data from their backend stores:

Schematic diagram of APIs and how they connect data to the Archive-It web application

A sketch of APIs that connect disparate data stores to the Archive-It web application. Browser icons by Linseed Studio from the Noun Project.

Partners may access, curate, preserve, and share these same data directly by following the instructions in the documentation above. Peers in the digital and web archiving communities can also use these to access public data from Archive-It partners, like their Wayback capture points and descriptive metadata. This kind of unmediated access makes some neat new things possible.

The Princeton Theological Seminary Web Archive is a helpful demonstration. It plugs three of these data pipelines into a web template in order to provide direct access to archives and their metadata on PTSEM’s own site, and without having to catalog and manually update records every time that a change is made or a new crawl run in the background:

Animated annotation of PTSEM front end, to identify data pulled through APIs

Data pulled directly into PTSEM’s Web Archive via three Archive-It APIs.

Connecting directly to the source means that you can have your data when and where you need them. That’s a foundational value and benefit to the web, so we’re eager to support it in web archiving, too. Use cases continue to drive development in this space, though! Let us know when you’ve used one of these APIs in the Help article above or make a feature request if there’s a new integration that you’d like to use, and we’ll be back to highlight more examples and ideas.