For safekeeping: An automated preservation workflow for Archive-It content

January 12th, 2021

by Adriane Hanson, Head of Digital Stewardship, University of Georgia Libraries

Adapted from my presentation of the same name at the Archive-It Partner Meeting on October 7, 2020.

In this blog post, I’ll describe the University of Georgia Libraries’ almost entirely automated workflow for preserving a copy of our Archive-It content locally. We’ve used it successfully in production since July 2020. If you want to skip this post and head straight to it, here is the link to our python scripts and documentation for the workflow.

Screenshot from the UGA Student Life web archive collection

Capture from the University of Georgia Student Life web archive collection

Why Keep a Local Copy?

We appreciate that Archive-It provides very good digital preservation for the copies they maintain. So why would we also want a local copy? Well, these are irreplaceable resources for us and of high importance for our collecting mission.In many cases, they are either our only documentation of something (e.g. student groups) or a central part of that documentation (e.g. political activism), and these sites can be short-lived on the live web. So, we just were not comfortable relying entirely on a third party to guarantee their long-term preservation, even a third party as reputable as the Internet Archive.

We do have the luxury of an in-house digital preservation system, so there is a good place to put the local copies. However, if we didn’t have that, we’d still want a copy to put on our servers or even hard drives just in case. It also doesn’t take much staff time since we use Archive-It’s very handy APIs to automate most of the process. So at the end of the day, we decided to keep a copy of most of our Archive-It content locally as a backup of last resort. If the worst comes to pass and the Archive-It service ends, we can pivot to relying on our own copies without having to scramble to download everything at once.

Schematic diagram of Archive-It API integrations

Archive-It APIs

What We Download

Every quarter, we download all WARCs saved during that quarter. We picked quarterly to give us a balance between not doing it so rarely that there is a huge amount to download and not doing it so often that it is annoying for staff. Pro tip: we picked February, May, August, and November as our quarters since December/January is such a busy time of year!

We also download six metadata reports related to each of the WARCs: seed, seed scope, collection, collection scope, crawl job, and crawl definition. Together, these give enough context for the WARCs to be understandable withoutthe wealth of information available via the Archive-It interface. They document the limits on and success of the crawl and also include the metadata we create for the collection and seed.

Workflow Summary

I wrote Python scripts which use the Archive-It APIs for this workflow. I’m fairly new to Python and brand new to APIs, so it is not that fancy and I did my best to document within the script what each piece of the code is doing. Hopefully, that will help you make sense of our workflow and make it easier to customize to your own institution should you so desire.There is more detail on these steps in the presentation and in the documentation on Github if you want to dig deeper.

Schematic diagram of steps in UGA's WARC preservation workflow

Workflow Diagram (Steps in blue are done with Python scripts)

1. A week in advance of our planned download, I run a script to check for metadata completeness. This generates a spreadsheet with the descriptive metadata in Archive-It for the WARCs that will be downloaded. If there is missing metadata, I share it with collections staff so they can fill in the gaps prior to download day. We don’t want to be preserving something locally with incomplete metadata and some of those fields are used by the script to determine which WARCs to download. As a bonus, it is a handy way for collections staff to review all their metadata without having to learn the API.

2. The download is done with another script: 

It first makes one folder per seed and downloads all the WARCs and metadata for each seed using the Archive-It APIs. The APIs are what let us limit which WARCs to include in the download, based on the crawl date. We also select WARCs by collection, since some UGA departments use our Archive-It account but not this preservation workflow.

It next packages the seed folders according to the AIP requirements of our digital preservation system. Making the AIP involves extracting and reformatting technical metadata and then bagging, tarring, and zipping the folder.

To help with quality control, the script also logs known errors (e.g. WARC fixity changed after download) and checks the AIPs for any missing WARCs or metadata files.

3. I do a brief review of the downloaded AIPs, mostly relying on the error log and completeness documentation generated by the script. I also open up a small sample of the AIPs to look for anything so wildly wrong with the preservation metadata file or the directory structure that the script didn’t catch it.

4. I transfer the AIPs to our ingest server and schedule them to go into our preservation system.

Easy Opportunities for Customization

You’re welcome to use and remix the scripts however you would like. There are a few places that would be particularly simple to customize with little Python knowledge. The slides from my presentation (starting on slide 9) show you where to make each of these changes in the code.

1. Update the test for metadata completeness to match your rules.  

2. Update the API filters for the WARCs to include in a download. For instance, quarterly may be too often or too infrequent for you, depending on how much you crawl. 

3. Change how the unique identifier is constructed. Our schema combines department, related collection, download date, and a sequential number.

4. Adjust which metadata reports you want to include. There are 26 and we picked 6.

5. Swap out our “make an AIP” process for yours.

If you want to take the workflow for a test spin, there is a version of the script that just downloads all WARCs and metadata for a single seed (web_aip_single.py).

In Conclusion

We’ve done three successful downloads at this point. Do be warned that it can take a while to run because of the time required to download many large WARCs from Archive-It. Our first download was for 1.5 years’ worth of crawling and took about 24 hours to complete. 

Ultimately, it gives us peace of mind to have backup copies in our own system. We hope we never need to use them and that the Internet Archive lasts forever, but in digital preservation we want to be prepared as much as we can for as many things as we can! Since the APIs and scripts made it easy, we think it is worth the investment in time and storage space to have local copies.

Here is that link to the scripts one more time, as well as a link to all our web archiving documentation. If you might be interested in doing something similar and have questions, please do reach out (ahanson [at] uga [dot] edu). I’m happy to help!