Recap: A coffee chat with web archivists in US government programs

April 6th, 2022

by Tanya Ulmer, Web Archivist for Archive-It

US Federal organizations’ web archiving programs were the hot topics for a Coffee Chat on March 29. In case you missed it, we’ll top you up on the conversation hosted by the Web Archiving Section (WAS) of the Society of American Archivists (SAA).

As attendees settled in with their mugs, five panelists from four organizations presented overviews of their programs. Each presenter summed up their programs’ collections and achievements in a handful of slides. For the full overviews, watch the recording available until April 30.

National Library of Medicine (NLM)

Christie Moffat’s slides highlighted many of NLM’s collections, from the original Health and Medicine Blogs to Global Health Events, which now has over 14,000 URLs, many covering the COVID-19 pandemic. NLM uses a variety of models and resource guides for selecting their content, with input from historians, medical museum curators, librarians, archivists, subject-matter experts, and public recommendations through their blog posts.

Screenshot of the National Library of Medicine's slide for their newer tools for outreach

NLM’s examples of newer intiatives to reach out to the public

Federal Depository Library Program (FDLP)

Dory Bower covered how the FLDP’s legally mandated web archiving continues the Government Publishing Office’s (GPO) mission of “Keeping America Informed,” as Federal agencies increasingly disseminate directly through their own websites. Content selected at first in 2012 tended to be smaller agencies’ websites or content that wasn’t in print. Since then, 213 collections have been developed with the guidance of the Public Policy Statement and can now also be accessed through their catalog.

Smithsonian Institution Archives

Lynda Schmitz Fuhrig’s slides showcased how the Smithsonian’s program currently collects nearly 400 of its web presences as the institution’s official repository. They started using Archive-It in 2011 to crawl their main sites once or twice a year. More frequent crawling started in early 2020 to reflect the closings and reopenings, and the increased need for online content.

Screenshot of Smithsonian Distance Learning archived website

The Smithsoninan’s Lynda Schmitz Fuhrig speaks to the increased need for online content in 2020

Library of Congress (LoC)

Meghan Lyon and Lauren Baker described how their web archiving team at LoC manages 174 collections. Collections’ content is determined by subject experts known as Recommending Officers (ROs). ROs make seed URL nominations that their team tracks using Digiboard, their inhouse workflow tool. They contract with an outside vendor for crawling, but their inhouse team reviews the crawls, then indexes the results and provides access through their own Wayback portal.

Each panelist then had the opportunity to weigh in on a series of questions moderated by the current WAS Chair, Melissa Wertheimer.

Collecting differences between government institutions and cultural heritage organizations

Legal mandates, copyright permissions, and embargoes were three of the main differences described between the panelists’ collecting and that of cultural heritage organizations. Moffat said having the Federal legal mandate to collect biomedical materials for NLM meant they could justify more resources. Bower said their legal mandate at FDLP meant they need not worry about copyright permissions or embargoes. At LoC there are strict procedures in place for obtaining copyright permissions from website owners, including both permission to crawl and permission to display. LoC also keeps the archived content embargoed for one year before display.

Quality Assurance (QA) processes

Lyon outlined new QA processes recently adopted at LoC: They first perform a pre-crawl analysis of the seed URLs before they’re sent for contract crawling, followed then by a post-crawl analysis when they receive the results, to inform future crawls’ scoping and focus attention for replay evaluation. At the Smithsonian, Schmitz Fuhrig starts by evaluating replay, mainly the main page and a few links off of each seed URL, then runs patch crawls, and escalates issues to the Archive-It team when those don’t suffice.

Screenshot of the Library of Congress' Web Comics web archive

The Web Comics collection is one of the Library of Congress’ most used

Access metrics

All four organizations use Google Analytics to track the usage of their web archives. Analyzing that available data taught Baker that one of the LoC’s most used collections was the Webcomics one. Bower sees most of the FDLP traffic from agencies using their repository. Schmitz Fuhrig noted an uptick in use as the pandemic started, now a downward trend, whereas Moffat hopes the NLM’s collections will be increasingly used by researchers as the pandemic winds down further.

Check out the FDLP’s Google Analytics case study if you ever want to integrate Google Analytics with your own Archive-It account or enhance an existing integration with data cleaning resources.

Web archiving awareness and future planning

Asked to peek a little into their futures, panelists immediately thought of their seed tracking spreadsheets, which seemed to be growing “crazy big.” Moffat wished for a workflow tool that would make them easier for smaller organizations to “grapple with.” Lyon tried to bring attendees more hope for the tools to come by linking to ongoing projects funded by the IIPC.

And Schmitz Fuhrig reminded us that awareness of web archiving’s importance  is on the rise: There’s a real light being shone on web archiving at the moment with the work that Saving Ukrainian Cultural Heritage Online is doing at the moment.

Attendees left the coffee chat wide awake to the current state of web archiving for US Federal government organizations. Thank you to the SAA WAS and UBC for hosting it on Zoom!