2023 Archive-It Partner Meeting Recap

August 10th, 2023

by Tanya Ulmer, Web Archivist for Archive-It

2023 Partner Meeting logo

The annual Archive-It Partner Meeting took place on July 26th in Washington, D.C. Around 30 partners and friends of Archive-It gathered in person at the D.C. Public Library’s Martin Luther King Jr. Memorial branch. This event was scheduled to coincide with the Society of American Archivists’ ARCHIVES*RECORDS 2023 conference starting later that week.

After enjoying a lunch together from Delizique, proceedings kicked off with a welcome to the newly renovated conference facility. DCPL staff oriented attendees in Room 401A, on the same floor as the People’s Archive and DCPL’s permanent exhibition, Up from the People, honoring Dr. King and local activism, politics, and history. Attendees then introduced themselves to the other attendees around the room in the most orderly manner.

Screen on left shows code for Collaborative Collecting and on the right a man pinches his fingers while describing it.

Kody Willis describes the process for Collaborative Collecting.

The first hour wrapped up with an Archiving and Data Services (ADS) update from Product Operations Manager, Kody Willis. This update touched on newer services like ARCH (Archives Research Compute Hub) and Vault, in addition to Archive-It’s recent developments, including a redesigned Help Center and a preview of a new seed sharing feature across Archive-It accounts, called Collaborative Collecting. Forthcoming developments mentioned include an updated look and feel for the public website, archive-it.org, and improvements around scheduling crawls and managing data budgets. Requested features from the crowd involved ability to use the Wayback QA tool on test crawls and a way to better tie the patch crawls to original crawls.

In the second hour, Ruth Bryan, University Archivist from the University of Kentucky (UK) Special Collections Research Center (SCRC) presented UK Libraries’ Wildcat Histories: Preserving Student Activist Social Media Content. Ruth started by giving us a broad overview of the web archiving program at the University of Kentucky Libraries, leading up to their Project STAND mini-grant. Then Taylor Leigh spoke to us on video specifically to the Wildcats Histories project funded by that grant that engaged with the student activist population through web archiving workshops, exhibits, and documenting their content creation. Also on video, Emily Collier then spoke more to the technical challenges of preserving the social media content these student groups produce and the workarounds the UK team has adopted. Ruth followed up with the lessons learned, which included asking others what worked, documenting procedures in manuals, and ensuring more staff time is dedicated to web archiving. She then fielded questions about their positionality as part of the institution and possibilities for documenting more of the internal communications for student groups.

Ruth Bryan wraps up her presentation with her colleagues who presented by video, Emily Collier and Taylor Leigh.

Elizabeth England, Senior Digital Preservation Specialist from the National Archives and Records Administration (NARA), then presented her lightning talk, Web archiving at NARA. She spoke to the web archiving efforts around Congressional and Presidential web archiving up to 2020. Then she mentioned the considerations that their current Web Resource Archives Project (WRAP) is researching in developing a new direction that more consistently collects federal web presences, as different public interest groups and users of web archives have come to expect.

Olga Holownia, Senior Program Officer at the International Internet Preservation Consortium (IIPC) and Council for Library and Information Resources (CLIR), then presented her lightning talk, Analysing the IIPC Novel Coronavirus Collection: ARCH and the Archives Unleashed Cohort. Olga started by reminding us that it was the 20th anniversary of the founding of the IIPC and how it came about to develop thematic, transnational collections of importance to research, including the Novel Coronavirus collection. She then showed us how the Analysing Web Archives of the COVID-19 Crisis (AWAC2) research cohort partnered with Archives Unleashed to quantitatively answer research questions using datasets from the IIPC’s Novel Coronavirus collection and earlier versions of ARCH. AWAC2 evaluated two lists of research questions, the second list that focused more on women’s lives during COVID-19 was determined by a vote and topic brainstorming during 3-4 datathons. She concluded with a list of AWAC2’s disseminated research and a call for proposals for the IIPC’s next web archiving conference in Paris in 2024.

Karl Blumenthal demonstrates Voyant for text mining analysis on ARCH datasets.

This AWAC2 research example set the stage nicely for Web Archivist Karl Blumenthal’s Web Archives as Data section of the program. This section introduced ARCH to the attendees in a more personal fashion. Karl started by reminding us of how WARC files can be unwieldy for computational analysis, then introduced the categories of lighter weight datasets ARCH provides (Collection, Network, Text, and File Format). He then mentioned the types of analysis that can be performed on these datasets: Natural Language Processing, Binary Analysis, and Network Analysis, and followed each type of analysis with a list of the tools that are useful for performing that type. He concluded with a mini-tutorial using Voyant for text mining that attendees could follow along with on their laptops. Karl then fielded questions about a potential ARCH repository of data sets, File Types dataset specifics, and any plans to use Artificial Intelligence on collections. For that last question, he pointed to Daniel Van Strien’s blog post as an example of how an “non-evil AI company” like Hugging Face can apply computer vision to these ARCH datasets.

Rounding out the day, Skip Kendall from Harvard University Archives walked us through how metadata is applied at his institution. Skip does his descriptive work up front when adding the seeds so that the seeds are ready to go public as soon as possible. He finds looking through the website useful for this, particularly going through the menus for publications and events, looking for the Named Entities. He’ll also use existing catalog records or permalinks when they’re available, with the understanding that permalinks won’t actually be hyperlinked in the metadata field (yet). He likes to avoid editing seed metadata when possible, but he will add dates in parentheses to seed Titles when URLs change. This sparked healthy debate when others compared how they’ve been handling such changes in URLs over time.

Skip Kendall shows us Harvard University Archives’ descriptive metadata.

This healthy debate (and ones on other topics) moved onto the Havana Room at nearby Cuba Libre over mojitos and nojitos for the rest of the afternoon. But not before Archive-It Staff announced randomized winners of swag and offered a round of heartfelt Thank Yous:

Thank you to the DCPL for hosting us!

And a special thank you to Laura Farley and Robert LaRose for making it happen!

Thank you to Delizique and Cuba Libre for delicious food and drinks!

Thank you to the presenters for such timely and expert information for us all in 2023!

Thank you to all attendees who made it there to attend in-person!!