The Human Rights web archive collection at Columbia University Libraries

March 1st, 2023

by Alex Thurman, Head of Web Collecting, Columbia University Libraries

Columbia University Libraries’ web collecting program began in 2008, with the goal of ensuring that selected freely available web resources remain accessible for current and future researchers. We are building both thematic collections of external web content and collections devoted to Columbia’s own web domain and other affiliated content. We chose Human Rights as the focus of our first and largest thematic web archive collection due to the importance of human rights as a field of study at Columbia University, where we have two academic centers devoted to the subject: the Human Rights Institute at Columbia Law School; and the interdisciplinary Institute for the Study of Human Rights. Additionally, the Libraries’ Center for Human Rights Documentation and Research hosts the physical archives of several notable organizations: Amnesty International USA; Physicians for Human Rights; Human Rights First; and Human Rights Watch.

Collection Scope

We began building our Human Rights web collection by spending a year identifying and classifying relevant content on the web. After testing and subscribing to Archive-It, we started our web collecting program in earnest, at first with 2 grant-funded web curator positions, but since 2013 with just my own Columbia-funded position. Our initial focus was on non-governmental organizations (NGOs) based outside of North America and Europe (as we believed these geographic areas to be under-represented in existing web archiving efforts), with content in all languages in scope. Later we expanded our scope to include NGOs based anywhere, and added new content types, including: National Human Rights Institutions (NHRIs); national sections of Amnesty International; Truth Commissions; Tribunals; Courts; and relevant Blogs by individuals.

In-scope organizations are not limited to those whose mission is self-identified specifically as “human rights” as a whole but also includes groups focused on narrower sub-topics, including but not limited to: democracy; women’s rights; transitional justice; indigenous peoples; torture; genocide; refugees. Our approach is to collect full websites (in all their language versions); we do not include individual news articles, academic reports, social media posts/feeds, or isolated documents as seeds.

New websites have been added steadily over the years, and recently we consulted the Yearbook of International Organizations online database to identify other relevant organizations. This led to our inclusion in the collection of websites devoted to previously underrepresented subspecialties of human rights work, such as associations working on behalf of seafarers or of sex workers. The most recent thematic cluster of added sites relate to MMIW (missing and murdered indigenous women) in North America.

Two men on board a large boat in the water

Human Rights at Sea website (https://www.humanrightsatsea.org/)

Prominent image is of an older woman embraced

MMIWhoismissing website (https://mmiwhoismissing.org/)

Description and Access

All collection seed websites receive Archive-It metadata for the following fields: Subject (LCSH); Creator (LC/NAF); Description; Language; and Coverage (Country). We also added a custom field “Status on live web” with the value “Unavailable” for seed websites that have become unavailable on the live web at their original seed host domain and for which no newer host domain could be located. This field allows users to click on the “Unavailable” facet link to see the subset (currently 199 seed URLs) of our archived content no longer available outside of web archives, which we hope will underline the value of web archiving.

In addition to Archive-It metadata, each selected website receives a MARC record in CLIO, Columbia’s library catalog; there are currently 866 of these records (and a collection-level record). As the Libraries have previously collected resources in print or other formats from many of the same human rights organizations featured in the web archive, including these MARC records for archived websites helps integrate all their content across our collections for users. In the Archive-It interface, capture, description and access is organized around the seed URL. But individual websites–understood as conceptual entities–may migrate over multiple different host domains over their lifespans, so the MARC record can serve as the integration point linking to a given website’s past and present host domains.

Screen shot of a CLIO catalog record from Columbia University Libraries

CLIO record for Generación Y

Technical

Comparing the number of website-level MARC records for the collection to the number of public seeds in Archive-It, we can make the following assertion: To capture the evolving web presence of 866 organizations/entities has required archiving 1147 seed URLs.

Crawls are quarterly. Of the public seeds in Archive-It, 707 remain active and included in quarterly crawls. Due to the volume of seeds and the complexity and depth of some of the large NGO websites included, our established workflow of crawling all the active seeds together in one automated quarterly crawl has had to be modified. Last year the 707 seeds were split into 8 smaller groups for manual re-crawls of less than 100 seeds each, each with long durations but careful total data limits. I have recurring quarterly calendar reminders to prompt me to run these manual crawls.

New seeds are added to the collection year-round as they are discovered and receive full Quality Assessment (QA) review for the first captures. Since we have only one web archivist and no dedicated staff help, QA for re-crawls is heavily driven by crawl report analysis. Such analysis helps uncover when seed websites have disappeared from the live web or moved to other host domains. The collection to date represents 19.2 TB of data, with about another 1.5 TB added each year.

Though ever larger and more unwieldy, over the years Columbia’s Human Rights collection has served as a testing dataset for numerous web archiving visualization and collection managements tools (including ODU’s TimeVis tool used to create the shown gif of multiple captures across different host domains of the Cuban blog “Generación Y”). A current Archives Unleashed cohort is working with ARCH data derived from the collection, and I look forward to learning more about their results. We welcome any inquiries about the collection from fellow librarians or researchers.

GIF alternating 6 screen shots from the Generación Y website

Generacion Y gif

Archive-It

The leading web archiving service for collecting and accessing cultural heritage on the web

March 1st, 2023

Collection Scope

Description and Access

Technical

Archive-It