Feature Spotlight: Internal Seeds Groups

April 12th, 2023

by Tanya Ulmer, Web Archivist for Archive-It

Seed groups have been around for a while, allowing partners to organize their seeds on their public collections’ pages at archive-it.org. But did you know that seed groups are no longer necessarily displayed publicly? Starting with the official release of the Seed Sharing feature in June 2022, seed groups can be visible internally only. In fact, any seed groups created after September 12, 2022 will not be visible on archive-it.org unless you choose to make it publicly visible.

Screenshot of an internal seed groups with the checkbox for visibility unchecked and circled.

Click that checkbox to make the seed group visible on archive-it.org.

Why would a partner organization want to use internal seed groups? These are just a few the efficiencies we could think of:

Screenshot of a seed list with an arrow pointing to a group and a check over a group's checkbox.

Select the group for those actions by clicking on the gear then the checkbox,
or click directly on the group name to see the group’s page (above)
where you can make the group publicly visible.

But naturally partners found their own ways to organize their workflows using these groups. Between June 2022 and March 2023, 779 Internal Seed groups have been created by 91 Archive-It partners. And some of the groups’ names surprised us! As of March 2023,  the top names given to these internal seed groups were:

  • ‘Fix’ (21)
  • ‘7’ (7)
  • A three-way tie: ‘Twitter’ (5), ‘QA seeds’ (5), ‘needs metadata’ (5)

And of course there were many unique group names, meaningful perhaps only to those at the organizations. We’re very fortunate to have four partners share with us how they’ve used these groups to their organization’s advantage.

Margaret Herrick Library of the Academy of Motion Picture Arts and Sciences

Caroline Jorgenson tells us that her team used these groups to track weekly releases of qualifying films:

The web archiving that we do at Margaret Herrick focuses on content related to films that qualify for the Academy Awards, so we do a weekly crawl of websites related to the previous weekend’s Oscar-qualifying theatrical film releases. The ability to subdivide our film website collection into weekly crawl groups by release date is INVALUABLE throughout our workflow, enabling us to differentiate between duplicate film titles and quickly locate archived web content for a given film based on its release week. We can then embed each Wayback link in our public-facing database alongside our list of physical holdings relating to that film production.

Bensenville Community Public Library

Emily Mamrak explains how they’ve helped her track similar looking URLs in their collections:

I found the Internal Seed Groups feature to be more useful in some collections and less applicable in others. In smaller collections or collections with less distinct groups of seeds, I was less likely to use the Internal Seed Groups feature. But in larger collections with groupable seeds, the Internal Seed Groups feature was very helpful. The first time I used the Internal Seed Groups feature was to organize seeds that were different parts of the same website. Crawling a whole website can be useful but in certain cases it can also take up a good amount of storage. Crawling the whole website once and then creating separate seeds to crawl individual web pages regularly was how I kept up with pages that changed frequently without having to recrawl the whole website. Unfortunately, this filled my seed list with a lot of similar-looking seeds. In order to more quickly find the seed I wanted, I used the Internal Seed Groups to label the seeds. For example, in the Village of Bensenville collection, each seed is labeled either Library, Fire, Village, or Parks. In the Businesses collection, each seed is labeled either Service, Store, Restaurant, or Industrial. Being able to deal with a smaller list of seeds saved time and made larger collections less complicated. For the Businesses collection specifically, the Internal Seed Groups were also useful for me to easily see that I had a good sampling of each type of business present in the collection. Overall, I am happy to have this feature available to use and am looking forward to finding new ways to apply this feature in the future.

The Senator John Heinz History Center

Sierra Green and Carly Lough share that these groups have helped them connect URLs with their creators and better reflect their other systems:

We noted how mutual aid groups, pharmacies, non-profits, and other organizations were utilizing free, web-based tools such as Calendly, the Google Workspace Suite, and GoFundMe to nimbly gather information, assess needs, distribute resources, engage their constituencies, and evolve their missions during the pandemic. Wrangling these URLs proved challenging, given that they almost never included words or phrases that could be traced back to their creator. As a result, we began to use internal seed groups to track these seeds within Archive-It and tie them back to their creator. In this manner, we could efficiently use the collection search function within Archive-It to instantaneously call all seeds tied to a given creator in a way that mimicked our internal content management system for web archiving that is independent of Archive-It. We also noted that these groups could serve as a building block for seed-level metadata.

More recently, we’ve been considering the use of internal seed groups to help us manage one-off or smaller scale crawls in a separate collection that incorporates seeds relating to a wide variety of subjects and collecting initiatives. In this collection, we plan to prototype the internal seed groups as a content management tool and preliminary metadata on two fronts. Firstly, consistent with its use within our Western Pennsylvania COVID-19 Response Collection, we intend to use internal seed groups to tie URLs back to their creator. Additionally, we also intend to use this designation to connect seeds to categories of special collections long established within our broader library and archival holdings. These special collections include, but are not limited to, ethnic groups, industries, historically marginalized communities, and conflicts. Discovering that seeds could be associated with two collections sparked our use of internal seed groups in this way.

Library and Archives Canada

Russell White mentioned how these groups have added further structure to some larger collections and helped the team there: 

At Library and Archives Canada, we use internal seed groups in a variety of ways, and find them useful for adding structure to projects. Since we joined Archive-It as a partner in 2013, nine staff members have contributed over 25,000 seeds to our account across 77 collections. Our scale is large, and seed groups help keep us organized.

Within collections, we typically assign seed groups to seeds that share themes or technical characteristics. For example, our COVID-19 collection seeds are grouped by “COVID Protests,” “News,” “Arts and Culture,” and so on, which provides a quick means for staff to focus on their seed assignments for QA or crawling purposes, or to change seed settings in bulk. Groups like “RSS feeds” or “patch seeds” provide context to technical decisions, and help guide QA effort.

We also use seed groups to create informal subcollections across Archive-It collections, which we track separately in an external database. For example, the seed group “Ukraine” spans our News Media, Government of Canada, and Events collections, acting much like a hashtag to bind together thematically similar content. Sometimes this approach makes more sense than creating a new, dedicated collection.

One feature we would like to see going forward would be the ability to assign seeds to groups upfront when we add them to collections, as opposed to assigning groups after the fact (which can be time consuming for large batches). We submitted this as a Feature Request recently, and would be interested to hear what other partners think about this idea.

Do you have large collections that could use some added structure? Collections that could reflect another system better? Collections with many similar looking URLs? Collections that need to be tracked on a schedule? Collections that have seeds with similar technical difficulties? Try out Internal Seed Groups to help organize those collections for your team!