What I did this summer: Smithsonian Institution Archives internship

September 01, 2011

I should have been blogging all summer long. But it was so daunting to want to redesign the whole site and start writing in earnest, that I just ignored these goals altogether like any responsible person would. But now summer is coming to an end. I'm back in Champaign, IL, going crazy over the new semester. So I'll write about summer while it's still fresh in my mind!

Lunchtime view on the rooftopAbout the internship

During the months of June and July, I was the web preservation intern at the Smithsonian Institution Archives, specifically in the Electronic Records Program in the Digital Records Division. It was an honor and a pleasure to work with the good people at the Smithsonian, to be able to draw from their experience in the field and see how a major cultural institution is run — and to really dig into the tools, strategies, and philosophies of web preservation.

In a nutshell, the main objective of my internship was to develop a workflow for preserving the Smithsonian's many websites. I configured Heritrix, an open-source web crawler from the Internet Archive, and Wayback, an open-source, local implementation of the Wayback Machine, for the Archives' needs. The ultimate goal is to take an annual snapshot of all of the Smithsonian sites. Each crawl performed by Heritrix bundled the web content into .warc files, which were then reviewable in Wayback. I ended up doing very focused crawls, only one or two sites per crawl, so that I could be very specific about what was captured (e.g., Flickr content on blogs) and what wasn't (e.g., a thousand off-site PDFs linked from the facilities department's site). As I discovered new solutions and new challenges, I wrote about 20 pages of documentation for the Archives, so that their next intern won't have as steep a learning curve as I did. And of course, more documentation is never a bad thing.

Blog posts on The Bigger Picture

You can read more about what I was doing in a post I wrote for the Archives' excellent blog, The Bigger Picture. Link: "Saving the Smithsonian's Web." The post was picked up by RAINbyte, a daily bundle of stories of Records and Archives in the News, as well as being mentioned on the Society of American Archivists (SAA) listserv. During the research phase at the beginning of my internship, I couldn't find much out there on how other institutions were preserving their web content, which means they're either not (yikes) or they're just not writing about it. I'm hoping this post on The Bigger Picture will remedy the latter a little, since it's through collaborating and sharing experiences that we find the best practices.

I also had the chance to write another blog post, "Five Tips for Designing Preservable Websites," wherein I analyzed the page captures to see what kinds of web design practices were good for preservation purposes (e.g., maintaining stable URLs) and not so good (e.g., only allowing searching of online collections, not browsing). This post was linked to by the Library of Congress' digital preservation blog, The Signal. It was also picked up by the Smithsonian's main Twitter account (@smithsonian) and retweeted over 100 times! (Tiny, yes, but I'm letting it go to my head.) Preservation isn't something web developers think about that much, since the nature of the Internet is marked by change and transience. But it's something librarians, archivists, and curators are worrying about all the time. The preservation process of a digital object starts from the beginning of the object's life. Curators know they need to be talking to digital humanities scholars and gatherers of scientific data as data management plans are drafted. But in my experience, there isn't as much focus on talking to web developers, perhaps because they're not necessarily working at universities or because most web projects aren't seen as anything but cultural chaff. It ain't right!

Why preserve web content?

To close, I'll mention that whenever I told people what I was doing at the Smithsonian internship, there was a 40% chance they'd ask (perhaps after a beer), "But isn't that what the Internet Archive is for?" i.e., why are you doing what's already being done by someone else? It's true that the Internet Archive does capture a lot of the Smithsonian's web pages. But their crawls are sporadic and not as deep as the Smithsonian archivists would like. The Digital Services Division of the Archives is also in contact with the web developers all over the Institution, so they can prioritize the preservation of websites that will soon be taken down or redesigned. And more importantly, the Internet Archives' crawls are something the public can view on their online Wayback Machine — but not something the Smithsonian Institution Archives can accession. Performing local web preservation practices ensures that the Archives has complete physical, intellectual, and administrative control over what gets saved.