Reading Experience Database text mining project
Cross-posted at my Day of DH 2012 blog.
About the project
I have really been enjoying Dr. Cathy Blake's Text Mining class this semester, in a large part because I've been given access to data that really excites me. The kind souls at the Reading Experience Database (or RED, hosted at the UK's Open University) sent me a .csv snapshot of the database from August 2011, for use in my final project. I first came across the Reading Experience Database in Bonnie Mak's History of the Book course in 2010 as I pursued my interest in reading history. RED aims to collect all information about reading experiences in Britain from 1450-1945. (See an example record.) The data is crowd-sourced, and anyone can contribute a reading experience by filling out a detailed webform. A contribution must include the text of the evidence of a reading experience — that is, a reference to someone reading something in a manuscript or published work. The 26,000 records in the RED make a truly incredible resource.
My final class project is still taking shape, but my goal is to perform a sentiment analysis with the records to explore British readers' attitudes toward literature throughout this period in history. What is particularly awesome about RED is that it includes details like the reader's socio-economic group (e.g. "Gentry") and the type of experience (e.g. "aloud, in company"), so I may be able to analyze sentiment within these different subsets of the data. Too cool!
Dr. Blake warned us that pre-processing the data may be the most time-consuming parts of our projects, and as I've been playing with the RED data over the weekend, I don't doubt it. Crowd-sourcing information is one of the best things about the internet (ordinary citizens can discover galaxies!), and the many thousands of RED records are only possible because of this technology. For a student text miner, however, crowd-sourced data can be pretty messy, especially when there is no authority control. There's a lot of redundant data, or data that is simply inaccurate:
I'd hoped that a good way to get my feet wet with this project would be to make a quick series of infographics that reflected attributes of the authors and readers in RED, partly for practice but mainly to understand any biases the RED may have, such as having mostly male readers/authors (though comparing this data to accurate historical data is probably outside my project scope). But it will take me some time to groom the data to something more manageable. So far, the most interesting and (reasonably) accurate data I've been able to extract has been a list of the most popular authors listed as read in the RED, as determined by how often a reading experience involves a book attributed to these authors. I shall present it as a clumsy HTML table (how long as it been since I used an HTML table?!).
The 50 most popular authors in the Reading Experience Database:
|#||First name||Last name||# RED records|
|4||George Gordon, Lord||Byron||267|
A note about the data
I have been using Oracle SQLDeveloper to explore the raw data, and I edited an exported CSV in Excel to refine things for this list. I consolidated various name spellings for the top-cited authors. I collapsed the various "n/a" and "anon/Anon./anonymous" etc. attributions into [n/a], and all the "unknown/Unknown/not known" etc. data into unknown. Many of the [n/a] texts are, as you would expect, holy scriptures, but there are also many, many newspapers listed as well. Unknown may indicate author anonymity, reader uncertainty, or record contributor uncertainty. Note that I have not taken into account any duplicate records in the RED (where 2+ record contributors may have read the same memoir and noted the same historical reader's experience). Note also that this data:
- reflects only the data in the RED, who would contribute to RED (one busy participant entered 8,000 records), and what contributors read
- reflects what historically was published and sold in Britain (i.e. by my count, there are 8 female authors in this set of 50 named authors)
- was not created with authority control
Data is never free of its context. Case in point: Ernest E. Unwin is not actually a widely-read author, but he kept the minutes of the "XII Book Club" and is often listed in the RED reading his own work. Case in point II: The entries that cite Samuel Johnson refer to both Dr. Samuel Johnson (of the Dictionary) and the Reverend Doctor Samuel Johnson. Ideally, I'd be able to assign an authorID to each individual author based on the publication title. Realistically, not gonna happen in the next month. I do still want to make some data visualizations though, so stay tuned.
What does this data tell us? Most of these authors are included in various iterations of the Western or British canon, with the exception of Unwin as well as the many unnamed journalists whose newspapers were mentioned. What's interesting to me is the diversity these top authors' output. The top 10 alone includes playwrights, poets, novelists, even a lexicographer. There's also a lack of diversity, of course — only 8 of 50 are women. I will also confess that my English-major pride was a little hurt that there were some authors I hadn't heard of before, like John Galsworthy and Harriet Martineau. There is always so much more to read! What else do you find interesting about this table?