Reading Experience Database text mining project

March 27, 2012

Cross-posted at my Day of DH 2012 blog.

About the project

I have really been enjoying Dr. Cathy Blake's Text Mining class this semester, in a large part because I've been given access to data that really excites me. The kind souls at the Reading Experience Database (or RED, hosted at the UK's Open University) sent me a .csv snapshot of the database from August 2011, for use in my final project. I first came across the Reading Experience Database in Bonnie Mak's History of the Book course in 2010 as I pursued my interest in reading history. RED aims to collect all information about reading experiences in Britain from 1450-1945. (See an example record.) The data is crowd-sourced, and anyone can contribute a reading experience by filling out a detailed webform. A contribution must include the text of the evidence of a reading experience — that is, a reference to someone reading something in a manuscript or published work. The 26,000 records in the RED make a truly incredible resource.

My final class project is still taking shape, but my goal is to perform a sentiment analysis with the records to explore British readers' attitudes toward literature throughout this period in history. What is particularly awesome about RED is that it includes details like the reader's socio-economic group (e.g. "Gentry") and the type of experience (e.g. "aloud, in company"), so I may be able to analyze sentiment within these different subsets of the data. Too cool!

Dr. Blake warned us that pre-processing the data may be the most time-consuming parts of our projects, and as I've been playing with the RED data over the weekend, I don't doubt it. Crowd-sourcing information is one of the best things about the internet (ordinary citizens can discover galaxies!), and the many thousands of RED records are only possible because of this technology. For a student text miner, however, crowd-sourced data can be pretty messy, especially when there is no authority control. There's a lot of redundant data, or data that is simply inaccurate:

I'd hoped that a good way to get my feet wet with this project would be to make a quick series of infographics that reflected attributes of the authors and readers in RED, partly for practice but mainly to understand any biases the RED may have, such as having mostly male readers/authors (though comparing this data to accurate historical data is probably outside my project scope). But it will take me some time to groom the data to something more manageable. So far, the most interesting and (reasonably) accurate data I've been able to extract has been a list of the most popular authors listed as read in the RED, as determined by how often a reading experience involves a book attributed to these authors. I shall present it as a clumsy HTML table (how long as it been since I used an HTML table?!).

The 50 most popular authors in the Reading Experience Database:

#	First name	Last name	# RED records
0		[n/a]	1956
0		[unknown]	1893
1	William	Shakespeare	513
2	Walter	Scott	414
3	Jane	Austen	272
4	George Gordon, Lord	Byron	267
5	Charles	Dickens	222
6	Alfred	Tennyson	217
7	John	Milton	208
8	William	Wordsworth	160
9	Samuel	Johnson	145
10	H. G.	Wells	143
11	Samuel	Richardson	127
12	--	Homer	123
13	--	Plato	120
14	William Makepeace	Thackeray	119
15	Robert	Browning	112
16	Alexander	Pope	105
17	John	Galsworthy	102
18	Charlotte	Bronte	98
19	Thomas	Carlyle	96
20	Percy Bysshe	Shelley	96
21	Robert	Southey	91
22	John	Ruskin	89
23	Victor	Alexander	88
24	Thomas	Moore	87
25	--	Virgil	87
26	John	Keats	86
27	--	Voltaire	86
28	Margaret	Dilks	85
29	George	Eliot	85
30	Robert Louis	Stevenson	84
31	Daniel	Defoe	81
32	Ernest E.	Unwin	80
33	William	Godwin	79
34	Maria	Edgeworth	77
35	Edward	Gibbon	76
36	Jean Jacques	Rousseau	74
37	George Bernard	Shaw	74
38	Dante	Alighieri	73
39	Samuel Taylor	Coleridge	72
40	Edmund	Spenser	71
41	Jonathan	Swift	70
42	Thomas	Hardy	69
43	James	Boswell	68
44	George	Meredith	68
45	Oliver	Goldsmith	67
46	Harriet	Martineau	67
47	Elizabeth	Gaskell	65
48	Henry	James	64
49	Arnold	Bennett	62
50	Frances	Burney	61

---

A note about the data

I have been using Oracle SQLDeveloper to explore the raw data, and I edited an exported CSV in Excel to refine things for this list. I consolidated various name spellings for the top-cited authors. I collapsed the various "n/a" and "anon/Anon./anonymous" etc. attributions into [n/a], and all the "unknown/Unknown/not known" etc. data into unknown. Many of the [n/a] texts are, as you would expect, holy scriptures, but there are also many, many newspapers listed as well. Unknown may indicate author anonymity, reader uncertainty, or record contributor uncertainty. Note that I have not taken into account any duplicate records in the RED (where 2+ record contributors may have read the same memoir and noted the same historical reader's experience). Note also that this data:

reflects only the data in the RED, who would contribute to RED (one busy participant entered 8,000 records), and what contributors read
reflects what historically was published and sold in Britain (i.e. by my count, there are 8 female authors in this set of 50 named authors)
was not created with authority control

Data is never free of its context. Case in point: Ernest E. Unwin is not actually a widely-read author, but he kept the minutes of the "XII Book Club" and is often listed in the RED reading his own work. Case in point II: The entries that cite Samuel Johnson refer to both Dr. Samuel Johnson (of the Dictionary) and the Reverend Doctor Samuel Johnson. Ideally, I'd be able to assign an authorID to each individual author based on the publication title. Realistically, not gonna happen in the next month. I do still want to make some data visualizations though, so stay tuned.

---

What does this data tell us? Most of these authors are included in various iterations of the Western or British canon, with the exception of Unwin as well as the many unnamed journalists whose newspapers were mentioned. What's interesting to me is the diversity these top authors' output. The top 10 alone includes playwrights, poets, novelists, even a lexicographer. There's also a lack of diversity, of course — only 8 of 50 are women. I will also confess that my English-major pride was a little hurt that there were some authors I hadn't heard of before, like John Galsworthy and Harriet Martineau. There is always so much more to read! What else do you find interesting about this table?