Gephi + MALLET + EMDA
As I prepare to attend Early Modern Digital Agendas next week, I've been exploring a few tools that have been on my to-try list for a while -- things that have come up at many a DH-related event over the years.
Gephi
I'm embarrassed to say that I haven't really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants' interests lay. Here's one of the better visualizations I made (words are stemmed):
To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article "Identifying the Pathways for Meaning Circulation using Text Network Analysis." You can see my script at Github. (Don't make too much fun of my novice code.)
This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.
The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we'd only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects 'communities' (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They're sort of clustered. I'm not really sure if this is a good visualization — it seems like it is, but I'm not experienced enough to critique knowledgeably.
And what does it look like if the visualization considers all 2900 nodes? Here's one look:
2900 nodes in 39 communities, not really clustered at all, no labels, data party!
Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.
MALLET
I also tried out topic modeling using MALLET on the same essay dump. Here's a list of topics limited to 5:
network seminar social historical milton reading scholarship field make approach sdfb form networks terms community long society fact benefit |
digital humanities work early projects institute research english teaching scholarly editions current library working students future experience part scholarship |
digital research shakespeare institute studies university tools methods dh hope language graduate develop study analysis based corpus large focus |
early modern project texts agendas scholars resources eebo questions folger ways period works books online tcp information bring existing |
data digital literary media history book time text database century archives order learn political share cultural narrative eager press |
And limited to 10:
early modern scholars network social texts agendas words scholarship sdfb approach inquiry criticism mining world chapter ontologies actors persons |
texts ways eebo questions folger period tcp bring existing online understand books reading corpus neh work present develop understanding |
digital humanities modern university tools english teaching study part projects scholarship development practice future professional provide past technology developing |
data work digital opportunity seminar database archives discussions literary narrative text eager press conversations relationships relationship archive interface interested |
shakespeare research work dh hope language methods graduate analysis agendas based application plan training university literature approaches writing benefit |
media milton interest field historical society theory means performance larger arts prose write reflect professor team college readings basic |
projects resources library students experience faculty collections london make explore information place john practical curation center important end moeml |
history early book time project technologies build share paper agendas experiences scale poems writers space ocr thinking courses form |
early modern institute research project studies scholarly editions working current summer knowledge edition large collaborative textual electronic renaissance participation |
literary century order political text methods historical works topic natural learn great scientific computer public complex discuss eighteenth applying |
Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.