Gephi + MALLET + EMDA

July 03, 2013

As I prepare to attend Early Modern Digital Agendas next week, I've been exploring a few tools that have been on my to-try list for a while -- things that have come up at many a DH-related event over the years.

Gephi

I'm embarrassed to say that I haven't really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants' interests lay. Here's one of the better visualizations I made (words are stemmed):

To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article "Identifying the Pathways for Meaning Circulation using Text Network Analysis." You can see my script at Github. (Don't make too much fun of my novice code.)

This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.

The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we'd only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects 'communities' (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They're sort of clustered. I'm not really sure if this is a good visualization — it seems like it is, but I'm not experienced enough to critique knowledgeably.

And what does it look like if the visualization considers all 2900 nodes? Here's one look:

2900 nodes in 39 communities, not really clustered at all, no labels, data party!

Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.

MALLET

I also tried out topic modeling using MALLET on the same essay dump. Here's a list of topics limited to 5:

network
seminar
social
historical
milton
reading
scholarship
field
make
approach
sdfb
form
networks
terms
community
long
society
fact
benefit

digital humanities
work
early
projects
institute
research
english
teaching
scholarly
editions
current
library
working
students
future
experience
part
scholarship

digital research
shakespeare
institute
studies
university
tools
methods
dh
hope
language
graduate
develop
study
analysis
based
corpus
large
focus

early modern
project
texts
agendas
scholars
resources
eebo
questions
folger
ways
period
works
books
online
tcp
information
bring
existing

data
digital
literary
media
history
book
time
text
database
century
archives
order
learn
political
share
cultural
narrative
eager
press

And limited to 10:

early modern scholars network social texts agendas words scholarship sdfb approach inquiry criticism mining world chapter ontologies actors persons	texts ways eebo questions folger period tcp bring existing online understand books reading corpus neh work present develop understanding	digital humanities modern university tools english teaching study part projects scholarship development practice future professional provide past technology developing	data work digital opportunity seminar database archives discussions literary narrative text eager press conversations relationships relationship archive interface interested	shakespeare research work dh hope language methods graduate analysis agendas based application plan training university literature approaches writing benefit
media milton interest field historical society theory means performance larger arts prose write reflect professor team college readings basic	projects resources library students experience faculty collections london make explore information place john practical curation center important end moeml	history early book time project technologies build share paper agendas experiences scale poems writers space ocr thinking courses form	early modern institute research project studies scholarly editions working current summer knowledge edition large collaborative textual electronic renaissance participation	literary century order political text methods historical works topic natural learn great scientific computer public complex discuss eighteenth applying

Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.