Home /
Presentations /
DH Curation Guide presentations

DH Curation Guide presentations

Project overview

Slide deck (PDF)
August 2012
Various locations

In 2010-12, I was involved with the DH Curation Guide, a wonderful community resource guide to data curation in the digital humanities.

Text modeling

Slide from the presentation deck

Slide deck (PDF)
September 27, 2011
Graduate School of Library and Information Science, University of Illinois Urbana-Champaign

I spoke about our text modeling choices and our publishing workflow in my Electronic Publishing class, taught by the inimitable Julia Flanders.

Notes & script for Text Modeling presentation:

Outline of presentation

My involvement as UI designer & co-editor
Intro to UI of DHCuration
Sidenote review of HTML and CSS
XSLT: what is it for? How do I write it? Close look at code (+ application sharing?)
Why push article content from XML to XSLT to HTML? What flexibilities?
Modeling choices like FAQ and Glossary as straight-up HTML

[screenshot of DHCuration main page]

Intro my involvement
I'm currently a research assistant within DCEP-H, where one of the projects is this guide to data curation in the digital humanities. I am a co-editor with Trevor and Julia, and right now my main priorities are developing the user interface and planning strategies for the launch and the site's sustainability.

I became involved last year as an hourly, purely as a UI designer because I've worked a lot in the past with HTML and CSS as well as graphic design tools like the Adobe Creative Suite. In undergrad, I was a web designer for grant-funded humanities projects, so I had a background with academic user interfaces. But this project, the Guide, was more complex than what I'd worked with before.

Intro to UI
The site will contain around a dozen articles that are half general intro to a topic, and half portal to other resources. These pages are encoded using a version of the TEI that has been customized by Trevor and Julia. They're the ones who have masterminded the XML.

[screenshot of article]

Here's what the user interface for the articles in the project currently looks like. It's not quite polished and it's a little bare-bones, and you can blame my love of minimalism for that.

I'll talk a tiny bit about interface decisions, specifically the commenting aspect. I liked the functionality of that comment press plugin we linked to earlier, but I'll be frank, I thought the design was so ugly. And the comments were hidden until you clicked to expand them. So for inspiration, I thought about what a document would be like if I gave it to my friends and asked them to make comments. I have a few friends who are really into office supplies, so I imagined a bunch of Post-It notes all over the page. Eye-catching, and more importantly, the content (or most of the content) of the comment would be visible without having to do anything. The commenting functionality has a way to go, and we just hired a Javascript programmer to make all of my dreams come true, but this is part of the reason why we decided to make our interface and publishing system from scratch.

I should also mention that we're using off-the-shelf commenting software called Disqus, and I'll talk a little bit about how we're using that later.

So, scroll around the page, you might be able to see how the article has been split up: into sections, paragraphs, resources, and groups of resources. All of this is described by the XML. And so is stuff like glossary terms, article authors, the titles of sections, etc., those are all elements in the TEI-based XML schema that Trevor and Julia wrote.

But this document is in HTML. And all the pretty stuff, like the slight shadow behind section titles and the yellow boxes of the comments, that's all in CSS. So how do we get from this, the XML, to this, the pretty page? [XML, arrow, HTML] Well, I use XSLT. But first, let me take a sidenote and talk briefly about HTML and CSS.

I'm sure many of us are pretty familiar with HTML and CSS, but just to give you a quick background, CSS (Cascading Style Sheets) provides a way to give your HTML (HyperText Markup Language) documents a lot of style in representation. [HTML slide] HTML says, "This content here is a paragraph and shall be a block of text", and [CSS slide] CSS is a layer on top of that that says, "Yes, well give it a drop shadow, make all the text all caps, and put it in the middle of the page, for some reason." And this is how most nice and not so nice looking web pages are designed now, with some help from other web tools like JavaScript and, God forbid, Flash. The other nice thing about CSS is that you can use one stylesheet for lots of different HTML pages — another example of single-source publishing.

XSLT
So, XSLT. I began teaching myself XSLT with help from Kevin Trainor. - Jeni Tennison book -

XSLT stands for Extensible Stylesheet Language Transformations. It's an XML-based language. Its main use is to take XML documents and rearrange their contents for a different output.

[diagram of xml-xslt-html]

XSLT can process XML documents to make other XML documents, or to make PDFs, or in our case, to take the character data from the XML-encoded articles and turn that into HTML and CSS. It's just one long document full of rules to rearrange XML content. And all we need is one document to transform many documents in a uniform way.

It can transform documents in a very automated way, as in on the fly — you upload your XML documents and it's immediately spit out online into HTML. But we have a more small-scale model, with static documents. So I, by hand, take every XML document we have and run it through the XSLT processor and upload the resulting HTML file to the site.

[3 big screenshots]

So, I'm doing everything in oXygen — checking on the XML, but also writing the XSLT, HTML, and CSS. oXygen has built-in processors that are a kind of behind-the-curtain machinery for getting the transformation done.

[closeup highlights]

Here's some example closeups of sections of code from each language that deal with the same content: the author's name and affiliation and the article title. I've color-coded it to make it easier to see. Each snippet of text is described in the XML — this is the family name, this is the given name. Then the XSLT document says, for any XML that comes in here, any values marked family name go here or there. They're values selected by the path — by where in the XML tree the values are. Once we press the button and hit 'transform', it spits out an HTML document with these values in the places the XSLT defined. You can see it's in a different order than in the XML. And finally, this is how it looks in your browser.

[marked up Intro screenshot]

XSLT can do a lot of different things though. It's really a programming language that allows us to control the XML content very tightly. We're also using it to give each paragraph a unique identifier, which is then used to attach a commenting thread to. We want every element of the page to be commentable — the whole article, each section, each paragraph. Which means that we need about 40 comment threads to load on the same page, each for the right object, which is ensured by the unique and persistent identifiers that XML and XSLT can set from the get go. This is important because we've also numbered the paragraphs on the page. These numbers are separate from the paragraph identifier, because the author might go back later and add or delete a paragraph. Renumbering 40 paragraphs can get tedious, so XSLT just does it automatically. But we still want the comments on a certain paragraph to stick with that one. I hope that isn't terrible confusing. In a nutshell, this system we have going allows for flexibility of representation and convenience of editing the content.

[modeling choices]

Modeling choices
But what is the point of doing all of this? Trevor talked about the value of publishing in XML, and I'll just emphasize this. We could just as easily do this straight-up in HTML and CSS without having to do anything in XML and XSLT. Well, this is true, but it would certainly not be quite as flexible or convenient. In the model we're using, the article content can be poured into many different kinds of output with minimal effort. Describing the data of the document with meaning adds value to the data, because it can be easily reused and retooled in the future.

« Presentations