In my first CHI Blog Post, I described this project this way:
I’m undertaking a large-scale text analysis of the Armed Services Editions, a collection of novels sent to US Soldiers during WWII to “fight the war on ideas,” to consider issues of politics and literary form.
This is a large mandate. I broke this project down into three phases, and worked through Phase One this year as a CHI Fellow.
Phase One is a book history phase, in which I develop richer, descriptive data of the corpus as a whole. To date, there has been no substantive analysis of the ASEs as a collection. Institutional histories of the CBW have been written, and histories of the role of the ASEs in the development of a reading public have been written. But each of these studies, however excellent, take the composition of the corpus for granted.
Phase Two will be a more analytic phase, working backward from the corpus to assemble a working definition of “democracy.” Phase Three, ideally, will involve supervised machine learning.
Working through Phase One of my project as a CHI Fellow, I did the following:
- I visited the Mudd Manuscript Library at Princeton University, where the Armed Services Editions/Council on Books in Wartime archive is held. My goal was to more clearly identify the aims of the program institutionally, and some of the personal aims of the publishers involved. Most importantly, I’d hoped to find the selection criteria for the ASEs. (No selection criteria was written during the founding of the CBW, but criteria were later identified in with the 1944 passage of the Soldiers’ Vote Act.)
- I assembled the corpus. When full-text was available, I ran (and corrected) OCR. The bulk of the year was spent on this process.
- I worked on learning a new programming language. Using this dataset, I worked my way through two textbooks on humanities data in R.
- I conducted some very basic, exploratory analysis. Voila.
- I secured funding to continue this project and conduct more advanced analysis over the summer.
What you see here is the very beginning of that process.
A few things that I’d ask my viewers to consider:
This is a work in progress. A proof of concept. The work that I’ve done here relies on just a small fraction of my corpus– what I was able to find in full-text. I’ll complete a more thorough and sophisticated analysis of these data this summer.
The majority of digital work happens behind the scenes. Think of Hemingway’s iceberg. Building a corpus– assembling it, cleaning it– takes a lot of time, and has to happen before any sort of data analysis can be undertaken. I was fortunate to have access to the LEADR lab at Michigan State University, where I was able to commandeer four computers at once to run OCR. And still, it took much longer than I realized. What you see here is one small fraction of the work that’s been completed this year.
This is a record of learning. While I was working on this project, I taught myself how to perform these functions. I’ve been working through R textbooks, learning a programming language. Some of this analysis seems basic: it is. It was conducted at different points of the year, and reflects my developing skill level. It’s very likely that users familiar with the textbooks I used will be able to match up the analysis here with the related chapters and problem sets.
I’ll continue to post data analysis as I conduct it– and I’m looking forward to collaborating with some fantastic friends and colleagues in the coming months.