BUDSC18 has ended
Saturday, October 6 • 10:45am - 12:15pm
Creating large text datasets for humanities research using reproducible methods

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
In this interactive session, I will guide the audience through methods for preparing a large, uniform data set from the writings of Phillip K. Dick suitable for use in discovering the uses of language around gender within the corpus using algorithmic methods. I will be working with a variety of media for source material (scanned pages of manuscripts, printed books, PDF's, web-based text, etc) and will use the Python programming language along with various libraries to extract text, format the data and retain meta-data information required to reference the original sources. I will be able to document my work in a way that will allow future researchers to read and understand the nuances of my methodology with enough detail to recreate my dataset from the original sources (whether or not they appear in the same medium) using an entirely different programming language or suite of tools.


Saturday October 6, 2018 10:45am - 12:15pm
Room 241

Attendees (9)