The first question that is often asked when I talk to colleagues about large amounts of encoded text is what or how to use such a data set. Encoding every word of Shakespeare does not strike most academic or interested parties as particularly useful. Isn’t it just a concordance? But the advantage to a digitally encoded corpus is the ability to collaborate with an expert in writing applications in which to manipulate the data through metadata. This process requires a good deal of imagination to form question with which we can ply a corpus of text. The applications and scripts allow interested parties two main options. The first of which is the novel or unlooked-for discoveries within a body of text that was previously inaccessible, or obfuscated by the amount of data. The second option is to design a large scale question (that may need revision) that previously required speculation or generalization. I am not claiming that big data renders these types of rhetorical and logical moves obsolete, but it can open doors for questions that seem out of reach for individual researchers.
Let me try and by a little more specific. Topic Modelling is one way that we are able to dynamically sift through large amounts of information. David M. Blei provides an introduction to the concept here:
The trick is to figure out the ways in which interested parties can, as he puts it, “zoom in” and zoom out” to relevant information and data. It allows readers the opportunity to find patterns, connections, and manipulate discourse. A more specific example still (and one that I find fascinating as a fan and an intellectual) is the Philip K. Dick android built by Hanson Robotics and researchers at the Institute for Intelligent Systems, like Andrew Olney at the University of Memphis. The android was built to look like Dick complete with beard, eyes, and facial expressions, but more than this is was programed to speak like him.
The software developers used typical conversational models called bots to generate the grammatical glue of the conversation, but they also used topic models and concept maps of Dick’s corpus of writing to create responses to verbal questions that come from the writers own words. After hearing Olney give a lecture on how he managed to achieve a conversable android Philip K. Dick, I not only got excited about the possibility of electric sheep, but new ways in which to manipulate discourse. I am currently still looking for the right questions to ask for this methodology, but tracking concepts between images, poetry, fiction, non-fiction, journals etc. can allow us to recreate not just discourse, but echoes of a discussion.