In the digital age, there exists an abundance of digitized sources. Text analysis deals with large corpora of sources and how to access this vast and mostly available historical record, from nineteenth-century British novels to Early American newspapers. Humanities scholars wishing to use these sources should know how to navigate this revolutionary digitization of material. Numerous methodologies allow for this navigation, such as topic modeling, word frequencies, token distribution analysis, and even network analysis, but they do come with caveats.
Most of these tools involve distant reading, discussed by Italian literary scholar Franco Moretti. Distantly reading the entire run of an eighteenth-century newspaper could allow patterns to emerge that otherwise wouldn’t with traditional close reading. One of the central ideas in text analysis is that distant reading complements close reading. Textual analysis of over one million words provides context for each source closely read. It generates new research questions and puts the hundreds of sources in context with the thousands and millions of words in text. Then hopefully, the digital methodology illuminates patterns or anomalies in the sources that prompt close reading and analysis. It is a matter of scale. The scholar can “read” a large amount, closely read a selection, then interpret and provide a representative argument for the corpus.
Another issue of text analysis is whether the technology drives the historical questions and whether distant reading really tells us anything surprising. When typing in a search term or terms, the scholar is already expecting to find that pattern in the text, as Cohen and Gibbs made clear in “A Conversation with Data: Prospecting Victorian Words and Ideas.” The supposed advantage of topic modeling is that the program generates topics of collocated words across the “bag of words,” rather than someone manually defining the search terms. The scholar determines which words to omit from the computational analysis, or stop words, determines the number of topics generated, and assigns meaning to those topics and explores them across texts and time. In “Quantifying Kissinger,” Micki Kaufman states that she used forty topics for that corpus because Robert Nelson did so in Mining the Dispatch. Matthew Jockers, in Macroanalysis: Digital Methods and Literary History, generated about five hundred topics for his data set of over 3,000 nineteenth-century books, consulting with numerous colleagues across disciplines to name them. Would their research change with different numbers of topics, or with different ways of interpreting and visualizing those topics?
Benjamin Schmidt argues in “Words Alone: Dismantling Topic Models in the Humanities,” that scholars need to focus on the actual words in the topic and compare them, because this methodology can be misleading. I have to agree with his caveats with this methodology, as I played around with topic modeling in the second semester of Digital Humanities coursework at GMU and confronted problems. In an attempt to distantly read a small corpus of nineteenth-century medical journal articles on hysteria in Clio II, I used topic modeling. I wanted to compare case studies of male and female hysteria in the nineteenth-century to explore the differences and/or similarities in symptoms, diagnoses, prognoses, and proposed cures and therapies. I wanted to analyze those results in the context of American medical discourse about the body, mind, and gender.
I learned a lot about how to not approach text analysis. First, I only used a limited number of search terms to find the articles in ProQuest and other databases with digitized medical journals. I searched for “hysteria” and “hysterical.” I only found about seventy case studies for males and females combined, but those often discussed more than one patient. In only using those search terms, I probably (definitely) missed out on other names and ideas of “hysteria” in the nineteenth-century. Amassing a much larger corpus of medical journals devoted to mental illness and separating the articles on gender would provide a more exploratory and less contrived experience. It would also provide a better context in which to situate ideas about hysteria.
Secondly, because my corpus was not large enough to generate more accurate topic models, the models changed from ten to twenty to forty topics. In comparing the two genders together, a set of words suggesting the reproductive system appeared, but it did not appear in the topics generated for the genders separately. That seemed like an important anomaly, because the root of “hysteria” is Greek for uterus. Third, I experienced trouble naming the topics and was not wholly satisfied with the ones I had interpreted. The project resulted in more of a proposal for a better use of topic modeling for the research questions. As in the projects for this week, text analysis seems to work better as a methodology for exploring an already amassed corpus or corpora of text. It is very useful for “reading” patterns and anomalies that then generate interesting research questions for close reading.