4 min read
Coolest Topics (all with stopwords removed)
Fulltext, 50 Topics, 1000 Iterations (20 topic word printings)
“Pamela in a Nutshell” → mrs sir master good pamela mr dear poor…
“Your Average Domestic Novel” → dear heart lady love father latter good hand hope…
Fulltext, 25 Topics, 100 Iterations (10 topic word printings)
“Time for Army Adventures” → king people war english england general adventures army…
“A Lovely Letter” → love dear letter heart adieu happy happiness moment friendship emily
Chunks, 25 Topics, 100 Iterations (10 topic word printings)
“Adventure = Money” → adventures guinea made make money moment moment sir master give
“Nouns R Important” → lady friend men woman world heart lord thing present happiness
Chunks, 25 Topics, 200 Iterations (10 topic words printed)
“A Brief History of England” → king people england war prince english power army
Chunks, 50 Topics, 1000 Iterations (20 topic word printings)
“A Gentleman’s Handbook” → honour time power thought favour give part liberty
“Intro to English Gov” → king people england duke france kingdom prince queen parliament…
“Probably Plot of Chrysal” → master guinea adventures made directly service general business person
Chunks, 50 Topics, 1000 Iterations (5 topic word printings)
“Nonsense” → de ia le ft la
Most interesting things in fulltext topic modeling: When I did the first run-through of 50 topics, 1000 iterations, and 20 topic word printings, I was struck by how some topics were very specific to one book. The topic I named “Pamela in a Nutshell” was a list which constituted 40% of Pamela. Throughout my experiments with different numbers of topics, iterations, and topic word printings (all using the fulltext folder), I found that it was indeed possible to find genres in some of these topics, and some of them are reminiscent of Tristram Shandy; however, I think it’s hard to evaluate because I’m going off what I know about existing genres to “check” these topics, so there’s no way for me to check the topics related to genre nuances I’m unaware of.
Cool things about chunks of novels topic modeling: It seems like the the chunks of novels we get a higher probability that each topic will more directly correspond with a particular book -- though I think this depends on the topic itself, because some are more general and others become more specific (like the ones about HistoryEngland). One weirdness was a topic that was suddenly very nonsensical (entitled “Nonsense”) that felt like stopwords had edged their way in there. When I looked at the topic more closely, it constituted 92% of Shandy1_22, which, upon further investigation, is a chunk that’s mostly in French -- so the topic confusion makes sense. Going further, it would be so helpful if a next version of this algorithm could scan for different languages within a text and either make a note of that or just create the topics in that language. Luckily, our corpus of works is (almost) all in English, so we don’t require this for our current use of the program, but it would be cool. Very cool.
General observations: The more specific novels have an easier time getting a topic all to themselves; for example, the HistoryEngland doc was represented with topics like “Time for Army Adventures” in the fulltext topic modeling and the topics “A Brief History of England” and “Intro to English Gov” in the chunks of novels topic modeling. All of the topic modeling feels a bit like a reality effect exercise, because it starts by essentially stripping away any possible meanings of the words and simply grouping them with other words they appear close to. This idea that words/objects could be in a novel without a meaning, to just be there, feels like what Barthes was talking about. (Maybe? I don’t have a full grasp on all the details of the reality effect.)
I preferred topic modeling the chunks of novels over modeling whole novel docs because modeling the chunks made it easier for me to understand how topics related to each individual book. Certain docs showed up with topics highly represented (high percentage of words in doc assigned to topic), which tells us that the particular topic in question is very important for that section/chunk of the novel. The other possibility was that one novel dominated the list of docs that included the topic in question, which means that the topic is important throughout the whole book because it shows up in more than one chunk of the novel. I liked being able to compare significances of topics for whole books with sections of those books -- this could be really useful for tracking themes throughout novels, and then throughout the history of novels, but I feel like we’d need a tool that lets us be more precise about we want to do.