Skip to main content

Topic-Modeling Metadata

2 min read

At first glance, topic modeling doesn’t seem to provide us with a lot of metadata; the outputs that the algorithm generates for us are pretty straightforward strings of words. But I think there’s more to topic modeling than meets the eye, and I would be interested in exploring and analyzing what little metadata topic modeling has to offer us. My research question would be something along the lines of: What can the metadata of topic modeling tell us about topic modeling as a practice, and about the novels they are attempting to topic model?

In order to effectively analyze the metadata of topics, we’d be burdened with the task of creating, or at least documenting, the metadata that’s available for each one. I would begin by tagging each topic with the number of other topics produced alongside it, the number of iterations, the number of printed words, and the presence of stop words or not. I would then create some basic content-related labels surrounding the topics based on what we’ve seen so far, such as “money” or “family” or “hilarious” or “???” depending on the topic. I’d also like to figure out a way to assign the topics a “relevancy score”, or some metric that indicates how much the topic “makes sense” to a human reader or how much meaning we can draw from it.

I think this research question, and its answers, would provide us with a SUPER macro-level picture of what’s going on in a corpus of novels, one worth discussing.