4 min read
I think I'm going to approach this a bit more artistically and a bit less scholarly than may be intended, but I can't help myself.
I was most taken with the topics generated from the prescribed settings (50 topics, 1000 iterations, 20 words per topic):
Space Pirates: strap captain narcissa ship chap board time behaviour morgan surgeon immediately body uncle mate cried expence put told banter thomson
time travel, space ships, best friends, beautiful aliens, and witty remarks from the medic.
The next three all have exceptionally good final three words. I wonder how much the order matters to my understanding of the topics, and how random the order is.
Evening Passion: eyes purpose attention voice tears peace stood silence instantly fixed ground soul night distress place led felt length rose equally
two lovers part in a moonlit garden.
Americans Abroad: peregrine pm pickle lord pipes hero commodore gentleman mrs emilia hatchway love trunnion lieutenant jolter painter company view french behaviour
men in double-breasted suits aboard steamers talk of art and war over lunch.
I'm having a lot of fun with these. They remind me of poems without linebreaks. The NY Times has a running column that makes poems out of missed connections postings on Craigslist, which remind me of this. It makes me really want to write found poetry for my experimental bibliography.
I generated two other lists of topics: one simple, and one complex. Both were disappointing.
10 topics, 100 iterations, 10 words per topic:
pastoral epic: time power pleasure present life nature happiness english country thousand
In fact, the simplicity of the settings has led me to the most complex, or at the very least, abstract topics. I have asked the computer to distill novels to their basest forms for me. If I consider novels an imitation or representation of reality, then I am nearly asking a computer to find the meaning of life. That, of course, did not work out so nicely in The Hitchhiker's Guide to the Universe.
20 topics, 2000 iterations, 15 words, no stopwords:
this took 30 mins for the program to complete.
The most interesting topic this:
an ode to ee cummings: the to of i in a it not that but for be have as my
Satisfyingly the opposite of the "simple" results, but otherwise too basic, too superficial.
Of course, I could write about the easily labeled topics: church, or voyages, or one topic that was very obviously Pamela. What's the fun in that? Topic modeling effectively takes something sciencey and relieves it of any obligation to be scientific. We take all these data that have been collected in the most absolutely unbiased process and require that they be nearly arbitrarily (certainly subjectively) named, labeled, and sorted.
I've been thinking about applications for topic modeling. Is it practical for telling about large amounts of writing? How could I actually use it in a real situation? Not just by generating lists, I think. But what about connecting the words in the topics to the full information? Could we hyperlink each word to direct back to its appearance(s) in the original text(s)? I'm thinking about something along the model of The Perseus Project. Could we create topic concordances, with links to locations of every instance of the word chosen in the topic? Could we generate statistical metadata, showing frequency, placement, etc? Lastly, could we superscore iterations of topics? Is that already being done by iterating (I don't have a strong enough grip on the actual process)? I'd like to see a super topic model, where only the strongest words remain, only those used over and over and over again, or used from topic to topic to topic. Is that close to my simple settings? What would happen if I asked the program to iterate once, generate one topic, and choose one word? I assume it would choose the most used word in 1760's novels.
I did this and returned with "sir". With 2 topics and two words each, I got "time made" and "sir lady." I'm intrigued by time made. I will leave it on this note.