Skip to main content

Topic-Modeling Metadata

2 min read

At first glance, topic modeling doesn’t seem to provide us with a lot of metadata; the outputs that the algorithm generates for us are pretty straightforward strings of words. But I think there’s more to topic modeling than meets the eye, and I would be interested in exploring and analyzing what little metadata topic modeling has to offer us. My research question would be something along the lines of: What can the metadata of topic modeling tell us about topic modeling as a practice, and about the novels they are attempting to topic model?

In order to effectively analyze the metadata of topics, we’d be burdened with the task of creating, or at least documenting, the metadata that’s available for each one. I would begin by tagging each topic with the number of other topics produced alongside it, the number of iterations, the number of printed words, and the presence of stop words or not. I would then create some basic content-related labels surrounding the topics based on what we’ve seen so far, such as “money” or “family” or “hilarious” or “???” depending on the topic. I’d also like to figure out a way to assign the topics a “relevancy score”, or some metric that indicates how much the topic “makes sense” to a human reader or how much meaning we can draw from it.

I think this research question, and its answers, would provide us with a SUPER macro-level picture of what’s going on in a corpus of novels, one worth discussing.


Vanishing Point - The New Yorker

Extraordinarily relevant to our discussions on imagined communities and the relationship between social media and the novel.

Capitalism, pickles, and a very sad guy named Harry

3 min read

(All topics were produced with stop words removed)

The highlights:

5 topics, 1000 iterations, 5 words, full text corpus

Title: Imperialism. “king people don made war”.

50 topics, 1000 iterations, 5 words, full text corpus

Title: Social mobility. “time made mr house great”.

Title: ??? “peregrine pm mr pickle pipes”.

Title: Novels are EVIL. “author genius book bad devil”.

Title: Harry had a rough day. “harry mr fool quality cried”.

Title: France and England, a History. “war prince french england english”.

Title: The Structure of a Novel. “set hundred laid part story”.

Title: Virtue’s Fear of Flying. “honour heart flie foul happened”.

5 topics, 1000 iterations, 5 words, ONLY the chunks of Pamela.

Title: Sparknotes version of Pamela. “mrs jewkes thought poor thing”

10 topics, 3000 iterations, seven words, full text corpus.

Title: England Over Everything. “king people country power england english time”.


The most striking thing I’m sure you all will notice about my topics is that they’re exceptionally short when compared to the baseline of 20 that the exercise guidelines outlined as an ideal starting point. I played with several different versions of topic modeling outputs over the course of this assignment, and I found that the five-word topic made the most sense to me and felt like I could draw the most meaning out of it. Thus, five-word topics abound in my highlights section.

Personally I have a tough time drawing a lot of meaning out of these topics, even in my more concise versions. I feel like the algorithm just doesn’t really create topics in a way that allows us to draw powerful conclusions from them, at least in this setting. Maybe there’s something I’m not seeing, but many of the topics just seem terribly incoherent.

I did notice there’s a little bit of Armstrong-y type stuff going on here, specifically with relation to “author genius book bad devil”. This reminded me of our discussions on Pamela and Shamela, and how as the female domestic novel emerged onto the literary scene, it often doubled/masqueraded as/was supposed to be a conduct book of sorts while it helped to shape what we know of today as the novel.

There’s also some imagined communities stuff floating around as well, a prime example being this “king people country power england english time” thing. This topic seems to suggest a unified English identity though the unity of king, people, and country over the course of time. It’s difficult to say whether this topic was drawn primarily from a single text within the corpus or not, but if it wasn’t, this topic gives powerful evidence to support Anderson’s theory that the novel was playing a crucial role in creating these imagined communities.

It’s possible to glean some interpretation from these topics, but as it stands right now, I liked them better for their comedic qualities.


Thinking about getting a tattoo?

2 min read

While doing my traditional descriptive bibliography assignment for Wollstonecraft’s “Mary, A Fiction,” the thing that struck me the most about this novel was its structural and paratextual simplicity, which contrasts sharply with its controversial and empowering textual contents. The Sparknotes version of the novel goes something like:

Mary is a sort of wallflower type daughter figure until her brother dies and leaves all of her lukewarm family’s inheritance in her name. Her mother suddenly starts paying attention to her/forces her to marry someone random and she has to oblige. Almost immediately afterwards he leaves for the continent and Mary develops deep personal relationships with Ann and Henry, who eventually both die from consumption. Charles comes back and they live mediocre-ly ever after.

So, despite being something of a proto-queer/feminist novel, the paratextual aspects of the copy I looked at on ECCO is shockingly simple. It is printed in one volume, the title is just three words, and title page itself is quite bare. There are no bells and whistles, no fanfare declaring this to be a noteworthy or game-changing novel. To be honest, that sort of bothered me, and I want to convey those aspects of the novel in my experimental bibliography.

In order to catalogue the novel’s close relationship specifically to gender, sexuality, and the body, I would like to literally fuse the paratext to physical bodies and present these fusions as a photo series. Naturally the first thing that comes to mind is tattoos, and I wouldn’t ask anyone to tattoo an eighteenth-century novel’s title page on themselves unless they really wanted to. I was thinking of using similar but non-permanent methods to fuse text to bodies, like temporary tattoos, other sources of ink, or maybe even something completely different. By fusing text to bodies, I would like to explore how the human and the literary interact on a very physical level, and frame the series of images in a way that conveys the cultural and social implications of this novel in a way that the paratext currently fails to do.


I can Descriptive Bibliography?

2 min read

Descriptive Bibliography for Wollstonecraft's Mary, A Fiction

Bobby Zipp

Wollstonecraft, Mary. MARY, A FICTION. 1st ed. London: J. Johnson, 1788.

MARY, | A | FICTION. | [Bulging line extending horizontally from center to first and fourth quartiles of page] | L'exercise des plus [long s]ublimes vertus éleve et nourri: | le génie. [...] ROSSEAU. | [Floral inscription] | LONDON: | PRINTED FOR J. JOHNSON, ST. PAUL's CHURCH-YARD. | MDCCLXXVIII. |•|

Volume 1/1:

Pagination: i - vi, 1, 2-187. 8mo. Collation: i2,A2, B-M12, N6.


i1r: Half-title. i1v: Title. A1r-A2v: Advertisement. A3r-N6r: Text.


Published anonymously. 31 Chapters. The half-title is a signature of the fictitious "Mary Bayley", top of page, centered. Translation of the Voltaire on title page: "the exercise of the most sublime virtues raises and nourishes genius". Advertisement is signed, "Mary". Each chapter ends with the text "CHAP" in the bottom right corner of the page it ends on. The word "As" is reprinted as the last word on C2r and the first word on C2v. The same pattern occurs with "There" on D1r and D1v. Also with "receive" on both sides of D2. Pattern becomes almost constant after this point. When dialogue is present, there are quotation marks on the right hand side of all of the text that is considered dialogue, not just the first and last lines. N6r includes the text "END" at the bottom of the page.

Sourced from the British Library. Digital facsimile retrieved from Eighteenth Century Collections Online. Gale. Swarthmore College Lib TRICO(PALCI). 8 Nov. 2015. No blank pages included in digital facsimile. Gale Document Number CW3312951500. ESTC Number: T039008. Microfilm Reel#: Eighteenth Century Collections Online, Range 4967.

Sections of this work included in The Young Gentleman and Lady's Instructor; Or, New Reader and Speaker: Being a Choice Collection of Pieces in Poetry and Prose, Etc. Lewes: Sussex Press, 1808.

Metadata is like... soooo meta

I think exercise 6 was my favorite exercise yet. Despite its limitations, I think Google FusionTables does things it’s supposed to pretty well.

Regarding the PubLocation, I noticed that for the first time, we saw an American publishing location! Maybe. Maybe we’ve seen New York before, or talked about it, but this is the first time I’ve seen/remembered an American publisher making it into our datasets. And I’m not surprised at all that Cambridge is the first one to pop up; isn’t it their thing to pretentiously brag about how they were the first at everything?

In all seriousness, I really enjoyed studying the heat map of the title locations mentioned in this corpus. I started to get a sneaking suspicion that the map of what locations were being mentioned in this particular map were all parts of the British empire at the time, and Google tells me that I was kind of sort of right! Compare the heat map in this post with the map of the 18th century British empire and see for yourself:

I guess it could be explained by the fact that early novelists either traveled to or heard a lot about other places in the empire, far more than they heard about places that weren’t British territory. I know as a writer I often subconsciously draw on the things or experiences I hear most often, so it makes sense that British authors would mention British-controlled places the most frequently.

Regarding the bar graphs: I decided to filter PubDate by VolumeStatement, and found that an increase in the number of volumes spiked in parallel with the overall proliferation of novels that peaked in about 1769. After that date, though, it looks like the number of novels published overall decreases faster than the average number of volumes. Maybe publishers realized they could make more money if they printed books in more volumes?

Funny note on the narrative form pie chart: Mine said that the form “4” comprised five percent of the Early Novels Database. Am I completely unaware of a secret narrative form that I’m just being exposed to through this exercise? Probably not. It’s probably an error. But it’s still funny.

Regarding the word cloud exercise: I used Excel’s Find and Replace function and it worked just fine for me, no TextWrangler needed. I also removed the words “first”, “second”, and “third”, as well as “two”, “three”, and “four”, because they were boring. (Sorry, number enthusiasts.) What was left behind was pretty interesting: the new most frequent words included “entertaining”, “original”, “young”, “curious”, “great”, “secret”, “real”, “moral,” and “curious”. I guess novelists were really concerned with making sure readers knew their works were going to be fresh-faced and spunky before they actually sat down to read them.

Overall, I really like FusionTables, and I think it’s a powerful tool for low-level data analysis. If you want to do anything more strenuous, you can always switch over to Stata or RStudio, so I don’t really mind how lightweight it is, and I wouldn’t really want it to have more firepower if the option was available. It’s just accessible enough and user-friendly enough that anyone can create pretty neat observations fairly quickly.

I wonder what would happen if we put the END into Stata?

Bibliography Overload

The amount of data and information included in this assignment was absolutely massive! Sometimes I felt overwhelmed by all the texts, data, and information provided in Garside’s introduction.

A major trend I noticed in just the bibliography was the way that the title, author, and content of the work appeared to influence the way that critics would read and subsequently review a book. It felt like many of the reviews of novels published anonymously, by women, or containing content that occupied a feminine sphere (such as the memoirs of older women or the histories/letters of younger women) were held to a different standard than those published by men or about men. This is clearly evidenced in the comments on the anonymous “The Wedding Ring; or the History of Miss Sidney”, in which CR refers to something called “…the female library.” Thus, it sort of feels like anything related to the feminine sphere is being relegated or written off as “other” or atypical in some way, which supports Armstrong’s claim that the separation of gender spheres in domestic fiction contributed to the rise of the individual subjectivity. In other words, this “female library” may be abnormal, but it may be on to something, and the critics don’t really know how to handle it.

It’s also worth noting that a work by the famous Voltaire is included in this bibliography, and the comments on his work are particularly of note. In essence, the comments on “Young James Or the Sage and the Atheist” boil down to something along the lines of: “Oh yeah, it’s Voltaire. Of course this is a good book,” whereas many of the other authors discussed in the bibliography are subject to a pretty thorough critique of even the minor characters in their work (like the Captain in Evelina. It appears that at least in the literary/scholarly world, your name can still carry some weight when it comes to critical reception of your early novel.

The five early novels I chose to compare were: Clara Reeve’s The champion of virtue. A Gothic story. By the editor of The Phonix. A translation of Barclay's Argenis, Voltaire’s Young James or the sage and the atheist. An English story. From the French of M. de Voltaire, Louisa Wharton’s Louisa Wharton. A story founded on facts: written by herself, in a series of letters to a friend. Wherein is Displayed Some particular ... and so on, Sutton-Abbey. A novel. In a series of letters, founded on facts, and Evelina.

When comparing these five early novels, the most noticeable difference between Evelina and the other novels I selected was the fact that Evelina was the only one out of the five that did not claim to be fiction or a story in any way on its title page. Young James is a “story”, and so are Louisa Wharton and “The champion of virtue.” Sutton-Abbey is the only one of the five to claim itself as a novel on the title page, but in the preface argues that it was “…not intended for publication.” Whether this claim is merely to convince readers that this is somehow a more authentic narrative or that it really was never meant to be printed lies beyond the scope of the exercise at hand. Regardless, the main conclusion that I drew from this discrepancy is that by not claiming to be a fictionalized story or a novel in any way on the title page, Evelina is actually succeeding the most at being exactly that: a fictionalized story AND a novel. This is evidenced by the generally favorable comments included in the bibliography, which stand in stark contrast to those of Sutton-Abbey, which gets called nothing more than second-rate. So, is the trick to being a popular 18th-century novel to be a “hipster novel”, by keeping silent about your novel-ness and allowing others to marvel at your talents? Maybe.

Moving on to Artemis: Just for funsies, I tried to search the time period 1760-1800 with the “Novel” filter under document type and got no results back. Does that mean Artemis doesn’t recognize any 18th century works as novels? I don’t know the answer. I might have just formatted it wrong, but it’s a question worth investigating in the future.

When I did get some results using the normal search methods, I was fascinated at the results of both data visualization tools. I saw that a spoke in the metaphorical wheel of words and connections was “Author”, but the only names underneath those spokes were female names; actually, “Ladies” was a major subcategory of the spoke “Author”. Again, I think that says a lot about the relationship between the feminine sphere and the rise of the individual subjectivity and its primacy in the early novel, as argued by Armstrong. Another piece of data that supports her claim is the fact that one of the subcategories of “Lady” is “Facts”, suggesting that those two are deeply related in some way.

For my term frequency graph, I plotted a few different terms together and got something pretty neat back. By including “novel”, “story”, “life”, “letters”, and “lady” together on the same graph, it’s easy to see that all of these terms are closely related in some way. All of these words show a pretty steady increase as the 18th century draws to a close, signaling what could be interpreted as the beginning of the rise of the novel as a major player on the literary scene, and that “life” is an essential part of the “story” that “novel”s are trying to tell. Also, more so than any other two terms, “letters” and “lady” follow each other so closely that it’s hard to tell which line is which at some points in the graph. Essentially, this tells us that if there’s going to be the mention of a lady in an early novel, it’s probably because it’s an epistolary novel and a lady is the one writing it. And in this way, we can also speculate that the rise of “lady” and “letters” directly relate to the overall rise of just the novel in general.

OCRs and I don't mix.

4 min read

I hit some roadblocks when trying to find a good OCR software to use for this assignment. Starting with the 1760 version of Vol. 10 chapter 1, I attempted to upload the PDF of the chapter into Google Docs, but to no avail. The only software applications I was able to access (or that Google would let me access) were the ones directly related to viewing strictly PDFs. I guess Google didn’t think it was a document of text in disguise like I had hoped. I also tried to download the trial version of Abbey FineReader Pro and it insisted that the software was only for Windows computers. But then, miraculously, Prizmo worked for me and I successfully uploaded the text.

While doing this exercise, I did notice that searching the images on ECCO returned surprisingly accurate results. It recognizes the typographically archaic S when searching for things, at least in some cases. Searching for “state” returns the typographically abnormal “state” in the images of Tristram Shandy accurately, which I found quite impressive. It was able to also return accurate results with the word “loss”. However, searching for “stead” in the ECCO did not return any results, even though it is very clearly legible on image 43. Even more interestingly, searching for both “Yorick” and “Torick” will return the name “Yorick” in the ECCO, but the two different searches produce entirely different lists of reference locations. For example, searching “Torick” may return the textual “Yorick” on (a hypothetical) image 57, but searching “Yorick” will return “Yorick” on (a hypothetical) image 43. I have no idea what causes the ECCO to differentiate between the different occurrences of the archaic “S” character or the different printings of “Yorick”, but I’d be curious to learn why.

When it came to actually using Prizmo, I was actually pretty disappointed in its firepower. First, it broke up the pages I chose to use (in this case, page 35) in seemingly random and arbitrary places, sometimes in the middle of sentences, making the OCR’d text very difficult to read. I had to manually adjust and readjust the text boxes so it read as one full page instead of a bunch of awkward blocks of text. Even as a full page, there’s a ton of problems with the text. Some of the most noticeable misfires deal with the text in italics, and I guess understandably so, since those look the least like standard English characters. But in one instance, “Rosinante” becomes “t ofi;,.a. te”, which just isn’t even recognizable. Maybe I’m being a little too much like Simon Cowell here, but I honestly didn’t like using the OCR on Tristram Shandy. The time it takes to go back and fix and decipher all of its little bugs and errors is greater than the time it would have taken me to get through the original text and type it by hand.

Despite my negative experience with an OCR, I can see why they are a pretty useful tool for thinking about almost all novels in ways we previously thought were impossible. I’m sure there are better OCRs out there that can and have gotten through Tristram Shandy and others and made them into super-awesome squeaky-clean digital versions of texts, which we can then use for cool digital humanities stuff like we’re doing here. But in creating these digital texts of early novels, we sort of steamroll them into a format that they weren’t really intended to be consumed in, and we run the risk of losing some of that novel’s meaning that is meant to be conveyed by the physical structure of the pages it’s printed on. I think digital fascimiles are a good intermediary, because they retain some of the key parts of the novel’s physicality that are essential to its meaning while presenting it in an easily accessible, digitized format.


Excel is Useful, Sometimes.

I’ve attached a photo of a little graph I made in Excel to this post; if anyone wants to learn how to make it, just find me in class/around campus and I can show you! It’s pretty easy.

This is a visual representation of the clusters of words I pulled out of my word cloud that I referenced in my last blog post. It compares some of the most frequently used words to the overall number of words used in the novel. As you can see, these 11 words account for almost 25% of all words in the novel!

Lies and Feelings, as Told by Pamela.

First and foremost, I really enjoyed this assignment, and I’d be interested in doing this sort of textual analysis on other novels. Is there a way to get clean versions of other novels? I was thinking of doing this with Americanah.

One of my most interesting findings occurred when I was looking at my word cloud, after using the standard list of English stopwords and then adding my own to the list. I also removed mrs, said, went, quite, shall, sir, dear, and mr from my word cloud. And when I looked at what was left behind, I noticed that the top words fell into a few distinct clusters that I could recognize. The first of these clusters was the group good, poor, lady, little and Pamela. When taken into consideration together as some of the most frequent words in the novel, it raises questions about how the novel is attempting to get us to look at Pamela as a character, as a person, and as a woman. In a sense, the high frequency of these words, often used together, is priming us as readers to think of Pamela as a tiny, poor, virtuous woman throughout the novel. It is not enough to just mention “poor Pamela” once or twice; it happens all the time. The repetition of these words throughout the novel may be, on an almost subconscious level, informing our perceptions of Pamela and of female characters and of female subjectivities in the novel in general without us even realizing it. Another cluster of words I noticed were the words “think, thought, know, and say”. The high frequency of these words helps to show how Pamela is truly an early novel form, and not something else. In Pamela, the primary method of characterization is through what the characters say, think, feel, and do, not necessarily (but sometimes) the societal forces being pressed upon them. It may be possible to conclude that the heavy use of these words is helping to form the idea that characters should, and indeed often do, have individual thoughts, feelings, and subjectivities that make them who they are.

Another interesting thing that I found was during the part of the exercise where we explored the frequency of given words throughout the novel as a whole, in a sort of “chronological” sense. I noticed that the word “Honesty” is used semi-frequently throughout the first half of the novel, but then its use drops off almost completely in the second half. Is this because Pamela no longer feels the need to assert her honesty (often in reference to her Virture), or is it because Pamela is becoming less of an honest character as the novel continues? And what does that say about her reliability as a narrator? All questions I still have after this exercise.

Small bug note: if I wanted to compare relative frequencies of two words that were not on the same page (like page 3/14 for the list of frequencies), I had trouble getting both of them to show up on the graph.

Different methods, same old Euro-centrism.

2 min read

I was honestly kind of disappointed when I uploaded my .csv file into MyMaps and it returned a map with mostly American and European locations on it, sometimes erroneously and sometimes not.

Some highlights: Trinidad, Colorado. Amazon, Montana. Yorkshire, Maryland. Also, MyMaps just decided that a random place in Tanzania is a good place to label as "Africa". Did that happen to anyone else?

I know that I can attribute most of the American locations to the errors inherent within MyMaps' algorithms/US-centrism, but I still think there is something worth noting about the fact that a novel that takes mostly on an uninhabited island in the middle of the Atlantic includes so many European locations. And I think we can attribute the abundance to the fact that Robinson Crusoe is, at its heart, a European novel. Its purpose was not necessarily to bridge different cultures, nor to inform people of diverse narratives of South American/Caribbean/African peoples, but to entertain a mostly European audience. And thus, I think the reason for the inclusion of so many European places was to make it more appealing to a European audience. And it makes sense; as a writer, you have to know your audience, and Defoe did just that when construction his novel in this way.

I also think that this Euro-centrism is evidenced when we compare the map that was included with RC to the contemporary MyMaps map. If we look closely at what is labeled in the RC-era map, it's clear that the cartographer constructed Europe to look like something that many of us in the 21st century would recognize. However, large portions of North America are just plain missing, and there's some significant inaccuracies in the renderings of South America and Africa. And let's not even get into the fact that one of the major regions of Africa is "Negroland". My point is, it makes perfect sense that RC is full of references to European cities and countries because that's the world he and his contemporaries new the best; their ideas about Africa and the "New World" were probably in flux as new discoveries were being made.

I found a listicle on why we're fascinated by lists. Enjoy! #Assignment1

Assignment 1: Dates are weird.

2 min read

First and foremost, I think the experience of using the NER was fascinating. I've always wanted to think about novels in this sort of broad, quantitative way and I'm glad the NER finally allows us to do that.

The list that caught my eye the most was definitely the "DATE" list, because of the way that the formatting and presentation of dates, and thus the reader's sense of time, evolves over the course of the novel. As I pored through the entries (which I assume are presented in TextWrangler in the order in which they appear in Robinson Crusoe), I noticed two things: first, that the specificity of the dates mentioned in the novel experiences a gradual decline from the full month, date, and year (such as entry 5: 1st September 1659) to just the month, day, or even just the season by the end of the novel/list (entries 300 and 335: summer), and second, that Defoe really, really, REALLY likes Fridays for some reason. I guess we all like Fridays in a way, but it seems unusual for Defoe to mention Friday as many times as he did; I count well over 100 counts of the mention of the word "Friday". It's also interesting to note that most of the entries occur in the latter half of the list.

I was particularly drawn to the "DATE" list because, as an aspiring writer of novels, I often find myself having difficulty accurately conveying a sense of time, and deciding how much to explicitly write down versus letting my readers figure out on their own what time period I'm writing in. I enjoyed being able to see how frequently Defoe bothers to tell people what year or even what day of the week it is. I don't think I'd necessarily model my revelation of time after Defoe's in particular, but I'd love to try this exercise with novels by writers that I'm aspiring to write like, such as Chimamanda Ngozi Adichie or Jennifer Eagan.

Questions I'm left with: If it's possible, how do we measure the distance in between two entries on these lists? Why does Defoe like Fridays so much? What other novels can we use the NER for?


P. 14-15

1 min read

"As to going home, she opposed the best motions that offered to my thoughts; and it immediately occurr'd to me how I should be laugh'd at among the neighbors, and should be asham'd to see, not my father and mother only, but even every body else; from whence I have since often observed, how incongruous and irrational the common temper of mankind is, especially of youth, to that reason which ought to guide them in such cases, viz. that they are not asham'd to sin, and yet are asham'd to repent; not asham'd of the action for which they ought justly to be esteemed fools, but are asham'd of the returning, which only can make them be esteem'd wise men."