Skip to main content

Exercise 6: Metadata

4 min read

Exercise 6

Playing around with the metadata confirmed a trend we’ve talked about in class: Most of the novels in the corpus were published in London, with a smaller but significant percentage coming out of Dublin and the remainder scattered among other locations. Between 1700 and 1740, 94.9% of the novels in the sample were published in London, only 2.2% in Dublin, and none in other publishing cities such as Edinburgh or Bath. Meanwhile, in the second half of the sampled era, between 1740 and 1779, only 85.7% of the novels were published in London, with 12.3% in Dublin and small but notable percentages in Edinburgh, Glasgow, and a couple of other cities. I think it’s safe to say that this this trend speaks to the increasing popularity of novels outside of centers like London and the growing tradition of reprinting and pirating books. However, this sample of 855 novels is dubiously representative of All 18th Century Novels, and it seems possible that this trend and others speak just as much to idiosyncrasies and oversampling in this particular corpus as to actual patterns. Partly because of this, and because of the unwieldy and idiosyncratic nature of categories like TitleNouns and AuthorDates, I had trouble seeing the utility of the metadata and finding anything really exciting in it when playing around with Google Fusion.

The data, and the tools we have to analyze it, are somewhat limited. I thought it would be interesting to trace the prominence of particular types of paratext in conjunction with each other over time. Specifically, I wanted to see how often pieces of paratext coded as “Preface” and pieces of paratext coded as “To the reader” occurred in the same novel over time. Their co-occurrence might be a rough proxy for the amount of hedging, snark, and/or authorial self-abasement addressed to readers and editors. However, because all the types of paratext (preface, advertisement, errata, etc.) are lumped together in one column (paratextTitleControlled) charting the rise of a couple of individual types of paratext doesn’t seem to be possible. For instance, I would want a filter to pick up on Samuel Richardson’s Clarissa as having both a preface and a “To the reader” section, as the novel is described as having “Preface, Character information, To the reader, Errata.” But from what I can figure out, a Google Fusion bar chart of publication date, filtered by “Preface” and “To the reader” in the category paratextTitleControlled, would only show novels whose paratext has been coded in that order, leaving out novels whose paratext was coded in a different order. The search treats “Preface, Character information, To the reader, Errata” as a different value from “Preface, To the reader” rather than recognizing it as the combined occurrence of a preface AND a “To the reader” along with some other stuff (Errata and Character information) that’s irrelevant in my search.

To illustrate, Image 1 is a chart of paratext over time, filtered by “Preface, To the reader.”

Image 2 is a chart of paratext over time, filtered by “To the reader, Preface.” Even this chart -- the same two types of paratext, listed in a different order -- is totally different.

Notably, neither of the two charts above include Clarissa at all, since the filter can’t pick out the two types of paratext in the orders listed.

Similarly, it might be interesting to look at the volume and frequency of particular title nouns (or adjectives, but I was looking at nouns) over time. The word cloud I made (Image 3) points out which nouns occur most, as raw numbers, out of the whole corpus we have data on, but it doesn’t let you visualize changes over time. A bar chart would be more helpful for that, but again, if you wanted to look at, say, “French” and “amour” in conjunction over time to see what if anything you could learn about how novelists imagined the French, the filter would only be able to pick out titles where the coder had listed “French” and “amour” in the order you entered the terms into the filter.

The human arbitrariness of the way the novels were coded (e.g. some novels’ paratext includes “To the reader, Preface” while some includes “Preface, To the reader”), the way the categories are formatted, and the relative simplicity of Google fusion combine to make looking at how multiple values interact difficult. More sophisticated analysis tools, and a more sophisticated understanding of how to use them on my part, would let me get at more multidimensional ways that different values interact for the different categories of metadata.