Shakespeare through the eyes of a chemist

shakeyShakespeare’s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let’s try to infer this from the most frequently occurring chemicals in his entire oeuvre.

NextMove’s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare’s plays as downloaded from MIT. Fortunately, LeadMine knows how to handle HTML (and XML in general – e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):

gold 229
water 147
silver 71
salt 50
iron 45
metal 26
ice 18
mercury 15
diamond 14
brine 9
ides of march 7
copper 7
curan 6
sulphur 4
metals 3
quicksilver 2
to be, or not to be 1
tin 1
vile jelly 1
ees 1
triton 1
silver water 1
carbon 1

It’s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a natural product. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. “ees” was an error in the text from “kn ees”, but probably should not have been identified in any case. Actually “carbon” is also an error in the original text, “carbon ado” instead of “carbonado”.

And here’s the code (most of which is concerned with downloading the plays):

from __future__ import with_statement
import os
import urllib
import com.nextmovesoftware.leadmine as lm
import com.nextmovesoftware.leadmine.fsmgenerator as fsmgen
from collections import defaultdict

def processPlay(name, counts):
    print name
    if not os.path.isfile("%s.html" % name):
        urllib.urlretrieve("http://shakespeare.mit.edu/%s/full.html" % name, "%s.html" % name)

    with open("%s.html" % name, "r") as f:
        text = f.read()
        results = engine.processString(text)
        for entity in results.entities:
            counts[entity.text.lower()] += 1

if __name__ == "__main__":
    mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary(
               ["to be, or not to be", "vile jelly", "ides of march"],
               "Shakey", False)

    dictionaries = lm.LeadMineConfig().dictionaries
    dictionaries.add(mydict)
    engine = lm.ExtractEngine(dictionaries)

    counts = defaultdict(int)
    if not os.path.isfile("main.html"):
        urllib.urlretrieve("http://shakespeare.mit.edu/", "main.html")
    for line in open("main.html"):
        idx = line.find("a href")
        if idx >= 0 and line.find("Poetry")<0:
            name = line[idx+8:line.find("index.html")-1]
            processPlay(name, counts)
    ans = sorted(counts.items(), key=lambda x:x[1], reverse=True)
    for k, v in ans:
        print k, v

Image credit: tonynetone on Flickr

One thought on “Shakespeare through the eyes of a chemist”

Comments are closed.