Shakespeare’s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let’s try to infer this from the most frequently occurring chemicals in his entire oeuvre.
NextMove’s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare’s plays as downloaded from MIT. Fortunately, LeadMine knows how to handle HTML (and XML in general – e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):
gold 229 water 147 silver 71 salt 50 iron 45 metal 26 ice 18 mercury 15 diamond 14 brine 9 ides of march 7 copper 7 curan 6 sulphur 4 metals 3 quicksilver 2 to be, or not to be 1 tin 1 vile jelly 1 ees 1 triton 1 silver water 1 carbon 1
It’s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a natural product. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. “ees” was an error in the text from “kn ees”, but probably should not have been identified in any case. Actually “carbon” is also an error in the original text, “carbon ado” instead of “carbonado”.
And here’s the code (most of which is concerned with downloading the plays):
from __future__ import with_statement import os import urllib import com.nextmovesoftware.leadmine as lm import com.nextmovesoftware.leadmine.fsmgenerator as fsmgen from collections import defaultdict def processPlay(name, counts): print name if not os.path.isfile("%s.html" % name): urllib.urlretrieve("http://shakespeare.mit.edu/%s/full.html" % name, "%s.html" % name) with open("%s.html" % name, "r") as f: text = f.read() results = engine.processString(text) for entity in results.entities: counts[entity.text.lower()] += 1 if __name__ == "__main__": mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary( ["to be, or not to be", "vile jelly", "ides of march"], "Shakey", False) dictionaries = lm.LeadMineConfig().dictionaries dictionaries.add(mydict) engine = lm.ExtractEngine(dictionaries) counts = defaultdict(int) if not os.path.isfile("main.html"): urllib.urlretrieve("http://shakespeare.mit.edu/", "main.html") for line in open("main.html"): idx = line.find("a href") if idx >= 0 and line.find("Poetry")<0: name = line[idx+8:line.find("index.html")-1] processPlay(name, counts) ans = sorted(counts.items(), key=lambda x:x, reverse=True) for k, v in ans: print k, v
Image credit: tonynetone on Flickr