{"id":710,"date":"2013-10-28T09:45:59","date_gmt":"2013-10-28T09:45:59","guid":{"rendered":"http:\/\/nextmovesoftware.com\/blog\/?p=710"},"modified":"2015-06-22T16:46:55","modified_gmt":"2015-06-22T15:46:55","slug":"shakespeare-through-the-eyes-of-a-chemist","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2013\/10\/28\/shakespeare-through-the-eyes-of-a-chemist\/","title":{"rendered":"Shakespeare through the eyes of a chemist"},"content":{"rendered":"<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2013\/10\/shakey.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"\/blog\/wp-content\/uploads\/2013\/10\/shakey.jpg\" alt=\"shakey\" width=\"239\" height=\"240\" class=\"alignright size-full wp-image-725\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2013\/10\/shakey.jpg 239w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2013\/10\/shakey-150x150.jpg 150w\" sizes=\"(max-width: 239px) 100vw, 239px\" \/><\/a>Shakespeare&#8217;s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let&#8217;s try to infer this from the most frequently occurring chemicals in his entire oeuvre.<\/p>\n<p>NextMove&#8217;s <a href=\"http:\/\/nextmovesoftware.com\/leadmine\">LeadMine<\/a> is a <a href=\"http:\/\/nextmovesoftware.com\/blog\/2013\/10\/11\/biocreative-announce-chemical-text-mining-competition-results\/\">state-of-the-art<\/a> chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare&#8217;s plays as downloaded from <a href=\"http:\/\/shakespeare.mit.edu\/\">MIT<\/a>. Fortunately, LeadMine knows how to handle HTML (and XML in general &#8211; e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):<\/p>\n<pre>\r\ngold 229\r\nwater 147\r\nsilver 71\r\nsalt 50\r\niron 45\r\nmetal 26\r\nice 18\r\nmercury 15\r\ndiamond 14\r\nbrine 9\r\nides of march 7\r\ncopper 7\r\ncuran 6\r\nsulphur 4\r\nmetals 3\r\nquicksilver 2\r\nto be, or not to be 1\r\ntin 1\r\nvile jelly 1\r\nees 1\r\ntriton 1\r\nsilver water 1\r\ncarbon 1\r\n<\/pre>\n<p>It&#8217;s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a <a href=\"http:\/\/www.chem.qmul.ac.uk\/iupac\/sectionF\/alkaloid\/alk20.html\">natural product<\/a>. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. &#8220;ees&#8221; was an error in the text from &#8220;kn ees&#8221;, but probably should not have been identified in any case. Actually &#8220;carbon&#8221; is also an error in the original text, &#8220;carbon ado&#8221; instead of &#8220;carbonado&#8221;. <\/p>\n<p>And here&#8217;s the code (most of which is concerned with downloading the plays):<\/p>\n<style>\npre { font-family: monospace; color: #000000; background-color: #ffffff; }\n.Constant { color: #ff00ff; }\n.Identifier { color: #008080; }\n.Comment { color: #0000ff; }\n.Statement { color: #804040; font-weight: bold; }\n.PreProc { color: #a020f0; }\n<\/style>\n<pre>\r\n<span class=\"PreProc\">from<\/span> __future__ <span class=\"PreProc\">import<\/span> with_statement\r\n<span class=\"PreProc\">import<\/span> os\r\n<span class=\"PreProc\">import<\/span> urllib\r\n<span class=\"PreProc\">import<\/span> com.nextmovesoftware.leadmine <span class=\"Statement\">as<\/span> lm\r\n<span class=\"PreProc\">import<\/span> com.nextmovesoftware.leadmine.fsmgenerator <span class=\"Statement\">as<\/span> fsmgen\r\n<span class=\"PreProc\">from<\/span> collections <span class=\"PreProc\">import<\/span> defaultdict\r\n\r\n<span class=\"Statement\">def<\/span> <span class=\"Identifier\">processPlay<\/span>(name, counts):\r\n    <span class=\"Identifier\">print<\/span> name\r\n    <span class=\"Statement\">if<\/span> <span class=\"Statement\">not<\/span> os.path.isfile(<span class=\"Constant\">&quot;%s.html&quot;<\/span> % name):\r\n        urllib.urlretrieve(<span class=\"Constant\">&quot;<a href=\"http:\/\/shakespeare.mit.edu\/%s\/full.html\">http:\/\/shakespeare.mit.edu\/%s\/full.html<\/a>&quot;<\/span> % name, <span class=\"Constant\">&quot;%s.html&quot;<\/span> % name)\r\n\r\n    <span class=\"Statement\">with<\/span> <span class=\"Identifier\">open<\/span>(<span class=\"Constant\">&quot;%s.html&quot;<\/span> % name, <span class=\"Constant\">&quot;r&quot;<\/span>) <span class=\"Statement\">as<\/span> f:\r\n        text = f.read()\r\n        results = engine.processString(text)\r\n        <span class=\"Statement\">for<\/span> entity <span class=\"Statement\">in<\/span> results.entities:\r\n            counts[entity.text.lower()] += <span class=\"Constant\">1<\/span>\r\n\r\n<span class=\"Statement\">if<\/span> __name__ == <span class=\"Constant\">&quot;__main__&quot;<\/span>:\r\n    mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary(\r\n               [<span class=\"Constant\">&quot;to be, or not to be&quot;<\/span>, <span class=\"Constant\">&quot;vile jelly&quot;<\/span>, <span class=\"Constant\">&quot;ides of march&quot;<\/span>],\r\n               <span class=\"Constant\">&quot;Shakey&quot;<\/span>, <span class=\"Identifier\">False<\/span>)\r\n\r\n    dictionaries = lm.LeadMineConfig().dictionaries\r\n    dictionaries.add(mydict)\r\n    engine = lm.ExtractEngine(dictionaries)\r\n\r\n    counts = defaultdict(<span class=\"Identifier\">int<\/span>)\r\n    <span class=\"Statement\">if<\/span> <span class=\"Statement\">not<\/span> os.path.isfile(<span class=\"Constant\">&quot;main.html&quot;<\/span>):\r\n        urllib.urlretrieve(<span class=\"Constant\">&quot;<a href=\"http:\/\/shakespeare.mit.edu\/\">http:\/\/shakespeare.mit.edu\/<\/a>&quot;<\/span>, <span class=\"Constant\">&quot;main.html&quot;<\/span>)\r\n    <span class=\"Statement\">for<\/span> line <span class=\"Statement\">in<\/span> <span class=\"Identifier\">open<\/span>(<span class=\"Constant\">&quot;main.html&quot;<\/span>):\r\n        idx = line.find(<span class=\"Constant\">&quot;a href&quot;<\/span>)\r\n        <span class=\"Statement\">if<\/span> idx &gt;= <span class=\"Constant\">0<\/span> <span class=\"Statement\">and<\/span> line.find(<span class=\"Constant\">&quot;Poetry&quot;<\/span>)&lt;<span class=\"Constant\">0<\/span>:\r\n            name = line[idx+<span class=\"Constant\">8<\/span>:line.find(<span class=\"Constant\">&quot;index.html&quot;<\/span>)-<span class=\"Constant\">1<\/span>]\r\n            processPlay(name, counts)\r\n    ans = <span class=\"Identifier\">sorted<\/span>(counts.items(), key=<span class=\"Statement\">lambda<\/span> x:x[<span class=\"Constant\">1<\/span>], reverse=<span class=\"Identifier\">True<\/span>)\r\n    <span class=\"Statement\">for<\/span> k, v <span class=\"Statement\">in<\/span> ans:\r\n        <span class=\"Identifier\">print<\/span> k, v\r\n\r\n<\/pre>\n<p><b>Image credit:<\/b> <a href=\"http:\/\/www.flickr.com\/photos\/tonynetone\/\">tonynetone<\/a> on Flickr<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Shakespeare&#8217;s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let&#8217;s try to infer this from the most frequently occurring chemicals in his entire oeuvre. NextMove&#8217;s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2013\/10\/28\/shakespeare-through-the-eyes-of-a-chemist\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Shakespeare through the eyes of a chemist<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/710"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=710"}],"version-history":[{"count":22,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/710\/revisions"}],"predecessor-version":[{"id":1438,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/710\/revisions\/1438"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}