Shakespeare through the eyes of a chemist

shakeyShakespeare’s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let’s try to infer this from the most frequently occurring chemicals in his entire oeuvre.

NextMove’s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare’s plays as downloaded from MIT. Fortunately, LeadMine knows how to handle HTML (and XML in general – e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):

gold 229
water 147
silver 71
salt 50
iron 45
metal 26
ice 18
mercury 15
diamond 14
brine 9
ides of march 7
copper 7
curan 6
sulphur 4
metals 3
quicksilver 2
to be, or not to be 1
tin 1
vile jelly 1
ees 1
triton 1
silver water 1
carbon 1

It’s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a natural product. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. “ees” was an error in the text from “kn ees”, but probably should not have been identified in any case. Actually “carbon” is also an error in the original text, “carbon ado” instead of “carbonado”.

And here’s the code (most of which is concerned with downloading the plays):

from __future__ import with_statement
import os
import urllib
import com.nextmovesoftware.leadmine as lm
import com.nextmovesoftware.leadmine.fsmgenerator as fsmgen
from collections import defaultdict

def processPlay(name, counts):
    print name
    if not os.path.isfile("%s.html" % name):
        urllib.urlretrieve("http://shakespeare.mit.edu/%s/full.html" % name, "%s.html" % name)

    with open("%s.html" % name, "r") as f:
        text = f.read()
        results = engine.processString(text)
        for entity in results.entities:
            counts[entity.text.lower()] += 1

if __name__ == "__main__":
    mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary(
               ["to be, or not to be", "vile jelly", "ides of march"],
               "Shakey", False)

    dictionaries = lm.LeadMineConfig().dictionaries
    dictionaries.add(mydict)
    engine = lm.ExtractEngine(dictionaries)

    counts = defaultdict(int)
    if not os.path.isfile("main.html"):
        urllib.urlretrieve("http://shakespeare.mit.edu/", "main.html")
    for line in open("main.html"):
        idx = line.find("a href")
        if idx >= 0 and line.find("Poetry")<0:
            name = line[idx+8:line.find("index.html")-1]
            processPlay(name, counts)
    ans = sorted(counts.items(), key=lambda x:x[1], reverse=True)
    for k, v in ans:
        print k, v

Image credit: tonynetone on Flickr

BioCreative announce chemical text mining competition results

NextMove recently participated in the BioCreative CHEMNDER (Chemical compound and drug name recognition) task. This task involved annotating chemical mentions in PubMed abstracts. BioCreative have annotated 10,000 abstracts of which 7,000 were provided to participants for training and in mid-September participants were asked to identify mentions in the unseen test corpus of 3,000 abstracts (which to avoid cheating was combined with 17,000 decoy abstracts).

In total 27 teams (23 academic and 4 commercial) submitted results. We achieved 85.0% recall at a precision of 88.7% giving an Fscore of 86.9%. Our solution ranked amongst the best submitted, being only 0.53% from the best performing solution in the chemical entity mentions task and significantly ahead of the other commercial solutions. Inter-annotator agreement was 91% indicating that with recent advances in machine annotation, automated systems are rapidly approaching the quality of human abstractors.

Participation in this competition has driven recent developments in LeadMine including improved coverage of non-systematic chemical entities and detection of abbreviations.

If you want to know the full details our proceedings paper is available here and you can find out how we compared in the full proceedings here  (results on p14, list of teams on p31). The presentation below, which I gave at the workshop, summarises our system:

Compile once, run anywhere – Portable cheminformatics applications

Often when you install C++ software, a number of different DLLs are installed along with the principal one. For example, if you install Open Babel on Windows, as well as the Open Babel core DLL there are DLLs for cairo, png, xml2 and various plugin DLLs for Open Babel functionality. In theory this allows the user to swap in a new version of a DLL or remove particular plugins that are unnecessary. In general, however, this flexibility is not exploited and for most users this is all just clutter. A better user experience might be to present a single file containing all of the necessary functionality. This is something I’ve been working on recently for our own software.

As another step towards portable applications, we have been moving away from using MSVC to using MinGW (and now MinGW-64) to statically compile our software. With the right flags a C++ application can be created that statically includes the C++ runtime and so is no longer dependent on the infamous MSVC++ runtime DLLs (which may or may not be present), or on any Cygwin or MinGW DLLs. As a result, it will run on any Windows version from WinXP onwards.

When you figure out how to generate these statically-linked libraries and applications, they work just great. Getting to that point is a bit tricky though. Luckily for me we have some in-house expertise (“ROGER – it’s happened again!”). The following presentation which I gave as a lightning talk at the recent RDKit UGM gives an overview of the approach and rationale:

Competitive text-mining and molecular data exchange at ACS Indy

Two talks on very different topics were presented by NextMovers at the recent ACS National Meeting in Indianapolis.

Daniel described how LeadMine uses two-state grammars to find dictionary terms, as well as some of the clever techniques he has used to improve performance for the BioCreative CHEMNDER chemical compound text-mining challenge (results not yet announced). Separately, Roger spoke about some of the challenges in the supposedly trivial tasks of reading/writing molecular file formats (for more on the MDLValence benchmark discussed here see an earlier blog post).