Handling biologics: A file format problem?

The increasing importance of biological therapeutics, or biologics, to the pharmaceutical industry is well-known. For example, data from Drugs.com show that of the top 15 best selling therapies in the US in Q4 2012, six were biologics. Monoclonal antibodies are a typical example; these are glycoproteins, comprising of short oligosaccharides attached to a multi-chain polypeptide.

It is clear that handling such molecules requires a different approach than that taken for small-molecules. For example, here is an all-atom depiction of the peptide crambin:
Crambin
No – it’s not a cyclic peptide. It just happens to have three disulfide bridges. A more useful depiction can be generated if we follow the IUPAC or FDA guidelines for peptide depiction; here the primary structure is much clearer as is the presence of the disulfide bonds:

FDA
FDA Style
IUPAC
IUPAC Style

However, to create these sorts of depictions, and otherwise handle biopolymers more appropriately, we need to know the polymer structure.

Some consider this a file format problem. Some file formats which have been developed to store or represent biopolymer structures include the CHUCKLES and CHORTLES languages from Chiron and Daylight, HELM (Hierarchical Editing Language for Macromolecules) from Pfizer, Protein Line Notation from Biochemfusion and SCSR (Self-Contained Sequence Representation, an MDL V3000 extension) from Accelrys. Naturally, Wisswesser Line Notation has also been extended to handle this problem.

In particular, the HELM format has recently received support from the Pistoia Alliance. See for example this post on the Pistoia blog which describes how HELM “gives us a single consistent way to describe macromolecules which can be used across industry and academia” so that “researchers do not have to spend time creating their own notations”.

But is a new file format the best way to achieve this goal? (I can’t resist inserting the xkcd comic on standards at this point 🙂 )

From xkcd, the web comic: http://xkcd.com/927/

While NextMove’s software for handling biopolymers, Sugar & Splice, will handle popular file formats such as HELM, I will describe a different view of the problem in the follow-up blog post.

Looking for a C++ cheminformatics toolkit?

If you are writing a C++ application that uses a cheminformatics toolkit, a key step during the configuration process is to tell the build system the location of the toolkit libraries and include files. However, in an ideal world, the build system would find these automatically based on standard install locations or hints such as the value of RDBASE for RDKit.

We have incorporated some of these ideas into CMake modules to find Open Babel, OEChem or RDKit. As a service to the community, we are placing these into the public domain (E&OE). Enjoy.

# FindOpenBabel.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find Open Babel headers and libraries
# Defines:
#
#  OPENBABEL_FOUND - system has Open Babel
#  OPENBABEL_INCLUDE_DIRS - the Open Babel include directories
#  OPENBABEL_LIBRARIES - Link these to use Open Babel

if(OPENBABEL_INCLUDE_DIRS AND OPENBABEL_LIBRARIES)
  # in cache already or user-specified
  set(OPENBABEL_FOUND TRUE)

else()

  if(NOT OPENBABEL_INCLUDE_DIRS)
    if(WIN32)
      find_path(OPENBABEL_INCLUDE_DIR openbabel/obconversion.h
        PATHS
        ${OPENBABEL_DIR}\\include
        $ENV{OPENBABEL_INCLUDE_DIR}
        $ENV{OPENBABEL_INCLUDE_DIR}\\openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_PATH}
        $ENV{OPENBABEL_INCLUDE_PATH}\\openbabel-2.0
        $ENV{OPENBABEL_DIR}\\include
        $ENV{OPENBABEL_PATH}\\include
        $ENV{OPENBABEL_BASE}\\include
        C:\\OpenBabel\\include
      )
      find_path(OPENBABEL_INCLUDE_DIR2 openbabel/babelconfig.h
        PATHS
        ${OPENBABEL_DIR}/windows-vc2008/build/include
        $ENV{OPENBABEL_DIR}/windows-vc2008/build/include
        )
      set(OPENBABEL_INCLUDE_DIRS ${OPENBABEL_INCLUDE_DIR} ${OPENBABEL_INCLUDE_DIR2})
    else()
      find_path(OPENBABEL_INCLUDE_DIRS openbabel/obconversion.h
        PATHS
        ${OPENBABEL_DIR}/include
        $ENV{OPENBABEL_INCLUDE_DIR}/openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_DIR}
        $ENV{OPENBABEL_INCLUDE_PATH}/openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_PATH}
        $ENV{OPENBABEL_DIR}/include/openbabel-2.0
        $ENV{OPENBABEL_DIR}/include
        $ENV{OPENBABEL_PATH}/include/openbabel-2.0
        $ENV{OPENBABEL_PATH}/include
        $ENV{OPENBABEL_BASE}/include/openbabel-2.0
        $ENV{OPENBABEL_BASE}/include
        /usr/include/openbabel-2.0
        /usr/include
        /usr/local/include/openbabel-2.0
        /usr/local/include
        /usr/local/openbabel/include/openbabel-2.0
        /usr/local/openbabel/include
        /usr/local/openbabel-2.0/include/openbabel-2.0
        /usr/local/openbabel-2.0/include
        ~/include/openbabel-2.0
        ~/include
      )
    endif()
    if(OPENBABEL_INCLUDE_DIRS)
      message(STATUS "Found Open Babel include files at ${OPENBABEL_INCLUDE_DIRS}")
    endif()
  endif()

  if(NOT OPENBABEL_LIBRARY_DIR)
    find_library(OPENBABEL_LIBRARIES NAMES openbabel openbabel-2
      PATHS
      ${OPENBABEL_DIR}/lib
      ${OPENBABEL_DIR}/windows-vc2008/build/src/Release
      $ENV{OPENBABEL_LIBRARIES}
      $ENV{OPENBABEL_DIR}/lib
      $ENV{OPENBABEL_DIR}/windows-vc2008/build/src/Release
      $ENV{OPENBABEL_PATH}/lib
      $ENV{OPENBABEL_BASE}/lib
      /usr/lib
      /usr/local/lib
      ~/lib
      $ENV{LD_LIBRARY_PATH}
    )
    if(OPENBABEL_LIBRARIES)
      message(STATUS "Found Open Babel library at ${OPENBABEL_LIBRARIES}")
    endif()
  endif()

  if(OPENBABEL_INCLUDE_DIRS AND OPENBABEL_LIBRARIES)
    set(OPENBABEL_FOUND TRUE)
  endif()

  mark_as_advanced(OPENBABEL_INCLUDE_DIRS OPENBABEL_LIBRARIES)
endif()

# FindOEChem.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find OEChem headers and libraries
# Defines:
#
#  OECHEM_FOUND - system has OEChem
#  OECHEM_INCLUDE_DIR - the OEChem include directory
#  OECHEM_LIBRARIES - Link these to use OEChem

if(OECHEM_INCLUDE_DIR AND OECHEM_LIBRARIES)
  # in cache already or user-specified
  set(OECHEM_FOUND TRUE)

else()

  if(NOT OECHEM_INCLUDE_DIR)
    if(WIN32)
      find_path(OECHEM_INCLUDE_DIR oechem.h
        PATHS
        $ENV{OECHEM_INCLUDE_DIR}
        $ENV{OECHEM_INCLUDE_PATH}
        $ENV{OECHEM_BASE}\\toolkits\\include
        $ENV{OECHEM_BASE}\\include
        C:\\OpenEye\\toolkits\\include
        C:\\OpenEye\\include
      )
    else()
      find_path(OECHEM_INCLUDE_DIR oechem.h
        PATHS
          $ENV{OECHEM_INCLUDE_DIR}
          $ENV{OECHEM_INCLUDE_PATH}
          $ENV{OECHEM_BASE}/toolkits/include
          $ENV{OECHEM_BASE}/include
          /usr/local/openeye/toolkits/include
          /usr/local/openeye/include
          /usr/include
          /usr/local/include
          ~/include
      )
      if(OECHEM_INCLUDE_DIR)
         message(STATUS "Found OEChem include files at ${OECHEM_INCLUDE_DIR}")
      endif()
    endif()
  endif()

  if(NOT OECHEM_LIBRARIES)
    find_library(OECHEM_LIB NAMES oechem
      HINTS
        ${OECHEM_INCLUDE_DIR}/../lib
      PATHS
        $ENV{OECHEM_LIB_DIR}
        $ENV{OECHEM_LIB_PATH}
        $ENV{OECHEM_LIBRARIES}
        $ENV{OECHEM_BASE}/toolkits/lib
        $ENV{OECHEM_BASE}/lib
        /usr/openeye/toolkits/lib
        /usr/openeye/lib
        /usr/local/openeye/toolkits/lib
        /usr/local/openeye/lib
        ~/openeye/toolkits/lib
        ~/openeye/lib
        $ENV{LD_LIBRARY_PATH}
    )
    if(OECHEM_LIB)
       GET_FILENAME_COMPONENT(OECHEM_LIBRARY_DIR ${OECHEM_LIB} PATH)
       message(STATUS "Found OEChem libraries at ${OECHEM_LIBRARY_DIR}")

      # Note that the order of the following libraries is significant!!
      find_library(OESYSTEM_LIB NAMES oesystem HINTS ${OECHEM_LIBRARY_DIR})
      find_library(OEPLATFORM_LIB NAMES oeplatform HINTS ${OECHEM_LIBRARY_DIR})
      set (OECHEM_LIBRARIES ${OECHEM_LIB} ${OESYSTEM_LIB} ${OEPLATFORM_LIB})
    endif()
  endif()

  if(OECHEM_INCLUDE_DIR AND OECHEM_LIBRARIES)
    set(OECHEM_FOUND TRUE)
  endif()

  mark_as_advanced(OECHEM_INCLUDE_DIR OECHEM_LIBRARIES)
endif()
# FindRDKit.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find RDKit headers and libraries
# Defines:
#
#  RDKIT_FOUND - system has RDKit
#  RDKIT_INCLUDE_DIR - the RDKit include directory
#  RDKIT_LIBRARIES - Link these to use RDKit

if(RDKIT_INCLUDE_DIR AND RDKIT_LIBRARIES)
  # in cache already or user-specified
  set(RDKIT_FOUND TRUE)

else()

  if(NOT RDKIT_INCLUDE_DIR)
    if(WIN32)
      find_path(RDKIT_INCLUDE_DIR GraphMol/RDKitBase.h
        PATHS
        ${RDKIT_DIR}\\Code
        $ENV{RDKIT_INCLUDE_DIR}
        $ENV{RDKIT_INCLUDE_PATH}
        $ENV{RDKIT_BASE}\\Code
        $ENV{RDBASE}\\Code
        C:\\RDKit\\include
        C:\\RDKit\\Code
      )
    else()
      find_path(RDKIT_INCLUDE_DIR GraphMol/RDKitBase.h
        PATHS
          ${RDKIT_DIR}/Code
          $ENV{RDKIT_INCLUDE_DIR}
          $ENV{RDKIT_INCLUDE_PATH}
          $ENV{RDKIT_BASE}/Code
          $ENV{RDBASE}/Code
          /usr/local/rdkit/include/Code
          /usr/local/rdkit/include
          /usr/local/rdkit/Code
          ~/rdkit/Code
      )
    endif()
    if(RDKIT_INCLUDE_DIR)
       message(STATUS "Found RDKit include files at ${RDKIT_INCLUDE_DIR}")
    endif()
  endif()

  if(NOT RDKIT_LIBRARIES)
    find_library(FILEPARSERS_LIB NAMES FileParsers
      PATHS
        ${RDKIT_DIR}/lib
        $ENV{RDKIT_LIB_DIR}
        $ENV{RDKIT_LIB_PATH}
        $ENV{RDKIT_LIBRARIES}
        $ENV{RDKIT_BASE}/lib
        $ENV{RDBASE}/lib
        /usr/local/rdkit/lib
        ~/rdkit/lib
        $ENV{LD_LIBRARY_PATH}
    )
    if(FILEPARSERS_LIB)
       GET_FILENAME_COMPONENT(RDKIT_LIBRARY_DIR ${FILEPARSERS_LIB} PATH)
       message(STATUS "Found RDKit libraries at ${RDKIT_LIBRARY_DIR}")

      # Note that the order of the following libraries is significant!!
      find_library(SMILESPARSE_LIB NAMES SmilesParse
                                   HINTS ${RDKIT_LIBRARY_DIR})
      find_library(DEPICTOR_LIB NAMES Depictor
                                HINTS ${RDKIT_LIBRARY_DIR})
      find_library(GRAPHMOL_LIB NAMES GraphMol
                                HINTS ${RDKIT_LIBRARY_DIR})
      find_library(RDGEOMETRYLIB_LIB NAMES RDGeometryLib
                                     HINTS ${RDKIT_LIBRARY_DIR})
      find_library(RDGENERAL_LIB NAMES RDGeneral
                                 HINTS ${RDKIT_LIBRARY_DIR})
      set (RDKIT_LIBRARIES ${FILEPARSERS_LIB} ${SMILESPARSE_LIB}
              ${DEPICTOR_LIB} ${GRAPHMOL_LIB} ${RDGEOMETRYLIB_LIB}
              ${RDGENERAL_LIB})
    endif()
    if(RDKIT_LIBRARIES)
            message(STATUS "Found RDKit library files at ${RDKIT_LIBRARIES}")
    endif()
  endif()

  if(RDKIT_INCLUDE_DIR AND RDKIT_LIBRARIES)
    set(RDKIT_FOUND TRUE)
  endif()

  mark_as_advanced(RDKIT_INCLUDE_DIR RDKIT_LIBRARIES)
endif()

Building and climbing a Chemical Ladder

A recent post by Jake Vanderplas described Word Ladders, and gave me an idea. A Word Ladder is a game developed by Lewis Carroll that involves converting one word into another one letter at a time, while passing through only valid words. For example, to convert MAD into HAT, a valid word ladder would be MAD->MAT->HAT or MAD->HAD->HAT.

Here I propose the term Chemical Ladder for a Word Ladder that is restricted to chemical names. For example, try converting from Barbamate to Carbazide in 4 steps only using valid chemical names. Or from Anginine to Arsonite.

So, how did I come up with these examples? Well, NextMove Software’s CaffeineFix can do chemical spelling correction based on a dictionary (or grammar). The spelling suggestions provided by CaffeineFix are substitutions, deletions and insertions, but if we consider just the 1-letter substitutions, this is exactly the transformation needed to build a word ladder, e.g.

>>> from caffeinefix import CaffeineFix
>>> cf = CaffeineFix("mydictionary.cfx")
>>> list(cf.suggest("azite"))
["azote", "azine"]

To begin with I downloaded the list of synonyms from PubChem, and filtered to remove database identifiers and various other cruft. I compiled these into a CaffeineFix dictionary, and edited the suggest method to just return substitutions (suggest_substitutions in the code below). The code shown below then uses the suggested substitutions for each chemical name to create a graph that I visualised to identify Chemical Ladders (see example below). A longer version of the code could be written to identify the Chemical Ladders more automatically.

Just in case any highly-respected and discerning chemistry society wants to include Chemical Ladders in its weekly or monthly magazine, I’ve decided not to include the full output of the program in this blogpost, apart from the image above. What do you think? Could this be the next ChemDoku?

from collections import defaultdict

from caffeinefix import CaffeineFix

def difference(a, b):
    for i, (d, e) in enumerate(zip(a,b)):
        if d != e:
            return i

def nearest(name):
    suggestions = list(cf.suggest_substitutions(name))
    suggestions.remove(name)
    return (name, suggestions)

replacements = [("0", "zero"), ("%", "PERCENT"), ("+", "PLUS"), ("4", "FOUR"),
                ("7", "SEVEN"), ("9", "NINE"), ("6", "SIX"), ("8", "EIGHT"),
                (")", "RBRACKET"), ("'", "APOSTROPHE"), ("@", "AT"),
                ("}", "RBRACKETB"), ("{", "LBRACKETB"), (":", "COLON"),
                ("/", "FSLASH"), (".", "PERIOD"), ("&", "AMPERSAND"),
                ("^", "CIRCUMFLEX"), ("[", "LBRACKETC"), ("]", "RBRACKETC"),
                ("|", "PIPE"), (";", "SEMICOLON")]

def fix(name):
    name = name.replace("-", "_").replace("1", "one").replace("2", "two").replace("3", "three").replace("5", "five").replace(",", "_").replace("?", "Q").replace("(", "LB").replace(" ", "SPACE")
    for x, y in replacements:
        name = name.replace(x, y)
    return name

if __name__ == "__main__":
    cf = CaffeineFix(r"C:\Tools\LargeData\PubChem_Synonyms\pubchem.cfx")
    names = [x.strip() for x in open(r"C:\Tools\LargeData\PubChem_Synonyms\lowercase_sorted_uniq.txt", "r") if x.rstrip()
             and len(x) == 10]
    results = map(nearest, names)

    output = open("wordladder_10.txt", "w")
    output.write("graph graphname {\n")

    for x, y in results:
        if len(y) > 1:
            collate = defaultdict(list)
            for w in y:
                collate[difference(x, w)].append(w)
            if len(collate) > 1:
                for v in collate.values():
                    output.write('%s [label="%s"];\n' % (fix(x), x))
                    for z in v:
                        output.write("%s -- %s\n" % (fix(x), fix(z)))
                        output.write('%s [label="%s"];\n' % (fix(z), z))

    output.write("}\n")
    output.close()

Note: A couple of people have asked why are there two edges for only some of the connections in the graph. This would be the case if I retained all of the original edges, as if A is a spelling correction of B, then B is a spelling correction of A. However, since a word ladder can only exist if a node in the graph has at least two connections, I filter out all those cases where a node has only a single connection (otherwise you end up with a lot of ‘word ladders’ composed of just two words). So, if I have A->B, B->(A, C), C->(B, D), D->C, then A->B and D->C will be removed, and the graph will be A-B=C-D.

Patently wrong – Tracing the origin of an unusual molecule in PubChem

Greg Landrum of Novartis posted a link to PubChem structure CID60140829 on Google Plus, with the comment:

This one is from a patent with the title “Apparatus and method for encoding and decoding block low density parity check codes with a variable coding rate”. I bet it’s the result of an overly zealous (and insufficiently error checked) image->structure conversion.

A good guess, but not quite right…

First of all, let’s look at the PubChem image for the deposited structure, SID143481705 (see right). Hmmm…

These structures are from the US Patent 20050283708 via SCRIPDB. For each patent the USPTO makes available a PDF, and very usefully for chemists, the ChemDraw files associated with the patent (along with MOL, and TIFF). NextMove Software’s PatFetch (bundled with LeadMine) makes it easy to extract the corresponding PNG, CDX and MOL files for a particular structure (it runs in a web browser, and you just click on an image to obtain the CDX file). In this case, the image corresponding to the structure is as follows:
Don’t ask me what it is, but I can confirm that it’s not a crossword.

But here’s the thing. If you download the CDX file and open it in ChemDraw…you get exactly this image. 🙂 In other words, the good people at the USPTO appear to use ChemDraw as a generic drawing tool, and in particular, seem to favour carbon-carbon bonds over the actual box or line tools. Actually, now that I see how useful a grid of carbon-carbon bonds can be to create a nice table, I think I might dump Excel for good too.

Visualising a hierarchy of substructures Part II

In an earlier post, I described a simple procedure to generate a hierarchy of substructures, and depicted the hierarchy with GraphViz. Pat Walters at Vertex realised that it is possible to add images as node labels in GraphViz and updated the script so that the image includes the actual chemical depictions (see below). He has also adapted the script to use OpenEye’s OEChem.

Update (8/11/2012): Wolf-D. Ihlenfeldt has implemented this with the Cactvs Toolkit and in addition has shown how to integrate the graph display with Knime.

#!/usr/bin/env python

import copy
import pickle
import os

from openeye.oechem import *
from openeye.oedepict import *

def mol2image(mol,imageFileName,width=200,height=200):
    clearcoords = True
    suppressH   = False
    opts = OE2DMolDisplayOptions(width, height, OEScale_AutoScale)
    itf = OEInterface()
    OESetup2DMolDisplayOptions(opts, itf)
    OEPrepareDepiction(mol, clearcoords, suppressH)
    disp = OE2DMolDisplay(mol, opts)
    ofs = oeofstream(imageFileName)
    name,ext = os.path.splitext(imageFileName)
    OERenderMolecule(ofs,ext[1:],disp)

def create_tree(structures):
    tree = {"Root":{}}
    stack = [(tree["Root"], structures.keys())]

    while len(stack) > 0:
        leaf, subset = stack.pop()

        max_matches = ("", [])
        for name in subset:
            smiles = structures[name]
            smarts = OESubSearch(smiles)
            matches = []
            for nameb in subset:
                if nameb != name:
                    molb = OEGraphMol()
                    OEParseSmiles(molb,structures[nameb])
                    if smarts.SingleMatch(molb):
                        matches.append(nameb)
            if len(matches) >= len(max_matches[1]):
                max_matches = (name, matches)
        if False: # Debug statement
            print max_matches

        for name in [max_matches[0]] + max_matches[1]:
            subset.remove(name)
        leaf[max_matches[0]] = {}
        if len(subset) > 0:
            stack.append( (leaf, subset) )
        if len(max_matches[1]) > 0:
            stack.append( (leaf[max_matches[0]], copy.deepcopy(max_matches[1])))

    with open("tmp.pickle", "w") as f:
        pickle.dump(tree, f)

def fix(name):
    return name.replace("-", "_").replace("1", "one").replace("2", "two").replace("3", "three").replace("5", "five").replace(",", "_")

def visit(name, leafdict):
    for x, y in leafdict.iteritems():
        if name != "Root":
            print ' %s -> %s;' % (name,x)
        print ' %s [label=<<TABLE><TR><TD><IMG SRC="%s/%s.png"/></TD></TR></TABLE>>];' % (x,os.getcwd(),x)
        visit(x, y)

if __name__ == "__main__":
    structureDict = dict([('001', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[CH]1(O)'),
                       ('002', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@H](O)[CH]1(O)'),
                       ('003', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H]1(O)'),
                       ('004', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1(O)'),
                       ('005', 'C(O)[CH](O1)[CH](O)[CH](O)[CH](O)[CH]1(O)'),
                       ('006', 'c1ccccc1'),
                       ('007', 'c1ccccc1Br'),
                       ('008', 'c1ccc(Br)cc1Br'),
                       ('009', 'c1cccc(Br)c1Br'),
                       ('010', 'c1c(Br)cc(Br)cc1Br')])

    for k in sorted(structureDict.keys()):
        imageFileName = "%s.png" % (k)
        smi = structureDict[k]
        mol = OEGraphMol()
        OEParseSmiles(mol,smi)
        mol2image(mol,imageFileName)

    create_tree(structureDict)
    with open("tmp.pickle", "r") as f:
        tree = pickle.load(f)
    print "digraph graphname {"
    print "node [shape=plaintext];"
    visit("Root", tree["Root"])
    print "}"

"""
Redirect the output to a file and then convert to a graph as follows:
"C:\Program Files (x86)\Graphviz 2.28\bin\dot.exe" -Tpng -ooutput.png dot_input.txt
"""

Using PubChem’s REST interface from Python

Version 1.0 of PubChem’s REST interface (PUG REST) went live on Sept 13. The tutorial gives examples of use, while full details are in the spec.

The following code shows illustrates basic usage from Python. If you want to simultaneously query large numbers of molecules, or do a substructure search, you need to perform a query “in the background” using the ListKey approach shown by SubstructureToCids:

import os
import sys
import time

import urllib2

def getresult(url):
    try:
        connection = urllib2.urlopen(url)
    except urllib2.HTTPError, e:
        return ""
    else:
        return connection.read().rstrip()

def NameToSmiles(name):
    return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%s/property/IsomericSMILES/TXT" % name)

def NameToCid(name):
    return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%s/cids/TXT" % name)

def CidToSmiles(cid):
    return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%s/property/IsomericSMILES/TXT" % cid)

def SubstructureToCids(smiles):
    result = getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/%s/XML" % smiles)
    if not result:
        return []
    listkey = ""
    for line in result.split("\n"):
        if line.find("ListKey")>=0:
            listkey = line.split("ListKey>")[1][:-2]
    assert listkey

    url = "http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/%s/cids/TXT" % listkey
    delta = 2
    while True:
        time.sleep(delta) # ...perchance to dream
        if delta < 8:
            delta += 1
        result = getresult(url).split("\n")
        if result[0] != "Your request is running":
            break
    return result

if __name__ == "__main__":
    print NameToSmiles("Gleevec")
    print NameToCid("Gleevec")
    print CidToSmiles("123596")
    print SubstructureToCids("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")

Visualising a hierarchy of substructures

Given a set of chemical structures as SMILES, how can you visualise the substructure/superstructure relationship between them?

For example, the following picture shows the relationship between members of a set of structures containing several benzene derivatives and monosaccharides:
This was created using the following Python script, which iteratively looks for the structure that matches the largest number of molecules in the set, building up a tree as it does so. The output is the tree in a form suitable for depiction using Graphviz’s dot program. The Python script uses Open Babel, but could easily be adapted for other toolkits.

Using Python for batch conversion of ChemSketch files to Mol files

There are several tools out there for batch conversion of chemical file formats. However, you may not have access to those tools or else a particular file format may not be supported. Sometimes your only option is to open each file in the original software, and export it as a more useful file format. When dealing with a large number of files, this can be quite tedious or just impossible.

Recently, faced with the challenge of converting 100s of ChemSketch files to Mol files, myself and Daniel looked up the literature (i.e. googled the web) and found that Rajarshi Guha has described how to automate ChemDraw using the Python module WATSUP. Unfortunately this no longer seems to be available. Rich Apodaca has described an alternative approach using AppleScript but this does not work so well on Windows.

Instead we used a Python library, pywinauto, to automate the process of using the ChemSketch GUI to File/Open the ChemSketch file and File/Export a Mol file. The script is shown below. The trickiest part is choosing the delays so that the process runs as fast as possible without failing (due to a ChemSketch operation taking longer than expected):