Identifying suspect InChIs in Wikipedia Chemboxes using Chemical Name to Structure

Wikipedia is a highly useful source of Chemistry and also of chemical nomenclature. A limitation in chemical name to structure software, such as OPSIN, is that trivial names that are similar to systematic names may be misinterpreted if the program has never encountered the trivial names. The nature of Wikipedia means that the most important chemicals and hence the most prevalent trivial names are included so surely Wikipedia would be a great resource to look for name to InChI relationships where the name to structure software was at fault?

I used Matthew Gamble’s code for extracting chemboxes as RDF to quickly grab the contents of all the current chemboxes. From the output of this tool it was simple to get the name/InChI pairs. As I was interested in trivial names I used the title of each Wikipedia page as the input for name to structure.

430 cases were flagged up for a range of reasons: ring/chain tautomerism, intentionally underspecified names, under or over specification of stereochemistry and of course the type of error I was expected. However there were also a significant number of cases where the InChI clearly described a different compound. Upon investigation for the records I’ve corrected so far the root cause appeared to be an incorrect reference to ChemSpider. This then allows script assisted updates to pull in inappropriate InChIs/SMILES.

Example of a previously incorrect page.

Increasing the precision of identificiation of these incorrect name/strucutral identifiers pairs should be possible if the IUPAC names were used as input…

Visualising a hierarchy of substructures

Given a set of chemical structures as SMILES, how can you visualise the substructure/superstructure relationship between them?

For example, the following picture shows the relationship between members of a set of structures containing several benzene derivatives and monosaccharides:
This was created using the following Python script, which iteratively looks for the structure that matches the largest number of molecules in the set, building up a tree as it does so. The output is the tree in a form suitable for depiction using Graphviz’s dot program. The Python script uses Open Babel, but could easily be adapted for other toolkits.

Using Python for batch conversion of ChemSketch files to Mol files

There are several tools out there for batch conversion of chemical file formats. However, you may not have access to those tools or else a particular file format may not be supported. Sometimes your only option is to open each file in the original software, and export it as a more useful file format. When dealing with a large number of files, this can be quite tedious or just impossible.

Recently, faced with the challenge of converting 100s of ChemSketch files to Mol files, myself and Daniel looked up the literature (i.e. googled the web) and found that Rajarshi Guha has described how to automate ChemDraw using the Python module WATSUP. Unfortunately this no longer seems to be available. Rich Apodaca has described an alternative approach using AppleScript but this does not work so well on Windows.

Instead we used a Python library, pywinauto, to automate the process of using the ChemSketch GUI to File/Open the ChemSketch file and File/Export a Mol file. The script is shown below. The trickiest part is choosing the delays so that the process runs as fast as possible without failing (due to a ChemSketch operation taking longer than expected):