## Radioactivity — It’s in the air for you and me

Radioactivity, discovered by Madame Curie, is the process by which the nucleus of an “unstable” atom decays to a different form.  As mentioned in a previous blog post, atoms are composed of protons, neutrons and electrons; protons are easy to handle in cheminformatics, electrons are incredibly difficult and here we discuss the neutrons.

Checking whether the number of neutrons specified with an atom (the isotope) is plausible and reasonable is a non-trivial challenge.  At the most simplistic level, many cheminformatics applications and file formats ignore isotopes altogether and assume every atom has the default terrestrial isotopic composition/abundance as prescribed/recommended by IUPAC. The next level of sophistication is to treat the atomic symbols “D” and “T” as corresponding to deuterium, [2H], and tritium, [3H] respectively.

More usually, such as with MDL’s SD files or SMILES, allow the optional specification of a mass number (number of nucleons, i.e. protons+neutrons, in the nucleus).  If not specified, the element again has the IUPAC recommended composition.  A common misunderstanding with these semantics is that [12C] is not the same as [C].  Although terrestrial carbon is predominantly carbon-12 (98.89% by the latest 2009 recommendations) the presence of trace amounts of [13C] keep these distinct.  Having said that, 22 elements do have a unique isotope officially used to determine their atomic weight and hence [4Be], [9F], [11Na], [13Al], [15P], [21Sc], [25Mn], [27Co], [33As], [39Y], [41Nb], [45Rh], [53I], [55Cs], [59Pr], [65Tb], [67Ho], [69Tm], [79Au], [83Bi], [90Th] and [91Pa] may legitimately be canonicalised without the isotopic specification, i.e. [Be], F, [Na] and so on.

The most advanced cheminformatics file formats, such as Perkin Elmer Informatics’ (formerly CambridgeSoft’s) ChemDraw CDX and CDXML file format can even specify enrichment and depletion in specific isotopes.

Unfortunately, having a specified isotope is often confused with being radioactive.  For example, RSC’s ChemSpider abuses the international icon for radioactivity to actually mean “Non-standard isotope”, though this is clearly stated.  This is because testing for a specified mass number is relatively easy, with many toolkits supporting the SMARTS semantics that [!0*] matches any specified isotope.  Although this is useful for identifying compounds whose isotopes need to be checked, it doesn’t correspond to radioactivity.  For example, deuterium, [2H] has a specified isotope but isn’t radioactive, whilst uranium, [U], even without a specified isotope is radioactive.

To address this I’ll describe how to ascertain whether a compound is radioactive, a useful descriptor especially when dealing with the “Health & Safety” parts of a pharmaceutical company.  A molecule is radioactive if any of its atoms is radioactive, and an atom is radioactive if it isn’t stable.  If an isotope isn’t specified, the element must have at least one stable isotope to be considered stable (these are the elements from hydrogen, [#1], to lead, [#82], with the exceptions of technetium [#43] and prometium [#61]), otherwise the specified isotope must correspond to one of the 255 known stable nuclides.  Hence, SMARTS pattern is_stable corresponds to [0#1,1#1,2#1,0#2,3#2,4#2,...].  Using De Morgan’s laws this atom expression can be negated to produce is_radioactive as [!0,!#1;!1,!#1;!2,!#2;…].

The complete SMARTS pattern for is_radioactive is shown below:

[!0,!#1;!1,!#1;!2,!#1;!0,!#2;!3,!#2;!4,!#2;!0,!#3;!6,!#3;!7,!#3;!0,!#4;!9,!#4;!0,!#5;!10,!#5;!11,!#5;!0,!#6;!12,!#6;!13,!#6;!0,!#7;!14,!
#7;!15,!#7;!0,!#8;!16,!#8;!17,!#8;!18,!#8;!0,!#9;!19,!#9;!0,!#10;!20,!#10;!21,!#10;!22,!#10;!0,!#11;!23,!#11;!0,!#12;!24,!#12;!25,!
#12;!26,!#12;!0,!#13;!27,!#13;!0,!#14;!28,!#14;!29,!#14;!30,!#14;!0,!#15;!31,!#15;!0,!#16;!32,!#16;!33,!#16;!34,!#16;!36,!#16;!0,
!#17;!35,!#17;!37,!#17;!0,!#18;!36,!#18;!38,!#18;!40,!#18;!0,!#19;!39,!#19;!41,!#19;!0,!#20;!40,!#20;!42,!#20;!43,!#20;!44,!#20;
!46,!#20;!0,!#21;!47,!#21;!0,!#22;!46,!#22;!47,!#22;!48,!#22;!49,!#22;!50,!#22;!0,!#23;!51,!#23;!0,!#24;!50,!#24;!52,!#24;!53,!#
24;!54,!#24;!0,!#25;!55,!#25;!0,!#26;!54,!#26;!56,!#26;!57,!#26;!58,!#26;!0,!#27;!59,!#27;!0,!#28;!58,!#28;!60,!#28;!61,!#28;!62,
!#28;!64,!#28;!0,!#29;!63,!#29;!65,!#29;!0,!#30;!64,!#30;!66,!#30;!67,!#30;!68,!#30;!70,!#30;!0,!#31;!69,!#31;!71,!#31;!0,!#32;!
70,!#32;!72,!#32;!73,!#32;!74,!#32;!0,!#33;!75,!#33;!0,!#34;!74,!#34;!76,!#34;!77,!#34;!78,!#34;!80,!#34;!0,!#35;!79,!#35;!81,!#
35;!0,!#36;!79,!#36;!80,!#36;!82,!#36;!83,!#36;!84,!#36;!86,!#36;!0,!#37;!85,!#37;!0,!#38;!84,!#38;!86,!#38;!87,!#38;!88,!#38;!0,
!#39;!89,!#39;!0,!#40;!90,!#40;!91,!#40;!92,!#40;!94,!#40;!96,!#40;!0,!#41;!93,!#41;!0,!#42;!92,!#42;!94,!#42;!95,!#42;!96,!#42;
!97,!#42;!98,!#42;!0,!#44;!96,!#44;!98,!#44;!99,!#44;!100,!#44;!101,!#44;!102,!#44;!104,!#44;!0,!#45;!103,!#45;!0,!#46;!102,!#4
6;!104,!#46;!105,!#46;!106,!#46;!108,!#46;!110,!#46;!0,!#47;!107,!#47;!109,!#47;!0,!#48;!106,!#48;!108,!#48;!110,!#48;!111,!#
48;!112,!#48;!114,!#48;!0,!#49;!113,!#49;!0,!#50;!112,!#50;!114,!#50;!115,!#50;!116,!#50;!117,!#50;!118,!#50;!119,!#50;!120,!
#50;!122,!#50;!124,!#50;!0,!#51;!121,!#51;!123,!#51;!0,!#52;!120,!#52;!122,!#52;!123,!#52;!124,!#52;!125,!#52;!126,!#52;!0,!#5
3;!127,!#53;!0,!#54;!124,!#54;!126,!#54;!128,!#54;!129,!#54;!130,!#54;!131,!#54;!132,!#54;!134,!#54;!136,!#54;!0,!#55;!133,!#
55;!0,!#56;!130,!#56;!132,!#56;!134,!#56;!135,!#56;!136,!#56;!137,!#56;!138,!#56;!0,!#57;!139,!#57;!0,!#58;!136,!#58;!138,!#5
8;!140,!#58;!142,!#58;!0,!#59;!141,!#59;!0,!#60;!142,!#60;!143,!#60;!145,!#60;!146,!#60;!148,!#60;!0,!#62;!144,!#62;!149,!#62;
!150,!#62;!152,!#62;!154,!#62;!0,!#63;!153,!#63;!0,!#64;!154,!#64;!155,!#64;!156,!#64;!157,!#64;!158,!#64;!160,!#64;!0,!#65;!1
59,!#65;!0,!#66;!156,!#66;!158,!#66;!160,!#66;!161,!#66;!162,!#66;!163,!#66;!164,!#66;!0,!#67;!165,!#67;!0,!#68;!162,!#68;!16
4,!#68;!166,!#68;!167,!#68;!168,!#68;!170,!#68;!0,!#69;!169,!#69;!0,!#70;!168,!#70;!170,!#70;!171,!#70;!172,!#70;!173,!#70;!1
74,!#70;!176,!#70;!0,!#71;!175,!#71;!0,!#72;!176,!#72;!177,!#72;!178,!#72;!179,!#72;!180,!#72;!0,!#73;!180,!#73;!181,!#73;!0,!
#74;!182,!#74;!183,!#74;!184,!#74;!186,!#74;!0,!#75;!185,!#75;!0,!#76;!184,!#76;!187,!#76;!188,!#76;!189,!#76;!190,!#76;!192,
!#76;!0,!#77;!191,!#77;!193,!#77;!0,!#78;!192,!#78;!194,!#78;!195,!#78;!196,!#78;!198,!#78;!0,!#79;!197,!#79;!0,!#80;!196,!#80
;!198,!#80;!199,!#80;!200,!#80;!201,!#80;!202,!#80;!204,!#80;!0,!#81;!203,!#81;!205,!#81;!0,!#82;!204,!#82;!206,!#82;!207,!#8
2;!208,!#82]


Image credit: LimeTech on Flickr

## Patently wrong – Tracing the origin of an unusual molecule in PubChem

Greg Landrum of Novartis posted a link to PubChem structure CID60140829 on Google Plus, with the comment:

This one is from a patent with the title “Apparatus and method for encoding and decoding block low density parity check codes with a variable coding rate”. I bet it’s the result of an overly zealous (and insufficiently error checked) image->structure conversion.

A good guess, but not quite right…

First of all, let’s look at the PubChem image for the deposited structure, SID143481705 (see right). Hmmm…

These structures are from the US Patent 20050283708 via SCRIPDB. For each patent the USPTO makes available a PDF, and very usefully for chemists, the ChemDraw files associated with the patent (along with MOL, and TIFF). NextMove Software’s PatFetch (bundled with LeadMine) makes it easy to extract the corresponding PNG, CDX and MOL files for a particular structure (it runs in a web browser, and you just click on an image to obtain the CDX file). In this case, the image corresponding to the structure is as follows:
Don’t ask me what it is, but I can confirm that it’s not a crossword.

But here’s the thing. If you download the CDX file and open it in ChemDraw…you get exactly this image. 🙂 In other words, the good people at the USPTO appear to use ChemDraw as a generic drawing tool, and in particular, seem to favour carbon-carbon bonds over the actual box or line tools. Actually, now that I see how useful a grid of carbon-carbon bonds can be to create a nice table, I think I might dump Excel for good too.

## Visualising a hierarchy of substructures Part II

In an earlier post, I described a simple procedure to generate a hierarchy of substructures, and depicted the hierarchy with GraphViz. Pat Walters at Vertex realised that it is possible to add images as node labels in GraphViz and updated the script so that the image includes the actual chemical depictions (see below). He has also adapted the script to use OpenEye’s OEChem.

Update (8/11/2012): Wolf-D. Ihlenfeldt has implemented this with the Cactvs Toolkit and in addition has shown how to integrate the graph display with Knime.

#!/usr/bin/env python

import copy
import pickle
import os

from openeye.oechem import *
from openeye.oedepict import *

def mol2image(mol,imageFileName,width=200,height=200):
clearcoords = True
suppressH   = False
opts = OE2DMolDisplayOptions(width, height, OEScale_AutoScale)
itf = OEInterface()
OESetup2DMolDisplayOptions(opts, itf)
OEPrepareDepiction(mol, clearcoords, suppressH)
disp = OE2DMolDisplay(mol, opts)
ofs = oeofstream(imageFileName)
name,ext = os.path.splitext(imageFileName)
OERenderMolecule(ofs,ext[1:],disp)

def create_tree(structures):
tree = {"Root":{}}
stack = [(tree["Root"], structures.keys())]

while len(stack) > 0:
leaf, subset = stack.pop()

max_matches = ("", [])
for name in subset:
smiles = structures[name]
smarts = OESubSearch(smiles)
matches = []
for nameb in subset:
if nameb != name:
molb = OEGraphMol()
OEParseSmiles(molb,structures[nameb])
if smarts.SingleMatch(molb):
matches.append(nameb)
if len(matches) >= len(max_matches[1]):
max_matches = (name, matches)
if False: # Debug statement
print max_matches

for name in [max_matches[0]] + max_matches[1]:
subset.remove(name)
leaf[max_matches[0]] = {}
if len(subset) > 0:
stack.append( (leaf, subset) )
if len(max_matches[1]) > 0:
stack.append( (leaf[max_matches[0]], copy.deepcopy(max_matches[1])))

with open("tmp.pickle", "w") as f:
pickle.dump(tree, f)

def fix(name):
return name.replace("-", "_").replace("1", "one").replace("2", "two").replace("3", "three").replace("5", "five").replace(",", "_")

def visit(name, leafdict):
for x, y in leafdict.iteritems():
if name != "Root":
print ' %s -> %s;' % (name,x)
print ' %s [label=<<TABLE><TR><TD><IMG SRC="%s/%s.png"/></TD></TR></TABLE>>];' % (x,os.getcwd(),x)
visit(x, y)

if __name__ == "__main__":
structureDict = dict([('001', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[CH]1(O)'),
('002', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@H](O)[CH]1(O)'),
('003', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H]1(O)'),
('004', 'C(O)[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1(O)'),
('005', 'C(O)[CH](O1)[CH](O)[CH](O)[CH](O)[CH]1(O)'),
('006', 'c1ccccc1'),
('007', 'c1ccccc1Br'),
('008', 'c1ccc(Br)cc1Br'),
('009', 'c1cccc(Br)c1Br'),
('010', 'c1c(Br)cc(Br)cc1Br')])

for k in sorted(structureDict.keys()):
imageFileName = "%s.png" % (k)
smi = structureDict[k]
mol = OEGraphMol()
OEParseSmiles(mol,smi)
mol2image(mol,imageFileName)

create_tree(structureDict)
with open("tmp.pickle", "r") as f:
print "digraph graphname {"
print "node [shape=plaintext];"
visit("Root", tree["Root"])
print "}"

"""
Redirect the output to a file and then convert to a graph as follows:
"C:\Program Files (x86)\Graphviz 2.28\bin\dot.exe" -Tpng -ooutput.png dot_input.txt
"""

## Ye cannae change the laws of physics

One of the problems with curating chemical databases of small organic molecules is filtering out bogus connection tables from legitimate molecules. One aspect of this challenge has been termed [Ed: “I” have termed] EOCWR standing for “Explodes On Contact With Reality”.

An interesting class of broken molecules, that are often overlooked by reasonableness filters, are those that defy the standard model of physics. In this view of matter, atoms are composed of whole numbers of protons, neutrons and electrons. To paraphrase, Democritus “all that exists are groupings of protons, neutrons and electrons in empty space, all else is opinion”. Naturally, under this model the formal charge on an atom cannot exceed its atomic number. Whilst [Ed: “While”] an arbitrary number of electrons may be associated with an atom, it cannot have fewer electrons than zero; hence the positive charge is bounded by the number of protons in the nucleus. However, many cheminformatics file formats record the formal charge rather than the electron count leading to the ability to represent impossible molecules. Checking for these is relatively trivial and allows compounds such as [H+2] or [C+7] to be flagged as erroneous.

Another example of testing for EOCWR is the work of Dr Jonathan Goodman and colleagues at the University of Cambridge on the challenges on embedding alkanes in three dimensions (here and here). Their work explains that in some molecules, although all atomic valences are reasonable, steric crowding would produce sufficient strain that the molecule would fall apart. Hence although graph theoretically an sp3 carbon may have four neighbours that each itself has three additional neighbours (all unique) in reality there is an energetic upper bound of 10 second neighbours in alkanes. As above, checking the number of second (and third) neighbours of an atom provides a convenient and efficient way of distinguishing plausible molecules from the artifacts of erroneous molecule processing (termed “robochemistry” by NCBI PubChem’s Evan Bolton).

Image credit: Graham Lees (Tram Painter on Flickr)

## Using PubChem’s REST interface from Python

Version 1.0 of PubChem’s REST interface (PUG REST) went live on Sept 13. The tutorial gives examples of use, while full details are in the spec.

The following code shows illustrates basic usage from Python. If you want to simultaneously query large numbers of molecules, or do a substructure search, you need to perform a query “in the background” using the ListKey approach shown by SubstructureToCids:

import os
import sys
import time

import urllib2

def getresult(url):
try:
connection = urllib2.urlopen(url)
except urllib2.HTTPError, e:
return ""
else:

def NameToSmiles(name):
return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%s/property/IsomericSMILES/TXT" % name)

def NameToCid(name):
return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%s/cids/TXT" % name)

def CidToSmiles(cid):
return getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%s/property/IsomericSMILES/TXT" % cid)

def SubstructureToCids(smiles):
result = getresult("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/%s/XML" % smiles)
if not result:
return []
listkey = ""
for line in result.split("\n"):
if line.find("ListKey")>=0:
listkey = line.split("ListKey>")[1][:-2]
assert listkey

url = "http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/%s/cids/TXT" % listkey
delta = 2
while True:
time.sleep(delta) # ...perchance to dream
if delta < 8:
delta += 1
result = getresult(url).split("\n")
if result[0] != "Your request is running":
break
return result

if __name__ == "__main__":
print NameToSmiles("Gleevec")
print NameToCid("Gleevec")
print CidToSmiles("123596")
print SubstructureToCids("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")


## Lazy File Reading with mmap

The example source code often provided with cheminformatics libraries to demonstrate the available functionality often try to address two conflicting requirements. On the one hand, these attempt to be pedagogical teaching examples, explaining how to perform a task often to a new or inexperienced user. On the other, these also attempt to address realistic problems and have utility in their own right. Alas in some cases, the best or most efficient way to do something is not the easiest to understand, so efficiency is sacrificed for clarity, or the other way around.

A recent example of this dilema arose for an example program to demonstrate how NextMove Software’s CaffeineFix library can be used with jQuery to provide efficient autocompletion/suggestion in chemical search engine text boxes, such as provided by Google. A minor implementation detail of this approach was the reading of CaffeineFix’s binary .cfx file format from disk.

In C or C++, a simple way to perform this shown by LoadDictionary1 below, that uses the standard I/O library to read in the dictionary in sequential chunks of 64 Kbytes at a time.

#include <stdlib.h>
#include <stdio.h>

unsigned char *LoadDictionary1(const char *fname) {
unsigned int alloc = 65536
unsigned char *result;
unsigned int len = 0;

FILE *fp = fopen(fname,"rb");
if (!fp) {
fprintf(stderr,"Error: Unable to read dictionary file %s\n",fname);
return (unsigned char*)0;
}

result = (unsigned char*)malloc(65536);
if (!result) {
fprintf(stderr,"Error: Unable to allocate 64K buffer!\n");
return (unsigned char*)0;
}

for(;;) {
len += chunk;

if (chunk != 65536) break;
alloc += 65536;
result = (unsigned char*)realloc(result,alloc);
if (!result) {
fprintf(stderr,"Error: Unable to reallocate %uK buffer!\n",alloc>>10);
exit(1);
}
}
fclose(fp);
return result;
}



This is functional, highly portable between systems and even works with streams, such as stdin.

Ultimately, however, the most efficient implementation for reading binary CaffeineFix dictionaries on UNIX-like operating systems is to use “mmap“. Memory mapping lets the operating system decide where to return contents of a file, but with a clever trick of virtual memory that the data is (typically) only read from disk when a memory location is accessed.

#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/mman.h>
static unsigned char *LoadDictionary4(const char *fname)
{
unsigned char *result;
unsigned int len;
struct stat buf;

int fd = open(fname,O_RDONLY);
if (fd < 0) {
fprintf(stderr,"Error: Unable to read dictionary file %s\n",fname);
return (unsigned char*)0;
}

if (fstat(fd,&buf) < 0) {
fprintf(stderr,"Error: Unable to determine file size\n");
return (unsigned char*)0;
}

len = (unsigned int)buf.st_size;
if (result == MAP_FAILED) fprintf(stderr,"Error: Unable to memory map dictionary!\n");
return result;
}

For the example use-case described above, autocompletion using CaffeineFix typically only (randomly) accesses a small fraction of the binary file. In practice, this means that only a small part need be read from disk.

To quantify the performance advantage, we consider autocompleting the text “bisul” using NCBI pubchem’s synonym dictionary of 46.6 million compound names. The original 976 Mbyte ASCII dictionary file is very efficiently encoded as a 718 Mbyte binary CFX file. Using LoadDictionary1 above, autocompletion of “bisul” to return 10 possible completions takes 68 seconds on my Apple laptop, over 99.9% spent in file I/O. Using memory mapped file I/O with LoadDictionary4, the same task takes under half a second.

In practice, if using a persistent process on a web server there’s very little difference between approaches, as once the dictionary has been read into memory suggestions can be made as fast as a user types. However, for applications where this start-up time is an issue, memory mapping is clearly superior.

As a final word, I should point out that memory mapped file I/O is also available to programmers on Microsoft Windows using the MapViewOfFile APIs, and even to Java programmers using the FileChannel.map method to return a ByteBuffer. [Ed: It’s also available in the Python standard library!]

Image credit: James Nash (aka Cirrus) on Flickr

## Identifying suspect InChIs in Wikipedia Chemboxes using Chemical Name to Structure

Wikipedia is a highly useful source of Chemistry and also of chemical nomenclature. A limitation in chemical name to structure software, such as OPSIN, is that trivial names that are similar to systematic names may be misinterpreted if the program has never encountered the trivial names. The nature of Wikipedia means that the most important chemicals and hence the most prevalent trivial names are included so surely Wikipedia would be a great resource to look for name to InChI relationships where the name to structure software was at fault?

I used Matthew Gamble’s code for extracting chemboxes as RDF to quickly grab the contents of all the current chemboxes. From the output of this tool it was simple to get the name/InChI pairs. As I was interested in trivial names I used the title of each Wikipedia page as the input for name to structure.

430 cases were flagged up for a range of reasons: ring/chain tautomerism, intentionally underspecified names, under or over specification of stereochemistry and of course the type of error I was expected. However there were also a significant number of cases where the InChI clearly described a different compound. Upon investigation for the records I’ve corrected so far the root cause appeared to be an incorrect reference to ChemSpider. This then allows script assisted updates to pull in inappropriate InChIs/SMILES.

Example of a previously incorrect page.

Increasing the precision of identificiation of these incorrect name/strucutral identifiers pairs should be possible if the IUPAC names were used as input…

## Visualising a hierarchy of substructures

Given a set of chemical structures as SMILES, how can you visualise the substructure/superstructure relationship between them?

For example, the following picture shows the relationship between members of a set of structures containing several benzene derivatives and monosaccharides:
This was created using the following Python script, which iteratively looks for the structure that matches the largest number of molecules in the set, building up a tree as it does so. The output is the tree in a form suitable for depiction using Graphviz’s dot program. The Python script uses Open Babel, but could easily be adapted for other toolkits.

## Using Python for batch conversion of ChemSketch files to Mol files

There are several tools out there for batch conversion of chemical file formats. However, you may not have access to those tools or else a particular file format may not be supported. Sometimes your only option is to open each file in the original software, and export it as a more useful file format. When dealing with a large number of files, this can be quite tedious or just impossible.

Recently, faced with the challenge of converting 100s of ChemSketch files to Mol files, myself and Daniel looked up the literature (i.e. googled the web) and found that Rajarshi Guha has described how to automate ChemDraw using the Python module WATSUP. Unfortunately this no longer seems to be available. Rich Apodaca has described an alternative approach using AppleScript but this does not work so well on Windows.

Instead we used a Python library, pywinauto, to automate the process of using the ChemSketch GUI to File/Open the ChemSketch file and File/Export a Mol file. The script is shown below. The trickiest part is choosing the delays so that the process runs as fast as possible without failing (due to a ChemSketch operation taking longer than expected):