Name to Structure, not just for systematic names

While advanced software now exists for converting systematic chemical names to structures, the humble chemical line formula has for the most part avoided the limelight. For a long time LeadMine has been able to recognize chemical line formulae but we are now adding the ability to interpret them.

A simple example would be CHF3 which is CHF3

Line formulae are interpreted from left-to-right, but in such a way that valency rules are respected e.g. the fluorine in the above example bonds to the C not the H.

More complicated examples include:

bondOrders

 

CF2=CF-O-CF2CF(CF3)O-CF2CF2-CH2OH [has an explicit double bond] {from US20020002258A1}

Formula with abbreviation

 

HO2C-CH2CH(NHFmoc)-CONH-(CH2)10CH3 [contains an abbreviated prefix: Fmoc] {from US20010038824A1}

Complicated line formula

(NH2NHCOCH2CH2)2N(CH2)11CONHNH2 [has a repeated substituent, repeated infix and implicit double bonds for the carbonyls] {from US20050081961A1}

This is useful for pulling out reagents of chemical reactions and where the described compound is important in its own right.

A small corner of SmallWorld

SmallWorld is a graph database where the nodes are molecules or molecular fragments and where the edges are directed and point to substructures. While frequently stated in the literature that calculation of the MCS through enumerating all possible substructures is impossible, really what they mean is that the time/space tradeoff is quite high. SmallWorld is the proof that with enough disk space, the time reduces quite a lot.

While finding the MCS is still an NP-complete problem, by precomputing canonical substructures of molecules, the NP-complete part is already factored out so that implementing algorithms just involves traversing the tree. For example, substructure searching means going up the graph from a single node, while finding a multi-molecule MCS means simultaneously going down the graph from several nodes.

To mark the growth of SmallWorld to 4 billion molecular substructures, I thought it’d be interesting to take an in-depth look at a small corner, all those acyclic structures with 8 bonds (in the molecular skeleton). There are 35. Here they are arranged by the number of subgraphs each has. (Notice anything about the order?)B8R0Some of these structures occur more often than others. Here is their frequency in ChEMBL molecules (only including acyclic molecules with up to 20 bonds).B8R0_2What is the smallest ChEMBL molecule that is a superstructure of as many of these as possible? The following molecule, CHEMBL1992288, is a superstructure of all but one.CHEMBL1992288These are just toy examples of the sorts of analyses possible. As part of a current collaboration, we are assessing how well SmallWorld compares to other methods for similarity searching. It’ll be interesting to find out.