SmallWorld is a graph database where the nodes are molecules or molecular fragments and where the edges are directed and point to substructures. While frequently stated in the literature that calculation of the MCS through enumerating all possible substructures is impossible, really what they mean is that the time/space tradeoff is quite high. SmallWorld is the proof that with enough disk space, the time reduces quite a lot.
While finding the MCS is still an NP-complete problem, by precomputing canonical substructures of molecules, the NP-complete part is already factored out so that implementing algorithms just involves traversing the tree. For example, substructure searching means going up the graph from a single node, while finding a multi-molecule MCS means simultaneously going down the graph from several nodes.
To mark the growth of SmallWorld to 4 billion molecular substructures, I thought it’d be interesting to take an in-depth look at a small corner, all those acyclic structures with 8 bonds (in the molecular skeleton). There are 35. Here they are arranged by the number of subgraphs each has. (Notice anything about the order?)Some of these structures occur more often than others. Here is their frequency in ChEMBL molecules (only including acyclic molecules with up to 20 bonds).What is the smallest ChEMBL molecule that is a superstructure of as many of these as possible? The following molecule, CHEMBL1992288, is a superstructure of all but one.These are just toy examples of the sorts of analyses possible. As part of a current collaboration, we are assessing how well SmallWorld compares to other methods for similarity searching. It’ll be interesting to find out.