We’re just back from the 7th Joint Sheffield Conference on Chemoinformatics where I presented the poster below on comparing the ability of structural fingerprints to measure structural similarity. As it happens, the corresponding paper has just come out today also:
Noel M. O’Boyle and Roger A. Sayle. Comparing structural fingerprints using a literature-based similarity benchmark J. Cheminf. 2016, 8, 36.
What we’ve tried to do is create a gound-truth dataset for structural similarity (in the context of med chemistry), and then test fingerprints against that. One approach to create this dataset would be to crowd-source it out to medicinal chemists – this is something that Pedro Franco has done and he was actually presenting some updated results at Sheffield.
We’ve taken an alternative approach: we’ve used the med chemistry literature as collated by the ChEMBL database. On the basis that a team of medicinal chemists have selected these molecules for synthesis and testing as part of the same med chem programme, we regard molecules that appear in the same ChEMBL assay as structurally similar (after removing molecules that appear in 5 or more papers, and some other simple filters).
This gives us pairs of molecules that are similar, but we really want to have a series of molecules with decreasing similarity to a reference, and then see if the various fingerprints can reproduce the series order. To create such a series, we hop from one paper to the next through molecules in common, thereby moving further and further away in terms of similarity from the original molecule. Inspired by Sereina Riniker and Greg Landrum, all of the data and scripts are available at our GitHub repo.