Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.



A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

Anthracene Zinc
Tool Query Time (s) Rank (slow) Query Time (s) Rank (slow)
arthor 2.254 3 > 0.357 2602
arthor+fp 0.022 285 > 0.001 1667
rdcart 0.698 794 < 202 4
rdlucene 27.126 566 > 23.87 600
pgchem 28.231 138 > 18.181 197
mychem 48.289 108 > 34.145 159
fastsearch 396 99 > 285 126
bingo-nosql 0.448 451 < 1.311 260
bingo-pgsql 0.392 638 > 0.060 1228
tripod-ss 21.797 350 < 1441 18
orchem 27.075 906 > 0.721 2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

2 thoughts on “Substructure Search Face-off: Are the slowest queries the same between tools?”

  1. This is a really interesting and (of course) careful analysis. It does, IMO, suffer a bit from the quality of the set of sample queries.

    Though Andrew has done the best he can with the SQC set, I don’t think it is particularly representative for one very common (at least in my experience) query type: “find all compounds that match a particular scaffold”. Coincidentally, I did a blog post about this last week: http://rdkit.blogspot.ch/2015/05/a-set-of-scaffolds-from-chembl-papers.html.

    Another very common SSS use case is for filtering out “bad” compounds. Here the PAINS queries might be an interesting thing to add to the comparison set. Those will likely make some of the fingerprints look even worse.

Leave a Reply

Your email address will not be published. Required fields are marked *