Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.