Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.

Can we agree on the structure represented by a SMILES string?

I’m just back from the 11th International Conference on Chemical Structures in Noordwijkerhout, the Netherlands, where I presented a poster with the title: “Can we agree on the structure represented by a SMILES string? A benchmark dataset.”

This benchmark is aimed at improving SMILES interoperability, and compares the hydrogen counts found when different software reads the same SMILES string. The benchmark focuses on SMILES reading rather than writing as only once we can all agree on the meaning of a particular SMILES string, can we start talking about errors in SMILES writing. The full dataset and results are available from http://github.com/nextmovesoftware/smilesreading, and the poster below summarises a cross-section of results.

From presenting the poster, and especially focusing on the example of the SMILES string “N(C)(C)(C)C”, it’s clear that even cheminformatics experts have a misconception of what a SMILES string represents. A typical conversation went something like this:

“How many hydrogens are on the nitrogen in the molecule represented by this SMILES string?”
“There’s something wrong here – that’s a charged nitrogen, right?”
“If it were charged, then it would be written [N+].”
“Oh yes, right, but there’s some mistake – they must have omitted the charge.”
“They may have, or they wrote an extra methyl, or they are referring to a radical present in Jupiter’s clouds and intended [N], or perhaps they are trying to accurately represent a dodgy structure that a student drew on the blackboard. But to come back to the original question, how many hydrogens do you think are on the nitrogen in the molecule represented by this SMILES string?”
“None”
“One”

In short, it is commonly believed that the count of implicit hydrogens on each atom in a SMILES is whatever is chemically sensible (which of course means different things to different people), whereas in fact there is a table in the Daylight specification that describes exactly how many hydrogens are implicit on each atom. Part of the confusion is that people think that you add hydrogens when reading a SMILES, whereas in fact the job of a SMILES reader is simply to set the number of hydrogens to that present in the original molecule. As you can see from the results below, while one may choose not to follow the Daylight specification, the inevitable consequence is that hydrogens magically appear and disappear as SMILES strings move between spec-compliant and spec-non-compliant software.

The poster generated quite a bit of interest, and so I’m hoping to work with more vendors/developers to improve SMILES interoperability before I present this work at the Fall ACS. Please do get in touch if this is of interest.