Arthor and data science interoperability

Untitled

The 3.0 release of Arthor brings many new features with it including an overhauled Python interface. This new interface has been written using Cython bindings to the C++ core of Arthor, with the goals of native speed for performing chemical searches as well as effortless interoperability with the existing Python data science ecosystem, i.e. numpy and pandas.

Loading and searching databases

Databases can be opened using either the arthor.SubDb or arthor.SimDb classes for substructure and similarity searches respectively. Once opened, these classes have a .search() method which takes a SMILES or SMARTS string respectively and returns a ResultSet object.

Here we load a similarity database (pubchem.smi.atfp) indexing the PubChem database, and also provide the corresponding smi file so we can later cross reference against it.

In [1]:
import arthor
In [2]:
simdb = arthor.SimDb('/nvme/arthor/pubchem.smi.atfp',
                     smifile='/nvme/arthor/pubchem.smi')
simdb.set_num_processors(16)
print(simdb.get_num_records())
102597760

We can then search for the top 100 most similar hits for a particular compound. Here we’re searching for a candidate molecule suggested in the recent Open Source Malaria Series 4 meeting.

The returned arthor.ResultSet object acts as a cursor over the hits found.

In [3]:
%%time

rs = simdb.search('C24=NN=C(C1=CC=C(OC(C)(C)C)C=C1)N2C(OCC(C3=CC(F)=C(F)C=C3)O)=CN=C4', limit=100)
CPU times: user 1.11 s, sys: 1.82 ms, total: 1.11 s
Wall time: 74.3 ms

Exporting to pandas

This ResultSet object can be iterated and sliced etc, but a more interesting option is that it has both a .read(nbytes) and .readline() method, allowing it to behave like an open file handle onto the results.

A use of this is to pass the arthor.ResultSet object directly into pandas.read_csv, allowing the results of the search to be efficiently slurped into a pandas.DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv(rs, sep='\s+', names=['smiles', 'id', 'similarity'])

print(df.head())
                                              smiles         id  similarity
0  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76311347       0.831
1  CN[C@@H](COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(...   76318585       0.702
2  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(CO)C4=CC(=C...   76314954       0.694
3  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76314948       0.686
4  CN(C)C(COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(F)...  129012036       0.678

Creating on the fly databases

Another new feature with the 3.0 release of Arthor is the ability to create substructure or similarity databases in memory. This is exposed in Python via .from_smiles classmethods which take an iterable of smiles (i.e. list, array or pandas series) to create a searchable chemical Database.

Here we create a substructure database of our previous results, and search for hits which feature the double fluoro substituted phenyl ring. The result set is then directly converted (with the to_array() method) into a numpy.array allowing it to index the original dataframe directly and pull out the rows which feature this substructure.

In [5]:
%%time

subdb = arthor.SubDb.from_smiles(df['smiles'])

filtered = df.iloc[subdb.search('c1(F)c(F)cccc1').to_array()]

print(len(filtered))
69
CPU times: user 5.82 ms, sys: 0 ns, total: 5.82 ms
Wall time: 5.6 ms

To quickly visualise this, we can drop the results into rdkit:

In [6]:
from rdkit import Chem

Chem.Draw.MolsToGridImage(
    filtered.iloc[:6]['smiles'].map(Chem.MolFromSmiles),
    legends=list(filtered.iloc[:6]['similarity'].map(lambda x: str(x)[:4]))
)
Out[6]:

Scalability of in-memory databases

The example databases in this notebook are toy examples to give an idea of the possibilities with the new API.

For those who are curious, these are the times for creating (currently this method is only single threaded) similarity and substructure databases of the entire PubChem datbase (currently 102 million molecules) within a notebook:

In [7]:
%%time

pubchem = pd.read_csv('/nvme/arthor/pubchem.smi', sep='\t', names=['smiles', 'id'])
CPU times: user 59 s, sys: 4.96 s, total: 1min 4s
Wall time: 1min 5s
In [8]:
%%time

pb_simdb = arthor.SimDb.from_smiles(pubchem['smiles'])
CPU times: user 16min 8s, sys: 10.2 s, total: 16min 18s
Wall time: 16min 10s
In [9]:
%%time

pb_subdb = arthor.SubDb.from_smiles(pubchem['smiles'])
CPU times: user 26min 52s, sys: 17.3 s, total: 27min 9s
Wall time: 26min 56s

Bioactivity databases

For more reasonably sized databases, for example Chembl 25, the times to create a database are much more reasonable for interactive use:

In [10]:
%%time

chembl = pd.read_csv('/nvme/arthor/chembl_25.smi', sep='\t', names=['smiles', 'id'])

print("Chembl 25 with {} rows".format(len(chembl)))
Chembl 25 with 1870461 rows
CPU times: user 1.67 s, sys: 132 ms, total: 1.8 s
Wall time: 1.8 s
In [11]:
%%time

chembl_simdb = arthor.SimDb.from_smiles(chembl['smiles'])
CPU times: user 19 s, sys: 125 ms, total: 19.2 s
Wall time: 19 s
In [12]:
%%time

chembl_subdb = arthor.SubDb.from_smiles(chembl['smiles'])
CPU times: user 25.5 s, sys: 329 ms, total: 25.8 s
Wall time: 25.6 s

Conclusion

This isn’t the end of the story by far, but is just a first pass of improvements to the Python API of NextMove’s tools. We’ve got plenty of more ideas for bringing these usability and productivity enhancements to Arthor, and our other products We’d love to hear what you think!

Optimising struct equality checks

Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn’t done anything too horrendous :).

Looking through the flame chart for this new feature showed mostly what I was expecting, as well as an unexpected visitor, the equality operator (operator==) for our AtomType struct. For the sake of argument, it looks roughly like this:

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;

  bool operator==(const AtomType& other) const;
};

// in AtomType.cpp
#include "AtomType.h"

bool AtomType::operator==(const AtomType &other) const {
  return ((this->element == other.element) &&
          (this->isotope == other.isotope) &&
          (this->charge == other.charge) &&
          (this->implicit_valence == other.implicit_valence) &&
          (this->explicit_valence == other.explicit_valence) &&
          (this->implicit_hydrogens == other.implicit_hydrogens) &&
          (this->explicit_hydrogens == other.explicit_hydrogens) &&
          (this->is_aromatic == other.is_aromatic));
}

The method is nothing fancy, just going through each field in the struct and checking that their values are equal. How bad can this be?

Digging down

To try and understand why this was taking long enough to show up in a profile, I turned to compiler explorer to view the x86 assembly generated by the compiler to perform this function call. All compilers we looked at produced output similar to this: (this particular code is from clang 9.0)

_ZNK8AtomTypeeqERKS_:
        mov     al, byte ptr [rdi]
        cmp     al, byte ptr [rsi]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 1]
        cmp     al, byte ptr [rsi + 1]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 2]
        cmp     al, byte ptr [rsi + 2]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 3]
        cmp     al, byte ptr [rsi + 3]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 4]
        cmp     al, byte ptr [rsi + 4]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 5]
        cmp     al, byte ptr [rsi + 5]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 6]
        cmp     al, byte ptr [rsi + 6]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 7]
        cmp     al, byte ptr [rsi + 7]
        sete    al
        ret
.LBB0_8:
        xor     eax, eax
        ret

Here we can see that the processor is being instructed to load a single byte (mov al, byte ptr [rdi]) and compare (cmp) it to another byte for every single field in the struct. Should any comparison fail, the code jumps to the.LBB0_8 label and exit as false (xor eax eax, ret), otherwise the result of the 8th comparison is given as the return (sete al, ret).

Recognising that the 8 different members of the struct will be contiguous in memory, this seems like a laborious way to approach this. Instead we can instead check that they are all equal in a single function call by using memcmp.

#include <cstring>

bool AtomType operator==(const AtomType &other) const {
  return memcmp(this, &other, sizeof(AtomType)) == 0;
}

Better yet, this function call gets inlined and the assembly produced by all compilers now shows that this operation is being done in far fewer instructions and without any branching!

_ZNK8AtomTypeeqERKS_:
        xor       eax, eax
        mov       rdx, QWORD PTR [rdi]
        mov       rcx, QWORD PTR [rsi]
        cmp       rdx, rcx
        sete      al                                     
        ret

Here all 8 bytes of each struct are being loaded (mov dst QWORD PTR [src]) and compared (cmp rdx, rcx) in a single pass. (Checking the instruction tables over on Agner Fog’s website will let you check that these instructions take the same time as the single byte equivalents.)

With uncommon sized structs

Things get more interesting when the size of the struct is larger than the simple 8 byte struct originally presented. Instead if we have an awkwardly sized 11 byte struct….

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;
  unsigned char fieldA, fieldB, fieldC;

  bool operator==(const AtomType& other) const;
};

We get different assembly produced from each compiler as they try and optimise how to instruct the processor to perform the memcmp call defined in our implementation. Firstly, gcc 9.2:

_ZNK8AtomTypeeqERKS_:
        mov     rax, QWORD PTR [rsi]
        cmp     QWORD PTR [rdi], rax
        je      .L5
.L2:
        mov     eax, 1
.L3:
        test    eax, eax
        sete    al
        ret
.L5:
        movzx   eax, WORD PTR [rsi+8]
        cmp     WORD PTR [rdi+8], ax
        jne     .L2
        movzx   eax, BYTE PTR [rsi+10]
        cmp     BYTE PTR [rdi+10], al
        jne     .L2
        xor     eax, eax
        jmp     .L3

Here gcc first checks the first 8 bytes (mov QWORD PTR, cmp), jumps down to label L5 then checks the next 2 bytes (WORD PTR), then finally the last byte. Any failed comparisons will jump to label L2 which will return 0 (false). This has split the 11 bytes into much more computer friendly (i.e. addressable) chunks of 8, 2 and 1, required 3 total loads and comparisons. Can this be improved upon? Enter clang 9.0:

_ZNK8AtomTypeeqERKS_:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 3]
        xor     rax, qword ptr [rsi]
        xor     rcx, qword ptr [rsi + 3]
        or      rcx, rax
        sete    al
        ret

Here we can see that clang is instead doing 2 QWORD PTR movs, which initially looks like 16 bytes of data(?) until we see that the second load starts only 3 bytes after the first, so in effect the middle 5 bytes of the struct have been loaded twice. Next this memory is compared using xor (instead of cmp) which produces the same result but the result will operate in place (as opposed to cmp which will store the result in a special flag register, requiring the result to be checked before it is lost!). The result of the two comparisons are then combined using an or and this result returned.

The overlapping read approach is more elegant as only 2 loads are required to read the entire struct, and is now affectionately called “memory shingling” in the office :). Finally we can check out what icc 19.0 will give us:

_ZNK8AtomTypeeqERKS_:
        push      rsi
        mov       edx, 11
        call      _intel_fast_memcmp
        xor       edx, edx
        test      eax, eax
        sete      dl
        mov       eax, edx
        pop       rcx
        ret

Ok, so Intel just calls out to _intel_fast_memcmp, which isn’t very fun for the purposes of this blog. A bit anticlimactic really.

Conclusions

It’s not entirely clear to me why the compiler can’t do these optimisations itself. Even if the struct is declared as just having unsigned char fields[8] (so struct padding isn’t an issue) the compiler will still check each byte individually and won’t jump to comparing all bytes at once.

This little find put me on a bit of an optimisation binge, which has turned up some other fun hacks elsewhere in our codebase too! Hopefully these will be filtering out into our releases this year (alongside the new features to Arthor I’ll blog about soon!)

Thanks for reading my first blog! – Richard