SmallWorld 5.0: Faster, Smaller and Easier to use

SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules. 

The latest database release (2021) contains 675 billion graphs and 9.16 trillion directed connections. To search the database we need efficient techniques to store, access and handle the data. We’ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.

Faster

Searches now run faster thanks to improved data structures, IO scheduling and format changes. 

Compressed Bitset

To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (std::set) and used ~40 bytes per entry with access times on the order of log N. We now use a compressed bitmap (inspired by Roaring Bitmaps) which uses ~1 byte per entry and has roughly constant time access. We avoid unnecessary calls to malloc by storing entries inline using a C union where possible.


A compressed bitset divides values into containers (buckets) and then chooses the most compact representation for that container. An array container is accessed with binary search, small array containers can be stored inline. The bitmap containers are used for dense ranges of the set and use a traditional bitset. 

Route Taken

To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.


Example route taken during a search. This allows us to know how to transform the molecule on the left to the one on the right in three steps. Adding a linker (-O-), a terminal (-NH2), and finally closing the ring.

IO Stalls

Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can’t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced io_uring which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the madvise and fadvise system calls – a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.


Speed comparison of v4.2 (last release) and v5 for selected queries in thousands of traversed edges per second (KTEPS). Times are recorded on a Raid0 array of spinning HDD.

Smaller

The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (B{b}R{r}.{idx}). The connections are specified as pairs of indexes. Previously we stored the connection data in a compressed-sparse row format using packed 5-byte integers (indexes up to 240) this averaged about ~7.2 bytes. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to ~5.8 bytes average. Using a technique similar to Grabowski and Bieniecki (2014) where edges are grouped in blocks, we were able to bring this down to ~3 bytes average – significantly reducing the storage space required.


Graphs Connections Old Format New Format
2019 230 billion 2.75 trillion 20.6TB 8.1TB
2020 471 billion 6.12 trillion 44.6TB 18.3TB
2021 675 billion 9.16 trillion 66.0TB 27.6TB

Not only do the storage requirements decrease but it happens to be more efficient to access the data:


Speed comparison of the old vs new format on selected queries in millions of traversed edges per second (MTEPS)

Hardware

Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing 56TB of fast storage (max of 7 in stock):

New hardware: 7x 8TB Sabrent Rocket Q NVME Drives and HighPoint SSD7540

This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:



Speed comparison of spinning HDD vs NVME Raid0 array. Speed is in millions of traversed edges per second (MTEPS)

Easier to use

The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:

import pysmallworld

db = pysmallworld.Db("chembl_29")

num_hits = 0
for hit in db.search("Clc1ccccc1", max_dist=4):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:

import pysmallworld

db = pysmallworld.Db("chembl_29")

query = db.new_query()
query.add_smiles("Clc1ccccc1")
query.add_smiles("Clc1c(C(C)(C)C)cccc1")
query.set_max_dist(4)

num_hits = 0
for hit in db.search_query(query):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

SmallWorld v5.0 is now available to download and is better than ever.

We’re hiring! Engineers with a passion for cheminformatics and algorithms, head over to our Careers page for more info.

13,118,970 Reactions and Counting

In 2012, 2014 and 2017 Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as CC-Zero. We have made updates (currently quarterly) for customers of our Pistachio query tool and extended the reaction data to include USPTO Sketches and European patents.

The next version of Pistachio will include data from WIPO PCT documents and include several enhancements to content and representation. Highlight these in a blog post made more sense than the usual release notes. 

Number of Reactions

The next release will contain >13,118,970 reactions from the following sources:

Source Count Latest Extraction
USPTO Grant Text 3,290,056 2021-03-16
USPTO Appl. Text 3,595,510 2021-03-18
WIPO PCT Text 1,484,646 2021-03-18
EPO Grant Text 1,060,397 2021-03-17
EPO Appl. Text 696,578 2021-03-17
USPTO Grant Sketch 1,186,924 2021-03-16
USPTO Appl. Sketch 1,804,859 2021-03-18

Pistachio covers 1976-current day however 4,447,418 (33%) were published in the last five years and 8,166,128 (62%) in the last ten years.

The reaction data is document (or citation) centric, the same reaction text often occurs in an application and grant. It can also be published in multiple authorities (e.g. USPTO, EPO and WIPO). Often the description text is identical but not always, sometimes a product yield/quantity may be miss-typed or omitted. The number of unique reactions by RInChI is 4,212,894.

All reactions extracted from text are Atom-Atom Mapped either by NameRxn or if the reaction is unrecognised we fallback to Indigo. 9,383,607 (71.5%) are currently recognized by NameRxn.

What’s New?

Through improvements related to the WIPO PCT inclusion we observed a ~15% increase in recall for existing USPTO and EPO data. This is one of the many advantages of automated extraction over manually curated databases. Tweaks can be made and mistakes can fixed then applied in bulk over all the original source documents. Re-extraction currently takes a few days on a single machine (8 cores). Where miss-extraction mistakes are spotted we welcome the feedback and aim to resolve these where possible.

Embedded Heading Detection

One of the biggest challenges with handling WIPO PCT data is the English text is primarily OCRd. The submitted document quality can vary considerably which leads to a wide spectrum of related problems. 

OCR is well known to have issues with the non-standard characters   used in systematic chemical names. Fortunately the majority of issues are simple character transliterations and extra white space. These can be effectively handled with our spelling correction algorithms. Some very badly corrupted names are beyond all hope:

WO 2020/243135 A1
Part of the chemical name could not be OCRd and remains as an image

Another common issue with OCR is the detection of paragraph breaks and lack of title markup. Often chemical reaction descriptions use anaphoric references “title compound”, “desired product”, and “product from Step B” that need to be resolved. If the title is not found we can’t resolve it.

Paragraph breaks may be omitted:

WO 2020/239862 A1
There should be a paragraph break before “Intermediate 2:”

or in the wrong place (here splitting a chemical name):

WO 2020239862 A1
The “Step H” compound has a paragraph break in the middle of the name.

To compensate for this, new algorithms were introduced to detect patterns of embedded headings where the break was missed by OCR. Existing inline heading (start of paragraph) detection was also improved.

These errors are not unique to WIPO data and occasionally occur in the USPTO documents too:

US 2020/0071296 A1
There should be a break before “Step-2:”

Patentscope (WIPO) recently announced a new OCR extraction framework that should improve paragraph detection and chemical formula recognition (from Feb 11th 2021). We have not yet found an improvement in the extraction recall rate.

Multi-Paragraph Reactions

In previous versions, the reaction step parsing started a new reaction on every new paragraph. This logic was tweaked to allow a reaction to span multiple paragraphs and handle the less reliable breaks in WIPO data. Regressions in USPTO extraction helped identify places in the reaction descriptions where a yielded product action was missed.

A side effect of this is we now sometimes extract cases where there was some unknown intermediate (A -> ?, ? -> B) as a single reaction. We are considering how to handle these.

Prefer Connected Representations

Chemical structures are generated from systematic names, line formulae, and dictionaries. Where possible we have updated the structure representations to favour connected representations:

[K]OC(=O)O[K] instead of [K+].[K+].[O-]C(=O)[O-]
[Na][Cl] instead of [Na+].[Cl-]
etc

Using CXSMILES fragment groups it is possible to keep the grouping of the counter-ions. The reaction components were also listed separately in the raw JSON files. Not all downstream tools can consume the CXSMILES representation and some users invented/adapted syntaxes to handle in their use cases e.g.

>[K+]+[K+]+[O-]C(=O)[O-]>
>[K+]..[K+]..[O-]C(=O)[O-]>

The philosophy in preferring the connected component is that it is easier to split things apart then piece them back together (Humpty Dumpty). Note that not all counter-ions have been bonded due to undesirable valence representations (e.g. HATU, NH4Cl) and so the CXSMILES fragment groups remain a useful extension.

Representation of stereoisomer mixtures

I recently added the ability to OPSIN to capture racemic and relative stereo information in systematic names. In total around ~1% of reactions now have this information captured in CXSMILES. In a simple example we have an unknown mixture of enantiomers formed:

BrC1N2C=NC=C2SC=1C1CC1.C(C1N(C2C=CC(OC)=CC=2)N=NC=1C=O)C>>C1(C2SC3=CN=CN3C=2[C@H](C2N=NN(C3C=CC(OC)=CC=3)C=2CC)O)CC1 |&1:38|	US20200405696A1_1398	Example 279

A more complex example is where one stereocenter configuration is known but the other is not:

ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CN(C(OC(C)(C)C)=O)CC3=N2)=NC=CC=1C(F)(F)F>ClCCl.FC(F)(F)C(O)=O>ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CNCC3=N2)=NC=CC=1C(F)(F)F |&1:8,55,a:13,60| US20200375986A1_0652 Example 4

Both of these cases could alternative be represented by simply removing the configuration from the racemic atom (and we may choose to normalise to that in future). The most important cases are when the relative configuration of two centres is known but that there is a mixture of enatiomers.

C(OC([C@H]1[C@H](C)CN(CC2C=CC=CC=2)C1)=O)C.CC(OC(OC(OC(C)(C)C)=O)=O)(C)C>C(O)C.[OH-].[OH-].[Pd+2]>C[C@H]1CN(C(OC(C)(C)C)=O)C[C@@H]1C(OCC)=O |f:3.4.5,&1:3,4,40,51|	US20200369658A1_0530	Intermediate 2, Step b

NameRxn

Recent updates to NameRxn include limited support for some common non-balanced Functional Group Interconversion and Addition reactions, i.e, “Hydroxy to chloro” and “Amination”.  This work also allowed a small performance boost. The total number of reaction types named is now 1,528 (from 1,297 previously), one source for additional reactions has been RXNO. NameRxn was not originally designed to provide Atom-Atom Maps; as that has become more of an interest we have made improvement to AAM where a functional group source was unknown.

Solvent Mixture Representations

More information about solvents and solvent mixtures is captured, for example: “5-chloro-3-((trimethylsilyl)ethynyl)pyrazin-2-amine (100 mg, 0.44 mmol) in THF (8 mL)” (US20200085822A1 Example 1 Step 6) the THF solvent is associated by reference from the reactant.

Finer grained details on component/volume fractions of solvent mixtures is also captured when described as “1M HCl Et2O”, “THF/MeOH (1 mL, 1:1)

Sequence and Step Labels

Where found in the text we now include and attach the sequence and step labels to reactions. For example “Example 4, Step A”, “Compound 7, Step 2”. Pistachio will allow searching by the labels and resolving queries:

US 2020/0405696 A1 Example 313, Step 2
Azide-alkyne Huisgen cycloaddition (4.1.4)

Improved cross reference handling

Previously the cross-reference “Compound 1” was indexed and resolved just as “1”. We now use the “reference type” to disambiguate cases when there is both a “Compound 1” and “Intermediate 1”. The variety of recognised identifier values was also extended.

Summary

We have made several improvements to reaction extraction. These will be available in the new version of Pistachio that will be available at the start of next month.

Arthor and data science interoperability

Untitled

The 3.0 release of Arthor brings many new features with it including an overhauled Python interface. This new interface has been written using Cython bindings to the C++ core of Arthor, with the goals of native speed for performing chemical searches as well as effortless interoperability with the existing Python data science ecosystem, i.e. numpy and pandas.

Loading and searching databases

Databases can be opened using either the arthor.SubDb or arthor.SimDb classes for substructure and similarity searches respectively. Once opened, these classes have a .search() method which takes a SMILES or SMARTS string respectively and returns a ResultSet object.

Here we load a similarity database (pubchem.smi.atfp) indexing the PubChem database, and also provide the corresponding smi file so we can later cross reference against it.

In [1]:
import arthor
In [2]:
simdb = arthor.SimDb('/nvme/arthor/pubchem.smi.atfp',
                     smifile='/nvme/arthor/pubchem.smi')
simdb.set_num_processors(16)
print(simdb.get_num_records())
102597760

We can then search for the top 100 most similar hits for a particular compound. Here we’re searching for a candidate molecule suggested in the recent Open Source Malaria Series 4 meeting.

The returned arthor.ResultSet object acts as a cursor over the hits found.

In [3]:
%%time

rs = simdb.search('C24=NN=C(C1=CC=C(OC(C)(C)C)C=C1)N2C(OCC(C3=CC(F)=C(F)C=C3)O)=CN=C4', limit=100)
CPU times: user 1.11 s, sys: 1.82 ms, total: 1.11 s
Wall time: 74.3 ms

Exporting to pandas

This ResultSet object can be iterated and sliced etc, but a more interesting option is that it has both a .read(nbytes) and .readline() method, allowing it to behave like an open file handle onto the results.

A use of this is to pass the arthor.ResultSet object directly into pandas.read_csv, allowing the results of the search to be efficiently slurped into a pandas.DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv(rs, sep='\s+', names=['smiles', 'id', 'similarity'])

print(df.head())
                                              smiles         id  similarity
0  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76311347       0.831
1  CN[C@@H](COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(...   76318585       0.702
2  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(CO)C4=CC(=C...   76314954       0.694
3  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76314948       0.686
4  CN(C)C(COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(F)...  129012036       0.678

Creating on the fly databases

Another new feature with the 3.0 release of Arthor is the ability to create substructure or similarity databases in memory. This is exposed in Python via .from_smiles classmethods which take an iterable of smiles (i.e. list, array or pandas series) to create a searchable chemical Database.

Here we create a substructure database of our previous results, and search for hits which feature the double fluoro substituted phenyl ring. The result set is then directly converted (with the to_array() method) into a numpy.array allowing it to index the original dataframe directly and pull out the rows which feature this substructure.

In [5]:
%%time

subdb = arthor.SubDb.from_smiles(df['smiles'])

filtered = df.iloc[subdb.search('c1(F)c(F)cccc1').to_array()]

print(len(filtered))
69
CPU times: user 5.82 ms, sys: 0 ns, total: 5.82 ms
Wall time: 5.6 ms

To quickly visualise this, we can drop the results into rdkit:

In [6]:
from rdkit import Chem

Chem.Draw.MolsToGridImage(
    filtered.iloc[:6]['smiles'].map(Chem.MolFromSmiles),
    legends=list(filtered.iloc[:6]['similarity'].map(lambda x: str(x)[:4]))
)
Out[6]:

Scalability of in-memory databases

The example databases in this notebook are toy examples to give an idea of the possibilities with the new API.

For those who are curious, these are the times for creating (currently this method is only single threaded) similarity and substructure databases of the entire PubChem datbase (currently 102 million molecules) within a notebook:

In [7]:
%%time

pubchem = pd.read_csv('/nvme/arthor/pubchem.smi', sep='\t', names=['smiles', 'id'])
CPU times: user 59 s, sys: 4.96 s, total: 1min 4s
Wall time: 1min 5s
In [8]:
%%time

pb_simdb = arthor.SimDb.from_smiles(pubchem['smiles'])
CPU times: user 16min 8s, sys: 10.2 s, total: 16min 18s
Wall time: 16min 10s
In [9]:
%%time

pb_subdb = arthor.SubDb.from_smiles(pubchem['smiles'])
CPU times: user 26min 52s, sys: 17.3 s, total: 27min 9s
Wall time: 26min 56s

Bioactivity databases

For more reasonably sized databases, for example Chembl 25, the times to create a database are much more reasonable for interactive use:

In [10]:
%%time

chembl = pd.read_csv('/nvme/arthor/chembl_25.smi', sep='\t', names=['smiles', 'id'])

print("Chembl 25 with {} rows".format(len(chembl)))
Chembl 25 with 1870461 rows
CPU times: user 1.67 s, sys: 132 ms, total: 1.8 s
Wall time: 1.8 s
In [11]:
%%time

chembl_simdb = arthor.SimDb.from_smiles(chembl['smiles'])
CPU times: user 19 s, sys: 125 ms, total: 19.2 s
Wall time: 19 s
In [12]:
%%time

chembl_subdb = arthor.SubDb.from_smiles(chembl['smiles'])
CPU times: user 25.5 s, sys: 329 ms, total: 25.8 s
Wall time: 25.6 s

Conclusion

This isn’t the end of the story by far, but is just a first pass of improvements to the Python API of NextMove’s tools. We’ve got plenty of more ideas for bringing these usability and productivity enhancements to Arthor, and our other products We’d love to hear what you think!

Optimising struct equality checks

Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn’t done anything too horrendous :).

Looking through the flame chart for this new feature showed mostly what I was expecting, as well as an unexpected visitor, the equality operator (operator==) for our AtomType struct. For the sake of argument, it looks roughly like this:

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;

  bool operator==(const AtomType& other) const;
};

// in AtomType.cpp
#include "AtomType.h"

bool AtomType::operator==(const AtomType &other) const {
  return ((this->element == other.element) &&
          (this->isotope == other.isotope) &&
          (this->charge == other.charge) &&
          (this->implicit_valence == other.implicit_valence) &&
          (this->explicit_valence == other.explicit_valence) &&
          (this->implicit_hydrogens == other.implicit_hydrogens) &&
          (this->explicit_hydrogens == other.explicit_hydrogens) &&
          (this->is_aromatic == other.is_aromatic));
}

The method is nothing fancy, just going through each field in the struct and checking that their values are equal. How bad can this be?

Digging down

To try and understand why this was taking long enough to show up in a profile, I turned to compiler explorer to view the x86 assembly generated by the compiler to perform this function call. All compilers we looked at produced output similar to this: (this particular code is from clang 9.0)

_ZNK8AtomTypeeqERKS_:
        mov     al, byte ptr [rdi]
        cmp     al, byte ptr [rsi]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 1]
        cmp     al, byte ptr [rsi + 1]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 2]
        cmp     al, byte ptr [rsi + 2]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 3]
        cmp     al, byte ptr [rsi + 3]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 4]
        cmp     al, byte ptr [rsi + 4]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 5]
        cmp     al, byte ptr [rsi + 5]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 6]
        cmp     al, byte ptr [rsi + 6]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 7]
        cmp     al, byte ptr [rsi + 7]
        sete    al
        ret
.LBB0_8:
        xor     eax, eax
        ret

Here we can see that the processor is being instructed to load a single byte (mov al, byte ptr [rdi]) and compare (cmp) it to another byte for every single field in the struct. Should any comparison fail, the code jumps to the.LBB0_8 label and exit as false (xor eax eax, ret), otherwise the result of the 8th comparison is given as the return (sete al, ret).

Recognising that the 8 different members of the struct will be contiguous in memory, this seems like a laborious way to approach this. Instead we can instead check that they are all equal in a single function call by using memcmp.

#include <cstring>

bool AtomType operator==(const AtomType &other) const {
  return memcmp(this, &other, sizeof(AtomType)) == 0;
}

Better yet, this function call gets inlined and the assembly produced by all compilers now shows that this operation is being done in far fewer instructions and without any branching!

_ZNK8AtomTypeeqERKS_:
        xor       eax, eax
        mov       rdx, QWORD PTR [rdi]
        mov       rcx, QWORD PTR [rsi]
        cmp       rdx, rcx
        sete      al                                     
        ret

Here all 8 bytes of each struct are being loaded (mov dst QWORD PTR [src]) and compared (cmp rdx, rcx) in a single pass. (Checking the instruction tables over on Agner Fog’s website will let you check that these instructions take the same time as the single byte equivalents.)

With uncommon sized structs

Things get more interesting when the size of the struct is larger than the simple 8 byte struct originally presented. Instead if we have an awkwardly sized 11 byte struct….

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;
  unsigned char fieldA, fieldB, fieldC;

  bool operator==(const AtomType& other) const;
};

We get different assembly produced from each compiler as they try and optimise how to instruct the processor to perform the memcmp call defined in our implementation. Firstly, gcc 9.2:

_ZNK8AtomTypeeqERKS_:
        mov     rax, QWORD PTR [rsi]
        cmp     QWORD PTR [rdi], rax
        je      .L5
.L2:
        mov     eax, 1
.L3:
        test    eax, eax
        sete    al
        ret
.L5:
        movzx   eax, WORD PTR [rsi+8]
        cmp     WORD PTR [rdi+8], ax
        jne     .L2
        movzx   eax, BYTE PTR [rsi+10]
        cmp     BYTE PTR [rdi+10], al
        jne     .L2
        xor     eax, eax
        jmp     .L3

Here gcc first checks the first 8 bytes (mov QWORD PTR, cmp), jumps down to label L5 then checks the next 2 bytes (WORD PTR), then finally the last byte. Any failed comparisons will jump to label L2 which will return 0 (false). This has split the 11 bytes into much more computer friendly (i.e. addressable) chunks of 8, 2 and 1, required 3 total loads and comparisons. Can this be improved upon? Enter clang 9.0:

_ZNK8AtomTypeeqERKS_:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 3]
        xor     rax, qword ptr [rsi]
        xor     rcx, qword ptr [rsi + 3]
        or      rcx, rax
        sete    al
        ret

Here we can see that clang is instead doing 2 QWORD PTR movs, which initially looks like 16 bytes of data(?) until we see that the second load starts only 3 bytes after the first, so in effect the middle 5 bytes of the struct have been loaded twice. Next this memory is compared using xor (instead of cmp) which produces the same result but the result will operate in place (as opposed to cmp which will store the result in a special flag register, requiring the result to be checked before it is lost!). The result of the two comparisons are then combined using an or and this result returned.

The overlapping read approach is more elegant as only 2 loads are required to read the entire struct, and is now affectionately called “memory shingling” in the office :). Finally we can check out what icc 19.0 will give us:

_ZNK8AtomTypeeqERKS_:
        push      rsi
        mov       edx, 11
        call      _intel_fast_memcmp
        xor       edx, edx
        test      eax, eax
        sete      dl
        mov       eax, edx
        pop       rcx
        ret

Ok, so Intel just calls out to _intel_fast_memcmp, which isn’t very fun for the purposes of this blog. A bit anticlimactic really.

Conclusions

It’s not entirely clear to me why the compiler can’t do these optimisations itself. Even if the struct is declared as just having unsigned char fields[8] (so struct padding isn’t an issue) the compiler will still check each byte individually and won’t jump to comparing all bytes at once.

This little find put me on a bit of an optimisation binge, which has turned up some other fun hacks elsewhere in our codebase too! Hopefully these will be filtering out into our releases this year (alongside the new features to Arthor I’ll blog about soon!)

Thanks for reading my first blog! – Richard

See you at the San Diego ACS

Roger, Richard and I will all be attending the upcoming 258th ACS National Meeting in San Diego, so come say hi (and get your bandit sticker!). Our presentations are shown below, along with talks by Jian Zhang and John Irwin where we are listed as co-authors.

As an addition to the official ACS agenda app, I’ve put together a spreadsheet showing all the talks in CINF, COMP, MEDI and ORGN ordered by session and time which you may find useful.

Sunday
10:40 Grand Ballroom A, Omni CINF 5: Current challenges in text-mining for chemical information (Roger)
18:30 Exhibit Hall A, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, CINF reception poster)

Monday
09:10 Grand Ballroom D, Omni CINF 55: Making a hash of it: Advantage of selectively leaving out structural information (Noel)
14:05 Grand Ballroom A, Omni CINF 70: Using DOCK and ZINC to visualize ultra-large chemical libraries (John Irwin, UCSF)
16:20 Grand Ballroom A, Omni CINF 74: Data visualization for compound library enhancement: Application of artificial intelligence algorithms from computer chess (Roger)
20:00 Exhibit Hall B, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, Sci-Mix poster)

Tuesday
08:40 Grand Ballroom E, Omni CINF 106: Measuring R group similarity using medicinal chemistry data (Noel)
13:35 Grand Ballroom E, Omni CINF 125: Biologics information in PubChem (Jian Zhang, PubChem)
14:25 Grand Ballroom E, Omni CINF 127: Representational and algorithmic challenges in biologic informatics 2019 (Roger)

Putting PubMed peptides in PubChem

Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.

The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.

The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).

At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.

To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.

How to avoid inventing peptide monomer names

As part of my work on Sugar&Splice, I regularly search through PubChem for biopolymers where we recognise most of the components but maybe miss one or two. My typical approach is to extract the SMILES strings of these monomers, and consider the most commonly occuring ones as candidates for addition to our internal library. When I do this, I try to avoid making up our own names for monomers, and instead defer to IUPAC or vendor catalogues or what’s commonly used in the field.

But identifying what’s commonly used in the field isn’t always obvious from a few Google searches. In a talk I presented at the Fall ACS in Boston (see below), I described a text-mining approach (using LeadMine) to determine commonly-used monomer names that we were missing.

The talk focussed on peptides, but the principle is general: I took a corpus of scientific text (e.g. US patents or PubMed abstracts) and searched for matches to a peptide line notation grammar – you can think of this as a regular expression that matches IUPAC condensed notation like H-Ala-Cys-OH. Next, I looked for gaps between text that is recognised (see the image below) and extracted the text. I then looked for the most frequently occurring phrases and added them to our internal library. These were often synonyms of monomers we already had, but not always.

Identification of “Igl” as a missing monomer name

The same principle can be applied to terminal modifications (look at the text that occurs before or after matches) and side-chain modifications (look for text within parentheses after an amino acid).

This approach worked really well. However, some care is needed to ensure that different peptide scientists really mean the same thing when they use the same term. An example is “Dpa”, which it turned out meant diaminopropionic acid to some people but 3-(2,4-dinitrophenyl)-2,3-diaminopropionic acid to others.

See you at the Boston ACS?

Roger, John and I will all be presenting talks at the upcoming 256th ACS National Meeting in Boston. There’s also a related talk by the folks at Chemspace about their integration of Arthor for substructure and similarity search. As an addition to the official ACS agenda app, John has put together a spreadsheet showing all the talks in CINF (and most in COMP) ordered by session and time which you may find useful.

Sunday
9:50 Lewis, Westin CINF 11: De facto standard or a free-for-all? A benchmark for reading SMILES (Noel)
13:45 Harbor Ballroom III, Westin CINF 35: Structure searching for patent information: The need for speed (John)

Monday
16:25 Grand Ballroom E, Westin CINF 77: Building a bridge between human-readable and machine-readable representations of biopolymers (Noel)

Thursday
10:25 Grand Ballroom A, Westin CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures (Oleksii Gavrylenko, Chemspace)
10:40 Lewis, Westin CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis (Roger)

Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.

Can we agree on the structure represented by a SMILES string?

I’m just back from the 11th International Conference on Chemical Structures in Noordwijkerhout, the Netherlands, where I presented a poster with the title: “Can we agree on the structure represented by a SMILES string? A benchmark dataset.”

This benchmark is aimed at improving SMILES interoperability, and compares the hydrogen counts found when different software reads the same SMILES string. The benchmark focuses on SMILES reading rather than writing as only once we can all agree on the meaning of a particular SMILES string, can we start talking about errors in SMILES writing. The full dataset and results are available from http://github.com/nextmovesoftware/smilesreading, and the poster below summarises a cross-section of results.

From presenting the poster, and especially focusing on the example of the SMILES string “N(C)(C)(C)C”, it’s clear that even cheminformatics experts have a misconception of what a SMILES string represents. A typical conversation went something like this:

“How many hydrogens are on the nitrogen in the molecule represented by this SMILES string?”
“There’s something wrong here – that’s a charged nitrogen, right?”
“If it were charged, then it would be written [N+].”
“Oh yes, right, but there’s some mistake – they must have omitted the charge.”
“They may have, or they wrote an extra methyl, or they are referring to a radical present in Jupiter’s clouds and intended [N], or perhaps they are trying to accurately represent a dodgy structure that a student drew on the blackboard. But to come back to the original question, how many hydrogens do you think are on the nitrogen in the molecule represented by this SMILES string?”
“None”
“One”

In short, it is commonly believed that the count of implicit hydrogens on each atom in a SMILES is whatever is chemically sensible (which of course means different things to different people), whereas in fact there is a table in the Daylight specification that describes exactly how many hydrogens are implicit on each atom. Part of the confusion is that people think that you add hydrogens when reading a SMILES, whereas in fact the job of a SMILES reader is simply to set the number of hydrogens to that present in the original molecule. As you can see from the results below, while one may choose not to follow the Daylight specification, the inevitable consequence is that hydrogens magically appear and disappear as SMILES strings move between spec-compliant and spec-non-compliant software.

The poster generated quite a bit of interest, and so I’m hoping to work with more vendors/developers to improve SMILES interoperability before I present this work at the Fall ACS. Please do get in touch if this is of interest.