CXSMILES – Part 2: Component Grouping

This post is a follow up on the previous introduction – Part 1. Here I examine how we can capture fragment grouping in CXSMILES and other extensions.

Fragment Grouping

Fragment grouping (or component grouping) allows you group together separate fragments/components of a molecule. It is critical for reaction representation and therefore several independent SMILES extensions that have emerged. Common cases include keeping counter-ions, hydrates, and salts together as a single “molecule”.

EP2305640A2 Example 11, Step v
EP2305640A2 Example 11, Step v
(CXSMILES with fragment grouping)


Here is a simple example annotated with the fragment indexes, we want to group together (0,1) and (3,4,5):

[Na+].[OH-].c1ccccc1.[Cs+].[O-]C(=O)[O-].[Cs+]>> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Component indexes span the entire reaction, so we can for example move to the agents and the CXSMILES encoding does not change:

[Na+].[OH-].c1ccccc1>[Cs+].[O-]C(=O)[O-].[Cs+]> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Does it only apply to reactions?

Toolkit dependent. ChemAxon appears to only read/write it on reactions (MarvinJS v22.9.0) but it’s also useful on molecules to capture formulations/mixtures (e.g. Artifical seawater) . In reaction, fragment grouping of agents (between the two “>”) appears to be ignored in MarvinJS – so my example images aren’t valid examples. One of our customers tested Marvin Desktop v21.15.1 for me and confirmed it round-trips correctly.

Do component terms need to be adjacent?

Toolkit dependent. In older versions of ChemAxon desktop tools (I no longer have access) I remember it would reject non-adjacent components:

[Na+].c1ccccc1.[OH-]>> |f:0.2|

This does not seem to be the case in MarvinJS and again a customer again confirmed it round-trips ok in Marvin Desktop.

Spanning Different Roles?

An input where the roles of the components being grouped (e.g. a reactant and a product) could be rejected as inconsistent:

c1ccccc1.[Na+]>>[OH-] |f:1.2|
c1ccccc1.[Na+]>>[OH-] |f:2.1|

MarvinJS and CDK default to the role of the first component encountered so those two inputs are different. As the author of the CDK logic I will note this is consistent only by coincidence.

Implicit grouping?

We can implicitly group components with multi-attach (m:) and Sgroup brackets. For example consider the following, which is preferred?

**.c1ccccc1 |$R1$,m:1:|
**.c1ccccc1 |$R1$,f:0.1,m:1:|


Daylight SMILES5

As with the cis/trans specification, SMILES5 had a solution:

Molecule-level Components: There will be another level of components within a molecule and reaction object which will allow easier handling of complex mixtures.” – Futures, MUG 2005

SMARTS has component grouping using zero-level brackets already and it would likely have followed a similar syntax:


Notice the wording that it applies to both molecule and reactions.


An extension used by Eli Lilly’s LillyMol is to treat the “.” to separate fragments and use a “+” to separate the molecules.


Note that LillyMol also supports CXSMILES.

IBM RXN for Chemistry

Schwaller et al 2020 describe how they use “~” to group together fragments. They note in the supplementary information how this is more useful than CXSMILES for their purposes since it enforces the fragments are kept together:


NextMove (proposed) / OntoChem

In 2013 Roger proposed a double-dot “..” for a similar purpose. The advantage being that “relaxed” SMILES parsers will simply ignore the repeated dot:


OntoChem also use this representation in reactions but I cannot find a link to relevant material.

NextMove (actual)

In Pistachio we use CXSMILES for fragment grouping in reactions. In recent releases we have tried to use alternative representations that avoid the problem where possible:


Where needed we still use CXSMILES since it is the most widely supported convention:

HATU reagent
US20200138797A1 Intermediate 3

The fragment grouping also gets captured in the JSON format of reactions albeit much less compactly:

role: "Product",
orgName: "title compound",
name: "Methyl (2S)-2-amino-3-(2-chlorophenyl)propanoate hydrochloride",
smiles: "Cl.N[C@H](C(=O)OC)CC1=C(C=CC=C1)Cl",
quantities: [ {type: "Mass", value: 5.89, text: "5.89 g"},
              {type: "Yield", value: 94, text: "94%"}],
stoichiometry: 1

We also support interconversion of the LillyMol syntax in our reaction processing tool set, HazELNut:

$ echo "*>[Na+].[OH-].[Cs+].[O-]C(=O)[O-].[Cs+]>* |f:1.2.3,4.5|" | ./filbert .smi .iwsmi

$ echo "*>C(=O)([O-])[O-].[Cs+]+[OH-].[Na+].[Cs+]>*" | ./filbert .iwsmi .smi
*>C(=O)([O-])[O-].[OH-].[Na+].[Cs+].[Cs+]>* |f:1.4,2.3.5| 

CXSMILES Gotchas – Part 1: Bond Indexes

ChemAxon Extended SMILES and SMARTS (CXSMILES) has become more popular in recent years for its ability to capture additional information on top of the core structural connectivity.

At NextMove Software we think it’s great and have increasingly used CXSMILES over the years to capture information more precisely.

Such structures can be captured as CXSMILES:

c12cccc1C=CC=C2.*[Zr](*)(Cl)Cl.c12cccc1C=CC=C2 |m:9:,11:| US20150376312A1 (2)
[c-]12cccc1C=CC=C2.*[Zr++](*)(Cl)Cl.[c-]12cccc1C=CC=C2 |m:9:,11:| US20150376312A1 (2)
C1CCOC1.C(Cl)Cl |Sg:c:0,1,2,3,4::,Sg:c:5,6,7::,Sg:mix:0,1,2,3,4,5,6,7::,SgH:2:0.1| 20% THF/DCM
[N+](=O)([O-])C1=CC=C(C(=O)O[C@H]2[C@H](CCCC2)N2N=C(C(=C2)NC(=O)C=2C=NN3C2N=CC=C3)C3=C(C=CC(=C3)Cl)OC)C=C1 |r| US20200002345A1 Ex 136


Originally created by ChemAxon as a way to to store a CTfile (i.e. MOLfile) in SMILES without losing information – It has evolved to be useful on its own right. Recent versions of the Enamine REAL use it for stereochemistry groups and there are efforts by InChI to better represent inorganic, mixtures, and reactions which could make use of it. Since CXSMILES has started to gain traction with more toolkits supporting the format it is becoming a convenient lingua franca for advanced chemical representations.

The following toolkits support CXSMILES to various degrees:


ChemAxon provides public documentation on CXSMILES but there are corner cases and some wonky areas I want to discuss. I planned to get this captured in one post but quickly realised the first topic alone is enough on it’s own. I will update this links below as new posts appear:

  • Bond Indexes
  • Component Grouping
  • Enhanced Stereo Canonicalisation
  • SRU Ambiguity
  • Atom Labels
  • EPAM Highlight Extension
  • Dative bond valence
  • Multi- vs Variable- attachment
  • Wish-list – beyond CXSMILES
    • Partial feature set
    • Atropisomers
    • More compact coordinates
    • cis/trans- stereo groups

Bond Indexes

Atoms and bonds in CXSMILES are referenced by index (0 <= idx < n) which is the position in the SMILES string:

CC=CC |c:1|
CC=CC |t:1|
CC=CC |ctu:1|

In this case it’s easy to see, there are three bonds at index: 0, 1, 2. The bond at index 1 is specified as cis (c:) or trans (t:) or unspecified (ctu:). As a linear notation, some bonds get written twice. Does the index increment twice? Is the ring open (first occurrence) or ring close (second occurrence) the “reference”?

Using a dot-disconnection trick we can reorder the previous example and probe behaviour of what ChemAxon accepts:

C1C.CC=1 |c:0| wrong
C1C.CC=1 |c:2| correct
C=1C.CC1 |c:0| wrong
C=1C.CC1 |c:2| correct - less desirable IMO

The “correct” choice is |c:2| – the closure bond is the reference. I should note this is somewhat an artefact of how the SMILES is parsed. A useful efficiency trick in SMILES is to use partial bonds (leave one atom temporarily undefined) which in this case would give you the incorrect index unless extra steps were taken.

A similar issue exists in vanilla SMILES when bond types mismatch: C#1C.C=1 or C/1=C/C.C/1. The bond that takes precedent is toolkit dependent, hopefully you would get a warning or error.

In case you don’t like the dot-disconnection being used like that, here is one in a macro cycle:

C=1CCCCCCCCCC1 |c:0| wrong
C1CCCCCCCCCC=1 |c:10| correct

The double counting question is already answered since it was |c:2| and not |c:3| but to confirm bonds do not get counted twice:

C1CCCCC1C=CC |c:8| wrong
C1CCCCC1C=CC |c:7| correct

Double bond configurations in rings are rare but bond indexes also apply to dative/hydrogen bonds and it is important there is consensus on how it works.

SMILES already supports cis/trans specification so why does CXSMILES add this? It turns out to avoid error propagation you need to be able to specify unknown configuration and this is not always possible in normal SMILES. SMILES uses "/" and "\" – it looks pretty but causes problems. Pause for a second and see if you can write the SMILES for the following structure:

Can you write the SMILES for this?

Maybe you wrote something like this:


Indeed most toolkits will do exactly that! Unfortunately we’ve added information that wasn’t there and inadvertently defined the middle bond. It is actually possible if you add explicit hydrogens:


But this only gets you so far, what if we had nitrogens? CXSMILES allows us to encode this unambiguously:

CC=NC=CN=CC |c:1,5|

This may seem like a narrow corner case but if you try to parse PubChem you will find a lot of them – or rather inconsistency warnings. Here are some warnings from CDK:

Ignoring invalid directional bonds
C1=C/C/2=C/N=C3/C(=C/4\C(=N/C=C/5/C(=C2/C=C1)/C=CC=C5)C=CC=C4)/C=CC=C3 5379414
                    ^ ^
Ignoring invalid directional bonds
CO/C(=C(\C#N)/C=C(/C=C(/C(=O)OC)\C#N)/C=C(/C(=O)OC)\C#N)/O 5720097
                  ^                  ^

Depending on the traversal and bond direction assignment we may accidentally define the configuration of this bond and no warning would be generated or alternatively it will come out with an “invalid” syntax.

PubChem CID 5379414
PubChem CID 5720097

It’s problematic and Daylight had planned to address it. At the user group meeting in 2005 a Futures talk makes references to preliminary work on SMILES5: “A unification of the EZ stereo representation with atom-based stereo is proposed. This will allow better specification of multiple conjugated EZ centers and also allows more robust specification of relative stereochemistry.”.

It may have looked something like this:

C[C@H]=[C@H]C=C[C@H]=[C@H]C cis,cis-
C[C@H]=[C@H]C=C[C@H]=[C@@H]C cis,trans-
C[C@H]=[C@H]C=C[C@@H]=[C@H]C cis,trans-

I actually added support for this in CDK but in hindsight I think using CXSMILES is simpler but less elegant:

CC=CC=CC=CC |c:1,5|
C/C=C\C=C\C=C/C |ctu:3|
C/C=C\C=C\C=C/C |c:1,5,ctu:3|

Which is best depends on the application – ignoring the CXSMILES you either get no configuration or the wrong configuration. With CXSMILES these should all canonicalise to the same thing.

SmallWorld 5.0: Faster, Smaller and Easier to use

SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules. 

The latest database release (2021) contains 675 billion graphs and 9.16 trillion directed connections. To search the database we need efficient techniques to store, access and handle the data. We’ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.


Searches now run faster thanks to improved data structures, IO scheduling and format changes. 

Compressed Bitset

To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (std::set) and used ~40 bytes per entry with access times on the order of log N. We now use a compressed bitmap (inspired by Roaring Bitmaps) which uses ~1 byte per entry and has roughly constant time access. We avoid unnecessary calls to malloc by storing entries inline using a C union where possible.

A compressed bitset divides values into containers (buckets) and then chooses the most compact representation for that container. An array container is accessed with binary search, small array containers can be stored inline. The bitmap containers are used for dense ranges of the set and use a traditional bitset. 

Route Taken

To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.

Example route taken during a search. This allows us to know how to transform the molecule on the left to the one on the right in three steps. Adding a linker (-O-), a terminal (-NH2), and finally closing the ring.

IO Stalls

Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can’t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced io_uring which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the madvise and fadvise system calls – a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.

Speed comparison of v4.2 (last release) and v5 for selected queries in thousands of traversed edges per second (KTEPS). Times are recorded on a Raid0 array of spinning HDD.


The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (B{b}R{r}.{idx}). The connections are specified as pairs of indexes. Previously we stored the connection data in a compressed-sparse row format using packed 5-byte integers (indexes up to 240) this averaged about ~7.2 bytes. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to ~5.8 bytes average. Using a technique similar to Grabowski and Bieniecki (2014) where edges are grouped in blocks, we were able to bring this down to ~3 bytes average – significantly reducing the storage space required.

Graphs Connections Old Format New Format
2019 230 billion 2.75 trillion 20.6TB 8.1TB
2020 471 billion 6.12 trillion 44.6TB 18.3TB
2021 675 billion 9.16 trillion 66.0TB 27.6TB

Not only do the storage requirements decrease but it happens to be more efficient to access the data:

Speed comparison of the old vs new format on selected queries in millions of traversed edges per second (MTEPS)


Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing 56TB of fast storage (max of 7 in stock):

New hardware: 7x 8TB Sabrent Rocket Q NVME Drives and HighPoint SSD7540

This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:

Speed comparison of spinning HDD vs NVME Raid0 array. Speed is in millions of traversed edges per second (MTEPS)

Easier to use

The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:

import pysmallworld

db = pysmallworld.Db("chembl_29")

num_hits = 0
for hit in"Clc1ccccc1", max_dist=4):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:

import pysmallworld

db = pysmallworld.Db("chembl_29")

query = db.new_query()

num_hits = 0
for hit in db.search_query(query):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

SmallWorld v5.0 is now available to download and is better than ever.

We’re hiring! Engineers with a passion for cheminformatics and algorithms, head over to our Careers page for more info.

13,118,970 Reactions and Counting

In 2012, 2014 and 2017 Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as CC-Zero. We have made updates (currently quarterly) for customers of our Pistachio query tool and extended the reaction data to include USPTO Sketches and European patents.

The next version of Pistachio will include data from WIPO PCT documents and include several enhancements to content and representation. Highlight these in a blog post made more sense than the usual release notes. 

Number of Reactions

The next release will contain >13,118,970 reactions from the following sources:

Source Count Latest Extraction
USPTO Grant Text 3,290,056 2021-03-16
USPTO Appl. Text 3,595,510 2021-03-18
WIPO PCT Text 1,484,646 2021-03-18
EPO Grant Text 1,060,397 2021-03-17
EPO Appl. Text 696,578 2021-03-17
USPTO Grant Sketch 1,186,924 2021-03-16
USPTO Appl. Sketch 1,804,859 2021-03-18

Pistachio covers 1976-current day however 4,447,418 (33%) were published in the last five years and 8,166,128 (62%) in the last ten years.

The reaction data is document (or citation) centric, the same reaction text often occurs in an application and grant. It can also be published in multiple authorities (e.g. USPTO, EPO and WIPO). Often the description text is identical but not always, sometimes a product yield/quantity may be miss-typed or omitted. The number of unique reactions by RInChI is 4,212,894.

All reactions extracted from text are Atom-Atom Mapped either by NameRxn or if the reaction is unrecognised we fallback to Indigo. 9,383,607 (71.5%) are currently recognized by NameRxn.

What’s New?

Through improvements related to the WIPO PCT inclusion we observed a ~15% increase in recall for existing USPTO and EPO data. This is one of the many advantages of automated extraction over manually curated databases. Tweaks can be made and mistakes can fixed then applied in bulk over all the original source documents. Re-extraction currently takes a few days on a single machine (8 cores). Where miss-extraction mistakes are spotted we welcome the feedback and aim to resolve these where possible.

Embedded Heading Detection

One of the biggest challenges with handling WIPO PCT data is the English text is primarily OCRd. The submitted document quality can vary considerably which leads to a wide spectrum of related problems. 

OCR is well known to have issues with the non-standard characters   used in systematic chemical names. Fortunately the majority of issues are simple character transliterations and extra white space. These can be effectively handled with our spelling correction algorithms. Some very badly corrupted names are beyond all hope:

WO 2020/243135 A1
Part of the chemical name could not be OCRd and remains as an image

Another common issue with OCR is the detection of paragraph breaks and lack of title markup. Often chemical reaction descriptions use anaphoric references “title compound”, “desired product”, and “product from Step B” that need to be resolved. If the title is not found we can’t resolve it.

Paragraph breaks may be omitted:

WO 2020/239862 A1
There should be a paragraph break before “Intermediate 2:”

or in the wrong place (here splitting a chemical name):

WO 2020239862 A1
The “Step H” compound has a paragraph break in the middle of the name.

To compensate for this, new algorithms were introduced to detect patterns of embedded headings where the break was missed by OCR. Existing inline heading (start of paragraph) detection was also improved.

These errors are not unique to WIPO data and occasionally occur in the USPTO documents too:

US 2020/0071296 A1
There should be a break before “Step-2:”

Patentscope (WIPO) recently announced a new OCR extraction framework that should improve paragraph detection and chemical formula recognition (from Feb 11th 2021). We have not yet found an improvement in the extraction recall rate.

Multi-Paragraph Reactions

In previous versions, the reaction step parsing started a new reaction on every new paragraph. This logic was tweaked to allow a reaction to span multiple paragraphs and handle the less reliable breaks in WIPO data. Regressions in USPTO extraction helped identify places in the reaction descriptions where a yielded product action was missed.

A side effect of this is we now sometimes extract cases where there was some unknown intermediate (A -> ?, ? -> B) as a single reaction. We are considering how to handle these.

Prefer Connected Representations

Chemical structures are generated from systematic names, line formulae, and dictionaries. Where possible we have updated the structure representations to favour connected representations:

[K]OC(=O)O[K] instead of [K+].[K+].[O-]C(=O)[O-]
[Na][Cl] instead of [Na+].[Cl-]

Using CXSMILES fragment groups it is possible to keep the grouping of the counter-ions. The reaction components were also listed separately in the raw JSON files. Not all downstream tools can consume the CXSMILES representation and some users invented/adapted syntaxes to handle in their use cases e.g.


The philosophy in preferring the connected component is that it is easier to split things apart then piece them back together (Humpty Dumpty). Note that not all counter-ions have been bonded due to undesirable valence representations (e.g. HATU, NH4Cl) and so the CXSMILES fragment groups remain a useful extension.

Representation of stereoisomer mixtures

I recently added the ability to OPSIN to capture racemic and relative stereo information in systematic names. In total around ~1% of reactions now have this information captured in CXSMILES. In a simple example we have an unknown mixture of enantiomers formed:

BrC1N2C=NC=C2SC=1C1CC1.C(C1N(C2C=CC(OC)=CC=2)N=NC=1C=O)C>>C1(C2SC3=CN=CN3C=2[C@H](C2N=NN(C3C=CC(OC)=CC=3)C=2CC)O)CC1 |&1:38|	US20200405696A1_1398	Example 279

A more complex example is where one stereocenter configuration is known but the other is not:

ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CN(C(OC(C)(C)C)=O)CC3=N2)=NC=CC=1C(F)(F)F>ClCCl.FC(F)(F)C(O)=O>ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CNCC3=N2)=NC=CC=1C(F)(F)F |&1:8,55,a:13,60| US20200375986A1_0652 Example 4

Both of these cases could alternative be represented by simply removing the configuration from the racemic atom (and we may choose to normalise to that in future). The most important cases are when the relative configuration of two centres is known but that there is a mixture of enatiomers.

C(OC([C@H]1[C@H](C)CN(CC2C=CC=CC=2)C1)=O)C.CC(OC(OC(OC(C)(C)C)=O)=O)(C)C>C(O)C.[OH-].[OH-].[Pd+2]>C[C@H]1CN(C(OC(C)(C)C)=O)C[C@@H]1C(OCC)=O |f:3.4.5,&1:3,4,40,51|	US20200369658A1_0530	Intermediate 2, Step b


Recent updates to NameRxn include limited support for some common non-balanced Functional Group Interconversion and Addition reactions, i.e, “Hydroxy to chloro” and “Amination”.  This work also allowed a small performance boost. The total number of reaction types named is now 1,528 (from 1,297 previously), one source for additional reactions has been RXNO. NameRxn was not originally designed to provide Atom-Atom Maps; as that has become more of an interest we have made improvement to AAM where a functional group source was unknown.

Solvent Mixture Representations

More information about solvents and solvent mixtures is captured, for example: “5-chloro-3-((trimethylsilyl)ethynyl)pyrazin-2-amine (100 mg, 0.44 mmol) in THF (8 mL)” (US20200085822A1 Example 1 Step 6) the THF solvent is associated by reference from the reactant.

Finer grained details on component/volume fractions of solvent mixtures is also captured when described as “1M HCl Et2O”, “THF/MeOH (1 mL, 1:1)

Sequence and Step Labels

Where found in the text we now include and attach the sequence and step labels to reactions. For example “Example 4, Step A”, “Compound 7, Step 2”. Pistachio will allow searching by the labels and resolving queries:

US 2020/0405696 A1 Example 313, Step 2
Azide-alkyne Huisgen cycloaddition (4.1.4)

Improved cross reference handling

Previously the cross-reference “Compound 1” was indexed and resolved just as “1”. We now use the “reference type” to disambiguate cases when there is both a “Compound 1” and “Intermediate 1”. The variety of recognised identifier values was also extended.


We have made several improvements to reaction extraction. These will be available in the new version of Pistachio that will be available at the start of next month.

Arthor and data science interoperability


The 3.0 release of Arthor brings many new features with it including an overhauled Python interface. This new interface has been written using Cython bindings to the C++ core of Arthor, with the goals of native speed for performing chemical searches as well as effortless interoperability with the existing Python data science ecosystem, i.e. numpy and pandas.

Loading and searching databases

Databases can be opened using either the arthor.SubDb or arthor.SimDb classes for substructure and similarity searches respectively. Once opened, these classes have a .search() method which takes a SMILES or SMARTS string respectively and returns a ResultSet object.

Here we load a similarity database (pubchem.smi.atfp) indexing the PubChem database, and also provide the corresponding smi file so we can later cross reference against it.

In [1]:
import arthor
In [2]:
simdb = arthor.SimDb('/nvme/arthor/pubchem.smi.atfp',

We can then search for the top 100 most similar hits for a particular compound. Here we’re searching for a candidate molecule suggested in the recent Open Source Malaria Series 4 meeting.

The returned arthor.ResultSet object acts as a cursor over the hits found.

In [3]:

rs ='C24=NN=C(C1=CC=C(OC(C)(C)C)C=C1)N2C(OCC(C3=CC(F)=C(F)C=C3)O)=CN=C4', limit=100)
CPU times: user 1.11 s, sys: 1.82 ms, total: 1.11 s
Wall time: 74.3 ms

Exporting to pandas

This ResultSet object can be iterated and sliced etc, but a more interesting option is that it has both a .read(nbytes) and .readline() method, allowing it to behave like an open file handle onto the results.

A use of this is to pass the arthor.ResultSet object directly into pandas.read_csv, allowing the results of the search to be efficiently slurped into a pandas.DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv(rs, sep='\s+', names=['smiles', 'id', 'similarity'])

                                              smiles         id  similarity
0  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76311347       0.831
1  CN[C@@H](COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(...   76318585       0.702
2  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(CO)C4=CC(=C...   76314954       0.694
3  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76314948       0.686
4  CN(C)C(COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(F)...  129012036       0.678

Creating on the fly databases

Another new feature with the 3.0 release of Arthor is the ability to create substructure or similarity databases in memory. This is exposed in Python via .from_smiles classmethods which take an iterable of smiles (i.e. list, array or pandas series) to create a searchable chemical Database.

Here we create a substructure database of our previous results, and search for hits which feature the double fluoro substituted phenyl ring. The result set is then directly converted (with the to_array() method) into a numpy.array allowing it to index the original dataframe directly and pull out the rows which feature this substructure.

In [5]:

subdb = arthor.SubDb.from_smiles(df['smiles'])

filtered = df.iloc['c1(F)c(F)cccc1').to_array()]

CPU times: user 5.82 ms, sys: 0 ns, total: 5.82 ms
Wall time: 5.6 ms

To quickly visualise this, we can drop the results into rdkit:

In [6]:
from rdkit import Chem

    legends=list(filtered.iloc[:6]['similarity'].map(lambda x: str(x)[:4]))

Scalability of in-memory databases

The example databases in this notebook are toy examples to give an idea of the possibilities with the new API.

For those who are curious, these are the times for creating (currently this method is only single threaded) similarity and substructure databases of the entire PubChem datbase (currently 102 million molecules) within a notebook:

In [7]:

pubchem = pd.read_csv('/nvme/arthor/pubchem.smi', sep='\t', names=['smiles', 'id'])
CPU times: user 59 s, sys: 4.96 s, total: 1min 4s
Wall time: 1min 5s
In [8]:

pb_simdb = arthor.SimDb.from_smiles(pubchem['smiles'])
CPU times: user 16min 8s, sys: 10.2 s, total: 16min 18s
Wall time: 16min 10s
In [9]:

pb_subdb = arthor.SubDb.from_smiles(pubchem['smiles'])
CPU times: user 26min 52s, sys: 17.3 s, total: 27min 9s
Wall time: 26min 56s

Bioactivity databases

For more reasonably sized databases, for example Chembl 25, the times to create a database are much more reasonable for interactive use:

In [10]:

chembl = pd.read_csv('/nvme/arthor/chembl_25.smi', sep='\t', names=['smiles', 'id'])

print("Chembl 25 with {} rows".format(len(chembl)))
Chembl 25 with 1870461 rows
CPU times: user 1.67 s, sys: 132 ms, total: 1.8 s
Wall time: 1.8 s
In [11]:

chembl_simdb = arthor.SimDb.from_smiles(chembl['smiles'])
CPU times: user 19 s, sys: 125 ms, total: 19.2 s
Wall time: 19 s
In [12]:

chembl_subdb = arthor.SubDb.from_smiles(chembl['smiles'])
CPU times: user 25.5 s, sys: 329 ms, total: 25.8 s
Wall time: 25.6 s


This isn’t the end of the story by far, but is just a first pass of improvements to the Python API of NextMove’s tools. We’ve got plenty of more ideas for bringing these usability and productivity enhancements to Arthor, and our other products We’d love to hear what you think!

Optimising struct equality checks

Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn’t done anything too horrendous :).

Looking through the flame chart for this new feature showed mostly what I was expecting, as well as an unexpected visitor, the equality operator (operator==) for our AtomType struct. For the sake of argument, it looks roughly like this:

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;

  bool operator==(const AtomType& other) const;

// in AtomType.cpp
#include "AtomType.h"

bool AtomType::operator==(const AtomType &other) const {
  return ((this->element == other.element) &&
          (this->isotope == other.isotope) &&
          (this->charge == other.charge) &&
          (this->implicit_valence == other.implicit_valence) &&
          (this->explicit_valence == other.explicit_valence) &&
          (this->implicit_hydrogens == other.implicit_hydrogens) &&
          (this->explicit_hydrogens == other.explicit_hydrogens) &&
          (this->is_aromatic == other.is_aromatic));

The method is nothing fancy, just going through each field in the struct and checking that their values are equal. How bad can this be?

Digging down

To try and understand why this was taking long enough to show up in a profile, I turned to compiler explorer to view the x86 assembly generated by the compiler to perform this function call. All compilers we looked at produced output similar to this: (this particular code is from clang 9.0)

        mov     al, byte ptr [rdi]
        cmp     al, byte ptr [rsi]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 1]
        cmp     al, byte ptr [rsi + 1]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 2]
        cmp     al, byte ptr [rsi + 2]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 3]
        cmp     al, byte ptr [rsi + 3]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 4]
        cmp     al, byte ptr [rsi + 4]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 5]
        cmp     al, byte ptr [rsi + 5]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 6]
        cmp     al, byte ptr [rsi + 6]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 7]
        cmp     al, byte ptr [rsi + 7]
        sete    al
        xor     eax, eax

Here we can see that the processor is being instructed to load a single byte (mov al, byte ptr [rdi]) and compare (cmp) it to another byte for every single field in the struct. Should any comparison fail, the code jumps to the.LBB0_8 label and exit as false (xor eax eax, ret), otherwise the result of the 8th comparison is given as the return (sete al, ret).

Recognising that the 8 different members of the struct will be contiguous in memory, this seems like a laborious way to approach this. Instead we can instead check that they are all equal in a single function call by using memcmp.

#include <cstring>

bool AtomType operator==(const AtomType &other) const {
  return memcmp(this, &other, sizeof(AtomType)) == 0;

Better yet, this function call gets inlined and the assembly produced by all compilers now shows that this operation is being done in far fewer instructions and without any branching!

        xor       eax, eax
        mov       rdx, QWORD PTR [rdi]
        mov       rcx, QWORD PTR [rsi]
        cmp       rdx, rcx
        sete      al                                     

Here all 8 bytes of each struct are being loaded (mov dst QWORD PTR [src]) and compared (cmp rdx, rcx) in a single pass. (Checking the instruction tables over on Agner Fog’s website will let you check that these instructions take the same time as the single byte equivalents.)

With uncommon sized structs

Things get more interesting when the size of the struct is larger than the simple 8 byte struct originally presented. Instead if we have an awkwardly sized 11 byte struct….

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;
  unsigned char fieldA, fieldB, fieldC;

  bool operator==(const AtomType& other) const;

We get different assembly produced from each compiler as they try and optimise how to instruct the processor to perform the memcmp call defined in our implementation. Firstly, gcc 9.2:

        mov     rax, QWORD PTR [rsi]
        cmp     QWORD PTR [rdi], rax
        je      .L5
        mov     eax, 1
        test    eax, eax
        sete    al
        movzx   eax, WORD PTR [rsi+8]
        cmp     WORD PTR [rdi+8], ax
        jne     .L2
        movzx   eax, BYTE PTR [rsi+10]
        cmp     BYTE PTR [rdi+10], al
        jne     .L2
        xor     eax, eax
        jmp     .L3

Here gcc first checks the first 8 bytes (mov QWORD PTR, cmp), jumps down to label L5 then checks the next 2 bytes (WORD PTR), then finally the last byte. Any failed comparisons will jump to label L2 which will return 0 (false). This has split the 11 bytes into much more computer friendly (i.e. addressable) chunks of 8, 2 and 1, required 3 total loads and comparisons. Can this be improved upon? Enter clang 9.0:

        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 3]
        xor     rax, qword ptr [rsi]
        xor     rcx, qword ptr [rsi + 3]
        or      rcx, rax
        sete    al

Here we can see that clang is instead doing 2 QWORD PTR movs, which initially looks like 16 bytes of data(?) until we see that the second load starts only 3 bytes after the first, so in effect the middle 5 bytes of the struct have been loaded twice. Next this memory is compared using xor (instead of cmp) which produces the same result but the result will operate in place (as opposed to cmp which will store the result in a special flag register, requiring the result to be checked before it is lost!). The result of the two comparisons are then combined using an or and this result returned.

The overlapping read approach is more elegant as only 2 loads are required to read the entire struct, and is now affectionately called “memory shingling” in the office :). Finally we can check out what icc 19.0 will give us:

        push      rsi
        mov       edx, 11
        call      _intel_fast_memcmp
        xor       edx, edx
        test      eax, eax
        sete      dl
        mov       eax, edx
        pop       rcx

Ok, so Intel just calls out to _intel_fast_memcmp, which isn’t very fun for the purposes of this blog. A bit anticlimactic really.


It’s not entirely clear to me why the compiler can’t do these optimisations itself. Even if the struct is declared as just having unsigned char fields[8] (so struct padding isn’t an issue) the compiler will still check each byte individually and won’t jump to comparing all bytes at once.

This little find put me on a bit of an optimisation binge, which has turned up some other fun hacks elsewhere in our codebase too! Hopefully these will be filtering out into our releases this year (alongside the new features to Arthor I’ll blog about soon!)

Thanks for reading my first blog! – Richard

See you at the San Diego ACS

Roger, Richard and I will all be attending the upcoming 258th ACS National Meeting in San Diego, so come say hi (and get your bandit sticker!). Our presentations are shown below, along with talks by Jian Zhang and John Irwin where we are listed as co-authors.

As an addition to the official ACS agenda app, I’ve put together a spreadsheet showing all the talks in CINF, COMP, MEDI and ORGN ordered by session and time which you may find useful.

10:40 Grand Ballroom A, Omni CINF 5: Current challenges in text-mining for chemical information (Roger)
18:30 Exhibit Hall A, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, CINF reception poster)

09:10 Grand Ballroom D, Omni CINF 55: Making a hash of it: Advantage of selectively leaving out structural information (Noel)
14:05 Grand Ballroom A, Omni CINF 70: Using DOCK and ZINC to visualize ultra-large chemical libraries (John Irwin, UCSF)
16:20 Grand Ballroom A, Omni CINF 74: Data visualization for compound library enhancement: Application of artificial intelligence algorithms from computer chess (Roger)
20:00 Exhibit Hall B, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, Sci-Mix poster)

08:40 Grand Ballroom E, Omni CINF 106: Measuring R group similarity using medicinal chemistry data (Noel)
13:35 Grand Ballroom E, Omni CINF 125: Biologics information in PubChem (Jian Zhang, PubChem)
14:25 Grand Ballroom E, Omni CINF 127: Representational and algorithmic challenges in biologic informatics 2019 (Roger)

Putting PubMed peptides in PubChem

Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.

The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.

The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).

At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.

To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.

How to avoid inventing peptide monomer names

As part of my work on Sugar&Splice, I regularly search through PubChem for biopolymers where we recognise most of the components but maybe miss one or two. My typical approach is to extract the SMILES strings of these monomers, and consider the most commonly occuring ones as candidates for addition to our internal library. When I do this, I try to avoid making up our own names for monomers, and instead defer to IUPAC or vendor catalogues or what’s commonly used in the field.

But identifying what’s commonly used in the field isn’t always obvious from a few Google searches. In a talk I presented at the Fall ACS in Boston (see below), I described a text-mining approach (using LeadMine) to determine commonly-used monomer names that we were missing.

The talk focussed on peptides, but the principle is general: I took a corpus of scientific text (e.g. US patents or PubMed abstracts) and searched for matches to a peptide line notation grammar – you can think of this as a regular expression that matches IUPAC condensed notation like H-Ala-Cys-OH. Next, I looked for gaps between text that is recognised (see the image below) and extracted the text. I then looked for the most frequently occurring phrases and added them to our internal library. These were often synonyms of monomers we already had, but not always.

Identification of “Igl” as a missing monomer name

The same principle can be applied to terminal modifications (look at the text that occurs before or after matches) and side-chain modifications (look for text within parentheses after an amino acid).

This approach worked really well. However, some care is needed to ensure that different peptide scientists really mean the same thing when they use the same term. An example is “Dpa”, which it turned out meant diaminopropionic acid to some people but 3-(2,4-dinitrophenyl)-2,3-diaminopropionic acid to others.

See you at the Boston ACS?

Roger, John and I will all be presenting talks at the upcoming 256th ACS National Meeting in Boston. There’s also a related talk by the folks at Chemspace about their integration of Arthor for substructure and similarity search. As an addition to the official ACS agenda app, John has put together a spreadsheet showing all the talks in CINF (and most in COMP) ordered by session and time which you may find useful.

9:50 Lewis, Westin CINF 11: De facto standard or a free-for-all? A benchmark for reading SMILES (Noel)
13:45 Harbor Ballroom III, Westin CINF 35: Structure searching for patent information: The need for speed (John)

16:25 Grand Ballroom E, Westin CINF 77: Building a bridge between human-readable and machine-readable representations of biopolymers (Noel)

10:25 Grand Ballroom A, Westin CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures (Oleksii Gavrylenko, Chemspace)
10:40 Lewis, Westin CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis (Roger)