CXSMILES – Part 2: Component Grouping

This post is a follow up on the previous introduction – Part 1. Here I examine how we can capture fragment grouping in CXSMILES and other extensions.

Fragment Grouping

Fragment grouping (or component grouping) allows you group together separate fragments/components of a molecule. It is critical for reaction representation and therefore several independent SMILES extensions that have emerged. Common cases include keeping counter-ions, hydrates, and salts together as a single “molecule”.

EP2305640A2 Example 11, Step v
EP2305640A2 Example 11, Step v
(CXSMILES with fragment grouping)


Here is a simple example annotated with the fragment indexes, we want to group together (0,1) and (3,4,5):

[Na+].[OH-].c1ccccc1.[Cs+].[O-]C(=O)[O-].[Cs+]>> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Component indexes span the entire reaction, so we can for example move to the agents and the CXSMILES encoding does not change:

[Na+].[OH-].c1ccccc1>[Cs+].[O-]C(=O)[O-].[Cs+]> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Does it only apply to reactions?

Toolkit dependent. ChemAxon appears to only read/write it on reactions (MarvinJS v22.9.0) but it’s also useful on molecules to capture formulations/mixtures (e.g. Artifical seawater) . In reaction, fragment grouping of agents (between the two “>”) appears to be ignored in MarvinJS – so my example images aren’t valid examples. One of our customers tested Marvin Desktop v21.15.1 for me and confirmed it round-trips correctly.

Do component terms need to be adjacent?

Toolkit dependent. In older versions of ChemAxon desktop tools (I no longer have access) I remember it would reject non-adjacent components:

[Na+].c1ccccc1.[OH-]>> |f:0.2|

This does not seem to be the case in MarvinJS and again a customer again confirmed it round-trips ok in Marvin Desktop.

Spanning Different Roles?

An input where the roles of the components being grouped (e.g. a reactant and a product) could be rejected as inconsistent:

c1ccccc1.[Na+]>>[OH-] |f:1.2|
c1ccccc1.[Na+]>>[OH-] |f:2.1|

MarvinJS and CDK default to the role of the first component encountered so those two inputs are different. As the author of the CDK logic I will note this is consistent only by coincidence.

Implicit grouping?

We can implicitly group components with multi-attach (m:) and Sgroup brackets. For example consider the following, which is preferred?

**.c1ccccc1 |$R1$,m:1:|
**.c1ccccc1 |$R1$,f:0.1,m:1:|


Daylight SMILES5

As with the cis/trans specification, SMILES5 had a solution:

Molecule-level Components: There will be another level of components within a molecule and reaction object which will allow easier handling of complex mixtures.” – Futures, MUG 2005

SMARTS has component grouping using zero-level brackets already and it would likely have followed a similar syntax:


Notice the wording that it applies to both molecule and reactions.


An extension used by Eli Lilly’s LillyMol is to treat the “.” to separate fragments and use a “+” to separate the molecules.


Note that LillyMol also supports CXSMILES.

IBM RXN for Chemistry

Schwaller et al 2020 describe how they use “~” to group together fragments. They note in the supplementary information how this is more useful than CXSMILES for their purposes since it enforces the fragments are kept together:


NextMove (proposed) / OntoChem

In 2013 Roger proposed a double-dot “..” for a similar purpose. The advantage being that “relaxed” SMILES parsers will simply ignore the repeated dot:


OntoChem also use this representation in reactions but I cannot find a link to relevant material.

NextMove (actual)

In Pistachio we use CXSMILES for fragment grouping in reactions. In recent releases we have tried to use alternative representations that avoid the problem where possible:


Where needed we still use CXSMILES since it is the most widely supported convention:

HATU reagent
US20200138797A1 Intermediate 3

The fragment grouping also gets captured in the JSON format of reactions albeit much less compactly:

role: "Product",
orgName: "title compound",
name: "Methyl (2S)-2-amino-3-(2-chlorophenyl)propanoate hydrochloride",
smiles: "Cl.N[C@H](C(=O)OC)CC1=C(C=CC=C1)Cl",
quantities: [ {type: "Mass", value: 5.89, text: "5.89 g"},
              {type: "Yield", value: 94, text: "94%"}],
stoichiometry: 1

We also support interconversion of the LillyMol syntax in our reaction processing tool set, HazELNut:

$ echo "*>[Na+].[OH-].[Cs+].[O-]C(=O)[O-].[Cs+]>* |f:1.2.3,4.5|" | ./filbert .smi .iwsmi

$ echo "*>C(=O)([O-])[O-].[Cs+]+[OH-].[Na+].[Cs+]>*" | ./filbert .iwsmi .smi
*>C(=O)([O-])[O-].[OH-].[Na+].[Cs+].[Cs+]>* |f:1.4,2.3.5| 

CXSMILES Gotchas – Part 1: Bond Indexes

ChemAxon Extended SMILES and SMARTS (CXSMILES) has become more popular in recent years for its ability to capture additional information on top of the core structural connectivity.

At NextMove Software we think it’s great and have increasingly used CXSMILES over the years to capture information more precisely.

Such structures can be captured as CXSMILES:

c12cccc1C=CC=C2.*[Zr](*)(Cl)Cl.c12cccc1C=CC=C2 |m:9:,11:| US20150376312A1 (2)
[c-]12cccc1C=CC=C2.*[Zr++](*)(Cl)Cl.[c-]12cccc1C=CC=C2 |m:9:,11:| US20150376312A1 (2)
C1CCOC1.C(Cl)Cl |Sg:c:0,1,2,3,4::,Sg:c:5,6,7::,Sg:mix:0,1,2,3,4,5,6,7::,SgH:2:0.1| 20% THF/DCM
[N+](=O)([O-])C1=CC=C(C(=O)O[C@H]2[C@H](CCCC2)N2N=C(C(=C2)NC(=O)C=2C=NN3C2N=CC=C3)C3=C(C=CC(=C3)Cl)OC)C=C1 |r| US20200002345A1 Ex 136


Originally created by ChemAxon as a way to to store a CTfile (i.e. MOLfile) in SMILES without losing information – It has evolved to be useful on its own right. Recent versions of the Enamine REAL use it for stereochemistry groups and there are efforts by InChI to better represent inorganic, mixtures, and reactions which could make use of it. Since CXSMILES has started to gain traction with more toolkits supporting the format it is becoming a convenient lingua franca for advanced chemical representations.

The following toolkits support CXSMILES to various degrees:


ChemAxon provides public documentation on CXSMILES but there are corner cases and some wonky areas I want to discuss. I planned to get this captured in one post but quickly realised the first topic alone is enough on it’s own. I will update this links below as new posts appear:

  • Bond Indexes
  • Component Grouping
  • Enhanced Stereo Canonicalisation
  • SRU Ambiguity
  • Atom Labels
  • EPAM Highlight Extension
  • Dative bond valence
  • Multi- vs Variable- attachment
  • Wish-list – beyond CXSMILES
    • Partial feature set
    • Atropisomers
    • More compact coordinates
    • cis/trans- stereo groups

Bond Indexes

Atoms and bonds in CXSMILES are referenced by index (0 <= idx < n) which is the position in the SMILES string:

CC=CC |c:1|
CC=CC |t:1|
CC=CC |ctu:1|

In this case it’s easy to see, there are three bonds at index: 0, 1, 2. The bond at index 1 is specified as cis (c:) or trans (t:) or unspecified (ctu:). As a linear notation, some bonds get written twice. Does the index increment twice? Is the ring open (first occurrence) or ring close (second occurrence) the “reference”?

Using a dot-disconnection trick we can reorder the previous example and probe behaviour of what ChemAxon accepts:

C1C.CC=1 |c:0| wrong
C1C.CC=1 |c:2| correct
C=1C.CC1 |c:0| wrong
C=1C.CC1 |c:2| correct - less desirable IMO

The “correct” choice is |c:2| – the closure bond is the reference. I should note this is somewhat an artefact of how the SMILES is parsed. A useful efficiency trick in SMILES is to use partial bonds (leave one atom temporarily undefined) which in this case would give you the incorrect index unless extra steps were taken.

A similar issue exists in vanilla SMILES when bond types mismatch: C#1C.C=1 or C/1=C/C.C/1. The bond that takes precedent is toolkit dependent, hopefully you would get a warning or error.

In case you don’t like the dot-disconnection being used like that, here is one in a macro cycle:

C=1CCCCCCCCCC1 |c:0| wrong
C1CCCCCCCCCC=1 |c:10| correct

The double counting question is already answered since it was |c:2| and not |c:3| but to confirm bonds do not get counted twice:

C1CCCCC1C=CC |c:8| wrong
C1CCCCC1C=CC |c:7| correct

Double bond configurations in rings are rare but bond indexes also apply to dative/hydrogen bonds and it is important there is consensus on how it works.

SMILES already supports cis/trans specification so why does CXSMILES add this? It turns out to avoid error propagation you need to be able to specify unknown configuration and this is not always possible in normal SMILES. SMILES uses "/" and "\" – it looks pretty but causes problems. Pause for a second and see if you can write the SMILES for the following structure:

Can you write the SMILES for this?

Maybe you wrote something like this:


Indeed most toolkits will do exactly that! Unfortunately we’ve added information that wasn’t there and inadvertently defined the middle bond. It is actually possible if you add explicit hydrogens:


But this only gets you so far, what if we had nitrogens? CXSMILES allows us to encode this unambiguously:

CC=NC=CN=CC |c:1,5|

This may seem like a narrow corner case but if you try to parse PubChem you will find a lot of them – or rather inconsistency warnings. Here are some warnings from CDK:

Ignoring invalid directional bonds
C1=C/C/2=C/N=C3/C(=C/4\C(=N/C=C/5/C(=C2/C=C1)/C=CC=C5)C=CC=C4)/C=CC=C3 5379414
                    ^ ^
Ignoring invalid directional bonds
CO/C(=C(\C#N)/C=C(/C=C(/C(=O)OC)\C#N)/C=C(/C(=O)OC)\C#N)/O 5720097
                  ^                  ^

Depending on the traversal and bond direction assignment we may accidentally define the configuration of this bond and no warning would be generated or alternatively it will come out with an “invalid” syntax.

PubChem CID 5379414
PubChem CID 5720097

It’s problematic and Daylight had planned to address it. At the user group meeting in 2005 a Futures talk makes references to preliminary work on SMILES5: “A unification of the EZ stereo representation with atom-based stereo is proposed. This will allow better specification of multiple conjugated EZ centers and also allows more robust specification of relative stereochemistry.”.

It may have looked something like this:

C[C@H]=[C@H]C=C[C@H]=[C@H]C cis,cis-
C[C@H]=[C@H]C=C[C@H]=[C@@H]C cis,trans-
C[C@H]=[C@H]C=C[C@@H]=[C@H]C cis,trans-

I actually added support for this in CDK but in hindsight I think using CXSMILES is simpler but less elegant:

CC=CC=CC=CC |c:1,5|
C/C=C\C=C\C=C/C |ctu:3|
C/C=C\C=C\C=C/C |c:1,5,ctu:3|

Which is best depends on the application – ignoring the CXSMILES you either get no configuration or the wrong configuration. With CXSMILES these should all canonicalise to the same thing.

SmallWorld 5.0: Faster, Smaller and Easier to use

SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules. 

The latest database release (2021) contains 675 billion graphs and 9.16 trillion directed connections. To search the database we need efficient techniques to store, access and handle the data. We’ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.


Searches now run faster thanks to improved data structures, IO scheduling and format changes. 

Compressed Bitset

To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (std::set) and used ~40 bytes per entry with access times on the order of log N. We now use a compressed bitmap (inspired by Roaring Bitmaps) which uses ~1 byte per entry and has roughly constant time access. We avoid unnecessary calls to malloc by storing entries inline using a C union where possible.

A compressed bitset divides values into containers (buckets) and then chooses the most compact representation for that container. An array container is accessed with binary search, small array containers can be stored inline. The bitmap containers are used for dense ranges of the set and use a traditional bitset. 

Route Taken

To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.

Example route taken during a search. This allows us to know how to transform the molecule on the left to the one on the right in three steps. Adding a linker (-O-), a terminal (-NH2), and finally closing the ring.

IO Stalls

Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can’t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced io_uring which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the madvise and fadvise system calls – a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.

Speed comparison of v4.2 (last release) and v5 for selected queries in thousands of traversed edges per second (KTEPS). Times are recorded on a Raid0 array of spinning HDD.


The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (B{b}R{r}.{idx}). The connections are specified as pairs of indexes. Previously we stored the connection data in a compressed-sparse row format using packed 5-byte integers (indexes up to 240) this averaged about ~7.2 bytes. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to ~5.8 bytes average. Using a technique similar to Grabowski and Bieniecki (2014) where edges are grouped in blocks, we were able to bring this down to ~3 bytes average – significantly reducing the storage space required.

Graphs Connections Old Format New Format
2019 230 billion 2.75 trillion 20.6TB 8.1TB
2020 471 billion 6.12 trillion 44.6TB 18.3TB
2021 675 billion 9.16 trillion 66.0TB 27.6TB

Not only do the storage requirements decrease but it happens to be more efficient to access the data:

Speed comparison of the old vs new format on selected queries in millions of traversed edges per second (MTEPS)


Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing 56TB of fast storage (max of 7 in stock):

New hardware: 7x 8TB Sabrent Rocket Q NVME Drives and HighPoint SSD7540

This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:

Speed comparison of spinning HDD vs NVME Raid0 array. Speed is in millions of traversed edges per second (MTEPS)

Easier to use

The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:

import pysmallworld

db = pysmallworld.Db("chembl_29")

num_hits = 0
for hit in"Clc1ccccc1", max_dist=4):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:

import pysmallworld

db = pysmallworld.Db("chembl_29")

query = db.new_query()

num_hits = 0
for hit in db.search_query(query):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

SmallWorld v5.0 is now available to download and is better than ever.

We’re hiring! Engineers with a passion for cheminformatics and algorithms, head over to our Careers page for more info.

13,118,970 Reactions and Counting

In 2012, 2014 and 2017 Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as CC-Zero. We have made updates (currently quarterly) for customers of our Pistachio query tool and extended the reaction data to include USPTO Sketches and European patents.

The next version of Pistachio will include data from WIPO PCT documents and include several enhancements to content and representation. Highlight these in a blog post made more sense than the usual release notes. 

Number of Reactions

The next release will contain >13,118,970 reactions from the following sources:

Source Count Latest Extraction
USPTO Grant Text 3,290,056 2021-03-16
USPTO Appl. Text 3,595,510 2021-03-18
WIPO PCT Text 1,484,646 2021-03-18
EPO Grant Text 1,060,397 2021-03-17
EPO Appl. Text 696,578 2021-03-17
USPTO Grant Sketch 1,186,924 2021-03-16
USPTO Appl. Sketch 1,804,859 2021-03-18

Pistachio covers 1976-current day however 4,447,418 (33%) were published in the last five years and 8,166,128 (62%) in the last ten years.

The reaction data is document (or citation) centric, the same reaction text often occurs in an application and grant. It can also be published in multiple authorities (e.g. USPTO, EPO and WIPO). Often the description text is identical but not always, sometimes a product yield/quantity may be miss-typed or omitted. The number of unique reactions by RInChI is 4,212,894.

All reactions extracted from text are Atom-Atom Mapped either by NameRxn or if the reaction is unrecognised we fallback to Indigo. 9,383,607 (71.5%) are currently recognized by NameRxn.

What’s New?

Through improvements related to the WIPO PCT inclusion we observed a ~15% increase in recall for existing USPTO and EPO data. This is one of the many advantages of automated extraction over manually curated databases. Tweaks can be made and mistakes can fixed then applied in bulk over all the original source documents. Re-extraction currently takes a few days on a single machine (8 cores). Where miss-extraction mistakes are spotted we welcome the feedback and aim to resolve these where possible.

Embedded Heading Detection

One of the biggest challenges with handling WIPO PCT data is the English text is primarily OCRd. The submitted document quality can vary considerably which leads to a wide spectrum of related problems. 

OCR is well known to have issues with the non-standard characters   used in systematic chemical names. Fortunately the majority of issues are simple character transliterations and extra white space. These can be effectively handled with our spelling correction algorithms. Some very badly corrupted names are beyond all hope:

WO 2020/243135 A1
Part of the chemical name could not be OCRd and remains as an image

Another common issue with OCR is the detection of paragraph breaks and lack of title markup. Often chemical reaction descriptions use anaphoric references “title compound”, “desired product”, and “product from Step B” that need to be resolved. If the title is not found we can’t resolve it.

Paragraph breaks may be omitted:

WO 2020/239862 A1
There should be a paragraph break before “Intermediate 2:”

or in the wrong place (here splitting a chemical name):

WO 2020239862 A1
The “Step H” compound has a paragraph break in the middle of the name.

To compensate for this, new algorithms were introduced to detect patterns of embedded headings where the break was missed by OCR. Existing inline heading (start of paragraph) detection was also improved.

These errors are not unique to WIPO data and occasionally occur in the USPTO documents too:

US 2020/0071296 A1
There should be a break before “Step-2:”

Patentscope (WIPO) recently announced a new OCR extraction framework that should improve paragraph detection and chemical formula recognition (from Feb 11th 2021). We have not yet found an improvement in the extraction recall rate.

Multi-Paragraph Reactions

In previous versions, the reaction step parsing started a new reaction on every new paragraph. This logic was tweaked to allow a reaction to span multiple paragraphs and handle the less reliable breaks in WIPO data. Regressions in USPTO extraction helped identify places in the reaction descriptions where a yielded product action was missed.

A side effect of this is we now sometimes extract cases where there was some unknown intermediate (A -> ?, ? -> B) as a single reaction. We are considering how to handle these.

Prefer Connected Representations

Chemical structures are generated from systematic names, line formulae, and dictionaries. Where possible we have updated the structure representations to favour connected representations:

[K]OC(=O)O[K] instead of [K+].[K+].[O-]C(=O)[O-]
[Na][Cl] instead of [Na+].[Cl-]

Using CXSMILES fragment groups it is possible to keep the grouping of the counter-ions. The reaction components were also listed separately in the raw JSON files. Not all downstream tools can consume the CXSMILES representation and some users invented/adapted syntaxes to handle in their use cases e.g.


The philosophy in preferring the connected component is that it is easier to split things apart then piece them back together (Humpty Dumpty). Note that not all counter-ions have been bonded due to undesirable valence representations (e.g. HATU, NH4Cl) and so the CXSMILES fragment groups remain a useful extension.

Representation of stereoisomer mixtures

I recently added the ability to OPSIN to capture racemic and relative stereo information in systematic names. In total around ~1% of reactions now have this information captured in CXSMILES. In a simple example we have an unknown mixture of enantiomers formed:

BrC1N2C=NC=C2SC=1C1CC1.C(C1N(C2C=CC(OC)=CC=2)N=NC=1C=O)C>>C1(C2SC3=CN=CN3C=2[C@H](C2N=NN(C3C=CC(OC)=CC=3)C=2CC)O)CC1 |&1:38|	US20200405696A1_1398	Example 279

A more complex example is where one stereocenter configuration is known but the other is not:

ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CN(C(OC(C)(C)C)=O)CC3=N2)=NC=CC=1C(F)(F)F>ClCCl.FC(F)(F)C(O)=O>ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CNCC3=N2)=NC=CC=1C(F)(F)F |&1:8,55,a:13,60| US20200375986A1_0652 Example 4

Both of these cases could alternative be represented by simply removing the configuration from the racemic atom (and we may choose to normalise to that in future). The most important cases are when the relative configuration of two centres is known but that there is a mixture of enatiomers.

C(OC([C@H]1[C@H](C)CN(CC2C=CC=CC=2)C1)=O)C.CC(OC(OC(OC(C)(C)C)=O)=O)(C)C>C(O)C.[OH-].[OH-].[Pd+2]>C[C@H]1CN(C(OC(C)(C)C)=O)C[C@@H]1C(OCC)=O |f:3.4.5,&1:3,4,40,51|	US20200369658A1_0530	Intermediate 2, Step b


Recent updates to NameRxn include limited support for some common non-balanced Functional Group Interconversion and Addition reactions, i.e, “Hydroxy to chloro” and “Amination”.  This work also allowed a small performance boost. The total number of reaction types named is now 1,528 (from 1,297 previously), one source for additional reactions has been RXNO. NameRxn was not originally designed to provide Atom-Atom Maps; as that has become more of an interest we have made improvement to AAM where a functional group source was unknown.

Solvent Mixture Representations

More information about solvents and solvent mixtures is captured, for example: “5-chloro-3-((trimethylsilyl)ethynyl)pyrazin-2-amine (100 mg, 0.44 mmol) in THF (8 mL)” (US20200085822A1 Example 1 Step 6) the THF solvent is associated by reference from the reactant.

Finer grained details on component/volume fractions of solvent mixtures is also captured when described as “1M HCl Et2O”, “THF/MeOH (1 mL, 1:1)

Sequence and Step Labels

Where found in the text we now include and attach the sequence and step labels to reactions. For example “Example 4, Step A”, “Compound 7, Step 2”. Pistachio will allow searching by the labels and resolving queries:

US 2020/0405696 A1 Example 313, Step 2
Azide-alkyne Huisgen cycloaddition (4.1.4)

Improved cross reference handling

Previously the cross-reference “Compound 1” was indexed and resolved just as “1”. We now use the “reference type” to disambiguate cases when there is both a “Compound 1” and “Intermediate 1”. The variety of recognised identifier values was also extended.


We have made several improvements to reaction extraction. These will be available in the new version of Pistachio that will be available at the start of next month.

Pistachio: Search and Faceting of Large Reaction Databases

I recently gave a talk at the Washington ACS on a reaction database and search system: Pistachio. We built Pistachio to browse and search reactions extracted from patents. The system brings together many of our existing products and technology components including: LeadMine, PatFetch, NameRxn (HazELNut), and Arthor. This post summarises the key innovations of Pistachio with more details on the searching to follow in another post.

Data The core deployment of Pistachio currently contains ~6.9M reaction details. The majority are extracted from experimental procedure text in patents (~4.2M USPTO, ~0.9M EPO). The remaining ~1.8M are extracted from sketches in U.S. patents (see Sketchy Sketches). Each reaction record is linked back (e.g. via PatFetch) to the location in the patent where it was extracted. Reactions from in-house electronic lab notebooks can also be added.

Reaction Diagrams As the majority of reactions are from text we must re-generate a reaction diagram from SMILES. To this end I’ve spent some time improving the reaction depiction in the Chemistry Development Kit (CDK). An example is shown in Figure 1 and compared with other tools in the talk on Slides 12-15 of the talk below.

Figure 1. A Chloro Suzuki Coupling generated from SMILES, source text: US20160016966 [0517]

Classification/Atom-Atom Mapping Every reaction is run through NameRxn to classify it, and simultaneously assign an Atom-Atom Mapping. Atom-Atom Mapping programs typically utilise Maximum Common Substructure (MCS) that can be slow and fail to correctly map certain reactions. Since NameRxn does not utilise MCS it is fast to process reactions and provides high quality atom maps (Figure 2).

Figure 2. Cyclic Beckmann Rearrangement, MCS-based Atom-Atom Mapping programs would find it difficult to map this correctly.

Search Queries are issued as natural language through an omni-box interface (Figure 3). The input text is interpreted with LeadMine and transformed in to the database query expression. I’ll expand more on the searching technology and capabilities in a follow up post.

Figure 3. Example of a Pistachio query.

What to know more? Additional information and a video demonstration of Pistachio working is on the product page. Pistachio is currently deployed as a Docker image and if you work for a large pharmaceutical company you may find you already have Pistachio running in-house. If you are interested in Pistachio or other areas of reaction informatics please contact us.

When compression makes things bigger

We’ve been looking into supporting Self-Contained Sequence Representation (SCSR) in Sugar&Splice (NextMove Software’s biologics perception, conversion, and depiction toolkit, as used by PubChem). SCSR is reported (Chen et al. 2011) as a “compressed format that retains chemistry detail”.

At NextMove, we’ve long argued that the best way to store peptides for registration is as the full connection table rather than as a compressed form. The primary advantage of this is that existing infrastructure for compound registration can be reused with minimal or no changes. On modern hardware, traditional cheminformatics algorithms can easily handle much larger structures (Sayle et al. 2015). An obvious problem is that without peptide perception (e.g. using Sugar&Splice), duplicates are missed if a user inputs a fully expanded structure instead of a compressed representation. A more subtle problem emerges with modified amino-acids in compressed representations, e.g. pyroglutamic acid may be considered different it was entered as modified glutamic acid or proline.

Having distinct registration systems for peptides and compounds is more complex and therefore more error prone, and more expensive to maintain.


When I generated the SCSR output I noticed that each line for a monomer looked longer than the SMILES for each fully expanded monomer. This means that while in theory this is a compressed format, it’s actually still larger than an uncompressed SMILES string. To demonstrate here are different representations of Beefy Meaty Peptide:

IUPAC Condensed:H-Lys-Gly-Asp-Glu-Glu-Ser-Leu-Ala-OH


  0  0  0     0  0            999 V3000
M  V30 COUNTS 8 7 0 0 0
M  V30 1 Lys 1.0 1.0 0 0 CLASS=AA ATTCHORD=(2 2 Br) SEQID=1
M  V30 2 Gly 2.0 1.0 0 0 CLASS=AA ATTCHORD=(4 1 Al 3 Br) SEQID=2
M  V30 3 Asp 3.0 1.0 0 0 CLASS=AA ATTCHORD=(4 2 Al 4 Br) SEQID=3
M  V30 4 Glu 4.0 1.0 0 0 CLASS=AA ATTCHORD=(4 3 Al 5 Br) SEQID=4
M  V30 5 Glu 5.0 1.0 0 0 CLASS=AA ATTCHORD=(4 4 Al 6 Br) SEQID=5
M  V30 6 Ser 6.0 1.0 0 0 CLASS=AA ATTCHORD=(4 5 Al 7 Br) SEQID=6
M  V30 7 Leu 7.0 1.0 0 0 CLASS=AA ATTCHORD=(4 6 Al 8 Br) SEQID=7
M  V30 8 Ala 8.0 1.0 0 0 CLASS=AA ATTCHORD=(2 7 Al) SEQID=8
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 3 4
M  V30 4 1 4 5
M  V30 5 1 5 6
M  V30 6 1 6 7
M  V30 7 1 7 8


To test how the size of these representations scales with the peptide length, random linear unmodified peptides were generated of increasing size. The formats listed above were tested as well as the fully expanded molfile and BIOVIA generated SCSR (BIOVIA Direct 2017). The difference between the BIOVIA SCSR and the NextMove SCSR (shown above) is that the expanded template for each occurring standard amino acid is included (i.e. a monomer definition). This has a little storage overhead that varies depending on the number of unique monomers.

The results are shown below. The molfile gets reasonably large (max 500KB+), though even this could still be stored on modern hardware. The SMILES (max 16KB+) peaks just above the more condensed formats of FASTA (max 1KB), HELM (max 2KB+), and Condensed (max 4KB+).


Using a log2 scale it’s easier to read the storage size:


An observation from the chart is that for small peptides the SCSR produced by BIOVIA (with monomer definitions) is actually larger than the molfile (also produced by BIOVIA). Crambin (e.g 1CRN) is often considered the boundary between a small-molecule and a protein. At 46 amino acids, it turns out that crambin reduced is smaller when stored as a fully expanded molfile compared to the SCSR representation:

Format Bytes
SCSR (BIOVIA) 20,448
Molfile 18,130


Sketchy Sketches

Chemical structure diagrams are essential in describing and conveying chemistry. Extracting chemistry from documents using text-mining (see NextMove Software’s LeadMine) is extremely useful but will miss anything described only by an image.

As a general approach to mining chemistry from images, one may consider using image-to-structure programs such as: OSRA, CliDE, ChemOCR, and Imago OCR. However, image-to-structure is not easy or quick and can be prone to compounding errors (e.g. OCR).

At NextMove we approach this problem slightly differently. It turns out that in some cases the source sketch files used to create the chemical diagrams may be available and provide a ‘cleaner’ data source than the raster images.

Although the data is ‘cleaner’ in terms of digital representation, naïvely exporting the connection table stored in a sketch file can lead to artificial and erroneous structures. The main problems stem from the stored representation (connection table) imprecisely reflecting what is displayed. To account for these issues, the NextMove Software converter (code name: Praline) applies correction, interpretation, and categorisation to sketches. The transformed connection table (currently written as ChemAxon Extended SMILES [CXSMILES]) better reflects what is actually displayed.

Let’s take a look at what’s possible with three examples:

1) US 2015 344500 A1

Method 9 in US 2015 344500 A1 describes a four step synthesis:


Using image-to-structure SureChEMBL extracts four structures, I’ve added the titles to make it easier to pair up:

Compound 2-2 (OCR error)
SCHEMBL17309138 / CID 118554493
Compound 9-1 (part)
SCHEMBL12363 / CID 10008
Compound 9-2
SCHEMBL17307813 / CID 118553325
Compound 9-3
SCHEMBL17309143 / CID 118554498

Compound 2-2 was not correctly extracted and looks like OCR has mistakenly recognised the -OBn as -OBu. The flurobenzene probably comes from Compound 9-1 where the label (Boc)2N- is difficult to recognise. The products of Step 4 contain valence errors and were probably thrown out as a recognition error.
However, by reading the ChemDraw files directly it’s possible to extract everything “warts and all”. To process this sketch the key interpretation phases are:

  • Line Formula Parsing – Using a strict yet comprehensive algorithm condensed labels are corrected and expanded.
  • Reaction Role Assignment – The reaction scheme layout is common to patents and made easier by looking for the USPTO-specific ‘splitter’ tag. To make valid reactions, reactants are duplicated and added to the previous step.
  • Agent Parsing – Based on the location the complete label “Boc2, DIEA” can be correctly processed. Agents can be a mix of trivial names, systematic names and formulas.
  • Clear Ambiguous Stereochemistry – One of the hashed wedges in Compounds 9-1, 9-2, and 9-3 is poorly placed between two stereocenters. In the stored representation both stereocentres are defined but we remove the definition at the wide end of the wedge.
  • Category Assignment – Based on the content we tag the output with a category for quick filtering. This is described more in the poster (see below).

Here are the results of our extraction, categorised as specific reactions:










Compounds 2-2 and 9-1 are now correctly extracted and actually novel to PubChem. We don’t try to correct author errors and so the bad valence is also preserved as drawn in Step 4.

2) US 7092578 B2

US 7092578 B2 is not a chemical patent but does have ChemDraw files. Here ChemDraw has been misused to draw tables, and direct export results in a cyclobutane grid. These are a well known class of bad structures in PubChem and have been referred to as chessboardanes. In addition to extracting the chemical structure, Praline assigns a categorisation code. This allows us to flag structures with potential problems as well as those with no real chemistry at all.


Resulting PubChem Compound CID 21040251:



Praline assigns the category No Connection Table and so it can be easily ignored.

3) US 6531452 B1

Strange connection tables don’t just come from non-chemistry patents, US 6531452 B1 like many chemistry patents contain a generic (Markush) claim. Earlier we saw the label -OBn misread by OCR. Even without OCR a condensed label may be expanded wrongly in the underlying representation, particular if the structure is generic.

“…at least one of R2 and R3 is”

In PubChem you’ll find the compound CID 22976968 has been extracted from this sketch:



Where did it come from? Well it turns out the generic label >C(R41)(R41) has been automatically expanded and stored in the file as:


Somewhere the Rs have been promoted to carbons and submitted to PubChem. Praline recognises and interprets generic labels and the attachment points (drawn here as tert-butyl) and categorised the sketch as a generic substituent. Here’s the output:


C1(C(C(C(N1*)(*)*)(*)*)C(*)(*)*)=O.*CCC(N)=O |$;;;;;R41;R41;R41;R41;R41;;_AP1;R41;R41;;_AP1$,Sg:n:3,6,7:n:ht|


Image-to-structure is slow; due to this, SureChEMBL currently has only processed images using image-to-structure from 2007 onwards (Papadatos G. 2015). In contrast Praline can process the entire archive of US Patent Applications and Grants with more than 24 million ChemDraw files (2001 onwards) in only 5 hours (single threaded).

Although the naïve molfile exports from the ChemDraw sketches are provided by the USPTO they have less information than the source ChemDraw sketch file. Reading the pre-exported molfile is significantly less accurate than interpreting the ChemDraw sketch, and even image-to-structure often produces more accurate results.

Other than U.S. Patents, this technology can be applied to sketch files extracted from Electronic Lab Notebooks (see NextMove Software’s HazELNut) as well as Journals where the publishers have held on to the sketch file submissions.

At the upcoming ACS in Philadelphia, Daniel will be presenting how some structures can only be extracted when the output from text-mining and sketches are combined. “The whole is greater than the sum of the parts” – Aristotle.

A poster on this work was presented at the 7th Joint Sheffield Conference on Chemoinformatics:


Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.


One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.


Roger’s full talk is available here:

Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.

A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

Anthracene Zinc
Tool Query Time (s) Rank (slow) Query Time (s) Rank (slow)
arthor 2.254 3 > 0.357 2602
arthor+fp 0.022 285 > 0.001 1667
rdcart 0.698 794 < 202 4
rdlucene 27.126 566 > 23.87 600
pgchem 28.231 138 > 18.181 197
mychem 48.289 108 > 34.145 159
fastsearch 396 99 > 285 126
bingo-nosql 0.448 451 < 1.311 260
bingo-pgsql 0.392 638 > 0.060 1228
tripod-ss 21.797 350 < 1441 18
orchem 27.075 906 > 0.721 2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

Casandra – Chemical Hazard Alerting For The 21st Century

At BioIT World, Roger presented a poster on NextMove Software’s Casandra. Casandra is a server for alerting of chemical and reactive hazards.

“Patents you wouldn’t want to work with”

Readers of Derek Lowe’s In The Pipeline may be familiar with a series of posts titled “Things I won’t work with”. In the spirit of those posts, we recently ran Casandra on one million reactions extracted from US Patents. One patent highlighted in this preliminary analysis contained an extremely energetic compound.

US20100081811A1 [paragraph:18]
It turns out the above patent is actually from a defence agency and they were intending to make explosives. A more subtle reactive hazard was found in US20020173655A1.

US20020173655A1 [paragraph:374]
Here the amide (DMF) and metal hydride (NaH) can react exothermically in a self-accelerating reaction[1,2].

The Casandra poster is available here. If this is an area of interest to you feel free to get in touch with us.

[1] J. Buckley et al., Chemical & Engineering News, Jul. 12, 1982, page 5
[2] G. DeWail, Chemical & Engineering News, Sep. 13, 1982