SMILES – NextMove Software

CXSMILES Part 3: Repeat Groups

This post continues a series (see Part 1 and Part 2) examining some features and insights when using ChemAxon Extended SMILES/SMARTS (CXSMILES) to represent repeat groups.

Repeat Groups

Inherited from CTfiles (Molfile) there are two ways to specify repeat groups in CXSMILES, you can either use use an Structure Repeat Unit (SRU) Sgroup or a Link Node. Historically link nodes are used for queries and the SRU for polymers however the boundaries are a little blurry. From a user’s perspective you probably want to handle them interchangeably – particularly if this is a query structure input. Recent versions of ChemAxon’s MarvinSketch will encode a repeat as a link node if possible. In third party tools similar interconversion is useful with a preferred (canonical) representation generated on output.

Link nodes are more limited than SRU groups but allow for a terser encoding. Firstly any crossing bonds (bonds that cross the brackets) must connect to the same repeated atom. You must also define a lower and upper bound for number of types to repeat (e.g. 1 to 3). In CTfiles the lower bound must be 1 – in practice any lower bound (including 0 is reasonable). A simple example is shown below, atom idx 1 repeats 1 to 3 times.

NCC=O |LN:1:1.3| (Ia)

As an SRU Sgroup we can write it like this:

NCC=O |Sg:n:1:1-3:| (Ib)

I have specified the subscript as “1-3” which isn’t semantically encoded but sufficient. SRUs Sgroups can have whatever you like as the subscript and you will frequently see n or m for any number of repeats but it can be anything.

c1ccccc1C(O)CC |Sg:n:6,7:n:ht| (II)
c1ccccc1C(O)CC |Sg:n:6,7:&#62;1:ht| (III)

The connectivity superscript of the bracket can be [head-to-tail (ht), head-to-head (hh), either-unspecified (eu)]. If the repeated part is symmetric (like in II) then it is redundant and can be omitted.

Link nodes allow you to specify the outer atoms of crossing bonds, Sgroups only support this for ladder-type polymers (as per documentation). You need the outer atoms for link nodes if there are more than two bonds – this means multiple atoms repeat. With the SRU Sgroups you just specify all the atoms that repeat. I’ve strained the brackets to demonstrate below:

c1ccccc1C(O)CC |LN:6:1.2.6.8| (IVa)
c1ccccc1C(O)CC |Sg:n:6,7:1-2:ht| (IVb)

c1ccccc1C(O)CC |LN:6:1.2.6.7| (Va)
c1ccccc1C(O)CC |Sg:n:6,8,9:1-2:ht| (Vb)

If the link node is in a ring, then the crossing bonds are implicitly the ones in the ring. However for portability the outer atoms should be specified whenever there is more than two bonds connected.

C1(O)CCCCC1 |LN:0:1.3| (VIa) acceptable
C1(O)CCCCC1 |LN:0:1.3.1.6| (VIb) preferred
C1(O)CCCCC1 |Sg:n:0,1:1-3:ht| (VIc)

Something that may not be obvious when you first encounter repeat groups is that although in polymer chemistry they are typically linear, for structure queries we can also have what I will call radial repeats. With a radial repeat the repeat unit “rotates” around a single external atom – they are implicitly bounded by the number of bonds that atom can have. They can have an odd or even number of crossing bonds and it depends on the grouping of the crossing bonds as to what type of repeat you have. The table below shows this, I have put some hypothetical examples in light grey to show the pattern repeats.

As with linear repeats, radial repeats can be represented as link nodes if the range is known and they are a single atom:

CS(=O)CC |Sg:n:2:1-2:| (VIIa)
CS(=O)CC |LN:2:1.2| (VIIb)

N1CCCC(Cl)C1 |Sg:n:5:1-2:| (VIIIa)
N1CCCC(Cl)C1 |LN:5:1.2| (VIIIb)

*Cl.n1ccccc1 |Sg:n:1:1-2:,m:0:2.3.4.5| (IXa)
*Cl.n1ccccc1 |m:0:2.3.4.5,LN:1:1.2| (IXb)

**.n1ccccc1 |$;_R1$,m:0:2.3.4.5,Sg:n:1:1-2:| (Xa)
**.n1ccccc1 |$;_R1$,m:0:2.3.4.5,LN:1:1.2| (Xb)

For non ladder-type polymers CXSMILES Sgroups only capture the atoms that repeat and not the grouping of the crossing bonds. Therefore we have an ambiguity with radial spiro-repeats. A reasonable rule of thumb is if the outer atom of each crossing bond is the same then it is likely a spiro-repeat rather than linear.

C=1C2=C3[N]([Ir][N]4=CC=CC(C=C2)=C34)=CC1 |Sg:n:0,1,2,3,5,6,7,8,9,10,11,12,13,14:n:ht| (XI)

CTfiles actually have the same issue if you don’t store the coordinates. Fortunately these repeat types are relatively rare but here are some in the wild:

**US 2016/0049599 A1**
(US20160049599A1-20160218-C00566)

**US 2015/0380666 A1**
(US20150380666A1-20151231-C00746)

We extract these from patents as CXSMILES with our Text-Mining tool LeadMine (Poster).

Explicit hydrogens may be required to define the Sgroup SRU properly. If an explicit hydrogen is used care needs to be taken when suppressing explicit hydrogens. If you suppress/remove the hydrogen in (XII) the meaning changes from a PEG linear repeat to a radial repeat

c1ccccc1CCO[H] |Sg:n:6,7,8:m:ht| (XII)

Atom indices can be written in any order, ideally when writing CXSMILES you should sort the list of indices.

c1ccccc1CCOCCO |Sg:n:6,7,8:n:ht| (XIIa)
c1ccccc1CCOCCO |Sg:n:6,8,7:n:ht| (XIIb)
c1ccccc1CCOCCO |Sg:n:8,7,6:n:ht| (XIIc)

As touched upon previously, If the repeated unit is symmetric then the subscript is irrelevant, for registration these should all be considered the same:

c1ccccc1OCOCCO |Sg:n:6,7,8:n:hh| (XIIIa)
c1ccccc1OCOCCO |Sg:n:6,7,8:n:ht| (XIIIc)
c1ccccc1OCOCCO |Sg:n:6,7,8:n:eu| (XIIIb)
c1ccccc1OCOCCO |Sg:n:6,7,8:n:| (XIIId)

My interests are mainly on repeat variation for structure queries. There is a lot more that can be said on polymer registration, for example you need to canonicalise the repeat unit (frame shift). Four of these are the same structure and one isn’t:

[H]NCC(=O)NCC(=O)O[H] |Sg:n:1,2,3,4:n:ht| (XIVa)
[H]NCC(=O)NCC(=O)O[H] |Sg:n:2,3,4,5:n:ht| (XIVb)
[H]NCC(=O)NCC(=O)O[H] |Sg:n:3,4,5,6:n:ht| (XIVc)
[H]NCC(=O)NCC(=O)O[H] |Sg:n:5,6,7,8:n:ht| (XIVd)
[H]NCC(=O)NCC(=O)O[H] |Sg:n:6,7,8,9:n:ht| (XIVe)

A final comment is something I noticed when preparing this post. It is common in polymer chemistry to draw external external attachments as a plain bond. Here is the wikipedia depiction of polyvinyl chloride:

*CC(Cl)* |Sg:n:1,2,3:n:ht| (XVa)

EPAM’s Ketcher looks like it automatically converts any methyl capping groups to ‘*’ on load… but this might just be a display setting. It’s possibly reasonable and it doesn’t seem to be default behaviour in the Indigo API but is an interesting auto-conversion to be aware of.

CCC(Cl)C |Sg:n:1,2,3:n:ht| (XVb)

CXSMILES Part 2: Component Grouping

This post is a follow up on the previous introduction – Part 1. Here I examine how we can capture fragment grouping in CXSMILES and other extensions.

Fragment Grouping

Fragment grouping (or component grouping) allows you group together separate fragments/components of a molecule. It is critical for reaction representation and therefore several independent SMILES extensions that have emerged. Common cases include keeping counter-ions, hydrates, and salts together as a single “molecule”.

`EP2305640A2 Example 11, Step v`
(SMILES)

`EP2305640A2 Example 11, Step v`
(CXSMILES with fragment grouping)

Syntax

Here is a simple example annotated with the fragment indexes, we want to group together (0,1) and (3,4,5):

[Na+].[OH-].c1ccccc1.[Cs+].[O-]C(=O)[O-].[Cs+]>> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Component indexes span the entire reaction, so we can for example move to the agents and the CXSMILES encoding does not change:

[Na+].[OH-].c1ccccc1>[Cs+].[O-]C(=O)[O-].[Cs+]> |f:0.1,3.4.5|
--0-- --1-- ----2--- --3-- -----4------- --5--

Does it only apply to reactions?

Toolkit dependent. ChemAxon appears to only read/write it on reactions (MarvinJS v22.9.0) but it’s also useful on molecules to capture formulations/mixtures (e.g. Artifical seawater) . In reaction, fragment grouping of agents (between the two “>”) appears to be ignored in MarvinJS – so my example images aren’t valid examples. One of our customers tested Marvin Desktop v21.15.1 for me and confirmed it round-trips correctly.

Do component terms need to be adjacent?

Toolkit dependent. In older versions of ChemAxon desktop tools (I no longer have access) I remember it would reject non-adjacent components:

[Na+].c1ccccc1.[OH-]>> |f:0.2|

This does not seem to be the case in MarvinJS and again a customer again confirmed it round-trips ok in Marvin Desktop.

Spanning Different Roles?

An input where the roles of the components being grouped (e.g. a reactant and a product) could be rejected as inconsistent:

c1ccccc1.[Na+]>>[OH-] |f:1.2|
c1ccccc1.[Na+]>>[OH-] |f:2.1|

MarvinJS and CDK default to the role of the first component encountered so those two inputs are different. As the author of the CDK logic I will note this is consistent only by coincidence.

Implicit grouping?

We can implicitly group components with multi-attach (m:) and Sgroup brackets. For example consider the following, which is preferred?

**.c1ccccc1 |$R1$,m:1:2.3.4.5.6.7|
**.c1ccccc1 |$R1$,f:0.1,m:1:2.3.4.5.6.7|

Alternatives

Daylight SMILES5

As with the cis/trans specification, SMILES5 had a solution:

“Molecule-level Components: There will be another level of components within a molecule and reaction object which will allow easier handling of complex mixtures.” – Futures, MUG 2005

SMARTS has component grouping using zero-level brackets already and it would likely have followed a similar syntax:

([Na+].[OH-]).c1ccccc1.([Cs+].[O-]C(=O)[O-].[Cs+])>>

Notice the wording that it applies to both molecule and reactions.

LillyMol

An extension used by Eli Lilly’s LillyMol is to treat the “.” to separate fragments and use a “+” to separate the molecules.

[Na+].[OH-]+c1ccccc1+[Cs+].[O-]C(=O)[O-].[Cs+]>>

Note that LillyMol also supports CXSMILES.

IBM RXN for Chemistry

Schwaller et al 2020 describe how they use “~” to group together fragments. They note in the supplementary information how this is more useful than CXSMILES for their purposes since it enforces the fragments are kept together:

[Na+]~[OH-].c1ccccc1.[Cs+]~[O-]C(=O)[O-]~[Cs+]>>

NextMove (proposed) / OntoChem

In 2013 Roger proposed a double-dot “..” for a similar purpose. The advantage being that “relaxed” SMILES parsers will simply ignore the repeated dot:

[Na+]..[OH-].c1ccccc1.[Cs+]..[O-]C(=O)[O-]..[Cs+]>>

OntoChem also use this representation in reactions but I cannot find a link to relevant material.

NextMove (actual)

In Pistachio we use CXSMILES for fragment grouping in reactions. In recent releases we have tried to use alternative representations that avoid the problem where possible:

[Na]O.c1ccccc1.[Cs]OC(=O)O[Cs]>>

Where needed we still use CXSMILES since it is the most widely supported convention:

The fragment grouping also gets captured in the JSON format of reactions albeit much less compactly:

{
role: "Product",
orgName: "title compound",
name: "Methyl (2S)-2-amino-3-(2-chlorophenyl)propanoate hydrochloride",
smiles: "Cl.N[C@H](C(=O)OC)CC1=C(C=CC=C1)Cl",
quantities: [ {type: "Mass", value: 5.89, text: "5.89 g"},
              {type: "Yield", value: 94, text: "94%"}],
stoichiometry: 1
}

We also support interconversion of the LillyMol syntax in our reaction processing tool set, HazELNut:

$ echo "*>[Na+].[OH-].[Cs+].[O-]C(=O)[O-].[Cs+]>* |f:1.2.3,4.5|" | ./filbert .smi .iwsmi
*>C(=O)([O-])[O-].[Cs+]+[OH-].[Na+].[Cs+]>*

$ echo "*>C(=O)([O-])[O-].[Cs+]+[OH-].[Na+].[Cs+]>*" | ./filbert .iwsmi .smi
*>C(=O)([O-])[O-].[OH-].[Na+].[Cs+].[Cs+]>* |f:1.4,2.3.5|

CXSMILES Gotchas – Part 1: Bond Indexes

ChemAxon Extended SMILES and SMARTS (CXSMILES) has become more popular in recent years for its ability to capture additional information on top of the core structural connectivity.

At NextMove Software we think it’s great and have increasingly used CXSMILES over the years to capture information more precisely.

Such structures can be captured as CXSMILES:

c12cccc1C=CC=C2.*[Zr](*)(Cl)Cl.c12cccc1C=CC=C2 |m:9:0.1.2.3.4,11:14.15.16.17.18| US20150376312A1 (2)
[c-]12cccc1C=CC=C2.*[Zr++](*)(Cl)Cl.[c-]12cccc1C=CC=C2 |m:9:0.1.2.3.4,11:14.15.16.17.18| US20150376312A1 (2)
C1CCOC1.C(Cl)Cl |Sg:c:0,1,2,3,4::,Sg:c:5,6,7::,Sg:mix:0,1,2,3,4,5,6,7::,SgH:2:0.1| 20% THF/DCM
[N+](=O)([O-])C1=CC=C(C(=O)O[C@H]2[C@H](CCCC2)N2N=C(C(=C2)NC(=O)C=2C=NN3C2N=CC=C3)C3=C(C=CC(=C3)Cl)OC)C=C1 |r| US20200002345A1 Ex 136

Background

Originally created by ChemAxon as a way to to store a CTfile (i.e. MOLfile) in SMILES without losing information – It has evolved to be useful on its own right. Recent versions of the Enamine REAL use it for stereochemistry groups and there are efforts by InChI to better represent inorganic, mixtures, and reactions which could make use of it. Since CXSMILES has started to gain traction with more toolkits supporting the format it is becoming a convenient lingua franca for advanced chemical representations.

The following toolkits support CXSMILES to various degrees:

ChemAxon – JChem, Marvin etc
Indigo
CDK
RDKit
OPSIN
LillyMol

Gotcha!

ChemAxon provides public documentation on CXSMILES but there are corner cases and some wonky areas I want to discuss. I planned to get this captured in one post but quickly realised the first topic alone is enough on it’s own. I will update this links below as new posts appear:

Bond Indexes
Component Grouping (Part 2)
Repeat Groups (Part 3)
Enhanced Stereo Canonicalisation
Atom Labels
EPAM Highlight Extension
Dative bond valence
Multi- vs Variable- attachment
Wish-list – beyond CXSMILES
- Partial feature set
- Atropisomers
- More compact coordinates
- cis/trans- stereo groups

Bond Indexes

Atoms and bonds in CXSMILES are referenced by index (0 <= idx < n) which is the position in the SMILES string:

CC=CC |c:1|
CC=CC |t:1|
CC=CC |ctu:1|

In this case it’s easy to see, there are three bonds at index: 0, 1, 2. The bond at index 1 is specified as cis (c:) or trans (t:) or unspecified (ctu:). As a linear notation, some bonds get written twice. Does the index increment twice? Is the ring open (first occurrence) or ring close (second occurrence) the “reference”?

Using a dot-disconnection trick we can reorder the previous example and probe behaviour of what ChemAxon accepts:

C1C.CC=1 |c:0| wrong
C1C.CC=1 |c:2| correct
C=1C.CC1 |c:0| wrong
C=1C.CC1 |c:2| correct - less desirable IMO

The “correct” choice is |c:2| – the closure bond is the reference. I should note this is somewhat an artefact of how the SMILES is parsed. A useful efficiency trick in SMILES is to use partial bonds (leave one atom temporarily undefined) which in this case would give you the incorrect index unless extra steps were taken.

A similar issue exists in vanilla SMILES when bond types mismatch: C#1C.C=1 or C/1=C/C.C/1. The bond that takes precedent is toolkit dependent, hopefully you would get a warning or error.

In case you don’t like the dot-disconnection being used like that, here is one in a macro cycle:

C=1CCCCCCCCCC1 |c:0| wrong
C1CCCCCCCCCC=1 |c:10| correct

The double counting question is already answered since it was |c:2| and not |c:3| but to confirm bonds do not get counted twice:

C1CCCCC1C=CC |c:8| wrong
C1CCCCC1C=CC |c:7| correct

Double bond configurations in rings are rare but bond indexes also apply to dative/hydrogen bonds and it is important there is consensus on how it works.

SMILES already supports cis/trans specification so why does CXSMILES add this? It turns out to avoid error propagation you need to be able to specify unknown configuration and this is not always possible in normal SMILES. SMILES uses "/" and "\" – it looks pretty but causes problems. Pause for a second and see if you can write the SMILES for the following structure:

Maybe you wrote something like this:

C/C=C\C=C/C=C\C
C\C=C/C=C\C=C/C

Indeed most toolkits will do exactly that! Unfortunately we’ve added information that wasn’t there and inadvertently defined the middle bond. It is actually possible if you add explicit hydrogens:

C/C=C(/[H])C=C/C=C\C
C\C=C(\[H])C=C\C=C/C

But this only gets you so far, what if we had nitrogens? CXSMILES allows us to encode this unambiguously:

CC=NC=CN=CC |c:1,5|

This may seem like a narrow corner case but if you try to parse PubChem you will find a lot of them – or rather inconsistency warnings. Here are some warnings from CDK:

Ignoring invalid directional bonds
C1=C/C/2=C/N=C3/C(=C/4\C(=N/C=C/5/C(=C2/C=C1)/C=CC=C5)C=CC=C4)/C=CC=C3 5379414
                    ^ ^
Ignoring invalid directional bonds
CO/C(=C(\C#N)/C=C(/C=C(/C(=O)OC)\C#N)/C=C(/C(=O)OC)\C#N)/O 5720097
                  ^                  ^

Depending on the traversal and bond direction assignment we may accidentally define the configuration of this bond and no warning would be generated or alternatively it will come out with an “invalid” syntax.

It’s problematic and Daylight had planned to address it. At the user group meeting in 2005 a Futures talk makes references to preliminary work on SMILES5: “A unification of the EZ stereo representation with atom-based stereo is proposed. This will allow better specification of multiple conjugated EZ centers and also allows more robust specification of relative stereochemistry.”.

It may have looked something like this:

C[C@H]=[C@H]C=C[C@H]=[C@H]C cis,cis-
C[C@H]=[C@H]C=C[C@H]=[C@@H]C cis,trans-
C[C@H]=[C@H]C=C[C@@H]=[C@H]C cis,trans-

I actually added support for this in CDK but in hindsight I think using CXSMILES is simpler but less elegant:

CC=CC=CC=CC |c:1,5|
C/C=C\C=C\C=C/C |ctu:3|
C/C=C\C=C\C=C/C |c:1,5,ctu:3|

Which is best depends on the application – ignoring the CXSMILES you either get no configuration or the wrong configuration. With CXSMILES these should all canonicalise to the same thing.