{"id":3026,"date":"2022-04-08T15:15:25","date_gmt":"2022-04-08T14:15:25","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=3026"},"modified":"2025-09-22T14:44:05","modified_gmt":"2025-09-22T13:44:05","slug":"cxsmiles-gotchas-part-1-bond-indexes","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/08\/cxsmiles-gotchas-part-1-bond-indexes\/","title":{"rendered":"CXSMILES Gotchas &#8211; Part 1: Bond Indexes"},"content":{"rendered":"\n<p><a href=\"https:\/\/docs.chemaxon.com\/display\/docs\/ChemAxon_Extended_SMILES_and_SMARTS_-_CXSMILES_and_CXSMARTS.html\">ChemAxon Extended SMILES and SMARTS<\/a>&nbsp;(CXSMILES)&nbsp;has become more popular in recent years for its ability to capture additional information on top of the core structural connectivity.  <\/p>\n\n\n\n<p>At NextMove Software we think it&#8217;s great and have increasingly used CXSMILES over the years to capture information more precisely.<\/p>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-layout-flex wp-container-core-columns-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.04.08.png\"><img loading=\"lazy\" decoding=\"async\" width=\"452\" height=\"490\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.04.08.png\" alt=\"\" class=\"wp-image-3035\" style=\"width:245px;height:266px\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.04.08.png 452w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.04.08-277x300.png 277w\" sizes=\"(max-width: 452px) 100vw, 452px\" \/><\/a><figcaption class=\"wp-element-caption\">US20150376312A1 (2)<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.24.05.png\"><img loading=\"lazy\" decoding=\"async\" width=\"896\" height=\"808\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.24.05.png\" alt=\"\" class=\"wp-image-3037\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.24.05.png 896w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.24.05-300x271.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.24.05-768x693.png 768w\" sizes=\"(max-width: 896px) 100vw, 896px\" \/><\/a><figcaption class=\"wp-element-caption\">US20060205715A1 [0343]<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.23.07.png\"><img loading=\"lazy\" decoding=\"async\" width=\"572\" height=\"336\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.23.07.png\" alt=\"\" class=\"wp-image-3036\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.23.07.png 572w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/Screenshot-2022-04-08-at-10.23.07-300x176.png 300w\" sizes=\"(max-width: 572px) 100vw, 572px\" \/><\/a><figcaption class=\"wp-element-caption\">US20200002345A1 Ex 136<\/figcaption><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>Such structures can be captured as CXSMILES:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>c12cccc1C=CC=C2.*&#91;Zr](*)(Cl)Cl.c12cccc1C=CC=C2 |m:9:0.1.2.3.4,11:14.15.16.17.18| US20150376312A1 (2)\n&#91;c-]12cccc1C=CC=C2.*&#91;Zr++](*)(Cl)Cl.&#91;c-]12cccc1C=CC=C2 |m:9:0.1.2.3.4,11:14.15.16.17.18| US20150376312A1 (2)\nC1CCOC1.C(Cl)Cl |Sg:c:0,1,2,3,4::,Sg:c:5,6,7::,Sg:mix:0,1,2,3,4,5,6,7::,SgH:2:0.1| 20% THF\/DCM\n&#91;N+](=O)(&#91;O-])C1=CC=C(C(=O)O&#91;C@H]2&#91;C@H](CCCC2)N2N=C(C(=C2)NC(=O)C=2C=NN3C2N=CC=C3)C3=C(C=CC(=C3)Cl)OC)C=C1 |r| US20200002345A1 Ex 136<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Background<\/h2>\n\n\n\n<p>Originally created by ChemAxon as a way to to store a <a href=\"https:\/\/discover.3ds.com\/sites\/default\/files\/2020-08\/biovia_ctfileformats_2020.pdf\">CTfile<\/a> (i.e. MOLfile) in SMILES without losing information &#8211; It has evolved to be useful on its own right. Recent versions of the <a href=\"https:\/\/enamine.net\/compound-collections\/real-compounds\/real-database\">Enamine REAL<\/a> use it for stereochemistry groups and there are efforts by <a href=\"https:\/\/www.inchi-trust.org\/\">InChI<\/a> to better represent inorganic, mixtures, and reactions which could make use of it. Since CXSMILES has started to gain traction with more toolkits supporting the format it is becoming a convenient lingua franca for advanced chemical representations. <\/p>\n\n\n\n<p>The following toolkits support CXSMILES to various degrees:<\/p>\n\n\n\n<ul>\n<li>ChemAxon &#8211; JChem,&nbsp;<a href=\"https:\/\/www.blogger.com\/blog\/post\/edit\/6131674809499098789\/6595884121288834033#\">Marvin<\/a>&nbsp;etc<\/li>\n\n\n\n<li><a href=\"https:\/\/www.blogger.com\/blog\/post\/edit\/6131674809499098789\/6595884121288834033#\">Indigo<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.blogger.com\/blog\/post\/edit\/6131674809499098789\/6595884121288834033#\">CDK<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.blogger.com\/blog\/post\/edit\/6131674809499098789\/6595884121288834033#\">RDKit<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.blogger.com\/blog\/post\/edit\/6131674809499098789\/6595884121288834033#\">OPSIN<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/EliLillyCo\/LillyMol\">LillyMol<\/a><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Gotcha!<\/strong><\/h2>\n\n\n\n<p>ChemAxon provides public <a href=\"https:\/\/docs.chemaxon.com\/display\/docs\/ChemAxon_Extended_SMILES_and_SMARTS_-_CXSMILES_and_CXSMARTS.html\">documentation<\/a> on CXSMILES but there are corner cases and some wonky areas I want to discuss. I planned to get this captured in one post but quickly realised the first topic alone is enough on it&#8217;s own. I will update this links below as new posts appear:<\/p>\n\n\n\n<ul>\n<li><a href=\"#bond-indexes\">Bond Indexes<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/20\/cxsmiles-part-2-component-grouping\/\">Component Grouping<\/a> (Part 2)<\/li>\n\n\n\n<li><a href=\"https:\/\/nextmovesoftware.com\/blog\/2025\/09\/22\/cxsmiles-part-3-repeat-groups\/\">Repeat Groups<\/a> (Part 3)<\/li>\n\n\n\n<li>Enhanced Stereo Canonicalisation<\/li>\n\n\n\n<li>Atom Labels<\/li>\n\n\n\n<li>EPAM Highlight Extension<\/li>\n\n\n\n<li>Dative bond valence<\/li>\n\n\n\n<li>Multi- vs Variable- attachment<\/li>\n\n\n\n<li>Wish-list &#8211; beyond CXSMILES\n<ul>\n<li>Partial feature set<\/li>\n\n\n\n<li>Atropisomers<\/li>\n\n\n\n<li>More compact coordinates<\/li>\n\n\n\n<li>cis\/trans- stereo groups<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bond-indexes\"><strong>Bond&nbsp;Indexes<\/strong><\/h2>\n\n\n\n<p>Atoms and bonds in CXSMILES are referenced by index (<em>0 &lt;= idx &lt; n<\/em>) which is the position in the SMILES string:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC=CC |c:1|\nCC=CC |t:1|\nCC=CC |ctu:1|<\/code><\/pre>\n\n\n\n<p>In this case it&#8217;s easy to see, there are three bonds at index: <code>0, 1, 2<\/code>. The bond at index <code>1<\/code> is specified as cis (<code>c:<\/code>) or trans (<code>t:<\/code>) or unspecified (<code>ctu:<\/code>). As a linear notation, some bonds get written twice. Does the index increment twice? Is the ring open (first occurrence) or ring close (second occurrence) the &#8220;reference&#8221;?<\/p>\n\n\n\n<p>Using a dot-disconnection trick we can reorder the previous example and probe behaviour of what ChemAxon accepts:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C1C.CC=1 |c:0| wrong\nC1C.CC=1 |c:2| correct\nC=1C.CC1 |c:0| wrong\nC=1C.CC1 |c:2| correct - less desirable IMO<\/code><\/pre>\n\n\n\n<p>The &#8220;correct&#8221; choice is |c:2| &#8211; the closure bond is the reference. I should note this is somewhat an artefact of how the SMILES is parsed. A useful efficiency trick in SMILES is to use partial bonds (leave one atom temporarily undefined) which in this case would give you the incorrect index unless extra steps were taken.<\/p>\n\n\n\n<p>A similar issue exists in vanilla SMILES when bond types mismatch: <code>C#1C.C=1<\/code> or <code>C\/1=C\/C.C\/1<\/code>. The bond that takes precedent is toolkit dependent, hopefully you would get a warning or error.<\/p>\n\n\n\n<p>In case you don&#8217;t like the dot-disconnection being used like that, here is one in a macro cycle:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C=1CCCCCCCCCC1 |c:0| wrong\nC1CCCCCCCCCC=1 |c:10| correct<\/code><\/pre>\n\n\n\n<p>The double counting question is already answered since it was <code>|c:2|<\/code> and not <code>|c:3|<\/code> but to confirm bonds do <strong><span style=\"text-decoration: underline;\">not<\/span><\/strong> get counted twice:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C1CCCCC1C=CC |c:8| wrong\nC1CCCCC1C=CC |c:7| correct<\/code><\/pre>\n\n\n\n<p>Double bond configurations in rings are rare but bond indexes also apply to dative\/hydrogen bonds and it is important there is consensus on how it works.<\/p>\n\n\n\n<p>SMILES already supports cis\/trans specification so why does CXSMILES add this? It turns out to avoid error propagation you need to be able to specify <strong>unknown configuration<\/strong> and this is not always possible in normal SMILES. SMILES uses <code>\"\/\"<\/code> and <code>\"\\\"<\/code> &#8211; it looks pretty but causes problems. Pause for a second and see if you can write the SMILES for the following structure:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"173\" height=\"69\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-1.png\" alt=\"\" class=\"wp-image-3064\"\/><\/a><figcaption class=\"wp-element-caption\">Can you write the SMILES for this?<\/figcaption><\/figure><\/div>\n\n\n<p>Maybe you wrote something like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C\/C=C\\C=C\/C=C\\C\nC\\C=C\/C=C\\C=C\/C<\/code><\/pre>\n\n\n\n<p>Indeed most toolkits will do exactly that! Unfortunately we&#8217;ve added information that wasn&#8217;t there and inadvertently defined the middle bond. It is actually possible if you add explicit hydrogens:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C\/C=C(\/&#91;H])C=C\/C=C\\C\nC\\C=C(\\&#91;H])C=C\\C=C\/C<\/code><\/pre>\n\n\n\n<p>But this only gets you so far, what if we had nitrogens? CXSMILES allows us to encode this unambiguously:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC=NC=CN=CC |c:1,5|<\/code><\/pre>\n\n\n\n<p>This may seem like a narrow corner case but if you try to parse PubChem you will find a lot of them &#8211; or rather inconsistency warnings. Here are some warnings from CDK:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Ignoring invalid directional bonds\nC1=C\/C\/2=C\/N=C3\/C(=C\/4\\C(=N\/C=C\/5\/C(=C2\/C=C1)\/C=CC=C5)C=CC=C4)\/C=CC=C3 5379414\n                    ^ ^\nIgnoring invalid directional bonds\nCO\/C(=C(\\C#N)\/C=C(\/C=C(\/C(=O)OC)\\C#N)\/C=C(\/C(=O)OC)\\C#N)\/O 5720097\n                  ^                  ^<\/code><\/pre>\n\n\n\n<p>Depending on the traversal and bond direction assignment we may accidentally define the configuration of this bond and no warning would be generated or alternatively it will come out with an &#8220;invalid&#8221; syntax.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/5379414.png\"><img loading=\"lazy\" decoding=\"async\" width=\"212\" height=\"173\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/5379414.png\" alt=\"\" class=\"wp-image-3043\"\/><\/a><figcaption class=\"wp-element-caption\">PubChem CID 5379414<\/figcaption><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/CID-5720097.png\"><img loading=\"lazy\" decoding=\"async\" width=\"337\" height=\"226\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/CID-5720097.png\" alt=\"\" class=\"wp-image-3040\" style=\"width:258px;height:173px\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/CID-5720097.png 337w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/CID-5720097-300x201.png 300w\" sizes=\"(max-width: 337px) 100vw, 337px\" \/><\/a><figcaption class=\"wp-element-caption\">PubChem CID 5720097 <\/figcaption><\/figure><\/div>\n\n\n<p>It&#8217;s problematic and Daylight had planned to address it. At the user group meeting in 2005 a <a href=\"https:\/\/www.daylight.com\/meetings\/mug05\/Delany\/futures.html\">Futures<\/a> talk makes references to preliminary work on SMILES5: <em>&#8220;A unification of the EZ stereo representation with atom-based stereo is proposed. This will allow better specification of multiple conjugated EZ centers and also allows more robust specification of relative stereochemistry.&#8221;<\/em>. <\/p>\n\n\n\n<p>It may have looked something like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C&#91;C@H]=&#91;C@H]C=C&#91;C@H]=&#91;C@H]C cis,cis-\nC&#91;C@H]=&#91;C@H]C=C&#91;C@H]=&#91;C@@H]C cis,trans-\nC&#91;C@H]=&#91;C@H]C=C&#91;C@@H]=&#91;C@H]C cis,trans-<\/code><\/pre>\n\n\n\n<p>I actually added support for this in CDK but in hindsight I think using CXSMILES is simpler but less elegant:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC=CC=CC=CC |c:1,5|\nC\/C=C\\C=C\\C=C\/C |ctu:3|\nC\/C=C\\C=C\\C=C\/C |c:1,5,ctu:3|<\/code><\/pre>\n\n\n\n<p>Which is best depends on the application &#8211; ignoring the CXSMILES you either get no configuration or the wrong configuration. With CXSMILES these should all canonicalise to the same thing.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ChemAxon Extended SMILES and SMARTS&nbsp;(CXSMILES)&nbsp;has become more popular in recent years for its ability to capture additional information on top of the core structural connectivity. At NextMove Software we think it&#8217;s great and have increasingly used CXSMILES over the years to capture information more precisely. Such structures can be captured as CXSMILES: Background Originally created &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/08\/cxsmiles-gotchas-part-1-bond-indexes\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">CXSMILES Gotchas &#8211; Part 1: Bond Indexes<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26,23,25],"tags":[28,29,27],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3026"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=3026"}],"version-history":[{"count":29,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3026\/revisions"}],"predecessor-version":[{"id":3155,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3026\/revisions\/3155"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=3026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=3026"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=3026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}