{"id":3059,"date":"2022-04-20T12:32:46","date_gmt":"2022-04-20T11:32:46","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=3059"},"modified":"2025-09-22T09:47:14","modified_gmt":"2025-09-22T08:47:14","slug":"cxsmiles-part-2-component-grouping","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/20\/cxsmiles-part-2-component-grouping\/","title":{"rendered":"CXSMILES Part 2: Component Grouping"},"content":{"rendered":"\n<p>This post is a follow up on the previous introduction &#8211; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/08\/cxsmiles-gotchas-part-1-bond-indexes\/\">Part 1<\/a>. Here I examine how we can capture fragment grouping in CXSMILES and other extensions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Fragment Grouping<\/strong><\/h2>\n\n\n\n<p>Fragment grouping (or component grouping) allows you group together separate fragments\/<a href=\"https:\/\/en.wikipedia.org\/wiki\/Component_(graph_theory)\">components<\/a> of a molecule. It is critical for reaction representation and therefore several independent SMILES extensions that have emerged. Common cases include keeping counter-ions, hydrates, and salts together as a single &#8220;molecule&#8221;.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"458\" height=\"136\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-4.png\" alt=\"\" class=\"wp-image-3086\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-4.png 458w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/1-4-300x89.png 300w\" sizes=\"(max-width: 458px) 100vw, 458px\" \/><\/a><figcaption><code>EP2305640A2 Example 11, Step v<\/code><br>(SMILES)<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4.56.7_-2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"417\" height=\"104\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4.56.7_-2.png\" alt=\"\" class=\"wp-image-3087\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4.56.7_-2.png 417w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4.56.7_-2-300x75.png 300w\" sizes=\"(max-width: 417px) 100vw, 417px\" \/><\/a><figcaption><code>EP2305640A2 Example 11, Step v<\/code><br>(CXSMILES with fragment grouping)<\/figcaption><\/figure><\/div>\n\n\n\n<p><strong>Syntax<\/strong><\/p>\n\n\n\n<p>Here is a simple example annotated with the fragment indexes, we want to group together <code>(0,1) <\/code>and <code>(3,4,5)<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+].&#91;OH-].c1ccccc1.&#91;Cs+].&#91;O-]C(=O)&#91;O-].&#91;Cs+]&gt;&gt; |f:0.1,3.4.5|\n--0-- --1-- ----2--- --3-- -----4------- --5--<\/code><\/pre>\n\n\n\n<p>Component indexes span the entire reaction, so we can for example move to the agents and the CXSMILES encoding does not change:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+].&#91;OH-].c1ccccc1&gt;&#91;Cs+].&#91;O-]C(=O)&#91;O-].&#91;Cs+]&gt; |f:0.1,3.4.5|\n--0-- --1-- ----2--- --3-- -----4------- --5--<\/code><\/pre>\n\n\n\n<p><strong><em>Does it only apply to reactions?<\/em> <\/strong><\/p>\n\n\n\n<p><strong>Toolkit dependent<\/strong>. ChemAxon appears to only read\/write it on reactions (MarvinJS v22.9.0) but it&#8217;s also useful on molecules to capture formulations\/mixtures (e.g. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Artificial_seawater\">Artifical seawater<\/a>) . In reaction, fragment grouping of agents (between the two &#8220;&gt;&#8221;) appears to be ignored in MarvinJS &#8211; so my example images aren&#8217;t valid examples. One of our customers tested Marvin Desktop v21.15.1 for me and confirmed it round-trips correctly. <\/p>\n\n\n\n<p><strong><em>Do component terms need to be adjacent?<\/em> <\/strong><\/p>\n\n\n\n<p><strong>Toolkit <strong>dependent<\/strong>.<\/strong> In older versions of ChemAxon desktop tools (I no longer have access) I remember it would reject non-adjacent components:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+].c1ccccc1.&#91;OH-]&gt;&gt; |f:0.2|<\/code><\/pre>\n\n\n\n<p>This does not seem to be the case in MarvinJS and again a customer again confirmed it round-trips ok in Marvin Desktop.<\/p>\n\n\n\n<p><strong><em>Spanning Different Roles?<\/em> <\/strong><\/p>\n\n\n\n<p>An input where the roles of the components being grouped (e.g. a reactant and a product) could be rejected as inconsistent:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>c1ccccc1.&#91;Na+]&gt;&gt;&#91;OH-] |f:1.2|\nc1ccccc1.&#91;Na+]&gt;&gt;&#91;OH-] |f:2.1|<\/code><\/pre>\n\n\n\n<p>MarvinJS and CDK default to the role of the first component encountered so those two inputs are different. As the author of the CDK logic I will note this is consistent only by coincidence. <\/p>\n\n\n\n<p><strong><em>Implicit grouping?<\/em><\/strong><\/p>\n\n\n\n<p>We can implicitly group components with multi-attach (<strong>m:<\/strong>) and Sgroup brackets. For example consider the following, which is preferred?<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/R1m_1_2.3.4.5.6.7_.png\"><img loading=\"lazy\" decoding=\"async\" width=\"49\" height=\"63\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/R1m_1_2.3.4.5.6.7_.png\" alt=\"\" class=\"wp-image-3091\"\/><\/a><\/figure><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code>**.c1ccccc1 |$R1$,m:1:2.3.4.5.6.7|\n**.c1ccccc1 |$R1$,f:0.1,m:1:2.3.4.5.6.7|<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Alternatives<\/h2>\n\n\n\n<p><strong>Daylight SMILES5<\/strong><\/p>\n\n\n\n<p>As with the cis\/trans specification, SMILES5 had a solution: <\/p>\n\n\n\n<p>&#8220;<em>Molecule-level Components: There will be another level of components within a molecule and reaction object which will allow easier handling of complex mixtures.<\/em>&#8221; &#8211; <a href=\"https:\/\/www.daylight.com\/meetings\/mug05\/Delany\/futures.html\">Futures, MUG 2005<\/a><\/p>\n\n\n\n<p>SMARTS has component grouping using zero-level brackets already and it would likely have followed a similar syntax:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>(&#91;Na+].&#91;OH-]).c1ccccc1.(&#91;Cs+].&#91;O-]C(=O)&#91;O-].&#91;Cs+])&gt;&gt;<\/code><\/pre>\n\n\n\n<p>Notice the wording that it applies to both <strong>molecule and reactions<\/strong>.<\/p>\n\n\n\n<p><strong>LillyMol<\/strong><\/p>\n\n\n\n<p>An extension used by Eli Lilly&#8217;s <a href=\"https:\/\/github.com\/EliLillyCo\/LillyMol\">LillyMol<\/a> is to treat the &#8220;.&#8221; to separate fragments and use a &#8220;+&#8221; to <a href=\"https:\/\/github.com\/EliLillyCo\/LillyMol\/blob\/007b2ad05b01bc3b7691cd5cbd6f720def759949\/src\/Molecule_Tools\/rxn_standardize.cc#L649-L663\">separate the molecules<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+].&#91;OH-]+c1ccccc1+&#91;Cs+].&#91;O-]C(=O)&#91;O-].&#91;Cs+]&gt;&gt;<\/code><\/pre>\n\n\n\n<p>Note that LillyMol also supports CXSMILES.<\/p>\n\n\n\n<p><strong>IBM RXN for Chemistry<\/strong><\/p>\n\n\n\n<p><a href=\"https:\/\/pubs.rsc.org\/en\/content\/articlelanding\/2020\/SC\/c9sc05704h\"><em>Schwaller et al<\/em> 2020<\/a> describe how they use &#8220;~&#8221; to group together fragments. They note in the supplementary information how this is more useful than CXSMILES for their purposes since it enforces the fragments are kept together:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+]~&#91;OH-].c1ccccc1.&#91;Cs+]~&#91;O-]C(=O)&#91;O-]~&#91;Cs+]&gt;&gt;<\/code><\/pre>\n\n\n\n<p><strong>NextMove (proposed) \/ OntoChem<\/strong><\/p>\n\n\n\n<p>In 2013 Roger <a href=\"https:\/\/nextmovesoftware.com\/talks\/Sayle_FileFormats_ACS_201309.pdf\">proposed a double-dot <\/a>&#8220;..&#8221; for a similar purpose. The advantage being that &#8220;relaxed&#8221; SMILES parsers will simply ignore the repeated dot:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na+]..&#91;OH-].c1ccccc1.&#91;Cs+]..&#91;O-]C(=O)&#91;O-]..&#91;Cs+]&gt;&gt;<\/code><\/pre>\n\n\n\n<p><a href=\"https:\/\/ontochem.com\/\">OntoChem<\/a> also use this representation in reactions but I cannot find a link to relevant material.<\/p>\n\n\n\n<p><strong>NextMove (actual)<\/strong><\/p>\n\n\n\n<p>In <a href=\"https:\/\/nextmovesoftware.com\/Pistachio\">Pistachio<\/a> we use CXSMILES for fragment grouping in reactions. In <a href=\"https:\/\/nextmovesoftware.com\/blog\/2021\/03\/24\/13118970-reactions-and-counting\/\">recent releases<\/a> we have tried to use alternative representations that avoid the problem where possible:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Na]O.c1ccccc1.&#91;Cs]OC(=O)O&#91;Cs]&gt;&gt;<\/code><\/pre>\n\n\n\n<p>Where needed we still use CXSMILES since it is the most widely supported convention:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/6.png\"><img loading=\"lazy\" decoding=\"async\" width=\"121\" height=\"106\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/6.png\" alt=\"\" class=\"wp-image-3098\"\/><\/a><figcaption>HATU reagent<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4_.png\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"72\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4_.png\" alt=\"\" class=\"wp-image-3107\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4_.png 384w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2022\/04\/f_3.4_-300x56.png 300w\" sizes=\"(max-width: 384px) 100vw, 384px\" \/><\/a><figcaption>US20200138797A1 Intermediate 3<\/figcaption><\/figure><\/div>\n\n\n\n<p>The fragment grouping also gets captured in the JSON format of reactions albeit much less compactly:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-json\" data-lang=\"JSON\"><code>{\nrole: &quot;Product&quot;,\norgName: &quot;title compound&quot;,\nname: &quot;Methyl (2S)-2-amino-3-(2-chlorophenyl)propanoate hydrochloride&quot;,\nsmiles: &quot;Cl.N[C@H](C(=O)OC)CC1=C(C=CC=C1)Cl&quot;,\nquantities: [ {type: &quot;Mass&quot;, value: 5.89, text: &quot;5.89 g&quot;},\n              {type: &quot;Yield&quot;, value: 94, text: &quot;94%&quot;}],\nstoichiometry: 1\n}<\/code><\/pre><\/div>\n\n\n\n<p>We also support interconversion of the LillyMol syntax in our reaction processing tool set, <a href=\"https:\/\/nextmovesoftware.com\/hazelnut\">HazELNut<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ echo \"*&gt;&#91;Na+].&#91;OH-].&#91;Cs+].&#91;O-]C(=O)&#91;O-].&#91;Cs+]&gt;* |f:1.2.3,4.5|\" | .\/filbert .smi .iwsmi\n*&gt;C(=O)(&#91;O-])&#91;O-].&#91;Cs+]+&#91;OH-].&#91;Na+].&#91;Cs+]&gt;*\n\n$ echo \"*&gt;C(=O)(&#91;O-])&#91;O-].&#91;Cs+]+&#91;OH-].&#91;Na+].&#91;Cs+]&gt;*\" | .\/filbert .iwsmi .smi\n*&gt;C(=O)(&#91;O-])&#91;O-].&#91;OH-].&#91;Na+].&#91;Cs+].&#91;Cs+]&gt;* |f:1.4,2.3.5| <\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>This post is a follow up on the previous introduction &#8211; Part 1. Here I examine how we can capture fragment grouping in CXSMILES and other extensions. Fragment Grouping Fragment grouping (or component grouping) allows you group together separate fragments\/components of a molecule. It is critical for reaction representation and therefore several independent SMILES extensions &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2022\/04\/20\/cxsmiles-part-2-component-grouping\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">CXSMILES Part 2: Component Grouping<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24,26,23,25],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3059"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=3059"}],"version-history":[{"count":45,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3059\/revisions"}],"predecessor-version":[{"id":3147,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3059\/revisions\/3147"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=3059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=3059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=3059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}