{"id":2975,"date":"2021-03-24T12:56:46","date_gmt":"2021-03-24T12:56:46","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=2975"},"modified":"2021-03-24T13:49:14","modified_gmt":"2021-03-24T13:49:14","slug":"13118970-reactions-and-counting","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2021\/03\/24\/13118970-reactions-and-counting\/","title":{"rendered":"13,118,970 Reactions and Counting"},"content":{"rendered":"\n<p>In <a href=\"https:\/\/web.archive.org\/web\/20200622054556\/https:\/\/bitbucket.org\/dan2097\/patent-reaction-extraction\/downloads\/\">2012, 2014<\/a> and <a href=\"https:\/\/figshare.com\/articles\/dataset\/Chemical_reactions_from_US_patents_1976-Sep2016_\/5104873\">2017<\/a> Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as <a href=\"https:\/\/creativecommons.org\/share-your-work\/public-domain\/cc0\/\">CC-Zero<\/a>. We have made updates (currently quarterly) for customers of our <a href=\"https:\/\/www.nextmovesoftware.com\/pistachio.html\">Pistachio<\/a> query tool and extended the reaction data to include USPTO Sketches and European patents.<\/p>\n\n\n\n<p>The next version of Pistachio will include data from WIPO PCT documents and include several enhancements to content and representation. Highlight these in a blog post made more sense than the usual release notes.&nbsp;<\/p>\n\n\n\n<p><strong><em>Number of Reactions<\/em><\/strong><\/p>\n\n\n\n<p>The next release will contain &gt;<strong>13,118,970<\/strong>&nbsp;reactions from the following sources:<\/p>\n\n\n\n<table class=\"wp-block-table\"><tbody><tr><td>\n<strong>Source<\/strong>\n<\/td><td>\n<strong>Count<\/strong>\n<\/td><td>\n<strong>Latest Extraction<\/strong>\n<\/td><\/tr><tr><td> <strong>USPTO Grant Text<\/strong> <\/td><td>\n3,290,056\n<\/td><td>\n2021-03-16\n<\/td><\/tr><tr><td> <strong>USPTO Appl. Text<\/strong> <\/td><td>\n3,595,510\n<\/td><td>\n2021-03-18\n<\/td><\/tr><tr><td>\n<strong>WIPO PCT Text<\/strong>\n<\/td><td>\n1,484,646\n<\/td><td>\n2021-03-18\n<\/td><\/tr><tr><td> <strong>EPO Grant Text<\/strong> <\/td><td>\n1,060,397\n<\/td><td>\n2021-03-17\n<\/td><\/tr><tr><td> <strong>EPO Appl. Text<\/strong> <\/td><td>\n696,578\n<\/td><td>\n2021-03-17\n<\/td><\/tr><tr><td> <strong>USPTO Grant Sketch<\/strong><\/td><td>\n1,186,924\n<\/td><td>\n2021-03-16\n<\/td><\/tr><tr><td><strong> USPTO Appl. Sketch<\/strong><\/td><td>\n1,804,859\n<\/td><td>\n2021-03-18\n<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p>Pistachio covers <strong>1976-current day<\/strong> however <strong>4,447,418<\/strong> (33%) were published in the last five years and <strong>8,166,128<\/strong> (62%) in the last ten years.<\/p>\n\n\n\n<p>The reaction data is document (or citation) centric, the same reaction text often occurs in an application and grant. It can also be published in multiple authorities (e.g. USPTO, EPO and WIPO). Often the description text is identical but not always, sometimes a product yield\/quantity may be miss-typed or omitted. The number of unique reactions by <strong>RInChI<\/strong> is <strong>4,212,894<\/strong>.<\/p>\n\n\n\n<p>All reactions extracted from text are Atom-Atom Mapped either by <strong><a href=\"https:\/\/www.nextmovesoftware.com\/namerxn.html\">NameRxn<\/a><\/strong> or if the reaction is unrecognised we fallback to Indigo. <strong>9,383,607 (71.5%) <\/strong>are currently recognized by NameRxn.<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>What\u2019s New?<\/strong><\/p>\n\n\n\n<p>Through improvements related to the WIPO PCT inclusion we observed a <strong>~15%<\/strong> increase in recall for existing USPTO and EPO data. This is one of the many advantages of automated extraction over manually curated databases. Tweaks can be made and mistakes can fixed then applied in bulk over all the original source documents. Re-extraction currently takes a few days on a single machine (8 cores). Where miss-extraction mistakes are spotted we welcome the feedback and aim to resolve these where possible.<\/p>\n\n\n\n<p><strong><em>Embedded Heading Detection<\/em><\/strong><\/p>\n\n\n\n<p>One of the biggest challenges with handling WIPO PCT data is the English text is primarily OCRd. The submitted document quality can vary considerably which leads to a wide spectrum of related problems.&nbsp;<\/p>\n\n\n\n<p>OCR is well known to have issues with the non-standard characters &nbsp; used in systematic chemical names. Fortunately the majority of issues are simple character transliterations and extra white space. These can be effectively handled with our spelling correction algorithms. Some very badly corrupted names are beyond all hope:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"808\" height=\"138\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image.png\" alt=\"\" class=\"wp-image-2976\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image.png 808w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-300x51.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-768x131.png 768w\" sizes=\"(max-width: 808px) 100vw, 808px\" \/><figcaption>WO 2020\/243135 A1<br>Part of the chemical name could not be OCRd and remains as an image<\/figcaption><\/figure><\/div>\n\n\n\n<p>Another common issue with OCR is the detection of paragraph breaks and lack of title markup. Often chemical reaction descriptions use anaphoric references \u201ctitle compound\u201d, \u201cdesired product\u201d, and \u201cproduct from Step B\u201d that need to be resolved. If the title is not found we can\u2019t resolve it.<\/p>\n\n\n\n<p>Paragraph breaks may be omitted:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"220\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-1-1024x220.png\" alt=\"\" class=\"wp-image-2977\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-1-1024x220.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-1-300x64.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-1-768x165.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-1.png 1245w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>WO 2020\/239862 A1<br>There should be a paragraph break before &#8220;Intermediate 2:&#8221;<\/figcaption><\/figure>\n\n\n\n<p>or in the wrong place (here splitting a chemical name):<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"221\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-4-1024x221.png\" alt=\"\" class=\"wp-image-2980\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-4-1024x221.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-4-300x65.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-4-768x166.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-4.png 1213w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>WO 2020239862 A1<br>The &#8220;Step H&#8221; compound has a paragraph break in the middle of the name.<\/figcaption><\/figure>\n\n\n\n<p>To compensate for this, new algorithms were introduced to detect patterns of embedded headings where the break was missed by OCR. Existing inline heading (start of paragraph) detection was also improved.<\/p>\n\n\n\n<p>These errors are not unique to WIPO data and occasionally occur in the USPTO documents too:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"798\" height=\"161\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-3.png\" alt=\"\" class=\"wp-image-2979\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-3.png 798w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-3-300x61.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-3-768x155.png 768w\" sizes=\"(max-width: 798px) 100vw, 798px\" \/><figcaption>US 2020\/0071296 A1<br>There should be a break before &#8220;Step-2:&#8221;<\/figcaption><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.wipo.int\/patentscope\/en\/\">Patentscope<\/a> (WIPO) recently announced a new OCR extraction framework that should improve paragraph detection and chemical formula recognition (from Feb 11th 2021). We have not yet found an improvement in the extraction recall rate.<\/p>\n\n\n\n<p><strong><em>Multi-Paragraph Reactions<\/em><\/strong><\/p>\n\n\n\n<p>In previous versions, the reaction step parsing started a new reaction on every new paragraph. This logic was tweaked to allow a reaction to span multiple paragraphs and handle the less reliable breaks in WIPO data. Regressions in USPTO extraction helped identify places in the reaction descriptions where a yielded product action was missed.<\/p>\n\n\n\n<p>A side effect of this is we now sometimes extract cases where there was some unknown intermediate (A -&gt; ?, ? -&gt; B) as a single reaction. We are considering how to handle these.<\/p>\n\n\n\n<p><strong><em>Prefer Connected Representations<\/em><\/strong><\/p>\n\n\n\n<p>Chemical structures are generated from systematic names, line formulae, and dictionaries. Where possible we have updated the structure representations to favour connected representations:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>[K]OC(=O)O[K]<\/strong> instead of <strong>[K+].[K+].[O-]C(=O)[O-]<\/strong>\n<strong>[Na][Cl]<\/strong> instead of <strong>[Na+].[Cl-]<\/strong>\netc<\/pre>\n\n\n\n<p>Using <a href=\"https:\/\/docs.chemaxon.com\/display\/docs\/chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md\">CXSMILES<\/a> fragment groups it is possible to keep the grouping of the counter-ions. The reaction components were also listed separately in the raw JSON files. Not all downstream tools can consume the CXSMILES representation and some users invented\/adapted syntaxes to handle in their use cases e.g.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>&gt;[K+]+[K+]+[O-]C(=O)[O-]&gt;\n&gt;[K+]..[K+]..[O-]C(=O)[O-]&gt;<\/strong><\/pre>\n\n\n\n<p>The philosophy in preferring the connected component is that it is easier to split things apart then piece them back together (Humpty Dumpty). Note that not all counter-ions have been bonded due to undesirable valence representations (e.g. <a href=\"https:\/\/en.wikipedia.org\/wiki\/HATU\">HATU<\/a>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Ammonium_chloride\">NH4Cl<\/a>) and so the CXSMILES fragment groups remain a useful extension.<\/p>\n\n\n\n<p><strong><em>Representation of stereoisomer mixtures<\/em><\/strong><\/p>\n\n\n\n<p>I recently added the ability to <a href=\"https:\/\/github.com\/dan2097\/opsin\/pull\/140\">OPSIN<\/a> to capture racemic and relative stereo information in systematic names. In total around ~1% of reactions now have this information captured in CXSMILES. In a simple example we have an unknown mixture of enantiomers formed:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"741\" height=\"326\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-5.png\" alt=\"\" class=\"wp-image-2982\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-5.png 741w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-5-300x132.png 300w\" sizes=\"(max-width: 741px) 100vw, 741px\" \/><\/figure><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">BrC1N2C=NC=C2SC=1C1CC1.C(C1N(C2C=CC(OC)=CC=2)N=NC=1C=O)C&gt;&gt;C1(C2SC3=CN=CN3C=2[C@H](C2N=NN(C3C=CC(OC)=CC=3)C=2CC)O)CC1 |&amp;1:38|\tUS20200405696A1_1398\tExample 279<\/pre>\n\n\n\n<p>A more complex example is where one stereocenter configuration is known but the other is not:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"720\" height=\"251\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-6.png\" alt=\"\" class=\"wp-image-2983\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-6.png 720w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-6-300x105.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CN(C(OC(C)(C)C)=O)CC3=N2)=NC=CC=1C(F)(F)F&gt;ClCCl.FC(F)(F)C(O)=O&gt;ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CNCC3=N2)=NC=CC=1C(F)(F)F |&amp;1:8,55,a:13,60| US20200375986A1_0652 Example 4<\/pre>\n\n\n\n<p>Both of these cases could alternative be represented by simply removing the configuration from the racemic atom (and we may choose to normalise to that in future). The most important cases are when the relative configuration of two centres is known but that there is a mixture of enatiomers.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"503\" height=\"153\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-7.png\" alt=\"\" class=\"wp-image-2984\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-7.png 503w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-7-300x91.png 300w\" sizes=\"(max-width: 503px) 100vw, 503px\" \/><\/figure><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">C(OC([C@H]1[C@H](C)CN(CC2C=CC=CC=2)C1)=O)C.CC(OC(OC(OC(C)(C)C)=O)=O)(C)C&gt;C(O)C.[OH-].[OH-].[Pd+2]&gt;C[C@H]1CN(C(OC(C)(C)C)=O)C[C@@H]1C(OCC)=O |f:3.4.5,&amp;1:3,4,40,51|\tUS20200369658A1_0530\tIntermediate 2, Step b<\/pre>\n\n\n\n<p><strong><em>NameRxn<\/em><\/strong><\/p>\n\n\n\n<p>Recent updates to NameRxn include limited support for some common non-balanced Functional Group Interconversion and Addition reactions, i.e, \u201cHydroxy to chloro\u201d and \u201cAmination\u201d.&nbsp; This work also allowed a small performance boost. The total number of reaction types named is now <strong>1,528 <\/strong>(from<strong> 1,297<\/strong> previously), one source for additional reactions has been RXNO. NameRxn was not originally designed to provide Atom-Atom Maps; as that has become more of an interest we have made improvement to AAM where a functional group source was unknown.<\/p>\n\n\n\n<p><strong><em>Solvent Mixture Representations<\/em><\/strong><\/p>\n\n\n\n<p>More information about solvents and solvent mixtures is captured, for example: \u201c<em>5-chloro-3-((trimethylsilyl)ethynyl)pyrazin-2-amine (100 mg, 0.44 mmol) in THF (8 mL)<\/em>\u201d (US20200085822A1 Example 1 Step 6) the THF solvent is associated by reference from the reactant.<\/p>\n\n\n\n<p>Finer grained details on component\/volume fractions of solvent mixtures is also captured when described as \u201c<strong>1M HCl Et2O<\/strong>\u201d, \u201c<strong>THF\/MeOH (1 mL, 1:1)<\/strong>\u201d<\/p>\n\n\n\n<p><strong><em>Sequence and Step Labels<\/em><\/strong><\/p>\n\n\n\n<p>Where found in the text we now include and attach the sequence and step labels to reactions. For example \u201cExample 4, Step A\u201d, \u201cCompound 7, Step 2\u201d. Pistachio will allow searching by the labels and resolving queries:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"970\" height=\"277\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-8.png\" alt=\"\" class=\"wp-image-2985\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-8.png 970w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-8-300x86.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/03\/image-8-768x219.png 768w\" sizes=\"(max-width: 970px) 100vw, 970px\" \/><figcaption><strong>US 2020\/0405696 A1 Example 313, Step 2 <\/strong><br> Azide-alkyne Huisgen cycloaddition (4.1.4)<\/figcaption><\/figure>\n\n\n\n<p><strong><em>Improved cross reference handling<\/em><\/strong><\/p>\n\n\n\n<p>Previously the cross-reference \u201cCompound 1\u201d was indexed and resolved just as \u201c1\u201d. We now use the \u201creference type\u201d to disambiguate cases when there is both a \u201cCompound 1\u201d and \u201cIntermediate 1\u201d. The variety of recognised identifier values was also extended.<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Summary<\/strong><\/p>\n\n\n\n<p>We have made several improvements to reaction extraction. These will be available in the new version of <a href=\"https:\/\/www.nextmovesoftware.com\/pistachio.html\">Pistachio<\/a> that will be available at the start of next month.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In 2012, 2014 and 2017 Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as CC-Zero. We have made updates (currently quarterly) for customers of our Pistachio query tool and extended the reaction data to include USPTO Sketches and European patents. The next version of Pistachio will &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2021\/03\/24\/13118970-reactions-and-counting\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">13,118,970 Reactions and Counting<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[18,19],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2975"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=2975"}],"version-history":[{"count":15,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2975\/revisions"}],"predecessor-version":[{"id":2999,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2975\/revisions\/2999"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=2975"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=2975"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=2975"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}