Unleashing over a million reactions into the wild

Reaction Extraction WorkflowUnlike with small molecules, there are currently no large sets of publically available reaction data.

To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.

The reactions are available as reaction SMILES or  CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright. [Update 24/08/2017: A newer version of the dataset described here is available on FigShare]

It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.

NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.

More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.

BioCreative announce chemical text mining competition results

NextMove recently participated in the BioCreative CHEMNDER (Chemical compound and drug name recognition) task. This task involved annotating chemical mentions in PubMed abstracts. BioCreative have annotated 10,000 abstracts of which 7,000 were provided to participants for training and in mid-September participants were asked to identify mentions in the unseen test corpus of 3,000 abstracts (which to avoid cheating was combined with 17,000 decoy abstracts).

In total 27 teams (23 academic and 4 commercial) submitted results. We achieved 85.0% recall at a precision of 88.7% giving an Fscore of 86.9%. Our solution ranked amongst the best submitted, being only 0.53% from the best performing solution in the chemical entity mentions task and significantly ahead of the other commercial solutions. Inter-annotator agreement was 91% indicating that with recent advances in machine annotation, automated systems are rapidly approaching the quality of human abstractors.

Participation in this competition has driven recent developments in LeadMine including improved coverage of non-systematic chemical entities and detection of abbreviations.

If you want to know the full details our proceedings paper is available here and you can find out how we compared in the full proceedings here  (results on p14, list of teams on p31). The presentation below, which I gave at the workshop, summarises our system:

Java 6 vs Java 7: When implementation matters

Java 7u6 last year brought with it a change to the implementation of String#substring. In previous versions of Java, Strings created by substring shared the same char array as the original String with an internal offset being used to make sure the correct characters were retrieved. This has the advantage of making substring a very cheap operation, O(1), but has the potential to create a significant memory leak. If a substring is taken of a long String, whilst the substring remains accessible the char array of the long String cannot be garbage collected. Java 7u6 (and later) change this and instead return truly independent Strings… but this requires copying part of the char array meaning substring is now an O(n) operation and hence repeatedly taking long substrings should be avoided.

A case where this behaviour can occur is in a tokenizer that after recognizing each token redefines the remaining String using substring.  The longer the String in question the greater the effect on performance.

This behaviour is found in OPSIN’s parser hence accounting for a ~13% performance decrease in performance when moving from JDK6 to JDK7.

Resolving this performance regression can be tackled in at least two ways: Implementing a cheap substring operator using a decorating class (an example) or not using substring at all and instead keeping track of an index in the string from which to read. The first approach is hampered by String being final so a decorating class must instead implement CharSequence which is far less frequently used. Hence for OPSIN I choose the approach of keeping track of the index tokenization had reached:

Comparison of performance between JDK6 and JDK7







Substrings are still used to capture the tokens which may explain why the performance is still a bit slower.

More details on the substring implementation change:


Bringing sugars to OPSIN

cycleofsugarsupportI’m pleased to announce the release of OPSIN 1.4.0. This new release brings significant improvements to OPSIN’s coverage of carbohydrate nomenclature. It also complements NextMove Software’s Sugar & Splice project that aims to make the conversion between carbohydrate and small molecule representations effortless.

Below is the effect this improvement to OPSIN has had on the conversion of IUPAC names in ChEBI. (This is one of the data sets used in the OPSIN publication [free access])

Number of names convertible to InChI on IUPAC names from ChEBI (Sept 2010)
Number of names convertible to InChI on IUPAC names from ChEBI (Sept 2010)

Examples of new nomenclature supported (pictures generated by the OPSIN web service)

3-Deoxy-alpha-D-manno-oct-2-ulopyranosonic acid
3-Deoxy-alpha-D-manno-oct-2-ulopyranosonic acid

beta-D-Fructofuranosyl alpha-D-glucopyranoside
beta-D-Fructofuranosyl alpha-D-glucopyranoside

Methyl 2,3,4-tri-O-acetyl-alpha-D-glucopyranosyluronate bromide
Methyl 2,3,4-tri-O-acetyl-alpha-D-glucopyranosyluronate bromide


OPSIN 1.4.0 is available from Bitbucket and Maven Central. The full release notes are below:

  • Added support for dialdoses,diketoses,ketoaldoses,alditols,aldonic acids,uronic acids,aldaric acids,glycosides,oligosacchardides, named systematically or from trivial stems, in cyclic or acyclic form
  • Added support for ketoses named using dehydro
  • Added support for anhydro
  • Added more trivial carbohydrate names
  • Added support for sn-glcyerol
  • Improved heuristics for phospho substitution
  • Added hydrazido and anilate suffixes
  • Allowed more functional class nomenclature to apply to amino acids
  • Added support for inverting CAS names with substituted functional terms e.g. Acetaldehyde, O-methyloxime
  • Double substitution of a deoxy chiral centre now uses the CIP rules to decide which substituent replaced the hydroxy group
  • Unicode right arrows, superscripts and the soft hyphen are now recognised

Making Sense of Patent Tables

Tabular data in patents is a useful source of experimental data and chemical structures. USPTO patents are available back to 1976 in formats where tables are explicitly annotated. For more recent patents these are XML tables similar in structure to what would be expected in HTML. Unfortunately the format used from 1976-2000 is not quite so straightforward to interpret leading to naive interpretations producing output that does not at all resemble the actual table, often with chemical name fragments scattered:

USPTO Patent Full-Text and Image Database:




The format for these tables is briefly documented by the USPTO but the description raises as many questions as answers:

  • Columns are delimited by one or more spaces… but a cell may contain spaces!
  • An overly long cell may be split over multiple lines due the format being limited to 80 characters per line
  • Where in the printed patent a cell spanned multiple rows it spans multiple lines in the format.

As the format is based on the how the tables were printed perfect reproduction of the semantics of these tables appears impossible, but a good approximation can be achieved.

After processing PatFetch produces:patfetch

Much better 🙂

(the colouring of Example 22 is due to “tertbutyl” being recognised as a misspelling of “tert-butyl”)

The method broadly works by:

  • Identifiying the header, body and footer
  • Producing a putative table layout
  • Splitting cells where a single space is determined to be a split point between two columns
  • Merging cells that are determined to be a continuation of a previous cell


Identifying suspect InChIs in Wikipedia Chemboxes using Chemical Name to Structure

Wikipedia is a highly useful source of Chemistry and also of chemical nomenclature. A limitation in chemical name to structure software, such as OPSIN, is that trivial names that are similar to systematic names may be misinterpreted if the program has never encountered the trivial names. The nature of Wikipedia means that the most important chemicals and hence the most prevalent trivial names are included so surely Wikipedia would be a great resource to look for name to InChI relationships where the name to structure software was at fault?

I used Matthew Gamble’s code for extracting chemboxes as RDF to quickly grab the contents of all the current chemboxes. From the output of this tool it was simple to get the name/InChI pairs. As I was interested in trivial names I used the title of each Wikipedia page as the input for name to structure.

430 cases were flagged up for a range of reasons: ring/chain tautomerism, intentionally underspecified names, under or over specification of stereochemistry and of course the type of error I was expected. However there were also a significant number of cases where the InChI clearly described a different compound. Upon investigation for the records I’ve corrected so far the root cause appeared to be an incorrect reference to ChemSpider. This then allows script assisted updates to pull in inappropriate InChIs/SMILES.

Example of a previously incorrect page.

Increasing the precision of identificiation of these incorrect name/strucutral identifiers pairs should be possible if the IUPAC names were used as input…