Unlike with small molecules, there are currently no large sets of publically available reaction data.
To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.
The reactions are available as reaction SMILES or CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright. [Update 24/08/2017: A newer version of the dataset described here is available on FigShare]
It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.
NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.
More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.
NextMove recently participated in the BioCreative CHEMNDER (Chemical compound and drug name recognition) task. This task involved annotating chemical mentions in PubMed abstracts. BioCreative have annotated 10,000 abstracts of which 7,000 were provided to participants for training and in mid-September participants were asked to identify mentions in the unseen test corpus of 3,000 abstracts (which to avoid cheating was combined with 17,000 decoy abstracts).
In total 27 teams (23 academic and 4 commercial) submitted results. We achieved 85.0% recall at a precision of 88.7% giving an Fscore of 86.9%. Our solution ranked amongst the best submitted, being only 0.53% from the best performing solution in the chemical entity mentions task and significantly ahead of the other commercial solutions. Inter-annotator agreement was 91% indicating that with recent advances in machine annotation, automated systems are rapidly approaching the quality of human abstractors.
Participation in this competition has driven recent developments in LeadMine including improved coverage of non-systematic chemical entities and detection of abbreviations.
If you want to know the full details our proceedings paper is available here and you can find out how we compared in the full proceedings here (results on p14, list of teams on p31). The presentation below, which I gave at the workshop, summarises our system:
Java 7u6 last year brought with it a change to the implementation of String#substring. In previous versions of Java, Strings created by substring shared the same char array as the original String with an internal offset being used to make sure the correct characters were retrieved. This has the advantage of making substring a very cheap operation, O(1), but has the potential to create a significant memory leak. If a substring is taken of a long String, whilst the substring remains accessible the char array of the long String cannot be garbage collected. Java 7u6 (and later) change this and instead return truly independent Strings… but this requires copying part of the char array meaning substring is now an O(n) operation and hence repeatedly taking long substrings should be avoided.
A case where this behaviour can occur is in a tokenizer that after recognizing each token redefines the remaining String using substring. The longer the String in question the greater the effect on performance.
This behaviour is found in OPSIN’s parser hence accounting for a ~13% performance decrease in performance when moving from JDK6 to JDK7.
Resolving this performance regression can be tackled in at least two ways: Implementing a cheap substring operator using a decorating class (an example) or not using substring at all and instead keeping track of an index in the string from which to read. The first approach is hampered by String being final so a decorating class must instead implement CharSequence which is far less frequently used. Hence for OPSIN I choose the approach of keeping track of the index tokenization had reached:
Substrings are still used to capture the tokens which may explain why the performance is still a bit slower.
More details on the substring implementation change:
I’m pleased to announce the release of OPSIN 1.4.0. This new release brings significant improvements to OPSIN’s coverage of carbohydrate nomenclature. It also complements NextMove Software’s Sugar & Splice project that aims to make the conversion between carbohydrate and small molecule representations effortless.
Below is the effect this improvement to OPSIN has had on the conversion of IUPAC names in ChEBI. (This is one of the data sets used in the OPSIN publication [free access])
Tabular data in patents is a useful source of experimental data and chemical structures. USPTO patents are available back to 1976 in formats where tables are explicitly annotated. For more recent patents these are XML tables similar in structure to what would be expected in HTML. Unfortunately the format used from 1976-2000 is not quite so straightforward to interpret leading to naive interpretations producing output that does not at all resemble the actual table, often with chemical name fragments scattered:
Wikipedia is a highly useful source of Chemistry and also of chemical nomenclature. A limitation in chemical name to structure software, such as OPSIN, is that trivial names that are similar to systematic names may be misinterpreted if the program has never encountered the trivial names. The nature of Wikipedia means that the most important chemicals and hence the most prevalent trivial names are included so surely Wikipedia would be a great resource to look for name to InChI relationships where the name to structure software was at fault?
I used Matthew Gamble’s code for extracting chemboxes as RDF to quickly grab the contents of all the current chemboxes. From the output of this tool it was simple to get the name/InChI pairs. As I was interested in trivial names I used the title of each Wikipedia page as the input for name to structure.
430 cases were flagged up for a range of reasons: ring/chain tautomerism, intentionally underspecified names, under or over specification of stereochemistry and of course the type of error I was expected. However there were also a significant number of cases where the InChI clearly described a different compound. Upon investigation for the records I’ve corrected so far the root cause appeared to be an incorrect reference to ChemSpider. This then allows script assisted updates to pull in inappropriate InChIs/SMILES.