Unleashing over a million reactions into the wild

Reaction Extraction Workflow

Unlike with small molecules, there are currently no large sets of publically available reaction data.

To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.

The reactions are available as reaction SMILES orĀ  CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright.

It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.

NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.

More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Unleashing over a million reactions into the wild

  1. Well done! Great license, ummm, waiver! BTW, please consider uploading the data to Dryad (http://www.datadryad.org/) or FigShare (http://figshare.com/), so that the data can be more easily cited (DOI!) and the impact counted.

  2. daniel says:

    FigShare is definitely a possibility (it’s free for submitters!). I am slightly concerned that you then have the data duplicated (I used Bitbucket as it’s where the code also resides… although its been pointed out to me that people downloading the extracted reactions done actually need to run the code themselves!) Having the data on FigShare would definitely make things clearer if there hypothetically was a v2 of this data (there’s always room for improvement in chemical text mining!)

  3. Pingback: A novel procedure towards accurate estimation of room temperature utilising the patent literature | NextMove Software

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>