Unlike with small molecules, there are currently no large sets of publically available reaction data.
To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.
The reactions are available as reaction SMILES or CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright. [Update 24/08/2017: A newer version of the dataset described here is available on FigShare]
It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.
NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.
More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.
Well done! Great license, ummm, waiver! BTW, please consider uploading the data to Dryad (http://www.datadryad.org/) or FigShare (http://figshare.com/), so that the data can be more easily cited (DOI!) and the impact counted.
FigShare is definitely a possibility (it’s free for submitters!). I am slightly concerned that you then have the data duplicated (I used Bitbucket as it’s where the code also resides… although its been pointed out to me that people downloading the extracted reactions done actually need to run the code themselves!) Having the data on FigShare would definitely make things clearer if there hypothetically was a v2 of this data (there’s always room for improvement in chemical text mining!)