Unleashing over a million reactions into the wild

Reaction Extraction WorkflowUnlike with small molecules, there are currently no large sets of publically available reaction data.

To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.

The reactions are available as reaction SMILES orĀ  CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright. [Update 24/08/2017: A newer version of the dataset described here is available on FigShare]

It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.

NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.

More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.