Assembling a large data set for melting point prediction: Text-mining to the rescue!

As part of a project initiated by Tony Williams and the Royal Society of Chemistry, I have been working with Igor Tetko to text-mine melting and decomposition point data from the US patent literature so that he could then produce a melting point prediction model. This model showed an improvement over previous models, which is likely due to the overwhelming large size of the dataset compared to the smaller curated data sets used by these previous models.

The results of this work have now been published in the Journal of Cheminformatics here: The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from Patents

From the text-mining side this involved identifiying compounds, melting and decomposition points, performing the association between them, and then normalizing the representation of the melting points (e.g. “182-4°C” means the same as “182 to 184°C”). Values that were likely to be typos in the original text were also flagged.

As mentioned in the paper the resultant set of 100,000s of melting points is available as SDF from Figshare while the model Igor developed is available from OCHEM.

Image credit: Iain George on Flickr (CC-BY-SA)