Assembling a large data set for melting point prediction: Text-mining to the rescue!

Gallenkamp_Melting_Point_ApparatusAs part of a project initiated by Tony Williams and the Royal Society of Chemistry, I have been working with Igor Tetko to text-mine melting and decomposition point data from the US patent literature so that he could then produce a melting point prediction model. This model showed an improvement over previous models, which is likely due to the overwhelming large size of the dataset compared to the smaller curated data sets used by these previous models.

The results of this work have now been published in the Journal of Cheminformatics here: The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from Patents

From the text-mining side this involved identifiying compounds, melting  and decomposition points, performing the association between them, and then normalizing the representation of the melting points (e.g. “182-4°C” means the same as “182 to 184°C”). Values that were likely to be typos in the original text were also flagged.

As mentioned in the paper the resultant set of 100,000s of melting points is available as SDF from Figshare while the model Igor developed is available from OCHEM.

Image credit: Iain George on Flickr (CC-BY-SA)

How the AUC of a ROC curve is like the Journal Impact Factor

dist3The Journal Impact Factor (or JIF) is the mean number of citations to articles published in a journal in the previous 2 years. Now, the mean is often a good measure of the average but not always. To decide whether it’s a good measure, it is often sufficient to look at a histogram of the data. The image above from a blogpost by Steve Royle shows the citation data for Nature. It is exactly as you would expect: a large number of papers have a small number of citations, while a small number of papers have a large number of citations. In other words, it is exactly the sort of curve for which the mean does not provide any meaningful (an ironic pun) result.

Why? Well, it’s the long tail that really kills it (although we could talk about how skewed it is too). Take 101 papers, 100 of which have 1 citation but one has 100. What’s the mean? 2.0. Say if that one had 1000 citations instead, then the mean is 11.0. The mean is heavily influenced by outliers, and here the long tail provides lots of these. For this reason, the mean does not give any useful measure of the average number of citations as it is just pulled up and down by whatever papers got most cited.

So what’s the link to the AUC of a ROC curve in a typical virtual screening experiment? The AUC has a linear dependance on the mean rank of the actives (see the BEDROC paper), and guess what, that distribution looks very similar to that for citations. For any virtual screening method that is better than random, most of the actives are clustered at the top of the ranked list, while any active that is not recognised by the method floats at random among the inactives. So the AUC is at best a measure of the rank of the actives not recognised by the method, and at worst a random value.

Naturally, the AUC is the most widely used ranking method in the field.