A novel procedure towards accurate estimation of room temperature utilising the patent literature

When chemists report that a reaction took place at room temperature, what exactly do they mean? Clearly the best way to approach this problem is to textmine reaction conditions from all US patent applications since 2001 and thus infer room temperature.

As previously discussed, Daniel has extracted reactions from US patents. The textmining software that Daniel has been working on, LeadMine, now has the ability to extract reaction conditions. Considering just those reactions where the temperature is explicitly given (as opposed to specified as “room temperature” or some such), the following graph is obtained (this is interactive – use the toolbar to zoom/pan; data included for the interval -273 to 800 °C):

You will immediately notice a preference among chemists for temperatures that are multiples of 5, and in particular, multiples of 10. In our determination of likely room temperature, such values are probably not useful. Once we remove them, the remaining data is as follows:

If you zoom in around the 20-25 degree area, we can infer that room temperature is 23°C or thereabouts – QED. Other peaks in the plot indicate particular reaction conditions that are common in organic chemistry: for example, both 78°C and -78°C are favourites (remember why?).

This analysis of temperature data was based on data presented by Daniel at the Fall ACS in San Francisco. His talk, “Chemistry and reactions from non-US patents”, covered:

  • Coverage of European vs United States patents
  • For novel compounds, which patent authority published first and how long was the lag
  • Trends in gene/protein mentions over time
  • Melting/boiling point extraction
  • Analysis of text mined reactions (yields vs scale, grouping them into synthetic routes, trends in solvent usage)

Finding the signal in activity data

There is a 2008 paper by the folks at Abbott that Roger refers to as “the paper that killed matched pairs for activity data”. This is the Hajduk and Sauer paper which looked at matched pair transformations across 84K compounds and 30 protein targets, and found that the potency changes associated with most matched pairs transformations were (nearly) normally distributed around zero.

I don’t have access to the Abbvie data, but I can do a similar analysis with ChEMBL data. For those assays in ChEMBL which have pIC50 data for ethyl, propyl and butyl as substituents at the same location in three molecules, here is a histogram of the pIC50 for the ethyl analog minus that of the butyl.butyl_all
The resulting histogram is pretty much in agreement with what Hajduk found: changing a butyl to an ethyl is equally likely to increase activity as decrease it. If we think about it, it’s fairly obvious why this might be – we are pooling data from different binding environments in different proteins, and the effect of the change depends on the binding environment.

So matched pairs are dead for activity. Or we have to restrict the analysis to data just for a particular pocket in a particular protein.

But what if we consider additional activity data? What if we already know the relative activities of propyl and butyl, for example. Let’s say that we know in a particular case that the propyl analog has a greater pIC50 than butyl, and then plot the ΔpIC50 for ethyl minus butyl:
…or vice versa, where propyl has a smaller pIC50 than butyl:
Interesting, eh? The point is that knowing some activity information improves our predictive ability. If we know that propyl > butyl, then it increases the chance that changing butyl to ethyl will increase the activity.

The question then arises, how best to extract and apply this information? One approach would be to throw more matched pairs at the problem. But actually, a simpler and more elegant approach is to look beyond matched pairs to the general concept of matched series.

Here are the slides I presented on “Evidence-based medicinal chemistry with matched series” at the recent UK-QSAR meeting in Cambridge.