Big Data, Big Bad Data

All_smallChemical data is increasing exponentially and soon we will be engulfed in an unmanageable torrent of molecules that threatens to overwhelm our computers and our very sanity. Or maybe not.

At yesterday’s RSC CICAG meeting on From Big Data to Chemical Information in London, I presented a talk entitled:
100 million compounds, 100K protein structures, 2 million reactions, 4 million journal articles, 20 million patents and 15 billion substructures: Is 20TB really Big Data?

According to Wikipedia, the term “Big Data” is defined as data sets so large or complex that traditional data processing applications are inadequate. Turning this on its head, this implies that without sufficiently efficient algorithms and tools, essentially any dataset could be Big Data. This statement is not entirely facetious, as many cheminformatics algorithms might work fine for 10s of thousands of molecular structures but cannot handle a PubChem-sized dataset in a reasonable length of time.

I discussed the approach used by NextMove Software to develop highly performant tools to tackle a variety of cheminformatics problems on large datasets. These problems include calculating the maximum common subgraph, substructure searching, identifying matched series in activity datasets, canonicalising protein structures, and extracting and naming reactions from patents and the literature. In particular, I describe for the first time improvements in substructure searching (200 times faster than the state-of-the-art) and canonicalisation (up to 200 times faster than the state-of-the-art) developed in the last few months by Roger and John. More details to follow…


Upcoming webinar on matched series

[Update 23/04/2015 – A recording of this webinar as well as the slides are now available online.]

Next Tuesday (14 April) I’ll be presenting a webinar entitled “Beyond Matched Pairs: Applying Matsy to predict new optimisation strategies”. The webinar will be hosted by our collaborators at Optibrium who have incorporated the Matsy algorithm into StarDrop (see earlier post). Here’s the abstract:
webinar_nextmove

Join our webinar with Noel O’Boyle of NextMove Software to learn how matched series analysis can predict new chemical substitutions that are most likely to improve target activity for your projects.

The Matsy™ algorithm for matched molecular series analysis grew out of a collaboration with computational chemists at AstraZeneca with the goal of supporting lead optimisation projects. Specifically, it was designed to answer the question, “What compound should I make next?”.

Matsy has been developed to generate and search in-house or public domain databases of matched molecular series to identify chemical substitutions that are most likely to improve target activity (J. Med. Chem., 2014, 57(6), pp 2704–2713). This goes beyond conventional ‘matched molecular pair analysis’ by using data from longer series of matched compounds (and not just pairs) to make more relevant predictions for a particular chemical series of interest. In addition, all predictions are backed by experimental results which can be viewed and assessed by the medicinal chemist when considering the predictions.

Matsy is applied in StarDrop’s Nova™ module, which automatically generates new compound structures to stimulate the search for optimisation strategies related to initial hit or lead compounds. StarDrop’s unique capabilities for multi-parameter optimisation and predictive modelling will enable efficient prioritisation of the resulting ideas to identify high quality compounds with the best chance of success.

I’ll be focusing on the science behind the algorithm itself rather than the specifics of its integration into StarDrop, so even if you’re not currently an Optibrium customer you may find this of interest.

To register, click on the image above or just here.

Paper on Reaction Fingerprints now out

Congrats to our collaborators Nadine and Greg at Novartis who have just published a paper on using reaction fingerprints to classify and measure similarity of reactions:

Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity
Nadine Schneider, Daniel M. Lowe, Roger A. Sayle, and Gregory A. Landrum
J. Chem. Inf. Model 2015, 55, 39

Two of the names on the author list are our very own Daniel and Roger, who provided the initial dataset used as a basis for training the reaction fingerprints, and developed the NameRxn software which names and classifies reactions according the RSC’s Reaction Name Ontology.

I think this paper is the first to cite a blog post from this blog, and so of course it deserves a special mention…:-)

And just a note on the reaction below which is taken from Figure 4 in the paper; this is now classified more specifically as an Iodo N-methylation by NameRxn rather than an Iodo N-alkylation:
Figure4
…and rumours of Roger’s move to Novartis have been greatly exaggerated. 🙂

Looking for leads in a 2D activity matrix

There is a nice dataset in the supporting information for Pickett et al. that illustrates how the sort order of rows and columns in an activity heatmap can hinder/help the identification of gaps which should be filled.

First of all, here is the heatmap in question, a 50×50 array. It show activity values for a series of analogs with the same scaffold but 50 different R groups at R1 and R2. Black squares are gaps where activity information isn’t available, white squares indicate inactive molecules, while the remaining colours indicate levels of activity from highly active (green) to lowly active (red).bitmap
The heatmap is depicted as shown in Figure 3 (top) in the paper where the rows (and separately the columns) are sorted by the most active molecule in the row. This has the effect of clustering the green squares in the bottom left of the array, and would suggest that the molecules in grid positions (2, 1) and (4, 1) should be tested next.

However, I would suggest trying (11, 2) and (20, 2) instead, and would say that in particular the molecule corresponding to (4, 1) is very unlikely to be worthwhile pursuing.

How so? Well, each row (and separately each column) can be considered a matched series; that is, the entire molecule is unchanged apart from a single R group. With this in mind, the sort order of the 2D array should be based on properties of these series rather than of any individual molecules in them, and especially not the extreme value which is unlikely to be representative of the row/column as a whole, and may indeed be a fluctuation due to experimental error.

A simple way to do this is to choose that row (and separately that column) with the largest number of filled boxes, and measure the average relative shift (that is, get the average deviation) of each row to it. If the sort order is based on this shift, the following heatmap is obtained:bitmap2This image has a much clearer band structure. The two gaps at (2, 2) and (5, 2) are much better candidates than those inferred from the original heatmap. In particular, the gap at (4, 1) in the original is now at (44, 3) and can be seen, in context, to be a poor choice despite the green box in the same column.

If you are interested in more information on tools for SAR transfer, get in touch…

Coming soon: Matsy in StarDrop

For the last few months, we have been working together with Optibrium to integrate Matsy into their StarDrop platform for lead optimisation, the result of which will be released later this year:
StarDrop_Logo_300

Optibrium™ and NextMove Software, developers of software and chemoinformatics solutions for drug discovery, today announced an agreement to collaborate on the integration of NextMove Software’s Matsy technology with Optibrium’s StarDrop software suite. This combination will help to guide scientists’ optimisation strategies to quickly identify compounds with a high chance of success for their drug discovery projects.

The Matsy algorithm has been developed by NextMove Software to generate and search databases of matched molecular series to identify chemical substitutions that are most likely to improve target activity (J. Med. Chem., 2014, 57(6), pp 2704–2713). This goes beyond conventional ‘matched molecular pair analysis’ by using data from longer series of matched compounds (and not just pairs) to make more relevant predictions for a particular chemical series of interest. As part of the collaboration with Optibrium, Matsy will be applied in StarDrop’s Nova™ module, which automatically generates new compound structures to stimulate the search for optimisation strategies related to initial hit or lead compounds. StarDrop’s unique capabilities for multi-parameter optimisation and predictive modelling will enable efficient prioritisation of the resulting ideas to identify high quality compounds with the best chance of success.

For further information, see the full press release.

Introducing new formats for handling macromolecules: SMILES and InChI

SMILES and InChI are two line formats widely-used for handling small molecules, but how well do they perform for macromolecules? As a starting point, we present the hypothesis that such molecules “are too large and ungainly to represent atom-by-atom”. Let’s test this hypothesis!

So, can we generate canonical representations of macromolecules using the existing widely-used line notations SMILES and InChI, or do we need to come up with a whole new ‘standard’? Our dataset is the SwissProt database of protein structures, excluding those with ambiguous residues (X, B, Z, or J); in short, a total of 452737 proteins.

For the conversion to InChI, we can use Open Babel. Since InChI has (by design) a limit of 1024 input atoms, we modified the code to extend this limit as far as we easily could and were able to extend it to handle up structures of up to 32766 atoms (99.4% of cases). For the conversion to canonical SMILES, we used our own Sugar & Splice.

In the following plots, the green dots indicate canonical SMILES while the blue dots indicate InChI. First, a scatterplot of the timings, followed by a zoomed-in view. The point in the top right is the largest protein in the database, TITIN_MOUSE, with 35213 amino acids and 312675 atoms, and which took 334s to generate a canonical SMILES string (of length 654K). The longest sequence handled by the modified InChI code was UTP10_KLULA, with 1774 amino acids and 28509 atoms, and which took 73.2s to generate an InChI (of length 117K).timings_firsttwo_arrow
The following graphs show a different view of the results, and indicate that the majority of the proteins are handled quickly: 96% within 10s for the InChI and 99% within 0.2s for the SMILES.
timings_secondtwo_arrowWhat does this mean for the use of SMILES and InChI for macromolecules? Well, I think it shows that performance is not a problem, if that is what is meant by “ungainly” in the original hypothesis. That’s not to say that all aspects of handling macromolecules are supported by SMILES or InChIs. For example, the presence of ambiguity and variable attachements or variable composition are out-of-scope (although ChemAxon’s extended SMILES syntax may be able to handle some of these). But the size of these molecules is not in itself a problem (though InChI performance could still be improved).

The above is taken from a talk that Roger gave at the recent InChI for Large Molecule Meeting, hosted by the NCBI:

A novel procedure towards accurate estimation of room temperature utilising the patent literature

When chemists report that a reaction took place at room temperature, what exactly do they mean? Clearly the best way to approach this problem is to textmine reaction conditions from all US patent applications since 2001 and thus infer room temperature.

As previously discussed, Daniel has extracted reactions from US patents. The textmining software that Daniel has been working on, LeadMine, now has the ability to extract reaction conditions. Considering just those reactions where the temperature is explicitly given (as opposed to specified as “room temperature” or some such), the following graph is obtained (this is interactive – use the toolbar to zoom/pan; data included for the interval -273 to 800 °C):

You will immediately notice a preference among chemists for temperatures that are multiples of 5, and in particular, multiples of 10. In our determination of likely room temperature, such values are probably not useful. Once we remove them, the remaining data is as follows:

If you zoom in around the 20-25 degree area, we can infer that room temperature is 23°C or thereabouts – QED. Other peaks in the plot indicate particular reaction conditions that are common in organic chemistry: for example, both 78°C and -78°C are favourites (remember why?).

This analysis of temperature data was based on data presented by Daniel at the Fall ACS in San Francisco. His talk, “Chemistry and reactions from non-US patents”, covered:

  • Coverage of European vs United States patents
  • For novel compounds, which patent authority published first and how long was the lag
  • Trends in gene/protein mentions over time
  • Melting/boiling point extraction
  • Analysis of text mined reactions (yields vs scale, grouping them into synthetic routes, trends in solvent usage)


Finding the signal in activity data

There is a 2008 paper by the folks at Abbott that Roger refers to as “the paper that killed matched pairs for activity data”. This is the Hajduk and Sauer paper which looked at matched pair transformations across 84K compounds and 30 protein targets, and found that the potency changes associated with most matched pairs transformations were (nearly) normally distributed around zero.

I don’t have access to the Abbvie data, but I can do a similar analysis with ChEMBL data. For those assays in ChEMBL which have pIC50 data for ethyl, propyl and butyl as substituents at the same location in three molecules, here is a histogram of the pIC50 for the ethyl analog minus that of the butyl.butyl_all
The resulting histogram is pretty much in agreement with what Hajduk found: changing a butyl to an ethyl is equally likely to increase activity as decrease it. If we think about it, it’s fairly obvious why this might be – we are pooling data from different binding environments in different proteins, and the effect of the change depends on the binding environment.

So matched pairs are dead for activity. Or we have to restrict the analysis to data just for a particular pocket in a particular protein.

But what if we consider additional activity data? What if we already know the relative activities of propyl and butyl, for example. Let’s say that we know in a particular case that the propyl analog has a greater pIC50 than butyl, and then plot the ΔpIC50 for ethyl minus butyl:
butyl_0
…or vice versa, where propyl has a smaller pIC50 than butyl:
butyl_1
Interesting, eh? The point is that knowing some activity information improves our predictive ability. If we know that propyl > butyl, then it increases the chance that changing butyl to ethyl will increase the activity.

The question then arises, how best to extract and apply this information? One approach would be to throw more matched pairs at the problem. But actually, a simpler and more elegant approach is to look beyond matched pairs to the general concept of matched series.

Here are the slides I presented on “Evidence-based medicinal chemistry with matched series” at the recent UK-QSAR meeting in Cambridge.

Using experimental data to update the Topliss Tree

A recurring task in medicinal chemistry is the optimization of R groups around a ring in order to improve biological activity. In a landmark paper in 1972, John Topliss (then at Schering) described a scheme for deciding what substituted phenyl to make next based on the relative potencies of the compounds made so far. This has become known as the Topliss Tree.

To begin with, Topliss suggested making the unsubstituted phenyl and the 4-chlorophenyl. Depending on their relative potencies, he went on to suggest either 4-methoxyphenyl or 3,4-dichlorophenyl. Based on the relative potency of that, the tree continued downwards. An abbreviated version of the tree is shown below.

Topliss Tree

So how did Topliss come up with his suggestions? He based it all on the Hansch parameters (σ and π) for different phenyl substituents, and the inferred relationship (either positive or negative) between the Hansch parameters and the potency. (This is less complicated than it sounds – it’s like identifying that potency is increasing with electron-withdrawing strength and so let’s try a stronger EWG.)

Today we have access to large amounts of experimental data on potency orders (for example, in ChEMBL), and so we can check whether that data agrees with the suggestions of the Topliss Tree. Back in March, we published a paper describing a method that uses a database of experimental results to suggest what R group would increase activity based on observed activity order. This is exactly the scenario described by Topliss and so we can come at the same question from a compeletely different angle. I briefly addressed this question in the paper, but I returned to it in more detail (and with more experimental data) for a talk I gave to MEDI at the recent ACS meeting in San Francisco: Revising the Topliss decision tree based on 30 years of medicinal chemistry literature [PDF]

The conclusions were that in the main the data in ChEMBL agreed with Topliss. However, particular points of disagreement were the suggestion of 4-OMe instead of 4-OH (if H > 4-Cl), and the suggestion of 4-CF3 (if 4-Cl > 3,4-diCl). This then raises the question, what would we recommend instead?…enter the Matsy Tree:

Matsy Tree

Since we can generate a similar tree for any situation or data, we can limit the data to particular targets (e.g. kinases) or apply ligand-efficiency rules to the predictions. For more details see the talk above.

If you are interested in an evaluation version of the software used here, Matsy, send an email to matsy@nextmovesoftware.com.

What is a med chemist’s favourite phenyl substituent?

In the course of preparing a talk for the recent ACS meeting (more on this later), I thought it would be interesting to give an overview of the ChEMBL data on substituted phenyls. What I did was take all those matched series* with associated IC50 data containing 4 or more phenyl substituents, and then count the frequency of each particular phenyl.

In other words, when a medicinal chemist was trying to optimize the substituents around a phenyl ring, which were the most frequent groups tested?

The most popular substituted phenyls
The most popular substituted phenyls

The order of popularity at the 4 position is OMe > Cl > F > Me, while at the 2 and 3 positions it’s Cl > OMe > F > Me. For these groups, in general the corresponding frequencies are in the order 4 >> 3 > 2. It would be interesting to know whether this corresponds to the ease of synthesis of these groups (in the general case) or whether other factors are at play.

In response to a query about whether the preferences have changed over time, I’ve generated the following image (click for bigger) that provides this information for the period 1990-2013 (the x-axis). The y-axis shows frequencies divided by the total number of substituted phenyls that year.

Changes in frequencies over time
Changes in frequencies over time
It’s a bit hard to draw any conclusions, but possibly 4-nitrile is becoming more popular, along with 3-F, while 2-NO2 and 2,3,4-OMe are going down.

*A matched (molecular) series is a series of analogs with same scaffold but different R groups (all at the same position). In this context, each matched series contains only molecules from the same assay and paper.