See you at the ACS?

NextMovers will be present and presenting at the 248th ACS National Meeting in San Francisco, which starts this Sunday.

As well as having a booth (come by if interested in picking up an evaluation copy of Matsy), we will be giving the following presentations:

Sunday
11 – Classification, representation, and analysis of cyclic peptides and peptide-like analogs
Authors: Dr Roger A Sayle, Dr Daniel M Lowe, Dr Noel M O’Boyle
Division: CINF: Division of Chemical Information
Date/Time: Sunday, August 10, 2014 – 08:35 AM
Session Info: Computational Methods and the Development/Production of Biologics and Biosimilars (08:30 AM – 09:35 AM)
Location: Palace Hotel
Room: California Parlor

6 – Chemistry and reactions from non-US patents
Authors: Dr Daniel M Lowe, Dr Roger A Sayle
Division: CINF: Division of Chemical Information
Date/Time: Sunday, August 10, 2014 – 09:20 AM
Session Info: Hunting for Hidden Treasures: Chemistry Text Mining in Patents and Other Documents (08:40 AM – 12:00 PM)
Location: Palace Hotel
Room: Presidio

22 – Revising the Topliss decision tree based on 30 years of medicinal chemistry literature
Authors: Noel M O’Boyle, Jonas Boström, Roger A Sayle, Adrian Gill
Division: MEDI: Division of Medicinal Chemistry
Date/Time: Sunday, August 10, 2014 – 11:30 AM
Session Info: General Oral Session (08:30 AM – 12:10 PM)
Location: Moscone Center, West Bldg.
Room: 3008

Tuesday
394 – Using matched series to predict R groups that improve biological activity
Authors: Noel M O’Boyle, Jonas Boström, Roger A Sayle, Adrian Gill
Division: COMP: Division of Computers in Chemistry
Date/Time: Tuesday, August 12, 2014 – 06:00 PM
Session Info: Poster Session (06:00 PM – 08:00 PM)
Location: San Francisco Marriott Marquis
Room: Golden Gate Section A/B

A small corner of SmallWorld

SmallWorld is a graph database where the nodes are molecules or molecular fragments and where the edges are directed and point to substructures. While frequently stated in the literature that calculation of the MCS through enumerating all possible substructures is impossible, really what they mean is that the time/space tradeoff is quite high. SmallWorld is the proof that with enough disk space, the time reduces quite a lot.

While finding the MCS is still an NP-complete problem, by precomputing canonical substructures of molecules, the NP-complete part is already factored out so that implementing algorithms just involves traversing the tree. For example, substructure searching means going up the graph from a single node, while finding a multi-molecule MCS means simultaneously going down the graph from several nodes.

To mark the growth of SmallWorld to 4 billion molecular substructures, I thought it’d be interesting to take an in-depth look at a small corner, all those acyclic structures with 8 bonds (in the molecular skeleton). There are 35. Here they are arranged by the number of subgraphs each has. (Notice anything about the order?)B8R0Some of these structures occur more often than others. Here is their frequency in ChEMBL molecules (only including acyclic molecules with up to 20 bonds).B8R0_2What is the smallest ChEMBL molecule that is a superstructure of as many of these as possible? The following molecule, CHEMBL1992288, is a superstructure of all but one.CHEMBL1992288These are just toy examples of the sorts of analyses possible. As part of a current collaboration, we are assessing how well SmallWorld compares to other methods for similarity searching. It’ll be interesting to find out.

Peptide informatics at Bio-IT World

Congratulations to the Pistoia Alliance for winning this year’s Bio-IT World Best Practices Award for Informatics for the development of HELM, the Hierarchical Editing Language for Macromolecules. As NextMove’s Sugar & Splice was cited in the submission as supporting HELM, we can claim a small part in helping them achieve this recognition.

Handling and interconverting peptide representations was also the topic of Lisa Sach-Peltason’s talk, “Peptide Informatics – Bridging the gap between small-molecule and large-molecule systems“. She described the peptide registration system that Roche have developed, in which Sugar & Splice has played a major role. One of the many challenges is that although non-standard amino acids make up 7% by frequency of amino acids used by Roche, over 88% of their peptides contain at least one non-standard amino acid.


And finally Roger presented a poster around the same topic, on the challenges of handling peptide line notations for biologics registration and patent filings:

Qualitative structure-activity at UK-QSAR

I’ve been wondering whether my Matched Series work falls under QSAR, as it does not use a numerical model nor does it make absolute activity predictions. Everything it does is based on relative activity/property orders. So perhaps this is not a Quantitative Structure-Activity Relationship, but rather a Qualitative Structure-Activity Relationship (let’s call it QualSAR for short). Is this a useful distinction? Are there additional areas of cheminformatics that would fall under this (MMPA springs to mind)?

But until the whole QualSAR field takes off as a separate discipline, I guess I’m going to continue to file my work under QSAR. Earlier this year, I was invited to contribute a brief description of the algorithm to the current UK-QSAR newsletter (direct link here).

And last week I presented the Matsy algorithm at UK-QSAR, hosted by Eli Lilly. Here’s the poster I presented, which summarises the recent paper and also has some additional examples of the sort of predictions generated (better quality PDF available here).

Talks on non-standard peptides and normalising ELN reactions from the Dallas ACS

At the recent ACS National Meeting in Dallas, Roger presented two talks describing our recent work in the areas of biologic representation and standardization of ELN reactions.

1. Representation and display of non-standard peptides using semi-systematic amino acid monomer naming
The registration of peptide and peptide-like structures in chemical databases poses a number of technical challenges. In particular, for post-translationally modified, D-, cyclic and non-standard peptides, we propose the use of semi-systematic monomer names, based upon readily recognizable chemical line formulae, for the encoding and display of traditionally difficult to handle peptides. These rules lead to amino acid names such as N(Me)Ser(tBuOH) that are similar to those seen/used in vendor catalogs and scientific publications.

2. Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification
In an ELN, there is no intrinsic notion of a duplicate, and each experiment is represented by its own notebook page. Here we describe approaches and challenges to identifying reaction ‘variations’ in large reaction databases.


For more talks from NextMove Software, see our Slideshare page.

How to predict R groups that improve biological activity – Paper now out

Our latest work has just been published in the Journal of Medicinal Chemistry. It’s a collaboration with AstraZeneca Mölndal on an algorithm (which we call Matsy) that predicts R groups that improve biological activity, given some existing SAR information at the same R group position:

N.M. O’Boyle, J. Boström, R.A. Sayle, A. Gill. Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity J. Med. Chem. 2014, In press.

Here’s a deliberately basic example from the paper. Imagine that you have synthesised and tested three alkane R groups for some scaffold and found that the pIC50s are in the following order (bigger is better): propyl > ethyl > methyl. What should you make next? Kind of obvious, you might say: a longer alkane. Using ChEMBL data and the Matsy algorithm described in the paper, we find that the most likely R group to increase activity is n-hexyl, which increased activity in this situation 75% of 53 times.

But about what ethyl > propyl > methyl? Now there is no correlation with molecular weight. Perhaps ethyl is just the right size, while methyl is too small but propyl is too big. In any case, the question remains: what should you try next? According to Matsy, tert-butyl is most likely to increase activity, based on 39% of 23 times.

To a large extent the algorithm is just doing what a medicinal chemist would; except that where a medicinal chemist would draw on previous experience or intuition, Matsy works out the answer from a database of previous work. To make up your mind about a particular prediction, you can always look at where the data came from and decide if it’s applicable in your case.

For more information, check out the paper (*). The talk below summarises the key points; I gave it a few weeks ago at the 1st Joint CICAG and Cambridge Cheminformatics Network Meeting (hosted by the CCDC):

See other talks from NextMove Software

* The paper will be made freely available soon. Until then, we are allowed to give 50 copies away so if you don’t have access to the journal, email noel@nextmovesoftware.com if you want one.

Shakespeare through the eyes of a chemist

shakeyShakespeare’s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let’s try to infer this from the most frequently occurring chemicals in his entire oeuvre.

NextMove’s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare’s plays as downloaded from MIT. Fortunately, LeadMine knows how to handle HTML (and XML in general – e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):

gold 229
water 147
silver 71
salt 50
iron 45
metal 26
ice 18
mercury 15
diamond 14
brine 9
ides of march 7
copper 7
curan 6
sulphur 4
metals 3
quicksilver 2
to be, or not to be 1
tin 1
vile jelly 1
ees 1
triton 1
silver water 1
carbon 1

It’s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a natural product. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. “ees” was an error in the text from “kn ees”, but probably should not have been identified in any case. Actually “carbon” is also an error in the original text, “carbon ado” instead of “carbonado”.

And here’s the code (most of which is concerned with downloading the plays):

from __future__ import with_statement
import os
import urllib
import com.nextmovesoftware.leadmine as lm
import com.nextmovesoftware.leadmine.fsmgenerator as fsmgen
from collections import defaultdict

def processPlay(name, counts):
    print name
    if not os.path.isfile("%s.html" % name):
        urllib.urlretrieve("http://shakespeare.mit.edu/%s/full.html" % name, "%s.html" % name)

    with open("%s.html" % name, "r") as f:
        text = f.read()
        results = engine.processString(text)
        for entity in results.entities:
            counts[entity.text.lower()] += 1

if __name__ == "__main__":
    mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary(
               ["to be, or not to be", "vile jelly", "ides of march"],
               "Shakey", False)

    dictionaries = lm.LeadMineConfig().dictionaries
    dictionaries.add(mydict)
    engine = lm.ExtractEngine(dictionaries)

    counts = defaultdict(int)
    if not os.path.isfile("main.html"):
        urllib.urlretrieve("http://shakespeare.mit.edu/", "main.html")
    for line in open("main.html"):
        idx = line.find("a href")
        if idx >= 0 and line.find("Poetry")<0:
            name = line[idx+8:line.find("index.html")-1]
            processPlay(name, counts)
    ans = sorted(counts.items(), key=lambda x:x[1], reverse=True)
    for k, v in ans:
        print k, v

Image credit: tonynetone on Flickr

Compile once, run anywhere – Portable cheminformatics applications

Often when you install C++ software, a number of different DLLs are installed along with the principal one. For example, if you install Open Babel on Windows, as well as the Open Babel core DLL there are DLLs for cairo, png, xml2 and various plugin DLLs for Open Babel functionality. In theory this allows the user to swap in a new version of a DLL or remove particular plugins that are unnecessary. In general, however, this flexibility is not exploited and for most users this is all just clutter. A better user experience might be to present a single file containing all of the necessary functionality. This is something I’ve been working on recently for our own software.

As another step towards portable applications, we have been moving away from using MSVC to using MinGW (and now MinGW-64) to statically compile our software. With the right flags a C++ application can be created that statically includes the C++ runtime and so is no longer dependent on the infamous MSVC++ runtime DLLs (which may or may not be present), or on any Cygwin or MinGW DLLs. As a result, it will run on any Windows version from WinXP onwards.

When you figure out how to generate these statically-linked libraries and applications, they work just great. Getting to that point is a bit tricky though. Luckily for me we have some in-house expertise (“ROGER – it’s happened again!”). The following presentation which I gave as a lightning talk at the recent RDKit UGM gives an overview of the approach and rationale:

Competitive text-mining and molecular data exchange at ACS Indy

Two talks on very different topics were presented by NextMovers at the recent ACS National Meeting in Indianapolis.

Daniel described how LeadMine uses two-state grammars to find dictionary terms, as well as some of the clever techniques he has used to improve performance for the BioCreative CHEMNDER chemical compound text-mining challenge (results not yet announced). Separately, Roger spoke about some of the challenges in the supposedly trivial tasks of reading/writing molecular file formats (for more on the MDLValence benchmark discussed here see an earlier blog post).

Matched Molecular Reactants

What happens when you cross Matched Molecular Pair Analysis (MMPA) with reactions? Why, of course, you get a new paradigm in drug discovery, Matched Molecular Reactants!

Well, let’s think about it for a second. If you take the reactants and the products and look for matched molecular pairs combining both, what you will find are reactions that involve single R group transformations. We can call these Matched Molecular Reactants, but they are probably more commonly known as functional group transformations, e.g. -OH to -Cl.

So, what are the most common functional group transformations in a typical ELN? Well, I can’t show you that but I can show you the results when this analysis is applied to reactions in the US patent literature (this data courtesy of Daniel). The following table show SMILES for the R group together with the observed frequency for the 15 most common transformations:

*[N+](=O)[O-] --> *N                                    21456
*C --> *[H]                                             21165
*[H] --> *C                                             15583
*CC --> *[H]                                            12914
*C(=O)OC --> *C(=O)O                                    11729
*C(=O)OC(C)(C)C --> *[H]                                 9149
*NC(=O)OC(C)(C)C --> *N                                  8054
*C(=O)OCC --> *C(=O)O                                    7673
*Cc1ccccc1 --> *[H]                                      6695
*OC --> *O                                               6339
*[H] --> *Br                                             6141
*O --> *Cl                                               4852
*OCc1ccccc1 --> *O                                       4662
*[H] --> *CC                                             3980
*C(=O)O --> *C(=O)OC                                     3888

It seems that the majority of reactions tend to make molecules smaller. If this keeps up, we’ll soon be left with nothing!