Peptide informatics at Bio-IT World

Congratulations to the Pistoia Alliance for winning this year’s Bio-IT World Best Practices Award for Informatics for the development of HELM, the Hierarchical Editing Language for Macromolecules. As NextMove’s Sugar & Splice was cited in the submission as supporting HELM, we can claim a small part in helping them achieve this recognition.

Handling and interconverting peptide representations was also the topic of Lisa Sach-Peltason’s talk, “Peptide Informatics – Bridging the gap between small-molecule and large-molecule systems“. She described the peptide registration system that Roche have developed, in which Sugar & Splice has played a major role. One of the many challenges is that although non-standard amino acids make up 7% by frequency of amino acids used by Roche, over 88% of their peptides contain at least one non-standard amino acid.


And finally Roger presented a poster around the same topic, on the challenges of handling peptide line notations for biologics registration and patent filings:

Qualitative structure-activity at UK-QSAR

I’ve been wondering whether my Matched Series work falls under QSAR, as it does not use a numerical model nor does it make absolute activity predictions. Everything it does is based on relative activity/property orders. So perhaps this is not a Quantitative Structure-Activity Relationship, but rather a Qualitative Structure-Activity Relationship (let’s call it QualSAR for short). Is this a useful distinction? Are there additional areas of cheminformatics that would fall under this (MMPA springs to mind)?

But until the whole QualSAR field takes off as a separate discipline, I guess I’m going to continue to file my work under QSAR. Earlier this year, I was invited to contribute a brief description of the algorithm to the current UK-QSAR newsletter (direct link here).

And last week I presented the Matsy algorithm at UK-QSAR, hosted by Eli Lilly. Here’s the poster I presented, which summarises the recent paper and also has some additional examples of the sort of predictions generated (better quality PDF available here).

Talks on non-standard peptides and normalising ELN reactions from the Dallas ACS

At the recent ACS National Meeting in Dallas, Roger presented two talks describing our recent work in the areas of biologic representation and standardization of ELN reactions.

1. Representation and display of non-standard peptides using semi-systematic amino acid monomer naming
The registration of peptide and peptide-like structures in chemical databases poses a number of technical challenges. In particular, for post-translationally modified, D-, cyclic and non-standard peptides, we propose the use of semi-systematic monomer names, based upon readily recognizable chemical line formulae, for the encoding and display of traditionally difficult to handle peptides. These rules lead to amino acid names such as N(Me)Ser(tBuOH) that are similar to those seen/used in vendor catalogs and scientific publications.

2. Standardized Representations of ELN Reactions for Categorization and Duplicate/Variation Identification
In an ELN, there is no intrinsic notion of a duplicate, and each experiment is represented by its own notebook page. Here we describe approaches and challenges to identifying reaction ‘variations’ in large reaction databases.


For more talks from NextMove Software, see our Slideshare page.

How to predict R groups that improve biological activity – Paper now out

Our latest work has just been published in the Journal of Medicinal Chemistry. It’s a collaboration with AstraZeneca Mölndal on an algorithm (which we call Matsy) that predicts R groups that improve biological activity, given some existing SAR information at the same R group position:

N.M. O’Boyle, J. Boström, R.A. Sayle, A. Gill. Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity J. Med. Chem. 2014, In press.

Here’s a deliberately basic example from the paper. Imagine that you have synthesised and tested three alkane R groups for some scaffold and found that the pIC50s are in the following order (bigger is better): propyl > ethyl > methyl. What should you make next? Kind of obvious, you might say: a longer alkane. Using ChEMBL data and the Matsy algorithm described in the paper, we find that the most likely R group to increase activity is n-hexyl, which increased activity in this situation 75% of 53 times.

But about what ethyl > propyl > methyl? Now there is no correlation with molecular weight. Perhaps ethyl is just the right size, while methyl is too small but propyl is too big. In any case, the question remains: what should you try next? According to Matsy, tert-butyl is most likely to increase activity, based on 39% of 23 times.

To a large extent the algorithm is just doing what a medicinal chemist would; except that where a medicinal chemist would draw on previous experience or intuition, Matsy works out the answer from a database of previous work. To make up your mind about a particular prediction, you can always look at where the data came from and decide if it’s applicable in your case.

For more information, check out the paper (*). The talk below summarises the key points; I gave it a few weeks ago at the 1st Joint CICAG and Cambridge Cheminformatics Network Meeting (hosted by the CCDC):

See other talks from NextMove Software

* The paper will be made freely available soon. Until then, we are allowed to give 50 copies away so if you don’t have access to the journal, email noel@nextmovesoftware.com if you want one.

Unleashing over a million reactions into the wild

Reaction Extraction WorkflowUnlike with small molecules, there are currently no large sets of publically available reaction data.

To remedy this situation, we have extracted over a million reactions from United States patent applications (2001-2013) and the same again from patent grants (1976-2013). This contrasts to the original data release of “only” 420 thousand (from 2008-2011 applications) whilst I was in the PMR group.

The reactions are available as reaction SMILES or  CML from here, as 7zip archives. The CML representation includes quantities and yields where these were found. A documentation zip provides further information on the format of the data. This data is made available under CC-Zero i.e. without copyright. [Update 24/08/2017: A newer version of the dataset described here is available on FigShare]

It is hoped that making this data resource available will facilitate analyses that require a large number of reactions.

NextMove Software is currently looking into what insights can be obtained from such data sets. For example using our reaction classification software we can show broad correlation between the type of reaction and its yield and that this trend could be reproduced from ELN data (presentation here). This is just the beginning of the sorts of analyses that can be performed with access to so many reactions. Expect to hear more at the upcoming ICCS and UK-QSAR meetings.

More information about how the reactions were extracted can be found in my PhD thesis and a presentation I gave at the ACS.

On the other other hand

handsMy earlier “On the other hand” blog post considered some of the issues of representing D- amino acids. In this post, I discuss the representation of amino acids with sidechain stereochemistry in nomenclature and peptide registration systems. Handling of chiral sidechains is potentially tricky and non-trivial, as indicated by the Pistoia Alliance’s HELM editor which restricts the user to only 17 (of the 19) standard D-form amino acids, explicitly prohibiting the specification D-threonine and D-isoleucine.

Threonine (Thr) and Isoleucine (Ile)

The most frequently encountered cases of sidechain stereochemistry occur in the naturally occurring amino acids threonine and isoleucine, which each contain a chiral carbon atom at their beta carbon position.

subst_6 L-Thr  aka (2S,3R)    PDB Code: THR CID6288
subst_6 L-Ile aka (2S,3S)    PDB Code: ILE

CID6306

By convention, the D-forms of these amino acids flip both stereocenters.

subst_6 D-Thr   aka (2R,3S)  PDB Code: DTR CID69435
subst_6 D-Ile  aka (2R,3R)   PDB Code: DIL CID76551

The forms of these amino acids where just the sidechain stereochemistry is inverted are referred to as “allo-” forms, allothreonine (written aThr or alloThr) and alloisoleucine (written aIle or alloIle).

subst_6 L-aThr  aka (2S,3S)   PDB Code: ALO CID99289
subst_6 D-aThr   aka (2R,3R)  PDB Code: 2TL CID90624
subst_6 L-aIle aka (2S,3R)  PDB Code: IIL CID99288
subst_6 D-aIle  aka (2R,3S)  PDB Code: ??? CID94206

Things really get interesting when stereochemistry is unspecified (either a racemate or unresolved chiral center) at either of these stereocenters.  This is not uncommon when working with SMILES strings or MOL files, but almost always indicates some loss of information as the biology/chemistry will nearly universally refer to one of the four fully specified steroisomers above.

Perhaps the easiest case to denote is the case of unspecified tetrahedral stereochemistry at the alpha carbon position, for which the “DL-” prefix is conventionally used.

subst_6 DL-Thr  aka (2?,3R) CID17757244
subst_6 DL-aThr  aka (2?,3S) CID17757249
subst_6 DL-Ile aka (2?,3S) CID10396882
subst_6 DL-aIle aka (2?,3R) CID17757247

A less widely appreciated convention, is the use of the Greek letter xi (ξ) in amino acid and natural product nomenclature, for chiral centers of unknown configuration (3AA-4.5).  Here I propose the use of the prefix “xi” or “xi-” in an identical way to “allo” or “allo-” to produce xi-threonine (xiThr) and xi-isoleucine (xiIle) when the beta
carbon stereochemistry is undefined/unspecified.

subst_6 L-xiThr  aka (2S,3?) CID11768555
subst_6 D-xiThr  aka (2R,3?) CID6399258
subst_6 DL-xiThr  aka (2?,3?) CID205
subst_6 L-xiIle aka (2S,3?) CID5351546
subst_6 D-xiIle aka (2R,3?) CID11051686
subst_6 DL-xiIle aka (2?,3?) CID791

4-Hydroxyproline, Hyp

An example of a non-natural (but frequently occurring) amino acid with sidechain stereochemistry is “4-hydroxyproline”.  Here the symbol Hyp is understood to refer to the more common trans- form, so the prefix “cis” or “cis-” is use to refer to the alternate configuration, such as the symbol “cis-Hyp”.

subst_6 L-Hyp aka (2S,4R)    PDB Code: HYP CID5810
subst_6 L-cisHyp aka (2S,4S)  PDB Code: HZP CID440015
subst_6 D-Hyp aka (2R,4S)    PDB Code: ??? CID440074
subst_6 D-cisHyp aka (2R,4R)   PDB Code: ??? CID440014

Once again unspecified configurations at the alpha- and gamma- carbon locants of Hyp can be described by “DL-” and “xi-” prefixes as before.

subst_6 DL-Hyp aka (2?,4R) CID54196981
subst_6 DL-cisHyp aka (2?,4S) CID21353534
subst_6 L-xiHyp aka (2S,4?) CID69248
subst_6 D-xiHyp aka (2R,4?) CID5318330
subst_6 DL-xiHyp aka (2?,4?) CID825

Note that although a few sources refer to names such as “cis-D-Hyp”, it is more usual to order terms consistently (where possible) with the “D-“, “L-” or “DL-” prefix at the start and the “allo”, “xi”, “cis”, “nor” and “homo” prefixes adjacent to the three-letter code.

Methionine sulfoxide, Met(O)

A simpler case of sidechain stereochemistry occurs when the amino acid name doesn’t imply a default stereochemistry.  In these cases, the usual Cahn, Ingold and Prelog (CIP) rules can be used to assign R and S (or E and Z) descriptors appropriately.  A simple example of this is methionine sulfoxide, which is commonly represented by the symbol “Met(O)”. In this case, the sulfur atom bearing the substitution may adopt one of two configurations requiring a “R-” or “S-” prefix to the substituent suffix.

subst_6 L-Met(O) CID158980
subst_6 D-Met(O) CID148508
subst_6 DL-Met(O) CID847
subst_6 L-Met(R-O) CID10062737
subst_6 L-Met(S-O) CID10909908
subst_6 D-Met(R-O) CID11829787
subst_6 D-Met(S-O) CID9577091
subst_6 DL-Met(R-O) CID ???
subst_6 DL-Met(S-O) CID57148329

Image credit: EmsiProduction on Flickr

Shakespeare through the eyes of a chemist

shakeyShakespeare’s plays have been analysed in some depth, but one question remains unanswered: what were his favourite chemicals? Let’s try to infer this from the most frequently occurring chemicals in his entire oeuvre.

NextMove’s LeadMine is a state-of-the-art chemical textmining software, which is provided as a Java library and command-line application. I wrote a Jython script that uses the Java library to extract chemical entities from HTML files of Shakespeare’s plays as downloaded from MIT. Fortunately, LeadMine knows how to handle HTML (and XML in general – e.g. a .docx file). I also added in a custom dictionary of a couple of well-known Shakespearean phrases just to see how that worked. The code is shown down below, but first here are the results (which took 26 seconds not including download time):

gold 229
water 147
silver 71
salt 50
iron 45
metal 26
ice 18
mercury 15
diamond 14
brine 9
ides of march 7
copper 7
curan 6
sulphur 4
metals 3
quicksilver 2
to be, or not to be 1
tin 1
vile jelly 1
ees 1
triton 1
silver water 1
carbon 1

It’s interesting to look at some of more quirky results. Curan is a minor character in King Lear, and also a natural product. Triton is a Greek god of the sea as well as being a hydrogen-3 atom. “ees” was an error in the text from “kn ees”, but probably should not have been identified in any case. Actually “carbon” is also an error in the original text, “carbon ado” instead of “carbonado”.

And here’s the code (most of which is concerned with downloading the plays):

from __future__ import with_statement
import os
import urllib
import com.nextmovesoftware.leadmine as lm
import com.nextmovesoftware.leadmine.fsmgenerator as fsmgen
from collections import defaultdict

def processPlay(name, counts):
    print name
    if not os.path.isfile("%s.html" % name):
        urllib.urlretrieve("http://shakespeare.mit.edu/%s/full.html" % name, "%s.html" % name)

    with open("%s.html" % name, "r") as f:
        text = f.read()
        results = engine.processString(text)
        for entity in results.entities:
            counts[entity.text.lower()] += 1

if __name__ == "__main__":
    mydict = fsmgen.CfxDictFromStrings.convertToCfxDictionary(
               ["to be, or not to be", "vile jelly", "ides of march"],
               "Shakey", False)

    dictionaries = lm.LeadMineConfig().dictionaries
    dictionaries.add(mydict)
    engine = lm.ExtractEngine(dictionaries)

    counts = defaultdict(int)
    if not os.path.isfile("main.html"):
        urllib.urlretrieve("http://shakespeare.mit.edu/", "main.html")
    for line in open("main.html"):
        idx = line.find("a href")
        if idx >= 0 and line.find("Poetry")<0:
            name = line[idx+8:line.find("index.html")-1]
            processPlay(name, counts)
    ans = sorted(counts.items(), key=lambda x:x[1], reverse=True)
    for k, v in ans:
        print k, v

Image credit: tonynetone on Flickr

BioCreative announce chemical text mining competition results

NextMove recently participated in the BioCreative CHEMNDER (Chemical compound and drug name recognition) task. This task involved annotating chemical mentions in PubMed abstracts. BioCreative have annotated 10,000 abstracts of which 7,000 were provided to participants for training and in mid-September participants were asked to identify mentions in the unseen test corpus of 3,000 abstracts (which to avoid cheating was combined with 17,000 decoy abstracts).

In total 27 teams (23 academic and 4 commercial) submitted results. We achieved 85.0% recall at a precision of 88.7% giving an Fscore of 86.9%. Our solution ranked amongst the best submitted, being only 0.53% from the best performing solution in the chemical entity mentions task and significantly ahead of the other commercial solutions. Inter-annotator agreement was 91% indicating that with recent advances in machine annotation, automated systems are rapidly approaching the quality of human abstractors.

Participation in this competition has driven recent developments in LeadMine including improved coverage of non-systematic chemical entities and detection of abbreviations.

If you want to know the full details our proceedings paper is available here and you can find out how we compared in the full proceedings here  (results on p14, list of teams on p31). The presentation below, which I gave at the workshop, summarises our system:

Compile once, run anywhere – Portable cheminformatics applications

Often when you install C++ software, a number of different DLLs are installed along with the principal one. For example, if you install Open Babel on Windows, as well as the Open Babel core DLL there are DLLs for cairo, png, xml2 and various plugin DLLs for Open Babel functionality. In theory this allows the user to swap in a new version of a DLL or remove particular plugins that are unnecessary. In general, however, this flexibility is not exploited and for most users this is all just clutter. A better user experience might be to present a single file containing all of the necessary functionality. This is something I’ve been working on recently for our own software.

As another step towards portable applications, we have been moving away from using MSVC to using MinGW (and now MinGW-64) to statically compile our software. With the right flags a C++ application can be created that statically includes the C++ runtime and so is no longer dependent on the infamous MSVC++ runtime DLLs (which may or may not be present), or on any Cygwin or MinGW DLLs. As a result, it will run on any Windows version from WinXP onwards.

When you figure out how to generate these statically-linked libraries and applications, they work just great. Getting to that point is a bit tricky though. Luckily for me we have some in-house expertise (“ROGER – it’s happened again!”). The following presentation which I gave as a lightning talk at the recent RDKit UGM gives an overview of the approach and rationale:

Competitive text-mining and molecular data exchange at ACS Indy

Two talks on very different topics were presented by NextMovers at the recent ACS National Meeting in Indianapolis.

Daniel described how LeadMine uses two-state grammars to find dictionary terms, as well as some of the clever techniques he has used to improve performance for the BioCreative CHEMNDER chemical compound text-mining challenge (results not yet announced). Separately, Roger spoke about some of the challenges in the supposedly trivial tasks of reading/writing molecular file formats (for more on the MDLValence benchmark discussed here see an earlier blog post).