Finding all types of every ‘mer

John Overington over at the ChEMBLog has recently discussed the task of finding redox pairs in a database. As John points out, these are neither isomers nor tautomers but are of interest in any case.

It turns out that after a minor modification, NextMove’s Equalizer software (insert link from future here) can be used to find these. This software can be used to generate canonical “hashes” for molecules which cause different forms of the same molecule to hash to the same representation [Footnote]. It’s not a new idea by any means (think of the layered structure of the InChI) but the way we do this is pretty neat as it uses canonical SMILES to do the hashing. By altering the information encoded in the SMILES different forms of the same structure can be identified; for example mesomers, structural isomers, tautomers [1], protomers…and now redox pairs.

For redox pairs, a SMILES is generated after setting all bond orders to 1 and all atom charges to 0. If the resulting canonical SMILES is identical but the overall (original) charge is different, two structures can be considered a redox pair. Here are some examples found in ChEMBLdb 15:Redox pairs

[Footnote] This is similar to the classic way of finding anagrams given a dictionary of words (I read about this in Jon Bentley’s Programming Pearls). Take each word, sort its letters (this sorted word is the “hash” in this case; words with the same hash will be anagrams) and write it to an output file followed by the original word on the same line. Sort the output file, and look for adjacent duplicates in the first column. The corresponding anagrams will be in the second column.

[1] RA Sayle, So you think you understand tautomerism? JCAMD, 2010, 24, 485.

Pharma’s favourite reactions

Derek Lowe of In the Pipeline has previously commented on the fact that relying on the same old reliable chemical transformations may restrict medicinal chemists to particular regions of chemical space. In order to reach beyond these, it is necessary to use reactions that are “often not so general and reliable”.

As part of a poster presentation at the recent Bio-IT World meeting in Boston, Roger has analysed the Electronic Lab Notebook of a major pharmaceutical company using our NameRXN software to extract the major classes of reaction as well as the specific top five transformations:NameRXN
Update 2013/04/22: See Derek Lowe’s analysis of these results over at In the Pipeline.

The full poster explains how the data was extracted from the PKI/CambridgeSoft ELN using HazELNut, and also shows how data quality issues in the reaction entry may be flagged up by combining the NameRXN information with atom-mapping algorithms (PDF version):

Sweet! NextMove helps PubChem import UniCarbKB

Sugar CubesLast week, PubChem announced the following:

Structures from UniCarbKB are now available in PubChem. UniCarbKB is an initiative to create an online information storage and search platform for glycomics and glycobiology research. UniCarbKB curates information from the scientific literature on glycoprotein derived glycan structures, building on previous curation efforts by GlycoSuiteDB and continuing the principles of EUROCarbDB to provide an open-access platform for glycoinformatics…

NextMove Software has been collaborating with PubChem on the development of Sugar & Splice, software to handle and interconvert representations of peptides, oligonucleotides and oligosaccharides. The import of UniCarbKB into PubChem is the first large application of the software and requires the conversion to SMILES strings of IUPAC condensed line notation for oligosaccharides (this looks like Fuc(a1-4)GlcNAc).

A nice validation of the software was that it was able to identify a small number of duplicates in the database, as well as some errors in the exported structures (e.g. mismatched brackets). These have been reported back to the database.

Image credit: Michael Allen Smith ( on Flickr)

Handling biologics: A file format problem?

The increasing importance of biological therapeutics, or biologics, to the pharmaceutical industry is well-known. For example, data from show that of the top 15 best selling therapies in the US in Q4 2012, six were biologics. Monoclonal antibodies are a typical example; these are glycoproteins, comprising of short oligosaccharides attached to a multi-chain polypeptide.

It is clear that handling such molecules requires a different approach than that taken for small-molecules. For example, here is an all-atom depiction of the peptide crambin:
No – it’s not a cyclic peptide. It just happens to have three disulfide bridges. A more useful depiction can be generated if we follow the IUPAC or FDA guidelines for peptide depiction; here the primary structure is much clearer as is the presence of the disulfide bonds:

FDA Style

However, to create these sorts of depictions, and otherwise handle biopolymers more appropriately, we need to know the polymer structure.

Some consider this a file format problem. Some file formats which have been developed to store or represent biopolymer structures include the CHUCKLES and CHORTLES languages from Chiron and Daylight, HELM (Hierarchical Editing Language for Macromolecules) from Pfizer, Protein Line Notation from Biochemfusion and SCSR (Self-Contained Sequence Representation, an MDL V3000 extension) from Accelrys. Naturally, Wisswesser Line Notation has also been extended to handle this problem.

In particular, the HELM format has recently received support from the Pistoia Alliance. See for example this post on the Pistoia blog which describes how HELM “gives us a single consistent way to describe macromolecules which can be used across industry and academia” so that “researchers do not have to spend time creating their own notations”.

But is a new file format the best way to achieve this goal? (I can’t resist inserting the xkcd comic on standards at this point 🙂 )

From xkcd, the web comic:

While NextMove’s software for handling biopolymers, Sugar & Splice, will handle popular file formats such as HELM, I will describe a different view of the problem in the follow-up blog post.

NextMove Software at Bio-IT World

[Update 2013/4/022: View the poster online here] From Tuesday to Thursday next week, Roger will be attending Bio-IT World 2013 in Boston.

He will be presenting a poster on Extraction, analysis, atom mapping, classification and naming of reactions from Pharmaceutical ELNs. This is an application of the HazELNut software and associated utilities. One of the interesting results is the list of top 5 reactions in the ELN of a major pharmaceutical company. The following image gives an overview of the process (click for larger):HazELNut PosterIf you’re attending Bio-IT World and want to meet Roger to discuss this or anything else, drop him a line at

NextMove Software at the New Orleans ACS

From Sun to Thurs next week, I’ll be attending the 245th ACS National Meeting in New Orleans. You’ll find me hanging around the CINF sessions, not least because I’ll be presenting some recent work there.

In particular, I’ll be talking about Roundtripping between small-molecule and biopolymer representations on the Tuesday (3:10pm 9th April, Room 349), which looks at the challenges I’ve encountered in the development of NextMove Software’s Sugar & Splice software. This software can be used to perceive, depict and interconvert between various biopolymer representations, and currently supports peptides, nucleotides and sugars (and mixtures thereof, e.g. glycoproteins).

If you’re interested in meeting up to discuss this or anything else, drop me a line at

Here’s the abstract. Slides will follow after the event:

Roundtripping between small-molecule and biopolymer representations

Noel M. O’Boyle,1 Evan Bolton,2 Roger A. Sayle1

1 NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge,
2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda MD 20894, USA

Existing cheminformatics toolkits provide a mature set of tools to handle small-molecule data, from generating depictions, to creating and reading linear representations (such as SMILES and InChI). However, such tools do not translate well to the domain of biopolymers where the key information is the identity of the repeating unit and the nature of the connections between them. For example, a typical all-atom 2D depiction of all but the smallest protein or oligosaccharide obscures this key structural information.

We describe a suite of tools which allow seamless interconversion between appropriate structure representations for small molecules and biopolymers (with a focus on polypeptides and oligosaccharides). For example:
SMILES: OC[C@H]1O[C@@H](O[C@@H]2[C@@H](CO)OC([C@@H]([C@H]2O)NC(=O)C)O)[C@@H]([C@H]([C@H]1O)O[C@@]1(C[C@H](O)[C@H]([C@@H](O1)[C@@H]([C@@H](CO)O)O)NC(=O)C)C(=O)O)O
Shortened IUPAC format: NeuAc(a2-3)Gal(b1-4)GlcNAc

I will discuss the challenge of supporting a variety of biopolymer representations, handling chemically-modified structures, and handling biopolymers with unknown attachment points (e.g. from mass spectrometry).