Java 6 vs Java 7: When implementation matters

Java 7u6 last year brought with it a change to the implementation of String#substring. In previous versions of Java, Strings created by substring shared the same char array as the original String with an internal offset being used to make sure the correct characters were retrieved. This has the advantage of making substring a very cheap operation, O(1), but has the potential to create a significant memory leak. If a substring is taken of a long String, whilst the substring remains accessible the char array of the long String cannot be garbage collected. Java 7u6 (and later) change this and instead return truly independent Strings… but this requires copying part of the char array meaning substring is now an O(n) operation and hence repeatedly taking long substrings should be avoided.

A case where this behaviour can occur is in a tokenizer that after recognizing each token redefines the remaining String using substring.  The longer the String in question the greater the effect on performance.

This behaviour is found in OPSIN’s parser hence accounting for a ~13% performance decrease in performance when moving from JDK6 to JDK7.

Resolving this performance regression can be tackled in at least two ways: Implementing a cheap substring operator using a decorating class (an example) or not using substring at all and instead keeping track of an index in the string from which to read. The first approach is hampered by String being final so a decorating class must instead implement CharSequence which is far less frequently used. Hence for OPSIN I choose the approach of keeping track of the index tokenization had reached:

Comparison of performance between JDK6 and JDK7







Substrings are still used to capture the tokens which may explain why the performance is still a bit slower.

More details on the substring implementation change:

Accessing SMILES atom order

In the course of my work, I sometimes have to search the dustier corners of cheminformatics toolkits to find features which are seldom used and may be undocumented. One example of this is how to relate the atoms of a toolkit molecule to their order in an output SMILES string. The various toolkits that I use allow one to do this, but the exact method is somewhat different in each case.

Open Babel stores it in a property of a molecule which you can access after writing a SMILES string. The value returned is a string containing the atom indices separated by spaces. This must be parsed before it can be used as a lookup:

OBPairData *pd = (OBPairData*) mol->GetData("SMILES Atom Order");
std::string atomOrder = pd->GetValue();

RDKit also does something similar but it returns the desired vector of atom indices directly:

std::vector<unsigned int> *atomOrder;
mol->getProp("_smilesAtomOutputOrder", *atomOrder);

In contrast, OEChem fills the atom order information into a data structure that you (optionally) provide when calling the function to create a SMILES string. To get the atom order as indices you need to remember the atom order of the current atoms, and then iterate over the data structure accessing the second item of the pair, and looking up the corresponding index.

size_t count = mol->GetMaxAtomIdx();
std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*> *atmord =
      (std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*>*)
      malloc(count*sizeof(std::pair<const OEChem::OEAtomBase*,OEChem::OEAtomBase*>));
OEChem::OECreateSmiString(smiles, *mol, OEChem::OESMILESFlag::AtomStereo ^ OEChem::OESMILESFlag::BondStereo, atmord);

Could the real explicit hydrogens please stand up?

The OpenSMILES specification led by Craig James of eMolecules tries to iron out the handling of various corner cases in SMILES. Furthermore it describes best practices when writing SMILES so as to aid ease of interpretation. The goal of all of this is to avoid loss or corruption of chemical information when exchanged as SMILES between different software.

One issue which was not addressed, and indeed perhaps does not come under the OpenSMILES remit, is the question of how many hydrogens should be explicitly created in the internal representation of a molecule read from a SMILES string.

Which hydrogens are explicit?

The SMILES strings “C” and “C([H])([H])([H])H” both represent methane, but the first is typically converted into a single carbon atom with four implicit hydrogens, while the second is converted to a carbon atom with four explicit hydrogens attached. With Open Babel, we can see this difference by calling NumAtoms() on the molecule:

>>> pybel.readstring("smi", "C").OBMol.NumAtoms()
>>> pybel.readstring("smi", "C([H])([H])([H])[H]").OBMol.NumAtoms()

So far so good. Most toolkits would show the same behaviour.

However, if instead we take the SMILES string “[CH3][C@@H](Br)Cl”, Daylight and ChemAxon will have 4 atoms, OEChem will have 5 (the stereo hydrogen is added), and Open Babel will have 8 (all the hydrogens mentioned in a square bracket are added).

Why it matters

It shouldn’t really matter how a molecule is stored internally in a toolkit except perhaps for performance. But wouldn’t you know, there is a common case where it does matter. If you use SMARTS for substructure searching/matching, then the distinction between explicit and implicit hydrogens is going to affect your results.

You see, there are various SMARTS terms that distinguish between implicit and explicit hydrogens. I don’t think this was one of Daylight’s finest moments; in general SMARTS expressions correspond to molecular substructures, except for the crazy terms that match substructures of particular internal representations. The offending terms are:

h<n>   Atom has <n> implicit hydrogens attached
D<n>   Atom has <n> explicit connections

In short, if you use these terms to match against molecules read from SMILES strings, you will get different results with different toolkits because of the reasons discussed above. The OpenSMARTS draft specification (led by Tim Vandermeersch) also makes this point.

To avoid this problem, you should only use ‘portable terms’ in SMARTS expressions (i.e. ones which do not depend on whether atoms are explicit or implicit). Going forward, it would be useful for all toolkits to agree to adopt Daylight’s approach on the internal representation resulting from reading SMILES.

Handling biologics: A perception problem?

In an earlier post, I described the importance of knowing the biopolymer structure when handling biologics. I also discussed various file formats that have been proposed to address this.

But rather than regarding this as a file format problem, why not consider it instead a perception problem. If we can perceive the biopolymer structure from an all-atom representation, then interconverting between any file format (whether one of the proposed biopolymer formats or existing all-atom representations such as SMILES) is straightforward. Can it be done? Well, that’s what exactly what PDB file writers do; they perceive the amino acid sequence from the all-atom structure and fill in the relevant columns in the PDB file.
There are several benefits to this approach. To begin with, it avoids the cost associated with a new registry system based on a macromolecular file format. There are no problems with new and unusual monomers; these will be faithfully stored in the all-atom representation. The de-facto standards for chemical information interchange, SMILES and MOL files, can be used as always for exchange of data. Tools for small-molecule analysis (e.g. SMARTS searching) can be combined with analyses based on biopolymer structure (e.g. HELM depiction, Smith-Waterman searching). And finally it’s worth considering that it may be difficult to migrate at a later date if a registry system is based on a particular file format.

What I’ve described here and in the previous post is the introduction from my ACS presentation on Roundtripping between small-molecule and biopolymer representations. This describes the development of the Sugar & Splice software for handling oligopeptides, oligonucleotides and oligosaccharides (including modified residues and mixtures of different biopolymers). Note that the presentation is somewhat sugar-centric; for more info on the peptide and nucleotide side of things see Roger’s Spring 2012 ACS presentation.

(For more presentations from NextMove Software, see our SlideShare page.)

Finding all types of every ‘mer

John Overington over at the ChEMBLog has recently discussed the task of finding redox pairs in a database. As John points out, these are neither isomers nor tautomers but are of interest in any case.

It turns out that after a minor modification, NextMove’s Equalizer software (insert link from future here) can be used to find these. This software can be used to generate canonical “hashes” for molecules which cause different forms of the same molecule to hash to the same representation [Footnote]. It’s not a new idea by any means (think of the layered structure of the InChI) but the way we do this is pretty neat as it uses canonical SMILES to do the hashing. By altering the information encoded in the SMILES different forms of the same structure can be identified; for example mesomers, structural isomers, tautomers [1], protomers…and now redox pairs.

For redox pairs, a SMILES is generated after setting all bond orders to 1 and all atom charges to 0. If the resulting canonical SMILES is identical but the overall (original) charge is different, two structures can be considered a redox pair. Here are some examples found in ChEMBLdb 15:Redox pairs

[Footnote] This is similar to the classic way of finding anagrams given a dictionary of words (I read about this in Jon Bentley’s Programming Pearls). Take each word, sort its letters (this sorted word is the “hash” in this case; words with the same hash will be anagrams) and write it to an output file followed by the original word on the same line. Sort the output file, and look for adjacent duplicates in the first column. The corresponding anagrams will be in the second column.

[1] RA Sayle, So you think you understand tautomerism? JCAMD, 2010, 24, 485.

Pharma’s favourite reactions

Derek Lowe of In the Pipeline has previously commented on the fact that relying on the same old reliable chemical transformations may restrict medicinal chemists to particular regions of chemical space. In order to reach beyond these, it is necessary to use reactions that are “often not so general and reliable”.

As part of a poster presentation at the recent Bio-IT World meeting in Boston, Roger has analysed the Electronic Lab Notebook of a major pharmaceutical company using our NameRXN software to extract the major classes of reaction as well as the specific top five transformations:NameRXN
Update 2013/04/22: See Derek Lowe’s analysis of these results over at In the Pipeline.

The full poster explains how the data was extracted from the PKI/CambridgeSoft ELN using HazELNut, and also shows how data quality issues in the reaction entry may be flagged up by combining the NameRXN information with atom-mapping algorithms (PDF version):

Sweet! NextMove helps PubChem import UniCarbKB

Sugar CubesLast week, PubChem announced the following:

Structures from UniCarbKB are now available in PubChem. UniCarbKB is an initiative to create an online information storage and search platform for glycomics and glycobiology research. UniCarbKB curates information from the scientific literature on glycoprotein derived glycan structures, building on previous curation efforts by GlycoSuiteDB and continuing the principles of EUROCarbDB to provide an open-access platform for glycoinformatics…

NextMove Software has been collaborating with PubChem on the development of Sugar & Splice, software to handle and interconvert representations of peptides, oligonucleotides and oligosaccharides. The import of UniCarbKB into PubChem is the first large application of the software and requires the conversion to SMILES strings of IUPAC condensed line notation for oligosaccharides (this looks like Fuc(a1-4)GlcNAc).

A nice validation of the software was that it was able to identify a small number of duplicates in the database, as well as some errors in the exported structures (e.g. mismatched brackets). These have been reported back to the database.

Image credit: Michael Allen Smith ( on Flickr)

Handling biologics: A file format problem?

The increasing importance of biological therapeutics, or biologics, to the pharmaceutical industry is well-known. For example, data from show that of the top 15 best selling therapies in the US in Q4 2012, six were biologics. Monoclonal antibodies are a typical example; these are glycoproteins, comprising of short oligosaccharides attached to a multi-chain polypeptide.

It is clear that handling such molecules requires a different approach than that taken for small-molecules. For example, here is an all-atom depiction of the peptide crambin:
No – it’s not a cyclic peptide. It just happens to have three disulfide bridges. A more useful depiction can be generated if we follow the IUPAC or FDA guidelines for peptide depiction; here the primary structure is much clearer as is the presence of the disulfide bonds:

FDA Style

However, to create these sorts of depictions, and otherwise handle biopolymers more appropriately, we need to know the polymer structure.

Some consider this a file format problem. Some file formats which have been developed to store or represent biopolymer structures include the CHUCKLES and CHORTLES languages from Chiron and Daylight, HELM (Hierarchical Editing Language for Macromolecules) from Pfizer, Protein Line Notation from Biochemfusion and SCSR (Self-Contained Sequence Representation, an MDL V3000 extension) from Accelrys. Naturally, Wisswesser Line Notation has also been extended to handle this problem.

In particular, the HELM format has recently received support from the Pistoia Alliance. See for example this post on the Pistoia blog which describes how HELM “gives us a single consistent way to describe macromolecules which can be used across industry and academia” so that “researchers do not have to spend time creating their own notations”.

But is a new file format the best way to achieve this goal? (I can’t resist inserting the xkcd comic on standards at this point 🙂 )

From xkcd, the web comic:

While NextMove’s software for handling biopolymers, Sugar & Splice, will handle popular file formats such as HELM, I will describe a different view of the problem in the follow-up blog post.

NextMove Software at Bio-IT World

[Update 2013/4/022: View the poster online here] From Tuesday to Thursday next week, Roger will be attending Bio-IT World 2013 in Boston.

He will be presenting a poster on Extraction, analysis, atom mapping, classification and naming of reactions from Pharmaceutical ELNs. This is an application of the HazELNut software and associated utilities. One of the interesting results is the list of top 5 reactions in the ELN of a major pharmaceutical company. The following image gives an overview of the process (click for larger):HazELNut PosterIf you’re attending Bio-IT World and want to meet Roger to discuss this or anything else, drop him a line at

NextMove Software at the New Orleans ACS

From Sun to Thurs next week, I’ll be attending the 245th ACS National Meeting in New Orleans. You’ll find me hanging around the CINF sessions, not least because I’ll be presenting some recent work there.

In particular, I’ll be talking about Roundtripping between small-molecule and biopolymer representations on the Tuesday (3:10pm 9th April, Room 349), which looks at the challenges I’ve encountered in the development of NextMove Software’s Sugar & Splice software. This software can be used to perceive, depict and interconvert between various biopolymer representations, and currently supports peptides, nucleotides and sugars (and mixtures thereof, e.g. glycoproteins).

If you’re interested in meeting up to discuss this or anything else, drop me a line at

Here’s the abstract. Slides will follow after the event:

Roundtripping between small-molecule and biopolymer representations

Noel M. O’Boyle,1 Evan Bolton,2 Roger A. Sayle1

1 NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge,
2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda MD 20894, USA

Existing cheminformatics toolkits provide a mature set of tools to handle small-molecule data, from generating depictions, to creating and reading linear representations (such as SMILES and InChI). However, such tools do not translate well to the domain of biopolymers where the key information is the identity of the repeating unit and the nature of the connections between them. For example, a typical all-atom 2D depiction of all but the smallest protein or oligosaccharide obscures this key structural information.

We describe a suite of tools which allow seamless interconversion between appropriate structure representations for small molecules and biopolymers (with a focus on polypeptides and oligosaccharides). For example:
SMILES: OC[C@H]1O[C@@H](O[C@@H]2[C@@H](CO)OC([C@@H]([C@H]2O)NC(=O)C)O)[C@@H]([C@H]([C@H]1O)O[C@@]1(C[C@H](O)[C@H]([C@@H](O1)[C@@H]([C@@H](CO)O)O)NC(=O)C)C(=O)O)O
Shortened IUPAC format: NeuAc(a2-3)Gal(b1-4)GlcNAc

I will discuss the challenge of supporting a variety of biopolymer representations, handling chemically-modified structures, and handling biopolymers with unknown attachment points (e.g. from mass spectrometry).