Could the real explicit hydrogens please stand up?

The OpenSMILES specification led by Craig James of eMolecules tries to iron out the handling of various corner cases in SMILES. Furthermore it describes best practices when writing SMILES so as to aid ease of interpretation. The goal of all of this is to avoid loss or corruption of chemical information when exchanged as SMILES between different software.

One issue which was not addressed, and indeed perhaps does not come under the OpenSMILES remit, is the question of how many hydrogens should be explicitly created in the internal representation of a molecule read from a SMILES string.

Which hydrogens are explicit?

The SMILES strings “C” and “C([H])([H])([H])H” both represent methane, but the first is typically converted into a single carbon atom with four implicit hydrogens, while the second is converted to a carbon atom with four explicit hydrogens attached. With Open Babel, we can see this difference by calling NumAtoms() on the molecule:

>>> pybel.readstring("smi", "C").OBMol.NumAtoms()
1
>>> pybel.readstring("smi", "C([H])([H])([H])[H]").OBMol.NumAtoms()
5

So far so good. Most toolkits would show the same behaviour.

However, if instead we take the SMILES string “[CH3][C@@H](Br)Cl”, Daylight and ChemAxon will have 4 atoms, OEChem will have 5 (the stereo hydrogen is added), and Open Babel will have 8 (all the hydrogens mentioned in a square bracket are added).

Why it matters

It shouldn’t really matter how a molecule is stored internally in a toolkit except perhaps for performance. But wouldn’t you know, there is a common case where it does matter. If you use SMARTS for substructure searching/matching, then the distinction between explicit and implicit hydrogens is going to affect your results.

You see, there are various SMARTS terms that distinguish between implicit and explicit hydrogens. I don’t think this was one of Daylight’s finest moments; in general SMARTS expressions correspond to molecular substructures, except for the crazy terms that match substructures of particular internal representations. The offending terms are:

h<n>   Atom has <n> implicit hydrogens attached
D<n>   Atom has <n> explicit connections

In short, if you use these terms to match against molecules read from SMILES strings, you will get different results with different toolkits because of the reasons discussed above. The OpenSMARTS draft specification (led by Tim Vandermeersch) also makes this point.

To avoid this problem, you should only use ‘portable terms’ in SMARTS expressions (i.e. ones which do not depend on whether atoms are explicit or implicit). Going forward, it would be useful for all toolkits to agree to adopt Daylight’s approach on the internal representation resulting from reading SMILES.

Handling biologics: A perception problem?

In an earlier post, I described the importance of knowing the biopolymer structure when handling biologics. I also discussed various file formats that have been proposed to address this.

But rather than regarding this as a file format problem, why not consider it instead a perception problem. If we can perceive the biopolymer structure from an all-atom representation, then interconverting between any file format (whether one of the proposed biopolymer formats or existing all-atom representations such as SMILES) is straightforward. Can it be done? Well, that’s what exactly what PDB file writers do; they perceive the amino acid sequence from the all-atom structure and fill in the relevant columns in the PDB file.
Roundtripping
There are several benefits to this approach. To begin with, it avoids the cost associated with a new registry system based on a macromolecular file format. There are no problems with new and unusual monomers; these will be faithfully stored in the all-atom representation. The de-facto standards for chemical information interchange, SMILES and MOL files, can be used as always for exchange of data. Tools for small-molecule analysis (e.g. SMARTS searching) can be combined with analyses based on biopolymer structure (e.g. HELM depiction, Smith-Waterman searching). And finally it’s worth considering that it may be difficult to migrate at a later date if a registry system is based on a particular file format.

What I’ve described here and in the previous post is the introduction from my ACS presentation on Roundtripping between small-molecule and biopolymer representations. This describes the development of the Sugar & Splice software for handling oligopeptides, oligonucleotides and oligosaccharides (including modified residues and mixtures of different biopolymers). Note that the presentation is somewhat sugar-centric; for more info on the peptide and nucleotide side of things see Roger’s Spring 2012 ACS presentation.

(For more presentations from NextMove Software, see our SlideShare page.)