Could the real explicit hydrogens please stand up?

The OpenSMILES specification led by Craig James of eMolecules tries to iron out the handling of various corner cases in SMILES. Furthermore it describes best practices when writing SMILES so as to aid ease of interpretation. The goal of all of this is to avoid loss or corruption of chemical information when exchanged as SMILES between different software.

One issue which was not addressed, and indeed perhaps does not come under the OpenSMILES remit, is the question of how many hydrogens should be explicitly created in the internal representation of a molecule read from a SMILES string.

Which hydrogens are explicit?

The SMILES strings “C” and “C([H])([H])([H])H” both represent methane, but the first is typically converted into a single carbon atom with four implicit hydrogens, while the second is converted to a carbon atom with four explicit hydrogens attached. With Open Babel, we can see this difference by calling NumAtoms() on the molecule:

>>> pybel.readstring("smi", "C").OBMol.NumAtoms()
1
>>> pybel.readstring("smi", "C([H])([H])([H])[H]").OBMol.NumAtoms()
5

So far so good. Most toolkits would show the same behaviour.

However, if instead we take the SMILES string “[CH3][C@@H](Br)Cl”, Daylight and ChemAxon will have 4 atoms, OEChem will have 5 (the stereo hydrogen is added), and Open Babel will have 8 (all the hydrogens mentioned in a square bracket are added).

Why it matters

It shouldn’t really matter how a molecule is stored internally in a toolkit except perhaps for performance. But wouldn’t you know, there is a common case where it does matter. If you use SMARTS for substructure searching/matching, then the distinction between explicit and implicit hydrogens is going to affect your results.

You see, there are various SMARTS terms that distinguish between implicit and explicit hydrogens. I don’t think this was one of Daylight’s finest moments; in general SMARTS expressions correspond to molecular substructures, except for the crazy terms that match substructures of particular internal representations. The offending terms are:

h<n>   Atom has <n> implicit hydrogens attached
D<n>   Atom has <n> explicit connections

In short, if you use these terms to match against molecules read from SMILES strings, you will get different results with different toolkits because of the reasons discussed above. The OpenSMARTS draft specification (led by Tim Vandermeersch) also makes this point.

To avoid this problem, you should only use ‘portable terms’ in SMARTS expressions (i.e. ones which do not depend on whether atoms are explicit or implicit). Going forward, it would be useful for all toolkits to agree to adopt Daylight’s approach on the internal representation resulting from reading SMILES.