Explicit and Implicit Hydrogens: Taking liberties with valence

The field of cheminformatics has two concepts of implicit vs. explicit when it comes to talking about hydrogen atoms.  The first concept is at the internal data structure level where explicit hydrogens are each stored as an atom object (like any other element) vs. implicit hydrogens that exist only as a hydrogen count field associated with a parent atom.  The second concept is at the file format or line notation representation level where the number of hydrogens or valence is either explicitly specified vs. encodings where the hydrogen count is implicitly defined by some (often undocumented) valence model.  The existence of these two related but independent concepts is a recent personal realization after discovering what I’d previously called an “explicit hydrogen” differed from what OpenBabel or RDKit call an explicit hydrogen.

To demonstrate the difference consider the following three SMILES for methane:  “C”, “[CH4]” and “[H][C]([H])([H])[H]“.  All these SMILES represent the exact same molecule.  Most(?) cheminformatics would represent the first as a single atom internally where both the representation and count is implicit; the second by a single atom where the representation is implicit but the count is explicit, and the third by five atoms where the hydrogen atoms are explicit.  I say most toolkits as to make life interesting OpenBabel allocates real hydrogen atoms for “[CH4]“, and equally confusingly OpenEye’s OEChem toolkit will allocate a real hydrogen atom for implicit hydrogens at atoms with specified stereochemistry, i.e. as in “[C@H]” and “[C@@H]“.

An interesting side-effect of the above distinction is that molecular hydrogen, the most common molecule in the observable universe, can be represented by either “[H][H]” or “[HH]“.

Although described in relation to SMILES, the exact same distinction also holds for connection table file formats, such as the MDL/Symyx mol file format. The equivalent three variants of methane can be created as files with explicit atom records/lines for hydrogen atoms, as a single carbon atom line with an explicit valence of four, and as a single carbon atom line with a default valence (specified as a zero in the vvv columns of the atom block).  It’s important to honor the valence field vvv in an MDL/Symyx mol file, as for example “sodium metal” and “sodium hydride” are represented by connection tables with a single atom line, distinguished only by their valence.

Where things get fascinating (a.k.a problematic) are the cases that Donald Rumsfeld might call the “implicit implicits”, where (often to conserve space) the hydrogen count is to be implied from an atom’s environment; specifically its atomic number, formal charge and the bond orders of the bonds it participates in. This is the world of the valence model.

A valence model is used for defining (deriving) the number of hydrogens implicitly bonded to an atom.  For example, a neutral carbon may be assumed to be four-valent, implying that a neutral carbon with two explicit single bonds should have two implicit hydrogens.

Myth #1: There is a single universal valence model.

A common misconception is that there is or can exist a single valence model for chemical informatics.  The reality is that many vendors and cheminformatics toolkits implement their own interpretations of chemical valence, and correctly interacting with these toolkits requires matching the native models.  As a simple proof consider a neutral iodine with four single bonds to it, i.e. “CI(C)(C)C”.  In Daylight’s valence model this has no implicit hydrogens and is equivalent to “C[I](C)(C)C”, but in MDL’s valence model, neutral iodine is potentially one, three and five valent, so this compound has one implicit hydrogen and is therefore equivalent to “C[IH](C)(C)C”.  These two cases show that a single “model” or function can’t be used to process both SMILES and MDL mol files; much like aromaticity models, the definitions need to be matched to vendor file format.

Myth #2: Valence models follow simple patterns.

Dmitri Mendeleev made the whole problem of chemistry look easy by showing that elements could be conveniently arranged in a periodic table, to help predict their properties.  But such patterns can be deceptive.  If one observes that oxygen which typically has valence two in its neutral form, becomes three valent as a cation (+1 formal charge) and single valent as an anion (-1 formal charge), and that nitrogen which is typically three-valent when neutral, becomes four valent as a cation and two valent as an anion, one could jump to an obvious rule:  That cations have a valence one higher than the neutral form, and anions have a valence one less. Extrapolating this pattern would predict that a carbon anion would be three valent (which is correct), and that a carbon cation would be five valent (which is incorrect).

The truth is actually much closer to Mendeleev’s view of the periodic table.  The cation of an element (with one less electron) is isoelectronic to (and therefore tends to behave like) the element to its left in the periodic table, and the anion (with one more electron) is isoelectronic to the element to its right.  This predicts that N+ has the same valence as neutral carbon, and N- has the same valence as neutral oxygen.  More importantly it reveals that C- behaves like nitrogen (three valent) and that C+ behaves like boron (also three valent).

Alas, in the real world of cheminformatics life is much more complicated, with the MDL valence model defining different valences even for isoelectronic element/charge states.  Neutral thallium and lead(+1) would be expected to have the same set of valences, but the former is 1 and 3 valent, with the latter only 3 valent. The best that can be done is to explicitly encode complex valence models, such as MDL’s as large tables and enumerate the applicable cases.

The ultimate arbiter of a valence model is the author of a file format; hence Daylight define (and document) the valence model for SMILES, MDL/Symyx/Accelrys define the valence model for Mol files, CambridgeSoft/PKI define the valence model for ChemDraw files, and Tripos define the valence model for Sybyl Mol2 files.  An easy way to determine the “correct” MDL/Symyx valence for an atom is to draw the element and formal charge in ISIS/Draw, Symyx/Draw or Accelrys/Draw, and select “AutoPosition Hydrogens”.

Myth #3: A connection table representing a molecule is interpreted the same way by all programs.

Whilst most cheminformatics software tends to agree on the valences of common elements and charge states, such as neutral carbon and quaternary ammoniums, they start to rapidly diverge the further one goes from core organic chemistry.  Consider a boron atom with a -4 formal charge, which according to the MDL/Symyx valence model should be interpreted as being single valent.  At the end of 2012, OpenEye’s OEChem treated B4- as zero valent,  OpenBabel treated it as three valent and RDKit treated it as seven valent.  The same connection table results in four different molecular formulae in four different programs.

To assess the level of consistency in the cheminformatics community when interpreting connection tables, I performed a simple experiment and presented the results of a small “mdlbench” benchmark at the 2012 RDKit User Group Meeting in London. Firstly many thanks to Greg Landrum for allowing me to present, and the numerous toolkit developers who generous assisted by benchmarking and tweaking their software.  The evaluation considered the 199 unique atom types observed in Accerlys’s MDDR database of drug and drug-like molecules, and compared their interpretation across no less than 24 different chemistry toolkits/programs (mdlbench.sdf and mdlbench.smi).  The diversity of results was a surprise even to me, and some of the numbers were a shock; I certainly wasn’t expecting the toolkit graciously hosting the meeting, RDKit, to appear at the bottom of the list.  [No longer the case, read on...]

To demonstrate examples of differences: Bi2+ with a single single bond, occurs 7266 times in MDDR 2011.2, and is defined by MDL/Symyx to be three valent, and therefore have two implicit hydrogens.  ChemDraw, Corina, Pipeline Pilot and Dotmatics Pinpoint all add no implicit hydrogens, whilst ChemAxon and RDKit add four instead of two.  B2- with a single single bond, occurs 370 times and and again should be three valent, with two implicit hydrogens.  GGA Indigo adds no implicit hydrogens, whereas Daylight, RDKit, ChemAxon and Pipeline Pilot v8.5 all add four implicit hydrogens.  Interestingly, this case has been fixed in the upcoming Pipeline Pilot v9, a dividend to the community of Accelrys acquiring Symyx.

Myth #4: Everything is broken in cheminformatics and nothing can be done to fix it.

Actually things aren’t as bad as they seem – the mdlbench analysis revealed that the vast majority of molecules are unaffected by differences between toolkits – but most importantly the exercise revealed the large degree of co-operation between developers in improving interoperability between tools.  Most significantly both Open Babel and RDKit open source projects have since accepted patches (Open Babel mdlvalence.h, RDKit MDLValence.h) to fully implement the MDL valence model and now achieve a perfect 100% on this benchmark.  Commercial developers at InfoChem and Optibrium have made improvements to their MDL file readers, as has the MayaChem tools project, and copies of the “mdlbench” test have even been requested by pharmaceutical companies to validate their in-house tool development efforts.  The great news for the community is that there is now less “corruption” of chemistry when sharing data between software, so the molecules read from PubChem or ChEMBL into your analysis programs or editors are more likely to match the molecules that the NCBI or EBI intended them to be.

This entry was posted in Uncategorized. Bookmark the permalink.

9 Responses to Explicit and Implicit Hydrogens: Taking liberties with valence

  1. Do you have links to the RDKit implementation too? It would be great to see a CC0 licensed data table, or BSD/MIT licensed reference implementation if available. I would also be interested in having mdlbench available for validation if possible, thanks for posting this great write up.

  2. John May says:

    Very nice read, I tried patching this in the CDK last year but reversed as I didn’t trust my judgement on “vvv: shows number of bonds to this atom, including bonds to implied H’s”. What was interesting is that the existing code reads the valence value for non-pseudo atoms but only writes it for pseudo atoms… wat! I also believe the value is ignored by the atom type decision tree so it’s pretty amazing to get that number of correct atom types. Incidentally, the search for the Chlorine atom is slow as the atom type runs automatically and it massive resource drain. I’ve been meaning to patch this out of the parser and make it optional but it’s not very high on the todo list.

    Marcus, the mdlbench.sdf.zip is linked in the post and the RDKit patch is p/rdkit/code/2238/.

  3. Roger, thanx for this nice post. And I am proud that the CDK shows up so well; it further supports me in my idea that the isolated atom typing in the CDK is a good thing.

  4. Markus Sitzmann says:

    Nice post, important post – put it into a paper :-)

  5. Keith Taylor says:

    Very interesting post.

    Without detracting from the information, I would like to deemphasize the role of the molfile. We should not imbue the molfile with more power than it has. The molfile is a connection table style, data exchange format. The reading application interprets it in the context of its data model and that is where the differences arise.

    Note the official Accelrys style for the file type: molfile (and its relatives: rxnfile, SDfile, RDfile, and XDfile).

  6. Pingback: Competitive text-mining and molecular data exchange at ACS Indy | NextMove Software

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>