Uncategorized – Page 10 – NextMove Software

Handling biologics: A file format problem?

The increasing importance of biological therapeutics, or biologics, to the pharmaceutical industry is well-known. For example, data from Drugs.com show that of the top 15 best selling therapies in the US in Q4 2012, six were biologics. Monoclonal antibodies are a typical example; these are glycoproteins, comprising of short oligosaccharides attached to a multi-chain polypeptide.

It is clear that handling such molecules requires a different approach than that taken for small-molecules. For example, here is an all-atom depiction of the peptide crambin:

No – it’s not a cyclic peptide. It just happens to have three disulfide bridges. A more useful depiction can be generated if we follow the IUPAC or FDA guidelines for peptide depiction; here the primary structure is much clearer as is the presence of the disulfide bonds:

However, to create these sorts of depictions, and otherwise handle biopolymers more appropriately, we need to know the polymer structure.

Some consider this a file format problem. Some file formats which have been developed to store or represent biopolymer structures include the CHUCKLES and CHORTLES languages from Chiron and Daylight, HELM (Hierarchical Editing Language for Macromolecules) from Pfizer, Protein Line Notation from Biochemfusion and SCSR (Self-Contained Sequence Representation, an MDL V3000 extension) from Accelrys. Naturally, Wisswesser Line Notation has also been extended to handle this problem.

In particular, the HELM format has recently received support from the Pistoia Alliance. See for example this post on the Pistoia blog which describes how HELM “gives us a single consistent way to describe macromolecules which can be used across industry and academia” so that “researchers do not have to spend time creating their own notations”.

But is a new file format the best way to achieve this goal? (I can’t resist inserting the xkcd comic on standards at this point 🙂 )

From xkcd, the web comic: http://xkcd.com/927/

While NextMove’s software for handling biopolymers, Sugar & Splice, will handle popular file formats such as HELM, I will describe a different view of the problem in the follow-up blog post.

NextMove Software at Bio-IT World

[Update 2013/4/022: View the poster online here] From Tuesday to Thursday next week, Roger will be attending Bio-IT World 2013 in Boston.

He will be presenting a poster on Extraction, analysis, atom mapping, classification and naming of reactions from Pharmaceutical ELNs. This is an application of the HazELNut software and associated utilities. One of the interesting results is the list of top 5 reactions in the ELN of a major pharmaceutical company. The following image gives an overview of the process (click for larger):If you’re attending Bio-IT World and want to meet Roger to discuss this or anything else, drop him a line at roger@nextmovesoftware.com.

NextMove Software at the New Orleans ACS

From Sun to Thurs next week, I’ll be attending the 245th ACS National Meeting in New Orleans. You’ll find me hanging around the CINF sessions, not least because I’ll be presenting some recent work there.

In particular, I’ll be talking about Roundtripping between small-molecule and biopolymer representations on the Tuesday (3:10pm 9th April, Room 349), which looks at the challenges I’ve encountered in the development of NextMove Software’s Sugar & Splice software. This software can be used to perceive, depict and interconvert between various biopolymer representations, and currently supports peptides, nucleotides and sugars (and mixtures thereof, e.g. glycoproteins).

If you’re interested in meeting up to discuss this or anything else, drop me a line at noel@nextmovesoftware.com.

Here’s the abstract. Slides will follow after the event:

Roundtripping between small-molecule and biopolymer representations

Noel M. O’Boyle,¹ Evan Bolton,² Roger A. Sayle¹

1 NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge,
CB4 0EY, UK
2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda MD 20894, USA

Existing cheminformatics toolkits provide a mature set of tools to handle small-molecule data, from generating depictions, to creating and reading linear representations (such as SMILES and InChI). However, such tools do not translate well to the domain of biopolymers where the key information is the identity of the repeating unit and the nature of the connections between them. For example, a typical all-atom 2D depiction of all but the smallest protein or oligosaccharide obscures this key structural information.

We describe a suite of tools which allow seamless interconversion between appropriate structure representations for small molecules and biopolymers (with a focus on polypeptides and oligosaccharides). For example:
SMILES: OC[C@H]1O[C@@H](O[C@@H]2[C@@H](CO)OC([C@@H]([C@H]2O)NC(=O)C)O)[C@@H]([C@H]([C@H]1O)O[C@@]1(C[C@H](O)[C@H]([C@@H](O1)[C@@H]([C@@H](CO)O)O)NC(=O)C)C(=O)O)O
Shortened IUPAC format: NeuAc(a2-3)Gal(b1-4)GlcNAc

I will discuss the challenge of supporting a variety of biopolymer representations, handling chemically-modified structures, and handling biopolymers with unknown attachment points (e.g. from mass spectrometry).

Explicit and Implicit Hydrogens: Taking liberties with valence

The field of cheminformatics has two concepts of implicit vs. explicit when it comes to talking about hydrogen atoms. The first concept is at the internal data structure level where explicit hydrogens are each stored as an atom object (like any other element) vs. implicit hydrogens that exist only as a hydrogen count field associated with a parent atom. The second concept is at the file format or line notation representation level where the number of hydrogens or valence is either explicitly specified vs. encodings where the hydrogen count is implicitly defined by some (often undocumented) valence model. The existence of these two related but independent concepts is a recent personal realization after discovering what I’d previously called an “explicit hydrogen” differed from what OpenBabel or RDKit call an explicit hydrogen.

To demonstrate the difference consider the following three SMILES for methane: “C”, “[CH4]” and “[H][C]([H])([H])[H]”. All these SMILES represent the exact same molecule. Most(?) cheminformatics would represent the first as a single atom internally where both the representation and count is implicit; the second by a single atom where the representation is implicit but the count is explicit, and the third by five atoms where the hydrogen atoms are explicit. I say most toolkits as to make life interesting OpenBabel allocates real hydrogen atoms for “[CH4]”, and equally confusingly OpenEye’s OEChem toolkit will allocate a real hydrogen atom for implicit hydrogens at atoms with specified stereochemistry, i.e. as in “[C@H]” and “[C@@H]”.

An interesting side-effect of the above distinction is that molecular hydrogen, the most common molecule in the observable universe, can be represented by either “[H][H]” or “[HH]”.

Although described in relation to SMILES, the exact same distinction also holds for connection table file formats, such as the MDL/Symyx mol file format. The equivalent three variants of methane can be created as files with explicit atom records/lines for hydrogen atoms, as a single carbon atom line with an explicit valence of four, and as a single carbon atom line with a default valence (specified as a zero in the vvv columns of the atom block). It’s important to honor the valence field vvv in an MDL/Symyx mol file, as for example “sodium metal” and “sodium hydride” are represented by connection tables with a single atom line, distinguished only by their valence.

Where things get fascinating (a.k.a problematic) are the cases that Donald Rumsfeld might call the “implicit implicits”, where (often to conserve space) the hydrogen count is to be implied from an atom’s environment; specifically its atomic number, formal charge and the bond orders of the bonds it participates in. This is the world of the valence model.

A valence model is used for defining (deriving) the number of hydrogens implicitly bonded to an atom. For example, a neutral carbon may be assumed to be four-valent, implying that a neutral carbon with two explicit single bonds should have two implicit hydrogens.

Myth #1: There is a single universal valence model.

A common misconception is that there is or can exist a single valence model for chemical informatics. The reality is that many vendors and cheminformatics toolkits implement their own interpretations of chemical valence, and correctly interacting with these toolkits requires matching the native models. As a simple proof consider a neutral iodine with four single bonds to it, i.e. “CI(C)(C)C”. In Daylight’s valence model this has no implicit hydrogens and is equivalent to “C[I](C)(C)C”, but in MDL’s valence model, neutral iodine is potentially one, three and five valent, so this compound has one implicit hydrogen and is therefore equivalent to “C[IH](C)(C)C”. These two cases show that a single “model” or function can’t be used to process both SMILES and MDL mol files; much like aromaticity models, the definitions need to be matched to vendor file format.

Myth #2: Valence models follow simple patterns.

Dmitri Mendeleev made the whole problem of chemistry look easy by showing that elements could be conveniently arranged in a periodic table, to help predict their properties. But such patterns can be deceptive. If one observes that oxygen which typically has valence two in its neutral form, becomes three valent as a cation (+1 formal charge) and single valent as an anion (-1 formal charge), and that nitrogen which is typically three-valent when neutral, becomes four valent as a cation and two valent as an anion, one could jump to an obvious rule: That cations have a valence one higher than the neutral form, and anions have a valence one less. Extrapolating this pattern would predict that a carbon anion would be three valent (which is correct), and that a carbon cation would be five valent (which is incorrect).

The truth is actually much closer to Mendeleev’s view of the periodic table. The cation of an element (with one less electron) is isoelectronic to (and therefore tends to behave like) the element to its left in the periodic table, and the anion (with one more electron) is isoelectronic to the element to its right. This predicts that N⁺ has the same valence as neutral carbon, and N^– has the same valence as neutral oxygen. More importantly it reveals that C^– behaves like nitrogen (three valent) and that C⁺ behaves like boron (also three valent).

Alas, in the real world of cheminformatics life is much more complicated, with the MDL valence model defining different valences even for isoelectronic element/charge states. Neutral thallium and lead(+1) would be expected to have the same set of valences, but the former is 1 and 3 valent, with the latter only 3 valent. The best that can be done is to explicitly encode complex valence models, such as MDL’s as large tables and enumerate the applicable cases.

The ultimate arbiter of a valence model is the author of a file format; hence Daylight define (and document) the valence model for SMILES, MDL/Symyx/Accelrys define the valence model for Mol files, CambridgeSoft/PKI define the valence model for ChemDraw files, and Tripos define the valence model for Sybyl Mol2 files. An easy way to determine the “correct” MDL/Symyx valence for an atom is to draw the element and formal charge in ISIS/Draw, Symyx/Draw or Accelrys/Draw, and select “AutoPosition Hydrogens”.

Myth #3: A connection table representing a molecule is interpreted the same way by all programs.

Whilst most cheminformatics software tends to agree on the valences of common elements and charge states, such as neutral carbon and quaternary ammoniums, they start to rapidly diverge the further one goes from core organic chemistry. Consider a boron atom with a -4 formal charge, which according to the MDL/Symyx valence model should be interpreted as being single valent. At the end of 2012, OpenEye’s OEChem treated B^4- as zero valent, OpenBabel treated it as three valent and RDKit treated it as seven valent. The same connection table results in four different molecular formulae in four different programs.

To assess the level of consistency in the cheminformatics community when interpreting connection tables, I performed a simple experiment and presented the results of a small “mdlbench” benchmark at the 2012 RDKit User Group Meeting in London. Firstly many thanks to Greg Landrum for allowing me to present, and the numerous toolkit developers who generous assisted by benchmarking and tweaking their software. The evaluation considered the 199 unique atom types observed in Accerlys’s MDDR database of drug and drug-like molecules, and compared their interpretation across no less than 24 different chemistry toolkits/programs (mdlbench.sdf and mdlbench.smi). The diversity of results was a surprise even to me, and some of the numbers were a shock; I certainly wasn’t expecting the toolkit graciously hosting the meeting, RDKit, to appear at the bottom of the list. [No longer the case, read on…]

To demonstrate examples of differences: Bi²⁺ with a single single bond, occurs 7266 times in MDDR 2011.2, and is defined by MDL/Symyx to be three valent, and therefore have two implicit hydrogens. ChemDraw, Corina, Pipeline Pilot and Dotmatics Pinpoint all add no implicit hydrogens, whilst ChemAxon and RDKit add four instead of two. B^2- with a single single bond, occurs 370 times and and again should be three valent, with two implicit hydrogens. GGA Indigo adds no implicit hydrogens, whereas Daylight, RDKit, ChemAxon and Pipeline Pilot v8.5 all add four implicit hydrogens. Interestingly, this case has been fixed in the upcoming Pipeline Pilot v9, a dividend to the community of Accelrys acquiring Symyx.

Myth #4: Everything is broken in cheminformatics and nothing can be done to fix it.

Actually things aren’t as bad as they seem – the mdlbench analysis revealed that the vast majority of molecules are unaffected by differences between toolkits – but most importantly the exercise revealed the large degree of co-operation between developers in improving interoperability between tools. Most significantly both Open Babel and RDKit open source projects have since accepted patches (Open Babel mdlvalence.h, RDKit MDLValence.h) to fully implement the MDL valence model and now achieve a perfect 100% on this benchmark. Commercial developers at InfoChem and Optibrium have made improvements to their MDL file readers, as has the MayaChem tools project, and copies of the “mdlbench” test have even been requested by pharmaceutical companies to validate their in-house tool development efforts. The great news for the community is that there is now less “corruption” of chemistry when sharing data between software, so the molecules read from PubChem or ChEMBL into your analysis programs or editors are more likely to match the molecules that the NCBI or EBI intended them to be.

Looking for a C++ cheminformatics toolkit?

If you are writing a C++ application that uses a cheminformatics toolkit, a key step during the configuration process is to tell the build system the location of the toolkit libraries and include files. However, in an ideal world, the build system would find these automatically based on standard install locations or hints such as the value of RDBASE for RDKit.

We have incorporated some of these ideas into CMake modules to find Open Babel, OEChem or RDKit. As a service to the community, we are placing these into the public domain (E&OE). Enjoy.

# FindOpenBabel.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find Open Babel headers and libraries
# Defines:
#
#  OPENBABEL_FOUND - system has Open Babel
#  OPENBABEL_INCLUDE_DIRS - the Open Babel include directories
#  OPENBABEL_LIBRARIES - Link these to use Open Babel

if(OPENBABEL_INCLUDE_DIRS AND OPENBABEL_LIBRARIES)
  # in cache already or user-specified
  set(OPENBABEL_FOUND TRUE)

else()

  if(NOT OPENBABEL_INCLUDE_DIRS)
    if(WIN32)
      find_path(OPENBABEL_INCLUDE_DIR openbabel/obconversion.h
        PATHS
        ${OPENBABEL_DIR}\\include
        $ENV{OPENBABEL_INCLUDE_DIR}
        $ENV{OPENBABEL_INCLUDE_DIR}\\openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_PATH}
        $ENV{OPENBABEL_INCLUDE_PATH}\\openbabel-2.0
        $ENV{OPENBABEL_DIR}\\include
        $ENV{OPENBABEL_PATH}\\include
        $ENV{OPENBABEL_BASE}\\include
        C:\\OpenBabel\\include
      )
      find_path(OPENBABEL_INCLUDE_DIR2 openbabel/babelconfig.h
        PATHS
        ${OPENBABEL_DIR}/windows-vc2008/build/include
        $ENV{OPENBABEL_DIR}/windows-vc2008/build/include
        )
      set(OPENBABEL_INCLUDE_DIRS ${OPENBABEL_INCLUDE_DIR} ${OPENBABEL_INCLUDE_DIR2})
    else()
      find_path(OPENBABEL_INCLUDE_DIRS openbabel/obconversion.h
        PATHS
        ${OPENBABEL_DIR}/include
        $ENV{OPENBABEL_INCLUDE_DIR}/openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_DIR}
        $ENV{OPENBABEL_INCLUDE_PATH}/openbabel-2.0
        $ENV{OPENBABEL_INCLUDE_PATH}
        $ENV{OPENBABEL_DIR}/include/openbabel-2.0
        $ENV{OPENBABEL_DIR}/include
        $ENV{OPENBABEL_PATH}/include/openbabel-2.0
        $ENV{OPENBABEL_PATH}/include
        $ENV{OPENBABEL_BASE}/include/openbabel-2.0
        $ENV{OPENBABEL_BASE}/include
        /usr/include/openbabel-2.0
        /usr/include
        /usr/local/include/openbabel-2.0
        /usr/local/include
        /usr/local/openbabel/include/openbabel-2.0
        /usr/local/openbabel/include
        /usr/local/openbabel-2.0/include/openbabel-2.0
        /usr/local/openbabel-2.0/include
        ~/include/openbabel-2.0
        ~/include
      )
    endif()
    if(OPENBABEL_INCLUDE_DIRS)
      message(STATUS "Found Open Babel include files at ${OPENBABEL_INCLUDE_DIRS}")
    endif()
  endif()

  if(NOT OPENBABEL_LIBRARY_DIR)
    find_library(OPENBABEL_LIBRARIES NAMES openbabel openbabel-2
      PATHS
      ${OPENBABEL_DIR}/lib
      ${OPENBABEL_DIR}/windows-vc2008/build/src/Release
      $ENV{OPENBABEL_LIBRARIES}
      $ENV{OPENBABEL_DIR}/lib
      $ENV{OPENBABEL_DIR}/windows-vc2008/build/src/Release
      $ENV{OPENBABEL_PATH}/lib
      $ENV{OPENBABEL_BASE}/lib
      /usr/lib
      /usr/local/lib
      ~/lib
      $ENV{LD_LIBRARY_PATH}
    )
    if(OPENBABEL_LIBRARIES)
      message(STATUS "Found Open Babel library at ${OPENBABEL_LIBRARIES}")
    endif()
  endif()

  if(OPENBABEL_INCLUDE_DIRS AND OPENBABEL_LIBRARIES)
    set(OPENBABEL_FOUND TRUE)
  endif()

  mark_as_advanced(OPENBABEL_INCLUDE_DIRS OPENBABEL_LIBRARIES)
endif()

# FindOEChem.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find OEChem headers and libraries
# Defines:
#
#  OECHEM_FOUND - system has OEChem
#  OECHEM_INCLUDE_DIR - the OEChem include directory
#  OECHEM_LIBRARIES - Link these to use OEChem

if(OECHEM_INCLUDE_DIR AND OECHEM_LIBRARIES)
  # in cache already or user-specified
  set(OECHEM_FOUND TRUE)

else()

  if(NOT OECHEM_INCLUDE_DIR)
    if(WIN32)
      find_path(OECHEM_INCLUDE_DIR oechem.h
        PATHS
        $ENV{OECHEM_INCLUDE_DIR}
        $ENV{OECHEM_INCLUDE_PATH}
        $ENV{OECHEM_BASE}\\toolkits\\include
        $ENV{OECHEM_BASE}\\include
        C:\\OpenEye\\toolkits\\include
        C:\\OpenEye\\include
      )
    else()
      find_path(OECHEM_INCLUDE_DIR oechem.h
        PATHS
          $ENV{OECHEM_INCLUDE_DIR}
          $ENV{OECHEM_INCLUDE_PATH}
          $ENV{OECHEM_BASE}/toolkits/include
          $ENV{OECHEM_BASE}/include
          /usr/local/openeye/toolkits/include
          /usr/local/openeye/include
          /usr/include
          /usr/local/include
          ~/include
      )
      if(OECHEM_INCLUDE_DIR)
         message(STATUS "Found OEChem include files at ${OECHEM_INCLUDE_DIR}")
      endif()
    endif()
  endif()

  if(NOT OECHEM_LIBRARIES)
    find_library(OECHEM_LIB NAMES oechem
      HINTS
        ${OECHEM_INCLUDE_DIR}/../lib
      PATHS
        $ENV{OECHEM_LIB_DIR}
        $ENV{OECHEM_LIB_PATH}
        $ENV{OECHEM_LIBRARIES}
        $ENV{OECHEM_BASE}/toolkits/lib
        $ENV{OECHEM_BASE}/lib
        /usr/openeye/toolkits/lib
        /usr/openeye/lib
        /usr/local/openeye/toolkits/lib
        /usr/local/openeye/lib
        ~/openeye/toolkits/lib
        ~/openeye/lib
        $ENV{LD_LIBRARY_PATH}
    )
    if(OECHEM_LIB)
       GET_FILENAME_COMPONENT(OECHEM_LIBRARY_DIR ${OECHEM_LIB} PATH)
       message(STATUS "Found OEChem libraries at ${OECHEM_LIBRARY_DIR}")

      # Note that the order of the following libraries is significant!!
      find_library(OESYSTEM_LIB NAMES oesystem HINTS ${OECHEM_LIBRARY_DIR})
      find_library(OEPLATFORM_LIB NAMES oeplatform HINTS ${OECHEM_LIBRARY_DIR})
      set (OECHEM_LIBRARIES ${OECHEM_LIB} ${OESYSTEM_LIB} ${OEPLATFORM_LIB})
    endif()
  endif()

  if(OECHEM_INCLUDE_DIR AND OECHEM_LIBRARIES)
    set(OECHEM_FOUND TRUE)
  endif()

  mark_as_advanced(OECHEM_INCLUDE_DIR OECHEM_LIBRARIES)
endif()

# FindRDKit.cmake
# Placed in the public domain by NextMove Software in 2013
# Try to find RDKit headers and libraries
# Defines:
#
#  RDKIT_FOUND - system has RDKit
#  RDKIT_INCLUDE_DIR - the RDKit include directory
#  RDKIT_LIBRARIES - Link these to use RDKit

if(RDKIT_INCLUDE_DIR AND RDKIT_LIBRARIES)
  # in cache already or user-specified
  set(RDKIT_FOUND TRUE)

else()

  if(NOT RDKIT_INCLUDE_DIR)
    if(WIN32)
      find_path(RDKIT_INCLUDE_DIR GraphMol/RDKitBase.h
        PATHS
        ${RDKIT_DIR}\\Code
        $ENV{RDKIT_INCLUDE_DIR}
        $ENV{RDKIT_INCLUDE_PATH}
        $ENV{RDKIT_BASE}\\Code
        $ENV{RDBASE}\\Code
        C:\\RDKit\\include
        C:\\RDKit\\Code
      )
    else()
      find_path(RDKIT_INCLUDE_DIR GraphMol/RDKitBase.h
        PATHS
          ${RDKIT_DIR}/Code
          $ENV{RDKIT_INCLUDE_DIR}
          $ENV{RDKIT_INCLUDE_PATH}
          $ENV{RDKIT_BASE}/Code
          $ENV{RDBASE}/Code
          /usr/local/rdkit/include/Code
          /usr/local/rdkit/include
          /usr/local/rdkit/Code
          ~/rdkit/Code
      )
    endif()
    if(RDKIT_INCLUDE_DIR)
       message(STATUS "Found RDKit include files at ${RDKIT_INCLUDE_DIR}")
    endif()
  endif()

  if(NOT RDKIT_LIBRARIES)
    find_library(FILEPARSERS_LIB NAMES FileParsers
      PATHS
        ${RDKIT_DIR}/lib
        $ENV{RDKIT_LIB_DIR}
        $ENV{RDKIT_LIB_PATH}
        $ENV{RDKIT_LIBRARIES}
        $ENV{RDKIT_BASE}/lib
        $ENV{RDBASE}/lib
        /usr/local/rdkit/lib
        ~/rdkit/lib
        $ENV{LD_LIBRARY_PATH}
    )
    if(FILEPARSERS_LIB)
       GET_FILENAME_COMPONENT(RDKIT_LIBRARY_DIR ${FILEPARSERS_LIB} PATH)
       message(STATUS "Found RDKit libraries at ${RDKIT_LIBRARY_DIR}")

      # Note that the order of the following libraries is significant!!
      find_library(SMILESPARSE_LIB NAMES SmilesParse
                                   HINTS ${RDKIT_LIBRARY_DIR})
      find_library(DEPICTOR_LIB NAMES Depictor
                                HINTS ${RDKIT_LIBRARY_DIR})
      find_library(GRAPHMOL_LIB NAMES GraphMol
                                HINTS ${RDKIT_LIBRARY_DIR})
      find_library(RDGEOMETRYLIB_LIB NAMES RDGeometryLib
                                     HINTS ${RDKIT_LIBRARY_DIR})
      find_library(RDGENERAL_LIB NAMES RDGeneral
                                 HINTS ${RDKIT_LIBRARY_DIR})
      set (RDKIT_LIBRARIES ${FILEPARSERS_LIB} ${SMILESPARSE_LIB}
              ${DEPICTOR_LIB} ${GRAPHMOL_LIB} ${RDGEOMETRYLIB_LIB}
              ${RDGENERAL_LIB})
    endif()
    if(RDKIT_LIBRARIES)
            message(STATUS "Found RDKit library files at ${RDKIT_LIBRARIES}")
    endif()
  endif()

  if(RDKIT_INCLUDE_DIR AND RDKIT_LIBRARIES)
    set(RDKIT_FOUND TRUE)
  endif()

  mark_as_advanced(RDKIT_INCLUDE_DIR RDKIT_LIBRARIES)
endif()

Bringing sugars to OPSIN

I’m pleased to announce the release of OPSIN 1.4.0. This new release brings significant improvements to OPSIN’s coverage of carbohydrate nomenclature. It also complements NextMove Software’s Sugar & Splice project that aims to make the conversion between carbohydrate and small molecule representations effortless.

Below is the effect this improvement to OPSIN has had on the conversion of IUPAC names in ChEBI. (This is one of the data sets used in the OPSIN publication [free access])

Number of names convertible to InChI on IUPAC names from ChEBI (Sept 2010)

Examples of new nomenclature supported (pictures generated by the OPSIN web service)

3-Deoxy-alpha-D-manno-oct-2-ulopyranosonic acid	beta-D-Fructofuranosyl alpha-D-glucopyranoside
Methyl 2,3,4-tri-O-acetyl-alpha-D-glucopyranosyluronate bromide	3-O-beta-D-galactosyl-sn-glycerol

OPSIN 1.4.0 is available from Bitbucket and Maven Central. The full release notes are below:

Added support for dialdoses,diketoses,ketoaldoses,alditols,aldonic acids,uronic acids,aldaric acids,glycosides,oligosacchardides, named systematically or from trivial stems, in cyclic or acyclic form
Added support for ketoses named using dehydro
Added support for anhydro
Added more trivial carbohydrate names
Added support for sn-glcyerol
Improved heuristics for phospho substitution
Added hydrazido and anilate suffixes
Allowed more functional class nomenclature to apply to amino acids
Added support for inverting CAS names with substituted functional terms e.g. Acetaldehyde, O-methyloxime
Double substitution of a deoxy chiral centre now uses the CIP rules to decide which substituent replaced the hydroxy group
Unicode right arrows, superscripts and the soft hyphen are now recognised

On the other hand

The vast majority of amino acid residues that appear in peptides and proteins appear as their natural L-form enantiomer. This is the form of all amino acids as translated by the ribosome. However post-translational modification, such as by racemases, or peptide synthetic methods can be used to introduce the mirror image D-forms of amino acids into peptidic compounds.

Though rare, it is often important to keep track of whether individual amino acids are L-form, D-form, unknown or a racemic mixture of the two (DL-form). Rather unhelpfully, IUPAC rule 3AA-3.3 from IUPAC’s “Nomenclature and Symbolism for amino acids and peptides” states that the configuration prefix may be omitted for amino acids from a natural protein source (where the configuration may be assumed to be L) and for amino acids from synthetic sources (where the configuration is assumed to be an equimolecular mixture of enantiomers).

In the RCSB’s PDB file format, three letter residue codes have been assigned for both the L- form and the D- form of the traditional 19 amino acids other than glycine. Glycine (PDB residue code GLY) has an achiral α-carbon and therefore does not have an L-form and a D-form. For convenience, the table below lists the correspondence between PDB’s three letter codes for these amino acids.

Name	L-form	D-form
Alanine	ALA	DAL
Arginine	ARG	DAR
Asparagine	ASN	DSG
Aspartic Acid	ASP	DAS
Cysteine	CYS	DCY
Glutamine	GLN	DGN
Glutamic Acid	GLU	DGL
Histidine	HIS	DHI
Isoleucine	ILE	DIL
Leucine	LEU	DLE
Lysine	LYS	DLY
Methionine	MET	MED
Phenylalanine	PHE	DPN
Proline	PRO	DPR
Serine	SER	DSN
Threonine	THR	DTH
Tryptophan	TRP	DTR
Tyrosine	TYR	DTY
Valine	VAL	DVA

The above table is believed to be the only internet resource conveniently linking the two PDB residue codes for enantiomeric forms of amino acids.

Unfortunately for the two recent natural amino acids, selenocysteine (PDB residue code CSE) and pyrrolysine (PDB residue code PYH), no PDB residue codes have yet been assigned for their enantiomeric D-forms.

Image credit: The Joneses (where are the joneses on Flickr)

Making Sense of Patent Tables

Tabular data in patents is a useful source of experimental data and chemical structures. USPTO patents are available back to 1976 in formats where tables are explicitly annotated. For more recent patents these are XML tables similar in structure to what would be expected in HTML. Unfortunately the format used from 1976-2000 is not quite so straightforward to interpret leading to naive interpretations producing output that does not at all resemble the actual table, often with chemical name fragments scattered:

USPTO Patent Full-Text and Image Database:

FreePatentsOnline:

The format for these tables is briefly documented by the USPTO but the description raises as many questions as answers:

Columns are delimited by one or more spaces… but a cell may contain spaces!
An overly long cell may be split over multiple lines due the format being limited to 80 characters per line
Where in the printed patent a cell spanned multiple rows it spans multiple lines in the format.

As the format is based on the how the tables were printed perfect reproduction of the semantics of these tables appears impossible, but a good approximation can be achieved.

After processing PatFetch produces:

Much better 🙂

(the colouring of Example 22 is due to “tertbutyl” being recognised as a misspelling of “tert-butyl”)

The method broadly works by:

Identifiying the header, body and footer
Producing a putative table layout
Splitting cells where a single space is determined to be a split point between two columns
Merging cells that are determined to be a continuation of a previous cell

Text Mining for a Worthy Cause

I recently received an e-mail from the charity “jeans for genes” introducing me to “black bone disease“, a rare genetic disease without a cure. It is more formally known as “Alkaptonuria” (OMIM entry) and is a defect in the homogentisate 1,2-dioxygenase gene (HGD) which leads to a toxic build-up of homogentisic acid in the blood, causing the symptoms of the disease.

Interestingly a re-purposed herbicide, nitisinone, is currently being investigated as a possible treatment for the disease based on its previous re-purposing as a therapy in related genetic disorder, Type 1 Tyrosinemia.

The story starts in 1977 when a researcher in California observed that relatively few weeds were growing under the bottlebrush (Callistemon) plants in his backyard. Analytical chemistry of the soil fractions revealed the active compound to be the natural product Leptospermone. Traditional ligand based optimization of this compound led to the effective herbicides mesotrione (Syngenta’s Callisto) and nitisinone being synthesized and tested in 1984, with the first patents on this class of herbicides appearing in 1986 (e.g. US 4780127). At the point these patents were filed/granted, the mechanism of action and protein target weren’t yet known, although they were experimentally proven to be toxic to plants but harmless to mammals. Much later it was discovered that these compounds worked by inhibiting the enzyme 4-hydroxyphenylpyruvate dioxygenase (HPPD) which blocks the synthesis of chlorophyll and leads to “bleaching” and eventual plant death.

It is the role that HPPD plays in human metabolism that make these herbicides so interesting as therapeutic agents. The pathway diagram below describes the five enzymatic steps (arrows) in the degradation metabolism of tyrosine.

Defects in these various enzymes responsible for each step lead to a number of related diseases: Problems with the first step, tyrosine-transaminase, cause type 2 tyrosinemia; the second step, p-Hydroxylphenylpyruvate-dioxygenase (HPPD) is our herbicide target for which defects cause type 3 tyrosinemia; step three, homogentisate dioxygenase (HGD) causes alkaptonuria (aka black bone disease); and step 5, 4-fumaryl-acetoacetate hydrolase causes type 1 tyrosinemia.

In the case of type 1 tyrosinemia, it was noticed that those patients with active HPPD had a more severe form of the disease, so it was hypothesized that a HPPD inhibitor may be beneficial. At the time Zeneca worked on both pharmaceuticals and crop protection and were able to evaluate their proven-safe herbicide nitisinone directly in the clinic. In what seems incredible by the standards of today’s pharmaceutical pipelines, their US 5550165 patent filing describes the administration to, and recovery of, sick infants and children, where it is now more usual for a drug candidate to spend years in phase I, II and III clinical trials after a patent is granted before it gets approved by the FDA.

HPPD inhibitors can be anticipated to treat alkaptonuria by much the same mechanism:
By blocking the formation of the toxic metabolite homogentisate, and causing tyrosine
to be metabolised via alternate routes.

One of the goals of modern text mining is to automatically discover links such as those between the above two patents, US4780127 and US5550165. Unfortunately, a range of technical issues complicate the process: In common with many pharmaceutical patent filings, the drug target is not known or not mentioned, so it is necessary to identify and annotate compound classes or modes of action such as “kinase inhibitor”, “beta-blocker”, “herbicide” or “antibiotic”. The large number of synonyms and typographical variants of enzyme and disease names requires the use of synonym dictionaries or ontologies to recognize that “tyrosine transaminase” is the same entity as “tyrosine aminotransferase” is the same as “EC 2.6.1.5“. Finally, as revealed by the mistake “tyosinemia” in the title of the above US 5550165, documents in real life frequently contain spelling errors, making it impossible to find the most relevant documents when searching for a keyword like “tyrosinemia” (without automatic spelling correction).

These are exactly the types of challenges our LeadMine software attempts to tackle.

Building and climbing a Chemical Ladder

A recent post by Jake Vanderplas described Word Ladders, and gave me an idea. A Word Ladder is a game developed by Lewis Carroll that involves converting one word into another one letter at a time, while passing through only valid words. For example, to convert MAD into HAT, a valid word ladder would be MAD->MAT->HAT or MAD->HAD->HAT.

Here I propose the term Chemical Ladder for a Word Ladder that is restricted to chemical names. For example, try converting from Barbamate to Carbazide in 4 steps only using valid chemical names. Or from Anginine to Arsonite.

So, how did I come up with these examples? Well, NextMove Software’s CaffeineFix can do chemical spelling correction based on a dictionary (or grammar). The spelling suggestions provided by CaffeineFix are substitutions, deletions and insertions, but if we consider just the 1-letter substitutions, this is exactly the transformation needed to build a word ladder, e.g.

>>> from caffeinefix import CaffeineFix
>>> cf = CaffeineFix("mydictionary.cfx")
>>> list(cf.suggest("azite"))
["azote", "azine"]

To begin with I downloaded the list of synonyms from PubChem, and filtered to remove database identifiers and various other cruft. I compiled these into a CaffeineFix dictionary, and edited the suggest method to just return substitutions (suggest_substitutions in the code below). The code shown below then uses the suggested substitutions for each chemical name to create a graph that I visualised to identify Chemical Ladders (see example below). A longer version of the code could be written to identify the Chemical Ladders more automatically.

Just in case any highly-respected and discerning chemistry society wants to include Chemical Ladders in its weekly or monthly magazine, I’ve decided not to include the full output of the program in this blogpost, apart from the image above. What do you think? Could this be the next ChemDoku?

from collections import defaultdict

from caffeinefix import CaffeineFix

def difference(a, b):
    for i, (d, e) in enumerate(zip(a,b)):
        if d != e:
            return i

def nearest(name):
    suggestions = list(cf.suggest_substitutions(name))
    suggestions.remove(name)
    return (name, suggestions)

replacements = [("0", "zero"), ("%", "PERCENT"), ("+", "PLUS"), ("4", "FOUR"),
                ("7", "SEVEN"), ("9", "NINE"), ("6", "SIX"), ("8", "EIGHT"),
                (")", "RBRACKET"), ("'", "APOSTROPHE"), ("@", "AT"),
                ("}", "RBRACKETB"), ("{", "LBRACKETB"), (":", "COLON"),
                ("/", "FSLASH"), (".", "PERIOD"), ("&", "AMPERSAND"),
                ("^", "CIRCUMFLEX"), ("[", "LBRACKETC"), ("]", "RBRACKETC"),
                ("|", "PIPE"), (";", "SEMICOLON")]

def fix(name):
    name = name.replace("-", "_").replace("1", "one").replace("2", "two").replace("3", "three").replace("5", "five").replace(",", "_").replace("?", "Q").replace("(", "LB").replace(" ", "SPACE")
    for x, y in replacements:
        name = name.replace(x, y)
    return name

if __name__ == "__main__":
    cf = CaffeineFix(r"C:\Tools\LargeData\PubChem_Synonyms\pubchem.cfx")
    names = [x.strip() for x in open(r"C:\Tools\LargeData\PubChem_Synonyms\lowercase_sorted_uniq.txt", "r") if x.rstrip()
             and len(x) == 10]
    results = map(nearest, names)

    output = open("wordladder_10.txt", "w")
    output.write("graph graphname {\n")

    for x, y in results:
        if len(y) > 1:
            collate = defaultdict(list)
            for w in y:
                collate[difference(x, w)].append(w)
            if len(collate) > 1:
                for v in collate.values():
                    output.write('%s [label="%s"];\n' % (fix(x), x))
                    for z in v:
                        output.write("%s -- %s\n" % (fix(x), fix(z)))
                        output.write('%s [label="%s"];\n' % (fix(z), z))

    output.write("}\n")
    output.close()

Note: A couple of people have asked why are there two edges for only some of the connections in the graph. This would be the case if I retained all of the original edges, as if A is a spelling correction of B, then B is a spelling correction of A. However, since a word ladder can only exist if a node in the graph has at least two connections, I filter out all those cases where a node has only a single connection (otherwise you end up with a lot of ‘word ladders’ composed of just two words). So, if I have A->B, B->(A, C), C->(B, D), D->C, then A->B and D->C will be removed, and the graph will be A-B=C-D.