When compression makes things bigger

We’ve been looking into supporting Self-Contained Sequence Representation (SCSR) in Sugar&Splice (NextMove Software’s biologics perception, conversion, and depiction toolkit, as used by PubChem). SCSR is reported (Chen et al. 2011) as a “compressed format that retains chemistry detail”.

At NextMove, we’ve long argued that the best way to store peptides for registration is as the full connection table rather than as a compressed form. The primary advantage of this is that existing infrastructure for compound registration can be reused with minimal or no changes. On modern hardware, traditional cheminformatics algorithms can easily handle much larger structures (Sayle et al. 2015). An obvious problem is that without peptide perception (e.g. using Sugar&Splice), duplicates are missed if a user inputs a fully expanded structure instead of a compressed representation. A more subtle problem emerges with modified amino-acids in compressed representations, e.g. pyroglutamic acid may be considered different it was entered as modified glutamic acid or proline.

Having distinct registration systems for peptides and compounds is more complex and therefore more error prone, and more expensive to maintain.


When I generated the SCSR output I noticed that each line for a monomer looked longer than the SMILES for each fully expanded monomer. This means that while in theory this is a compressed format, it’s actually still larger than an uncompressed SMILES string. To demonstrate here are different representations of Beefy Meaty Peptide:

IUPAC Condensed:H-Lys-Gly-Asp-Glu-Glu-Ser-Leu-Ala-OH


  0  0  0     0  0            999 V3000
M  V30 COUNTS 8 7 0 0 0
M  V30 1 Lys 1.0 1.0 0 0 CLASS=AA ATTCHORD=(2 2 Br) SEQID=1
M  V30 2 Gly 2.0 1.0 0 0 CLASS=AA ATTCHORD=(4 1 Al 3 Br) SEQID=2
M  V30 3 Asp 3.0 1.0 0 0 CLASS=AA ATTCHORD=(4 2 Al 4 Br) SEQID=3
M  V30 4 Glu 4.0 1.0 0 0 CLASS=AA ATTCHORD=(4 3 Al 5 Br) SEQID=4
M  V30 5 Glu 5.0 1.0 0 0 CLASS=AA ATTCHORD=(4 4 Al 6 Br) SEQID=5
M  V30 6 Ser 6.0 1.0 0 0 CLASS=AA ATTCHORD=(4 5 Al 7 Br) SEQID=6
M  V30 7 Leu 7.0 1.0 0 0 CLASS=AA ATTCHORD=(4 6 Al 8 Br) SEQID=7
M  V30 8 Ala 8.0 1.0 0 0 CLASS=AA ATTCHORD=(2 7 Al) SEQID=8
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 3 4
M  V30 4 1 4 5
M  V30 5 1 5 6
M  V30 6 1 6 7
M  V30 7 1 7 8


To test how the size of these representations scales with the peptide length, random linear unmodified peptides were generated of increasing size. The formats listed above were tested as well as the fully expanded molfile and BIOVIA generated SCSR (BIOVIA Direct 2017). The difference between the BIOVIA SCSR and the NextMove SCSR (shown above) is that the expanded template for each occurring standard amino acid is included (i.e. a monomer definition). This has a little storage overhead that varies depending on the number of unique monomers.

The results are shown below. The molfile gets reasonably large (max 500KB+), though even this could still be stored on modern hardware. The SMILES (max 16KB+) peaks just above the more condensed formats of FASTA (max 1KB), HELM (max 2KB+), and Condensed (max 4KB+).


Using a log2 scale it’s easier to read the storage size:


An observation from the chart is that for small peptides the SCSR produced by BIOVIA (with monomer definitions) is actually larger than the molfile (also produced by BIOVIA). Crambin (e.g 1CRN) is often considered the boundary between a small-molecule and a protein. At 46 amino acids, it turns out that crambin reduced is smaller when stored as a fully expanded molfile compared to the SCSR representation:

Format Bytes
SCSR (BIOVIA) 20,448
Molfile 18,130


Sketchy Sketches

Chemical structure diagrams are essential in describing and conveying chemistry. Extracting chemistry from documents using text-mining (see NextMove Software’s LeadMine) is extremely useful but will miss anything described only by an image.

As a general approach to mining chemistry from images, one may consider using image-to-structure programs such as: OSRA, CliDE, ChemOCR, and Imago OCR. However, image-to-structure is not easy or quick and can be prone to compounding errors (e.g. OCR).

At NextMove we approach this problem slightly differently. It turns out that in some cases the source sketch files used to create the chemical diagrams may be available and provide a ‘cleaner’ data source than the raster images.

Although the data is ‘cleaner’ in terms of digital representation, naïvely exporting the connection table stored in a sketch file can lead to artificial and erroneous structures. The main problems stem from the stored representation (connection table) imprecisely reflecting what is displayed. To account for these issues, the NextMove Software converter (code name: Praline) applies correction, interpretation, and categorisation to sketches. The transformed connection table (currently written as ChemAxon Extended SMILES [CXSMILES]) better reflects what is actually displayed.

Let’s take a look at what’s possible with three examples:

1) US 2015 344500 A1

Method 9 in US 2015 344500 A1 describes a four step synthesis:


Using image-to-structure SureChEMBL extracts four structures, I’ve added the titles to make it easier to pair up:

Compound 2-2 (OCR error)
SCHEMBL17309138 / CID 118554493
Compound 9-1 (part)
SCHEMBL12363 / CID 10008
Compound 9-2
SCHEMBL17307813 / CID 118553325
Compound 9-3
SCHEMBL17309143 / CID 118554498

Compound 2-2 was not correctly extracted and looks like OCR has mistakenly recognised the -OBn as -OBu. The flurobenzene probably comes from Compound 9-1 where the label (Boc)2N- is difficult to recognise. The products of Step 4 contain valence errors and were probably thrown out as a recognition error.
However, by reading the ChemDraw files directly it’s possible to extract everything “warts and all”. To process this sketch the key interpretation phases are:

  • Line Formula Parsing – Using a strict yet comprehensive algorithm condensed labels are corrected and expanded.
  • Reaction Role Assignment – The reaction scheme layout is common to patents and made easier by looking for the USPTO-specific ‘splitter’ tag. To make valid reactions, reactants are duplicated and added to the previous step.
  • Agent Parsing – Based on the location the complete label “Boc2, DIEA” can be correctly processed. Agents can be a mix of trivial names, systematic names and formulas.
  • Clear Ambiguous Stereochemistry – One of the hashed wedges in Compounds 9-1, 9-2, and 9-3 is poorly placed between two stereocenters. In the stored representation both stereocentres are defined but we remove the definition at the wide end of the wedge.
  • Category Assignment – Based on the content we tag the output with a category for quick filtering. This is described more in the poster (see below).

Here are the results of our extraction, categorised as specific reactions:










Compounds 2-2 and 9-1 are now correctly extracted and actually novel to PubChem. We don’t try to correct author errors and so the bad valence is also preserved as drawn in Step 4.

2) US 7092578 B2

US 7092578 B2 is not a chemical patent but does have ChemDraw files. Here ChemDraw has been misused to draw tables, and direct export results in a cyclobutane grid. These are a well known class of bad structures in PubChem and have been referred to as chessboardanes. In addition to extracting the chemical structure, Praline assigns a categorisation code. This allows us to flag structures with potential problems as well as those with no real chemistry at all.


Resulting PubChem Compound CID 21040251:



Praline assigns the category No Connection Table and so it can be easily ignored.

3) US 6531452 B1

Strange connection tables don’t just come from non-chemistry patents, US 6531452 B1 like many chemistry patents contain a generic (Markush) claim. Earlier we saw the label -OBn misread by OCR. Even without OCR a condensed label may be expanded wrongly in the underlying representation, particular if the structure is generic.

“…at least one of R2 and R3 is”

In PubChem you’ll find the compound CID 22976968 has been extracted from this sketch:



Where did it come from? Well it turns out the generic label >C(R41)(R41) has been automatically expanded and stored in the file as:


Somewhere the Rs have been promoted to carbons and submitted to PubChem. Praline recognises and interprets generic labels and the attachment points (drawn here as tert-butyl) and categorised the sketch as a generic substituent. Here’s the output:


C1(C(C(C(N1*)(*)*)(*)*)C(*)(*)*)=O.*CCC(N)=O |$;;;;;R41;R41;R41;R41;R41;;_AP1;R41;R41;;_AP1$,Sg:n:3,6,7:n:ht|


Image-to-structure is slow; due to this, SureChEMBL currently has only processed images using image-to-structure from 2007 onwards (Papadatos G. 2015). In contrast Praline can process the entire archive of US Patent Applications and Grants with more than 24 million ChemDraw files (2001 onwards) in only 5 hours (single threaded).

Although the naïve molfile exports from the ChemDraw sketches are provided by the USPTO they have less information than the source ChemDraw sketch file. Reading the pre-exported molfile is significantly less accurate than interpreting the ChemDraw sketch, and even image-to-structure often produces more accurate results.

Other than U.S. Patents, this technology can be applied to sketch files extracted from Electronic Lab Notebooks (see NextMove Software’s HazELNut) as well as Journals where the publishers have held on to the sketch file submissions.

At the upcoming ACS in Philadelphia, Daniel will be presenting how some structures can only be extracted when the output from text-mining and sketches are combined. “The whole is greater than the sum of the parts” – Aristotle.

A poster on this work was presented at the 7th Joint Sheffield Conference on Chemoinformatics:


Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see https://nextmovesoftware.com/blog/2014/11/). Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.


One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.


Roger’s full talk is available here:

Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.

A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

Anthracene Zinc
Tool Query Time (s) Rank (slow) Query Time (s) Rank (slow)
arthor 2.254 3 > 0.357 2602
arthor+fp 0.022 285 > 0.001 1667
rdcart 0.698 794 < 202 4
rdlucene 27.126 566 > 23.87 600
pgchem 28.231 138 > 18.181 197
mychem 48.289 108 > 34.145 159
fastsearch 396 99 > 285 126
bingo-nosql 0.448 451 < 1.311 260
bingo-pgsql 0.392 638 > 0.060 1228
tripod-ss 21.797 350 < 1441 18
orchem 27.075 906 > 0.721 2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

Casandra – Chemical Hazard Alerting For The 21st Century

At BioIT World, Roger presented a poster on NextMove Software’s Casandra. Casandra is a server for alerting of chemical and reactive hazards.

“Patents you wouldn’t want to work with”

Readers of Derek Lowe’s In The Pipeline may be familiar with a series of posts titled “Things I won’t work with”. In the spirit of those posts, we recently ran Casandra on one million reactions extracted from US Patents. One patent highlighted in this preliminary analysis contained an extremely energetic compound.

US20100081811A1 [paragraph:18]
It turns out the above patent is actually from a defence agency and they were intending to make explosives. A more subtle reactive hazard was found in US20020173655A1.

US20020173655A1 [paragraph:374]
Here the amide (DMF) and metal hydride (NaH) can react exothermically in a self-accelerating reaction[1,2].

The Casandra poster is available here. If this is an area of interest to you feel free to get in touch with us.

[1] J. Buckley et al., Chemical & Engineering News, Jul. 12, 1982, page 5
[2] G. DeWail, Chemical & Engineering News, Sep. 13, 1982

For every fingerprint optimisation, there is an equal and opposite fingerprint deterioration

FingerprintChemical fingerprints are used for both similarity and substructure searching. When used for similarity, a score accounts for features shared and different between compounds. For substructure searching, the fingerprint provides a prescreen of potential hits by enforcing that every feature encoded in the query fingerprint must be present in the reference. If a single feature is found in the query but not in the reference can it safely be discarded?

Common types of fingerprints include: substructure keys (MACCS, CACTVS), path (Daylight), circular (ECFP, Morgan), tree, and n-gram (LINGOS, IBM).  The fingerprint examples described below are often documented as being similarity fingerprints or “optimised for similarity” but it isn’t always stated that their use should be avoid for substructure screening. A fingerprint intended for similarity will often screen out results from a substructure search that do actually match (false negatives).


Circular and n-gram fingerprints inherently can not be used for substructure filtering as they capture the absence as well as the presence of neighbours [1].


As is seen with the circular fingerprint, the number of neighbours (degree) is not invariant between the query and reference. The degree of the reference atoms must be equal or more to that of the query. The connectivity/degree can therefore not be encoded in the other types of fingerprints.

Hydrogen count

Similar to connectivity, the hydrogen count may be less than or greater than the query. The MACCS 166 substructure keys (as used in the open source toolkits) were reoptimised for similarity[2,3]. As some keys match hydrogen counts, they should not be used as a substructure fingerprint:


In the compounds above, the MACCS keys 118 (‘[#6H2]([#6H2]*)*’>1) and 129 (‘[#6H2](~*~*~[#6H2]~*)~*’) are found in the query (left) but not the reference (right). The CACTVS substructure keys also match hydrogens (e.g. bit 329, 335) and have the same property. As with MACCS, the documentation states that the CACTVS keys are intended for similarity.


Attempting to encode hybridisation is also problematic, consider the following query and target.


The left is not considered a substructure of the right with the CDK’s hybridisiation fingerprinter as an sp2 carbon in the query is sp1 in the reference.


Care should also be taken with ring size, in particular the smallest ring size of an atom or bond is not invariant.


This behaviour is observed with the CDK’s ShortestPath fingerprint where the query (left) has atoms in a smallest ring of size six but the reference (right) has atoms in either smaller ring of size five. More subtle issues are found when using the non-unique SSSR [4].  Some effects of the use of (E)SSSR are observed in the CACTVS substructure keys (intended for similarity as stated in the manual).


For these two PubChem Compound entries (CID 135973, CID 9249) the query (left) encodes a four membered ring while the reference (right) does not.

It is possible to encode and match the degree and hydrogen count in a fingerprint just not as a single feature. Encoding the degree in the feature or layering properties (ala RDKit Fingerprint) can be done safely but is redundant and leads to a denser fingerprint. Ring size information can also be encoded, rather than encoding smallest rings, all ring sizes (up to some length) need to be encoded.

Take home message

Different fingerprints exist for different purposes and surprisingly few are truly suitable for substructure filtering. Path and tree fingerprints are generally okay but caution must be taken to ensure variant properties are not encoded. The keen eyed may notice there is no mentioned of issues with aromaticity in fingerprints; there are unfortunately too many to list in a single post.

  1. http://pubs.acs.org/doi/abs/10.1021/ci100050t
  2. http://pubs.acs.org/doi/abs/10.1021/ci010132r
  3. http://www.dalkescientific.com/writings/diary/archive/2011/01/20/implementing_cactvs_keys.html
  4. http://docs.eyesopen.com/toolkits/oechem/cplusplus/ring.html#smallest-set-of-smallest-rings-sssr-considered-harmful

Image credit: CPOA