PubChem as a biologics database: The ACS presentation

Keen readers of the blog will have noticed a recent series of blog posts on the topic of PubChem and sequence databases. This was in preparation for my recent ACS presentation on “PubChem as a biologics database” (see below), the goal of which was (a) to convince the audience that PubChem is actually a biologics database (with some small molecules thrown in for disguise), and (b) to show that applying structure perception of biologics (e.g. via Sugar&Splice) on top of an existing chemical database yields useful insights.

Figure 1 – Counts of peptide monomers in PubChem

One perennial question in the field of biologics, at least from the point of view of biologic registration systems, is how many monomers are there and how best to handle them? This depends on how you slice-and-dice them of course, and how many of the long tail of possible monomers you actually support. I show that one view considers PubChem biologics to contain ~27K monosaccharide monomers, and ~8K amino acid monomers (Figure 1).

One of the surprising results was that there are more peptides in PubChem than in the PDB (~500K vs ~110K). This is not quite the case for oligosaccharides, but the number is not too far off the total in GlyTouCan (~67K vs ~80K). The type of peptide/sugar can be quite different though, as the entries in PubChem are those that chemists have gotten their hands on, and it’s full of reaction intermediates with a whole host of protecting groups not usually observed in vivo. Similarly, as I discussed in an earlier blog post, when one looks at sequence variation in PubChem, you’re not getting a view of the million-year process of evolution but rather the sites of variation that a chemist determined might best modulate activity.

Thank you for clicking through my slides, and I’d be happy to take any questions.

See you at the Washington ACS?

Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).

CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center

We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ‘read-only’ Electronic Lab Notebook.

In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.

Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.


CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.

There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.


CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The SMILES format developed by Dave Weininger at the EPA Environmental Research Laboratory in 1986, and subsequently at Daylight Chemical Information Systems, quickly became a de facto standard for chemical information interchange and storage. It is still popular today as a compact representation of a chemical structure that captures atom- and bond-based stereochemistry and is reasonably human-readable and writable. Despite its popularity, there was still some ambiguity around certain aspects of the SMILES notation, leading to a divergence in how different cheminformatics toolkits wrote and interpreted them. To address this, the OpenSMILES effort attempted to document these corner cases, an effort which was largely successful but foundered with the end in sight on the topic of aromaticity in SMILES.

This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:

  • How to read an aromatic SMILES
  • Why are some aromatic SMILES strings not read by toolkits?
  • Should the reader ‘fix’ aromatic SMILES that are not correct?
  • Is ring perception necessary when reading SMILES?
  • Is aromaticity perception necessary when reading SMILES?
  • What is the Daylight aromaticity model?
  • How to write an aromatic SMILES

The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.


CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center

The Chemistry Development Kit (CDK) is an open-source Java library for cheminformatics. It has been developed over the last 16 years by more than 90 contributors, mostly volunteers. This talk will discuss new and improved features of the v2.0 major release and future plans. From previous toolkit versions, performance and robustness issues have been addressed in many areas including: SMILES handling, stereochemistry, depiction, substructure pattern matching, and canonicalisation. A benchmark and discussion on how these improvements were made will be presented. Overall the toolkit now provides a solid foundation upon which advanced cheminformatics systems can be and have been built.


CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center

(see above for abstract)


CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center

When scientists think of Markush, the complex structural descriptions present in patents typically come to mind. Although text-mining of the patent literature has allowed specific structures to be indexed, structures described by these Markush structures have remained elusive.

Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.

We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .


CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center

To quote Otto von Bismarck, ‘Only a fool learns from his own mistakes. The wise man learns from the mistakes of others’. Sayle’s corollary: ‘Wise men learn from fatal mistakes’.

In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.


CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center

PubChem, as the standard bearer for online chemical databases, has long been the deposition site of choice for chemical data. In addition to this, it also contains a wealth of information on biologics, as oligosaccharides, oligopeptides and oligonucleotides are essentially medium to large chemicals. Recent years have seen direct depositions of biologic databases into PubChem, in addition to their appearance in vendor catalogs.

Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.

Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.

We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.

Chemically-related patent families

The term patent family is generally used to describe a set of patents that cover the same invention but which are filed with different patent authorities. Here instead we look at finding groups of patents within a single authority (the USPTO) where the patents are linked by chemical structures.

It turns out that it is not usual for essentially the same chemical information to appear in multiple patent applications within the USPTO, often with the same or similar title. I’m not sure of the reason for this – perhaps corrections, rewrites, or separate applications for different targets. In any case, it is useful to identify such cases for the purposes of linking or collation, or indeed to discard if looking for truly novel chemistry.

Here’s an approach that appears to work reasonably well: we regard as “chemically-related” two patents that share at least N key (but rare) molecules in common. All that remains is to define “N”, “key”, and “rare”:

  • A key compound is one associated with a compound number (which may be in the text or a ChemDraw file) or associated with an experimental property (taken from a table, and possibly described in terms of R groups that need to be attached to a scaffold).
  • A rare molecule is one that appears in 30 or fewer patents.
  • N was defined as 8.

Naturally, these cutoffs could benefit from some tweaking with a testset (e.g. patents with the same title and assignee), but for the purposes of this blog post they seem to work well. Here is a typical example of a highly-connected chemical patent family, where the labels are the number of key (but rare) molecules in common:

US Patent Application Title
US20030032623A1 Tnf-alpha production inhibitors
US20050014800A1 Angiogenesis inhibitor
US20060229342A1 TNF-a production inhibitors
US20060241155A1 TNF-alpha production inhibitors
US20080161270A1 Angiogenesis inhibitors
US20080182881A1 TNF-alpha production inhibitors
US20100016380A1 TNF-alpha production inhibitors

These patents appear to be all from Santen Pharmaceutical Co., though the company name is not listed as assignee on some of the patents. Equally interesting are those related families where the members are less highly connected. Here’s an example from GSK along with representative examples of the patent titles:

US Patent Application Title
US20100174065A1 COMPOUNDS

As ever, if this sparks some ideas and you’re interested in collaborating, drop us a line.

The perils of using __lzcnt with MSVC

TLDR; Don’t ever use __lzcnt without a corresponding __cpuid check.

I recently ran into a problem with a port of some g++ code to MSVC (2013). It was doing some bit-twiddling and needed an operator to count the leading zeros. It turns out that MSVC provides an intrinsic just for this purpose, __lzcnt.

Everything seemed to work, but a bug was reported and we traced it to this statement. The funny thing was, a simple test case (printing the leading zeros for a few different integers) gave different results on different machines and, for the value of 0, generated different answers each time.

We eventually worked out the root cause. The ‘lzcnt’ instruction is only provided by certain CPUs, and __lzcnt is just directly turned into this instruction regardless of whether it’s available or not. The funny (not so funny) thing is that instead of getting an ‘illegal instruction’ result when you run it, Intel (in their infinite wisdom) decided to reuse or repurpose existing opcodes so that CPUs without ‘lzcnt’ instead did a ‘bsr’ (bit scan reverse). This was why (a) the results were different/wrong, and (b) why a value of 0 gave gibberish (the docs for ‘bsr’ say the results are undefined in that case).

For background, see this StackOverflow answer from BeeOnRope:

…What happened is that Intel used the invalid sequence rep bsr to encode the new lzcnt instruction. Using a rep prefix on bsr (and many other instructions) was not a defined behavior, but all previous Intel CPUs just ignore redundant rep prefixes (indeed, they are allowed in some places where they have no effect, e.g., to make longer nop instructions).

So if you happen to execute lzcnt on a CPU that doesn’t support it, it will execute as bsr. Of course, this fallback is not exactly intentional, and it gives the wrong result…

Careful reading of the __lzcnt docs does say this in the Remarks: “If you run code that uses this intrinsic on hardware that does not support the lzcnt instruction, the results are unpredictable.”. I think this could be made a bit more obvious – hence this blog post for future googlers.

PubChem as a sequence database: Sequence variation

Earlier posts considered exact matches to sequence representations in PubChem. Now, let’s look at what should be considered similar matches. It is a failing of structural fingerprints that (to a first approximation) all oligopeptides are similar to all other oligopeptides because the paths (or atom environments) become saturated. A better way to measure similarity in this context would be to use edit distance. This can be done on the all-atom representation itself (e.g. using an MCS-based approach such as SmallWorld) or, more commonly for biopolymers, using a sequence representation.

Here we consider single mutations from a particular query. Some of the hits found will be due to an evolutionary process, and some due to humans exploring SAR. Naturally, there may also be some “mutations” due to errors by depositors – for the purposes of this blogpost we will minimise these by requiring strict matching on the conserved residues of the sequence (i.e. applying rules 1b, 2b, 3b from the previous blog post).

Sequence logos summarising the results are shown below for a set of queries against (a) the whole of PubChem, then (b) that subset derived from ChEMBL depositions.

Peptide PubChem ChEMBL
neuromedin N

Given that ChEMBL is a depositor into PubChem, it follows that the number of mutants found in ChEMBL must be a subset of those present in PubChem. It is still interesting to see that additional mutants are present, as it shows that PubChem has value above and beyond ChEMBL when it comes finding positions where SAR has been explored for a particular bioactive peptide.

Notes: Sequence logos were generated with WebLogo3. For more details see Noel O’Blog.

PubChem as a sequence database: Identity search

The previous post introduced the concept of treating PubChem as a sequence database. In that post, structures with the same sequence were collated to search for alternative disulfide bridging patterns. Here we explore the general concept of using sequence identity to search PubChem with the goal of answering the question, what results would (or should) be found for an exact sequence search of a chemical database?

Let’s take for example, kemptide. When written as a sequence, this is represented exactly by LRRASLG:

There are a number of choices to be made when converting a chemical structure to a sequence. For example:
1. We can write uppercase characters for all of D-/L-/DL-amino acids (1a), or we can use lowercase for D- (1b)
2. We can treat sidechain stereochemistry variants of Thr and Ile all as if they were Thr or Ile (i.e. T/I) (2a), or else handle the allo- and ξ versions with ‘X’ (2b)
3. We can consider all sidechain modifications as the parent aminoacid (3a, e.g. Ser(PO3H2) as Ser (and so ‘S’ instead of ‘X’) , instead of distinguishing between them (3b)

If we generate sequences for PubChem following the least specific rules (i.e. 1a, 2a, 3a), then 16 structures are found that have the same sequence as kemptide. These can then be partitioned by generating sequences with more specific rules, for example, distinguishing based on the presence of D- stereochemistry (i.e. 1b, 2a, 3a). As shown below, this first level of separation splits the sequences into those corresponding to LRRASLG, LRraSLG, LrRASLG, lRRASLG and lrRASLG.

In this particular case, only the first of these contains multiple strutures. These can be further split by applying rules 2b and 3b, which here separate based on the phosphorylation of the serines. After this, the ties can be split by considering the IUPAC condensed representation which shows differences in N- and C-terminal modifications, or the presence of a cosolvate.

        Ac-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 78069426 acetyl-kemptide
        Ac-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 71429096 acetyl-kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-NH2 85062657 kemptide amide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 100074 kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH.TFA 118797564 
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-NH2 9897033 kemptide amide
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 9962276 kemptide
        Unk-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 11650926,101224399,101878757 
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-NH2 102212089 [Ser(PO3H2)-5]kemptide amide
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-OH 13783725 [Ser(PO3H2)-5]kemptide
        H-Leu-Arg-D-Arg-D-Ala-Ser-Leu-Gly-OH 53393688 [D-Arg3,D-Ala4]kemptide
        H-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864041 [D-Arg2]kemptide
        H-D-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864040 [D-Leu1]kemptide
        H-D-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864042 [D-Leu1,D-Arg2]kemptide

Kemptide was chosen here because it’s fairly obscure and so only 16 hits were found. More popular peptides yield many more exact identity hits. For example, oxytocin has 87 hits, octreotide 207 hits, and substance P 608. In fact, turning this on its head, we can also use these exact identity matches to find ‘popular’ peptides that are missing from our internal peptide database.

PubChem as a sequence database: Disulfide bridging patterns

While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.

Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.

As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.

To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.

In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).

  1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353
  1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637

Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.

  1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550
  1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290

Are more bioactivities available from patents than from the academic literature?

Patents, such as those freely available from the US patent office, are a rich source of bioactivity data. One argument for favoring these data over data extracted from the academic literature is timeliness: a recent publication by Stefan Senger suggests an average delay of 4 years between the publication of compound-target interaction pairs in the patent literature compared to the academic literature.

However, another argument is simply the quantity of data. Daniel has been working on the general problem of extracting data from tables in patents, a certain proportion of which are bioactivity data. The following graph shows the amount of bioactivity data per (publication) year in ChEMBL versus extracted by LeadMine from US patents. Note that for the purposes of this comparison, the ChEMBL data excludes data extracted from patents by BindingDB.

The rise in the amount of patent data is due to an increase in the size of patents as well as the number thereof. If the trend continues, patents will become increasingly important as a source of bioactivity data.

Daniel presented the details of the text-mining procedure at the recent ACS meeting in San Francisco. The talk below also includes a comparison between the data extracted by LeadMine and that extracted manually by BindingDB. If you’re interested in seeing a poster on the topic, Daniel will be presenting at UK-QSAR this Wednesday.

Workshop at Bio-IT World on extraction of information from medicinal chemistry patents

Next month, at Bio-IT World, I will be co-hosting a workshop with Chris Southan (Guide to Pharmacology) and Paul Thiessen (PubChem) entitled “Digging Bioactive Chemistry Out of Patents Using Open Resources”. Chris Southan recently wrote about some of the untapped potential for the patent literature in drug discovery here.

The workshop will cover the following topics:

  • Outline the statistics of patent chemistry in various open sources
  • Introduce a spectrum of open resources and tools
  • Enable a deeper understanding of target identification, bioactivity and SAR extraction from patents and also papers
  • Show ways to engage with medicinal chemistry patent mining
  • Include hands-on exercises

The workshop is scheduled for Tuesday 23rd May and the deadline for signups is the 14th of April registration is still open. For more information on the agenda of the workshop and to sign up head to here.

Nh, Mc, Ts and Og spell trouble

John and Roger recently published a commentary in the Journal of Cheminformatics on the “Technical implications of new IUPAC elements in cheminformatics”. It’s fairly short, and focuses on ambiguities that may arise in two areas: (1) interpreting chemical sketches and (2) SMARTS patterns.

Regarding (1), the main point is that Ts, the new symbol for Tennessine, is currently widely used to indicate Tosyl. While one could instead use Tos, a quick look at usage in sketches from recent patents indicates that Ts is 20 times more common than Tos.

Section (2) is a bit more technical, and covers ambiguities which must be addressed for writers of SMARTS parsers and generators, which may misinterpret existing SMARTS when adding support for the new elements.