Chemically-related patent families

The term patent family is generally used to describe a set of patents that cover the same invention but which are filed with different patent authorities. Here instead we look at finding groups of patents within a single authority (the USPTO) where the patents are linked by chemical structures.

It turns out that it is not usual for essentially the same chemical information to appear in multiple patent applications within the USPTO, often with the same or similar title. I’m not sure of the reason for this – perhaps corrections, rewrites, or separate applications for different targets. In any case, it is useful to identify such cases for the purposes of linking or collation, or indeed to discard if looking for truly novel chemistry.

Here’s an approach that appears to work reasonably well: we regard as “chemically-related” two patents that share at least N key (but rare) molecules in common. All that remains is to define “N”, “key”, and “rare”:

  • A key compound is one associated with a compound number (which may be in the text or a ChemDraw file) or associated with an experimental property (taken from a table, and possibly described in terms of R groups that need to be attached to a scaffold).
  • A rare molecule is one that appears in 30 or fewer patents.
  • N was defined as 8.

Naturally, these cutoffs could benefit from some tweaking with a testset (e.g. patents with the same title and assignee), but for the purposes of this blog post they seem to work well. Here is a typical example of a highly-connected chemical patent family, where the labels are the number of key (but rare) molecules in common:

US Patent Application Title
US20030032623A1 Tnf-alpha production inhibitors
US20050014800A1 Angiogenesis inhibitor
US20060229342A1 TNF-a production inhibitors
US20060241155A1 TNF-alpha production inhibitors
US20080161270A1 Angiogenesis inhibitors
US20080182881A1 TNF-alpha production inhibitors
US20100016380A1 TNF-alpha production inhibitors

These patents appear to be all from Santen Pharmaceutical Co., though the company name is not listed as assignee on some of the patents. Equally interesting are those related families where the members are less highly connected. Here’s an example from GSK along with representative examples of the patent titles:

US Patent Application Title
US20130012491A1 PYRIMIDINE DERIVATIVES FOR USE AS SPHINGOSINE 1-PHOSPHATE 1 (S1P1) RECEPTOR AGONISTS
US20120094979A1 THIAZOLE OR THIADIZALOE DERIVATIVES FOR USE AS SPHINGOSINE 1-PHOSPHATE 1 (S1P1) RECEPTOR AGONISTS
US20100273771A1 OXADIAZOLE DERIVATIVES ACTIVE ON SPHINGOSINE-1-PHOSPHATE (SIP)
US20100174065A1 COMPOUNDS
US20120101083A1 S1P1 AGONISTS COMPRISING A BICYCLIC N-CONTAINING RING

As ever, if this sparks some ideas and you’re interested in collaborating, drop us a line.

The perils of using __lzcnt with MSVC

TLDR; Don’t ever use __lzcnt without a corresponding __cpuid check.

I recently ran into a problem with a port of some g++ code to MSVC (2013). It was doing some bit-twiddling and needed an operator to count the leading zeros. It turns out that MSVC provides an intrinsic just for this purpose, __lzcnt.

Everything seemed to work, but a bug was reported and we traced it to this statement. The funny thing was, a simple test case (printing the leading zeros for a few different integers) gave different results on different machines and, for the value of 0, generated different answers each time.

We eventually worked out the root cause. The ‘lzcnt’ instruction is only provided by certain CPUs, and __lzcnt is just directly turned into this instruction regardless of whether it’s available or not. The funny (not so funny) thing is that instead of getting an ‘illegal instruction’ result when you run it, Intel (in their infinite wisdom) decided to reuse or repurpose existing opcodes so that CPUs without ‘lzcnt’ instead did a ‘bsr’ (bit scan reverse). This was why (a) the results were different/wrong, and (b) why a value of 0 gave gibberish (the docs for ‘bsr’ say the results are undefined in that case).

For background, see this StackOverflow answer from BeeOnRope:

…What happened is that Intel used the invalid sequence rep bsr to encode the new lzcnt instruction. Using a rep prefix on bsr (and many other instructions) was not a defined behavior, but all previous Intel CPUs just ignore redundant rep prefixes (indeed, they are allowed in some places where they have no effect, e.g., to make longer nop instructions).

So if you happen to execute lzcnt on a CPU that doesn’t support it, it will execute as bsr. Of course, this fallback is not exactly intentional, and it gives the wrong result…

Careful reading of the __lzcnt docs does say this in the Remarks: “If you run code that uses this intrinsic on hardware that does not support the lzcnt instruction, the results are unpredictable.”. I think this could be made a bit more obvious – hence this blog post for future googlers.

PubChem as a sequence database: Sequence variation

Earlier posts considered exact matches to sequence representations in PubChem. Now, let’s look at what should be considered similar matches. It is a failing of structural fingerprints that (to a first approximation) all oligopeptides are similar to all other oligopeptides because the paths (or atom environments) become saturated. A better way to measure similarity in this context would be to use edit distance. This can be done on the all-atom representation itself (e.g. using an MCS-based approach such as SmallWorld) or, more commonly for biopolymers, using a sequence representation.

Here we consider single mutations from a particular query. Some of the hits found will be due to an evolutionary process, and some due to humans exploring SAR. Naturally, there may also be some “mutations” due to errors by depositors – for the purposes of this blogpost we will minimise these by requiring strict matching on the conserved residues of the sequence (i.e. applying rules 1b, 2b, 3b from the previous blog post).

Sequence logos summarising the results are shown below for a set of queries against (a) the whole of PubChem, then (b) that subset derived from ChEMBL depositions.

Peptide PubChem ChEMBL
casokefamide
glumitocin
neuromedin N
setmelanotide
spinorphin
thymulin

Given that ChEMBL is a depositor into PubChem, it follows that the number of mutants found in ChEMBL must be a subset of those present in PubChem. It is still interesting to see that additional mutants are present, as it shows that PubChem has value above and beyond ChEMBL when it comes finding positions where SAR has been explored for a particular bioactive peptide.

Notes: Sequence logos were generated with WebLogo3. For more details see Noel O’Blog.

PubChem as a sequence database: Identity search

The previous post introduced the concept of treating PubChem as a sequence database. In that post, structures with the same sequence were collated to search for alternative disulfide bridging patterns. Here we explore the general concept of using sequence identity to search PubChem with the goal of answering the question, what results would (or should) be found for an exact sequence search of a chemical database?

Let’s take for example, kemptide. When written as a sequence, this is represented exactly by LRRASLG:

There are a number of choices to be made when converting a chemical structure to a sequence. For example:
1. We can write uppercase characters for all of D-/L-/DL-amino acids (1a), or we can use lowercase for D- (1b)
2. We can treat sidechain stereochemistry variants of Thr and Ile all as if they were Thr or Ile (i.e. T/I) (2a), or else handle the allo- and ξ versions with ‘X’ (2b)
3. We can consider all sidechain modifications as the parent aminoacid (3a, e.g. Ser(PO3H2) as Ser (and so ‘S’ instead of ‘X’) , instead of distinguishing between them (3b)

If we generate sequences for PubChem following the least specific rules (i.e. 1a, 2a, 3a), then 16 structures are found that have the same sequence as kemptide. These can then be partitioned by generating sequences with more specific rules, for example, distinguishing based on the presence of D- stereochemistry (i.e. 1b, 2a, 3a). As shown below, this first level of separation splits the sequences into those corresponding to LRRASLG, LRraSLG, LrRASLG, lRRASLG and lrRASLG.

In this particular case, only the first of these contains multiple strutures. These can be further split by applying rules 2b and 3b, which here separate based on the phosphorylation of the serines. After this, the ties can be split by considering the IUPAC condensed representation which shows differences in N- and C-terminal modifications, or the presence of a cosolvate.

LRRASLG
    LRRASLG
        Ac-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 78069426 acetyl-kemptide
        Ac-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 71429096 acetyl-kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-NH2 85062657 kemptide amide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 100074 kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH.TFA 118797564 
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-NH2 9897033 kemptide amide
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 9962276 kemptide
        Unk-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 11650926,101224399,101878757 
    LRRAXLG
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-NH2 102212089 [Ser(PO3H2)-5]kemptide amide
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-OH 13783725 [Ser(PO3H2)-5]kemptide
LRraSLG
    LRraSLG
        H-Leu-Arg-D-Arg-D-Ala-Ser-Leu-Gly-OH 53393688 [D-Arg3,D-Ala4]kemptide
LrRASLG
    LrRASLG
        H-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864041 [D-Arg2]kemptide
lRRASLG
    lRRASLG
        H-D-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864040 [D-Leu1]kemptide
lrRASLG
    lrRASLG
        H-D-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864042 [D-Leu1,D-Arg2]kemptide

Kemptide was chosen here because it’s fairly obscure and so only 16 hits were found. More popular peptides yield many more exact identity hits. For example, oxytocin has 87 hits, octreotide 207 hits, and substance P 608. In fact, turning this on its head, we can also use these exact identity matches to find ‘popular’ peptides that are missing from our internal peptide database.

PubChem as a sequence database: Disulfide bridging patterns

While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.

Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.

As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.

To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.

In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).

ICCNPACGPKYSC
  1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353
  1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637


Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.

IKCNCKRHVIKPHICRKICGKN
  1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550
  1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290

Are more bioactivities available from patents than from the academic literature?

Patents, such as those freely available from the US patent office, are a rich source of bioactivity data. One argument for favoring these data over data extracted from the academic literature is timeliness: a recent publication by Stefan Senger suggests an average delay of 4 years between the publication of compound-target interaction pairs in the patent literature compared to the academic literature.

However, another argument is simply the quantity of data. Daniel has been working on the general problem of extracting data from tables in patents, a certain proportion of which are bioactivity data. The following graph shows the amount of bioactivity data per (publication) year in ChEMBL versus extracted by LeadMine from US patents. Note that for the purposes of this comparison, the ChEMBL data excludes data extracted from patents by BindingDB.

The rise in the amount of patent data is due to an increase in the size of patents as well as the number thereof. If the trend continues, patents will become increasingly important as a source of bioactivity data.

Daniel presented the details of the text-mining procedure at the recent ACS meeting in San Francisco. The talk below also includes a comparison between the data extracted by LeadMine and that extracted manually by BindingDB. If you’re interested in seeing a poster on the topic, Daniel will be presenting at UK-QSAR this Wednesday.

Nh, Mc, Ts and Og spell trouble

John and Roger recently published a commentary in the Journal of Cheminformatics on the “Technical implications of new IUPAC elements in cheminformatics”. It’s fairly short, and focuses on ambiguities that may arise in two areas: (1) interpreting chemical sketches and (2) SMARTS patterns.

Regarding (1), the main point is that Ts, the new symbol for Tennessine, is currently widely used to indicate Tosyl. While one could instead use Tos, a quick look at usage in sketches from recent patents indicates that Ts is 20 times more common than Tos.

Section (2) is a bit more technical, and covers ambiguities which must be addressed for writers of SMARTS parsers and generators, which may misinterpret existing SMARTS when adding support for the new elements.

Searching ChEMBL in the browser

A previous post (see the slidedeck from slide 40) described some of the work we have done on the development of fast substructure search, a project code-named Arthor. At the time, it ran about two orders of magnitude faster than any of the other programs benchmarked. Such speed makes possible interactive searches of large databases. That’s pretty obvious, and so rather than discuss that here, here’s something else that’s a bit more novel: interactive substructure search of moderately sized datasets, entirely client-side in the browser.

It is important to note this is not the first time that substructure search has been implemented entirely in the browser: Peter Ertl and co. developed the Wikipedia Structure Explorer which searches almost 15K structures from Wikipedia using the Actelion Java library compiled to JavaScript. However, with Arthor (also compiled to JavaScript), it is possible to search the whole of ChEMBL22_1, 1.68 million molecules, in the browser. It even works on my mid-range phone (Moto G 3rd gen, 2GB RAM), although there it is limited by memory constraints to 1.0 million molecules.

Time for the timings. Note that times quoted for the native code do not include the use of a fingerprint screen to be like-for-like with the JavaScript, where is not possible to use fingerprints for the whole of ChEMBL due to RAM constraints. The native and JavaScript times were measured on the same machine (Core i7 6900K CPU, 3.20GHz), and all are times to find the total number of hits (rather than the first 10 or 100 or whatever) using a single-thread. Phone times are for 1.0 million molecules. All times are in ms unless otherwise stated.

1.68M mols
1.00M mols
Query Hits Native JavaScript Phone
c1ccccc1 1420663 419 663 3.24s
Br 75132 113 197 819
CCO 754842 230 368 1.32s
OOO 1 99 300 1.12s
[X5] 160 102 186 817

Imagine a future where the computationally expensive step of substructure searching no longer requires a server, but is done client-side. Impossible, or only a matter of time?

Just what you wanted for Christmas – a compiler for Gaussian

One of Roger’s main interests is compilation as applied to code, SMARTS patterns, and indeed anything else. Indeed, for a period back in the noughties, Roger moonlighted as a middle-end maintainer for the GCC project.

So when, during a sabbatical, he was faced with the task of compiling Gaussian, he naturally turned to GFortran. However, given that this would not compile it, he tweaked the compiler and submitted patches to the FSF (see for example, the MOPAC changes on page 43 of this summary). When not all of these patches were accepted into mainline GFortran, he packaged the remaining pieces into a Fortran pre-processor that emulates the (non-standard) behaviour of commercial compilers.

The result, gXXfortran, is now available on GitHub. In theory, it should work for a standard Linux or Mac system. However, as we don’t have access to the Gaussian source code, your mileage may vary.

This package provides a “pgf77” script that emulates the Portland Group’s PGI fortran 77 compiler, instead using the Free Software Foundation’s GNU gfortran compiler instead. This emulation is sufficient to allow packages such as Gaussian03, that would otherwise require a commercial compiler, to be built using open source tools.

In addition, this package also allows Gaussian03 to be built on a case-insensitive file system (such as when using Mac OS X, cygwin or a FAT32 drive) by overriding the behaviour of “cp” and “gau-cpp” such that they don’t cause problems when used by Gaussian’s build scripts on non case-sensitive file systems.

Buying a ring, or making one yourself

When synthesising a molecule containing one or more rings, the chemist may decide on a synthetic route that includes ring-forming reactions or may instead be able to rely on starting materials that already incorporate the desired rings. The choice depends on many factors, including cost of starting materials, likely yields, and ease of access to additional analogs.

Some ring systems are very common – a phenyl ring springs to mind, of course – but yet they are not often formed as part of a typical synthetic route. Let’s automate the process of finding whether a particular ring system is likely to be formed in a reaction.

As a dataset we will use reactions extracted from the text of US and European patents by LeadMine and where Indigo produces an atom-mapping. Only one reaction per patent is used, and exact duplicates are discarded. Given these 212K reactions, here are the most common ring systems (*) in the products along with their frequency:
ringfreqs
Next, we use the mapping to identify those instances where a ring was formed. For each of these reactions, we take each of the common ring systems in turn and see whether it appears on the right-hand side (RHS) but not on the left-hand side. Here are the most commonly formed rings:
mostcommonringformed
Finally, we divide the corresponding figures from the diagram above to calculate the likelihood that, given a particular ring system on the RHS, it was formed by the reaction. For example, for phenyl ring, the likelihood is 807/151983 or 0.5%. Here are the rings with the highest likelihoods:propensity_lowest
…and those with the lowest:propensity_highest
So what is it about these rings that places them at the top and bottom of the likelihood lists? Comments welcome…

* Depending on how you slice-and-dice molecules to find ring systems, the exact results will vary. Here I included exocyclic double bonds as part of the ring system. In addition, I hashed tautomers to the same representation and removed any stereochemistry.