PubChem as a sequence database: Sequence variation

Earlier posts considered exact matches to sequence representations in PubChem. Now, let’s look at what should be considered similar matches. It is a failing of structural fingerprints that (to a first approximation) all oligopeptides are similar to all other oligopeptides because the paths (or atom environments) become saturated. A better way to measure similarity in this context would be to use edit distance. This can be done on the all-atom representation itself (e.g. using an MCS-based approach such as SmallWorld) or, more commonly for biopolymers, using a sequence representation.

Here we consider single mutations from a particular query. Some of the hits found will be due to an evolutionary process, and some due to humans exploring SAR. Naturally, there may also be some “mutations” due to errors by depositors – for the purposes of this blogpost we will minimise these by requiring strict matching on the conserved residues of the sequence (i.e. applying rules 1b, 2b, 3b from the previous blog post).

Sequence logos summarising the results are shown below for a set of queries against (a) the whole of PubChem, then (b) that subset derived from ChEMBL depositions.

Peptide PubChem ChEMBL
neuromedin N

Given that ChEMBL is a depositor into PubChem, it follows that the number of mutants found in ChEMBL must be a subset of those present in PubChem. It is still interesting to see that additional mutants are present, as it shows that PubChem has value above and beyond ChEMBL when it comes finding positions where SAR has been explored for a particular bioactive peptide.

Notes: Sequence logos were generated with WebLogo3. For more details see Noel O’Blog.

PubChem as a sequence database: Identity search

The previous post introduced the concept of treating PubChem as a sequence database. In that post, structures with the same sequence were collated to search for alternative disulfide bridging patterns. Here we explore the general concept of using sequence identity to search PubChem with the goal of answering the question, what results would (or should) be found for an exact sequence search of a chemical database?

Let’s take for example, kemptide. When written as a sequence, this is represented exactly by LRRASLG:

There are a number of choices to be made when converting a chemical structure to a sequence. For example:
1. We can write uppercase characters for all of D-/L-/DL-amino acids (1a), or we can use lowercase for D- (1b)
2. We can treat sidechain stereochemistry variants of Thr and Ile all as if they were Thr or Ile (i.e. T/I) (2a), or else handle the allo- and ξ versions with ‘X’ (2b)
3. We can consider all sidechain modifications as the parent aminoacid (3a, e.g. Ser(PO3H2) as Ser (and so ‘S’ instead of ‘X’) , instead of distinguishing between them (3b)

If we generate sequences for PubChem following the least specific rules (i.e. 1a, 2a, 3a), then 16 structures are found that have the same sequence as kemptide. These can then be partitioned by generating sequences with more specific rules, for example, distinguishing based on the presence of D- stereochemistry (i.e. 1b, 2a, 3a). As shown below, this first level of separation splits the sequences into those corresponding to LRRASLG, LRraSLG, LrRASLG, lRRASLG and lrRASLG.

In this particular case, only the first of these contains multiple strutures. These can be further split by applying rules 2b and 3b, which here separate based on the phosphorylation of the serines. After this, the ties can be split by considering the IUPAC condensed representation which shows differences in N- and C-terminal modifications, or the presence of a cosolvate.

        Ac-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 78069426 acetyl-kemptide
        Ac-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 71429096 acetyl-kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-NH2 85062657 kemptide amide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 100074 kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH.TFA 118797564 
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-NH2 9897033 kemptide amide
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 9962276 kemptide
        Unk-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 11650926,101224399,101878757 
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-NH2 102212089 [Ser(PO3H2)-5]kemptide amide
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-OH 13783725 [Ser(PO3H2)-5]kemptide
        H-Leu-Arg-D-Arg-D-Ala-Ser-Leu-Gly-OH 53393688 [D-Arg3,D-Ala4]kemptide
        H-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864041 [D-Arg2]kemptide
        H-D-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864040 [D-Leu1]kemptide
        H-D-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864042 [D-Leu1,D-Arg2]kemptide

Kemptide was chosen here because it’s fairly obscure and so only 16 hits were found. More popular peptides yield many more exact identity hits. For example, oxytocin has 87 hits, octreotide 207 hits, and substance P 608. In fact, turning this on its head, we can also use these exact identity matches to find ‘popular’ peptides that are missing from our internal peptide database.

PubChem as a sequence database: Disulfide bridging patterns

While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.

Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.

As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.

To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.

In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).

  1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353
  1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637

Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.

  1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550
  1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290