While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.
Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.
As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.
To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.
In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).
ICCNPACGPKYSC 1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353 1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637
Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.
IKCNCKRHVIKPHICRKICGKN 1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550 1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290