Cross-checking peptide SMILES from Wikipedia

Spot the error in this structure of Bombesin
Spot the error in this structure of Bombesin
Here at NextMove Towers, we find Wikipedia a very useful resource. In fact Roger gave a talk on this at the recent ACS meeting. But here’s a completely different application, a comparison of the SMILES/names generated by Sugar & Splice for oligopeptides and those present in Wikipedia.

The background to this is while there are an enormous number of possible short peptides, the number with trivial names (such as oxytocin and neuropeptide S) is fairly small. However, since IUPAC define how to name derivatives of peptides, these names can be used as references to cover a wider range of peptides of potential therapeutic interest, e.g. [2-alanine]oxytocin and neuropeptide S (3-8).

One nice feature of Wikipedia is the use of categories, as pages about peptides are marked with category Peptides. Well, almost – they may also or instead be marked as belonging to a subcategory of Peptides, e.g. Neuropeptides. Anyhoo, with a bit of Python code that accessed the Wikipedia API, I was able to download all pages on peptides, a number that totalled 561. I then searched the text on these pages for SMILES strings (typically as “SMILES *= *(.*)\n” in a Chembox or Drugbox), and finally converted the SMILES string to a peptide name with Sugar & Splice.

For those cases where Sugar & Splice generated a peptide name, the names were mostly in agreement with the title of the Wikipedia page…but not always. For example the SMILES for Tuftsin was named as [4-D-arginine]tuftsin – the sequence for tuftsin is Thr-Lys-Pro-Arg but the SMILES was actually for Thr-Lys-Pro-D-Arg. Bombesin was named as [8-BLAH]bombesin – the 8th residue is supposed to be tryptophan but the bond to the indole was in the wrong location, and Sugar & Splice identifies it as Ala(indol-2-yl) instead of Trp. Interestingly, if you look at the talk page for Bombesin you can see that someone pointed out this very error in the diagram back in 2011. For those cases where we have found such errors, we will be updating Wikipedia.

Of course, other examples provide cases that need to be added to Sugar & Splice’s dictionary, e.g. Felypressin is named as [2-L-phenylalanine]lypressin, and Morphiceptin as [4-L-proline]endomorphin-2. So overall, Wikipedia provides a nice source of named peptides which we can use to improve our software, and at the same time we are happy to contribute back fixes for any problems we observe.

Image credit: Image by Megac7

One thought on “Cross-checking peptide SMILES from Wikipedia”

Comments are closed.