Keen readers of the blog will have noticed a recent series of blog posts on the topic of PubChem and sequence databases. This was in preparation for my recent ACS presentation on “PubChem as a biologics database” (see below), the goal of which was (a) to convince the audience that PubChem is actually a biologics database (with some small molecules thrown in for disguise), and (b) to show that applying structure perception of biologics (e.g. via Sugar&Splice) on top of an existing chemical database yields useful insights.
One perennial question in the field of biologics, at least from the point of view of biologic registration systems, is how many monomers are there and how best to handle them? This depends on how you slice-and-dice them of course, and how many of the long tail of possible monomers you actually support. I show that one view considers PubChem biologics to contain ~27K monosaccharide monomers, and ~8K amino acid monomers (Figure 1).
One of the surprising results was that there are more peptides in PubChem than in the PDB (~500K vs ~110K). This is not quite the case for oligosaccharides, but the number is not too far off the total in GlyTouCan (~67K vs ~80K). The type of peptide/sugar can be quite different though, as the entries in PubChem are those that chemists have gotten their hands on, and it’s full of reaction intermediates with a whole host of protecting groups not usually observed in vivo. Similarly, as I discussed in an earlier blog post, when one looks at sequence variation in PubChem, you’re not getting a view of the million-year process of evolution but rather the sites of variation that a chemist determined might best modulate activity.
Thank you for clicking through my slides, and I’d be happy to take any questions.
How many oligonucleotides represented as sequence structures can be found in Pubchem? And what nomenclature is used for modified nucleotides?
The number that we currently find is 3448. However, I believe this underestimates the actual number considerably, something which became apparent in the course of preparing the talk. Over time we’ve adapted the peptide and sugar recognition to support arbitrary substitutents; I think that we now need to adapt the nucleotide code in a similar way.
The nomenclature for modified nucleotides is inspired by the University of Albany’s RNA modification database, http://mods.rna.albany.edu/mods/, e.g. br5Cyt for 5-bromocytosine, m1Ade for 1-methyladenine. The reason I say “inspired” is that the database describes terms for nucleotides, but we prefer to describe modifications at the nucleobase level so that we can combine the modifications with a modified sugar and/or a modified phosphate linker.
For more info, see Roger’s Fall ACS 2016 presentation at:
https://www.slideshare.net/NextMoveSoftware/line-notations-for-nucleic-acids-both-natural-and-therapeutic