Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see https://nextmovesoftware.com/blog/2014/11/). Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.


sp-all_1000

One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.


sp-scatter1000

Roger’s full talk is available here: