I attended the ever-excellent Sheffield Cheminformatics – sorry – Chemoinformatics Conference last week where I presented a poster on Sugar & Splice, Macromolecules or Big Small-Molecules? Handling Biopolymers in a Chemical Registry System (click on the image below to access the PDF):
If you’re familiar with the HELM format, a new format for describing macromolecules, you may be interested to note the HELM string in the bottom-left of the poster which represents a cyclic peptide connected to a cysteine through a disulfide bridge:
In this case, the HELM string is much longer than the corresponding IUPAC Condensed string, and indeed also longer than the all-atom SMILES string. Unfortunately, while both tyrosine and phosphate are supported as monomers by the current HELM release, phosphotyrosine is not nor can it be constructed by connecting the phosphate to the tyrosine (no R3 locant). As a result, the phosphotyrosine is represented as a CHEM object using the SMILES string.
GSK’s Socrates Search project made a splash at BIO-IT World back in April. This chemically-aware enterprise search system won the 2013 Bio-IT World Best Practices Award for Knowledge Management. Congratulations to Andrew Wooster and his team. This system showcased the use of NextMove Software’s LeadMine and HazELNut products, in conjunction with HP Autonomy and ChemAxon’s JChem Cartridge.
A recent article by Matt Luchette over at the Bio-IT World website gives an overview of the project and explains why existing enterprise search tools need to be made chemically-aware:
What they realized, though, was that the search requirements for their scientists were different than those of a standard text search engine…Most importantly, the engineers wanted the program to search the company’s entire library of electronic lab notebooks and recognize chemicals through their various generic and scientific names, as well as drawings and substructures.
Socrates Search, as the project came to be known, was made by combining a number of commercial search programs… Autonomy’s text search and ChemAxon’s JChem Oracle cartridge, which allows users to search for chemicals with their various names or structure, were already a part of GSKSearch, but now had added capabilities, including improved text analytics and data extraction with software from NextMove, and web integration with Microsoft’s C# ASP.NET libraries. The result was a new program that could search through the company’s archived electronic lab notebooks and recognize a vast library of scientific terms, bringing once inaccessible data to scientists’ fingertips.
I’ve been doing some work on the peptide depictions generated by Sugar & Splice and thought it might be nice to show a variety of the more interesting structures that are present in PubChem.
The following animated gif shows examples of cyclic peptides, disulfide bridges, D-amino acids, terminal modifications as well as main-chain and side-chain modifications. The depiction style used is that recommended by IUPAC.
Note: To create the animated gif, I combined several pngs using ImageMagick (see this gist).
Java 7u6 last year brought with it a change to the implementation of String#substring. In previous versions of Java, Strings created by substring shared the same char array as the original String with an internal offset being used to make sure the correct characters were retrieved. This has the advantage of making substring a very cheap operation, O(1), but has the potential to create a significant memory leak. If a substring is taken of a long String, whilst the substring remains accessible the char array of the long String cannot be garbage collected. Java 7u6 (and later) change this and instead return truly independent Strings… but this requires copying part of the char array meaning substring is now an O(n) operation and hence repeatedly taking long substrings should be avoided.
A case where this behaviour can occur is in a tokenizer that after recognizing each token redefines the remaining String using substring. The longer the String in question the greater the effect on performance.
This behaviour is found in OPSIN’s parser hence accounting for a ~13% performance decrease in performance when moving from JDK6 to JDK7.
Resolving this performance regression can be tackled in at least two ways: Implementing a cheap substring operator using a decorating class (an example) or not using substring at all and instead keeping track of an index in the string from which to read. The first approach is hampered by String being final so a decorating class must instead implement CharSequence which is far less frequently used. Hence for OPSIN I choose the approach of keeping track of the index tokenization had reached:
Substrings are still used to capture the tokens which may explain why the performance is still a bit slower.
More details on the substring implementation change:
In the course of my work, I sometimes have to search the dustier corners of cheminformatics toolkits to find features which are seldom used and may be undocumented. One example of this is how to relate the atoms of a toolkit molecule to their order in an output SMILES string. The various toolkits that I use allow one to do this, but the exact method is somewhat different in each case.
Open Babel stores it in a property of a molecule which you can access after writing a SMILES string. The value returned is a string containing the atom indices separated by spaces. This must be parsed before it can be used as a lookup:
In contrast, OEChem fills the atom order information into a data structure that you (optionally) provide when calling the function to create a SMILES string. To get the atom order as indices you need to remember the atom order of the current atoms, and then iterate over the data structure accessing the second item of the pair, and looking up the corresponding index.