Roger presented a talk entitled Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions from Pharmaceutical ELNs at the recent 6th Joint Sheffield Conference on Chemoinformatics. This shows some of the problems encountered when handling data from pharamaceutical ELNs.
For more information on HazELNut see its webpage.
Sugar & Splice at 6th Joint Sheffield Conference on Chemoinformatics
I attended the ever-excellent Sheffield Cheminformatics – sorry – Chemoinformatics Conference last week where I presented a poster on Sugar & Splice, Macromolecules or Big Small-Molecules? Handling Biopolymers in a Chemical Registry System (click on the image below to access the PDF):
If you’re familiar with the HELM format, a new format for describing macromolecules, you may be interested to note the HELM string in the bottom-left of the poster which represents a cyclic peptide connected to a cysteine through a disulfide bridge:
PEPTIDE1{[ac].C}|PEPTIDE2{N.V.P.C}|CHEM1{*N[C@@H](Cc1ccc (cc1)OP(=O)(O)O)C(=O)* |$_R1;;;;;;;;;;;;;;;;;_R2$|}|PEPTIDE3{V} $PEPTIDE1,PEPTIDE2,2:R3-4:R3|PEPTIDE2,CHEM1,4:R2-1:R1| CHEM1,PEPTIDE3,1:R2-1:R1|PEPTIDE3,PEPTIDE2,1:R2-1:R1$$$
In this case, the HELM string is much longer than the corresponding IUPAC Condensed string, and indeed also longer than the all-atom SMILES string. Unfortunately, while both tyrosine and phosphate are supported as monomers by the current HELM release, phosphotyrosine is not nor can it be constructed by connecting the phosphate to the tyrosine (no R3 locant). As a result, the phosphotyrosine is represented as a CHEM object using the SMILES string.
HazELNut and Leadmine in the news: Saving millions for GSK
GSK’s Socrates Search project made a splash at BIO-IT World back in April. This chemically-aware enterprise search system won the 2013 Bio-IT World Best Practices Award for Knowledge Management. Congratulations to Andrew Wooster and his team. This system showcased the use of NextMove Software’s LeadMine and HazELNut products, in conjunction with HP Autonomy and ChemAxon’s JChem Cartridge.
A recent article by Matt Luchette over at the Bio-IT World website gives an overview of the project and explains why existing enterprise search tools need to be made chemically-aware:
What they realized, though, was that the search requirements for their scientists were different than those of a standard text search engine…Most importantly, the engineers wanted the program to search the company’s entire library of electronic lab notebooks and recognize chemicals through their various generic and scientific names, as well as drawings and substructures.
…
Socrates Search, as the project came to be known, was made by combining a number of commercial search programs… Autonomy’s text search and ChemAxon’s JChem Oracle cartridge, which allows users to search for chemicals with their various names or structure, were already a part of GSKSearch, but now had added capabilities, including improved text analytics and data extraction with software from NextMove, and web integration with Microsoft’s C# ASP.NET libraries. The result was a new program that could search through the company’s archived electronic lab notebooks and recognize a vast library of scientific terms, bringing once inaccessible data to scientists’ fingertips.“Searching for Gold: GSK’s New Search Program that Saved Them Millions“, Matt Luchette, June 2013, Bio-IT World website
PubChem peptide depictions
I’ve been doing some work on the peptide depictions generated by Sugar & Splice and thought it might be nice to show a variety of the more interesting structures that are present in PubChem.
The following animated gif shows examples of cyclic peptides, disulfide bridges, D-amino acids, terminal modifications as well as main-chain and side-chain modifications. The depiction style used is that recommended by IUPAC.
Note: To create the animated gif, I combined several pngs using ImageMagick (see this gist).
Java 6 vs Java 7: When implementation matters
Java 7u6 last year brought with it a change to the implementation of String#substring. In previous versions of Java, Strings created by substring shared the same char array as the original String with an internal offset being used to make sure the correct characters were retrieved. This has the advantage of making substring a very cheap operation, O(1), but has the potential to create a significant memory leak. If a substring is taken of a long String, whilst the substring remains accessible the char array of the long String cannot be garbage collected. Java 7u6 (and later) change this and instead return truly independent Strings… but this requires copying part of the char array meaning substring is now an O(n) operation and hence repeatedly taking long substrings should be avoided.
A case where this behaviour can occur is in a tokenizer that after recognizing each token redefines the remaining String using substring. The longer the String in question the greater the effect on performance.
This behaviour is found in OPSIN’s parser hence accounting for a ~13% performance decrease in performance when moving from JDK6 to JDK7.
Resolving this performance regression can be tackled in at least two ways: Implementing a cheap substring operator using a decorating class (an example) or not using substring at all and instead keeping track of an index in the string from which to read. The first approach is hampered by String being final so a decorating class must instead implement CharSequence which is far less frequently used. Hence for OPSIN I choose the approach of keeping track of the index tokenization had reached:
Substrings are still used to capture the tokens which may explain why the performance is still a bit slower.
More details on the substring implementation change:
http://www.javaadvent.com/2012/12/changes-to-stringsubstring-in-java-7.html
Accessing SMILES atom order
In the course of my work, I sometimes have to search the dustier corners of cheminformatics toolkits to find features which are seldom used and may be undocumented. One example of this is how to relate the atoms of a toolkit molecule to their order in an output SMILES string. The various toolkits that I use allow one to do this, but the exact method is somewhat different in each case.
Open Babel stores it in a property of a molecule which you can access after writing a SMILES string. The value returned is a string containing the atom indices separated by spaces. This must be parsed before it can be used as a lookup:
OBPairData *pd = (OBPairData*) mol->GetData("SMILES Atom Order"); std::string atomOrder = pd->GetValue();
RDKit also does something similar but it returns the desired vector of atom indices directly:
std::vector<unsigned int> *atomOrder; mol->getProp("_smilesAtomOutputOrder", *atomOrder);
In contrast, OEChem fills the atom order information into a data structure that you (optionally) provide when calling the function to create a SMILES string. To get the atom order as indices you need to remember the atom order of the current atoms, and then iterate over the data structure accessing the second item of the pair, and looking up the corresponding index.
size_t count = mol->GetMaxAtomIdx(); std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*> *atmord = (std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*>*) malloc(count*sizeof(std::pair<const OEChem::OEAtomBase*,OEChem::OEAtomBase*>)); OEChem::OECreateSmiString(smiles, *mol, OEChem::OESMILESFlag::AtomStereo ^ OEChem::OESMILESFlag::BondStereo, atmord);
Could the real explicit hydrogens please stand up?
The OpenSMILES specification led by Craig James of eMolecules tries to iron out the handling of various corner cases in SMILES. Furthermore it describes best practices when writing SMILES so as to aid ease of interpretation. The goal of all of this is to avoid loss or corruption of chemical information when exchanged as SMILES between different software.
One issue which was not addressed, and indeed perhaps does not come under the OpenSMILES remit, is the question of how many hydrogens should be explicitly created in the internal representation of a molecule read from a SMILES string.
Which hydrogens are explicit?
The SMILES strings “C” and “C([H])([H])([H])H” both represent methane, but the first is typically converted into a single carbon atom with four implicit hydrogens, while the second is converted to a carbon atom with four explicit hydrogens attached. With Open Babel, we can see this difference by calling NumAtoms() on the molecule:
>>> pybel.readstring("smi", "C").OBMol.NumAtoms() 1 >>> pybel.readstring("smi", "C([H])([H])([H])[H]").OBMol.NumAtoms() 5
So far so good. Most toolkits would show the same behaviour.
However, if instead we take the SMILES string “[CH3][C@@H](Br)Cl”, Daylight and ChemAxon will have 4 atoms, OEChem will have 5 (the stereo hydrogen is added), and Open Babel will have 8 (all the hydrogens mentioned in a square bracket are added).
Why it matters
It shouldn’t really matter how a molecule is stored internally in a toolkit except perhaps for performance. But wouldn’t you know, there is a common case where it does matter. If you use SMARTS for substructure searching/matching, then the distinction between explicit and implicit hydrogens is going to affect your results.
You see, there are various SMARTS terms that distinguish between implicit and explicit hydrogens. I don’t think this was one of Daylight’s finest moments; in general SMARTS expressions correspond to molecular substructures, except for the crazy terms that match substructures of particular internal representations. The offending terms are:
h<n> Atom has <n> implicit hydrogens attached
D<n> Atom has <n> explicit connections
In short, if you use these terms to match against molecules read from SMILES strings, you will get different results with different toolkits because of the reasons discussed above. The OpenSMARTS draft specification (led by Tim Vandermeersch) also makes this point.
To avoid this problem, you should only use ‘portable terms’ in SMARTS expressions (i.e. ones which do not depend on whether atoms are explicit or implicit). Going forward, it would be useful for all toolkits to agree to adopt Daylight’s approach on the internal representation resulting from reading SMILES.
Handling biologics: A perception problem?
In an earlier post, I described the importance of knowing the biopolymer structure when handling biologics. I also discussed various file formats that have been proposed to address this.
But rather than regarding this as a file format problem, why not consider it instead a perception problem. If we can perceive the biopolymer structure from an all-atom representation, then interconverting between any file format (whether one of the proposed biopolymer formats or existing all-atom representations such as SMILES) is straightforward. Can it be done? Well, that’s what exactly what PDB file writers do; they perceive the amino acid sequence from the all-atom structure and fill in the relevant columns in the PDB file.
There are several benefits to this approach. To begin with, it avoids the cost associated with a new registry system based on a macromolecular file format. There are no problems with new and unusual monomers; these will be faithfully stored in the all-atom representation. The de-facto standards for chemical information interchange, SMILES and MOL files, can be used as always for exchange of data. Tools for small-molecule analysis (e.g. SMARTS searching) can be combined with analyses based on biopolymer structure (e.g. HELM depiction, Smith-Waterman searching). And finally it’s worth considering that it may be difficult to migrate at a later date if a registry system is based on a particular file format.
What I’ve described here and in the previous post is the introduction from my ACS presentation on Roundtripping between small-molecule and biopolymer representations. This describes the development of the Sugar & Splice software for handling oligopeptides, oligonucleotides and oligosaccharides (including modified residues and mixtures of different biopolymers). Note that the presentation is somewhat sugar-centric; for more info on the peptide and nucleotide side of things see Roger’s Spring 2012 ACS presentation.
(For more presentations from NextMove Software, see our SlideShare page.)
Finding all types of every ‘mer
John Overington over at the ChEMBLog has recently discussed the task of finding redox pairs in a database. As John points out, these are neither isomers nor tautomers but are of interest in any case.
It turns out that after a minor modification, NextMove’s Equalizer software (insert link from future here) can be used to find these. This software can be used to generate canonical “hashes” for molecules which cause different forms of the same molecule to hash to the same representation [Footnote]. It’s not a new idea by any means (think of the layered structure of the InChI) but the way we do this is pretty neat as it uses canonical SMILES to do the hashing. By altering the information encoded in the SMILES different forms of the same structure can be identified; for example mesomers, structural isomers, tautomers [1], protomers…and now redox pairs.
For redox pairs, a SMILES is generated after setting all bond orders to 1 and all atom charges to 0. If the resulting canonical SMILES is identical but the overall (original) charge is different, two structures can be considered a redox pair. Here are some examples found in ChEMBLdb 15:
[Footnote] This is similar to the classic way of finding anagrams given a dictionary of words (I read about this in Jon Bentley’s Programming Pearls). Take each word, sort its letters (this sorted word is the “hash” in this case; words with the same hash will be anagrams) and write it to an output file followed by the original word on the same line. Sort the output file, and look for adjacent duplicates in the first column. The corresponding anagrams will be in the second column.
References:
[1] RA Sayle, So you think you understand tautomerism? JCAMD, 2010, 24, 485.
Pharma’s favourite reactions
Derek Lowe of In the Pipeline has previously commented on the fact that relying on the same old reliable chemical transformations may restrict medicinal chemists to particular regions of chemical space. In order to reach beyond these, it is necessary to use reactions that are “often not so general and reliable”.
As part of a poster presentation at the recent Bio-IT World meeting in Boston, Roger has analysed the Electronic Lab Notebook of a major pharmaceutical company using our NameRXN software to extract the major classes of reaction as well as the specific top five transformations:
Update 2013/04/22: See Derek Lowe’s analysis of these results over at In the Pipeline.
The full poster explains how the data was extracted from the PKI/CambridgeSoft ELN using HazELNut, and also shows how data quality issues in the reaction entry may be flagged up by combining the NameRXN information with atom-mapping algorithms (PDF version):