Nh, Mc, Ts and Og spell trouble

John and Roger recently published a commentary in the Journal of Cheminformatics on the “Technical implications of new IUPAC elements in cheminformatics”. It’s fairly short, and focuses on ambiguities that may arise in two areas: (1) interpreting chemical sketches and (2) SMARTS patterns.

Regarding (1), the main point is that Ts, the new symbol for Tennessine, is currently widely used to indicate Tosyl. While one could instead use Tos, a quick look at usage in sketches from recent patents indicates that Ts is 20 times more common than Tos.

Section (2) is a bit more technical, and covers ambiguities which must be addressed for writers of SMARTS parsers and generators, which may misinterpret existing SMARTS when adding support for the new elements.

Searching ChEMBL in the browser

A previous post (see the slidedeck from slide 40) described some of the work we have done on the development of fast substructure search, a project code-named Arthor. At the time, it ran about two orders of magnitude faster than any of the other programs benchmarked. Such speed makes possible interactive searches of large databases. That’s pretty obvious, and so rather than discuss that here, here’s something else that’s a bit more novel: interactive substructure search of moderately sized datasets, entirely client-side in the browser.

It is important to note this is not the first time that substructure search has been implemented entirely in the browser: Peter Ertl and co. developed the Wikipedia Structure Explorer which searches almost 15K structures from Wikipedia using the Actelion Java library compiled to JavaScript. However, with Arthor (also compiled to JavaScript), it is possible to search the whole of ChEMBL22_1, 1.68 million molecules, in the browser. It even works on my mid-range phone (Moto G 3rd gen, 2GB RAM), although there it is limited by memory constraints to 1.0 million molecules.

Time for the timings. Note that times quoted for the native code do not include the use of a fingerprint screen to be like-for-like with the JavaScript, where is not possible to use fingerprints for the whole of ChEMBL due to RAM constraints. The native and JavaScript times were measured on the same machine (Core i7 6900K CPU, 3.20GHz), and all are times to find the total number of hits (rather than the first 10 or 100 or whatever) using a single-thread. Phone times are for 1.0 million molecules. All times are in ms unless otherwise stated.

1.68M mols
1.00M mols
Query Hits Native JavaScript Phone
c1ccccc1 1420663 419 663 3.24s
Br 75132 113 197 819
CCO 754842 230 368 1.32s
OOO 1 99 300 1.12s
[X5] 160 102 186 817

Imagine a future where the computationally expensive step of substructure searching no longer requires a server, but is done client-side. Impossible, or only a matter of time?

Just what you wanted for Christmas – a compiler for Gaussian

One of Roger’s main interests is compilation as applied to code, SMARTS patterns, and indeed anything else. Indeed, for a period back in the noughties, Roger moonlighted as a middle-end maintainer for the GCC project.

So when, during a sabbatical, he was faced with the task of compiling Gaussian, he naturally turned to GFortran. However, given that this would not compile it, he tweaked the compiler and submitted patches to the FSF (see for example, the MOPAC changes on page 43 of this summary). When not all of these patches were accepted into mainline GFortran, he packaged the remaining pieces into a Fortran pre-processor that emulates the (non-standard) behaviour of commercial compilers.

The result, gXXfortran, is now available on GitHub. In theory, it should work for a standard Linux or Mac system. However, as we don’t have access to the Gaussian source code, your mileage may vary.

This package provides a “pgf77” script that emulates the Portland Group’s PGI fortran 77 compiler, instead using the Free Software Foundation’s GNU gfortran compiler instead. This emulation is sufficient to allow packages such as Gaussian03, that would otherwise require a commercial compiler, to be built using open source tools.

In addition, this package also allows Gaussian03 to be built on a case-insensitive file system (such as when using Mac OS X, cygwin or a FAT32 drive) by overriding the behaviour of “cp” and “gau-cpp” such that they don’t cause problems when used by Gaussian’s build scripts on non case-sensitive file systems.

Buying a ring, or making one yourself

When synthesising a molecule containing one or more rings, the chemist may decide on a synthetic route that includes ring-forming reactions or may instead be able to rely on starting materials that already incorporate the desired rings. The choice depends on many factors, including cost of starting materials, likely yields, and ease of access to additional analogs.

Some ring systems are very common – a phenyl ring springs to mind, of course – but yet they are not often formed as part of a typical synthetic route. Let’s automate the process of finding whether a particular ring system is likely to be formed in a reaction.

As a dataset we will use reactions extracted from the text of US and European patents by LeadMine and where Indigo produces an atom-mapping. Only one reaction per patent is used, and exact duplicates are discarded. Given these 212K reactions, here are the most common ring systems (*) in the products along with their frequency:
ringfreqs
Next, we use the mapping to identify those instances where a ring was formed. For each of these reactions, we take each of the common ring systems in turn and see whether it appears on the right-hand side (RHS) but not on the left-hand side. Here are the most commonly formed rings:
mostcommonringformed
Finally, we divide the corresponding figures from the diagram above to calculate the likelihood that, given a particular ring system on the RHS, it was formed by the reaction. For example, for phenyl ring, the likelihood is 807/151983 or 0.5%. Here are the rings with the highest likelihoods:propensity_lowest
…and those with the lowest:propensity_highest
So what is it about these rings that places them at the top and bottom of the likelihood lists? Comments welcome…

* Depending on how you slice-and-dice molecules to find ring systems, the exact results will vary. Here I included exocyclic double bonds as part of the ring system. In addition, I hashed tautomers to the same representation and removed any stereochemistry.

Comparing structural fingerprints using a literature-based similarity benchmark

graph0We’re just back from the 7th Joint Sheffield Conference on Chemoinformatics where I presented the poster below on comparing the ability of structural fingerprints to measure structural similarity. As it happens, the corresponding paper has just come out today also:

Noel M. O’Boyle and Roger A. Sayle. Comparing structural fingerprints using a literature-based similarity benchmark J. Cheminf. 2016, 8, 36.

What we’ve tried to do is create a gound-truth dataset for structural similarity (in the context of med chemistry), and then test fingerprints against that. One approach to create this dataset would be to crowd-source it out to medicinal chemists – this is something that Pedro Franco has done and he was actually presenting some updated results at Sheffield.

We’ve taken an alternative approach: we’ve used the med chemistry literature as collated by the ChEMBL database. On the basis that a team of medicinal chemists have selected these molecules for synthesis and testing as part of the same med chem programme, we regard molecules that appear in the same ChEMBL assay as structurally similar (after removing molecules that appear in 5 or more papers, and some other simple filters).

This gives us pairs of molecules that are similar, but we really want to have a series of molecules with decreasing similarity to a reference, and then see if the various fingerprints can reproduce the series order. To create such a series, we hop from one paper to the next through molecules in common, thereby moving further and further away in terms of similarity from the original molecule. Inspired by Sereina Riniker and Greg Landrum, all of the data and scripts are available at our GitHub repo.

Fishing for matched series in a sea of structure representations

When searching for matched pairs/series, the typical approach is to use a fragmentation scheme and then collate the results for the same scaffold. Leaving aside other issues, we come to the question of how to ensure that all matched pairs for the same scaffold are actually found given the following representation issues: tautomeric forms (e.g. keto-enol), charge states (e.g. COO- versus COOH) and charge-separated/hypervalent forms (e.g. nitro as N(=O)=O or [N+]([O-])O).

Let’s take assay data in ChEMBL as an example. While the other issues are fairly well nailed down, the tautomer stored in ChEMBL is the first one encountered in the literature. This can lead to situations where the molecules from the same assay may have the same tautomer in the paper but not in ChEMBL (e.g. CHEMBL496754 and CHEMBL522563 from CHEMBL1009882):Chembl_examples

There are two approaches to sorting out these sorts of problems. The first is to try to generate a canonical representation of the molecule up-front. Note that this need not be the most preferred representation, just one that is canonical. An alternative approach is to create a hash for the structure that is invariant to representation issues and to use this hash to collate the scaffolds. This is actually quite a bit easier than the former approach. In an earlier blogpost, we described this method in the context of finding redox pairs, but it’s one of those ideas that bears repeating as it can be applied to several different problems.

I’ll call this method Sayle Hashing (after all, this fits with the nautical theme of the title). In this particular case, the Sayle Hash consists of two parts, a SMILES string and an integer. The integer is the total of the formal charges on the scaffold minus the number of hydrogens on each non-carbon atom, while the SMILES string is the canonical SMILES for the scaffold after setting all bond orders to 1 and hydrogen counts to 0. An example may be useful at this point. Here is a matched pair we would like to identify:TwoReps
Once fragmented at the halogen bond, we get the following non-identical scaffold SMILES:

*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O
*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]

However, the corresponding Sayle Hashes are identical:

*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4
*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4

SayleHash
Neat, huh? By the way, the values of 3 are from a hydrogen count of 3 and charge of -1, and a hydrogen count of 4 and charge of 0, respectively. This allows us to match these two scaffolds, arbitrarily picking one of the original representations to serve as the common scaffold.

Supporting the updated Symbol Nomenclature for Glycans (SNFG)

Even C&EN reported the recent standardisation efforts by the oligosaccharide community on symbols to use for glycan* depiction. These guidelines are available online in Appendix 1B of Essentials of Glycobiology and will be updated over time.

As a test case for Sugar & Splice support, I depicted the oligosaccharide below whose structure is strangely reminiscent of Table 1 in the guidelines. For those of you glycan enthusiasts who wish to print T-shirts with this emblazoned on the front, here is an Inkscape-friendly SVG file.
SNFG
However, such a diverse set of monosaccharide symbols is not present in the typical oligosaccharide. I’ve searched PubChem for the entry with the most symbols and found CID91850542 below with 11. (For an alternative depiction of the same structure, see GlyTouCan. Interestingly, the CSDB entry for the same paper describes a different but very similar glycan.):
11_91850542_new

In fact, having many symbols often indicates a dodgy structure as in the following example (PubChem CID101754793) deposited by Nikkaji which has 9 monosaccharide symbols. Looking at the original source, the SMILES not only has nitro as [N+](=O)O (must have been corrected by PubChem) but many of the sugars have incorrect stereochemistry (compared to the provided IUPAC name). The D/L in several of the symbols, indicating the presence of the rarer stereoisomer, is also a red flag.
09_101754793_newIf you put the IUPAC name through OPSIN (after a minor mod), and then depict the resulting SMILES using Sugar & Splice, you get the correct structure:tmp

* Glycans are “compounds consisting of a large number of monosaccharides linked glycosidically”, via Wikipedia and the IUPAC gold book.

Sugar&Splice supports PubChem’s support for biologics

Half a million molecules on PubChem have just had a new section added entitled “Biologic Description”. This includes a depiction of the oligomer structure and several line notations including IUPAC condensed and HELM, all of which were generated using Sugar&Splice through perception from the all-atom representation. Since the original development of Sugar&Splice was as part of a collaboration with PubChem, it is great to see these annotations finally appearing as part of this important resource.

Previous blog posts have shown examples of the sorts of peptide depictions that Sugar&Splice can generate. Here is how one appears on PubChem (CID118753634).

pubchem-peptide

Sugar&Splice also supports CFG-style depiction of oligosaccharides (CID71297593):

pubchem-sacc

As ever, there is always more work to be done on improving depictions and perception, and we look forward to further increasing the coverage of biologics in PubChem over the coming months.

Popular med chem replacements

medchemreplacementsWhen people talk about bioisosteres (e.g. tetrazole and carboxylic acid) they are usually referring to R group replacements that have similar biological properties. Identifying new bioisosteres can expand a med chemist’s toolbox, and so a number of studies have analysed activity databases to search for previously unknown bioisoteric replacements (e.g. [1]).

Here instead we will analyse what med chemists already consider to be bioisosteres. That is, we will look at the set of med chem replacements observed in the medicinal chemistry literature without any regard to the corresponding activity.

What I’ve done is take all (non-duplicate) IC50, EC50 and Ki data from ChEMBL and generated matched series on a per-assay basis (e.g. an assay with halide analogues will be converted to [*Br, *Cl, *F]). The corresponding matched pairs (e.g. [*Br, *F], [*Br, *Cl], [*F, *Cl]) are then associated with the paper from which the assay is taken, and any duplicates for the same paper are removed.

Having done this, we can then ask what is a popular replacement for *Br? As it turns out the top answer is ethynl, after *I. This comes from the fact that *Br occurs in 5497 of the 32,158 papers, and ethynl in 322, so if they occured independently we would expect to see them co-occur in 55 papers. Given that they actually co-occur in 103, this is an enrichment (or “lift” as recommender systems [2] call it) of 1.9 times what you would expect to see by chance. Here are the others with positive enrichment:

R Occurence Co-occur Expected Enrichment
*I 1553 901 265.5 3.4
*C#C 322 103 55.0 1.9
*Cl 10769 3263 1840.8 1.8
*[N+](=O)[O-] 3910 1179 668.4 1.8
*C=C 334 91 57.1 1.6
*C#N 3373 883 576.6 1.5
*SC(F)(F)F 63 16 10.8 1.5
*F 9048 2261 1546.6 1.5
*OC(F)(F)F 1149 279 196.4 1.4
*C(F)(F)F 4984 1130 852.0 1.3
*S(=O)(=O)C(F)(F)F 51 10 8.7 1.1
*SC 1337 252 228.5 1.1
*C#CC 76 14 13.0 1.1

I’ve put together an animation that summarises these data. This cycles through the most popular R group replacements that have positive enrichment and that have not previously been shown (in the animation, that is). The suggestions seem to make a lot of sense, especially when you remember that no fingerprint or MCS calculation is used – the co-occurences come completely from the data.

References:
[1] Wassermann AM, Bajorath J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem. 2011, 3, 425-436.
[2] Boström J, Falk N, Tyrchan C. Exploiting personalized information for reagent selection in drug design. Drug Discov Today. 2011, 16, 181-187.

How the AUC of a ROC curve is like the Journal Impact Factor

dist3The Journal Impact Factor (or JIF) is the mean number of citations to articles published in a journal in the previous 2 years. Now, the mean is often a good measure of the average but not always. To decide whether it’s a good measure, it is often sufficient to look at a histogram of the data. The image above from a blogpost by Steve Royle shows the citation data for Nature. It is exactly as you would expect: a large number of papers have a small number of citations, while a small number of papers have a large number of citations. In other words, it is exactly the sort of curve for which the mean does not provide any meaningful (an ironic pun) result.

Why? Well, it’s the long tail that really kills it (although we could talk about how skewed it is too). Take 101 papers, 100 of which have 1 citation but one has 100. What’s the mean? 2.0. Say if that one had 1000 citations instead, then the mean is 11.0. The mean is heavily influenced by outliers, and here the long tail provides lots of these. For this reason, the mean does not give any useful measure of the average number of citations as it is just pulled up and down by whatever papers got most cited.

So what’s the link to the AUC of a ROC curve in a typical virtual screening experiment? The AUC has a linear dependance on the mean rank of the actives (see the BEDROC paper), and guess what, that distribution looks very similar to that for citations. For any virtual screening method that is better than random, most of the actives are clustered at the top of the ranked list, while any active that is not recognised by the method floats at random among the inactives. So the AUC is at best a measure of the rank of the actives not recognised by the method, and at worst a random value.

Naturally, the AUC is the most widely used ranking method in the field.