Sketchy Sketches

Chemical structure diagrams are essential in describing and conveying chemistry. Extracting chemistry from documents using text-mining (see NextMove Software’s LeadMine) is extremely useful but will miss anything described only by an image.

As a general approach to mining chemistry from images, one may consider using image-to-structure programs such as: OSRA, CliDE, ChemOCR, and Imago OCR. However, image-to-structure is not easy or quick and can be prone to compounding errors (e.g. OCR).

At NextMove we approach this problem slightly differently. It turns out that in some cases the source sketch files used to create the chemical diagrams may be available and provide a ‘cleaner’ data source than the raster images.

Although the data is ‘cleaner’ in terms of digital representation, naïvely exporting the connection table stored in a sketch file can lead to artificial and erroneous structures. The main problems stem from the stored representation (connection table) imprecisely reflecting what is displayed. To account for these issues, the NextMove Software converter (code name: Praline) applies correction, interpretation, and categorisation to sketches. The transformed connection table (currently written as ChemAxon Extended SMILES [CXSMILES]) better reflects what is actually displayed.

Let’s take a look at what’s possible with three examples:

1) US 2015 344500 A1

Method 9 in US 2015 344500 A1 describes a four step synthesis:

US20150344500A1-20151203-C00112

Using image-to-structure SureChEMBL extracts four structures, I’ve added the titles to make it easier to pair up:

SCHEMBL17309138
Compound 2-2 (OCR error)
SCHEMBL17309138 / CID 118554493
SCHEMBL12363
Compound 9-1 (part)
SCHEMBL12363 / CID 10008
SCHEMBL17307813
Compound 9-2
SCHEMBL17307813 / CID 118553325
SCHEMBL17309143
Compound 9-3
SCHEMBL17309143 / CID 118554498

Compound 2-2 was not correctly extracted and looks like OCR has mistakenly recognised the -OBn as -OBu. The flurobenzene probably comes from Compound 9-1 where the label (Boc)2N- is difficult to recognise. The products of Step 4 contain valence errors and were probably thrown out as a recognition error.
However, by reading the ChemDraw files directly it’s possible to extract everything “warts and all”. To process this sketch the key interpretation phases are:

  • Line Formula Parsing – Using a strict yet comprehensive algorithm condensed labels are corrected and expanded.
  • Reaction Role Assignment – The reaction scheme layout is common to patents and made easier by looking for the USPTO-specific ‘splitter’ tag. To make valid reactions, reactants are duplicated and added to the previous step.
  • Agent Parsing – Based on the location the complete label “Boc2, DIEA” can be correctly processed. Agents can be a mix of trivial names, systematic names and formulas.
  • Clear Ambiguous Stereochemistry – One of the hashed wedges in Compounds 9-1, 9-2, and 9-3 is poorly placed between two stereocenters. In the stored representation both stereocentres are defined but we remove the definition at the wide end of the wedge.
  • Category Assignment – Based on the content we tag the output with a category for quick filtering. This is described more in the poster (see below).

Here are the results of our extraction, categorised as specific reactions:

US20150344500A1-20151203-C00112_Step1

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>[Li]CCCC.CC=O>C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F

US20150344500A1-20151203-C00112_Step2

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>B(Br)(Br)Br.C(Cl)Cl>C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F

US20150344500A1-20151203-C00112_Step3

C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F>CC(C)(C)OC(=O)OC(=O)OC(C)(C)C.CCN(C(C)C)C(C)C>C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F

US20150344500A1-20151203-C00112_Step4

C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F>[N+](=O)(O)[O-].OS(=O)(=O)O>C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-].C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-]

 

Compounds 2-2 and 9-1 are now correctly extracted and actually novel to PubChem. We don’t try to correct author errors and so the bad valence is also preserved as drawn in Step 4.

2) US 7092578 B2

US 7092578 B2 is not a chemical patent but does have ChemDraw files. Here ChemDraw has been misused to draw tables, and direct export results in a cyclobutane grid. These are a well known class of bad structures in PubChem and have been referred to as chessboardanes. In addition to extracting the chemical structure, Praline assigns a categorisation code. This allows us to flag structures with potential problems as well as those with no real chemistry at all.

US07092578-20060815-C00001

Resulting PubChem Compound CID 21040251:

CID21040251

C1C2C3C4C5C6C7C8CC9C%108C%117C%126C%135C%144C%153C%162C1C%17C%18%16C%19%15C%20%14C%21%13C%22%12C%23%11C%24%10C9C%25C%26%24C%27%23C%28%22C%29%21C%30%20C%31%19C%32%18C%17C%33C%34%32C%35%31C%36%30C%37%29C%38%28C%39%27C%40%26C%25C%41C%42%40C%43%39C%44%38C%45%37C%46%36C%47%35C%48%34C%33C%49C%50%48C%51%47C%52%46C%53%45C%54%44C%55%43C%56%42C%41C%57C%58%56C%59%55C%60%54C%61%53C%62%52C%63%51C%64%50C%49CC%64C%63C%62C%61C%60C%59C%58C%57

Praline assigns the category No Connection Table and so it can be easily ignored.

3) US 6531452 B1

Strange connection tables don’t just come from non-chemistry patents, US 6531452 B1 like many chemistry patents contain a generic (Markush) claim. Earlier we saw the label -OBn misread by OCR. Even without OCR a condensed label may be expanded wrongly in the underlying representation, particular if the structure is generic.

“…at least one of R2 and R3 is”
US06531452-20030311-C00142

In PubChem you’ll find the compound CID 22976968 has been extracted from this sketch:

CID22976968

C1C2C13C24C35C46C57C68C79C81C92C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C2C1

Where did it come from? Well it turns out the generic label >C(R41)(R41) has been automatically expanded and stored in the file as:

CR41R41

Somewhere the Rs have been promoted to carbons and submitted to PubChem. Praline recognises and interprets generic labels and the attachment points (drawn here as tert-butyl) and categorised the sketch as a generic substituent. Here’s the output:

US06531452-20030311-C00142_praline

C1(C(C(C(N1*)(*)*)(*)*)C(*)(*)*)=O.*CCC(N)=O |$;;;;;R41;R41;R41;R41;R41;;_AP1;R41;R41;;_AP1$,Sg:n:3,6,7:n:ht|

Conclusion

Image-to-structure is slow; due to this, SureChEMBL currently has only processed images using image-to-structure from 2007 onwards (Papadatos G. 2015). In contrast Praline can process the entire archive of US Patent Applications and Grants with more than 24 million ChemDraw files (2001 onwards) in only 5 hours (single threaded).

Although the naïve molfile exports from the ChemDraw sketches are provided by the USPTO they have less information than the source ChemDraw sketch file. Reading the pre-exported molfile is significantly less accurate than interpreting the ChemDraw sketch, and even image-to-structure often produces more accurate results.

Other than U.S. Patents, this technology can be applied to sketch files extracted from Electronic Lab Notebooks (see NextMove Software’s HazELNut) as well as Journals where the publishers have held on to the sketch file submissions.

At the upcoming ACS in Philadelphia, Daniel will be presenting how some structures can only be extracted when the output from text-mining and sketches are combined. “The whole is greater than the sum of the parts” – Aristotle.

A poster on this work was presented at the 7th Joint Sheffield Conference on Chemoinformatics:

Praline_Sheffield2016

Comparing structural fingerprints using a literature-based similarity benchmark

graph0We’re just back from the 7th Joint Sheffield Conference on Chemoinformatics where I presented the poster below on comparing the ability of structural fingerprints to measure structural similarity. As it happens, the corresponding paper has just come out today also:

Noel M. O’Boyle and Roger A. Sayle. Comparing structural fingerprints using a literature-based similarity benchmark J. Cheminf. 2016, 8, 36.

What we’ve tried to do is create a gound-truth dataset for structural similarity (in the context of med chemistry), and then test fingerprints against that. One approach to create this dataset would be to crowd-source it out to medicinal chemists – this is something that Pedro Franco has done and he was actually presenting some updated results at Sheffield.

We’ve taken an alternative approach: we’ve used the med chemistry literature as collated by the ChEMBL database. On the basis that a team of medicinal chemists have selected these molecules for synthesis and testing as part of the same med chem programme, we regard molecules that appear in the same ChEMBL assay as structurally similar (after removing molecules that appear in 5 or more papers, and some other simple filters).

This gives us pairs of molecules that are similar, but we really want to have a series of molecules with decreasing similarity to a reference, and then see if the various fingerprints can reproduce the series order. To create such a series, we hop from one paper to the next through molecules in common, thereby moving further and further away in terms of similarity from the original molecule. Inspired by Sereina Riniker and Greg Landrum, all of the data and scripts are available at our GitHub repo.

Fishing for matched series in a sea of structure representations

When searching for matched pairs/series, the typical approach is to use a fragmentation scheme and then collate the results for the same scaffold. Leaving aside other issues, we come to the question of how to ensure that all matched pairs for the same scaffold are actually found given the following representation issues: tautomeric forms (e.g. keto-enol), charge states (e.g. COO- versus COOH) and charge-separated/hypervalent forms (e.g. nitro as N(=O)=O or [N+]([O-])O).

Let’s take assay data in ChEMBL as an example. While the other issues are fairly well nailed down, the tautomer stored in ChEMBL is the first one encountered in the literature. This can lead to situations where the molecules from the same assay may have the same tautomer in the paper but not in ChEMBL (e.g. CHEMBL496754 and CHEMBL522563 from CHEMBL1009882):Chembl_examples

There are two approaches to sorting out these sorts of problems. The first is to try to generate a canonical representation of the molecule up-front. Note that this need not be the most preferred representation, just one that is canonical. An alternative approach is to create a hash for the structure that is invariant to representation issues and to use this hash to collate the scaffolds. This is actually quite a bit easier than the former approach. In an earlier blogpost, we described this method in the context of finding redox pairs, but it’s one of those ideas that bears repeating as it can be applied to several different problems.

I’ll call this method Sayle Hashing (after all, this fits with the nautical theme of the title). In this particular case, the Sayle Hash consists of two parts, a SMILES string and an integer. The integer is the total of the formal charges on the scaffold minus the number of hydrogens on each non-carbon atom, while the SMILES string is the canonical SMILES for the scaffold after setting all bond orders to 1 and hydrogen counts to 0. An example may be useful at this point. Here is a matched pair we would like to identify:TwoReps
Once fragmented at the halogen bond, we get the following non-identical scaffold SMILES:

*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O
*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]

However, the corresponding Sayle Hashes are identical:

*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4
*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4

SayleHash
Neat, huh? By the way, the values of 3 are from a hydrogen count of 3 and charge of -1, and a hydrogen count of 4 and charge of 0, respectively. This allows us to match these two scaffolds, arbitrarily picking one of the original representations to serve as the common scaffold.

Supporting the updated Symbol Nomenclature for Glycans (SNFG)

Even C&EN reported the recent standardisation efforts by the oligosaccharide community on symbols to use for glycan* depiction. These guidelines are available online in Appendix 1B of Essentials of Glycobiology and will be updated over time.

As a test case for Sugar & Splice support, I depicted the oligosaccharide below whose structure is strangely reminiscent of Table 1 in the guidelines. For those of you glycan enthusiasts who wish to print T-shirts with this emblazoned on the front, here is an Inkscape-friendly SVG file.
SNFG
However, such a diverse set of monosaccharide symbols is not present in the typical oligosaccharide. I’ve searched PubChem for the entry with the most symbols and found CID91850542 below with 11. (For an alternative depiction of the same structure, see GlyTouCan. Interestingly, the CSDB entry for the same paper describes a different but very similar glycan.):
11_91850542_new

In fact, having many symbols often indicates a dodgy structure as in the following example (PubChem CID101754793) deposited by Nikkaji which has 9 monosaccharide symbols. Looking at the original source, the SMILES not only has nitro as [N+](=O)O (must have been corrected by PubChem) but many of the sugars have incorrect stereochemistry (compared to the provided IUPAC name). The D/L in several of the symbols, indicating the presence of the rarer stereoisomer, is also a red flag.
09_101754793_newIf you put the IUPAC name through OPSIN (after a minor mod), and then depict the resulting SMILES using Sugar & Splice, you get the correct structure:tmp

* Glycans are “compounds consisting of a large number of monosaccharides linked glycosidically”, via Wikipedia and the IUPAC gold book.

Analysing the last 40 years of medicinal chemistry reactions

reactionanalysisoverviewIn collaboration with Novartis (with particular thanks to Nadine Schneider) we have published a paper on the the analysis of reactions that we have text-mined from 40 years of US medicinal chemistry patents.

The paper covers the evolution of common reaction types over time, using NameRxn to provide the reaction classification. The reaction classification is hierarchical allowing a reaction to be classified at various levels of granularity. For example a Chloro Suzuki coupling is a Suzuki coupling which is a C-C bond formation reaction. Analysis of the properties of the reaction products was also performed revealing trends such as increase in the number of rings over time.

The reactions were extracted using a workflow based on the use of LeadMine for identifying and normalizing chemicals and physical quantities. One quantity of especial interest that is extracted and associated with the reaction is the yield. This allowed the identification of reaction types with consistently low/high yield and revealed a trend towards slightly lower yields over time.

Greg Landrum has kindly hosted interactive versions of some of the graphs from the paper here. In the Pipeline has also blogged positively about the paper here.

Sugar&Splice supports PubChem’s support for biologics

Half a million molecules on PubChem have just had a new section added entitled “Biologic Description”. This includes a depiction of the oligomer structure and several line notations including IUPAC condensed and HELM, all of which were generated using Sugar&Splice through perception from the all-atom representation. Since the original development of Sugar&Splice was as part of a collaboration with PubChem, it is great to see these annotations finally appearing as part of this important resource.

Previous blog posts have shown examples of the sorts of peptide depictions that Sugar&Splice can generate. Here is how one appears on PubChem (CID118753634).

pubchem-peptide

Sugar&Splice also supports CFG-style depiction of oligosaccharides (CID71297593):

pubchem-sacc

As ever, there is always more work to be done on improving depictions and perception, and we look forward to further increasing the coverage of biologics in PubChem over the coming months.

PhD positions available in Big Data Analysis in Chemistry

NextMove Software is a partner in the Horizon 2020 MSC ITN EID BigChem project. Ten PhD positions are available in the area of “Big Data Analysis in Chemistry”, all of which offer a mix of time spent in academia and with industrial partners. The following position involves a placement with us for 3 months:

ESR2: Computational compound profiling by large-scale mining of pharmaceutical data

This position is announced within the BIGCHEM project. Read about the carrier development perspectives.

Check eligibility rules as well as recruitment details and apply for this position before 20 March 2016.

Objectives: In the life-sciences, data is being generated and published at unprecedented rates. This wealth of data provides unique opportunities to get insights into the mechanisms of disease and to identify starting points for treatments. At the same time, the size, complexity and heterogeneity of available data sets pose substantial challenges for computational analysis and design.

Aim of this project is to address the challenges posed by large, heterogeneous, incomplete, and noisy datasets. Specifically, we aim to:

  • apply machine learning technologies to derive predictive QSAR models from real-world life science data sets;
  • analyze trade-offs between training data accuracy and quantity, in particular, in the context of high-throughput screening data;
  • develop and apply methods to systematically account for noise and experimental errors in the search for active compounds.

Planned secondments: Three months stay in NextMove to work with data automatically extracted from patents using unique technology of company. Three months in HMGU to collect data from public databases such as ChEMBL, OCHEM, PubChem.

Employment: 36 months total, including Boehringer Ingelheim, Biberach, Germany (months 1-18) and the University of Bonn, Germany (months 19-36).

Enrollment in PhD program: The ESR will be supervised by Prof. J. Bajorath from the University of Bonn and by supervisors from Boehringer Ingelheim.

Salary details are described here.

Organization:
Boehringer Ingelheim GmbH & Co KG & University of Bonn
Location:
Germany
Region:
Germany
Employment type:
Full time
Years of experience:
4 years or less (see eligibility rules)
Required languages:
English
Required general skills:
Have experience in data mining and statistics. Good knowledge on medicinal chemistry is a plus.
Required IT skills:
Good knowledge on programming in mainstream computer languages and UNIX/LINUX operating system.
Required degree level:
Master’s degree in Chemistry, Bioinformatics, Medicinal Chemistry, Informatics/Data Science or closely related fields.

 

Popular med chem replacements

medchemreplacementsWhen people talk about bioisosteres (e.g. tetrazole and carboxylic acid) they are usually referring to R group replacements that have similar biological properties. Identifying new bioisosteres can expand a med chemist’s toolbox, and so a number of studies have analysed activity databases to search for previously unknown bioisoteric replacements (e.g. [1]).

Here instead we will analyse what med chemists already consider to be bioisosteres. That is, we will look at the set of med chem replacements observed in the medicinal chemistry literature without any regard to the corresponding activity.

What I’ve done is take all (non-duplicate) IC50, EC50 and Ki data from ChEMBL and generated matched series on a per-assay basis (e.g. an assay with halide analogues will be converted to [*Br, *Cl, *F]). The corresponding matched pairs (e.g. [*Br, *F], [*Br, *Cl], [*F, *Cl]) are then associated with the paper from which the assay is taken, and any duplicates for the same paper are removed.

Having done this, we can then ask what is a popular replacement for *Br? As it turns out the top answer is ethynl, after *I. This comes from the fact that *Br occurs in 5497 of the 32,158 papers, and ethynl in 322, so if they occured independently we would expect to see them co-occur in 55 papers. Given that they actually co-occur in 103, this is an enrichment (or “lift” as recommender systems [2] call it) of 1.9 times what you would expect to see by chance. Here are the others with positive enrichment:

R Occurence Co-occur Expected Enrichment
*I 1553 901 265.5 3.4
*C#C 322 103 55.0 1.9
*Cl 10769 3263 1840.8 1.8
*[N+](=O)[O-] 3910 1179 668.4 1.8
*C=C 334 91 57.1 1.6
*C#N 3373 883 576.6 1.5
*SC(F)(F)F 63 16 10.8 1.5
*F 9048 2261 1546.6 1.5
*OC(F)(F)F 1149 279 196.4 1.4
*C(F)(F)F 4984 1130 852.0 1.3
*S(=O)(=O)C(F)(F)F 51 10 8.7 1.1
*SC 1337 252 228.5 1.1
*C#CC 76 14 13.0 1.1

I’ve put together an animation that summarises these data. This cycles through the most popular R group replacements that have positive enrichment and that have not previously been shown (in the animation, that is). The suggestions seem to make a lot of sense, especially when you remember that no fingerprint or MCS calculation is used – the co-occurences come completely from the data.

References:
[1] Wassermann AM, Bajorath J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem. 2011, 3, 425-436.
[2] Boström J, Falk N, Tyrchan C. Exploiting personalized information for reagent selection in drug design. Drug Discov Today. 2011, 16, 181-187.

Assembling a large data set for melting point prediction: Text-mining to the rescue!

Gallenkamp_Melting_Point_ApparatusAs part of a project initiated by Tony Williams and the Royal Society of Chemistry, I have been working with Igor Tetko to text-mine melting and decomposition point data from the US patent literature so that he could then produce a melting point prediction model. This model showed an improvement over previous models, which is likely due to the overwhelming large size of the dataset compared to the smaller curated data sets used by these previous models.

The results of this work have now been published in the Journal of Cheminformatics here: The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from Patents

From the text-mining side this involved identifiying compounds, melting  and decomposition points, performing the association between them, and then normalizing the representation of the melting points (e.g. “182-4°C” means the same as “182 to 184°C”). Values that were likely to be typos in the original text were also flagged.

As mentioned in the paper the resultant set of 100,000s of melting points is available as SDF from Figshare while the model Igor developed is available from OCHEM.

Image credit: Iain George on Flickr (CC-BY-SA)

How the AUC of a ROC curve is like the Journal Impact Factor

dist3The Journal Impact Factor (or JIF) is the mean number of citations to articles published in a journal in the previous 2 years. Now, the mean is often a good measure of the average but not always. To decide whether it’s a good measure, it is often sufficient to look at a histogram of the data. The image above from a blogpost by Steve Royle shows the citation data for Nature. It is exactly as you would expect: a large number of papers have a small number of citations, while a small number of papers have a large number of citations. In other words, it is exactly the sort of curve for which the mean does not provide any meaningful (an ironic pun) result.

Why? Well, it’s the long tail that really kills it (although we could talk about how skewed it is too). Take 101 papers, 100 of which have 1 citation but one has 100. What’s the mean? 2.0. Say if that one had 1000 citations instead, then the mean is 11.0. The mean is heavily influenced by outliers, and here the long tail provides lots of these. For this reason, the mean does not give any useful measure of the average number of citations as it is just pulled up and down by whatever papers got most cited.

So what’s the link to the AUC of a ROC curve in a typical virtual screening experiment? The AUC has a linear dependance on the mean rank of the actives (see the BEDROC paper), and guess what, that distribution looks very similar to that for citations. For any virtual screening method that is better than random, most of the actives are clustered at the top of the ranked list, while any active that is not recognised by the method floats at random among the inactives. So the AUC is at best a measure of the rank of the actives not recognised by the method, and at worst a random value.

Naturally, the AUC is the most widely used ranking method in the field.