Fishing for matched series in a sea of structure representations

When searching for matched pairs/series, the typical approach is to use a fragmentation scheme and then collate the results for the same scaffold. Leaving aside other issues, we come to the question of how to ensure that all matched pairs for the same scaffold are actually found given the following representation issues: tautomeric forms (e.g. keto-enol), charge states (e.g. COO- versus COOH) and charge-separated/hypervalent forms (e.g. nitro as N(=O)=O or [N+]([O-])O).

Let’s take assay data in ChEMBL as an example. While the other issues are fairly well nailed down, the tautomer stored in ChEMBL is the first one encountered in the literature. This can lead to situations where the molecules from the same assay may have the same tautomer in the paper but not in ChEMBL (e.g. CHEMBL496754 and CHEMBL522563 from CHEMBL1009882):Chembl_examples

There are two approaches to sorting out these sorts of problems. The first is to try to generate a canonical representation of the molecule up-front. Note that this need not be the most preferred representation, just one that is canonical. An alternative approach is to create a hash for the structure that is invariant to representation issues and to use this hash to collate the scaffolds. This is actually quite a bit easier than the former approach. In an earlier blogpost, we described this method in the context of finding redox pairs, but it’s one of those ideas that bears repeating as it can be applied to several different problems.

I’ll call this method Sayle Hashing (after all, this fits with the nautical theme of the title). In this particular case, the Sayle Hash consists of two parts, a SMILES string and an integer. The integer is the total of the formal charges on the scaffold minus the number of hydrogens on each non-carbon atom, while the SMILES string is the canonical SMILES for the scaffold after setting all bond orders to 1 and hydrogen counts to 0. An example may be useful at this point. Here is a matched pair we would like to identify:TwoReps
Once fragmented at the halogen bond, we get the following non-identical scaffold SMILES:

*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O
*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]

However, the corresponding Sayle Hashes are identical:

*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4
*[C]1[C]([C]([CH][C]2[C]1[N][C]([N]2)[C]([O])[O])[C]([O])[N])N([O])[O]_4

SayleHash
Neat, huh? By the way, the values of 3 are from a hydrogen count of 3 and charge of -1, and a hydrogen count of 4 and charge of 0, respectively. This allows us to match these two scaffolds, arbitrarily picking one of the original representations to serve as the common scaffold.

Supporting the updated Symbol Nomenclature for Glycans (SNFG)

Even C&EN reported the recent standardisation efforts by the oligosaccharide community on symbols to use for glycan* depiction. These guidelines are available online in Appendix 1B of Essentials of Glycobiology and will be updated over time.

As a test case for Sugar & Splice support, I depicted the oligosaccharide below whose structure is strangely reminiscent of Table 1 in the guidelines. For those of you glycan enthusiasts who wish to print T-shirts with this emblazoned on the front, here is an Inkscape-friendly SVG file.
SNFG
However, such a diverse set of monosaccharide symbols is not present in the typical oligosaccharide. I’ve searched PubChem for the entry with the most symbols and found CID91850542 below with 11. (For an alternative depiction of the same structure, see GlyTouCan. Interestingly, the CSDB entry for the same paper describes a different but very similar glycan.):
11_91850542_new

In fact, having many symbols often indicates a dodgy structure as in the following example (PubChem CID101754793) deposited by Nikkaji which has 9 monosaccharide symbols. Looking at the original source, the SMILES not only has nitro as [N+](=O)O (must have been corrected by PubChem) but many of the sugars have incorrect stereochemistry (compared to the provided IUPAC name). The D/L in several of the symbols, indicating the presence of the rarer stereoisomer, is also a red flag.
09_101754793_newIf you put the IUPAC name through OPSIN (after a minor mod), and then depict the resulting SMILES using Sugar & Splice, you get the correct structure:tmp

* Glycans are “compounds consisting of a large number of monosaccharides linked glycosidically”, via Wikipedia and the IUPAC gold book.

Sugar&Splice supports PubChem’s support for biologics

Half a million molecules on PubChem have just had a new section added entitled “Biologic Description”. This includes a depiction of the oligomer structure and several line notations including IUPAC condensed and HELM, all of which were generated using Sugar&Splice through perception from the all-atom representation. Since the original development of Sugar&Splice was as part of a collaboration with PubChem, it is great to see these annotations finally appearing as part of this important resource.

Previous blog posts have shown examples of the sorts of peptide depictions that Sugar&Splice can generate. Here is how one appears on PubChem (CID118753634).

pubchem-peptide

Sugar&Splice also supports CFG-style depiction of oligosaccharides (CID71297593):

pubchem-sacc

As ever, there is always more work to be done on improving depictions and perception, and we look forward to further increasing the coverage of biologics in PubChem over the coming months.

Popular med chem replacements

medchemreplacementsWhen people talk about bioisosteres (e.g. tetrazole and carboxylic acid) they are usually referring to R group replacements that have similar biological properties. Identifying new bioisosteres can expand a med chemist’s toolbox, and so a number of studies have analysed activity databases to search for previously unknown bioisoteric replacements (e.g. [1]).

Here instead we will analyse what med chemists already consider to be bioisosteres. That is, we will look at the set of med chem replacements observed in the medicinal chemistry literature without any regard to the corresponding activity.

What I’ve done is take all (non-duplicate) IC50, EC50 and Ki data from ChEMBL and generated matched series on a per-assay basis (e.g. an assay with halide analogues will be converted to [*Br, *Cl, *F]). The corresponding matched pairs (e.g. [*Br, *F], [*Br, *Cl], [*F, *Cl]) are then associated with the paper from which the assay is taken, and any duplicates for the same paper are removed.

Having done this, we can then ask what is a popular replacement for *Br? As it turns out the top answer is ethynl, after *I. This comes from the fact that *Br occurs in 5497 of the 32,158 papers, and ethynl in 322, so if they occured independently we would expect to see them co-occur in 55 papers. Given that they actually co-occur in 103, this is an enrichment (or “lift” as recommender systems [2] call it) of 1.9 times what you would expect to see by chance. Here are the others with positive enrichment:

R Occurence Co-occur Expected Enrichment
*I 1553 901 265.5 3.4
*C#C 322 103 55.0 1.9
*Cl 10769 3263 1840.8 1.8
*[N+](=O)[O-] 3910 1179 668.4 1.8
*C=C 334 91 57.1 1.6
*C#N 3373 883 576.6 1.5
*SC(F)(F)F 63 16 10.8 1.5
*F 9048 2261 1546.6 1.5
*OC(F)(F)F 1149 279 196.4 1.4
*C(F)(F)F 4984 1130 852.0 1.3
*S(=O)(=O)C(F)(F)F 51 10 8.7 1.1
*SC 1337 252 228.5 1.1
*C#CC 76 14 13.0 1.1

I’ve put together an animation that summarises these data. This cycles through the most popular R group replacements that have positive enrichment and that have not previously been shown (in the animation, that is). The suggestions seem to make a lot of sense, especially when you remember that no fingerprint or MCS calculation is used – the co-occurences come completely from the data.

References:
[1] Wassermann AM, Bajorath J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem. 2011, 3, 425-436.
[2] Boström J, Falk N, Tyrchan C. Exploiting personalized information for reagent selection in drug design. Drug Discov Today. 2011, 16, 181-187.

How the AUC of a ROC curve is like the Journal Impact Factor

dist3The Journal Impact Factor (or JIF) is the mean number of citations to articles published in a journal in the previous 2 years. Now, the mean is often a good measure of the average but not always. To decide whether it’s a good measure, it is often sufficient to look at a histogram of the data. The image above from a blogpost by Steve Royle shows the citation data for Nature. It is exactly as you would expect: a large number of papers have a small number of citations, while a small number of papers have a large number of citations. In other words, it is exactly the sort of curve for which the mean does not provide any meaningful (an ironic pun) result.

Why? Well, it’s the long tail that really kills it (although we could talk about how skewed it is too). Take 101 papers, 100 of which have 1 citation but one has 100. What’s the mean? 2.0. Say if that one had 1000 citations instead, then the mean is 11.0. The mean is heavily influenced by outliers, and here the long tail provides lots of these. For this reason, the mean does not give any useful measure of the average number of citations as it is just pulled up and down by whatever papers got most cited.

So what’s the link to the AUC of a ROC curve in a typical virtual screening experiment? The AUC has a linear dependance on the mean rank of the actives (see the BEDROC paper), and guess what, that distribution looks very similar to that for citations. For any virtual screening method that is better than random, most of the actives are clustered at the top of the ranked list, while any active that is not recognised by the method floats at random among the inactives. So the AUC is at best a measure of the rank of the actives not recognised by the method, and at worst a random value.

Naturally, the AUC is the most widely used ranking method in the field.

Identifying novel chemical-disease relationships

roger_layoutAs part of the BioCreative V competition, Daniel developed software to find chemical-disease relationships in PubMed abstracts. I’m going to describe a proof-of-concept that uses that code to identify new relationships extracted from the literature. This could be useful both for finding new adverse drug affects and for finding new therapeutic applications.

Most common relationships

Daniel ran the software over all PubMed abstracts in high-precision mode and found 1392503 putative relationships (of which 282604 were unique). To begin with, I looked at the most common relationships found. However learning that “alcohol is associated with alcoholism” and “cyanide is associated with poisoning” is not super-useful. It is unfortunately the case that the information about which you can be most confident (i.e. it is found multiple times) is also the least useful as by definition it’s already well known. Although actually I didn’t know the top relation found, that “streptozotocin is associated with diabetes”; it turns out that streptozotocin is used to produce an animal model for diabetes.

Searching for novel relationships

Really what’s most interesting are novel relationships, ones that haven’t previously been described. To find these I looked at any relationships attributed to this month (i.e. Sep 2015 at the time of writing) or later that were not in earlier abstracts. This gave 847 relationships. When I looked at the sentences associated with these relationships I found that 6 of them explicitly stated that this was the first report of a particular interaction and that in each case we identified the correct relationship.* (Just for interest, I searched the “known” relationships from September for similar phrases stating that they were the first report, but did not find any.)

26228174	D010269	D013262	paraquat	TEN	To our knowledge, this is the first case report of TEN related to paraquat 	Dermatology (Basel, Switzerland)	Sep 2015
25619447	C079703	D000380	rufinamide	agranulocytosis	To the best of our knowledge, this is the first reported case of agranulocytosis induced by rufinamide.	Brain & development	Sep 2015
26356743	D011345	D016553	Fenofibrate	Immune Thrombocytopenia	A Case of Fenofibrate-Induced Immune Thrombocytopenia: First Report.	Puerto Rico health sciences journal	Sep 2015
26370487	D017706	D010996	lisinopril	pleural effusion	We report the first case of eosinophilic pleural effusion occurring due to lisinopril treatment.	Revue des maladies respiratoires	Sep 2015
25588686	C118667	D013262	Dronedarone	Toxic Epidermal Necrolysis	Toxic Epidermal Necrolysis During Dronedarone Treatment: First Report of a Severe Serious Adverse Event Of A New Antiarrhythmic Drug.	Cardiovascular toxicology	Oct 2015
26308264	C033249	D010024	HAR	bone loss	The current study describes for the first time that HAR inhibits receptor activator of nuclear factor ?B ligand (RANKL)-induced osteoclastogenesis in vitro and suppresses inflammation-induced bone loss in a mouse model.	Journal of natural products	Sep 2015

Then there are mentions of novel compounds, but I guess there are different degrees of novelty:

26386102	C530299	D050197	Vorapaxar	atherosclerotic	Vorapaxar is a novel antiplatelet agent that has demonstrated efficacy in reducing atherosclerotic events in patients with a history of 	American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists	Oct 2015
25969859	C509120	D007249	2-Chloroacetamidine	inflammation	2-Chloroacetamidine, a novel immunomodulator, suppresses antigen-induced mouse airway inflammation.	Allergy	Sep 2015

A number of other relationships mention potential as a therapeutic agent:

26201693	C005274	D013274	Naringin	gastric carcinoma	Thus, the present finding suggests that Naringin induced autophagy- mediated growth inhibition shows potential as an alternative therapeutic agent for human gastric carcinoma.	International journal of oncology	Sep 2015
26079694	D002762	D007889	vitamin D3	uterine fibroids (leiomyomas)	To provide a detailed summary of current scientific knowledge on uterine fibroids (leiomyomas) in-vitro and in in-vivo animal models, as well as to postulate the potential role of vitamin D3 as an effective, inexpensive, safe, long-term treatment option for 	Fertility and sterility	Sep 2015
26192096	C054989	D000544	sulfuretin	Alzheimer's disease	Our results also indicate that sulfuretin-induced induction of Nrf2-dependent HO-1 expression via the PI3K/Akt signaling pathway has preventive and/or therapeutic potential for the management of Alzheimer's disease.	Neuroscience	Sep 2015
26239378	D011374	D003093	progesterone	UC	Collagenase, progesterone, heparin, urokinase, nadh and adenosine drugs demonstrated potential for use in treatment of CRC and UC.	Molecular medicine reports	Oct 2015
26234785	C101789	D010523	SA4503	neuropathy	, and the Sig-1R agonist SA4503 could serve as a potential candidate for the treatment of chemotherapeutic-induced neuropathy.	Synapse (New York, N.Y.)	Nov 2015
26301726	C550822	D007249	Fijiolide A	inflammation	Fijiolide A is a secondary metabolite isolated from a marine-derived actinomycete and displays inhibitory activity against TNF-α-induced activation of NFκB, an important transcription factor and a potential target for the treatment of different cancers and inflammation related diseases.	Journal of the American Chemical Society	Sep 2015
26245494	C469689	D009369	Tricetin	cancers	Tricetin, a natural flavonoid, was demonstrated to inhibit the growth of various cancers, but the effect of 	Expert opinion on therapeutic targets	Oct 2015
26203774	C581182	D001943	DMDD	breast cancer	 cells in vitro and further examined the molecular mechanisms of DMDD-induced apoptosis in human breast cancer cells.	Oncotarget	Sep 2015

Filtering using the CTD database

The Comparative Toxicogenomics Database (CTD) contains curated and inferred chemical-disease relationships (among other data) and is freely available to download. The latest update is from Aug 2015 and appears to contain 89039 unique curated relationships and 4.0 million inferred ones (I note that these figures do not agree with the ones reported by CTD so I could be mistaken).

If the novel relationships from Sep 2015 are filtered using the curated CTD set, 813 remain and none of the results above change (note that I didn’t take any advantage of the MESH hierarchy for this proof of concept). Of these, 254 are present in the much larger CTD inferred relationship set. Interestingly, the link between toxic epidermal necrolysis (TEN) and paraquat, first reported in Sep 2015, is one of these.

Conclusions

Hopefully the above discussion and results show the potential of this approach. To do this properly would probably require more work on the text-mining to target therapies (this was outside the scope of the BioCreative V competition) and a manual assessment of the quality of the results. If you’d like to collaborate on this, get in touch.

* Note: The format used is PubMed Id, Chemical MESH Id, Disease MESH Id, Chemical text, Disease Text, Relationship text, Journal, Publication Date (it may have appeared online prior to this)

Shakespeare through the eyes of a chemist Part II

Lego ShakeyIn an earlier post I looked at the chemicals found in Shakespeare’s plays. Following on from the improved text-mining of diseases described in the previous post, let’s look at diseases this time.

First of all, I should point out that it is actually useful to us to run LeadMine on arbitary texts. It helps to find errors in the dictionaries we use, but also makes us aware that certain terms may be fine if used to mine PubMed abstracts or patents, but may produce false positives on general text.

Here are the most common disease terms found in Shakespeare’s plays, with counts, MESH Id, then the text as it appeared in the play:

176 D010146 : ('pains', 91) ('pain', 66) ('painful', 8) ('sorely', 6) ('aches', 5)
124 D000435 : ('drunk', 67) ('drunken', 19) ('drunkard', 13) ('drooping', 9) ('drunkards', 6) ('drunkenness', 4) ('besotted', 2) ('intemperance', 2) ('being drunk', 1) ('buzzed', 1)
109 D004332 : ('drown', 74) ('drowned', 22) ('drowning', 9) ('drowns', 4)
107 D010930 : ('plague', 85) ('plagues', 13) ('the plague', 9)
68 D020521 : ('stroke', 41) ('strokes', 23) ('apoplexy', 4)
48 D001733 : ('sting', 26) ('bites', 11) ('stings', 8) ('stinging', 3)
44 D018746 : ('sirs', 44)
39 D006470 : ('bleeding', 31) ('bleeds', 7) ('loss of blood', 1)
34 D005076 : ('rash', 33) ('a rash', 1)
33 D003221 : ('confusion', 33)
32 D013217 : ('starve', 19) ('famine', 11) ('starving', 1) ('starves', 1)
29 D002921 : ('scars', 15) ('scar', 10) ('cicatrice', 3) ('cicatrices', 1)
28 D003141 : ('infect', 21) ('infectious', 6) ('infecting', 1)
27 D018908 : ('weakness', 25) ('decrepit', 2)
27 D012614 : ('scurvy', 27)
27 D003288 : ('bruised', 9) ('bruise', 8) ('black and blue', 4) ('bruising', 4) ('contusions', 1) ('bruises', 1)
27 D002056 : ('burns', 27)
25 D034381 : ('deaf', 24) ('hard of hearing', 1)
23 D005334 : ('fever', 22) ('fevers', 1)
20 D004487 : ('swelling', 19) ('dropsy', 1)
19 D005221 : ('wearied', 9) ('weariness', 3) ('wearies', 2) ('weariest', 1) ('wearying', 1) ('wearily', 1) ('languor', 1) ('unwearied', 1)
19 D001237 : ('smother', 15) ('suffocating', 1) ('smothered', 1) ('suffocation', 1) ('smothering', 1)
18 D014202 : ('trembling', 17) ('tremor', 1)
18 D007239 : ('infection', 17) ('infections', 1)
18 D004216 : ('distemper', 18)

This already has highlighted some changes that we need to make (and have already made). For example, SIRS should only be matched uppercase, “unwearied” may redirect to “wearied” on Wikipedia but it’s the opposite, “besotted” no longer means drunk (except with love) and “buzzed” is probably not a useful synonym. 🙂

But overall, the software seems to be in good health, although Shakespeare’s protagonists may not be. Don’t they all die at the end? [SPOILER ALERT]

3 D058734 : ('bleed to death', 3)
3 D003645 : ('sudden death', 3)

Image credit: Ryan Ruppe on Flickr

Cross-checking peptide SMILES from Wikipedia

Spot the error in this structure of Bombesin
Spot the error in this structure of Bombesin
Here at NextMove Towers, we find Wikipedia a very useful resource. In fact Roger gave a talk on this at the recent ACS meeting. But here’s a completely different application, a comparison of the SMILES/names generated by Sugar & Splice for oligopeptides and those present in Wikipedia.

The background to this is while there are an enormous number of possible short peptides, the number with trivial names (such as oxytocin and neuropeptide S) is fairly small. However, since IUPAC define how to name derivatives of peptides, these names can be used as references to cover a wider range of peptides of potential therapeutic interest, e.g. [2-alanine]oxytocin and neuropeptide S (3-8).

One nice feature of Wikipedia is the use of categories, as pages about peptides are marked with category Peptides. Well, almost – they may also or instead be marked as belonging to a subcategory of Peptides, e.g. Neuropeptides. Anyhoo, with a bit of Python code that accessed the Wikipedia API, I was able to download all pages on peptides, a number that totalled 561. I then searched the text on these pages for SMILES strings (typically as “SMILES *= *(.*)\n” in a Chembox or Drugbox), and finally converted the SMILES string to a peptide name with Sugar & Splice.

For those cases where Sugar & Splice generated a peptide name, the names were mostly in agreement with the title of the Wikipedia page…but not always. For example the SMILES for Tuftsin was named as [4-D-arginine]tuftsin – the sequence for tuftsin is Thr-Lys-Pro-Arg but the SMILES was actually for Thr-Lys-Pro-D-Arg. Bombesin was named as [8-BLAH]bombesin – the 8th residue is supposed to be tryptophan but the bond to the indole was in the wrong location, and Sugar & Splice identifies it as Ala(indol-2-yl) instead of Trp. Interestingly, if you look at the talk page for Bombesin you can see that someone pointed out this very error in the diagram back in 2011. For those cases where we have found such errors, we will be updating Wikipedia.

Of course, other examples provide cases that need to be added to Sugar & Splice’s dictionary, e.g. Felypressin is named as [2-L-phenylalanine]lypressin, and Morphiceptin as [4-L-proline]endomorphin-2. So overall, Wikipedia provides a nice source of named peptides which we can use to improve our software, and at the same time we are happy to contribute back fixes for any problems we observe.

Image credit: Image by Megac7

Visualising Matched Molecular Series

At the recent Boston ACS, Herman Skolnik Awardee Jürgen Bajorath described the concept of Matched Molecular Pairs (MMPs) as one of the most powerful ideas in medicinal chemistry. However, I would argue that the more general concept of Matched Molecular Series that he himself has developed puts MMPs in the shade.

In my own presentation at the meeting I described what I called the “Matched Pair Mentality” which prevented for some years the realisation that the same methodology applies to series longer than two. By describing the concept in terms of pairs, chemists could not think beyond two molecules as the word “pair” is somewhat special and cannot easily be replaced with a term for three: would this be a triplet, a triad, or a trio? Furthermore, the concept of matched pairs has become synonymous for many with “a matched pair transformation” (that is, a replacement of a terminal R Group), and this cemented the idea of two R groups as a fundamental concept rather than just a specific instance of a general case. Overall, this puts me in mind of the inhabitants of Flatland unable to conceive of a 3rd or higher dimension.

My talk was part of the “Visualizing Chemistry Data to Guide Optimization” symposium organised by Matt Segal and Erin Davis, and focused on the interface we developed for our matched series prediction method, Matsy. This is a visual interface based around R groups as first-class objects (see slide 14 below for example). One advantage of this approach is that it makes it clear that predictions are based solely on the R groups and not the scaffold. It should also help break the matched pair mentality by illustrating that matched pairs are just a subset of matched series: drag down one R group and the predictions are based on matched pairs, drag down another and the predictions are based on series of length 3, and so on. Finally, this interface makes it easy to play around with series, swapping the order, adding new R groups in, and moving between predictions for improving a property versus making it worse.

The elephant in the room is that this may not be the interface you want, despite my attempts to convince you that this is the One True Way. You may indeed want a dataset-centric approach and enough of this malarkey. If so, we’ve got that base covered too as we’ve partnered with Optibrium to introduce this to StarDrop. Their approach integrates both Matsy predictions and predictions from SAR transfer into a single interface, and shows the underlying series from which the predictions come. You can see a demo of this in the webinar linked to at the top of an earlier blogpost.

NextMovers at Boston ACS

We’ve been busy preparing for the Boston ACS. As well as a booth (#643), we’ll be giving various talks, a poster and organising a session. Note that the CINF talk times are not necessarily correct in the PDF or printed schedule. Correct talk times and abstracts are as follows (also available online here):

CINF 1: Generating canonical identifiers for glycoproteins and other chemically modified biopolymers
Roger Sayle, 8:35am – 9:05am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

Bioinformatics dogma asserts that all-atom representations, capable of encoding details such as disulfide bridging and post-translationally modified amino acids, are too unwieldy to be of practical use. In this presentation, we show how recent advances in computer power, software algorithms and storage technology require us to question this precept. We show how InChI, InChI keys and canonical SMILES can be generated for the largest known proteins, and even for nucleic acid sequences as large as viral and prokaryotic genomes. Indeed, unique identifiers derived from all-atom nucleic acid representations, allow the capture of epigenetic methylation information and circular DNA; feats that are impossible with the one-letter codes used by bioinformaticians. These unique identifiers allow the linking of mature antibodies to the unique identifiers of the plasmids used to express them. Finally, we discuss the possibility of polymer-specific implementations/optimizations of standard InChI, by showing how InChIs and InChI keys may be generated efficiently for specific classes of polymer with over a million atoms.

 

CINF 4: Naming algorithms for derivatives of peptide-like natural products
Roger Sayle, 10:35am – 11:00am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The nomenclature of natural products is a highly specialized field of biochemistry. Fortunately, some classes of natural products are more amenable to computer analysis than others. Non-ribosomal peptides and heavily post-translationally modified peptides, such as derivatives of the homodetic cycles gramicidin S and the cyclic depsi-peptide valinomycin and the natural product cyclic isopeptides anantin and sungsanpin push the current state-of-the-art in automated natural product naming. Where a compound is structurally related to an existing peptide, perceiving this relationship is required for generating succinct human understandable names. In this talk, we describe the use of databases/dictionaries based upon HELM notation and IUPAC’s condensed line notations for specifying ‘parent’ peptides from which derivatives and analogues can be named. Using the described techniques the name ‘[5-L-valine]dichotomin C’ may be assigned to the cyclic peptide CHEMBL478596. These techniques have been successfully used to identify and correct naming issues in the UniProt and IUPhar/BPS guide to pharmacology databases, which have then been updated by their curators.

 

CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
Roger Sayle, 2:25pm – 2:45pm, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The resources provided by the Wikimedia Foundation provide an unprecedented resource for chemists, information professionals and natural language processing researchers in the annotation of pharmaceutically-relevant information in documents. A widely publicized example of the use of Wikipedia in artificial intelligence research is IBM’s Watson’s participation in the Jeopardy! quiz show. In this presentation, we present several chemical research applications of Wikipedia-derived data sets, including named-entity dictionaries and synonym lists for linking ontologies. The global community of volunteer contributors to these projects deserves continual recognition for the invaluable resource they enable.

 

CINF 29: Visualization and manipulation of Matched Molecular Series for decision support
Noel O’Boyle, 3:00pm – 3:25pm, Sun, Aug 16, Room 104B – Boston Convention & Exhibition Center

A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.

We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.

An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.

References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.

 

CINF 51: Analyzing success rates of supposedly ‘easy’ reactions
Roger Sayle, 10:35am – 11:00am, Mon, Aug 17, Room 104A – Boston Convention & Exhibition Center

Chemists, like insects, come in a bewildering number of varieties and specializations. Traditional retrosynthesis tools are aimed at expert synthetic chemists to assist them with challenging total syntheses, or at process chemists searching for optimal routes via obscure reaction mechanisms. In this talk, we instead consider the role of computer software to support non-experts in synthetic chemistry, such as medicinal and computational chemists. Here the challenge is not in choosing the reaction, but instead preventing silly mistakes with the most widely applied classes of named reactions. Anecdotal experience with the content of pharmaceutical ELNs shows that low yield reactions often correlate with the presence of known incompatible functional groups, such as a second halide in Suzuki couplings.

 

CINF 74: Unlocking chemical information from tables and legacy articles
Daniel Lowe, 2:20pm – 2:45pm, Mon, Aug 17, Room 104B – Boston Convention & Exhibition Center

Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.

 

CINF 91: Chemistry enabling Chinese, Japanese, and Korean patents (Poster)
Daniel Lowe, 8:00pm – 10:00pm, Mon, Aug 17, Hall C – Boston Convention & Exhibition Center

Chinese, Japanese and Korean (CJK) patents account for over half of all national patent filings and hence are of increasing importance to patent informatics. In chemistry, searching for relevant patents relies heavily on the ability to index by chemical structures mentioned. Chemical names are typically given in the native language of the patent significantly complicating their identification and interpretation by conventional chemical text mining tools. Here we present on our approach to the translation of chemical names from CJK text and give examples of the wealth of chemical knowledge that can be unlocked.
As novel compounds are described using systematic chemical nomenclature, our approach has been developed to be especially adept at translating systematic names. Systematic chemical nomenclature in CJK languages generally follow the rules described by the IUPAC1 meaning that after translation there will exist a corresponding English name which can then be used with conventional chemical text mining tools.
Strategies for translation vary between languages. In Chinese each morpheme of an English chemical name is represented by one or more Hanzi. The interpretation of a Hanzi may be context dependent which is handled by looking at the environment in which it occurs. Japanese and Korean chemical names, by contrast, are mostly transliterations of English/German chemical nomenclature into Katakana and Hangul respectively.
As a case study we applied our approach to 44 thousand Korean patents (1990-2013) that were likely to contain chemistry and extracted 1.5 million distinct compounds. 177 thousand of these compounds were not found by a comparable analysis of US patents. Of the 759 thousand compounds, first disclosed between 2006 and 2013 by both a US and a Korean patent, for 362 thousand the Korean patent was published earlier.

 

CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alerting system
John May, 2:30pm – 2:50pm, Tue, Aug 18, Waterfront 1A/1B – Seaport Hotel and World Trade Center

Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick’s Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.

In this talk, we describe our attempts to encode the Environmental Protection Agency’s (EPA’s) guidance entitled ‘A Method for Determining Compatibility of Hazardous Waste’, 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.

 

CINF 141: So I have an SD File…what do I do next?
Rajarshi Guha and Noel O’Boyle, 1:35pm – 1:55pm, Wed, Aug 19, Room 104A – Boston Convention & Exhibition Center

Cheminformatics tasks cover a wide range of topics, from manipulating chemical structure file formats to predicting properties of chemical structures. The common theme underlying all these tasks is the handling of chemical structures. Yet frequently key aspects of structural information are lost, altered or ignored during even the most routine of processing tasks either through a misunderstanding of how tools work, limitations of the tools used or unfamiliarity with the features (or lack thereof) of particular chemical file formats.

Here we present a compendium of the “Dos and Don’ts of cheminformatics”. Using examples drawn from over a decade of involvement with open source cheminformatics toolkits [1] [2] and a variety of cheminformatics applications, as well as from recent commentaries on chemical structure databases, we illustrate some misconceptions regarding how chemistry data is stored, propose best practices for preserving chemical information intact, and end with a cautionary suggestion: “don’t trust, but verify”.

References:
[1] Steinbeck, C. et al., J.Chem. Inf. Comput. Sci., 2003, 43, 493-500
[2] O’Boyle, N.M. et al., J. Cheminf., 2011, 3, 33