Supporting the updated Symbol Nomenclature for Glycans (SNFG)

Even C&EN reported the recent standardisation efforts by the oligosaccharide community on symbols to use for glycan* depiction. These guidelines are available online in Appendix 1B of Essentials of Glycobiology and will be updated over time.

As a test case for Sugar & Splice support, I depicted the oligosaccharide below whose structure is strangely reminiscent of Table 1 in the guidelines. For those of you glycan enthusiasts who wish to print T-shirts with this emblazoned on the front, here is an Inkscape-friendly SVG file.
However, such a diverse set of monosaccharide symbols is not present in the typical oligosaccharide. I’ve searched PubChem for the entry with the most symbols and found CID91850542 below with 11. (For an alternative depiction of the same structure, see GlyTouCan. Interestingly, the CSDB entry for the same paper describes a different but very similar glycan.):

In fact, having many symbols often indicates a dodgy structure as in the following example (PubChem CID101754793) deposited by Nikkaji which has 9 monosaccharide symbols. Looking at the original source, the SMILES not only has nitro as [N+](=O)O (must have been corrected by PubChem) but many of the sugars have incorrect stereochemistry (compared to the provided IUPAC name). The D/L in several of the symbols, indicating the presence of the rarer stereoisomer, is also a red flag.
09_101754793_newIf you put the IUPAC name through OPSIN (after a minor mod), and then depict the resulting SMILES using Sugar & Splice, you get the correct structure:tmp

* Glycans are “compounds consisting of a large number of monosaccharides linked glycosidically”, via Wikipedia and the IUPAC gold book.

Analysing the last 40 years of medicinal chemistry reactions

reactionanalysisoverviewIn collaboration with Novartis (with particular thanks to Nadine Schneider) we have published a paper on the the analysis of reactions that we have text-mined from 40 years of US medicinal chemistry patents.

The paper covers the evolution of common reaction types over time, using NameRxn to provide the reaction classification. The reaction classification is hierarchical allowing a reaction to be classified at various levels of granularity. For example a Chloro Suzuki coupling is a Suzuki coupling which is a C-C bond formation reaction. Analysis of the properties of the reaction products was also performed revealing trends such as increase in the number of rings over time.

The reactions were extracted using a workflow based on the use of LeadMine for identifying and normalizing chemicals and physical quantities. One quantity of especial interest that is extracted and associated with the reaction is the yield. This allowed the identification of reaction types with consistently low/high yield and revealed a trend towards slightly lower yields over time.

Greg Landrum has kindly hosted interactive versions of some of the graphs from the paper here. In the Pipeline has also blogged positively about the paper here.

Sugar&Splice supports PubChem’s support for biologics

Half a million molecules on PubChem have just had a new section added entitled “Biologic Description”. This includes a depiction of the oligomer structure and several line notations including IUPAC condensed and HELM, all of which were generated using Sugar&Splice through perception from the all-atom representation. Since the original development of Sugar&Splice was as part of a collaboration with PubChem, it is great to see these annotations finally appearing as part of this important resource.

Previous blog posts have shown examples of the sorts of peptide depictions that Sugar&Splice can generate. Here is how one appears on PubChem (CID118753634).


Sugar&Splice also supports CFG-style depiction of oligosaccharides (CID71297593):


As ever, there is always more work to be done on improving depictions and perception, and we look forward to further increasing the coverage of biologics in PubChem over the coming months.

PhD positions available in Big Data Analysis in Chemistry

NextMove Software is a partner in the Horizon 2020 MSC ITN EID BigChem project. Ten PhD positions are available in the area of “Big Data Analysis in Chemistry”, all of which offer a mix of time spent in academia and with industrial partners. The following position involves a placement with us for 3 months:

ESR2: Computational compound profiling by large-scale mining of pharmaceutical data

This position is announced within the BIGCHEM project. Read about the carrier development perspectives.

Check eligibility rules as well as recruitment details and apply for this position before 20 March 2016.

Objectives: In the life-sciences, data is being generated and published at unprecedented rates. This wealth of data provides unique opportunities to get insights into the mechanisms of disease and to identify starting points for treatments. At the same time, the size, complexity and heterogeneity of available data sets pose substantial challenges for computational analysis and design.

Aim of this project is to address the challenges posed by large, heterogeneous, incomplete, and noisy datasets. Specifically, we aim to:

  • apply machine learning technologies to derive predictive QSAR models from real-world life science data sets;
  • analyze trade-offs between training data accuracy and quantity, in particular, in the context of high-throughput screening data;
  • develop and apply methods to systematically account for noise and experimental errors in the search for active compounds.

Planned secondments: Three months stay in NextMove to work with data automatically extracted from patents using unique technology of company. Three months in HMGU to collect data from public databases such as ChEMBL, OCHEM, PubChem.

Employment: 36 months total, including Boehringer Ingelheim, Biberach, Germany (months 1-18) and the University of Bonn, Germany (months 19-36).

Enrollment in PhD program: The ESR will be supervised by Prof. J. Bajorath from the University of Bonn and by supervisors from Boehringer Ingelheim.

Salary details are described here.

Boehringer Ingelheim GmbH & Co KG & University of Bonn
Employment type:
Full time
Years of experience:
4 years or less (see eligibility rules)
Required languages:
Required general skills:
Have experience in data mining and statistics. Good knowledge on medicinal chemistry is a plus.
Required IT skills:
Good knowledge on programming in mainstream computer languages and UNIX/LINUX operating system.
Required degree level:
Master’s degree in Chemistry, Bioinformatics, Medicinal Chemistry, Informatics/Data Science or closely related fields.


Popular med chem replacements

medchemreplacementsWhen people talk about bioisosteres (e.g. tetrazole and carboxylic acid) they are usually referring to R group replacements that have similar biological properties. Identifying new bioisosteres can expand a med chemist’s toolbox, and so a number of studies have analysed activity databases to search for previously unknown bioisoteric replacements (e.g. [1]).

Here instead we will analyse what med chemists already consider to be bioisosteres. That is, we will look at the set of med chem replacements observed in the medicinal chemistry literature without any regard to the corresponding activity.

What I’ve done is take all (non-duplicate) IC50, EC50 and Ki data from ChEMBL and generated matched series on a per-assay basis (e.g. an assay with halide analogues will be converted to [*Br, *Cl, *F]). The corresponding matched pairs (e.g. [*Br, *F], [*Br, *Cl], [*F, *Cl]) are then associated with the paper from which the assay is taken, and any duplicates for the same paper are removed.

Having done this, we can then ask what is a popular replacement for *Br? As it turns out the top answer is ethynl, after *I. This comes from the fact that *Br occurs in 5497 of the 32,158 papers, and ethynl in 322, so if they occured independently we would expect to see them co-occur in 55 papers. Given that they actually co-occur in 103, this is an enrichment (or “lift” as recommender systems [2] call it) of 1.9 times what you would expect to see by chance. Here are the others with positive enrichment:

R Occurence Co-occur Expected Enrichment
*I 1553 901 265.5 3.4
*C#C 322 103 55.0 1.9
*Cl 10769 3263 1840.8 1.8
*[N+](=O)[O-] 3910 1179 668.4 1.8
*C=C 334 91 57.1 1.6
*C#N 3373 883 576.6 1.5
*SC(F)(F)F 63 16 10.8 1.5
*F 9048 2261 1546.6 1.5
*OC(F)(F)F 1149 279 196.4 1.4
*C(F)(F)F 4984 1130 852.0 1.3
*S(=O)(=O)C(F)(F)F 51 10 8.7 1.1
*SC 1337 252 228.5 1.1
*C#CC 76 14 13.0 1.1

I’ve put together an animation that summarises these data. This cycles through the most popular R group replacements that have positive enrichment and that have not previously been shown (in the animation, that is). The suggestions seem to make a lot of sense, especially when you remember that no fingerprint or MCS calculation is used – the co-occurences come completely from the data.

[1] Wassermann AM, Bajorath J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem. 2011, 3, 425-436.
[2] Boström J, Falk N, Tyrchan C. Exploiting personalized information for reagent selection in drug design. Drug Discov Today. 2011, 16, 181-187.

Assembling a large data set for melting point prediction: Text-mining to the rescue!

Gallenkamp_Melting_Point_ApparatusAs part of a project initiated by Tony Williams and the Royal Society of Chemistry, I have been working with Igor Tetko to text-mine melting and decomposition point data from the US patent literature so that he could then produce a melting point prediction model. This model showed an improvement over previous models, which is likely due to the overwhelming large size of the dataset compared to the smaller curated data sets used by these previous models.

The results of this work have now been published in the Journal of Cheminformatics here: The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from Patents

From the text-mining side this involved identifiying compounds, melting  and decomposition points, performing the association between them, and then normalizing the representation of the melting points (e.g. “182-4°C” means the same as “182 to 184°C”). Values that were likely to be typos in the original text were also flagged.

As mentioned in the paper the resultant set of 100,000s of melting points is available as SDF from Figshare while the model Igor developed is available from OCHEM.

Image credit: Iain George on Flickr (CC-BY-SA)

How the AUC of a ROC curve is like the Journal Impact Factor

dist3The Journal Impact Factor (or JIF) is the mean number of citations to articles published in a journal in the previous 2 years. Now, the mean is often a good measure of the average but not always. To decide whether it’s a good measure, it is often sufficient to look at a histogram of the data. The image above from a blogpost by Steve Royle shows the citation data for Nature. It is exactly as you would expect: a large number of papers have a small number of citations, while a small number of papers have a large number of citations. In other words, it is exactly the sort of curve for which the mean does not provide any meaningful (an ironic pun) result.

Why? Well, it’s the long tail that really kills it (although we could talk about how skewed it is too). Take 101 papers, 100 of which have 1 citation but one has 100. What’s the mean? 2.0. Say if that one had 1000 citations instead, then the mean is 11.0. The mean is heavily influenced by outliers, and here the long tail provides lots of these. For this reason, the mean does not give any useful measure of the average number of citations as it is just pulled up and down by whatever papers got most cited.

So what’s the link to the AUC of a ROC curve in a typical virtual screening experiment? The AUC has a linear dependance on the mean rank of the actives (see the BEDROC paper), and guess what, that distribution looks very similar to that for citations. For any virtual screening method that is better than random, most of the actives are clustered at the top of the ranked list, while any active that is not recognised by the method floats at random among the inactives. So the AUC is at best a measure of the rank of the actives not recognised by the method, and at worst a random value.

Naturally, the AUC is the most widely used ranking method in the field.

Identifying novel chemical-disease relationships

roger_layoutAs part of the BioCreative V competition, Daniel developed software to find chemical-disease relationships in PubMed abstracts. I’m going to describe a proof-of-concept that uses that code to identify new relationships extracted from the literature. This could be useful both for finding new adverse drug affects and for finding new therapeutic applications.

Most common relationships

Daniel ran the software over all PubMed abstracts in high-precision mode and found 1392503 putative relationships (of which 282604 were unique). To begin with, I looked at the most common relationships found. However learning that “alcohol is associated with alcoholism” and “cyanide is associated with poisoning” is not super-useful. It is unfortunately the case that the information about which you can be most confident (i.e. it is found multiple times) is also the least useful as by definition it’s already well known. Although actually I didn’t know the top relation found, that “streptozotocin is associated with diabetes”; it turns out that streptozotocin is used to produce an animal model for diabetes.

Searching for novel relationships

Really what’s most interesting are novel relationships, ones that haven’t previously been described. To find these I looked at any relationships attributed to this month (i.e. Sep 2015 at the time of writing) or later that were not in earlier abstracts. This gave 847 relationships. When I looked at the sentences associated with these relationships I found that 6 of them explicitly stated that this was the first report of a particular interaction and that in each case we identified the correct relationship.* (Just for interest, I searched the “known” relationships from September for similar phrases stating that they were the first report, but did not find any.)

26228174	D010269	D013262	paraquat	TEN	To our knowledge, this is the first case report of TEN related to paraquat 	Dermatology (Basel, Switzerland)	Sep 2015
25619447	C079703	D000380	rufinamide	agranulocytosis	To the best of our knowledge, this is the first reported case of agranulocytosis induced by rufinamide.	Brain & development	Sep 2015
26356743	D011345	D016553	Fenofibrate	Immune Thrombocytopenia	A Case of Fenofibrate-Induced Immune Thrombocytopenia: First Report.	Puerto Rico health sciences journal	Sep 2015
26370487	D017706	D010996	lisinopril	pleural effusion	We report the first case of eosinophilic pleural effusion occurring due to lisinopril treatment.	Revue des maladies respiratoires	Sep 2015
25588686	C118667	D013262	Dronedarone	Toxic Epidermal Necrolysis	Toxic Epidermal Necrolysis During Dronedarone Treatment: First Report of a Severe Serious Adverse Event Of A New Antiarrhythmic Drug.	Cardiovascular toxicology	Oct 2015
26308264	C033249	D010024	HAR	bone loss	The current study describes for the first time that HAR inhibits receptor activator of nuclear factor ?B ligand (RANKL)-induced osteoclastogenesis in vitro and suppresses inflammation-induced bone loss in a mouse model.	Journal of natural products	Sep 2015

Then there are mentions of novel compounds, but I guess there are different degrees of novelty:

26386102	C530299	D050197	Vorapaxar	atherosclerotic	Vorapaxar is a novel antiplatelet agent that has demonstrated efficacy in reducing atherosclerotic events in patients with a history of 	American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists	Oct 2015
25969859	C509120	D007249	2-Chloroacetamidine	inflammation	2-Chloroacetamidine, a novel immunomodulator, suppresses antigen-induced mouse airway inflammation.	Allergy	Sep 2015

A number of other relationships mention potential as a therapeutic agent:

26201693	C005274	D013274	Naringin	gastric carcinoma	Thus, the present finding suggests that Naringin induced autophagy- mediated growth inhibition shows potential as an alternative therapeutic agent for human gastric carcinoma.	International journal of oncology	Sep 2015
26079694	D002762	D007889	vitamin D3	uterine fibroids (leiomyomas)	To provide a detailed summary of current scientific knowledge on uterine fibroids (leiomyomas) in-vitro and in in-vivo animal models, as well as to postulate the potential role of vitamin D3 as an effective, inexpensive, safe, long-term treatment option for 	Fertility and sterility	Sep 2015
26192096	C054989	D000544	sulfuretin	Alzheimer's disease	Our results also indicate that sulfuretin-induced induction of Nrf2-dependent HO-1 expression via the PI3K/Akt signaling pathway has preventive and/or therapeutic potential for the management of Alzheimer's disease.	Neuroscience	Sep 2015
26239378	D011374	D003093	progesterone	UC	Collagenase, progesterone, heparin, urokinase, nadh and adenosine drugs demonstrated potential for use in treatment of CRC and UC.	Molecular medicine reports	Oct 2015
26234785	C101789	D010523	SA4503	neuropathy	, and the Sig-1R agonist SA4503 could serve as a potential candidate for the treatment of chemotherapeutic-induced neuropathy.	Synapse (New York, N.Y.)	Nov 2015
26301726	C550822	D007249	Fijiolide A	inflammation	Fijiolide A is a secondary metabolite isolated from a marine-derived actinomycete and displays inhibitory activity against TNF-α-induced activation of NFκB, an important transcription factor and a potential target for the treatment of different cancers and inflammation related diseases.	Journal of the American Chemical Society	Sep 2015
26245494	C469689	D009369	Tricetin	cancers	Tricetin, a natural flavonoid, was demonstrated to inhibit the growth of various cancers, but the effect of 	Expert opinion on therapeutic targets	Oct 2015
26203774	C581182	D001943	DMDD	breast cancer	 cells in vitro and further examined the molecular mechanisms of DMDD-induced apoptosis in human breast cancer cells.	Oncotarget	Sep 2015

Filtering using the CTD database

The Comparative Toxicogenomics Database (CTD) contains curated and inferred chemical-disease relationships (among other data) and is freely available to download. The latest update is from Aug 2015 and appears to contain 89039 unique curated relationships and 4.0 million inferred ones (I note that these figures do not agree with the ones reported by CTD so I could be mistaken).

If the novel relationships from Sep 2015 are filtered using the curated CTD set, 813 remain and none of the results above change (note that I didn’t take any advantage of the MESH hierarchy for this proof of concept). Of these, 254 are present in the much larger CTD inferred relationship set. Interestingly, the link between toxic epidermal necrolysis (TEN) and paraquat, first reported in Sep 2015, is one of these.


Hopefully the above discussion and results show the potential of this approach. To do this properly would probably require more work on the text-mining to target therapies (this was outside the scope of the BioCreative V competition) and a manual assessment of the quality of the results. If you’d like to collaborate on this, get in touch.

* Note: The format used is PubMed Id, Chemical MESH Id, Disease MESH Id, Chemical text, Disease Text, Relationship text, Journal, Publication Date (it may have appeared online prior to this)

Shakespeare through the eyes of a chemist Part II

Lego ShakeyIn an earlier post I looked at the chemicals found in Shakespeare’s plays. Following on from the improved text-mining of diseases described in the previous post, let’s look at diseases this time.

First of all, I should point out that it is actually useful to us to run LeadMine on arbitary texts. It helps to find errors in the dictionaries we use, but also makes us aware that certain terms may be fine if used to mine PubMed abstracts or patents, but may produce false positives on general text.

Here are the most common disease terms found in Shakespeare’s plays, with counts, MESH Id, then the text as it appeared in the play:

176 D010146 : ('pains', 91) ('pain', 66) ('painful', 8) ('sorely', 6) ('aches', 5)
124 D000435 : ('drunk', 67) ('drunken', 19) ('drunkard', 13) ('drooping', 9) ('drunkards', 6) ('drunkenness', 4) ('besotted', 2) ('intemperance', 2) ('being drunk', 1) ('buzzed', 1)
109 D004332 : ('drown', 74) ('drowned', 22) ('drowning', 9) ('drowns', 4)
107 D010930 : ('plague', 85) ('plagues', 13) ('the plague', 9)
68 D020521 : ('stroke', 41) ('strokes', 23) ('apoplexy', 4)
48 D001733 : ('sting', 26) ('bites', 11) ('stings', 8) ('stinging', 3)
44 D018746 : ('sirs', 44)
39 D006470 : ('bleeding', 31) ('bleeds', 7) ('loss of blood', 1)
34 D005076 : ('rash', 33) ('a rash', 1)
33 D003221 : ('confusion', 33)
32 D013217 : ('starve', 19) ('famine', 11) ('starving', 1) ('starves', 1)
29 D002921 : ('scars', 15) ('scar', 10) ('cicatrice', 3) ('cicatrices', 1)
28 D003141 : ('infect', 21) ('infectious', 6) ('infecting', 1)
27 D018908 : ('weakness', 25) ('decrepit', 2)
27 D012614 : ('scurvy', 27)
27 D003288 : ('bruised', 9) ('bruise', 8) ('black and blue', 4) ('bruising', 4) ('contusions', 1) ('bruises', 1)
27 D002056 : ('burns', 27)
25 D034381 : ('deaf', 24) ('hard of hearing', 1)
23 D005334 : ('fever', 22) ('fevers', 1)
20 D004487 : ('swelling', 19) ('dropsy', 1)
19 D005221 : ('wearied', 9) ('weariness', 3) ('wearies', 2) ('weariest', 1) ('wearying', 1) ('wearily', 1) ('languor', 1) ('unwearied', 1)
19 D001237 : ('smother', 15) ('suffocating', 1) ('smothered', 1) ('suffocation', 1) ('smothering', 1)
18 D014202 : ('trembling', 17) ('tremor', 1)
18 D007239 : ('infection', 17) ('infections', 1)
18 D004216 : ('distemper', 18)

This already has highlighted some changes that we need to make (and have already made). For example, SIRS should only be matched uppercase, “unwearied” may redirect to “wearied” on Wikipedia but it’s the opposite, “besotted” no longer means drunk (except with love) and “buzzed” is probably not a useful synonym. 🙂

But overall, the software seems to be in good health, although Shakespeare’s protagonists may not be. Don’t they all die at the end? [SPOILER ALERT]

3 D058734 : ('bleed to death', 3)
3 D003645 : ('sudden death', 3)

Image credit: Ryan Ruppe on Flickr

Using Wikipedia to understand disease names

We recently got back from thewikipediadiseaselinking BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.

As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.

16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F1-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.

We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions  (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!

You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.